Parallel Independent Grid Files Based on a

Parallel Independent Grid Files Based on a Dynamic Declustering Method Using Multiple Error Correcting Codes

Paolo Ciaccia

Technical Report UBLCS-94-21 November 1994

Laboratory for Computer Science University of Bologna Piazza di Porta S. Donato, 5 40127 Bologna (Italy)

The University of Bologna Laboratory for Computer Science Research Technical Reports are available in gzipped PostScript format via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW with the URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS. All local authors can be reached via e-mail at the address [email protected]. Written requests and comments should be addressed to [email protected].

UBLCS Technical Report Series 93-15 Data Algorithm: A Numerical Method to Extract Shape Information from Gray Scale Images, R. Davoli, F. Tamburini, June 1993. 93-16 Towards Performance Evaluation in Process Algebras, R. Gorrieri, M. Roccetti, July 1993. 93-17 Split and ST Bisimulation Semantics, R. Gorrieri, C. Laneve, July 1993. 93-18 Multilanguage Interoperability, G. Attardi, M. Gaspari, July 1993. 93-19 HERMES: an Expert System for the Prognosis of Hepatic Diseases, I. Bonf`a, C. Maioli, F. Sarti, G.L. Milandri, P.R. Dal Monte, September 1993. 93-20 An Information Flow Security Property for CCS, R. Focardi, R. Gorrieri, October 1993. 93-21 A Classification of Security Properties, R. Focardi, R. Gorrieri, October 1993. 93-22 Real Time Systems: A Tutorial, F. Panzieri, R. Davoli, October 1993. 93-23 A Scalable Architecture for Reliable Distributed Multimedia Applications, F. Panzieri, M. Roccetti, October 1993. 93-24 Wide-Area Distribution Issues in Hypertext Systems, C. Maioli, S. Sola, F. Vitali, October 1993. 93-25 On Relating Some Models for Concurrency, P. Degano, R. Gorrieri, S. Vigna, October 1993. 93-26 Axiomatising ST Bisimulation Equivalence, N. Busi, R. van Glabbeek, R. Gorrieri, December 1993. 93-27 A Theory of Processes with Durational Actions, R. Gorrieri, M. Roccetti, E. Stancampiano, December 1993. 94-1 Further Modifications to the Dexter Hypertext Reference Model: a Proposal, W. Penzo, S. Sola, F. Vitali, January 1994. 94-2 Symbol-Level Requirements for Agent-Level Programming, M. Gaspari, E. Motta, February 1994. 94-3 Extending Prolog with Data Driven Rules, M. Gaspari, February 1994. 94-4 Schedulability Checking of Data Flow Tasks in Hard-Real-Time Distributed Systems, R. Davoli, L. A. Giachini, March 1994. 94-5 A Shared Dataspace Language and its Compilation, M. Gaspari, March 1994. 94-6 Exploring the Coordination Space with LO, S. Castellani, P. Ciancarini, April 1994. 94-7 Comparative Semantics of LO, S. Castellani, P. Ciancarini, April 1994. 94-8 Distributed Conflicts in Communicating Systems, N. Busi, R. Gorrieri, G. Siliprandi, April 1994. 94-9 Pseudopolar Array Mask Algorithm for Spherical Coordinate Grid Surfaces, A. Amoroso, G. Casciola, May 1994. 94-10 MPA: a Stochastic Process Algebra, M. Bernardo, L. Donatiello, R. Gorrieri, May 1994. 94-11 Describing Queueing Systems with MPA, M. Bernardo, L. Donatiello, R. Gorrieri, May 1994. 94-12 Operational GSPN Semantics of MPA, M. Bernardo, L. Donatiello, R. Gorrieri, May 1994. 94-13 Experiments in Distributing and Coordinating Knowledge, P. Ciancarini, May 1994. 94-14 A Comparison of Parallel Search Algorithms Based on Tree Splitting, P. Ciancarini, May 1994. 94-15 RELACS: A Communications Infrastructure for Constructing Reliable Applications in Large-Scale Distributed ¨ Babao˘glu, M.G. Baker, R. Davoli, L.A. Giachini, June 1994. Systems, O. ¨ Babao˘glu, A. Bartoli, G. Dini, June 94-16 Replicated File Management in Large-Scale Distributed Systems, O. 1994. 94-17 Parallel Symbolic Computing with the Shared Dataspace Coordination Model, P. Ciancarini, M. Gaspari, July 1994. 94-18 An Algorithmic Method to Build Good Training Sets for Neural-Network Classifiers, F. Tamburini, R. Davoli, July 1994. ¨ Babao˘glu, A. Schiper, July 1994. 94-19 On Group Communication in Large-Scale Distributed Systems, O. 94-20 Dynamic Allocation of Signature Files in Multiple-Disk Systems, P. Ciaccia, August 1994.

Parallel Independent Grid Files Based on a Dynamic Declustering Method Using Multiple Error Correcting Codes1 Paolo Ciaccia2

Technical Report UBLCS-94-21 November 1994 Abstract Several methods for declustering spatial (i.e, multi-dimensional) data on a set of M disks have been proposed in recent years, in order to reduce response time for large range queries and to increase level of concurrency for small range and point queries. Some of these methods provide dynamic management of data only within disks, by using M parallel independent (PI) organizations and a static mapping of regions of space to disks that guarantees correct addressability. On the other hand, so-called multiplexed (MX) parallel structures, such as MX R-trees [KF92], dynamically decluster a single-disk spatial structure (e.g., an R-tree) without using any data-independent mapping function. The contribution of this paper to the development of efficient parallel spatial structures is twofold. First, we introduce a new declustering method, called MC2 , based on a dynamic mapping function that assigns new regions of space to disks by using Multiple error Correcting Codes. Differently from previous methods based on static mapping functions, MC2 requires no global reorganization, because of the way splits of data regions are dealt with, and can therefore deal also with (highly) dynamic data sets and non stationary data distributions. As a second contribution, we show how dynamic management of data across disks can be achieved even with a PI organization without losing global addressing capabilities. To this end, a new organization based on the grid file concept, namely PI-GF (‘Parallel Independent Grid Files’), is introduced and its behavior formally specified.

1. Partial support for this work was provided by the Italian Ministry of University, Research and Technology and the National Research Council of Italy (CNR). 2. DEIS - CIOC CNR, University of Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.

1

1 Introduction

1

Introduction

Efficient management of spatial data is becoming a key requirement for many advanced database applications, such as those dealing with scientific and statistical data, images, geographic maps, etc. Because such applications often need to analyze large amount of data that are resident on secondary memory devices, their requests represent a formidable challenge to the performance of I/O subsystems [DG92]. To avoid the I/O bottleneck, a promising, today available, technology is the one based on large arrays of small disks, that together can provide for increased I/O bandwidth [PGK88]. This approach is also motivated by the consideration that performance of individual disks is not likely to improve very much in the future, so that significant reductions in disk access time cannot be expected [Pat93]. In order to properly take advantage of parallel I/O subsystems, declustering algorithms for data placement and organization on disks assume a primary relevance, the focus being reduction of response time for individual queries or increase of overall system’s throughput, the latter measure being particularly important in a multi-user environment [SL91]. For throughput maximization, the following, somewhat contrasting, requirements are to be satisfied by a good declustering method [KF92, SL91, ZSC94]: Load balancing: The access load should be almost the same for each disk, in order to avoid ‘hot spots’ [WZS91]. Minimum load: Any query should lead to as few as possible I/O operations. Parallelism degree: Whenever possible, multiple requests should be served in parallel by different disks. The reference hardware architecture we consider in this paper consists of a single-processor host machine (spatial server), and M independent disks connected to it, where each disk has its own queue of requests. Within this framework, we will deal with the typical class of multi-dimensional range (window) queries, selecting data enclosed in a (hyper)rectangular volume, for which the above-stated requirements turn in the following ones [KF92]: Fairness: Large range queries should activate as many disks as possible, by retrieving the same amount of data from each disk. Efficiency: Small range and exact-match queries should activate as few disks as possible (one, in the limit case).

The specific problem we address here is the efficient declustering of a file of d-dimensional points (records), to be stored in fixed-size buckets (i.e., pages, blocks) of capacity c. To this end, in Section 2 we provide the necessary background on previous declustering methods dealing with the same problem. Some of these methods provide dynamic management of data only within disks, by using M parallel independent (PI) organizations and a static mapping of regions of space to disks that guarantees correct addressability. On the other hand, so-called multiplexed (MX) parallel structures dynamically decluster a single-disk spatial structure without using any dataindependent mapping function. We recognize that such methods may fail to perform well under certain circumstances, for reasons that are related either to the specific declustering algorithm or to the underlying data organization. In order to obviate a major drawback of the considered PI organizations, namely necessity of reorganization phases, we introduce in Section 3 a new declustering method, MC2 , based on a dynamic mapping function that assigns new regions of space to disks by using Multiple error Correcting Codes. Differently from previous methods based on static mapping functions, MC2 requires no global reorganization, because of the way splits of data regions are dealt with, and can therefore deal also with (highly) dynamic data sets and non stationary data distributions. Since the MC2 method allows for both MX and PI organizations, we discuss them in Section 4. Section 5 provides a formalization for one of them, called ‘parallel independent grid files’, PI-GF, which is based on the grid file concept and uses a set of ‘local directories’, one for each disk. Section 6 discusses some specific issues related to the MC2 method, and to management of directories in PI-GF, and Section 7 briefly concludes. UBLCS-94-21

2

2 Background

2

Background

In this work we consider a file F of d-dimensional points, where, without loss of generality, each point = hp1 ; : : :; pi; : : :; pd i 2 F lies in the normalized unit space S = [0; 1)d. The i-th component of , pi , is the value of the i-th coordinate (attribute) of point . Points of F are to be stored on M independent disks into fixed-size buckets of capacity c. Several declustering methods to distribute data files on M disks have been proposed. In our discussion we consider the most relevant approaches to the declustering problem for spatial data, and qualitatively evaluate them with respect to large and small range queries, precisely defined as follows.

p

p

p

Definition 2.1 (Small and large range queries) Given a set of buckets storing data of file F , a range query is small if it retrieves points from only one bucket, otherwise it is large. Note that the above distinction is indeed arbitrary, its only aim being to evidentiate different problems that such types of queries can rise. 2.1 Multiplexed organizations We define a multiplexed (MX) organization as a parallel data structure that can be obtained from a corresponding single-disk structure by assigning its index nodes and data buckets to disks according to some specific algorithm. Remarkable examples of MX organizations are the LOBtrees [SL91] and the MX R-trees [KF92], that are multi-disk versions of B-trees and R-trees [Gut84], respectively. Note that LOB-trees, however, are not multi-dimensional data structures. Provided a good declustering algorithm can be found, every single-disk spatial structure can be parallelized into a corresponding MX organization. For instance, MX R-trees are a variant of R-trees where also so-called ‘cross-disk’ pointers are present. The abstract structure of an MX organization is shown in Figure 1. MX organizations can achieve high-level performance on large range queries,

root PPP PP PP PP P P

" "

( ((( hhhhh " hhhh( h( (((( bb h( (h hhh ((( b ((( hhh

XX ``` XXX ``` # c ``` XX # ` c # c X

disk 0

disk 1

C

C

disk 2

Figure 1. A generic MX organization.

provided the declustering algorithm meets the fairness requirement, thus evenly distributing accessed pages across disks. As to small range queries, these methods satisfy the minimum load requirement, but the number of activated disks depends on the height of the tree or, should we have unbalanced index trees, on the length of the longest index path. In highly loaded systems, this can penalize the throughput, because each I/O request can involve a different queue. UBLCS-94-21

3

2 Background

2.2 Large page organizations In a large page (LP) organization one starts with a corresponding single-disk structure whose pages, more properly called super pages, are M times larger than usual. A super page thus consists of M pages, each allocated to a different disk. Experimental results comparing parallel organizations for B-trees and R-trees show that the super page approach is not really competitive with multiplexed organizations (see [SL91, KF92] for details). The main reason is that LP organizations fail to satisfy the minimum load requirement and, consequently, have a low efficiency. 2.3 Parallel independent organizations With parallel independent (PI) organizations we assign data to disks according to some criterion, and then independently organize data within each disk. There is a major trouble with PI organizations that can severely limit their applicability. We call it the ‘visibility’ problem: Definition 2.2 (Visibility) A parallel organization is visible if, given a point p 2 F , we can say on which disk this point is stored. All MX organizations are not visible. However, this is not a problem since a single logical structure is involved, and retrieval of a point involves a single index path. With PI organizations, on the other hand, loss of visibility is a major drawback, since it requires activating the search process on all disks, which is clearly not efficient for small range queries. In order to solve the visibility problem, one needs to assign data to disks by using some information related to location of point in space. However, as noticed in [SL91], using some mapping function applied directly to point coordinates is a solution only for exact-match queries, but not for small queries, since also (very) close points will presumably be mapped to different disks. A more flexible approach is to partition the whole space into a set of M ‘subspaces’, and assign each subspace to a different disk. The mapping of subspaces to disks then becomes part of the PI organization, acting as a ‘routing device’, and leads to the abstract structure shown in Figure 2. Although this approach works fine for small range queries, it can violate the load

? ? ?

? ? ? ?

?

root 0 @

A

A

routing Z Z device HH ZZ H Z HHHZ Z HH Z H Z H Z j H root 1 Q Q

# #

disk 0

#

root 2 QQ

c c c

disk 1

C

C

disk 2

Figure 2. A generic PI organization. The ‘routing device’ is used to direct the search of a point along a single index path.

balancing requirement on large range queries if space partitioning consists of large connected UBLCS-94-21

4

2 Background

regions [KF92]. To this end space partitioning has to be quite fine-grained, with subspaces resulting from the union of many disjoint small regions. This is the basic idea underlying the declustering methods based on the concept of Cartesian product file, which is defined as follows. Definition 2.3 (Cartesian product file) Let Di = [0; 1) be the i-th (normalized) value domain (i = 1; 2; : : :; d) and let F D1 D2 : : : Dd be a file of d-dimensional points. Let Di be partitioned into ni disjoint intervals Di;0 ; Di;1 ; : : :; Di;ni ?1 , and let Di;ji be one of such intervals. File F is a Cartesian product file if all and only the points in the region D1;i1 D2;i2 : : : Dd;id , represented by the ordered d-tuple hi1 ; i2 ; : : :; id i, are stored into a same bucket. Thus, in a Cartesian product file the space is partitioned by an orthogonal grid into a set of (hyper)rectangular regions, each of which is in correspondence with a data bucket. Then, declustering algorithms operate as follows: Partition the space: For each dimension, determine the intervals Di;j (i 2 [1; d], j 2 [0; ni ? 1]); Map the regions: Given a bucket b, whose region is hi1 ; i2; : : :; id i, assign it to disk:

diskOf (hi1 ; i2 ; : : :; id i) = f (i1 ; i2; : : :; id ; M )

(1)

where f () is a mapping function specific to the declustering method. For instance, for the CMD method it is [LSR92]:

diskOfCMD (hi1 ; i2; : : :; id i) = (i1 + i2 + : : : + id ) mod M

(2)

It has to be observed that any declustering method for Cartesian product files establishes a fixed mapping between regions of space and disks. As a corollary, new data buckets and index nodes are always allocated to the same disk of the bucket (node) they originate from. Declustering algorithms for Cartesian product files date back to the disk modulo (DM) allocation method in [DS82]. More recently, advances have been introduced by the coordinate modulo declustering (CMD) method [LSR92], by the field-wise exclusive or (FX) method [KP88], and by algorithms based on error correcting codes (ECC) [FM91] and on space-filling Hilbert curves (fractals) (HCAM) [FB93]. Results in the above-referenced papers and in [HS94], where DM, CMD, FX, ECC, and HCAM algorithms have been experimentally compared, show that no mapping function is superior to the others for all the possible scenarios (i.e. relation size, number of disks, number of attributes, size of queries, etc.), when the performance metrics is the ‘bucket response time’:3 Definition 2.4 (Bucket response time) Let q = h[l1 ; h1 ); : : :; [li; hi); : : : [ld ; hd )i (0 li hi 1, i = 1; : : :; d) be a range query over S that retrieves all the points in the d-dimensional box [l1 ; h1 ) : : : [li ; hi) : : : [ld ; hd) S . The bucket response time of q is taken to be maxu fBu (q)g, where Bu (q) (u = 0; : : :; M ? 1) is the number of qualifying buckets on the u-th disk unit, that is, those whose region has a non-empty intersection with q. Note that the optimal bucket response time for query the total number of buckets qualifying for .

q

q is dB(q)=M e, where B(q) = Pu Bu(q) is

2.4 A criticism of Cartesian product files For our purposes, it is important to understand which are the limits of PI organizations originating from the Cartesian product file abstraction. Our claim is that such organizations are appropriate only for almost static data sets, for which neither file size nor data distribution change through time. As long as we deal with (almost) static files, no problems arise because regions of the Cartesian product file are in 1-to-1 correspondence with data buckets. Because of the adopted 3. We speak of ‘bucket response time’ since, in the usual definition, only accesses to data buckets are concerned, without taking into account accesses to intermediate index nodes.

UBLCS-94-21

5

2 Background

mapping function, these organizations are almost fair, since they lead to optimal bucket response time for many large range queries, and to good performance in all the other sub-optimal cases [HS94]. As to efficiency, it can be seen that only one disk is activated in case of small range queries, since the assignment of regions to disks is a-priori known. In the case of (highly) dynamic files things are quite different. Indeed, in this case many buckets can be needed for a region or, alternatively, many regions can become empty. Since, as noticed above, the mapping of regions of space to disks does not change through time, this can require to reorganize the whole structure, which is a costly I/O operation. We see that the problem is due to the lack of dynamic control on the granularity of declustering, which implies that a single unit of declustering can refer to an unpredictable amount of data that can vary through time. This lack of tuning capability is particularly evident in the case of range queries for which a single region qualifies, and that lead to retrieve from the same disk all the allocated buckets. Example 2.1 Without loss of generality, consider the CMD method (Eq. (2)). Figure 3 shows the case when d = 2, M = 2, d1 = d2 = 2, and file F consists of N = 8 points, with 2 points per bucket. Given this 2 2 grid, each range query is optimally answered, as it is immediate to verify.

1

0

s s s s s s s s 0

1

Figure 3. Regions h0; 0i and h1; 1i are mapped to disk 0, the others to disk 1.

Now, assume that the file grows up to N = 32 points, and that bucket capacity is c = 2 points. Therefore, in the most optimistic case of a perfect uniform distribution of data, each region now encloses 8 points, for which 4 buckets are required. For a query that needs to access only the lower-left region h0; 0i, 4 buckets on disk 0 have to be retrieved, whereas the optimal bucket response time is d4=2e = 2. In general, it can be seen that performance considerably deteriorates with file growth. In contrast, consider a 4 4 grid. If, at a certain instant, the file contains, say, only N = 8 points, most regions are empty, as shown in Figure 4, and performance will depend on the specific placement of the non-empty regions. 2

3 2 1

s s ss s s ss

0 0

1

2

3

Figure 4. 5 non-empty regions are mapped to disk 1, and only h1; 1i is assigned to disk 0.

UBLCS-94-21

6

3 The MC2 declustering method

Our criticism on declustering methods for Cartesian product files is summarized as follows: 1. These methods can suffer from lack of load balancing and, consequently, lead to deterioration of performance, depending on current file size and data distribution. In the limit case, they require a global reorganization phase. 2. They can provide a dynamic management of data only within disks4 (i.e., no transfer from one disk to another is possible) but not for the declustering process itself. The extension of CMD described in [ZSC94] only partially remediates these problems. The idea is to allow refinements of regions by means of so-called logical grid indexes, in such a way that intervals on a coordinate are assigned a progressive integer (timestamp) value that depends on their time creation. This is to avoid the reallocation of many buckets on different disks. However, since timestamp values are not related to the placement of regions in space, the method cannot prevent degradation of performance, thus requiring periodical reorganization [ZSC94]. Figure 5 provides an illustrative example, for the case M = 2 and c = 2.

2

1

0

s

s

s

s

s

s s s s s s s s s s 0

p

1 (a)

2

2

1

0

s

s

s

s

s

s s s s s s s s s s 0

1

3

2

(b)

Figure 5. The insertion of point p in the file shown in (a) forces the split of interval 1 on the first coordinate (X ). Rather than renumbering the intervals on the X axis the timestamp value 3 is assigned to the new interval, as shown in (b). By doing so, however, both regions h1; 0i and h3; 0i, that are adjacent in data space, are mapped to the same disk. The same holds for the pairs h1; 1i, h3; 1i, and h1; 2i, h3; 2i.

3

The MC2 declustering method

We have seen that the major drawback of PI organizations is their unability of automatically moving data regions from one disk to another, because of the need of avoiding the visibility problem, which is a trouble mainly for small range queries. Reallocation, on the other hand, is the usual policy in MX organizations, to be applied when buckets and nodes split. MX structures, on the other hand, can be penalized by queueing effects in the case of small range queries, because of the ‘zig-zagging’ between different disks. From this qualitative analysis, it appears that, apart from specific improvements concerning (heuristic) allocation algorithms, the investigation on PI organizations based on a dynamic declustering method and having no visibility problem is worth considering. In other terms, the challenging question is: can a PI organization move data from one disk to another and, at the same time, efficiently support small range queries? This would allow for a fair and efficient organization able to deal also with dynamic data sets. 4. For instance, in [LSR92] buckets that are assigned to a disk are locally organized as a grid file, that is, a dynamic structure.

UBLCS-94-21

7


The solution we propose is based on a two-step approach. First, we introduce a dynamic mapping function that does not assume any predefined subdivision of the space, and that requires no reorganization phase. Second, we design a PI organization based on this mapping function, and show how it deals with the visibility problem. 3.1 Region identification We start by specifying how partitioning of space takes place. To this end, we consider that bucket regions can have arbitrary size, but they are restricted to be binary radix regions, that is, the extension of a region along the i-th coordinate is a binary radix interval, Ii , obtained by repeatedly cutting the axis in halves. The binary interval Ii = [li; hi ) can be represented as a finite-length Pmbinary string b1 b2 b3 : : :bm , that is the binary encoding of the fractional part of li (i.e., li = j =1 bj 2?j ). The width of Ii is hi ? li = 2?m . The region I1 I2 : : : Id is then represented as hI1 ; I2 ; : : :; Id i. Now, we consider that regions are halved in a cyclic way, starting with the first coordinate, then with the second, and so on. On these assumptions, each region can be identified with its z-value [Kum94, Ore86, Ore89], obtained by interleaving the bits of the binary radix intervals I1 ; I2 ; : : :; Id . It is immediate to verify that regions of size 2?n have an identifier of n bits, and that the j -th leftmost bit of any region identifier always pertains to the (((j ? 1) mod d) + 1) coordinate. For instance, the 7-th bit in a 3-dimensional space always refers to the 1-st coordinate. Example 3.1 Refer to Figure 6. In (a) the first axis (X ) is split into 2 intervals, denoted 0 and 1. The region on the left is furtherly subdivided in two by a split on the second axis (Y ). Region-ids are shown inside the regions. Figure 6 (b) shows a refinement of region h0; 1i, obtained by a split on the X axis, leading to the new regions h00; 1i and h01; 1i, followed by a further partitioning of h01; 1i, that yields regions h01; 10i and h01; 11i. 2

11

01 1

0111 010 0110

10

1

1

00

00 0

0 0

1

(a)

00

01

1

(b)

Figure 6. Region identifiers are z-values, defined by binary radix intervals and bit interleaving.

3.2 Mapping regions to disks Since regions identifiers are z-values of variable length, we need a mapping function that can deal with this case and, at the same time, respects the heuristic declustering principle of assigning close regions to different disks. Finally, when a region splits, thus leading to two child regions, we want one (and only one) of them to be mapped to another disk, and the other take the place of its parent in the same disk. Our choice has been to start from a well-known mapping function, namely ECC [FM91], and generalize it, indeed in a rather straightforward way, to the case of variable-length region identifiers. In our opinion, this simplifies presentation and can rely, at least partially, on experimental and analytical results for one of the most effective mapping functions known so far. However, we need to point out that this choice is by no means a strict necessity, and that any UBLCS-94-21

8


other mapping function able to deal with variable-length identifiers and that satisfies the above requirements could do the job as well. For the sake of understanding, we briefly recall how ECC works. More details can be found in [FM91]. When all region identifiers have length n and the number of disks is M = 2m , ECC defines m parity check equations on n-bit strings. These equations completely define a C (n; k) binary linear code (n is the length and k = n ? m is the dimension of the code), and can conveniently be represented in matricial form. The m n resulting matrix, here denoted n, is called the check matrix of the C (n; k) code. Let R be a region of size 2?n , and idn(R) be its n-bit identifier. Then, in matrix form, the mapping function of ECC is:

H

diskOfECC (idn (R)) = (Hn idn (R)0 )2

(3)

where prime denotes vector transposition.5 It is a basic property of binary linear codes that the set of all the 2n region identifiers of length n gets partitioned by the above function into M groups, each containing 2k elements. Example 3.2 Consider the case where M

=

4 and n = 4, and the following parity check matrix:

H4 =

1 0 1 0 1 1

1 1

2

The resulting mapping is shown in Figure 7.

11 10 01 00

2

1

0

3

1

2

3

0

3

0

1

2

0

3

2

1

00

10

01

11

Figure 7. The ECC method. The disk a region is assigned to is shown within each region.

To introduce the diskOfMC 2 () mapping function, first consider the case when all the regions in the above example have been split. Thus, each region R gets replaced by two sub-regions, R0 and R1, respectively called the 0-child and the 1-child of R. The corresponding identifiers are obtained by simply adding either 0 or 1 to the id of R, that is: idn+1 (R0 ) = idn (R) 0 and idn+1 (R1) = idn (R) 1, where denotes concatenation. To apply the ECC method to this new situation, we need a new parity check matrix, 5 , with 5 columns. In principle, there needs to be no relationship between 4 and 5 . However, let us consider things from a dynamic point of view, thus starting with 4 (i.e., with 16 regions) and ending with 5 (32 regions). In this case it is immediate to see that, in order to avoid a global reorganization , the only possible choice is to have 5 = [ 4 j ], thus simply extending 4 with a column vector. This works because, regardless of , all the new 0-child regions are not moved at all, since:

H

H

h

H

H

H

H

Hh

H

H id (R)0 = H id (R )0 4

4

5

5

0

5. We note that in [KF92] a region identifier is obtained by concatenating, rather than interleaving, the binary representations of its coordinates. This distinction is immaterial for the understanding of the ECC method.

UBLCS-94-21

9


h

trivially holds. On the other hand, all the new 1-child regions are mapped to another disk (because is not the null vector). Figure 8 shows the effect of adding to 4 the column vector = (1; 0)0. In

H

11 10 01 00

h

2

0

1

3

0

2

3

1

1

3

2

0

3

1

0

2

3

1

0

2

1

3

2

0

0

2

3

1

2

0

1

3

000

001

010

011

100

101

110

111

Figure 8. The new mapping of regions to disks, when the column h = (1; 0)0 is added to the H4 check matrix.

the ‘transient phase’, where some regions have not split yet, the method can be applied as well: for region identifiers of length 4, only the 4 leftmost columns of 5 are to be used. Equivalently, id4 (R) is padded with 1 trailing 0. The above discussion and examples are formalized by the following: Definition 3.1 (MC2 mapping function) The MC2 declustering method allocates a region R of size 2?n to disk:

H

diskOfMC (idn(R)) = (Hn+ idn(R)0 )2 2

(4)

where n n+ and matrix multiplication uses only the n leftmost columns of Hn+ . Note that in the above definition n+ is not really needed, since we can always add columns to a check matrix. Example 3.3 Figure 9 shows how regions of different size are mapped by using the H4 check matrix, when M = 4 disks are used. 2

3

11

1

3 0

10 01

0

1

2

2

1

3

00 00

01

10

11

2

Figure 9. The MC mapping function.

UBLCS-94-21

10

4 MX and PI organizations for the MC2 method

In practice, MC2 uses Multiple error Correcting Codes (which justifies the name), whose number equals the number of different region sizes. It should be pointed out that, with respect to the basic ECC method, the following advantages are obtained: There is no need to predetermine a grid to partition the data space. Thus, the method is robust with respect to changes in file size. Regions are created and coalesced on the basis of the actual data set. Thus, the method is robust with respect to non-stationary data distributions. This terminates the description of the MC2 method. Next we describe both MX and PI organizations based on MC2 .

4

MX and PI organizations for the MC2 method

It is a remarkable fact that both MX and PI organizations are possible with the MC2 method. An example of the first can be obtained by using a multilevel grid file (MLGF) organization [WK85], with index nodes allocated on disks on the basis, say, of the identifier of the region they correspond to. We call MX-GF (Multiplexed Grid File) this parallel organization. The second possibility, the one we will analyze in detail, is to resort to a PI organization. The solution we propose makes use of grid files [NHS84] and will be referred to as PI-GF (Parallel Independent Grid Files). It is designed to satisfy the two following, rather contrasting, requirements: Dynamic reallocation: Regions should be able to move from one disk to another. Visibility: At any time we would like to be able to say on which disk any data region is mapped to. For the sake of understanding, we recall that a grid file provides access to data buckets by using a grid directory, where each entry of the directory is called a (grid) cell, and refers to a single bucket. In contrast, in order to avoid poor storage utilization, many cells can share a bucket. Their union, restricted to be a d-dimensional rectangle, is called the region of the bucket. Associated to the grid directory is a set of d linear scales that keep trace of the current intervals defined for each of the d coordinates. Linear scales are kept in main memory, whereas the grid directory is stored on disk(s). We consider two basic alternatives for PI-GF: the first is to use a single (global) grid directory, whereas the second is to have M (local) grid directories. 4.1 PI-GF with a global directory With this organization we have a single grid directory, and a single set of linear scales. According to the current partitioning of space, we assign regions and cells to disks by using their z-values. Within each disk, an index structure is built to organize the cells. Note that, if two or more cells share the same bucket, and they reside on different disks, we assign the bucket to the disk determined by the diskOfMC 2 () function applied to bucket’s region identifier. Since this requires the use of cross-disk pointers, we do not have full visibility. However, with this scheme we are guaranteed that search for a point involves at most two disks: one for the cell, and another (possibly the same) for the bucket. Furthermore, the minimum load requirement is respected, since the search follows a single index path, as in MX organizations. There are at least three problems in having a global directory for PI-GF. They are related to directory growth, directory update, and local organization.6 Directory growth. When using scale-based grid files it is well-known that, in case of (very) skewed data distribution, the directory size (i.e., number of cells) grows as O(N d ), where N is the number of points. Because of the visibility problem, alternative directory organizations, such as MLGF [WK85], which would alleviate this problem, are not possible here. 6. We observe that this organization seems to be the one adopted in [ZSC94], where, however, considerations on local indexes are missing.

UBLCS-94-21

11


Directory update. With a single global directory, when an interval is halved, all local indexes have to be modified, since each disk will be assigned at least one new cell (see also Example 4.2 below). Local organization. The cells that, at a certain time, are mapped by the diskOfMC 2 () function to disk unit u form a subspace Su of the whole data space S . The problem of efficiently organize such subspace is analogous to the one of organize the whole space S when no data in S ? Su is present. Because of the very skewed nature of such a distribution, this does not appear to be a trivial problem to be solved. A similar problem has been dealt with by the CMD method, by providing a ‘coordinate transformation function’ that transforms Su into a rectangular space, so that the grid file organization can still be used [LSR92]. In our context, this can work only in certain cases. In particular, we are done if all cells have the same size, 2?n, as if we were using the static ECC mapping function, because of the following basic observation: Observation 4.1 (Information bits) For any binary linear code C (n; k) there exists a set of k bit positions, such that any two codewords (i.e., binary strings) differ in the value of at least one bit in these positions. Such a set is called the information set, and its complement of m = n ? k positions is called the check set. Coherently, a bit of a codeword in the position which belongs to the information set is called an information bit, and a bit in the position from the check set is called a check bit.

H

If the m leftmost columns of the n check matrix are chosen to form an identity matrix, and we drop the first m bits from the cell identifiers mapped to a given disk, then, from the above observation, we have it that the resulting k-bit strings are all different. Each of these strings defines a so-called local cell identifier, from which the corresponding (global) cell identifier can be easily reconstructed. With this transformation, cells mapped to disk u can be organized into a grid structure.7 Example 4.1 Consider the 8 cells in Figure 8 that are mapped to disk 0. The set of cell identifiers is shown in Figure 10 (a). By dropping the first 2 bits, we consequently remove one bit from the first coordinate, and one from the second. This leads to the local grid structure shown in Figure 10 (b). For instance, cell h011; 10i, whose id is 01101, is transformed into h11; 0i, whose local identifier is 101. 2

00000 01011 00110 01101 11010 10001 11100 10111 (a)

(11)010

(01)011

(00)110

(10)111

(00)000

(10)001

(11)100

(01)101

1

0 00

01

10

11

(b)

Figure 10. Local grid organization when all cells have the same size. In (a), the cell identifiers of Figure 8 that are mapped to disk 0 are shown. In (b), within each cell we show the corresponding local identifier, with the dropped (i.e., check) bits enclosed in parentheses.

7. We are not aware of any other ‘coordinate transformation function’ for the ECC method.

UBLCS-94-21

12


In the general case where cells have different size, however, the local grid organization breaks down, as the following example shows. Example 4.2 Let, say, n=4 and M=4, as in Figure 7. Each disk stores 4 cells organized as a 2 2 grid. Now, assume that the interval 00 on the X axis is refined into 000 and 001. Each of the new 4 cells gets mapped to a different disk (the resulting mapping can be observed in the second column of Figure 8). But this means that each disk now manages 5 cells, and these cannot be accomodated into a 2- dimensional grid. 2 Since the cells mapped to a certain disk cannot organized into a grid structure, other approaches are needed. One possibility that could be considered here is to use a G-tree [Kum94], that has been explicitely designed for managing regions defined by z-values. Future work should investigate on this hybrid organization (i.e., global grid file and local G-trees). 4.2 PI-GF with local directories Let us now consider the case where each disk has its own local directory, denoted GDu , and a set of (local) linear scales, Lu . First, let us note that, on the assumption that each disk stores almost the same amount of data, in the worst case each directory grows as O(N d =M d ), so O(N d =M d?1 ) is the growth rate of the M directories. From this we can see that, other things staying constant, increasing the number of disks tends to reduce the size of directories (since d > 1). In order to preserve the local grid structures, we can start, as in Section 4.1, with the situation where all cells have the same size. The key difference is that, rather than having the cutting hyperplanes act on the global grid, now we let them act only on the local grids. This guarantees that local grid structures are always well-defined and, as an additional positive side-effect, the number of new grid cells to be created due to the introduction of a new hyperplane decreases, because cells are partitioned on M disks. We first illustrate the mechanism through an example. Then, in Section 5, we formalize this organization, simply referred as PI-GF in the sequel. For convenience, we do not distinguish anymore between global and local cell identifiers, since only the latter are significant at this point, and simply speak of cell identifiers. Example 4.3 Consider the simple case of M = 2 disks. In this case, we have a single parity check equation (m = 1), that leads to map to disk 0 all the regions whose id has an even number of 1’s. For simplicity of exposition, we start with the situation depicted in Figure 11 (a), where each disk stores a 2 2 local grid, and where each cell points to a different bucket. Thus, each bucket region consists of a single cell. The corresponding induced partitioning on the whole data space S is shown in Figure 11 (b). Note that, since m = 1, the next split on a local grid has to be on the second (Y ) coordinate, since the last bit of the cell identifiers refers to the first (X ) coordinate. Now, assume that the bucket of region R = h00; 1i (id3 (R) = 010) splits. The two originating buckets will correspond to the sub-regions R0 = h00; 10i (id4 (R0 ) = 0100) and R1 = h00; 11i (id4 (R1 ) = 0101). The 0-child region replaces R in disk 1, whereas the 1-child is mapped to disk 0. The following actions are performed on the two disks: Disk of R0 : since the number of cells in the grid on disk 1 has not changed, we do not modify at all the grid directory. Rather, the new length of the region identifier is recorded in the bucket of R0 (this would be 4 in our example). Below we motivate the need of this choice. Disk of R1 : the grid directory is modified so as to accomodate the new cell h0; 11i, corresponding to region R1 . As a consequence, also cell h1; 11i has to be created, in order to preserve the grid structure. This would correspond to region h11; 11i, that does not exist yet. Therefore, cell h1; 11i has to share with cell h1; 1i in disk 1 the bucket of region h11; 1i (also stored on disk 1). To this end, cell h1; 11i stores a cross-disk pointer. The resulting situation is shown in Figure 12 (a). Figure 12 (b) shows the new global organization 2 of the data space. For clarity, dashed boxes indicate actually stored data buckets. The following example shows how exact-match queries and insertions are supported by PI-GF. UBLCS-94-21

13


disk 0 (1)10

disk 1 (0)11

(0)10

(1)11

(1)00

(0)01

1

1 (0)00

(1)01

0

0 0

1

0

1

(a)

1

0 00

1

0

0

1

0

1

1

0

01

10

11

(b) Figure 11. A 2-disk system storing a 2-dimensional file. We show in (a) the local grids, and in (b) the partitioning induced on the whole data space S , with the disk each region is mapped to.

Example 4.4 Let p hp1 ; p2 i = h0:8; 0:95i be a point to be searched (or inserted) in the file shown in Figure 12. First, we convert p into a binary string by representing both p1 and p2 in binary form, considering only the fractional part, and then interleaving the bits. This yields the identifier of point p. Since (0:8)10 = (0:110 : : :)2 and (0:95)10 = (0:111 : : :)2 , we obtain the string 111101 : : :. As a second step, we look at the local scales of disk 1, since diskOfMC 2 (111101) = 1. Here we check for the existence of the cell identifier 11101. Since this fails, we drop the last bit and are left with = 11110. Since diskOfMC 2 (11110) = 0, we search the cell identifier 1110 on disk 0. This fails again, and we are left with 1111, for which the search of the cell identifier 111 on disk 0 succeeds. Only at this point we access the grid directory of disk 0 and, by following the cross-disk pointer stored in cell h1; 11i, retrieve (or insert) point p. 2 We can see that the above structure indeed allows for the dynamic definition of regions, and for an automatic balance between disks. Data location is based on a set of local grid directories, that are modified as a consequence of bucket split and merge. In order to provide a sound addressing mechanism, a cell may have to store a cross-disk pointer. Regardless of the way grid directories are implemented, it can be seen that the above scheme never involves more than two disks for the retrieval of a point. Finally, let us motivate the need for storing within each bucket the length of its region identifier. Although the grid directory of the 0-child of a split region is not modified, in order to avoid wasting space, there must be anyway some means of determining actual region identfiers. This is needed in order to properly choose the next splitting coordinate on the local grids. With reference to Example 4.3 and Figure 12, consider that the data bucket whose region is h00; 10i has to be split. The two child regions will be h001; 10i and h001; 10i, respectively. Since region h00; 10i is represented by cell h0; 1i in the directory of disk 1, we can correctly retrieve it and the associated bucket (according to the algorithm outlined in Example 4.4), but, without other information, we UBLCS-94-21

14

5

disk 0 11 10

disk 1

(0)101

(1)111

(1)100

(0)110

(0)00

(1)01

L L

bucket (0)10

L L

0 0

1

10

0 1 0

-

(1)11

6

1

L

L L

(1)00

L

(a)

11

PI-GF: A formal description

(0)01

0

L L

0

0

0

1

1

1

0 1 0

0 00

01

10

11

(b) Figure 12. The new scenario after the split of region h00; 1i.

have no means to discover that h0; 1i stands for h00; 10i, rather than h00; 1i.8 This is why we have chosen to keep trace into each data bucket of its region size, through the length, n, of its region identifier. In the above example, the bucket of region h00; 10i would store the value n = 4. This is enough to correctly determine the new child region identifiers. Note that keeping up-to-date this information pays no additional I/O cost, since n is modified only when a data bucket is split.

5


To properly formalize a Parallel Independent Grid File, we evidentiate the precise relationship existing between grid cells and their data regions, as well as between points and cells. Then, we explicit the properties that a bucket split algorithm has to satisfy, and, finally, define the set of cells that satisfy a query.9 5.1

Regions, scales, and cells

In PI-GF the whole data space S is partitioned into a set, R, of non-overlapping binary radix regions. Each region R = hI1 ; : : :; Id i 2 R has assigned a unique identifier of length n, idn (R), 2?n being the size of R. Each region corresponds to a single data bucket. We write R = reg(B ) to denote that R is the region of bucket B . Bucket B stores point data (B:data) and a value, B:n, 2?B:n being the current size of reg(B ). B:data is a set of points , where = hp1 ; : : :; pd i 2 B:data iff, for each i (1 i d), Ii is a prefix of pi , where both Ii and pi are binary strings obtained by cyclic bit interleaving. We concisely write R = reg( ) iff R = reg(B ) and 2 B:data. The same

p

p

p

p

8. Indeed, we could check for the existence of the data bucket with region h00; 11i, but this would require additional I/O operations. 9. We need to point out that this kind of formal activity is particularly important for complex spatial structures, where intuition can sometimes be misleading. For a relevant recent example see [Sal94].

UBLCS-94-21

15

5


relationship can be expressed more concisely by considering region and point identifiers. To this end, let filln () be a function that, given a binary string, either returns its n-bit prefix, by dropping the exceeding bits, or pads the string with trailing 0’s, if its length is less than n. Then, if R has size 2?n, R = reg( ) iff filln (id( )) = idn(R). Equivalently, idn (R) z id(P ) has to hold.10 Regions in R are assigned to disks by using the diskOfMC 2 () function. The set of regions, Ru , assigned to disk unit u (0 u M ? 1) is organized by means of a grid directory, GDu . GDu has associated a set Lu = hLu1 ; : : :; Lud i of d local (linear) scales, where scale u u i (0 < lu < lu < : : : < lu u < 1) is a vector that partitions Lui = hli;u0 ; li;u1; li;u2; : : :; li;n i;1 i;2 i;n ?1 i ?1 u ); [lu ; lu ); : : :; [liu u ; 1), where, by definition, it is the i-th domain into the nui intervals [0; li; 1 i;1 i;2 i;n ?1 u defines the j -th cutting hyperplane ion the i-th coordinate in disk unit li;u0 = 0. The ‘cut’ value li;j u 2 [0; 1), each cut value is univocally u. Since only binary radix intervals are considered, and li;j represented by a binary string of finite length. GD u is a set of grid cells, each cell C represented as (C:id; C:ptr), where C:id is the cell identifier, and C:ptr is a pointer to a data bucket B = (C:ptr). To denote that R is the region of cell C we write R = reg(C ), meaning that R = reg(B ) and B = (C:ptr). Cell C is represented by the tuple of its (local) coordinates, and its identifier is obtained, as with regions and points, by bit interleaving. The basic differences are: 1. If m check bits are used, bit interleaving starts from coordinate (m mod d) + 1, in order to preserve the same bit sequencing of region (and point) identifiers. 2. If the binary string of some coordinate is not long enough to allow for bit interleaving, as many trailing 0’s as needed are adjoined. Let l2g(C:id; u) be the ‘local to global’ function that adds the m check bits in front of the cell identifier C:id, given the disk unit u where C is mapped and the n+ check matrix used for declustering. The region of cell C is univocally determined by the following equation:

p

p

H

id(C:ptr):n(reg((C:ptr))) = fill(C:ptr):n (l2g(C:id; u)) (5) Then, R = reg(C ) iff the identifier of cell C , once prefixed with the m check bits, can ‘match’ the identifier of region R. It has also to be observed that, because of the fill() function and of the use of the B:n value, adding or dropping trailing 0’s to a cell identifier does not alter at all the cell-to-region correspondence. As to cell pointers, we have that C:ptr is a local pointer iff:

diskOfMC (fill(C:ptr):n (l2g(C:id; u))) = u 2

(6)

otherwise C:ptr is a cross-disk pointer. 5.2

Point location

In order to formalize the search process described in Example 4.4 we introduce the following cell() function, that, given a binary identifier and the M grid directories, returns the corresponding cell. Definition 5.1 (cell() function) Let ID be a binary identifier. Given the M grid directories, cell(ID) is implicitly defined as:

filln (ID)

=

filln (l2g(cell(ID):id; u = diskOfMC (filln (ID)))) ^ cell(ID) 2 GDu ^ 6 9n > n for which the same holds 2

(7)

Note that, when ID exactly matches l2g(cell(ID); u) (that is, the two strings agree on all the 1-valued bit positions) the n value is unbounded. In this case we assume that it equals the length of l2g(cell(ID; u)). Retrieval and insertion of a point are performed by accessing the data bucket referenced by cell C = cell(id( )). We have the following basic result that relates Equations (5) and (7).

p

p

p

10. Our formalization considers that identifiers are binary strings. An alternative approach would have taken them as fractional numbers. In this case, for instance, we have that R = reg( ) iff idn (R) is the largest region identifier less than or equal to id( ).

p

UBLCS-94-21

16

5


Theorem 5.1 Point p belongs to region R, that is, R = reg(p), iff R = reg(cell(id(p))), Proof: In the proof we make use the following notation: n is the value of n as determined by Equation (7) when applied to id( ), u is the disk unit of cell(id( )), that is, u = diskOfMC 2 (filln (id( ))), and B is the bucket of region R. Furthermore, we take advantage of Corollary 5.1, to be introduced in Section 5.3, that proves that no two bucket regions can overlap. (if) Since R = reg(cell(id( ))), it is idB:n (R) = fillB:n (l2g(cell(id( )):id; u)). Considering the possible relationships between n and B:n, we have: n = B:n: It is idB:n (R) = fillB:n (id( )), that is, R = reg( ). n > B:n: Since, for any two strings s1 and s2 , filln (s1 ) = filln (s2 ) implies filln (s1 ) = filln (s2 ), for any n n , idB:n (R) = fillB:n (id( )) holds, that is, R = reg( ). n < B:n: Let R0 be a region of size 2?n such that idn (R0) = filln (id( )) = filln (l2g(cell( ):id; u)). Therefore, R0 = reg( ). Since, by hypothesis, R = reg(cell(id( ))), filln (idB:n (R)) = filln (l2g(cell(id( )):id; u)) also holds. But this implies that R R0, thus violating the hypothesis that regions in R do not overlap (see Corollary 5.1). Therefore region R0 62 R, and R = reg( ). (only if) Since R = reg( ), it is idB:n(R) = fillB:n (id( )). Again, we consider the relationship between n and B:n: n = B:n: It is idB:n (R) = fillB:n (l2g(cell(id( )):id; u)), that is, R = reg(cell(id( ))). n > B:n: idB:n (R) = fillB:n (l2g(cell(id( )):id; u)) holds, that is, R = reg(cell(id( ))). n < B:n: Let R0 be a region of size 2?n such that idn (R0) = filln (l2g(cell(id( )):id; u)) = filln (id( )). Therefore, R0 = reg(cell(id( ))). Since, by hypothesis, R = reg( ), filln (idB:n (R)) = filln (id( )) also holds. Again, this implies that R R0 , that is, a contradiction.11 2

p

p

p

p

p

p

p

p

p

p p

p

p

p

p

p

p

p

5.3

p

p

p

p

p

p

p

p

Bucket split

C = cell(id(p)) be the cell retrieved upon insertion of point p, and assume that the bucket (C:ptr) splits. Let R = reg(B ) be the corresponding region of size 2?n (n = B:n), R0 and R1 the two child regions of R, whose buckets are, respectively, B0 and B1 . Finally, let v = diskOfMC (idn+1(R1)). Let

B

=

2

The following actions are performed: a): disk of R1 : 1. (New bucket) Create a new data bucket, B1 , where:

B1 :n B1 :data

= =

B:n + 1 fpjp 2 B:data ^ filln+1 (id(p)) = idn+1(R1 )g

2.

(New cells) If 6 9C 0 2 GD v such that idn+1 (R1 ) = filln+1 (l2g(C 0 :id; v)) (that is, R1 = reg(C 0 )) then expand GDv to include cell C 0, and set (C 0:ptr) = B1 . 3. (Pointers of the new cells) For each new cell C 00 6= C 0 of GDv , let C 000 = cell(C 00 :id), and set C 00:ptr = C 000:ptr. b): disk of R0 : Create a new data bucket, B0 , that replaces bucket B , with values:

B0 :n B0 :data

B:n + 1 = fpjp 2 B:data ^ filln+1 (id(p)) = idn+1 (R0 )g c): other disks : check for other cells C 00 6= C 0 in disk w (0 w M ? 1) such that: idn+1(R1) = filln+1 (l2g(C 00:id; w)) ( R1 = reg(C 00 )) For each such C 00 (for which C 00:ptr = C:ptr necessarily holds) set C 00:ptr = C 0:ptr =

11. It has to be observed that in single-disk grid files the case n

UBLCS-94-21

< B:n

can never arise.

17

5


In step a.2 we expand the grid directory GD v so as to accomodate the new cell C 0, and in step a.3 we compute pointer values for the new cells. To this end, we apply the cell() function to the identifiers of such new cells. Finally, in step c) we update pointer values for any cell C 00 already in GDw whose new region is R1 (and whose old region was R). The precise definition of how GD v has to be expanded is not considered here, since different cases are possible. They are discussed in Section 6.2. The following is an immediate consequence of the above splitting algorithm: Corollary 5.1 Let us start with a single bucket region, which encompasses the whole data space S . Then, at any time, no two regions can overlap. Proof: It easily follows from the observation that each time a bucket splits, its region is partitioned into 2 two disjoint subregions. 5.4 Range queries Given a range query , the set of cells in GDu that satisfy

q

q is:

C u(q) = fC jC 2 GDu ^ reg(C ) \ q 6= ;g

(8)

q

For a more constructive definition, we need to relate the C u( ) sets to the query region in space S . Although the local scales would suffice to solve a query, it is simpler to refer to a set of global scales, G = hG1 ; : : :; Gd i, as they would be defined if using a single-disk grid file. The global scales define what we call virtual cells. Definition 5.2 (Global scales and virtual cells) Let R be the current partitioning of space S . For each R 2 R such that R = hI1 ; I2 ; : : :; Id i there exist cut values g1;j1 ; g2;j2 ; : : :; gd;jd , defined on the Gi global scales, such that gi;ji = Ii . The Cartesian product of global scales G = hG1 ; : : :; Gdi defines a set, VC , of virtual cells, that is, VC = G1 G2 : : : Gd .

q

The steps to be performed for determining the sets Cu( ) can be detailed as follows. 1. Given query = h[l1 ; h1 ); : : :; [li; hi); : : : [ld ; hd )i (0 li hi 1, i = 1; : : :; d) and global scales G , determine the set, V C ( ), of virtual cells that intersect , that is:

q

V C (q) = fV

q

=

q

hI1 ; : : :; Id i 2 VCj8i = 1; : : :; d :

li ; hi) \ Ii 6= ;g

[

(9)

q

2. For each V 2 V C ( ) determine the corresponding cell C = cell(idx (V )), where idx (V ) is the x-bit identifier of V , obtained by bit interleaving and possibly using additional trailing 0’s for some coordinates, as done for cell identifiers. Let C ( ) be the set of such cells. 3. Partition the set C ( ) into M disjoint subsets, with C u( ) being the set:

q

q

q

C u(q) = fC jC 2 C (q) \ GDu g

The following theorem ensures that this retrieval process yields correct results. Theorem 5.2 Let R(q) be the set of regions that satisfy query q, that is:

R(q) = fR 2 Rj9V 2 V C (q) : V Rg

(10)

and Ru (q) the sets (u = 0; : : :; M

? 1): Ru (q) = fR 2 Rj9C 2 C u (q) : R = reg(C )g

where the C u (q) sets are obtained as described in the above algorithm. Then, R(q) = UBLCS-94-21

(11)

SM ?1 u u=0 R (q). 18

5


Proof: We separately prove the soundness (i.e., if, for some u, R 2 Ru( ), then R 2 R( )) and the completeness (i.e., if R 2 R( ), then 9u such that R 2 Ru( )) of the algorithm. Soundness: If, for some u, R 2 Ru ( ), then 9C 2 C u( ) such that R = reg(C ) (from the definition of Ru ( )), and 9V 2 V C ( ) such that C = cell(idx (V )). Therefore, 9R0 2 R( ) such that V R (from Equation (10)), with R0 = reg(B 0 ), for some bucket B 0 . Note that, because of Definition 5.2, it is B 0 :n x, that is the size of R0 is not less than the size of V . Let B the bucket of region R. Since R = reg(C ), it is (see Equation (5)) idB:n (R) = fillB:n (l2g(C:id; u)). However, since B 0 :n x, idB0 :n(R0 ) = fillB0 :n(l2g(C:id; u)) also holds. This implies that R0 = reg(C ). Therefore R = R0, that is, R 2 R( ). Completeness: Let R 2 R( ). Then, 9V 2 V C ( ) such that V R. Let C = cell(idx (V )), and R0 = reg(C ) = reg(B 0 ). Let B the bucket of region R. Since B:n x, idB:n (R) = fillB:n (l2g(C:id; u)), also holds, that is, R = reg(C ). This implies that R = R0, that is, 9u such that R 2 Ru( ). 2 The following example summarizes most of the relevant material introduced in this Section.

q

q

q

q q

q

q

q

q

q

q

q

q

Example 5.1 We start with the situation depicted in Figure 12, and then assume that region

R = h01; 0i (id3 (R) = 001) has split, as well its 0-child, R0 = h01; 00i (id4(R0) = 0010). Since both R1 and (R0 )1 map to disk 0 only the grid directory on such disk has to be modified. Figure 13

(a) shows the allocated data buckets, denoted by capital letters together with the number of the disk where they are stored (note that R1 = reg(I ) and (R0 )1 = reg(K )). The new cells needed in GD 0 have id 011 and 0101, respectively (remind that the first bit of the region ids has to be dropped). The first insertion leads to expand the grid shown in Figure 12, whose scales are, respectively, L01 = h0; 1i and L02 = h0; 10; 11i, leading to the new grid with scales L01 = h0; 1i and L02 = h00; 01; 10; 11i. The second insertion leads to have L01 = h0; 10; 11i and L02 = h00; 01; 10; 11i. This is represented in Figure 13 (b) where, for the sake of clarity, within each cell the referenced data bucket is also shown. Given the above local scales and those of GD1 , the global scales turn to be G1 = h00; 010; 011; 10; 11i, and G2 = h00; 01; 10; 11i, as it can be observed in Figure 13 (a). This yields a total of 20 virtual regions. As an example of the correspondence between cells and regions, consider the cell h10; 10i in GD0 . Since its id is 1100, it is l2g(1100; 0) = 01100. Since D = (h10; 10i:ptr) is the bucket it points to, and D:n = 3, the corresponding region identifier is 011, that is region h01; 1i. The same is obtained by considering the cell h11; 11i, as it can be verified. Let us now consider the processing of the range query q = h[0:125; 0:45); [0:125; 0:625)i, shown as a dashed rectangle in Figure 13 (a). Given the global scales G , the virtual cells that intersect q are:

V C (q) = fh00; 00i; h00; 01i; h00; 10i; h010; 00i; h010; 01i; h010; 10i; h011; 00i; h011; 01i; h011; 10ig For each V 2 V C (q), we compute cell(idx (V )), to determine the corresponding cell. Each cell is adorned with the index, u, of the grid directory where it is found. We obtain (for the sake of clarity, duplicate cells are not discarded):

C (q) = fh0; 00i0; h0; 00i0; h0; 1i1; h1; 0i1; h10; 01i0; h10; 10i0; h11; 00i0; h10; 01i0; h10; 10i0g By partitioning cells according to the disk index value and discarding duplicates, we obtain:

C 0(q) C 1(q)

= =

fh0; 00i; h10; 01i; h10; 10i; h11; 00ig fh0; 1i; h1; 0ig

where the 4 cells in C 0 (q) refer to buckets C; I; D, and K , respectively. and the 2 cells in C 1 (q) to buckets B and J . As an example of the translation from virtual cells to cells, let us consider V = h00; 01i, for which it is id4(V ) = 0001. In order to determine cell(0001), we start with the whole identifier, drop the first (check) bit, and look for cell with identifier 001 on GD1 . Since the local scales are L11 = h0; 1i and L12 = h0; 1i, the search fails. Now, we drop the last bit, and search UBLCS-94-21

19

6 Further Issues

11 10 01

A

0

B

1

C

0

00 00

D

0

I

0

J 1 K 0

E

0

F

1

H

1

G

0

010 011 10

11

(a) disk 0 11 10 01 00 0

(0)101 A

(1)1110 F

(0)1111 D

(1)100 E

(0)1100 D

(1)1101 F

(1)001 H

(0)0110 I

(1)0111 G

(0)000 C

(1)0100 G

(0)0101 K

10

11 (b)

Figure 13. Data buckets (a) and the grid directory in disk 0 (b). The range query q = h[0:125; 0:45); [0:125; 0:625)i, shown as a dashed rectangle in (a), selects buckets B; C; D; ; I ; J , and K .

on GD 0 for a cell whose identifier can match 00. This is cell h0; 00i, which addresses bucket C , as expected. 2

6

Further Issues

The PI-GF organization based on the MC2 declustering method can be ameliorated, from a logical point of view, by improving performance of the mapping function itself and by providing an efficient management of the local directories. To this end we provide some basic hints and, finally, discuss how the range query retrieval process can be accelerated by taking advantage of the very nature of region identifiers. 6.1

Gray codes

Like any other declustering function, ECC (and its MC2 generalization) tries to place on different disks regions that are close in space. To this end, error correcting codes are used, the basic idea being to store on the same disk regions that are at least dmin bit apart, dmin being the minimum distance of the code [PW72]. The implicit assumption in doing this is that close regions have a low bit distance, when identifiers are considered. However, this is not always true, it depending on the specific identifiers, that is, on the placement of regions in space. Example 6.1 Let us consider the case where all regions have the same size, and refer to regions

R = h001; 011i and R0 = h010; 100i. Their Manhattan (i.e., L1 or city-block) distance is 2 (expressed

in units of region side) whereas their bit distance is 5. On the other hand, if we take the two 2 regions R = h010; 100i and R0 = h011; 101i, both their Manhattan and bit distances are 2.

UBLCS-94-21

20

6 Further Issues

In order to guarantee that closeness in space translates in closeness in identifier space, we can consider the use of Gray codes, that have the nice property of providing 1 bit distance for the binary representations of any two consecutive integers. Gray codes have been suggested in [Fal88] in order to increase the size of clusters of contiguous buckets that satisfy either partial match or range queries. Here we propose to use them for the encoding of intervals along the d coordinates, and, consequently, for the formation of spatial identifiers. Example 6.2 We qualitatively illustrate the advantage of using Gray codes by considering the case of M = 8 disks, and a 4 4 regular grid. For simplicity, each region is assumed to consist of a single cell. Figure 14 shows the mapping obtained by using the check matrix:

2

1 0 0 H4 = 4 0 1 0 0 0 1

3

1 1 5 1

This defines a C (4; 1) code of minimum distance dmin = 4. Therefore, each disk stores 2 out of 16 regions, and the identifiers of the two regions mapped to a disk have all the corresponding bits different. In Figure 14 (a) it is shown the mapping when using the binary encoding for intervals, whereas Figure 14 (b) refers to the case where Gray codes are used. It can be verified that, in this example, Gray codes lead to an allocation scheme that always guarantees optimal bucket 2 response time, whereas this is clearly not true for standard binary encoding.

11 10 01 00 00

5

4

1

0

2

3

6

7

7

6

3

2

0

1

4

5

01

10

10 11 01 00

2

3

7

6

5

4

0

1

7

6

2

3

0

1

5

4

00

11

01

(a)

11

10

(b)

Figure 14. Allocation of cells to disks when using binary encoding (a) and Gray codes (b) for the intervals.

Let b1 b2 b3 : : :bn be the n-bit binary representation of some object of interest, such as an interval, a region identifier, etc., and g1 g2 g3 : : :gn be the corresponding codeword in the n-bit Gray code.12 The conversion formulas are:

gk = and

bj =

k X j =1

j X k=1

bj

(mod

2)

1kn

gk

(mod

2)

1j

n

In practical terms, consider interval I , and let (I )G be its Gray code representation. Let Ilow and Ihigh be the lower and higher sub-interval of I , respectively. Then, (Ilow )G = (I )G 0 and 12. More precisely, we consider binary reflected Gray codes.

UBLCS-94-21

21

6 Further Issues (Ihigh )G = (I )G 1 if the parity of (I )G is even, whereas (Ilow )G = (I )G 1 and (Ihigh )G = (I )G 0 if the parity of (I )G is odd. For instance, if (I )G = 1110, then the codes for the two sub-intervals will be (Ilow )G = 11101, and (Ihigh )G = 11100. The adoption of Gray codes allows for a more direct relationship to be established between distance in space S and bit distance. In turn, by using properties of the code(s) used by the mapping function (related to bit distance), this can lead to performance predictions for range queries (related to distance in space S ). Future work will investigate these theoretical aspects.

6.2

Grid compression

There are two basic compression strategies to reduce the storage overhead of the grid directories in PI-GF. Basically, the first strategy, called ‘0’s don’t care’, avoids the splitting of intervals unless absolutely needed, and the second strategy, called ‘ghost intervals’, relaxes the constraint that the intervals of the local scales have to completely partition the [0; 1) domain. As a beneficial side effect, both strategies reduce CPU costs to be paid for computing values for the pointers of the new cells. Both strategies address peculiar problems of PI-GF, that arise because grid modifications are triggered by events that take place on a different disk (i.e., a bucket split). As we have observed in Section 5, when a region splits its 1-child moves to another disk. For this region a new grid cell is needed, if not already present. In the usual case, only an interval has to be split to refine the grid partitioning, and this happens in a cyclic way, as it is usual with single-disk grid files. However, the following cases may also arise: 1. More than one cutting hyperplane may be needed. 2. The interval(s) to be split may need to be partitioned in more than two subintervals. Example 6.3 Assume that cell C = h001; 10i has to be inserted in GDu whose scales are Lu1 = h0; 1i and Lu2 = h0; 1i. The new scales should be L1u = h000; 001; 01; 1i and L2u = h0; 10; 11i. This amounts to a total of 12 cells. This case includes both above-listed possibilities: 1. Intervals on both coordinates have been split. 2. Interval 0 on the first coordinate has been replaced by 000,001, and 01. 2 6.2.1 The ‘0’s don’t care’ strategy This first compression strategy takes advantage of the way cell identifiers are formed and of the observation that either adding or removing trailing 0’s to a cell identifier does not alter the cell-to-region and the point-to-cell correspondence. Basically, whenever we need an interval on coordinate i of the form b1 b2 : : :bm 0 : : : 0, and interval b1 b2 : : :bm already exists, we do not modify the corresponding scale. Example 6.4 Refer to Example 6.3, where cell C = h001; 10i has to be inserted in GDu . By applying the ‘0’s don’t care’ strategy we avoid refining the second scale, for which interval 10 is needed, since interval 1 is already present. Therefore, after modification, the new scales are, respectively, L1u = h000; 001; 01; 1i and L2u = h0; 1i, leading to a total of 8 cells. 2 6.2.2 The ‘ghost intervals’ strategy The ‘ghost intervals’ strategy relaxes the constraint that intervals of a scale have to completely partition the [0; 1) domain. Thus, only strictly needed intervals are created. Example 6.5 Consider now Example 6.4. Although the ‘0’s don’t care’ strategy saves 4 cells, the 0 interval in the first scale has been still replaced by three subintervals. Of these, we need 001, because this is the value of the first coordinate of cell C = h001; 10i, and 000, for the already present cells whose first coordinate value was 0. It turns out that interval 01 is not needed at all. Thus, the two combined compression strategies lead to the new scales Lu1 = h000; 001; 1i and L2u = h0; 1i, that is, only 6 cells. The resulting situation is shown in Figure 15. 2 UBLCS-94-21

22

6 Further Issues

C C

- 1

1

10 001

0

0 0

GDu

1

000 001 1

Figure 15. The two combined compression strategies.

It has to be pointed out that the combined strategy (i.e., ‘0’s don’t care’ and ‘ghost intervals’) guarantees that, for each grid GD u, for each coordinate and each interval, there is at least one cell C such that the bucket (C:ptr) is stored in disk u. The same does not necessarily hold if at least one of the two strategies is not adopted. We leave as an exercise the proof that the above compression strategies do not affect the soundness of the retrieval mechanism.

6.3 Range query processing The steps needed for the evaluation of a range query have been detailed in Section 5.4. Here we briefly consider how qualifying cells can be determined starting from the set, V C ( ), of selected virtual cells. Since many virtual cells can correspond to a single cell, and, in turn, many cells can correspond to the same bucket region, it is useful to avoid wasting time for obtaining duplicate information. To this end, we take advantage of the fact that bucket regions are disjoint and completely cover the data space. Before presenting the basic lemma on which our optimization is based, we illustrate the underlying idea through a simple example.

q

Example 6.6 Assume that V1 2 V C (q) has identifier 1011 : : :, and that a match is found with cell C1 2 GDu , for which it is l2g(C1 :id; u) = 101. Let V2 be another qualifying virtual cell, with identifier 10101 : : :, and consider the following two cases, where C2 = cell(idy (V2 )): 1. C2 = C1 . In this case both virtual cells V1 and V2 individuate the same cell. 2. C2 6= C1 (C2 2 GD v ). Without loss of generality, we can consider that l2g(C2 :id; v) = 10101. Since no match was found for the identifier 1011, this implies that region 101 has not been partitioned yet. Therefore, region 10101 does not exist, and both cells C1 and C2 refer to the same bucket region. 2

The second case in the above example simply takes advantage of the observation that, when one discovers that there is no cell for the 1-child of a region, then any (distinct) cell that appears to refer to the 0-child of the same region (or to further subdivisions of it) is not worth considering. The following lemma formalizes these considerations. Lemma 6.1 Let V1 ; V2 2 V C (q) be two qualifying virtual cells for query q, and let idx (V1 ) and idy (V2 ) be their identifiers, that are assumed to be equal up to the n-th bit, whereas idx (V1 ) (idy (V2 )) has value 1 (resp. 0) for the (n + 1)-th leftmost bit. Now, let C1 = cell(idx (V1 )), with C1 2 GD u, n1 be the value determined by Equation (7), with n1 n < x, and R1 = reg(C1 ). Then, R1 = reg(cell(idy (V2 ))) necessarily holds. UBLCS-94-21

23

7 Conclusions

Proof: Let n2 be the value determined by Equation (7) when applied to idy (V2 ), C2 = cell(idy (V2 )), and R2 = reg(C2 ). We separately consider the two cases whether n2 n holds or not. n2 n. Since the identifiers of the two virtual cells are the same up to the n-th bit, it is C1 = C2 and, consequently, R1 = R2 . n2 > n. Since trailing 0’s do not matter, in order to have C1 6= C2 (otherwise we are done) filln2 (idy (V2 )) has to have at least 1-valued bit in a position greater than n. The size of region R1 is at least 2?n, because the (n + 1)-th leftmost bit of idx (V1 ) has value 1, and n1 n. Therefore, the identifier of R1 has at most n bits, and they coincide with the first n bits of the identifier of R2 . This implies that R2 R1 . However, because regions are pairwise disjoint, R1 = R2 necessarily holds. 2 In order to take full advantage of the result of the above lemma, virtual cells in V C (q) should be processed in decreasing lexicographic z-value order. This guarantees that any cell in C ( ) is never generated more than once and, at the same time, that some cells sharing the same bucket region with an already generated cell are not inserted in C ( ). We note that, given a range query and a set of global scales, it is a rather simple programming exercise to generate qualifying virtual cells in decreasing z-value order. Thus, they can be generated and tested on-the-fly without the need of allocating memory for the V C ( ) set.

q

q

q

q

Example 6.7 Consider the following identifiers that are obtained from virtual cells in V C (q): 01011; 01010; 01001; 01000; 001 We start by processing 01011, since it is the largest z-value in the set. Assume that the matching cell for 01011 satisfies l2g(cell(01011):id; u) = 01. From Lemma 6.1 we have that identifiers 01010, 01001, and 01000 can be discarded without further consideration, since all of them will lead to region reg(cell(01011)). As to the 001 identifier, it might be the case that reg(cell(001)) = reg(cell(01011)). However, this cannot be discovered at this level, but only when accessing the local grid directories. 2

7

Conclusions

In this paper we have considered the problem of distributing a set of d-dimensional points on M independent disks to efficiently support small and large range queries. Existing solutions may fail to work well under certain circumstances. This is particularly true for small range queries in highly loaded systems. We have considered both ‘multiplexed’ (MX) and ‘parallel independent’ (PI) organizations, and have qualitatively analyzed their merits and drawbacks. Then, focusing on PI organizations, we have discussed the limits of some declustering methods based on Cartesian product files. We have introduced a new dynamic declustering method (MC2 ) based on multiple error correcting codes, and considered three organizations based on it and on the grid file concept. ‘Parallel independent grid files’ with local directories (PI-GF) satisfies quite well all the requirements a parallel data structure should fulfill. It represents, as far as we know, a unique case of dynamic PI organization that does not lead to the ‘visibility problem’ and does not undergo periodic reorganization phases. Current work is being devoted to the implementation of a prototype system, where both PI-GF and MX-GF organizations are to be experimentally evaluated. We also plan to develop an in-depth analysis on the effects of using Gray codes for the encoding of intervals. Finally, we will consider the application of the MC2 method to the declustering of non-point spatial data, the basic idea being to subdivide an object’s region into a set of binary radix regions of variable size, as already considered in single-disk organizations based on the redundancy concept [Ore89]. UBLCS-94-21

24

REFERENCES

References [DG92] D. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98, June 1992. [DS82] H.C. Du and J.S. Sobolewski. Disk allocation for Cartesian product files on multiple disk systems. ACM Transactions on Database Systems, 7(1):82–101, March 1982. [Fal88] C. Faloutsos. Gray codes for partial match and range queries. IEEE Transactions on Software Engineering, 14(10):1381–1393, October 1988. [FB93] C. Faloutsos and P. Bhagwat. Declustering using fractals. In Proceedings of the 3rd Parallel and Distributed Information Systems (PDIS) International Conference, pages 18–25, January 1993. [FM91] C. Faloutsos and D. Metaxas. Disk allocation methods using error correcting codes. IEEE Transactions on Computers, 40(8):907–914, August 1991. [Gut84] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, pages 47–57, Boston, MA, June 1984. [HS94] B. Himatsingka and J. Srivastava. Performance evaluation of grid based multi-attribute record declustering methods. In Proceedings of the 10th International Conference on Data Engineering, pages 356–365, Houston, Texas, February 1994. [KF92] I. Kamel and C. Faloutsos. Parallel R-tree. In Proceeedings of the 1992 ACM SIGMOD International Conference on Management of Data, pages 195–204, San Diego, CA, June 1992. [KP88] M.H. Kim and S. Pramanik. Optimal file distribution for partial match queries. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pages 173–182, Chicago, IL, June 1988. [Kum94] A. Kumar. G-tree: A new data structure for organizing multidimensional data. IEEE Transactions on Knowledge and Data Engineering, 6(2):341–347, April 1994. [LSR92] J. Li, J. Srivastava, and D. Rotem. CMD: A multidimensional declustering method for parallel database systems. In Proceedings of the 18th VLDB International Conference, pages 3–14, Vancouver, Canada, August 1992. [NHS84] J. Nievergelt, H. Hinterberger, and K.C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, 9(1):38–71, March 1984. [Ore86] J. Orenstein. Spatial query processing in an object-oriented database system. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, pages 326–336, Washington, DC, May 1986. [Ore89] J. Orenstein. Redundancy in spatial databases. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, pages 294–305, Portland, Oregon, June 1989. [Pat93] D.A. Patterson. Massive parallelism and massive storage: Trends and predictions for 1995–2000. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (PDIS), pages 5–6. IEEE Computer Society Press, January 1993. [PGK88] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pages 109–116, Chicago, IL, June 1988. [PW72] W. W. Peterson and E. J. Weldon. Error Correcting Codes. MIT Press, 1972. [Sal94] B. Salzberg. On indexing spatial and temporal data. Information Systems, 19(6):447–465, 1994. [SL91] B. Seeger and P.-A. Larson. Multi-disk B-trees. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, pages 436–445, Denver, Colorado, May 1991. [WK85] K.-Y. Whang and R. Krishnamurthy. Multilevel grid files. Research Report RC 11516, IBM Thomas J. Watson Research Center, Yorktown Heights, New York, November 1985. [WZS91] G. Weikum, P. Zabback, and P. Scheuermann. Dynamic file allocation in disk arrays. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, pages 406–415, Denver, Colorado, May 1991. UBLCS-94-21

25

REFERENCES

[ZSC94] Y. Zhou, S. Shekhar, and M. Coyle. Disk allocation methods for parallelizing grid files. In Proceedings of the 10th International Conference on Data Engineering, pages 243–252, Houston, Texas, February 1994.

UBLCS-94-21

26