Dynamic Declustering Methods for Parallel Grid Files

Dynamic Declustering Methods for Parallel Grid Files

Paolo Ciaccia

Arianna Veronesi

Technical Report UBLCS-95-17 November 1995

Department of Computer Science University of Bologna Piazza di Porta S. Donato, 5 40127 Bologna (Italy)

The University of Bologna Department of Computer Science Research Technical Reports are available in gzipped PostScript format via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS. All local authors can be reached via e-mail at the address [email protected]. Written requests and comments should be addressed to [email protected].

UBLCS Technical Report Series 94-6 Exploring the Coordination Space with LO, S. Castellani, P. Ciancarini, April 1994. 94-7 Comparative Semantics of LO, S. Castellani, P. Ciancarini, April 1994. 94-8 Distributed Conflicts in Communicating Systems, N. Busi, R. Gorrieri, April 1994 (Revised February 1995). 94-9 Pseudopolar Array Mask Algorithm for Spherical Coordinate Grid Surfaces, A. Amoroso, G. Casciola, May 1994. 94-10 MPA: a Stochastic Process Algebra, M. Bernardo, L. Donatiello, R. Gorrieri, May 1994. 94-11 Describing Queueing Systems with MPA, M. Bernardo, L. Donatiello, R. Gorrieri, May 1994. 94-12 Operational GSPN Semantics of MPA, M. Bernardo, L. Donatiello, R. Gorrieri, May 1994. 94-13 Experiments in Distributing and Coordinating Knowledge, P. Ciancarini, May 1994. 94-14 A Comparison of Parallel Search Algorithms Based on Tree Splitting, P. Ciancarini, May 1994. 94-15 RELACS: A Communications Infrastructure for Constructing Reliable Applications in Large-Scale Distributed ¨ Babao˘glu, M.G. Baker, R. Davoli, L.A. Giachini, June 1994. Systems, O. ¨ Babao˘glu, A. Bartoli, G. Dini, June 94-16 Replicated File Management in Large-Scale Distributed Systems, O. 1994. 94-17 Parallel Symbolic Computing with the Shared Dataspace Coordination Model, P. Ciancarini, M. Gaspari, July 1994. 94-18 An Algorithmic Method to Build Good Training Sets for Neural-Network Classifiers, F. Tamburini, R. Davoli, July 1994. ¨ Babao˘glu, A. Schiper, July 1994. 94-19 On Group Communication in Large-Scale Distributed Systems, O. 94-20 Dynamic Allocation of Signature Files in Multiple-Disk Systems, P. Ciaccia, August 1994. 94-21 Parallel Independent Grid Files Based on a Dynamic Declustering Method Using Multiple Error Correcting Codes, P. Ciaccia, November 1994. 95-1 Performance Preorder and Competitive Equivalence, F. Corradini, R. Gorrieri, M. Roccetti, January 1995. 95-2 Clepsydra Methodology, P. Ciaccia, O. Ciancarini, W. Penzo, January 1995. 95-3 A Unified Framework for the Specification and Run-time Detection of Dynamic Properties in Distributed ¨ Babao˘glu, E. Fromentin, M. Raynal, January 1995 (Revised February 1995). Computations, O. 95-4 Effective Applicative Structures, A. Asperti, A. Ciabattoni, January 1995. 95-5 An Open Framework for Cooperative Problem Solving, M. Gaspari, E. Motta, A. Stutt, February 1995. 95-6 Considering New Guidelines in Group Interface Design: a Group-Friendly Interface for the CHAOS System, D. Bottura, C. Maioli, S. Mangiaracina, February 1995. 95-7 Modelling Interaction in Agent Systems, A. Dalmonte, M. Gaspari, February 1995. 95-8 Building Hypermedia for Learning: a Framework Based on the Design of User Interface, S. Mangiaracina, C. Maioli, February 1995. 95-9 The Bologna Optimal Higher-Order Machine, A. Asperti, C. Giovannetti, A. Naletto, March 1995. 95-10 Reliable Synchronization Support and Group-Membership Services for Distributed Multimedia Applications, F. Panzieri, M. Roccetti, March 1995. ¨ Babao˘glu, R. Davoli, L.-A. 95-11 The Inherent Cost of Strong-Partial View-Synchronous Communication, O. Giachini, P. Sabattini, April 1995. 95-12 On the Complexity of Beta-Reduction, A. Asperti, July 1995. 95-13 Optimal Multi-Block Read Schedules for Partitioned Signature Files, P. Ciaccia, August 1995. 95-14 Integrating Performance and Functional Analysis of Concurrent Systems with EMPA, M. Bernardo, L. Donatiello, R. Gorrieri, September 1995. ¨ Babao˘glu, A. Bartoli, G. Dini, September 1995. 95-15 On Programming with View Synchrony, O. 95-16 Generative Communication in Process Algebra, P. Ciancarini, R. Gorrieri, G. Zavattaro, October 1995. 95-17 Dynamic Declustering Methods for Parallel Grid Files, P. Ciaccia, A. Veronesi, November 1995.

Dynamic Declustering Methods for Parallel Grid Files1 Paolo Ciaccia2

Arianna Veronesi3

Technical Report UBLCS-95-17 November 1995 Abstract Several declustering functions for distributing multi-attribute data on a set of disks have been proposed in recent years. Since these functions map grid regions to disks in a static way, performance deteriorates in case of dynamic datasets and/or non-stationary data distributions. In this paper we first analyze how declustering functions can be extended in order to deal with dynamic datasets without requiring periodic reorganizations. In order to support dynamic declustering, we propose to organize the directory as a parallel Multilevel Grid File. On this structure we experiment six different dynamic declustering functions as well as index-based allocation methods that only use locally available information. This first comparison among the two approaches reveals that methods based on local criteria always yield better results.

1. This work has been supported by the ESPRIT LTR project no. 9141, HERMES (Foundations of High Performance Multimedia Information Management Systems), and by Italian CNR, Grant no. 95.00443.CT12. 2. DEIS - CIOC-CNR, University of Bologna, Viale Risorgimento 2, 40136 Bologna, Italy. 3. CINECA, Casalecchio di Reno (Bologna), Italy.

1

1

Introduction

Efficient management of large multi-attribute datasets is a key requirement for many advanced database applications dealing with scientific and statistical data, images, geographic maps, etc. Since such applications usually manage large amount of data resident on secondary memory devices, increasing the performance of the I/O subsystem is a fundamental challenge to be dealt with [DG92]. To this end, a common technique for reducing the response time of queries is to decluster data across a set of disks to exploit parallelism at the I/O level [PGK88]. This approach is also motivated by the fact that performance of individual disks is not likely to improve very much in the future, so that significant reductions in disk access time cannot be expected [Pat93]. Many declustering algorithms today exist, also including well-known single-attribute methods (e.g., range partitioning, round robin, and hashing [DG92]). In this paper we concentrate on multi-attribute declustering, whose major aim is to efficiently support multi-dimensional range queries. Previous approaches can be characterized as follows. Methods based on declustering functions – DM [DS82], CMD [LSR92], FX [KP88], ECC [FM91], HCAM [FB93], and others – operate by first partitioning the data space into an orthogonal grid, and then mapping each cell of the grid to a specific disk. These methods have been explicitly designed for almost static environments, since they establish a fixed mapping between regions of space and disks. Therefore they are not suitable for efficiently managing cases where either the size of the file or data distribution change over time. The MaxCut approach described in [LS95] adopts graph partitioning techniques that are sensible to the type of queries. This is shown to outperform mapping function methods, the price to be paid being the computational overhead of the graph partitioning algorithm. Even in this case, however, the focus is on static datasets, and effects due to changes in data (and query) distribution are not investigated. Finally, index-based methods, such as those proposed for R-trees [KF92], perform incremental (i.e., dynamic) declustering by mapping a new index node or data page to a disk according to some heuristic criterion that takes into account only local information. In this article we propose a new approach to the declustering problem, which consists in using dynamic declustering functions (DDFs), able to adapt to data size and distribution. This implies that no a priori partitioning of the space and no reorganization is needed. Our approach can therefore efficiently parallelize Grid Files. We propose dynamic extensions to 5 methods, namely DM, FX, Z, ECC, and HCAM. After analyzing issues related to directory management, we argue that a parallel Multilevel Grid File, called “multiplexed" Grid File (MX-GF), is the best alternative to organize index nodes on disks. Because of the paged organization of MX-GF, we are also able to consider index-based declustering methods, namely Round-Robin (RR) and Proximity Index (PI). We experimentally compare the different methods and conclude that those based on local information are superior to the others. In particular, the PI method is the best choice for either small queries (regardless of data distribution) or uniform distributions (regardless of query size), whereas RR outperforms other methods in the case of medium-large queries on skewed distributions. The rest of the paper is organized as follows. Section 2 provides background on declustering functions, and emphasizes their limits. Section 3 describes our approach in general terms. Section 4 introduces dynamic versions of specific declustering functions. Section 5 describes MX-GF and the two index-based methods. Section 6 presents experimental results.

2

Background and Motivation

In this Section we briefly review the basic concepts of the approach based on declustering functions, and highlight its intrinsic limits. For simplicity of exposition, we refer to a 2-dimensional space, but generalization to higher dimensions is immediate. Consider a file of 2-dimensional points (records), and a set of M disk units. The space is partitioned into an S S orthogonal grid, so that each region of the grid, called cell, is univocally UBLCS-95-17

2

identified by its row-column coordinates, I1 ; I2 2 [0; S ? 1]. Each declustering method, met, maps cell identifiers (coordinates) into disk identifiers by using the diskOfmet () function, that is:

diskOfmet : [0; S ? 1] [0; S ? 1] ! [0; M ? 1]

Example 2.1 The disk modulo (DM) method [DS82] is defined as4

diskOfDM (I1 ; I2) = (I1 + I2 ) mod M

whereas the error correcting codes (ECC) method [FM91], which requires M = 2m (m 1) and S = 2s , declusters grid cells by means of the parity check matrix of a binary code. Cell coordinates are represented using s bits each, which are then interleaved to form the (binary) cell identifier, id(I1 ; I2). This is multiplied by the m 2s parity check matrix, H2s, and the result, which is an m-bit vector, is the disk-id:

diskOfECC (I1 ; I2) = (H2s id(I1 ; I2)0 )2

where prime denotes vector transposition.5 Figure 1 shows the mappings on M = 4 disks of a 4 4 grid obtained, respectively, by DM and ECC methods, the latter using the matrix:

H4 =

1 0 0 0 1 1

1 0

2 3

3 2 1 0

11 10 01 00 0

1

2 (a)

3

2 1

00 01 10 11

0 disk-id

(b)

Figure 1. The DM (a) end ECC (b) declustering methods.

Besides DM and ECC, several other function-based declustering methods have been proposed, such as FX (field-wise exclusive or) [KP88] and HCAM (Hilbert curve) [FB93]. Performance of these methods is usually evaluated with respect to the class of (orthogonal) range queries, by measuring how well they map cells to disks in such a way that approximately the same number of cells satisfies the query on each disk. Let q be a range query, and Ci the number of cells that satisfy q on the i-th disk , i.e. cells whose region intersects q. The response time of Pquery q is then defined as maxfC0 ; C1; : : :; CM ?1 g, to be compared to the optimal response time d i Ci =M e. Experimental analysis has shown that no method is superior to the others for all the possible query and database scenarios (i.e. file size, number of disks, number of dimensions, size and shape of queries, etc.), with respect to the above performance metric [HS94]. On the theoretical side, it is known that no declustering method can guarantee optimal response time for all the orthogonal range queries, when the number of disks is M > 5 [ZSC94]. The basic assumption which justifies a metric based on “counting the cells” is the existence of a 1-to-1 correspondence between cells and data pages (buckets), that is, bewteen grid regions 4. The more recent Coordinate Modulo Distribution (CMD) [LSR92] is a variant of DM which requires the grid size to be a multiple of M . 5. In [KF92] a cell identifier is obtained by concatenating, rather than interleaving, the binary representations of its coordinates. This distinction is immaterial for the understanding of the ECC method.

UBLCS-95-17

3

and data regions, so that the above-defined response time is indeed a measure of the (parallel) I/O complexity of a query. In practice, in order to avoid poor storage utilization, two or more cells can share a same data page, in which case the usual approach is to store the bucket in a randomly chosen disk (among those the sharing cells are mapped to) [ZSC94, LS95]. 2.1 Limits of Static Declustering Functions Parallel organizations based on known declustering functions are appropriate only for almost static datasets, for which neither file size nor data distribution change through time. When data distribution is non-stationary, even if the grid is initially designed in order to minimize the data skew among cells [LRS93], this partitioning soon becomes obsolete. For stationary distributions, problems can arise if the size of the dataset changes, because the granularity of declustering cannot be explicitly controlled. When the file grows, data pages split. Since the regions of the new pages are included in the region of the split page, (permanently) mapped to the i-th disk, the new pages are stored in the i-th disk as well. It follows that the number of data pages which can correspond to a single cell of the declustered grid is unbounded. If the file shrinks, merge of data pages easily leads to situations where no effective declustering control can be exerted. Example 2.2 Consider the ECC mapping of Figure 1 (b), and assume that each cell corresponds to a data page of capacity c = 2 points. By definition, ECC (as well as any other method) is trivially optimal for each query that retrieves a single cell. However, if the file grows so that each of the 16 cells corresponds to, say, 4 data pages (= 8 points), this is no longer true. On the other hand, assume that file shrinks and only 4 data pages are needed, whose regions are shown as dashed rectangles in Figure 2. Note that each data region covers 4 cells which are assigned to 4 different disks. It turns out that, if assignment of data pages to disks occurs randomly, the method reduces to a pure random declustering strategy, which is known to perform poorly. 2

11 10 01 00 00 01 10 11 Figure 2. ECC with only 4 data regions.

3

The Approach

In order to remediate problems due to static declustering functions, we propose to extend them in such a way that they can adapt to both changes in data distribution and file size. The key requirement is that no periodical reorganization is needed, since it is costly and reduces data availability. As the first step, we define how a dynamic declustering function should operate on an orthogonal grid. Definition 3.1 A dynamic declustering function (DDF) based on method met, denoted diskOfmet+ (), behaves like diskOfmet (), for each Sl Sl grid such that Sl = 2l S0 (l 0; S0 1), that is:

diskOfmet+ (I1 ; I2) = diskOfmet (I1 ; I2)

8l 0; 8I1; I2 2 [0; Sl ? 1], where l is called the grid level. and S0 is the initial grid size. UBLCS-95-17

4

Thus, what is required to a DDF is to still correctly map, according to the met criterion, all the grid regions which results from a doubling of the number of intervals on each dimension. In the following, without loss of generality, we will assume that S0 = 1. Example 3.1 Consider the DM mapping of Figure 1 (a). According to the definition, if the 4 4 grid expands to an 8 8 grid, diskOfDM + () has to map the cells as shown in Figure 3, since this is the mapping of diskOfDM () for such a grid. 2

7 6 5 4 3 2 1 0

3 2 1

0

1

Figure 3. An 8

2

3

4

5

6

7

0 disk-id

8 grid declustered by the DM method.

It is not difficult to see that the following holds, for any DDF. Property 3.1 Any DDF has the property that, when an interval I is split, the two new subintervals have values 2 I and 2 I + 1. For instance, interval 2 in a 4 4 grid, corresponds to intervals 4 and 5 in a 8 8 grid. Note that, any other numbering schema for the new (sub)intervals does not preserve the correspondence between interval values and their (relative) position in space, thus losing the basic property on which declustering functions are based. For instance, the extension proposed in [ZSC94] does not define a DDF, since new interval values depend on the order of splits, and are therefore not related to position in space. Let us now consider the general case, where data regions of variable size are present. A DDF should provide a consistent way of mapping such regions to disks, thus avoiding as much as possible any form of “random assignment". Consider a region R, obtained by recursive binary partitioning of the space. Then, there is a single grid level, Sl1 , such that there exists an interval, I1 2 [0; Sl1 ? 1], which exactly spans R’s extension on the first coordinate. The same holds for the 2nd coordinate, for which a level, Sl2 , and an interval, I2 2 [0; Sl2 ? 1], are univocally identified. We call l1 and l2 the “grid levels" of region R. Then, the definition of DDF can be generalized as follows. Definition 3.2 Let R be a region, with coordinate values (I1 ; I2 ), where I1 2 [0; Sl1 ? 1], I2 2 [0; Sl2 ? 1], and l1 and l2 are the grid levels of R. A dynamic declustering function (DDF) based on method met, denoted diskOfmet+ (), behaves like diskOfmet (), applied to the Sl1 Sl2 grid, that is: diskOfmet+ (I1 ; I2) = diskOfmet (I1 ; I2) Example 3.2 Consider the regions in Figure 4. Region R has coordinate values I1 = 2 (at level l1 = 2), and I2 = 0 (at level l2 = 1). Therefore, with M = 4 disks, the DM+ method maps R to disk (2 + 0) mod 4 = 2. Other disk-id values are obtained in a similar way. 2 UBLCS-95-17

5

3

3 1

2

2

1

1

R

0

0 disk-id

0 0

1

2

3

level 2 level 1

0

1

Figure 4. Allocation of variable-size regions with DM+ .

It can be easily verified that, when a single grid level is in use for all coordinates and for all regions, we are back to the “square grid" case. Definition 3.2 is used to decluster data regions. As to the allocation of the directory (i.e. grid cells), we postpone the discussion to Section 5. Until then, it is assumed that declustered data pages are anyway correctly addressed, regardless of the way the directory is organized on disks.6 In order to characterize different DDFs, we introduce two requirements that a ‘good” DDF should satisfy. They are both related to the smoothness of the declustering process. Definition 3.3 Consider a region R, currently mapped to disk i, which has to be split into two subregions. A DDF has the Mem (memory reuse) property if it maps back to disk i at least one of the two subregions, and has the Dec (declustering) property if it maps the two subregions to different disks. If a DDF has the Mem property, the disk space freed by R is immediately reused, which is a desirable feature. The Dec property, on the other hand, guarantees that each split results in the declustering of the two new subregions. Figure 5 shows the ideal case where both Mem and Dec properties hold.

split Mem

Dec

Figure 5. The case where both Mem and Dec properties hold.

6. By the way, we note that issues related to directory management have never been explicitly considered in performance measurements of declustering functions.

UBLCS-95-17

6

DM+ Mem Dec

RDM+

Z+

2

HCAM+

Table 1. Classification of DDFs. 1:

4

ECC+

1

M

4. 2:

M

FX+

6 2m . =

Dynamic Declustering Functions

In this Section we define and analyze the behavior of several DDFs. Table 1 summarizes results with respect to the Mem and Dec properties. They qualitatively suggest that some of the methods are more suitable for dynamic environments than others. 4.1 DM+ and RDM+ The DM+ method is analogous to static DM, the new feature being only the way new intervals are numbered. Figure 4 is an example of how DM+ works. It can be observed that DM+ enjoys the Dec, but not the Mem property. To obviate this, we introduce a variant, called RDM+ (reverse disk modulo), which works with binary-valued coordinates. Before summing and taking the modulus, RDM+ reverts the binary representations of the coordinates, that is:

diskOfRDM + (I1 ; I2 ) = (rev(I1 )2 + rev(I2 )2 ) mod M This guarantees the Mem property since, when interval I splits (e.g., 010), the two subintervals, I (0) and I (1), are obtained by appending “0" and “1", respectively, to I binary representation (i.e., I (0) = 0100 and I (1) = 0101). Clearly, reversing the representation of I (0) does not change the value of diskOf () as applied to rev(I ). Figure 6 shows how diskOfRDM + () works. Note that region R = (10; 0), which originated from the split of region (1; 0), is still mapped to disk 1, whereas this was not the case with DM+ (see Figure 4). RDM+ , however, cannot be efficiently used when M = 2m . In fact, given I and its I (1) subinterval, if the level of I (1) is greater than m, the Dec property is lost. This can be seen in Figure 6, where regions (010; 10) and (011; 10) are both assigned to disk 3. 3

11 1

2

10

1

01 0

R

0 disk-id

00 00

01 0

10

11 1

Figure 6. The RDM+ method.

4.2 Fractal Methods: Z+ and HCAM+ Space-filling (fractal) curves visit each region of an Sl Sl (Sl = 2l ) grid exactly once, thus establishing a linear ordering on regions. Declustering functions based on such methods rely on this ordering to map regions to disks in a round-robin fashion. The Z (or Peano) curve of order l, Zl , is bit interleaving applied to binary representation of coordinate values. This generates UBLCS-95-17

7

so-called Z-values of length 2l [Ore86], which, interpreted as binary numbers, are then mapped to disks, i.e.: diskOfZ (I1 ; I2) = Z (I1 ; I2)2 mod M A similar idea inspires the Hilbert Curve Allocation Method (HCAM) [FB93] which uses the Hilbert curve of order l, Hl , to visit regions and assign H-values to them:

diskOfHCAM (I1 ; I2) = H (I1 ; I2) mod M Figure 7 shows Z and Hilbert curves of order 1 and 2.

1

3

0

2

15

0

Z1

Z2

1

2

0

3

0

15

H1

H2

Figure 7. Z and Hilbert curves.

The Z+ DDF is obtained by allowing variable-length Z-values. Thus, for each region R we consider the corresponding intervals I1 and I2 at levels l1 and l2 , interleave them, interpret the result as a binary number, and take the modulus. Figure 8 illustrates how Z+ works. Z+ does not have the Mem property, but it has the Dec property, since Z-values of regions originated by a split are consecutive integers. 3

11

2

10

1

01

0 disk-id

00 00

01

10

11

Figure 8. The Z+ method.

The extension of HCAM is only a bit more involved. If a region R has grid levels l1 = l2 = l, then its value is obtained from its position in Hl , the Hilbert curve of order l. The other possible case, considering that splits are done cyclically, is to have l1 = l2 + 1. Among the two square regions of Hl1 which compose R, we select the minimum of the two H-values. Figure 9 illustrates how HCAM+ works. HCAM+ does not have the Mem property. For instance, splitting the region (0; 1), for which it is H (0; 1) = 1, yields regions (00; 1) ((00; 1) = 4) and (01; 1) ((01; 1) = 6), which are both mapped to different disks. Note that if M > 4, no way to assign H-values to non-square regions can guarantee memory reuse. As to the Dec property, it holds provided M 4 or, when M 3, if the H-value of a non-square region is chosen on a per-case basis. 4.3 ECC+ The ECC+ method works, as static ECC , with M = 2m . It multiplies the Z-value of a region by a code check matrix. With variable-size regions, we have already seen that Z-values have variable UBLCS-95-17

8

11 5 10 4

6

11 2

10

7 13

01

12 56 01

57 00 0 00

59

58

00

15

01

10

11

00

01

10

11

Figure 9. The HCAM+ method.

length. This would imply using different check matrices (codes), one for each region size. The basic idea to achieve Mem and to avoid file reorganizations is derived by the technique used in [CTZ] for the dynamic declustering of signature files. This suggests to extend the parity check matrix by adjoining non-zero m-bit column vectors to its right. Consider the split of region R, which originates regions R0 and R1 , where Z (R0 ) = Z (R) 0 and Z (R1 ) = Z (R) 1 (“" denotes concatenation). Let l1 +l2 be the check matrix used to allocate R, whose grid levels are l1 and l2 . In order to allocate R0 and R1 we use a matrix

H

Hl +l + 1

where

2

h is a non-null vector. Then we have: Hl +l + Z (R )0 Hl +l + Z (R )0 1 1

2 2

1 1

Hl +l jh]

1 =[

0

=

1

=

1

2

Hl +l Z (R)0 Hl +l Z (R)0 h 1

2

1

2

where is sum modulo 2 (exclusive-or). This shows that the Dec property holds as well, since is not null. Figure 10 shows the effect of adding to the 4 matrix of Example 2.1 the column vector = (1; 0)0.

h

H

h

3

11

2

10

1

01

0 disk-id

00 00

01

10

11

Figure 10. The ECC+ method.

4.4 FX+ The last method we consider is Fieldwise eXclusive-or (FX) [KP88]. It works with binary-valued coordinates, which are bitwise xor-ed and the result is taken modulo M :

diskOfFX (I1 ; I2) = (I1 I2 )2 mod M For instance, for region (3; 4) of an 8 8 grid we first get (011 100)2 M = 4, maps to disk 3. UBLCS-95-17

= (111)2 ,

which, with

9

FX+ applies the same principle to allocate regions for which the grid levels are equal, l1 = l2 . When l1 = l2 + 1, bit strings are aligned “to the left" before summing them. As an example, let R = (110; 01). Aligning to the left yields, (110 010)2 = (100)2 = 4, after which 4 mod M can be executed. Figure 11 shows a complete example. 3

11

2

10

1

01

0 disk-id

00 00

01

10

11

Figure 11. The FX+ method.

FX+ has the Dec property (since regions originating from a split differ only in the least significant bit of a coordinate value), but not the Mem property. To see this, let M = 4 and consider the split of region R = (00; 01), mapped to disk 1. We get R0 = (000; 01) and R1 = (001; 01), which are respectively mapped to disks 2 and 3. Notice that, having we chosen to “align to the right" bit strings of different length, Mem would not hold as well. For instance, splitting region R = (01; 10) (mapped to disk 3) yields R0 = (010; 10), which goes to disk 0, and R1 = (011; 10), allocated in disk 1.

5

The Directory: MX-GF

We now consider how DDFs can be efficiently supported at index level. We make the reasonable assumption that the directory is too large to be kept in main memory and must therefore be organized on the M disks. Clearly, as is usual with static mapping functions, even in the dynamic case we want to provide efficient declustering for the directory entries in order to evenly distribute the load on the disks. In the following we consider 3 basic design alternatives. 5.1

Global Flat Grid File

The Global Flat Grid File (GF-GF) organizes the space using a single (global) orthogonal grid, which is updated in front of file evolution. Addressing is based on a set of linear scales, one per dimension, which are resident in main memory and are used to address grid cells. The set of cells mapped to the i-th disk, denoted C i , is then organized by some single-disk spatial access method (SAM). For instance, the CMD (static) approach organizes C i using a single-disk Grid File [NHS84], after applying an appropriate “coordinate transformation function" [LSR92]. Figure 12 illustrates the principles of the GF-GF organization. In the dynamic case GF-GF has some major drawbacks. It inherits problems of single-disk Grid File in managing skewed data distributions, for which the number of cells can become as large as O(N D ) for N points and D dimensions. This is because many new cells have to be allocated in front of a single data page split which requires halving a scale interval. In a multi-disk environment the directory growth propagates to all disks, thus involving all local SAMs. The overhead due to directory growth is further increased if the DDF does not have the Mem property. Consider the case where a cell with region R, mapped to disk i, is split because of interval halving. If the DDF has the Mem property, one of the two subcells, with region R0 , is mapped back to disk i, thus reusing memory and requiring minimal changes to the directory of UBLCS-95-17

10

SAM

SAM

Figure 12. The GF-GF organization.

such disk.7 If the Mem property does not hold, however, we cannot simply update the SAM of disk i, but we need to delete the cell from it, and insert the new subcells to the SAMs of other disks. A semi-dynamic variant of GF-GF is obtained by dynamically declustering only data pages (using a DDF), whereas using a static set of global linear scales to partition the space. Thus, the SAM of disk i organizes a fixed subspace, but it can provide access to data also stored in other disks. The major disadvantage of this approach is that it cannot efficiently distribute the access load to directory entries, since there is no control to balance the size of the directories. 5.2 Multiplexed Grid File The final alternative we analyze is based on hierarchical versions of the Grid File. In particular, we consider the Multilevel Grid File described in [WK85]. Each node of the index corresponds to a (hyper-)rectangular space region and is stored on a disk page. Entries of the index node whose region is R are binary-radix subregions of R, and are accessed through so-called “cross-disk" pointers, that is, pointers which can refer to a page stored in a different disk. Each index and data page is declustered by applying a DDF to its region identifier, apart from the root which is kept in main memory. We term this organization Multiplexed Grid File (MX-GF), since, apart from different space partitioning criteria, it closely resembles a multiplexed R-tree [KF92]. Figure 13 shows how MX-GF organizes the data regions of Figure 11 using the FX+ DDF, assuming that each index node can store 6 entries. It can be argued that MX-GF does not suffer any of the problems which plague GF-GF (as well its semi-dynamic variant): it can dynamically decluster both data and directory, a data page split leads to the insertion of a new index entry (rather than, possibly, of a new row or column in the global grid), and, apart from propagation of splits up to higher levels of the tree, only one disk is affected by directory growth. Furthermore, search and maintenance algorithms are equivalent to those of Multilevel Grid File [WK85], and therefore are not detailed here. 5.2.1 Local Declustering: RR and PI Once we have chosen MX-GF as the way to implement the directory, it turns out that also indexbased (local) declustering methods can be used. Consider the split of a page P (index node or data bucket) with region R, whose entry is stored in node Pfather . When P splits, leading to P0 and P1 (with regions R0 and R1 ) a local declustering method decides on the allocation of the new pages by only taking into account how other children of Pfather are mapped to disks. Intuitively, a local declustering method should map “close" regions to different disks. We consider 2 local methods for the declustering of MX-GF pages: Round-Robin (RR) and Proximity Index (PI), the latter being an adaptation of a convenient criterion for the declustering 7. The details depend on the specific SAM and how it encodes cells’ coordinates. For instance, using a Z-value representation would require only changing Z (R) into Z (R0 ).

UBLCS-95-17

11

to data pages Index nodes

3 2 Root 1 0 disk-id

Index nodes

Figure 13. MX-GF using FX+ .

of parallel R-trees [KF92]. In both methods we enforce the Mem property, that is we impose that

diskOf (R0 ) = diskOf (R).

The RR method maps R1 to the disk which has the minimum number of pages, by considering all and only the entries in Pfather . As Figure 14 highlights, this policy does not guarantee the Dec property.

1 R

split

R0 R1

0 disk-id

Figure 14. The RR method with M = 2.

The PI method works by defining a “proximity measure" between regions. The measure we use is based on the distance between regions’ centers, although more elaborated measures could be defined as well.8 Given region R1 and the set of regions mapped to disk i, Ri , we will map R1 to the disk for which min fdistance(R1 ; Rj )g

Rj 2Ri

is maximized. With the possible exception of cases where “large" regions split, this almost certainly guarantees that Dec holds (see Figure 15). 8. For instance, in [KF92] the proximity of two regions is derived considering the probability that the two regions will both qualify for a same query.

UBLCS-95-17

12

2 R

split

R1 R0

1 0 disk-id

Figure 15. The PI method with M = 3. A counterexample to the Dec property.

Parameter

M D c p size

Description n. of disks n. of dimensions data page capacity (points) disk page size (Kbytes)

Value 8 2 2 4

Table 2. Default parameter values.

6

Experimental Results

In this Section we present experimental results obtained from a prototype implementation of MX-GF, whose pages are declustered using, in turn, each of the considered declustering methods. Without loss of generality, we normalize input data to the unit hypercube (i.e. square in the 2-D case), and use Z-values to represent regions in such a space. For the implementation of static declustering methods we first define a square grid, which therefore establishes the “finest granularity" of declustering. Then, we use MX-GF to organize data and index pages, but we “freeze" the declustering process when splitting a region that has reached the chosen granularity level. Since we use Z-values, this is easily implemented by checking the length of the region identifier. For instance, with a static 32 32 grid (= 1; 024 grid regions) the split of a region whose Z-value is 10 (or more) bits long leads to map back to the same disk the two originating subregions. Each result of our experiments is obtained by averaging the response time of 100 square range queries, uniformly distributed over the unit space and having a non-empty result. The response time of a range query q, denoted R(q), is computed as the instant at which the last data page which satisfies q is retrieved. We assume that each page retrieval incurs a (constant) unitary time cost and that CPU costs are negligible with respect to I/O costs. The (parallel) retrieval process of the pages which compose the query tree goes as follows. At each time t 0 we have a queue U of (disk id, page id) requests to pages which must be accessed to answer the query. In case some elements of U are in conflict, i.e. they refer to pages stored on a same disk, we break the tie by using a FIFO scheduling policy. Once a page is accessed, its entry is removed from U . Incoming entries are first compared with q and in case inserted in U . The process halts when no more requests are waiting, that is, U = ;. Our analysis concentrates on the following points:

How DDFs behave with respect to static methods. How DDFs behave with respect to index-based methods. How results are influenced by non-uniform data distributions.

Unless otherwise stated, we use the parameter values listed in Table 2. The low value of data page capacity, c = 2, is used to obtain, for a given number of points, N , the maximum number of index and data pages. The datasets used in the experiments are summarized in Table 3. The zip codes dataset contains the centroids of all USA regions for which a ZIP code is defined, and fractal1.2 is a highly skewed dataset with fractal (i.e. non-integer) dimension 1:2 (see also Section 6.3). UBLCS-95-17

13

Name 2D unif xxK zip codes fractal1.2

Description 2D uniform distribution USA zip codes “L`evy’s flight"

N. points

xx 103 26; 874 20; 000

Table 3. Datasets.

6.1 DDFs vs Static Methods Figure 16 shows the relative response times of static declustering methods with respect to the corresponding dynamic versions, using the 2D unif 10K dataset. The total number of data pages is 6; 407, but static methods use a 32 32 grid. This means that each grid region (unit of declustering for static methods) corresponds, on the average, to 6:25 data pages. For small queries this results in almost doubling the response time. For larger queries, on the other hand, the overhead decreases, because of the higher chance to get almost the same number of pages from each disk.

Relative response time

2 DM RDM FX ECC Z HCAM

1.8

1.6

1.4

1.2

1 0

Figure 16. R

0.05

0.1 query side

0.15

0.2

Relative response time of static methods with respect to corresponding DDF ones (=

met (q)=Rmet+ (q)) as a function of the size of query side.

In the following experiment we analyze how the performance of static methods depends on the size of the dataset. This is important in order to understand, among other things, when a static organization should be reorganized in front of changes to data volume. Figure 17 shows only the behavior of HCAM, but similar results have been obtained also with other static methods, as well as for other values of the parameters listed in Table 2. The figure plots the ratio, RHCAM (q)=Ropt(q), of HCAM response time, for square queries of size 0:02 0:02, to the “optimal response time", which is defined as:

Ropt (q) =

L n pages (q) X l l=1

(1)

M

In the above definition, L is the number of levels of MX-GF, excluding the root (which is cached in memory) and including data pages, and n pagesl (q) is the total number of pages retrieved by query q at level l.9 For comparison, the relative response time of the HCAM+ method is also shown. It can be seen that performance of static methods considerably deteriorates with increasing dataset size, since the number of data pages which correspond to a single grid region

9. For L = 2 the definition indeed corresponds to the optimal response time for query q. When L 3, Eq. 1 can slightly overestimate the minimum theoretical response time, since it is possible to read in parallel pages at different levels of the query tree.

UBLCS-95-17

14

grows without limits. For instance, when N = 105 , there are about 64 pages per grid region. On the other hand, HCAM+ performance stabilizes around a value which only depends on the declustering method itself and on the query size. 4 HCAM+ HCAM


3.5 3 2.5 2 1.5 1 1000

10000 n. of points (N)

100000

Figure 17. Relative response time of HCAM and HCAM+ methods with respect to the optimal response time as a function of dataset size.

6.2 DDF vs Index-Based Methods In the next series of experiments we compare DDF and index-based (RR and PI) methods. Figure 18 plots the relative response time of dynamic methods with respect to optimal response time, using the uniform 2D UNIF 10K dataset. We exclude from the analysis the RDM+ method, since it is unsuitable for M = 2m disks.10 The most remarkable observation concerns the PI method, which is observed to consistently outperform all the others. Although we used a simplified “similarity measure", as compared to the one proposed for R-trees [KF92], declustering based on the analysis of local space configurations provides to be an excellent allocation criterion. At the other extreme we find that RR has almost the worst response time, whereas DDFs exhibit an intermediate behavior. Other experiments in which we considered larger queries (up to 0:6 0:6) and non-square queries confirmed that PI is the appropriate choice for declustering uniformly distributed datasets. This is consistent with the observations reported in [KF92]. 6.3 Non-Uniform Distributions In the following experiments we consider the case of non-uniform data distributions and restrict the analysis to the two index-based methods, RR and PI, and to two DDFs, namely ECC+ and HCAM+ . Figure 19 shows results obtained with the skewed zip codes dataset, which requires 17; 193 data pages. Compared with the results in Figure 18, it appears that: 1) both index-based methods perform substantially better than DDFs; 2) RR outperforms PI for medium-large queries. As to the first point, we observe that, with skewed data distributions, the RR method achieves, by its very nature, a better balance than DDFs in allocating data regions to disks. This is shown in Table 4, where we have computed, for each of the four methods, the standard deviation of the distribution of the number of data pages on disks. This is enough to guarantee that, on the average, RR has a lower response time than DDFs, with the only exception of very small queries (i.e. 0:005 0:005 in the figure), which lead to retrieve only a few data pages (= 3:48 on the average). As to the reason why RR outperforms PI on medium-large queries, we argue that this is again due to the higher balance that RR can achieve. From Table 4 it appears that PI has the 10. In experiments with M = 2m , which are not reported here, the performance of RDM+ was almost similar to that of other DDFs.

6

UBLCS-95-17

15

1.6 DM+ FX+ ECC+ Z+ HCAM+ RR PI


1.5 1.4 1.3 1.2 1.1 1 0

0.05

0.1 query side

0.15

0.2

Figure 18. Relative response time (= Rmet+ (q)=Ropt (q)) of dynamic (DDF and local) methods with respect to the optimal case. Dataset: 2D unif 10K.

worst behavior as to uniformity of data allocation on disks, thus suffering from the presence of skewness in the dataset. Nonetheless, because of PI’s smarter monitoring of local configurations, RR cannot do better than PI for small queries (i.e. up to 0:05 in the figure).


1.25 ECC+ HCAM+ RR PI

1.2

1.15

1.1

1.05

1 0

0.05

0.1 query side

0.15

0.2

Figure 19. Relative response time (= Rmet+ (q)=Ropt (q)) of dynamic (DDF and local) methods with respect to the optimal case. Data set: zip codes.

The last experiment examines the four methods on the 20; 000 points “fractal” distribution fractal1.2, represented in Figure 20. This corresponds to a so-called “L`evy’s flight" with fractal dimension 1.2 [Man77]. Figure 21 confirms that RR is the best choice for (highly) skewed data distributions. Also note that the improvement obtained by PI over RR in case of small queries has almost vanished.

ECC+ 31:6248

HCAM+ 30:5541

RR 19:8813

PI 38:465

Table 4. Standard deviation of data pages distribution on M = 8 disks, for the zip codes dataset.

UBLCS-95-17

16

2500

2000

1500

1000

500

0 0

500

1000 1500 2000 2500 3000 3500 4000

Figure 20. The point distribution fractal1.2, with fractal dimension 1:2.


1.25 ECC+ HCAM+ RR PI

1.2

1.15

1.1

1.05

1 0

0.05

0.1 query side

0.15

0.2

Figure 21. Relative response time (= Rmet+ (q)=Ropt (q)) of dynamic (DDF and local) methods with respect to the optimal case. Data set: fractal1.2.

7

Conclusions

In this work we have considered extensions to well-known declustering functions able to efficiently deal with dynamic datasets. In order to support dynamic declustering functions (DDFs), we have introduced a so-called “multiplexed" Grid File (MX-GF), which essentially is a Multilevel Grid File [WK85] whose index nodes and data pages are distributed over a set of disks. The paged organization of MX-GF allows the application of index-based declustering methods too, among which we have analyzed RR (Round-Robin) and PI (Proximity Index). Our experimental results can be summarized as follows: Dynamic declustering functions can indeed achieve good performance levels with timevarying datasets, their overhead being mostly a function of the query size and of the (static) methods from which they are derived, rather than of the data volume. On the other hand, the performance of static declustering functions degenerates when the dataset size does not fit the anticipated number of grid regions. For uniform data distributions, the PI index-based method consistently outperforms all the others, whereas RR has the worst performance among the analyzed methods. Although we have used for PI a “similarity measure" different from the one considered in [KF92] for the declustering of R-trees, our results confirm the results in that paper. For skewed data distributions we have rather surprisingly observed that RR considerably improves its effectiveness. In particular, for medium-large queries RR is the best method, UBLCS-95-17

17

whereas for small queries the advantage of using PI seems to be distribution-dependent. We note that these results do not have any counterpart with the analysis in [KF92], where only uniform distributions were considered. Above results seem to suggest that DDFs are not worth implementing in parallel Grid Files because of the better behavior of index-based methods. In favor of DDFs, however, there are some points, which would deserve further analysis, among which: a lighter CPU load, because the complexity of index-based methods depends on the number of entries in a node, and a mapping of data regions to disks which does not depend on the ordering of data input. In this paper we have considered a single-processor multi-disk architecture. Our future work will deal with the problem of extending dynamic declustering methods to the case of multi-processor architectures, such as the shared-nothing distributed processing model. This requires to take into account also communication costs, in order to determine good trade-offs between the cost of moving data and the benefit of load balancing which dynamic declustering methods can achieve. A thorough comparison of the different parallel spatial data structures which are today available is also planned. Acknowledgments Fabio Grandi, Dario Maio, and Pavel Zezula read a preliminary version of the paper and contributed with useful suggestions. Paolo Tiberio strongly encouraged the first author to pursue the subject of this work.

References [CTZ] P. Ciaccia, P. Tiberio, and P. Zezula. Declustering of key-based partitioned signature files. To appear on ACM Transactions on Database Systems. [DG92] D. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98, June 1992. [DS82] H.C. Du and J.S. Sobolewski. Disk allocation for Cartesian product files on multiple disk systems. ACM Transactions on Database Systems, 7(1):82–101, March 1982. [FB93] C. Faloutsos and P. Bhagwat. Declustering using fractals. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (PDIS’93), pages 18–25, January 1993. [FM91] C. Faloutsos and D. Metaxas. Disk allocation methods using error correcting codes. IEEE Transactions on Computers, 40(8):907–914, August 1991. [HS94] B. Himatsingka and J. Srivastava. Performance evaluation of grid based multi-attribute record declustering methods. In Proceedings of the 10th International Conference on Data Engineering, pages 356–365, Houston, Texas, February 1994. [KF92] I. Kamel and C. Faloutsos. Parallel R-trees. In Proceeedings of the 1992 ACM SIGMOD International Conference on Management of Data, pages 195–204, San Diego, CA, June 1992. [KP88] M.H. Kim and S. Pramanik. Optimal file distribution for partial match queries. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pages 173–182, Chicago, IL, June 1988. [LRS93] J. Li, D. Rotem, and J. Srivastava. Algorithms for loading parallel grid files. In Proceedings of the 1993 ACM SIGMOD Conference on Management of Data, pages 347–356, Washington, DC, May 1993. SIGMOD Record, 22(2), June 1993. [LS95] D.-R. Liu and S. Shekhar. A similarity graph-based approach to declustering problems and its application towards parallelizing grid files. In Proceedings of the 11th International Conference on Data Engineering, pages 373–381, Taipei, Taiwan, March 1995. [LSR92] J. Li, J. Srivastava, and D. Rotem. CMD: A multidimensional declustering method for parallel database systems. In Proceedings of the 18th VLDB International Conference, pages 3–14, Vancouver, Canada, August 1992. [Man77] B. Mandelbrot. The Fractal Geometry of Nature. W.H.Freeman, New York, 1977. [NHS84] J. Nievergelt, H. Hinterberger, and K.C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, 9(1):38–71, March 1984. UBLCS-95-17

18

[Ore86] J. Orenstein. Spatial query processing in an object-oriented database system. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, pages 326–336, Washington, DC, May 1986. [Pat93] D.A. Patterson. Massive parallelism and massive storage: Trends and predictions for 1995–2000. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (PDIS), pages 5–6. IEEE Computer Society Press, January 1993. [PGK88] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pages 109–116, Chicago, IL, June 1988. [WK85] K.-Y. Whang and R. Krishnamurthy. Multilevel grid files. Research Report RC 11516, IBM Thomas J. Watson Research Center, Yorktown Heights, New York, November 1985. [ZSC94] Y. Zhou, S. Shekhar, and M. Coyle. Disk allocation methods for parallelizing grid files. In Proceedings of the 10th International Conference on Data Engineering, pages 243–252, Houston, Texas, February 1994.

UBLCS-95-17

19