Cost Models for Join Queries in Spatial Databases - Semantic Scholar

0 downloads 0 Views 165KB Size Report
jRNA. jRNA. RR total. NA. (7). The involved parameters (h, NR j i ,. , and sR j k i , ,. ) are given by. Eq. 2, Eq. 3, and Eq. 4, respectively, as functions of the actual.
Cost Models for Join Queries in Spatial Databases Yannis Theodoridis

Emmanuel Stefanakis

Timos Sellis

Computer Science Division Department of Electrical and Computer Engineering National Technical University of Athens Zographou 15773, Athens, HELLAS (GREECE) E-mail: {theodor, stefanak, timos}@cs.ntua.gr URL: http://www.dbnet.ece.ntua.gr/

ABSTRACT: The join query is one of the fundamental operations in Data Base Management Systems (DBMSs). Modern DBMSs should be able to support non-traditional data, including spatial objects, in an efficient manner. Towards this goal, spatial data structures can be adopted in order to support the execution of join queries on sets of multidimensional data. This paper introduces analytical models that estimate the cost (in terms of node or disk accesses) of join queries involving two multidimensional indexed data sets using R-tree-based structures. In addition, experimental results are presented, which show the accuracy of the analytical estimations when compared to actual runs on both synthetic and real data sets. It turns out that the relative error rarely exceeds 15% for all combinations, a fact that makes the proposed cost models useful tools for efficient spatial query optimization.

1. Introduction A Spatial Data Base Management System (SDBMS) should offer appropriate data types and query language to support spatial data, and provide efficient indexing methods and cost models on the execution of specialized spatial operations, for query processing and optimization purposes. Relevant applications that handle large volume of data include Geographic Information Systems (GIS), Multimedia Systems, Medical or Satellite Image Bases, etc. The join query is one of the fundamental but also expensive operations that an SDBMS should support. It combines entities from two data sets into single entities whenever the combination satisfies the join condition (e.g. overlap). In general, the join operation is an important database query operation since it retrieves information from two different data sets based on their Cartesian product. An example of a spatial join query is: “Find all countries in Europe that are crossed by rivers”, or, the more complex: “Find pairs of rivers that cross common countries in Europe and lie west of the 7th meridian”.

© 1998 IEEE. Published in the Proceedings of ICDE'98, February 1998 in Orlando, Florida. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: +Intl. 908-562-3966.

In the first case, the processing of the query is straightforward: the entries of the two spatial data sets C and R (denoting countries and rivers, respectively) are combined on their spatial predicates (polygons ci and lines rj, respectively) using the topological operator cross. In the latter case, however, there exist several alternative strategies for the query processing. One solution consists of the following three-step procedure: (i) selection of the rivers that lie west of the 7th meridian (using the directional operator west) and construction of an intermediate set R1, (ii) spatial join between R1 and C resulting to a set S of pairs (ri, cj) of rivers ri that cross countries cj, and (iii) main memory processing of S for pairs (ri, rj), i ≠ j, such as (ri, ci) and (rj, ci) belong to S. Other solutions, which differ on the execution order of the necessary primitive operations and consequently on the efficiency, are also possible and need to be evaluated by a spatial query optimizer. Whichever strategy is followed by the SDBMS query processor in order to execute a join query, the objects geometry or spatial representations need to be combined based on a spatial operator (e.g. overlap). However, the processing cost of complex representations, such as polygons, is high. Thus a twostep (filter - refinement) procedure for query processing is usually applied [Ore89]. The filter step is usually based on multidimensional indexes that organize Minimum Bounding Rectangle (MBR) approximations of spatial objects, while the refinement step usually includes computational geometry techniques for the intersection of geometric shapes. The latter one is usually a time-consuming procedure since the actual geometry of the objects needs to be checked. Although several techniques for speeding-up this procedure have been studied in the past [BKSS94] the cost of this step can not be considered as part of index cost analysis and hence it is not taken into consideration in the rest of the paper. Traditional join methods, such as nested loops, sort-merge, or hash join [ME92], are not efficient when dealing with spatial data. This is due to the lack of ordering, which is an inherent characteristic of entries in a multidimensional space. Because of that, several specialized techniques have been proposed, based on general-purpose or specialized tree indexes (examples in [BKS93, LR94]). However, there is a lack of an efficient cost model which would make an accurate estimation of the I/O cost of a spatial join operation between two data sets (either

uniform-like or non-uniform ones), in correspondence to the existing cost models for a range query between a data set and a query window [FK94, TS96]. According to Brinkhoff et al., "... an analytical investigation of the execution time of a spatial join performed with R*-trees seems to be almost impossible ..." [BKS93]. This statement was due to the lack of efficient cost models for R-trees in the literature some years ago. Such models have recently appeared [FK94, TS96] and could serve as the basic platforms for the analysis of the spatial join operation. Focusing on hierarchical multidimensional tree structures, Gunther's proposal [Gun93] was the earliest attempt to provide an analytical model for spatial join cost estimation. Abstractions of tree indexes, called "generalization trees", were modeled on the support of θ-joins. Implementation algorithms for general θ-joins were presented and evaluated for various probability distributions. Later, Aref and Samet [AS94] proposed analytical formulae for the execution cost and the selectivity of spatial joins, based on Kamel and Faloutsos’ Rtree analysis [KF93]. The basic idea of that work was the consideration of the one data set as the underlying database and the other data set as a source for query windows in order to estimate the cost of a spatial join query based on the cost of range queries. Experimental results showing the accuracy of the selectivity estimation formula were presented in that paper. In this paper we propose a model that estimates the cost of a join query between two spatial data sets. The model is based on the analytical formula that estimates the cost of a range query, proposed in [TS96]. We also present comparison results that show the accuracy of the analytical estimations when compared with actual tests on synthetic and real data sets. The rest of the paper is organized as follows. In Section 2 we provide the definition of the spatial join operation and we present implementation algorithms for spatial join using R-treebased structures. In Section 3 we extend the analytical model introduced in [TS96], in order to support join queries, based on the algorithms presented in Section 2. Section 4 contains experimental results on synthetic and real data sets in one- and two-dimensional space, which show the accuracy of the proposed model. Section 5 concludes with a discussion on possible extensions and hints for future work.

(spatial) data, called spatial join, applies a spatial operator θ on the i-th column of REL1 and the j-th column of REL2, which are of some spatial data type. Spatial operators may be topological (e.g. overlap), directional (e.g. north), or distance-related (e.g. close), with overlap being the most common one. As argued earlier, traditional join implementation techniques are not efficient for spatial data that are characterized by the absence of ordering, which is a necessary condition for all of them. Although several extensions of the traditional methods have been proposed [Ore89, Rot91], recent research efforts have focused on multidimensional indexing methods.

2. Background

Figure 1 shows an example set of two-dimensional data rectangles and the corresponding R-tree index [Gut84, BKSS90]. A range query retrieves all objects of the data set that overlap a query window q and it is implemented by performing a downwards traversal of the R-tree index. The join operation between two spatial data sets can be supported by applying a synchronized traversal on both R-tree indexes. An algorithm based on this idea, called SpatialJoin1, was originally introduced by Brinkhoff et al. in [BKS93]. Two improvements of this algorithm towards the reduction of the CPU- and I/O-cost were also proposed in the same paper by considering faster main-memory algorithms and better read schedule for a given LRU-buffer, respectively. Since (a) CPUcost is not a subject of R-tree cost analysis (in terms of disk accesses), and (b) the adoption of an LRU-buffer is considered

A join operation between two relations REL1 and REL2 using a condition REL1[i]θREL2[j] is one of the operations supported by the relational model. The result of the operation contains those tuples in the Cartesian product REL1×REL2 , where the ith column of REL1 stands in relation θ to the j-th column of REL2. Obviously, our discussion does not depend on the specific underlying model of the SDBMS. For example, in an object-oriented system the relations REL1 and REL2 can be viewed as sets of objects or class extents and their tuples as spatial objects. In conventional applications, θ is often equality, which leads to the equi-join operation. Its extension for multidimensional

2.1. Join processing using multidimensional indexes Past studies in the context of spatial join processing are classified in two groups: (i) those that consider the absence of multidimensional indexes on at least one of the data sets, and (ii) those that consider the presence of multidimensional indexes on both data sets. In the first case, relevant proposals adopt the construction of appropriate partitions in space based on data set entries [LR96, PD96, KS97] or specialized tree indexes on the fly, e.g. seeded trees [LR94]. In the rest of the paper we will consider the second case, in which both data sets are supported by spatial indexes. Since the processing of spatial predicates is crucial in an SDBMS, one can argue that indexes on those predicates should necessarily exist in order to efficiently support the basic (spatial) operations of the system. Among others, Grid files have been studied in [Rot91, BHF93] and R-trees in [BKS93]. We select the R-tree structure to be the underlying index since it has been widely recognized as the most effective family of spatial indexing methods. $

. )

*

q '

( +

% 5227 √

,

$ √ '

0

1

-

(

)

*

%√

$

%

&

+

,

-

& .

/

0

1

/ &

Figure 1: An example of R-tree index.

to be an extension of pure analysis (see [LL98] for a related work), we study the cost estimation of join queries based on the first algorithm, illustrated in Figure 2. SJ(R1,R2: R_node); /* SpatialJoin Algorithm */ 01 BEGIN 02 FOR (all Er2 in R2) DO 03 FOR (all Er1 in R1) DO 04 IF (overlap(Er1.rect, Er2.rect)) THEN 05 IF (R1 and R2 are leaf pages) THEN 06 output(Er1.oid, Er2.oid) 07 ELSE IF (R1 is a leaf page) THEN 08 ReadPage(Er2.ptr); 09 SJ(Er1.ptr, Er2.ptr) 10 ELSE IF (R2 is a leaf page) THEN 11 ReadPage(Er1.ptr); 12 SJ(Er1.ptr, Er2.ptr) 13 ELSE 14 ReadPage(Er1.ptr); ReadPage(Er2.ptr); 15 SJ(Er1.ptr, Er2.ptr) 16 END-IF 17 END-IF 18 END-FOR 19 END-FOR; 20 END.

Figure 2: Spatial join between two R-trees In other words, a synchronized traversal of both R-trees rooted by nodes R1 and R2, respectively, is performed with the entries of nodes R1 and R2 playing the roles of data and query rectangles, respectively, in a series of range queries. The cost of a spatial join operation can be measured as the total amount of actual disk accesses for both disk-resident indexes (the procedure ReadPage either performs an actual read operation on the disk or reads the corresponding node information from a memory-resident buffer). The relevance of the spatial join processing with that of a set of range queries has been already mentioned in [BKS93, Gun93, AS94]. This is the starting point of our study, and in Section 3 we propose a cost model for join queries based the corresponding one for range queries.

2.2. Cost analysis of R-trees Several proposals about the analytical estimation of the search performance of the R-trees have been presented in the past. Two models that predict the performance of R-trees on the execution of a range query without assuming uniform data distribution were proposed in [FK94, TS96], with the analytical cost formulae being based on two properties of the data set, fractal dimension and density surface, respectively. Both models were shown to be accurate, with the analytical estimates being close to the experimental results. Following we present the latter proposal and the idea of extending it to estimate the cost of a spatial join operation using R-trees. According to [TS96], given a tree with the following characteristics: h, Nj, and sj,k, which denote the height of the tree structure, the expected number of nodes in the tree, and the average node extent at each dimension k, respectively, at level j of the tree (the root is assumed to be at level j=h, and the leafnodes at level j=1), the expected retrieval cost, in terms of node accesses, for an n-dimensional query window q with extents

(q1, q2, …, qn) at each direction is given by the following formula, originally proposed in [KF93, PSTW93]: h −1 n   (1) NA( q ) =  N j ⋅ ( s j , k +qk )  j =1  k =1  i.e., the expected number NA of node accesses is equal to the sum of the total coverage (at each level j) of R-tree nodes s, assuming that their size has been extended by the size of the query window q at each direction k. The height h of an R-tree is computed by the formula: N   h =1+logc⋅ M c ⋅ M   (2) where N is the number of distinct objects in the database, M is the maximum number of entries in an R-tree node, and c is the average node capacity (typically c = 67%; c⋅M denotes the average number of entries per node). The number Nj of nodes at level j is:





N (3) (c ⋅ M ) j and the average node extent (assuming square node rectangles) at level j [TS96]: Nj =

1

D  n (4) s j , k = j  N j  where Do=D (i.e., the actual density of the data set), Dj denotes the density of node rectangles at level j and it is computed as a function of the density Dj-1 of node rectangles at level j-1:  D 1n −1    j −1 Dj = 1+ 1   (c ⋅ M ) n 

n

(5)

Qualitatively, the cost model of [TS96] estimates the retrieval cost for an n-dimensional query window q, only based on the knowledge of the data set (i.e., the number N of data objects and the density D of their MBRs) and the query window extent (q1, q2, …, qn). As discussed earlier, the join operation between two spatial data sets is considered to be equivalent to a set of range queries, with appropriate selections of the data and the query set during the synchronized traversal of the corresponding trees. Thus Eq. 1 can be repeatedly used in order to estimate the cost of the whole join procedure. In the next two sections a detailed presentation of our analysis for spatial join as well as the experimental evaluation of our proposal using synthetic and real data sets are described.

3. A cost model for spatial joins Formally, the problem of the join query analysis using R-trees is the following: Let n be the dimensionality of the work space and WS = [0,1)n the n-dimensional unit work space. Let us assume two spatial data sets of cardinality N R and N R , 1 2 respectively, with their entries’ MBRs being stored in two Rtree indexes R1 and R2, respectively. The target of our cost analysis is a set of formulae that would efficiently estimate the

average number NA (or DA) of node (or disk) accesses needed to process a join query between the two data sets. The distinction between node and disk accesses is a subject of buffer management; the inequality DA ≤ NA always stands. The cost formulae should be based on the knowledge of the primitive properties of the data sets only, without extracting information from the corresponding R-tree structures. According to [TS96], the formula that estimates the cost of a range query, defined by a query window q = (q1, ..., qn), over a data set of N objects with density D, is given by Eq. 1, with the average size sj,k of node rectangles being a function of N and D parameters (Eq. 4). As discussed in Section 2, the processing cost of a join query is equal to the total cost of a set of appropriate range queries, as the algorithm SJ shown in Figure 3 can also extract it. In this paper we follow the distinction of [BKS93] for Rtrees of equal or different height; for each case we propose a pair of formulae that estimate the cost in terms of node or disk accesses, respectively. In the rest of the section we use the symbols of Table 1. Symbol n M c

hRi

Definition number of dimensions maximum R-tree node capacity average R-tree node capacity (in %) height of the R-tree Ri

N Ri

number of data rectangles indexed by Ri

DRi

density of data rectangles indexed by Ri

N Ri , j

number of nodes of Ri at level j

DRi , j

density of node rectangles of Ri at level j

sRi , j , k

average size of node rectangles of Ri at level j on dimension k (k = 1, ..., n) number of node accesses for Ri at level j number of node accesses for a join query between R1 and R2 number of disk accesses for Ri at level j number of disk accesses for a join query between R1 and R2

NA(Ri, j) NA_total(R1, R2) DA(Ri, j) DA_total(R1, R2)

Table 1: List of symbols and definitions

3.1. Cost estimation for R-trees of same height Suppose that the height of both tree indexes is equal to h and the two root nodes are stored in main memory. At each level j, 1 ≤ j ≤ h-1, tree R1 (R2) contains N R , j ( N R , j ) nodes of average 1

2

size sR , j ( sR , j ) consisting of a set of entries E R , j ( E R , j ). In 1 2 2 1 order to find which pairs of entries are overlapping and downwards traverse the tree, we compare entries E R , j with 1 entries E R , j (line 04 of the spatial join algorithm SJ illustrated 2 in Figure 2). The cost (in terms of node accesses) of the above comparison for level j is given by the summation of two factors

which express the respective costs for the two R-trees, namely NA(R1, j) and NA(R2, j). In order to estimate these two factors we consider that the entries of R1 (R2) at level j play the role of the data set (a set of query windows q, respectively) and we apply the function: intsect( N , s, q) = N ⋅

n

∏ min{1, (s

k

+ qk )}

k =1

which returns the number of nodes at level j intersected by a query window q [TS96], in order to estimate the access cost for R1 (R2). Since we consider no buffering scheme, the access costs for both trees R1 and R2 are equal (since equal number of nodes are accessed, as can be extracted by line 14 of SJ algorithm). Formally (i = 1, 2): NA( Ri , j ) = N R2 , j ⋅ N R1 , j ⋅

∏ min {1, (s n

k =1

R1 , j , k

+ s R2 , j , k

)}

(6)

The processing of line 04 of the SJ algorithm is repeatedly executed at each level of the two trees down to the leaf level. Hence the total cost, denoted by NA_total(R1, R2), is the summation of two factors (computed by Eq. 6) for all levels j: NA _ total ( R1 , R2 ) =

h −1

∑{NA( R , j) + NA( R , j )} 1

2

(7)

j =1

The involved parameters (h, N R , j , and sR , j , k ) are given by i i Eq. 2, Eq. 3, and Eq. 4, respectively, as functions of the actual population N Ri and the actual density DR of the data sets. i

Hence Eq. 7 estimates the cost of a join query between two spatial data sets based on their primitive properties only, namely density and number of objects. Notice that Eq. 7 is symmetric with respect to the two indexes R1 and R2. The same conclusion is drawn by studying the SJ algorithm, since the number of node accesses is equal to the number of ReadPage calls (line 14) which, in turn, are the same for both trees. The equivalence of the two indexes is not the case when a simple path buffer (i.e., a buffer that keeps the most recently visited path for each tree) is introduced, as we will discuss next. The actual access cost for each tree Ri at level j (denoted by DA(Ri,j)) represents the number of disk accesses needed to answer the join query, with the respective NA(Ri,j) serving as an upper bound; i.e., DA(Ri,j)≤NA(Ri, j). In our analysis we introduce a simple buffering mechanism that maintains a path buffer for each tree structure. By examining the algorithm SJ we intuitively see that the existence of such a buffering scheme affects the computation of the cost for the tree that plays the role of the query (data) set in a high (low) degree because its corresponding entries constitute the outer (inner) loop of the algorithm and hence are less (more) frequently updated. This statement is formally explained in the following alternative cases (illustrated in Figure 3): (i) Suppose that an entry E R , j of tree R2 at level j overlaps 2

with m entries of a node ER , j +1 , m´ entries of a different node 1 E 'R1 , j +1 , etc., of the tree R1. E R2 , j is kept in main memory

during its comparison with all entries of node ER , j +1 and will 1

WUHH5

WUHH5

OHYHOM

$

%

+

,

&

% '

OHYHOM

$

(

)

$

*

-

.

$

%

&

+

,

-

% '

(

)

*

.

Hypothesis: D2 overlaps with {D1, E1} and {H1, I1} entries of nodes A1 and B1, respectively E2 overlaps with {E1, F1} and {H1} entries of nodes A1 and B1, respectively case (i): Example of DA(R2,j) computation: 2 hits due to entry D2 (i.e., equal to the number of intersected entries of R1 at level j+1, namely {A1, B1}). case (ii): Example of DA(R1,j) computation: Rule: 2 hits due to entry H1 (i.e., equal to the number of intersected entries of R2 at level j, namely {D2, E2}). Exception to the rule: 1 hit due to entry E1 (since the overlapping pairs (E1, D2) and (E1, E2) are consecutively checked)

Figure 3: Alternative cases for estimating DA cost.

be again fetched from disk, hence re-computed in DA(R2,j) cost, when its comparison with the entries of E 'R , j +1 starts. As 1

a result, the number of actual disk accesses of the node rooted by E R , j is equal to the number of the nodes of R1 at level j+1 2 (i.e., the parent level) having rectangles intersected by E R , j ;

(

)

2

formally: intsect N R1 , j +1 , s R1 , j +1 , s R2 , j . (ii) On the other hand, an entry E R , j of tree R1 at level j is re1 computed in DA(R1,j) as soon as it overlaps with an entry E R , j 2 of tree R2 with only one exception: E R , j being the last member 1 of the intersection set of E R , j and, simultaneously, the first 2 member of the intersection set of its consecutive E ' R2 , j . The above exception rarely happens; moreover, it is hardly modeled since no order exists among entries of R-tree nodes. The above discussion is formalized in the formulae: (8) DA( R2 , j ) = intsect N R1 , j +1 , s R1 , j +1 , s R2 , j , N R2 , j

(



)

and (9) DA( R1 , j ) ≈ NA( R1 , j ) The total cost DA_total(R1, R2) is again the summation of the two factors for all levels j: DA _ total ( R1 , R2 ) =

h −1

∑{DA( R , j ) + DA( R , j )} 1

2

(10)

j =1

Notice that, in contrast to Eq. 7, Eq. 10 is sensitive to the two indexes, R1 and R2. The experimental results of Section 4 strengthen this statement. In the above analysis we have considered two cases: adopting (a) no, or (b) a simple path buffer scheme. A more complex buffering scheme (e.g. an LRU-buffer of predefined size) would surely achieve a lower value for DA_total(R1, R2). However, as argued earlier, its effect is beyond the scope of this paper since it introduces a system oriented parameter, such as the LRU-buffer size.

3.2. Cost estimation for R-trees of different height Assume that hR and hR denote the heights of R1 and R2, 1 2 respectively, and, without loss of generality, hR > hR . For the 1 2 upper hR levels of the two structures, the SJ algorithm does 2 not take the height inequality into account, i.e., the access cost at each level j is given by slightly modifying Eq. 7, since the corresponding level of R2 is not same to that of R1 (we call it j’ to distinguish it from j). When the leaf nodes of R2 are being processed, j’ is fixed to value 1 (i.e., denoting the leaf level) and the propagation of R1 continues down to its lower hR − hR levels. Formally, the total cost in terms of node 1

2

accesses is given by Eq. 11: NA' _ total( R1 , R2 ) =

h R1 −1

∑{NA( R , j) + NA( R , j' )} 1

2

(11)

j =1

j − ( hR1 − hR2 ), hR1 − hR 2 + 1 ≤ j ≤ hR1 − 1 where j '=  1 ≤ j ≤ hR1 − hR 2 1,  On the other hand, when a path buffer exists, we have to handle two different cases: the tree R1 that plays the role of the data set being taller or shorter than R2. In the first case (where hR > hR ) the propagation of R1 down to its lower levels adds 1

2

no extra cost (in terms of disk accesses) to the ‘query’ tree R2 that has already reached its leaf level. In the second case (where hR1 < hR2 ) each propagation of the ‘query’ tree R2 down to its lower levels adds equal cost to the ‘data’ tree R1 (in correspondence to Eq. 9 denoting that the buffer existence does not affect the cost of R1). Formally (j’ = j – abs | hR − hR |): 1

2

DA' _ total ( R1 , R2 ) =

(

) DA( R , j ) + DA( R , j ' ), if h > h 1 2 R1 R2

max hR1 ,hR2 −1



 DA( R1 , j ' ) + DA( R2 , j ), if hR1 < hR2 j = abs hR1 − hR2 +1 

abs hR1 − hR2

 DA( R1 , j ), if hR1 > hR2 (12)  2 ⋅ DA( R2 , j ), if hR1 < hR2 j =1 Notice that the corresponding cost formulae of subsection 3.1 for two R-trees of the same height (Eq. 7 and Eq. 10) are special cases of the above formulae (Eq. 11 and Eq. 12, respectively) since they are identical for hR1 = hR2 . +



In this section we proposed analytical formulae for the cost estimation of a join query between two R-tree-indexed spatial data sets. The proposed cost model is based on primitive data properties only without the corresponding R-trees needed to be built. In the next section we evaluate our model by comparing the analytical estimations with experimental results on synthetic and real data sets of dimensionality n ≤ 2.

4. Evaluation of the model The evaluation of the proposed cost formulae was based on a variety of experimental tests on synthetic and real data sets. Synthetic data sets consist of (i) random and (ii) skewed distributions of varying cardinality N (20K ≤ N ≤ 80K) and density D (0.2 ≤ D ≤ 0.8), and have been constructed by using random number generators. Real two-dimensional data sets are parts of the TIGER database of the U.S. Bureau of Census [Bur91]. We built R*-tree indexes [BKSS90] on those data sets and performed several spatial joins. All experimental results were computed on an HP700 workstation with 256 Mbytes of main memory and several Gbytes of secondary storage. On the other hand, the analytical estimations of node (disk) accesses were based on Eq. 7 (Eq. 10) for R-trees of the same height and Eq. 11 (Eq. 12) for R-trees of different height with the average capacity of the tree indexes being set to the typical c = 67% value and the maximum node capacity being set to M = 84 (M = 50) for dimensionality n = 1 (n = 2), values that correspond to page size of 1 Kbyte.

assumption that lowers the accuracy of its cost estimation. However, as already mentioned in subsection 3.1, the exception to that general rule is hardly modeled. (iii)The above conclusions stand for all random data set combinations in both dimensionalities considered (n = 1, 2). Figures 5a and 5b illustrate the experimental and analytical results of node and disk accesses (denoted by NA and DA) for one- and two-dimensional random data sets, respectively, for all N R1 / N R2 combinations. (relevant conclusions also stand for varying density D). The linearity of the plots in Figure 5a is due to the fact that all R-tree indexes on one-dimensional data sets of our tests are of equal height h = 3. On the other hand, this is not the fact for the results on two-dimensional data sets (Figure 5b) since the height of the two-dimensional indexes of cardinality 20K ≤ N ≤ 40K (60K ≤ N ≤ 80K) is equal to h = 3 (h = 4). The above conclusion is clearly illustrated in Figures 6a and 6b, which show the NA and DA costs for equally populated indexes of dimensionality n = 1 and n = 2, respectively. According to our experiments it also turns out that the cost formulae for the estimation of disk accesses DA are nonsymmetric with respect to the trees R1 and R2, a fact that has been already mentioned during the presentation of the cost models in Section 3. The comparison results confirm that, for tree indexes of equal height, the choice of the less (more) populated index to play the role of the ‘query’ (‘data’) tree is the best choice for the effectiveness of SJ algorithm, which however is not a general rule for trees of different height (all areas in Figure 7 follow the rule, except AREA 2 and AREA 3 in Figure 7b).

4.2. Comparison results on non-uniform data The cost model presented in Section 3 is also shown to be applicable for non-uniform data sets. According to the discussion of [TS96], appropriate transformations are necessary in order to reduce the uniformity assumption of the underlying analytical model from global (i.e., assuming the global workspace) to local (i.e., assuming a small sub-area of the workspace). This is done by transforming the actual density DR to a set of local densities (applying sampling procedures) i

4.1. Comparison results on uniform-like data By evaluating the analytical formulae on random (uniform-like) data we conclude the following: (i) When no buffering scheme is adopted (i.e., the estimated number of node accesses NA is evaluated) then the accuracy of the estimated cost is always high, with the relative error never exceeding 10%. (ii) When a path buffer is adopted then the estimated cost of R2 (i.e., the tree that plays the role of the query set) is always very close to the actual cost (relative error usually below 5%), while the estimated cost of R1 (i.e., the tree that plays the role of the data set) is usually 10%-15% far from the experimental result. The accuracy of the cost estimation for R2 is expected since the existence of a buffer has been considered in Eq. 8, while Eq. 9 assumes that the buffer existence does not affect R1, an

as described in [TS96]. The relative error was always shown to be around 10%-20%. In addition, for the real data sets used in our experiments, a relative error below 15% appeared for all combinations. In conclusion, the estimated cost of spatial joins is very close to the actual experimental results for uniform-like or nonuniform data distributions. This fact makes the proposed formulae useful tools for SDBMS query processors and optimizers, especially when complex queries (e.g. nested joins) are involved. Compared to related work about estimation models for spatial joins [Gun93, AS94], our work provides robust analytical formulae which: (i) do not need knowledge of the two R-tree structures, since they are only based on data properties (recall that both related studies assumed knowledge of index properties), and (ii) are shown to be accurate by performing a wide set of experiments on both uniform-like and non-uniform data sets (not supported by previous work).

exper(NA)

exper(DA)

anal(NA)

anal(DA)

exper(NA)

6000

exper(DA)

anal(NA)

anal(DA)

20000 18000

5000

16000 14000

4000

12000 3000

10000 8000

2000

6000 4000

1000

2000

N5 / N5 combination

80K / 80K

80K / 60K

80K / 40K

80K / 20K

60K / 80K

60K / 60K

60K / 40K

60K / 20K

40K / 80K

40K / 60K

40K / 40K

40K / 20K

20K / 80K

20K / 60K

20K / 40K

0 20K / 20K

80K / 80K

80K / 60K

80K / 40K

80K / 20K

60K / 80K

60K / 60K

60K / 40K

60K / 20K

40K / 80K

40K / 60K

40K / 40K

40K / 20K

20K / 80K

20K / 60K

20K / 40K

20K / 20K

0

N5 / N5 combination

(a) n = 1 (b) n = 2 Figure 5: Experimental vs. analytical NA and DA access costs for uniform-like data anal(NA)

anal(DA)

anal(NA)

6000

18000

5000

15000

4000

12000

3000

9000

2000

6000

1000

3000

0 20K / 20K

40K / 40K

60K / 60K

80K / 80K

0 20K / 20K

40K / 40K

N5 / N5 combination

anal(DA)

60K / 60K

80K / 80K

N5 / N5 combination

(a) N R = N R (n = 1) 1 2

(b) N R = N R (n = 2) 1 2

Figure 6: Behavior of NA and DA plots for trees of the same or different height DA

NR2=20K NR2=80K

NR1=20K NR1=80K

NR2=20K NR2=80K

DA

NR1=20K NR1=80K

12000

5000

10000

4000

AREA 3

8000

3000

AREA 4

6000 2000

4000

1000

AREA 2

2000 AREA 1

0 20K

40K

60K

80K

N5 or N5

0 20K

40K

60K

80K

N5 or N5

(a) n = 1 (b) n = 2 Figure 7: Analytical DA costs for varying cardinality N R or N R 1 2

5. Conclusion Traditional join implementation techniques are not efficient in multidimensional space due to the lack of ordering, which is an inherent characteristic of spatial data. In the literature there exist several techniques for the efficient implementation of the spatial join operation. However, there is a lack of efficient cost models which would make an accurate estimation of the I/O cost of a spatial join operation between two data sets of various data distributions (uniform and non-uniform ones). In this paper we proposed cost models that estimate the cost of a join query between two spatial data sets indexed by two R-

tree-based data structures. The proposed cost formulae can be used without any knowledge of the underlying R-tree indexes since they are functions of data properties only. We also presented comparison results between analytical estimations and actual tests using R*-trees for uniform-like and non-uniform (synthetic and real) data sets of dimensionality n ≤ 2. The comparison showed that the proposed formulae achieve accurate cost estimations (the relative error being usually below 15%) for all data sets. Hence we consider that they would be useful tools for SDBMS query optimizers, especially when complex spatial queries (e.g. nested joins) are involved.

We are currently working on the extension of the proposed model towards two directions: (i) the support of spatial join queries using spatial operators other than overlap, and (ii) the accurate estimation of the selectivity of a spatial join query for uniform and non-uniform distributions of data. As discussed in [PT97], a transformed query window Q has to be defined in order to retrieve a multidimensional (topological, directional or distance) operator OP, instead of the ‘classic’ overlap operator. The above transformation has been already adopted for the analytical cost estimation of several queries in GIS applications of dimensionality n = 2 [TP95] and n = 3, 4 [PTS97]. On the other hand, based on the corresponding formula for the selectivity estimation of a range query [TS96], we aim at a formula that would estimate the number of overlapping pairs of objects at the leaf level of the two indexes based on the roles of the ‘data’ and the ‘query’ tree. Future work also includes the following tasks: • variable buffer size: In the cost analysis of spatial join we have considered a simple path buffer for the implementation of the join algorithms. Although by introducing the size of an LRU-buffer mechanism as a parameter into the cost model one shifts to non-pure analysis, we plan to take it into consideration because of its applicability in commercial DBMSs. • accuracy for data sets in high-dimensional space: R-tree implementations originally designed for n = 2, such as the R*tree, are not efficient in high-dimensional space thus leading to specialized R-tree-based methods (e.g. [BKK96]). As a result, the behavior of the proposed cost model should also be studied for n >> 2. • parallel processing of spatial join: Spatial join implementation algorithms presented in this paper could be made parallelisable, as proposed in [BKS96]. We plan to work towards the modification of our model in order to support parallel implementations of spatial join processing. Furthermore, the analytical models for range and join queries using R-tree-based structures, proposed in [TS96] and in the current paper, respectively, provide average case estimation of access cost. Their appropriate extension and formalization, according to the indexability theory issues presented in [HKP97], constitute main goals for further research.

Acknowledgements The research was partially supported by the European Commission funded TMR project “CHOROCHRONOS: A Research Network for Spatiotemporal Database Systems”.

References [AS94]

W.G. Aref, H. Samet, "A Cost Model for Query Optimization Using R-Trees", Proc. 2nd ACM-GIS Workshop, 1994. [BHF93] L. Becker, K. Hinrichs, U. Finke, "A New Algorithm for Computing Joins with Grid Files", Proc. 9th IEEE Data Engineering Conf., 1993. [BKK96] S. Berchtold, D.A. Keim, H.-P. Kriegel, " The X-Tree: An Index Structure for High-Dimensional Data", Proc. 22nd VLDB Conf., 1996.

[BKS93]

T. Brinkhoff, H.-P. Kriegel, B. Seeger, "Efficient Processing of Spatial Joins Using R-trees", Proceedings of ACM SIGMOD Conf., 1993. [BKS96] T. Brinkhoff, H.-P. Kriegel, B. Seeger, "Parallel Processing of Spatial Joins Using R-trees", Proc. 12th IEEE Data Engineering Conf., 1996. [BKSS90] N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles", Proceedings of ACM SIGMOD Conf., 1990. [BKSS94] T. Brinkhoff, H.-P. Kriegel, R. Schneider, B. Seeger, "Multi-Step Processing of Spatial Joins", Proceedings of ACM SIGMOD Conf., 1994. [Bur91] Bureau of the Census, TIGER/LINE Census Files, March 1991. [FK94] C. Faloutsos, I. Kamel, "Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension", Proc. 13th ACM PODS Symposium, 1994. [Gun93] O. Gunther, "Efficient Computations of Spatial Joins", Proc. 9th IEEE Data Engineering Conf., 1993. [Gut84] A. Guttman, "R-trees: A Dynamic Index Structure for Spatial Searching", Proceedings of ACM SIGMOD Conf., 1984. [HKP97] J.M. Hellerstein, E. Koutsoupias, C.H. Papadimitriou, "On the Analysis of Indexing Schemes", Proc. 16th ACM PODS Symposium, 1997. [KF93] I. Kamel, C. Faloutsos, "On Packing R-trees", Proc. 2nd CIKM Conf., 1993. [KS97] N. Koudas, K.C. Sevcik, "Size Separation Spatial Join", Proceedings of ACM SIGMOD Conf., 1997. [LL98] S.T. Leutenegger, M.A. Lopez, "The Effect of Buffering on the Performance of R-Trees", Proc. 14th IEEE Data Engineering Conf., 1998. [LR94] M.-L. Lo, C.V. Ravishankar, "Spatial Joins Using Seeded Trees", Proceedings of ACM SIGMOD Conf., 1994. [LR96] M.-L. Lo, C.V. Ravishankar, "Spatial Hash-Joins", Proceedings of ACM SIGMOD Conf., 1996. [ME92] P. Mishra, M.H. Eich, "Join Processing in Relational Databases", ACM Computing Surveys, vol.24(1), pp. 63-113, 1992. [Ore89] J. Orenstein, "Redundancy in Spatial Databases", Proceedings of ACM SIGMOD Conf., 1989. [PD96] J.M. Patel, D.J. DeWitt, "Partition Based Spatial-Merge Join", Proceedings of ACM SIGMOD Conf., 1996. [PSTW93] B.-U. Pagel, H.-W. Six, H. Toben, P. Widmayer, "Towards an Analysis of Range Query Performance", Proc. 12th ACM PODS Symposium, 1993. [PT97] D. Papadias, Y. Theodoridis, "Spatial Relations, Minimum Bounding Rectangles, and Spatial Data Structures", International Journal of Geographic Information Science, vol. 11(2), pp. 111-138, 1997. [PTS97] D. Papadias, Y. Theodoridis, E. Stefanakis, "MultiDimensional Range Query Processing with Spatial Relations", Geographical Systems, in press, 1997. [Rot91] D. Rotem, "Spatial Join Indices", Proc. 7th IEEE Data Engineering Conf., 1991. [TP95] Y. Theodoridis, D. Papadias, "Range Queries Involving Spatial Relations: A Performance Analysis", Proc. 2nd COSIT Conf., 1995. [TS96] Y. Theodoridis, T. Sellis, "A Model for the Prediction of R-tree Performance", Proc. 15th ACM PODS Symposium, 1996.