Simple QSF-Trees: An Efficient and Scalable Spatial ... - CiteSeerX

0 downloads 0 Views 170KB Size Report
trees is a simple modification of a point access method (PAM). [11, 14, 21] indexing ..... presented in [20], confirmed in our experiments (see Section 6 of this paper), show .... give the coordinates of the low and the high endpoint, respectively, of ...
Simple QSF-Trees: An Efficient and Scalable Spatial Access Method Byunggu Yu

Ratko Orlandic*

Martha Evens

Dept. of Computer Science Illinois Institute of Technology st 10 W 31 St., Chicago, IL 60616 (312) 567-5152

Dept. of Computer Science Illinois Institute of Technology st 10 W 31 St., Chicago, IL 60616 (312) 567-5343

Dept. of Computer Science Illinois Institute of Technology st 10 W 31 St., Chicago, IL 60616 (312) 567-5153

[email protected]

[email protected]

[email protected]

ABSTRACT The development of high-performance spatial access methods that can support complex operations of large spatial databases continues to attract considerable attention. This paper introduces QSF-trees, an efficient and scalable structure for indexing spatial objects, which has some important advantages over R*-trees. QSF-trees eliminate overlapping of index regions without forcing object clipping or sacrificing the selectivity of spatial operations. The method exploits the semantics of topological relations between spatial objects to further reduce the number of index nodes visited during the search. A series of experiments involving randomly-generated spatial objects was conducted to compare the structure with two variations of R*-trees. The experiments show QSF-trees to be more efficient and more scalable to the increase in the data-set size, the size of spatial objects, and the number of dimensions of the spatial universe.

Keywords Database management, spatial database, spatial access methods, point access methods, topological relations.

relationships of objects. The database must support typical queries involving topological relations, i.e. the relations that stay invariant under translation, rotation, and scaling [20]. For example, the spatial operation of selecting objects that contain a given region uses the topological relation contains in its selection predicate. Other meaningful topological relations between solid objects are: equal, inside, covers, covered_by, overlap, meet, and disjoint [20]. The performance of spatial operations heavily depends on the choice of the underlying spatial access method (SAM). While certain SAMs support more accurate representations of the spatial extent [8, 17, 18], most spatial access methods employ some form of approximation of spatial objects in order to reduce storage overhead and simplify the search and update operations. Typical approximations of regions in space include minimum bounding rectangles (MBRs) [1, 9], minimum bounding circles (MBCs) [16, 25], and minimum bounding polygons (MBPs) [13]. Since MBR-based approximations are characterized by intermediate complexity and accuracy, they tend to be used more frequently than other kinds of approximations. In the rest of the paper, we consider only the methods that employ MBR-based approximations of objects.

1. INTRODUCTION A large database of multi-dimensional (spatial) objects is essential to many applications, including medical and geographic information systems, traffic and environmental simulations, and cellular-telephone locator systems. Of particular interest for these applications are the absolute or relative positions of objects in space and their spatial relationships. The underlying database must maintain the knowledge of these properties and

*

The author’s work was supported in part by the DOE grant no. DEFG02-95-ER25254.

Perhaps the most popular MBR-based spatial access method, used in several commercial database systems, is the R*-tree [1]. R*-trees tend to be more efficient than other schemes based on MBR approximations [6]. However, the insertions of R*-trees are rather complex. In addition, the search performance can suffer due to an excessive overlap of MBRs in the upper levels of the tree. This problem becomes more pronounced with growing size or numbers of objects, and with greater dimensionality of the spatial universe [2]. Furthermore, like other variants of R-trees [2, 9, 24], R*-trees do not discriminate search operations involving different topological relations. This problem is addressed by Papadias et al. [20] who demonstrated that specializing search operations to take the advantage of the semantics of topological relations can result in a significant reduction of page accesses during the search. In this paper, we introduce a new MBR-based spatial access method, called QSF-tree (“Query-to-Search-and-Filter”-tree), that completely eliminates overlap of index regions and fully discriminates spatial operations with different topological relations. Unlike R*-trees, which represent object MBRs as an

ordered list of intervals projected on individual axes of the multidimensional space, QSF-trees represent each MBR with its two opposing vertex (point) vectors. However, unlike spatial access methods that apply endpoint transformation of MBRs [6], MBRs are not mapped to points in higher-dimensional space. Instead, QSF-trees apply an original query transformation calculating search and filtering regions from the given query and using these to search the tree and filter the result set. The structure of QSFtrees is a simple modification of a point access method (PAM) [11, 14, 21] indexing the low endpoints of MBRs and using their high endpoints to filter the result set once the search reaches the lowest level of the tree. The underlying point access method determines the complexity of updates of QSF-trees, which is typically much lower than update complexity of R*-trees. Furthermore, because typical PAMs partition spatial universe into non-overlapping regions, the problem of region overlap is automatically eliminated. This increases the performance of search operations without incurring the penalties associated with expensive clipping of spatial objects [24]. Moreover, due to the elimination of overlap, the structure is more scalable than R*-trees to the increase in the data-set size, the size of spatial objects, and the number of dimensions of the spatial universe. An important property of QSF-trees is that they fully discriminate search operations involving different topological relations. For a given query window and one of seven topological relations other than disjoint, the scheme constructs a search window and a search predicate. Each query involving different topological relation is mapped onto one of seven different types of (searchwindow, search-predicate) pairs. As we will see later in the paper, this mapping allows greater selectivity of search operations at the higher levels of the tree, which further reduces the number of page accesses. The rest of the paper is organized as follows. In Section 2, we introduce the semantics of topological relations between solid spatial objects and their MBRs. Since in this paper we intend to position QSF-trees against R*-trees, in Section 3, we discuss the strengths and weaknesses of R*-trees and other variants of Rtrees in more detail. In Section 4, we introduce the basic idea of QSF-trees. Then, in Section 5, we present the design of the structure as well as the criteria used by the search operations. Section 6 presents the results of our experimental study of the search performance of QSF-trees implemented as a simple modification of kdB-trees [21]. There, the original R*-trees [1] as well as the R*-trees that differentiate topological relations [20] are used as benchmarks for the comparison. The results corroborate our claim that even the simple QSF-trees, as presented in this paper, are more efficient and more scalable than either of these variants of R*-trees. Moreover, as we will see in Section 7, which discusses the results and our future work in this area, there are numerous optimizations that can further improve the performance of simple QSF-trees without sacrificing its simplicity or the efficiency of insertions.

2. TOPOLOGICAL RELATIONS Although the discussion in this paper can be generalized to accommodate other kinds of spatial objects, i.e. points and lines, in this paper we assume solid spatial objects (also referred to as regions) representing sets of points with clearly distinguished interior, boundary, and exterior. In [20], Papadias et al. defined eight meaningful topological relations between solid objects: equal, contains, inside, covers, covered_by, overlap, meet, and disjoint. The relations provide complete coverage of all possible configurations of two solid objects [20]. The semantics of topological relations adopted in [20] makes the relations pairwise disjoint, i.e. no two relations refer to the same configuration of two spatial objects. Unfortunately, pairwisedisjoint relations lead to numerous exceptions and the meaning of individual relations cannot be stated as a simple rule. As a result, logical expressions involving pairwise-disjoint topological relations tend to be rather cumbersome. Topological relation equal_to (r, q) contains (r, q) covers (r, q) contained_by (r, q) covered_by (r, q) overlaps (r, q) meets (r, q) disjoint_with (r, q)

Semantics of the relation All points of r are in the interior or on the boundary of q and vice versa. All points of q are in the interior of r. All points of q are in the interior or on the boundary of r. All points of r are in the interior of q. All points of r are in the interior or on the boundary of q. Some points of r are in the interior of q. Some points of r are on the boundary of q, but no point of r is in the interior of q. All points of r are in the exterior of q.

Table 1: Semantics of topological relations. In this paper, we assume the simple semantics of topological relations given in Table 1. The relations are defined in terms of the interior, boundary, and exterior of participating objects. Observe, the relation contained_by is the converse of contains, while covers is the converse of covered_by. To facilitate the interpretation of the relations and to avoid possible confusion, we have slightly renamed most of the relations. With this, the predicate contained_by (r, q), involving the relation contained_by and two spatial objects r and q, is read as “r is contained by q”. Similarly, meets (r, q) is simply read as “r meets q”, etc. Query predicate equal_to (r, q) contains (r, q) covers (r, q) contained_by (r, q) covered_by (r, q) overlaps (r, q) meets (r, q)

Selection predicate equal_to (r’, q’) contains (r’, q’) covers (r’, q’) contained_by (r’, q’) covered_by (r’, q’) overlaps (r’, q’) overlaps (r’, q’) ∨ meets (r’, q’)*

Table 2: Topological relations that MBRs convey about objects. With respect to the topological relations in [20], only the semantics of the relations covers, covered_by, and overlaps

(called cover, covered_by, and overlap in [20]) is changed. However, with these changes, certain relations are no longer pairwise disjoint. Some relations have subset relationships. That is, the set of configurations between two objects that constitute a relation rel1 could be a subset of configurations constituting a relation rel2, which we denote by rel1 ⊂ rel2. The following subset relationships can be inferred from Table 1: equal_to ⊂ covers, equal_to ⊂ covered_by, contains ⊂ covers, contained_by ⊂ covered_by, covers ⊂ overlaps, and covered_by ⊂ overlaps. Since MBRs are only approximations of objects, their topological relations do not necessarily reflect the relations between the actual objects. However, with the semantics of topological relations adopted in this paper, one can observe a strong correlation between the topological relations of actual objects and their respective MBRs. In particular, if two objects satisfy the relation equal_to, then their MBRs must satisfy the same relation. The same is true for the relations contains, covers, contained_by, covered_by, and overlaps. Observe, the opposite does not hold. For example, if two MBRs r’ and q’ satisfy the predicate contains (r’, q’), the actual objects may satisfy several different relations [20]. Table 2 shows how the relations between actual objects other than disjoint_with, are reflected on the relations between their MBRs.a In the table, r and q denote actual spatial objects, while r’ and q’ represent their MBRs. For each row of the table, if the predicate of the first column involving two spatial objects is satisfied, then their MBRs must satisfy the predicate in the second column. The validity of this table can be verified by observing possible relations between MBRs if the actual objects satisfy the given topological relation. Alternatively, one can verify this by consulting Table 1 in [20] and applying the semantics of topological relations defined in this paper.b The table also shows how the queries involving actual objects should be processed by an MBR-based SAM. For a given query involving a topological relation, the spatial access method must retrieve all MBRs satisfying the corresponding predicate in the second column of the table. For example, to answer the query “find all objects r that contain the given query object q”, we must retrieve all MBRs containing the query window q’ (the MBR of q). In the refinement step, the actual objects corresponding to the selected MBRs are consulted to eliminate false hits. For this reason, a predicate in the first column of Table 2 is called query a

Spatial access methods that apply MBR approximations of objects are rather awkward in processing the relation disjoint_with. As noted in [20], this relation should be processed by a sequential scan of the leaf nodes in the structure. b

For the relation meets, the predicate of the second column of Table 2 can be restricted further, which we indicated by labeling the predicate with a “*”. As a result of this simplification, processing of the relation meets can result in some unnecessary false hits. Since we did not change the semantics of the relation meets. Table 1 in [20] gives precisely the configurations for which the refinement is not necessary. Using that table, one can eliminate these unnecessary false hits.

predicate, while the corresponding predicate in the second column is referred to as selection predicate. Later, we will introduce the notion of a search predicate used in traversing the internal nodes of a SAM structure. In closing this section, we define three other relations involving two points in space or a point and a rectangle. Given two points p1 and p2 and a rectangle r, we define the relations pEqual, pContains and pCovers as follows. If pEqual (p1, p2) is true, than p1 and p2 have the same coordinates in space. If pContains (r, p1) holds, then p1 appears in the interior of r. On the other hand, if pCovers (r, p1), then the point p1 lies in the interior or on the boundary of r.

3. R-TREES AND THEIR VARIANTS An R-tree [9] is a height-balanced hierarchy of nodes (pages) adopted for multi-dimensional spaces. The MBRs of actual objects are stored in the leaf nodes of the tree. Each entry of an internal node represents the smallest rectangle enclosing all MBRs of its child node (node indicated by the pointer of that entry). Search starts at the root node and propagates downward, possible visiting multiple paths in the tree. At each node of the structure, all entries are tested to select those that do not fall outside (that are not disjoint_with) the given query window. In, whenever a part of the query region falls within the overlap of two or more MBRs at upper levels of an R-tree, the search must branch into their subtrees, resulting in an increased number of disk accesses. The greater the overlap, the greater the probability that one has to visit multiple paths in the tree [2, 6]. R+-trees [24] generally improve the performance of R-trees by eliminating region overlap. But, this is accomplished by clipping MBRs and storing their individual pieces separately. The number of entries in the structure is increased and, with this, its storage overhead. In databases containing large objects, the search performance of R+-trees can even be lower than that of the original R-trees [7]. Even worse, if all MBRs of an overfilled index node overlap on the same point, the R+-tree splitting algorithm breaks down entirely [7]. R*-trees do not necessarily eliminate overlap, but they do not force expensive clipping of objects either. Instead, the idea is to group MBRs more effectively. One optimization of R*-trees, called forced reinsert, is to defer splitting of an index node and perform instead re-insertion of some of its entries. This can eliminate the negative impact of certain objects inserted early in the dynamic growth of the structure on the size, shape, and overlap of the bounding rectangles in the upper-levels of the tree. In addition, R*-trees optimize node splitting to take into account not only the size of the resulting MBRs, as is the case in R-trees, but also their overlap and the sum of their intervals projected on individual axes [1]. Through experimental evidence confirmed in later studies [6], Beckmann et al. demonstrated that, as a result of the above optimizations, the search performance of R*-trees can be significantly better than that of the original R-trees [1]. However, the penalty that must be paid is the increased complexity of insertion. In addition, even though the overlap rate of R*-trees is

generally lower than in the original R-trees, it is far from insignificant. Just as with R-trees, the probability that MBRs will overlap increases with more skewed distribution of objects in space, larger objects, and larger numbers of spatial objects stored in the structure. Moreover, as shown in [2], overlap (percentage of space covered by more than one rectangle) in R*-trees increases with the growing number of dimensions of the universe. As a result of their investigation of the overlap in R*-trees, Berchtold at al. proposed a variation of R*-trees, called the Xtree, which is optimized for high-dimensional spaces [2]. Instead of allowing splits that introduce high overlap, index nodes are extended over the usual block size. These clusters of nodes, called supernodes, are searched sequentially. Therefore, the advantages of reduced overlap in higher-dimensional spaces come at the expense of linear scans of nodes constituting the supernodes and an even greater complexity of dynamic updates.

2-dimensional universe. In general, the N-dimensional space is recursively divided into non-overlapping regions corresponding to the individual nodes of the structure by means of (N-1)dimensional iso-oriented hyperplanes. Each dividing hyperplane is perpendicular to one of the axes and the directions of the hyperplanes alternate among individual dimensions. Some spatial access methods map N-dimensional rectangles to points in 2N-dimensional space and use a PAM to maintain and search these points [12, 22]. If the underlying PAM partitions the universe into non-overlapping regions, these transformation schemes automatically eliminate region overlapping. Typically, the N-dimensional MBRs are represented by their two opposing points and the coordinates of these points in the N-dimensional space are treated as coordinates of a single point in the dual space. This is called endpoint transformation of MBRs.

This problem is attacked by Papadias et al. [20]. Their approach is to use specialized search criteria for different topological relations in traversing the upper levels of R-trees and their variants. For query predicates involving different topological relations, a different search predicate is used to test the entries in the upper levels of the tree. The experimental results presented in [20], confirmed in our experiments (see Section 6 of this paper), show that specializing search operations to discriminate different topological relations while traversing the tree can result in a significant reduction in page accesses [20].

4. BASIC IDEA OF QSF-TREES The discussion of Section 3 suggests two viable avenues one can pursue to improve the performance of R*-trees. First, by eliminating the problem of region overlap without incurring expensive object clipping, one can hope to achieve potentially significant improvements in the performance of search operations. In the same way, one can also hope to improve the ability of the spatial structure to handle large sets of relatively large objects, objects in high-dimensional spaces, and nonuniform distributions of objects in space. Second, by discriminating the search operations with different topological relations, one can improve the selectivity of search predicates and, thereby, further reduce the number of page accesses. It is important to notice that, while there are some PAMs which represent exceptions (e.g., BANG file [4]), most point access methods [15, 21, 23] incur no region overlap. As an example, consider Figure 1 that illustrates how a kdB-tree [21] partitions a

j

R3 ’

A common characteristics of R-trees and their variants is their failure to discriminate spatial operations with different topological relations. Regardless of the topological relation specified in the spatial query, the search predicate applied to the upper-level index entries is always the same: not_disjoint (R’, q’), where not_disjoint includes the seven topological relations other than disjoint_with, R’ is the rectangle of the tested entry, and q’ is the query window (i.e., the MBR of the query object). As a result, given the same structure, spatial operations involving different topological relations visit exactly the same index nodes.

e

R4’

g f

i

d h

k

b

R1 ’

R2 ’ c

a

a

b

h

c

R1 ’ R2’

R3’ R4’

k

d

e

f

g

i

j

Figure 1: A kdB-tree and its partition of space. Despite its conceptual simplicity and an automatic way to prevent overlap, MBR transformation has serious disadvantages [6]. The transformation often disseminates objects in close proximity in the original space throughout the dual space and produces a skewed distribution of points even though the original objects may be uniformly distributed [3]. Expressing queries in dual space is much more complex than in the original space, which can lead to performance degradations. Some complex queries are not even possible in the dual space [10, 19]. The idea behind QSF-trees can be stated as follows: employ the space-partitioning approach of a point access method which incurs no overlap, and avoid both object clipping and transformation by applying a suitable query transformation in the original space. Like typical MBR-transformation schemes, QSFtrees use two opposing vertices to represent an N-dimensional MBR: the smallest vertex vector (low endpoint) and the largest vertex vector (high endpoint). The low endpoint is the vertex of the MBR (in fact, an N-dimensional vector) with the lowest coordinates along each dimension. In contrast, the high endpoint is the vertex of the MBR with the highest coordinates along each dimension. (Note, due to the geometry of rectangles, each MBR

has exactly one low and one high endpoint.) However, unlike transformation schemes, QSF-trees do not apply transformation of MBRs. An MBR is treated simply as a pair containing its low and high endpoints in the original space. Like other MBR-based SAMs, QSF-trees translate the problem of selecting all objects r that satisfy the given topological relation with respect to the given query object q (search problem SP1) into the problem of selecting all MBRs r’ that satisfy a transformed predicate involving r’ and the MBR q’ of the query object (problem SP2). With the semantics of topological relations adopted in this paper, for any topological relation involving actual objects, the transformed predicates are given in the second column of Table 2 (refer back to Section 2). Thus, the translation from SP1 to SP2 entails a transformation of the given query predicate into the corresponding selection predicate. However, QSF-trees perform an additional step of query translation from the search problem SP2 to a search problem SP3. The problem SP3 can be understood in the light of the following question—where could the low and high endpoints of the MBRs that satisfy the selection predicate of the problem SP2 possibly lie in the space? Without showing how it can be derived, let us call the region in space containing low endpoints of all possible MBRs that could satisfy the given selection predicate the L-region. Similarly, the region in space containing high endpoints of all possible MBRs that could satisfy the selection predicate is called the H-region. Assuming that L- and H-regions are derived from the given query window q’ and the selection predicate of the problem SP2, the problem SP3 can be formulated as follows—find MBRs whose low endpoints lie in the L-region and whose high endpoints lie in the H-region. Thus, the transformed selection predicate of the problem SP3 tests whether the low and high endpoint of an object MBR fall in the L- and H-region, respectively. This two-phase query transformation translates the original query into a problem of finding relevant points in space. Since the later problem can be solved by adopting or reusing a point access method which partitions the space into non-overlapping regions, we can automatically eliminate the possibility of region overlap and, therefore, avoid unnecessary page accesses. Furthermore, since the L- and H-regions can be tuned to the semantics of individual topological relations, with a careful design, we can achieve a differentiation of search operations with different topological relations that could further reduce the average number of page accesses. Of course, the realization of the above possibilities as well as the overall effectiveness of the mechanism depends on the actual design of the structure, which is discussed in the next section. But, before proceeding to the design of QSF-trees, it is important to point out that the two-phase query transformation, as described in this section, is purely conceptual. Its sole purpose is to explain the basic idea of QSF-trees. In reality, the L- and Hregions are computed directly from the query predicate. In other words, QSF-trees translate the problem SP1 directly into the problem SP3.

5. DESIGN OF SIMPLE QSF-TREES In the following, we assume an N-dimensional universe U, which itself is a bounded rectangle, and the universal sets R and P of all rectangles and points in U, respectively. We define two functions l:R P and h:R P which, for each rectangle r’∈R, give its low endpoint l(r’) and high endpoint h(r’), respectively. Then, an MBR r’ is represented as an 〈l(r’), h(r’)〉 pair. li(r’) and hi(r’) give the coordinates of the low and the high endpoint, respectively, of the MBR r’ along the dimension i=1,..,N. Note, l(U) and h(U) give the low endpoint (origin) and the high endpoint of the universe, respectively.

Æ

Æ

In the implementation of QSF-trees, for each dimension i of the universe, we dynamically keep track of two values, Mi and mi. Given an MBR r’, let lengthi(r’) = hi(r’) - li(r’) be the length of its interval projected on the axis i. Then, for each i=1,..,N, Mi and mi are the maximum and minimum lengthi(r’) of MBRs r’ of all objects r in the given data set, respectively. For each topological relation between actual spatial objects, Table 3 gives the coordinates of the L- and H-region used in searching a QSF-tree. For the topological relation equal_to, the L- and H-region are just the low and the high endpoint of the query window, respectively. There are two cases when L- and Hregions cannot be generated for the given query, because the length of the region's interval projected on an axis i is negative. For the relations contains and covers, this will happen when hi(q’) > li(q’) + Mi. Similarly, for the relations contained_by and covered_by, the length of the region along one dimension is negative when hi(q’) < li(q’) + mi. In these situations, no MBR in the structure can satisfy the query. Therefore, in principle, these instances need not generate any page access. Query Predicates equal_to (r, q)

L- and H-region Coordinates li(Le) = hi(Le) = li(q’) li(He) = hi(He) = hi(q’)

contains (r, q), covers (r, q)

li(Lc) = max{(hi(q’)–Mi), li(U)} hi(Lc) = li(q’) li(Hc) = hi(q’) hi(Hc) = min{(li(q’)+Mi), hi(U)}

contained_by (r, q), covered_by (r, q)

li(Lcb) = li(q’) hi(Lcb) = max{(hi(q’)–mi), li(q’)} li(Hcb) = min{(li(q’)+mi), hi(q’)} hi(Hcb) = hi(q’)

overlaps (r, q), meets (r, q)

li(Lom) = max{(li(q’)–Mi), li(U)} hi(Lom) = hi(q’) li(Hom) = li(q’) hi(Hom) = min{(hi(q’)+Mi), hi(U)}

Table 3: Computation of L- and H-regions. Figures 2(a) through 2(d) illustrate the L- and H-regions generated for different topological relations with a query window q’, assuming that the origin of the universe is in the low left corner of each figure. In the figures, vdim is an N-dimensional vector whose component along each dimension i has magnitude

MI - lengthi(q’). Similarly, vmax and vmin are N-dimensional vectors whose components along each dimension i have magnitude Mi and mi, respectively. For each topological relation, Figure 2 shows the actual selection predicates applied to the low and high endpoints of object MBRs stored in a QSF-tree. For example, the selection predicate for the relation contains tests whether the low and high endpoint of an MBR r’ fall in the interior of the regions Lc (L-region of the relation contains and covers) and Hc (H-region of the relation contains and covers), respectively. On the other hand, the selection predicate for the relation covers tests whether the low and high endpoint of an MBR r’ fall in the interior or on the boundary of Lc and Hc, respectively. One way to index the low and high endpoints of object MBRs is to store the endpoints in a single index structure as individual points in space. Alternatively, low endpoints could be maintained by one index structure and high endpoints by another. An advantage of either approach is that the spatial access method could simply reuse a point access method with no alteration. But, in either case, the separation of individual endpoints would result in two different search operations to answer a query—one traversal to locate the relevant low endpoints falling in the Lregion and one to locate the high endpoints falling within the constructed H-region. This would increase the number of page accesses and force a potentially expensive intersection of the result sets returned by the two individual search operations.

Figure 3 shows a simple QSF-tree which uses modified kdB-trees as its underlying structure. Note, the space partition of the structure is the same as that of a regular kdB-tree indexing only the low endpoints of object MBRs. Contrasting Figure 3 with Figure 1 showing a regular kdB-tree, one can see that the only structural difference is in the lowest level of the tree. Therefore, if kdB-trees are chosen as the base structure, the update operations of kdB-trees require only a simple modification to accommodate extended entries at the lowest level. The search operations of a simple QSF-tree rely solely on the Lregion in traversing the upper levels of the tree. However, once the search reaches the lowest level, both L- and H-regions are used to select the candidate MBRs. Each object MBR r’ at the lowest level whose low endpoint falls within the L-region is checked to see whether its high endpoint lies within the Hregion. For this reason, the L-region can be regarded as a search region and the H-region as a filtering region. This is where the name ``Query-to-Search-and-Filter''-trees (QSF-trees) comes from.

R3’

j

e

R4’

g f

d

i h

k

b

R1’ a

He = h(q’)

v dim equal_to (r, q) : pEqual (Le , l(r’)) ∧

q’

contains (r, q) : pContains (Lc , l(r’)) ∧ pContains (Hc , h(r’))

q’ Lc

R2’

c

R 1’

R2 ’

R 3’

R 4’

Hc

pEqual (H e , h(r’))

covers (r, q) : pCovers (Lc , l(r’)) ∧

v dim

Le = l(q’)

pCovers (Hc , h(r’))

(a)

(b)

L cb contained_by (r, q) : pContains (Lcb , l(r’)) ∧ pContains (Hcb , h(r’))

v min

v max Lom

q’ Hom

q’

v min Hcb

covered_by (r, q) : pCovers (Lcb , l(r’)) ∧ pCovers (Hcb , h(r’))

(c)

v max























overlaps (r, q) : pContains (Lom , l(r’)) ∧ pContains (Hom ,h(r’)) meets (r, q) : pCovers (Lom , l(r’)) ∧ pCovers (Hom ,h(r’))

(d)

Figure 2: Query transformation of QSF-trees. In simple QSF-trees, both endpoints of an object MBR are stored in a single entry of the structure, but the entries are indexed solely by the low endpoints. The underlying index structure is a simple modification of a PAM (typically, a height-balanced tree), in which the nodes at the lowest level contain extended entries of the form 〈l(r’), h(r’), r_ptr〉, where r’ is an object MBR and r_ptr is the pointer to the database tuple containing the actual object r. Each upper-level entry of the structure is an 〈R’, c_ptr〉 pair, where R’ represents a rectangle and c_ptr is a pointer to the child node. The rectangle R’ encloses the low endpoints of all MBRs stored in the subtree indicated by c_ptr.

Figure 3: Structure of a QSF-tree based on kdB-trees. Following the idea of Papadias et al. [20], the search operations are specialized to exploit the semantics of the given topological relation and reduce the number of page accesses by pruning the search space. Table 4 shows the search predicates applied to the upper-level entries of a simple QSF-tree. The first column of the table gives the topological relation between an object r and a query object q. The second column shows the corresponding search predicates used while traversing the upper levels of a QSF-tree. The third column, which is included for later reference, shows the search predicates applied to the upper-level entries of R-trees or its variants. In the table, R’ represents the bounding rectangle corresponding to an upper-level entry of a QSF-tree or an R-tree variant, while q’ denotes the query window. As in Figure 2, Le, Lc, Lcb, and Lom represent the Lregions of different types of queries.

Intuitively, the search predicates of QSF-trees are derived as follows. Recall that the search operation must locate the low endpoints of object MBRs lying in the interior (or, for some topological relations, on the boundary) of the given L-region. Since any upper-level rectangle R’ that overlaps (or, for some topological relations, meets) the L-region may contain relevant low endpoints, the search operation must branch into the subtree rooted at R’. For the relation equal_to, since the L-region Le is just a point, any upper-level rectangle R’ that covers Le satisfies the search predicate. Similar reasoning explains the third column of Table 4 showing the search predicates used to test the upper-level entries of Rtrees or its variants, assuming the semantics of topological relations adopted in this paper. The search predicates of the third column of Table 4 can be derived from Table 2 in [20], applying the semantics of topological relations and Table 2 of this paper. Query Predicate equal_to (r, q) contains (r, q) covers (r, q) contained_by (r, q) covered_by (r, q) overlaps (r, q) meets (r, q)

Search Predicate of QSF-trees pCovers (R’, Le) overlaps (R’, Lc) overlaps (R’, Lc) ∨ meets (R’, Lc) overlaps (R’, Lcb) overlaps (R’, Lcb) ∨ meets (R’, Lcb) overlaps (R’, Lom) overlaps (R’, Lom) ∨ meets (R’, Lom)

Search Predicate of R*-trees covers (R’, q’) contains (R’, q’) covers (R’, q’) overlaps (R’, q’) overlaps (R’, q’) overlaps (R’, q’) overlaps (R’, q’) ∨ meets (R’, q’)

Table 4: Search predicate of QSF-trees and R*-trees. In summary, QSF-trees process the query “find all objects r that satisfy the given topological relation with respect to the given query object q” as follows: 1. Initialization. Construct the query window q’ for the query object q and calculate the L- and H-regions based on q’ and the topological relation of the query predicate, as shown in Table 3. 2. Search. Starting from the root node, prune the search space by applying the corresponding search predicate of the second column of Table 4 to the upper-level entries of QSF-trees.

that differentiate topological relations [20] (which we call topological R*-trees) were used as benchmarks for the comparison. (Note, topological R*-trees differ from the original R*-trees only with respect to the search operations.) As usual, search performance was measured in terms of the number of page accesses. QSF-trees were implemented by modifying kdB-trees to accommodate extended entries at the lowest level of the tree (see Figure 3). However, unlike the original kdB-trees which use forced splitting of index nodes [21], our implementation uses splitting based on a first-division plane [5] which does not force a partition of rectangles corresponding to the lower-level nodes. As a result, the implementation avoids downward propagation of splitting associated with forced splitting and, thereby, achieves a greater storage utilization than the original kdB-trees [5, 6]. R*-trees were implemented in a way that optimizes their search performance. The implementation uses both optimized splitting and forced reinsertion with the number of reinserted entries from an overfilled node set to 30% of the node capacity, as suggested in [1]. The search operations of topological R*-trees are slightly different from the ones proposed by Papadias et al. [20], because we implemented the semantics of the relations adopted in this paper, rather than the pairwise-disjoint relations. The third column of Table 4 (see Section 5) shows the search predicates applied to the upper-level entries of topological R*-trees. The experiments were performed on 2- and 4-dimensional MBRs. The node capacity (fan-out) of both QSF-trees and R*trees was 12 for 2-dimensional and 7 for 4-dimensional objects. The structures were constructed from six files of exactly 16,384 (214) randomly generated MBRs. The first, second, and third file contained 2-dimensional MBRs whose lengths along each axis, relative to the length of the universe along the axis, were between (a) 0.1% and 0.5% (small objects), (b) 15% and 30% (large objects), and (c) 0.5% and 30% (widely-varying objects), respectively. The fourth, fifth and sixth file contained 4dimensional MBRs of small, large, and widely-varying size, respectively (i.e., objects whose lengths along each axis were between (d) 0.1% and 0.5%, (e) 15% and 30%, and (f) 0.5% and 30% the length of the universe along the axis).

6. EXPERIMENTAL RESULTS

Objects of each file were inserted into a QSF-tree and an equivalent R*-tree. The respective structures had similar storage overhead, but it took much more time to build an R*-tree than an equivalent QSF-tree. This is because the insertions into an R*tree are more complex than the insertions into a kdB-tree which, in turn, determine the insertion complexity of our implementation of QSF-trees. The performance of spatial queries of the original R*-trees, topological R*-trees, and QSF-trees was measured after inserting 27, 28, 29, 210, 211, 212, 213, and 214 objects. At every point in the growth of the structures, each type of query with different topological relation was performed 500 times. The query windows were generated randomly.

In this section, we present the results of an extensive set of experiments intended to verify the search performance of QSFtrees and show their practical appeal. The results will be explained in Section 7. The original R*-trees and the R*-trees

Figure 4 shows the average number of page accesses per query obtained at different stages in the growth of the structures for each of the six files described above. As one can see from the

3. Selection. To search a leaf node, use the appropriate selection predicate of Figure 2. Include in the candidate set each object MBR whose low endpoint lies within the L-region and whose high endpoint lies in the H-region. 4. Refinement. For each MBR r’ in the candidate set, using the tuple pointer r_ptr, consult the actual object r to determine whether it satisfies the given topological relation with the query object q.

figure, the performance of QSF-trees is generally better than that of the original R*-trees and topological R*-trees, but by a different margin. The margin of performance improvement gets larger as the structures grow, which can be attributed to the fact that, with the larger number of objects, the overlap of R*-trees increases. Since the overlap of R*-trees also increases with a growing number of dimensions (see [2] for explanation), the margin of improvement tends to be much larger for 4dimensional than for 2-dimensional objects.

100.0

10.0

10000.0

1000.0

1000.0

100.0 10.0

(d)

number of objects

(e)

8192 8192

8192

512

2048

number of objects

10.0 1.0

128

8192

2048

1.0

100.0

128

1.0

page accesses

10.0

(c)

10000.0 page accesses

100.0

512

128

8192

512

2048

128

8192

512

128

2048

(b)

1000.0

128

number of objects

number of objects

(a)

512

1.0

1.0 number of objects

10.0

2048

10.0

7. DISCUSSION AND FUTURE WORK

100.0

2048

100.0

1.0

page accesses

avg. (Simple QSF-tree) 1000.0 page accesses

page accesses

page accesses

avg. (topo. R*-tree) 1000.0

512

avg. (org. R*-tree) 1000.0

number of objects

300 200

0 e

ct cv ctb cvb o

1200 1000 800 600 400 200 0

m

ct cv ctb cvb o

400

m

e ct cv ctb cvb o

200

2000 1500 1000 500

m

0

e

ct cv ctb cvb o

(e )

m

(c )

p ag e ac c es s es

400

(d)

600

0 e

p ag e ac c es s es

600

ct cv ctb cvb o

800

(b )

800

e

1000

200

(a)

0

Table 5 shows the total percentage improvement in the search performance of QSF-trees over the search performance of both the original and topological R*-trees for each of the six input files after inserting all 16,384 objects. The percentages were obtained using the formula: 100×(TR-TQ)/TR, where TR and TQ are the total number of page accesses generated by all queries performed on one of the R*-tree variants and the corresponding QSF-tree, respectively. Depending upon the size of objects, the performance improvements of QSF-trees over the original R*trees were up to about 65% for 2-dimensional and up to 80% for 4-dimensional objects. With respect to the topological R*-trees, the performance improvements were up to almost 40% and 70% for 2- and 4-dimensional objects, respectively.

S im ple Q S F -tre e

p ag e ac c es s es

p ag e ac c es s es

Table 5: Performance improvements after inserting 16384 objects.

400

100

p ag e ac c es s es

QSF-trees vs. orig. R*-trees +45.41% +64.09% +54.98% +80.82% +74.88% +69.75%

to p o. R *-tre e

p ag e ac c es s es

QSF-trees vs. topo. R*-trees +5.38% +38.08% +22.10% +67.62% +57.62% +49.07%

One of the aspects in which simple QSF-trees differ from R*trees is that they completely eliminate overlap. But, there are other performance-related aspects in which QSF-trees differ from R*-trees. These can be observed by analyzing the search predicates given in Table 4 of Section 5: the average size of rectangles corresponding to the upper-level entries, the size of the search region, and the selectivity of search predicates applied to the upper-level entries. In both QSF-trees and R*-trees, with smaller size of upper-level rectangles and/or the search region, and with a more restrictive search predicate, the probability that the given upper-level entry will satisfy the predicate (and, thereby, the probability that the search will branch into its subtree) decreases. This, in turn, results in a smaller number of page accesses.

(f)

Figure 4: Average number of page accesses per query with (a) small, (b) large, and (c) widely-varying 2-dimensional objects and (d) small, (e) large, and (f) widely-varying 4dimensional objects.

File (dims/size) 2D / 0.1 - 0.5% 2D / 15 - 30% 2D / 0.5 - 30% 4D / 0.1 - 0.5% 4D / 15 - 30% 4D / 0.5 - 30%

Figure 5 shows the average performance of different types of queries (involving different topological relations) of QSF-trees and topological R*-trees for each of the six files after inserting all 16,384 objects. One can see that, for the topological relations equal_to, contained_by and covered_by, QSF-trees outperform topological R*-trees in all scenarios. For the relations contains, covers, overlaps and meets, topological R*-trees slightly outperform QSF-trees for large and widely-varying 2-dimensional objects (see Figures 5b and 5c). In all other situations, QSF-trees outperform R*-trees, usually by a big margin.

m

1400 1200 1000 800 600 400 200 0

e

ct cv ctb cvb o

m

(f)

Figure 5: Average performance of queries with different topological relations with (a) small, (b) large, and (c) widelyvarying 2-dimensional objects and (d) small, (e) large, and (f) widely-varying 4- dimensional objects after inserting all 16,384 objects. An empirical analysis of the structures constructed in our experiments revealed that the average size of upper-level rectangles in QSF-trees is much smaller than that of the corresponding R*-trees. Since in our implementation QSF-trees apply kdB-tree like space partition into non-overlapping

rectangles that completely cover the universe (see Figure 1), this is due to the region overlap in R*-trees. Therefore, in dicussing the relative strengths of QSF-trees over topological R*-trees, we can focus on three factors only: the region overlap, the size of the search region, and the selectivity of the search predicate for each individual type of query. Note, the search region of R*-trees (both the original and topological) is the query window, while the search region of QSF-trees is one of the L-regions depicted in Figure 2. In comparing the selectivity of the search predicates of QSF-trees and topological R*-trees, we assume that the parameters of the predicates are the same. Table 6, obtained by analyzing Figure 2 and Table 4, gives a descriptive summary of the relative advantages and disadvantages of QSF-trees and topological R*-trees with respect to the three major performance-related factors. For each type of query and one of the three aspects, the table indicates whether a QSF-tree has a relative advantage (“+”) or disadvantage (“-”) over the corresponding topological R*-tree, or the two structures are equal (“=”) with respect to the given aspect. The notation “+/-” means that, in some scenarios the aspect may be an advantage, while in others it may be a disadvantage of QSF-trees. Type of Query

Region overlap

equal_to contains covers contained_by covered_by overlaps meets

+ + + + + + +

Size of the search region + +/+/+ + -

Selectivity of the search predicate = = - (small) = =

Table 6: Relative pros/cons of QSF-trees and topo. R*-trees. As shown in the second column of Table 6, all types of queries performed on QSF-trees benefit from the lack of region overlap. The larger the overlap of R*-trees, the greater the relative advantages of QSF-trees. Analyzing Figure 2, one can see that, for the topological relations equal_to, contained_by and covered_by, the size of the L-region is always smaller than the query window (hence, “+” in the corresponding cells of the third column). However, for the relations overlaps and meets, the Lregion is always larger than the query window (hence, “-” in the corresponding cells). Depending upon the size of the query window and the maximum size of objects along each dimension, the L-region for the relations contains and covers could be either smaller or larger than the query window (hence, “+/-” in the corresponding cells of the fourth column). Analyzing Table 4, one can see that the search predicates of QSF-trees are less restrictive than the corresponding search predicates of topological R*-trees for the relations contains, covers, and covered_by. (Note, in the case of covered_by, this relative disadvantage is rather small.) In all other cases, the search predicates of the two structures are equally restrictive. Obviously, for the relations equal_to, contained_by and covered_by, one should always expect QSF-tree to outperform the equivalent topological R*-tree. For all other relations,

different factors compete against each other. However, QSF-trees capitalize on the fact that increasing the data-set size or the dimensionality of data aggravates the problem of overlap in R*trees without having any adverse effect on the other two performance-related factors. This means that simple QSF-trees will scale better than even the best variations of R*-trees. It is useful to contrast Table 6 with the results presented in Figure 5. Recall, QSF-trees perform better for the relations equal_to, contained_by, and covered_by in all scenarios depicted in Figure 5. This is because, for these relations, QSF-trees are better than the topological R*-trees in almost all of the three performance-related aspects. Since in 4-dimensional spaces, the overlap in R*-trees is rather significant (see [2]), the lack of overlap in QSF-trees offsets any disadvantage they have with respect to the topological R*-trees. However, with 2-dimensional objects and relatively small data sets, as is the case in our experiments, the overlap may not be enough to compensate for some of the relative disadvantages of QSF-trees. This explains why the figures 5a, 5b, and 5c show some fluctuations in the relative performance of the two structures for the relations contains, covers, overlaps and meets. Here, different factors compete against each other. For example, for relatively large 2-dimensional objects (see Figures 5b and 5c), region overlap of R*-trees increases significantly, which works in favor of QSF-trees, but so does the size of the L-region for the relations overlaps and meets, which works against QSFtrees. Similar reasoning can explain Table 5 which shows different margins of performance improvement in different scenarios. Here, the region overlap and the size of the L-region for the relations overlaps and meets are the most significant factors contributing to the search costs of R*-trees and QSF-trees, respectively. For small 2-dimensional objects and relatively small data sets, region overlap in R*-trees is small and the Lregion is only slightly bigger than the average query window. Therefore, the overall performance improvements of QSF-trees over either variant of R*-trees tend to be smaller than in any other scenario. For large and widely-varying objects, the L-region of the relations overlaps and meets is large, but about the same in both cases. However, overlap is larger with large objects than with widely-varying objects. Therefore, in both 2- and 4-dimensional space, the performance improvements for widely-varying objects tend to be smaller than for large objects (see Table 5). The greatest performance improvements were obtained for small 4dimensional objects. While the overlap in this scenario is smaller than the overlap for large and widely-varying 4-dimensional objects, the average size of the L-region for overlaps and meets is much smaller than in the case of large and widely-varying objects. In closing, we point out that there are several optimizations that can further improve the performance of QSF-trees. For example, we can improve the selectivity of search predicates of simple QSF-trees by propagating the information about not only low but

also high endpoints of object MBRs to upper levels of the tree. In turn, this would eliminate some false drops into pages containing no candidate MBRs, which occur in simple QSF-trees as a result of the fact that the search operations rely exclusively on low endpoints in searching the upper levels of the tree. The propagation of information about high endpoints of object MBRs to upper levels could result in moderate to significant reduction in page access over the simple QSF-trees. Therefore, in our future work, we plan to investigate relative pros and cons of different optimizations. Further improvements can be obtained by reducing the average size of rectangles in upper levels of QSF-trees, which can be done simply by adopting a different point access method, e.g. a buddy tree, instead of a kdB-tree. In our future work, we also plan to investigate the search operations on QSF-trees for complex queries involving arbitrary logical predicates with more than one topological relation.

8. REFERENCES [1] N. Beckmann, H. Kriegel, R. Schneider and B. Seeger, “The R*-tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 322—331, 1990. [2] S. Berchtold, D.A. Keim and H. Kriegel, “The X-tree: An Index Structure for High-Dimensional Data,” Proc. 22th Int. Conf. on Very Large Data Bases, 28—39, 1996. [3] C. Faloutsos, R. Sellis and N. Roussopoulos, “Analysis of Object-Oriented Spatial Access Methods,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 426—439, 1987. [4] M. Freeston, “The BANG file: A New Kind of Grid File,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 260—269, 1987. [5] M. Freeston, “A General Solution of the N-dimensional Btree Problem,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 80—91, 1995. [6] V. Gaede and O. Gunther, “Multidimensional Access Methods,” ACM Computing Surveys, 30(2):170—231, 1998. [7] D. Greene, “An Implementation and Performance Analysis of Spatial Data Access Methods,” Proc. 5th IEEE Int. Conf. on Data Engineering, 606—615, 1989. [8] O. Gunther and J. Bilmes, “Tree-Based Access Methods for Spatial Databases: Implementation and Performance Evaluation,” IEEE Trans. Knowledge and Data Engineering, 3(3):342—356, 1991. [9] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 47—54, 1984. [10] A. Henrich, H.W. Six and P. Widmayer, “The LSD-tree: Spatial Access to Multidimensional Point and Non-point Objects,” Proc. 15th Int. Conf. on Very Large Data Bases, 45—53, 1989.

[11] A. Henrich, and H.W. Six, “How to Split Buckets in Spatial Data Structures,” in Geographic Database Management Systems, G. Gambosi, M. Scholl, and H.W. Six eds., Springer-Verlag, Berlin, 212—244, 1991.

[12] K. Hinrichs, “Implementation of the Grid File: Design Concepts and Experience,” BIT, 25:569—592, 1985.

[13] I. Kamel and C. Faloutsos, “Hilbert R-tree: An Improved Rtree Using Fractals,” Proc. 20th Int. Conf. on Very Large Data Bases, 500—509, 1994.

[14] J. Kuan and P. Lewis, “A Study on Data Point Search for HG-trees,” SIGMOD Record, 28(1):90—96, 1999.

[15] J. Nievergelt, H. Hinterberger and K.C. Sevcik, “The Grid File: An Adaptable, Symmetric Multikey File Structure,” ACM Trans. Database Syst., 9(1):38—71, 1984.

[16] P. Oosterom, Reactive Data Structures for Geographic Information Systems, Ph.D. Thesis, University of Leiden, Netherlands, 1990.

[17] J. Orenstein and T.H. Merrett, “A Class of Data Structures for Associative Searching,” Proc. 3rd ACM SIGACTSIGMOD Symposium on Principles of Database Systems, 181—190, 1984.

[18] R. Orlandic, “A High-Precision Spatial Access Method Based on a New Linear Representation of Quadtrees,” Proc. 1st Conf. on Information and Knowledge Management CIKM-92, 499—508, 1992.

[19] B.U. Pagel, H.W. Six and H. Toben, “The Transformation Technique for Spatial Objects Revisited,” in Advances in Spatial Databases, D. Abel and B.C. Ooi, eds., Lecture Notes in Computer Science 692, Springer-Verlag, Berlin, 73—88, 1993.

[20] D. Papadias, Y. Theodoridis, T. Sellis and M.J. Egenhofer, “Topological Relations in the World of Minimum Bounding Rectangles: A Study with R-trees,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 92—103, 1995.

[21] J.T. Robinson, “The K-D-B Tree: A Search Structure for Large Multidimensional Dynamic Indexes,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 10—18, 1981.

[22] B. Seeger and H.P. Kriegel, “Techniques for Design and Implementation of Efficient Spatial Access Methods,” Proc. 14th Int. Conf. on Very Large Data Bases, 10—17, 1988.

[23] B. Seeger and H.P. Kriegel, “The Buddy-tree: An Efficient and Robust Access Method for Spatial Data Base Systems,” Proc. 16th Int. Conf. on Very Large Data Bases, 590—601, 1990.

[24] T. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+-Tree: A Dynamic Index for Multi-Dimensional Objects,” Proc. 13th Int. Conf. on Very Large Data Bases, 507—518, 1987.

[25] D.A. White and R. Jain, “Similarity Indexing with the SStree,” Proc. 12th IEEE Int. Conf. on Data Engineering, 516— 523, 1996.

Suggest Documents