An index structure for efficient reverse nearest neighbor queries Congjun Yang Division of Computer Science, Department of Mathematical Sciences The University of Memphis, Memphis, TN 38152, USA
[email protected]
Abstract The Reverse Nearest Neighbor (RNN) problem is to find all points in a given data set whose nearest neighbor is a given query point. Just like the Nearest Neighbor (NN) queries, the RNN queries appear in many practical situations such as marketing and resource management. Thus efficient methods for the RNN queries in databases are required. This paper introduces a new index structure, the Rdnn-tree, that answers both RNN and NN queries efficiently. A single index structure is employed for a dynamic database, in contrast to the use of multiple indexes in previous work. This enables significant savings in dynamically maintaining the index structure. The Rdnn-tree outperforms existing methods in various aspects. Experiments on both synthetic and real world data show that our index structure outperforms previous method by a significant margin (more than 90% in terms of number of leaf nodes accessed) in RNN queries. It also shows improvement in NN queries over standard techniques. Furthermore, performance in insertion and deletion is significantly enhanced by the ability to combine multiple queries (NN and RNN) in one traversal of the tree. These facts make our index structure extremely preferable in both static and dynamic cases.
1 Introduction Indexing is an indispensable tool in database systems. Various kinds of indexes are used to speed up query execution. Moreover, new applications and queries continue to demand new and improved indexes and associated algorithms. One type of query that has recently received attention is the Reverse Nearest Neighbor (RNN) Query: Given a data set S and a query point q , an RNN query finds all the points in S having q as their nearest neighbor.
King-Ip Lin Division of Computer Science, Department of Mathematical Sciences The University of Memphis, Memphis, TN 38152, USA
[email protected]
This problem corresponds to a class of problems we called “influence” problems. For instance, suppose a bank is to open up a new branch at a location. It may want to know which existing branches will be affected by the new branch, assuming people choose the nearest branch to conduct business. Moreover, a rival bank may also assess the influence of putting a new branch in that location and what effects it has on existing branches of other banks. Also, with the advance of the Internet and the Web, people are expecting systems to deliver (or push) interesting and relevant information to them. While the users do not want to be inundated with a large volume of junk messages, it is crucial for them to receive information that is more relevant to them than what they have received. One way to achieve this balance is to push only information most pertaining to the interests of the user. For instance, a company can send advertisements about a new product to only those customers who will find this product more relevant than any of the existing products. This allows the users to receive the information that they absolutely need, and at the same time spare them from sorting through junk information that is cluttering their mailboxes, thus making the advertisement more effective. Hence we can see that the reverse nearest neighbor queries is a very practical and important class of queries. Korn and Muthukrishnan [8] provided more examples. A naive solution of the problem requires O(n2 ) time with no preprocessing, as the nearest neighbors of all the points in S has to be found. Thus more efficient algorithms are required. One approach, described by Korn and Muthukrishnan [8] is to pre-compute the nearest neighbors of every point in S . Then given the query point q one can compare it with the existing nearest neighbors of every point in S . For each point x in S , one can compute and store a spherical region with x as the center and the distance from x to its nearest neighbors as radius. It can be seen that if a query point q falls into the region, x is an RNN of q ([8] provides a proof).
All the regions can be organized into a multi-dimensional index structure (for instance, the R-tree family [1, 5, 13]) for effective storage and query performance. This method, while pioneering, has some drawbacks. For instance, it requires two indexes in the dynamic case – where insertions and deletions to the data set occur. Moreover, the regions that are stored tend to have significant overlap, thus hampering performance. In this paper we present a new structure, called the Rdnn-tree (R-tree containing Distance of Nearest Neighbors), which is well suited for RNN queries in both static and dynamic cases. Rdnn-tree differs from standard R-tree structure by storing extra information about nearest neighbor of the points in each node. This piece of information provides significant improvement in all algorithms. The Rdnn-tree has many advantages, including:
It significantly outperforms the index structures in [8], and typically requires only 1-2 leaf access to locate the RNNs.
The Rdnn-tree can perform NN queries efficiently. As a result, we only require one tree in the dynamic case, for both NN and RNN queries.
The Rdnn-tree enables one to execute multiple NN and RNN queries in one traversal of the tree, further enhancing performance in the dynamic case.
used to determine which branch to traverse next, and which branches can be pruned from the search. Various algorithms differ in the order of the search. For instance, Roussopoulos et al. [10] use a depth-first approach, while Hjaltason and Samet [6] propose a “distance-browsing” algorithm, using a priority queue to order the branches to be traversed. Other approaches have been proposed. One approach is to modify index structures to enhance the branch-and-bound algorithms. Two examples are the SS-tree [15] and the SRtree [7]. An alternative approach, proposed by Berchtold et al [2], indexes an approximation of the Voronoi diagram associated with the data set.
3
Definitions and existing algorithms
This section presents existing algorithms for Reverse Nearest Neighbor search and discusses potential improvements. We first provide formal definitions of the Nearest neighbor and Reverse Nearest Neighbor search. In what follows, we assume that S is a set of points in d dimensional space. D (p; q ) is the distance between two points p and q . If T is a subset of S , D(p; T ) denotes the minimum distance between p and any points in T . C (p; r) is a circle centered at p with radius r . D EFINITION : (Nearest Neighbor Search (NN Search)) Given a set S of points in some d dimensional space and a query point q , the Nearest Neighbor Search problem is to find a subset N NS (q ) of S defined as follows
The rest of the paper is organized as follows: Section 2 outlines previous work in multi-dimensional index and queries. Section 3 describes the previous algorithms in RNN in more detail, and outlines the potential for improvement. The proposed Rdnn-tree is presented in Section 4. Section 5 provides experimental results and Section 6 summarizes our work and discusses future directions.
N NS (q )
=
fr 2 S j8p 2 S
:
D (q; r ) D (q; p)g:
D EFINITION : (Reverse Nearest Neighbor Search (RNN Search)) Given a set S of points in some d dimensional space and a query point q , the Reverse Nearest Neighbor Search problem is to find a subset RN NS (q ) of S defined as follows
2 Related work There has been a large body of work on multidimensional index structures. For instance, the index structure we propose is based on the popular R-tree family [5, 13, 1], which generalizes the B + -tree to multiple dimensions by storing minimum bounding regions (hyperrectangles) instead of numbers which represent 1-D intervals. Interested readers are referred to [12] and [4] for full survey s on multi-dimensional index structures. Early work on multi-dimensional index structures focus on range queries. Recently the nearest neighbor query problem has received substantial attention. In addition to the work in computational geometry (e.g. see [11]), many algorithms have been proposed to search nearest neighbors using tree-based indexes like the R-trees. Many such algorithms take the branch-and-bound approach: the tree is traversed from the root; and at each step a certain heuristic is
RN NS (q )
=
fr 2 S j8p 2 S
:
D (q; r ) D (r; p)g:
Notice that D(p; N NS (p)) is the distance between p and its nearest neighbors in S . For simplicity we denote it by dnnS (p). S will be omitted from the above notations where the context is clear. In general, there is no natural relationship between N NS (q ) and RN NS (q ). For instance, r 2 N NS (q ) 6) r 2 RN NS (q ), and vice versa. The general RNN-search algorithm is presented by Korn and Muthukrishnan [8]. Let S be a given set of points and q a query point. For any point p in S , p takes q as its nearest 2
neighbor if and only if D(p; q ) dnnS (p), i.e., p is at least as close to q as to its nearest neighbors in S . Since S is known, we can pre-compute N NS (p) for every point p in S and store it in a certain way. Korn and Muthukrishnan used an RNN-tree, which is essentially an R -tree. For every point p in the data set S , the RNN-tree stores the minimum bounding rectangle of the circle C (p; dnnS (p)) in the leaf node. With such an index structure, the RNN search problem becomes a simple point query problem. For any given query point q , p is in RN NS (p) only if p falls inside the circle and hence inside the minimum bounding rectangle of the circle. Complications arise when points are inserted into or deleted from the tree. In such cases, the RNN-tree has to be updated. Consider first the case of insertion. When a point p0 is inserted into S , we need to make two kinds of adjustments: for every point in RN NS (p0 ), we need to update the region stored in the RNN-tree since p0 is the new nearest neighbor; also the region corresponding to p0 (i.e. 0 0 C (p ; dnnS (p ))) needs to be computed and inserted to the RNN-tree. This implies that the insertion algorithm needs to find both N NS (p0 ) and RN NS (p0 ). One would like to use the RNN-tree to find the nearest neighbors. However, the leaf nodes of an RNN-tree contain geometric objects (the regions) instead of points themselves. This makes the higher level bounding region larger and makes the tree sub-optimal for standard nearest neighbor queries. Thus it is proposed that a second tree, the NN-tree (a simple R -tree) be created to ensure efficient nearest neighbor search. However, this implies the second tree also needs to be maintained during insertion. The insertion algorithm can be summerized as follows:
as follows. Algorithm 2 RNN-Delete(RNN-tree, NN-tree, p00 ) 1) Delete p00 into RNN-tree using R -tree deletion algorithm. 2) Delete p00 into NN-tree using R -tree deletion algorithm. 3) Perform RNN search on RNN-tree for p00 to find 00 RN NS (p ). 4) For each pi 2 RN NS (p00 ), call standard NN search algorithm on NN-tree to find N NS (pi ) and enlarge C (pi ; dnnS (pi )) to C (pi ; d(pi ; p00 )) Thus in the dynamic case, one needs to update two trees to maintain the index structures. This leads to inefficiency in both time and space. While the technique above is a general approach, there are other techniques that work for lower dimensional points. One such approach is to take advantage of the geometric properties of the problem. Stanoi, Agrawal and El Abbadi [14] introduced an algorithm that works directly on an R -tree. It transforms the RNN problem into a set of constrained nearest neighbor queries. An interesting fact about RNN queries is that the maximum number of RNNs of a query point is bounded, and if multiple RNNs exist, they have to be distributed fairly evenly. Thus, upon receiving the query point q , the algorithm divides the entire space into a number of regions based on q . The number of regions is equal to the maximum possible solution. For each region, the algorithm finds the nearest neighbors of q . It can be shown that the true RNNs are among these points, and finding the correct solutions from them can be done easily. The main drawback of the algorithm is that the number of regions to be searched grows very fast when the dimensionality increases. For instance, for L1 norm, the growth is exponential. This renders the algorithm ineffective in higher dimensions. Moreover, every region has to be searched, whether an RNN resides or not; thus, there can be a lot of wasted effort during the search.
Algorithm 1 RNN-Insert(RNN-tree, NN-tree, p0 ) 1) Perform RNN search RNN-tree for p0 to find 0 RN NS (p ). 0 2 RN NS (p ) shrink 2) For each pi 0 C (pi ; dnnS (pi )) to C (pi ; d(pi ; p )) 3) Call standard NN search algorithm on NN-tree to find N NS (p0 ). 4) Insert p0 into RNN-tree using R -tree insertion algorithm. 5) Insert p0 into NN-tree using R -tree insertion algorithm.
4
The Rdnn-tree
4.1
Motivation
We have dicussed the limitations of the RNN-tree approach in the last section. While storing the spherical region C (p; dnnS (p)) is necessary, the RNN-tree suffers from the following:
A similar situation arises when a point p00 is deleted. Again, we need to make two kinds of adjustments: deleting the region corresponding to p00 (i.e. C (p00 ; dnnS (p00 ))); as well as finding the new nearest neighbors for all the points in RN NS (p00 ) and adjust their corresponding regions in the RNN-tree. Once again both NN and RNN queries are needed. Notice that we might have to perform multiple NN queries in the second case. The deletion algorithm is listed
3
Large overlap between the regions causes increased overlapping in parent nodes MBR (minimum bounding rectangles), hampering the RNN search performance.
Storing the spherical regions themselves renders the index structure ineffective in solving NN queries, thus a second tree is needed for the dynamic cases. This serverly adds cost for maintaining the index.
in B cannot be closer to q than to its nearest neighbor in S . Our experiments (cf Section 5) show that this is very efficient in pruning the search path. To summarize the above description, we have the following formal algorithm.
Thus we want to find a structure such that the pointlocation and NN queries are utilized, while the information of dnnS (p) is maintained to ensure RNN queries are supported properly. Thus we propose the Rdnn-tree (R -tree with Distance of Nearest Neighbors) to kill two birds with one stone: using the R -tree to store the data points themselves, but enhance the nodes with information about distance of the nearest neighbors of the points in the nodes.
Algorithm 3 RNN-Search (Node n, Point q ) Input: Node n to start the search and query point q Output: the reverse nearest neighbors of q If n is a leaf node, then for each entry (ptid; dnn), if D(q; ptid) dnn, output ptid as one of the RNNs of q If n is an internal node, then for each branch B = (ptr; Re t; max dnn), if D (q; Re t) max dnn, call RNN-Search(B:ptr; q )
4.2 The Rdnn-tree structure In an Rdnn-tree, a leaf node contains entries of the form where ptid refers to a d-dimensional point in the data set and dnn is the distance from the point to its nearest neighbors in the data set. A non-leaf node contains an array of branches of the form (ptr; Re t; max dnn). ptr is the address of a child node in the tree. If ptr points to a leaf node, Re t is the minimum bounding rectangle of all points in the leaf node. If ptr points to a nonleaf node, Re t is the minimum bounding rectangle of all rectangles that are entries in the child node; max dnn = maxfdnnS (p)g, where p are points contained in the subtree rooted at this node. (ptid; dnn),
NN search As the Rdnn-tree has all properties of the R tree, we can apply the standard nearest neighbor search technique (e.g. [10]) for the NN search. Moreover, the N NS (p) information can help us prune extra branches during the branch-and-bound search. This is due to the following lemma: L EMMA 4.1 Let q be a query point and p any point from the data set S. If D(p; q ) dnnS (p)=2, then p is the nearest neighbor of q in S . The correctness of the above lemma is easy to see. is a circle that contains no points from the data set S . If D(p; q ) dnnS (p)=2, then D(x; q ) dnnS (p)=2 for any point x 62 C (p; dnnS (p)). This means that the distance from the query point q to any point outside the circle is greater than D(q; p). Hence p is the nearest neighbor of q . When we search a leaf node for the nearest neighbor of a query point q , we can stop the search if the condition D (p; q ) dnnS (p)=2 is satisfied. Therefore we have the following improved NN search algorithm:
4.3 Algorithms
C (p; dnnS (p))
We first provide the NN and RNN algorithms for the Rdnn-tree, as they are both needed for the insertion and deletion algorithms. RNN search The reverse nearest neighbor search on the Rdnn-tree is similar to a point-location search. The only difference is the criterion to decide which branch(es) to go down the tree. Assume that q is the query point:
For a leaf node, we need to examine each point p in the node. If D(q; p) dnnS (p)), i.e. p is at least as close to q as to its nearest neighbor, then p is one of the reverse nearest neighbors.
For an internal node, we compare the query point q with each branch B = (ptr; Re t; max dnn). Here max dnn plays a crucial role. By definition, all points in the subtree rooted at B are contained in Re t and the distance from each point to its nearest neighbor is not greater than max dnn (max dnn is the largest of them). Hence if D(q; Re t) > max dnn, then branch B need not to be visited. This is because any point
Algorithm 4 NN-Search (Node n, Point q ) Input: A node to start search and a query point q Output: The nearest neighbor of q 1) Initialize the candidate nearest neighbor 2) if n is a leaf node then for each data point p do: if D(p; q ) < D(p; N NS (p))=2, output p and stop the search; if D(p; q ) < D(q; ), replace by p. 3) if n = (B1 ; ; Bk ) is a non-leaf node, where Bi = (ptri ; Re ti ; max dnni ). Let di = D (q; Re ti ) and sort Bi according to di . For each Bi , if di D(q; ), call NN-Search(ptr; q ) 4
Insertion and Deletion Insertion and deletion are similar to the RNN-tree. The main difference is that we have only one tree and in the tree we maintain a number max dnn containing nearest neighbor information instead of a rectangle.
Now we turn our attention to deletion. Just like the RNNtree, deleting a point from the Rdnn-tree will also affect the reverse nearest neighbors of the point to be deleted. In order to maintain the integrity of the Rdnn-tree while deleting a point p00 , an NN search needs to be done for each point in RN N (p00 ). This is an expensive step. Observe that the points in RN N (p00 ) should be physically close to each other in the data set because they are the reverse nearest neighbors of a point p00 . Moreover, the number of points in 00 RN N (p ) is upper-bounded (the bound is based on the dimensionality). Hence, we can do a Batch NN search, finding the nearest neighbors for multiple query points in one pass. Let’s call it the Batch-NN-Search. To delete the point physically, the standard R -tree deletion algorithm suffices.
We first look at insertion. When a point p0 is to be inserted into an Rdnn-tree containing a data set S , we first perform an NN and an RNN search to find N NS (p0 ) and 0 0 RN NS (p ) respectively. With N NS (p ) we can compute 0 0 0 dnn(p ), to create the entry for p . The RN NS (p ) gives us the information of the points that are affected. The dnn fields for those points will need to be recomputed and max dnn field of their ascendent nodes will also need to be adjusted. This can be done in a way very similar to the RNN-Search algorithm. The only difference is that we adjust the dnn field whenever we find a new RNN point for p0 in a leaf node and propagate the changes to the parent nodes on the way back up. Since we have one index structure for both NN and RNN search, we can combine the two steps into one. This means we also search for the nearest neighbors of p0 when we search for the affected points (RN NS (p0 )) and adjust max dnn field for the corresponding nodes. Our experiments show this combined NN-RNN search has virtually the same cost as the RNN search alone. This saves us one NN search while maintaining the correctness for the index. If we call this step the pre-insertion phase, we have the following formal Pre-insert algorithm.
Algorithm 7 Delete (Node n, Point p00 ) Input: A tree rooted at n and a point p00 to be deleted Output: A tree with p00 deleted 1) Call R -tree algorithm to delete p00 from n 2) Call RNN-Search(n, p00 ) to find RN NS (p00 ) 3) Call Batch-NN-Search(n, RN NS (p00 )) 4) Adjust the dnn for each point in RN NS (p00 ) and propagate the change up to the root The Batch-NN-Search procedure is a slight modification of the NN-Search algorithm. Formally, it looks as follows. Algorithm 8 Batch-NN-Search (Node n, Point q1 ; ; q k ) Input: A tree rooted at node n and an array of query points q1 ; ; qk Output: The nearest neighbors of q1 ; ; qk 1) Initialize all candidate NNs 1 ; ; k for
Algorithm 5 Pre-insert (Node n, Point p0 ) Input: The root node of the tree and a point p0 Output: the adjusted tree and N NS (p0 ) 1) Initialize the candidate nearest neighbor 2) If n is a leaf node, then for each entry (ptid; dnn) do: If D (p0 ; ptid) < D (p0 ; ), then let = ptid; If D(p0 ; ptid) < dnn, output ptid and return D(p0 ; ptid) 3) If n is a non-leaf node, then for each branch B = (ptr; Re t; max dnn) do: If 0 0 D (p ; Re t) < max dnn or D (p ; Re t) < 0 0 D (p ; ) call Pre-Insert( ptr , p ); If ptr was adjusted, adjust max dnn for B and return
1
q ; ; qk
2) If n is a leaf node, update i if there is a better one in the leaf node. 3) If n is a non-leaf node. Let n = (B1 ; ; Bk ), where Bi = (ptri ; Re ti ; max dnni ). Let dij = D (qj ; Re ti ), di = maxj =1;;k (dij ). Sort Bi according to di . For Bi , if di < D (qj ; j ) for any j 2 1; ; k call Batch-NNSearch(ptri ; q1 ; ; qk ).
max dnn
With the above preparations, we can present the following Insertion algorithm
Comparing our insertion and deletion algorithms with those presented in Section 3, we need only a single index, as opposed to the combined NN, RNN tree approach. Considering the insertion algorithm, inserting one point into the Rdnn-tree and the RNN-tree are almost equivalent. Both have a pre insertion phase followed by a call to the standard R*-tree insertion algorithm. However, employing one index makes it possible for us to perform a combined NN-RNN search in the pre-insertion phase. Our experiments show
Algorithm 6 Insert (Node n, Point p0 ) Input: Root node n and point p0 to be inserted Output: the tree with p0 inserted 1) Pre-Insert(n, p0 ) 2) Call R -tree insertion algorithm to insert entry (p0 ; dnnS (p0 )) into n 5
that the combined search saves us one NN search. Better yet, we do not need to insert the point into a second index. Regarding the deletion algorithm, we have the same situation. In addition, we propose to do batch NN searches in the post deletion phase. This provides greater savings.
Table 1 shows the result for total pages accessed (results for the leaf access is similar). It turns out that the Rdnntree performs slightly better than the standard R -tree. This is due to the fact that the nearest neighbor information in the Rdnn-tree increases the pruning power of the algorithm. More importantly, this illustrates the feasibility of the Rdnntree for NN-queries, enabling us to eliminate the extra index and significantly cutting down the maintenance cost.
5 Experimental results This section presents the results of our experiments. We compare the Rdnn-tree on reverse nearest neighbor (RNN) queries with the RNN-tree method of Korn and Muthukrishnan. We also measure the performance of the Rdnn-tree on nearest neighbor (NN) queries and compare it to standard NN algorithms. Furthermore, we look at two other kinds of queries, combined NN-RNN queries and batch NN queries. These results have significant impact on performance in the dynamic case. We implement both structures in C++, and run our tests on a machine with 2 500-MHz Pentium II processors with 512 MB RAM under SCO UNIX. For the RNN-tree, we use the code provided by Korn and Muthukrishnan. We obtain a large real data set from the US National Mapping Information web site (URL: http://mappings.usgs.gov/www/gnis/). It contains populated places in the USA, represented by latitude and longitude coordinates. We sample different number of items from the data set to create our various data sets to be indexed and then sample 500 items from the rest of the data set to be the query set. For higher dimensional data we generate random points for both the data and query set.
Dynamic performance : Combined NN-RNN queries Inserting a data point into the index requires the algorithm to locate both the NNs and the RNNs of the point for update purposes. If one can combine the NN and RNN query for the point into one pass, there will be significant savings. Thus we run experiments to measure the costs for NN queries, RNN queries, and the combined NN-RNN queries. Figure 2 shows the results. We show only the 4-D results, as the 2-D results are similar. We can see that the cost for a combined NN-RNN query is essentially the same as that of an RNN query, and is much less than the combined cost of a separate NN and RNN query. This shows we can get the NN of the query point nearly for free when we run the RNN query.
Dynamic performance : Batch NN queries Recall that batch NN queries can be used to speed up deletions. We run experiments to test its effectiveness by measuring the cost of NN queries involved in the deletions. In the experiments, we simulate the delete procedure by picking 500 points from the data set, finding their RNNs, and doing the NN queries for the RNNs of each point. Observe that for any point p, we have jRNN(p)j 0, where jS j is the cardinality of a set S . If jRNN(p)j 1, batch NN and regular NN queries for RNN (p) are the same. Only when jRNN(p)j 2 is a batch NN query necessary. For each data point p, we compare the cost of running the NN queries separately for each point in RNN(p) with that of running the batch-NN query for all points in RNN(p).
Static performance : RNN search The first set of experiments compares RNN-search performance. Figure 1 shows the results. We measure both the number of leaf nodes and the total number of nodes. We can see that the Rdnn-tree provides significantly better performance than the RNNtree approach. For instance, in the 2-D case, the RNN-tree approach on average takes 20 leaf access for the 100,000 item data set, while in our case less than 2 leaf access is required on average, an improvement of more than 90%. Significant improvement can also be seen on the total disk access case – the Rdnn-tree is consistently 4 to 5 times better than the RNN-tree in the 2-D case, and even better in the 4-D case. This establishes the effectiveness of the Rdnntree.
Figure 3 shows the average results of the 500 points. We can see that doing batch NN queries significantly reduces the disk access. Also not shown in the figure is that the cost of a batch NN query is comparable to a single NN query. This means that if jRNN(p)j = k , then the batch NN for RNN(p) reduces cost k times. Our experiments show that k is usually in the range of 0 to 5. The importance of batch NN queries increases with the increase of dimensionality. For instance in 2-D only 20-30% of the deletion require a batch-NN, i.e., jRNNj > 1, while at 4-D over 60% of the deletion requires a batch-NN query.
Dynamic performance : NN queries One of the main advantages of the Rdnn-tree is the elimination of a second tree in dynamic cases, as the Rdnn-tree itself can perform NN queries effectively. To verify this, we implement the standard NN search algorithm by Roussopoulos et al. [10] and compare it to the Rdnn-tree approach. 6
30
80 RNN-tree (leaf) Rnnd-tree (leaf) RNN-tree (total) Rnnd-tree (total)
25
RNN-tree (leaf) Rnnd-tree (leaf) RNN-tree (total) Rnnd-tree (total)
70
60
Page accessed
Page accessed
20
15
50
40
30
10 20 5 10
0
0 25000
50000 Data set size
75000
100000
25000
2-D (real data set)
50000 Data set size
75000
100000
4-D (uniform data set)
Figure 1. Comparison of performance for (static) RNN queries
Number of points Rdnn-tree R -tree
2-D data sets 10,000 25,000 2.098 2.120 2.11 2.2.0
50,000 3.307 3.360
75,000 3.388 3.460
100,000 3.452 3.48
4-D data sets 5,000 50,000 4.376 6.464 4.436 6.82
Table 1. Comparison of NN queries performance (Total pages accessed)
Acknowledgments
6 Conclusion and future work
We would like to thank Flip Korn for providing the RNN-tree code, and Flip Korn and Ioana Stanoi for their comments. We would also like to thank Diane Mittelmeier for proofreading the manuscript.
In this paper, we presented the Rdnn-tree, an R -tree enhanced by storing nearest neighbor distance information. We demonstrated that this structure is much more efficient in answering RNN-queries, by eliminating the need for a second index, and by providing superior performance in both static and dynamic cases.
References [1] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-Tree: an efficient and robust access method for points and rectangles. In Proc. of the 1990 ACM SIGMOD International Conference on Management of Data, pages 322–331, Atlantic City, NJ, May 1990. [2] S. Berchtold, B. Ertl, D. A. Keim, H.-P. Kriegel, and T. Seidl. Fast nearest neighbor search in high-dimensional spaces. In Proc. of the 14th IEEE Conference on Data Engineering, pages 23–27, Feb. 1998. [3] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree : An index structure for high-dimensional data. In Proc. of 22th International Conference on Very Large Data Bases, pages 28–39, 3–6 Sept. 1996. [4] V. Gaeda and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30(2):170–231, June 1998. [5] A. Guttman. R-trees: a dynamic index structure for spatial searching. In Proc. of the 1984 ACM SIGMOD International Conference on Management of Data, pages 47–57, Boston, Mass, June 1984.
Our focus in this paper is the monochromatic reverse nearest neighbor problem. A future direction for us is to adapt the Rdnn-tree to the bichromatic reverse nearest neighbor problem. In such problem the data are divided into 2 types. Given a query point q of one type, the system is required to find all the points of the second type that has q as its nearest neighbor. It will be interesting to see what different constraints (like a single index for both types, or separate index on each type) will have on the effect of the algorithms. Also it would be interesting to see how well Rdnn-tree adapts to the problem. The Rdnn-tree is based on the R -tree. While it works well in lower dimensions, its performance degrades in high dimensions. We plan on exploring how to adjust the Rdnntree techniques to high-dimensional indexing techniques, like the TV-tree [9] and the X-tree [3]. 7
12
20 RNN query combined NN-RNN query seperate RNN + NN query
RNN query combined NN-RNN query seperate RNN + NN query
10 15
Page accessed
Page accessed
8
6
10
4 5 2
0
0 25000
50000 Data set size
75000
100000
25000
4-D (uniform data set), leaf nodes
50000 Data set size
75000
100000
4-D (uniform data set), total nodes
Figure 2. Performance of combined NN-RNN queries 10 Non-batch NN Batch NN
4
Non-batch NN Batch NN
3.5
8
Page accessed
Page accessed
3 2.5 2
6
4
1.5 1 2 0.5 0
0 25000
50000 Data set size
75000
100000
25000
4-D (uniform data set), leaf nodes
50000 Data set size
75000
100000
4-D (uniform data set), total nodes
Figure 3. Comparison for Batch NN queries for 4-D data [11] J. Sack and J. Urrutia, editors. Handbook on Computational Geometry. North-Holland, 2000. [12] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990. [13] T. Sellis, N. Roussopoulos, and C. Faloutsos. The R tree: a dynamic index for multi-dimensional objects. In Proc. 13th International Conference on Very Large Databases, pages 507–518, England, Sept. 1987. [14] I. Stanoi, D. Agrawal, and A. E. Abbadi. Reverse nearest neighbor queries for dynamic databases. In Proc. of 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 44–53, May 2000. [15] D. A. White and R. Jain. Similarity indexing with the sstree. In Proceedings of the 12th International Conference on Data Engineering, pages 516–523, Feb. 1996.
[6] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Transactions on Database Systems, 24(2):265–318, June 1999. [7] N. Katayama and S. Satoh. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proc. of 1997 ACM SIGMOD International Conference on Management of Data, pages 369–380, June 1997. [8] F. Korn and S. Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In Proc. of 2000 ACM SIGMOD International Conference on Management of Data, pages 201–212, May 2000. [9] K.-I. Lin, H. Jagadish, and C. Faloutsos. The TV-tree - an index structure for high-dimensional data. The VLDB Journal, 3:517–542, Oct. 1994. [10] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of 1995 ACM SIGMOD International Conference on Management of Data, pages 71–79, San Jose, CA, May 1995.
+
8