IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 12,
NO. 1,
JANUARY/FEBRUARY 2000
45
Indexing the Solution Space: A New Technique for Nearest Neighbor Search in High-Dimensional Space Stefan Berchtold, Member, IEEE, Daniel A. Keim, Member, IEEE Computer Society, Hans-Peter Kriegel, Member, IEEE Computer Society, and Thomas Seidl AbstractÐSimilarity search in multimedia databases requires an efficient support of nearest-neighbor search on a large set of highdimensional points as a basic operation for query processing. As recent theoretical results show, state of the art approaches to nearest-neighbor search are not efficient in higher dimensions. In our new approach, we therefore precompute the result of any nearest-neighbor search which corresponds to a computation of the Voronoi cell of each data point. In a second step, we store conservative approximations of the Voronoi cells in an index structure efficient for high-dimensional data spaces. As a result, nearest neighbor search corresponds to a simple point query on the index structure. Although our technique is based on a precomputation of the solution space, it is dynamic, i.e., it supports insertions of new data points. An extensive experimental evaluation of our technique demonstrates the high efficiency for uniformly distributed as well as real data. We obtained a significant reduction of the search time compared to nearest neighbor search in other index structures such as the X-tree. Index TermsÐNearest neighbor search, high-dimensional indexing, efficient query processing, spatial databases, Voronoi diagrams.
æ 1
INTRODUCTION
A
N important research issue in the field of multimedia databases is the content-based retrieval of similar multimedia objects such as images, text, and videos [1], [15], [17], [22], [28], [30]. However, in contrast to searching data in a relational database, a content-based retrieval requires the search of similar objects as a basic functionality of the database system. Most of the approaches addressing similarity search use a so-called feature transformation which transforms important properties of the multimedia objects into high-dimensional points (feature vectors). Thus, the similarity search corresponds to a search of points in the feature space which are close to a given query point and, therefore, corresponds to a nearest neighbor search. Up to now, a lot of research has been done in the field of nearest neighbor search in high-dimensional spaces [2], [8], [16], [23], [25], [32]. Most of the existing approaches solving the nearest neighbor problem perform a search on a priori built index while expanding the neighborhood around the query point until the desired closest point is reached. However, as recent theoretical results [5] show, such index-based approaches must access a large portion of the data points in higher dimensions. Therefore, searching an index by
. S. Berchtold is with stb software technologie beratung gmbh, Ulrichsplatz 6, 86150 Augsburg, Germany. E-mail:
[email protected]. . D.A. Keim is with the Computer Science Institute, University of HalleWittenberg, Kurt-Mothes Str. 1, D-06099 Halle (Saale), Germany. E-mail:
[email protected]. . H.-P. Kriegel and T. Seidl are with the Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 Muenchen, Germany. E-mail: {kriegel, seidl}@informatik.uni-muenchen.de. Manuscript accepted 8 June 1999. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 110013.
expanding the query region is, in general, inefficient in high dimensions. One way out of this dilemma is exploiting parallelism for an efficient nearest neighbor search, as we did in [8]. In this paper, we suggest a new solution to sequential nearest neighbor search which is based on precalculating and indexing the solution space instead of indexing the data. The solution space may be characterized by a complete and overlap-free partitioning of the data space into cells, each containing exactly one data point. Each cell consists of all potential query points which have the corresponding data point as a nearest neighbor. The cells therefore correspond to the d-dimensional Voronoi cells [24]. Determining the nearest neighbor of a query point now becomes equivalent to determining the Voronoi cell in which the query point is located. Since the Voronoi cells may be rather complex high-dimensional polyhedra which require too much disk space when stored explicitly, we approximate the cells by minimum bounding (hyper-) rectangles and store them in a multidimensional index structure such as the X-tree [9]. The nearest neighbor query now becomes a simple point query which can be processed efficiently using the multidimensional index. In order to obtain a good approximation quality for high-dimensional cells, we additionally introduce a new decomposition technique for high-dimensional spatial objects. The paper is organized as follows: In Section 2, we cover related work and briefly discuss the problems that occur in indexing high-dimensional space. Section 3 then introduces our new solution to the nearest neighbor problem, which is based on approximating the solution space. We formally define the solution space as well as the necessary cell approximations and, in Section 4, we outline an efficient algorithm for
1041-4347/00/$10.00 ß 2000 IEEE
46
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 12,
NO. 1, JANUARY/FEBRUARY 2000
determining the high-dimensional cell approximations. Section 5 introduces improved algorithms for calculating the approximations and describes the incremental algorithms which are necessary for dynamic databases. In Section 6, we then discuss the problems related to indexing the high-dimensional cell approximations and introduce our solution, which is based on a new decomposition of the approximations. In Section 7, we present an experimental evaluation of our new approach using uniformly distributed, as well as real, data. The evaluation unveils significant speed-ups over the R*-treeand X-tree-based nearest neighbor search.
2
MOTIVATION
In high-dimensional data spaces, a broad variety of mathematical effects can be observed when one increases the dimensionality of the data space. These effects are subsumed by the term ªcurse of dimensionalityº because they are nonintuitive and, also, devastating for the performance of multidimensional index structures. Generally speaking, the problem is that important parameters such as volume and area depend exponentially from the number of dimensions of the data space. Therefore, most index structures proposed so far operate efficiently only if the number of dimensions is fairly small. The effects are nonintuitive because we are used to dealing with threedimensional spaces in the real world, but these effects do not occur in low-dimensional spaces. Many people even have trouble understanding spatial relation in threedimensional spaces, however, no one can ªimagineº an eight-dimensional space. Rather, we always try to find a low-dimensional analogy when dealing with such spaces. To demonstrate how much we stick to our understanding of low-dimensional spaces, consider the following lemma: Consider a cubic-shaped d-dimensional data space of extension 0; 1d . We define the center-point c of the data space as the point
0:5; . . . ; 0:5. The lemma ªEvery ddimensional sphere touching (or intersecting) the
d ÿ 1dimensional boundaries of the data space also contains cº is obviously true for d 2, as one can take from Fig. 1. Spending some more effort and thinking, we are able to also prove the lemma for d 3. However, the lemma is definitely false for d 16, as the following counterexample shows. Define a sphere around the point p
0:3; . . . ; 0:3. p This point p has a Euclidean distance of d 0:22 0:8 from the center point. If we define the sphere around p with a radius of 0.7, the sphere will touch (or intersect) all 15dimensional surfaces of the space. However, the center point is not included in the sphere. We have to be aware of the fact that effects like this are not only nice mathematical properties, but also lead to severe conclusions for the performance of index structures. The most basic effect is the exponential growth of volume. The volume of a cube in a d-dimensional space is of the formula: vol ed , where d is the dimension of the data space and e is the edge length of the cube. Now, if the edge length is a number between 0 and 1, the volume of the cube will exponentially decrease when increasing the dimension. Viewing the problem from the opposite side, if we want to define a cube of constant volume for
Fig. 1. Spheres in high-dimensional spaces.
increasing dimensions, the appropriate edge length will quickly approach 1. For example, in a two-dimensional space of extension 0; 1d , a cube of volume 0.25 has an edge length of 0.5, whereas, p in a 16-dimensional space, the edge length has to be 16 0:25 0:917. Another important issue is the space partitioning one can expect in high-dimensional spaces. Usually, index structures split the data space using
d ÿ 1-dimensional hyperplanes; for example, in order to perform a split, the index structure selects a dimension (the split dimension) and a value in this dimension (the split value). All data items having a value in the split dimension smaller than the split value are assigned to the first partition, whereas the other data items form the second partition. This process of splitting the data space continues recursively until the number of data items in a partition is below a certain threshold and the data items of this partition are stored in a data page. Thus, the whole process can be described by a binary tree, the split tree. As the tree is a binary tree, the height h of the split tree usually depends logarithmically on the number of leaf nodes, i.e., data pages. On the other hand, the number d0 of splits for a single data page is, on average, N ; d0 log2 Ceff
d where N is the number of data items and Ceff
d is the capacity of a single data page. Thus, we can conclude that if all dimensions are equally used as split dimensions, a data page has been split at most once or twice in each dimension and, therefore, spans a range between 0.25 and 0.5 in each of the dimensions (for uniformly distributed data). From that, we may conclude that the majority of the data pages is located at the surface of the data space rather than in the interior. Additionally, this obviously leads to a coarse data space partitioning in single dimensions. However, from our understanding of index structures such as the R*-Tree, which had been designed for geographic applications, we are used to very fine partitions where the majority of the data pages is in the interior of the space and we have to be careful not to apply this understanding to high-dimensional spaces. Fig. 2 depicts the different configurations. Note that this effect applies to almost any index structure proposed so far because we only made assumptions about the split algorithm.
BERCHTOLD ET AL.: INDEXING THE SOLUTION SPACE: A NEW TECHNIQUE FOR NEAREST NEIGHBOR SEARCH IN HIGH-DIMENSIONAL...
Fig. 2. Space partitioning in high-dimensional spaces.
Additionally, not only index structures show a strange behavior in high-dimensional spaces, but also the expected distribution of the queries is affected by the dimensionality of the data space. If we assume a uniform data distribution, the selectivity of a query (the fraction of data items contained in the query) directly depends on the volume of the query. In the case of nearest-neighbor queries, the query affects a sphere around the query point which exactly contains one data item, the NN-sphere. According to [5], the radius of the NN-sphere increases rapidly with increasing dimension. In a data space of extension 0; 1d , it quickly reaches a value larger than 1 when increasing d. This is a consequence of the above-mentioned exponential relation of extension and volume in high-dimensional spaces. Considering all these effects, we can conclude that if one builds an index structure using a state-of-art split algorithm, the performance will deteriorate rapidly when increasing the dimensionality of the data space. This has been realized not only in the context of multimedia systems [5], where nearest-neighbor queries are most relevant, but also in the context of data warehouses, where range queries are the most frequent type of query [3], [4].
2.1 High-Dimensional Index Structures A variety of multidimensional index structures has been proposed in the past (e.g., [26], [9], [11]). Most of these index structures have been designed to be efficient in lowdimensional data spaces, i.e., for data items having up to three or four attributes. Therefore, these index structures are preferably used in geographical applications where only two attributes occur. In high-dimensional spaces, a variety of effects arise, deteriorating most of the low-dimensional index structures. For example, the directory of R*-trees degenerates when going to higher dimensions because of massive overlap in the directory which is due to the inappropriate split algorithm. Other index structures suffer from the exponential growing of space, e.g., a 16-dimensional Quad-Tree has a fanout of 2d 216 65; 536, which leads to underfilled data nodes. Therefore, recently, some index structures have been proposed especially focusing on high-dimensional spaces. In [21], Lin et al. presented the TV-tree, which is an R-treelike index structure. The TV-tree is based on the concept of telescope vectors (TV). The basic idea is to treat attributes asymmetrically. For example, all data items in a data page may have one attribute value in common so that storing this attribute is redundant. On the other hand, we may achieve enough selectivity in the directory using only a few of the attributes. Telescope vectors therefore divide the attributes in three classes: attributes which are common to all according data items, attributes which are used to build the directory, and attributes which are ignored. The major
47
drawback of the TV-tree is that we require information about the behavior of the single attributes, e.g., their selectivity. Another R-tree-like high-dimensional index structure is the SS-tree [33], which uses spheres instead of bounding boxes in the directory. Although the SS-tree clearly outperforms the R*-tree, spheres tend to overlap in high-dimensional spaces, too. Thus, recently, an improvement of the SS-tree has been proposed in [20], where the concepts of Rtrees and the SS-tree are integrated in one new index structure, the SR-tree. The directory of the SR-tree consists of spheres (SS-tree) and hyper-rectangles (R-tree) such that the area corresponding to a directory entry is the intersection between the sphere and the hyper-rectangle. Therefore, the SR-tree outperforms both the R*-tree and the SS-tree. In [18], Jain and White introduced the VAM-Split R-tree and the VAM-Split KD-tree. Both are static index structures, i.e., all data items must be available at the creation time of the index. VAM-Split trees are rather similar to KD-trees [26], however in contrast to KD-trees, splits are not performed using the 50 percent-quantile of the data according to the split dimension and the dimension where the maximum variance occurs as a split dimension. VAM Split trees are build in main memory and then stored to secondary storage. Therefore, the size of a VAM Split tree is limited by the available main memory. In [9], the X-tree has been proposed which is a index structure adapting the algorithms of R*-trees to highdimensional data using two techniques: First, the X-tree introduces an overlap-free split algorithm which is based on the split history of the tree. Second, if the overlap-free split algorithm would lead to an unbalanced directory, the X-tree omits the split and the directory node accordingly becomes a so-called supernode. Supernodes are directory nodes which are enlarged by a multiple of the block size. The Xtree outperforms the R*-tree by a factor of up to 400 (point queries). However, the dynamic construction of an X-Tree is very time-consuming. To overcome this drawback, Berchtold et al. recently proposed a bottom-up construction technique of the X-Tree [3]. Additionally, they introduced the concept of unbalanced partitioning. This concept is motivated by the fact that queries of a reasonable selectivity have a very large extension in high-dimensional spaces. Therefore, when processing range queries, a split at the 50 percent-quantile is suboptimal and leads to an excessive page accesses. Instead, one should split the space at, e.g., the 10 percentquantile. However, in order to create a correct X-Tree having no underfilled nodes, the concept of unbalanced split is still restricted with respect to the choice of the quantile. Therefore, Berchtold et al. developed the PyramidTechnique [4]. The Pyramid-Technique is a mapping from d-dimensional space into a one-dimensional space. By mapping the data items into a one-dimensional space, a B -tree can be used as an efficient index structure. The Pyramid-Technique partitions the space into partitions shaped like the peel of an onion. As experiments show, the Pyramid-Tree is very efficient for almost cubic shaped range queries.
48
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 12,
NO. 1, JANUARY/FEBRUARY 2000
NN-Cell
T : fx 2 DS j 8
P 2 T 8
P 0 2 DB ÿ T : d
x; P d
x; P 0 g:
Fig. 3. Voronoi diagram and NN-diagram. (a) Voronoi diagram of order 2 (cf. [24]). (b) NN-diagram.
3
APPROXIMATING
THE
SOLUTION SPACE
Our new approach to solving the nearest neighbor problem is based on precalculating the solution space. Precalculating the solution space means determining the Voronoi diagram (cf. Fig. 3a) of the data points in the database. In the following, we recall the definition of Voronoi cells as provided in [27]. Definition 1 (Voronoi Cell, Voronoi Diagram). Let DB be a database of points. For any subset A 2 DB of size m : jAj, 1 m < N and a given distance function d :