Effective Nearest Neighbor Indexing with the Euclidean ... - CiteSeerX

0 downloads 0 Views 2MB Size Report
ABSTRACT. The nearest neighbor search is an important operation widely-used in multimedia databases. In higher dimensions, most of previous methods for ...
Effective Nearest Neighbor Indexing with the Euclidean Metric Sang-Wook Kim

Charu C. Aggarwal and Philip S. Yu

Division of Computer, Information, and Communications Engineering Kangwon National University [email protected]

ABSTRACT The nearest neighbor search is an important operation widely-used in multimedia databases. In higher dimensions, most of previous methods for nearest neighbor search become inefficient and require to compute nearest neighbor distances to a large fraction of points in the space. In this paper, we present a new approach for processing nearest neighbor search with the Euclidean metric, which searches This over only a small subset of the original space. approach effectively approximates clusters by encapsulating them into geometrically regular shapes and also computes better upper and lower bounds of the distances from the query point to the clusters. For showing the effectiveness of the proposed approach, we perform extensive experiments. The results reveal that the proposed approach significantly outperforms the X-tree as well as the sequential scan.

Keywords Similarity search, nearest neighbor queries, multimedia databases, high dimensional indexes, Euclidean metric

1. Introduction The similarity search is an important issue in the field of multimedia databases[5]. Often, it may be desirable to provide the functionality of searching for similar images in the database. The features of images can be represented as points, called feature vectors, in high dimensional space[1][2][12] . These points represent information about color histograms, textures, or other descriptors of the images. The points are often stored in some form of indexes that facilitate various types of queries on the database. One of the queries that helps in providing the functionality for similarity search is the nearest neighbor query[3][4][14] . The nearest neighbor query is formulated as follows: t;‘or a given target query point t, find the point that has the The k-nearest shortest distance from t in the database. neighbor query is the generalization of the nearest neighbor query, and requires us to find the k points closest to the

given target point 1.

Various distance functions may be

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specitic permission and/or a fee. CIKM’OI, November 5-10,2001, Atlanta, Georgia, USA. Copyright 2001 ACM I-581 13-436-3/01/0011...$5.00.

IBM T. J. Watson Research Center {charu, psyu}@us.ibm.com

used in order to determine the notion of proximity. Recent results show that for many distance functions in high dimensionality, the concept of proximity may not be very well defined[l I]. Furthermore, for many applications, the distance function is heuristic to begin with so that the nearest neighbor problem may be viewed from many novel perspectives. For example, locality-specific projections[l l] could be used in order to find the nearest neighbors in projections which are based on dimensional selectivity. Another alternative is to redefine the distance function in order to make it more meaningful and effective. For such applications, we show that it is possible to improve the nearest neighbor search both qualitatively and from a performance perspective. For some problems, however, the distance function is pre-defined, and there is no way of avoiding the spars@ effects of high dimensionality in such cases. The most commonly-used distance function is the Euclidean metric. The results of this paper are tailored to the applications that employ this particular metric as a pre-defined measure. In such cases, it becomes important to provide ways of performing the search more efficiently. For efftcient processing of nearest neighbor queries, there also have been many research efforts on high dimensional such a s R*-trees[8], X-trees[6], M-trees[lO], indexing SR-trees[l3], TV-trees[lS], a n d SS-trees[l9]. Weber et a1.[18] proved that the sequential scan always outperforms the tree-based multidimensional indexes in case of uniformly-distributed data whenever the dimensionality is above 10. To overcome this problem, they proposed an approximate-based scheme with the VA-file, a set of bit-compressed version of points. Recently, Berchtold et a1.[7] proposed a hybrid approach of the VA-file and the tree-based index. The performance of previous multidimensional indexes, which use multidimensional rectangles and/or spheres for representing the capsule of a point cluster, deteriorates seriously as the number of dimensions gets higher. In this paper, we first point out the fact that the simple representation of capsules incurs performance degradation in For alleviating this processing nearest neighbor queries. problem, we propose (1) adopting new coordinate systems appropriate to a given cluster, (2) representing various shapes of capsules by using hyperspheres, and (3) maintaining outliers separately. Our approach effectively approximates clusters by encapsulating them into geometrically regular shapes and also quickly computes better upper and lower bounds of the distances from the query point to the clusters. We also propose an efficient algorithm that touches a small

fraction of the original space by exploiting the sparsity of the search space in finding the nearest neighbor for a query point. The proposed approach is also easily extended to the k-nearest neighbor problem. This paper is organized as follows. In Section 2, we review the fundamentals of the branch and bound method. In Section 3, we discuss how to tightly approximate the clusters in high dimensional space by using capsules that encapsulate the clusters. More specifically, we deal with the cases of rectangular and ellipsoidal capsules, and point out their weaknesses. In Section 4, we discuss how to compute the lower bound distances to the clusters. In Section 5, we present a variation on the theme of ellipsoidal capsules that allows a quicker computation of the distance to the surface of the capsule. In Section 6, we present the performance evaluation on the proposed approach. Finally, we briefly summarize and conclude the paper in Section 7.

bounding shape of the cluster and then compute the minimum distance from the target point to that minimum bounding shape. Since this is performed during the tree-traversal for processing a query, it is important to calculate such distances efficiently and quickly in the course of the algorithm. Most of previous methods based on the branch and bound method employ minimum bounding rectangles that are parallel to a given set of axes. In this paper, we explore the use of approximate minimum bounding shapes that can be arbitrarily aligned in the space. This approach has the advantage of greatly improving the lower bound estimates on the distances to the clusters. We investigate the use of

2. Branch and Bound Method An important approach to the nearest neighbor problem proceeds by first arranging the points in the multidimensional space hierarchically in either an index tree or a cluster tree and then searching the tree using a branch and bound method[l6]. The branch and bound is a classical technique from combinatorial optimization, which is often used to prune a large portion of the solution space with the confidence that they are further away than an estimated upper bound on the nearest neighbor distance. Figure I gives a brief overview of the branch and bound method for nearest neighbor search. The branch and bound method is based on the tree-traversal utilizing a hierarchical decomposition of the data points in small clusters. For each node in the tree, a lower bound could be computed for the distance between the target query point and all the points within that node. At the same time, a global upper bound is maintained for the true distance from the current nearest neighbor to the target point. When the global upper bound for the nearest neighbor distance is less than the local lower bound for the distance of any possible point in the node to the target point, we can safely prune that node from further contention. The efficiency of the branch and bound method depends upon the quality of the decomposition and the tightness of the lower and upper bounds calculated. The order of the nodes visited in the tree also affects the running time of the procedure. The use of branch and bound methods for nearest neighbor search has been explored in earlier work by [1][3][16][17]. The primary weakness of these approaches is that they rely on index structures that decompose the solution space into minimum bounding rectangles along pre-decided directions that are typically parallel to the axes. The use of such minimum bounding rectangles is effective for quick calculations of the lower and upper bounds to the distances. This eases the problem of calculations of lower bounds, however, incurs a new problem that the approximations thus created could occupy volumes much larger than the clusters do in reality.

Figure 2. Effectiveness of introducing different coordinate systems

different geometrical shapes to encapsulate the clusters. Hereafter, we refer to those shapes as cupsules. In high dimensions, it is often the case that the real data points tend to be vary sparse and correlated. These correlations cause the points to be aligned along certain hyperplanes in the high dimensional space. Finding of these hyperplanes is useful in determining capsules with smaller volumes. Figure 2 shows the effectiveness of introducing such hyperplanes. Now, we discuss how the hyperplanes could be found by solving at most k simple optimization problems for each cluster. The k hyperplanes are determined one by one by greedily finding each one that minimizes the mean distance Algorithm

BranchAndBound(TargetPoint: t; DecompositionTree: T) Initialize UpperBound to the minimum of the distance of t from a random sample of points in the entire database. Examine the nodes in the tree one by one in a given order $“,“r cda;lhnf;;f nfass). If

n is a leaf node Then For each point p in that node If D(p, t) c UpperBound Then UpperBound

Else

= D(p, t).

LB(n) = lower bound of the distance from t to any possible point in n. UB(n) = upper bound of the distance from t to any possible point in n. If UB(n) < UpperBound then UpperBound = UB(n). If LB(n) > UpperBound Then prune n and its descendents from further contention. Else recursively call BranchAndBound(t, root of subtree whose parent is n) for each subtree of n

End

3. Computing Geometrical Approximations for Clusters of Points

Figure 1. The branch and bound method for nearest neighbor search.

In order to find the lower bound of the distance to any point in a given cluster, we need to maintain a minimum

IO

of the points in the cluster to that hyperplane. We note that once 6k hyperplanes are determined, the next hyperplane is optimally determined in the orthogonal (k-r)-hyperspace. Thus, the last of the k hyperplanes is determined uniquely, once the first (k-l) hyperplanes have been determined. The basic algorithm for determining the hyperplanes is illustrated in Figure 3. We now describe how this hyperplane could be determined by solving a simple optimization model. This step is denoted by an asterisk in the description of the algorithm in Figure 3. In order to facilitate the following description, we introduce some notation and terminology.

max =

W C choose the rectangular region R-or R’+3.0, whichever is smaller. We denote the rectangular capsule of C by R*(C). In general, R*(C) will contain most of the points in the cluster. Those points within C but outside R*(C) are added to a list of outliers that are maintained separately. The distances to the points in the outlier list are computed explicitly. For real data points, one expects the size of the outlier list to be just a small fraction of the total number of points. Since these rectangular capsules are aligned along hyperplanes whose average distances to the points in the clusters is as small as possible, the volume of the capsule tends to be small. This is important because it helps in maintaining tightness of the upper and lower bound calculations of the distances from the target point to the clusters. Greater tightness in the upper and lower bound estimates helps substantially in better pruning properties in the branch and bound method. Along with each cluster C, we maintain the coordinate-axis transformation T (C), which is a combination of rotations and linear transformations. T (C) converts the original coordinate system into a new one in order that the rectangular capsule is aligned along its axes in the positive quadrant, with one comer at (0, 0, 0, . . . . 0). T(C) is uniquely determined by the centroid of the given cluster, the set of orthogonal hypcrplanes that were determined to be at the smallest distances to the points, and the size of the rectangular capsule R*(C).

We denote the hyperplane

$ ai’Xi=b by [~l,~z,.e.ak,b] 1= or more simply [a, b]. Let the hyperplanes determined till the r-th iteration be [T, b’l, [ a’, b’], [ u3, b3], . . . , [ a’, 67. Let xi, x2,. . . , xN be the points in the cluster C. Let < denote the centroid of the cluster C. We would like all of the hyperplanes to pass through this centroid %. In order to find the r-th hyperplane that is best aligned, we use the following optimization model: Minimize

lY.xj-

bl

subject to : -‘=o a. a

Vi E (1, 2, . . . . r-1)

a.~--b=O

II-a = 1 Once these hyperplanes are determined, we discuss how to find the capsule that contains most of the points in the cluster.

3.1. Finding clusters

rectangular

n-m .,cE(x)

Algorithm ComputeHyperplaaes(Cluster: C; Dimensionality: k) begin r = 0. S(C’) = (). while r

Suggest Documents