Hypersphere Indexer - Google Sites

1 downloads 15 Views 120KB Size Report
Indexing high-dimensional data for efficient nearest-neighbor searches poses interesting ... A user submits a query to a
Hypersphere Indexer Navneet Panda1 , Edward Y. Chang1 , and Arun Qamra1 University of California, Santa Barbara, CA

Abstract. Indexing high-dimensional data for efficient nearest-neighbor searches poses interesting research challenges. It is well known that when data dimension is high, the search time can exceed the time required for performing a linear scan on the entire dataset. To alleviate this dimensionality curse, indexing schemes such as locality sensitive hashing (LSH) and M-trees were proposed to perform approximate searches. In this paper, we propose a hypersphere indexer, named Hydex, to perform such searches. Hydex partitions the data space using concentric hyperspheres. By exploiting geometric properties, Hydex can perform effective pruning. Our empirical study shows that Hydex enjoys three advantages over competing schemes for achieving the same level of search accuracy. First, Hydex requires fewer seek operations. Second, Hydex can maintain sequential disk accesses most of the time. And third, it requires fewer distance computations.

1 Introduction Nearest neighbor search has generated substantial interest because of a wide range of applications such as text, image/video, and bio-data retrieval. These applications represent objects (text documents, images, or bio-data) as feature vectors in very highdimensional spaces. A user submits a query to a search engine, which returns objects that are similar to the query. The similarity between two objects is measured by some distance function (e.g., Euclidean) over their feature vectors. The search returns the objects that are nearest to the query object in the high-dimensional vector space. The prohibitive nature of exact nearest-neighbor search has led to the interest in approximate nearest-neighbor (A-NN) search that returns instances (objects) approximately similar to the query instance [1, 10]. The first justification behind approximate search is that a feature vector is often an approximate characterization of an object, so we are already dealing with approximations [13]. Second, an approximate set of answers suffices if the answers are relatively close to the query concept. Of late, two approximate indexing schemes, locality sensitive hashing (LSH) [11] and M-trees [8] have been popular. These approximate indexing schemes speed up similarity search significantly (over a sequential scan) by slightly lowering the bar for accuracy. In this paper, we propose a hypersphere indexer, named Hydex, to perform approximate nearest-neighbor searches. First, the indexer finds a roughly central instance among a given set of instances. Next, the instances are partitioned based on their distances from the central instance. Hydex builds an intra-partition indexer (or local indexer) within each partition to efficiently retrieve relevant instances. It also builds an inter-partition indexer to help a query identify a good starting location in a neighboring partition to search for nearest neighbors. A search is conducted by first finding the

partition to which the query instance belongs. Hydex then searches in this and the neighboring partitions to locate nearest neighbors of the query. Through empirical studies on two large, high-dimensional datasets, we show that Hydex significantly outperforms both LSH and M-trees in both IO and CPU time. The rest of the paper is organized as follows. In Section 2, we discuss related work. Section 3 describes Hydex’s operations. Section 4 presents experimental results. We offer our closing remarks in Section 5.

2 Related Work Existing indexers can be divided into two categories: coordinate-based and distancebased. The coordinate-based methods work on data instances residing in a vector space by partitioning the space into minimum bounding regions (MBRs). A top-k nearestneighbor query can be treated as a range query, and ideally, the best matches can be found after examining only a small number of MBRs. Examples of coordinate-based methods are the X-tree [4], the R∗ -tree [3], the TV-tree [14] and the SR-tree [12], to name a few. A big disadvantage of these tree-structures is the exponential decay in performance that accompanies an increase in the dimensionality of data instances [19]. This phenomenon has been reported for the R∗ -tree, the X-tree, and the SR-tree, among others. The decay in performance can be attributed to the almost exponential number of rectangular MBRs (in the cases of the X-tree and the R∗ -tree) or spherical MBRs (in the cases of the SR-tree, the TV-tree, and the M-tree)1 that must be examined for finding nearest neighbors. In contrast to these tree-structures, each partition of Hydex has only two neighboring partitions, independent of data dimension. The distance-based methods do not require an explicit vector space, but rely only on the existence of a pairwise distance metric. The M-tree [9] is a representative scheme that uses the distances between instances to build an indexing structure. (Some other distance-based methods are the multi-vantage-point trees [5], geometric near-neighbor access trees [6], and spatial approximation trees [16].) Given a query point, it prunes out instances based on distance. The M-tree performs IOs to retrieve relevant data blocks into memory for processing; however, the disk accesses cannot be sequential since the data blocks can be randomly scattered on disks. In contrast to the M-tree, our proposed indexer, Hydex, can largely maintain contiguous layout of neighboring partitions on disks, so its IOs can be sequential and hence more efficient. When data dimension is very high, the cost of supporting exact queries can be higher than that of a linear scan. The work of [11] proposes an approximate indexing strategy using locality sensitive hashing (LSH). This approach attempts to hash nearest-neighbor instances into the same bucket, with high probability. A top-k approximate query is supported by retrieving the bucket into which the query point has been hashed. To improve search accuracy, multiple hash functions must be used. Theoretically, the number of n ρ ) , where n is the number of instances in the hash functions to be used is given by ( B dataset, B is the number of instances that can be accommodated in a single bucket, and ρ a tunable parameter. As we will show in Section 4, Hydex outperforms LSH in two 1

To ensure retrieval of the exact set of top-k nearest neighbors, a search needs to examine all 3d neighboring MBRs [13]. For the tree-structures using spherical MBRs, [19] reports that their performance decays almost as rapidly as the tree-structures using rectangular MBRs.

aspects. First, Hydex requires fewer IOs compared to LSH to achieve the same level of search accuracy. Second, when a dataset is relatively static (without a large number of insertions), Hydex can largely maintain its IOs in sequential order, whereas LSH cannot.

3 Algorithm SphereDex Our proposed Hydex aims to improve the performance of nearest-neighbor searches using a three-pronged approach. First, we attempt to reduce the number of IOs. Second, when multiple disk blocks must be retrieved, we make the IO sequential as much as possible. Third, when data is finally staged in main memory for processing, we minimize the amount of data to be examined to obtain the top-k approximate nearest neighbors. 3.1 Create — Building the Index The indexer is created in four steps. – 1. Finding the instance xc that is approximately centrally located in the data space, – 2. Separating the instances into partitions based on their distances from xc , – 3. Constructing an intra-partition indexer in each partition, and – 4. Creating an inter-partition index. Choosing the center instance Input: Dataset instances: {x1 · · · xn }. Output: Approximate center instance xc . Definition 1. The center instance is the instance at the smallest distance from the centroid of the data instances. This is done to ensure that the instances are “roughly” uniformly distributed in all directions around the reference point. The centroid of the available instances can be computed in O(n d) time, and finding the instance at the smallest distance from the centroid takes O(n d) time. Therefore, overall this step can be accomplished in O(n d) time and O(d) space. As reported in http : //www .cs.ucsb.edu/ ∼ panda/dexa06 long.pdf the choice of a random instance as center instance leads to a small degradation in performance only. Partitioning the Instances Input :{x1 · · · xn }, xc , # of instances per partition: g. Output: # of partitions: np , Partitions: P [1] · · · P [np ]. Once the center instance has been determined, our next step is to partition the instances in the dataset based on their sorted distances from the center instance. This step requires O(n log n + n d) time and O(n) space. The created partitions (alongwith associated index structures developed below) are then placed on the disk. Our placement policy aims to achieve two goals. First, we would like to place adjacent partitions contiguously on the disk to achieve sequential disk accesses. Second, we need to reserve space within partitions to accomodate insertions of new data instances. Details of the insertion and placement policy may be found at http : //www .cs.ucsb.edu/ ∼ panda/dexa06 long.pdf Intra-partition Index Input : P [1] · · · P [np ], g. Output: Local indices: S[1] · · · S[np ]. An intra-partition index (or local index) is created for each partition and stored on disk along with the instances in the partition. For each instance, this local index stores a

sorted list of instances ranked according to their distances from this instance to allow the algorithm to quickly converge on the instances closest to the query. Maintaining this data structure for all pairs would take up O(g 2 ) space. Instead, in Section 3.2, we detail an approach storing two reduced sets of information for each instance reducing the space requirement to O(g log g) for each partition. Operations on this local index are explained in Section 3.2 when we discuss query-processing. Inter-partition Index Input : np , P [1] · · · P [np ], g. Output: Inter-partition indices: I. We also maintain a data structure which contains neighborhood information across adjacent partitions. This index gives Hydex a good location to search for k-NNs when a search transitions from one partition to the next. Intuitively, a good starting point to continue searching is an instance that is very close to the current set of top-k results. This index stores, for each instance in a partition, its nearest neighbor(s) in the adjacent partition(s). Building this index requires O(n g d) time. The storage required to maintain this index is of the order O(n). 3.2 Search — Querying the Index Given a query instance, query processing begins by finding the approximate nearest neighbors of the query in the partition to which it belongs. Thereafter, adjacent partitions are retrieved and searched for better nearest neighbors. The set of nearest neighbors improves as we continue to retrieve and process adjacent partitions. Hydex terminates its search for top-k when the constituents of the top-k set do not change over the evaluation of multiple partitions, or when the query time expires. Identification of Starting Partition Input : Query instance: xq , xc , Delim.

Output: Partition number: α.

First, we identify the partition to which the query instance xq belongs. Instead of maintaining the distances of all the instances from the center, we keep another sorted list, Delim, of the distances of only the starting instance of each partition which is searched for the distance between xq and xc . The cost of this binary search is O(log np ) (np denotes the number of partitions). Intra-Partition Search Input : xq , P [α], S[α], I[α].

Output: Approximate k-NNs of xq .

Once the starting partition P [α] has been identified, we retrieve the instances in that partition and the associated local index S[α] from the disk (if not yet cached in main memory). We then look for the nearest neighbor of xq among the instances in P [α]. From that one single nearest neighbor, we can use the intra-partition index S[α] to harvest the approximate k-NNs of xq . We start by selecting an arbitrary instance x0 in P [α], and then iteratively find instances closer to xq (than x0 ). Figure 1 depicts an example partition P [α]. The local index of x0 contains nine data instances: x0 · · · x8 . The query instance xq is located at the top of the figure. We use x0 as an anchor to find instances closer to xq . Pictorially, Figure 1 shows that a better nearest neighbor than x0 lies within a radius of a from xq . Let the distance between x0 and xq be a. Starting at x0 , we seek to find an instance as close to xq as possible. The intra-partition index of x0 contains an ordered list of instances based on their distances

from x0 . For the example in Figure 1, the neighboring points of x0 appear in the order of x3 , x1 , x4 , x5 , x2 , x6 , x7 , and x8 on its sorted list. To find an instance closer than x0 to xq , we search this list for instances at a distance of about a from x0 . To do so, the instance performs a binary search for a and picks up the first instance xn in its sorted list closer to xq than itself (in this case, x1 is chosen after rejecting x4 and x5 ). The next iteration then commences with the new anchor xn . X 3 X 1

a

X 2

a X

X q

Xq

* C

0

u

X

X 4

r1

r2

X 5

X c

X8

X 6

X7

Fig. 1. Arrangement of instances

Fig. 2. Processing other partitions

The iterations continue till no nearer instance than the current anchor can be found. The approximate top-k instances nearest to xq are the k nearest neighbors on the current anchor’s intra-partition sorted list. To evaluate the performance of our intra-partition search algorithm, we present the number of distance computations (between instances in the dataset and xq ) in http : //www .cs.ucsb.edu/ ∼ panda/dexa06 long.pdf . Size of the Local Index Instead of storing the distances of all the instances from xi in the local index of xi , if we stored just the distances of log g items, the probability of not finding an instance closer to the query point after evaluating all these instances would 1 , which is essentially g1 . In our local index for every instance we store two sets be 2logg of information. – 1. The distances of only O(log g) instances for every instance xi in the partition. Let L = (l1 , l2 , l3 , · · · , lg ) be the ordering of all instances in the partition based on their distances from xi . Let L1 = (l1 , l2 , · · · , l g2 ) and L2 = (l g2 +1 , l 2g +2 ), · · · , lg be two equal halves of L. In our local index we store the distances of 4 × log g2 instances with log g2 distances from each end of L1 and L2 , thus selecting l1 , l 2 , l 4 , · · · , l

l g +1 , l g +2 , l g +4 , · · · , l g 2

2

2

2

g g ⌊log ⌋ , l 2 2 2

, l g −1 , l g −3 , · · · , l g 2

2

g , lg , lg−1 , lg−3 , · · · ⌊log ⌋ 2 +1 +2

2

,l

g , ⌊log ⌋ 2 +1 −2

g . ⌊log ⌋ 2 +1 g−2

The 4 × log g2 instances are chosen to accommodate instances spread out over the entire spectrum of distances of instances in the partition. – 2. The distances of the r closest instances for every xi . As we get closer to the query instance, the number of instances on the list that are examined becomes small (because distance a to the query instance comes down), and the possibility of not finding an instance closer to the query instance becomes more likely. This situation is addressed by adding the instances closest to xi in the partition to our index. That is, we add the distances of the instances, l1 , l2 , · · · , lt , to our index for xi . Combining the above two sets, the local index of Hydex uses O(g log g + t g) space per partition. With values of t close to O(log g), the total space required for the local index would therefore be O(g log g) per partition and O(n log g) overall. Constructing this local index for all partitions requires O( n g log g + n g d ) time.

Our empirical studies show that a typical g is in the order of thousands, and t is about 30. For g = 1000, the probability of not finding an instance closer to the query instance, after evaluating all the instances in the local index of xi , is less than 10−10 g 4 (≤ 2−log( 2 ) = ( g1)4 ). With t ≈ 30, this probability, even when the anchor is close to 2

the query instance, is given by 2130 (≈ 10−9 ). Therefore, the number of distances stored in the local index of instance xi is bounded by 4 × log g2 + 30. (However, since some of the chosen positions overlap, the number of instances chosen is lower than this number.) For g = 1000, the number of instances stored is about 60 (≤ 4 log(1000/2) + 30). Search in Adjacent Partition As the algorithm proceeds, starting from the partition to which the query instance belongs, we advance to adjacent partitions. Having identified the instance x in a partition close to the query instance, we use the inter-partition index to select a good starting instance in the adjacent partition. Since the inter-partition index contains the instance from the adjacent partition closest to the instance x, there is a high probability that the inter-partition index will give an instance very close to the query instance. This has the effect of lowering the number of iterations needed to converge to the approximate best instance in the adjacent partition. Figure 2 presents the scenario when searching adjacent partitions and details of computations necessary are presented in the appendix of the online version at http://www.cs.ucsb.edu/∼panda/dexa06 long.pdf.

4 Experiments In this section, we present an evaluation of our indexing approach and make comparisons with existing techniques to demonstrate its effectiveness. We were interested in the following key questions: – Using random IOs, how does Hydex compare with LSH and M-tree? (Section 4.2) – When Hydex maintains sequential IOs, what is the additional gain? (Section 4.2) Answers to other questions like “How do insertions affect sequential access?”, “What is the maximum space reservation possible without a decay in performance?”, “What is the percentage of data Hydex processes compared to competing schemes?”, “What is the impact of a poor center instance?” are presented in the online version of the paper at http : //www .cs.ucsb.edu/ ∼ panda/dexa06 long.pdf because of space limitations. 4.1 Setup It would be impossible to compare our method with every published indexing scheme so we chose to compare Hydex with two representative schemes: LSH and the M-tree. LSH was chosen because it outperforms many traditional schemes, and it has been used in real applications [18, 7]. An approximate version of the algorithm was presented in [8]. For a fair comparison, we chose the algorithm presented in [8] and used the code provided by M. Patella. The code is available on request, but it is not part of the basic download available at [17]. Datasets We used two datasets for conducting our experiments. The first dataset [15] is the same as the largest dataset used by LSH. It contains 275, 465 feature vectors. Each of these was a 60-dimensional vector representing texture information of blocks of large aerial photographs. The second dataset contained 314, 499 feature vectors. Each

of these was a 144-dimensional vector representing color and texture information for each image. In our experiments on each dataset, we chose 2, 000 instances randomly from the dataset and created an index using the rest of the dataset. As in [11] our experiments aimed at finding the performance for a top-10 approximate NN search. The results are averaged over all the approximate 10-NN searches. Comparison metric The objective of our experiment was to ascertain how quickly the approximate results could be obtained using the index. Following [2] the effective error for the 1-nearest neighbor search problem is defined as X dindex 1 ( ∗ − 1), E= |Q| d queryxq ∈Q

where dindex denotes the distance from a query point xq to a point found by the indexing approach, d∗ is the distance from xq to the closest point, and the sum is taken over all queries. For the approximate k-nearest neighbor problem, as in [11], we measure the distance ratios between the closest point to the nearest neighbor, the 2nd closest one to the 2nd nearest neighbor and so on, finally averaging the ratios. In the case of LSH, the sum is calculated over all queries returning more than k results. Under LSH, queries returning less than k results are defined as misses. We compare methods based on the number of disk accesses performed. Our measure of disk accesses takes into account both the seek time and the data transfer time as outlined below. To correctly gauge the impact of data transfer time we compute the disk transfer time in units of the seek time. The number of disk accesses is then the total number of seeks performed (both the actual seeks and the seeks to account for the data transfer). To understand the relationship between seek time and data transfer time, we look at the time the disk takes to transfer 1 MB of data. Current disk technology can maintain a sustained throughput in the range of 50-70MB/s. Therefore, transferring 1MB of data takes roughly 14-20ms. Average seek times are usually in the range of 10ms. Thus, the ratio of transfer time to seek time ranges from 1.4 to 2. Since the exact time varies from disk to disk, we use the ratio of the times in our calculations and select a higher value of the ratio (2) to allow the competing index structures the maximum benefit. Further, since the other index structures transfer only small chunks of data in each access to the disk, the time they require to transfer data has been assumed to be 0. We also present graphs with different ratios between seek and transfer time for the sake of completeness. 4.2 Performance with disk IOs We compared the performance of the two approaches in cases when the index needs to be accessed from the disk. In such cases, the cost of distance computations becomes insignificant when compared to the cost of disk access. We count only the total number of seeks performed by LSH. The number of seeks performed by LSH is controlled by parameter l. The space used by our indexing structure is O(log g) per instance, g being the number of instances assigned to a partition. As explained in Section 3.2, when g = 1000, we store approximately 60 distances for each instance. If we limit ourselves to 2 bytes for each distance, we can get 4 digits of precision. Also, since there are 1000 instances in the partition, to maintain unique identifiers for them we need only 10 bits.

100

LSH Hydex 12 40 Hydex 10 50 Hydex 9 60 Hydex 8 70

Disk accesses

80 70

90

LSH Hydex 12 40 Hydex 10 50 Hydex 9 60 Hydex 8 70

90 80

Disk accesses

90

60 50 40 30

70

80 70 60

60

50

50 40

40

30

30

20

20

20

10

10

10

0

0 0

10

20

30

40

50

0 0

10

Error percentage

(a) vs. LSH: random access

Number of partitions

100

20

30

40

50

Error percentage

(b) vs. LSH: sequential access

Fig. 3. Hydex vs. LSH: first dataset

The total space required to store the index associated with one instance is therefore 60 × (2 + 10 8 ) bytes, which is equal to 60 × 3.25 bytes. The results are presented in Figures 3,4. The x-axis represents the percentage of error, and the y-axis the number of disk accesses. The second y-axis, when used, represents the number of partitions. Many of the curves pertaining to Hydex overlap. Performance under random access In the case of the first dataset, LSH used l ≈ 70 hash functions to achieve acceptable error rates. Hence the total time taken by LSH would be given by tLSH = l × (tseek + trot ) = 70 × (tseek + trot ). Here, tseek is the average seek time and trot is the average rotation latency. Compared to this, our method has to retrieve whole partitions. Thus, we need to compute the space required to store the partitions. In the case of the first dataset, where each instance has 60 features, each partition takes up space g × d + g × β × log g. But, as explained above, β × log g is equal to 60 × 3.25 when g = 1000. Hence, the total space consumed is equal to 1000 × 60 + 1000 × 60 × 3.25, where, as in LSH, each dimension can be stored in 1 byte. Therefore, the total space required to store a partition is 0.255MB. The total time consumed is therefore given by r × tseek + r × trot + r × 0.255 × ttr , where, ttr is the time taken to transfer 1MB of data from the disk and r is the number of partitions transferred. The rotation latency, trot , is about half the seek time, tseek . Since ttr takes less than twice the average seek time, the equation simplifies to r × tseek (1 + 0.5 + 0.51). (We also present data for other ratios between seek time and transfer time in the graphs shown in Figure 3. In the graph legend the first value after Hydex is the average seek time in milliseconds and the second value is the data transfer rate in MB/s). Taking the ratio of the times taken by LSH and our method we get 70 × 1.5 70 × (tseek + 0.5 × tseek ) = . Speedup = r × tseek (1 + 0.5 + 0.51) r × (2.01) Therefore, if our approach is able to retrieve the approximate set of instances in less than 70×1.5 2.01 partitions, then we have speedup. This value is roughly equal to 52. We found (Figure 3(c)) that the average number of partitions that had to be evaluated before achieving the same level of error was 12. The speedup is ≈ 4 times. Similar analysis for the second dataset can be done as follows. The space required for each partition of the second dataset is given by 1000 × (144 + 3.25 × 60) bytes

160

LSH Hydex 12 40 Hydex 10 50 Hydex 9 60 Hydex 8 70

120 100

200

LSH Hydex 12 40 Hydex 10 50 Hydex 9 60 Hydex 8 70

140

Disk accesses

Disk accesses

140

80 60 40

120 100

180 160 140 120

80

100

60

80 60

40 40

20

20

0

20 0

0 0

10

20

30

Error percentage

(a) vs. LSH: random access

40

50

Number of partitions

160

0

10

20

30

40

50

Error percentage

(b) vs. LSH: sequential access

Fig. 4. Hydex vs. LSH: second dataset

(0.339MB). For the second dataset, LSH required l > 130 to achieve 15% error rate. 130×(tseek +trot ) 130×1.5 Therefore, Speedup = r×tseek (1+0.5+2×0.339) = r×(2.178) . To achieve parity with the time taken by LSH we would need to evaluate r = 89 partitions. We show in Figure 4(c) that in this case the average value of r is 21. The speedup is approximately 4. Performance under sequential access In this case, we would need at most two seek operations and the rest of the time would be spent in transferring the partitions. Here, we assume that space has been reserved for another g instances in each partition to accommodate insertions. The total time consumed for the first dataset is therefore given by 2 × tseek + 2 × trot + 2r × 0.255 × ttr . Since ttr is less than twice the average seek time and trot is about half of the average seek time, the equation simplifies to 2 × tseek (1.5 + 2r × 0.255). (We also present data for other ratios between seek time and transfer time in the graphs shown in Figure 4. In the graph legend the first value after Hydex is the average seek time in milliseconds and the second value is the data transfer rate in MB/s). Taking the ratio of the times taken by LSH and our method, 35 × 1.5 70 × (tseek + trot ) = . Speedup = 2tseek (1.5 + 2r × 0.255) 1.5 + 0.51r Therefore, if our approach can retrieve the approximate set of instances in fewer than 34×1.5 0.51 partitions, then we achieve speedup. This value is roughly equal to 100. Since the average number of partitions required to be evaluated before achieving the same level of error is 12, the speedup is on the order of 8 times. Similar analysis for the second dataset can be done as follows. 130 × (tseek + trot ) 65 × 1.5 Speedup = = . 2tseek (1.5 + 2r × 0.339) 1.5 + 0.678r To achieve parity with the time taken by LSH we would need to evaluate r = 141 partitions. The average value of r is 21 for 15% error. Hence, the speedup is approximately 6.5 times. We do not present the graph in this case because of space limitations. Our experiments with M-tree show that the number of IO operations performed by the index are much higher on both the data sets. This makes using M-trees for searching prohibitively expensive.

5 Conclusion We have presented a new approach to efficiently index high-dimensional vector spaces. This approach minimizes IOs to reduce the seek overheads. Strategies were presented

to minimize the number of distance computations performed. We studied the tradeoff between sequential access using space reservation and random placement of partitions. Experimental comparison with LSH and M-tree showed the good performance of our approach.

References 1. S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In Proceedings of the 5th SODA, pages 573–82, 1994. 2. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. JACM, 45(6), 1998. 3. N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R∗ tree: An efficient and robust access method for points and rectangles. In ACM SIGMOD Intl. Conf. on Mgmt. of Data, pages 322–331, 1990. 4. S. Berchtold, D. Keim, and H. Kriegel. The X-tree: An index structure for high-dimensional data. In 22nd Conference on Very Large Databases, pages 28–39, 1996. 5. T. Bozkaya and M. Ozsoyoglu. Indexing large metric spaces for similarity search queries. ACM Trans. on Database Systems, 24(3):361–404, 1999. 6. S. Brin. Near neighbor search in large metric spaces. In The VLDB Journal, 1995. 7. J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. In Bioinformatics, volume 17, pages 419–428, 2001. 8. P. Ciaccia and M. Patella. Pac nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. In In Proceedings of International Conference on Data Engineering, pages 244–255, 2000. 9. P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. Proc. 23rd Int. Conf. on Very Large Databases, pages 426–435, 1997. 10. K. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the 10th SCG, pages 160–164, 1994. 11. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In The VLDB Journal, pages 518–529, 1999. 12. N. Katayama and S. Satoh. The SR-tree: an index structure for high-dimensional nearest neighbor queries. In ACM SIGMOD Int. Conf. on Mgmt. of Data, pages 369–380, 1997. 13. C. Li, E. Chang, H. Garcia-Molina, and G. Wilderhold. Clindex: Approximate similarity queries in high-dimensional spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), 14(4):792–808, July 2002. 14. K.-I. Lin, H. V. Jagadish, and C. Faloutsos. The TV-tree: An index structure for highdimensional data. VLDB Journal: Very Large Data Bases, 3(4):517–542, 1994. 15. B. S. Manjunath. Airphoto dataset. http://vision.ece.ucsb.edu/download.html. 16. G. Navarro. Searching in metric spaces by spatial approximation. In SPIRE/CRIWG, pages 141–148, 1999. 17. M. Patella. M-tree website. http://www-db.deis.unibo.it/Mtree/download.html. 18. A. Qamra, Y. Meng, and E. Y. Chang. Enhanced perceptual distance functions and indexing for image replica recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), volume 27, 2005. 19. R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 194–205, 24–27 1998.