Reverse Query-Aware Locality-Sensitive Hashing for ... - IEEE Xplore

4 downloads 0 Views 907KB Size Report
Sun Yat-Sen University. Guangzhou, China [email protected]. Qiong Fang. School of Software Engineering. South China University of Technology.
2017 IEEE 33rd International Conference on Data Engineering

Reverse Query-Aware Locality-Sensitive Hashing for High-Dimensional Furthest Neighbor Search Qiang Huang

Jianlin Feng

Qiong Fang

School of Data and Computer Science Sun Yat-Sen University Guangzhou, China [email protected]

School of Data and Computer Science Sun Yat-Sen University Guangzhou, China [email protected]

School of Software Engineering South China University of Technology Guangzhou, China [email protected]

Abstract—The c-Approximate Furthest Neighbor (c-AFN) search is a fundamental problem in many applications. However, existing hashing schemes are designed for internal memory. The old techniques for external memory, such as furthest point Voronoi diagram and the tree-based methods, are only suitable for the low-dimensional case. In this paper, we introduce a novel concept of Reverse Locality-Sensitive Hashing (RLSH) family which is directly designed for c-AFN search. Accordingly, we propose two novel hashing schemes RQALSH and RQALSH∗ for highdimensional c-AFN search over external memory. Experimental results validate the efficiency and effectiveness of RQALSH and RQALSH∗ .

I.

excellent practical performance. The LSH schemes make use of a set of “locality-sensitive” hash functions to partition the “close” objects into the same buckets with high probability. However, traditional LSH schemes are designed for finding the “close” objects of q. If we want to find the “far-apart” objects of q, we may need to check all the objects in the remaining buckets which do not collide with q, which is too expensive to use these LSH schemes for c-AFN search. In this paper, we introduce a novel concept of Reverse LSH family directly designed for c-AFN search and develop new reverse query-aware LSH functions accordingly (Section III). We propose two novel hashing schemes namely RQALSH and RQALSH∗ for high-dimensional c-AFN search over external memory (Section IV). Extensive experiments show that our proposed methods outperform two state-of-the-art methods, namely QDAFN and DrusillaSelect (Section V).

I NTRODUCTION AND P RIOR W ORK

Similar to the problem of Nearest Neighbor (NN) search, the problem of Furthest Neighbor (FN) search is fundamentally important in many applications. For example, in the collaborative filtering recommendation systems, the FN search has been used to provide more diverse recommendations [1]. Moreover, the FN search is a key component in many fundamental problems, such as maximum spanning trees [2], complete linkage clustering [3] and non-linear dimensionality reduction problem [4].

II.

Let D be a database of n data objects in d-dimensional Euclidean space Rd . We focus on the Euclidean distance o − q between any two objects o and q. Given an approximation ratio c (c > 1) and a query q, c-AFN search is to find an object o ∈ D such that o − q ≥ o∗ − q/c, where o∗ ∈ D is the FN of q. Correspondingly, the c-approximate k furthest neighbors (c-k-AFN) search is to finds k objects oi ∈ D (1 ≤ i ≤ k) such that oi − q ≥ o∗i − q/c, where o∗i ∈ D is the exact ith FN of q.

Due to the difficulty in finding an efficient method for exact FN search in high-dimensional space, researchers study the approximate version of the problem, named the c-Approximate Furthest Neighbor (c-AFN) search [5]–[8]. The most efficient methods are the hashing schemes [6]–[8]. Indyk et al. [6] propose the first hashing scheme based on random projections, but the proposed method needs a complicated transformation for c-AFN search. Later, Pagh et al. [7] reuse the data structure [6] and introduce a query strategy named QDAFN (QueryDependent Approximate Furthest Neighbor) to directly solve the c-AFN problem. Most recently, Curtin et al. [8] propose a heuristic method named DrusillaSelect which selects a set of candidates using data distribution instead of random projections. However, all these hashing schemes are designed for c-AFN search over internal memory. They tend to check a large number of candidates to achieve a high accuracy, which will incur a large number of random I/Os if they were adapted for c-AFN search over external memory. To the best of our knowledge, there is no efficient external method for highdimensional c-AFN search.

Similar to the hashing scheme proposed by Indyk et al. [6], RQALSH directly solves the decision version of c-AFN search, named (R, c)-Far Neighbor ((R, c)-FN) search. Given a search radius R and an approximation ratio c (c > 1), the (R, c)-FN search is to find an object o ∈ D such that o − q ≥ R/c if there exists an object o ∈ D such that o − q ≥ R. III.

R EVERSE Q UERY-AWARE LSH FAMILY

A. Reverse LSH Family A family of RLSH functions is expected to have the following separation property: the probability with which any two objects o and q are separated into different buckets increases monotonically as their distance o − q increases. If o and q are separated into different buckets by an RLSH function h, we say o and q separate under h. Formally, we define the RLSH function family (or simply the RLSH family) as follows:

Locality-Sensitive Hashing (LSH) [9] and its variants [10], [11] are widely used for the problem of high-dimensional cApproximate NN search due to their theoretical guarantee and 2375-026X/17 $31.00 © 2017 IEEE DOI 10.1109/ICDE.2017.66

P ROBLEM D EFINITION

169 167

Definition 1 (RLSH Family): Given a search radius R and an approximation ratio c (c > 1), an RLSH function family H = {h : Rd → U } is said to be (R, Rc , p1 , p2 )-sensitive, if for p1 > p2 , and for any objects o, q ∈ Rd , we have: • If o − q ≥ R, then P rh∈H [o and q separate under h] ≥ p1 ; • If o − q < R/c, then P rh∈H [o and q separate under h] ≤ p2 ;

/ 



Fig. 1.

 



… − − − −  

 



 …

 

… − − − − 

 



 …

Virtual Rehashing of RQALSH for c = 2

B. (1, 1c , p1 , p2 )-Sensitive Reverse Query-Aware LSH Family

C. Virtual Rehashing

In order to quickly identify the “far-apart” objects of a query q, we propose a new reverse query-aware hash family, which reuses the concept of query-awareness from [11] and its hash function as reverse query-aware hash function. Formally, a reverse query-aware hash function ha (·) : Rd → R maps a d-dimensional object o to a value along a real line identified by a random vector a, whose entries are drawn independently from standard normal distribution N (0, 1). For a fixed a, ha (·) is defined as follows: ha (o) = a · o (1)

RQALSH does not solve the c-AFN problem directly. Instead, RQALSH reduces it into a series of (R, c)-FN problem with properly decreasing R ∈ {ck , ck−1 , . . . , 1}. Because an (R, Rc , p1 , p2 )-sensitive reverse query-aware LSH family HaR (·) requires R to be pre-specified to compute p1 and p2 . Let I R be the round-R separation interval defined by HaR (·), i.e., (−∞, HaR (q) − w2 ) ∪ (HaR (q) + w2 , +∞). To answer an (R, c)-FN query q, we can check the interval wR (−∞, ha (q) − wR a (q) + 2 , +∞) instead of checking 2 ) ∪ (h I R . To answer a c-AFN query q, we can check the intervals wR (−∞, ha (q) − wR a (q) + 2 , +∞) round by round with 2 ) ∪ (h k k−1 , . . . , 1}. Since all these gradually decreasing R ∈ {c , c intervals are centered at ha (q) along the same a, we can identify all of them along a with properly adjusted interval width. Therefore, we only need to keep one physical copy of projections of data objects along a. This is the idea of virtual rehashing of RQALSH. Referring to Figure 1, for the random lines a1 and a2 , we use the projection of q as origin (i.e., 0). I 4 is identified by (−∞, −4) ∪ (4, +∞). Similarly, I 2 and I 1 are identified by (−∞, −2) ∪ (2, +∞) and (−∞, −1) ∪ (1, +∞), respectively. Virtual rehashing of RQALSH is equal to symmetrically searching I R from both sides to the projection of q with decreasing R values.

We construct reverse query-aware hash functions by the following two steps: random projection and query-aware interval identification. We first compute the projections of all data objects in the pre-processing step. When a query q arrives, instead of identifying the “close” objects o collided with q in the same bucket, i.e., |ha (o) − ha (q)| ≤ w2 , we mainly consider the “far-apart” objects o which are separated from q, i.e., |ha (o) − ha (q)| > w2 . The interval (−∞, ha (q) − w2 ) ∪ (ha (q)+ w2 , +∞) is called separation interval. And the way to identify the separation interval is called query-aware interval identification. If the projection of an object o falls in the separation interval, i.e., |ha (o) − ha (q)| > w2 , we say o and q separate under ha (·). Here, the term “reverse” of reverse query-aware hash functions indicates that we use the “reverse inequality” to identify the “far-apart” objects.

IV.

Let s = o − q be the Euclidean distance between any two objects o and q. Due to the stability of N (0, 1), (a ·o −a · q) is distributed as sX, where X is a random variable drawn from N (0, 1) [9]. Let ϕ(x) be the Probability Density Function x2 (PDF) of N (0, 1), i.e., ϕ(x) = √12π e− 2 . The separation probability between o and q is computed as follows: p(s)

= P ra [|ha (o) − ha (q)| > w2 ] −w = P r[|sX| > w2 ] = 2 −∞2s ϕ(x)dx

T HE R EVERSE Q UERY-AWARE LSH S CHEMES

In the section, we propose two novel hashing schemes RQALSH and RQALSH∗ . Before we present these two schemes, we first introduce how to build hash tables. A. Building Hash Tables RQALSH exploits a set of m hash tables for the c-AFN search. In the pre-processing step, we build the m hash tables as follows. We first project all objects o ∈ D from Rd to m random lines. For each random line ai , we build a hash table Ti of hai (·). Here, Ti is a list of the pairs (hai (o), IDo ) for all o ∈ D, where IDo is the object ID referring to o. Then, each Ti is sorted in the ascending order of hai (o). Finally, we index each Ti by a B + -tree and store it on the disk.

(2)

We now show that the family of reverse query-aware hash functions ha (·) coupled with query-aware interval identification is locality-sensitive. Then, each ha (·) is said to be a reverse query-aware LSH function. Lemma 1: The reverse query-aware hash family of all the hash functions ha (o) that are identified by Equation 1 and coupled with query-aware interval identification is (1, 1c , p1 , p2 )sensitive, where p1 = p(1) and p2 = p( 1c ).

B. RQALSH for (R, c)-FN Search To find the (R, c)-FN of q, we first compute the hash value hai (q) for each i ∈ {1, 2, . . . , m}. Then, we perform a range wR search (−∞, hai (q) − wR ai (q) + 2 , +∞) on each Ti 2 ) ∪ (h R to identify I . For each object o falling in I R , we collect its separation number #Sep(o), which is defined as follows:

Proof: We rewrite Equation 2 as follows: p(s) = x w ), where norm(x) = −∞ ϕ(t) dt. Notice that 2 norm(− 2s norm(x) is the Cumulative Distribution Function (CDF) of N (0, 1), which increases monotonically as x increases. For w increases monotonically as s increases. Thus, a fixed w, − 2s p(s) increases monotonically as s increases. Therefore, according to Definition 1, Lemma 1 is proved.

#Sep(o) = |{i ∈ {1, ..., m} | |hai (o) − hai (q)| >

wR 2 }|

(3)

Let l be a pre-specified separation threshold. If an object o whose separation number is at least l, i.e., #Sep(o) ≥ l, 168 170

RQALSH∗ before building the hash tables. The intuition is that we want to collect a small set S of objects that are likely to be the furthest neighbors of queries. By using the small set S of data objects instead of the entire dataset D for c-AFN search, we can significantly improve the search efficiency of RQALSH and keep the accuracy simultaneously.

we say o is a candidate. For each Ti , we add #Sep(o) by 1 if the object o is falling in I R , i.e., |hai (o) − hai (q)| > wR 2 . We collect the separation numbers first for the objects whose projections are further from hai (q). If we find at least one candidate o such that o − q ≥ R/c, we stop early and return YES and o. We only need to check the first βn candidates (where β is the percentage of the false positives and n is the cardinality of D) and compute the Euclidean distance to q for them. If there are some candidates o such that o − q ≥ R/c, we return YES and the furthest one; otherwise, we return NO.

The strategy of data-dependent objects selection is similar to DrusillaSelect [8]. Compared with DrusillaSelect, we make two modifications here. (1) we do not ignore any objects for future projecting directions. Because even though these objects are not selected into S on this direction, their norms may be still large enough so that they will be selected into S in the future direction. (2) we design a more reasonable way to compute the score (i.e., score = offset2 - distortion2 ) instead of the original way (i.e., score = |offset| - distortion). By using the square error to compute the score, we can effectively prevent to select the small-norm objects into S when the value of (|offset| - distortion) of the small-norms objects is equal to or slightly larger than that of the large-norms objects.

C. RQALSH for c-AFN Search In order to find the c-AFN of q, RQALSH first sets up a startup search radius R. Then, RQALSH adds #Sep(o) by 1 for each object o which falls in I R and collects candidates o if #Sep(o) ≥ l. If the candidates collected so far are not enough, RQALSH automatically updates R, and hence collects more candidates from the next I R via virtual rehashing, and etc., until finally enough candidates have been found or a good enough candidate has been identified. The c-AFN of q is the furthest one among the candidates. Given an approximation ratio c > 1, RQALSH returns a c2 -AFN with probability at least ( 12 − δ). The theoretical proof of RQALSH is similar to that of QALSH [11]. Due to space limitation, we omit the proof here.

V.

E XPERIMENTS

A. Experiment Setup 1) Datasets and Queries: We use two real datasets Mnist1 and Trevi2 in the experiments. We set different page size B for different datasets so that at least three objects can fit in one page. We randomly remove 1,000 objects from each dataset and use them as queries.

Terminating Conditions. RQALSH terminates when one of the two terminating conditions satisfies: T1 : At round-R, there is at least one candidate o such that o − q ≥ R/c; T2 : At round-R, at least βn candidates have been found.

2) Methods: We compare with two state-of-the-art hashing schemes QDAFN [7] and DrusillaSelect [8] which have been adapted for external memory. We also consider the brute-force linear scan method (LINEAR) since it is known to be a strong competitor in high dimensional FN search. All methods are implemented in C++ and are compiled with gcc 4.8 with -w -O3. We conduct all experiments on a machine with Intel Core i3 3.20GHz CPU, 4 GB memory and 1 TB hard disk, running under Linux 3.13.

Set up Startup R. By utilizing the projections of data objects, we now introduce an automatic way to set up startup radius R. Since each Ti is indexed by a B + -tree, we can quickly identify the left-most object and the right-most object and collect the furthest object to q on each Ti . We have m such furthest objects in total. Suppose their distance to q (in terms of projection) are sorted in descending order and are denoted as {d1 , d2 , . . . , dm }. Let dl be the lth largest value in {d1 , d2 , . . . , dm }. The startup radius R can be set up as R = ck such that wR 2 ≤ dl , where integer k = logc (2dl /w) . Thus, there are at least l objects in I R at the beginning.

3) Evaluation Metrics: We use the following metrics for performance evaluation. The overall ratio, I/O cost, and running time are averaged over all queries. Index Size. Since the size of datasets are constant for all methods, we measure the size of index created by a method.

Update R. RQALSH automatically updates R by leveraging the projections of data objects as we set up the startup radius. Specifically, by using the B + -tree on each Ti , we can quickly find the object o which is furthest to hai (q) and exists outside of current I R . We calculate the lth furthest distance value dl and use it to set up the next radius R = ck such that wR 2 ≤ dl , where integer k = logc (2dl /w) .

I/O Cost and Running Time. We use I/O cost and running time to measure the efficiency of a method. The I/O cost is defined as the number of pages to be accessed to answer a c-k-AFN query; The running time is defined as the wall-clock time to answer a c-k-AFN query. Overall Ratio. We follow [7], [8] and use overall ratio to evaluate the accuracy of∗ a method. For the c-k-AFN search, it k o −q is defined as k1 i=1 oii −q , where oi is the ith furthest object returned by a method; o∗i is the exact ith furthest object.

D. RQALSH for c-k-AFN Search For c-k-AFN search, RQALSH modifies its terminating conditions as follows: T1 : At round-R, there exist at least k candidates o such that o − q ≥ R/c; T2 : At round-R, there are at least (βn + k − 1) candidates that have been found.

B. Comparison with Theoretical Methods In this study, we compare RQALSH with two state-of-theart theoretical methods, namely QDAFN and LINEAR. The partial results are displayed in Figure 2.

E. RQALSH∗ Finally, we introduce a heuristic variant of RQALSH, named RQALSH∗ . Compared with RQALSH, we add a pre-processing step of data-dependent objects selection in

1 http://yann.lecun.com/exdb/mnist/ 2 http://phototour.cs.washington.edu/patches/default.htm

169 171



 

 

   

   







  













  

  



(a) I/O Cost Fig. 2.



  



  









 

(b) Running Time

(c) Overall Ratio

Experimental Results of RQALSH, QDAFN, and LINEAR on Trevi

TABLE I.

E XPERIMENTAL R ESULTS OF RQALSH∗ , QDAFN∗ , AND D RUSILLA S ELECT FOR k = 10 Mnist

Methods RQALSH∗ QDAFN∗ DrusillaSelect

two datasets under the same I/O costs. Furthermore, the results between QDAFN∗ and DrusillaSelect are consistent with the results in [8]. In addition, for the high-dimensional dataset Trevi, since both RQALSH∗ and DrusillaSelect use the same number of I/Os to check the same number of candidates, the difference in overall ratios reflects the difference in candidates selection. Our results indicate that the data-dependent objects selection we propose is more effective than DrusillaSelect.

Trevi

Ratio

I/O

Time

Ratio

I/O

Time

1.0396 1.0470 1.0948

283 283 288

2.0 2.3 2.3

1.0025 1.0443 1.0187

30 37 30

0.8 1.5 0.8

1) Index Size: QDAFN needs a smaller index compared to RQALSH. Because it uses fewer projections and stores smaller number of objects’ ID in each projection. However, compared with the large size of the datasets, the index sizes of RQALSH are relatively small and acceptable for most applications.

VI.

C ONCLUSIONS

In this paper, we introduce a novel concept of RLSH family and propose two efficient hashing schemes named RQALSH and RQALSH∗ for high-dimensional c-AFN search over external memory. Experimental results demonstrate the superior performance of RQALSH and RQALSH∗ . As future work, we plan to adapt our proposed schemes to support other lp norms. It is also interesting to extend the reverse query-aware LSH family to other distance metrics.

2) I/O Cost and Running Time: For the high-dimensional dataset Trevi, the I/O costs of RQALSH are significantly smaller than those of QDAFN and LINEAR by about one to two orders of magnitude. Because the I/O cost of LINEAR is proportional to d, while the I/O cost of RQALSH is independent of d. The I/O cost of QDAFN is also independent of d, but it needs to check a large number of candidates for a theoretical guarantee, and hence increases a large number of random I/Os. For the low-dimensional dataset Mnist, the advantage of RQALSH is less apparent. The results of running time show similar trends of the results of I/O costs.

ACKNOWLEDGMENT This work is partially supported by China NSF Grant 60970043 and 61602186. We thank Wilfred Ng (HKUST) for his insightful comments. R EFERENCES [1]

3) Overall Ratio: All methods get satisfactory overall ratios, which are much smaller than the theoretical bound c2 = 4. QDAFN performs better than RQALSH. However, the overall ratios of RQALSH over the two datasets are still smaller than 1.05, which are good enough for most applications. In conjunction with the results of I/O cost and running time, we can see trade off between the accuracy and efficiency for the three methods. Notice that RQALSH uses only a small fraction of I/Os of QDAFN (i.e., about 3% on Trevi), while still achieves comparable overall ratios (i.e., 1.0038 vs. 1.0013 on Trevi when k = 1).

[2]

[3]

[4] [5]

C. Comparison with Heuristic Methods In this study, we compare RQALSH∗ with two state-ofthe-art heuristic methods, namely QDAFN∗ and DrusillaSelect, where QDAFN∗ is a heuristic variant of QDAFN with manually setting its parameters. To make a fair comparison, we tune the parameters of the three methods so that they use almost the same I/O costs. The experimental results are displayed on Table I. For each method, we report the overall ratio (Ratio), I/O cost (I/O) and running time (Time, in Milliseconds) for k = 10. The overall ratios of RQALSH∗ are significantly smaller than those of QDAFN∗ and DrusillaSelect. It indicates that RQALSH∗ outperforms QDAFN∗ and DrusillaSelect over the

[6] [7] [8] [9] [10] [11]

170 172

A. Said, B. Kille, B. J. Jain, and S. Albayrak, “Increasing diversity through furthest neighbor-based recommendation,” Proceedings of the WSDM, vol. 12, 2012. P. K. Agarwal, J. Matouˇsek, and S. Suri, “Farthest neighbors, maximum spanning trees and related problems in higher dimensions,” Computational Geometry, vol. 1, no. 4, pp. 189–201, 1992. P. D. Schloss, S. L. Westcott, T. Ryabin et al., “Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities,” Applied and Environmental Microbiology, vol. 75, no. 23, pp. 7537–7541, 2009. N. Vasiloglou, A. G. Gray, and D. V. Anderson, “Scalable semidefinite manifold learning,” in 2008 IEEE Workshop on MLSP, 2008. S. Bespamyatnikh, “Dynamic algorithms for approximate neighbor searching,” in CCCG, 1996. P. Indyk, “Better algorithms for high-dimensional proximity problems via asymmetric embeddings,” in ACM-SIAM SODA, 2003. R. Pagh, F. Silvestri, J. Sivertsen, and M. Skala, “Approximate furthest neighbor in high dimensions,” in SISAP, 2015. R. R. Curtin and A. B. Gardner, “Fast approximate furthest neighbors with data-dependent candidate selection,” in SISAP, 2016. M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in SoCG, 2004. J. Gan, J. Feng, Q. Fang, and W. Ng, “Locality-sensitive hashing scheme based on dynamic collision counting,” in SIGMOD, 2012. Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng, “Query-aware locality-sensitive hashing for approximate nearest neighbor search,” Proceedings of the VLDB Endowment, vol. 9, no. 1, pp. 1–12, 2015.

Suggest Documents