May 9, 2016 - 25. Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: Using email spam to identify web spam automatically.
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH∗ Thomas D. Ahle, Martin Aumüller, and Rasmus Pagh IT University of Copenhagen, Denmark, {thdy, maau, pagh}@itu.dk
arXiv:1605.02673v1 [cs.DS] 9 May 2016
Abstract We present a data structure for spherical range reporting on a point set S, i.e., reporting all points in S that lie within radius r of a given query point q (with a small probability of error). Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures have several parameters whose optimal values depend on the distance distribution from q to the points of S (and in particular on the number of points to report), our data structure is parameter-free. Nevertheless, its time and space usage essentially matches those of an LSH data structure whose parameters have been optimally chosen for the data and query in question. In particular, our data structure provides a smooth trade-off between hard queries (typically addressed by standard LSH parameter settings) and easy queries such as those where the number of points to report is a constant fraction of S, or where almost all points in S are far away from the query point. To our knowledge, this is the first output-sensitive data structure for spherical range reporting in high dimensions. If t is the number of points to report, we propose an algorithm that has query time upper bounded by O(t(n/t)ρ ), where ρ ∈ (0, 1) depends on the data distribution and the strength of the LSH family used. We further present a parameter free way of using multi-probing, for LSH families that support it, and show that for many such families this approach allows us to get query time close to O(nρ + t), which is the best we can hope to achieve using LSH. This improves on the running time of Ω(tnρ ) achieved by traditional LSH-based data structures where parameters are tuned for outputting a single point within distance r, and for large d on bounds of the type 1/Ω(d) + t by Arya et al. (ESA’10). Further, for many data distributions where the intrinsic dimensionality of the point set close to q is low, we give improved upper bounds on query time. 1998 ACM Subject Classification H.3.3 Information Search and Retrieval Keywords and phrases Spherical range reporting, locality-sensitive hashing, output sensitivity, intrinsic dimensionality. Digital Object Identifier 10.4230/LIPIcs...
1
Introduction
Finding near neighbors in a high-dimensional space is a key challenge in such diverse areas as machine learning, database management, information retrieval and image recognition [20, 21, 22, 23, 24]. In this paper we consider the spherical range reporting problem (SRR) (or range reporting) [7]: Given a distance parameter r, preprocess a point set S such that given a query point q we report all points in S within distance r from q. Solving the spherical range reporting problem exactly and in time that is truly sublinear in the point set size n = |S| seems to require space exponential in the dimensionality of the point set S. This ∗
The research leading to these results has received funding from the European Research Council under the European Union’s 7th Framework Programme (FP7/2007-2013) / ERC grant agreement no. 614331. © Thomas D. Ahle, Martin Aumüller, and Rasmus Pagh; licensed under Creative Commons License CC-BY Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
XX:2
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH
phenomenon is an instance of the curse of dimensionality, and is supported by popular algorithmic hardness conjectures (see [1, 26]). For this reason, most indexes for near neighbor problems such as LSH [12, 14] involve approximation of distances: For some approximation parameter c > 1 we allow the data structure to only distinguish between distance ≤ r and > cr, while points at distance in between can either be reported or not. Using such indexes for range reporting means that we get all points closer than distance r, but some amount of points in the range (r, cr). When c is large, this can have negative consequences for performance: a query could report nearly every point in the data set and any performance gained from approximation is lost. On the other hand, when the approximation factor c is set close to 1, data structures working in high dimensions usually need many independent repetitions to still guarantee finding the points at distance r. This is another issue with such indexes that makes range reporting hard: very close points show up in every repetition, and we need to remove these duplicates.1 The natural approach to overcome the difficulties mentioned above is to choose the approximation factor c such that the cost of duplicated points roughly equals the cost of dealing with far points. For LSH-based algorithms, many papers explain an offline approach of finding the optimal value of c for a dataset [2, 8, 11]. In this paper, we provide a query algorithm for an almost standard LSH data structure that adapts to the input and finds a near-optimal c at query time. We manage to do this in time proportional to the number of points eventually returned for the optimal parameters, making the search essentially free. To compare this approach to standard LSH, we consider “hard datasets” for range reporting in Section 3. In these data sets, we pick t very close points that show up in almost every repetition, one point at distance r, and the remaining points close to distance cr. In this case LSH would need Θ(nρ ) many repetitions to retrieve the point at distance r with constant probability, for ρ = 1/c [14] and ρ = 1/c2 [3] in Hamming space and Euclidean space, respectively. This means that the algorithm considers Θ(tnρ ) candidate points, which could be as large as Θ(n1+ρ ) for large t, even worse than a linear scan! We describe algorithms that are aware of this problem. In Section 4 we provide an overview of their running times before we describe and analyze them in Section 5. The basic idea is that our algorithms “notice” the presence of many close points, and respond by setting c more lenient, allowing for t far points being reported per repetition in addition to the t close points. This in turn allows doing only Θ((n/t)ρ ) repetitions, giving a total candidate set of size Θ(t(n/t)ρ ), which is never larger than n. A very different approach described by [7] achieves query times of the type 1/(c − 1)Ω(d) + Ω(t) for fixed c < 2 using a tree-based data structure. When we stick to the LSH framework, the ideal solution would never consider a candidate set larger than Θ(nρ + t), giving the optimal output sensitive running time achievable by (data independent) LSH. In order to get closer to this bound, we analyze the so-called multi-probing approach for LSH data structures, introduced in [18] and further developed in [17]. The idea is that LSH partitions the space in many buckets, but usually only examines the exact bucket in which the query point falls in each repetition. Multi-probing considers buckets around the query bucket to increase the likelihood of finding close points. To our knowledge, multi-probing has always been applied in order to save memory by allowing a smaller number of repetitions to be made and trading this for an increase in query time. Our approach is different: We want to take advantage of the fact that each of the t very close points can only be in one bucket per repetition. Hence by probing multiple buckets in each
1
Even for practical datasets such as WebSpam [25] our initial experiments showed that dealing with these reoccurring points quickly becomes the main bottleneck even for small distances.
T. Ahle, M. Aumüller and R. Pagh
repetition, we not only save memory, but also gain a large improvement in the dependency on t in our running time. We do this by generalizing our algorithm to not only find the optimal c for a query, but also the optimal number of buckets to probe. As we show in Section 5, we are able to do this in time negligible compared to the size of the final candidate set, making it practically free. The algorithm works for any probing sequence supplied by the user, but in Section 6 we provide a novel probing sequence for Hamming space and show that it always improves the query time compared to the non-multi-probing variant. For certain regimes of t, we show that the running time matches the target time O(nρ + t). Our approach to query-sensitivity works by doing an intelligent search over the parameter space. This is different from the recent work of [13] which considers finding an approximate nearest neighbor. A natural alternative would be the use of sampling to estimate the distance distribution around the query point. This is usually done offline at construction time [2, 8, 11], but we could also imagine doing it at query time, for the benefits mentioned above. However, we will show that the search problem comes down to estimating the higher order moments of the input distance distribution, and estimating higher order moments, at least in general, requires polynomially many samples [5]. In particular, for queries in locally low intrinsic dimensionality areas of the data set, the query time with optimal parameters takes only logarithmic time, thus sampling, unless somehow adaptive, would often be a bottleneck. The reason we are able to do a complete search over the parameter space is that certain parts of the output size can be estimated very quickly when storing the size of the hash table buckets in the LSH data structure. For example, when considering very large c, though the output may be large, there are only few repetitions to check. Gradually decreasing c, we will eventually have to check so many repetitions that the bare task of iterating through them would be more work than the size of the best candidate set found so far. Since the repetitions at each c grow geometrically, it ends up being bounded by the last check, which has size not larger than the returned candidate set. For multi-probing it turns out that a similar strategy works, even though the search problem is now two-dimensional.
2
Preliminaries
Let (X, dist) be a metric space over X with distance function dist. In this paper, the space usually does not matter; only in Section 6 we consider {0, 1}d under Hamming distance. I Definition 1 (Spherical Range Reporting, SRR). Given a set of points S ⊆ X and parameter r ≥ 0, construct a data structure such that for a given query q ∈ X, each point p ∈ S with d(p, q) ≤ r is returned with constant probability. I Definition 2 (Locality-Sensitive Hash Family, [9]). A locality-sensitive hash family H is family of functions h : X → R, such that for each pair x, y ∈ X and random choice of h ∈ H, whenever dist(q, x) ≤ dist(q, y) we have Pr[h(q) = h(x)] ≥ Pr[h(q) = h(y)]. Usually the set R is small, like the set {0, 1}. Often we will concatenate multiple independent hash functions from a family, getting functions hk : X → Rk . We call this a hash function at level k. Having access to an LSH family H, Indyk and Motwani [14] showed the following result. I Theorem 3 ([14, Theorem 4]). Suppose there exists an LSH family such that Pr[h(q) = h(x)] ≥ p1 when dist(q, x) ≤ r and Pr[h(q) = h(x)] ≤ p2 when dist(q, x) ≥ cr with p1 > p2 , for some metric space (X, dist) and some factor c > 1. Then there exists a data structure such that for a given query q, it returns with constant probability a point within distance cr,
XX:3
XX:4
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH
if there exists a point within distance r. The algorithm uses O(dn + n1+ρ ) space and O(nρ ) 1/p1 hash function evaluations per query, where ρ = log log 1/p2 . It is instructive for understanding our algorithms to know how the above data structure works. Hence we present a small proof sketch: Proof. Given access to an LSH family H with the properties stated in the theorem and two parameters L and k (to be specified below), repeat the following process independently for each i in {1, . . . , L}: Choose k hash functions gi,1 , . . . , gi,k independently at random from H. For each point p ∈ S, we view the sequence hi (p) = (gi,1 (p), . . . , gi,k (p)) ∈ Rk as the hash code of p, identify this hash code with a bucket in a table, and store a reference to p in bucket hi (p). To avoid storing empty buckets from Rk , we resort to hashing and build a hash table Ti to store the non-empty buckets for S and hi . Given a query q ∈ X, we retrieve all points from the buckets h1 (q), . . . , hL (q) in tables T1 , . . . , TL , respectively, and report a close point in distance at most cr as soon as we find such a point. Note that the algorithm stops and reports that no close points exists after retrieving more than 3L points, which is crucial to guarantee query time O(nρ ). The parameters k and L are set according to the following reasoning. First, set k such that it is expected that at most one distant point at distance at least cr collides with the query in one of the repetitions. This means that we require npk2 ≤ 1 and hence we define log n k = d log(1/p e. To find a close point at distance at most r with probability at least 1 − δ, 2) the number of repetitions L must satisfy δ ≤ (1 − pk1 )L ≤ exp(−pk1 · L). This means that L ρ should be at least p−k 1 ln δ and simplifying yields L = O(n ). Note that these parameters are set to work even in a worst-case scenario where there is exactly one point at distance p and all other points have distance cr + 1. J The framework can easily be extended to solve SRR. We just report all the points that are in distance at most r from the query point in the whole candidate set retrieved from all tables T1 , . . . , TL . For the remainder of this paper, we will denote the number of points retrieved in this way by W (“work”). It is easy to see that this change to the query algorithm would already solve SRR with the guarantees stated in the problem definition. However, we will see in the next section that its running time might be as large as O(n1+ρ ), worse than a linear scan over the dataset. Our query algorithms work on top of the following augmented LSH data structure. I Definition 4 (Multi-level LSH). Given a set S ⊆ X of n points, two parameters r and L, and access to an LSH family H that maps from X to R, we build the following data structure. First, set K = d log L1/p1 e where p1 is the probability that points at distance r collide under random choice of h ∈ H. Next, choose K · L functions gi,k for 1 ≤ i ≤ L and 1 ≤ k ≤ K from H independently at random. For each i ∈ {1, . . . , L}, we build K hash tables Ti,k with 1 ≤ k ≤ K. For each k ∈ {1, . . . , K} and each x ∈ S, we concatenate hash values (gi,1 (x), . . . , gi,k (x)) ∈ Rk to obtain the hash code hi,k (x), and set up a hash table to store references to all points in S in the standard LSH way. For a point x ∈ X, and for integers 1 ≤ i ≤ L and 1 ≤ k ≤ K, we let |Ti,k (x)| be the number of points in bucket hi,k (x) in table Ti,k . We assume this value can be retrieved in constant time. In contrast to a standard LSH data structure, we only accept the number of repetitions as parameter. The maximum value of k we support is chosen such that the number of repetitions available suffice to obtain a close point at distance r with constant probability. The space usage of our data structure is O(n · KL).
T. Ahle, M. Aumüller and R. Pagh
In practice, a good implementation of the multi-level LSH data structure could be the LSH forest described in Bawa et al. [8]. Instead of setting up a factor of K additional hash tables compared to standard LSH, the LSH forest builds a trie as navigation structure on top of an array of references to data points sorted by hash codes. This allows us to build only one data structure per repetition and querying bucket sizes gradually. A common technique when working with LSH is multi-probing. This refers to checking more than a single bucket of a hash table in each repetition. The ordering of the bucket depends on the hash family [17, 11, 4, 15]. It is generally assumed that one trades time-efficiency for space-efficiency [18], but often only experimental arguments are used to determine how many buckets are probed. We formalize probing sequences in multi-probing as follows. I Definition 5 (Probing Sequence). For k ≥ 1 and λ ≥ 1, a probing sequence σ = (σk,` )1≤`≤λ is a sequence of functions Rk → Rk . When the hash family and r are known, we let p` be the probability Pr[σk,` (hk (q)) = hk (y)], where y is a point with dist(q, y)=r. If p1≥p2 ≥ . . . , the probing sequence is called reasonable.
3
Difficult Inputs For SRR
Suppose we want to solve SRR using an LSH family H. Assume that the query point q ∈ X is fixed. Given n, t with 1 ≤ t ≤ n, and c > 1, we generate a data set S by picking n t points at distance from q, for small enough that even concatenating d loglog1/p e hash 2 functions from H, we still have collision probability higher than 0.01, one point x ∈ X with dist(q, x) = r, the remaining n − t − 1 points at distance cr. We call a set S that is generated by the process described above a t-heavy input for SRR on q. We argue that the standard LSH approach is unnecessarily slow on such inputs. I Observation 6. Suppose we want to solve SRR in (X, dist) using LSH with parameters as in Theorem 3 with LSH family H. Let q ∈ X be a fixed query point, and S be a t-heavy input generated by the process above. Then the expected number of points retrieved from the hash tables on query q in the LSH data structure is O(tnρ ). n Proof. The standard LSH data structure is set up with k = d loglog1/p e and L = O(nρ ). L 2 repetitions are necessary to find the close point at distance r with constant probability. By the construction of S, each repetition will contribute at least Θ(t) very close points in expectation. So, we expect to retrieve O(tnρ ) close points from the hash tables in total. J
The process described above assumes that the space allows us to pick sufficiently many points at a certain distance. This is for example true in Rd with Euclidean distance. In Hamming space {0, 1}d we would change the above process to enumerate the points from distance 1, 2, . . . and distance cr + 1, cr + 2, . . .. If d and r are sufficiently large, the same observation as above also holds for inputs generated according to this process. We also note that this hard instance for LSH is not as artificial as it may seem. In datasets with many small clusters, this behaviour appears naturally. Additionally, our algorithms provide the same guarantees if we had Θ(t) points in the range (r, cr).
4
Our Results
This section summarizes our results. We start by showing how to tune standard LSH to t-heavy inputs, given t is known in advance. Then, we show how a multi-probing approach
XX:5
XX:6
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH
can give improved query times in Hamming space. Finally, we present the general running time guarantees of our adaptive algorithms that do not require knowledge of t.
4.1
Handling t-heavy Inputs When t is Known
I Theorem 7. Suppose we are given a Multi-Level LSH data structure with a sufficient number repetitions and levels, and let S be a t-heavy input for SRR on query q. Then we can solve SRR on q in expected time O(nρ t1−ρ ). log(n/t) ρ e and make Θ(p−k Proof. Set the level k to d log(1/p 1 ) = Θ((n/t) ) repetitions, where p1 and 2) p2 are the probabilities that points at distance r and cr collide, respectively. Now pk2 ≤ t/n, so we expect at most t collisions per bucket with the n−t−1 points at distance cr. Adding the Θ(t) collisions with very close points, we expect to retrieve O(t1−ρ nρ ) many points. J
This result basically shows that there exists a single level of the multi-level LSH data structure that we want to probe when t is known (assuming that k and L are large enough). While the query time provides a big improvement over the naïve query time O(tnρ ), it is still far away from the target time O(nρ + t). Next, we show that—perhaps surprisingly—we get a further improvement when we consider multi-probing. Before we can state the results, we have to set up some notation: We set τ such that nτ = t. Furthermore, for α ∈ [0, 1], β ∈ [0, 1] we let H(α) = α log 1/α + (1 − α) log 1/(1 − α) and D(α k β) = α log(α/β) + (1 − α) log((1 − α)/(1 − β)) denote the binary entropy of α and the relative entropy between α and β, respectively. We will analyze multi-probing in d-dimensional Hamming space by devising an explicit probing sequence for which we get the following running time guarantees. Details of the static analysis are deferred to Section 6. I Theorem 8. Suppose that S is a t-heavy input for SRR on q in d-dimensional Hamming space {0, 1}d . Suppose again we preprocess S in the standard way to build an LSH data structure. Then there exists a probing sequence such that the query algorithm solves SRR in expected time O(t1+D(α k 1−p1 )/ H(α) ), where α satisfies 1 + D(α k 1 − p2 )/ H(α) = 1/τ . We know of no nice inverse of 1 + D(α k 1 − p2 )/ H(α) in α, so we state the result on the expected query time for two special cases of the value τ ∈ [0, 1] explicitly. H(1−p2 ) log 4 ρ I Corollary 9. For τ ≥ H(1−p2 )+D(1−p ≈ log 1/(p 2 , the expected time is O(n + t). 2 k 1−p1 ) 1 −p ) 1
ψ
1−p1 2 For τ →0, the expected time is O(nρ t log 1/τ (1+o(1)) ), where ψ = ρ log 1−p p2 − log p1 ≤ 1 − ρ.
Figure 1 provides a comparison of the different results of this section. We see that the running time guarantees stated in Theorem 7 (“Simple”) and Theorem 8 (“Multiprobing”) improve heavily on a standard LSH approach for the whole range τ . Moreover, the multi-probing approach allows optimal query time in some parameter range. We get these results without any preprocessing of the data such as clustering to avoid very close points in repetitions.
4.2
Adaptive Search Algorithms When t is Unknown
Section 5 describes adaptive algorithms that find the optimal value of c for the setting where t is unknown. Here we summarize the results. We stress that the proposed algorithms work at least as well as tuned LSH on every distribution, not just on t-heavy input distributions. For this generalization, let q ∈ X be the query point, let S ⊆ X be the data set, let y ∈ S be a data point at distance r, and let h ∈ H be a random hash function from an LSH family.
T. Ahle, M. Aumüller and R. Pagh
XX:7
W n1.4 n1.2 n1.0
Linear Scan Naive LSH Simple (Thm. 7)
n0.8
Multiprobe (Thm. 8) Lower Bound
n0.6 nρ 1
n0.2
n0.4
n0.6
n0.8
t n
Figure 1 Overview of the running time guarantees stated in Theorem 7 (“Simple”) and Theorem 8 (“Multiprobing”) for p1 = .8 and p2 = .6 in d-dimensional Hamming space such that ρ ≈ .436. The x-axis shows the value of t compared to n, the y-axis shows the expected work W . For comparison, we plotted the lower bound of O(nρ + t), the running time O(tnρ ) of the simple LSH approach, and the running time O(n) of a linear scan.
Now consider the two quantities: " !# X 1 k 1+ Wsimple = min Pr[h(q) = h(x)] k Pr[h(q) = h(y)]k x∈S X 1 ` + Wmulti = min P Pr[σk,i (h(q)) = h(x)] k,` 1≤i≤` Pr[σk,i (h(q)) = h(y)]
(1)
(2)
x∈S,1≤i≤`
For fixed k, the right-hand side of Wsimple is the expected amount of work we need when 1 searching an LSH index for level k. We need Pr[h(q)=h(y)] k repetitions to find y with constant probability. In each repetition we need to look at the bucket plus the collisions inside of it. In the second quantity we assume that we probe up to ` buckets using multi-probing. Wsimple and Wmulti provide the minimum over all parameter choices, thus these quantities denote the optimal work for an LSH based approach. I Theorem 10. Suppose a standard LSH data structure with parameters k and L was used to preprocess a set S ⊆ {0, 1}d with |S| = n. Let q ∈ {0, 1}d be the query point. Let Wsimple and Wmulti be as above. Then there exists an adaptive query algorithm that solves SRR with expected query time O(Wsimple ), and for each probing sequence σ there exists a variant of the query algorithm that has expected query time O(Wmulti (log Wmulti )2 ). The theorem shows that we are never more than log factors away from the ideal query time across all possible parameters. These quantities are query specific parameters, so we cannot assume that offline tuning achieves these candidate set sizes for all query points. In the rest of the section we discuss two examples to get a sense for quantity (1). Fix a query point q ∈ {0, 1}d and assume that our dataset S consists of n uniform random points from {0, 1}d . Then the distance distribution from our query point is going to be binomially P Bin(n, 1/2). Now we think of x∈S Pr[h(q) = h(x)]k as the term nE(P k ) where P is a random variable representing the probability of a collision between q and a randomly chosen point from S, using the LSH family H at hand. (The expectation is taken over the random
XX:8
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH
Algorithm 1 Query(q, p1 , T ) 1: k ← 1, kbest ← 0, Wkbest ← n; −k
≤ min(L, Wkbest ) do Pp−k 1 Wk ← i=1 (1 + |Ti,k (q)|); if Wk < Wkbest then kbest ← k; Wkbest ← Wk ;
2: while p1 3: 4: 5:
k ← k + 1; best Sp−k 1 7: return C ← i=1 {x ∈ Ti,kbest (q) | dist(x, q) ≤ r} 6:
choice of the data point.) If we choose bitsampling as in [14] as hash function, we can write P Pd as a sum of independent Bernoulli random variables Xi as P = 1 − 1/d i=1 Xi . Then P d E(P k ) = 1/dk E ( i=0 Xi )k = 2−k (1+O(1/d)) as can be seen by expanding the monomials, P and using E(Xij ) = 1/2 for all j. If we take k = log n, we get x∈S Pr[h(q) = h(x)]k = O(1) log 1/p1
and W = n log 2 as we would expect for LSH with bitsampling and the above parameters in the random case. Another interesting setting to consider is when the data is locally growth-restricted, as considered by Datar et al. [10, Appendix A]. This means that the number of points within distance r of q, for any r > 0, is at most rc for some small constant c. In [10], the LSH framework is changed by providing the parameter k to the hash function. However, if we fix r = k, our algorithm will find a candidate set of size W = O(log n).2 So, our algorithm takes advantage of restricted growth and adapts automatically on such inputs.
5
Adaptive Query Algorithm
We assume a multi-level LSH data structure on S ⊆ X with L repetitions and K levels has been set up, as described in Section 2. We have access to the tables Ti,k for 1 ≤ i ≤ L, 1 ≤ k ≤ K. To make the algorithms easier to read, we avoid ceilings for repetition counts p−k and also give probability guarantees slightly weaker than the constant probability demanded by the SRR problem. By doing a factor of Θ(log log n log(1/δ)) more repetitions, the algorithms described in this section solve the SRR problem such that each close point is reported with probability at least 1 − δ. Given a multi-level LSH data structure, the query algorithm on input q works as follows: For each level 1 ≤ k ≤ K, calculate the work of doing p−k 1 repetitions, i.e., the work it would have to do if this was the k parameter chosen for standard LSH. We terminate as soon as we have provably found the optimal level, which may be one that we have considered in the past, and report all close points in the candidate set. The algorithm is given as Algorithm 1. I Theorem 11. Let S ⊆ X with |S| = n and r be given. Then Algorithm 1 solves SRR with probability Ω(1/ log log n). The expected running time of the while-loop in Lines (2)–(6) and the expected number of distance computations in Line (7) is O(Wsimple ), as defined in Sec. 4.
2
The√proof from [10] works, since they also inspect all colliding points. It is easy to see that the integral
R r/
2
e−Bc cb dc is still bounded by 2O(b) when we start at c = 0 instead of c = 1, since the integrand is less than 1 in this interval. 1
T. Ahle, M. Aumüller and R. Pagh
XX:9
Proof. The proof is given in two steps. First we show that the algorithm works correctly, then we argue about its running time. To see that the algorithm works correctly, let y ∈ S be a point with dist(y, q) ≤ r. At each level k, we see y in a fixed bucket with probability at least pk1 . Hence looking at p−k 1 independent repetitions suffices to obtain constant collision probability. The algorithm picks one out of K possible levels. We can do another log K repetitions to guarantee by a union bound over all levels that on each level, y has constant probability of being found. In the algorithm as described, we do not use these log K repetitions, so we get success probability 1/ log K = Ω(1/ log log n). For the running time the work inside the loop is dominated by line (3) which takes O(p−k buckets. Say the last value k before 1 ) time, given constant access to the size of the ∗ the loop terminates is k ∗ . By the loop condition p−k ≤ Wkbest , and so the loop takes time 1 Pk∗ −k −k∗ P∞ k k=1 O(p1 ) = O(p1 k=0 p1 ) = O(Wkbest ). In Line 7, the algorithm looks at Wkbest points and buckets, so the total expected work is E(Wkbest ) = E min Wk ≤ min E(Wk ), (3) 1≤k≤K
1≤k≤K
where we have applied Jensen’s inequality over the concave min function. By linearity of P k expectation E(Wk ) = p−k 1 + Pr[h(q) = h(x)] , so (3) is exactly O(Wsimple ). J 1 x∈S
5.1
A Multi-probing Version of Algorithm 1
We extend the algorithm from the previous subsection to include multi-probing. When using a probing sequence σ = (σk,` ), we will always know in advance how many elements λ we are going to use. Hence, we assume wlog. that the collision probabilities at each probe are ordered p1 ≥ p2 ≥ · · · ≥ pλ , so that σ is reasonable. For each k and `, we will be interested in the probability that q and a fixed point y at distance r collide under the first ` probes on level k, and denote this probability by the partial sum Pk,` = pk,1 + · · · + pk,` . Before describing the algorithm, we note the following useful inequalities: I Lemma 12. Let σ be a reasonable probing sequence. Then for each ` with 1 ≤ ` < λ and each k it holds: `/Pk,` ≤ (` + 1)/Pk,`+1 ` X i=1
1/Pk,` ≤ `H` /Pk,` =
(4) ` Pk,`
(log ` + O(1)),
(5)
where H` = 1 + 1/2 + · · · + 1/` is the `-th harmonic number. Proof. (4): By the definition of Pk,` we might write (4) as (` + 1)(pk,1 + · · · + pk,` ) ≥ `(pk,1 + · · · + pk,`+1 ) ⇐ pk,1 + · · · + pk,` ≥ ` · pk,`+1 , which is true since the values pk,i are non-increasing. (5): Using the first inequality inductively, we have that i/Pk,i ≤ `/Pk,` whenever i ≤ `. Hence we can bound the sum term-wise as ` ` ` X 1 (4) X ` ` X1 ` ` ≤ = = H` = · (log ` + O(1)), P (iPk,` ) Pk,` i=1 i Pk,` Pk,` i=1 i=1 k,i
where the last step follows from the standard approximation of the harmonic numbers.
J
XX:10
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH
As in Algorithm 1, we carefully explore the now two-dimensional space (k, `) of parameters and stop once the expected number of buckets we have to check is as large as the smallest candidate set we have seen so far. Again, let y be a fixed point at distance r to our query. Fix a pair (k, `) with 1 ≤ k ≤ K and 1 ≤ ` ≤ λ. To find y with constant probability on level k −1 with ` probes, we need Pk,` repetitions. Since we probe ` buckets in each repetition, we −1 associate cost(k, `) = ` · Pk,` as the cost to this pair of parameters. The algorithm is simply going to search the space of (k, `) parameters ordered by cost(k, `). For each parameter pair it stores the size of the candidate set and keeps the best, i.e., smallest set. To make the algorithm easier to state, we simply write Ti,k (σk,` (q)) to be the bucket in table Ti,k for identifier σk,` (hk,` (q)). The pseudocode of the algorithm is given as Algorithm 2. To obtain good query time it is necessary to store the values Wk,` computed so far and reuse them in Line 8 of the algorithm. Details are given in the proof below. Algorithm 2 Query-Multiprobe(q, σ, T ) 1: Wbest ← n; kbest ← 0; `best ← 1; PQ ← initialize empty priority queue 2: for 1 ≤ k ≤ K do 3:
PQ.insert((k, 1), cost(k, 1))
4: while PQ is not empty and PQ.min()< Wbest do 5: 6: 7: 8: 9: 10: 11:
(k, `) ← PQ.extractMin() if ` < λ then PQ.insert((k, ` + 1), cost(k, ` + 1)) P1/P P` Wk,` ← i=1k,` j=1 (1 + |Ti,k (σk,j (q))|) if Wk,` < Wbest then kbest ← k; `best ← `; Wbest ← Wk,` S return 1≤i≤1/Pkbest ,`best {x ∈ Ti,kbest (σkbest ,j (q)) | dist(x, q) ≤ r} 1≤j≤`best
I Theorem 13. Let S ⊆ X with |S| = n, r, K and L be given. Assume we have access to a reasonable probing sequence σ. Then Algorithm 2 solves SRR. For the value Wmulti as defined in (2), the expected running time of the while-loop in Lines (4)–(10) is O(Wmulti (log Wmulti )2 ) with O(Wmulti log Wmulti ) priority queue operations, and the expected number of distance computations in Line (11) is O(Wmulti ). Proof. Since we do sufficiently many repetitions for each (k, `), the correctness of the algorithm follows by the same line of reasoning as given in the proof of Theorem 11, where we apply the same argument about a union bound over the parameter space. For the running time, note that it cannot happen that all pairs (k, `) in the priority queue have cost larger than Wbest , but there exists a pair (k 0 , `0 ) with k 0 ≥ k and `0 ≥ ` not inspected so far such that cost(k 0 , `0 ) < Wbest . This is because for fixed k the cost is non-decreasing in ` by inequality (4) and for fixed ` the cost (k, `) is non-decreasing in k. To compute a new value Wk,`+1 in Line (8) of the algorithm, we take advantage of the work Wk,` already discovered, and only compute the number of buckets that are new or no longer needed: 1/Pk,`+1
Wk,`+1 = Wk,` +
X i=1
|Ti,k (σk,`+1 (q))| −
` X
1/Pk,`+1
X
j=1 i=1+1/Pk,`
|Ti,k (σk,j (q))|.
T. Ahle, M. Aumüller and R. Pagh
XX:11
h1,k (y) h3,k (y)
X
h2,k (y)
a σ1,1 (h1,k (q))
σ2,i (h2,k (q)) σ2,1 (h2,k (q))
... σ3,1 (h3,k (q))
L Figure 2 At each of the L repetitions we query the closest ` positions. Since the projected distance X to our target point y is distributed as Bin(k, dist(q, y)/d), we find y with constant probability by setting L = O(Pr[X ≤ a]−1 ).
For each k, we never visit a bucket more than twice and amortized over all operations, the computation of Wk,`+1 takes time O(1/Pk,`+1 ). For each k with 1 ≤ k ≤ K, let `∗k be the largest value of ` such that the pair (k, `) was considered by the algorithm. The total P`∗ cost of computing Wk,1 , . . . , Wk,`∗k is then at most i=0 1/Pk,i ≤ `∗ (log `∗ + O(1))/Pk,`∗ by inequality (5). By the loop condition, we know that `∗ /Pk,`∗ is at most Wbest , so the algorithm spends time O(Wbest log Wbest ) for fixed k. Let kmax be the maximum value of k such that a pair (k, `) was considered by the algorithm. Since we stop when every item in the priority queue has price higher or equal to Wbest −kmax Wbest we must have kmax ≤ log repetitions for the single log 1/p1 , since we need at least p1 probe on level kmax . Thus, the final search time for the while-loop is O(Wbest log2 (Wbest )) and the algorithm makes exactly Wbest distance computations in Line (11). The connection between Wmulti and E(Wbest ) is the same as in the proof of Theorem 11. J
6
A Probing Sequence in Hamming Space
In this section we prove Theorem 8 by proposing a novel multi-probing sequence in Hamming space, analyzing carefully optimal parameter settings for the t-heavy Spherical Range Reporting input distributions from Section 3. We use bitsampling as hash functions hk : {0, 1}d → {0, 1}k . For a fixed a query point q ∈ {0, 1}d and k ≥ 1, the probing sequence σ is the ordering of {0, 1}k by distance to hk (q), i.e., σk,` maps h(q) to the `-th closest point in {0, 1}k , where ties are broken arbitrarily. Fix a target close point y ∈ {0, 1}d at distance r, let p be the probability that q and y collide, and let pk,` be the probability that y lands in the `-th bucket that we check. Pa Furthermore, let V (a) = i=0 ki be the volume of the radius a hamming ball. If σk,` (hk (q)) is at distance a to h(q), we have a collision if q and y differ in exactly a out of the k coordinates chosen by hk . Hence, pk,` = pk−a (1 − p)a for the a satisfying V (a − 1) < i ≤ V (a). If p ≥ 1/2, pk,`+1 ≤ pk,` for each integer ` ≥ 1, so σ is a reasonable probing sequence. Figure 2 illustrates our approach. We will now show that the smaller number of repetitions needed by multi-probing leads
XX:12
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH
to fewer collisions with the t very close points in hard instances of SRR. To see this, we bound the value of Wmulti from (2) as follows: X 1 ` + Wmulti = min P Pr[σk,` (hk (q)) = hk (x)] k,` 1≤`≤λ Pr[σk,` (hk (q)) = hk (y)] x∈S,1≤`≤λ !# " X 1 V (a) + Pr[dist(h(q), h(x)) ≤ a] ≤ min k,a Pr[dist(h(q), h(y)) ≤ a] x∈S h a h h a i h a i ii
exp k H + t + n exp −k D = O min exp k D
1 − p1
1 − p2 k,a k k k Here we have restricted ` to only consider lengths that correspond to searching complete hamming balls up to radius a. Furthermore, we applied that the distances in the projected space are binomially distributed and applied a standard bound on √ the tail of the binomial distribution, namely Pr[Bin(k, p) ≤ αk] = exp(−k D(α k p))Θ(1/ k) for α ∈ (0, 1/2) [19]. The next step is to minimize this bound over the choice of k and a. For simplicity we write α = a/k for the normalized radius. Since all terms are constant, the expression is minimized by setting the three terms in the parenthesis equal. So, k H(α) = log t = log n − k D(α k 1 − p2 ), log t log n which implies k = H(α) = H(α)+D(α k 1−p2 ) . We can then write down the two equations: D(α k 1 − p1 ) log Wmulti = k D(α k 1 − p1 ) + log t + O(1)= + 1 log t + O(1), and (6) H(α) H(α) + D(x k 1 − p2 ) D(α k 1 − p2 ) log n = = + 1, (7) 1/τ = log t H(α) H(α) which are exactly the values stated in Theorem 8. Next we prove Corollary 9. For the first statement observe that if the suggested value of α ends up being larger than 1 − p1 , this means the search radius is large enough to encompass the expected value of X of the distance, and so we only have to do a constant number of repetitions. Since the second factor is 3t, everything disappears except O(t). This happens H(1−p2 ) exactly when τ ≥ H(1−p2 )+D(1−p . 2 k 1−p1 ) For the second part of the corollary, we solve the equation implied by Theorem 8, asymptotically as τ → 0. Details can be found in Appendix 8.1, but the idea is as follows: k p) 1 2 We first define fp (α) = 1 + D(α H(α) , and show fp1 (α) = (ρ + ψα/ log p2 + O(α ))fp2 (α) for ψ being the constant defined in Corollary 9. Using bootstrapping, we show the inversion 1/p2 1 α = fp−1 (1/τ ) = αlog log 1/α + O(1/ log α ). Plugging this into (6) proves the corollary. 2
7
Conclusion
In this article we proposed an adaptive LSH-based algorithm for Spherical Range Reporting that is never worse than a static LSH data structure knowing optimal parameters for the query in advance, and much better on many input distributions where the output is large. The main open problem remaining is to achieve target time O(nρ + t) on t-heavy distributions. One approach might be a data-dependent data structure as described in [6]. In the light of our multi-probing results, we however wonder if the bound can be obtained data independently as well. Here, it would be interesting to analyze other probing sequences. It seems that not all time/space-tradeoff-aware data structures are applicable for this purpose, e.g., the novel filtering approach by [16] does not appear to easily give an improvement over O(tnρ ). Finally, it would be natural to extend our methods to give better LSH data structures for the k-nearest neighbor problem.
T. Ahle, M. Aumüller and R. Pagh
References 1
2 3
4
5
6
7
8
9
10
11
12
13
14
15
16
Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neighbors. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, pages 136–150, 2015. doi:10.1109/FOCS.2015.18. Alexandr Andoni and Piotr Indyk. E2LSH, user manual. 2005. URL: http://www.mit. edu/~andoni/LSH/manual.pdf. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2006, pages 459–468, 2006. doi:10.1109/FOCS.2006.49. Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and optimal LSH for angular distance. In Advances in Neural Information Processing Systems 28, NIPS 2015, pages 1225– 1233. Curran Associates, Inc., 2015. URL: http://papers.nips.cc/paper/ 5893-practical-and-optimal-lsh-for-angular-distance.pdf. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, pages 363–372. IEEE, 2011. Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the Forty-Seventh Annual ACM Symposium on the Theory of Computing, STOC 2015, pages 793–801. ACM, 2015. Sunil Arya, Guilherme D Da Fonseca, and David M Mount. A unified approach to approximate proximity searching. In European Symposium on Algorithms, ESA 2010, pages 374–385. Springer, 2010. Mayank Bawa, Tyson Condie, and Prasanna Ganesan. LSH forest: self-tuning indexes for similarity search. In Proceedings of the 14th international conference on World Wide Web, WWW 2005, pages 651–660. ACM, 2005. Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, STOC 2002, pages 380–388, 2002. doi:10.1145/509907.509965. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth annual Symposium on Computational Geometry, SOCG 2004, pages 253–262. ACM, 2004. Wei Dong, Zhe Wang, William Josephson, Moses Charikar, and Kai Li. Modeling LSH for performance tuning. In Proceedings of the 17th ACM conference on Information and Knowledge Management, CIKM 2008, pages 669–678. ACM, 2008. Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321–350, 2012. doi:10.4086/toc.2012.v008a014. Sariel Har-Peled and Sepideh Mahabadi. Proximity in the age of distraction: Robust approximate nearest neighbor search. CoRR, abs/1511.07357, 2015. URL: http://arxiv. org/abs/1511.07357. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, STOC 1998, pages 604–613, 1998. doi:10.1145/276698.276876. Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, pages 329–342, 2015. doi:10.1145/2745754.2745761. Thijs Laarhoven. Tradeoffs for nearest neighbors on the sphere. arXiv preprint arXiv:1511.07527, 2015.
XX:13
XX:14
High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH
17
Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search, VLDB 2007. pages 950–961. VLDB Endowment, 2007. URL: http://dl.acm.org/citation.cfm?id=1325851.1325958.
18
Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pages 1186–1195, 2006.
19
Valentin Petrov. Sums of independent random variables, volume 82. Springer Science & Business Media, 2012.
20
Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
21
Venu Satuluri and Srinivasan Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. Proceedings of the VLDB Endowment, 5(5):430–441, 2012.
22
Gregory Shakhnarovich, Piotr Indyk, and Trevor Darrell. Nearest-neighbor methods in learning and vision: theory and practice. 2006.
23
Malcolm Slaney, Yury Lifshits, and Junfeng He. Optimal parameters for locality-sensitive hashing. Proceedings of the IEEE, 100(9):2604–2623, 2012.
24
Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment, 6(14):1930–1941, 2013. doi:10.14778/2556549.2556574.
25
Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In Proceedings of the Third Conference on Email and Anti-Spam, CEAS 2006, 2006.
26
Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci., 348(2-3):357–365, 2005. doi:10.1016/j.tcs.2005.09.023.
8
Appendix
8.1
Proof of Corollary 9, second part
When t is small compared to n, the multiprobing radius α can be made smaller. In this regime, we hence consider the following expansion: D(α k 1 − p1 ) H(α) H(α) + D(α k 1 − p1 ) = f (α, p2 ) H(α) + D(α k 1 − p2 ) p1 log p11 + α log 1−p 1 = p2 f (α, p2 ) log p12 + α log 1−p 2 p1 p2 1 1 log 1−p1 log p2 − log p1 log 1−p2 log p1 = + α + O(α2 ) f (α, p2 ) 2 log p2 log p12
f (α, p1 ) = 1 +
= (ρ + ψα/ log 1/p2 + O(α2 ))f (α, p2 ),
(8)
for constants ρ and ψ depending on p1 and p2 . This already gives us that we are asymptotically optimal, as long as αf (α, p2 ) goes to 0 as f (α, p2 ) goes to ∞. To see that this is indeed the
T. Ahle, M. Aumüller and R. Pagh
XX:15
case, we need the following asymptotics: 1 H(α) + D(α k 1 − p) = α log 1−p + (1 − α) log p1
= log p1 + O(α) 1 H(α) = α log α1 + (1 − α) log 1−α
= α log α1 + (1 − α)(α − O(α2 )) = α(log α1 + 1) + O(α2 ) f (α, p) = (H(α) + D(α k 1 − p))/ H(α) =
log p1 α(log α1 + 1)
+ O(1/ log α1 )
(9)
We would like invert (9) to tell us how fast α goes to zero, and plug that into (8). To this end, we let y = f (α, p2 )/ log p12 . Then it is clear that, at least asymptotically, 1/y 2 < α < 1/y. That tells us α = y −Θ(1) , and we can use this estimate to “bootstrap” the inversion: 1 α α= +O y(log α1 + 1) y log α1 1 + O(1/(y 2 log α1 )) = y log =
1 1 +O(1/(y 2 y(log 1 +1) α
1 log α ))
+ 1
1 h i + O(1/y 2 ) 1 y log y(log α1 + 1) + log 1+O(1/y) +1
1 + O(1/y 2 ) y log y + O(y log log y) log log y 1 = +O y log y y(log y)2 =
Plugging the result back into (8) we finally get: log E(Wk ) = (log t)f (α, p1 ) log log y ψ = log t ρ + (log 1/p2 )y log y + O f (α, p2 ) y(log y)2 log log f ψ +O = log t ρf (α, p2 ) + log f (α, p2 ) (log f )2 log n ψ log log log t = ρ log n + log t + O 2 , log n n log log t log log log t as
log n log t
goes to ∞, i.e., τ goes to 0.
(10)