Jul 12, 2012 - applications in search or interactive data visual analytics. ...... Final retrieved data points (before r
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Probabilistic Hashing for Efficient Search and Learning
Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca, NY 14853
July 12, 2012
1
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
BigData Everywhere Conceptually, consider a dataset as a matrix of size n × D . In modern applications, # examples n
= 106 is common and n = 109 is not
rare, for example, images, documents, spams, search click-through data.
D = 106 (million), D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a sense, D High-dimensional (image, text, biological) data are common:
can be arbitrarily high by considering pairwise, 3-way or higher interactions.
2
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Examples of BigData Challenges: Linear Learning Binary classification: Dataset {(xi , yi )}n i=1 , xi
∈ RD , yi ∈ {−1, 1}.
One can fit an L2 -regularized linear SVM: n X 1 T T min w w + C max 1 − yi w xi , 0 , w 2 i=1
or the L2 -regularized logistic regression: n
X 1 T −yi wT xi , min w w + C log 1 + e w 2 i=1 where C
> 0 is the penalty (regularization) parameter.
The weight vector w has length D , the same as the data dimensionality.
3
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Challenges of Learning with Massive High-dimensional Data
• The data often can not fit in memory (even when the data are sparse). • Data loading takes too long. For example, online algorithms require less
memory but often need to iterate over the data for several (or many) passes to achieve sufficient accuracies. Data loading in general dominates the cost.
• Training can be very expensive, even for simple linear models such as logistic regression and linear SVM.
• Testing may be too slow to meet the demand, especially crucial for applications in search or interactive data visual analytics.
• The model itself can be too large to store, for example, we can not really store a vector of weights for logistic regression on data of 264 dimensions.
4
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
A Popular Solution Based on Normal Random Projections Random Projections:
Replace original data matrix A by B
A
=A×R
R = B
R ∈ RD×k : a random matrix, with i.i.d. entries sampled from N (0, 1). B ∈ Rn×k : projected matrix, also random. B approximately preserves the Euclidean distance and inner products between any two rows of A. In particular, E (BBT ) = AAT . Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.
5
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Experiments on Classification, Clustering, and Regression
# Examples (n)
# Dims (D)
Avg # nonzeros
Webspam
350,000
16,609,143
3,728
4/5 v.s. 1/5
MNIST-PW
70,000
193,816
12,346
6/7 v.s. 1/7
440
138,672
138,672
1/2 v.s. 1/2
53,500
384
384
4/5 v.s. 1/5
Dataset
PEMS (UCI) CT (UCI)
Train v.s. Test
MNIST-PW = Original MNIST + all pairwise features. CT = Regression task (reducing number of examples instead of dimensions).
——————— Instead of using dense (normal) projections, we will experiment with very sparse random projections.
6
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Very Sparse Random Projections
R = {rij } ∈ RD×k . Instead of sampling from normals, we sample from a sparse distribution parameterized by s ≥ 1: The projection matrix:
If s
1 −1 with prob. 2s rij = 0 with prob. 1 − 1 1 with prob. 2s
1 s
= 100, then on average, 99% of the entries are zero.
If s
= 10000, then on average, 99.99% of the entries are zero. √ Usually, s = D or s = k is a good choice. ——————Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.
7
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
8
Linear SVM Test Accuracies on Webspam Red dashed curves: results based on the original data k = 4096
100
k = 1024 k = 512 k = 256
95
Classification Acc (%)
Classification Acc (%)
100
k = 128
90
k = 64 k = 32
85 80
Webspam: SVM (s = 1) 75 −2 10
−1
10
Observations:
0
10 C
1
10
2
10
k = 4096
k = 1024 k = 512 k = 256
95
k = 128
90
k = 64
85
k = 32
80 Webspam: SVM (s = 1000) 75 −2 10
−1
10
0
10 C
1
10
• We need a large number of projections (e.g., k ≥ 4096) for high accuracy. • The sparsity parameter s matters little unless k is small.
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
9
Linear SVM Test Accuracies on Webspam Red dashed curves: results based on the original data k = 4096
100
k = 1024 k = 512 k = 256
95
Classification Acc (%)
Classification Acc (%)
100
k = 128
90
k = 64 k = 32
85 80
Webspam: SVM (s = 1) 75 −2 10
−1
10
0
10 C
1
10
2
10
k = 4096
k = 1024 k = 512 k = 256
95 90 85
k = 128
Webspam: SVM (s = 10000) k = 64
80
k = 32
75 −2 10
−1
10
0
10 C
1
10
As long as k is large (necessary for high accuracy), the projection matrix can be extremely sparse, even with s
= 10000.
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
10
Linear SVM Test Accuracies on MNIST-PW Red dashed curves: results based on the original data 100 k = 4096
k = 1024 k = 512 k = 256
90
Classification Acc (%)
Classification Acc (%)
100
k = 128 k = 64
80
k = 32
70
MNIST−PW: SVM (s = 1) 60 −2 10
−1
10
0
10 C
1
10
2
10
k = 4096
k = 1024 k = 512 k = 256
90
k = 128
80
k = 64
70 60 −2 10
k = 32
MNIST−PW: SVM (s = 10000) −1
10
0
10 C
1
10
As long as k is large (necessary for high accuracy), the projection matrix can be extremely sparse. For this data, we probably need k
> 104 for high accuracy.
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
11
Linear SVM Test Accuracies
k = 128
90
k = 64 k = 32
85 80
Webspam: SVM (s = 1) 75 −2 10
−1
10
0
10 C
1
10
90
k = 128 k = 64 k = 32
85 80 Webspam: SVM (s = 100) 75 −2 10
2
10
−1
10
0
10 C
1
10
2
10
100 k = 4096
90
Classification Acc (%)
k = 1024 k = 512 k = 256 k = 128 k = 64
80
k = 32
70
MNIST−PW: SVM (s = 1) 60 −2 10
−1
10
0
10 C
1
10
2
10
k = 4096
100 k = 4096
k = 1024 k = 512 k = 256
95
k = 128
90
k = 64
85
k = 32
80 Webspam: SVM (s = 1000) 75 −2 10
−1
10
0
10 C
1
10
2
10
k = 1024 k = 512 k = 256
90
k = 128
80
k = 64 k = 32
70 60 −2 10
MNIST−PW: SVM (s = 100) −1
10
0
10 C
1
10
2
10
90
k = 1024 k = 512 k = 256
90
k = 128
80 k = 64
70 60 −2 10
k = 32
MNIST−PW: SVM (s = 1000) −1
10
0
10 C
1
10
2
10
k = 128
85
Webspam: SVM (s = 10000) k = 64
80
k = 32
100 k = 4096
k = 1024 k = 512 k = 256
95
75 −2 10
100 k = 4096
Classification Acc (%)
100 Classification Acc (%)
95
k = 1024 k = 512 k = 256
Classification Acc (%)
95
100
k = 4096
Classification Acc (%)
100
k = 1024 k = 512 k = 256
Classification Acc (%)
k = 4096
Classification Acc (%)
Classification Acc (%)
100
−1
10
0
10 C
k = 4096
1
10
2
10
k = 1024 k = 512 k = 256
90
k = 128
80
k = 64
70 60 −2 10
k = 32
MNIST−PW: SVM (s = 10000) −1
10
0
10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
12
Logistic Regression Test Accuracies
k = 64
85
k = 32
80 Webspam: Logit (s = 1) 75 −2 10
−1
10
0
10 C
1
10
k = 64 k = 32
85 80
Webspam: Logit (s = 100) 75 −2 10
2
10
−1
10
0
10 C
1
10
Classification Acc (%)
90
k = 128 k = 64
80
k = 32
70 MNIST−PW: Logit (s = 1) 60 −2 10
−1
10
0
10 C
1
10
2
10
k = 64
85 k = 32
80 Webspam: Logit (s = 1000) 75 −2 10
2
10
100 k = 1024 k = 512 k = 256
k = 4096
k = 128
90
−1
10
0
10 C
1
10
90
k = 128
80
k = 64 k = 32
70
MNIST−PW: Logit (s = 100) 60 −2 10
−1
10
0
10 C
1
10
2
10
k = 4096
k = 1024 k = 512
95
k = 256
90
k = 128
85 Webspam: Logit (s = 10000) k = 64
80
k = 32
75 −2 10
2
10
100 k = 1024 k = 512 k = 256
k = 4096
Classification Acc (%)
100 Classification Acc (%)
k = 128
90
95
−1
10
0
10 C
1
10
90
k = 128
80
k = 64 k = 32
70 60 −2 10
MNIST−PW: Logit (s = 1000) −1
10
0
10 C
1
10
2
10
k = 1024 k = 512 k = 256
k = 4096
90
k = 128
80
k = 64
70 60 −2 10
k = 32
MNIST−PW: Logit (s = 10000) −1
10
0
10 C
1
10
Again, as long as k is large (necessary for high accuracy), the projection matrix can be extremely sparse (i.e., very large s).
2
10
100 k = 1024 k = 512 k = 256
k = 4096
Classification Acc (%)
k = 128
90
95
100 k = 1024 k = 512 k = 256
k = 4096
Classification Acc (%)
95
100
k = 1024 k = 512 k = 256
k = 4096
Classification Acc (%)
100
k = 1024 k = 512 k = 256
k = 4096
Classification Acc (%)
Classification Acc (%)
100
2
10
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
SVM and Logistic Regression Test Accuracies on PEMS 100
80
k = 4096 k = 128 k = 256
60
k = 64
Classification Acc (%)
Classification Acc (%)
100
k = 32
40 20 PEMS: SVM (s = 1) 0 −2 10
0
2
10
10
k = 128
60
k = 64
40
k = 32
20 PEMS: SVM (s = 10000) 0 −2 10
4
10
k = 4096
80
0
4
10
10
C 100 Classification Acc (%)
100 k = 4096
80
k = 128 k = 64 k = 32
60 40 20
PEMS: Logit (s = 1) 0 −2 10
2
10
C
Classification Acc (%)
Ping Li
0
2
10
10 C
4
10
k = 4096
80
k = 128 k = 64
60
k = 32
40 20
PEMS: Logit (s = 10000) 0 −2 10
0
2
10
10
4
10
C
Again, as long as k is large, the projection matrix can be extremely sparse.
13
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
K-Means Clustering Accuracies on MNIST-PW Random projections were applied before the data were fed to k-means clustering. As the classification labels are known, we simply use the classification accuracy to assess the clustering quality.
70 Clustering Acc (%)
Ping Li
MNIST−PW: K−means
60 50 s = 1,10,100 40
s = 1000
s = 10000
30 1 10
2
3
10
10
4
10
k Again, s does not really seem to matter much, as long as k is not small.
14
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Ridge Regression on CT Random projections were applied to reduce the number of examples before ridge regression. The task becomes estimating the original regression coefficients. 20 Regression Error
20 Regression Error
Ping Li
15
10
5 1 10
CT (Train): Regression Error 2
3
10
10 k
4
10
15
10
5 1 10
CT (Test): Regression Error 2
3
10
10
4
10
k
Different curves are for different s values. Again, the sparsity parameter s does not really seem to matter much, as long as k is not small.
15
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Advantages of Very Sparse Random Projections (Large s) Consequences of using very sparse projection matrix R:
• Matrix multiplication A × R becomes much faster. • Much easier to store R if necessary. • Much easier to generate (and re-generate) R on the fly. • Sparse original data =⇒ sparse projected data. Average number of 0’s of the projected data vector would be
1 k× 1− s
f
where f is the number of nonzero elements in the original data vector. ——————Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.
16
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Disadvantages of Random Projections (and Variants)
Inaccurate, especially on binary data.
17
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Variance Analysis for Inner Product Estimates
A
First two rows in A:
R = B
u1 , u2 ∈ RD (D is very large): u1 = {u1,1 , u1,2 , ..., u1,i , ..., u1,D } u2 = {u2,1 , u2,2 , ..., u2,i , ..., u2,D }
First two rows in B:
v1 , v2 ∈ Rk (k is small): v1 = {v1,1 , v1,2 , ..., v1,j , ..., v1,k } v2 = {v2,1 , v2,2 , ..., v2,j , ..., v2,k }
18
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
k 1X a ˆ= v1,j v2,j , (which is also an inner product) k j=1
E(ˆ a) = a 1 2 V ar(ˆ a) = m1 m2 + a k D D X X m1 = |u1,i |2 , m2 = |u2,i |2 i=1
i=1
——— Random projections may not be good for inner products because the variance is dominated by marginal l2 norms m1 m2 , especially when a
≈ 0.
For real-world datasets, most pairs are often close to be orthogonal (a
≈ 0).
——————Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06. ¨ Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11
19
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
b-Bit Minwise Hashing
• Simple algorithm designed specifically for massive binary data. • Much more accurate than random projections for estimating inner products. • Much smaller space requirement than the original minwise hashing algorithm. • Capable of estimating 3-way similarity, while random projections can not. • Useful for large-scale linear learning (and kernel learning of course). • Useful for sub-linear time near neighbor search.
20
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Massive, High-dimensional, Sparse, and Binary Data Binary sparse data are very common in the real-world:
• For many applications (such as text), binary sparse data are very natural. • Many datasets can be quantized/thresholded to be binary without hurting the prediction accuracy.
• In some cases, even when the “original” data are not too sparse, they often become sparse when considering pariwise and higher-order interactions.
21
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
An Example of Binary (0/1) Sparse Massive Data A set S
⊆ Ω = {0, 1, ..., D − 1} can be viewed as 0/1 vector in D dimensions.
Shingling:
Each document (Web page) can be viewed as a set of w -shingles.
For example, after parsing, a sentence “today is a nice day” becomes
• w = 1: {“today”, “is”, “a”, “nice”, “day”} • w = 2: {“today is”, “is a”, “a nice”, “nice day”} • w = 3: {“today is a”, “is a nice”, “a nice day”} Previous studies used w
≥ 5, as single-word (unit-gram) model is not sufficient.
Shingling generates extremely high dimensional vectors, e.g., D
= (105 )w .
(105 )5 = 1025 = 283 , although in current practice, it seems D = 264 suffices.
22
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Notation A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros). Consider two sets S1 , S2
⊆ Ω = {0, 1, 2, ..., D − 1} (e.g., D = 264 )
f1
f1 = |S1 |,
f2 = |S2 |,
a
f
2
a = |S1 ∩ S2 |.
The resemblance R is a popular measure of set similarity
|S1 ∩ S2 | a R= = |S1 ∪ S2 | f1 + f2 − a
23
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Minwise Hashing: Stanford Algorithm in the Context of Search
Suppose a random permutation π is performed on Ω, i.e.,
π : Ω −→ Ω,
where
Ω = {0, 1, ..., D − 1}.
An elementary probability argument shows that
Pr (min(π(S1 )) = min(π(S2 ))) =
|S1 ∩ S2 | = R. |S1 ∪ S2 |
24
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
An Example
D = 5. S1 = {0, 3, 4}, S2 = {1, 2, 3}, R =
|S1 ∩S2 | |S1 ∪S2 |
= 15 .
One realization of the permutation π can be
0 =⇒ 3 1 =⇒ 2 2 =⇒ 0 3 =⇒ 4 4 =⇒ 1 π(S1 ) = {3, 4, 1} = {1, 3, 4}, In this example, min(π(S1 ))
π(S2 ) = {2, 0, 4} = {0, 2, 4}
6= min(π(S2 )).
25
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Minwise Hashing Estimator After k permutations, π1 , π2 , ..., πk , one can estimate R without bias:
ˆM R
k 1X = 1{min(πj (S1 )) = min(πj (S2 ))}, k j=1
ˆM Var R
=
1 R(1 − R). k
—————————We recently realized that this estimator could be written as an inner product in
264 × k (assuming D = 264 ) dimensions). This means one can potentially use it for linear learning, although the dimensionality would be excessively high.
¨ Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11
26
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
b-Bit Minwise Hashing: the Intuition Basic idea: Only store the lowest b-bits of each hashed value, for small b.
Intuition:
• When two sets are identical, then their lowest b-bits of the hashed values are of course also equal. b = 1 only stores whether a number is even or odd. • When two sets are similar, then their lowest b-bits of the hashed values “should be” also similar (True?? Need a proof).
• Therefore, hopefully we do not need many bits to obtain useful information, especially when real applications care about pairs with reasonably high resemblance values (e.g., 0.5).
27
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
The Basic Collision Probability Result Consider two sets, S1 and S2 ,
S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, f1 = |S1 |, f2 = |S2 |, a = |S1 ∩ S2 | Define the minimum values under π
: Ω → Ω to be z1 and z2 :
z1 = min (π (S1 )) ,
z2 = min (π (S2 )) .
and their b-bit versions (b)
(b)
z1 = The lowest b bits of z1 , Example: if z1
z2 = The lowest b bits of z2 (1)
= 7(= 111 in binary), then z1
(2)
= 1, z1 = 3.
28
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Collision probability: Assume D is large (which is virtually always true)
Pb = Pr
(b) z1
=
(b) z2
= C1,b + (1 − C2,b ) R
——————–
Recall, (assuming infinite precision, or as many digits as needed), we have
Pr (z1 = z2 ) = R
29
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Collision probability: Assume D is large (which is virtually always true)
Pb = Pr f1 r1 = , D C1,b C2,b
(b) z1
=
(b) z2
= C1,b + (1 − C2,b ) R
f2 r2 = , D
f1 = |S1 |,
f2 = |S2 |,
r1 r2 = A1,b + A2,b , r1 + r2 r1 + r2 r1 r2 + A2,b , = A1,b r1 + r2 r1 + r2
A1,b =
r1 [1 − r1 ]
2b −1
1 − [1 − r1 ]
2b
,
A2,b =
r2 [1 − r2 ]
2b −1
1 − [1 − r2 ]
——————– ¨ Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.
2b
.
30
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
The closed-form collision probability is remarkably accurate even for small D . The absolute errors (approximate - exact) are very small even for D
= 20.
−4
Approximation Error
4
0.008 0.006
f2 = 2
0.004 0.002 0 0
D = 20, f1 = 4, b = 1 0.2
0.01
0.4 0.6 a / f2
f2 = 4 0.8
3 2
1
2
1 f = 50 2
0.2
0.4 0.6 a / f2
−4
f =5 2
f =2 2
0.002 D = 20, f1 = 10, b = 1 0 0 0.2 0.4 0.6 a / f2
f = 25
f =2
0.006 0.004
D = 500, f1 = 50, b = 1
4
0.008
x 10
0 2 0
Approximation Error
Approximation Error
0.01
Approximation Error
Ping Li
f2 = 10 0.8
1
x 10
0.8
1
0.8
1
D = 500, f = 250, b = 1 1
3 2 1 0 0
0.2
0.4 0.6 a/f 2
—————– ¨ Ref: Li and Konig, Theory and Applications of b-Bit Minwise Hashing, CACM Research Highlights, 2011 .
31
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
An Unbiased Estimator
ˆ b for R: An unbiased estimator R Pˆb − C1,b ˆ Rb = , 1 − C2,b k n o X 1 (b) (b) Pˆb = z1,πj = z2,πj , k j=1
ˆb = Var R
Var
Pˆb
1 Pb (1 − Pb ) 2 = k [1 − C2,b ] [1 − C2,b ]2
MMDS2012
32
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
The Variance-Space Trade-off Smaller b
=⇒ smaller space for each sample. However,
smaller b
=⇒ larger variance.
B(b; R, r1 , r2 ) characterizes the var-space trade-off: ˆb B(b; R, r1 , r2 ) = b × Var R Lower B(b) is better.
MMDS2012
33
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
The ratio,
B(b1 ; R, r1 , r2 ) B(b2 ; R, r1 , r2 ) measures the improvement of using b (e.g., b1 B(64) B(b)
= b2 (e.g., b2 = 1) over using b = b1
= 64).
= 20 means, to achieve the same accuracy (variance), using b = 64 bits per hashed value will require 20 times more space (in bits) than using b = 1.
34
Probabilistic Hashing for Efficient Search and Learning
B(64) B(b) , relative improvement of using
r = r = 10 1
b=1
30
2
25
b=2
20
b=3
15
b=4
10 5
r = r = 0.1 1
2
b=1
25
b=2
20
b=3
15
b=4
10 5
0 0
0.2
60
0.4 0.6 0.8 Resemblance (R)
0 0
1
0.2
60
0.4 0.6 0.8 Resemblance (R)
r = r = 0.5 1
2
b=1
40
b=2
30
b=3
20
b=4
10 0 0
0.2
0.4 0.6 0.8 Resemblance (R)
40 b=2 30 b=3
20
b=4
10 1
1
b=1
50 Improvement
50
MMDS2012
35 −10
Improvement
Improvement
30
July 12, 2012
b = 1, 2, 3, 4 bits, compared to 64 bits.
35
Improvement
Ping Li
0 0
r1 = r2 = 0.9 0.2
0.4 0.6 0.8 Resemblance (R)
1
35
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Relative Improvements in the Least Favorable Situation If r1 , r2
→ 0 (least favorable situation; there is a proof), then B(64) R = 64 B(1) R+1
If R
= 0.5, then the improvement will be
64 3
= 21.3-fold.
36
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Experiments: Duplicate detection on Microsoft news articles The dataset was crawled as part of the BLEWS project at Microsoft. We computed pairwise resemblances for all documents and retrieved documents
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
b=2
b=1
R0 = 0.3 100
200 300 400 Sample size (k)
b=1 b=2 b=4 M 500
Precision
Precision
pairs with resemblance R larger than a threshold R0 .
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
2 b=1 R0 = 0.4
100
200 300 400 Sample size (k)
b=1 b=2 b=4 M 500
37
Probabilistic Hashing for Efficient Search and Learning
1 0.9 0.8 0.7 2 0.6 0.5 0.4 0.3 0.2 0.1 0 0
2 b=1
R = 0.5 0
100
200 300 400 Sample size (k)
b=1 b=2 b=4 M
Precision
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
500
b=1
R = 0.7 0
100
200 300 400 Sample size (k)
b=1 b=2 b=4 M 500
Precision
Precision
Precision
Ping Li
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
July 12, 2012
MMDS2012
b=1 2
R = 0.6 0
100
200 300 400 Sample size (k)
b=1 b=2 b=4 M 500
1 0.9 0.8 b=1 0.7 0.6 2 0.5 b=1 0.4 R0 = 0.8 b=2 0.3 0.2 b=4 0.1 M 0 0 100 200 300 400 500 Sample size (k)
38
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
1/2 Bit Suffices for High Similarity Practical applications are sometimes interested in pairs of very high similarities. For these data, one can XOR two bits from two permutations, into just one bit.
ˆ 1/2 . Compared to the 1-bit estimator R ˆ1: The new estimator is denoted by R • A good idea for highly similar data: Var
lim
ˆ1 R
= 2. R→1 ˆ 1/2 Var R
• Not so good idea when data are not very similar: ˆ 1 < Var R ˆ 1/2 , Var R if R < 0.5774, Assuming sparse data ————— ¨ Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.
39
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
b-Bit Minwise Hashing for Estimating 3-Way Similarities Notation for 3-way set intersections a
13
f1
a
f
3
r1
a23
a12
s13
r3
s s
s12
f
2
23
r
2
3-Way collision probability: Assume D is large.
Pr (lowest b bits of 3 hashed values are equal) = where u
Z Z+s + R123 = , u u
= r1 + r2 + r3 − s12 − s13 − s23 + s, and ...
40
Ping Li
Probabilistic Hashing for Efficient Search and Learning
Z =(s12 − s)A3,b + +(s13 − s)A2,b + +(s23 − s)A1,b +
(r3 − s13 − s23 + s) r1 + r2 − s12 (r2 − s12 − s23 + s) r1 + r3 − s13 (r1 − s12 − s13 + s) r2 + r3 − s23
July 12, 2012
MMDS2012
s12 G12,b s13 G13,b s23 G23,b
h i (r1 − s12 − s13 + s) + (r2 − s23 )A3,b + (r3 − s23 )A2,b G23,b r2 + r3 − s23 h i (r2 − s12 − s23 + s) + (r1 − s13 )A3,b + (r3 − s13 )A1,b G13,b r1 + r3 − s13 h i (r3 − s13 − s23 + s) + (r1 − s12 )A2,b + (r2 − s12 )A1,b G12,b , r1 + r2 − s12 b rj (1 − rj )2 −1 , Aj,b = b 1 − (1 − rj )2 Gij,b =
b (ri + rj − sij )(1 − ri − rj + sij )2 −1 1 − (1 − ri − rj + sij )2
b
, i, j ∈ {1, 2, 3}, i 6= j.
————— ¨ Ref: Li, Konig, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities, NIPS’10.
41
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Useful messages: 1. Substantial improvement over using 64 bits, just like in the 2-way case. 2. Must use b
≥ 2 bits for estimating 3-way similarities.
42
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
b-Bit Minwise Hashing for Large-Scale linear Learning Linear learning algorithms require the estimators to be inner products.
If we look carefully, the estimator of minwise hashing is indeed an inner product of two (extremely sparse) vectors in D
ˆM R because, as z1
× k dimensions (infeasible when D = 264 ):
k 1X 1{min(πj (S1 )) = min(πj (S2 )) = k j=1
= min(π(S1 )), z2 = min(π(S2 )) ∈ Ω = {0, 1, ..., D − 1}. 1{z1 = z2 } =
D−1 X i=0
1{z1 = i} × {z2 = i}
———– ¨ Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11.
43
Ping Li
Probabilistic Hashing for Efficient Search and Learning
Consider D
July 12, 2012
MMDS2012
= 5. We can expand numbers into vectors of length 5.
0 =⇒ [0, 0, 0, 0, 1], 3 =⇒ [0, 1, 0, 0, 0],
1 =⇒ [0, 0, 0, 1, 0], 4 =⇒ [1, 0, 0, 0, 0].
2 =⇒ [0, 0, 1, 0, 0]
———————
If z1
= 2, z2 = 3, then
0 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 1, 0, 0, 0].
If z1
= 2, z2 = 2, then
1 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].
44
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Linear Learning Algorithms Linear algorithms such as linear SVM and logistic regression have become very powerful and extremely popular. Representative software packages include SVMperf , Pegasos, Bottou’s SGD SVM, and LIBLINEAR.
Given a dataset {(xi , yi )}n i=1 , xi
∈ RD , yi ∈ {−1, 1}, the L2 -regularized
linear SVM solves the following optimization problem:
n X 1 T T min w w + C max 1 − yi w xi , 0 , w 2 i=1
and the L2 -regularized logistic regression solves a similar problem: n
X T 1 T min w w + C log 1 + e−yi w xi . w 2 i=1 Here C
> 0 is an important penalty (regularization) parameter.
45
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Integrating b-Bit Minwise Hashing for (Linear) Learning Very simple: 1. We apply k independent random permutations on each (binary) feature vector xi and store the lowest b bits of each hashed value. This way, we obtain a new dataset which can be stored using merely nbk bits.
2. At run-time, we expand each new data point into a 2b
× k -length vector, i.e.
we concatenate the k vectors (each of length 2b ). The new feature vector has exactly k 1’s.
46
Ping Li
Probabilistic Hashing for Efficient Search and Learning
An example with k
July 12, 2012
MMDS2012
= 3 and b = 2
For set (vector) S1 : Original hashed values (k
= 3) : 12013 25964 20191
Original binary representations
010111011101101 Lowest b
:
110010101101100
100111011011111
= 2 binary digits : 01 00 11
Corresponding decimal values b
Expanded 2
:1 0 3
= 4 binary digits : 0010
New feature vector fed to a solver Same procedures on sets S2 , S3 , ...
0001
1000
: {0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0}
47
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Datasets and Solver for Linear Learning This talk presents the experiments on two text datasets.
Dataset
# Examples (n)
# Dims (D)
Avg # Nonzeros
Train / Test
Webspam (24 GB)
350,000
16,609,143
3728
80% / 20%
Rcv1 (200 GB)
781,265
1,010,017,424
12062
50% / 50%
To generate the Rcv1 dataset, we used the original features + all pairwise features + 1/30 3-way features. We chose LIBLINEAR as the basic solver for linear learning. Note that our method is purely statistical/probabilistic, independent of the underlying procedures. All experiments were conducted on workstations with Xeon(R) CPU (
[email protected]) and 48GB RAM, under Windows 7 System.
48
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Experimental Results on Linear SVM and Webspam
• We conducted our extensive experiments for a wide range of regularization C values (from 10−3 to 102 ) with fine spacings in [0.1, 10]. • We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.
49
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Testing Accuracy
Accuracy (%)
Ping Li
100 b = 6,8,10,16 b=4 98 96 b=2 4 94 92 b=1 90 88 86 svm: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
• Solid: b-bit hashing. Dashed (red) the original data • Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using the original data.
• The results are averaged over 50 runs.
50
Probabilistic Hashing for Efficient Search and Learning
b=4 b=2
b=1
svm: k = 50 Spam: Accuracy −1
0
10
10
1
10
2
10
100 b = 6,8,10,16 b=4 98 96 b=2 4 94 92 b=1 90 88 86 svm: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
b=4 b=2
b=1
svm: k = 100 Spam: Accuracy −1
0
10
10
1
10
2
10
C
Accuracy (%)
Accuracy (%)
C
6
Accuracy (%)
b=6
100 b = 8,10,16 98 96 6 94 4 92 90 88 86 84 82 80 −3 −2 10 10
July 12, 2012
100 b = 6,8,10,16 b=4 98 b = 2 96 4 94 b=1 92 90 88 86 svm: k = 300 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
Accuracy (%)
100 98 b = 8,10,16 96 94 92 90 88 86 84 82 80 −3 −2 10 10
Accuracy (%)
Accuracy (%)
Ping Li
MMDS2012
100 b = 6,8,10,16 98 b=4 96 4 b=2 94 92 b=1 90 88 86 svm: k = 150 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
100 b = 6,8,10,16 b=4 98 b=2 96 4 b=1 94 92 90 88 86 svm: k = 500 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
51
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
52
Stability of Testing Accuracy (Standard Deviation) Our method produces very stable results, especially b 0
b=1 b=2 b=4 b=6 b=8
−1
10
10
b = 16
svm: k = 50 Spam accuracy (std)
10 −3 10
−2
10
−1
0
10
10
b=2 −1
10
b=6 b=8
−2
1
10
b=4
svm: k = 100 Spam accuracy (std)
10 −3 10
2
10
−2
10
−1
0
Accuracy (std %)
Accuracy (std %)
b=2 −1
10
b=4 b=6
svm: k = 200 Spam accuracy (std) −2
10
−1
0
10
10 C
b=2 −1
10
b=4
Spam:Accuracy (std) 1
10
10
10 −3 10
2
10
−2
10
−1
10
2
10
10
1
10
2
10
0
Spam:Accuracy (std)
b=2
−1
10
b=4
−2
1
0
10
10 b=1
b = 8,10,16
b=8 b = 16
C
0
10 b=1
−2
b=1
C
10
10 −3 10
svm: k = 150
−2
0
10
C
b = 10,16
Accuracy (std %)
−2
10 b=1
Accuracy (std %)
Accuracy (std %)
10
0
10
Accuracy (std %)
0
≥4
b=8 b = 16
svm: k = 300
10 −3 10
−2
10
−1
0
10
10 C
1
10
b=1 −1
b=4
−2
2
10
b=2
10
svm: k = 500 Spam accuracy (std) b = 6,8,10,16
10 −3 10
−2
10
−1
0
10
10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
53
Training Time svm: k = 50 Spam: Training time
Training time (sec)
2
10
1
10
0
10 −3 10
3
2
1
10
−2
10
−1
0
10
10
1
10
10 −3 10
2
10
−2
10
−1
0
10
10
1
10
10
Training time (sec)
b = 16
0
10
0
10
10 C
b=4
Spam:Training Time −2
−1
10
1
10
2
10
10
b=1 2
b = 16
16 4
1
10
0
0
10
10
1
10
2
10
3
10
2
−1
b=8
b=1
svm: k = 300
2
−2
b = 16
2 4 16
C
10
10
10 −3 10
1
10
10 −3 10
2
3
1
b=1
C
svm: k = 200 Spam: Training time
10
2
10
0
0
3
Training time (sec)
svm: k = 150
10
C 10
10
svm: k =100 Spam: Training time
Training time (sec)
Training time (sec)
3
10
Training time (sec)
3
10
10 −3 10
b=1
b=8
−2
−1
0
10
10 C
2
10
b = 16 1
b = 10
10
0
Spam:Training Time 10
svm: k = 500 Spam: Training time
1
10
2
10
10 −3 10
−2
10
−1
0
10
10
1
10
2
10
C
• They did not include data loading time (which is small for b-bit hashing) • The original training time is about 100 seconds. • b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
54
Testing Time
100
10 2 1 −3 10
−2
10
−1
0
10
10
1
10
3
100
10 2 1 −3 10
2
10
10
−1
10
10
1
100
10
10
10
−1
0
10
10 C
−2
10
−1
1
10
2
10
1000
svm: k = 300 Spam:Testing Time
2
10
1
10
10 −3 10
−2
10
−1
0
10
10 C
0
10
10
1
10
2
10
C
0
−2
1
10
10 −3 10
2
Testing time (sec)
10 Testing time (sec)
Testing time (sec)
0
3
10
2
10
C
svm: k = 200 Spam: Testing time
2 1 −3 10
svm: k = 150 Spam:Testing Time
0
−2
C 1000
10
svm: k = 100 Spam: Testing time
Testing time (sec)
1000
svm: k = 50 Spam: Testing time
Testing time (sec)
Testing time (sec)
1000
1
10
2
10
svm: k = 500 Spam: Testing time
100
10 2 1 −3 10
−2
10
−1
0
10
10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
55
Experimental Results on L2 -Regularized Logistic Regression
100
b=6
b=4 b=2 b=1
logit: k = 100 Spam: Accuracy −1
0
10
1
10
10
2
10
C
b=4 b=8
b=2
95 b=1
80 −3 10
logit: k = 300 Spam:Accuracy −2
10
−1
0
10
10 C
95
b=2
90
b=1
85
logit: k = 150 Spam:Accuracy
80 −3 10
−2
10
−1
0
10
10
1
10
1
10
2
10
2
10
C
100
90
b=4
b=8 Accuracy (%)
100 98 b = 8,10,16 96 94 92 90 88 86 84 82 80 −3 −2 10 10
Accuracy (%)
100 b=4 98 b = 6,8,10,16 96 b=2 94 92 b=1 90 88 86 logit: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
Accuracy (%)
100 98 b=6 b = 8,10,16 96 b=4 94 92 90 b=2 88 86 b=1 84 logit: k = 50 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
Accuracy (%)
Accuracy (%)
Accuracy (%)
Testing Accuracy
100 b=4 b = 6,8,10,16 98 b=2 4 96 b=1 94 92 90 88 86 logit: k = 500 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
56
Training Time 3
2
10
1
10
logit: k = 50 1
10
Training time (sec)
Training time (sec)
2
10
1
10
logit: k = 100 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C
2
10
10
10
Training time (sec)
2
1
10
logit: k = 200 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C
10
2
10
b=8
1
10
logit: k = 150 Spam:Training Time −2
10
−1
0
10
10
1
10
2
10
3
b = 16 2
10
b=1 2 8
b=4
1
10
logit: k = 300 Spam:Training Time 10 −3 10
4
10
0
1
2
C
10
10
b = 16 b=1
10 −3 10
2
3
3
10
2
10
0
1
Training time (sec)
Training time (sec)
10
10
0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C
Training time (sec)
3
3
10
−2
10
−1
0
10
10 C
1
10
2
10
b = 16 2
10
1
10
logit: k = 500 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Comparisons with Other Algorithms We conducted extensive experiments with the VW algorithm (Weinberger et. al., ICML’09, not the VW online learning platform), which has the same variance as random projections.
Consider two sets S1 , S2 , f1
R=
|S1 ∩S2 | |S1 ∪S2 |
=
= |S1 |, f2 = |S2 |, a = |S1 ∩ S2 |,
a f1 +f2 −a . Then, from their variances
1 2 V ar (ˆ aV W ) ≈ f1 f2 + a k 1 ˆ M IN W ISE = R (1 − R) V ar R k we can immediately see the significant advantages of minwise hashing, especially when a
≈ 0 (which is common in practice).
57
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Comparing b-Bit Minwise Hashing with VW
100 1,10,100 10,100 C =1 98 0.1 C = 0.1 96 C = 0.01 94 C = 0.01 92 90 88 86 84 svm: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k
Accuracy (%)
Accuracy (%)
8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about the same test accuracy as VW with k = 104 ∼ 106 . 100 10,100 100 10 C =1 1 98 C = 0.1 C = 0.1 96 94 C = 0.01 C = 0.01 92 90 88 86 84 logit: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k
58
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
59
8-bit hashing is substantially faster than VW (to achieve the same accuracy).
3
3
10
svm: VW vs b = 8 hashing Training time (sec)
Training time (sec)
10
C = 100 2
C = 10
10
C = 1,0.1,0.01 C = 100 1
10
C = 10
Spam: Training time 0
10
C = 1,0.1,0.01 2
10
4
10
10 k
5
10
6
10
C = 0.1,0.01
2
10
100 1 10,1.0,0.1
10
C = 0.01
0
3
C = 100,10,1
10
logit: VW vs b = 8 hashing Spam: Training time 2
10
3
4
10
10 k
5
10
6
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
60
Experimental Results on Rcv1 (200GB) Test accuracy using linear SVM (Can not train the original data) 90 b=8 b=4
70
b=2
b=8
80 b=4
70
50 −3 10
−2
10
0
10
10
1
10
b=2 b=1
50 −3 10
2
10
−2
10
−1
10
70 b=2
0
10
10 C
b=2 b=1
10
50 −3 10
2
10
rcv1: Accuracy −2
10
−1
80
1
10
10
50 −3 10
90
b=2 b=1
−2
10
1
10
2
10
b = 16 b = 12 b=8
b=4
80
b=2
70 b=1
60 rcv1: Accuracy
rcv1: Accuracy 2
10
svm: k =500
b=4
70
0
10
100
b = 16 b = 12
rcv1: Accuracy 10
1
b=8
60
b=1
−1
b=4
70
C
90 Accuracy (%)
Accuracy (%)
b = 12
b=4
−2
10
svm: k =300
80
50 −3 10
0
100
b = 16
b=8
60
80
C
svm: k =200
90
b=8
60
C
100
b = 16 b = 12
rcv1: Accuracy
rcv1: Accuracy −1
svm: k =150
90
b = 12
60
60 b=1
b = 16
Accuracy (%)
80
100
svm: k =100
90
b = 12
Accuracy (%)
Accuracy (%)
100
b = 16
svm: k =50
Accuracy (%)
100
−1
0
10
10 C
1
10
2
10
50 −3 10
−2
10
−1
0
10
10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
61
Training time using linear SVM 3
3
10
12 16 1
10
12 b=8
rcv1: Train Time
0
−2
10
−1
0
10
10
1
2
b = 16 12
10
1
10
12 b=8
rcv1: Train Time
0
10 −3 10
2
10
svm: k=150
10
−2
10
−1
0
10
C
10
1
3
10
12
1
10
b=8
rcv1: Train Time 10
0
10
10 C
0
10
10
1
10
10
2
10
svm: k=500 12
2
10
b = 16
1
10
b=8
rcv1: Train Time
0
2
1
10
3
Training time (sec)
b = 16
2
10
−1
10
−1
10 svm: k=300
Training time (sec)
svm: k=200
0
rcv1: Train Time −2
C
3
−2
b=8 0
10
10 −3 10
1
10
10 −3 10
2
10
b = 16 12
2
10
C
10 Training time (sec)
Training time (sec)
2
10
10 −3 10
10
svm: k=100 Training time (sec)
svm: k=50 Training time (sec)
3
10
10 −3 10
−2
10
−1
0
10
10 C
1
10
2
b = 16
10
12 1
10
b=8
rcv1: Train Time
0
2
10
10 −3 10
−2
10
−1
0
10
10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
62
Test accuracy using logistic regression 90 b=8
80 b=4
b=2
60
b=8
80 b=4
70 b=2
60 b=1
50
rcv1: Accuracy
−2
0
10
10
10
−2
10
−1
10
Accuracy (%)
Accuracy (%)
b=4 b=2
10
−1
0
10
10 C
b=2
b=1
10
50 −3 10
2
10
rcv1: Accuracy −2
10
−1
80
1
10
10
50 −3 10
b=2 b=1
−2
10
1
10
2
10
b = 16 b = 12
90
b=8 b=4
80
b=2
70 b=1
60 rcv1: Accuracy
2
10
logit: k =500
b=4
70
0
10
100
b = 16 b = 12 b=8
60
b=1
−2
70
C
90
rcv1: Accuracy 50 −3 10
10
1
logit: k =300
b=8
60
0
100
b = 16 b = 12
90
70
b=4
C
logit: k =200
80
80
rcv1: Accuracy
C 100
b=8
60
b=1
50 −3 10
2
b = 12
90
Accuracy (%)
70
b = 16
logit: k =150
b = 12
90
b = 12
100
b = 16
logit: k =100 Accuracy (%)
Accuracy (%)
100
b = 16
logit: k =50
Accuracy (%)
100
−1
0
10
10 C
1
10
2
10
50 −3 10
rcv1: Accuracy −2
10
−1
0
10
10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
63
Training time using logistic regression 3
3
10
b=1 b=2 b=4
8 b = 16 b = 12
1
10
rcv1: Train Time
0
10 −3 10
−2
10
−1
0
10
10
1
b=1
2
10
b=8 b = 12
1 b = 16
10
rcv1: Train Time
0
10 −3 10
2
10
logit: k=150
10
−2
10
−1
0
10
C
10
1
3
3
12
8 b = 16
1
10
rcv1: Train Time
0
−2
10
−1
0
10
10 C
10
−1
1
10
10
10
1
2
10
10
12 2
b=8
10
1
10
10 −3 10
rcv1: Train Time −2
10
−1
0
10
10 C
1
10
b = 16
logit: k=500
b = 16
0
2
0
10
3
Training time (sec)
2
10 −3 10
rcv1: Train Time −2
10 logit: k=300
Training time (sec)
logit: k=200
b=1
10
C
10
10
b = 16 1
0
10
12
8 b=4
10 −3 10
2
10
2
10
C
10 Training time (sec)
Training time (sec)
2
10
10
logit: k=100 Training time (sec)
logit: k=50 Training time (sec)
3
10
12
2
10
b=8 1
10
rcv1: Train Time
0
2
10
10 −3 10
−2
10
−1
0
10
10 C
1
10
2
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
64
Comparisons with VW on Rcv1 Test accuracy using linear SVM
100
80
C = 0.01
70 C = 0.01,0.1,1,10
60
svm: VW vs 1−bit hashing 3
10
50 1 10
4
10
10
svm: VW vs 2−bit hashing 2
100 C = 10
90 C = 0.1,1,10 C = 0.01
C = 0.01
70 60
rcv1: Accuracy svm: VW vs 8−bit hashing 2
3
10
10 K
C = 0.01
70
50 1 10
4
10
rcv1: Accuracy svm: VW vs 4−bit hashing 2
3
10
10
100
C = 10
4
10
80
C = 0.1,1,10 C = 0.01
70
50 1 10
C = 10
90 C = 0.01
60
4
10
K
90 Accuracy (%)
Accuracy (%)
10
C = 0.01
K
100
50 1 10
3
10
C = 10
80
60
rcv1: Accuracy
K
80
C = 0.01
70 60
rcv1: Accuracy 2
80 C = 0.01,0.1,1,10
C = 0.1,1,10
90
rcv1: Accuracy svm: VW vs 12−bithashing 2
3
10
10 K
4
10
Accuracy (%)
50 1 10
C = 0.1,1,10
90
C = 0.1,1,10
Accuracy (%)
Accuracy (%)
90
100 Accuracy (%)
100
C = 0.01 C = 0.1,1,10
80
C = 0.01
70 60 50 1 10
rcv1: Accuracy svm: VW vs 16−bit hashing 2
3
10
10 K
4
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
65
100
100
90
90
90
C = 0.1,1,10
80 C = 0.01,0.1,1,10
60 50 1 10
2
3
10
70 60
rcv1: Accuracy logit: VW vs 1−bithashing 10
C = 0.01 C = 0.01,0.1,1,10
50 1 10
4
10
2
100 C = 0.1,1,10
C = 0.01 C = 0.01
70 60 50 1 10
rcv1: Accuracy logit: VW vs 8−bithashing 2
3
10
10 K
70
50 1 10
4
10
rcv1: Accuracy logit: VW vs 4−bithashing 2
4
10
80
100
50 1 10
C = 0.01 C = 0.01
rcv1: Accuracy logit: VW vs 12−bithashing 2
3
10
10 K
10
4
10
C = 10 C = 0.1,1,10
90
C = 0.1,1,10
70 60
3
10
K
C = 0.1,1,10
90 Accuracy (%)
Accuracy (%)
80
10
C = 0.01
C = 0.01
K
100 90
3
10
K
C = 0.1,1,10
C = 0.1,1,10
60
rcv1: Accuracy logit: VW vs 2−bithashing
C = 0.1,1,10
80
C =1
4
10
Accuracy (%)
70
C = 0.01
C = 0.1,1,10
80
Accuracy (%)
100 Accuracy (%)
Accuracy (%)
Test accuracy using logistic regression
C = 0.01
80
C = 0.01
70 60 50 1 10
logit: VW vs 16−bithashing 2
3
10
10 K
4
10
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Training time 3
3
10 rcv1: Train Time
C = 10
2
10
C=1 C = 0.1
C = 10 1
10
Training time (sec)
Training time (sec)
10
C = 0.01
C=1 C = 0.1
0 C = 0.01
10 1 10
svm: VW vs 8−bit hashing 2
3
10
10 K
4
10
rcv1: Train Time
C = 10 C=1
2
10
C = 0.1
C = 10
C = 0.01
C=1 1 C = 0.1
10
C = 0.01 0
10 1 10
logit: VW vs 8−bit hashing 2
3
10
10 K
4
10
66
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
b-Bit Minwise Hashing for Efficient Near Neighbor Search Near neighbor search is a much more frequent operation than training an SVM.
The bits from b-bit minwise hashing can be directly used to build hash tables to enable sub-linear time near neighbor search.
This is an instance of the general family of locality-sensitive hashing.
67
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
An Example of b-Bit Hashing for Near Neighbor Search We use b
= 2 bits and k = 2 permutations to build a hash table indexed from 0000 to 1111, i.e., the table size is 22×2 = 16. Index 00 00 00 01 00 10
Data Points 8, 13, 251 5, 14, 19, 29 (empty)
11 01 7, 24, 156 11 10 33, 174, 3153 11 11 61, 342
Index 00 00 00 01 00 10
Data Points 2, 19, 83 17, 36, 129 4, 34, 52, 796
11 01 7, 198 11 10 56, 989 11 11 8 ,9, 156, 879
Then, the data points are placed in the buckets according to their hashed values. Look for near neighbors in the bucket which matches the hash value of the query. Replicate the hash table (twice in this case) for missed and good near neighbors. Final retrieved data points (before re-ranking) are the union of the buckets.
68
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Experiments on the Webspam Dataset
B = b × k : table size
L: number of tables.
Fractions of retrieved data points (before re-ranking) 0
0
10 Fraction Evaluated
Fraction Evaluated
Webspam: L = 25 b=4
−1
b=2
10
b=1
−2
10
8
SRP b−bit 12
b=4 b=2 −1
10
b=1
−2
16 B (bits)
20
24
10
10
Webspam: L = 50
8
SRP b−bit 12
b=4
Fraction Evaluated
0
10
Webspam: L = 100
b=2 −1
10
b=1
SRP b−bit −2
16 B (bits)
20
24
10
8
SRP (sign random projections) and b-bit hashing (with b
12
16 B (bits)
20
24
= 1, 2) retrieved similar
numbers of data points.
———————— Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary Data, ECML’12
69
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
100 SRP 90 b−bit 80 70 60 50 1 40 b=4 30 20 10 B= 16 L= 100 Webspam: T = 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)
Precision (%)
100 SRP 90 b−bit 80 70 60 b=4 b=1 50 40 30 20 10 B= 24 L= 100 Webspam: T = 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)
Precision (%)
100 SRP 90 b−bit 80 70 60 50 40 1 30 20 b=4 10 B= 16 L= 100 Webspam: T = 5 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)
Precision (%)
100 SRP 90 b−bit 80 70 60 b=4 50 b=1 40 30 20 10 B= 24 L= 100 Webspam: T = 5 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)
Precision (%)
Precision (%)
Precision (%)
Precision-Recall curves (the higher the better) for retrieving top-T data points. 100 SRP 90 b−bit 80 70 60 50 b=1 b=4 40 30 20 10 B= 24 L= 100 Webspam: T = 50 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) 100 SRP 90 b−bit 80 70 60 b=4 1 50 40 30 20 10 B= 16 L= 100 Webspam: T = 50 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)
Using L
= 100 tables, b-bit hashing considerably outperformed SRP (sign random projections), for table sizes B = 24 and B = 16. Note that there are many LSH schemes, which we did not compare exhaustively.
70
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Simulating Permutations with Universal (2U and 4U) Hashing Perm 2U 4U
80
b=4 b=2
−2
−1
10 10 b=8
0
10
1
10
C
90
85 −3 10
b=2 b=1 Perm 2U 4U
SVM: k = 200 Spam: Accuracy −2
10
−1
0
10
10
1
10
b=4 Perm 2U 4U
b=2 90
85 −3 10 100
2
10
b=4 95
95
Spam: Accuracy
b=1 70 −3 10 100
b=8 Accuracy (%)
b=8
90
100
SVM: k = 10
Accuracy (%)
Accuracy (%)
100
Accuracy (%)
Ping Li
2
10
C
Both 2U and 4U perform well except when b
b=1 SVM: k = 100 Spam: Accuracy −2
−1
10 10 b=8 4
0
10
1
10
2
10
C
b=2 95 b=1 90
85 −3 10
Perm 2U 4U
SVM: k = 500 Spam: Accuracy −2
10
−1
0
10
10
1
10
2
10
C
= 1 or small k .
———————– ¨ Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958
71
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Speed Up PreProcessing Using GPUs CPU: Data loading and preprocessing (for k
= 500 permutations) times (in
seconds). Note that we measured the loading times of LIBLINEAR which used a plain text data format. Using binary data will speed up data loading. Dataset
Loading
Permu
2U
4U (Mod)
4U (Bit)
Webspam
9.7 × 102
6.1 × 103
4.1 × 103
4.4 × 104
1.4 × 104
Rcv1
1.0 × 104
–
3.0 × 104
–
–
GPU: Data loading and preprocessing (for k
= 500 permutations) times
Dataset
Loading
GPU 2U
GPU 4U (Mod)
GPU 4U (Bit)
Webspam
9.7 × 102
51
5.2 × 102
1.2 × 102
Rcv1
1.0 × 104
1.4 × 103
1.5 × 104
3.2 × 103
Note that the modular operations in 4U can be replaced by bit operations.
72
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Breakdowns of the overhead: A batch-based GPU implementation. 4
4
10
10
2
GPU Kernel
10
Time (sec)
Time (sec)
GPU Kernel
CPU −−> GPU 0
10
2
10
CPU −−> GPU 0
10
GPU −−> CPU GPU −−> CPU Spam: GPU Profiling, 2U
−2
10
0
10
1
10
4
10
2
3
10 10 Batch size
4
10
Rcv1: GPU Profiling, 2U
−2
10
5
10
0
10
1
10
5
10
2
4
GPU Kernel
2
Time (sec)
CPU −−> GPU
10
GPU −−> CPU Spam: GPU Profiling, 4U−Bit
−2
10
0
10
1
10
2
3
10 10 Batch size
4
10
4
10
5
10
GPU Kernel
3
10
0
3
10 10 Batch size
10
Time (sec)
Ping Li
10
2
10
CPU −−> GPU
1
10
0
10
GPU −−> CPU
−1
10
Rcv1: GPU Profiling, 4U−Bit
−2
5
10
10
0
10
1
10
2
3
10 10 Batch size
4
10
5
10
GPU kernel times dominate the cost. The cost is not sensitive to the batch size. ———————— ¨ Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958 ¨ Ref: Li, Shrivastava, Konig, GPU-Based Minwise Hashing, WWW’12 Poster
73
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Online Learning (SGD) with Hashing b-Bit minwise hashing can also be highly beneficial for online learning in that it significantly reduces the data loading time which is usually the bottleneck for online algorithms, especially if many epochs are required. 100 95
100 b=4
b=6 Accuracy (%)
Accuracy (%)
Ping Li
b=4 90 85 b = 2 b=1 80 −10 10
−8
10
SGD SVM: k = 50 Spam: Accuracy −6
10 λ
−4
10
−2
10
95 b=2 90 b=1 85 80 −10 10
SGD SVM: k = 200 Spam: Accuracy −8
10
−6
10 λ
−4
10
−2
10
λ: 1/C , following the convention in online learning literature. Red dashed curve: Results based on the original data ———————— ¨ Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958
74
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
Conclusion
• BigData everywhere. For many operations including clustering, classification, near neighbor search, etc. exact answers are often not necessary.
• Very sparse random projections (KDD 2006) work as well as dense random projections. They however require a large number of projections.
• Minwise hashing is a standard procedure in the context of search. b-bit
minwise hashing is a substantial improvement by using only b bits per hashed value instead of 64 bits (e.g., a 20-fold reduction in space).
• b-bit minwise hashing provides a very simple strategy for efficient linear learning, by expanding the bits into vectors of length 2b × k . • We can directly build hash tables from b-bit minwise hashing for sub-linear time near neighbor search.
• It can be substantially more accurate than random projections (and variants).
75
Ping Li
Probabilistic Hashing for Efficient Search and Learning
July 12, 2012
MMDS2012
• Major drawbacks of b-bit minwise hashing: – It was designed (mostly) for binary static data.
– It needs a fairly high-dimensional representation (2b
× k ) after expansion.
– Expensive preprocessing and consequently expensive testing phrase for unprocessed data. Parallelization solutions based on (e.g.,) GPUs offer very good performance in terms of time but they are not energy-efficient.
76