Probabilistic Hashing for Efficient Search and ... - Stanford University

Ping Li

Probabilistic Hashing for Efficient Search and Learning

July 12, 2012

MMDS2012


Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca, NY 14853

July 12, 2012

1

Ping Li


July 12, 2012

MMDS2012

BigData Everywhere Conceptually, consider a dataset as a matrix of size n × D . In modern applications, # examples n

= 106 is common and n = 109 is not

rare, for example, images, documents, spams, search click-through data.

D = 106 (million), D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a sense, D High-dimensional (image, text, biological) data are common:

can be arbitrarily high by considering pairwise, 3-way or higher interactions.

2

Ping Li


July 12, 2012

MMDS2012

Examples of BigData Challenges: Linear Learning Binary classification: Dataset {(xi , yi )}n i=1 , xi

∈ RD , yi ∈ {−1, 1}.

One can fit an L2 -regularized linear SVM: n X 1 T T min w w + C max 1 − yi w xi , 0 , w 2 i=1

or the L2 -regularized logistic regression: n

X 1 T −yi wT xi , min w w + C log 1 + e w 2 i=1 where C

> 0 is the penalty (regularization) parameter.

The weight vector w has length D , the same as the data dimensionality.

3

Ping Li


July 12, 2012

MMDS2012

Challenges of Learning with Massive High-dimensional Data

• The data often can not fit in memory (even when the data are sparse). • Data loading takes too long. For example, online algorithms require less

memory but often need to iterate over the data for several (or many) passes to achieve sufficient accuracies. Data loading in general dominates the cost.

• Training can be very expensive, even for simple linear models such as logistic regression and linear SVM.

• Testing may be too slow to meet the demand, especially crucial for applications in search or interactive data visual analytics.

• The model itself can be too large to store, for example, we can not really store a vector of weights for logistic regression on data of 264 dimensions.

4

Ping Li


July 12, 2012

MMDS2012

A Popular Solution Based on Normal Random Projections Random Projections:

Replace original data matrix A by B

A

=A×R

R = B

R ∈ RD×k : a random matrix, with i.i.d. entries sampled from N (0, 1). B ∈ Rn×k : projected matrix, also random. B approximately preserves the Euclidean distance and inner products between any two rows of A. In particular, E (BBT ) = AAT . Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.

5

Ping Li


July 12, 2012

MMDS2012

Experiments on Classification, Clustering, and Regression

# Examples (n)

# Dims (D)

Avg # nonzeros

Webspam

350,000

16,609,143

3,728

4/5 v.s. 1/5

MNIST-PW

70,000

193,816

12,346

6/7 v.s. 1/7

440

138,672

138,672

1/2 v.s. 1/2

53,500

384

384

4/5 v.s. 1/5

Dataset

PEMS (UCI) CT (UCI)

Train v.s. Test

MNIST-PW = Original MNIST + all pairwise features. CT = Regression task (reducing number of examples instead of dimensions).

——————— Instead of using dense (normal) projections, we will experiment with very sparse random projections.

6

Ping Li


July 12, 2012

MMDS2012

Very Sparse Random Projections

R = {rij } ∈ RD×k . Instead of sampling from normals, we sample from a sparse distribution parameterized by s ≥ 1: The projection matrix:

If s

 1  −1 with prob. 2s   rij = 0 with prob. 1 −    1 1 with prob. 2s

1 s

= 100, then on average, 99% of the entries are zero.

If s

= 10000, then on average, 99.99% of the entries are zero. √ Usually, s = D or s = k is a good choice. ——————Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.

7

Ping Li


July 12, 2012

MMDS2012

8

Linear SVM Test Accuracies on Webspam Red dashed curves: results based on the original data k = 4096

100

k = 1024 k = 512 k = 256

95

Classification Acc (%)


100

k = 128

90

k = 64 k = 32

85 80

Webspam: SVM (s = 1) 75 −2 10

−1

10

Observations:

0

10 C

1

10

2

10

k = 4096

k = 1024 k = 512 k = 256

95

k = 128

90

k = 64

85

k = 32

80 Webspam: SVM (s = 1000) 75 −2 10

−1

10

0

10 C

1

10

• We need a large number of projections (e.g., k ≥ 4096) for high accuracy. • The sparsity parameter s matters little unless k is small.

2

10

Ping Li


July 12, 2012

MMDS2012

9

Linear SVM Test Accuracies on Webspam Red dashed curves: results based on the original data k = 4096

100

k = 1024 k = 512 k = 256

95



100

k = 128

90

k = 64 k = 32

85 80

Webspam: SVM (s = 1) 75 −2 10

−1

10

0

10 C

1

10

2

10

k = 4096

k = 1024 k = 512 k = 256

95 90 85

k = 128

Webspam: SVM (s = 10000) k = 64

80

k = 32

75 −2 10

−1

10

0

10 C

1

10

As long as k is large (necessary for high accuracy), the projection matrix can be extremely sparse, even with s

= 10000.

2

10

Ping Li


July 12, 2012

MMDS2012

10

Linear SVM Test Accuracies on MNIST-PW Red dashed curves: results based on the original data 100 k = 4096

k = 1024 k = 512 k = 256

90



100

k = 128 k = 64

80

k = 32

70

MNIST−PW: SVM (s = 1) 60 −2 10

−1

10

0

10 C

1

10

2

10

k = 4096

k = 1024 k = 512 k = 256

90

k = 128

80

k = 64

70 60 −2 10

k = 32

MNIST−PW: SVM (s = 10000) −1

10

0

10 C

1

10

As long as k is large (necessary for high accuracy), the projection matrix can be extremely sparse. For this data, we probably need k

> 104 for high accuracy.

2

10

Ping Li


July 12, 2012

MMDS2012

11

Linear SVM Test Accuracies

k = 128

90

k = 64 k = 32

85 80

Webspam: SVM (s = 1) 75 −2 10

−1

10

0

10 C

1

10

90

k = 128 k = 64 k = 32

85 80 Webspam: SVM (s = 100) 75 −2 10

2

10

−1

10

0

10 C

1

10

2

10

100 k = 4096

90


k = 1024 k = 512 k = 256 k = 128 k = 64

80

k = 32

70

MNIST−PW: SVM (s = 1) 60 −2 10

−1

10

0

10 C

1

10

2

10

k = 4096

100 k = 4096

k = 1024 k = 512 k = 256

95

k = 128

90

k = 64

85

k = 32

80 Webspam: SVM (s = 1000) 75 −2 10

−1

10

0

10 C

1

10

2

10

k = 1024 k = 512 k = 256

90

k = 128

80

k = 64 k = 32

70 60 −2 10

MNIST−PW: SVM (s = 100) −1

10

0

10 C

1

10

2

10

90

k = 1024 k = 512 k = 256

90

k = 128

80 k = 64

70 60 −2 10

k = 32

MNIST−PW: SVM (s = 1000) −1

10

0

10 C

1

10

2

10

k = 128

85

Webspam: SVM (s = 10000) k = 64

80

k = 32

100 k = 4096

k = 1024 k = 512 k = 256

95

75 −2 10

100 k = 4096


100 Classification Acc (%)

95

k = 1024 k = 512 k = 256


95

100

k = 4096


100

k = 1024 k = 512 k = 256


k = 4096



100

−1

10

0

10 C

k = 4096

1

10

2

10

k = 1024 k = 512 k = 256

90

k = 128

80

k = 64

70 60 −2 10

k = 32

MNIST−PW: SVM (s = 10000) −1

10

0

10 C

1

10

2

10

Ping Li


July 12, 2012

MMDS2012

12

Logistic Regression Test Accuracies

k = 64

85

k = 32

80 Webspam: Logit (s = 1) 75 −2 10

−1

10

0

10 C

1

10

k = 64 k = 32

85 80

Webspam: Logit (s = 100) 75 −2 10

2

10

−1

10

0

10 C

1

10


90

k = 128 k = 64

80

k = 32

70 MNIST−PW: Logit (s = 1) 60 −2 10

−1

10

0

10 C

1

10

2

10

k = 64

85 k = 32

80 Webspam: Logit (s = 1000) 75 −2 10

2

10

100 k = 1024 k = 512 k = 256

k = 4096

k = 128

90

−1

10

0

10 C

1

10

90

k = 128

80

k = 64 k = 32

70

MNIST−PW: Logit (s = 100) 60 −2 10

−1

10

0

10 C

1

10

2

10

k = 4096

k = 1024 k = 512

95

k = 256

90

k = 128

85 Webspam: Logit (s = 10000) k = 64

80

k = 32

75 −2 10

2

10

100 k = 1024 k = 512 k = 256

k = 4096


100 Classification Acc (%)

k = 128

90

95

−1

10

0

10 C

1

10

90

k = 128

80

k = 64 k = 32

70 60 −2 10

MNIST−PW: Logit (s = 1000) −1

10

0

10 C

1

10

2

10

k = 1024 k = 512 k = 256

k = 4096

90

k = 128

80

k = 64

70 60 −2 10

k = 32

MNIST−PW: Logit (s = 10000) −1

10

0

10 C

1

10

Again, as long as k is large (necessary for high accuracy), the projection matrix can be extremely sparse (i.e., very large s).

2

10

100 k = 1024 k = 512 k = 256

k = 4096


k = 128

90

95

100 k = 1024 k = 512 k = 256

k = 4096


95

100

k = 1024 k = 512 k = 256

k = 4096


100

k = 1024 k = 512 k = 256

k = 4096



100

2

10


July 12, 2012

MMDS2012

SVM and Logistic Regression Test Accuracies on PEMS 100

80

k = 4096 k = 128 k = 256

60

k = 64



100

k = 32

40 20 PEMS: SVM (s = 1) 0 −2 10

0

2

10

10

k = 128

60

k = 64

40

k = 32

20 PEMS: SVM (s = 10000) 0 −2 10

4

10

k = 4096

80

0

4

10

10

C 100 Classification Acc (%)

100 k = 4096

80

k = 128 k = 64 k = 32

60 40 20

PEMS: Logit (s = 1) 0 −2 10

2

10

C


Ping Li

0

2

10

10 C

4

10

k = 4096

80

k = 128 k = 64

60

k = 32

40 20

PEMS: Logit (s = 10000) 0 −2 10

0

2

10

10

4

10

C

Again, as long as k is large, the projection matrix can be extremely sparse.

13


July 12, 2012

MMDS2012

K-Means Clustering Accuracies on MNIST-PW Random projections were applied before the data were fed to k-means clustering. As the classification labels are known, we simply use the classification accuracy to assess the clustering quality.

70 Clustering Acc (%)

Ping Li

MNIST−PW: K−means

60 50 s = 1,10,100 40

s = 1000

s = 10000

30 1 10

2

3

10

10

4

10

k Again, s does not really seem to matter much, as long as k is not small.

14


July 12, 2012

MMDS2012

Ridge Regression on CT Random projections were applied to reduce the number of examples before ridge regression. The task becomes estimating the original regression coefficients. 20 Regression Error

20 Regression Error

Ping Li

15

10

5 1 10

CT (Train): Regression Error 2

3

10

10 k

4

10

15

10

5 1 10

CT (Test): Regression Error 2

3

10

10

4

10

k

Different curves are for different s values. Again, the sparsity parameter s does not really seem to matter much, as long as k is not small.

15

Ping Li


July 12, 2012

MMDS2012

Advantages of Very Sparse Random Projections (Large s) Consequences of using very sparse projection matrix R:

• Matrix multiplication A × R becomes much faster. • Much easier to store R if necessary. • Much easier to generate (and re-generate) R on the fly. • Sparse original data =⇒ sparse projected data. Average number of 0’s of the projected data vector would be

1 k× 1− s

f

where f is the number of nonzero elements in the original data vector. ——————Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.

16

Ping Li


July 12, 2012

MMDS2012

Disadvantages of Random Projections (and Variants)

Inaccurate, especially on binary data.

17

Ping Li


July 12, 2012

MMDS2012

Variance Analysis for Inner Product Estimates

A

First two rows in A:

R = B

u1 , u2 ∈ RD (D is very large): u1 = {u1,1 , u1,2 , ..., u1,i , ..., u1,D } u2 = {u2,1 , u2,2 , ..., u2,i , ..., u2,D }

First two rows in B:

v1 , v2 ∈ Rk (k is small): v1 = {v1,1 , v1,2 , ..., v1,j , ..., v1,k } v2 = {v2,1 , v2,2 , ..., v2,j , ..., v2,k }

18

Ping Li


July 12, 2012

MMDS2012

k 1X a ˆ= v1,j v2,j , (which is also an inner product) k j=1

E(ˆ a) = a 1 2 V ar(ˆ a) = m1 m2 + a k D D X X m1 = |u1,i |2 , m2 = |u2,i |2 i=1

i=1

——— Random projections may not be good for inner products because the variance is dominated by marginal l2 norms m1 m2 , especially when a

≈ 0.

For real-world datasets, most pairs are often close to be orthogonal (a

≈ 0).

——————Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06. ¨ Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11

19

Ping Li


July 12, 2012

MMDS2012

b-Bit Minwise Hashing

• Simple algorithm designed specifically for massive binary data. • Much more accurate than random projections for estimating inner products. • Much smaller space requirement than the original minwise hashing algorithm. • Capable of estimating 3-way similarity, while random projections can not. • Useful for large-scale linear learning (and kernel learning of course). • Useful for sub-linear time near neighbor search.

20

Ping Li


July 12, 2012

MMDS2012

Massive, High-dimensional, Sparse, and Binary Data Binary sparse data are very common in the real-world:

• For many applications (such as text), binary sparse data are very natural. • Many datasets can be quantized/thresholded to be binary without hurting the prediction accuracy.

• In some cases, even when the “original” data are not too sparse, they often become sparse when considering pariwise and higher-order interactions.

21

Ping Li


July 12, 2012

MMDS2012

An Example of Binary (0/1) Sparse Massive Data A set S

⊆ Ω = {0, 1, ..., D − 1} can be viewed as 0/1 vector in D dimensions.

Shingling:

Each document (Web page) can be viewed as a set of w -shingles.

For example, after parsing, a sentence “today is a nice day” becomes

• w = 1: {“today”, “is”, “a”, “nice”, “day”} • w = 2: {“today is”, “is a”, “a nice”, “nice day”} • w = 3: {“today is a”, “is a nice”, “a nice day”} Previous studies used w

≥ 5, as single-word (unit-gram) model is not sufficient.

Shingling generates extremely high dimensional vectors, e.g., D

= (105 )w .

(105 )5 = 1025 = 283 , although in current practice, it seems D = 264 suffices.

22

Ping Li


July 12, 2012

MMDS2012

Notation A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros). Consider two sets S1 , S2

⊆ Ω = {0, 1, 2, ..., D − 1} (e.g., D = 264 )

f1

f1 = |S1 |,

f2 = |S2 |,

a

f

2

a = |S1 ∩ S2 |.

The resemblance R is a popular measure of set similarity

|S1 ∩ S2 | a R= = |S1 ∪ S2 | f1 + f2 − a

23

Ping Li


July 12, 2012

MMDS2012

Minwise Hashing: Stanford Algorithm in the Context of Search

Suppose a random permutation π is performed on Ω, i.e.,

π : Ω −→ Ω,

where

Ω = {0, 1, ..., D − 1}.

An elementary probability argument shows that

Pr (min(π(S1 )) = min(π(S2 ))) =

|S1 ∩ S2 | = R. |S1 ∪ S2 |

24

Ping Li


July 12, 2012

MMDS2012

An Example

D = 5. S1 = {0, 3, 4}, S2 = {1, 2, 3}, R =

|S1 ∩S2 | |S1 ∪S2 |

= 15 .

One realization of the permutation π can be

0 =⇒ 3 1 =⇒ 2 2 =⇒ 0 3 =⇒ 4 4 =⇒ 1 π(S1 ) = {3, 4, 1} = {1, 3, 4}, In this example, min(π(S1 ))

π(S2 ) = {2, 0, 4} = {0, 2, 4}

6= min(π(S2 )).

25

Ping Li


July 12, 2012

MMDS2012

Minwise Hashing Estimator After k permutations, π1 , π2 , ..., πk , one can estimate R without bias:

ˆM R

k 1X = 1{min(πj (S1 )) = min(πj (S2 ))}, k j=1

ˆM Var R

=

1 R(1 − R). k

—————————We recently realized that this estimator could be written as an inner product in

264 × k (assuming D = 264 ) dimensions). This means one can potentially use it for linear learning, although the dimensionality would be excessively high.

¨ Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11

26

Ping Li


July 12, 2012

MMDS2012

b-Bit Minwise Hashing: the Intuition Basic idea: Only store the lowest b-bits of each hashed value, for small b.

Intuition:

• When two sets are identical, then their lowest b-bits of the hashed values are of course also equal. b = 1 only stores whether a number is even or odd. • When two sets are similar, then their lowest b-bits of the hashed values “should be” also similar (True?? Need a proof).

• Therefore, hopefully we do not need many bits to obtain useful information, especially when real applications care about pairs with reasonably high resemblance values (e.g., 0.5).

27

Ping Li


July 12, 2012

MMDS2012

The Basic Collision Probability Result Consider two sets, S1 and S2 ,

S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, f1 = |S1 |, f2 = |S2 |, a = |S1 ∩ S2 | Define the minimum values under π

: Ω → Ω to be z1 and z2 :

z1 = min (π (S1 )) ,

z2 = min (π (S2 )) .

and their b-bit versions (b)

(b)

z1 = The lowest b bits of z1 , Example: if z1

z2 = The lowest b bits of z2 (1)

= 7(= 111 in binary), then z1

(2)

= 1, z1 = 3.

28

Ping Li


July 12, 2012

MMDS2012

Collision probability: Assume D is large (which is virtually always true)

Pb = Pr

(b) z1

=

(b) z2

= C1,b + (1 − C2,b ) R

——————–

Recall, (assuming infinite precision, or as many digits as needed), we have

Pr (z1 = z2 ) = R

29

Ping Li


July 12, 2012

MMDS2012

Collision probability: Assume D is large (which is virtually always true)

Pb = Pr f1 r1 = , D C1,b C2,b

(b) z1

=

(b) z2

= C1,b + (1 − C2,b ) R

f2 r2 = , D

f1 = |S1 |,

f2 = |S2 |,

r1 r2 = A1,b + A2,b , r1 + r2 r1 + r2 r1 r2 + A2,b , = A1,b r1 + r2 r1 + r2

A1,b =

r1 [1 − r1 ]

2b −1

1 − [1 − r1 ]

2b

,

A2,b =

r2 [1 − r2 ]

2b −1

1 − [1 − r2 ]

——————– ¨ Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.

2b

.

30


July 12, 2012

MMDS2012

The closed-form collision probability is remarkably accurate even for small D . The absolute errors (approximate - exact) are very small even for D

= 20.

−4

Approximation Error

4

0.008 0.006

f2 = 2

0.004 0.002 0 0

D = 20, f1 = 4, b = 1 0.2

0.01

0.4 0.6 a / f2

f2 = 4 0.8

3 2

1

2

1 f = 50 2

0.2

0.4 0.6 a / f2

−4

f =5 2

f =2 2

0.002 D = 20, f1 = 10, b = 1 0 0 0.2 0.4 0.6 a / f2

f = 25

f =2

0.006 0.004

D = 500, f1 = 50, b = 1

4

0.008

x 10

0 2 0

Approximation Error

Approximation Error

0.01

Approximation Error

Ping Li

f2 = 10 0.8

1

x 10

0.8

1

0.8

1

D = 500, f = 250, b = 1 1

3 2 1 0 0

0.2

0.4 0.6 a/f 2

—————– ¨ Ref: Li and Konig, Theory and Applications of b-Bit Minwise Hashing, CACM Research Highlights, 2011 .

31

Ping Li


July 12, 2012

An Unbiased Estimator

ˆ b for R: An unbiased estimator R Pˆb − C1,b ˆ Rb = , 1 − C2,b k n o X 1 (b) (b) Pˆb = z1,πj = z2,πj , k j=1

ˆb = Var R

Var

Pˆb

1 Pb (1 − Pb ) 2 = k [1 − C2,b ] [1 − C2,b ]2

MMDS2012

32

Ping Li


July 12, 2012

The Variance-Space Trade-off Smaller b

=⇒ smaller space for each sample. However,

smaller b

=⇒ larger variance.

B(b; R, r1 , r2 ) characterizes the var-space trade-off: ˆb B(b; R, r1 , r2 ) = b × Var R Lower B(b) is better.

MMDS2012

33

Ping Li


July 12, 2012

MMDS2012

The ratio,

B(b1 ; R, r1 , r2 ) B(b2 ; R, r1 , r2 ) measures the improvement of using b (e.g., b1 B(64) B(b)

= b2 (e.g., b2 = 1) over using b = b1

= 64).

= 20 means, to achieve the same accuracy (variance), using b = 64 bits per hashed value will require 20 times more space (in bits) than using b = 1.

34


B(64) B(b) , relative improvement of using

r = r = 10 1

b=1

30

2

25

b=2

20

b=3

15

b=4

10 5

r = r = 0.1 1

2

b=1

25

b=2

20

b=3

15

b=4

10 5

0 0

0.2

60

0.4 0.6 0.8 Resemblance (R)

0 0

1

0.2

60


r = r = 0.5 1

2

b=1

40

b=2

30

b=3

20

b=4

10 0 0

0.2


40 b=2 30 b=3

20

b=4

10 1

1

b=1

50 Improvement

50

MMDS2012

35 −10

Improvement

Improvement

30

July 12, 2012

b = 1, 2, 3, 4 bits, compared to 64 bits.

35

Improvement

Ping Li

0 0

r1 = r2 = 0.9 0.2


1

35

Ping Li


July 12, 2012

MMDS2012

Relative Improvements in the Least Favorable Situation If r1 , r2

→ 0 (least favorable situation; there is a proof), then B(64) R = 64 B(1) R+1

If R

= 0.5, then the improvement will be

64 3

= 21.3-fold.

36

Ping Li


July 12, 2012

MMDS2012

Experiments: Duplicate detection on Microsoft news articles The dataset was crawled as part of the BLEWS project at Microsoft. We computed pairwise resemblances for all documents and retrieved documents

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

b=2

b=1

R0 = 0.3 100

200 300 400 Sample size (k)

b=1 b=2 b=4 M 500

Precision

Precision

pairs with resemblance R larger than a threshold R0 .

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

2 b=1 R0 = 0.4

100


b=1 b=2 b=4 M 500

37


1 0.9 0.8 0.7 2 0.6 0.5 0.4 0.3 0.2 0.1 0 0

2 b=1

R = 0.5 0

100


b=1 b=2 b=4 M

Precision

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

500

b=1

R = 0.7 0

100


b=1 b=2 b=4 M 500

Precision

Precision

Precision

Ping Li

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

July 12, 2012

MMDS2012

b=1 2

R = 0.6 0

100


b=1 b=2 b=4 M 500

1 0.9 0.8 b=1 0.7 0.6 2 0.5 b=1 0.4 R0 = 0.8 b=2 0.3 0.2 b=4 0.1 M 0 0 100 200 300 400 500 Sample size (k)

38

Ping Li


July 12, 2012

MMDS2012

1/2 Bit Suffices for High Similarity Practical applications are sometimes interested in pairs of very high similarities. For these data, one can XOR two bits from two permutations, into just one bit.

ˆ 1/2 . Compared to the 1-bit estimator R ˆ1: The new estimator is denoted by R • A good idea for highly similar data: Var

lim

ˆ1 R

= 2. R→1 ˆ 1/2 Var R

• Not so good idea when data are not very similar: ˆ 1 < Var R ˆ 1/2 , Var R if R < 0.5774, Assuming sparse data ————— ¨ Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.

39

Ping Li


July 12, 2012

MMDS2012

b-Bit Minwise Hashing for Estimating 3-Way Similarities Notation for 3-way set intersections a

13

f1

a

f

3

r1

a23

a12

s13

r3

s s

s12

f

2

23

r

2

3-Way collision probability: Assume D is large.

Pr (lowest b bits of 3 hashed values are equal) = where u

Z Z+s + R123 = , u u

= r1 + r2 + r3 − s12 − s13 − s23 + s, and ...

40

Ping Li


Z =(s12 − s)A3,b + +(s13 − s)A2,b + +(s23 − s)A1,b +

(r3 − s13 − s23 + s) r1 + r2 − s12 (r2 − s12 − s23 + s) r1 + r3 − s13 (r1 − s12 − s13 + s) r2 + r3 − s23

July 12, 2012

MMDS2012

s12 G12,b s13 G13,b s23 G23,b

h i (r1 − s12 − s13 + s) + (r2 − s23 )A3,b + (r3 − s23 )A2,b G23,b r2 + r3 − s23 h i (r2 − s12 − s23 + s) + (r1 − s13 )A3,b + (r3 − s13 )A1,b G13,b r1 + r3 − s13 h i (r3 − s13 − s23 + s) + (r1 − s12 )A2,b + (r2 − s12 )A1,b G12,b , r1 + r2 − s12 b rj (1 − rj )2 −1 , Aj,b = b 1 − (1 − rj )2 Gij,b =

b (ri + rj − sij )(1 − ri − rj + sij )2 −1 1 − (1 − ri − rj + sij )2

b

, i, j ∈ {1, 2, 3}, i 6= j.

————— ¨ Ref: Li, Konig, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities, NIPS’10.

41

Ping Li


July 12, 2012

MMDS2012

Useful messages: 1. Substantial improvement over using 64 bits, just like in the 2-way case. 2. Must use b

≥ 2 bits for estimating 3-way similarities.

42

Ping Li


July 12, 2012

MMDS2012

b-Bit Minwise Hashing for Large-Scale linear Learning Linear learning algorithms require the estimators to be inner products.

If we look carefully, the estimator of minwise hashing is indeed an inner product of two (extremely sparse) vectors in D

ˆM R because, as z1

× k dimensions (infeasible when D = 264 ):

k 1X 1{min(πj (S1 )) = min(πj (S2 )) = k j=1

= min(π(S1 )), z2 = min(π(S2 )) ∈ Ω = {0, 1, ..., D − 1}. 1{z1 = z2 } =

D−1 X i=0

1{z1 = i} × {z2 = i}

———– ¨ Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11.

43

Ping Li


Consider D

July 12, 2012

MMDS2012

= 5. We can expand numbers into vectors of length 5.

0 =⇒ [0, 0, 0, 0, 1], 3 =⇒ [0, 1, 0, 0, 0],

1 =⇒ [0, 0, 0, 1, 0], 4 =⇒ [1, 0, 0, 0, 0].

2 =⇒ [0, 0, 1, 0, 0]

———————

If z1

= 2, z2 = 3, then

0 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 1, 0, 0, 0].

If z1

= 2, z2 = 2, then

1 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].

44

Ping Li


July 12, 2012

MMDS2012

Linear Learning Algorithms Linear algorithms such as linear SVM and logistic regression have become very powerful and extremely popular. Representative software packages include SVMperf , Pegasos, Bottou’s SGD SVM, and LIBLINEAR.

Given a dataset {(xi , yi )}n i=1 , xi

∈ RD , yi ∈ {−1, 1}, the L2 -regularized

linear SVM solves the following optimization problem:

n X 1 T T min w w + C max 1 − yi w xi , 0 , w 2 i=1

and the L2 -regularized logistic regression solves a similar problem: n

X T 1 T min w w + C log 1 + e−yi w xi . w 2 i=1 Here C

> 0 is an important penalty (regularization) parameter.

45

Ping Li


July 12, 2012

MMDS2012

Integrating b-Bit Minwise Hashing for (Linear) Learning Very simple: 1. We apply k independent random permutations on each (binary) feature vector xi and store the lowest b bits of each hashed value. This way, we obtain a new dataset which can be stored using merely nbk bits.

2. At run-time, we expand each new data point into a 2b

× k -length vector, i.e.

we concatenate the k vectors (each of length 2b ). The new feature vector has exactly k 1’s.

46

Ping Li


An example with k

July 12, 2012

MMDS2012

= 3 and b = 2

For set (vector) S1 : Original hashed values (k

= 3) : 12013 25964 20191

Original binary representations

010111011101101 Lowest b

:

110010101101100

100111011011111

= 2 binary digits : 01 00 11

Corresponding decimal values b

Expanded 2

:1 0 3

= 4 binary digits : 0010

New feature vector fed to a solver Same procedures on sets S2 , S3 , ...

0001

1000

: {0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0}

47

Ping Li


July 12, 2012

MMDS2012

Datasets and Solver for Linear Learning This talk presents the experiments on two text datasets.

Dataset

# Examples (n)

# Dims (D)

Avg # Nonzeros

Train / Test

Webspam (24 GB)

350,000

16,609,143

3728

80% / 20%

Rcv1 (200 GB)

781,265

1,010,017,424

12062

50% / 50%

To generate the Rcv1 dataset, we used the original features + all pairwise features + 1/30 3-way features. We chose LIBLINEAR as the basic solver for linear learning. Note that our method is purely statistical/probabilistic, independent of the underlying procedures. All experiments were conducted on workstations with Xeon(R) CPU ([email protected]) and 48GB RAM, under Windows 7 System.

48

Ping Li


July 12, 2012

MMDS2012

Experimental Results on Linear SVM and Webspam

• We conducted our extensive experiments for a wide range of regularization C values (from 10−3 to 102 ) with fine spacings in [0.1, 10]. • We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.

49


July 12, 2012

MMDS2012

Testing Accuracy

Accuracy (%)

Ping Li

100 b = 6,8,10,16 b=4 98 96 b=2 4 94 92 b=1 90 88 86 svm: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

• Solid: b-bit hashing. Dashed (red) the original data • Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using the original data.

• The results are averaged over 50 runs.

50


b=4 b=2

b=1

svm: k = 50 Spam: Accuracy −1

0

10

10

1

10

2

10

100 b = 6,8,10,16 b=4 98 96 b=2 4 94 92 b=1 90 88 86 svm: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

b=4 b=2

b=1

svm: k = 100 Spam: Accuracy −1

0

10

10

1

10

2

10

C

Accuracy (%)

Accuracy (%)

C

6

Accuracy (%)

b=6

100 b = 8,10,16 98 96 6 94 4 92 90 88 86 84 82 80 −3 −2 10 10

July 12, 2012

100 b = 6,8,10,16 b=4 98 b = 2 96 4 94 b=1 92 90 88 86 svm: k = 300 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

Accuracy (%)

100 98 b = 8,10,16 96 94 92 90 88 86 84 82 80 −3 −2 10 10

Accuracy (%)

Accuracy (%)

Ping Li

MMDS2012

100 b = 6,8,10,16 98 b=4 96 4 b=2 94 92 b=1 90 88 86 svm: k = 150 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

100 b = 6,8,10,16 b=4 98 b=2 96 4 b=1 94 92 90 88 86 svm: k = 500 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

51

Ping Li


July 12, 2012

MMDS2012

52

Stability of Testing Accuracy (Standard Deviation) Our method produces very stable results, especially b 0

b=1 b=2 b=4 b=6 b=8

−1

10

10

b = 16

svm: k = 50 Spam accuracy (std)

10 −3 10

−2

10

−1

0

10

10

b=2 −1

10

b=6 b=8

−2

1

10

b=4

svm: k = 100 Spam accuracy (std)

10 −3 10

2

10

−2

10

−1

0

Accuracy (std %)

Accuracy (std %)

b=2 −1

10

b=4 b=6

svm: k = 200 Spam accuracy (std) −2

10

−1

0

10

10 C

b=2 −1

10

b=4

Spam:Accuracy (std) 1

10

10

10 −3 10

2

10

−2

10

−1

10

2

10

10

1

10

2

10

0

Spam:Accuracy (std)

b=2

−1

10

b=4

−2

1

0

10

10 b=1

b = 8,10,16

b=8 b = 16

C

0

10 b=1

−2

b=1

C

10

10 −3 10

svm: k = 150

−2

0

10

C

b = 10,16

Accuracy (std %)

−2

10 b=1

Accuracy (std %)

Accuracy (std %)

10

0

10

Accuracy (std %)

0

≥4

b=8 b = 16

svm: k = 300

10 −3 10

−2

10

−1

0

10

10 C

1

10

b=1 −1

b=4

−2

2

10

b=2

10

svm: k = 500 Spam accuracy (std) b = 6,8,10,16

10 −3 10

−2

10

−1

0

10

10 C

1

10

2

10

Ping Li


July 12, 2012

MMDS2012

53

Training Time svm: k = 50 Spam: Training time

Training time (sec)

2

10

1

10

0

10 −3 10

3

2

1

10

−2

10

−1

0

10

10

1

10

10 −3 10

2

10

−2

10

−1

0

10

10

1

10

10

Training time (sec)

b = 16

0

10

0

10

10 C

b=4

Spam:Training Time −2

−1

10

1

10

2

10

10

b=1 2

b = 16

16 4

1

10

0

0

10

10

1

10

2

10

3

10

2

−1

b=8

b=1

svm: k = 300

2

−2

b = 16

2 4 16

C

10

10

10 −3 10

1

10

10 −3 10

2

3

1

b=1

C

svm: k = 200 Spam: Training time

10

2

10

0

0

3

Training time (sec)

svm: k = 150

10

C 10

10

svm: k =100 Spam: Training time

Training time (sec)

Training time (sec)

3

10

Training time (sec)

3

10

10 −3 10

b=1

b=8

−2

−1

0

10

10 C

2

10

b = 16 1

b = 10

10

0

Spam:Training Time 10

svm: k = 500 Spam: Training time

1

10

2

10

10 −3 10

−2

10

−1

0

10

10

1

10

2

10

C

• They did not include data loading time (which is small for b-bit hashing) • The original training time is about 100 seconds. • b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).

Ping Li


July 12, 2012

MMDS2012

54

Testing Time

100

10 2 1 −3 10

−2

10

−1

0

10

10

1

10

3

100

10 2 1 −3 10

2

10

10

−1

10

10

1

100

10

10

10

−1

0

10

10 C

−2

10

−1

1

10

2

10

1000

svm: k = 300 Spam:Testing Time

2

10

1

10

10 −3 10

−2

10

−1

0

10

10 C

0

10

10

1

10

2

10

C

0

−2

1

10

10 −3 10

2

Testing time (sec)

10 Testing time (sec)

Testing time (sec)

0

3

10

2

10

C

svm: k = 200 Spam: Testing time

2 1 −3 10

svm: k = 150 Spam:Testing Time

0

−2

C 1000

10


Testing time (sec)

1000


Testing time (sec)

Testing time (sec)

1000

1

10

2

10


100

10 2 1 −3 10

−2

10

−1

0

10

10 C

1

10

2

10

Ping Li


July 12, 2012

MMDS2012

55

Experimental Results on L2 -Regularized Logistic Regression

100

b=6

b=4 b=2 b=1

logit: k = 100 Spam: Accuracy −1

0

10

1

10

10

2

10

C

b=4 b=8

b=2

95 b=1

80 −3 10

logit: k = 300 Spam:Accuracy −2

10

−1

0

10

10 C

95

b=2

90

b=1

85

logit: k = 150 Spam:Accuracy

80 −3 10

−2

10

−1

0

10

10

1

10

1

10

2

10

2

10

C

100

90

b=4

b=8 Accuracy (%)

100 98 b = 8,10,16 96 94 92 90 88 86 84 82 80 −3 −2 10 10

Accuracy (%)

100 b=4 98 b = 6,8,10,16 96 b=2 94 92 b=1 90 88 86 logit: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

Accuracy (%)

100 98 b=6 b = 8,10,16 96 b=4 94 92 90 b=2 88 86 b=1 84 logit: k = 50 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

Accuracy (%)

Accuracy (%)

Accuracy (%)

Testing Accuracy

100 b=4 b = 6,8,10,16 98 b=2 4 96 b=1 94 92 90 88 86 logit: k = 500 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C

Ping Li


July 12, 2012

MMDS2012

56

Training Time 3

2

10

1

10

logit: k = 50 1

10

Training time (sec)

Training time (sec)

2

10

1

10

logit: k = 100 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C

2

10

10

10

Training time (sec)

2

1

10


10

2

10

b=8

1

10

logit: k = 150 Spam:Training Time −2

10

−1

0

10

10

1

10

2

10

3

b = 16 2

10

b=1 2 8

b=4

1

10

logit: k = 300 Spam:Training Time 10 −3 10

4

10

0

1

2

C

10

10

b = 16 b=1

10 −3 10

2

3

3

10

2

10

0

1

Training time (sec)

Training time (sec)

10

10

0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C

Training time (sec)

3

3

10

−2

10

−1

0

10

10 C

1

10

2

10

b = 16 2

10

1

10


1

10

2

10

Ping Li


July 12, 2012

MMDS2012

Comparisons with Other Algorithms We conducted extensive experiments with the VW algorithm (Weinberger et. al., ICML’09, not the VW online learning platform), which has the same variance as random projections.

Consider two sets S1 , S2 , f1

R=

|S1 ∩S2 | |S1 ∪S2 |

=

= |S1 |, f2 = |S2 |, a = |S1 ∩ S2 |,

a f1 +f2 −a . Then, from their variances

1 2 V ar (ˆ aV W ) ≈ f1 f2 + a k 1 ˆ M IN W ISE = R (1 − R) V ar R k we can immediately see the significant advantages of minwise hashing, especially when a

≈ 0 (which is common in practice).

57

Ping Li


July 12, 2012

MMDS2012

Comparing b-Bit Minwise Hashing with VW

100 1,10,100 10,100 C =1 98 0.1 C = 0.1 96 C = 0.01 94 C = 0.01 92 90 88 86 84 svm: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k

Accuracy (%)

Accuracy (%)

8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about the same test accuracy as VW with k = 104 ∼ 106 . 100 10,100 100 10 C =1 1 98 C = 0.1 C = 0.1 96 94 C = 0.01 C = 0.01 92 90 88 86 84 logit: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k

58

Ping Li


July 12, 2012

MMDS2012

59

8-bit hashing is substantially faster than VW (to achieve the same accuracy).

3

3

10

svm: VW vs b = 8 hashing Training time (sec)

Training time (sec)

10

C = 100 2

C = 10

10

C = 1,0.1,0.01 C = 100 1

10

C = 10

Spam: Training time 0

10

C = 1,0.1,0.01 2

10

4

10

10 k

5

10

6

10

C = 0.1,0.01

2

10

100 1 10,1.0,0.1

10

C = 0.01

0

3

C = 100,10,1

10

logit: VW vs b = 8 hashing Spam: Training time 2

10

3

4

10

10 k

5

10

6

10

Ping Li


July 12, 2012

MMDS2012

60

Experimental Results on Rcv1 (200GB) Test accuracy using linear SVM (Can not train the original data) 90 b=8 b=4

70

b=2

b=8

80 b=4

70

50 −3 10

−2

10

0

10

10

1

10

b=2 b=1

50 −3 10

2

10

−2

10

−1

10

70 b=2

0

10

10 C

b=2 b=1

10

50 −3 10

2

10

rcv1: Accuracy −2

10

−1

80

1

10

10

50 −3 10

90

b=2 b=1

−2

10

1

10

2

10

b = 16 b = 12 b=8

b=4

80

b=2

70 b=1

60 rcv1: Accuracy

rcv1: Accuracy 2

10

svm: k =500

b=4

70

0

10

100

b = 16 b = 12

rcv1: Accuracy 10

1

b=8

60

b=1

−1

b=4

70

C

90 Accuracy (%)

Accuracy (%)

b = 12

b=4

−2

10

svm: k =300

80

50 −3 10

0

100

b = 16

b=8

60

80

C

svm: k =200

90

b=8

60

C

100

b = 16 b = 12

rcv1: Accuracy

rcv1: Accuracy −1

svm: k =150

90

b = 12

60

60 b=1

b = 16

Accuracy (%)

80

100

svm: k =100

90

b = 12

Accuracy (%)

Accuracy (%)

100

b = 16

svm: k =50

Accuracy (%)

100

−1

0

10

10 C

1

10

2

10

50 −3 10

−2

10

−1

0

10

10 C

1

10

2

10

Ping Li


July 12, 2012

MMDS2012

61

Training time using linear SVM 3

3

10

12 16 1

10

12 b=8

rcv1: Train Time

0

−2

10

−1

0

10

10

1

2

b = 16 12

10

1

10

12 b=8

rcv1: Train Time

0

10 −3 10

2

10

svm: k=150

10

−2

10

−1

0

10

C

10

1

3

10

12

1

10

b=8

rcv1: Train Time 10

0

10

10 C

0

10

10

1

10

10

2

10

svm: k=500 12

2

10

b = 16

1

10

b=8

rcv1: Train Time

0

2

1

10

3

Training time (sec)

b = 16

2

10

−1

10

−1

10 svm: k=300

Training time (sec)

svm: k=200

0

rcv1: Train Time −2

C

3

−2

b=8 0

10

10 −3 10

1

10

10 −3 10

2

10

b = 16 12

2

10

C

10 Training time (sec)

Training time (sec)

2

10

10 −3 10

10

svm: k=100 Training time (sec)

svm: k=50 Training time (sec)

3

10

10 −3 10

−2

10

−1

0

10

10 C

1

10

2

b = 16

10

12 1

10

b=8

rcv1: Train Time

0

2

10

10 −3 10

−2

10

−1

0

10

10 C

1

10

2

10

Ping Li


July 12, 2012

MMDS2012

62

Test accuracy using logistic regression 90 b=8

80 b=4

b=2

60

b=8

80 b=4

70 b=2

60 b=1

50

rcv1: Accuracy

−2

0

10

10

10

−2

10

−1

10

Accuracy (%)

Accuracy (%)

b=4 b=2

10

−1

0

10

10 C

b=2

b=1

10

50 −3 10

2

10

rcv1: Accuracy −2

10

−1

80

1

10

10

50 −3 10

b=2 b=1

−2

10

1

10

2

10

b = 16 b = 12

90

b=8 b=4

80

b=2

70 b=1

60 rcv1: Accuracy

2

10

logit: k =500

b=4

70

0

10

100

b = 16 b = 12 b=8

60

b=1

−2

70

C

90

rcv1: Accuracy 50 −3 10

10

1

logit: k =300

b=8

60

0

100

b = 16 b = 12

90

70

b=4

C

logit: k =200

80

80

rcv1: Accuracy

C 100

b=8

60

b=1

50 −3 10

2

b = 12

90

Accuracy (%)

70

b = 16

logit: k =150

b = 12

90

b = 12

100

b = 16

logit: k =100 Accuracy (%)

Accuracy (%)

100

b = 16

logit: k =50

Accuracy (%)

100

−1

0

10

10 C

1

10

2

10

50 −3 10

rcv1: Accuracy −2

10

−1

0

10

10 C

1

10

2

10

Ping Li


July 12, 2012

MMDS2012

63

Training time using logistic regression 3

3

10

b=1 b=2 b=4

8 b = 16 b = 12

1

10

rcv1: Train Time

0

10 −3 10

−2

10

−1

0

10

10

1

b=1

2

10

b=8 b = 12

1 b = 16

10

rcv1: Train Time

0

10 −3 10

2

10

logit: k=150

10

−2

10

−1

0

10

C

10

1

3

3

12

8 b = 16

1

10

rcv1: Train Time

0

−2

10

−1

0

10

10 C

10

−1

1

10

10

10

1

2

10

10

12 2

b=8

10

1

10

10 −3 10


10

−1

0

10

10 C

1

10

b = 16

logit: k=500

b = 16

0

2

0

10

3

Training time (sec)

2

10 −3 10


10 logit: k=300

Training time (sec)

logit: k=200

b=1

10

C

10

10

b = 16 1

0

10

12

8 b=4

10 −3 10

2

10

2

10

C

10 Training time (sec)

Training time (sec)

2

10

10

logit: k=100 Training time (sec)

logit: k=50 Training time (sec)

3

10

12

2

10

b=8 1

10

rcv1: Train Time

0

2

10

10 −3 10

−2

10

−1

0

10

10 C

1

10

2

10

Ping Li


July 12, 2012

MMDS2012

64

Comparisons with VW on Rcv1 Test accuracy using linear SVM

100

80

C = 0.01

70 C = 0.01,0.1,1,10

60

svm: VW vs 1−bit hashing 3

10

50 1 10

4

10

10


100 C = 10

90 C = 0.1,1,10 C = 0.01

C = 0.01

70 60

rcv1: Accuracy svm: VW vs 8−bit hashing 2

3

10

10 K

C = 0.01

70

50 1 10

4

10


3

10

10

100

C = 10

4

10

80

C = 0.1,1,10 C = 0.01

70

50 1 10

C = 10

90 C = 0.01

60

4

10

K

90 Accuracy (%)

Accuracy (%)

10

C = 0.01

K

100

50 1 10

3

10

C = 10

80

60

rcv1: Accuracy

K

80

C = 0.01

70 60

rcv1: Accuracy 2

80 C = 0.01,0.1,1,10

C = 0.1,1,10

90

rcv1: Accuracy svm: VW vs 12−bithashing 2

3

10

10 K

4

10

Accuracy (%)

50 1 10

C = 0.1,1,10

90

C = 0.1,1,10

Accuracy (%)

Accuracy (%)

90

100 Accuracy (%)

100

C = 0.01 C = 0.1,1,10

80

C = 0.01

70 60 50 1 10


3

10

10 K

4

10

Ping Li


July 12, 2012

MMDS2012

65

100

100

90

90

90

C = 0.1,1,10

80 C = 0.01,0.1,1,10

60 50 1 10

2

3

10

70 60

rcv1: Accuracy logit: VW vs 1−bithashing 10

C = 0.01 C = 0.01,0.1,1,10

50 1 10

4

10

2

100 C = 0.1,1,10

C = 0.01 C = 0.01

70 60 50 1 10


3

10

10 K

70

50 1 10

4

10


4

10

80

100

50 1 10

C = 0.01 C = 0.01


3

10

10 K

10

4

10

C = 10 C = 0.1,1,10

90

C = 0.1,1,10

70 60

3

10

K

C = 0.1,1,10

90 Accuracy (%)

Accuracy (%)

80

10

C = 0.01

C = 0.01

K

100 90

3

10

K

C = 0.1,1,10

C = 0.1,1,10

60

rcv1: Accuracy logit: VW vs 2−bithashing

C = 0.1,1,10

80

C =1

4

10

Accuracy (%)

70

C = 0.01

C = 0.1,1,10

80

Accuracy (%)

100 Accuracy (%)

Accuracy (%)

Test accuracy using logistic regression

C = 0.01

80

C = 0.01

70 60 50 1 10

logit: VW vs 16−bithashing 2

3

10

10 K

4

10

Ping Li


July 12, 2012

MMDS2012

Training time 3

3

10 rcv1: Train Time

C = 10

2

10

C=1 C = 0.1

C = 10 1

10

Training time (sec)

Training time (sec)

10

C = 0.01

C=1 C = 0.1

0 C = 0.01

10 1 10


3

10

10 K

4

10

rcv1: Train Time

C = 10 C=1

2

10

C = 0.1

C = 10

C = 0.01

C=1 1 C = 0.1

10

C = 0.01 0

10 1 10

logit: VW vs 8−bit hashing 2

3

10

10 K

4

10

66

Ping Li


July 12, 2012

MMDS2012

b-Bit Minwise Hashing for Efficient Near Neighbor Search Near neighbor search is a much more frequent operation than training an SVM.

The bits from b-bit minwise hashing can be directly used to build hash tables to enable sub-linear time near neighbor search.

This is an instance of the general family of locality-sensitive hashing.

67

Ping Li


July 12, 2012

MMDS2012

An Example of b-Bit Hashing for Near Neighbor Search We use b

= 2 bits and k = 2 permutations to build a hash table indexed from 0000 to 1111, i.e., the table size is 22×2 = 16. Index 00 00 00 01 00 10

Data Points 8, 13, 251 5, 14, 19, 29 (empty)

11 01 7, 24, 156 11 10 33, 174, 3153 11 11 61, 342

Index 00 00 00 01 00 10

Data Points 2, 19, 83 17, 36, 129 4, 34, 52, 796

11 01 7, 198 11 10 56, 989 11 11 8 ,9, 156, 879

Then, the data points are placed in the buckets according to their hashed values. Look for near neighbors in the bucket which matches the hash value of the query. Replicate the hash table (twice in this case) for missed and good near neighbors. Final retrieved data points (before re-ranking) are the union of the buckets.

68

Ping Li


July 12, 2012

MMDS2012

Experiments on the Webspam Dataset

B = b × k : table size

L: number of tables.

Fractions of retrieved data points (before re-ranking) 0

0

10 Fraction Evaluated

Fraction Evaluated

Webspam: L = 25 b=4

−1

b=2

10

b=1

−2

10

8

SRP b−bit 12

b=4 b=2 −1

10

b=1

−2

16 B (bits)

20

24

10

10

Webspam: L = 50

8

SRP b−bit 12

b=4

Fraction Evaluated

0

10

Webspam: L = 100

b=2 −1

10

b=1

SRP b−bit −2

16 B (bits)

20

24

10

8

SRP (sign random projections) and b-bit hashing (with b

12

16 B (bits)

20

24

= 1, 2) retrieved similar

numbers of data points.

———————— Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary Data, ECML’12

69

Ping Li


July 12, 2012

MMDS2012

100 SRP 90 b−bit 80 70 60 50 1 40 b=4 30 20 10 B= 16 L= 100 Webspam: T = 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)

Precision (%)

100 SRP 90 b−bit 80 70 60 b=4 b=1 50 40 30 20 10 B= 24 L= 100 Webspam: T = 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)

Precision (%)

100 SRP 90 b−bit 80 70 60 50 40 1 30 20 b=4 10 B= 16 L= 100 Webspam: T = 5 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)

Precision (%)

100 SRP 90 b−bit 80 70 60 b=4 50 b=1 40 30 20 10 B= 24 L= 100 Webspam: T = 5 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)

Precision (%)

Precision (%)

Precision (%)

Precision-Recall curves (the higher the better) for retrieving top-T data points. 100 SRP 90 b−bit 80 70 60 50 b=1 b=4 40 30 20 10 B= 24 L= 100 Webspam: T = 50 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) 100 SRP 90 b−bit 80 70 60 b=4 1 50 40 30 20 10 B= 16 L= 100 Webspam: T = 50 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%)

Using L

= 100 tables, b-bit hashing considerably outperformed SRP (sign random projections), for table sizes B = 24 and B = 16. Note that there are many LSH schemes, which we did not compare exhaustively.

70


July 12, 2012

MMDS2012

Simulating Permutations with Universal (2U and 4U) Hashing Perm 2U 4U

80

b=4 b=2

−2

−1

10 10 b=8

0

10

1

10

C

90

85 −3 10

b=2 b=1 Perm 2U 4U

SVM: k = 200 Spam: Accuracy −2

10

−1

0

10

10

1

10

b=4 Perm 2U 4U

b=2 90

85 −3 10 100

2

10

b=4 95

95

Spam: Accuracy

b=1 70 −3 10 100

b=8 Accuracy (%)

b=8

90

100

SVM: k = 10

Accuracy (%)

Accuracy (%)

100

Accuracy (%)

Ping Li

2

10

C

Both 2U and 4U perform well except when b

b=1 SVM: k = 100 Spam: Accuracy −2

−1

10 10 b=8 4

0

10

1

10

2

10

C

b=2 95 b=1 90

85 −3 10

Perm 2U 4U

SVM: k = 500 Spam: Accuracy −2

10

−1

0

10

10

1

10

2

10

C

= 1 or small k .

———————– ¨ Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958

71

Ping Li


July 12, 2012

MMDS2012

Speed Up PreProcessing Using GPUs CPU: Data loading and preprocessing (for k

= 500 permutations) times (in

seconds). Note that we measured the loading times of LIBLINEAR which used a plain text data format. Using binary data will speed up data loading. Dataset

Loading

Permu

2U

4U (Mod)

4U (Bit)

Webspam

9.7 × 102

6.1 × 103

4.1 × 103

4.4 × 104

1.4 × 104

Rcv1

1.0 × 104

–

3.0 × 104

–

–

GPU: Data loading and preprocessing (for k

= 500 permutations) times

Dataset

Loading

GPU 2U

GPU 4U (Mod)

GPU 4U (Bit)

Webspam

9.7 × 102

51

5.2 × 102

1.2 × 102

Rcv1

1.0 × 104

1.4 × 103

1.5 × 104

3.2 × 103

Note that the modular operations in 4U can be replaced by bit operations.

72


July 12, 2012

MMDS2012

Breakdowns of the overhead: A batch-based GPU implementation. 4

4

10

10

2

GPU Kernel

10

Time (sec)

Time (sec)

GPU Kernel

CPU −−> GPU 0

10

2

10

CPU −−> GPU 0

10

GPU −−> CPU GPU −−> CPU Spam: GPU Profiling, 2U

−2

10

0

10

1

10

4

10

2

3

10 10 Batch size

4

10

Rcv1: GPU Profiling, 2U

−2

10

5

10

0

10

1

10

5

10

2

4

GPU Kernel

2

Time (sec)

CPU −−> GPU

10

GPU −−> CPU Spam: GPU Profiling, 4U−Bit

−2

10

0

10

1

10

2

3

10 10 Batch size

4

10

4

10

5

10

GPU Kernel

3

10

0

3

10 10 Batch size

10

Time (sec)

Ping Li

10

2

10

CPU −−> GPU

1

10

0

10

GPU −−> CPU

−1

10

Rcv1: GPU Profiling, 4U−Bit

−2

5

10

10

0

10

1

10

2

3

10 10 Batch size

4

10

5

10

GPU kernel times dominate the cost. The cost is not sensitive to the batch size. ———————— ¨ Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958 ¨ Ref: Li, Shrivastava, Konig, GPU-Based Minwise Hashing, WWW’12 Poster

73


July 12, 2012

MMDS2012

Online Learning (SGD) with Hashing b-Bit minwise hashing can also be highly beneficial for online learning in that it significantly reduces the data loading time which is usually the bottleneck for online algorithms, especially if many epochs are required. 100 95

100 b=4

b=6 Accuracy (%)

Accuracy (%)

Ping Li

b=4 90 85 b = 2 b=1 80 −10 10

−8

10

SGD SVM: k = 50 Spam: Accuracy −6

10 λ

−4

10

−2

10

95 b=2 90 b=1 85 80 −10 10

SGD SVM: k = 200 Spam: Accuracy −8

10

−6

10 λ

−4

10

−2

10

λ: 1/C , following the convention in online learning literature. Red dashed curve: Results based on the original data ———————— ¨ Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958

74

Ping Li


July 12, 2012

MMDS2012

Conclusion

• BigData everywhere. For many operations including clustering, classification, near neighbor search, etc. exact answers are often not necessary.

• Very sparse random projections (KDD 2006) work as well as dense random projections. They however require a large number of projections.

• Minwise hashing is a standard procedure in the context of search. b-bit

minwise hashing is a substantial improvement by using only b bits per hashed value instead of 64 bits (e.g., a 20-fold reduction in space).

• b-bit minwise hashing provides a very simple strategy for efficient linear learning, by expanding the bits into vectors of length 2b × k . • We can directly build hash tables from b-bit minwise hashing for sub-linear time near neighbor search.

• It can be substantially more accurate than random projections (and variants).

75

Ping Li


July 12, 2012

MMDS2012

• Major drawbacks of b-bit minwise hashing: – It was designed (mostly) for binary static data.

– It needs a fairly high-dimensional representation (2b

× k ) after expansion.

– Expensive preprocessing and consequently expensive testing phrase for unprocessed data. Parallelization solutions based on (e.g.,) GPUs offer very good performance in terms of time but they are not energy-efficient.

76

Probabilistic Hashing for Efficient Search and ... - Stanford University

Probabilistic Hashing for Efficient Search and ... - Stanford University

Suggest Documents

Probabilistic Hashing for Efficient Search and Learning on Massive Data

Multiview Alignment Hashing for Efficient Image Search

Distant Search - Stanford University

Multiview Alignment Hashing for Efficient Image Search

Bulletproofs: Efficient Range Proofs for ... - Stanford University

Bulletproofs: Efficient Range Proofs for ... - Stanford University

Efficient Embedded Computing - Stanford University

On the Probabilistic Foundations of Probabilistic ... - Stanford University

Probabilistic Modeling in Psycholinguistics - Stanford University

Probabilistic seismic hazard deaggregation of ... - Stanford University

Uncertainty propagation in probabilistic seismic ... - Stanford University

A Probabilistic Method for the Magnitude ... - Stanford University

A stable, efficient, and adaptive hybrid method for ... - Stanford University

efficient similarity learning for asymmetric hashing

Hashing for Similarity Search: A Survey - arXiv

Hashing for Similarity Search: A Survey - arXiv

Efficient and Effective Similarity Search over Probabilistic Data ...

Efficient and Effective Similarity Search over Probabilistic Data Based ...

Efficient Top-k Query Evaluation on Probabilistic Data - Stanford CS

Temporal and Cross-Subject Probabilistic Models ... - Stanford University

A Probabilistic and RIPless Theory of ... - Stanford University

A Probabilistic and RIPless Theory of ... - Stanford University

Quantized Posterior Hashing: Efficient Posterior

efficient suboptimal rare-event simulation - Stanford University