Optimal hash functions for approximate closest pairs on ... - Google Sites

1 downloads 249 Views 268KB Size Report
Jan 30, 2009 - Use a different hash function, such as mapping n bits to codewords of an. [n, k] error-correcting code. G
Optimal hash functions for approximate closest pairs on the n-cube Daniel Gordon and Victor Miller and Peter Ostapenko IDA/CCR

January 30, 2009

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

1 / 33

Outline

1

Introduction

2

Optimal Regions and Hash Functions

3

Hashing with Projection

4

Hashing with Codes

5

Computing Optimal Regions

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

2 / 33

Introduction

Closest Pair Problem Given a set of n-bit vectors v1 , v2 , . . . , vM , find a pair with minimal distance.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

3 / 33

Introduction

Closest Pair Problem Given a set of n-bit vectors v1 , v2 , . . . , vM , find a pair with minimal distance.

Applications DNA sequence comparison Information retrieval GET MORE EXAMPLES

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

3 / 33

Finding Close Vectors

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

4 / 33

Finding Close Vectors

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

5 / 33

Strategy 0: Check Every Pair

For lists of size M , work is O(M 2 ). Simple, but this becomes too expensive for large M .

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

6 / 33

Strategy 0: Check Every Pair

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

7 / 33

Strategy 0: Check Every Pair

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

8 / 33

Strategy 0: Check Every Pair

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

9 / 33

Strategy 0: Check Every Pair

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

10 / 33

Strategy 1: Projection

Hash on k bits, check for collisions. If there’s an error in those bits, this will fail.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

11 / 33

Strategy 1: Projection

Hash on k bits, check for collisions. If there’s an error in those bits, this will fail.

Work per Success: M · CHash + M 2 /2k+1 · CTest (1 − pk )

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

11 / 33

Strategy 1: Projection

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

12 / 33

Strategy 1: Projection

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

13 / 33

Strategy 1: Projection

01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

14 / 33

Strategy 2: Other Hash Functions

Alternate Idea Use a different hash function, such as mapping n bits to codewords of an [n, k] error-correcting code.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

15 / 33

Strategy 2: Other Hash Functions

Alternate Idea Use a different hash function, such as mapping n bits to codewords of an [n, k] error-correcting code.

This uses more bits, but error may not be fatal.

This idea has occurred independently many times, and been patented twice.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

15 / 33

Cost of Hashing

Work per Success: (M · CHash + M 2 /2k+1 · CTest )/Ph where Ph = Ph (p) = Prob(h(vi ) = h(vi + e)

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

16 / 33

Cost of Hashing

Work per Success: (M · CHash + M 2 /2k+1 · CTest )/Ph where Ph = Ph (p) = Prob(h(vi ) = h(vi + e)

The Big Question What hash function minimizes work/success?

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

16 / 33

Example: n = 3, k = 1

Project on one bit Region Q2 maps to a point.

001 000

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

17 / 33

Example: n = 3, k = 1

Project on one bit Region Q2 maps to a point.

001 000

Ph = (1 − p)

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

17 / 33

Example: n = 3, k = 1

Project on one bit

Code C = {000, 111}

Region Q2 maps to a point.

Region B3 (1) maps to a point.

111 001 000

000

Ph = (1 − p)

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

17 / 33

Example: n = 3, k = 1

Project on one bit

Code C = {000, 111}

Region Q2 maps to a point.

Region B3 (1) maps to a point.

111 001 000

Ph = (1 − p)

GMO (IDA/CCR)

000

Ph = (1 − p)(1 − p( 12 − p))

Optimal hash functions

January 30, 2009

17 / 33

Structure of Hamming space around codewords

c2 c1

c3

c0 x

x+e

c5 c4

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

18 / 33

Standard Coding Theory vs. Hashing with Codes I

Coding Theory Correct codewords with errors.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

19 / 33

Standard Coding Theory vs. Hashing with Codes I

Coding Theory Correct codewords with errors.

Hashing with codes Correct anything with errors.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

19 / 33

Optimal Regions

Let S be the points in V that hash to 0.

h(x) = h(x + e) with probability PS (p) =

1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

20 / 33

Optimal Regions

Let S be the points in V that hash to 0.

h(x) = h(x + e) with probability PS (p) =

1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S

Definition S is an optimal region if it maximizes this probability for any region of size |S|.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

20 / 33

Standard Coding Theory vs. Hashing with Codes II Definition If S is a code, the probability of undetected error is P(S, p) =

1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

21 / 33

Standard Coding Theory vs. Hashing with Codes II Definition If S is a code, the probability of undetected error is P(S, p) =

1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S

Coding Theory S is a code. Minimize this probability.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

21 / 33

Standard Coding Theory vs. Hashing with Codes II Definition If S is a code, the probability of undetected error is P(S, p) =

1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S

Coding Theory S is a code. Minimize this probability.

Hashing with Codes S is the sphere around a codeword. Maximize this probability.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

21 / 33

Coding Theory Aside Let Ai = #{(x, y) : x, y ∈ S and d(x, y) = i}

Distance Distribution Function A(S, ζ) :=

n X

Ai ζ i

i=0

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

22 / 33

Coding Theory Aside Let Ai = #{(x, y) : x, y ∈ S and d(x, y) = i}

Distance Distribution Function A(S, ζ) :=

n X

Ai ζ i

i=0

PS (p) := =

GMO (IDA/CCR)

1 X d(x,y) p (1 − p)n−d(x,y) |S| 1 |S|

x,y∈S n X

Ai pi (1 − p)n−i =

i=0

  (1 − p)n p A S, . |S| 1−p

Optimal hash functions

January 30, 2009

22 / 33

Projection Pn,k

Project x onto k coordinates S is an n − k subcube. DD function is A(S, ζ) = (2(1 + ζ))n−k Probability of collision is Pn,k

P

(1 − p)n (p) = 2n−k



2 1−p

n−k

= (1 − p)k .

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

23 / 33

Projection Pn,k (cont’d)

For small error rates, projection is optimal:

Theorem Let S be the 2n−k -subcube of V. For any error rate p ∈ (0, 2−2(n−k) ), S is an optimal region, and so k-projection is an optimal hash.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

24 / 33

Hashing with Codes

Perfect Codes A code is perfect if every vertex is distance ≤ e from exactly one codeword.

Perfect Binary Codes [n, n, 1] Repetition Codes [2m − 1, 2m − m − 1, 3] Hamming Codes Hm [23, 12, 7] binary Golay Code G

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

25 / 33

Binary Golay Code

S = 3−sphere The 3-sphere’s DD function is 2048 + 11684ζ + 128524ζ 2 + 226688ζ 3 + 1133440ζ 4 + 672980ζ 5 + 2018940ζ 6 .

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

26 / 33

Binary Golay Code

S = 3−sphere The 3-sphere’s DD function is 2048 + 11684ζ + 128524ζ 2 + 226688ζ 3 + 1133440ζ 4 + 672980ζ 5 + 2018940ζ 6 .

Corollary This beats projection P23,12 for p > 0.2555.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

26 / 33

Hamming Codes

S = 1 − sphere The 1-sphere’s DD function is 2m + 2(2m − 1)ζ + (2m − 1)(2m − 2)ζ 2 ,

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

27 / 33

Hamming Codes

S = 1 − sphere The 1-sphere’s DD function is 2m + 2(2m − 1)ζ + (2m − 1)(2m − 2)ζ 2 ,

Corollary This beats projection for m ≥ 4 and p > αm ≈ (m − 2)/2m

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

27 / 33

Other Linear Codes

p

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

d=3 d=5 d=7 H4 G H5

0

5

10

15

20

25

30

k

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

28 / 33

Optimal Regions

Alternate Formulation What region of size 2t in F2n has the best P (p)?

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

29 / 33

Optimal Regions

Alternate Formulation What region of size 2t in F2n has the best P (p)?

Previous Results 2n−1 -subcube is optimal for all n, p. 2t -subcube is optimal for t ≤ 3 for all n, p. A subcube is optimal for any t, n if p is small enough.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

29 / 33

Structure of Optimal Regions Definition For x = (x1 , . . . , xn ) ∈ V, let ρi (x) := (x1 , x2 , . . . , xi−1 , 0, xi+1 , . . . xn ) and σij (x) := (x1 , . . . , min(xi , xj ), . . . , max(xi , xj ), . . . xn ).

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

30 / 33

Structure of Optimal Regions Definition For x = (x1 , . . . , xn ) ∈ V, let ρi (x) := (x1 , x2 , . . . , xi−1 , 0, xi+1 , . . . xn ) and σij (x) := (x1 , . . . , min(xi , xj ), . . . , max(xi , xj ), . . . xn ).

Definition A set S ⊂ V is a down-set if ρi (S) ⊂ S for all i ≤ n.

Definition A set S ⊂ V is right-shifted if σij (S) ⊂ S for all i, j ≤ n. GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

30 / 33

Structure of Optimal Regions (cont’d)

111

101

011

110

001

100

010

000

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

31 / 33

Optimal Regions (cont’d)

Theorem If a set S is optimal, then it is isomorphic to a right-shifted down-set.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

32 / 33

Optimal Regions (cont’d)

Theorem If a set S is optimal, then it is isomorphic to a right-shifted down-set.

Computing Right-shifted Downsets We may find all right-shifted downsets, and look for optimal regions. For size 64, there are 4384627. We have compiled tables of optimal regions of up to size 64. Unfortunately, they don’t tile the cube.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

32 / 33

Random Codes

We would expect that for large n, a random code would do well.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

33 / 33

Random Codes

We would expect that for large n, a random code would do well.

Theorem For a fixed error rate p ∈ (0, 1/2), rate R = k/n, and n sufficiently large, a random code of rate R will beat projection.

GMO (IDA/CCR)

Optimal hash functions

January 30, 2009

33 / 33