Jan 30, 2009 - Use a different hash function, such as mapping n bits to codewords of an. [n, k] error-correcting code. G
Optimal hash functions for approximate closest pairs on the n-cube Daniel Gordon and Victor Miller and Peter Ostapenko IDA/CCR
January 30, 2009
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
1 / 33
Outline
1
Introduction
2
Optimal Regions and Hash Functions
3
Hashing with Projection
4
Hashing with Codes
5
Computing Optimal Regions
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
2 / 33
Introduction
Closest Pair Problem Given a set of n-bit vectors v1 , v2 , . . . , vM , find a pair with minimal distance.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
3 / 33
Introduction
Closest Pair Problem Given a set of n-bit vectors v1 , v2 , . . . , vM , find a pair with minimal distance.
Applications DNA sequence comparison Information retrieval GET MORE EXAMPLES
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
3 / 33
Finding Close Vectors
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
4 / 33
Finding Close Vectors
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
5 / 33
Strategy 0: Check Every Pair
For lists of size M , work is O(M 2 ). Simple, but this becomes too expensive for large M .
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
6 / 33
Strategy 0: Check Every Pair
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
7 / 33
Strategy 0: Check Every Pair
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
8 / 33
Strategy 0: Check Every Pair
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
9 / 33
Strategy 0: Check Every Pair
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
10 / 33
Strategy 1: Projection
Hash on k bits, check for collisions. If there’s an error in those bits, this will fail.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
11 / 33
Strategy 1: Projection
Hash on k bits, check for collisions. If there’s an error in those bits, this will fail.
Work per Success: M · CHash + M 2 /2k+1 · CTest (1 − pk )
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
11 / 33
Strategy 1: Projection
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
12 / 33
Strategy 1: Projection
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
13 / 33
Strategy 1: Projection
01101010100110010110110001100101001101100100100110 01011100010110001001100001100011101001010101100100 10111001110111100100000010001000000010011100111001 01100101100001000010101111011000001001011000111000 11101111000010011101000000000111010111000100110111 10100100010101011000110011010100101110000011010000 00001001000111111101011001110110010000111001111011 00110001011110011101001110100001111001100110011110 11010010110111010111011110000001011110001111010011 11011100000110001001100001100010101001010101110100 10001101000100110000000101101010110100110001001000 01111011111110111010100010010001010100001000011000 11000000001010010010111100100000100010100011000001
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
14 / 33
Strategy 2: Other Hash Functions
Alternate Idea Use a different hash function, such as mapping n bits to codewords of an [n, k] error-correcting code.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
15 / 33
Strategy 2: Other Hash Functions
Alternate Idea Use a different hash function, such as mapping n bits to codewords of an [n, k] error-correcting code.
This uses more bits, but error may not be fatal.
This idea has occurred independently many times, and been patented twice.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
15 / 33
Cost of Hashing
Work per Success: (M · CHash + M 2 /2k+1 · CTest )/Ph where Ph = Ph (p) = Prob(h(vi ) = h(vi + e)
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
16 / 33
Cost of Hashing
Work per Success: (M · CHash + M 2 /2k+1 · CTest )/Ph where Ph = Ph (p) = Prob(h(vi ) = h(vi + e)
The Big Question What hash function minimizes work/success?
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
16 / 33
Example: n = 3, k = 1
Project on one bit Region Q2 maps to a point.
001 000
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
17 / 33
Example: n = 3, k = 1
Project on one bit Region Q2 maps to a point.
001 000
Ph = (1 − p)
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
17 / 33
Example: n = 3, k = 1
Project on one bit
Code C = {000, 111}
Region Q2 maps to a point.
Region B3 (1) maps to a point.
111 001 000
000
Ph = (1 − p)
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
17 / 33
Example: n = 3, k = 1
Project on one bit
Code C = {000, 111}
Region Q2 maps to a point.
Region B3 (1) maps to a point.
111 001 000
Ph = (1 − p)
GMO (IDA/CCR)
000
Ph = (1 − p)(1 − p( 12 − p))
Optimal hash functions
January 30, 2009
17 / 33
Structure of Hamming space around codewords
c2 c1
c3
c0 x
x+e
c5 c4
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
18 / 33
Standard Coding Theory vs. Hashing with Codes I
Coding Theory Correct codewords with errors.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
19 / 33
Standard Coding Theory vs. Hashing with Codes I
Coding Theory Correct codewords with errors.
Hashing with codes Correct anything with errors.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
19 / 33
Optimal Regions
Let S be the points in V that hash to 0.
h(x) = h(x + e) with probability PS (p) =
1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
20 / 33
Optimal Regions
Let S be the points in V that hash to 0.
h(x) = h(x + e) with probability PS (p) =
1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S
Definition S is an optimal region if it maximizes this probability for any region of size |S|.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
20 / 33
Standard Coding Theory vs. Hashing with Codes II Definition If S is a code, the probability of undetected error is P(S, p) =
1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
21 / 33
Standard Coding Theory vs. Hashing with Codes II Definition If S is a code, the probability of undetected error is P(S, p) =
1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S
Coding Theory S is a code. Minimize this probability.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
21 / 33
Standard Coding Theory vs. Hashing with Codes II Definition If S is a code, the probability of undetected error is P(S, p) =
1 X d(x,y) p (1 − p)n−d(x,y) . |S| x,y∈S
Coding Theory S is a code. Minimize this probability.
Hashing with Codes S is the sphere around a codeword. Maximize this probability.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
21 / 33
Coding Theory Aside Let Ai = #{(x, y) : x, y ∈ S and d(x, y) = i}
Distance Distribution Function A(S, ζ) :=
n X
Ai ζ i
i=0
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
22 / 33
Coding Theory Aside Let Ai = #{(x, y) : x, y ∈ S and d(x, y) = i}
Distance Distribution Function A(S, ζ) :=
n X
Ai ζ i
i=0
PS (p) := =
GMO (IDA/CCR)
1 X d(x,y) p (1 − p)n−d(x,y) |S| 1 |S|
x,y∈S n X
Ai pi (1 − p)n−i =
i=0
(1 − p)n p A S, . |S| 1−p
Optimal hash functions
January 30, 2009
22 / 33
Projection Pn,k
Project x onto k coordinates S is an n − k subcube. DD function is A(S, ζ) = (2(1 + ζ))n−k Probability of collision is Pn,k
P
(1 − p)n (p) = 2n−k
2 1−p
n−k
= (1 − p)k .
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
23 / 33
Projection Pn,k (cont’d)
For small error rates, projection is optimal:
Theorem Let S be the 2n−k -subcube of V. For any error rate p ∈ (0, 2−2(n−k) ), S is an optimal region, and so k-projection is an optimal hash.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
24 / 33
Hashing with Codes
Perfect Codes A code is perfect if every vertex is distance ≤ e from exactly one codeword.
Perfect Binary Codes [n, n, 1] Repetition Codes [2m − 1, 2m − m − 1, 3] Hamming Codes Hm [23, 12, 7] binary Golay Code G
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
25 / 33
Binary Golay Code
S = 3−sphere The 3-sphere’s DD function is 2048 + 11684ζ + 128524ζ 2 + 226688ζ 3 + 1133440ζ 4 + 672980ζ 5 + 2018940ζ 6 .
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
26 / 33
Binary Golay Code
S = 3−sphere The 3-sphere’s DD function is 2048 + 11684ζ + 128524ζ 2 + 226688ζ 3 + 1133440ζ 4 + 672980ζ 5 + 2018940ζ 6 .
Corollary This beats projection P23,12 for p > 0.2555.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
26 / 33
Hamming Codes
S = 1 − sphere The 1-sphere’s DD function is 2m + 2(2m − 1)ζ + (2m − 1)(2m − 2)ζ 2 ,
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
27 / 33
Hamming Codes
S = 1 − sphere The 1-sphere’s DD function is 2m + 2(2m − 1)ζ + (2m − 1)(2m − 2)ζ 2 ,
Corollary This beats projection for m ≥ 4 and p > αm ≈ (m − 2)/2m
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
27 / 33
Other Linear Codes
p
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
d=3 d=5 d=7 H4 G H5
0
5
10
15
20
25
30
k
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
28 / 33
Optimal Regions
Alternate Formulation What region of size 2t in F2n has the best P (p)?
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
29 / 33
Optimal Regions
Alternate Formulation What region of size 2t in F2n has the best P (p)?
Previous Results 2n−1 -subcube is optimal for all n, p. 2t -subcube is optimal for t ≤ 3 for all n, p. A subcube is optimal for any t, n if p is small enough.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
29 / 33
Structure of Optimal Regions Definition For x = (x1 , . . . , xn ) ∈ V, let ρi (x) := (x1 , x2 , . . . , xi−1 , 0, xi+1 , . . . xn ) and σij (x) := (x1 , . . . , min(xi , xj ), . . . , max(xi , xj ), . . . xn ).
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
30 / 33
Structure of Optimal Regions Definition For x = (x1 , . . . , xn ) ∈ V, let ρi (x) := (x1 , x2 , . . . , xi−1 , 0, xi+1 , . . . xn ) and σij (x) := (x1 , . . . , min(xi , xj ), . . . , max(xi , xj ), . . . xn ).
Definition A set S ⊂ V is a down-set if ρi (S) ⊂ S for all i ≤ n.
Definition A set S ⊂ V is right-shifted if σij (S) ⊂ S for all i, j ≤ n. GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
30 / 33
Structure of Optimal Regions (cont’d)
111
101
011
110
001
100
010
000
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
31 / 33
Optimal Regions (cont’d)
Theorem If a set S is optimal, then it is isomorphic to a right-shifted down-set.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
32 / 33
Optimal Regions (cont’d)
Theorem If a set S is optimal, then it is isomorphic to a right-shifted down-set.
Computing Right-shifted Downsets We may find all right-shifted downsets, and look for optimal regions. For size 64, there are 4384627. We have compiled tables of optimal regions of up to size 64. Unfortunately, they don’t tile the cube.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
32 / 33
Random Codes
We would expect that for large n, a random code would do well.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
33 / 33
Random Codes
We would expect that for large n, a random code would do well.
Theorem For a fixed error rate p ∈ (0, 1/2), rate R = k/n, and n sufficiently large, a random code of rate R will beat projection.
GMO (IDA/CCR)
Optimal hash functions
January 30, 2009
33 / 33