External Double Hashing with Choice - UCSD CSE - University of

0 downloads 0 Views 3MB Size Report
The runtime performance for insertion is slightly greater than for ordinary ... tioned into s equal-sized blocks referred to as varieties and ... inserting eight records within a table consisting of five buck- ets and ...... ξ = 1 sixteen slots per bucket.
External Double Hashing with Choice Walter A. Burkhard Gemini Storage Systems Laboratory Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0404 USA [email protected] Abstract A novel extension to external double hashing providing significant reduction to both successful and unsuccessful search lengths is presented. The experimental and analytical results demonstrate the reductions possible. This method does not restrict the hashing table configuration parameters and utilizes very little additional storage space per bucket. The runtime performance for insertion is slightly greater than for ordinary external double hashing. 1

1. Introduction Hashing is a well-known data indexing technique for organizing data stored externally. Numerous schemes exist for handling collisions such as open-addressing; each record determines the probe sequence used to store or retrieve it. To store a record, it is placed in the first un-filled bucket of its probe sequence; to search for a record, buckets designated by its probe sequence are examined in order until finding it or a un-filled bucket, not containing it, is encountered indicating the record is not present. Uniform hashing, introduced by Peterson [14], is an idealized model that maps records to random permutation probe sequences. Double hashing is an efficient scheme to generate the probe sequence for a record. Recently Lueker and Molodowitch [11] as well as Guibas and Szemeredi[8] have shown double hashing to be asymptotically equivalent to the ideal uniform hashing. In case the bucket capacity is one, the performance of uniform hashing has been analyzed by Peterson [14], Morris [13] and Knuth [9]. Ullman [17] raised an optimality 1

A preliminary version appears within the Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms & Networks, ISPAN’05

question and presented a model for discussing it. More recently, Yao [19] has shown that uniform hashing is optimal among all open-addressing schemes with respect to the expected successful search length; he poses the question – is uniform hashing also optimal among all open-addressing schemes with respect to the expected unsuccessful search length. Larson [10] analyzes uniform hashing for bucket capacity exceeding one. Blake and Konheim [2] provides analysis of buckets with capacity exceeding one using linear probing collision resolution. Our approach to significantly improve the search lengths is to utilize more than one hash function. Use of more than one hash function has been previously presented and analyzed for load balancing by Azar et al. [1] and V¨ocking [18]. Multiple hash functions were used to improve IP lookups by Broder and Mitzenmacher [3]. Separate chaining collision resolution is utilized within these efforts. We consider open-addressing double hashing in which during an insertion, the search length for the each hash function is determined and one with the shortest is used to position the record. The probe sequence to follow, for future accesses, is recorded within a predictor bit array. During a fetch, each hash function is evaluated to determine the possible probe sequence but only those probe sequences marked within the predictor bit array are actually followed. With two hash functions significant improvements are possible. The paper is divided into sections; external double hashing with choice is introduced in section two, the analysis of the run-time is presented in section three, in section four, the experimental and analytical results are presented graphically.

2. External Double Hashing with Choice External double hashing with choice is an extension of double hashing in which d hash functions from an appropriate family of universal hash functions [6] are used rather

than one. In case d is one, the search length performance of the scheme reverts to that of ordinary double hashing. The table, configured with two records per bucket, is augmented with an array of s predictor bits. The hash domain is partitioned into s equal-sized blocks referred to as varieties and any variety i access will be associated with the ith predictor bit. For any access, each of the d hash functions determine its associated predictor bit. We present two table data type methods insert and fetch for double hashing with choice in figures 1 and 3. The algorithms are very similar to those of ordinary double hashing with the exception that the insert method must try all d probe sequences to select a shortest access path and the fetch method must interleave the appropriate subset of the d probe sequences. In both figures, the hash array, contains pointers to each of the d hash functions; this array is initialized during table construction. The for statement, in fig-

predictor bits

0:

0

doe

1:

cat

bee

2:

dog

pig

3:

yak

4:

sow

8

16 24 32 40

1

9

17 25 33 41

2

10 18 26 34 42

3

11 19 27 35 43

4

12 20 28 36 44

5

13 21 29 37 45

6

14 22 30 38 46

7

15 23 31 39 47

cat: dog: sow: pig: ape: bee: yak: doe:

sequence 1, 2, 3, 4, 0 2, 4, 1, 3, 0 1, 4, 2, 0, 3 2, 3, 4, 0, 1 2, 0, 3, 1, 4 1, 2, 3, 4, 0 2, 1, 0, 4, 3 1, 0, 4, 3, 2

variety 18 3 6 34 3 43 21 14

sequence 2, 4, 1, 3, 3, 2, 1, 0, 4, 2, 0, 3, 4, 2, 0, 3, 2, 4, 1, 3, 4, 1, 3, 0, 3, 2, 1, 0, 1, 2, 3, 4,

0 4 1 1 0 2 4 0

variety 1 41 21 2 14 10 32 3

ape

Figure 2. External Double Hashing with Choice

fetch ( Data & data ) for ( j = 0 , number = 0 ; j < d ; j ++ ) { value = hash[ j ] ( data ) ; start [ j ] = value % n ; stride[ j ] = 1 + value % ( n − 1) ; variety = ( value/( n ∗ ( n − 1) ) ) % s ; location[j ] = ( start[j ] − stride[j ] ) % n ; if predictor[variety] is set which[number++] = j ; } // for ( ... select possible paths. while ( number > 0 ) { for (j = 0; number > 0; j++) { location[j ] = (location[j ] + stride[j ]) % n ; if bucket[location[j ]] contains data copy data record and return true ; if bucket[location[j ]] is not full remove probe sequence which[j ] from consideration and decrement number. } // for ( ... } // while ( ... return false ;

insert ( Data & data ) shortest = n + 1 ; for ( j = 0 ; j < d ; j ++ ) { value = hash[ j ] ( data ); start [ j ] = value % n ; stride[ j ] = 1 + value % ( n − 1) ; variety [ j ] = ( value/( n ∗ ( n − 1)) ) % s ; count = 0, location [ j ] = ( start [ j ] − stride[ j ] ) % n ; do { location [ j ] = ( location [ j ] + stride [ j ] ) % n ; if duplicate found in bucket [ location [ j ] ] return false ; count + + ; } while bucket [ location ] is full ; if count < shortest shortest = count, best [ 0 ] = j, number = 1 ; elseif count == shortest best [ number + + ] = j ; } // for ( ...

Figure 3. Fetch method

select one shortest path j from best array.

ues; both give rise to shortest possible insertion sequences. The first probe sequence with variety 18 is selected (arbitrarily); the record is inserted in bucket 1, and predictor 18 is set. The predictor array is indexed by variety; a set predictor bit is designated by a gray square . The dog record is similar; both probe sequences give rise to shortest possible insertion sequences. The first probe sequence with variety 3 is selected. The dog record is inserted in bucket 2 and predictor 3 is set. The sow record is similar with both probe sequences reaching an empty slot with one probe; the second sequence is selected. The record is inserted into bucket 4 and predictor 21 is set. When doe is inserted, the candidate insertion sequences are not the same length and the second probe sequence is shorter. The record doe is inserted in bucket 0 and predictor 3 is set (again). The records cat, dog, sow, and doe are individually accessed to demonstrate the fetch method. We utilize the table within figure 2 and the fetch method of figure 3. The for statement determines the d probe sequences and the which array contains only those with the associated predictor bit set. Here number counts the number of probe sequences to

copy data into bucket [ location [ j ] ] ; predictor [ variety [ j ] ] = true ; return true ;

Figure 1. Insert method

ure 1, determines which of the d probe sequences will locate an un-filled bucket with the fewest probes; the indices of these shortest probe sequences are stored within the array best. The second phase of the insert involves selecting one of the shortest probe sequences j to use to store the record. There are several strategies possible here; we view the best entries to be equally-likely to be chosen. Finally the data is copied into bucket [ location [ j ] ] and the variety [ j ] predictor is set. Figure 2 contains an example table configuration created inserting eight records within a table consisting of five buckets and forty-eight predictor bits. Each record, cat, dog, sow . . ., doe, has a pair of probe sequences together with access varieties. The cat record could be inserted either in bucket 1 or 2 as its two probe sequences begin with these two val2

follow. The while loop interleaves the probe sequences as well as removing from consideration those that end without success. For the cat record, only its first probe sequence must be followed since predictor 18 is set while predictor 1 is not. The search length is one probe. For the dog record, only its first probe sequence must be followed since predictor 3 is set but not predictor 41; the search length is one probe. The sow record is similar with only one probe sequence being followed. When accessing the doe record, since predictors 21 and 3 both are set, we must interleave the probe sequences; the search length is 7/2 probes since the probe sequences are equally-likely to begin the search. When accessing ant, a record not within the table, the probe sequences for ant, 2,3,4,0,1 with variety 18 and 4,2,0,3,1 with variety 3, are determined. Since predictors 18 and 3 are both set, the interleaved probe sequences are followed with a total of six probes to determine the record is not within the table.

ther, for a given record the fetch function first determines whether the predictor bit is set for each hash function. Only the probe sequences associated with set predictor bits need be followed. The insertion method sets exactly one predictor bit; a predictor bit is set with probability 1/s during an insert operation. After m insertions, a predictor bit remains unset with probability ( 1 − 1/s )m and is set with probability pm = 1 − ( 1 − 1/s ) m ≈ 1 − e−m/s .

(1)

During a successful search access operation, the d hash functions each determine a variety for the desired record; at least one of the associated predictor bits must be set. Let dˆ count the number of set associated predictor bits; the probability dˆ equals j for 1 ≤ j ≤ d is, within tables containing m records,   d − 1 j−1 ˆ (2) P rob { d = j } = p ( 1 − pm ) d−j j −1 m

3. Analysis of External Double Hashing with Choice

and the expected number of set predictor bits is 1 + ( d − 1 ) pm . Similarly, during an unsuccessful search operation, the probability dˆ equals j for 0 ≤ j ≤ d is, within tables containing m records,   d j P rob { dˆ = j } = p ( 1 − pm ) d−j . (3) j m

A table contains n buckets each with capacity for b records possesses s predictor bits. We associate with the table d hash functions from a universal family of hash functions [6]. The hashing process, when applied to a record, produces a permutation of the bucket addresses. Within double hashing, a probe sequence is an arithmetic progression with successive probes differing by a constant referred to as the stride which is determined via the process. When inserting a record, the probe sequence of each of the d hash functions is created and the number of probes X required to reach an unfilled bucket is determined. The record is inserted by following a probe sequence with the smallest X value; this sequence of probes is referred to as the minsequence. Of course, the min-sequence and its associated X value will depend upon the number of records currently stored within the table. This smallest X is the minimum order statistic for the d values [7]. As the table fills, the average of individual minimum order statistics, denoted Lmin , s will be of interest. The d hash functions determine the d triples consisting of the initial probe, the step, and the variety values for a record; we assume that the n (n − 1) s triples are equally-likely; previous analyses of double hashing also assume the n (n−1) pairs of initial probe and stride values are equally-likely [11, 8]. Hashing performance metrics are the successful search length as well as unsuccessful search length which are calculated analytically as expected values or measured experimentally as average values. The successful search length is the number of probes to access a record stored within the table. The unsuccessful search length is the number of probes to determine a record is not stored in the table. Within ei-

and the expected number of set predictor bits is d pm . Both the successful and unsuccessful search lengths are calculated using the probability a bucket is filled together with the expected number of set associated predictor bits. Tables with single record bucket capacity have been analyzed [9, 14]; the analysis appears within data structure and algorithm textbooks circa 2005. This analysis utilizes a sampling without replacement approach to calculate the probability an insertion requires exactly r probes to locate the desired bucket in a table containing m records. Evidently extending this approach to larger buckets is difficult. Tables also have been analyzed using sampling with replacement; this approach readily extends to larger bucket capacities [10, 12] and will be utilized here. We continue by calculating f illed b ( α ) the probability a capacity b, arbitrary bucket contains b records within a table with loading factor α. Both the expected successful and unsuccessful search lengths will utilize this probability. The occupancy count of the first un-filled bucket encountered on a probe sequence will be arbitrary. The likelihood a probe sequence accesses a bucket containing i records depends only on the number of such buckets. This insight is utilized to calculate the probability a bucket is filled. Let Ni ( m ), for 0 ≤ i ≤ b, be the expected number of buckets containing exactly i records when m records have 3

been inserted into a table configured with n buckets. The recurrences, following the total expectation rule [15], obtain the b + 1 expected values as a function of m and n

these change only slightly over a wide range of table sizes from 102 through 106 . Table 1 contains ψ values for tables with 257 and 1000000 buckets. However we have no asymptotic results for these approximations.

N0 ( m + 1 ) = N 0 ( m ) − N 0 ( m ) / ( n − N b ( m ) ) Ni ( m + 1 ) = (4) Ni ( m ) + ( Ni−1 ( m ) − Ni ( m ) ) / ( n − Nb ( m ) )

1

Nb ( m + 1 ) = Nb ( m ) + Nb−1 ( m ) / ( n − Nb ( m ) ) filled bucket probability

for 0 ≤ m < b n and 0 ≤ i ≤ b. The initial conditions, with all buckets empty, are Ni ( 0 ) = n δi,0 . Finally, the buckets are filled completely with b n records and the final values are Ni ( b n ) = n δi,b . Equations (4) are determined as follows. Suppose m records have been inserted; during insertion of the m + 1st record the number of buckets containing i records can change by at most one. The probability of these events will vary as the number of unfilled buckets varies; the expression ( n − Nb ( m ) ) designates the number of unfilled buckets after m insertions. The probability the number of buckets containing i records increases by one is Ni−1 ( m ) / ( n − Nb ( m ) ) and the probability the number decreases by one is Ni ( m ) / ( n−Nb ( m ) ). Furthermore, the probability the number containing i records does not change is given by 1 − ( Ni−1 ( m ) − Ni ( m ) ) / ( n − Nb ( m ) ). Then, using the total expectation rule,

b=1 b=2

3

0.4 10 100 0.2

0

0.2

0.4

0.6

0.8

1

loading factor

Figure 4. The filled bucket probability versus loading factor.

3.1. Successful Search Length A successful search is conducted for a record residing within the table; the successful search length LS measures the total number of buckets examined to conduct the search. First we calculate the expected average of the min-sequence length E [ Lmin ]. Then we can calculate the expected sucS cessful search length E [ LS ]. Since records are not moved within the table once they are inserted, we anticipate

This is the principal equation within (4). The empty bucket count cannot increase as m increases and is a special case N0 ( m + 1 ) = N 0 ( m ) − N 0 ( m ) / ( n − N b ( m ) )

E [ LS ] ≥ E [ Lmin ]. s

Similarly, the full bucket count cannot decrease as m increases and is a special case

Moreover, when d is one, we note E [ LS ] is E [ Lmin ]. S Suppose the table with n capacity b buckets has loading factor α; the probability a bucket is filled f illed b ( α ) has been calculated above. The probability an insertion, using d hash functions, will require at least λ + 1 probes is f illed b ( α )d λ . That is, each of the d probe sequences must hit a filled bucket λ times. Finally, when a table contains m records, the probability that Lmin is greater than λ is S

Nb ( m + 1 ) = Nb ( m ) + Nb−1 ( m ) / ( n − Nb ( m ) ). In general, a bucket is filled with probability (5)

where α = m/( b n ) is the loading factor. Figure 4 presents the probability values calculated using equations 4 for various bucket sizes. These systems of non-linear recurrence equations evidently do not have closed form solutions; however, useful approximations of the form f illed b ( α ) ≈ αψ

0.6

0

Ni ( m + 1 ) = Ni ( m ) + ( Ni−1 ( m ) − Ni ( m ) ) / ( n − Nb ( m ) )

f illed b ( α ) = Nb ( m ) / n

probability vs. loading factor exact values approximate values

0.8

P rob { Lmin S

m 1 X > λ} = f illed b ( j/( n b ) ) d λ . m j=1

Thus we have

(6)

which is also presented within figure 4. The ψ values are determined to minimize the mean square or minmax error;

E [ Lmin ] = S

m 1 X X f illed b ( j/( n b ) ) d λ . (7) m j =1 λ≥0

4

by b 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100

n = 257 ψmean sq. ψminmax 1.000 1.000 1.671 1.663 2.201 2.190 2.653 2.640 3.053 3.040 3.417 3.403 3.752 3.738 4.070 4.051 4.369 4.345 4.652 4.623 6.959 6.880 8.735 8.619 10.238 10.086 11.565 11.380 12.766 12.550 13.870 13.627 14.899 14.629 15.866 15.570 16.781 16.461

n = 1000000 ψmean sq. ψminmax 1.000 1.000 1.668 1.661 2.197 2.187 2.648 2.636 3.048 3.035 3.411 3.398 3.746 3.731 4.062 4.044 4.361 4.337 4.644 4.615 6.943 6.868 8.718 8.603 10.219 10.067 11.543 11.359 12.741 12.527 13.843 13.601 14.871 14.601 15.835 15.541 16.748 16.430

LS

α

=

(d −1) pm 2

(9)

where the loading factor α is m/( n b ). The theorem follows by combining equations 2 and 8 via the total expectation rule.

3.2. Unsuccessful Search Length An unsuccessful search is conducted for a record not residing within the table; the unsuccessful search length LU measures the total number of buckets examined to conduct the search. First we calculate the expected number of buckets accessed E [ L1U ] by a single probe sequence to first touch an unfilled bucket and then the expected unsuccessful search length E [ LU ]. For unsuccessful searches, the probability a single probe sequence accesses i buckets to first touch an unfilled bucket follows the geometric distribution. For a single hash function P rob { L1U > λ } = f illed b ( α )λ

j/n.

And we have > λ} = P rob { Lmin s

(8)

if dˆ is d.

E [ Lmin ] ( ( d − 1 ) pm + 1 ) − S

We can obtain closed form expressions for E [ Lmin ] s when b is one and d either one or two. In case b is one, using equation 4, =

if dˆ is 1; if dˆ is 2; if dˆ is 3;

Theorem: For tables configured with n capacity b buckets containing m records using d hash functions, the expected successful search length E [ LS ] is

Table 1. ψ approximation exponents.

f illed 1 ( j/n )

  Lmin  S ,   min  − 1) + 23 , 2 (L  S  3 (Lmin − 1) + 2, = S  ..   .     min d (LS − 1) + d+1 2 ,

m 1 X αd λ dλ . ( j/n ) ≈ m j=1 dλ +1

from which it follows E [ L1U ]

If d is one, we have for ordinary double hashing the familiar expression, X E [ Lmin ] ≈ αλ /(λ+1) = −α−1 ln ( 1−α ) s

=

1 . 1 − f illed b ( α )

This formula is, of course, a direct extension of the result for ordinary double hashing [9, 10, 13, 14] where the expected value is 1/( 1 − α ). The unsuccessful search length E [ LU ] is given by

λ≥0

which is well-known [9, 10, 13, 14] as the expression for E [ Ls ]. If d is two, we have   X 1 1+α 2λ ] ≈ α /(2 λ+1) = . ln E [ Lmin s 2α 1−α

E [ dˆL1U ]

(10)

since all of the dˆ probe sequences must be followed until each locates an unfilled bucket. Theorem: For tables configured with n capacity b buckets containing m records using d hash functions, the expected unsuccessful search length E [ LU ] is

λ≥0

The successful search is conducted by interleaving dˆ probe sequences where dˆ ≤ d is the number of set predictor bits associated with the search. One of these probe sequences will hit the desired record during its Lmin probe S on average. Within the search process, each of the last dˆ probes will access the desired record with equal probability. Accordingly, the successful search length LS is given

d pm 1 − f illed b ( α ) where the loading factor α is m/( n b ).

5

(11)

4. Experimental Paradigm and Results

The theorem follows by combining equations 3 and 10 via the total expectation rule. Both the successful and unsuccessful search length analyses follow the sampling with replacement paradigm. Since probe sequences are permutations, the analysis would be more accurately served using sampling without replacement; however our expected value models provide very usable results as shown in the experimental results section.

The experimental effort provides average successful and unsuccessful search lengths for comparison with our analytical results as well as individual table-method timings. External double hashing with choice is implemented in C; the code compiled via the gcc compiler with -O3 optimizations. Timing was done using the gettimeofday function which “ticks” once per microsecond. The computational environment consists of a one GHz Pentium III processor configured as a server. For our experiments, we utilized tables configured with 257 buckets; records were generated via the Linux random function producing long unsigned values with a period of approximately 16 ( 231 − 1 ). The approximately uniformly distributed values provide a challenging workload without locality; the analysis also assumes such a workload. Reference locality will provide improvements depending upon the degree of locality; locality is not considered further here. For load factors from zero to one, the successful and unsuccessful search length averages were calculated as well as the associated timings. An average value is determined experimentally together with its 95% confidence interval of width 1% of the to-be-determined average. Since the data sample standard deviation is unknown and will vary with the load factor, the two-stage approach of Stein [16] is utilized to determine the number of samples sufficing to obtain the confidence interval of desired width. For a given load factor, Stein’s approach involves obtaining a fixed number of samples (40 in our experiments) to determine the sample mean and standard deviation. The number of additional samples required, for the desired 95% confidence interval, is the larger of zero and & 2 ' sample standard deviation ∗ 2.021 − 40. 0.005 * sample mean

3.3. Asymptotic Choice Several configurations are considered to demonstrate the significance of the approach. Since there are several parameters, the number of hash functions d and the number of predictor bits s, we determine a parameter space partitioning useful for presenting similar results. The ratio ξ =

s nbd

in which n represents the number of buckets within the table provides such a partitioning. Theorem: For tables configured with n capacity b buckets and s predictor bits containing m records let n ξ b = s / d remain constant while d and s increase. The asymptotic search lengths are α 2ξ

(12)

E [ LS ]

=

1+

E [ LU ]

=

α ξ ( 1 − f illed b ( α ) )

where the loading factor α is m/( n b ). The limiting E [ LS ] is similar to the separate chaining [9] successful search length. Moreover the expected successful search lengths can be arbitrarily close to one by increasing ξ. The E [ LS ] formula comes from expression (9). Increasing d improves the chances of obtaining a shorter min-probe sequence, and

This process is followed for the successful and unsuccessful search lengths as well as the successful and unsuccessful search timings; the largest of the four numbers is used to complete the experiment for the load factor. There is no control on the final number of samples required especially when the sample standard deviation is large; in our experiments with 257 buckets per table, the number ranged between 40 and a few thousand per data point. All of our experiments obtained a 95% confidence interval width 1 100 ∗ sample average search length; these confidence intervals, while not shown, are easily determined. Both experimental and analytical results regarding successful and unsuccessful search lengths are presented graphically. We conducted, for d = 1, 2, 4, 8, 16, 32 & 64, ξ = 1, 2, 4, 8, 16, 32 & 64 and b = 1, 2, 4, 8, 16, 32 & 64 7 × 7 × 7 = 343 configurations, experiments providing both successful and unsuccessful search lengths.

lim E [ Lmin ] = 1 S

d→∞

since the shortest possible min-probe sequence has length one. Then ( d − 1 ) ( 1 − ( 1 − 1/s )m ) = ( d − 1 ) pm and ξ = s/( n b d ) we have lim ( d − 1 ) pm = lim ( d − 1 )( 1 − ( 1 −

d→∞

d→∞

1 )m ) nbdξ

= m/( n b ξ ) = α/ξ. Combining the two limit expressions in formula (9) yields formula 12. The E [ LU ] expression has a similar development since f illed b ( α ) does not vary with s. 6

the following lemma. Lemma: For a > b ≥ 1 and c ≥ 0

2.5 ξ

buckets accessed

=1

two slots per bucket

SS expected values US expected values experimental values

2

1 ≤

This can be verified using l’Hospital’s rule as well as a continuity argument. For unsuccessful searches with constant ξ, b, n and α, the expected search length increases with increasing s and d. Increasing both s and d by a factor of β ≥ 1, the respective search lengths are approximately

1.5

1

0.5

β d ( 1 − e−m/βs ) 1 − f illed b ( α )

0 0

0.2

0.4

0.6

0.8

1

loading factor

2.5 ξ

=2

two slots per bucket

SS expected values US expected values experimental values

2

d ( 1 − e−m/βs ) 1 − f illed b ( α )

1.5

0.5

0 0.2

0.4

0.6

0.8

d ( 1 − e−m/s ) . 1 − f illed b ( α )

and

d ( 1 − e−m/s ) . 1 − f illed b ( α )

The previous lemma indicates the unsuccessful search length decrease is less than 1/β. Then for example, with b = 2 and β = 2, with constant d, n and α, the expected unsuccessful search length for s predictor bits is less than double the search length for 2s predictor bits as shown in figures 5 and 6. Successful search lengths are improved as well with increased d, b, and s. For one hash function, the successful search length obtains no improvement via predictor bits. However the improvement via additional hash functions and predictor bits is evident in figures 5 and 6 as the curves become flatter approximating a perfect hash function. The expected successful search lengths range between the min-sequence length E [ Lmin ] when d is one and the S asymptotic length 1 + α/2ξ as d → ∞. Since ψ monotonically increases with b, the approximate min-sequence length

1

0

and

The previous lemma indicates the unsuccessful search length increase is less than β; this is seen in figures 5 or 6 with b = 2 and β = 1 or 2 as well as constant n and α. This discussion can be expanded to show as β enlarges, the unsuccessful search length approaches its asymptotic value. Similarly the unsuccessful search lengths decrease with increasing s with constant b, d, n, and α. Increasing s by a factor of β ≥ 1, the respective search lengths are approximately

Figure 5. Successful and unsuccessful search lengths for ξ = s/514d = 1. with d = 1, 2, 4, 8, 16, 32 and 64

buckets accessed

a a( 1 − e−c/a ) < . −c/b b b( 1 − e )

1

loading factor

Figure 6. Successful and unsuccessful search lengths for ξ = s/257d = 2. with d = 1, 2, 4, 8, 16, 32 and 64

Figures 5 and 6 graphically present typical experimental results for the successful and unsuccessful search lengths as well as expected values for b = 2, n = 257, ξ = s/(n b d ) = 1 & 2, and d = 1, 2, 4, 8, 16, 32 & 64. The successful search lengths (requiring at least one access) become flatter as d increases; the unsuccessful search lengths become larger as well as clustering. As d increases, the search lengths more closely approximate the values from formula 12. Both the unsuccessful search and the successful search lengths decrease as ξ moves from 1 to 2. The analysis of the unsuccessful search lengths utilizes

X

λ≥0

αψλ ψλ + 1

decreases with increasing b. For ψ = ξ = b = 1, the min-sequence length exceeds the asymptotic length for all positive α. Generally, the α value where the min-sequence length equals the asymptotic length depends upon ξ and b. In figure 5, equality occurs at approximately α = 0.7; in figure 6, at approximately 0.25. 7

Rather than presenting 47 similar figures for the remaining configurations, we present summary figures in which each curve is replaced by its average value. The summary figure for each configuration consists of a pair of bar graphs depicting the average successful or unsuccessful search lengths for given fixed b, n and ξ values with d ranging from 1 through arbitrarily large values. One end of a bar denotes the d is one average and the d → ∞ average at the other end. Experimentally the largest d is 64. Figure 7 conveys these results for ξ = 1; the average asymptotic expected successful and unsuccessful values for arbitrarily large d are specified within the figure. Similar graphs for ξ = 2 and 8 are given in figures 8 and 9; all the graphs are available [5]. The average expected asymptotic successful search length is approximated by the area under the curve 1 + α/(2ξ) for α ranging from 0 to 1. This is calculated as a proper integral to be 1 + 1/(4ξ). For fixed ξ, with increasing b, the average min-sequence length becomes less than the average asymptotic length 1 + α/4ξ. The average expected asymptotic unsuccessful search length is approximated by the area under the curve α/ξ(1 − f illed b( α ) for α between 0 and 1. This obtains a divergent improper integral; however expression 13 yields our approximation

4.15

3

ξ=1 Successful search length 2.32

Unsuccessful search length

2 1.57

1.25 1.15

1 0.893 0.791 0.667

1

2

4

8 16 slots per bucket

32

64

Figure 7. Successful and unsuccessful average search lengths for ξ = 1.

3

ξ=2 Successful search length Unsuccessful search length

2.06

2

1.16

1 ξnb

X

0≤i 1+ . nb λψ + 1 4ξ

1.125

1

1

2

4

8 16 slots per bucket

0.40

32

0.34

64

Figure 8. Successful and unsuccessful average search lengths for ξ = 2.

hash values, raises both the average expected successful and unsuccessful search lengths lowering the emsemble performance. Access calculation timing is an issue when using more than one hash function; figure 10 show the increase in “internal” calculation time as the number of hash functions increases. The insertion time is estimated using d E [ L1U ] disk accesses plus the internal time. Table 2 presents some “external” timings using formulae 7, 9, and 11 as follows in which the expected external latency is 4900 microseconds per disk access. INtime SStime U Stime

(14)

1≤i≤nb λ≥0

In a region where the average min-sequence length is larger, additional predictor bits and hash functions lower the expected successful search length; the unsuccessful search length is (slightly) increased as well. Otherwise, the use of additional predictor bits as well as runtime to calculate the

= d ∗ E [ L1U ] ∗ 4900 + internal time = E [ LS ] ∗ 4900 + internal time = E [ LU ] ∗ 4900 + internal time.

Table 2 indicates for as few as eight hash functions, which require approximately three microseconds to evaluate, the results are very close to what could be obtained using an “unlimited” number of hash functions; evaluation of the 8

3

d

ξ=8 Successful search length

1

Unsuccessful search length 2

2 1.031

1 0.52

8

0.29

1

2

0.20

4

0.14

0.11

8 16 slots per bucket

0.10

32

0.08

64 64

Figure 9. Successful and unsuccessful average search lengths for ξ = 8.

ξ

SS

US

2

7351

5244

15926

4

7351

2941

15926

8

7351

1520

15926

2

6273

5832

31851

4

5783

6127

31851

8

5783

1471

31851

2

5734

6275

127403

4

5734

3188

127403

8

5099

1620

127403

2

5851

6341

1019220

4

5361

3206

1019220

8

5117

1638

1019220

IN

Table 2. Expected access times in microseconds; α = 0.8 and the average disk drive seek time is 4900 microseconds.

microseconds per operation

20

age asymptotic successful search length according to formula 14. It will be interesting to combine choice with passbits; passbits improve the unsuccessful search length [4, 12].

10

5

References

2

[1] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. In Proceedings of the ACM Symposium on Theory of Computing (STOC 94), pages 593–602, Montreal, Quebec, 1994.

1

1

2

4

8

16

32

64

number of hash functions

[2] I. F. Blake and A. G. Konheim. Big buckets are (are not) better. Journal of the Association of Computing Machinery, 24(4):591–606, 1977.

Figure 10. Successful and unsuccessful search average internal timings.

[3] A. Z. Broder and M. Mitzenmacher. Using multiple hash functions to improve IP lookups. In Proceedings of the 20th IEEE Computer and Communications Conference (INFOCOM 01), number 3, pages 1454–1463, 2001.

hash functions will dominate the access time as d increases without bound.

[4] W. A. Burkhard. Double hashing with passbits. Information Processing Letters, 96:162–166, 2005. [5] W. A. Burkhard. External double hashing with choice. http://www.cse.ucsd.edu/groups/gemini/papers/choice.pdf, 2006.

5. Conclusions External double hashing with choice provides a very convenient extension to ordinary double hashing providing improved successful and unsuccessful search lengths. Within external storage environments, the extra space required of predictor bits is extremely modest and realistically could be maintained within cache memory. We have observed two distinct performance modes; one in which additional hash functions improves the runtime performance and the other where additional hash functions degrade the performance. The two modes are determined by whether the average min-sequence length is greater than the aver-

[6] J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143–154, 1979. [7] W. Feller. An Introduction to Probability Theory and Its Applications, volume I. John Wiley, 1968. [8] L. J. Guibas and E. Szemeridi. The analysis of double hashing. Journal of Computer and System Sciences, 16:226–274, 1978. [9] D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, Reading, second edition, 1998.

9

[10] P.-A. Larson. Analysis of uniform hashing. Journal of the Association for Computing Machinery, 30(4):805–819, October 1983. [11] G. S. Lueker and M. Molodowitch. More analysis of double hashing. Combinatorica, 13(1):83–96, 1993. [12] P. M. Martini and W. A. Burkhard. Double hashing with multiple passbits. International Journal of Foundations of Computer Science, 14(6):1165–1182, 2003. [13] R. Morris. Scatter storage techniques. Communications of the Association for Computing Machinery, 11(1):38–44, January 1968. [14] W. W. Peterson. Addressing for random-access storage. IBM Journal of Research and Development, 1(2):130–146, 1957. [15] A. R´enyi. Probability Theory. North-Holland Publishers, 1970. [16] C. Stein. A two sample test for a linear hypothesis whose power is independent of the variance. Annals of Mathematical Statistics, 16(3):243–258, 1945. [17] J. D. Ullman. A note on the efficiency of hash functions. Journal of the Association for Computing Machinery, 19(3):569–575, July 1972. [18] B. V¨ocking. How asymmetry helps load balancing. Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pages 131–141, 1999. [19] A. C. Yao. Uniform hashing is optimal. Journal of the Association for Computing Machinery, 32(3):687–693, July 1985.

10

Appendix 8.26

4.15

5

3

ξ = 0.5

4.62

ξ=1

Successful search length

Successful search length

4 2.32

Unsuccessful search length 3.15

3

Unsuccessful search length

2 1.57

2.39

2

1.25

1.79

1.15 1.58

1

1.50

0.893

1.35

0.791

1

0.667

1

2

4

8 16 slots per bucket

3

32

64

1

2

4

8 16 slots per bucket

Successful search length

Successful search length

2

Unsuccessful search length

Unsuccessful search length

2.06

64

ξ=4

ξ=2

2

32

1.16

1

1.125

1.063

1.03

1 0.79

0.58 0.57 0.45

1

2

4

8 16 slots per bucket

3

0.39

0.40

32

0.29

0.34

1

64

2

4

0.22

8 16 slots per bucket

0.20

32

0.18

64

ξ = 16

ξ=8 Successful search length

Successful search length

2

Unsuccessful search length

Unsuccessful search length 2

1.016

1 1.031

1 0.52

0.26

0.29

1

2

0.20

4

2

0.14

0.11

8 16 slots per bucket

0.10

32

0.14

0.08

1

64

2

0.10

4

ξ = 64 Successful search length

2

8 16 slots per bucket

32

0.05

32

0.04

64

Unsuccessful search length

1.008

4

8 16 slots per bucket

ξ = 32

1

2

0.06

Successful search length Unsuccessful search length

1

0.07

64

1

1.004

1

11

2

4

8 16 slots per bucket

32

64

2.5 ξ

one slot per bucket

SS expected values US expected values experimental values

2

buckets accessed

=1

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=1

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

12

2.5 ξ

four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=1

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

eight slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=1

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

13

2.5 ξ

sixteen slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=1

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

thirty two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=1

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

14

2.5 ξ

sixty-four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=1

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

one slot per bucket

SS expected values US expected values experimental values

2

buckets accessed

=2

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

15

2.5 ξ

two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=2

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=2

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

16

2.5 ξ

eight slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=2

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

sixteen slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=2

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

17

2.5 ξ

thirty two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=2

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

sixty-four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=2

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

18

2.5 ξ

one slot per bucket

SS expected values US expected values experimental values

2

buckets accessed

=4

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=4

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

19

2.5 ξ

four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=4

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

eight slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=4

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

20

2.5 ξ

sixteen slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=4

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

thirty two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=4

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

21

2.5 ξ

sixty-four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=4

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

one slot per bucket

SS expected values US expected values experimental values

2

buckets accessed

=8

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

22

2.5 ξ

two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=8

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=8

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

23

2.5 ξ

eight slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=8

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

sixteen slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=8

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

24

2.5 ξ

thirty two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=8

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

sixty-four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

=8

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

25

2.5 ξ

one slot per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 16

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 16

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

26

2.5 ξ

four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 16

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

eight slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 16

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

27

2.5 ξ

sixteen slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 16

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

thirty two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 16

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

28

2.5 ξ

sixty-four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 16

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

one slot per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 32

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

29

2.5 ξ

two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 32

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 32

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

30

2.5 ξ

eight slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 32

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

sixteen slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 32

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

31

2.5 ξ

thirty two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 32

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

sixty-four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 32

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

32

2.5 ξ

one slot per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 64

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 64

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

33

2.5 ξ

four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 64

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

eight slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 64

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

34

2.5 ξ

sixteen slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 64

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0.8

1

loading factor 2.5 ξ

thirty two slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 64

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

35

2.5 ξ

sixty-four slots per bucket

SS expected values US expected values experimental values

2

buckets accessed

= 64

1.5

1

0.5

0 0

0.2

0.4

0.6

loading factor

36

0.8

1