A new universal class of hash functions and dynamic hashing in real ...

20 downloads 6851 Views 1MB Size Report
Jun 27, 2005 - In: Paterson M.S. (eds) Automata, Languages and Programming. ICALP 1990. Lecture Notes in Computer Science, vol 443. Springer, Berlin ...
A n e w u n i v e r s a l class of hash f u n c t i o n s and d y n a m i c h a s h i n g in real t i m e Martin Dietzfelbinger* Friedhelm Meyer auf der Heide* Fachbereich 17. Mathematik- Informatik Universit£t-GH Paderborn, D-4790 Paderborn, Fed. Rep. of Germany

Abstract

The paper presents a new universal class of hash functions which have many desirable features of random functions, but can be (probabilistically) constructed using sublinear time and space, and can be evaluated in constant time. These functions are used to construct a dynamic hashing scheme that performs in real time, i.e., a Monte Carlo type dictionary that uses linear space and needs worst case constant time per instruction. Thus instructions can be given in constant length time intervals. Answers to queries given by the algorithm are always correct, the space bound is always satisfied, and the algorithm falls only with probability O ( n - k ) , where n is the number of data items currently stored. The constazat k can be chosen arbitrarily large.

1

Introduction

A dictionary is a data structure that supports (at least) three kinds of instructions, namely Insert x, Delete x and Lookup x. Here the possible k e y s x are taken form some universe U. This is to be understood as follows: the instruction "Insert x" causes x together with some information associated with x to be stored in the data structure, the instruction "Delete x" causes x to be removed from the data structure; the instruction "Lookup x" returns the information associated with x (or a default message if x is currently not stored). We always assume that the user that provides the instructions waits for one instruction to be processed before entering the following one. We will be interested in time requirements (for single instructions as well as amortized over a sequence of instructions) and space requirements for implementations of such a data structure. In this paper we present a new efficient dictionary. It is based on a dynamic hashing strategy that uses a novel type of hash functions.

N e w u n i v e r s a l classes o f h a s h f u n c t i o n s The basic structure of the new dictionary is similar to that one used in [FKSS4] and [DKMS8]: a level-1 hash function h : g --+ { 1 , . . . ,n} splits the set S C_ U, ISI = n, of keys to be stored into buckets B h = h - l ( i ) f3 S, 1 < i < n. Each bucket B~ is stored in a separate subtable by means of a level-$ hash function hi. Our level-2 hash tables will be as in the above dictionaries. The main difference of the new scheme in comparison with the previous ones is the structure of the level-1 function h. h usually is chosen randomly from a l-universal class, i.e., I arbitrary keys xl . . . . , xt satisfy Prob(h(xi) = yi for 1 < i < l) = n -l, l constant. We introduce a new class T~ of hash functions that still can be evaluated in constant time but share many properties of n-universal hash functions, i.e., hash functions that map n keys independently, uniformly into { 1 , . . . , n). *Supported in part by DFG Grant ME 872/1-3

The most striking properties of the new class are as follows: For each S C_ U, IS[ 1, such that, given a set S C U, ISI = n, a random h E 7~ splits S in buckets of size at most r with probability 1 - O(al-r). Such functions are used in [GMWg0] to derive optimal parallel hashing schemes in several abstract models of parallel hashing. Our new types of hash functions can be used as good breakers which use linear space and have constant evaluation time. A p p l i c a t i o n s of t h e new d i c t i o n a r y (a) Distributed dictionaries. Assume that each one of p processors maintains a dictionary, working independently. It is very useful to have a bound for the time the slowest processor needs to perform n instructions. The best estimate that could be obtained by using schemes available before was O(n logp) (expected). With our scheme, the probability that all dictionaries execute each instruction in constant time is at least 1 - O(p. n-k). Sequential dictionaries that operate as evenly as that are indispensable when constructing efficient distributed dictionaries (cf. [DM90] for details of such techniques). (b) "Clocked adversaries." In a recent paper [LN89], Lipton and Naughton showed that several schemes for implementing dynamic hashing strategies by use of chaining are susceptible to attacks by adversaries that are able to time the algorithm. Even though the adversary (who chooses the instructions) does not have access to the random numbers the algorithm chooses, he can draw conclusions about the structure of the hash function being used from the time the execution of certain instructions takes, and subsequently choose keys to be inserted that make the algorithm perform badly. Lipton and Naughton ask whether the algorithm of [DKM88] is susceptible to such attacks. This question is still open; but the algorithm described in the present paper is immune to such at-

tacks in a natural way: Normally, each instruction takes constant time, which gives no information to the adversary at all. From time to time the algorithm may crash; but the consequence is that the data structure is built anew, redoing all random choices. This results in a data structure that is independent of all previous events, in particular of the situation that caused the crash. S t r u c t u r e of p a p e r In Section 2, some basic definitions are given and the technical background for our construction is reviewed. Section 3 contains several probability bounds useful for the analysis of the performance of our hash functions and dictionaries. Section 4 presents the new hash functions; Section 5 the new dictionary.

2

Background:

Polynomials

as hash

functions

and

the

FKS-scheme We will be using the following universal class of hash functions: Let p be prime, U =- {0, 1 , . . . ,p - 1} the universe. Consider two parameters: d >_ 1, the degree and s >_ 1, the table size. Define 7"(d :--- { ha ] a =- ( a 0 , . . . , a d ) @ U d+l }, where for a = ( a o , . . . , aa) e U d+l we let d

ha(x) := ( y ~ a,x' mod p) mod s, "i=0

for x C U.

/

We will have h chosen uniformly at random from 7"td. All probabilities are with respect to this probability space; no assumptions about the distribution of the input keys are made. We recall some useful facts. Assume for the following that some set S C U with tSt = n is given. D e f i n i t i o n 2.1 Let h : U ~ {1 . . . . . s}. (a) The i-th bucket is B~ := { x e S I h(x) = i }.

(b) Its site is b,~ := tB,~I. (c) h is called e-perfect for S ifb h ~ c for all i E { 1 , . . . , s } . (d) I f h is chosen at random from some class, we write Bi for the random set B h and bi for the random variable bhi. F a c t 2.2 ([DKM88]) Let h be randomly chosen from 7-I~. Then Ca) E ( ~ = I I-[{=o(bi - u)) < 2n. (2n/s) d and (b) Prob(h is d-perfect for S) > 1 - n . (2n/s) d.

Proof: (a) is Thin. 3(a) in [DKM88] (Proof in the Tech. Report version, Lemma l(a)). (b) Recall Markov's inequality: I f Y >_ 0 is a random variable, and t > 0, then Prob(Y >_ t) < E ( Y ) / t . Thus, (a) implies that P r o b ( E , 8= 1 I-I,=o(b~ d - u) > 2) d + 1 > 2. b~ > d for some i entails I-I~=0( D We apply this to the case of n ~ keys, where 0 < ~ < 1 is fixed and s = n, to obtain a dictionary that has constant access time but still needs supertinear space.

Corollary 2.3 There is a dictionary that uses d . n space and that with probability exceeding 1 O(n ~-(1-*)d) executes a sequence of n ~ instructions in such a way that each one takes constant time (in the worst case). Proof: Use a hash table of size s = n, each table position comprising d slots for one key each. By 2.2, a hash function h chosen randomly from 7"/4 maps at most d of the n ~ keys to each of the table positions. If a sequential list of length n ~ is maintained in which all keys are recorded that were ever entered in the dictionary, the memory space used can be cleared in O(n ~) steps after finishing the n ~ instructions. D The following fact is an essential ingredient for the construction of the new hash functions below.

10 Fact 2.4 ([KRS90]) Let 0 < 5 < 1 be fixed, and let s = n t-s. Let h E 7-l~ be randomly chosen. Then Prob(bi < 2n 8 for all i) > 1 - O(nt-S-Sd/2).

(With high probability, all buckets will have size close to the average, which is nS.) Proof: Let S = { x t , . . . , x~}. Define random variables Y~ :=

1, 0,

if h(xj) = i, otherwise,

for 1 _< j sd/2. Corollary 4.20 in [KRS901 says that

(a), (b), and (c) imply that

Prob(~'~(Yj -- I/s) > n / s ) : "j=l

that means, Prob(b~ > 2n ~) = O(n-~a/2). By subadditivity, it follows that Prob(bl > 2n 6 for some i) = O(n 1-8 . n-Sd/~), which proves the claim. O We also will make use of the original result of Fredman, Koml6s, and Szemer~di, summarized in the following theorem. T h e o r e m 2.5 ([FKSS4]) There is a sequential static hashing strategy (i. e., only lookups are supported), the FKS-scheme, with the following features. (i} O(n) space is used for storing n keys. (ii} The lookup time is worst case constant. (iii} The data structure can be constructed by a probabilistic algorithm whose running time T= satisfies Prob(Tn > L . c. n) < 2 - z , for some constant c and arbitrary L > 1. n

3

Probability bounds

In this section, we first quote a classical theorem that estimates the probability for a sum of independent bounded random variables to deviate far from the expected value. This estimate is a generalization of the "Chernoff bounds" that estimate how far sums of independent 0-1-valued random variables deviate from the mean (see, e.g., [AV79]). Next, we prove a theorem in the same spirit that deals with sums of independent random variables that are approximately exponentially distributed, thus unbounded. Both estimates will be used in the next section for establishing the main properties of a new class of hash functions to be defined there. T h e o r e m 3.1 (Hoeffding, see [Hof87, p. 104]) Let X t , . . . , X , be independent random variables with ! z-.i=1 v.n X ~,j be the mean expected value of the values in the interval [0, z] for some z > O. Let Is := E(~," Xi. Then

[( Is /.+,(" z -z-Is Is-V

Pr°h"--X'>Is+t,->_3l. W) < e -(t-1)'3h'W"

(9)

hence, in combination with (2) we get

We let h := (3M) -1. Then clearly (4) and (6) are satisfied, and (9) reads Prob(S > 31W) < e -(L-1)W/M, which was to be shown. O

12

4

A new class of hash functions

In this section, we will define a new class of hash functions and establish some of the properties of this class. We fix n and a set S C U, [S I = n, the set of keys to be stored in the hash table. D e f i n i t i o n 4.1 Forr, s > 1 and dl,d2 > 1 define Tg(r,s, dl,d2) := { h : U -~ { 1 , . . . , s } I h = h(f,g, al,...,a~) for some f e 7-tda,g e n d 2 , a l , . . . , a ~ e { 1 , . . . , s } },

where h = h(f,g, al . . . . . a~) is defined as follows: h(x) := (g(~) + a~(~)) rood s, /or • ~ C.

Even if h( f , g, ax, . . . , a~) and h( f', g', a ~ , . . . , a'~) are extensionalty equal, we consider them different if (f,g, al . . . . . a~) ~ (f',g',a'x,...,a'~). R e m a r k 4.2 It is clear that a function h C Tg(r,s, dl,d~) can be stored in O(d~ + d2 + r) cells, can

be generated sequentially in O( dx + d2 + r) steps, and can be evaluated in O( d~ + d2 + r) steps. In fact, if the offset a~ is chosen only at the time when h(x) has to be evaluated for some x with f ( z ) = i for the first time, the setup time for h is constant. Let us first informally explain the structure of these functions: Obviously, f E H d: splits S into r buckets B{, 1 < i < r. Two-level hashing strategies considered before (e. g., the FKS-scheme) would try to store the keys in each bucket in separate storage areas (as a linked list, a subtable, etc.). The new idea is to apply a second hash function (g(x) + ai) rood s to the keys in B 1, with common range { 1 , . . . , s). The cruciM facts that cause the universal class of hash functions thus obtained to behave differently than previously known classes are that, for suitable choices of s and r, (i) g will be d2-perfect for B[, for all i E { 1 , . . . , r} (with high probability) and (ii) the offsets ai are independent. Property (i) ensures that keys that belong to the same By do not normally collide, i.e., they are mapped to different values by h; property (ii) (independence of the g~) implies that collisions between keys that belong to different By are governed by the laws that apply to truly random functions. We want to single out a subset of the hash functions h that are guaranteed to have property (i). D e f i n i t i o n 4.3 Tis(r,s, dl,d2) :=

{h = h(f,g, aa,... ,a~) G Ti(r,s, dl,d2) l b{ _ 1 are arbitrary, then for dl,d~ large enough, Prob(h e Tgs(nl-~,n, dl,d2)) >_ 1 - O ( n - k ) .

Proof: Let dl,d2 >_ 1; further let r = n 1-5 and s = n. By 2.4, a randomly chosen h C 7-/d~ satisfies b{ u) < (e~-l/n=) 1/d2, f o v a l l j = 1 , . . . , n . (a') Probs(#{y E S I h(z) = h(y)} > d2 + u) < eu-1/~ u, for all z e S.

(b)

Es(max{b~ I 1 < j _< .}) = O(log n/log log.).

(b')

Probs(max{~ I 1 4d2. l n n / l n l n n .

Proof." Assume that IS I = n. (Otherwise add some dummy elements.) Let r = n 1-6. Fix f • 7-(~1 and g • "H~2 with the property that b{ u) = Probs~

X1 >_ u . n #

~ Prob(Z _> u), for integral, nonnegative random variables Z. [] As noted in the introduction, the class 7~ already provides a nice basis for strategies that use methods similar to chaining for resolution of collisions. Here, we are interested in perfect hashing schemes, that is, schemes in which at most a constant number of keys is mapped to the same bucket. In preparation for this application, which will be described in the next section, we prove a technical property of the new class 7~, which is central for our purposes. It states that, with high probability, only O(n ~) keys from S witl collide with some key from an arbitrary fixed subset S' C S of size n ~, if the parameters are properly chosen.

14

T h e o r e m 4.7 Let 0 < 5 < e < 1 be such that ¢ + 5 < 1, and let k >_ I be arbitrary. Let S c_ U, IS] =

n, and S' C_ S, IS'[ = n ~, be fixed. Then, for suitable constants dl, d2, P r o b s ( # { x e S [ 3x' e S'(h(x) = h(x'))} >_ 4 d 2 - n ") = O(n-~).

Proof: Let r := n 1-6, and let f E ~dl be fixed so that b{ _< 2n ~ for 1 < i < r. We consider the random experiment of choosing g E 7"/d~2 and a t , . . . , ar E { 1 , . . . , n}, denoting the corresponding probabilities and expected values by P r o b / ( . . . ) and Ef(...), respectively, and note that it suffices to prove the claim for P r o b / i n s t e a d of Probs. Define I ' := I'(S', f) := {i [ 1 < i _u. n')

Suggest Documents