Perfect hashing - Science Direct

53 downloads 3515 Views 9MB Size Report
... in compilers, operating systems, language translation systems, hypertext, ...... first step, a preprocessing function p reduces the universe of keys from size u to ...
Theoretical Computer Science ELSEVIER

Theoretical

Computer

Science

182 (1997)

Fundamental

j-143

Study

Perfect hashing Zbigniew J. Czech a,*, George Havasb, Bohdan S. MajewskiC aSilesiu University of Technology, 44-100 Gliwice, Polund ’ University of Queensland, Queensland 4072, Australia ’ University of‘ Newcastle, Caliayhan, NS W 2308, Australia Communicated

by M. Nivat

Contents Preface. ..................................................................................................... 1. Introduction to perfect and minimal perfect hashing .................................................... 1.1. Basic definitions .................................................................................... 1.2.Trial and error.. .................................................................................... 1.3.Space and time requirements ....................................................................... I .4. Bibliographic remarks. .............................................................................. 2. Number theoretical solutions ............................................................................ 2.1. Introduction ......................................................................................... 2.2. Quotient and remainder reduction .................................................................. 2.3. Reciprocal hashing .................................................................................. 2.4. A Chinese remainder theorem method .............................................................. 2.5. A letter-oriented scheme . . .......................................................................... 2.6. A Chinese remainder theorem based variation ..................................................... 2.7. Another Chinese remainder theorem based variation ............................................... 2.8. Bibliographic remarks ............................................................................... 3. Perfect hashing with segmentation ...................................................................... 3.1. Introduction ......................................................................................... 3.2. Bucket distribution schemes ........................................................................ 3.3. A hash indicator table method ...................................................................... 3.4. Using backtracking to find a hash indicator table .................................................. 3.5. A method of very small subsets. ................................................................... 3.6. Bibliographic remarks ............................................................................... 4. Reducing the search space ............................................................................... 4.1. happing, ordering, searching.. ..................................................................... 4.2. Using letter frequency .............................................................................. 4.3. Minimum length cycles ............................................................................. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9.

Skewed vertex degree distribution .................................................................. Minimum length fundamental cycles ............................................................... A linear-time algorithm ............................................................................. Quadratic minimal perfect hashing ................................................................. Large and small buckets ............................................................................ Bibliographic remarks ...............................................................................

* Corresponding

author. E-mail: [email protected],

0304-3975/97/$17.~ @ 1997-Elsevier PII SO304-3975(96)00146-6

web:http:Nsun.zo.iinf.polsl.gliwice.pl./~zjc.

Science B.V. All rights reserved

2 4

4 5 8 21 21 21 21 23 24 26 28 29 30 31 31 31 35 40 44 48 48 48 49 51 58 64 67 77 78 79

Z.J.

2

Czech et al. I Theoretical Computer Science 182 (1997)

1-143

5. Perfect hashing based on sparse table compression ..................................................... 5.1. Introduction ......................................................................................... 5.2. A single displacement method ...................................................................... 5.3. A double displacement method ..................................................................... 5.4. A letter-oriented method. ........................................................................... 5.5. Another letter-oriented method ..................................................................... 5.6. Bibliographic remarks ............................................................................... 6. Probabilistic perfect hashing ............................................................................. 6.1. Introduction ......................................................................................... 6.2. The FKS algorithm and its modifications reinterpreted ............................................. 6.3. A random graph approach .......................................................................... 6.4. Random hypergraphs methods ...................................................................... 6.5. Modifications. ...................................................................................... 6.6. Advanced applications .............................................................................. 6.7. Graph-theoretic obstacles ........................................................................... 6.8. Bibliographic remarks. .............................................................................. 7. Dynamic perfect hashing ................................................................................ 7.1. Introduction ......................................................................................... 7.2. The FKS based method ............................................................................ 7.3. A real-time dictionary .............................................................................. 7.4. Bibliographic remarks .............................................................................. Appendix. Notation index .................................................................................. Acknowledgements ......................................................................................... References ..................................................................................................

80 80 81 83 85 88 93 93 93 96 98 103

,111 I 15 119 128 129 129

,129 133 .137 138 139 139

Preface Let S be a set of n distinct

keys belonging

to some universe

U. We would like to

store the keys of S so that membership queries of the form “Is x in S?” can be answered quickly. This searching problem, also called the dictionary problem, is ubiquitous in computer

science applications.

Various algorithms to solve the dictionary problem have been devised. Some of them are based on comparisons of keys. For example, binary search using a linear ordering

of keys compares

the given

key x with a key xi in the table. Depending

on the result of the comparison (x < xi, x = xi, x > xi), the search in one of three different ways. The idea of comparisons between keys in the methods involving binary search trees and AVL trees. Another searching makes use of the representation of keys as sequences of digits characters. From the first digit of a given key a subset of keys which

is continued is also used approach to or alphabetic begins with

that digit is determined. The process can be then repeated for the subset and subsequent digits of the key. This digitally governed branching procedure is called a digital search. Yet another approach to the dictionary problem is to compute a function h(x) which determines the location of a key in a table. This approach has led to a class of very efficient searching methods commonly known as hashing or scatter storage techniques. If a set of keys is static, then it is possible to compute a function h(x) which enables us to find any key in the table in just one probe. Such a function is said to be a perfect

Z. J. Czech et al. I Theoretical Computer Science 182 (1997)

3

I-143

hash function. A perfect hash function minimum

amount

of memory,

which allows us to store a set of records in a i.e. in a table of the size equal to the number of keys

times the key size, is called a minimal perfect hash function. Minimal memory

perfect hash functions

have a wide range of applications.

efficient storage and fast retrieval

words in programming used words

languages,

in natural

command

languages.

They are used for

of items from static sets, such as reserved names in interactive

Therefore

systems, or commonly

there is application

hash functions in compilers, operating systems, language translation hypermedia, file managers, and information retrieval systems.

for minimal

perfect

systems, hypertext,

This work is a monograph on perfect and minimal perfect hashing. It is intended to serve researchers and professionals of computer science. Some parts of this text can be used as an introduction to the subject for students. All three authors have covered material from this work in Algorithms and Data Structures courses taught to higher level undergraduate or to starting postgraduate students. Researchers will find in this work the current state of developments in perfect and minimal

perfect hashing.

Bibliographical

titioners in mind, we have included which can be readily implemented

remarks facilitate

further reading.

With prac-

many examples of minimal perfect hash functions in practice. We also indicate where some imple-

mentations are available publicly on the Internet (see Section 6.6). It is assumed that the reader possesses some background in the design and analysis of algorithms. The work comprises seven chapters plus an appendix and a comprehensive bibliography. Ch. 1 is an introduction to perfect and minimal perfect hashing. Section 1.l contains basic definitions. In Section 1.2 we present a simple example of finding a minimal perfect hash function by trial and error. This example shows the kinds of difficulties which are encountered in designing such a function. Section 1.3 discusses the space and time requirements

for perfect hash functions.

This section is included

in

the Introduction because it is fundamental but it is quite theoretical and difficult. The reader may safely skim it on first reading without causing difficulties in understanding later sections. The remaining minimal

chapters constitute

perfect hashing.

came from consideration

a detailed study of existing methods of perfect and

Ch. 2 describes of theoretical

the algorithm properties

proposed

by Sprugnoli

of sets of integers.

which

Five different

descendants of it are then presented. Even though each tries to overcome drawbacks of its predecessor, we show that none of them has serious practical value. Ch. 3 discusses the methods based on rehashing and segmentation, i.e. breaking up the input set of keys into a number of smaller sets, for which a perfect hash function can be found. Ch. 4 presents Cichelli’s method which uses a simple heuristic approach to cut down the search space. The solutions of Sprugnoli and Cichelli, despite of their flaws, inspired a number of improved methods capable of handling larger key sets. In the rest of Ch. 4 we describe Sager’s algorithm which was developed as an attempt to optimize Cichelli’s method. It was the basis of various improvements which led to algorithms which have practical application to larger sets. We describe further improvements to

4

Z.J.

Czech et al. I Theoretical

Computer

Sager’s algorithm

which speed up generation

discusses

algorithms

several

table compression.

and Yao are first presented. suitable for letter-oriented their modifications

of minimal

which construct

The single and double

Science 182 (1997)

l-143

perfect hash functions.

perfect

hash functions

displacement

methods

Then the modifications

keys are considered.

Ch. 5

based on sparse

proposed

by Tarjan

of the methods which make them

In Ch. 6 we present three algorithms

that generate perfect hash functions

by using probabilistic

Some advanced applications of these methods are also discussed. Finally give an overview of dynamic perfect hashing. The Appendix contains

and

approach. in Ch. 7 we the notation

index.

Chapter 1. Introduction to perfect and minimal perfect hashing 1.1. Basic definitions Let U = {0,1,2, . . . , u - I} be the universe for some arbitrary positive integer U, and let S be a set of n distinct elements (keys) belonging to U. A hash function is a function h : U + M that maps the keys from S into some given interval of integers A4, say [0, M - 11. Given a key x E S, the hash function computes an address, i.e. an integer in [0, m - 11, for the storage or retrieval of x. The storage area used to store keys is known as a hash table. (In practice, the goal is to handle the information contained in the record associated with a key. We simplify the problem by considering only the table of keys and assuming that once a key is located or placed in the table, associated information can be easily found or stored by use of pointers.) Keys for which the same address is computed are called synonyms. Due to the existence

of synonyms,

a situation

called

ent keys have the same address. Various

collision may arise, in which two differschemes

for resolving

collisions

are known.

A perfect (or l-probe) hash function for S is an injection h: U + [O,m- 11, i.e. for all x, y E S such that x # y we have h(x) # h(y), which implies that m 2 n. If m = n and h is perfect, then we say that h is a minimal perfect hash function. As the definition implies, a perfect hash function transforms each key of S into a unique address in the hash table. Since no collisions

occur, each key can be retrieved

from the table in a

single probe. We also say that a minimal perfect hash function perfectly scatters the keys of S in the hash table. The function h is said to be order preserving if for any pair of keys xi,xj E S h(xi) < h(xj) if and only if i < j. In other words, we assume that keys in S are arranged in some sequential order, and the function preserves this order in the hash table. Such a function is called an ordered minimal perfect hash fun&on. Perfect and minimal perfect hashing is readily suitable only for static sets, i.e. sets in which no deletion and insertion of elements occurs. As defined above, the keys to be placed in the hash table are nonnegative integers. However, it is often the case that keys are sequences of characters over some finite and ordered alphabet C. To deal with such a key, a hashing scheme can either convert it to an integer by using the

Z.J.

ordinal numbers An example

of characters

in C, or by applying

5

I-143

a character by character

of using the latter method is given in Section

The question function

Czech et al. I Theoretical Computer Science 182 (1997)

arises whether for a given set of keys an ordered minimal

always exists. The answer to this question

approach.

1.2.

is positive

perfect hash

as it is straightforward

to prove that any two finite, equal size and linearly

ordered

sets, say X and Y, are

isomorphic.

function

h :X -+ Y transforming

This means that there exists an injective

X into Y such that Vix, ,x2 E x

xi

- loglogm

- 1. >

Proof. (a) The number of distinct subsets of U of size n is (I). Any perfect hash , . function can be perfect for at most (i )” (r) subsets. This result comes from considering the number of elements that can be hashed into a given slot of a hash table by a given hash function, and then maximizing the product of these numbers for all m slots. There are (t) ways to select n out of m places in the hash table. Denote by h-‘(i) the subset of elements of U hashed by some hash function h into the ith location of the hash

Z.J. Czech et al. I Theoretical Computer Science 182 (1997)

of subsets for which h can be perfect is bounded

table. Thus the number

c

Ih-‘(il)l

x jh-‘(iz)l

9

l-143

by

x ... x lh-‘(i,,)l.

O