Theoretical Computer Science ELSEVIER
Theoretical
Computer
Science
182 (1997)
Fundamental
j-143
Study
Perfect hashing Zbigniew J. Czech a,*, George Havasb, Bohdan S. MajewskiC aSilesiu University of Technology, 44-100 Gliwice, Polund ’ University of Queensland, Queensland 4072, Australia ’ University of‘ Newcastle, Caliayhan, NS W 2308, Australia Communicated
by M. Nivat
Contents Preface. ..................................................................................................... 1. Introduction to perfect and minimal perfect hashing .................................................... 1.1. Basic definitions .................................................................................... 1.2.Trial and error.. .................................................................................... 1.3.Space and time requirements ....................................................................... I .4. Bibliographic remarks. .............................................................................. 2. Number theoretical solutions ............................................................................ 2.1. Introduction ......................................................................................... 2.2. Quotient and remainder reduction .................................................................. 2.3. Reciprocal hashing .................................................................................. 2.4. A Chinese remainder theorem method .............................................................. 2.5. A letter-oriented scheme . . .......................................................................... 2.6. A Chinese remainder theorem based variation ..................................................... 2.7. Another Chinese remainder theorem based variation ............................................... 2.8. Bibliographic remarks ............................................................................... 3. Perfect hashing with segmentation ...................................................................... 3.1. Introduction ......................................................................................... 3.2. Bucket distribution schemes ........................................................................ 3.3. A hash indicator table method ...................................................................... 3.4. Using backtracking to find a hash indicator table .................................................. 3.5. A method of very small subsets. ................................................................... 3.6. Bibliographic remarks ............................................................................... 4. Reducing the search space ............................................................................... 4.1. happing, ordering, searching.. ..................................................................... 4.2. Using letter frequency .............................................................................. 4.3. Minimum length cycles ............................................................................. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9.
Skewed vertex degree distribution .................................................................. Minimum length fundamental cycles ............................................................... A linear-time algorithm ............................................................................. Quadratic minimal perfect hashing ................................................................. Large and small buckets ............................................................................ Bibliographic remarks ...............................................................................
* Corresponding
author. E-mail:
[email protected],
0304-3975/97/$17.~ @ 1997-Elsevier PII SO304-3975(96)00146-6
web:http:Nsun.zo.iinf.polsl.gliwice.pl./~zjc.
Science B.V. All rights reserved
2 4
4 5 8 21 21 21 21 23 24 26 28 29 30 31 31 31 35 40 44 48 48 48 49 51 58 64 67 77 78 79
Z.J.
2
Czech et al. I Theoretical Computer Science 182 (1997)
1-143
5. Perfect hashing based on sparse table compression ..................................................... 5.1. Introduction ......................................................................................... 5.2. A single displacement method ...................................................................... 5.3. A double displacement method ..................................................................... 5.4. A letter-oriented method. ........................................................................... 5.5. Another letter-oriented method ..................................................................... 5.6. Bibliographic remarks ............................................................................... 6. Probabilistic perfect hashing ............................................................................. 6.1. Introduction ......................................................................................... 6.2. The FKS algorithm and its modifications reinterpreted ............................................. 6.3. A random graph approach .......................................................................... 6.4. Random hypergraphs methods ...................................................................... 6.5. Modifications. ...................................................................................... 6.6. Advanced applications .............................................................................. 6.7. Graph-theoretic obstacles ........................................................................... 6.8. Bibliographic remarks. .............................................................................. 7. Dynamic perfect hashing ................................................................................ 7.1. Introduction ......................................................................................... 7.2. The FKS based method ............................................................................ 7.3. A real-time dictionary .............................................................................. 7.4. Bibliographic remarks .............................................................................. Appendix. Notation index .................................................................................. Acknowledgements ......................................................................................... References ..................................................................................................
80 80 81 83 85 88 93 93 93 96 98 103
,111 I 15 119 128 129 129
,129 133 .137 138 139 139
Preface Let S be a set of n distinct
keys belonging
to some universe
U. We would like to
store the keys of S so that membership queries of the form “Is x in S?” can be answered quickly. This searching problem, also called the dictionary problem, is ubiquitous in computer
science applications.
Various algorithms to solve the dictionary problem have been devised. Some of them are based on comparisons of keys. For example, binary search using a linear ordering
of keys compares
the given
key x with a key xi in the table. Depending
on the result of the comparison (x < xi, x = xi, x > xi), the search in one of three different ways. The idea of comparisons between keys in the methods involving binary search trees and AVL trees. Another searching makes use of the representation of keys as sequences of digits characters. From the first digit of a given key a subset of keys which
is continued is also used approach to or alphabetic begins with
that digit is determined. The process can be then repeated for the subset and subsequent digits of the key. This digitally governed branching procedure is called a digital search. Yet another approach to the dictionary problem is to compute a function h(x) which determines the location of a key in a table. This approach has led to a class of very efficient searching methods commonly known as hashing or scatter storage techniques. If a set of keys is static, then it is possible to compute a function h(x) which enables us to find any key in the table in just one probe. Such a function is said to be a perfect
Z. J. Czech et al. I Theoretical Computer Science 182 (1997)
3
I-143
hash function. A perfect hash function minimum
amount
of memory,
which allows us to store a set of records in a i.e. in a table of the size equal to the number of keys
times the key size, is called a minimal perfect hash function. Minimal memory
perfect hash functions
have a wide range of applications.
efficient storage and fast retrieval
words in programming used words
languages,
in natural
command
languages.
They are used for
of items from static sets, such as reserved names in interactive
Therefore
systems, or commonly
there is application
hash functions in compilers, operating systems, language translation hypermedia, file managers, and information retrieval systems.
for minimal
perfect
systems, hypertext,
This work is a monograph on perfect and minimal perfect hashing. It is intended to serve researchers and professionals of computer science. Some parts of this text can be used as an introduction to the subject for students. All three authors have covered material from this work in Algorithms and Data Structures courses taught to higher level undergraduate or to starting postgraduate students. Researchers will find in this work the current state of developments in perfect and minimal
perfect hashing.
Bibliographical
titioners in mind, we have included which can be readily implemented
remarks facilitate
further reading.
With prac-
many examples of minimal perfect hash functions in practice. We also indicate where some imple-
mentations are available publicly on the Internet (see Section 6.6). It is assumed that the reader possesses some background in the design and analysis of algorithms. The work comprises seven chapters plus an appendix and a comprehensive bibliography. Ch. 1 is an introduction to perfect and minimal perfect hashing. Section 1.l contains basic definitions. In Section 1.2 we present a simple example of finding a minimal perfect hash function by trial and error. This example shows the kinds of difficulties which are encountered in designing such a function. Section 1.3 discusses the space and time requirements
for perfect hash functions.
This section is included
in
the Introduction because it is fundamental but it is quite theoretical and difficult. The reader may safely skim it on first reading without causing difficulties in understanding later sections. The remaining minimal
chapters constitute
perfect hashing.
came from consideration
a detailed study of existing methods of perfect and
Ch. 2 describes of theoretical
the algorithm properties
proposed
by Sprugnoli
of sets of integers.
which
Five different
descendants of it are then presented. Even though each tries to overcome drawbacks of its predecessor, we show that none of them has serious practical value. Ch. 3 discusses the methods based on rehashing and segmentation, i.e. breaking up the input set of keys into a number of smaller sets, for which a perfect hash function can be found. Ch. 4 presents Cichelli’s method which uses a simple heuristic approach to cut down the search space. The solutions of Sprugnoli and Cichelli, despite of their flaws, inspired a number of improved methods capable of handling larger key sets. In the rest of Ch. 4 we describe Sager’s algorithm which was developed as an attempt to optimize Cichelli’s method. It was the basis of various improvements which led to algorithms which have practical application to larger sets. We describe further improvements to
4
Z.J.
Czech et al. I Theoretical
Computer
Sager’s algorithm
which speed up generation
discusses
algorithms
several
table compression.
and Yao are first presented. suitable for letter-oriented their modifications
of minimal
which construct
The single and double
Science 182 (1997)
l-143
perfect hash functions.
perfect
hash functions
displacement
methods
Then the modifications
keys are considered.
Ch. 5
based on sparse
proposed
by Tarjan
of the methods which make them
In Ch. 6 we present three algorithms
that generate perfect hash functions
by using probabilistic
Some advanced applications of these methods are also discussed. Finally give an overview of dynamic perfect hashing. The Appendix contains
and
approach. in Ch. 7 we the notation
index.
Chapter 1. Introduction to perfect and minimal perfect hashing 1.1. Basic definitions Let U = {0,1,2, . . . , u - I} be the universe for some arbitrary positive integer U, and let S be a set of n distinct elements (keys) belonging to U. A hash function is a function h : U + M that maps the keys from S into some given interval of integers A4, say [0, M - 11. Given a key x E S, the hash function computes an address, i.e. an integer in [0, m - 11, for the storage or retrieval of x. The storage area used to store keys is known as a hash table. (In practice, the goal is to handle the information contained in the record associated with a key. We simplify the problem by considering only the table of keys and assuming that once a key is located or placed in the table, associated information can be easily found or stored by use of pointers.) Keys for which the same address is computed are called synonyms. Due to the existence
of synonyms,
a situation
called
ent keys have the same address. Various
collision may arise, in which two differschemes
for resolving
collisions
are known.
A perfect (or l-probe) hash function for S is an injection h: U + [O,m- 11, i.e. for all x, y E S such that x # y we have h(x) # h(y), which implies that m 2 n. If m = n and h is perfect, then we say that h is a minimal perfect hash function. As the definition implies, a perfect hash function transforms each key of S into a unique address in the hash table. Since no collisions
occur, each key can be retrieved
from the table in a
single probe. We also say that a minimal perfect hash function perfectly scatters the keys of S in the hash table. The function h is said to be order preserving if for any pair of keys xi,xj E S h(xi) < h(xj) if and only if i < j. In other words, we assume that keys in S are arranged in some sequential order, and the function preserves this order in the hash table. Such a function is called an ordered minimal perfect hash fun&on. Perfect and minimal perfect hashing is readily suitable only for static sets, i.e. sets in which no deletion and insertion of elements occurs. As defined above, the keys to be placed in the hash table are nonnegative integers. However, it is often the case that keys are sequences of characters over some finite and ordered alphabet C. To deal with such a key, a hashing scheme can either convert it to an integer by using the
Z.J.
ordinal numbers An example
of characters
in C, or by applying
5
I-143
a character by character
of using the latter method is given in Section
The question function
Czech et al. I Theoretical Computer Science 182 (1997)
arises whether for a given set of keys an ordered minimal
always exists. The answer to this question
approach.
1.2.
is positive
perfect hash
as it is straightforward
to prove that any two finite, equal size and linearly
ordered
sets, say X and Y, are
isomorphic.
function
h :X -+ Y transforming
This means that there exists an injective
X into Y such that Vix, ,x2 E x
xi
- loglogm
- 1. >
Proof. (a) The number of distinct subsets of U of size n is (I). Any perfect hash , . function can be perfect for at most (i )” (r) subsets. This result comes from considering the number of elements that can be hashed into a given slot of a hash table by a given hash function, and then maximizing the product of these numbers for all m slots. There are (t) ways to select n out of m places in the hash table. Denote by h-‘(i) the subset of elements of U hashed by some hash function h into the ith location of the hash
Z.J. Czech et al. I Theoretical Computer Science 182 (1997)
of subsets for which h can be perfect is bounded
table. Thus the number
c
Ih-‘(il)l
x jh-‘(iz)l
9
l-143
by
x ... x lh-‘(i,,)l.
O