GRAB - Inverted Indexes with Low Storage Overhead. Michael Lesk Bellcore.
ABSTRACT: A searching command (grab) for main- taining indexes combines ...
GRAB
- Inverted Indexes
with Low Storage Overhead Michael Lesk Bellcore
ABSTRACT: A searching command (grab) for maintaining indexes combines acceptably fast searching with very low storage overhead. It looks like grep except that it demands a preindexing pass, looks only for whole words, and runs faster. As an example of performance, consider the time to search for single words in a 7.8 Mbyte file (the Brown corpus of English). The times below are in seconds on a DEC 8600 running Ultrix; the space overhead is given as a percentage ofthe original file.
word
No. uses
Shakespeare
29
tt3?HPu* t,,1"*t0""* 31.3
0.4
ttåJiil.. 0.1
Dickens
J
31.6 0
1.2
Chaucer
I
31.4
0.8
7o/o
0.1 970/o 0.1
Time and Space Costs
In these examples, grab gets 980/o of the time savings for only 7o/o of the space costs of using B-trees. The preprocessing time lor grab is also less, about lß of that required for the B-trees. Grab contains the following three ideas: (a) storing an inverted file by hash codes rather than words; (b) keeping a bit vector of blocks in which the items appear, rather than pointers to bytes; and (c) using a fixed length codeword compression scheme on the sparse vectors. @
Computing Systems, Vol.
I . No. 3 . Summer 1988 207
1. Usage full text of A Study in Scarlet in a file Hl, about 269,000 bytes long. If no one else has made the index, you type imake tlt and about I minute later there is a file Hl.sg which contains 14,000 bytes. Then you type srab tobacco Suppose you happen to have the
and, 0.28 seconds of CPU time later, two lines are typed on your terminal: H1
019#06 the ground.
You don't mind the smell of strong
tobacco, I hope?" 033#23 at a glance the ash of any known brand either cigar or of tobacco. It is just in
of
If, instead, you had used egrep you would not have had to make the index in advance, but the search would take 14 times as long (That may sound bad, but is only 4 seconds. This file is really too short to need grab). If, instead, you had used B-trees (a similar package is available, with indexer named bmake and searcher bgrab), the search would run in 0.08 CPU seconds but the index frle would take 201,750 bytes and making it requires 3 minutes. Admittedly the search is faster, but can you tell the difference between 0.3 seconds and 0.1 seconds when the shell overhead is about 1.0 seconds? More important, would you rather have the search go faster than 0.3 seconds, or save 187,000 bytes of disk space?
There are a few options to these programs. When executing grab, the -y option folds upper and lower case in searching. Specifying -n prints the name of the flle from which each printed line comes. And timing information on the search is printed when -r is specified.
208
Michael Lesk
For imake, the -bH option sets the block size to N, and the option makes N the hash table size (see below for an explanation). The ordinary user shouldn't need these options. hru
There is also a set of library routines using the same software. This permits use of the same kind of data retrieval within a program. The entry points and their usage are:
fd = gopen(fitename); char *fd, *fiIename; gopen opens a file which should have an associatedf lename.sg index file, made by the imake routine. It returns NULL on failure and nonzero on success.
p = gseek (word, fd); int p; char *word, *fd; gseek returns a pointer
to the first byte position within filename where word is found. It returns -t if word is not found in the file.
p=gnext (fd); char *fd; int p; gnext returns a pointer to the next byte position within the file where the key is found. It returns -t when it runs out of places.
gcIose (fd); char *fd; gclose closes the files.
Thus a typical loop to frnd all references to cat in a frle named dog (and on which imake dog has previously been run) is:
int
p;
char *gf; gf = gopen("dogn)-
for (p=gseek(rrcatrr, 9f); p)=0; p=gnext(9f)) g. r;r; (;f ) ; As another example of timings, the full American Academic Encyclopedia is 54 Mbytes long; making a set of indexes that use 4.7 Mbytes, or 80/o more storage, results in the following search costs:
GRAB
-
Inyerted Indexes with Low Storage Overhead
209
CPU time (on an 8650)
No.
Hits
Grab
Egrep Grep
qwerty
0
5.8
131.9
quarto
J
9.8
127.5
224.1
Fgrep
281.2
Unfortunately, although the ration of real time to CPU time (lightly loaded system) for egrep is about 2:l,the ratio for grab is about 6:1, so that the response time advantage for grab is only a factor of 4 or 5 instead of t5. Just finding out the line where the word appears isn't that useful; you'd really like to know the title of the encyclopedia article. Using the grab library routines, a 76 line C program sufficed to look backwards for the article name, and it frnds the article reference plus the contents of the line in a similar 1.1 minutes of CPU time. That may not be real-time response, but it is a lot better than grep, and the output is more useful.
2. Introduction and Discussion We have a number of large text files in which we would like to look up words quickly. These include the Brown corpus of English (7.8 Mbytes), the World Almanac (10.6 Mbytes) the Ameri' can Academic Encyclopedia (56 Mbytes), the LATA Switching System Generic Requirements (4.7 Mbytes), and two collegiate-level dictionaries (15.6 Mbytes for the webster's 7th New collegiate and 6.5 Mbytes for the Oxford Advanced Learner's Dictionary). The traditional UNIX solution is to use grep. The UNIX religion teaches that grep is all you need for a DBMS, except for really large files where you might need egrep. However, "latge" in that sense does not include the files above. Even on a DEC 8600, for example, to egrep through the Brown corpus on an idle machine takes 32 seconds. Thus it is desirable to have some kind of index. This is not exactly new, and various packages have been written in the past (including by the author) to make and maintain such inverted indexes ILesk 1977]. The usual inverted index, however, roughly doubles the storage space required. The Weinberger B-tree package, [Weinberger l98l] for example, is easily adapted to store a word index
210
Michael Lesk
to large texts, but the overhead ofthe tree is roughly equal to the size of the text file originally being stored. Since most UNIX systems are always running out of disk space, it is desirable to avoid such high space overheads. The uses ofbit vectors for such purposes has been studied before, although usually with the aim of imitating edge-notched cards for modern hardware. A good study of a comparable system to this one is given by Choueka [Choueka et al. 19861. Thus we come to grab. It combines low storage overhead (