On the Mapping of Index Compression Techniques on CSR Information Retrieval1 Sterling Stuart Stein Computer Science Department and Information Retrieval Laboratory Illinois Institute of Technology
[email protected]
Abstract Information retrieval is the selection of documents relevant to a query. Inverted index is the conventional way to store the index of the collection. Because of the large amounts of data, compression techniques are commonly used in information retrieval systems to reduce the size of the inverted index. We experimentally evaluate the result of the mapping of such techniques on the Compressed Sparse Row (CSR) information retrieval (IR). Our experimental results, using some of these compression techniques such Elais Gamma, Golomb, Interpolative, and fixed length Byte-Aligned, demonstrate that such techniques can easily be applied to compress the index in CSR IR.
1. Introduction The use of an inverted index was shown to be the processing scheme of choice due to its reduced I/O demands for an ad-hoc query [1]. The storage space for an inverted index structure, however, is not necessarily smaller than the original text, if various data and all term offsets are stored. If only the document locations are stored in the index, the size of the index is roughly 30% of the uncompressed text collection. To save storage space and reduce the amount of I/O in query processing, different compression techniques on inverted index were proposed. Among these techniques is fixed-length Byte-Aligned index compression [2] that reduces the index size to roughly 15% of the index, and various variable-length compression [3, 4, 5] techniques that reduces the index size roughly to 10% of the original index. A comparison of compression ratio and query processing timing of several different compression schemes is given in [5].
1
Nazli Goharian Computer Science Department and Information Retrieval Laboratory Illinois Institute of Technology
[email protected]
The CSR Information Retrieval, initially proposed in [6], an alternative indexing approach to the conventional inverted index, uses a sparse matrix vector multiplication algorithm to perform query processing. This approach demonstrated efficient results in a parallel implementation of an information retrieval system [7]. In addition to the parallel implementation, the storage requirement for a compressed row sparse matrix implementation was favorably compared to the inverted index [8]. However, no additional compressions, as commonly done on inverted index approaches, were evaluated. The remainder of the paper is organized as follows: We present the prior work in Section 2. In Section 3, we describe our experimental framework and results. Finally, we conclude our paper in Section 4.
2. Prior Work 2.1. CSR Information Retrieval Compressed Sparse Row (CSR) Information Retrieval (IR) was proposed as an alternative approach to the conventional inverted index to store the text collection index. Inverted index storage structure has two components. One component, term index or lexicon, stores all unique terms in the text collection along with a pointer that points to the first element of the second component of the index, called the posting list. The posting list basically is the list of all documents that have a given term. It stores the document identifiers of the documents having a given term along with the frequency of occurrence of that term in a given document. The data structure of the index in CSR IR contains three arrays: a real array non-zero-vector (1:nz) to store the non-zero elements of the sparse matrix row-wise, an
This work is supported in part by the National Science Foundation under contract #0119469
integer array column-vector (1:nz) to store the column indices of the elements of non-zero-vector, and an array for storing the row information, row-vector (1:m+1). The value of the i-th entry of the row-vector indicates the first element of the i-th row in non-zerovector and column-vector. The parameter nz stands for the number of non-zero elements, and m stands for the number of the rows [9]. The text collection was mapped into this index structure, by storing the weight of the terms in each document in the non-zero vector, the corresponding term identifiers in the column vector and the document identifiers in the row vector. The sparse matrix vector multiplication algorithm was used to perform query processing.
2.2. Index Compression Techniques We describe several of the index compression techniques that are used to compress an inverted index. Among these methods are Flat Huffman compression, Byte-Aligned, Interpolative, Elias Gamma and Golomb compression schemes. In a Flat Huffman compression scheme [10, 11], to encode a number in the range 0 to n takes jlog nk bits. That is a straight binary encoding. The last bit is not always necessary. To encode without wasting bits, a flat encoding can be used. It is equivalent to having a Huffman tree with all of the leaves within one level of each other. Given a number n in the range of 1 to m (m not a power of 2), n can be encoded and decoded by: Let nc = m-n, the reversed value of n Let o = jlog nk, the length of the simple binary encoding Let d =2o-m , the number of leaves in the Huffman tree at height o-1 * The number of leaves on the height o is m-d * The number of branches at height o-1 is 2(o-1)-d FlatEncode(N,m) 1) If nc$m-d, write o-1 bits of nc+2(o-1)-m 2) Else, write o bits of nc FlatDecode(m) 1) Get o-1 bits, put in nc 2) If nc$2(o-1)-d, n=2(o-1)-nc 3) Else nc=nc*2+(1 more bit) (for o bits total), n=m-nc This is the arithmetic version of using a flat Huffman tree.
Byte Aligned compression scheme [2] is a fixed length encoding technique that creates a byte boundary to represent the encoded integer to achieve a faster access. As we noticed that the gap calculated between
every two term identifiers belonging to a given document in on our collection are less than 215-1, thus each integer needs a maximum of 2 bytes to be encoded. In our implementation of this approach we used blocks of 7 bits, plus an additional bit to indicate if an additional byte is needed. For example, the integer 2 would be encoded as 0 0000010. The first bit indicates that no additional byte is followed. The following is the pseudocode for our implementation of compression and decompression of Byte Aligned (BA): BA Encoding n on 0 to infinity: BAEncode(n) 1) While n>127 a: Write the lower 7 bits of n and turn on the topmost bit (27 =128) Write((n&127)+128) b: Subtract 128 from n, then discard the lowest 7 bits n=(n-128)>>7 2) Write n BADecode(n) 1) n=0, p=0 2) Read a byte into y, loop if it is >127 a: Left shift y by p (multiply y by 2 p ) b: Add that(a) to n n+=y