Assigning Document Identifiers to Enhance Compressibility of Web Search Engines Indexes Fabrizio Silvestri
Raffaele Perego
Salvatore Orlando
Computer Science Department University of Pisa - ITALY
Information Science and Technology Institute (CNR) PISA - ITALY
Computer Science Department University of Venice - Italy
[email protected]
[email protected]
[email protected]
ABSTRACT Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d gaps (i.e. differences among successive document identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.
Categories and Subject Descriptors E.4 [Coding and Information Theory]: Data compaction and compression; H.3.1 [Content Analysis and Indexing]: Indexing methods; H.3.2 [Information Storage]: File organization; I.7.m [Document and Text Processing]: Miscellaneous
General Terms Algorithms, Performance, Experimentation
Keywords Inverted Files, Clustering, Document Identifiers Assignment, Index Compression, Web Search Engines
1.
INTRODUCTION
Web Search Engines (WSEs) processing steps are carried out by three distinct, and cooperating modules. The Crawler
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC ’04, March 14-17, 2004, Nicosia, Cyprus Copyright 2004 ACM 1-58113-812-1/03/04 ...$5.00.
gathers Web documents and stores them into a huge repository. The Indexer abstracts the data contained within the document repository, and creates an Index granting fast access to document contents and structure. Finally, the Searcher module accepts user queries, searches the index for the documents that best match each query, and returns the references to these documents in an understandable form. Web and Text mining techniques are routinely exploited by modern WSE at many levels in order to enhance both efficacy and efficiency. For example crawlers exploit Web Structure and Content Mining techniques that analyze the hyper-textual links included into the documents to drive the visit towards pages important [4], or focused only on a certain topic [2]. Link analysis is nowadays also at the basis of the most effective page ranking algorithms. The Indexing phase itself can be viewed as a Web Content Mining process. Starting from a collection of unstructured documents the Indexer extracts a large number of information such as the list of documents which contain a given terms, and the case, the typeface, the number of all the occurrences of each term within every document. At this point the indexer stores the information extracted in a structured archive (i.e. the index) which is usually represented using an Inverted File (IF). IF is the most widely adopted format for the index of a WSE due to its relatively small space occupancy and the efficiency involved in the resolution of keyword–based queries [17]. An IF index consists of an array of postings list where each list is associated to a distinct term t, and contains the doc ids (which are numerical values) of all of the documents containing t. Since the doc ids of each posting list are sorted, they are usually stored with a simple difference coding technique. For example, consider the posting list ((apple; 5)1, 4, 10, 20, 23) indicating that the term apple occurs in 5 documents having doc id 1, 4, 10, 20, and 23, respectively. The above posting list can be written as ((apple; 5)1, 3, 6, 10, 3), where the items of the list represent d gaps, i.e. differences between successive doc ids. Storing a d gap posting list clearly requires the coding of smaller integer values, so it may benefits more from the adoption of variable length coding methods which encode smaller numbers with a smaller number of bits. Obviously, if we could devise an assignment of the doc ids that globally minimizes the average value of the d gaps present in all the postings lists, we could reduce the space needed to store the whole index. Unfortunately, since the search space given by the set of all the permutations of n documents is exponential in n, and n is huge in the real case, finding an optimal assignment of doc ids is an intractable problem for a real
collection of documents. In this paper we focused the attention in finding a scalable heuristic aimed at approximating an optimal reordering in a time proportional to the size n of the collection of documents. The proposed heuristic exploits a text clustering algorithm that reorder the collection on the basis of document similarity. The reordering is then used to assign close doc ids to similar documents thus reducing d gap values and enhancing the compressibility of the IF index representing the collection. At our best knowledge, the problem of reordering a collection of documents in order to obtain higher compression rates is addressed by only two other works [13, 1]. Unfortunately, the approaches proposed are expensive in both time and space and seem hardly applicable to huge collections. Since efficiency in time and space is obviously very important, we exploited a lightweight clustering algorithm to assign identifiers to documents. Our clustering algorithm resembles the k-means [3, 8]. The k-means algorithm works as follows. It initially chooses k documents as cluster representatives, and assigns the remaining n − k documents to one of these clusters according to a given similarity metric. New centroids for the k clusters are then recomputed, and all the documents are reassigned according to their similarity with the new k centroids. The algorithm iterates until the position of the k centroids become stable. The main strength of this algorithm is the O (n) space occupancy. On the other hand, computing the new centroids is expensive for large values of n, and the number of iterations required to converge may be high. We thus adopted a simplified version of the k-means algorithm requiring only k steps. At each step i the algorithm selects a document among those not yet assigned and uses it as centroid for the i-th cluster. Then it chooses among the unassigned documents the nk − 1 ones most similar to the current centroid and assign them to the i-th cluster. The time and space complexity of our algorithm is the same of a single-pass k-means, but differently from classical clustering algorithms which produce clusters as sets of unordered items, our algorithm devises also an ordering among the members of the same cluster. Such ordering is exploited to assign consecutive doc ids to consecutive documents belonging to the same cluster. The rest of the paper is structured as follows. Section 2 discusses previous works regarding text clustering and the compression of IF indexes. Section 3 details our clustering algorithm and discusses its complexity. Section 4 describes the experiments conducted and presents the results obtained. Finally, Section 5 reports some conclusions along with a description of the research directions we plan to investigate in the near future.
2.
RELATED WORKS
Clustering is a widely adopted technique aimed at dividing a collection of data into disjoint groups of homogeneous elements. Clustering has been used in many different applications of Text Mining. For example, when the content of textual documents is considered, the different groups could correspond to different topics discussed in the collection of documents. A popular example of clustering in Text Mining is the analysis of customers e-mails to find out if there are some common themes that have been overlooked. Unfortunately, the results of clustering strongly depends on the metric and the parameters used to estimate the similarity among documents. For instance in the above application one could consider a mixture of several features such as the sender address, the set of terms contained in each document, the number and position of term
occurrences, etc. Each of the features considered may contribute in a different way to the quality of the resulting clusters. Moreover, estimating cluster quality is in general very difficult. In their book, Jain and Dubes state: The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. [8] In our specific case, the quality of the resulting clustering structure is fortunately well-defined and simply measurable: the smaller the resulting compressed index, the better the clustering structure generating the document reordering. It’s wide accepted that data compression can improve the price/performance ratio of computing systems by allowing the storing of large data in the faster levels of the memory hierarchy [14]. This remains true even in the case of WSEs indexes. In [12] the authors investigate the performance of different index compression schemes through experiments on large query sets and collections of Web Documents. Among the main results discussed in the paper, the authors demonstrate that for a small collection, where the uncompressed index fits in main memory, a bytewise compressed index can still be processed faster than the uncompressed one. Most of the works regarding the compression of IF indexes deal with compression methods which we could call passive. With this term we intend emphasize the fact that they ”simply” exploit the locality present in the lists of postings to devise variable-length coding schema that minimizes the average number of bits (or bytes) required to store each doc id [5, 7, 9, 11, 12, 10, 17]. Some of the most effective of such encoding methods are: Gamma [5], Delta [5], Golomb [7], and Binary Interpolative [10]1 . On the other hand, only two recent papers deal with active compression methods based on a preprocessing of the document collection aimed at reassigning the doc ids in a way that enhances the compression rates achieved by the traditional, i.e. passive, encoding algorithms [13, 1]. In [13] the authors propose a doc id reassignment algorithm that adopts a Travelling Salesman Problem heuristic. Each document of the collection is considered a vertex of a similarity graph G = (V, E), where an edge is inserted between two vertexes di and dj whenever the two documents associated to di and dj share at least one term. Moreover, each edge e ∈ E is weighted by the number of terms shared by the two documents. The TSP heuristic algorithm is then used to find a cycle in the graph having maximal weight and traversing each vertex exactly once. The suboptimal cycle found is finally used to reassign the doc ids. The rationale is that since the cycle preferably traverses edges connecting documents sharing a lot of terms, if we assign close doc ids to these documents we should expect a reduction in the size of the relative compressed IF index. The experiments conducted demonstrated a good improvement in the compression rate. Unfortunately, this technique is also very expensive even for reordering collections of small size. The reordering of a collection of approximately 131, 896 documents requires about 23 hours, and 1 This method uses only implicitly the d gap representation. In fact, it exploits the tendency of terms to appear relatively frequently in some parts of the collection, and infrequently in others.
occupies 2.17 GBytes of main memory, thus making this approach unusable for real Web collections. In [1] the authors propose an algorithm that permutes the document identifiers in order to enhance the locality of an IF index. A similarity graph G = (V, E) is built as in the previous case but the Cosine Measure is used to weight the edges. Indeed, the graph building phase is carried out starting from the terms-documents bipartite graph relative to the collection considered (in practice, a sort of Inverted File index containing only references to documents which contain a given term). Once the graph is built, the algorithm recursively splits G into smaller subgraphs Gl,i = (Vl,i , El,i ) (where l is the level, and j is the position of the subgraph within the level) representing smaller subsets of the collection. The splitting is performed until the subgraphs become singleton. The identifiers are then assigned according to a depth first visit of the tree obtained. Also this approach presents some drawbacks. First, as the one discussed in [13], it requires to store the whole graph G in main memory. Moreover, the cost of performing a split operation on the graph (at each level) is extremely high even if, to speed-up the algorithm, the authors propose two optimization heuristics that are used at each iteration to reduce the dimension of the subproblems. Finally, the algorithm starts from an already built IF (the bipartite graph terms-documents), and thus require a postprocessing step to reorder the IF and all the other linked files. The authors tested their strategy on an IF index built over the TREC-8 ad hoc track collection [15]. The results of several tests, conducted by varying the various parameters of the algorithm, were reported in the paper. The enhancement of the compression ratio obtained is significative, but the approach is high demanding in terms of both time and space. The execution times reported in the paper regards tests conducted on a subcollection of only 32, 000 documents. While these two papers address some relevant issues, the approaches proposed are probably unusable for large collections. Our proposal try to address this issue by exploiting a scalable clustering algorithm using space and time linear in the number of documents of the collection.
˘ ¯ Dπ(1) , Dπ(2) , . . . , Dπ(N ) we will indicate the permutation of D resulting from the application of function π.
3.
Algorithm 1 The reordering algorithm. 1: Input: e • The set D. • The number k of clusters to create.
THE REORDERING ALGORITHM
Let D = {D1 , D2 , . . . , DN } be a collection of N textual documents to which consecutive integer doc ids d = 1, . . . , N are initially assigned. Moreover, let T be the number of distinct terms ti , i = 1, . . . , T present in the documents, and µt the average length of terms. The total size CSize (D) of an IF index for D can be written as: CSize (D) = CSizelexicon (T · µt ) + X Encodem (d gaps (ti )) i=1,...,T
where CSizelexicon (T · µt ) is the number of bytes needed to code the lexicon, while d gaps (ti ) is the d gap representation of the posting list associated to term ti , and Encodem is a function that returns the number of bytes required to code a list of d gaps according to a given encoding method m. Definition A reordering of a collection of documents D is a bijective function π : {1, . . . , N } → {1, . . . , N } that maps each doc id i into a new integer π (i). Moreover, with π (D) =
Definition A reordering π is defined to be optimal if, when applied, yields to a more compressible index structure. More formally if ∀π 0 6= π. CSize (π (D)) ≤ CSize (π 0 (D)). Since the possible reordering of D are N !, the problem of finding an optimal reordering is intractable for large values of N . Our algorithm tries to approximate an optimal reordering in O(N ) time and space. In the following of this section we will analyze the different aspects taken in consideration in the design and implementation of our algorithm.
Model of the documents.. Our clustering algorithm starts from a transactional representation of the documents: each document Di ∈ D is represented by the set Si ⊆ T of all the terms that the document contains. In the following we e = will denote the transactional document collection with D {S1 , S2 , . . . , SN }. To evaluate the similarity among documents, we used the Jaccard measure [6]. The Jaccard measure is a well known metric used in cluster analysis to estimate the similarity between objects described by the presence or absence of attributes. It counts the number of common attributes divided by the number of attributes possessed by at least one of the two objects. More formally, in our algorithm the similarity between two distinct documents Si , and Sj is given by: jaccard measure(Si , Sj ) =
|Si ∩ Sj | |Si ∪ Sj |
Reordering Step.. Our reordering algorithm is outlined in e and the numFigure 1. It takes as input parameters the set D, ber k of clusters to create. In output it produces the ordered list of all the members of the k clusters. This list univocally e minimizing the average value of defines π, a reordering of D the d gaps.
2: Output: e • k ordered clusters representing a reordering π of D. 3: for i = 1, . . . , k do “ ” e 4: current center = longest member D e =D e \ current center 5: D e do 6: for all Sj ∈ D 7: sim [j] = compute jaccard (current center, Sj ) 8: end for 9: M = select members (sim) 10: ci = ci ∪ M e =D e \M 11: D 12: ci = ci ∪ current center 13: dump (ci ) 14: end for e At each scan i, it The algorithm performs k scans of D. chooses the longest document not yet assigned to a cluster as
Algorithm 2 The select members procedure. 1: Input: • An array sim: sim [j] contains the similarity between current center and Sj .
select members routine is thus given by: ! ! e e |D| |D| e · log2 −1 ω |D| − i k k
2: Output: “˛ ˛ ” ˛ e˛ • The set of the ˛D ˛ /k − 1 documents more similar to current center. “˛ ˛ ” ˛ e˛ 3: Initialize a min heap of size ˛D ˛ /k − 1 e do 4: for i = 1, . . . , |D| 5: if (sim [i] > sim [heap root()]) OR ((sim [i] = sim [heap root()]) AND (length (i) > length (heap root()))) then 6: heap insert(i) 7: end if 8: end for 9: M = ∅ “˛ ˛ ” ˛ e˛ 10: for i = 1, . . . , ˛D ˛ /k − 1 do 11: M = M ∪ heap extract() 12: end for 13: return M
The total time spent by the algorithm is thus given by: e k) = T (|D|,
k−1 X
ω = τ · τ0
e k) Ti (|D|,
i=0
where: e k) = Ti (|D|, σ
e e − i |D| |D| k
ω
e −i |D|
e |D| k
!! + ! · log2
!! e |D| −1 k
Since σ τ ω we have that: e k) ≈ σ Ti (|D|,
e e − i |D| |D| k
!
and thus:
current center of cluster ci , and computes the distances between it and each of the remaining unassigned documents. Once all the have been computed, the algorithm ˛ ” “˛similarities ˛ e˛ selects the ˛D /k − 1 documents most similar to the cur˛ rent center by means of the procedure reported in Figure 2, and put them in ci . It’s worth noting that when two documents result to have the same similarity the longest one is selected. In fact, since the first doc id of each posting list has to be coded as it is, assigning smaller identifiers to documents containing a lot of distinct terms, maximizes the number of posting lists starting with small doc ids. The algorithm returns a ordered list of documents. The corresponding reordering π assigns doc id i to the i-th document of the returned list.
Analysis of the algorithm.. Let σ be the cost of computing the jaccard distance between two documents, and τ the cost of comparing two jaccard measures. Computing the jaccard similarity mainly consists in performing the intersection of two sets, while comparing the similarity of two document only requires to compare two floats. We thus have that σ τ . e e − i |D| At each iteration, our algorithm performs |D| comk putations of the jaccard distance. The total cost of this phase at each iteration i is thus: σ
e −i |D|
e |D| k
!
Once the vector of similarities sim has been computed, the e e − i |D| select members procedure is called. It performs |D| k insertions into an heap of size |D| − 1. Moreover, since we k perform an insertion only if the element in the root of the heap is smaller than the element to be inserted, we should further scale the cost by a factor τ 0 1. The total time spent in the e
e k) ≈ T (|D|,
k−1 X
“ ” e k) = σ|D|k e = O |D|k e Ti (|D|,
i=0
The space occupied by the reordering algorithm is given by the following equation: e k) = S(|D|,
4.
e |S||D|+ e 8|D|+ e 4 |D| k “ ” e = O |D|
documents array of similarities the heap data structure
EXPERIMENTAL RESULTS
To assess the performances of our algorithm we tested it on a real collection of Web documents, the publicly available Google Programming Contest collection2 . This collection consists of about 900, 000 documents, and occupies about 2GBytes of space on disk. In this section we present some preliminary results obtained running our algorithm on a Linux PC using a 2GHz Intel Xeon processor, 1GB of main memory, and an Ultra-ATA 60GB disk. The tests we performed were aimed to evaluate efficiency and effectiveness of our solution under several configurations. In particular we varied the number of documents to reorder and the number of clusters generated. In Figure 1.(a) the algorithm execution time is plotted as a function of the size of the collection. The two curves reported correspond to different values of k. In both cases the curves have a linear behavior as we argued from the theoretical analysis outlined in the previous section. From the above analysis results that the time spent in reordering also depends from k. In order to evaluate the completion times as a function of k, we performed a test on a subset of the collection composed by about 160, 000 documents. 2
http://www.google.com/programming contest
(a) Figure 2: Space occupied by the reordering algorithm by varying the size of the collection.
(b) Figure 1: Execution times as a function of: (a) the size of the collection; (b) the number of clusters.
Figure 1.(b) shows the results of this test that confirms our analytical estimate. In Figure 2 the space used by the reordering algorithm is plotted as a function of the size of the collection. As it can be seen, the space needed by the algorithm is linear in the number of documents being reordered, ˛and˛ can be estimated to be ˛ e˛ approximately equal to 1KByte · ˛D ˛. Since memory occupancy is dominated by the space used to store the collection in main memory it does not depend from the number of clusters generated. The main goal of our reordering algorithm is producing an assignment of identifiers that increases the number of small d-gaps. To verify that our algorithm actually produce such a reordering, Figure 3 plots the number of occurrences of each d-gap value. As it can be seen, the number of d-gaps smaller than 8 increases significantly, while the other d gaps monotonically decrease. This means that encoding methods privileging small numbers, like Gamma coding [16], should result in better compression improvements when applied to the reordered collections. To verify this deduction we computed the average number of bits needed to store each doc ids for randomly assigned document identifiers and reordered ones. We also varied various parameters such as the size of the collection, the number
Figure 3: Number of occurrences of d-gaps values in an inverted list built over either reordered, or random assigned identifiers. The reordering is carried out by our algorithm with k = 2000. of clusters k, and the encoding method used. Tables 1 and 2 reports the results obtained. From both the tables we can observe a significative reduction in the average number of bits-per-posting value for all of the encoding schemes used. For the whole collection (see Table 1), the maximum compression rate enhancement was achieved by adopting the Binary Interpolative scheme. In this case our algorithm allowed to reduce the size of the compressed index of an additional 14%. Also from the data reported in Table 2, we can observe that our algorithm produces an high improvement in the compression rates. In the case of Gamma encoding, we improve the average number of bitsper-posting from 9.1 to 6.9. This reduction corresponds to an IF index 23.2% smaller than the one built over randomly assigned identifiers. Moreover, from the results reported in both the tables we can deduce that acceptably good reorderings are obtained also with relatively small values of k. For example let us to consider the bits-per-posting values shown in table 2 in correspondence of the Binary Interpolative Coding method. For values of k equal to 500 and 2000 the bits-per-posting
k random 100 200 500 1000 2000 4000 8000 9000
Gamma 9.0157 8.2478 8.1957 8.1146 8.0634 8.0162 7.9699 7.9293 7.9225
Delta 8.5882 7.7207 7.6790 7.6083 7.5615 7.5166 7.4719 7.4335 7.4268
Golomb 7.1382 6.4676 6.4674 6.4670 6.4667 6.4666 6.4665 6.4664 6.4663
Interpolative 6.9644 6.1117 6.0859 6.0449 6.0231 6.0050 5.9924 5.9874 5.9876
Table 1: Average number of bits-per-posting for different values of k and different encoding methods. The results reported refer to tests conducted with the whole Google collection. The row labelled random refers to the case in which no reordering is performed. k random 5 100 200 250 500 1000 2000 3500
Gamma 9.0860 7.5552 7.3106 7.2202 7.1858 7.1106 7.0450 6.9981 6.9751
Delta 8.4330 7.1661 6.9410 6.8555 6.8239 6.7510 6.6869 6.6389 6.6102
Golomb 6.4753 6.3807 6.3505 6.3534 6.3544 6.3531 6.3525 6.3523 6.3524
Interpolative 6.5644 5.7129 5.5563 5.5036 5.4843 5.4535 5.4307 5.4248 5.4290
Table 2: Average number of bits-per-posting for different values of k and different encoding methods. The results reported refer to tests conducted with a subset of the Google collection composed of about 160, 000 web documents. The row labelled random refers to the case in which no reordering is performed. value resulted equal to 5.4535 and 5.4248, respectively. This means that we loose only about 0.5% in compression rate when we generate 500 instead of 2000 clusters. However, by looking at the plot reported in Figure 1 we can see that reordering the collection with k = 500 takes about one quarter of the time required when k = 2000.
5.
CONCLUSIONS
In this paper we presented an efficient algorithm for computing a reordering of a collection of textual documents that effectively enhances the compressibility of the IF index built over the reordered collection. The reordering algorithm exploits a k-means-like clustering algorithm which compute an effective reordering with linear space, and time complexities. The experimental evaluation conducted with a real world test collection, resulted in improvements up to ∼ 23% in the compression rate achieved. The algorithm depends on a parameter k which indicates the number of clusters produced. The quality of the reordering obtained is however only marginally sensible to the value of k. Large values of k results in the average in slightly better compression rates. On the other hand, smaller values of k result in a significantly lower completion time. In the near future we plan to extend our work in several directions. First, we will compare our solution with the algorithm proposed in [1]. Even if the Blelloch algorithm should produce slightly better reorderings than ours, we think that our approach may reach comparable compression rates by using a significantly lower amount of computational resources. Moreover, we plan to perform a more deep and accurate experimen-
tal evaluation on larger collections of Web documents. Finally, since collections coming from real Web repositories typically contain several millions of documents, we plan to design and evaluate an on-line version of our algorithm where the reordering is applied to fixed-size blocks of documents.
6.
REFERENCES
[1] Dan Bladford and Guy Blelloch. Index compression through document reordering. In IEEE, editor, Proc. of DCC’02. IEEE, 2002. [2] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In 8th International World Wide Web Conference, May 1999. [3] Soumen Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kaufmann Pub., 2003. [4] Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through url ordering. In 7th World Wide Web Conference (WWW7), Apr 1998. [5] P. Elias. Universal codeword sets and representation of the integers. IEEE Trans. on Information Theory, 21(2):194–203, Feb 1975. [6] W. B. Frakes and Editors R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms, chap. on Clustering Alg. (E. Rasmussen). Prentice Hall, Englewood Cliffs, NJ, 1992. [7] S. Golomb. Run-length encodings. IEEE Trans. on Information Theory, 12(3):399–401, Jul 1966. [8] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentiche Hall, 1988. [9] A. Moffat and T. Bell. In situ generation of compressed inverted files. J. of the American Society for Information Science, 46(7):537–550, 1995. [10] Alistair Moffat and Lang Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25–47, Jul 2000. [11] Anh NgocVo and Alistair Moffat. Compressed inverted files with reduced decoding overheads. In Proc. of the 21st Intl. Conf. on Research and Development in Information Retrieval, pages 290–297, Oct 1998. [12] Falk Scholer, Hugh E. Williams, John Yiannis, and Justin Zobel. Compression of inverted index for fast query evaluation. In SIGIR02, 2002. [13] Wann-Yun Shieh, Tien-Fu Chen, Jean Jyh-Jiun Shann, and Chung-Ping Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117–131, Jan 2003. [14] T. B. Smith, B. Abali, D. E. Poff, and R. B. Tremaine. Memory expansion technology (mxt): Competitive impact. J. of Research and Development, 45(2):303–309, Feb 2001. [15] E. Voorhees and E. D. K. Harman. Overview of the eighth text retrieval conference (trec-8). In Proc. of the TREC-8, pages 1– 24, 1999. [16] H. Williams and J. Zobel. Compressing integers for fast file access. Computer Journal, 42(3):193–201, 1999. [17] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes – Compressing and Indexing Documents and Images. Morgan Kaufmann Pub., 2nd ed., 1999.