On Using Query Logs for Static Index Pruning Hoang Thanh Lam∗ , Raffaele Perego† and Fabrizio Silvestri† ∗ Department of Mathematics and Computer Science TU Eindhovein, Email:
[email protected] † ISTI Consiglio Nazionale delle Ricerche Pisa Italy Email: {Raffaele.Perego,fabrizio.silvestri}@isti.cnr.it
Abstract—Static index pruning techniques aim at removing from the posting lists of an inverted file the references to documents which are likely to be not relevant for answering user queries. The reduction in the size of the index results in a better exploitation of memory hierarchies and faster query processing. On the other hand, pruning may affect the precision of the information retrieval system, since pruned entries are unavailable at query processing time. Static pruning techniques proposed so far exploit query-independent measures to evaluate the importance of a document within a posting list. This paper proposes a general framework that aims at enhancing the precision of any static pruning methods by exploiting usage information extracted from query logs. Experiments conducted on the TREC WT10g Web collection and a large Altavista query log show that integrating usage knowledge into the pruning process is profitable, and increases remarkably performance figures obtained with the state-of-the art Carmel’s static pruning method. Keywords-Inverted index; static pruning; query log ; information retrieval;
I. I NTRODUCTION Static index pruning plays an important role in large–scale web information retrieval systems, which crawl and index hundreds of billions of pages [4], [6], [2]. The key point in designing a good static index pruning technique is the function used to assess the importance of a document within a postings list. Given a good measure for the importance of a document for an associated term, we can expect to be able to choose which entries can be ruled out from the term’s posting list without a perceptible loss in result quality. This paper investigates a novel approach that exploits the knowledge mined from the query logs to assess the importance of posting lists entries. Concretely, the occurrence and co-occurrence of terms within past queries extracted from query logs was used for static index pruning. The problem is formally defined as an optimization problem which is then reduced to a job scheduling problem for which the exact solution is well known. Indeed, the knowledge extracted from query logs is obviously biased to the behaviors of current users. Therefore, pruning the index entries by considering only usage information could strongly affect the effectiveness of the information retrieval system. We thus propose a hybrid pruning approach that allows the level of query-independent and query-dependent contributions to be
finely adjusted in order to maximize performance figures for the pruned index. This makes our approach so general that it can be used to boost any state-of-the-art static pruning technique based on document and term relevance metrics. II. P ROBLEM F ORMULATION Each posting list Ii of an Inverted Index I is associated with an ordering Oi established by the decreasing relevance of the document for the term1 . Two partial orderings Oi and Oj are considered as equivalent (denoted as Oi ≡ Oj ) if, and only if, for any pairs of documents d0 and d00 contained in both posting lists Ii and Ij , if Oi ranks d0 higher than d00 then also Oj will rank d0 higher than d00 . Without loss of generality let us assume that ∀ti ∈ T (terms set), postings lists Ii are ordered by Oi , and let us denote with O the resulting set of |T | orderings, containing possible repetitions. We say that I follows a global ordering O if and only if ∀i, j Oi ≡ Oj . We say that I follows a local ordering if, and only if, ∃i, j s.t. Oi 6≡ Oj . Let Q = {q1 , q2 , · · · , q|Q| } be a query log, i.e. a set of queries submitted in the past to a search engine. For each query qi ∈ Q, let Lk (qi ) be the list of the top-k documents returned by the search engine for answer to qi (hereinafter, we will omit, without loss of generality, the superscript k, unless otherwise specified). Moreover, with It (d) we denote the position of document d in It . Intuitively, an optimal ordering O with respect to the results list L(qi ) orders entries in the posting list It such that all the important documents of L(qi ) are placed close to the beginning of the posting list. It turns out that we have to minimize simultaneously all the positions It (d) for ∀d ∈ L(qi ). This leads to a multi-objective optimization problem which requires to find the Pareto-optimal solutions. However, we can simplify the problem as P follows. For any term t of query qi , let C(qi , t) = d∈L(qi ) It (d), i.e. C(qi , t) is the sum of the positions occupied in It by documents in L(qi ). For example, given a query q, the list of results L(q) = {12, 6, 8, 10}, and the posting list t 7→ {10, 25, 6, 8, 30, 12, 99, 100}, we have C(q, t) = 1 Thereinafter, we will use O (resp. O ) to denote an ordering of the t i posting list of a term ti (resp. of a term t).
It (12) + It (6) + It (8) + It (10) = 6 + 3 + 4 + 1 = 14. For all terms t0 not occurring in a query q we set C(q, t0 ) = 0. In general, we can argue that the smaller the value of C(q, t) the better the organization of the posting list of t ∈ q to answer query q. More formally, we define the following optimization problem: Problem Definition 1: Given an inverted index I and a query log Q, find a set of orderings O∗ such that the aggregate cost over Q is minimized: XX M IN C(Q) = C(q, t) (1) q∈Q t∈q
Proposition 1: An optimal set of orderings O∗ exists. Problem 1 can be reduced to an equivalent problem that is easier to solve. For each term t ∈ T we denote Qt the set of all queries of the query log Q that contain term t. Formally, P Qt = {q ∈ Q|t ∈ q}. Let C(Qt ) = q∈Qt C(q, t). Proposition 2: C(Q) is minimized if and only if C(Qt ) is minimized ∀t : t ∈ T . The aforementioned proposition allows us to re-formulate our optimization problem: Problem Definition 2: Given an inverted index I, a query log Q and a term t ∈ T , find an optimal ordering Ot∗ for posting list It such that C(Qt ) is minimized. III. F INDING THE O PTIMAL O RDERING Let us show that it is possible to reduce Problem 2 to a well-known job scheduling problem asking for the minimization of the total weighted completion time and P denoted in the literature as 1k wi .Ci . Let Qt = {q1 , q2 , · · · , qN } be the set of distinct queries of Q containing query term t, and f1 , f2 , · · · , fN be the frequencies on these queries in Q. Recall that each query qi is associated with a set of top-k result Li returned by the search engine as answers to qi . Besides, let Dt0 = {d01 , d02 , · · · , d0M } be the set of documents which appear in the posting list It of term t. The optimization problem 2 aims to find an optimal placement of M documents into M ordered slots numbered 1, 2, · · · , M , such that the following sum is minimized: X C(Qt ) = fi ∗ C(qi , t) (2) 0≤i≤N
=
X 0≤i≤N
=
X
fi ∗
X
It (d)
(3)
d∈Li
It (d0i ) ∗ Fi ,
(4)
0≤i≤M
Where Fi is the cumulative frequency of the document d0i counted as the number of occurrences of d0i in the topk result sets. It is not difficult to show that the problem 2 formulated P in this way is equivalent to the job scheduling problem 1k wi .Ci . Concretely, each document di is a job, we have M jobs assumed with unit processing time (pi = 1).
Algorithm 1 LocalOrder(t) 1: Input: term t and a subset of query log Qt 2: Output: An ordering Ot which ranks documents in It by decreasing order of their importance
3: Li ← top-k documents answering query qi 4: Let A : Dt → N be a map table from the set of documents Dt to the set of 5: 6: 7: 8: 9: 10: 11: 12: 13:
natural number N. for all d ∈ Dt do A[d] ← 0 end for for all qi ∈ Qt do for all d ∈ Li do A[d] ← A[d] + 1 end for end for Let Ot be an ordering that orders documents of It by decreasing values of A[d]
14: return Ot
The completion time Ci of a job can be considered as the index of the position of the respective document in It . The cumulative frequency Fi can be considered as the weight value wi of the job di . Hence, finding an optimal placement of M documents into M slots turns out to be scheduling M jobs for minimizing the total weighted completion time. The exact algorithm to solve the job scheduling problem is well known, it adopted Weighted Shortest Processing Time first (WSPT) rule. According to this rule the jobs are ordered in decreasing order of wpii , thus jobs are ordered decreasingly by the cumulative frequency Fi . Algorithms solving problem 2 It turns out that to solve the problem 2 we just need to count the cumulative frequency of each document in the result sets and sort the documents decreasingly by the frequencies. Let Qt = {q1 , q2 , · · · , q|Qt | } be the set of all distinct queries of Q that contain query term t, and Dt = {d1 , d2 , · · · , d|Dt | } the set of documents S returned by at least a query of Qt , or more formally Dt = 1≤i≤|Qt | Li . The exact solution solving the problem 2 is illustrated by the Algorithm 1. It counts frequencies locally, e.g., by considering only the occurrences of a document within the result lists that answer queries containing the specific term. IV. H YBRID A PPROACHES FOR S TATIC I NDEX P RUNING In the previous Section, we have seen how to estimate the importance of documents based on the knowledge extracted from a query log by considering co-occurrence of terms in query log. Co-occurrence of terms in query log and documents were used for lossless [7] and lossy compression techniques [3], [8] . Nevertheless, since these patterns can be biased to the queries used for training, we define a hybrid solution to mitigate the effect of biasing. The hybrid pruning algorithm experimented in this work works as follows. The Carmel’s pruning method [4] is first applied to build a query-independently pruned index. It is important to note that our approach is such flexible that any pruning method can be used instead of the carmel’s method. In the second step, we measure document importance on
Algorithm 2 L-Carmel(k, δ, λ) 1: 2: 3: 4: 5: 6: 7: 8:
Input: inverted file I and the parameters δ, λ, k Output: a pruned inverted file I ∗ I∗ ← ∅ for all t ∈ T do It0 ← Carmel(It , k, δ) Ot ← LocalOrder(t) Sort It in the same order defined by Ot Let It00 be the set of documents belonging to the top λ ∗ |It | documents of It ranked by the ordering Ot 9: It∗ = It0 ∪ It00 10: I ∗ = I ∗ ∪ {It∗ } 11: end for
12: return I ∗
Figure 2. Jaccard similarity (A) and Kendall’s Tau coefficient (B) for Top-10 queries from the Altavista query log with and without query result caching. The performance of the Carmel’s method degrades when the most frequent queries are filtered out by result caching while the hybrid approach is still better the Carmel’s method.
WT10g3 collection was chosen (although old and small) because it is temporally aligned with the altavista query log4 . We need the collection and the query log to be aligned.Moreover, we also use the topic queries (topics 501 to 550) coming along with WT10g to measure. A. Query Results Similarity Figure 1. Carmel, G-Carmel and L-Carmel similarities for Top-10 and Top20 queries from the Altavista query log, measured with Jaccard similarity (A) and kendall’s Tau coefficient (B) as a function of the pruning level.
the basis of access frequency information extracted from a query log. To decide which documents will be added to the pruned index, for each term t we order documents of It by applying Algorithm 1. Then, a cut-off value λ measured as the percentage of entries of It to be rescued is defined. The pseudo-code of Algorithm 2 shows the above hybrid approach for static index pruning. For each posting list It , the algorithm computes the pruned list It0 returned by the Carmel function (line 4). Then, an ordering Ot that defines the importance of documents in the posting list It by either using the LocalOrder function described in the previous section or the GlobalOrder one. If LocalOrder is used we refer to the hybrid approach as L-Carmel, otherwise, we call it G-Carmel. In line number 8, L-Carmel defines It00 as the set of documents that belong to the top-(λ ∗ |It |) highest important documents of It . These documents will be considered as candidates to be rescued. Finally, by merging the two lists It0 and It00 , we obtain the final pruned posting list It∗ , which is finally added to the pruned inverted file I ∗ (lines 9-10). For brevity reasons, we only show the LCarmel algorithm, the G-Carmel can be straightforwardly devised from it. V. E XPERIMENTAL R ESULTS To perform tests we used a slightly modified version of Lucy2 , an open source search engine which is an older version of Zettair developed to scale up with GBs of data.
To evaluate the performance of the proposed static pruning methods, we measured the similarity between the top-k results obtained from both the original and the pruned index. In order to measure similarity between the two result sets x and y computed on the two indexes, we used two wellknown metrics: the Jaccard similarity (J(x, y) = |x∩y| |x∪y| ) and the Kendall’s Tau similarity coefficient. In order to perform experiments, the Altavista query log was split into a training and a testing part. The training log contains 5M queries, while the test log is formed by the remaining 2M. The queries in the training log were submitted to Lucy to obtain the top-20 result list of each query. Then, the various pruned indexes were built as described in Sections II and III by using the training log and the associated result sets. Finally, the queries in the test log were processed also on these pruned indexes, and the query results sets obtained on the pruned indexes compared with the original results achieved on the whole index. Figure 1 plots the Jaccard similarity (A) and the Kendall’s tau (B) measured on indexes obtained with the L-Carmel, G-Carmel (description omitted due to space) and Carmel algorithms. The plots in the figure show that the L-Carmel algorithm behaves better than G-Carmel, and that both algorithms outperform the Carmel’s algorithm, particularly when the pruned index is small. B. Query Results Precision Table 1 reports the precision at different cutoff points (P@20, P@10, P@5) and the Mean Average Precision (MAP) for the 50 short queries as a function of the pruning 3 http://trec.nist.gov/
2 http://www.seg.rmit.edu.au/zettair/previous
releases.html
4 http://www.altavista.com/
Table I P RECISION AND MAP MEASURED FOR T OPICS 501-550 S HORT Q UERIES FROM WT10 G AS A FUNCTION OF THE PRUNING LEVEL . S INCE WE PERFORMED TESTS WITH TREC Q UERIES BUT TRAINED THE ALGORITHM WITH THE A LTAVISTA WE OBTAINED JUST SLIGHTLY SIGNIFICANT RESULTS . Precision MAP
P@20
P@10
P@5
% Carmel G-Carmel L-Carmel Carmel G-Carmel L-Carmel Carmel G-Carmel L-Carmel Carmel G-Carmel L-Carmel
10 0.043 0.061 0.058 0.108 0.107 0.126 0.106 0.120 0.144 0.140 0.160 0.188
20 0.073 0.075 0.078 0.161 0.174 0.177 0.188 0.196 0.198 0.212 0.244 0.228
30 0.093 0.096 0.097 0.195 0.202 0.200 0.232 0.240 0.236 0.284 0.296 0.288
40 0.113 0.114 0.115 0.224 0.229 0.229 0.252 0.262 0.260 0.304 0.312 0.312
50 0.123 0.124 0.124 0.230 0.235 0.235 0.276 0.280 0.278 0.308 0.308 0.308
60 0.133 0.134 0.134 0.247 0.254 0.254 0.286 0.294 0.294 0.320 0.324 0.324
70 0.143 0.144 0.144 0.249 0.251 0.251 0.290 0.294 0.294 0.332 0.332 0.332
80 0.148 0.149 0.149 0.253 0.254 0.254 0.290 0.292 0.292 0.336 0.332 0.332
90 0.154 0.155 0.155 0.257 0.257 0.257 0.302 0.302 0.302 0.332 0.332 0.332
100 0.167 0.167 0.167 0.262 0.262 0.262 0.32 0.32 0.32 0.37 0.37 0.37
level. The figures obtained with the long queries were very similar, and we thus omitted them since they do not add any useful information. The best value obtained in each test was marked in boldface. We can see that the index pruned with the L-Carmel method outperformed the other indexes of the same size in most of the cases. As before the difference is greater with high levels of pruning (from 10% to 30%). However, since we performed tests with TREC Queries but trained the algorithm with the Altavista we obtained just slightly significant results. It is emergent to have a real query log with relevant judgments for future investigation.
and large-scale information retrieval systems affect pruning performance. This aspect, even if very important, was never considered in the literature about static index pruning, and surely needs to be taken into account for evaluating index pruning strategies in a realistic setting. Thus, an interesting future work is studying in depth the relations between query result caching and static pruning in order to define pruning techniques optimized for the actual query load in the information retrieval system when a cache filters out many of the frequent queries.
C. The Impact of Query Result Caching
[1] R. A. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. Design trade-offs for search engine caching. TWEB, 2(4), 2008.
Query results caching is an important technique utilized in all large-scale Web search engines [5], [1]. In this section, we consider the impact of query results caching on the static index pruning approaches. Previous work did not take into account this aspect although it is nowadays considered in most modern information retrieval systems. The plots reported in Figure 2 show Jaccard similarity and Kendall’s tau coefficient obtained in the cases with and without caching. We can observe that similarities obtained in the two cases differ. Both Jaccard and Kendall’s tau figures referred to all pruning methods experimented decrease when most frequent queries are filtered out, thus showing that query result caching influences negatively index pruning efficacy. However, all methods seem to be affected in the same proportion, and the results measured on the index built with the L-Carmel algorithm are still the best. VI. C ONCLUSIONS AND F UTURE W ORK This paper proposed a hybrid approach exploiting querydependent information extracted from query logs of largescale search systems for boosting state-of-the-art static pruning techniques based on document relevance metrics. Experiments conducted on the TREC 10G Web collection and a large Altavista query log show that integrating usage information into the pruning process is profitable. Moreover, as an additional contribution of this paper, we showed that query result caches commonly interposed between users
R EFERENCES
[2] R. Blanco and A. Barreiro. Probabilistic static pruning of inverted files. ACM TOIS, 2009. [3] S. B¨uttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In ACM CIKM ’06, pages 182–189, New York, NY, USA, 2006. ACM. [4] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In In ACM SIGIR 2001, pages 43–50. ACM Press, 2001. [5] T. Fagni, R. Perego, F. Silvestri, and S. Orlando. Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst., 24(1):51–78, 2006. [6] S. Garcia, H. E. Williams, and A. Cannane. Access-ordered indexes. In ACSC ’04: 27th Australasian conference on Computer science, pages 7–14, Darlinghurst, Australia, Australia, 2004. Australian Computer Society, Inc. [7] H. T. Lam, R. Perego, N. T. Quan, and F. Silvestri. Entry pairing in inverted file. In WISE ’09: Proceedings of the 10th International Conference on Web Information Systems Engineering, pages 511–522, Berlin, Heidelberg, 2009. SpringerVerlag. [8] E. S. d. Moura, C. F. d. Santos, B. D. s. d. Araujo, A. S. d. Silva, P. Calado, and M. A. Nascimento. Locality-based pruning methods for web search. ACM TOIS, 26(2):1–28, 2008.