Effective Top-K Computation in Retrieving Structured Documents with ...

Effective Top-K Computation in Retrieving Structured Documents with Term-Proximity Support Mingjie Zhu2*, Shuming Shi1, Mingjing Li1, Ji-Rong Wen1 1

Microsoft Research Asia, Beijing, China 2 University of Science and Technology of China, Hefei, Anhui, China

{shumings, mjli, jrwen}@microsoft.com1, [email protected] ABSTRACT

Modern web search engines are expected to return top-k results efficiently given a query. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignore some especially important factors in ranking functions, e.g. term proximity (the distance relationship between query terms in a document). The inclusion of term proximity breaks the monotonicity of ranking functions and therefore leads to additional challenges for efficient query processing. This paper studies the performance of some existing top-k computation approaches using term-proximity-enabled ranking functions. Our investigation demonstrates that, when term proximity is incorporated into ranking functions, most existing index structures and top-k strategies become quite inefficient. According to our analysis and experimental results, we propose two index structures and their corresponding index pruning strategies: Structured and Hybrid, which performs much better on the new settings. Moreover, the efficiency of index building and maintenance would not be affected too much with the two approaches.

(1.1) where F(D,Q) represents the overall relevance score of document D to query Q, G(D) is the (query-independent) static rank (e.g. PageRank [4]) of the document, T(D,Q) is the overall term score of D to Q, and T(D,t) is the term-score of D and query term t, ). And computed via a term-weighting function (e.g. BM25 [22]). ,

and

are parameters satisfying

and

. The above formula is monotonic with respect to document static rank G(D) and term score T(D,t). This property is utilized by existing approaches to estimate the score upper bound of unseen documents, and therefore to skip the score computation of some lower-score documents (refer to section 2.3). In addition to those factors in Formula 1.1, the search quality of modern search engines depends heavily on some other evidence: document structure, anchor-text, and term proximity. Document structure means that a web page often comprises multiple fields (title, URL, body text, etc). An appropriate usage of document field structure can improve search results effectively, as we know terms appearing in special fields like title, URL are generally more important. Anchor text is a piece of clickable text that links to a target web page. As anchor text aggregates the (relatively objective) opinion of potentially a large number of other pages, it actually acts as a special important field of a web page. Finally, term proximity demonstrates how close query terms appear in a document. Intuitively, one document with high term proximity values (i.e. query terms are near in the document) should be more relevant to the query than another document in which query terms are far away from one another, if other factors are the same for the two documents. Take query “knowledge management” as an example. It is clear that quite a lot of pages on the web containing both “knowledge” and “management” are actually irrelevant to knowledge management. While if term “knowledge” appears in a document followed by term “management”, the document will be much probably related to knowledge management. Term proximity is more important in web search than in traditional information retrieval systems, due to the fact that there are usually millions of relevant results to a query. Due to the importance of term proximity and document structure in modern commercial search engines, effective top-k computation approaches should be studied by considering these factors. Unfortunately, although quite a few indexing pruning strategies have been proposed, few (if have) of them consider term proximity in the ranking functions. This motivates us to study the effectiveness of existing approaches when ranking functions are fortified with document structure and term proximity

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Search process; H.3.4 [Systems and Software]: Performance evaluation (efficiency and effectiveness)

General Terms

Algorithms, Performance, Experimentation

Keywords

Top-k, Dynamic index pruning, Term proximity, Document structure, Hybrid index structure

1. INTRODUCTION

Major commercial web search engines [12, 17, 27] have grown to index billions of pages. In spite of such large amount of data, end users expect search results being retrieved quickly and with high accuracy. In web search, as top few results instead of all relevant documents are often required, it is possible to speed up the searching process by skipping the relevance computation of some documents. Various dynamic index pruning mechanisms [1, 2, 15] have been proposed for this purpose. The following kind of ranking functions are assumed in most papers studying efficient top-k computation, * This work was done when the author was an intern at Microsoft Research Asia

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’07, November 6-8, 2007, Lisboa, Portugal. Copyright 2007 ACM 978-1-59593-803-9/07/0011...$5.00.

771

top k results without completely scanning inverted lists and/or computing the relevance scores of all documents.

factors, and to propose new index structures and pruning strategies. With term proximity and document structure information being considered, ranking functions would become more complex (see subsection 3.1). More importantly, term proximity makes a ranking function NOT monotonic (w.r.t. query term scores) any more. Since term proximity is determined by the relationship between all query terms rather than one single query term, a document with low term score values may have a high term proximity score (and therefore a high overall relevance score), and vice versa. This adds additional challenges to the efficient top-k computation problem. We argue that, partially due to the nonmonotonicity of the new ranking functions, most existing top-k strategies become quite inefficient with term proximity information being included. We then propose and study two index structures and their corresponding pruning strategies, which efficiently address the top-k problem by utilizing document structure information. And the efficiency of index building and maintenance would not be affected too much with the two approaches. We make the following contributions in this paper, 1). To the best of our knowledge, it is the first work which studies and compares the performance of various top-k computation approaches when ranking functions are extended to support document structure and term proximity. 2). We come out with two top-k computation approaches which performs much better than existing ones on the new settings. The relative superiority between the two approaches is also explored (in subsection 4.2.4). The rest of this paper is organized as follows. Section 2 introduces the preliminary knowledge and existing pruning techniques. Then a study of pruning with document structure and term proximity support will be addressed in Section 3. In Section 4 we will show the experiment results of various pruning approaches and settings. Finally Section 5 will be conclusions and future work.

2.2 Related Efforts

In database community, R. Fagin et al have a series of comprehensive work [9, 10, 11] on efficient top-k computation. As Fagin [11] states, if multiple inverted lists are sorted by attributes value and the combination function is monotonic, then an efficient top-k algorithm exists. In information retrieval area, the work of index pruning could go back to 1980s. Some earlier works are described in [5, 6, 13, 18, 20, 26]. Their main idea is to sort the entries by their contribution to the score of the document and put the important entries in the front of inverted index for early termination. Much work [1, 14] has attempted to optimize the index structure in various ways. Persin et al [20] propose to partition the inverted list into several parts. When PageRank shows its power in web search, pruning techniques considering PageRank are studied in Long [15]. Static index pruning [7, 8, 19] aims at reducing index size by keeping only relatively important information of the inverted index. Differently, we focus on the dynamic index pruning problem which skips the computation of some document scores at query execution time. There has also some work discussing efficient top-k computation in XML retrieval [16] where documents are structured. However, we are unaware of any existing work discussing index pruning with term proximity support. Approximate and probabilistic [24] top-k computation are also important to web search. By relaxing result quality requirements, query processing efficiency has more space to improve. We focus on exact top-k processing and would like to leave approximate and probabilistic query processing as future work.

Indexing and ranking are two key components of a web search engine. To support efficient retrieval, a web collection needs to be offline indexed. Given a query, one ranking function is adopted to compute a relevance score for each document based on the information stored in inverted index. The documents are then sorted by their scores and the top k documents with the highest scores are returned to end users. Although there are often millions of documents which are somewhat related to the query, end users only care about top k (e.g. k=10, 20, 50, 100) results. One primary way of indexing large scale web documents is inverted index, which organize document collection information into many inverted lists, corresponding to different terms. For a term t, its inverted list includes the information (DocIds, occurring positions, etc) of all documents containing the term. Zobel and Moffat [28] give a comprehensive introduction to inverted index for text search engines. Various dynamic index pruning techniques have been proposed for efficient top-k computation. They aim to correctly identify the

High-impact Segment

By PR

2.1 Background

By PR

In this section, we first give some background concepts related to top-k computation. Then some existing top-k approaches are studied and categorized.

Single Segment

By PR or Impact

By DocID

2. BACKGROUND AND EXISTING INDEX PRUNING TECHNIQUES

Seg1

Segm

low-impact Segment

Single Segment

(a) (b) (c) Figure 1. Existing index structures: (a). All documents in an inverted list are sorted by DocId; (b). Documents are ordered by static rank or one kind of impact score; (c). Each inverted list is divided into some segments by impact value, with documents in each segment sorted by static rank.

2.3 Existing Index Structures and Pruning Strategies for Efficient Top-K Computation

The organization of inverted index is the key to index pruning. Efficient top-k strategies are commonly based on specially designed index structures. Here we briefly summarize the index structures utilized in previous work and their corresponding pruning strategies. The documents in each inverted list are often naturally sorted by document IDs which enables straightforward implementation of

772

2.3.2 Single segment sorted by term impact

multi-way merging in processing a query. Figure 1(a) illustrates such kind of index organization. As this structure leaves little room for improving top-k computation, other index structures were proposed (see Figure 1(b) and (c)). The primary idea of the structures is to put documents with higher scoring factors at the top of an inverted list. These index structures can be roughly divided into three categories.

In this index structure, documents are sorted in descending order of a specific kind of term impact scores. The phrase “term impact” is first adopted by Anh and Moffat [1] to represent term scores, TFs or other factors related to the relevance score of a term to a document. Fagin et al [9, 10, 11] have extensively study on efficient top-k computation efficiency based on this kind of index structure. Notice that static rank will give a global ordering of documents, while orders determined by term impact scores may differ from term to term. That is, the orders of documents in inverted lists of different terms are not coherent in general. Due to the above differences, any reasonable pruning strategy based on this kind of index structure will have to maintain a document score accumulator to contain document partial scores (Partial score computation could be avoided if random access to inverted lists is efficient. This does not hold for large scale Web search.), leading to more efforts than the PR-1 approach. Please refer to Anh and Moffat [1] for comparison between impact-sorted-index and other index structures. With such kind of index structure, the expression of ST is as follows,

2.3.1 Single segment sorted by static rank

One simple way of organizing inverted lists for improving top-k computation is to have documents sorted in descending order by static rank scores (In case of two documents having the same static score, they can be sorted in ascending order by DocId. If inverted list compression is required or critical, documents can be re-assigned new DocIds according to static rank scores, e.g. the document with the maximal static rank is assigned DocId 0. A hashtable or an array can be used to maintain the relationship between new DocIds and old ones if necessary), as shown in Figure 1 (b). We call this kind of index structure PR-1. The intuition here is that, since static rank is an important factor of a ranking function (Formula 1.1), documents with relatively high static rank scores are more likely to appear in the final top results. Therefore, documents with high static rank should appear in top positions of inverted lists. As a result, it is possible to achieve an early stop, if we have enough evidence that top-k results are unlikely to be found in the remaining parts of the inverted lists.

(2.4) where Gmax is the maximum static rank of all documents, and Tcur(t) is term t’s score of the latest seen document in term t’s inverted list. Different from the strategy in Figure 2, the retrieval process will NOT end when the condition of Formula 2.2 is satisfied. Although the condition means that all unseen documents are unlikely to be in top-k, the seen documents with partial scores still have the possibility of being in top-k. Please refer to [9, 11] for details.

Algorithm: Top-k computation for index structure PR-1 Input: Inverted lists L1, …, Lm, for the query terms Output: The list of top-k documents Variables: R: Current top-k list; SK: The minimal score (i.e. the Kth score) in R Clear R; Set SK = 0 For each doc d in the inverted lists Evaluate d if(|R| < k OR d.score > SK) R.insert(d) SK = the minimal score in R EndIf

2.3.3 Multi-segment divided by impact scores Algorithm: Top-k computation for the multi-impact-segments index structure Input: Inverted lists L1, …, Lm, for the query terms Output: The list of Top-k documents

Update ST (by Formula 2.3) if(|R| >= k AND SK >= ST) return R; EndFor return R.

Variables: R: Current Top-k list; A: The accumulator for candidate docs. SK: The minimal score (i.e. the Kth score) in R

Figure 2. Pruning strategy for index structure PR-1 (single segment sorted by static rank) Let SK to be the score of the Kth document (in terms of relevant scores) in processing a query, and ST to be the maximum possible score of all unseen documents. If the following inequation is satisfied, which means any unseen documents won’t be able to be in top-k, we could make an early termination,

Clear R and A; Set SK = 0 For each segment seg For each doc d in seg Evaluate d; //compute d.minscore and d.maxscore if(|R| < k OR d.minscore > SK) R.insert(d); if(!A.contains(d) AND d.maxscore > SK) A.insert(d); if(A.contains(d) AND d.maxscore = k AND SK >= ST AND A-R is empty) return R; EndFor EndFor return R.

(2.3) where Gcur is the static rank of the last seen documents, and Tmax(t) is the maximum value of all the term scores in the inverted list for term t. Figure 2 illustrates how to perform index pruning based on this index structure, by utilizing the above threshold criteria.

Figure 3. Pruning strategy for the multi-impact-segments index structure

773

The index structure described in the above sub-section has some drawbacks. First, it will make the index compression not so easy (as the difference between adjacent DocIds is not a small integer anymore). Second, inverted list merging becomes much more difficult. Moreover, its performance is not good in practice for queries containing more than one term. To solve those problems, some work (Persin, Zobel, and SacksDavis [20], Long and Suel [15], Anh and Moffat [2]) investigates multi-segment index structure. In this structure, an inverted list is divided into multiple segments by one specific kind of term impact scores (e.g. TF, IDF, term weighting score, etc). Each segment is sorted by DocId or static rank. Figure 1 (c) shows such an index structure. The advantages of this index structure are twofold. On one hand, as the weight of static scores are commonly smaller than that of term weighting scores in real ranking functions, splitting inverted lists by impact scores might be more effective than purely sorting documents by static rank. On the other hand, as documents in one segment have global ordering (by static rank or DocId), it avoids some disadvantages of the single-segment-sorted-by-term-impact index structure. For one document, its information corresponding to different query terms may appear in different segments. So we still need document score accumulators to maintain document partial scores. Figure 3 illustrates one index pruning strategy based on this kind of index structure, with the following expression of ST, (2.5 )

Now we study how the existing index structures and pruning techniques perform when term proximity and document structure are considered in ranking functions. When document structure and term proximity are considered, the ranking function of Formula 1.1 will be in a different format. In this section, we first define a more practical ranking function (than Formula 1.1) which considers document structure and term proximity in computing document scores. Then we briefly analyze the behavior of existing approaches using the newly defined ranking function. We then propose and study two index structures which improve top-k computation by exploiting document structure information.

3.1 A More Practical Ranking Function Considering document structure and term proximity factors, a reasonable ranking function could be as follows, (3.1 ) where G(D) is the static rank of D, T(D,Q) is the overall term score (by aggregating the scores of all query terms) of D to query Q, and X(D,Q) is the term-proximity score between D and Q. T(D, Q) and X(D,Q) can be expressed as follows, (3.2)

(3.3) where T(Fi,t) is the term t’s term-weighting score in field Fi (of document D), and X(Fi,Q) is the proximity score of field Fi relative to Q. One sample of term weighting functions is the BM25 formula [22].. One sample of term proximity weighting where in current segment

function is in [21]. Coefficient , and are respectively the weights of static score,, term weights, and term term-proximity scores in score combination, and are the weights of field Fi in computing term term-scores and proximity scores respectively, and indicates the relative importance of term t in score computation. T(Fi,t) is computed without considering term proximity information. The weights and coefficients satisfy the following constraints,

is the maximum term score of all documents ,

is the static rank of the last seen

document in current segment, and static rank of all documents.

is the maximum

Now we study, in the following sections, how existing index structures and their corresponding pruning techniques perform when term proximity and document structure are considered in ranking functions.

3. TERM-PROXIMITY-ENABLED TOP-K COMPUTATION

Modern web search engines have been exploiting more and more features to improve quality of search results. Document structure and term proximity are no doubt among the most important features. It is a common practice to differentiate document fields to improve search results. Document fields which search engines considered most include title, URL, anchor text and body text. Term proximity means the distance relationship between query terms in a document. A document where query terms appear together is considered to be more relevant than another document in which query terms are far away from each other. Please refer to [21] for more details about how to compute term proximity scores for documents given a query.

Like in Formula 1.1, we assume that G(D), T(D,t), X(Fi,Q) have both been normalized into range [0.0, 1.0] in the following part of this paper. So the values of F(D, Q), T(D,Q), and X(D,Q) are all in [0.0, 1.0]. With the above ranking function, there are basically two ways for index pruning: simply extending existing approaches, or proposing new index structures and/or pruning strategies. Now we first have a brief analysis about how existing approaches will perform, and then demonstrate two index structures exploiting document structure information for efficient pruning.

774

where and are the estimated maximum possible static rank and term weighting score. Weights and are actually and in Formula 1.1. They are subscripted here to differentiate themselves from the weights in Formula 3.5. When term proximity is included (Formula 3.1), the expression of ST is now changed to,

Body_High

By PR

(3.4)

Fancy

By PR

By PR

It is not hard to extend existing top-k computation algorithms to support the new ranking function of Formula 3.1. Main efforts include re-estimating ST, the maximum possible score of any unseen document in the process of query processing. As has been summarized in the previous section, ST is commonly defined with the following expression,

Fancy

By PR

By PR

3.2 Analyzing the Performance of Existing Approaches with the New Ranking Function

Body_Low

Body

(a) Structured (b) Hybrid Figure 4. Two index structures for efficient top-k computation with term proximity support

(3.5)

3.3 Index Structures that Exploit Document Structure Information

where is the estimated maximum possible term proximity score. As long as the expression of ST is modified from Formula 3.4 to Formula 3.5, the pruning strategies in the previous section can be applied at once to process queries with the new term-proximityenabled ranking function. Now the question is how efficient they will be. We assume all static rank scores, term weighting scores and term proximity scores have been normalized into [0.0, 1.0]. Please keep in mind that, as term proximity scores are determined by the distance between query terms, it is not a monotonic function of term weighting scores. Small term weighting scores do not imply a low term proximity score. Given a document D and a query Q={A, B}, the term proximity score of D may be quite small even when the term weighting scores of both A and B are large. Therefore it is hard (if not impossible) for a pruning strategy to find an accurate estimation of the maximum possible term proximity score. By examining all of the index structures and pruning strategies described above, we can find that they have no choice except to estimate the possible maximum term proximity score as 1.0. That is, Xmax=1.0 in formula 3.5. If we assume that the weight of static rank keeps unchanged when term proximity information is added into the ranking function (which me means

By examine web documents and the new ranking function of Formula 3.1, we have the following observations, 1). For a web document, the term proximity scores of its different fields are independent (refer to the ranking function expressed by Formula 3.1 to 3.3). That is, in computing the term proximity score of one field, no information from other fields is needed. 2). Other fields of a web document are often significantly shorter than body text. 3). Other fields of a web document are commonly more important than body text in computing relevance scores. The above observations motivate us to investigate index formats and pruning strategies which utilize document field structures. In the index format of Figure 4 (a), each inverted list contains two segments: a fancy segment and a body segment. The body segment contains body field information of all documents containing the term. While fancy segment contains the information of other document fields (title, URL, anchor text, etc). The documents in each segment are sorted by static rank. As the length of body text is commonly significantly larger than that of other fields, the body segment is much longer than the fancy segment. We call this index structure as Structured.

, by subtracting 3.4 from 3.5, we get, (3.6)

This kind of index structure is mentioned in Brin and Page’s paper [4]. They sort documents by DocId instead of static rank. And they did not illustrate how to compute accurate top-k results efficiently using such an index structure.

Notice that Tmax SK) R.insert(d);

value of is much less than 1.0 accordingly. Formula 3.7 also gives a good estimation of Tmax (upper bound of the overall term weighting score among all unseen documents). Now we come to the pruning strategies based on this index structure. Similar with the multi-impact index structure (Section 2.3.3), the Structured index structure also needs to maintain a topk accumulator, containing the documents which are not in top-k for the moment but have possibility of being in top-k.

if(!A.contains(d) AND d.maxscore > SK) A.insert(d); if(A.contains(d) AND d.maxscore = k AND SK >= ST AND A-R is empty) return; EndFor EndFunc

Figure 5 shows the pruning strategy for this kind of index structure. In addition to dividing inverted lists by document structure, we can further consider to organize the body segment using existing index structures. That is, it might be possible to combine traditional index structures and the structure of Figure 4 (a) to get a hybrid index structure. This results in the hybrid index structure of Figure 4 (b). It contains three segments: fancy, body_high, and body_low. The body_high segment contains documents with relatively high term weighting score (i.e. T(Fbody,t) is larger than a threshold). And segment documents with low term weighting scores are in segment body_low. The documents in each segment are still sorted in descending order by static rank. The hybrid index structure is expected to combine the merits of two index structures: structured, and multi-impact.

//function ProcessFancySegs ProcessFancySegs(R, A) //The implementation is similar to ProcessBodySegs. Omitted to remove redundancy

Figure 5. Pruning strategy for the index structure of Figure 4 (a)

3.4 Discussions 3.4.1 Impact on index building and maintenance

Generally speaking, the impact of an efficient top-k approach is twofold. One is improved query processing performance, the main purpose of top-k processing. The other is the potential negative impact on index building speed, index compression, and the easiness of implementation. After implemented all the approaches mentioned above and used them for indexing and query processing, the structured and hybrid approaches were found to be quite similar with existing approaches in terms of index building speed, index compression ratio, and implementation difficulty.

For the hybrid index structure, its corresponding top-k computation strategy is quite similar with the structured approach. We omit it here to avoid unnecessary redundancy. Our experimental results show that, the structured and hybrid approaches generally outperform existing approaches when term proximity is considered. The relative performance of the hybrid approach compared with the structured depends on parameter settings. Please refer to Section 4.2 for more details.

3.4.2 Supporting multiple ranking functions with one index

Modern search engines often have some practical considerations in ranking function design. For example, it is often desirable for an engine to support multiple ranking functions (or one ranking function with different parameters) with one index structure. With such requirements, some index structures or pruning strategies may not be feasible anymore. For example, for the multi-impact index structure, term scores are not appropriate to act as impact values anymore, because different parameters may result in different term scores. However, TF (term frequency) could still be used as one kind of impact score in this circumstance, as has been done in [3] by Anh and Moffat. For the same reason, the hybrid index structure may have to use TFs instead of term scores to split the body segment.

776

4. EXPERIMENTS

by PageRank (Figure 4(a)). The pruning strategy in Figure 5 is adopted. HYBRID: Each inverted list contains a fancy segment, a high impact body segment, and a low impact body segment, with documents in each segment sorted by PageRank (Figure 4(b)). The pruning strategy in Figure 5 is adopted. Algorithm and Parameters: We use MSRA2000 [23], which get one of the best results in TREC04 web track, as our ranking function in all the experiments. This ranking function fits Formula 3.1, with T(D,Q) and X(D,Q) expressed by Formula 3.2 and 3.3 respectively. The term score for each field is computed via the BM25 formula [22]. Given one kind of index structure, the performance of a top-k algorithm depends on many factors, including the number of results required (k) and various parameters in the ranking function. We tested different factor values in the experiments, trying to give an, as comprehensive as possible, comparison between different pruning strategies. Table 3 shows the factors

We now conduct some experiments to compare the performance of different index structures and pruning strategies on various system settings, using the new and more practical ranking function illustrated in subsection 3.1.

4.1 Experimental Setup

Datasets: We use two TREC [25] test collections in the experiment, as illustrated in Table 1. The GOV collection consists of about 1.25 million web pages crawled from web sites in the .gov domain in 2002; and the GOV2 corpus is a crawl of .gov sites in early 2004 (Refer to http://ir.dcs.gla.ac.uk/test_collections/ for more information about the two collections). For corpus GOV, we only keep html pages (about 1 million), while documents of other types (such as pdf, doc, etc) are discarded. For the GOV corpus, we conduct our experiments on query sets 2004mixed. For the GOV2 corpus, we do experiments on query set 2004tera which was used in the Terabyte track of TREC in 2004 and 2005. Details of the query sets are shown in Table 2. We choose data collections of different size and queries of different types. This allows us to test and compare the performance of various index structures and pruning strategies by various settings.

values we tested in the experiments. Parameter measures the importance rtance of static rank in comput computing document scores. The optimal value of may vary from corpus to corpus. Common practice tells us that its value should not exceed 0.5.

Table 1. Test collections used in the experiments Collection Name GOV GOV2

Documents

Size

1 million 25 million

18 GB 426 GB

Table 3. Variables and their values tested in the experiments Parameters k

Table 2. Query sets 2004mixed

2004tera

# of queries 225 50 Query types np, hp, td np, hp, td Collection GOV GOV2 Avg query length 3.43 3.16 *td: topic distillation; np: named page finding; hp: home page finding

wbody

Description

Values

Number of results required

1, 5, 10

The weight of static rank weight (in Formula 3.1)

0, 0.1, 0.2, 0.5

The weight of term proximity scores relative to that of term weighting scores (in Formula 3.1)

0, 0.1, 0.2, 0.4, 0.5, 1

Body weights relative to the weight of fancy fields

0.5, 1, 2, 4

Evaluation Metrics: There are basically two kinds of metrics for evaluating a top-k algorithm: search quality (P@10, average precision, etc), and processing performance (average query processing time, throughput, etc). As this paper only focuses on exact top-k computation, all strategies are required to return the same results as no pruning being performed. So we measure the performance of pruning by average query execution time

Hardware and Software Environment: For the GOV dataset, experiments are carried out on a single machine with Dual 3.06 GHz Intel Xeon CPU, 4G RAM and 600 GB local SATA disk. While for corpus GOV2, we distributed its documents to 5 machines (via URL hashing), with each machine having about 5 million documents indexed. The 5 nodes exchange link information by network communication, based on which each document’s static rank is computed and its anchor text is accumulated. Only one CPU per machine is used in the pruning experiments, i.e. we are running a single-thread program on each machine. Index Structures and Strategies: We compare 4 kinds of pruning strategy with 4 kinds of index structure under various settings. We set different parameters in equation 3.1, and the result number k (the number of results returned) to 1, 5, and 10. BASE: The documents in each inverted list are sorted by DocId (Figure 1(a)), and no pruning strategies are performed. PR-1: The documents in each inverted list are sorted by PageRank (Figure 1(b)), and the strategy in Figure 3 is adopted. IMPACT-2: Each inverted list has two segments: a high score segment, and a low score segment. Documents in each segment are sorted by PageRank (Figure 1(c)). And the strategy in Figure 3 is adopted. STRUCTURED: Each inverted list contains a fancy segment and a body segment, with documents in both segments sorted

4.2 Experimental results

Experiments are conducted with different parameters in Table 3 and the results are presented as follows. First we study the performance of each pruning strategy when proximity score is considered, then we study how the values of various parameters affect the performance of each approach, and finally we compare the performance the two proposed approaches: Structured and Hybrid.

4.2.1 With and without term proximity In the first group of experiments, we test the performance of various index structures and pruning strategies before and after term proximity information is considered in ranking functions. Figure 6 and Figure 7 show the average query processing time of various algorithms for some selected term proximity weight values. In the experiments, we fix the values of k,

, and wbody,

and vary the values of . The results are acquired by using query set 2004tera (corresponding to dataset GOV2), and 2004mixed (corresponding to the GOV dataset) respectively.

777

Base 2

PR-1

IMPACT-2

STRUCTURED

Modern web search engines often distribute several billion web pages into several thousand machines for indexing and retrieval. To get the global top-k (say 100) results, each indexing node may need only to return 1 to 10 results to the frontier machine (which is interacted directly with end users). So we test three different values of k in the experiments: 1, 5, and 10.

HYBRID

1.8 Query Execution Time (s)

1.6 1.4 1.2

It is not surprising that smaller k value makes pruning easier. When more results are required, each pruning approach’s performance will drop further. But the relative performance order of IMPACT-2, STRUCTURED, and HYBRID keeps unchanged.

1

0.8 0.6 0.4 0.2 0

1.8

0

0.1

0.2

0.4

0.5

1

1.6

Proximity Score Weight

1.4

Base

PR-1

IMPACT-2

STRUCTURED

Query Execution Time (s)

Figure 6. Performance comparison of various top-k algorithms, for different term proximity weights (Dataset: GOV2; Query set: 2004tera; k=10; =0.2; wbody=1.0) HYBRID


0.7 0.6

Base

1.2

PR-1

1

IMPACT-2

0.8

STRUCTURED

0.6

HYBRID

0.4 0.2

0.5

0 0.4

1

0.3

5 K

10

Figure 8. Performance as a function of k (Dataset: GOV2; Query set: 2004tera; = 0.5; =0.2; wbody=0.5)

0.2 0.1

2

0 0

0.1

0.2

0.4

0.5

1.8

1


Proximity Score Weight

Figure7. Performance comparison of different top-k algorithms, for different term proximity weights (Dataset: GOV; Query set: 2004mixed; k=10; =0.2; wbody=1.0) Some observations can be made from these results. First, although the IMPACT-2 approach behaves well without term proximity, its performance drops rapidly when the weights of term proximity get larger. When the term proximity weight is large enough, IMPACT-2 even performs worse than the baseline. Second, the two approaches we investigated in Section 3 (STRUCTURED and HYBRID) performs much better than other ones when term proximity weight is higher than a threshold (0.1 in the figures). The STRUCTURED strategy has steady pruning performance in such settings. It could reduce more than 80% query execution time for all term proximity weights. Third, no significant performance gains can be acquired with the PR-1 strategy. This is consistent with the statements in Long et al [15]. The reason why PR-1 performs slightly worse than the baseline is because this approach has extra overhead of looking up the static rank table in merging inverted lists (in processing multi-term queries). PR-1 can improve performance only when the weight of static rank is extremely high (see Figure 9).

1.6 1.4 Base

1.2

PR-1

1

IMPACT-2

0.8

STRUCTURED

0.6

HYBRID

0.4 0.2 0 0

0.1

0.2

0.5

Page Rank Weight

Figure 9. Performance as a function of static rank weight (Dataset: GOV2; Query set: 2004tera; k=10; = 0.5; wbody=0.5)

4.2.3 Effect of static rank weights and body weights Here we study the impact of static rank weight and body weight wbody to the pruning performance of different approaches. The performance of various approaches for different static rank weights (other parameters are fixed) are shown in Figure 9. We only show the results of query set 2004tera here, and the results of other query sets are similar.

4.2.2 Effect of K

From the figure, we can find PR-1 gets the most significant performance boost when the weight of static rank is large enough. However, it is unreasonable in practice to have such a high page rank weight of 0.5. We include experimental results about it here only for trend analysis. The STRUCTURED approach is also benefited from the increase of static rank weights. At the first

To test the effect of K (the maximum number of results needed to be returned by one machine) to the pruning performance of various approaches, we fix other parameters and set different values of K. Figure 8 shows the experimental results on the GOV2 dataset, using query set 2004tera.

778

4.2.4 STRUCTURED vs. HYBRID

glance, it seems strange that the performance of IMPACT-2 gets better when the weight of static rank increases to 0.5. That is

From Figure 9, 10, and 11, we might be able to have a rough idea about the impact of static rank weight and body weight on the performance comparison between the HYBRID and the STRUCTURED approach. The STRUCTURED approach performs better than HYBRID when the static rank weight is larger than a threshold, or the body weight is lower than a

because the weight of term proximity becomes smaller when is large (keep in mind that =1, and we fixed in the experiments). The reason why IMPACT-2 performs worse than the baseline can be explained as follows: In such parameter settings, IMPACT-2 cannot achieve pruning too much, but have to pay for the overhead of maintaining the partial score accumulator.

threshold. In addition, we find in experiments that the value of has only moderate impact on the performance comparison between STRUCTURED and HYBRID. Thus the performance comparison of these two strategies is mainly determined by two parameters: static rank weight and body weight. Here we investigate how the two parameters will affect their performance. We selected a set of different parameter value pairs for query set 2004tera, and plot the winning areas of each strategy in Figure 12. In the shadowed area, the HYBRID approach has better performance than the STRUCTURED one. While in the blank area, the STRUCTURED approach behaves better. The only difference of the two approaches is that HYBRID further splits the body segment into two segments by term impact scores. It is intuitive that such a split is needed only if body weight is much larger than the weight of static rank. It implies that we could choose appropriate index structures and strategies according to the parameters we intend to use in ranking functions.


2.5

2

Base

1.5

PR-1 IMPACT-2

1

STRUCTURED HYBRID

0.5

0 0.5

1

2

4

Body Weight

Figure 10. Performance of various top top-k approaches, for different body score weights (Dataset: GOV2; Query set: 2004tera; k=10; =0.2; = 0.5)

0.5

Page Rank Weight

0.4

0.7


0.6

STRUCTURED

0.3 0.2

HYBRID

0.1

0.5 Base

0.4

0

PR-1

0

IMPACT-2

0.3

STRUCTURED

0.2 0.1

Body Weight

2

3

4

4.2.5 Summary

0 1

2 Body Weight

Figure 12. The “winning areas” of the STRUCTURED and HYBRID approach respectively. (Dataset: Dataset: GOV2; Query set: 2004tera; k=10; = 0.5)

HYBRID

0.5

1

4

In summary, when ranking functions are fortified with term proximity factors, most traditional index structures (and their corresponding pruning strategies) will have bad performance. And it is interesting that simply splitting index according to document structure can achieve good performance in most settings. About the two approaches we proposed in this paper, their relative

Figure 11. Performance of various top top-k approaches, for different body score weights (Dataset: GOV; Query set: 2004mixed; k=10; =0.2; = 0.5) Figure 10 and Figure 11 show the performance of various top-k strategies for different body weights, conducted on query sets 2004tera and 2004mixed respectively. As expected, when body field weight becomes higher (which means lower weights of the fancy fields), the performance of the STRUCTURED and HYBRID approaches degrade. That is because the two approaches utilize document structure information. From Figure 10 and 11, we observe that above a certain body weight threshold the HYBRID approach will outperform STRUCTURED. There is also a threshold in Figure 9, above which the performance of the STRUCTURED approach perform better that of HYBRID. We will give a detailed analysis of that in the following subsection.

superiority are determined by two factors: static rank weight and body weight wbody.

5. CONCLUSIONS AND FUTURE WORK

,

Various index structures and pruning strategies for top-k computation are compared in this paper. Our investigation demonstrates that, when term proximity is incorporated into the ranking functions, top-k problem becomes much more difficult due to the fact that the ranking functions are not monotone any more. As a result, most existing index structures and strategies become inefficient. However, we could still do pruning efficiently if we consider the web page structure. Both of the two wellperformed algorithms (structured, and hybrid) exploit document

779

[9] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation

structure information. We also give some hints and insights about how to make a choice between the two algorithms. In this paper, we only consider the ranking function of Formula 3.1, i.e. the overall score is a linear combination of static rank, term scores, and term-proximity scores. More complex ranking functions may exist in practice. The core idea in pruning with more complex ranking functions (for example, non-linear ones) might be similar, although it won’t be exactly the same. One future direction is to address efficiently query processing for more complex ranking functions.

Algorithms for Middleware. In Proc. of the Twentieth ACM Symposium on Principles of Database Systems, 2001.

[10] R. Fagin. Combining Fuzzy Information from Multiple

Systems. In Proc. of the Fifteenth ACM Symposium on Principles of Database Systems, 1996.

[11] R. Fagin. Combining Fuzzy Information: an Overview. SIGMOD Record, 31(2): 109-118, June 2002.

[12] Google: http://www.google.com. [13] D. Harman and G. Candela. Retrieving records from a

The problem studied in this work has not been considered by existing top-k research. So previous research on extensions and applications of top-k problem need to be reviewed and further studied. They will be addressed in the future work. This paper mainly focuses on exact top-k computation, i.e., generating the same top-k results as the case of all documents being evaluated. For a real world web search engine, approximate top k results should be enough. For term-proximity-enabled approximate top-k computation, different conclusions might be drawn about the performance comparison of various approaches. We would like to leave this as future work.

gigabyte of text on a minicomputer using statistical ranking. Journal of the American Society for Information Science, 41(8): 581-589, August 1990.

[14] R. Kaushik, R. Krishnamurthy, J. Naughton and R.

Ramakrishnan. On the Integration of Structure Indexes and Inverted Lists. SIGMOD 2004.

[15] X. Long and T. Suel. Optimized Query Execution in Large

Search Engines with Global Page Ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, September 2003.

[16] A. Marian et al. Adaptive processing of Top-K queries in

6. REFERENCES

XML. In ICDE 2005.

[1] V. N. Anh and A. Moffat. Pruned Query Evaluation Using

[17] Microsoft Live Search: http://www.live.com. [18] A. Moffat and J. Zobel. Self-Indexing Inverted Files for Fast

Pre-Computed Impacts. Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 372–379, Seattle, WA, August 2006. ACM Press, Seattle.

Text Retrieval. TOIS 14(4), 1996.

[19] E. de Moura et al. Improving Web Search Efficiency via a

Locality Based Static Pruning Method. In Proceedings of the 14th international conference on World Wide Web, pages 235 – 244, 2005.

[2] V. N. Anh and A. Moffat. Pruning Strategies for Mixed-

Mode Querying. In Proc. Of 2006 CIKM Int. Conf. on Information and Knowledge Management (CIKM 2006).

[20] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered Document

[3] V. N. Anh and A. Moffat. Simplified similarity scoring using

Retrieval with Frequency-Sorted Indexes. Journal of the American Scociety for Information Science, 47(10): 749764, May 1996.

term ranks. In Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (SIGIR2005). Salvador, Brazil, August 15 - 19, 2005.

[21] Y. Rasolofo and J. Savoy. Term Proximity Scoring for

Keyword-Based Retrieval Systems. In Proceedings of the 25th European Conference on IR Research (ECIR 2003) pages 207--218, April 2003.

[4] S. Brin and L. Page. The Anatomy of a Large-scale

Hypertextual (Web) Search Engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.

[5] C. Buckley and A. Lewit. Optimization of inverted vector

[22] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, and M.

[6] M. Burrows. Method for Parsing, Indexing and Searching

[23] R. Song, J. Wen, S. Shi, G. Xin, Ti. Liu, et al. Microsoft

Gatford. Okapi at TREC-3. In Proceedings of TREC-3, November 1994.

searches. In Proc. 8th Annual SIGIR conf, pages 97-110, 1985.

Research Asia at Web Track and Terabyte Track of TREC 2004. In 2004 Text REtrieval Conference.

World-Wide-Web Pages. US Patent 5,864,863, 1999.

[7] S. Büttcher and C. Clarke. A document-centric approach to

[24] M. Theobald, G. Weikum, and R. Schenkel. Top-K Query

static index pruning in text retrieval systems. In Proceedings of the 15th ACM international Conference on information and Knowledge Management CIKM '06. ACM Press, New York, NY, 182-189

Evaluation with Probabilistic Guarantees. In Proceedings of the 30th VLDB conference, Toronto, Canada, 2004.

[25] TREC home page: http://trec.nist.gov. [26] H. Turtle and J. Flood. Query evaluation: Strategies and

[8] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y.

optimizations. Information Processing and Management, 31(6):831-850, 1995.

Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43--50, 2001.

[27] Yahoo Search: http://search.yahoo.com. [28] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys. Vol. 38, No 2, Jul. 2006.

780

Effective Top-K Computation in Retrieving Structured Documents with ...

Effective Top-K Computation in Retrieving Structured Documents with ...

Suggest Documents

Structured Parallel Computation in Structured Documents - CiteSeerX

Structured Parallel Computation in Structured Documents - CiteSeerX

Structured queries are most effective in retrieving ... - Semantic Scholar

EFFECTIVE COMPONENT TREE COMPUTATION WITH

Authoring Structured Multimedia Documents

Retrieving Spoken Documents by Combining Multiple ...

Retrieving poorly degraded OCR documents - Département d ...

retrieving documents by constrained spreading ... - Semantic Scholar

Indexing and Retrieving Cursive Documents without

Managing Structured Documents in Distributed ...

Structured Information Retrieval in XML documents - CiteSeerX

Searching Structured Documents - Semantic Scholar

Interactive interpretation of structured documents

Retrieving Best Entry Points in Semi-Structured ... - Semantic Scholar

Helping Architects In Retrieving Architecture Documents: A Semantic ...

TOPK expression is ... - PLOS

EFFECTIVE COMPUTATION RESILIENCE IN HIGH PERFORMANCE

Automatic Processing of Structured Handwritten Documents: An ...

Efficiency of structured adiabatic quantum computation

Ranking Structured Documents Using Utility ... - Semantic Scholar

Separate compilation of structured documents - CiteSeerX

TreeCalc : Towards Programmable Structured Documents - CiteSeerX

docbase - a database environment for structured documents

Separate compilation of structured documents - CiteSeerX