When Query Expansion Fails - CiteSeerX

3 downloads 0 Views 358KB Size Report
Justin Zobel. School of Computer ... Walker to be successful at TREC 8 [8], where an average of about 10% ... see Sparck Jones, Walker, and Robertson [11].
When Query Expansion Fails Bodo Billerbeck

Justin Zobel

School of Computer Science and Information Technology RMIT University, GPO Box 2476V, Melbourne, Australia, 3001. {bodob,jz}@cs.rmit.edu.au

Abstract The effectiveness of queries in information retrieval can be improved through query expansion. This technique automatically introduces additional query terms that are statistically likely to match documents on the intended topic. However, query expansion techniques rely on fixed parameters. Our investigation of the effect of varying these parameters shows that the strategy of using fixed values is questionable.

Categories & Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.3.4 [Information Storage and Retrieval]: Systems and Software - Performance evaluation

General Terms

selected from the R top-ranked documents by first giving all terms in the R documents a term selection value. The E expansion terms with the lowest TSV (excluding terms from the original query) are then appended to the original query. In most expansion experiments that have been reported, the key parameters are fixed; for example, the Okapi parameters are typically R = 10 and E = 25. Using comprehensive experiments on a test collection, we show that other choices of values can give higher effectiveness, but that no fixed choice is robust: entirely different values are preferable for other collections. Worse, the best choices per query vary wildly. The use of fixed parameters appears to be unjustified.

Exploring the parameters In much of the Okapi work, fixed values were used for key parameters. In some experiments, fixed values were used for R and E, the number of documents in the initial ranking and the number of expansion terms, respectively. These values (10 and 25 respectively) were chosen empirically. In other experiments, a fixed value was used for R and a fixed upper bound was imposed for T SV . In our experiments, we have primarily used TREC disks 4 and 5 [4] and queries 401–450, which is the data used in TREC 8 in 1999. In some additional experiments, we have used the TREC 9 10-gigabyte web track with queries 451– 500, from TREC 9. We explored all combinations of R and E from 1 to 100. For each of the 10,001 combinations, we ran all 50 queries and measured average precision. Results for TREC 8 are in the left-hand graph in Figure 1. In this graph, the darker the area, the greater the increase in average precision compared to no expansion. Thus, roughly, the greatest improvement was seen for R between 8 and 16, and for E between 7 and 42. Choosing R of around 50 also gave good results. The original expansion parameters of R = 10 and E = 25 are just within the dark “best” area. The average precision at this point is 0.254, up from 0.216 with no expansion. These are not quite the best choices; R = 13 and E = 15 gave slightly better results overall, of 0.260. However, the original values are impressively close to these settings. Contrasting results are shown in the middle graph in Figure 1, for the TREC 9 10-Gb web collection. Expansion has been much less successful, with little overall improvement observed in even the best case (R = 98 and E = 4); the standard parameters degrade performance. Even at the best point, most queries were better without expansion. More generally, the best parameters vary wildly between queries, as illustrated in the right-hand graph in Figure 1. If QE was applied with the optimal parameters for each query,

Algorithms, Measurement and Performance

Introduction Query expansion has been widely investigated as a method for improving the performance of information retrieval [1, 2, 5, 7, 8, 10]. It is the only effective automatic method for solving the problem of vocabulary mismatch: queries are often not well-formulated, but may be ambiguous, insufficiently precise, or use terminology that is specific to a country—consider for example the US “wrench” versus the UK “spanner”. Alternatives to query expansion, such as thesaurus-based techniques, have not been as successful [5]. Query expansion or QE—also known as pseudo-relevance feedback or automatic query expansion—is based on the observation that the top-ranked documents have a reasonable probability of being relevant. It can heuristically be assumed that the first 10 (say) matches to a query are relevant; terms from these documents can be used to form a new query. Several alternative approaches to QE have been described. We have focused on a method shown by Robertson and Walker to be successful at TREC 8 [8], where an average of about 10% improvement in effectiveness was demonstrated through query expansion. In this approach, documents are initially ranked using the Okapi BM25 measure [7, 9] applied to the original query. (For a discussion of this formulation, see Sparck Jones, Walker, and Robertson [11].) In common with all query expansion methods, the Okapi approach requires several parameters, with values determined in experiments on a particular test data set. Expansion terms are Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007.

387

100

90 80 70 60 50 40 30 20 10 0

100

Number of expansion terms added to query (t)

Number of expansion terms added to query (E)

Number of expansion terms added to query (E)

100

90 80 70 60 50 40 30 20 10

90 80 70 60 50 40 30 20 10 0

0 0

10

20

30

40

50

60

70

80

Number of ranked documents used for extraction of expansion terms (R)

90

100

0

10

20

30

40

50

60

70

80

90

100

Number of ranked documents used for extraction of expansion terms (R)

0

10

20

30

40

50

60

70

80

90

100

Number of ranked documents used for extraction of expansion terms (R)

Figure 1: Average precision for each number of documents and number of expansion terms. Left: TREC 8 data, disks 4 and 5. Middle: TREC 9 10-gigabyte web data. (Dark: high average precision. Light: low average precision. White: average precision worse than or equal to no expansion.) Right: per query, the best 100 parameter pairs in descending mark size, for the TREC 8 data.

Acknowledgements. This research is supported by the Australian Research Council and by the State Government of Victoria.

the average precision would increase to 0.330 from 0.260. Intuition suggests that queries that are effective prior to expansion should be good candidates for QE. This intuition is strengthened by the observation that QE based only on relevant documents in the top R is superior to QE based on all documents, as we have seen in our experiments and as reported for example by Mano and Ogawa [6]. However, we found using the Pearson coefficient that there is no correlation between the average precision that the original query achieves and by how much QE improves average precision. An interesting question is then whether some property of the query can be used to predict whether expansion will be effective. We explored a range of query metrics, but without clear success. These included the similarity score of the documents fetched in the original ranking; a measure of how distinct these documents were from the rest of the collection; specificity of the query terms; and an approximation to query clarity [3]. Sakai and Robertson [10] have suggested the use of varying parameters per query, by classifying queries into one of 10 bins according to measures such as the similarity score of the highly-ranked documents. As we did not observe any correlation between such scores and improvements due to expansion, we are not convinced that such a strategy is likely to be successful. Carpineto et al. [2] experimented with varying R for some fixed E, and with varying E for some fixed R, considering the impact on average effectiveness for two data sets. Our work generalises their results.

1. REFERENCES [1] C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using SMART: TREC 3. In Text REtrieval Conference, 1994. [2] C. Carpineto, R. de Mori, G. Romano, and B. Bigi. An information-theoretic approach to automatic query expansion. ACM TOIS, 19(1):1–27, 2001. [3] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In Proc. ACM-SIGIR Annual Int. Conf. on Reasearch and Development in Information Retrieval, New Orleans, Louisiana, 2002. [4] D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271–289, 1995. [5] R. Mandala, T. Tokunaga, and H. Tanaka. Combining multiple evidence from different types of thesaurus for query expansion. In Proc. ACM-SIGIR Annual Int. Conf. on Reasearch and Development in Information Retrieval, Berkeley, California, 1999. [6] H. Mano and Y. Ogawa. Selecting expansion terms in automatic query expansion. In Proc. ACM-SIGIR Annual Int. Conf. on Research and Development in Information Retrieval, pages 390–391. 2001. [7] S. E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In NIST Special Pub. 500-264: The Eight Text REtrieval Conference (TREC-8), pages 151–161, Gaithersburg, MD, 2000. [8] S. E. Robertson and S. Walker. Microsoft Cambridge at TREC-9: Filtering track. In NIST Special Pub. 500-249: The Ninth Text REtrieval Conference (TREC-9), pages 361–368, Gaithersburg, MD, 2001. [9] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Text REtrieval Conference, pages 21–30, 1992. [10] T. Sakai and S. E. Robertson. Flexible pseudo-relevance feedback using optimization tables. In Proc. ACM-SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 396–397, New Orleans, Louisiana, 2001. [11] K. Sparck-Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. parts 1&2. Information Processing and Management, 36(6):779–840, 2000.

Conclusions We have quantified the performance of a successful query expansion technique, by exploring behaviour as parameters are varied. This exploration has identified an upper bound on the improvement available via the Okapi approach to query expansion on two test collections, and showed that use of fixed parameters for all queries can be significantly improved upon. What is not clear is how the parameters should be chosen. We have preliminarily explored a range of options, but have not identified a metric that provides a method for guiding expansion.

388

Suggest Documents