Journal of Information Science http://jis.sagepub.com
Global term weights for document retrieval learned from TREC data W. John Wilbur Journal of Information Science 2001; 27; 303 DOI: 10.1177/016555150102700501 The online version of this article can be found at: http://jis.sagepub.com/cgi/content/abstract/27/5/303
Published by: http://www.sagepublications.com
On behalf of:
Chartered Institute of Library and Information Professionals
Additional services and information for Journal of Information Science can be found at: Email Alerts: http://jis.sagepub.com/cgi/alerts Subscriptions: http://jis.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
The effect of postings information on searching behaviour
1 2 3 4 5 6 7 8 9 1110 1 2 3 4 5 6 7 8 9 20 1 2 113 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 1 2
Global term weights for document retrieval learned from TREC data Perriens [1] in the form:
W. John Wilbur National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA Received 3 April 2001 Revised 25 May 2001
Abstract. A key element in modern text retrieval systems is the weighting of individual words for importance. Early in the development of document retrieval methods it was recognized that performance could be improved if weights were based at least in part on the frequencies of individual terms in the database. This observation led investigators to propose inverse document frequency weighting, which has become the most commonly used approach. Inverse document frequency weighting can be given some justification based on probabilistic arguments. However, many different formulas have been tried and it is difficult to distinguish between these on a purely theoretical basis. Witten, Moffat and Bell, have proposed a monotonicity condition as fundamental: ‘a term that appears in many documents should not be regarded as more important than a term that appears in a few’. Based on this monotonicity assumption and probabilistic arguments we show here how the TREC data can be used to learn ideal global weights. Using cross-validation we show that these weights are a modest but statistically significant improvement over IDF weights. One conclusion is that IDF weights are close to optimal within the probabilistic assumptions that are commonly made.
1. Introduction Inverse document frequency (IDF) term weights seem to have been first defined and used by Williams and Correspondence to: W. J. Wilbur, National Library of Medicine, Bldg 38A, Rm 8S806, 8600 Rockville Pike, Bethesda, MD 20894, USA. E-mail
[email protected]
idf_wt log
nN
(1)
t
where N represents the number of documents in the collection and nt the number of documents in the collection containing the term t. However, Sparck Jones [2] was the first to systematically study IDF weighting for index terms and to show that such weighting is effective in improving retrieval over simple coordination level matching of documents with queries. She noted that high-frequency terms were less important than low-frequency terms, but nevertheless were needed to obtain good retrieval. Based on this she argued that one should weight terms in the reverse order of their frequencies. She appealed to the Zipf distribution to justify the formula: log
nN 1
(2)
t
Currently formula (1) has become a rather standard approach in the field of document retrieval and is known as collection frequency weighting (CFW) [3–5]. This formula can be given some justification as an approximation to relevance-based probabilistic weighting by making simplifying assumptions [3, 6]. However, numerous other methods for assigning global term weights have also been examined [7, 8] and the general conclusion has been that no one method can claim to be clearly superior to all others. Greiff [9] specifically examined the question of the optimality of IDF weighting using concepts of mutual information and weight of evidence. Based on several approximations, he was only able to conclude that IDF weighting should perform well. Here we wish to examine further the question of how best to compute global term weights for document retrieval. We base our approach on the assumption that Bayesian probabilistic weighting is an important ideal model that we should seek to approximate as closely as possible in order to enhance performance. The Bayesian probabilistic approach generally involves
Journal of Information Science, 27 (5) 2001, pp. 303–310 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
303
Learned global term weights
some assumption about the independence of terms as well as a choice of whether term absence as well as term presence should be used in the weighting. These options were considered by Robertson and Sparck Jones [10] in their investigation of the ideal form for weighting of search terms based on relevance data. Borrowing from their formulation, the particular possibilities that we will find useful are: ●
●
●
Independence Assumption I2 – the distribution of terms in relevant documents is independent and the distribution of terms in non-relevant documents is independent. Ordering Principle O1 – that probable relevance is based only on the presence of search terms in documents. Ordering Principle O2 – that probable relevance is based on both the presence of search terms in documents and their absence from documents.
The Independence Assumption I2 is the natural assumption that one needs to allow term weighting to distinguish the relevant from the non-relevant material in the database. This assumption is widely invoked in probabilistic treatments of document retrieval either implicitly or explicitly [4, 11–15]. The assumption can be somewhat weakened as pointed out by Cooper [16], but this observation has no effect on how the assumption is used in practice. We will take the standard approach using I2. Of the two ordering principles O2 is the more correct theoretically, as observed by Robertson and Sparck Jones [10]. While Robertson and Sparck Jones also observed that O2 was superior to O1 in testing on real data, we will find in our use that O1 performs as well as O2 and gives a simpler formula. While I2 and O1 or O2 provide the necessary foundation for relevance-based weighting, they are generally applied to produce a weight specific for each individual index term from each individual query in a test set. Such maximally specific weights yield the best performance in ranking documents in response to queries. However, such specific weights have no, or very little, generality. They can really only be applied in the precise setting in which they were produced. In order to provide weights with a certain level of generality we pool the data from all the terms of a given frequency in all the queries in the test set. In order to have a generous supply of data for this purpose we use queries 50–200 from the TREC (text retrieval conference) collection [17, 18] in their long form (title+description+narrative). In order to pool the data at different term frequencies we make use of an addi304
tional assumption. While this assumption is perhaps part of the folklore of the subject, it has been given a concise expression by Witten et al. [14]: ●
Monotonicity Assumption M1 – a term that appears in many documents should not be regarded as being more important than a term that appears in a few.
When coupled with I2 and either O1 or O2 this monotonicity assumption will lead to a natural method of pooling the data to obtain weights.
2. Weighting formulas Natural weighting of terms arises from a Bayesian analysis of the probability of document relevance to a query based on the terms in the query and in the document. Let q represent a query and d an arbitrary docu– ment. Further let dRq stand for ‘d relevant to q’ and dRq stand for ‘d not relevant to q ‘. For any t q let us set: pt = p (t d |dRq)
(3)
– – = p (t d |dR q) p t Then term weights are given by: wt log
pt (1 pt
pt
1 pt
t
t
p (1 p ) log p log 1 p t
t
(4)
under the assumptions I2 and O2. If O1 is used instead of O2 the formula simplifies to: wt log
pp . t
(5)
t
The derivations of eqns (4) and (5) are given in Robertson and Sparck Jones [10], while the derivation of eqn (4) may also be found in many other sources [3, 4, 11, 12, 15]. We will begin with eqn (5) and show how it may be transformed to a form convenient for estimation. We first apply Bayes’ theorem to derive pt p (dRq|t ∈ d) p(dRq) pt p (dRq|t ∈ d) p(dRq)
(6)
This allows us to rewrite eqn (5) as wt log
∈ d) 1 p(dRq) log .(7) 1 p (dRq|t p(dRq|t ∈ d) p(dRq)
We now consider the assumption M1. This requires that wt be a non-increasing function of term frequency. Because the second term on the right in eqn (7) is a
Journal of Information Science, 27 (5) 2001, pp. 303–310 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
111 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 1 2
W. JOHN WILBUR
111 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 1120 1 2 3 4 5 6 7 8 9 1130 1 2 3 4 5 6 7 8 9 1140 1 2 3 4 5 6 7 8 9 50 1 112
constant this condition is fulfilled if and only if p(dRq/td ) is a non-increasing function of term frequency. Let us denote the constant term in eqn (7) as Cr log
1 p(dRq) p(dRq)
(8)
In order to apply eqns (7) and (8) to estimate weights as defined in eqn (5), we assume we are given a set of queries Q on a database D and that for each q Q and d D we have available the information as to which of – dRQ or dRq is true. We then follow a two-step estimation procedure. 2.1. Estimation procedure Step 1. To calculate Cr estimate the value p(dRq) as the average number of relevant documents per query in Q divided by the number of documents in the database D. Step 2. For each triple (t, d, q) with d D, q Q and t d q, define n(t, d, q) to equal 1 if dRq and 0 otherwise. Consider the data points (idf _ wt, n(t, d, q)) and determine that non-decreasing probability function pr (idf _ wt) that maximizes the probability of seeing the data:
pr (idf _ w )
(1 pr (idf _ wt ))1n(t,d,q)
n(t,d,q)
t
(9)
(t,d,q)
Substitute the value pr(idf _ wt) for p(dRqt d) in eqn (7) to complete the estimate for wt. There is a simple and elegant algorithm for the determination of non-decreasing maximal likelihood estimator pr(x) in step 2. It is known as the pool adjacent violators (PAV) algorithm [19, 20]. In the operation of the PAV algorithm the data is pooled over all term occurrences in query and document pairs so that the final estimate of p(dRqt d) only depends on the frequency of t and is a non-increasing function of that frequency. (For an explanation of the PAV Algorithm see the Appendix.) Because the result of PAV is a maximal likelihood estimate there is a sense in which it is optimal under the constraint M1. Thus far we have shown how to estimate wt from eqn (5) under the assumption O1. If we instead make the assumption O2 we focus on the right side of eqn (4). The first term is estimated just as described by the estimation procedure above. The same approach is also applied to determine the second term. We first apply Bayes’ theorem: 1 pt p(t ∉ d|dRq) 1 pt p(t ∉ d|dRq)
p(dRq|t ∉ d) p(dRq) p (dRq|t ∉ d) p (dRq)
(10)
We may take the log to obtain: log
p(dRq|t ∉ d) C 11 pp log 1 p (dRq|t ∉ d) t
t
(11)
t
Now we have used M1 as equivalent to the statement that weights must be non-increasing as term frequency increases. Equivalently by eqn (7) p(dRqt d) must be non-increasing as term frequency increases. In fact one may argue directly from M1 to this statement regarding p(dRqt d). Higher frequency terms are generally less important and thus a high-frequency term t d q provides less evidence that dRq than would be provided by a lower frequency and generally more important term. A related argument suggests that p(dRq t d) should also be non-increasing as term frequency increases. While a low-frequency term t q is more important and more indicative of dRq when it is present in d it is less surprising when it is absent from such a related d because of its low frequency. Thus its absence is more permissive of dRq than would be the absence of a higher frequency term from the query. We conclude that p(dRqt d) should be non-increasing as term frequency increases (more discussion of this point is given in the last section of the paper). This monotonicity allows us to estimate p(dRqt d) from the raw data using the PAV algorithm by the same scheme as outlined in the estimation procedure above. By this means we are able to estimate wt as defined in eqn (4). What we have said thus far does not guarantee that the weights determined by eqn (4) will be monotonically non-increasing with increasing frequency. However, in practice the decrease in the term in eqn (4) coming from eqn (6) is much greater than the increase in the term coming from eqn (10) as term frequency increases. This results in a function that at least closely approximates the monotonicity required by M1.
3. Learning on TREC data We applied the estimation procedure outlined above to the long form of queries 50–200 from the TREC collection. In carrying out the estimation procedure for p(dRqt d) we obtained 217,641,276 data points. Each data point is of the form (idf_wt, n(t, d, q)) and regression is on the value idf_wt. A similar procedure is applied to estimate p(dRqt d). In principle the PAV algorithm can be applied directly to the idf_wt but it is much easier and loses no important information to bin
Journal of Information Science, 27 (5) 2001, pp. 303–310 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
305
Learned global term weights
the idf_wt values into discrete bins. We do this by dividing the range of 0–20 for idf_wt into 1000 equal sized bins. The data points in each bin are averaged and this average is given a weight equal to the number of points involved. The result is a set of 1000 weighted data points on which PAV may be conveniently applied. In our case the results are shown in Fig. 1. The values for p(dRqt d) vary over almost three orders of magnitude while the values of p(dRqt d) vary only by about a factor of two. A general rule is: p(dRqt d) p(dRq) p(dRqt d)
(12)
For the TREC data we are studying p(dRq) is 0.000340. Values for p(dRqt d) begin just a little above p(dRq) at 0.0005 and end at 0.28, while values for p(dRqt d) begin at 0.00014 and end at 0.00029. Based on the curves pictured in Fig. 1 we computed the values of wt using O1 (eqn (5)) and using O2 (eqn (4)). The results of these calculations are displayed in Fig. 2. The curve for O1 is monotonic non-decreasing as it must be. The curve for O2 fails slightly to be monotonic at the lower end, while overall it approaches the O1 curve more closely at its upper end in harmony with the fact that the additional term in eqn (4) is nonincreasing as idf_wt increases. The straight line in Fig. 2 passes through the origin and shows how close the computed weights come to being a constant multiple of IDF weights.
Fig. 1. The probabilities p(dRqt d) and p(dRqt d) computed as monotonically non-decreasing maximal likelihood estimators. Both are computed as functions of the IDF term weight using the pool adjacent violators algorithm.
306
We tested the computed weights obtained from the estimation procedure and displayed in Fig. 2 by using them in place of IDF weights. In order to obtain reasonable retrieval results it is necessary to use local weights as well as global weights on the TREC data. We chose two formulas for local weighting that have appeared in the literature and have been used successfully on the TREC collection in the past. The first is a formula published by Robertson and Walker [21]: lwtd
tf . lend 2.0 tf avgdoclen
(13)
Here tf is by convention the local frequency of the term t in the document d. The value lend is the total number of term occurrences in d of all terms in d and avgdoclen is the average of lend over the TREC collection. We shall refer to this formula as LW1. The second formula was published by Ponte and Croft [22]: lwtd
tf lend 1.5 0.5 tf avgdoclen
(14)
and is ascribed by them to Robertson. In a general form it appears in Sparck Jones [3, 5]. We refer to this second formula as LW2. We tested IDF, O1 and O2 weighting in combination with either LW1 or LW2. The results appear in the table that follows. We give the results as interpolated precisions at 11 recall values as well as the
Fig. 2. Term weights based on I2, M1, and O2 or O1 as a function of the IDF term weight. A constant multiple of IDF weighting is included for comparison. Journal of Information Science, 27 (5) 2001, pp. 303–310
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
111 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 1 2
W. JOHN WILBUR
111 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 1120 1 2 3 4 5 6 7 8 9 1130 1 2 3 4 5 6 7 8 9 1140 1 2 3 4 5 6 7 8 9 50 1 112
interpolated 11-point average precisions [14]. All results are based on queries 50–200 of the TREC collection. An examination of Table 1 shows that LW2 outperforms LW1 when combined with either IDF or O1 weightings. Based on this finding LW1 was dropped from further testing. The combination LW2&O2 was found to be inferior to LW2&O1. Because the combination LW2&O1 gave superior performance we used cross-validation to remove any possibility of overtraining in the results based on O1 learning. We removed each of the 150 queries in turn from the estimation procedure of Section 2 and estimated the weights from eqn (5) based on the remaining data. The re-estimated weights were then used to do the retrieval on the query that had been removed from the estimation procedure. The results on the 150 queries were then combined in the standard way to produce the interpolated precisions in the table. It is evident that the results are almost as good as LW2&O1 without crossvalidation. We believe this is not surprising as the cross-validation in each case removed only a small fraction of the data (on average less than 1%). Also the data set is itself very large and this should minimize overtraining.
4. Discussion Perhaps the first question that is raised by the results in Table 1 is why the LW2&O2 performance is worse than
the LW2&O1 performance. In order to confirm that LW2&O2 is in fact worse we performed statistical testing using the bootstrap shift precision test [23]. One million resamplings of the data confirmed the difference between LW2&O2 and LW2&O1 to be significant at the 0.001 level. We next ask whether the assumption that p(dRqt d) is a non-decreasing function of idf_wt is appropriate. This assumption underlies the estimation of p(dRqt d) and the derivation of the O2 weights as pictured in Fig. 2. We claim that the assumption is reasonable based on the results of applying the regression procedure to the data as pictured in Fig. 1. On the left half of the curve for p(dRqt d) the regression procedure yields an increasing function. This would not happen if the function were either decreasing or a constant over this range. Thus the left half of the picture would seem to be an appropriate representation of the function. On the right the regression procedure produces a constant. However, the function has already essentially reached its limit at midpoint according to the relationship (12). Thus over the upper half of its range the function can only be a constant or a decreasing function. Yet the upper half of the range where frequencies are low is that part of the range where the argument that p(dRqt d) should be non-decreasing seems most certain to be true. This follows from the fact that for any query q the set Rq of documents relevant to q is much smaller than the set NRq of non-relevant documents. Thus as the frequency of an already infrequent term t decreases, the fraction of NRq documents not containing t increases much less
Table 1 Interpolated precisions for eleven recall values for six different retrieval strategies. Summary interpolated 11-point average precisions are in the final row Recall
LW1&IDF
LW2&IDF
LW1&O1
LW2&O1
LW2&O2
LW2&O1 cross-validated
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 11-point average precision
0.7735 0.4969 0.4053 0.3368 0.2774 0.2292 0.1764 0.1297 0.0827 0.0419 0.0030 0.2684
0.7877 0.5167 0.4203 0.3496 0.2908 0.2382 0.1875 0.1373 0.0914 0.0456 0.0034 0.2790
0.7666 0.5162 0.4270 0.3571 0.2986 0.2485 0.1931 0.1428 0.0927 0.0486 0.0035 0.2813
0.7890 0.5381 0.4435 0.3745 0.3154 0.2603 0.2067 0.1535 0.1021 0.0540 0.0040 0.2946
0.7937 0.5324 0.4366 0.3640 0.3033 0.2487 0.1965 0.1456 0.0964 0.0495 0.0036 0.2882
0.7887 0.5378 0.4424 0.3736 0.3143 0.2591 0.2054 0.1524 0.1014 0.0536 0.0040 0.2939
Journal of Information Science, 27 (5) 2001, pp. 303–310 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
307
Learned global term weights
than the fraction of Rq documents not containing t increases. Of course at very low frequencies neither of these fractions changes very much and one expects an almost constant value for p(dRqt d). Another issue that may have some bearing on the relatively poor performance obtained from LW2&O2 is the validity of the independence assumption I2. While this sort of assumption is routinely made it is recognized as not strictly true. In fact the presence of lowfrequency subject-specific terms often implies the presence also of high-frequency terms of a more general nature. For example the phrases ‘city of San Francisco’ or ‘transforming growth factor-beta gene’ have the general high-frequency terms ‘city’ and ‘gene’ coupled with much more specific low-frequency terms. One way to attempt to correct for this coupling or dependence would be to down-weight the high-frequency terms. O1 in effect does this and it is possible that this is actually beneficial to retrieval. The value of such down-weighting may depend on the number of terms in the average query. A larger number of terms per query increases the likelihood of significant dependencies. The TREC data studied here has 49.7 terms per query on average. We may recall that the original study by Robertson and Sparck Jones [10] concluded that O2 was superior to O1 based on relevance judgments. However, the test sets they worked with had relatively few terms per query (7.9 on average for the Cranfield 1400 collection and 5.3 for the Keen collection). What we have put forward here is only a suggested explanation for an incompletely understood phenomenon. We find that LW2&O1(cross-validated) achieves a 5.3% improvement over LW2&IDF in 11-point average precision. Again we performed the bootstrap shift precision test with one million resamplings and confirmed that this difference is significant at least at the 0.001 level. While this level of improvement is not large it may be sufficient to have practical implications. However, we believe it is perhaps most important for what it says about improving IDF weighting. In looking at O1 weighting in Fig. 2 and comparing it with IDF weighting one sees that there is a tendency to have O1 weights a little smaller than IDF weights on the left, but a little greater on the right. We attempted to smooth out the O1 curve to retain just these features in relation to IDF, but in doing so we lost most of the advantage of O1 over IDF weighting. Based on such attempts we have concluded that the information in O1 that gives improvement over IDF is largely in the many individual steps or wiggles demonstrated in Fig. 2. Greiff [9] has also examined the question of improving on IDF weights and gives data suggesting that by flattening the 308
IDF weights (the straight line in our Fig. 2) at both the right- and left-hand ends improvements are produced. In fact one can see in Fig. 2 such a flattening in the ideal weights learned based on O1. However, we find the improvement based on such flattening is small. It is important to note that Greiff’s results are given for global weighting alone without any local weights and that such results are overall quite poor. He does follow a precedent in giving results without local weighting [24]. On the other hand it is unclear that such results indicate the performance that might be obtained in a realistic setting where local weights as well as global weights are used. In summary we are able to produce a modest (>5%) improvement over IDF weighting in a realistic retrieval setting, but we believe there is essentially no chance that this sort of training generalizes to a different database. Such improvements if they are attainable on another database would seem to require specific training on that database. Our conclusions, then, are two-fold: first, IDF weighting is close to optimal; second, in specific circumstances where there is sufficient judged material it may be possible to learn improved weights, but IDF may be the most robust generally applicable method of global weighting.
Acknowledgment The author would like to thank an anonymous referee for comments that were helpful in the preparation of the paper.
References [1] J.H. Williams Jr and M.P. Perriens, Automatic full text indexing and searching system, presented at IBM Information Systems Symposium, Washington, DC (1968). [2] K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval, The Journal of Documentation 28(1) (1972) 11–21. [3] K. Sparck Jones, S. Walker and S.E. Robertson, A Probabilistic Model of Information Retrieval: Development and Status, University of Cambridge Technical Report 446 (1998). [4] K. Sparck Jones, Information retrieval and artificial intelligence, Artificial Intelligence 114 (1999) 257–281. [5] K. Sparck Jones, S. Walker and S.E. Robertson, A probabilistic model of information retrieval: development and comparative experiments (Part 2), Information Processing and Management 36 (2000) 809–840. Journal of Information Science, 27 (5) 2001, pp. 303–310
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
111 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 1 2
W. JOHN WILBUR
111 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 1120 1 2 3 4 5 6 7 8 9 1130 1 2 3 4 5 6 7 8 9 1140 1 2 3 4 5 6 7 8 9 50 1 112
[6] W.B. Croft and D.J. Harper, Using probabiliistic models of document retrieval without relevance information, Journal of Documentation 35(4) (1979) 285–295. [7] J.S. Ro, An evaluation of the applicability of ranking algorithms to improve the effectiveness of full-text retrieval. II. On the effectiveness of ranking algorithms on full-text retrieval, Journal of the American Society for Information Science 39(3) (1988) 147–160. [8] J. Zobel and A. Moffat, Exploring the similarity space, ACM SIGIR Forum 32(1) (1998) 18–34. [9] W.R. Greiff, A theory of term weighting based on exploratory data analysis, presented at: W.B. Croft, A. Moffat and C.J. van Rijsbergen (eds), SIGIR’98, Melbourne, Australia (1998) 21. [10] S.E. Robertson and K. Sparck Jones, Relevance weighting of search terms, Journal of the American Society for Information Science May–June (1976) 129–146. [11] C.J. van Rijsbergen, Information Retrieval, 2nd edn (Butterworths, London, 1979). [12] G. Salton, Automatic Text Processing (Addison-Wesley, Reading, MA, 1989). [13] K. Sparck Jones, S. Walker and S.E. Robertson, A probabilistic model of information retrieval: development and comparative experiments (Part 1), Information Processing and Management 36 (2000) 779–808. [14] I.H. Witten, A. Moffat and T.C. Bell, Managing Gigabytes, 2nd edn (Morgan-Kaufmann, San Francisco, CA, 1999). [15] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval (Addison-Wesley–Longman, Harlow, 1999). [16] W. Cooper, Some inconsistencies and misidentified modelling assumptions in probabilistic information retrieval, ACM Transactions on Information Systems, 13 (1995) 100–111. [17] D.K. Harman, Overview of the second text retrieval conference, presented at: D.K. Harman (ed.), The Second Text Retrieval Conference (TREC-2), Gaithersburg, MD, Special Publication 500–215 (1994). [18] D.K. Harman, Overview of the third text retrieval conference, presented at: D.K. Harman (ed.), The Third Text
[19]
[20] [21]
[22]
[23]
[24]
Retrieval Conference (TREC3), Gaithersburg, MD, Special Publication 500–225 (1995). M. Ayer, H.D. Brunk, G.M. Ewing, W.T. Reid and E. Silverman, An empirical distribution function for sampling with incomplete information, Annals of Mathematical Statistics, 26 (1954) 641–647. W. Hardle, Smoothing Techniques: with Implementation in S (Springer, New York, 1991). S.E. Robertson and S. Walker, Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, presented at: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1994). J.M. Ponte and W.B. Croft, A language modeling approach to information retrieval, presented at: W.B. Croft, A. Moffat, C.J. v. Rijsbergen, R. Wilkinson and J. Zobel (eds), SIGIR’98, Melbourne, Australia (1998). W.J. Wilbur, Nonparametric significance tests of retrieval performance comparisons, Journal of Information Science 20(4) (1994) 270–284. S.E. Robertson and S. Walker, On relevance weights with little information, presented at: N.J. Belkin, D. Narasimhalu and P. Willett (eds), SIGIR’97, Philadelphia, PA (1997) 20.
Appendix: the pool adjacent violators algorithm n Consider the set of data points (idf_wi,ni)i=1 defined in step 2 of the estimation procedure of Section 2. Here it is understood that for each i, ni = b(t, d, q) for some triple (t, d, q) with d D, q Q, t d « q, and idf_wi is the IDF weight for t. Without loss of generality we may assume that these points are arranged in order by the first coordinates so that idf_wi is a non-decreasing function of the index i. Let us begin by setting:
pi = ni, wi = 1, i = 1,2,…, N
(15)
Fig. A1. A depiction of the pool adjacent violators algorithm applied to a small data set. Initial weights of all points are taken to be one. See the Appendix for details.
Journal of Information Science, 27 (5) 2001, pp. 303–310 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
309
Learned global term weights
Then the maximum likelihood estimates pr[idf_w1], pr[idf_w2], …, pr[idf _ wN] where pr[idf_w1] pr[idf–w2] … pr[idf–wn] can be obtained as follows: if p1 p2 … pN, then pr[idf_wi] = pi, i = 1,2,…, N. Otherwise we must go through an iterative pooling process. Let us assume that the current number of elements in the sequences {pi} and {wi} is N′ (the beginning number is N). If pk pk+1 for some k (k = 1,2,…, N′ 1), the numbers pk and pk+1 are pooled in the sequence p1, p2, …, pN′ and replaced by the single number which is their weighted average, wkpk wk1pk1 . wk wk1
310
Likewise the numbers wk and wk+1 are replaced in {wi } by the single number wk + wk+1. The result of this operation is an ordered set of N′–numbers {pi} and corresponding weights {wi }. This procedure is repeated until an ordered set of numbers is obtained which are monotonically increasing. The effect of the pooling of order violators is illustrated in the lines of Fig. A1. Each line of data represents several pooling operations applied to the data of the previous line. When the algorithm has completed, for each i, pr[idf_wi] is equal to that one of the final set of numbers to which the original pi has contributed.
Journal of Information Science, 27 (5) 2001, pp. 303–310 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 17, 2008 © 2001 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
111 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 5 6 7 8 9 30 1 2 3 4 5 6 7 8 9 40 1 2 3 4 5 6 7 8 9 50 1 2