A Statistical Method for System Evaluation Using ... - CiteSeerX

0 downloads 0 Views 571KB Size Report
Aug 11, 2006 - of a ranked list of up to 1,000 documents were submitted with respect to each of 50 or more topics. In principle, each document in the collection ...
A Statistical Method for System Evaluation Using Incomplete Judgments Javed A. Aslam∗ Virgil Pavlu Emine Yilmaz College of Computer and Information Science Northeastern University 360 Huntington Ave, #202 WVH Boston, MA 02115 {jaa,vip,emine}@ccs.neu.edu

ABSTRACT

1. INTRODUCTION

We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves. Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TRECstyle judgment pool.

We consider the problem of large-scale retrieval evaluation and propose a statistical technique for efficiently and effectively estimating standard measures of retrieval performance via random sampling. Standard methods of retrieval evaluation can be quite expensive when conducted on a large-scale. For example, in many years of the annual Text REtrieval Conference (TREC), upwards of 100 runs each consisting of a ranked list of up to 1,000 documents were submitted with respect to each of 50 or more topics. In principle, each document in the collection would need to be assessed for relevance with respect to each query in order to evaluate many standard retrieval measures such as average precision and R-precision; in practice, this is prohibitively expensive. TREC instead employs depth pooling wherein the union of the top k documents retrieved in each run corresponding to a given query is formed, and the documents in this depth k pool are judged for relevance with respect to this query. In TREC, k = 100 has been shown to be an effective cutoff in evaluating the relative performance of retrieval systems [8, 12], and while the depth 100 pool is considerably smaller than the document collection, it still engenders a large assessment effort: in TREC8, for example, 86,830 relevance judgments were used to assess the quality of the retrieved lists corresponding to 129 system runs in response to 50 topics [11]. The table given below shows the relationship between pool depth and the number of judgments required per topic on average for various TRECs.

Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software – Performance evaluation

Pool Depth 1 3 5 10 20 100

General Terms Theory, Measurement, Experimentation

Keywords Evaluation, Sampling, Average Precision ∗

We gratefully acknowledge the support provided by NSF grant CCF-0418390.

7 27 65 98 179 337 1596

TREC 8 10 29 25 71 64 110 100 200 184 379 339 1712 1414

Shallower pools [12] and greedily chosen dynamic pools [6, 2] have also been studied in an attempt to alleviate the assessment effort; however, such techniques tend to produce biased estimates of standard retrieval measures, especially when relatively few relevance judgments are used. Recently, Buckley and Voorhees proposed a new measure, bpref, and they show that when bpref is employed with a judged sample drawn uniformly from the depth 100 pool, the retrieval systems can be effectively ranked [5]. However, bpref is not designed to approximate any standard measure of retrieval

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’06, August 6–11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

541

performance, and while bpref is correlated with average precision, especially at high sampling rates, these correlations and the system rankings produced degenerate at low sampling rates.

trieval measures from random samples.1 While many of the details are somewhat complex and/or necessarily omitted for space considerations, the basic ideas can be summarized as follows:

Our goal in this work is to efficiently and accurately estimate standard measures of retrieval performance. Unlike previously proposed methodologies which tend to produce biased estimates of standard measures using few relevance judgments or methodologies based on estimating measures other than the most widely reported standard measure, our methodology, by statistical design, produces unbiased estimates of the standard measures of retrieval performance themselves. The core of our methodology is the derivation, for each measure, of a distribution over documents such that the value of the measure is proportional to the expectation of observing a relevant document drawn according to that distribution. (In the case of average precision, the distribution is over pairs of documents, and the observation is the product of the relevances for the pair drawn.) Given such distributions, one can estimate the expectations (and hence measurement values) using random sampling. By statistical design, such estimates will be unbiased. Furthermore, through the use of the statistical estimation technique of importance sampling [1], we show how low variance estimates of multiple retrieval measures can be simultaneously estimated for multiple runs given a single sample. In sum, we show how both efficient and effective estimates of standard retrieval measures can be inferred from a random sample, thus providing an alternative to large-scale TREC-style evaluations. We tested our methodology using the benchmark TREC data collections, and results are reported for TRECs 7, 8, and 10. We compare the performance of TREC-style depth pools to sampling pools of an equivalent size. Generally speaking, as the number of relevance assessments is decreased (either by considering a lower depth pool, in the case of TREC, or by sampling at a lower rate, in the case of the proposed technique), the bias and variance of the estimates inferred from TREC-style depth pools increases, while only the variance of estimates inferred from sampling pools increases. At equivalent levels of judgment effort, the estimates produced by sampling are consistently better than those produced by depth pooling, often dramatically so. While TREC-style depth pools are used to evaluate the systems from which they were generated, they are also often used to evaluate new runs which did not originally contribute to the pool. Depth 100 TREC-style pools generalize well in the sense that they can be used to effectively evaluate new runs, and this has been a tremendous boon to researchers who use TREC data. We also show that random sampling pools generalize well, achieving estimation errors on unseen runs comparable to the estimation errors on runs from which the sample was drawn. In the sections that follow, we describe our methodology and the results from experiments using the TREC collections. We conclude with a summary and directions for future research.

1. For each measure, we derive a random variable and associated probability distribution such that the value of the measure in question is proportional to the expectation of the random variable with respect to the probability distribution. For example, to estimate precisionat-cutoff 500, one could simply uniformly sample documents from the top 500 in a given list and output the fraction of relevant documents seen. Thus, the underlying random variable for precision-at-cutoff c is dictated by the binary relevance assessments, and the associated distribution is uniform over the top c documents. (Since R-precision is precision-at-cutoff R, an identical strategy holds, given the value or an estimate of R.) For average precision, the situation is somewhat more complex: we show that the required sampling distribution is over pairs of documents and the underlying random variable is the product of the binary relevance judgments for that pair.

2.

2. Given that the value of a measure can be viewed as the expectation of a random variable, one can apply standard sampling techniques to estimate this expectation and hence the value of the measure. To implement this methodology efficiently, one would like to estimate all retrieval measures for all runs simultaneously using a single judged sample. As such, one is confronted with the task of estimating the expectation of a random variable with respect to a known distribution by using a sample drawn according to a different (but known) distribution. We employ scaling factors used in the statistical estimation field of importance sampling [1] in order to facilitate this goal. 3. Finally, to implement this methodology effectively, one desires low variance unbiased estimators so that the computed empirical means will converge to their true expectations quickly. While we show that any known sampling distribution can be used to yield unbiased estimators for retrieval measures, we also describe a heuristic for generating a specific sampling distribution which is likely to yield low variance estimators. In the sections that follow, we describe each of these ideas in more detail, and we then present extensive experiments using TREC data.

2.1 Sum precision as an expected value While our goal is to simultaneously estimate multiple measures of performance over multiple lists, we begin by considering the problem of estimating average precision from a random sample. Unlike R-precision or precision at standard cutoffs, deriving a sampling distribution for average precision is non-trivial, and it yields a distribution which is empirically quite useful for estimating the other measures of interest. By definition, average precision (AP ) is the average of the precisions at all relevant documents, or equivalently, the

METHODOLOGY

1 A somewhat different treatment of this material may be found in our recent workshop work-in-progress report [4].

In this section, we sketch our proposed statistical technique for efficiently and effectively estimating standard re-

542

1 2 3 .. . Z

1 1 1/2 1/3 1/Z

2 1/2

3 ... Z

1/3

1/3

1/Z

1/Z

1 2 3 .. . Z

1/Z

...

1 2 1/2 1/3 1/Z

2 1/3

3 ... Z . . . 1/Z 1/3 . . . 1/Z 2/3 . . . 1/Z

1/Z

1/Z

1/2

1

1/3

...

2/Z

Table 1: (Left) Weights associated with pairs of ranks; normalizing by Z yields an asymmetric joint distribution. (Right) Symmetric weights; normalizing by 2Z yields the symmetric joint distribution JD. sum of the precisions at all relevant documents divided by R, the number of documents relevant to the query. Let SP be this sum of precisions at all relevant documents. In what follows, we first discuss estimating SP, and later we discuss estimating R; our estimate of AP will be the ratio of the estimates of SP and R. One can compute sum precision as follows, where Z is the length of the retrieved list, rel (i) is the binary relevance of the document at rank i, and R is the number of documents relevant to the query. SP

=

X

PC (i) =

i : rel (i)=1

=

Z X i=1

rel (i)

Z X

Figure 1: AP evaluation diagram

rel (i) · PC (i)

i=1 i X

rel (j)/i =

X 1≤j≤i≤Z

j=1

1 · rel (i) · rel (j) i

Figure 2: PC evaluation diagram

Thus, in order to evaluate SP , one must compute the weighted product of relevances of documents at pairs of ranks, where for any pair j ≤ i, the associated weight is 1/i. (See Table 1, left.) In order to view this sum as an expectation, we define an event space corresponding to pairs of ranks (i, j), a random variable X corresponding to the product of the binary relevances rel (i) · rel (j), and an appropriate probability distribution over the event space. One such distribution corresponds to the (appropriately normalized) weights given in Table 1 (left); for convenience, we shall instead define a symmetrized version of these weights (see Table 1 (right)) and the corresponding joint distribution JD (appropriately normalized by 2Z). It is not difficult to see that

pool. In order to combat this potential inefficiency, we shall construct a single distribution over pairs of documents derived from the joint distributions JD s associated with every system s. We shall effectively be sampling from a distribution different from the one necessary to estimate the expectations desired. To combat this, we introduce scaling factors as follows. Let D(i, j) be the joint distribution over documents from which we effectively sample. Note that i and j now denote documents, not ranks. Similarly abusing notation, let JD s (i, j) denote the joint distribution over documents (not ranks) required to estimate the proper expectation for system s. We define scaling factors SF s (i, j) which correspond to the ratio between the desired and sampling distributions

SP = Z · E[X] where the expectation is computed with respect to either distribution. Thus, if U is a multiset of pairs drawn according to JD, we obtain the following estimate for SP X c =Z · 1 SP rel (i) · rel (j). |U |

SF s (i, j) =

JD s (i, j) Ds (i, j)

where Ds is the distribution induced by D over documents retrieved by s. We then have X c = Zs · 1 rel (i) · rel (j) · SF s (i, j) SP |Us |

(i,j)∈U

(i,j)∈Us

2.2 Simultaneous estimation for multiple runs

where Zs is the length of the list returned by system s and Us ⊆ U is the subset of samples corresponding to documents retrieved by s. Note that the above formulation holds for any sampling distribution D. In what follows, we describe a heuristic for determining a good sampling distribution— one which corresponds to a distribution over documents (for efficiency) and which explicitly attempts to minimize the variance in the estimates produced (for accuracy).

One is often faced with the task of evaluating the average precisions of many retrieval systems with respect to a given query (as in TREC), and in a naive implementation of the technique described, the documents judged for one system will not necessarily be reused in judging another system. In contrast, TREC creates a single pool of documents from the collection of runs to be evaluated, judges that pool, and evaluates all of the systems with respect to this single judged

543

2.3 Deriving the sampling distribution

marginal M (i) by averaging the marginals associated with each system s. 1 X M (i) = MD s (i) N s

The technique described above can be used to estimate the average precision of one or more retrieval runs with respect to any given query. However, it is relatively inefficient: (1) On a per run basis, independent and identically distributed (i.i.d.) pairs of documents are drawn and judged, but the induced pairs of judged documents across i.i.d. samples are not used. In order to combat this potential inefficiency, we shall instead draw a sample from a distribution over documents and consider all induced pairs of judgments. In determining a sampling distribution D, we consider two factors. First, we impose the condition that D be a symmetric product distribution, i.e., D(i, j) = M (i) · M (j) for some (marginal) distribution over documents M . The purpose for this is efficiency: we will sample documents according to M and consider all induced pairs of documents, which will be distributed (approximately) according to D. Second, we seek a D which explicitly attempts to minimize the variance in our estimator, for accuracy. We begin by considering the latter factor.

We finally note that in a typical TREC setting, one averages AP over 50 queries to obtain a final estimate of the performance of a system, and this averaging results in a further significant variance reduction. Exact computation of scaling factors. M (i) is the distribution we use for sampling documents, and given a sample of K such documents, we consider all K 2 induced pairs and estimate the required expectations from these induced pairs and appropriate scaling factors. For sufficiently large K, the distribution over induced pairs will approximate the associated product distribution D(i, j) = M (i) · M (j); however, the actual distribution is multinomial. As a consequence, if Ks is the size of the subset of K sampled documents which are retrieved by system s, one obtains the following final scaling factors:

Variance minimization. For a sampling distribution D and a given system s, let Ds be the distribution induced by D over pairs of documents contained in the list returned by system s. Furthermore, let Y be the random variable rel (i) · rel (j) · SF s (i, j) such that SP = Zs · EDs [Y ]. Since Zs and R are fixed, in order to minimize the variance of AP , we must minimize the variance of Y .

SF s (i, j) =

Finally, we derive the exact form of the multinomial sampling distribution over induced pairs. Sampling K documents (with replacement) from a distribution M and forming all K 2 pairs of documents yields a multinomial distribution I over the possible outcomes. Let k = (k1 , k2 , . . . , kW ) correspond to counts for the sampled documents where k1 + k2 + · · · + kW = K and k`i is the count ´ Q associated with document i. Then Pr(k) = k1 ,k2K...kW · M (d)kd .

Var[Y ] = E[Y 2 ] − E2 [Y ] X = Ds (i, j) · rel (i)2 · rel (j)2 · SF s (i, j)2 − (SP/Zs )2 i,j

=

X i,j:rel(i)=rel(j)=1

=

X i,j:rel(i)=rel(j)=1

Ds (i, j) ·

JD s (i, j)2 − (SP/Zs )2 Ds (i, j)2

d

For i = j and k > 2, the induced pairs distribution is derived as follows P ki ·kj I(i, j) = · Pr(k) K2

JD s (i, j)2 − (SP/Zs )2 Ds (i, j)

 k=(...,ki ,...,kj ,...)

= To minimize this variance, it is enough to minimize the first term since SP /Zs is fixed. Employing importance sampling techniques for minimizing the variance of Monte Carlo estimators, one can derive that the best sampling distribution D is the distribution induced by JD s over relevant documents. (See Anderson [1] for an example of such a derivation.) Of course, we do not have the complete relevance judgments necessary to calculate the ideal sampling distribution. HowP ever, the marginal distribution MD s (i) = j JD s (i, j) associated with the average precision sampling distribution JD s (i, j) has been shown to be a reasonable prior for relevant documents [3], and using such a prior one can argue that a sampling distribution Ds (i, j) proportional2 to (MD s (i) · MD s (j))3/2 is likely to result in low variance. (Details omitted for space considerations.) Ds (i, j) is a product distribution having identical marginals with respect to i and j; let MD s (i) be the marginal associated with Ds (i, j). If our task were to estimate the performance of only one retrieval system, we could sample documents according to MD s (i), consider all induced pairs of documents, and estimate AP using appropriate scaling factors. However, in general our task is to simultaneously estimate AP for N systems from a single sample. We obtain a final sampling 2

JD s (i, j) . I(i, j) · K 2 /Ks 2

=

=

1 K2

P

ki +kj ≤K



ki kj

`

´ M (i)ki M (j)kj ·

K ki ,kj ,...

” (1 − M (i) − M (j))K−ki −kj P “ (K−2)!(K−1)K 1 K2

ki +kj ≤K

(ki −1)!(kj −1)!(K−ki −kj )!

M (i)M (j) ·

M (i)ki −1 M (j)kj −1 (1 − M (i) − M (j))K−ki −kj ´ P “` K(K−1)M (i)M (j) K−2 · ki −1,kj −1,... K2 ki +kj ≤K

M (i)ki −1 M (j)kj −1 (1 − M (i) − M (j))K−ki −kj =





K −1 M (i)M (j) K

When i = j, a similar derivation yields “ ” 1 I(i, i) = M (i) 1 + (K − 1)M (i) + (1 − M (i))K−1 . K

2.4 Estimating R and AP To obtain an estimate for AP , we must know or obtain estimates for SP , R and Zs , and the expectation described above. We have described in detail how to estimate SP , and Zs is a known quantity (the length of the system’s returned list). However, R, the total number of documents

The expression must be normalized to form a distribution.

544

judgment effort.) Generate all K 2 pairs of judged documents. (5) Compute the multinomial induced pairs distribution (Figure 1, step 1), I(i, j); this is the “effective” distribution over pairs of documents from which we sampled. From I(i, j) and JD s (i, j), compute the required scaling factors. (6) From the induced pairs and the scaling factors, compute the estimates of SP for each run (Figure 1 black box). (7) Estimate R using the sampled documents drawn according to M (i) and appropriate scaling factors (Figure 3). (8) Estimate AP by the ratio of the estimates for SP and R. Note that the estimate of a ratio is not necessarily the ratio of estimates. More accurate ratio estimates derived via a second order Taylor series approximation [10] were tested, and they were generally found to be of little benefit for the computational effort required.

2.7 New retrieval runs A question of particular importance is how can we use the samples generated by our method to evaluate a new run, i.e. a run that did not contribute to the sampling distribution. In order for sampling to work correctly, there should be sufficient sampled documents in the new run so that the evaluation using sampling is meaningful. The evaluation (estimation) methodology is independent of the fact that the run participated to the sampling process; therefore it can be applied to the new runs in the same way as for the runs used in producing the sample. On the scale factor computation, the numerator is a function of the ranks of sampled documents in the new list and the denominator is computed based on the sampling distribution conditioned to the new run. In TREC data, it is already the case that the actual pools are created from only a subset of the submitted runs. In all our experiments, to computing the average sampling distribution, we only use the runs that contributed to the pool (training systems) and use the runs that did not contribute to the pool as testing systems. In the plots that follow, the dots (·) in the plots correspond to the training systems, and the pluses (+) correspond to the testing systems.

Figure 3: Sampling diagram relevant to the given query, is not typically known and must also be estimated. Sophisticated approaches for estimating R exist [9]; however, in this preliminary study we employ techniques similar to those described above. In order to estimate R (as calculated by TREC), one could simply uniformly sample documents from the depth 100 pool. Given that our sample is drawn according to M (i) instead, one can employ appropriate scaling factors to obtain the correct estimate.

2.5 Estimating PC and RP To estimate precision-at-cutoff c, one could simply uniformly sample documents from the top c in any given list. Given that we sample documents according to M (i), we again employ appropriate scaling factors to obtain correct estimates for PC (c). R-precision is simply the precision-at-cutoff R. We do b for R as not know R; however, we can obtain an estimate R described above. Given this estimate, we simply estimate b PC (R).

3. EXPERIMENTAL RESULTS We tested the proposed sampling method as a mechanism for estimating the performance of retrieval systems using data from TRECs 7, 8 and 10. We used mean average precision (MAP), mean R-precision (MRP), and mean precision at cutoffs 5, 10, 15, 20, 30, 100, 200, 500, and 1000 (MPC(c)) as evaluation measures. We compared the estimates obtained by the sampling method with the “actual” evaluations, i.e. evaluations obtained by depth 100 TRECstyle pooling. The estimates are found to be consistently good even when the total number of documents judged is far less than the number of judgments used to calculate the actual evaluations. To evaluate the quality of our estimates, we calculated three different statistics, root mean squared (RMS) error (how different theqestimated values are from the actual valPN 2 1 ues, i.e. RMS = i=1 (ai − ei ) , where ai are the acN tual and ei are the estimates values), linear correlation coefficient ρ (how well the actual and estimated values fit to a straight line), and Kendall’s τ (how well the estimated measures rank the systems compared to the actual rankings).

2.6 Practical summary Below we give a summary for implementing the sampling method. Figure 3 can serve as a guide. (1) For each run s, use the joint distribution over pairs of documents, JDs (i, j ), dictated by SP such that sampling pairs of documents according to JD would yield SP in expectation. (2) To minimize variance, we introduce a prior over relevant documents (Figure 3, step 2). Let MDs (i) be the marginal distribution of JD s (i, j), and let Ds (i, j) be the (appropriately normalized) joint distribution corresponding to (MD s (i) · MD s (j))3/2 . Let MD s (i) be the marginal distribution of D. This is the sampling distribution over documents that would be used for run s. (3) Over all runs, compute M (i), the average of these sampling distributions over documents. M (i) is our single, final sampling distribution for the query (all runs). (4) Sample K documents with replacement according to M (i) (Figure 3, black box) until T unique documents are drawn; judge these documents. (T is the desired a priori

545

Depth Pooling MRP, TREC=8 K=200

0.4 trainingRMS=0.067717 testingRMS=0.071323 0.35 ρ=0.935188 τ=0.735349 0.3

0.4 trainingRMS=0.057039 testingRMS=0.057814 0.35 ρ=0.990995 τ=0.912055 0.3

0.4 trainingRMS=0.037644 testingRMS=0.042596 0.35 ρ=0.928781 τ=0.710209 0.3

0.4 trainingRMS=0.039552 testingRMS=0.040188 0.35 ρ=0.989286 τ=0.909278 0.3

0.25 0.2

0.25 0.2 0.15

0.1

0.1 0.05 0 0

0.5

0.2

MAP Sampling MAP, TREC=8 K=29

0.45

0.45

0.4

0.4

estimated MAP

trainingRMS=0.026391 testingRMS=0.028177 0.35 ρ=0.967492 τ=0.800824 0.3 0.25 0.2

0.3

0.4

0 0

0.5

0.1

0.2

0.3

0.4

MRP

MRP

Sampling MRP estimates, TREC=8 k=29

Sampling MRP estimates, TREC=8 k=200

0.5

0.5

0.45

0.45

0.4 trainingRMS=0.034582 testingRMS=0.028835 0.35 ρ=0.982952 τ=0.859687 0.3

0.4 trainingRMS=0.009726 testingRMS=0.007316 0.35 ρ=0.997094 τ=0.941298 0.3

0.05

0.05 0.5

0.25 0.2 0.15

0.1

0.4

0.2

0.5

0.25 0.2 0.15

0.2 0.15

0.3

0.1

0.5

0.25

0.1

MAP

0.4

trainingRMS=0.009328 testingRMS=0.005606 0.35 ρ=0.997112 τ=0.949600 0.3

0.15

0.2

0.3

Sampling MAP, TREC=8 K=200 0.5

0.1

0.1 0.05

MAP

0.5

0 0

0.1

0 0

0.1

0.2

0.05

estimated MRP

0.4

0.25

0.15

0 0

0.1

0.1

0.05

0.05

0 0

0.1

0.2

0.3

0.4

0 0

0.5

0.1

0.2

0.3

MRP

0.1

0.2

0.3

0.4

0.4

0.5

MRP

0.5

MAP

Figure 5: Sampling vs. depth pooling mean Rprecision estimates at depths 1 and 10 in TREC8.

Figure 4: Sampling vs. depth pooling mean average precision estimates at depths 1 and 10 in TREC8. Each dot (·) corresponds to a distributioncontributor run and each plus (+) to a distributionnon-contributor run (there are 129 runs in TREC8.)

depth pooled MPC100

Depth Pooling MPC100 estimates, TREC=8 K=29

Note that in contrast to the RMS error, Kendall’s τ and ρ do not measure how much the estimated values differ from the actual values. Therefore, even if they indicate perfectly correlated estimated and actual values, the estimates may still not be accurate. Hence, it is much harder to achieve small RMS errors than to achieve high τ or ρ values. Because of this, we mainly focus on the RMS error values when evaluating the performance of the sampling method. Since the performance of the sampling method varies depending on the actual sample, we sampled 10 times and picked a representative sample that exhibited typical performance based on the three evaluation statistics used. We report the results of the experiments for MAP, MRP, and MPC(100) on TREC8 in Figure 4, Figure 5, and Figure 6, respectively. As can be seen, on TREC8 , for both depth=1 (on avg 29 judgments/query) and depth=10 (on avg 200 judgments/query), there is a significant improvement in all three statistics when sampling is used versus the TREC-style pooling for all the measures. The sampling estimates have reduced variance and little or no bias compared to depth pooling estimates. This can be seen from the great reduction in the RMS error when the estimates are obtained via sampling. Furthermore, the bottom-right plots of all three figures show that with as few as 200 relevance judgments on average per query, the sampling method can very accurately estimate the actual measure values which were obtained using 1,737 relevance judgments on average per query. Figure 7 illustrates how MAP estimates using TRECstyle depth pooling compare in terms of ρ and Kendall’s τ with those obtained using sampling as the depth of the pool changes. For depths 1 to 10, we first calculated the number of documents required to be judged using TRECstyle depth pooling. Then, for each depth, we formed 10 different samples of the same size as the required judgment

0.5 0.45

0.4 trainingRMS=0.159938 testingRMS=0.164225 0.35 ρ=0.905450 τ=0.655833 0.3

0.4 trainingRMS=0.058984 testingRMS=0.062682 0.35 ρ=0.964205 τ=0.862276 0.3

0.25 0.2 0.15

0.25 0.2 0.15

0.1

0.1

0.05

0.05

0 0

estimated MPC100

Depth Pooling MPC100 estimates, TREC=8 K=200

0.5 0.45

depth pooled MPC100

0.3

0.2

0.1

0.2

0.3

0.4

0 0

0.5

0.1

0.2

0.4

MPC100 Sampling MPC100 estimates, TREC=8 k=200

0.5

0.5

0.45

0.45

0.4 trainingRMS=0.027276 testingRMS=0.019395 0.35 ρ=0.954947 τ=0.817609 0.3

0.4 trainingRMS=0.005145 testingRMS=0.006775 0.35 ρ=0.996809 τ=0.943060 0.3

0.25 0.2 0.15

0.5

0.25 0.2 0.15

0.1

0.1

0.05

0.05

0 0

0.3

MPC100 Sampling MPC100 estimates, TREC=8 k=29

estimated MPC100

0.2

estimated MRP

0.1

0.25

0.15

0.15

0 0

depth pooled MRP

0.5 0.45

depth pooled MRP

0.5 0.45

0.05

estimated MAP

Depth Pooling MRP, TREC=8 K=29

Depth pooling MAP estimates, TREC=8 K=200 0.5 0.45

depth pooled MAP

depth pooled MAP

Depth pooling MAP estimates, TREC=8 K=29 0.5 0.45

0.1

0.2

0.3 MPC100

0.4

0.5

0 0

0.1

0.2

0.3

0.4

0.5

MPC100

Figure 6: Sampling vs. depth pooling mean prec at cutoff 100 estimates at depths 1 and 10 in TREC8. set for each corresponding depth and calculated the average ρ (left column) and τ (right column). As can be seen in the figure, for all TRECs the sampling method significantly outperforms the TREC-style depth pooling method at all depths. For comparison purposes, we also include the average Kendall’s τ value of bpref obtained using random samples [5] of the given size to the plots in the second column. The Kendall τ values for bpref are the average values computed over 10 different random sampling distributions.

3.1 Per query and per run results While the MAP (and MRP and MPC) show improved performance over depth-style pooling in both mean and bias, there are certain situations when one needs the results of a single query, hence not taking advantage of the massive

546

Kendall’s tau for AP estimates TREC7

Sampling AP,TREC 8 , all Q, run=Sab8A1 K=29

0.95

0.9

0.98

depth pooling sampling (10 runs avg)

0.96 0.95

0.85

estimated AP

Kendall’s tau

corr. coeff.

0.97

0.8

0.94 0.93

0.75

depth pooling sampling (10 runs avg) b−pref with random judgments

0.92 0.91 0

2

4 6 8 percentage of pool judged

10

0.7 0

12

5 10 percentage of pool judged

Corr. coeff. for AP estimates TREC8

15

Kendall’s tau for AP estimates TREC8

1

Sampling AP,TREC 8 , all Q, run=Sab8A1 K=200

1

1

0.9

0.9

0.8 RMS=0.162226 ρ=0.736732 0.7 τ=0.590204

0.8 RMS=0.089635 ρ=0.930377 0.7 τ=0.840000 estimated AP

Corr. coeff. for AP estimates TREC7 1 0.99

0.6 0.5 0.4

0.9

0.4 0.3

0.2

0.2

0 0

0.99

0.5

0.3

0.1

0.95

0.6

0.1 0.1

0.2

0.3

0.4

0.5 0.6 trec AP

0.7

0.8

0.9

1

0 0

0.1

0.2

0.3

0.4

0.5 0.6 trec AP

0.7

0.8

0.9

1

Kendall’s tau

corr. coeff.

0.98 depth pooling sampling (10 runs avg)

0.97 0.96

0.85

Figure 9: Sampling estimates for a fixed typical run (Sab8A1) with MAP = 0.25, all queries. . Each dot (·) is an AP for a query estimate (total 50); MAP estimate is plotted as ”×”.

0.8

0.95 0.75

depth pooling sampling (10 runs avg) b−pref with random judgments

0.94 0.93 0

2

4 6 8 percentage of pool judged

10

0.7 0

12

5 10 percentage of pool judged

Corr. coeff. for AP estimates TREC10 0.99

1

0.98

0.95

0.97

depth pooling sampling (10 runs avg) b−pref with random judgments

depth 1 to depth 10 equivalent for training and testing systems is shown in Figure 10. On x-axis the units are the depth-pool equivalent number of judgments converted into percentages of depth-100 pool.

0.9

0.96

Kendall’s tau

corr. coeff.

15

Kendall’s tau for AP estimates TREC10

depth pooling sampling (10 runs avg)

0.95 0.94

0.85 0.8 0.75

0.93

0.7

0.92 0.91 0

2

4

6 8 10 percentage of pool judged

12

0.65 0

14

5 10 percentage of pool judged

15

4. CONCLUSIONS AND FUTURE WORK Figure 7: Linear correlation coefficient and Kendall’s τ error comparisons for mean average precision, in TRECs 7, 8 and 10. Sampling AP, TREC 8, Query 46, all runs, K=29 1 0.9 trainingRMS=0.096698 testingRMS=0.071029 ρ=0.860963 τ=0.744727

0.8 0.7 estimated AP

0.7 estimated AP

Sampling AP, TREC 8, Query 46, all runs, K=200

1 0.9 0.8

0.6 0.5 0.4

0.5 0.4 0.3

0.2

0.2

0.1

trainingRMS=0.028061 testingRMS=0.031735 ρ=0.982351 τ=0.912914

0.6

0.3

0 0

We propose a statistical technique for efficiently and effectively estimating standard measures of retrieval performance from random samples, and we demonstrate that highly accurate estimates of standard retrieval measures can be obtained from judged subsamples as small as 4% of the standard TREC-style depth 100 pool. This work leaves open a number of question for further research: (1) In standard TREC settings, all documents in the depth 100 pool are judged and no documents outside the pool are judged. Our work indicates that more judging effort should be placed on documents near the top of ranked lists (they have high sampling probabilities) and less judging effort should be placed on documents near the bottom of ranked lists (they have low sampling probabilities). What is the optimal sampling distribution, and how does it change as a function of the collection or systems to be evaluated? Consider evaluating retrieval system on the web or with respect to the TREC Terabyte track, for example. (2) Given that our technique is based on random sampling, one could in principle derive high probability confidence intervals for the estimates obtained, and such confidence intervals would be quite useful in practice. (3) Finally, we have preliminary results which show that given accurate estimates of retrieval measures obtained from a judged subsample, one can accurately infer relevance assessments for the remaining unjudged documents. Such a technique would have obvious benefits in the production of large test collections.

0.1 0.1

0.2

0.3

0.4

0.5 0.6 trec AP

0.7

0.8

0.9

1

0 0

0.1

0.2

0.3

0.4

0.5 0.6 trec AP

0.7

0.8

0.9

1

Figure 8: Sampling estimates for a query with mixed system performance. Dots (·) represent training runs; pluses (+) represent testing runs. . variance reduction achieved by averaging over 50 queries. It is certainly not expected to see the same kind of performance on per query basis; however our results show definite usable query estimates (Figure 8). The method described in this paper is self-contained for a query, i.e. estimates for a query are not dependant on any data from other queries. On a different setup, one may want to analyze only one run over all queries (Figure 9) Note that this setup is not self contained as the sampling method requires a set of runs (not only the one plotted) and associated judgments.

5. REFERENCES [1] E. C. Anderson. Monte carlo methods and importance sampling. Lecture Notes for Statistical Genetics, October 1999. [2] J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch, pooling, and system evaluation. In O. Frieder, J. Hammer, S. Quershi, and L. Seligman, editors, Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 484–491. ACM Press, November 2003.

3.2 Generalization on new runs It is important that the performance of sampling over the testing runs is virtually as good as the performance over the training runs. Note that the testing systems do not directly contribute to the sampling pool; their documents are sampled only because they happen to appear in training runs. The trend of RMS error, as sample size increases from

547

AP

RP

RMS error for AP estimates, TREC7

0.16

0.07 3

2 2

0.06

3

4

0.14 5

1

6

4

7 6

0.07

8

9

0.06

10

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

0.05 0.04

2

9 10

0.04 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

0.03

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

1

0.12 8

0.05

7

RMS error

5

0.08

RMS error

1

0.09

RMS error

TREC7

RMS error for PC100 estimates, TREC7

RMS error for RP estimates, TREC7

0.11 0.1

PC(100)

0.1

3 4

0.08

5 6 7

0.06

8

9

10

0.04 0.03

0.02

0.02

0.02 0.01 0

2

4 6 8 percentage of pool judged

10

0.01 0

12

2

4 6 8 percentage of pool judged

RMS error for AP estimates, TREC8

RMS error

0.03

0.01

6

5

9

10

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

0.03

2

4 6 8 percentage of pool judged

10

0.025

0.06

0.015

0.04

4 6 8 percentage of pool judged

10

1

0.045

2

2

0.07

5

6

RMS error

4 0.05

0.1

7 8

0.04

9

5

7 8

0.03

0.03

2

4 6 8 percentage of pool judged

9

10

10

12

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

1

0.08

6

0.035

9

10

3 4 5

0.06

6 7

0.04

0.025

10

8

2

0.04

3

0.06

7

0.12

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 3 4

RMS error

0.08

6

RMS error for PC100 estimates, TREC10

RMS error for RP estimates, TREC10

0.05

5

0 0

12

0.055

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

1

4

0.02 2

RMS error for AP estimates, TREC10 0.1

3

0.1 0.08

0.02

0.005 0

12

12

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

2

0.12

0.01

0 0

RMS error

4

0.14

8

2

1

10

RMS error for PC100 estimates, TREC8

7

RMS error

0.035

depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err

0.04

4 6 8 percentage of pool judged

1 3

0.04

0.05 RMS error

TREC8

0.06

2

0.16

0.045

0.02

TREC10

0 0

12

RMS error for RP estimates, TREC8

0.07

0.09

10

8

9

10

0.02

0.02 0.02 0.01 0

0.015

2

4

6 8 10 percentage of pool judged

12

14

0.01 0

2

4

6 8 10 percentage of pool judged

12

14

0 0

2

4

6 8 10 percentage of pool judged

12

14

Figure 10: RMS error train/test crossvalidation comparisons for MAP, RP, PC(100),in TRECs 7, 8 and 10. Equivalent depths are indicated on the plot. [3] J. A. Aslam, V. Pavlu, and E. Yilmaz. Measure-based metasearch. In G. Marchionini, A. Moffat, J. Tait, R. Baeza-Yates, and N. Ziviani, editors, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 571–572. ACM Press, August 2005. [4] J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, August 2005. Copyright held by authors. [5] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 25–32, New York, NY, USA, 2004. ACM Press. [6] G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Croft et al. [7], pages 282–289. [7] W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,

[8]

[9]

[10] [11]

[12]

548

Melbourne, Australia, Aug. 1998. ACM Press, New York. D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1–19, Gaithersburg, MD, USA, Apr. 1995. U.S. Government Printing Office, Washington D.C. P. Kantor, M.-H. Kim, U. Ibraev, and K. Atasoy. Estimating the number of relevant documents in enormous collections. In D. Cfd, editor, Proceedings of tthe 62nd Annual Meeting of the American Sociaty for Information Science, volume 36, pages 507 – 514, 1999. J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth and Brooks/Cole, 1988. E. M. Voorhees and D. Harman. Overview of the seventh text retrieval conference (TREC-7). In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 1–24, 1999. J. Zobel. How reliable are the results of large-scale retrieval experiments? In Croft et al. [7], pages 307–314.

Suggest Documents