Adaptive data fusion methods in information retrieval - Semantic Scholar

3 downloads 23771 Views 239KB Size Report
Sep 22, 2013 - Jiangsu University, Zhenjiang, China. Xiaoqin Zeng ..... blog opinion task has the second largest number of queries in all past years' TREC tasks .... Multiple linear regression has been found to be an effective tech- nique for ...
Adaptive data fusion methods in information retrieval Shengli Wu and Jieyu Li Jiangsu University, Zhenjiang, China Xiaoqin Zeng Hohai University, Nanjing, China Yaxin Bi University of Ulster, Northern Ireland, UK September 22, 2013

Abstract Data fusion is currently used extensively in information retrieval for various tasks. It has proved to be a useful technology because it is able to improve retrieval performance frequently. However, in almost all prior research in data fusion, static search environments have been used, and dynamic search environments have generally not been considered. In this paper, we investigate adaptive data fusion methods that can change their behavior when the search environment changes. Three adaptive data fusion methods are proposed and investigated. In order to test these proposed methods properly, we generate a benchmark from a historical TREC data set. Experiments with the benchmark show that two of the proposed methods are good and may potentially be used in practice.

1

Introduction

Data fusion is currently used extensively in information retrieval for various tasks (Wu, 2012b). This technique is used to combine results from different models, systems, or components to improve effectiveness. There has also been research on related areas such as metasearch(Efron, 2009) and learning to rank (Liu, 2011). Many data fusion methods have been investigated. They are CombSum (Fox et al., 1993; Fox and Shaw, 1994), CombMNZ (Fox et al., 1993; Fox and Shaw, 1994), the linear combination method (Bartell

1

et al., 1994; Vogt and Cottrell, 1998, 1999; Wu, 2012c; Wu et al., 2009; Wu and McClean, 2006), the Borda count (Aslam and Montague, 2001), the Condercet fusion (Montague and Aslam, 2002; Wu, 2013), Learning to rank methods (Burges et al., 2005; Cao et al., 2006; Herschtal and Raskutti, 2004; Niu et al., 2012; Yue et al., 2007), and others (Chen et al., 2011; Dwork et al., 2001; Farah and Vanderpooten, 2007; Lillis et al., 2006; Renda and Straccia, 2003). Numerous experiments with TREC 1 , NTCIR 2 , CLEF 3 , LETOR 4 and other data sets have found that data fusion can improve retrieval effectiveness. In order to carry out a proper data fusion experiment, we would usually require a group of information retrieval systems, a collection of documents, a group of queries, and a type of relevance judgment that indicates which document is relevant to which query. For each of the queries, we take results from all component information retrieval systems. We may choose to treat all results equally or give some more weight to some than others during the fusion process. If the former is used, then data fusion can take place immediately when component results are available. On the other hand, if the latter is used, then we need to decide the weights for all those information retrieval systems involved before the fusion process. Usually we divide all the queries into two parts, one part used as training data for obtaining weights, and the other used as test data for testing data fusion methods. Then we switch the two roles. To sum up, when combining results from all component systems through all the queries, all component information retrieval systems and the collection of documents involved are kept unchanged. However, in the real world, things are always changing, and this is especially true for information retrieval. First, a document collection may be constantly changing, for example, in the case of web search. Other examples include many different types of digital libraries, online information services, blogs, and so on. Second, queries issued by users vary over time. Third, information retrieval systems (search engines) may be upgraded from time to time. Therefore, an information retrieval environment can be very dynamic if all or some of the abovementioned three aspects change considerably over time. Almost all the research conducted previously on data fusion in information retrieval does not consider the dynamic property of a search environment including continuous changes in document collections and component search engines. To our knowledge, only a few papers (Bigot et al., 2011; Diamond and Liddy, 1998) investigate query-specific data 1

http://trec.org/ http://research.nii.ac.jp/ntcir/index-en.html 3 http://www.clef-initiative.eu/ 4 http://research.microsoft.com/ letor/ 2

2

fusion methods. Such methods may be useful for routing tasks in which the underlying document collection is updated regularly but the same group of information retrieval systems are involved and the same group of queries are processed again and again. This is a very special case. On the contrary, the primary goal of our current research is to investigate how to use the data fusion technology properly in a more general dynamic search environment. In particular, we focus on the situation in which dynamic changes occur in the information retrieval systems involved. Compared with other data fusion methods, linear fusion is very flexible since different weights can be assigned to different component systems. For those methods that treat all results equally, if a poor retrieval result is included or some results are more similar to each other than the rest, then the effectiveness of the fused result will be considerably affected. A rigid restriction should be applied to the selection of component retrieval systems. For the linear combination, the inclusion of any result will potentially improve effectiveness, and even if poor results are included, no harm will be done if the weights are properly assigned. Therefore, linear combination is very good in various kinds of situations such as component systems/results vary in effectiveness or some of the systems/results are more similar than the others. For a dynamic search environment, linear combination is also a good option. The weights of all retrieval systems can be changed incrementally so as to reflect the change in the search environment quickly and accurately. The remaining of this paper is organized as follows: Section 2 discusses some related work. In Section 3, we describe a benchmark that is generated from a historical TREC data set. The benchmark generated is able to reflect the dynamic property of component retrieval systems and is used in the experiments of testing adaptive data fusion methods. In Section 4, we introduce three adaptive data fusion methods that are applicable in a dynamic search environment. Section 5 presents settings and results of a group of data fusion experiments. Section 6 concludes the paper.

2

Related work

A lot of research has been conducted on data fusion in information retrieval (Wu, 2012b). Some early experiments on data fusion were reported in Foltz and Dumais (1992), Saracevic and Kantor (1988), and Turtle and Croft (1991). Later more data fusion methods were introduced and evaluated. They include CombSum and CombMNZ (Fox et al., 1993; Lee, 1997), Markov chain-based methods (Dwork

3

et al., 2001; Renda and Straccia, 2003), the multiple criteria approach (Farah and Vanderpooten, 2007), the Borda count (Aslam and Montague, 2001), the Condorcet voting (Montague and Aslam, 2002; Wu, 2013), the linear combination method (Thompson, 1993; Vogt and Cottrell, 1999; Wu et al., 2009; Wu, 2012c), and others. The linear combination method is very relevant to the study carried out in this paper. Therefore, let us review some details. The method uses the following Equation 1 g(d) =

t X

wi ∗ si (d)

(1)

i=1

to calculate scores. Here g(d) is the global score that document d obtains during data fusion, si (d) the score that document d obtains from information retrieval system iri (1 ≤ i ≤ t) and wi the weight assigned to system iri . The linear combination method is a generalized form of CombSum, in which the same weight of 1 is assigned to all component retrieval systems. Weights need to be assigned before data fusion. Usually, we use some training data to work out a solution. There are a few different approaches for doing this: by considering performance, similarity among results, or both. For each retrieval system iri , suppose its average performance measured by a given measure over a group of training queries is pi , then pi is set as iri ’s weight (wi = pi ). This is the policy of the simple performance-level weighting scheme. The simple performance-level scheme was used by a few researchers in their data fusion experiments, e.g., in Aslam and Montague (2001), Thompson (1993), and Wu and Crestani (2002). In all but one experiment (Thompson, 1993), the linear combination method with this simple weighting scheme outperformed both CombSum and CombMNZ. In a paper written by Wu et al. (2009), their experiment shows that the performance-level weighting can be improved by using a power function of performance with a power value between 2 and 6. The new scheme can lead to slightly better fusion results than performance-level weighting. On the other hand, in Wu and McClean (2006)’s work, similarities between systems were considered. In their weighting scheme, each component system’s similarity to the other component systems is calculated. If a system is similar to the rest, then a smaller weight is assigned to it; if a system is different from the others, then a greater weight is assigned to it. In such a way, more diversified opinion can be taken into consideration by the fusion process and better results are achievable. An opposite approach is taken by Klementiev et al. (1997). Ordering agreement between retrieval results is rewarded. Therefore,

4

greater weights are given to those results that are similar to each other and smaller weights are given to those that are different from the others. In (Wu, 2009, 2012c), both performance and similarity are considered. One of them (Wu, 2009) presents a heuristic method by using statistical principles, in particular the theory of stratified sampling, and the other (Wu, 2012c) discusses how to use multiple linear regression to obtain suitable weights. A considerable number of researchers use machine learning algorithms, such as support vector machines and gradient descent, to optimize the fused results for a given metric or a surrogate metric. Performance metrics optimized include ROCArea (Herschtal and Raskutti, 2004) or modifications of ROCArea (Cao et al., 2006), NDCG (Burges et al., 2005; Niu et al., 2012), average precision (Yue et al., 2007), or ERR (Niu et al., 2012). Since efficiency may be a concern for those aforementioned who use machine learning methods, some papers (Wang et al., 2010, 2011) propose methods that take both effectiveness and efficiency into consideration. Score normalization is related to data fusion. If raw scores are provided by all component retrieval systems for their retrieved documents, then a very straightforward normalization method is the zeroone method (Lee, 1997), which normalizes the maximum score in an arbitrary result to one, the minimum score in the result to 0, and any other score to a value between 0 and 1 by a linear function. The zero-one method can be improved by using a more suitable interval (a,b), where (0 < a < b < 1). This is because, in any given result, the top-ranked documents are not always relevant and bottom ranked document are not always irrelevant. The improved approach is known as the fitting method (Wu et al., 2006). Two other linear score normalization methods Sum and z-scores are investigated in Montague and Aslam (2001). For any result, Sum shifts the minimum score to zero and normalizes their sum to 1; while z-scores shift the mean to zero and scale the variance to 1. An improved version of z-scores is presented in Webber et al. (2008). An alternative approach for score normalization is to convert ranking information into scores. This type of methods can be used if raw scores are not provided by some component retrieval systems, or raw scores are not reliable for use even after they are normalized by a linear normalization method. A simple such method is the one used by the Borda count (Aslam and Montague, 2001). Suppose that a result has a ranked list of n documents, then the first is given a score of n, the second is given a score of n − 1,..., and the last is given a score of 1. Another method uses the reciprocal function of rank. In their experiment with 4 TREC data sets, Cormack et al. (2009) observes that the following function

5

1 (2) rank + c provides good results, especially with a value of 60 for c. Apart from the reciprocal function, the logistic function may also be used (Calv´e and Savoy, 2000). The logistic model uses the following function score(rank) =

ea+b∗ln(rank) 1 = (3) 1 + ea+b∗ln(rank) 1 + e−a−b∗ln(rank) to calculate scores for all the documents at different ranks. Here a and b are two parameters that are usually determined, for example, by some training data. There are also some other types of score normalization methods. A CDF-based method is presented in Fernandez et al. (2006). CDF refers to the cumulative density function. A signal-to-noise approach is presented in Arampatzis and Kamps (2009). The method presented in Shokouhi (2007) is a mixture of ranking-related relevance scores and normalized raw scores which are provided by the information retrieval system involved. The method presented in Gerani et al. (2012) combines score normalization and weights assignment into one. Finally, there are a few papers such as Bigot et al. (2011) and Diamond and Liddy (1998) that investigate query-specific data fusion methods. Such methods may be useful for routing tasks in which the underlying document collection is updated regularly but the same group of retrieval systems and the same group of queries are used again and again. The goal of this study is to find efficient and effective adaptive data fusion methods when component retrieval systems evolve overtime. score(rank) =

3 Generating benchmark from a historical TREC data set Although quite a number of information retrieval events such as TREC, NTCIR, CLEF have been held annually for some time, it is difficult for them to take dynamic search environment into much consideration (Voorhees, 2008). In those events, a task consists of a group of queries. However, while running through all the queries, the same collection of documents is used and the information retrieval systems involved do not evolve over different queries. That is to say, only one of the three aspects is dynamic. Therefore, the data sets being used in those events are not ideal for testing adaptive information retrieval systems and data fusion methods, since we expect adaptive data fusion methods be able to deal with dynamic search environments in which the collection

6

of documents and/or the information retrieval systems involved may change over time as well. In such situations, it is desirable to produce some benchmark to test adaptive data fusion methods. Considering the huge cost of producing new data sets (a data set with dynamic properties is even more costly), it is reasonable to reuse some historical data sets from TREC or other events with any necessary changes (if at all possible). One good candidate for this is the data set of the TREC 2008 blog opinion task. In most TREC tasks, 50 queries are used. However, the TREC 2008 blog opinion task includes 150 queries (queries 851-950 and queries 1001-1050). Later in this paper we re-number them 1-150. As a matter of fact, the 2008 blog opinion task has the second largest number of queries in all past years’ TREC tasks, after only the TREC 2004 robust track in which 250 queries were used. When more queries are involved, we are more likely to generate some runs with various levels of performance over different queries; it is also possible for us to use more queries for training and/or testing so as to make the experimental results more reliable. What we would do is to make some individual runs, each of which performs quite differently across different blocks of queries. This can be done by generating some “artificial” runs, any of which is a mixture of blocks from several different original runs. First of all, we divide all 150 queries into 6 blocks where each comprises 25 consecutive queries (1-25, 25-50,..., or 125-150), 6 being chosen arbitrarily. 4, 5, 7, 8, 9, or 10 might be reasonable options as well. From all 191 original runs, we choose those runs that include 1,000 documents for each of the 150 queries. A few runs that include almost 150,000 (more than 149,900) documents are also chosen. Thus we obtain 104 runs. Removing runs with fewer documents makes the calculation of some quantities (such as similarity between results) in the experiment straightforward. See later for more details. For every original run chosen, we divide it into 6 equal blocks where each block includes results for 25 queries. In each block, the average performance measured by average precision is calculated for all the runs. Based on this, we divide all the runs into 4 groups top, 2ndclass, 3rd-class, and bottom. As the name of each group suggests, the top quarter of runs go to the top group, the next quarter of runs go to the 2nd-class, and so on. Any artificial run is made up of 6 blocks from different original runs. First of all, we randomly select an original run and take its first block. Then we randomly choose a run and take its second block if x 6= y holds, where x and y are the groups that the first and second blocks belong to, respectively. If the condition does not hold, we repeat this step until the condition holds. The same process is repeated for the other four blocks, each time making sure that the newly added block and the previous block are from different groups. In such a way, we can

7

guarantee that in the generated run, any two adjacent blocks perform differently because they belong to different groups. When generating those artificial runs, one more restriction is necessary. That is, for any block in any original run, it can only be used once. This restriction is necessary. Without this restriction, it may be that two or more generated results include the same ranked lists for 25 queries. With this restriction in place, only a limited number of runs can be generated. We managed to generate 60 artificial runs. Such a size should be fine for our purposes. See Appendix for detailed information about those generated runs. Now let us see how different the benchmark (generated runs) is from the original runs. For every original run and generated run, we first calculate average precision for each of the 6 blocks separately, then the standard deviation of those 6 average precision values are calculated for the whole run. Finally we average those standard deviation values. For the 104 original runs, the averaged standard deviation comes to 0.1389, while for the 60 generated runs, it is 0.2353. The latter is much larger than the former by 69%. On the other hand, the mean average precision of all 104 original runs is 0.3202, and the mean average precision of all 60 generated runs is 0.3061. There is not much difference between them in this aspect. The generated runs are more dynamic, mainly with respect to information retrieval systems, while other aspects are kept very much the same. Among all the generated runs, most of them are very close in performance. The best has an average precision of 0.3711, while the worst has an average precision of 0.2108. Under the circumstances, we expect that those equal-treatment data fusion methods such as CombSum should work well with this generated data set.

4

Adaptive data fusion methods

Suppose that we have a group of component systems ir1 , ir2 ,..., irt , and each of them provides a ranked list of documents for any query issued. These ranked lists are combined by data fusion. We also assume that, for any individual query issued, the fused result as well as the component results from component systems will be evaluated after the fused result is presented to the end user. Therefore, when fusing results for a given query, performance of component systems is available on previous queries. It may be argued that this condition is quite difficult to satisfy in certain circumstances. Anyway, it is a reasonable assumption for interactive systems. For other types of systems, it is still possible for us to use some form of feedback provided by users or simply click-through data from users as a kind of pseudo-relevance feedback. Then we can estimate the performance of the retrieval systems/results.

8

Similarity between results can be computed automatically without any user interaction. Such information may not be useful for those equaltreatment data fusion methods, but it is certainly useful for those biased data fusion methods such as the linear combination method. An adaptive data fusion method works in the following way: at the very beginning, since no knowledge about any of the component systems/results is available, we just treat all component systems equally. After the first query has been processed, we have a little knowledge (e.g., effectiveness) about those results involved. Thus we can update the weights for the linear combination method accordingly. When more queries are processed, we gradually gain more knowledge about them. Moreover, there are different ways of updating the weights. In this study, we investigate three methods, all of which update weights incrementally. The first is the simple performance-squared updating (referred to as PSU later in this paper); the second is a combined weights updating of performance and dissimilarity between component systems, or combined updating for short (referred to as CU); the third obtains its weight update through linear regression analysis, referred to as LRU. All of them update the weights of component systems per query. PSU is related to performance-squared weighting, which is investigated in (Wu et al., 2009) for the linear combination method. PSU uses the following equation to update the weight of any component system wi′ = (1 − c) ∗ wi + c ∗ p2

(4)

where wi and wi′ are the weights before and after the update, respectively. p is the performance of retrieval system iri for the most recent query, and c is the update rate, which should be set somewhere between 0 and 1. Two extreme situations are c = 1 and c = 0. If c = 1, then wi′ is only decided by p2 of the current query; if c = 0, then wi′ is only decided by wi , and no adaptive updating is allowed. More details about how to set up parameters for this method will come later. For any component retrieval system iri , CU considers both its performance and its dissimilarity with the other retrieval systems involved (Wu, 2009). For the dissimilarity component, we first need to calculate the dissimilarity of any two component results. Here we use the documents’ ranking difference in both results to work out the dissimilarity between them (Bar-Ilan et al., 2006). Suppose that two results A and B have the same number of n ranked documents, where m (m ≤ n) of these documents appear in both A and B, and (n − m) appear in only one of them. For those n − m documents that only appear in one of the results, we simply assume that they occupy the places from rank n + 1 to rank 2n − m. Thus we can calculate the average rank

9

difference of all the documents in both results and use it to measure the dissimilarity between the two results. To summarize, We have

diss(A, B)

=

1 { (n + m)

diu ∈A∧diu ∈B

X

i=1,2,...,m

1 |rA (diu ) − rB (diu )| m

div ∈A∧div ∈B /

+

X

1 |rA (div ) − (n + i)| n − m i=1,2,...,n−m i diw ∈A∧d / w ∈B

+

X

1 |rB (diw ) − (n + i)|} n − m i=1,2,...,n−m

(5)

Here rA (di ) and rB (di ) denote the rank position of di in A and B, 1 respectively. (n+m) is the normalization coefficient which guarantees that diss(A, B) is in the range of 0 and 1. Based on Equation 5, the dissimilarity weight of Li (1 ≤ i ≤ t) is defined as j6=i X 1 diss(Li , Lj ) diss(Li ) = t − 1 j=1,2,...,t

(6)

For the performance component of CU, we still use the same scheme as PSU. Then the combined weight is defined as wi′ = (1 − c) ∗ wi + c ∗ p2 ∗ diss(Li )

(7)

As in PSU, c is the update rate. Multiple linear regression has been found to be an effective technique for determining the weights of the linear combination method (Wu, 2012c). In this study, we apply multiple linear regression to the data from a single query, not from a large number of queries as in Wu (2012c). This decision is mainly based on the following two considerations: first, because adaptive data fusion methods are used in a dynamic environment and weight assignment and fusion need to be done at run-time, storing and processing data generated from a large number of queries may be too costly; second, the search environment may change very rapidly, so that it is unwise to make too much use of possibly outdated historical data.5 Suppose for a query q, the same n documents d1 , d2 ,..., dn are retrieved by all m component systems. Every component system (iri ) assigns a score sij to document dj . Multiple linear regression tries to minimize e in the following equation 5 Using a few more queries (up to 5) for multiple linear regression was also tried but there is no significant improvement over using a single query.

10

e=

n X

[yj − (β1 ∗ s1j + β2 ∗ s2j + ... + βt ∗ stj )]2

(8)

j=1

where yj is the judged score 6 of document dj , n is the total number of documents retrieved and β1 , β2 ,..., βt are the weights of ir1 , ir2 ,..., irt , respectively. When yj for (1 ≤ j ≤ n) and sij for (1 ≤ i ≤ t) and (1 ≤ j ≤ n) are known, β1 , β2 ,..., βt can be calculated. Furthermore, those coefficients (β1 , β2 ,..., βt ) can be normalized by t ∗ βi βi′ = Pt i=1 βi

(9)

where t is the number of systems involved. After normalization, the average of all β ′ s is 1. Analogously to performance-squared weighting, the linear combination method uses the following equation to update weight for every information retrieval system involved: wi′ = (1 − c) ∗ wi + c ∗ αi

(10)

where αi =βi′ and c is an arbitrary parameter as in Equation 4. Theoretically, multiple linear regression should be a very good method. It calculates the optimized weights for all information retrieval systems involved, in the sense of least squares by minimizing the error of relevance score estimation of all the documents.

5

Experimental setting and results

Two score normalization methods, the reciprocal model (see Equation 2) and the logistic model (see Equation 3) are tested in this study. The reciprocal model is straightforward by setting c=60 as in Cormack et al. (2009), while the logistic model needs to obtain coefficients by training (Calv´e and Savoy, 2000). Using the data in some original runs, we obtain the values of the two coefficients: a=0.718, and b=-2.183. In order to obtain reliable results, we need to try enough number of combinations. Thus from all 60 available runs generated, we randomly choose 3, 4, 5, 6, 7, 8, 9, or 10 of them to perform the data fusion experiment. For each given number (3-10), 200 randomly selected combinations are tested. Each of the combinations is chosen independently from all 60 runs. Apart from the three adaptive methods, CombSum is also involved for comparison. 6

If binary relevance judgment is used, it is 1 for a relevant document and 0 for an irrelevant document.

11

For the three adaptive methods, we give equal weight to all component results initially. More specifically, the initial weight of all component systems are 0.1, 0.05 and 1 for PSU, CU, and LRU, respectively. The reason for using these values is to provide a suitable approximation for the normal values of such weights. For example, the average performance of all 60 runs is about 0.3. After a few queries, the average weight of all the component systems will be close to 0.1 (0.3*0.3=0.09). Thus 0.1 should be a proper initial weight for PSU. In each experiment, we need to set a value for c (see Equations 4, 7 and 10) for the three adaptive data fusion methods. For example, if c is set to be 0.10, then at each step, the weight generated keeps 90% of the old value and takes 10% of the update. c can be set to different values so as to meet different application requirements. The general principle is: the more dynamic the search environment is, the larger the value of c. Different update rates are tried in the experiment. See details later. Several measures including average precision at different document levels (10 and 100), average precision over all relevant documents and recall-level precision are used for retrieval evaluation. We use P@x (x denotes a particular document level), AP, and RP to denote them, respectively.

5.1

Comparison of two normalization models

First of all, let us have a look at the two score normalization methods. We use these two methods to normalize the same group of results, and then use the same data fusion method, CombSum, afterwards. Thus, we can compare the two score normalization methods based on the performance of the fused results, which is shown in Figure 1. In Figure 1, the reciprocal model is always better than the logistic model for all four measures. The difference is the largest when P@10 is used, while AP is the second largest. On average, the difference between them is 0.97%, 0.25%, 1.64%, and 0.61%, for AP, RP, P@10, and P@100, respectively. Because the largest difference happens with P@10, it would suggest that the reciprocal model is especially good for estimating the scores of top-ranked documents. In the following we only present the experimental result using the reciprocal model because the result using the logistic model is very similar, though a little worse.

5.2

A Typical Scenario

In a dynamic search environment, it may be possible to measure how dynamic the search environment is. Such information would be useful for us to design good adaptive data fusion methods. For example, we

12

0.75

0.7

Performance of CombSum

0.65

AP(Reciprocal) AP(Logistic) RP(Reciprocal) RP(Logistic) P@10(Reciprocal) P@10(Logistic) P@100(Reciprocal) P@100(Logistic)

0.6

0.55

0.5

0.45

0.4

0.35 3

4

5

6

7

8

9

10

Number of systems

Figure 1: Comparison of two normalization methods of converting ranks to scores need to decide the value of update rate for any of the above-mentioned adaptive data fusion methods. In this study, we do not take this direction due to the high costs involved. If we can find a fixed update rate that can work in most dynamic situations, then it would be an acceptable solution. A given update rate determines how fast we forget about older queries and learn from the most recent queries. Let us take a look at the effect of different update rates on cumulative weights of recent queries, as shown in Figure 2. Figure 2 shows the cumulative weights of recent queries with four different update rates: 0.05, 0.10, 0.25, and 0.50. Each point in Figure 2 is the cumulative weight that x recent queries contribute, and x is given on the horizontal axis. For example, point (3,0.2710) on the curve of 0.10 indicates that 3 recent queries count for 27.10% of the weights if a update rate of 0.10 is used. This is because the most recent query counts for 10% of the weight, the next counts for 0.9*10%=9%, and the third counts for 0.81*10%=8.1%. They sum up to 10%+9%+8.1%=27.10%. After some deliberation, we believe that 10%-15% is a safe range for update rates in many situations. For example, if a update rate of 10% is used, then it needs 7 queries to get half of the weight (learning from new queries at a reasonable speed); on the other hand, after 16 queries, 20% of the weight still needs to be decided by older queries (not very

13

1 0.9 0.8

Cumulative weight

0.7 0.6 0.5 0.4 0.3 0.2

update rate:0.05 0.10 0.25 0.50

0.1 0 0

2

4

6

8

10

12

14

16

18

20

Number of queries

Figure 2: The effect of different update rates on cumulative weights of recent queries quickly forgetting about older queries). If a very small update rate, for example, 5% or less, is used, then it takes quite a while to adapt to the current situation. For example, it needs 14 queries to get half of the weight. If a very large update rate, for example, 30% or more, is used, then it forgets about old queries very quickly and the two most recent queries count for more than a half of the weight. With the reciprocal model, we fuse those selected results using the three adaptive methods. In this experiment, c is set to be 0.10, so each query obtains 10% of the update for adaptive data fusion methods. The performances of all three adaptive data fusion methods are shown in Tables 1-3, and each of them with a different measure. From Tables 1-3, we can see that PSU and CU are better than LRU and CombSum all the time. The difference between CombSum and two data fusion methods PSU and CU are significant statistically at a level of 0.005 (p >99.95%). Occasionally, LRU is better than CombSum (3 or 4 results & AP, and 3 results & RP). Otherwise, LRU is worse. On average, LRU is the worst. Comparing those measures used, AP is more favorable to CU and PSU than two other measures. They are better than CombSum by 3.25% and 1.55%, respectively. The corresponding figures for RP and P@10 are {2.47%, 1.06%} and {1.46%, 0.77%}.

14

Table 1: Performance (in AP) of the fused result using different data fusion methods (the figures in parentheses are the percentages of improvement of the data fusion methods over CombSum; the figures in black indicate that the differences between those data fusion methods and CombSum are significant statistically at a level of 0.005) Number of CombSum PSU LRU CU Systems 3 0.3855 0.3908(1.38) 0.3890(0.91) 0.3960(2.72) 4 0.4002 0.4055(1.32) 0.4023(0.52) 0.4116(2.85) 5 0.4130 0.4180(1.21) 0.4129(0.00) 0.4243(2.74) 6 0.4227 0.4296(1.63) 0.4227(0.00) 0.4365(3.26) 7 0.4258 0.4323(1.53) 0.4246(-0.28) 0.4399(3.31) 8 0.4317 0.4392(1.74) 0.4297(-0.46) 0.4470(3.54) 9 0.4334 0.4411(1.78) 0.4310(-0.55) 0.4493(3.67) 10 0.4359 0.4434(1.72) 0.4314(-1.03) 0.4517(3.62) Average 0.4185 0.4250(1.55) 0.4180(-0.12) 0.4321(3.25)

Table 2: Performance (in RP) of the fused result using different data fusion methods (the figures in parentheses are the percentages of improvement of the data fusion methods over CombSum; the figures in black indicate that the differences between those data fusion methods and CombSum are significant statistically at a level of 0.005) Number of CombSum PSU LRU CU Systems 3 0.4166 0.4205(0.94) 0.4185(0.46) 0.4252(2.05) 4 0.4278 0.4319(0.96) 0.4288(0.23) 0.4375(2.27) 5 0.4379 0.4415(0.82) 0.4374(-0.11) 0.4471(2.10) 6 0.4456 0.4507(1.14) 0.4452(-0.09) 0.4568(2.51) 7 0.4479 0.4523(0.98) 0.4469(-0.23) 0.4589(2.46) 8 0.4519 0.4572(1.17) 0.4502(-0.38) 0.4639(2.65) 9 0.4535 0.4590(1.21) 0.4503(-0.48) 0.4662(2.80) 10 0.4547 0.4603(1.23) 0.4505(-0.92) 0.4677(2.86) Average 0.4420 0.4467(1.06) 0.4410(-0.23) 0.4529(2.47)

15

Table 3: Performance (in P@10) of the fused result using different data fusion methods (the figures in parentheses are the percentages of improvement of the data fusion methods over CombSum; the figures in black indicate that the differences between those data fusion methods and CombSum are significant statistically at a level of 0.005) Number of CombSum PSU LRU CU Systems 3 0.6547 0.6574(0.41) 0.6453(-1.43) 0.6586(0.60) 4 0.6686 0.6723(0.55) 0.6602(-1.26) 0.6754(1.02) 5 0.6812 0.6853(0.60) 0.6711(-1.48) 0.6879(0.98) 6 0.6918 0.6976(0.84) 0.6812(-1.53) 0.7018(1.45) 7 0.6932 0.6986(0.78) 0.6826(-1.40) 0.7044(1.62) 8 0.6991 0.7058(0.96) 0.6884(-1.53) 0.7117(1.80) 9 0.6986 0.7058(1.03) 0.6875(-1.59) 0.7130(2.06) 10 0.7012 0.7078(0.94) 0.6873(-1.98) 0.7157(2.07) Average 0.6860 0.6913(0.77) 0.6754(-1.55) 0.6960(1.46)

5.3

Effect of update rate on fusion performance

In last subsection, we have presented experimental results of data fusion methods with a fixed update rate of 10%. Obviously, it is of interest to find the effect of changing the update rate to the performance of adaptive data fusion methods. This time we repeat the above experiment but with various levels of update rates. PSU and CU are involved in the experiment. The result is shown in Figure 3. As expected, all the curves for both methods CU and PSU and both metrics AP and PR have the same pattern: there is a plateau in the middle, while the curves go down towards the two ends. Of note is the fact that the plateau is very flat and wide (from 0.2 to 0.8) - a most useful property. It suggests that the performance of data fusion methods is not sensitive to update rate if it is in the range of 0.2 to 0.8 and using any update rate in the range of 0.2-0.8 should be equally viable. Another observation is: the curves go down more quickly at one end (small update rate) than the other end (large update rate). This observation tell us that we should avoid using very small update rates in order to achieve good fusion results. The third observation comes from the performance difference between CU and PSU. The difference between them is consistent. However, the difference is larger in the middle, and becomes smaller towards both ends. In this case, the best performance occurs when a update rate of around 0.5 is used for both PSU and CU. For PSU, the best AP value

16

0.46

Performance of two data fusion methods

0.455

0.45

0.445

0.44

0.435

0.43

0.425 AP(CU) AP(PSU) RP(CU) RP(PSU)

0.42

0.415 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Update rate

Figure 3: Performances of CU and PSU with different update rates

0.8 CU PSU AVE

Performance (AP) of data fusion methods

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 25

50

75

100

125

150

Query number

Figure 4: Performances (AP) of CU and PSU per query with a fixed update rate of 20%

17

Performance difference of CU and PSU

0.012

0.008

0.004

0

-0.004 30

60

90

120

150

Query number

Figure 5: Performance (AP) comparison of CU and PSU with a fixed update rate of 20%

0.1

Performance difference of CU and CombSum

0.08

0.06

0.04

0.02

0

-0.02

-0.04

-0.06 25

50

75

100

125

150

Query number

Figure 6: Performance (AP) comparison of CombSum and CU with a fixed update rate of 20% (CombSum is better than CU at the beginning of query blocks 2-6, or queries 26-50, queries 51-75, ...queries 126-150)

18

0.46

CU PSU best average

Performance (AP) of data fusion methods

0.44

0.42

0.4

0.38

0.36

0.34

0.32

0.3 3

4

5

6

7

8

9

10

Number of systems

Figure 7: Performances (AP) of CU, PSU, best and average of all component results when different number of component results are fused (a fixed update rate of 20% is used) is 0.4355, which is better than CombSum by 4.06%; the best RP value is 0.4564, which is better than CombSum by 3.21%. For CU, the best AP value is 0.4378, which is better than CombSum by 4.61%; the best RP value is 0.4582, which is better than CombSum by 3.67%.

5.4

Other observations

In this subsection, we let the adaptive methods use a fixed update rate of 0.20. However, the observations and conclusions should stand if other reasonable update rates are used. In order to have a close look at the adaptive methods, we investigate the performances of the two adaptive data fusion methods, CU and PSU, for each query. The result is shown in Figure 4. We can see that the performance of CU, PSU, and the average of all component results vary from query to query. The performance is very unpredictable if we only refer to historical queries. However, the differences between CU, PSU and the average of all component results are always consistent. On average, the performances of CU, PSU, and the average of all component results are 0.4359 (CU), 0.4331(PSU), and 0.3065, respectively. The difference between CU and the average of all component results is 42.22%, and the difference between PSU and the average of all component results is 41.31%.

19

Table 4: Linear model of the performance the number of component results involved Method Constant Coefficient Best .327 .003 PSU .386 .007 CU .386 .008

of data fusion methods based on R2 .930 .911 .912

Significance level 99.95%). We also compare the performances of CombSum and CU per query. Figure 6 shows the result. Out of 150 queries, CU outperforms CombSum in 129 queries, CombSum outperforms CU in 20 queries, and there is a draw between them in the first query. However, this is not the complete story. From this figure, we can learn how the adaptive data fusion method CU performs from query to query, especially in certain places of interest. Recall that every generated run is assembled by 6 equal-sized blocks and each block comprises 25 consecutive queries. Let us focus on those queries that are located at the beginning of a block. For clarity, we list all the queries at which CombSum performs better than CU: 26 (0.0027), 38 (0.0010), 51 (0.0254), 76 (0.0482), 78 (0.0293), 80 (0.0029), 92 (0.0007), 101 (0.0007), 102 (0.0098), 103 (0.0080), 104 (0.0037), 111 (0.0133), 113 (0.0161), 117 (0.0012), 122 (0.0043), 126 (0.0320), 127 (0.0157), 128 (0.0039), 145 (0.0019), 150 (0.0125). The query numbers in bold are those either at the beginning of a block or in a series whose leader is at the beginning of a block. There are 10 of them. It shows that CombSum outperforms CU at the beginning of all five blocks (26, 51, 76, 101, and 126). The first block can not be counted, because at this stage CU assigns equal weights to all component results and there is no difference between CU and CombSum. Finally, we investigate how CU and PSU perform when different number of component results are fused. The best component result and the average of all component results are also presented. Compared with the best component result, CU is better by 26% and PSU

20

is better by 25%. When different number of results are fused, the average performance of all component results does not change very much. This is understandable because all component results are chosen by a randomized selection process. The performance (AP) of CU, PSU, and the best component result increases when more component results are used for fusion. All such increases in performance can be approximated by a linear function of the number of results, or p = constant + coef f icient ∗ number, where p is the performance of the data fusion method, constant is a constant obtained from linear regression analysis, coef f icient is a coefficient also obtained from linear regression analysis, and number is the number of results used for fusion. See Table 4 for detailed information. Coefficients obtained for Best, PSU, and CU are .003, .007, and .008, respectively. It shows that CU has the fastest increase rate, Best has the slowest increase rate, while PSU is in the second place but very close to CU.

6

Conclusions and future work

In this paper, we have presented our research on adaptive data fusion methods. In order to test them, we have generated a benchmark of dynamic search environment, which comprises 60 artificial runs generated from 104 original runs submitted to the TREC 2008 opinion retrieval task. In this benchmark, any run is a mixture of partial results from up to six original runs submitted. The effect is very much like that the information retrieval systems have been made changes five times when running those 150 queries. It happens after every 25 queries have been performed. Two methods of converting rankings into scores, the logistic model and the reciprocal model, have been investigated. Experiments show that the reciprocal model is better than the logistic model. Three adaptive methods, PSU, CU, and LRU, have been tested. Experiments on the benchmark show that both PSU and CU are better than CombSum, while LRU is not very good. On the other hand, PSU and CU perform better than the best component results consistently. It demonstrates that both PSU and CU have excellent potential for practical use in the condition of dynamic search environment. The work reported in this paper can be furthered in several different directions. Firstly, we could investigate those adaptive fusion methods in more dynamic situations where all three factors including document collections, queries and component retrieval systems change over time. We would see if the proposed methods still work or some modification to the methods is needed. Again, a proper benchmark is needed for this proposed research. The second is to consider the efficiency of adaptive data fusion

21

methods. Because the weights are updated at run-time, efficiency is an important issue, especially when the collection of documents is huge and a lot of documents need to be retrieved and included in the resultant list. A typical example is the Web. According to worldwidewebsize 7 , the indexed Web contains at least 3.78 billion pages as of 18 August, 2013. Some collections used in TREC are very large as well. For example, “ClueWeb09” 8 has over 1 billion documents and “ClueWeb12” 9 has over 8 hundred million documents. In such situations, retrieval evaluation and weights training can be very timeconsuming tasks due to the large number of documents involved. It is desirable to investigate adaptive data fusion methods with partial results, e.g., some top-ranked documents in a result. Thus, the efficiency of data fusion can be improved. If the weights obtained from partial results are not as effective as those from complete results, then we have to make a compromise between effectiveness and efficiency. The third is to apply adaptive data fusion methods to some applications such as blog search (Chen et al., 2008; Huang and Croft, 2009), event-based retrieval (Belew, 2008), etc. In these applications, data fusion is an effective approach (Wu, 2012a; Xu et al., 2009) due to the large number of competitive techniques. Although the general principle of data fusion is the same for most applications, specific aspects need to be examined for each of them. For example, in blog retrieval, a user may be interested in finding out the currently trending topics or discovering popular opinion regarding a given topic; in event-based retrieval, event identification and relation between events are of interest to many users.

References Arampatzis, A. and Kamps, J. (2009). A signal-to-noise approach to score normalization. In Proceedings of the 18th Annual International ACM CIKM Conference, pages 797–806, Hong Kong, China. Aslam, J. A. and Montague, M. (2001). Models for metasearch. In Proceedings of the 24th Annual International ACM SIGIR Conference, pages 276–284, New Orleans, Louisiana, USA. Bar-Ilan, J., Mat-Hassan, M., and Levene, M. (2006). Methods for comparing rankings of search engine results. Computer Networks, 50(10):1448–1463. 7

see http://www.worldwidewebsize.com/ see http://lemurproject.org/clueweb09.php/ 9 see http://lemurproject.org/clueweb12.php/ 8

22

Bartell, B. T., Cottrell, G. W., and Belew, R. K. (1994). Automatic combination of multiple ranked retrieval systems. In Proceedings of ACM SIGIR’94, pages 173–184, Dublin, Ireland. Belew, R. (2008). Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press. Bigot, A., Chrisment, C., Dkaki, T., Hubert, G., and Mothe, J. (2011). Fusing different information retrieval systems according to querytopics: a study based on correlation in information retrieval systems and TREC topics. Information. Retrieval, 14(6):617–648. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the Twenty-Second International Conference on Machine Learning, pages 89–96, Bonn, Germany. Calv´e, A. L. and Savoy, J. (2000). Database merging strategy based on logistic regression. Information Processing & Management, 36(3):341–359. Cao, Y., Xu, J., Liu, T., Li, H., Huang, Y., and Hon, H. (2006). Adapting ranking svm to document retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference, pages 186–193, Seattle, USA. Chen, S., Wang, F., Song, Y., and Zhang, C. (2011). Semisupervised ranking aggregation. Information Processing & Management, 47(3):415C425. Chen, Y., Tsai, F., and Chan, K. (2008). Machine larning techniques for business blod search and mining. Expert Systems with Applications, 35(3):581–590. Cormack, G. V., Clarke, C. L. A., and B¨ uttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning mthods. In Proceedings of the 32nd Annual International ACM SIGIR Conference, pages 758–759, Boston, MA, USA. Diamond, T. and Liddy, E. D. (1998). Dynamic data fusion. In Proceedings of TIPSTER’98 workshop, pages 123–128, Baltimore, USA. Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the Tenth International World Wide Web Conference, pages 613–622, Hong Kong, China.

23

Efron, M. (2009). Generative model-based metasearch for data fusion in information retrieval. In Proceedings of the 2009 Joint International Conference on Digital Libraries, pages 153–162, Austin, USA. Farah, M. and Vanderpooten, D. (2007). An outranking approach for rank aggregation in information retrieval. In Proceedings of the 30th ACM SIGIR Conference, pages 591–598, Amsterdam, The Netherlands. Fernandez, M., Vallet, D., and Castells, P. (2006). Probabilistic score normalization for rank aggregation. In Proceedings of the 28th European Conference on Information Retrieval (Advances in Information Retrieval), Lecture Notes in Computer Science, Volume 3936/2006, pages 553–556, London, the United Kingdom. Foltz, P. W. and Dumais, S. T. (1992). Personalized information delivery: an analysis of information-filtering methods. Communications of the ACM, 35(12):51–60. Fox, E. A., Koushik, M. P., Shaw, J., Modlin, R., and Rao, D. (1993). Combining evidence from multiple searches. In The First Text REtrieval Conference (TREC-1), pages 319–328, Gaitherburg, MD, USA. Fox, E. A. and Shaw, J. (1994). Combination of multiple searches. In The Second Text REtrieval Conference (TREC-2), pages 243–252, Gaitherburg, MD, USA. Gerani, S., Zhai, C., and Crestani, F. (2012). Score transformation in linear combination for multi-criteria relevance ranking. In Proceedings of the 34th European Conference on IR Research, pages 256–267, Barcelona, Spain. Herschtal, A. and Raskutti, B. (2004). Optimising area under the ROC curve using gradient descent. In Proceedings of the Twentyfirst International Conference on Machine Learning, Banff, Alberta, Canada. Huang, X. and Croft, B. (2009). A unified relevance model for opinion retrieval. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 947–956, Hong Kong, China. Klementiev, A., Roth, D., and Small, K. (1997). An unsupervised learning algorithm for rank aggregation. In Proceedings of the 18th European Conference on Machine Learning ( LNCS 4701), pages 616–623, Warsaw, Poland.

24

Lee, J. H. (1997). Analysis of multiple evidence combination. In Proceedings of the 20th Annual International ACM SIGIR Conference, pages 267–275, Philadelphia, Pennsylvania, USA. Lillis, D., Toolan, F., Collier, R., and Dunnion, J. (2006). Probfuse: a probabilistic approach to data fusion. In Proceedings of the 29th Annual International ACM SIGIR Conference, pages 139–146, Seattle, Washington, USA. Liu, T. (2011). Learning to Rank for Information Retrieval. Springer. Montague, M. and Aslam, J. A. (2001). Relevance score normalization for metasearch. In Proceedings of ACM CIKM Conference, pages 427–433, Berkeley, USA. Montague, M. and Aslam, J. A. (2002). Condorcet fusion for improved retrieval. In Proceedings of ACM CIKM Conference, pages 538–548, McLean, VA, USA. Niu, S., Guo, J., Lan, Y., and Cheng, X. (2012). Top-k learning to rank: Labeling, ranking and evaluation. In Proceedings of the 35th Annual ACM SIGIR Conference, pages 751–760, Portland, USA. Renda, M. E. and Straccia, U. (2003). Web metasearch: rank vs. score based rank aggregation methods. In Proceedings of ACM 2003 Symposium of Applied Computing, pages 841–846, Melbourne, USA. Saracevic, T. and Kantor, P. (1988). A study of information seeking and retrieving, iii: Searchers, searches, overlap. Journal of the American Society of Information Science, 39(3):197–216. Shokouhi, M. (2007). Segmentation of search engine results for effective data-fusion. In Advances in Information Retrieval, Proceedings of the 29th European Conference on IR Research, pages 185–197, Rome, Italy. Thompson, P. (1993). Description of the PRC CEO algorithms for TREC. In The First Text REtrieval Conference (TREC-1), pages 337–342, Gaitherburg, MD, USA. Turtle, H. and Croft, W. B. (1991). Evaluation of an inference networkbased retrieval model. ACM Transactions on Information Systems, 9(3):187–222. Vogt, C. C. and Cottrell, G. W. (1998). Predicting the performance of linearly combined IR systems. In Proceedings of the 21st Annual ACM SIGIR Conference, pages 190–196, Melbourne, Australia.

25

Vogt, C. C. and Cottrell, G. W. (1999). Fusion via a linear combination of scores. Information Retrieval, 1(3):151–173. Voorhees, E. M. (2008). On test collections for adaptive information retrieval. Information Processing & Management, 44(6):1879–1885. Wang, L., Lin, J., and Metzler, D. (2010). Learning to efficiently rank. In Proceedings of the 33rd International ACM SIGIR Conference, pages 138–145, Geneva, Switzerland. Wang, L., Lin, J., and Metzler, D. (2011). A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference, pages 105–114, Beijing, China. Webber, W., Moffat, A., and Zoble, J. (2008). Score standardication for inter-collection comparison of retrival ssytems. In Proceedings of ACM SIGIR Conference, pages 51–58, Singapore, Singapore. Wu, S. (2009). Applying statistical principles to data fusion in information retrieval. Expert Systems with Applications, 36(2):2997–3006. Wu, S. (2012a). Applying the data fusion technique to blog opinion retrieval. Expert Systems with Applications, 39(1):1346–1353. Wu, S. (2012b). Data Fusion in Information Retrieval. Springer. Wu, S. (2012c). Linear combination of component results in information retrieval. Data & Knowledge Engineering, 71(1):114–126. Wu, S. (2013). The weighted condorcet fusion in information retrieval. Information Processing & Managment, 49(1):114–126. Wu, S., Bi, Y., Zeng, X., and Han, L. (2009). Assigning appropriate weights for the linear combination data fusion method in information retrieval. Information Processing & Management, 45(4):413–426. Wu, S. and Crestani, F. (2002). Data fusion with estimated weights. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, pages 648–651, McLean, VA, USA. Wu, S., Crestani, F., and Bi, Y. (2006). Evaluating score normalization methods in data fusion. In Proceedings of the 3rd Asia Information Retrieval Symposium (LNCS 4182), pages 642–648, Singapore. Wu, S. and McClean, S. (2006). Improving high accuracy retrieval by eliminating the uneven correlation effect in data fusion. Journal of American Society for Information Science and Technology, 57(14):1962–1973.

26

Xu, T., Yuan, L., and Niu, B. (2009). Data fusion algorithm based on event-driven and minimum delay aggregation path in wireless sensor network. In Emerging Intelligent Computing Technology and Applications With Aspects of Artificial Intelligence, Proceedings of the 5th International Conference on Intelligent Computing, pages 1028–1038, Ulsan, South Korea. Yue, Y., Finley, T., Radlinski, F., and Joachims, T. (2007). A support vector method for optimizing average precision. In Proceedings of the 21th Annual International ACM SIGIR Conference, pages 271–278, Amsterdam, The Netherlands.

27

7

Appendix

Part A. The name list of all 104 initial runs used to generate artificial runs (the character in parentheses is used to represent the components in generated runs in Part B) (a) B2DocOpinAZN (b) B2DocOpinSWN (c) B2PsgOpinAZN (d) B2PsgOpinSWN (e) B3DocOpinAZN (f) B3DocOpinSWN (g) B3PsgOpinAZN (h) B3PsgOpinSWN (i) B4DocOpinAZN (j) B4DocOpinSWN (k) b4dt1mRd (l) b4dt1pRd (m) B4PsgOpinAZN (n) B4PsgOpinSWN (o) B5DocOpinAZN (p) B5DocOpinSWN (q) b5dt1mRd (r) b5dt1pRd (s) B5PsgOpinAZN (t) B5PsgOpinSWN (u) b5wt1mRc (v) b5wt1pRc (w) DCUCDVPgo (x) DCUCDVPgonc (y) DCUCDVPgoo (z) DCUCDVPgoonc (A) DCUCDVPtdo (B) DCUCDVPto (C) DCUCDVPtol (D) DCUCDVPtolnc (E) DUTIR08Run3 (F) DUTIR08Run4 (G) FIUBL2DFR (H) FIUBL2PL2c9 (I) FIUBL3DFR (J) FIUBL3PL2c9 (K) FIUBL4DFR (L) FIUBL4PL2c9 (M) FIUBL5DFR (N) FIUBL5PL2c9 (O) KGPBASE4 (P) KLEDocOpinT (Q) KLEDocOpinTD (R) KLEPsgOpinT (S) KLEPsgOpinTD (T) kuo2 (U) NOpMM21 (V) NOpMM23 (W) NOpMM27 (X) NOpMM2opi (Y) NOpMM31 (Z) NOpMM33 (α) NOpMM37 (β) NOpMM3opi (γ) NOpMM41 (δ) NOpMM43 (ǫ) NOpMM47 (ε) NOpMM4opi (ζ) NOpMM51 (η) NOpMM53 (θ) NOpMM57 (ϑ) NOpMM5opi (ι) NOpMMs23 (κ) NOpMMs33 (λ) NOpMMs43 (µ) NOpMMs53 (ν) prisom1 (ξ)top3dt1mRd (o) top3wt1mRc (π) uams08b2pr (̟) uams08b3pr (ρ) uams08b4pr (̺) uams08b5pr (σ) uams08qm4it2 (ς) UB (τ ) uicop1bl2r (υ) uicop1bl3r (φ) uicop1bl4r (ϕ) uicop1bl5r (χ) uicop2bl2r (ψ) uicop2bl3r (ω) uicop2bl4r (Γ) uicop2bl5r (∆) uogOP2intL (Θ) uogOP2ofL (Λ) uogOP2Pr (Ξ) uogOP2PrintL (Π) uogOP3intL (Σ) uogOP3ofL (Υ) uogOP4intL (Ψ) uogOP4ofL (Ω) uogOP4Pr (0) uogOP4PrintL (1) uogOP5extL (2) uogOP5ofL (3) UWnb2Op (4) UWnb3Op (5) UWnb4Op (6) UWnb5Op (7) wdqbdt1mRd (8) wdqbdt1pRd (9) wdqfdt1mRd (+) wdqfdt1pRd (-) york08bo3b

28

Part B. The list of all artificial runs generated from original runs, each of which is represented by six characters; for example, the last run doObfB in this part means that the ranked lists of documents for queries 1-25, 26-50, 51-75, 76-100, 101-125, and 126-150 are from the corresponding part of d, o, 0, b, f, and B, respectively (see Part A); the figures in parentheses are performance values measured by average precision gzϕ3n4 bsςΣεb mS̟sιε ΛδAp∆e eπγAPa iϑGeXQ (0.3513) (0.2448) (0.2802) (0.3460) (0.3121) (0.3195) ̟λ∆Fδγ EGδGK8 φMσχµ3 κNR̟ζǫ HCρTξM βjΞIϕW (0.3531) (0.3306) (0.2690) (0.2641) (0.2448) (0.3161) πr2Hα5 aγgBǫ+ WtSΛΣα POϑδσθ ΞΓHgλπ δiCWFN (0.2953) (0.3166) (0.3348) (0.3711) (0.2906) (0.3131) ΠBΩMbx 8R5Dmµ cDEφUw 3-ΣVνA qTβυ1c ϕστ 7oϕ (0.2790) (0.3315) (0.3298) (0.2847) (0.3051) (0.3208) ωcz+QΓ AΣǫCπE z2υZψv fαJ8Yν LdDEωO ̺eΥNZd (0.3432) (0.2963) (0.3461) (0.3028) (0.3274) (0.2436) F3tUβl 9UFJpΘ 7WΠΥuX rhφXGυ 4Ωoρ2ξ onIπVy (0.2940) (0.2921) (0.2727) (0.3176) (0.3046) (0.2999) ςεKYΥβ Ψfπ4+n +Aαj̟V ZEχ9wo 2Kιϑcρ CǫLαγT (0.3008) (0.3011) (0.2648) (0.3214) (0.2905) (0.3337) OFΘOρτ η∆uβθη hk3∆R̟ luYωvϑ XmζεT7 wΘκh̺Σ (0.3536) (0.3074) (0.3146) (0.3126) (0.2964) (0.2931) NχMahσ BHηΘS9 D7WǫηY S8Xιτ oϕλκυ̺ Γυ4̺xΛ (0.2108) (0.3260) (0.3176) (0.2932) (0.3697) (0.2891) uψΨσe1 tζ0ςEι slNKς6 VΛ6Ξ-Υ Υ̟θ-Γφ doObfB (0.3145) (0.2963) (0.2618) (0.2985) (0.2959) (0.3584)

29

Suggest Documents