The linear combination data fusion method in information retrieval

0 downloads 0 Views 182KB Size Report
strate that the linear combination method is an effective data fusion method for combining multiple ... An alternative is to convert ... investigated for distributed information retrieval [3,15], but not for data fusion before. ..... group (weightLCP (even)) and vice versa. ..... In The First Text REtrieval Conference (TREC-1), pages.
The linear combination data fusion method in information retrieval Shengli Wu1,2 Yaxin Bi2 and Xiaoqin Zeng3 1

School of Computer Science and Telecommunication Engineering Jiangsu University, Zhenjiang, China 2 School of Computing and Mathematics University of Ulster, Northern Ireland, UK 3 College of Computer and Information Engineering Hehai University, Nanjing, China

Abstract. In information retrieval, data fusion has been investigated by many researchers. Previous investigation and experimentation demonstrate that the linear combination method is an effective data fusion method for combining multiple information retrieval results. One advantage is its flexibility since different weights can be assigned to different component systems so as to obtain better fusion results. However, how to obtain suitable weights for all the component retrieval systems is still an open problem. In this paper, we use the multiple linear regression technique to obtain optimum weights for all involved component systems. Optimum is in the least squares sense that minimize the difference between the estimated scores of all documents by linear combination and the judged scores of those documents. Our experiments with four groups of runs submitted to TREC show that the linear combination method with such weights steadily outperforms the best component system and other major data fusion methods such as CombSum, CombMNZ, and the linear combination method with performance level/performance square weighting schemas by large margins.

1

Introduction

In information retrieval, the data fusion technique has been investigated by many researchers and quite a few data fusion methods such as CombSum [7,8], CombMNZ [7,8], the linear combination method [2,19,20,21,23],the multiple criteria approach[6], the correlation method [26,27], Borda count [1], Condorcet fusion [14], Marcov chain-based methods [4,16], and others [5,9,11,17,28], have been presented and investigated. Previous research demonstrates that, generally speaking, data fusion is an effective technique for obtaining better results. Data fusion methods in information retrieval can be divided into two categories: score-based methods and rank-based methods. These two different types of methods apply to different situations. If some or all component systems only provide a ranked list of documents as the result of a query, then rankbased methods can be applied. If every component system provides scores for

all retrieved documents, then score-based methods can be applied. In all these methods, CombSum [7,8], CombMNZ [7,8], Borda count [1] 4 , the correlation method [26,27], the linear combination method [2,19,20,21,23] are score-based methods, while Condorcet fusion [14], Marcov chain-based methods [4,16], and the multiple criteria approach[6] are rank-based methods. Among all those score-based data fusion methods, the linear combination method is very flexible since different weights can be used for different component systems. It is especially beneficial when component systems involved vary considerably in performance and dissimilarity between each other. However, how to assign suitable weights to the component systems involved is a crucial issue that needs to be carefully investigated. Some previous investigation (e.g., in[1,18]) took a simple performance level policy. That is, for every component system, its average performance (e.g., average precision) is measured using a group of training queries, then the value obtained is used as the weight for that component system correspondingly. Some alternatives to the simple performance level weighting schema was investigated in [23]. On the other hand, both system performance and dissimilarities between systems were considered in [27] to determine all systems’ weights. Although these weighting schemas can usually bring performance improvement over CombSum and CombMNZ – the two commonly used data fusion methods that assign equal weights to all component systems, it is very likely that there is still room for further improvement since all these weighting schemas are heuristic in nature. In this paper we investigate a new approach, multiple linear regression, to training system weights for data fusion. Our experiments demonstrate that this approach is more effective than other approaches. To our knowledge, multiple linear regression has not been used for such a purpose before, thought it has been used in various other tasks in information retrieval. A related issue is how to obtain reliable scores for those retrieved documents. Sometimes scores are provided by component retrieval systems, but usually those raw scores from different retrieval systems are not comparable and some kind of score normalization is needed before fusing them. An alternative is to convert ranking information into scores if raw scores are not available or not reliable. Different models [1,3,12,15,22] have been investigated for such a purpose. In this piece of work, we use the logistic regression model for it. This technique has been investigated for distributed information retrieval [3,15], but not for data fusion before. The rest of this paper is organized as follows: in Section 2 we review some related work on data fusion. Then in Section 3 we discuss the linear combination method for data fusion and especially the two major issues involved. Section 4 presents the experiment settings and results for evaluating a group of data fusion methods including the one presented in this paper. Section 5 is the conclusion. 4

Borda count is a little special. It does not require that scores are provided by component systems for retrieved documents. It converts ranking information into scores instead. Since it uses the same combination algorithm as CombSum (or CombMNZ) for fusion, we classify it as a score-based method.

2

Related work

In this section we review some previous work on the linear combination method and score normalization methods for data fusion in information retrieval. Probably it was Thompson who reported the earliest work on the linear combination data fusion method in [18]. The linear combination method was used to fuse results submitted to TREC 1. It was found that the combined results, weighted by performance level, performed no better than a combination using uniform weights (CombSum). Weighted Borda count [1] is a variation of performance level weighting, in which documents at every rank are assigned corresponding scores using Borda count, and then the linear combination method with the performance level weighting is used for fusion. Another variation is, every system’s performance can be estimated automatically without any training data, then the linear combination method can be applied [24]. In a piece of work done by Wu et. al. [23], their experiment shows that, using the power function of performance with a power value between 2 and 6 can lead to slightly better fusion results than performance level weighting. On the other hand, in Wu and McClean’s work [26,27], both system performance and dissimilarities among systems were considered. In their weighting schema, any system’s weight is a compound weight that is the product of its performance weight and its dissimilarity weight. The combination of performance weights and dissimilarity weights was justified using statistical principles and sampling theories by Wu in [21]. Bartell and Cottrell [2] used a numerical optimisation method, conjugate gradient, to search for good weights for component systems. Because the method is time-consuming, only 2 to 3 component systems and the top 15 documents returned from each system for a given query were considered in their investigation. Vogt and Cottrell [19,20] analysed the performance of the linear combination method. In their experiments, they used all possible pairs of 61 systems submitted to the TREC 5 ad-hoc track. Another numerical optimisation method, golden section search, was used to search for good system weights. Due to the nature of the golden section search, only two component systems can be considered for each fusion process. Generally speaking, these brute-force search methods are very time-consuming and their usability in many applications are very limited. An associated problem is how to obtain reliable scores for retrieved documents. This problem is not trivial by any means. Sometimes component information retrieval systems provide scores for their retrieved documents. However, those raw scores from different retrieval systems may not be comparable. Some kind of normalization is required for those raw scores. One common linear score normalization method is: for any list of scores (associated with a ranked list of documents) for a given topic or query, we map the highest score into 1, the lowest score into 0, and any other scores into a value between 0 and 1 by using the formula

n score =

(raw score − min score) (max score − min score)

(1)

Here max score is the highest score, min score is the lowest score, raw score is the score to be normalized and n score is the normalized score of raw score. This normalization method was used by Lee [10] and others in their experiments. The above zero-one normalization method can be improved in some situations. For example, in the TREC workshop, each system is usually required to submit 1000 documents for any given query. In such a situation, the top-ranked documents in the list are not always relevant, and the bottom-ranked documents are not always irrelevant. Therefore, Wu, Crestani, and Bi [25] considered that [a, b] (0 < a < b < 1) should be a more suitable range than [0,1] for score normalization. Experiments with TREC data were conducted to support this idea. Two other linear score normalization methods, SUM and ZMUV (Zero-Mean and Unit-Variance), were investigated by Montague and Aslam [13]. In SUM, the minimal score is mapped to 0 and the sum of all scores in the result to 1. In ZMUV, the average of all scores is mapped to 0 and their variance to 1. If only a ranked list of documents is provided without any scoring information, then we cannot use the linear combination method directly. One usual way of dealing with this is to assign a given score to documents at a particular rank. For example, Borda count [1] works like this: for a ranked list of t documents, the first document in the list is given a score of t, the second document in the list is given a score of t − 1, ..., the last document in the list is given a score of 1. Thus all documents are assigned corresponding scores based on their rank positions and CombSum or the linear combination method can be applied accordingly. Some non-linear methods for score normalization have also been discussed in [3,12,15,22]. In [12], an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents were used, then a mixture model was defined to fit the real score distribution. The cubic regression model for the relation between documents’ rank and degree of relevance was investigated in [22]. The logistic regression model were investigated in [3,15] in the environment of distributed information retrieval. However, this model can also be used in data fusion without any problem.

3

The linear combination method

Suppose we have n information retrieval systems ir1 , ir2 , ..., irn , and for a given query q, each of them provides a result ri . Each ri comprises a ranked list of documents with associated scores. The linear combination method uses the following formula to calculate score for every document d: M (d, q) =

n X i=1

βi ∗ si (d, q)

(2)

Here si (d, q) is the normalized score of document d in result ri , βi is the weight assigned to system iri , and M (d, q) is the calculated score of d for q. All the documents can be ranked according to their calculated scores. There are two major issues that need to be considered for using the linear combination method properly. Firstly, for every document retrieved by a retrieval system, it should have a reliable score that indicates the degree of relevance of that document to the query. Secondly, appropriate weights need to be assigned to all component systems for best possible fusion results. Next let us discuss these two issues one by one. In this research, we use the binary logistic regression model to estimate the relation between ranking position of a document and its degree of relevance, because we consider it is a good, reliable method. However, some other score normalization methods may also be able to achieve similar results presented in this paper. The logistic model can be expressed by the following equation ea+b∗t 1 = (3) a+b∗t −a−b∗t 1+e 1+e In Equation 3, score(t) is the degree of the document at rank t being relevant to the query. a and b are two parameters. As in [3], we use ln(t) to replace t in the above equation 3: score(t) =

ea+b∗ln(t) 1 = (4) a+b∗ln(t) −a−b∗ln(t) 1+e 1+e The values of parameters a and b can be decided by some training data, as we shall see later in Section 4. The second issue is how to train weights for all the component systems involved. In our work we use the multiple linear regression model to do this. Suppose there are m queries, n information retrieval systems, and a total of r documents in a document collection D. For each query q i , all information retrieval systems provide scores for all the documents in the collection. Therefore, we have (si1k , si2k , ..., sink , yki ) for {i=(1, 2, ..., m), k=(1, 2, ..., r) }. Here sijk stands for the score assigned by retrieval system irj to document dk for query q i ; yki is the judged relevance score of dk for query q i . If binary relevance judgment is used, then it is 1 for relevant documents and 0 otherwise. Now we want to estimate score(t) =

Y = {yki ; i = (1, 2, ..., m), k = (1, 2, ..., r)} by a linear combination of scores from all component systems. The least squares estimates of the β’s are the values βˆ0 , βˆ1 , βˆ2 , ..., and βˆn for which the quantity q=

m X r X

2 [yki − (βˆ0 + βˆ1 si1k + βˆ2 si2k + ... + +βˆn sink )]

(5)

i=1 k=1

is a minimum. This is partly a matter of mathematical convenience and partly due to the fact that many relationships are actually of this form or can be

approximated closely by linear equations. β0 , β1 , β2 ,..., and βn , the multiple regression coefficients, are numerical constants that can be determined from observed data. In the least squares sense the coefficients obtained by multiple linear regression can bring us the optimum fusion results by the linear combination method, since they can be used to make the most accurate estimation of the relevance scores of all the documents to all the queries as a whole. In a sense this technique for fusion performance improvement is general since it is not directly associated with any special metrics. In information retrieval, all commonly used metrics are rank-based. Therefore, efforts have been focused on how to improve rankings. This can partially explain why a method like this has not been investigated. However, theoretically, this approach should work well for any reasonably defined rank-based metrics including average precision or recall-level precision and others, since more accurately estimated scores for all the documents are able for us to obtain better rankings of those documents. In the linear combination method, those multiple regression coefficients, β1 , β2 ,..., and βn , are be used as weights for ir1 , ir2 ,..., irn , respectively. Note that we do not need to use β0 to calculate scores, since the relative ranking positions of all the documents are not affected if a constant β0 is added to all the scores.

4

Experiments

In this section we present our experiment settings and results with four groups of runs submitted to TREC (TRECs 5-8, ad hoc track). Their information is summarized in Table 1. Table 1. Information of the four groups of runs submitted to the ad-hoc track in TREC Group TREC TREC TREC TREC

5 6 7 8

Total Number Number of Number of runs selected runs of queries 61 53 50 74 57 50 103 82 50 129 108 50

We only use those runs that include 1000 documents for every query. In each year group, some runs include fewer than 1000 documents for some queries. Removing those runs with fewer documents provides us a homogeneous environment for the investigation. Anyhow, we are still left with quite a large number of runs in each year group. In some runs, the raw scores assigned to documents are not reasonable. Therefore, we do not use those raw scores in our experiment. Instead, we use the binary logistic model to generate estimated scores for documents at different ranks. In

Table 2. Coefficients obtained using binary logistic regression Group TREC TREC TREC TREC

a .821 .823 1.218 1.362

5 6 7 8

b -.692 -.721 -.765 -.763

each year group, all the runs are put together. Then we use ln(t) as the independent variable and score(t) as the dependant variable to obtain the coefficients of the logistic model. The coefficients we obtain are shown in Table 2.

0.8 Trec 5 score Trec 6 score Trec 7 score Trec 8 score

0.7

Assigned score to documents

0.6

0.5

0.4

0.3

0.2

0.1

0 0

100

200

300

400

500

600

700

800

900

1000

Ranking position

Fig. 1. Corresponding scores of documents at different ranks based on the logistic model (coefficients are shown in Table 2)

Note that in a year group, for any query and any result (run), the score assigned to any document is only decided by the ranking position of the document. For example, for TREC 7, we obtain a=1.218 and b=-.765. It follows from Equation 4 that score(t) =

1 1 + e−1.218+0.765∗ln(t)

(6)

Therefore, score(1) = 0.7717, score(2) = 0.6655,..., and so on. On the one hand, such scores are reliable since they come from the logistic regression model with all the data. On the other hand, such scores are not optimum by any means since we treat all submitted runs and all queries equally. As a matter of fact, there is considerable variance from query to query, from one submitted run to another.

In Table 2, it can be noticed that the two pairs of parameters obtained in TREC 5 and TREC 6 are very close to each other, the two pairs in TREC 7 and TREC 8 are very close to each other, while it seems that the parameters in TRECs 5 & 6 are quite different from that in TRECs 7 & 8. However, the actual difference between TRECs 5 & 6 and TRECs 7 & 8 may not be as big as those parameters look like. In order to let us have a more intuitive understanding of them, we use a graph (Figure 1) to present the calculated scores based on those parameters. In Figure 1, we can see that all four curves are very close to each other. This demonstrates that all of them are very similar indeed. With such normalized scores for the runs submitted, the next step is to use multiple linear regression to train system weights and carry out fusion experiments. For all 50 queries in each year group, we divided them into two groups: odd-numbered queries and even-numbered queries. First we used odd-numbered queries to train system weights and even-numbered queries to test data fusion methods; then we exchanged their positions by using even-numbered queries to train system weights and odd-numbered queries to test data fusion methods. The file for linear multiple regression is a m row by (n + 1) column table. Here m is the total number of documents in all results and n is the total number of component systems involved in the fusion process. In the table, each record i is used for one document di (1 ≤ i ≤ m) and any element sij (1 ≤ i ≤ m, 1 ≤ j ≤ n) in record i represents the score that document di obtains from component system irj for a given query. Any element si(n+1) is the judged score of di for a given query. That is, 1 for relevant documents and 0 for irrelevant documents. The file can be generated by the following steps:

1. For each query, go through steps 2-5. 2. If document di occurs in at least one of the component results, then di is put into the table as a row. 3. If document di occurs in result rj , then the score sij is calculated out by the logistic model (parameters are shown in Table 2). 4. If document di does not occur in result rj , then a score of 0 is assigned to sij . 5. Ignore all those documents that are not retrieved by any component systems. The above step 5 implies that any document that is not retrieved will be assigned a score of 0. Apart from the linear combination method with the trained weights by multiple regression (referred to as LCR later in this paper), CombSum, CombMNZ, the linear combination method with performance level weighting (referred to as LCP), and the linear combination method with performance square weighting (referred to as LCP2) are also involved in the experiment. For LCP and LCP2, we also divide all queries into two groups: odd-numbered queries and even-numbered queries. A run’s performance measured by average precision on odd-numbered queries (AP (odd)) is used for the weighting of even-numbered group (weightLCP (even)) and vice versa.

4.1

Experiment 1

We first investigated the situation with various different number of component systems, from 3, 4, ..., to up to 32 component systems. For each given number k (3 ≤ k ≤ 32), 200 combinations were randomly selected and tested. Eight metrics, including average precision, recall-level precision, precision at 5, 10, 15, 20, 30, and 100 document level, were used for performance evaluation. But in the rest of this paper, we only report three of them, which are average precision, recall-level precision, and precision at 10 document level. Results with other metrics are not presented because they are very similar to ones presented. 0.44

Performance (average precision) of the fused results

0.42 0.4 0.38 0.36 0.34 0.32 0.3 Best System CombSum CombMNZ LCP LCP2 LCR

0.28 0.26 0.24 4

8

12

16

20

24

28

32

Number of runs used for fusion

Fig. 2. Performance comparison of different data fusion methods (average of 4 groups and 200 combinations for each given number of component systems, average precision)

Figures 2, 3, and 4 show average precision(AP), recall-level precision(RP), and precision at 10 document level(P10), respectively, of all data fusion methods involved over 4 year groups. From Figures 2-4, we can see that LCR is the best. In the following, we compare them using the metric of average precision(AP) and the average of 4 year groups and 200 combinations for each given number of component systems. When 3 component systems are fused, LCR outperforms LCP, LCP2, CombSum, CombMNZ, and the best component systems by 1.78%, -0.14%, 11.02%, 9.95%, and 6.08%; when 16 component systems are fused, LCR outperforms LCP, LCP2, CombSum, CombMNZ, and the best component systems by 14.22%, 8.82%, 22.50%, 23.64%, and 14.67%; when 32 component systems are fused, LCR outperforms LCP, LCP2, CombSum, CombMNZ, and the best component systems by 23.71%, 16.94%, 32.50%, 34.39%, and 18.05%, respectively. From Figures 2-4, we have a few further observations. Firstly, we can see that LCR consistently outperforms all other methods and the best component system by a large margin. The only exception is LCP2, which is slightly better than LCR

Performance (recall-level precision) of the fused results

0.44

0.42

0.4

0.38

0.36

0.34

0.32 Best System CombSum CombMNZ LCP LCP2 LCR

0.3

0.28 4

8

12

16

20

24

28

32

Number of runs used for fusion

Fig. 3. Performance comparison of different data fusion methods (average of 4 groups and 200 combinations for each given number of component systems, recall-level precision)

Performance (precision at 10 document level) of the fused results

0.7

0.65

0.6

0.55

Best System CombSum CombMNZ LCP LCP2 LCR

0.5

0.45 4

8

12

16

20

24

28

32

Number of runs used for fusion

Fig. 4. Performance comparison of different data fusion methods (average of 4 groups and 200 combinations for each given number of component systems, precision at 10 document level)

when fusing 3 component systems (AP: 0.14% improvement for 3 component systems, -0.30% for 4 component systems, and -0.80% for 5 component systems). As the number of component systems increase, the difference in performance between LCR and all other data fusion methods increase accordingly. Secondly, LCP2 is always better than LCP, which confirms that performance square weighting is better than performance level weighting [23]. The difference between them is increasing with the number of component systems for sometime, and then becomes stable. Let us take average precision as an example: the differences are 1.92% for 3 component systems, 2.40% for 4 component systems, 2.76% for 5 component systems,..., 4.96% for 16 component systems. Then it stabilizes at 5.50% to 5.80% when the number reaches 20 and over. Two-tailed t test was carried out to compare the difference between these two methods. It shows that the differences are always highly significant (p-value < 2.2e-16) whether 3, 4, 5, or more component systems are fused. Thirdly, compared with the best component system, LCP and LCP2 perform better when a small number of component systems are fused. However, there are significant difference when different metrics are used for evaluation. Generally speaking, average precision is the metric that favours LCP and LCP2, while precision at 10 document level favours the best component system. For LCP2, if average precision is used, then it is always better than the best component system; if recall-level precision is used, then it is better than the component system when 20 or fewer component systems are fused; if precision at 10 document level is used, then the number of component systems decreases to 8 or 9 for it to outperform the best component system. For LCP, 16, 12, and 4 are the maximum numbers of component systems can be involved for it to beat the best component system, if AP, RP, and P10 are used, respectively. Fourthly, CombSum and CombMNZ perform badly compared with all the linear combination data fusion methods and the best component system. It demonstrates that equally treating all component systems is not a good fusion policy when a few poorly performed component systems are involved. Fifthly, as to effectiveness, the experimental results demonstrate that we can benefit from fusing a large group of component systems no matter which fusion method we use. Generally speaking, the more component systems we fuse, the better fusion result we can expect. This can be seen from all those curves that are increasing with the number of component systems in all 3 figures. Finally, it demonstrates that LCR can cope well with all sorts of component systems that are very different in effectiveness and reciprocal similarity, even some of the component systems are very poor. It can bring effectiveness improvement steadily for different groups of component systems no matter which rank-based metric is used. 4.2

Experiment 2

Compared with some numerical optimisation methods such as conjugate gradient and golden section search, the multiple linear regression technique is much more efficient and much more likely to be used in real applications. However, if

a large number of documents and a large number of information retrieval systems are involved, then efficiency might be an issue that needs to be considered. For example, in some dynamic situations such as the Web, documents are updated frequently. Weights may need to be calculated quite frequently so as to obtain good performance. Therefore, we consider how to use the linear regression technique in a more efficient way. In Experiment 1, we used all 1000 documents for every query to form two training data sets: odd-numbered group and even-numbered group. Then the multiple linear regression applied to these data sets to obtain weights. This time we did not use all 1000 documents, but the top 20 documents for each query to form training data sets. Then the multiple linear regression was applied as before. We carried out the experiment to compare the performance of the linear combination data fusion method with two groups of different weights. One group uses the training data set of all 1000 documents, as in Experiment 1; the other group uses the training data set that only includes top 20 documents for each of the queries. Apart from that, the methodology in this experiment is just the same as in Experiment 1. As a matter of fact, we reused the result in Experiment 1. We just rerun the experiment with weights obtained by using the 20-document training set. The number of combinations and all the component systems involved in each combination run are just the same as in Experiment 1. The experimental result is shown in Figure 5. In Figure 5, LCR 1000 represents the LCR method using the 1000-document training set and LCR 20 represents the LCR method using the 20-document training set. Note that LCR 1000 is exactly the same as LCR in Experiment 1. In Figure 5, for AP, RP, and P10, the curves of LCR 1000 and LCR 20 overlap with each other and are totally indistinguishable. In almost all the cases, the difference between them is less than 0.3%. this demonstrates that using the top-20 documents for each of the queries as training data set can do the same job, no better and no worse, as using all 1000 documents. Obviously, using fewer documents in training data set is beneficial if considering the aspect of efficiency. We also measured the time needed for weighting calculation by the multiple linear regression. A personal computer 5 , which installed R with Windows interface package “Rcmdr”, and two groups of data, TREC 5 and TREC 7 (evennumbered queries) were used for this part of the experiment. The time needed for the multiple linear regression is shown in Table 3. Note that in this experiment, for a year group, we imported data of all component systems into the table and randomly selected 5, 10, 15, and 20 from them to carried out the experiment. It is probably because the TREC 7 group comprises more component systems (82) than the TREC 5 group does (53), it took a little more time for the multiple regression to deal with TREC 7 than TREC 5 when the same number of component systems (5, 10, 15, or 20) are processed. We can see that roughly the time needed for the 20-document data set is 1/20 of the time for the 1000-document 5

It is installed with Intel Duo core E8500 3.16GHz CPU, 2 GB of RAM and Windows operating system.

0.7

0.65

Performance of the fused results

0.6

0.55

0.5

0.45

0.4

0.35

LCR_1000(AP) LCR_20(AP) LCR_1000(RP) LCR_20(RP) LCR_1000(P10) LCR_20(P10)

0.3

0.25 4

8

12

16 20 Number of runs used for fusion

24

28

32

Fig. 5. Performance comparison of the linear combination fusion methods with two groups of different weights trained using different data sets (average of 4 groups and 200 combinations for each given number of component systems) Table 3. Time (in seconds) needed for each multiple linear regression run No. of variables 5 10 15 20

TREC 5 1000 20 1.20 0.06 1.60 0.08 2.00 0.10 2.70 0.14

TREC 7 1000 20 1.40 0.07 1.70 0.09 2.80 0.12 3.60 0.14

data set, though the number of records in the 20-document data set is 1/50 of that in the 1000-document data set. But a difference like this may still be a considerable advantage for many applications. It is also possible to speed up the process if we only upload those data needed into the table.

5

Conclusion

In this paper we have investigated a new data fusion method that uses the multiple linear regression to obtain suitable weights for the linear combination method. The weights obtained in this way is optimum in the least squares sense of estimating relevance scores of all related documents by linear combination. The extensive experiments with 4 groups of runs submitted to TREC have demonstrated that it is a very good data fusion method. It outperforms other major data fusion methods such as CombSum, CombMNZ, the linear combination method with performance level weighting or performance square weighting, and the best component system, by large margins. We believe this piece of work is

a significant progress to the data fusion technique in information retrieval and can be used to implement more effective information retrieval systems.

References 1. J. A. Aslam and M. Montague. Models for metasearch. In Proceedings of the 24th Annual International ACM SIGIR Conference, pages 276–284, New Orleans, Louisiana, USA, September 2001. 2. B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked retrieval systems. In Proceedings of ACM SIGIR’94, pages 173–184, Dublin, Ireland, July 1994. 3. A. L. Calv´e and J. Savoy. Database merging strategy based on logistic regression. Information Processing & Management, 36(3):341–359, 2000. 4. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the Tenth International World Wide Web Conference, pages 613–622, Hong Kong, China, May 2001. 5. M. Efron. Generative model-based metasearch for data fusion in information retrieval. In Proceedings of the 2009 Joint International Conference on Digital Libraries, pages 153–162, Austin, USA, June 2009. 6. M. Farah and D. Vanderpooten. An outranking approach for rank aggregation in information retrieval. In Proceedings of the 30th ACM SIGIR Conference, pages 591–598, Amsterdam, The Netherlands, July 2007. 7. E. A. Fox, M. P. Koushik, J. Shaw, R. Modlin, and D. Rao. Combining evidence from multiple searches. In The First Text REtrieval Conference (TREC-1), pages 319–328, Gaitherburg, MD, USA, March 1993. 8. E. A. Fox and J. Shaw. Combination of multiple searches. In The Second Text REtrieval Conference (TREC-2), pages 243–252, Gaitherburg, MD, USA, August 1994. 9. A. Ju´ arez-Gonz´ alez, M. Montes y G´ omez, L. Pineda, D. Avenda˜ no, and M. P´ erezCouti˜ no. Selecting the n-top retrieval result lists for an effective data fusion. In Proceedings of 11th International Conference on Computational Linguistics and Intelligent Text Processing, pages 580–589, Iasi, Romania, March 2010. 10. J. H. Lee. Analysis of multiple evidence combination. In Proceedings of the 20th Annual International ACM SIGIR Conference, pages 267–275, Philadelphia, Pennsylvania, USA, July 1997. 11. D. Lillis, L. Zhang, F. Toolan, R. Collier, D. Leonard, and J. Dunnion. Estimating probabilities for effective data fusion. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 347–354, Geneva, Switzerland, July 2010. 12. R. Manmatha, T. Rath, and F. Feng. Modelling score distributions for combining the outputs of search engines. In Proceedings of the 24th Annual International ACM SIGIR Conference, pages 267–275, New Orleans, USA, September 2001. 13. M. Montague and J. A. Aslam. Relevance score normalization for metasearch. In Proceedings of ACM CIKM Conference, pages 427–433, Berkeley, USA, November 2001. 14. M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of ACM CIKM Conference, pages 538–548, McLean, VA, USA, November 2002.

15. H. Nottelmann and N. Fuhr. From retrieval status values to probabilities of relevance for advanced ir applications. Information Retrieval, 6(3-4):363–388, September 2003. 16. M. E. Renda and U Straccia. Web metasearch: rank vs. score based rank aggregation methods. In Proceedings of ACM 2003 Symposium of Applied Computing, pages 847–452, Melbourne, USA, April 2003. 17. M. Shokouhi. Segmentation of search engine results for effective data-fusion. In Advances in Information Retrieval, Proceedings of the 29th European Conference on IR Research, pages 185–197, Rome, Italy, April 2007. 18. P. Thompson. Description of the PRC CEO algorithms for TREC. In The First Text REtrieval Conference (TREC-1), pages 337–342, Gaitherburg, MD, USA, March 1993. 19. C. C. Vogt and G. W. Cottrell. Predicting the performance of linearly combined IR systems. In Proceedings of the 21st Annual ACM SIGIR Conference, pages 190–196, Melbourne, Australia, August 1998. 20. C. C. Vogt and G. W. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(3):151–173, October 1999. 21. S. Wu. Applying statistical principles to data fusion in information retrieval. Expert Systems with Applications, 36(2):2997–3006, March 2009. 22. S. Wu, Y. Bi, and S. McClean. Regression relevance models for data fusion. In Proceedings of the 18th International Workshop on Database and Expert Systems Applications, pages 264–268, Regensburg, Germany, September 2007. 23. S. Wu, Y. Bi, X. Zeng, and L. Han. Assigning appropriate weights for the linear combination data fusion method in information retrieval. Information Processing & Management, 45(4):413–426, July 2009. 24. S. Wu and F. Crestani. Data fusion with estimated weights. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, pages 648–651, McLean, VA, USA, November 2002. 25. S. Wu, F. Crestani, and Y. Bi. Evaluating score normalization methods in data fusion. In Proceedings of the 3rd Asia Information Retrieval Symposium (LNCS 4182), pages 642–648, Singapore, October 2006. 26. S. Wu and S. McClean. Data fusion with correlation weights. In Proceedings of the 27th European Conference on Information Retrieval, pages 275–286, Santiago de Composite, Spain, March 2005. 27. S. Wu and S. McClean. Improving high accuracy retrieval by eliminating the uneven correlation effect in data fusion. Journal of American Society for Information Science and Technology, 57(14):1962–1973, December 2006. 28. D. Zhou, S. Lawless, J. Min, and V. Wade. A late fusion approach to cross-lingual document re-ranking. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, pages 1433–1436, Toronto, Canada, October 2010.

Suggest Documents