Learning to Retrieve Information 1 Introduction - CiteSeerX

49 downloads 0 Views 137KB Size Report
That is, there is no speci c target for the system, only that one output should be .... 1 is: \What sport may have given rise to the myth of the Minotaur?" (Answer: ...
Learning to Retrieve Information Brian Bartell

Encylopdia Britannica and Institute for Neural Computation Computer Science & Engineering University of California, San Diego La Jolla, California 92093

Garrison W. Cottrell Institute for Neural Computation Computer Science & Engineering University of California, San Diego La Jolla, California 92093

Rik Belew Institute for Neural Computation Computer Science & Engineering University of California, San Diego La Jolla, California 92093

Abstract

Information retrieval di ers signi cantly from function approximation in that the goal is for the system to achieve the same ranking function of documents relative to queries as the user: the outputs of the system relative to one another must be in the proper order. We hypothesize that a particular rank-order statistic, Guttman's point alienation, is the proper objective function for such a system, and demonstrate its ecacy by using it to nd the optimal combination of retrieval experts. In application to a commercial retrieval system, the combination performs 47% better than any single expert.

1 Introduction In text retrieval, the goal is to have the system retrieve documents that are

relevant to a user's query. Unfortunately, relevance is dicult to determine

a priori, and may di er across users. This suggests the proper approach is to use an adaptive system that learns what is relevant from user feedback. However, most adaptive systems in the neural network literature are useful for classi cation tasks or for function approximation. Previous experience suggests that simple classi cation of documents into \relevant/irrelevant" is not sucent information for the system to learn appropriate parameters. Function approximation is inappropriate because the input to the system is some representation of the document and some representation of the query. The output is a ranking which forms a partial order over the documents. That is, there is no speci c target for the system, only that one output should be less than or greater than some other output. Thus the retrieval  [email protected]

criteria can most naturally be de ned as an objective function over all of the network outputs. In this paper, we report on our work using a particular objective function that matches user's preferences by a variant on a rank-order statistic called Guttman's point alienation. We show that this objective function can be optimized by gradient descent (we use conjugate gradient), even though there are singularities. Although optimizing such an objective function can be expensive, in this paper we limit the parameters so adapted by simply optimizing weights between retrieval experts, pre-designed rank order retrieval systems. Thus we only have as many parameters as experts we are combining. In rank order retrieval, we nd that two heads are de nitely better than one, i.e., a system that combines multiple sources of evidence for relevance assessment is usually better than a single system. Part of the performance bene t arises from the fact that di erent retrieval systems retrieve widely di erent sets of documents for the same query. Recently, Harman (1993) observed that the systems in the 1993 Text REtrieval Conference (TREC1) all performed within the same approximate range, though they retrieved signi cantly di erent sets of documents for the same queries. Thus, combining systems leads to higher achievable recall, although this does not account for all of the improvement (Saracevic & Kantor 1988).

2 Approach The criterion used in this paper is a variation on Guttman's Point Alienation (Guttman, 1978). a statistical measure of rank correlation. Previous work (Bartell et al., 1994) has demonstrated that this criterion can be highly correlated with average precision1 , a more typical measure of performance in Information Retrieval. Thus, optimization of this criterion is likely to lead to optimized average precision performance. The technique applies to ranked retrieval functions, a function that generates a single numerical score for each document and query pair. This score is interpreted as the degree of relevance of the document to the query. By ordering all documents with respect to their scores, the system provides a ranked list of documents estimating the order of relevance of the documents, most relevant before least relevant. Many current approaches in Information Retrieval are examples of ranked retrieval functions, e.g., the vector space model (Salton & McGill, 1983), the probabilistic model (Bookstein & Swanson, 1974), spreading activation approaches (Mozer, 1984; Belew, 1989) and boolean retrieval. Therefore, this requirement on the functionality of the system does not appear overly restrictive. 1 Precision is the fraction of documents retrieved that are relevant. Recall is the fraction of relevant documents that are retrieved. Average precision is precision averaged over a number of recall levels.

Using the notation in (Bartell et al., 1994), we let  ( ) be a ranked retrieval function which generates a score indicating the relevance of document to query . is must be di erentiable in its parameters . The criterion is: P (  ( ) ?  ( 0)) X ? 1 q P ( ) = j j (1) q j  ( ) ?  ( 0 ) j 2 R

d

q

;q d

R

d0

d

J R

Q

q

d0

d

Q

R

;q d

R

;q d

R

;q d

R

;q d

Here, is the set of training queries and and 0 are documents retrieved by at least one of the experts for query . It is assumed that we either know the relevance of all retrieved documents for the training queries, or we restrict the optimization to those documents that are known. This knowledge is represented by the binary relation  (a notation adopted from Wong et al. 1993).  is a preference relation over document pairs. It is interpreted as:  0 () the user prefers to 0 (2) This preference relation is much more expressive than common alternatives in Information Retrieval. For example, the typical binary relevant/irrelevant classi cation of documents is a special case. The goal of the optimization is to nd parameters  so that  ( )  ( 0) whenever  0. That is, the system should rank document higher than document 0 whenever is preferred by the user to 0. The criterion is an average, over the queries in , of the rank match between  and  . When  perfectly orders the elements with respect to  , the numerator and denominator are equivalent, since the sum of di erences  ( ) ?  ( 0) is equivalent to the absolute value of the sum of di erences. In this case, the ratio is 1 0, and the criterion takes on its minimum value, ?1 0. When  is completely misordered, the ratio is ?1 0, and the criterion is maximized at 1 0. The goal is to minimize the criterion. We use conjugate gradient with multi-starts to minimize . For a number of the experiments, all optimizations lead to the same parameter values, so the optimum we have found is robust. In other experiments, the optimized parameters are not identical. In these cases, we generally select the one optimization which results in the best average precision performance on the training set. To apply the method to combining experts, we often use the following linear model (illustrated for three experts): (3)  ( ) = 1 1 ( ) +  2 2 ( ) +  3 3 ( ) The parameters  serve as the scales on the individual experts, . The critical feature of such a model is that the score it generates is di erentiable with respect to the parameters . This is obviously true here, but Q

d

d

q

q

q

d

q d

d

q d

d

d

R

R

;q d

d

d

J

q

R

;q d

>

d

d

R

;q d

Q

R

;q

R

;q

q

;q d

:

:

R

:

:

J

R

;q d

E

q; d

E

q; d

E

q; d

Ei

clearly other models of interest, particularly nonlinear neural network models, are also di erentiable with respect to their parameters. This model is quite similar to an earlier model used by Lewis & Croft (1990) in their examination of syntactic phrase indexing and semantic phrase clustering. Unfortunately, Lewis & Croft did not have an automatic method available for determining parameter values. Therefore, they were forced to manually adjust the parameter values, and were unable to nd parameters which would result in anything but trivial improvements to a baseline weighting. The gradient in general is: ( ) = X X ( )   ( ) (4)   2 2 q  ( ) @J R @

@J R

q

Qd

D

@R

;q d

@R

;q

d

@

where is the set of documents retrieved by query by any of the experts. This equation provides the blue-print for optimizing many di erent models: Simply derive the gradient of the model output with respect to the parameters of the model. There are singularities in the rst term, these are easily dealt with or can be ignored (Bartell, 1994). For the linear combination model this is:  ( )= ( ) (5)  In the applications reported in this paper, optimization of a linear model using Conjugate Gradient requires approximately 20 minutes on a Sparc IPC (a low-end workstation) for a database of 20,000 documents, 3 experts, and 61 training queries. Training time for non-linear models increases to 1 to 3 hours. For some experiments presented later, we restrict the training set to only consider the highest ranked documents for each query. In these cases, training time is as short as a few seconds and yields performance results better than the full size experiments. Of course, training time may vary for other text collections and other models. A further discussion of computational complexity is provided in (Bartell, 1994). Dq

q

@R

@

;q d

Ei q; d

i

3 Optimal Performance in a Commercial System We have applied our method to combining the experts in a commercial retrieval system to validate the usefulness of the technique for real systems. We have been able to achieve very high performance relative to any of the system's individual experts. The optimized system performs on average 47% higher (measured by average precision) than the best individual expert. We have compared our results with an alternative method using supervised learning methods and found they only perform on average 2% better than the best individual expert (Bartell, 1994). The commercial system used is SmarTrieve, developed by Compton's New Media and available for the PC. The SmarTrieve (ST) ranking algorithm

consults four experts when determining the relevance of a document to a query. The four experts provide the following relevance estimates: A traditional vector space estimate; a count of the number of query terms in the document; an estimate of the number of query phrases in the document, based on term proximity; and a count of the number of terms in the title of the document. ST retrieves titles separately from documents; therefore, we have omitted the last expert and have concentrated on the retrieval of only full-text documents using the remaining three experts. The objective is to nd the best performing combination of these three experts. We note that the vector expert does not utilize state-of-the-art weighting schemes (cf. Salton & Buckley, 1988). Therefore, we have found that the performance of the vector expert alone is not as high as could be achieved using the SMART system (Salton & McGill, 1983) and some of its possible weighting mechanisms. However, our emphasis in this work is on achieving the highest level of performance possible by combining existing systems, not on optimizing the individual experts. Therefore, we have not attempted here to modify the vector expert (or any of the other experts) to achieve better performance, treating the experts as black boxes.

3.1 Test Collection The Compton's MultiMedia Encyclopedia (CMME) is used for all experiments reported in this section. The CMME consists of 20,701 documents along with additional associated multimedia objects (such as short video, sound clips, etc.). The CMME is edited for a high school level audience. All 20,701 full-text documents are used in the experiments. The training and test queries are derived from a set of questions found in the printed version of the encyclopedia. At the head of each of the 25 volumes, the editors have provided a set of questions which can be answered by one of the articles in that volume. For example, a question at the head of volume 1 is: \What sport may have given rise to the myth of the Minotaur?" (Answer: acrobatics, found in the article titled \Aegean Civilization"). A total of 1417 queries can be constructed from these questions, and subsets of these queries are used for training and testing. Deriving sample queries from these editors' questions is not without its limitations. In particular, we have found a number of cases where the answer is present in a number of documents, not just the single document the editors have identi ed. In other cases, the answer is not present in the identi ed document at all. We have decided not to alter the queries to compensate for these factors, because it would be unlikely to nd perfect noise-free training queries in any real retrieval setting. It should be noted that these queries are somewhat atypical of queries commonly seen in the text retrieval literature, since each query has only a single correct answer in a single document. This is in contrast to topic-

Test Performance By Partition Query Proximity Count Vector Optimized Improvement Partition Expert Expert Expert System over Best 1 .2315 .3235 .2162 .3176 -2% 2 .2989 .3766 .3093 .4865 +29% 3 .3761 .4610 .3173 .3957 -14% 4 .2675 .3344 .2611 .5153 +54% 5 .3600 .4810 .3013 .5089 +6% 6 .1729 .3369 .3338 .3622 +8% 7 .3402 .3248 .2018 .2684 -17% 8 .3124 .4468 .2999 .3627 -19% Average .2949 .3856 .2801 .4021 +6% Table 1: Results of optimized system. oriented queries, typi ed by the queries in the TREC collection (Harman, 1993), which do not have well-de ned answers. Rather, a potentially large number of documents, each relating to some aspect of the broad query topic, may be relevant to each query. Nine groups of training queries were constructed, with approximately 60 queries per group. No query belongs to more than one group. The 9 groups correspond to the queries in each of the rst 9 volumes of the Compton's Encyclopedia. For all experiments reported below, the networks are trained on one group and then tested on a second. This is repeated 8 times, so that each network is trained and tested using 8 pairs of training and test groups, in order to avoid any bias due to the particular queries in one group. We examined a number of di erent models for combining the three experts, starting with the linear model de ned previously. The performance of the optimized system and a comparison to the individual experts is illustrated in Table 1. The performance of the optimized system is measured by training 5 linear networks for each of the 8 query partitions. Within each partition, the network having the highest average precision over the training queries is selected as the representative network for that partition. The network is then evaluated over the test queries to determine its performance. The results in the table are not extremely positive. There is only marginal performance improvement (an average of 6%) over the best individual expert, and for some partitions the performance is worse. In addition, this 6% improvement is not statistically signi cant (with = 0 625). These results are particularly surprising in light of some preliminary experiments we had performed. In these preliminary experiments, we trained the network using only the rst 12 queries from the rst partition. The performance of this network on test queries is very high (42% improvement in average precision over the best individual expert). It seems surprising that increasing the p

:

size of the training set from 12 to the full 61 in the rst partition would degrade the results so extremely. On closer examination, we observed that the base retrieval system (unoptimized SmarTrieve) performs quite well for all of the 12 queries in the preliminary set. The correct answer for each query is always in the top 8 documents retrieved. This contrasts with the whole set of queries in the rst partition. In the larger set of 61, 6 of the relevant documents are not in the top 100 retrieved, and 3 of those are not in the top 500. A possible conclusion is that these few extremely dicult queries are outliers that make it dicult to nd good overall solutions. In order to perform well for these dicult 6 queries, parameters are learned which do not perform extremely well for these 6 or the remaining 55. This hypothesis is examined next. We examined the performance of the method when trained on the queries with outliers removed. We de ned outliers as those queries for which the one relevant document is not in the top 100 documents retrieved by the unoptimized system. For the rst partition, 6 queries are removed as outliers, leaving 55. The performance of this network is much better than the one trained using outliers. Performance improves from a test set average precision of 0 3176 when trained on all 61 queries (2% worse than the count expert) to a test set performance of 0 5029, a 55% improvement over the count expert. We thus decided to try limiting the training set to the highest ranked 15 documents retrieved by the system for each query. This restriction is actually more realistic than assuming we have a collection of training queries for which we know the relevance and irrelevance of every document in the collection. Rather, a more likely scenario is that the queries used in training are derived from actual sessions with users of the system. In this case, typically only a small number of documents will have been identi ed as relevant or irrelevant by the user, so complete information on all documents will simply not be available. We used the top 15 documents as ranked by SmarTrieve for each query in training. When the system is evaluated after training, all documents are used. As before, 5 networks are optimized for each of the 8 test partitions in order to avoid local minima. The network with the best average precision performance on the training queries is selected as the representative for each partition. The results of this experiment with comparison to the best individual expert and to the previous experiments using all documents is illustrated in Table 2. The combined system trained on only the top-ranked 15 documents performs much better than either the best individual expert (the count expert) or the network trained on all documents. The improvement is at least 25% over the count expert for all partitions, and the average improvement is 47%. The improvements over the count expert and over the linear :

:

Test Performance By Partition, Top-Ranked 15 Documents Query Count Optimized Optimized Improvement over Partition Expert All Docs Top 15 Count Expert 1 .3235 .3176 .5121 +58% 2 .3766 .4865 .5559 +48% 3 .4610 .3957 .6373 +38% 4 .3344 .5153 .5161 +54% 5 .4810 .5089 .6011 +25% 6 .3369 .3622 .5153 +53% 7 .3248 .2684 .5732 +76% 8 .4468 .3627 .5629 +26% Average .3856 .4021 .5592 +47% Table 2: Results of training on only the highest ranked 15 documents. Improvement is calculated with respect to the best individual expert (the count expert). combination trained on all documents are signi cant (with 0 0005 and = 0 003 respectively). These results support the conclusion made earlier that removing the outliers can have a very positive e ect on performance. Also, restricting the training to 15 documents results in the optimization taking between a few seconds and a few minutes. Interestingly, non-linear networks do have utility in handling outliers. A 3-layer non-linear network trained on all documents and queries (including outliers) has a 42% average improvement over the count expert. This demonstrates that the non-linearities can be exploited to isolate the degrading e ects of the outliers. p