The Relevance Density Method for Multi-topic Queries in ... - CiteSeerX

14 downloads 0 Views 19KB Size Report
Bell Communications Research. 331 Newman Springs Road, Red Bank, NJ 07701-7030. G. Casella. Cornell University, 337 Warren Hall, Ithaca, NY 14850.
The Relevance Density Method for Multi-topic Queries in Information Retrieval. Y. Kane-Esrig, L. Streeter, S. Dumais, W. Keese Bell Communications Research 331 Newman Springs Road, Red Bank, NJ 07701-7030 G. Casella Cornell University, 337 Warren Hall, Ithaca, NY 14850

Abstract A long standing problem in information retrieval is how to treat queries that are best answered by two or more distinct sets of documents. Existing methods average across the words or terms in a user’s query, and consequently, perform poorly with multimodal queries, such as: "Show me documents about French art and American jazz". We propose a new method, the Relevance Density Method for selecting documents relevant to a user’s query. The method can be used whenever the documents and the terms are represented by vectors in a multi-dimensional space, such that the vectors corresponding to documents and terms dealing with closely related topics are close to each other. We show that the Relevance Density Method performs better for multimodal as well as single mode queries than an averaging method. In addition, we show that retrieval is substantially faster for the new method. Introduction The task of an information retrieval system is to respond to a user’s request for information (a query) by searching a collection of documents (e.g texts such as books, journal articles etc.) and selecting those documents that seem to be relevant to the topic(s) of the query. Usually, the documents in the collection are indexed by terms (keywords). It is assumed that the topic(s) of a document or of a query is adequately reflected by its collection of terms. The relevance density method proposed in this paper can be applied whenever terms and documents are represented by vectors in the same multidimensional document-term space with similarity of terms and documents reflected by the closeness of their vector representations in that space. In other words, if two vectors are close together, then the corresponding terms or documents can be assumed to be closely related in their topics and vice versa. Methods for constructing such a space are presented in [1] and [2]. Currently, the method of selecting relevant documents used in conjunction with such vector representations of terms and documents is called vector averaging (VA). VA [2], [3] represents a query by a single vector in the

document-term space. This query vector is a weighted average of the term vectors used in the query. Documents in the collection are ranked by the closeness (measured by the cosine or dot product) of their vectors to the query vector. The top ranking documents are selected as relevant and returned to the user. Representing the query by a single vector works well when the vectors of the relevant objects (documents and terms) are clustered together in a single region of the document-term space, since the center of that region is a reasonable estimate of the query’s content. However, if the vectors of the relevant objects fall into two or more clusters separated by regions of the space containing nonrelevant documents, then averaging will perform poorly, since it will tend to retrieve documents between the two clusters of relevant documents. One proposed solution [4] was to identify multimodal queries and split them into sub-queries. However, this method was too computationally expensive and has not been used widely. An additional drawback of vector averaging is computational expense. Typically, the query vector is compared to every document vector. If the document collection is large and the dimensionality of the document-term space high, computational demands can be quite significant. The proposed method can be implemented using table look-up, thereby trading space for time. The Relevance Density Method We propose a new method of ranking documents. The Relevance Density Method (RDM) can be used whenever documents and terms are represented by vectors in the document-term space. We treat relevance as a continuous quantity and model its distribution by a probability density π(D D ) over the document-term space. The documents in the collection are ranked in the order of the height of the density over their vector representations D . In other words, the document that has the highest value of π(D D ) is given rank 1 etc., with higher ranks reflecting greater similarity or relevance. Thus, this density should be high over areas of the

document-term space containing vectors to relevant objects and low over areas of nonrelevant objects. If there is more than one cluster of relevant objects, then the density should be multimodal.

common or frequent words are weighted less heavily than rare words. (A list of desirable qualities of a sampling density, proofs that f (Q | D ) has these qualities, and an alternative sampling function are presented in [5].)

To construct the density π(D D ) we will start with a prior density π0(D D ) which reflects the system’s a priori guess about the user’s interests. If no prior information about the user is available, π0(D D ) is a constant and does not affect the ranking. We use Bayes’ rule to update the density when the user’s query is received. As in vector averaging, the query is treated as a collection of terms

The values of w j .c (b j ).exp ILb j .cos(T T j ,D D ) MO can be precalculated for every document and term and stored. Thus, when a user’s query is processed, the system simply looks up the values corresponding to the terms used in the query and adds them up to compute f (Q | D ). This table look-up method of computation makes the RDM far less computationally expensive than the VA in terms of the number of operations required [5]. However, if the term by document matrix is large, having enough space to store the values becomes an issue.

I

M

L

O

used in the query. Let Q = KT 1, . . . , T k N be the set of vectors corresponding to the terms used in the query, where k is the number of terms used in the query. Then 1 π1(D D | Q ) = f (Q | D ).π0(D D) In some cases, relevance feedback can be obtained from the user after the initial query. The user is presented with a few top ranking documents and asked which of them s/he considers relevant to her/his query. If such relevance feedback is available, it can be used to update π(D D ). Let I

M

L

O

Q 1 = KD 1, . . . , D m N be the set of vectors corresponding to the documents that the user considered relevant, where m is the number of documents the user considered relevant. Then the relevance density after the feedback is: π2(D D | Q , Q 1) = f (Q 1 | D ).π1(D D |Q) We used: f (Q | D ) =

k

I

M

Σ w j .c (b j ).exp Lb j .cos(TT j ,DD ) O

j =1

where cos(T T ,,D D ) is the cosine of the angle between the term vector T and document vector D . The above density has the property of being unimodal when the term vectors are in a single cluster and multimodal when there is more than one cluster. This density is a sum of bell-shaped components. The ith bell is centered over the vector of the i th term used in the query. The bell is tall and narrow if the parameter of concentration, b j is high, and low and wide, if b j is low. The parameter of concentration differentiates highly specific terms from broad, less specific terms. For example, single word terms, such as cable tend to be less specific than multi-word terms, such as fiberoptic cable [3] [6]. The factor c (b j ) normalizes the i th bell to integrate to 1, making it a proper density. The weights w j can be used to express different amounts of importance associated with terms. For example, words can be weighted according to their information value; hhhhhhhhhhhhhhhhhh

1. To make π1 a proper density, a scaling constant is needed, but since it does not affect the ranking, we will omit it.

Results of Testing Both the RDM and VA methods were tested on Bellcore’s ADVISOR system [3], [6]. The system responds to a query by identifying departments within Bellcore best suited to answer the query. (Bellcore is a large and diverse research and development company.) At the time of the first set of tests, the 104 departments were represented by abstracts of the technical papers they produced in 1987. There were 728 such documents indexed by 7,100 terms in the ADVISOR’s collection. New abstracts were collected in 1987 and in 1989 and used as test queries. (We did not use as queries any of the abstracts in ADVISOR’s collection.) In addition, to study the performance in cases where the query was likely to have at least two separate topics, we constructed "double" queries by joining the texts of pairs of abstracts produced by two different departments and treating these joined texts as a single query. The measure of performance for each test query was the rank of the first retrieved "relevant" document. A document was considered relevant to the query if it was produced by the same department as the one that produced the query. In the case of the double queries, the documents produced by either one of the two departments were considered relevant. If the method of retrieval were perfect, the rank of the first correct document would be 1. On the other hand, if the the documents were ranked randomly, the rank would be on average 52. Each query was ranked by each of the two methods, RDM and VA. VA was used with a root mean squared weighting of the terms and with the cosine as the similarity measure. This weighting scheme and similarity measure were chosen because they produced the best performance in previous tests on this collection. The RDM was used with a constant prior density (i.e. no prior information), with constant weights on the terms and with b j =1 for terms consisting of a single word and b j =2 for multi-word

terms.

Conclusions

The results of these tests are presented in Table 1. We observe that for 263 new abstracts produced in 1987, which were used as queries, both VA and RDM answered at least 25% of these 263 queries correctly on the first try, since the lower quartile of the ranks of the first correct documents is 1 for both methods. VA answered at least 50% of the queries on or before the third try (median rank=3), while RDM did better with a median rank of 2. Finally, the upper quartile of the ranks was 19 for VA and 9 for RDM. This indicates that RDM answered 75% of the queries correctly on or before the 9th try, whereas VA answered 75% of the queries correctly on or before the 19th try. >From the user’s point of view there is likely to be a big difference between looking at 8 versus 18 nonrelevant documents before getting a relevant one. The statistical significance of the differences in performance was assessed using a Wilcoxon Signed Rank test. The value of the z statistic for the 263 queries was -2.14. The p value of the test against the two-sided hypothesis is 0.016.

The Relevance Density Method of ranking documents for retrieval was designed to overcome two problems of the currently used method, Vector Averaging. These problems are: (1) poor performance in the case of multimodal queries and (2) high computational cost. The proposed method was tested on Bellcore’s ADVISOR system and performed faster and better than Vector Averaging in these tests.

Similar comparisons can be made for the two ranking methods based on queries from 1989 and on the "double" queries from 1987 and 1989. Both methods performed better on the 1987 queries. This is to be expected, since the work of the departments represented in ADVISOR’s database is from 1987 documents, and undoubtedly departments’ emphasis and work have shifted in two years. The overall conclusion that can be drawn from the data in Table 1 is that the RDM performed better than the VA (had the rank of the first relevant document closer to 1). The Wilcoxon test statistic ranged from highly significant (p value < 0.0001) to moderately significant (p value < 0.018), but in all 4 tests the RDM was the superior method. Recently, we compared the two methods in terms of their computational cost. We collected 316 actual queries submitted by the users at Bellcore to ADVISOR and found out how long it took to do the computations needed by each method for these queries. (We ignored the time it takes to do the I/O and the sort of the documents since this is the same for both methods.) The current version of ADVISOR represents documents and terms by 300 dimensional vectors and has 1023 documents in its collection. The computations were done on a DEC 5000/200 machine. The computation time (the sum of user and system time) is plotted against the number of terms in the query in Figure 1. It is obvious, that VA took substantially longer than RDM. The median of the VA time was 0.53 seconds, of the RDM time was 0.02 seconds.

References [1]

Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. "Indexing by latent semantic analysis." Journal of the Society for Information Science, 1990, 41(6), 391-407.

[2]

Salton G. and McGill M. J. Introduction to Modern Information Retrieval. McGraw Hill, New York 1983.

[3]

Streeter L. A. and Lochbaum K. E. "An Expert/Expert-locating System Based on Automatic Representation of Semantic Structure." Proceedings of the Fourth Conference on Artificial Intelligence Applications. San Diego Ca., March 14-18, 1988, pp. 345-350.

[4]

Borodin A., Kerr L. and Lewis F. "Query Splitting in Relevance Feedback Systems." Scientific Report No. ISR-14, Department of Computer Science, Cornell University, Ithaca NY, October 1988.

[5]

Kane-Esrig Y. Information Retrieval and Estimation with Auxiliary Information. Dissertation, field of Statistics, Cornell University, Ithaca NY, 1990.

[6]

Streeter L. A. and Lochbaum K. E. "Who knows: A System based on Automatic representation of Schematic Structure." RIAO 88: User-oriented Content Based Text and Image Handling Massachusetts Institute of Technology, Cambridge MA, March 21-24, 1988, pp 379-388.

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

c TABLE 1 c ADVISOR RESULTS iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii c

c

c

c

c c

Method c VA c

c

RDM c

1987 Queries Lower Q Median 1 3 1

2

c

Upper Q c c 19 c c

9 c

ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiic c

c Zwilc p value # of queries c -2.14 0.016 263 ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii c c c c

Method c VA c

1987 PAIRS Lower Q Median 1 5

c

c

RDM

c

Upper Q c c 24 c c

1

3

c

19

ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiic c

c

Zwilc p value # of queries c -4.20 0.000 66 ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii c iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii c c c

Method VA c c

1989 QUERIES Lower Q Median 2 8

c

c c

RDM

c c

Upper Q c 51 c c

1

5

c

29 c

ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiic

Zwilc

c

p value

# of queries

c

ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii -0.920 0.018 43 c iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii c c

Method c VA c c

c c

RDM

1989 PAIRS Lower Q Median 3 10 1

6

c c

Upper Q c 31 c c c

33 c

c

ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiic

Zwilc

p value

# of queries

c

ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii cc 0.018 98 c -0.918

Suggest Documents