Using Term Co-occurrence Data for Document Indexing and * Retrieval Holger Billhardt Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid. Campus de Montegancedo, 28660 Boadilla del Monte (Madrid), Spain,
[email protected]
Daniel Borrajo Departamento de Informática, Universidad Carlos III de Madrid. 28911 Leganés (Madrid), Spain,
[email protected]
Victor Maojo Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid. Campus de Montegancedo, 28660 Boadilla del Monte (Madrid), Spain,
[email protected]
Abstract In the vector space model for information retrieval, term vectors are pair-wise orthogonal, that is, terms are assumed to be independent. It is well known that this assumption is too restrictive. In this article, we present our work on an indexing and retrieval method that, based on the vector space model, incorporates term dependencies and thus obtains semantically richer representations of documents. First, we generate term context vectors based on the co-occurrence of terms in the same documents. Then we use these vectors to get context vectors for documents and queries. Experimental results on four text collections (MEDLARS, CRANFIELD, CISI and CACM) show that our method performs better on some collections than the classical vector space model with IDF weights. We also comment some disadvantages and how we plan to overcome them. We conclude that the method performs well under certain circumstances, as it is the case for most techniques, and, therefore, it should be used in combination with other methods, rather than by its own.
1
Introduction
In traditional keyword-based IR systems, the documents of a collection are represented by a set of keywords that describe their content. These representations are matched to the words describing the user’s information need. Such systems have two fundamental problems: i) the queries have to be specified using the same set of keywords that has been used during document indexing, and ii) the keywords are usually assigned manually to the documents. More modern IR systems find a way around the problem of a restricted search vocabulary and the subjectiveness of the indexing process by automating this process. There, the representation of a document is based on the words that occur in the text or in some surrogate (e.g., abstract). The set of keywords, or, as it is commonly known in full text indexing methods, the set of index terms, includes all the words occurring in the collection. This set is usually confined to only the significant words by eliminating common functional words (also called stop words). However, this indexing approach has brought with it new problems. The first problem is known in the IR community as the vocabulary or word-matching problem. It refers to the fact that different documents describing the same subject may use different words. In such cases, a simple word-matching approach will probably miss some relevant documents, just because they do not contain the same terms as used in the query. The second problem is the growing need for mechanisms that rank documents in order of relevance, since many more documents are likely to match the words occurring in a query. Probably the best known model in IR is the Vector Space Model (VSM) [14, 16]. It implements full-text automatic indexing and relevance ranking. In VSM, documents and queries are modelled as elements of a vector space. This vector space is generated by a set of basis vectors that correspond to the index terms. Each document can be represented as a linear combination of these term vectors. The indexing process thus consists of calculating these document vectors. During the retrieval process, a query is also put through the indexing process and a query vector is obtained. This query vector is then matched against all document vectors and some measure of similarity or
aboutness is calculated (e.g. cosine coefficient). The result of the retrieval process is a ranked list of documents ordered by their relevance (similarity) to the query. One assumption of the vector space model is that term vectors are pair-wise orthogonal. The interpretation of this is that terms are assumed to be independent of each other, that is, there exists no relationships between different terms. It is well known that this assumption is too restrictive, since words are not actually independent. They do have relationships, because they represent concepts or objects, that may be similar or related by any other kind of association. Although the vector space model does not include term dependencies and does not solve the wordmatching problem, it has turned out to be very effective, and many other more complex models, have not achieved the expected substantial improvement in retrieval performance. In this article, we present a model that is based on the vector space model but relaxes the independence assumption. We use term co-occurrence data for modelling term dependencies. In our model, each index term can be considered from two points of view: i) as the name of a word in the same sense as it is used in VSM, and ii) as the semantic meaning of a word or its context in relation to other words. We represent each term as a context vector in the vector space and use these vectors as a basis for calculating document and query vectors. In section 2, we analyse the possibility of introducing term dependencies in the retrieval process. Section 3 describes the model we use in more detail and discusses the different parameters that play a role in the indexing and retrieval processes. We also discuss various term-weighting techniques that can be applied. In section 4, we show some experimental results that have been obtained on 4 different test collections (CACM, CISI, MEDLARS and CRANFIELD). It seems clear that the effectiveness of the model for certain collections, measured in terms of precision and recall, is better than that of the vector space model using IDF weights. These results are discussed, and Section 5 gives some conclusions and directions for future research.
2
Term Dependencies in Document Indexes
One of the fundamental problems in information retrieval is the vocabulary or word-matching problem. It actually refers to two different things: i) the fact that the same objects may be expressed in different ways so documents about the same issue may use different words, and ii) the existence of words that have several different meanings. The first is called synonymy and the second polysemy in information retrieval. In IR systems, the prevalence of synonyms tends to decrease recall, since not all of the relevant documents may match the terms in a query. On the other hand, the existence of polysemic terms is related to low precision. If a term with an unclear meaning is used in a query then many irrelevant documents, which contain the term but not with the intended meaning, are likely to be retrieved. Polysemy and synonymy actually describe just the extreme cases and there is a broad spectrum of relationships between words and their meanings between the two. There have been many attempts to solve the vocabulary problem in recent years. It has been argued that the use of term dependencies can help to achieve this aim [1, 11] and almost all methods use such term dependencies in one way or another. Most of them get term relationships from co-occurrence data, that is, from the frequency terms co-occur in the same documents. In [18], Schütze represents the semantics of words and contexts in a text as vectors in a vector space where the dimensions correspond to words. These vectors are called context vectors. A context vector for an entity is obtained from the words occurring close to that entity in a text. Therefore, the vectors represent the context of a single occurrence of a word. Because of the high dimensionality of the vectors, dimensionality reduction is carried out by means of singular value decomposition. The vectors are later applied to word sense disambiguation and thesaurus generation tasks. Even though the use of the methods is only reported for these tasks in [18], they could be used directly in IR systems as well. Many of the methods approach the vocabulary problem by means of query expansion. Query expansion techniques either consider the information search as a process and try to refine a user’s information need in each retrieval step [3, 4] or make use of inter-term relationships to expand a query when it is received. The information on term dependencies may come from manually or automatically generated thesauruses, as in [3, 4, 8], or may be obtained by means of relevance feedback. In relevance feedback, the system analyses the documents a user judged relevant at the previous stage and uses this information for query refinement. It has been shown that automatic query expansion using relevance feedback can add useful words to a query and can improve retrieval performance [15]. However, such techniques require the collaboration of the user who has to judge the documents supplied and such judgements are often not provided. Ad hoc or blind feedback can solve this problem [2, 9]. In this method,
documents are first retrieved for the original query. Then the top n documents are analysed and used in a relevance feedback process to create an expanded query. Afterwards, documents are retrieved again, but using this new query. Another class of methods approach the vocabulary problem by generating representations of documents and queries that are semantically richer then just vectors based on the occurrence frequency of terms. They use the inherent semantic structure that exists in the association of terms with documents in the indexing process. In Latent Semantic Indexing (LSI) [5], singular value decomposition is used to decompose the original term/document matrix into its orthogonal factors. Of those, only the n highest factors are kept and all others are set to 0. The chosen factors can approximate the original matrix by linear combination. Thus, smaller and less important influences are eliminated from the document index vectors, and terms that did not actually appear in a document may now be represented in its vector. LSI can be seen as a space transformation approach, where the original vector space is transformed to some factor space. A transformation of the vector space is also the basis of the Generalized Vector Space Model (GVSM) [20]. In [20], Wong et al. argue that the orthogonality assumption in VSM is too restrictive and, therefore, another representation has to be found. Analysing term correlations obtained from co-occurrence data, they define a new set of orthogonal basis vectors that spans a (transformed) vector space. This space is then used to represent document and query vectors more semantically, since term dependencies are implicitly included. The results they report with their model show a clear improvement on the classical vector space approach. The work presented in this paper is similar to the Distributional Semantics based Information Retrieval (DSIR) method proposed by Rungsawang and Rajman [12, 13]. They also use term co-occurrence information, collected over the whole document collection to get term vectors in the same space as document and query vectors. Then, vectors representing the documents in the system are obtained using the term vectors of the words occurring in those documents. Even though in its basic aspects their method is quite similar to ours, we introduce some modifications as to term weights and query indexing in this paper, which lead to better results.
3
Document Indexing and Retrieval Based on Term Contexts
In this section, we introduce our model that, based on a vector space approach, uses term co-occurrence data in the indexing process in order to get semantically richer descriptions of documents from a collection and user queries. The basic idea of the model is to represent the index terms in the same vector space as documents and queries. This means that terms are also represented as vectors in the n dimensional space spanning all keywords. The basic indexing algorithm that we use is as follows: 1. 2. 3.
Compute term/document matrix Generate term correlation matrix Calculate document context vectors
In the following, we describe each of these steps in more detail. We will use m to denote the number of documents in a collection and d1, d2, ... , dm to denote these documents. Furthermore, t1, t2, ... , tn denote the number of index terms used for indexing the documents and n denotes the total number of index terms.
3.1
Computing the Term/Document Matrix
Normally, the starting point of the indexing process in a vector space based method is a term/document matrix as shown in Table 1. d1
d2
dm
t1
w11
w21
wm1
t2
w12
w22
wm2
tn
w1n
w2n
wmn
Table 1: Term/document matrix In this matrix each element wij corresponds to the frequency word tj occurs in document di (hereafter called occurrence frequency). The vector is the vector that represents the document di and is used as the initial document vector in most information retrieval models that are based on the bag of words approach. In the classical VSM these vectors represent the documents of a collection. They are usually only modified with normalisation purposes and through the introduction of term weights that shall improve the retrieval performance (e.g. Inverse Document Frequency weights). However, in our approach we use these initial term occurrence vectors as the starting point for calculating semantically richer document context vectors. As it will be seen in section 3.3, we transform each initial document vector to a context vector using a term correlation matrix T that gives information about the dependencies between terms.
3.2
Generating the Term Correlation Matrix
The term correlation matrix T, which is later used for calculating document context vectors, is of the form:
where the i-th column represents a context vector for the index term ti in the n dimensional term space. It should be noted that this interpretation allows us to consider index terms in two different ways: i) as the basic components or elements that define the vector space, and ii) as vectors or points in this space that are themselves linear combinations of the index terms. The second interpretation, that is, the vectors corresponding to terms allows us to integrate term dependencies into the model. We call these vectors term context vectors. In the following we discuss the way the elements of the term correlation matrix T can be obtained. In principle, each element tij with i j is calculated from the frequency with which the index terms ti and tj co-occur in the same textual units over the whole collection. The elements tij with i = j are obtained from the collection frequency of the word ti, that is, from the number of times term ti occurs in all documents of the collection. However, as it will be described below, some normalisation has to be applied to these values before the term correlation matrix T can be further used. Therefore, we start the definition of T by defining a matrix T’ with elements t’ij. These variables correspond to the frequency values as mentioned above. Let p be the number of textual units in a collection and let tuk denote the textual unit k. Furthermore, let vki be the occurrence frequency of term ti in the textual unit tuk. Then the (i,j)th element in the matrix T’ is defined as follows: (1) There are several aspects that should be taken into consideration with respect to formula (1): •
•
The selection of textual units influences the term correlation matrix. If each document is considered as a textual unit, then the second part of the formula corresponds to the sum of the co-occurrence frequencies of terms ti and tj in the documents of the collection. However, textual units may be defined in different ways. Phrases or paragraphs may be used or a window may be defined, which is “shifted” over the texts and the cooccurrence of terms within the same window may be counted. The latter approach is a bit more complicated, because it is necessary to assure that the same pairs of terms are not counted twice. This may occur since the window “slides” word-wise over the text and therefore, the same pair of terms may be part of two or more windows. The “sliding window” approach is used in our system, since it allows analysing the impact of different window sizes, that is, the closeness of term co-occurrence, on the retrieval results. It should be noted that choosing the window size sufficiently large implies the use of documents as textual units. The co-occurrence frequency of two terms is calculated as the product of their occurrence frequencies in the same textual unit. This leads to an undesired property for a term correlation matrix. Using formula (1), the term context vector of a term ti may contain elements that are greater than the collection frequency of ti itself. From an interpretative point of view, this means that other terms have a higher influence in the semantic
•
meaning of the term than itself; and they describe better its context. This, however, seems contradictory, so we established two different ways to overcome the problem. The first one simply “cuts off” the elements that are too high to the value of the collection frequency of the term. In this way there may be terms that have the same influence on the vector as the term itself but none of them exceeds that influence. The second approach for solving the problem is setting the diagonal elements of the matrix to zero. This changes the way of interpreting term context vectors. There, a term context is defined exclusively through its relationships with all other terms. In other words, a term is defined only by other terms and the component of the term itself in its semantic meaning is zero. We tried both approaches on the MEDLARS collection, where we also did some runs without treating the problem at all. Both, the “cut off” and the “diagonal elimination” approach performed better than the basic approach. Even though the elimination of the diagonal elements seems more theoretically founded in some sense, the “cut off” approach performed slightly better. Thus, we used the “cut off” approach in the experiments reported in this paper. The length of the term context vectors in T’ varies with respect to the collection frequency of the terms. (Note that we use the Euclidean vector norm to define the length of a vector.) This arises from the fact that terms that occur more often in the whole collection are likely to co-occur with many more other terms. Furthermore, their co-occurrence frequency in textual units with other terms will be much higher than that of rare terms. The length differences are also undesired, since, in the calculation of document context vectors, index terms with a high collection frequency would always have a higher impact than terms that occur less often. In order to overcome this problem, we normalise all term context vectors in the matrix T’ to the length of 1, and the resulting values finally build the term correlation matrix T, which is used in the process of calculating document context vectors. Therefore, the (i,j)th element in T is defined as follows: (2)
3.3
Calculating Document Context Vectors
Once the term correlation matrix T has been generated, we transform each initial document vector to a context vector . This is done with the following formula: (3) T where is the context vector of term tj. The obtained vectors =(w’i1, w’i2, ... , w’in) correspond to the centroid of the term vectors, where each vector is multiplied with the occurrence frequency of term ti in document di. If T is the unit matrix, then the context vectors for the index terms are pair-wise orthogonal and . This means that the presented model includes in some way the vector space model in its classic form. After having obtained the document context vectors , these vectors are further treated in order to improve the retrieval performance. Two types of modifications are carried out: 1. 2.
multiplication of the elements of document context vectors with some appropriate term weights, and normalisation of the resulting vectors.
3.3.1 Term Weights In the IR community, it is well known that the use of appropriate term weights can improve the retrieval performance of a system considerably. In the vector space model each element of the initial document vectors, that is, of the columns in the document/term matrix can be multiplied with such weights. A lot of different term weighting schemas have been proposed in the last years (e.g. the signal weight [6], or the discrimination value [17]), but the probably most used, and also one of the most effective weights, is the Inverse Document Frequency (IDF). The IDF weight of a term ti is calculated as follows: (4) where m is the number of documents in the collection and DF(ti) is the document frequency of the term ti. The document frequency of a term is simply the number of documents, out of the collection, in which the term occurs.
(Usually log2 is applied to scale the values down, and 1 is added in order to assure that each weight is greater or equal than 1.) The motivation for using IDF weights is the fact that terms, which occur only in very few documents, are better discriminators than terms that occur in the majority of the documents, and therefore, should be weighted higher. It can be expected that term weighting techniques will improve retrieval performance also in the context vector model, and indeed this is the case, as the presented results will show. Basically we have tried four different term weighting techniques, that applied to the document context vectors give rise to new vectors , where each w’’ij is defined as: (5) and tw(tj) denotes the calculated weight for term tj. Below, we briefly describe the weights we have used. IDF Weights We used IDF weights in the same way they are used in the classical vector space model; that is, the weight for each index term ti is calculated as in equation (4). It should be noted that the document frequency values are obtained from the initial term/document matrix, and not from the document context vectors calculated in formula (3). Average Mean Deviation of Occurrence Frequency Values Even though the classical IDF weights considers the number of different documents in which a term occurs, it does not take into account the differences in the number of times a term occurs in those documents. With the same argumentation as given for IDF weights, it seems obvious that the variations of the occurrence frequency of terms in the documents will also be important. For example, a term that occurs in all documents but with very different frequency values in each one will be more important than a term occurring in all documents just once. The IDF values, however, will be in both cases 1. This leads to the conclusion that some kind of deviation measure that assesses the differences of the occurrence frequency of terms over all documents should improve retrieval performance. The motivation for using weights based on deviation values can also be seen from another point of view. In contrast to the document vectors from the initial term/document matrix, the document context vectors calculated in (3) are not very sparse. This is due to the fact that they are calculated using not only the terms that actually occur in the documents, but also other terms that co-occur with those in at least one textual unit in the collection. Therefore, the number of non-zero elements in a document context vector will be much higher than the number of different index terms in the document. Furthermore, the elements of document context vectors have real values and not natural numbers. Because of these two properties, calculating IDF weights based on the document context vectors does not make a lot of sense. However, deviation measures seem to represent a similar idea and can be computed easily. In our experiments we used a modified mean average deviation weight defined as follows: (6) We also tried the standard deviation and the variance on the MEDLARS collection, but they performed worse and, therefore, the results have not been included in this paper. However, the difference in using one deviation measure or another is not very big, at least in MEDLARS. There are some aspects with respect to formula (6) that require a more detailed explanation. The normal mean average deviation is calculated as follows: (7) Applying formula (7), it turns out that terms with a higher average occurrence frequency will usually have a higher mean average deviation. This is because for such terms it is likelier that the variations from the mean are higher. Therefore, we divide (7) by the mean occurrence frequency of term ti (mean of over all document context vectors ). At the same time, this “moves” the importance from frequently occurring terms towards rare terms.
Furthermore, we multiply the whole formula with the number of documents in a collection in order to upscale the calculated weights. It should be noted, however, that this multiplication has no influence on the final context vectors if those are afterwards normalised to length 1. Finally, we add 1 to the formula in order to obtain weights that are greater than 1. With all these modifications, formula (7) can be written as (6). Nevertheless, different factors or other scaling techniques may be used in order to give more/less importance to rare terms, or to change the impact of the weights in the document context vectors . Average Mean Deviation of Term Context Vectors Another weighting technique is based on the deviation of the elements in term context vectors from their mean. These weights are only indirectly based on the document collection, because their calculation does not use the initial term/document matrix, but the term correlation matrix T. Similar to formula (6) we used a modification of the average mean deviation of term context vectors defined by: (8) where tij is the (i,j)th element in the term correlation matrix T and n is the number of index terms in the collection. The motivation for using this deviation measure is the following. A term, which is related in the same way to all other terms, is not very descriptive; it does not have a very specific meaning. Such terms should have less weight in the representation of documents. In fact, common function words like “the” would fall into this category, since they will have a very similar relation to all other terms and their context vectors will be very uniform. On the other hand, terms that are strongly related to only a few other terms represent more specific meanings, and should have higher weights. Using the same modifications as applied to the mean average deviation of occurrence frequency values moves the importance from frequently occurring terms towards rare terms. 3.3.2 Length Normalisation It is well known that differences in the length of document vectors may lead to worse retrieval results. As pointed out earlier, we use the Euclidean vector norm to describe the length of a vector. Using, for example, the scalar product as a similarity measure would imply that longer documents have a higher probability to match terms in a query than shorter ones. Therefore, they will be ranked higher. This also applies to our model, even though not exactly because of the same reason. The document context vectors before applying term weights are already quite similar in length. Actually, their length is always less or equal to 1. This is because the term context vectors used in the indexing process are normalised to a length of 1. In fact, the closer the meaning of the terms occurring in a document, the closer to one will be the length of its vector. Even though this property seems interesting and could be exploited in the retrieval process in some way, we did not study it in more detail. A further variation of the length is caused by the multiplication with term weights. In order to overcome these differences, we do the following normalisation step: (9) T and finally the context vectors =(w’’’i1, w’’’i2, ... , w’’’in) represent the documents di in our system.
3.4
Retrieval of Relevant Documents
Once the indexing process has been finished and all document context vectors are stored in the system, it can receive queries. For each query, it computes a list that presents documents in order of their relevance. This process is carried out in two steps: 1) calculating query vectors, and 2) computing the similarity, or distance, between the query vector and each document vector. Query vectors are obtained by indexing queries in the same way as documents, that is, using the termcorrelation matrix T and formulas (3), (5), and (9). This means that a query q is represented in form of a query T context vector =(q’’’1, q’’’2, ... , q’’’n) . For calculating , we use the same term context vectors and apply the same term weights as in document indexing. The vector is used for calculating document query distances. In our system, instead of using similarity coefficients for measuring the relevance of a document to a query, we employ the Euclidean Distance between the points defined by the vectors in the n-dimensional vector space:
(10) As the retrieval result, the documents of the collection are presented in a ranked list, which is ordered by increasing distance. It is worth noting that, if document vectors are normalised to a standard length (e.g. by formula (9)), then this ranking mechanism is equivalent to ordering documents by decreasing cosine similarity. However, the actual values obtained by both measures will be different. In fact, taking into account that the importance is on the ranking of documents and not on the actual distance values, and that the document vectors are normalised to length 1, formula (10) can be simplified to: (11) where w’’’ij is the j-th element in the document context vector and q’’j is the j-th element in the vector as obtained from formula (5). Note that the normalisation of query vectors has no impact on ranking, and thus may be omitted. The only advantage of using (11) instead of (10) is the reduction in calculation time. In on-line retrieval systems, this may be quite important since the response time is an issue of concern. With respect to query indexing, we found out that our model performs significantly better, on some collections, when binary query vectors are used that reflect only the existence or non-existence of index terms in a query. This has also been experimented in [20]. In these cases, the combination of document indexing based on the incorporation of term co-occurrence data and term weights, with binary query vectors performs best. Using this approach, the applied formula for calculating document/query distances is: (12) where w’’’ij is the component of term tj in the document context vector and
Similar to the simplification of the distance measure, the use of binary queries reduces retrieval time drastically, because query indexing is much shorter, and, what is even more important, because the query vectors are very sparse. In a query context vector almost all elements will have a value which, depending of the number of index terms, might be several thousands. In contrast, a binary vector will only have a couple of non-zero elements. Using binary query vectors may also lead to more efficient methods for the calculation of document/query distances. However, this consideration exceeds the scope of this paper.
4
Experimental Results
In this section we present experimental results on four different test collections. All of them have been extensively used for evaluating IR systems. The collection characteristics are the following: • • • •
MEDLARS is a collection that comprises 1033 medical abstracts. The collection has 30 queries. CRANFIELD is a collection of 1398 documents from aerodynamics and 225 associated queries. CISI contains 1460 documents and 112 queries from library science. CACM is a collection of 3204 documents and 64 queries.
All four collections are publicly available on the Internet, and can be obtained from ftp://ftp.cs.cornell.edu/ pub/smart/ or http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/. Before calculating context vectors for documents and queries, we carried out some pre-processing. First, we eliminated common function words (stop words). Afterwards, the word stem of each remaining word is calculated using the Porter stemmer algorithm [10]. From the remaining word stems, we eliminated all those that occur just once in the collection. All remaining word stems build the set of index terms {t1, t2, ..., tn} which is used to calculate context vectors as described in section 3. On each collection, we tested different weighting techniques and the use of binary versus context query vectors (as discussed in section 3.4). In all experiments we used the “cut off” strategy to assure that the diagonal elements in the term correlation matrix T are maximum (see section 3.2). We also tried the “diagonal elimination” strategy on the MEDLARS collection, but it performed slightly worse. Therefore, we did not include these results
here. As textual units for calculating term co-occurrences we used the sections of a document as defined in the different collections (actually, we used a window size of 100000, which is the same, since no document section has more than 100000 words.). Using this base line, we tested the following 5 different term weighting functions on all four test collections: 1. 2. 3. 4. 5.
no: no term weighting function is used idf: inverse document frequency weights as defined in formula (4) mmad: average mean deviation of occurrence frequency values (formula (6)) mmadt: average mean deviation of term context vectors as in formula (8) com: the combination of idf and mmadt:
Each of the five weighting schemas has been tested once, where query vectors are calculated in the same way as document vectors – using the distance measure defined in formula (11), and once, with binary query vectors – using formula (12). The results of the 10 test runs are compared to standard VSM using IDF weights (denoted as vsm). In the vsm runs we indexed queries in the same way as documents, that is, we did not use binary query vectors. This is acceptable, since the use of binary vectors in this model performs worse on all four collections. The results for vsm have been obtained using our system, and employing the same set of index terms as for all the other methods. Also, the evaluation methods used are the same as for all other runs. This ensures greater consistency in the comparison. For each experimental run we present the precision/recall curve of interpolated recall level precision averages. That is, we calculated the interpolated precision at recall points 0, 0.1, 0.2, ... 1 for each query, and average the results over all queries in a collection. We use the same method as employed in the TREC experiments [19]. Furthermore, we calculated for each query the average precision over all relevant documents (non-interpolated) and give the average of these values over all queries (presented as “Average Precision over all Relevant Documents (noninterpolated)”) and their variance. Also here, we use the same calculation method as used in TREC [19].
Context Query Vectors no
idf
mmad
Binary Query Vectors mmadt
com
no
idf mmad com mmadt
vsm Average Precision over all Relevant Documents (non-interpolated) 0,5147
0,5637
0,5671
0,5692
0,5717
0,5706
0,6238
0,6487
0,6554
0,6446
0,6531
Variance of “Average Precision over all Relevant Documents (non-interpolated)” over all queries 0,0343
0,0569
0,0565
0,0565
0,0564
0,0563
0,0535
0,0493
0,0476
0,0496
0,0461
Figure 1: Test Results on the MEDLARS Collection: Precision/Recall Curves for Context Query Vectors (left curves) and Binary Query Vectors (right curves); Average and Variance of “Average Precision over all Relevant Documents” Figure 1 presents the results obtained on MEDLARS. It can be seen that on this collection our method performs better than the standard VSM with IDF weights. This holds for all different term weights employed and even in the case where no term weights are used at all. Furthermore, a significant improvement can be obtained by using binary instead of context query vectors. The influences of the different term weights seem to be quite similar since all results are in the same range. However, the best performing weight is mmad, the average mean deviation of occurrence frequency values. The test results on the CRANFIELD collection are presented in figure 2. There, the behaviour of our method is different that on MEDLARS. Only the use of the combination weight com with binary query vectors has a better average precision value over all relevant documents than standard VSM with IDF weights. However, for all runs, the precision values for the first recall levels are lower than in vsm. This changes for higher recall levels (from 0.5/0.6 to 1.0), where com, mmad and idf with binary query vectors performed better. These are also the weights that perform best as far as our model is concerned. The fact that our model performs better on higher recall levels can also be
observed on the MEDLARS runs, where the improvement at these levels is even higher. The results for the CACM and CISI collections are shown in figures 3 and 4, respectively. In both collections our model performs worse than vsm. Only on the CISI collection there are some precision values that, when using mmadt or no weights, are slightly better at recall levels greater than 0.6. We think, the fact that the context vector model performs well on MEDLARS and bad on CISI and CACM is due to the different characteristics of these collections. Documents in CACM and CISI are shorter than those in MEDLARS and this may lead to more arbitrary and biased term contexts, and, thus, poorer retrieval results. It can be expected that term correlation matrices obtained from larger corpora than just the collection under consideration will lead to better document descriptions. One advantage of our model is that it facilitates such an approach, since, in principle, term dependencies are independent of the particular collection. In this sense, the term correlation matrix can be “learned” from large corpora as a form of general language understanding. However, one should take into account that certain types of term relationships will depend on the domains of the corpora on which they have been obtained. It seems clear that obtaining term dependencies on larger corpora may only improve the document descriptions if the domain of those corpora is the same as the domain of the collection under consideration.
Context Query Vectors no
idf
mmad
Binary Query Vectors mmadt
com
no
idf mmad com mmadt
vsm Average Precision over all Relevant Documents (non-interpolated) 0,3915
0,3537
0,3644
0,2561
0,3553
0,3629
0,3096
0,3774
0,3821
0,3249
0,3939
Variance of “Average Precision over all Relevant Documents (non-interpolated)” over all queries 0,0618
0,0717
0,0733
0,0591
0,0721
0,0751
0,0650
0,0692
0,0741
0,0670
0,0703
Figure 2: Test Results on the CRANFIELD Collection: Precision/Recall Curves for Context Query Vectors (left curves) and Binary Query Vectors (right curves); Average and Variance of “Average Precision over all Relevant Documents”
Context Query Vectors no
idf
mmad
Binary Query Vectors mmadt
com
no
idf mmad com mmadt
vsm Average Precision over all Relevant Documents (non-interpolated) 0,3266
0,2531
0,2569
0,2306
0,2537
0,2614
0,1456
0,2231
0,2176
0,1717
0,2454
Variance of “Average Precision over all Relevant Documents (non-interpolated)” over all queries 0,0564
0,0462
0,0476
0,0457
0,0458
0,0480
0,0221
0,0345
0,0281
0,0267
0,0354
Figure 3: Test Results on the CACM Collection: Precision/Recall Curves for Context Query Vectors (left curves) and Binary Query Vectors (right curves); Average and Variance of “Average Precision over all Relevant Documents”
Context Query Vectors no
idf
mmad
Binary Query Vectors mmadt
com
no
idf mmad com mmadt
vsm Average Precision over all Relevant Documents (non-interpolated) 0,2334
0,1570
0,1513
0,1098
0,1531
0,1456
0,1900
0,1961
0,1729
0,1965
0,1968
Variance of “Average Precision over all Relevant Documents (non-interpolated)” over all queries 0,0290
0,0142
0,0146
0,0081
0,0142
0,0140
0,0263
0,0198
0,0154
0,0273
0,0196
Figure 4: Test Results on the CISI Collection: Precision/Recall Curves for Context Query Vectors (left curves) and Binary Query Vectors (right curves); Average and Variance of “Average Precision over all Relevant Documents” Another basic difference in the collections is on the queries. In CACM, queries are very specific in the sense that they ask for very specific information, whereas in MEDLARS they define rather vague information needs. In such cases, direct word-matching approaches will perform better than context vector comparison since the latter describes the concepts rather vaguely. However, vagueness in users’ information needs may be intended [7], and the strength of our approach seems to be in retrieval tasks with vague information. Moreover, in contrast to MEDLARS, many queries in CACM specify the user’s need in a human like way, e.g. “I’m looking for X, but I am not interested in Y, and some examples are Z”. Also for such kind of queries VSM seems to work better. In general, the main advantage of the context vector model seems to be the performance improvement in retrieval tasks with vague information needs, or when high recall is required. Thus, the context vector method should be used in such environments, as an extension to, or in conjunction with, other methods. In these cases, it also seems reasonable to use binary query vectors, which decreases retrieval time drastically. The main disadvantage of the model is its memory and time consumption. The indexing process takes longer than in the classical VSM, because the term correlation matrix has to be generated. Furthermore, the method requires more memory since it has to store this matrix, and the document context vectors. Even though the correlation matrix is symmetric, and only a bit more than half of it has to be stored, the memory requirements increase with the number of index terms. This also holds for the document context vectors. The document vectors in VSM will usually have only a few non-zero elements although their dimension may be quite high. In contrast, the number of non-zero elements in the document context vectors increases with the number of index terms. However, the main problem regarding memory and time consumption is not in the indexing process, since this can be done in a batch mode. This issue is much more important in the retrieval process where usually short response times are required. If binary query vectors are used, as proposed, then query indexing is very fast and does not require a lot of memory. In this case, also the calculation of document/query distances by formula (12) will be fast because query vectors are sparse. The only remaining problem is the size of the document context vectors. In big collections it will be difficult to load all vectors into memory. In [12, 13] a solution to this problem is proposed. There, a subset of the set of index terms is selected before the actual indexing process. This subset spans a (reduced) vector space. The number of the selected terms will usually be smaller than the number of index terms and thus, term, document, and query context vectors have a lower dimension. We propose another solution where the initial vector space is maintained, but only the most significant elements in each document context vector are stored and all others are set to 0. This does not actually reduce the dimensions, but generates more sparse vectors with only a few non-zero elements. Experiments we did on MEDLARS, reducing the document context vectors by zeroing all elements that are smaller than the maximum element divided by 10, show still an average precision over all relevant documents of around 0.61.
5
Conclusions
In this paper we presented our work on an information retrieval model that uses term dependencies in the process of indexing documents and queries. The calculated context vectors, which represent the documents of a collection, are semantically richer descriptions of the issues treated in those documents. The use of such vectors implies that the calculation of document/query distances is based on a semantic matching rather than on a simple word-matching approach. This brings more uncertainty or vagueness into the retrieval process and it can be expected that, in comparison to standard VSM, the proposed method will improve recall but may lose precision at high recall levels. The results presented in this paper seem to confirm this hypothesis. We tested the model, with different term weighting schemas, on four test collections (MEDLARS, CRANFIELD, CACM, and CISI), and compared the results to standard VSM with IDF weights. Its behaviour varies very much on the different collections. On MEDLARS, a significant improvement has been shown. Small, but non-significant improvements have also been
obtained on CRANFIELD, using the com weights. However, on CACM and CISI the method performs significantly worse than VSM-IDF. For measuring the significance of the results we applied a non-parametric test that is known as the sign test and used a significance level of 5%. However, significance tests in IR should be interpreted with care, since many unknown variables influence the retrieval performance. In this sense, the presented results seem to confirm that our method is not generally better than VSM-IDF, but it may improve retrieval performance for certain types of queries and/or document collections. The same seems to hold with respect to the selection of term weighting techniques and the use of query context vectors versus binary query vectors. It is a task for further research to find out under which conditions a particular retrieval method should be used. In general, instead of substituting other models, we believe that our method should be used in complementation to others, which work better on high precision tasks. We think that a substantial improvement in retrieval performance goes along with the use of different retrieval models. These should be complemented with a system that, analysing the incoming queries decides which of them will work best. Our future research is concerned with this issue.
6
References
1.
Bollmann-Sdorra P, Raghavan V V. On the Necessity of Term Dependence in a Query Space for Weighted Retrieval, Journal of the American Society for Information Science 1998, 49(13):1161-1168
2.
Buckley C, Salton G, Allan J, Singhal A. Automatic Query Expansion using SMART: TREC-3, In: Harman D K (ed). Overview of the Third Text REtrieval Conference (TREC-3). NIST Special Publication 500-226, 1995, pp 69-80
3.
Chen H, Tobun D N, Martinez J, Schatz B. A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System, Journal of the American Society for Information Science 1997, 48(1):17-31
4.
Chen H, Yim T, Fye D, Schatz B. Automatic Thesaurus Generation for an Electronic Community System, Journal of the American Society for Information Science 1995, 46(3):175-193
5.
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R. Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science 1990, 41(6): 391-407
6.
Dennis S F. The Design and Testing of a Fully Automated Indexing-Searching System for Documents Consisting of Expository Text, In: Schecter G (ed). Information Retrieval: A Critical review. Thompson Book Company, 1967, pp 67-94
7.
Efthimiadis E N. A User-Centered Evaluation of Ranking Algorithms for Interactive Query Expansion, In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93), ACM Digital Library, 1993, pp 146-159
8.
Jing H, Tzoukermann E. Information Retrieval Based on Context Distance and Morphology, In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), ACM Digital Library, 1999, pp 90-96
9.
Mitra M, Singhal A, Buckley C. Improving Automatic Query Expansion, In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), ACM Digital Library, 1998, pp 206-214
10.
Porter M F. An algorithm for suffix stripping, Program 1980, 14:130-137
11.
Raghavan V V, Wong S K M. A Critical Analysis of Vector Space Model for Information Retrieval, Journal of the American Society for Information Science 1986, 37(5):279-287
12.
Rungsawang A, DSIR: the First TREC-7 Attempt, In: Vorhees E M, Harman D K (ed). Proceedings of the Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242, 1999, pp 425-433
13.
Rungsawang A, Rajman M. Textual Information Retrieval Based on the Concept of the Distributional Semantics, In: Proceedings of the third International Conference on Statistical Analysis of Textual Data (JADT’95), Rome, Italy, 1995
14.
Salton G. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison Wesley, 1989
15.
Salton G, Buckley C. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science 1990, 41(4):288-297
16.
Salton G, McGill M. Introduction to Modern Information Retrieval, McGraw Hill, 1983
17.
Salton G, Yang C S. On the specification of Term Values in Automatic Indexing, Journal for Documentation 1973, 29(4): 351-372
18.
Schütze H. Dimensions of Meaning, In: IEEE Proceedings of Supercomputing 92, 1992
19.
Vorhees E M, Harman D K (ed). Proceedings of the Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242, 1999, Appendix A
20.
Wong S K M, Ziarko W, Raghavan V V, Wong P C N. On Modeling of Information Retrieval Concepts in Vector Spaces, ACM Transactions on Database Systems 1987, 12(2): 299-321