The RFA Algorithm - Semantic Scholar

16 downloads 1365 Views 320KB Size Report
Nov 8, 2004 - For example a document talking about “laptop” will never be retrieved in response to a query containing only the term. “notebook”, because the ...
Improving Document Representations Using Relevance Feedback: The RFA Algorithm Razvan Stefan Bot

Yi-fang Brook Wu

Information Systems Department, NJIT GITC 4323, 23 Martin Luther King Jr. Blvd Newark, NJ 07102, USA 1 (973) 596-5422

Information Systems Department, NJIT GITC 4400, 23 Martin Luther King Jr. Blvd Newark, NJ 07102, USA 1 (973) 596-5285

[email protected]

[email protected]

ABSTRACT In this paper we present a document representation improvement technique, named the Relevance Feedback Accumulation (RFA) algorithm. Using prior relevance feedback assessments and a data mining measure called “support”, the algorithm’s learning function gradually improves document representations, over time and across users. Results show that the modified document representations yield lower dimensionality while improving retrieval effectiveness. The algorithm is efficient and scalable, suited for retrieval systems managing large document collections.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – relevance feedback, retrieval models, document learning, document representation.

General Terms: Algorithms, Theory. Keywords: Information retrieval, relevance feedback, document representation, document oriented view. 1. INTRODUCTION Document representation is a major research focus in Information Retrieval (IR). It is concerned with the transformation of textual documents into surrogate internal data structures that leverage fast and effective retrieval. Therefore, one of the goals is that every document is replaced by a smaller specialized data structure. The most common such data structure is the inverted file also called index. This study is only concerned with indexes generated in the context of the Vector Space Model (VSM) where documents and queries are represented as multidimensional vectors. Each dimension corresponds to a specific word/term. All IR systems consist of three main activities: document gathering, automatic indexing and retrieval. The indexes are generated during Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’04, November 8-13, 2004, Washington, DC, USA. Copyright 2004 ACM 1-58113-874-1/04/0011...$5.00.

the automatic indexing phase and remain unchanged afterwards. They represent a static, non-automatic evolving structure. If new documents are added to the collection, new indexes must be generated for these documents. It is a very difficult task to design algorithms that automatically generate topical indexes for text documents because they must select content-bearing words (called terms or concept terms) from the documents and assign them weights to indicate how important they are in describing the topics of the documents. The automatic indexing algorithms usually generate large amounts of unimportant terms that have nothing or very little to do with the topics of the documents. Most of them use term frequency (TF) as the main indicator of a term’s importance. Having estimated TF, a weighting scheme can be computed. This approach is also called frequency-based analysis or lexicographic analysis. The yet unproven assumption behind this approach is the following: the more occurrences a word/term has in a document the more important it is in that document [13]. Our motivation to design and test a document representation improvement algorithm is driven by the inherent problems associated with the frequency-based automatic indexing algorithms: - Poor quality: most index terms do not tell anything about the topics of the documents from which they were selected. - Fixed indexes: the resulted indexes are static structures. Once generated, they cannot be further altered in response to the IR system environment changes (for example terminology evolution). - High dimensionality: the dimension of the generated indexes is usually very large, because of the great amount of unwanted index terms. Dimensionality reduction was successfully applied in text categorization [16]. - Word mismatch: the indexes suffer from the word mismatch problem. For example a document talking about “laptop” will never be retrieved in response to a query containing only the term “notebook”, because the index does not contain the latter term, which in fact is a synonym of “laptop”. Our approach hereby consists of an algorithm that collects and uses additional information in order to improve the quality of the document representations. The additional information is obtained by analyzing the relevance feedback (RF) history/corpus. We use a data mining approach to analyze the RF corpus. Relevance feedback is the user feeding back into the system decisions on relevance of

retrieved documents [11]. The quality of a document representation is defined along two axes: quantitative and qualitative. The quantitative facet is concerned with the efficiency aspect (dimensionality of a document representation) while the qualitative facet is concerned with the quality of the index terms. This study focuses on evaluating the quantitative aspect of a document’s representation quality. Along these lines, we developed a document representation improvement algorithm called Relevance Feedback Accumulation (RFA) that uses prior RF corpus collected over time and across users in order to improve and maintain a dynamic document representation space. The algorithm can be applied on top of any existing VSM-based IR system. Due to efficiency and simplicity, our solution is scalable for retrieval systems governing large document collections. This idea was further encouraged and motivated by one of the identified IR near future challenges: the need for models and tools for incorporating multiple sources of evidence (text, queries, relevance judgments, user context, etc) [1]. The rest of this paper is organized as follows: Section 2 presents a literature review focused on document representation improvement techniques; Section 3 describes the algorithm; Section 4 consists of the evaluation design; Section 5 presents our findings; and Section 6 and 7 present our future research directions and the conclusion of this study.

2. PRIOR STUDIES Since the first attempts reported in [4], very little effort focused on document representation improvement techniques that use relevance feedback across time and users as the main source of modification information. These techniques form the document-oriented view (DOV). DOV techniques are also referred as document representation learning techniques. However, most research efforts were spent on the query-oriented view (QOV) techniques that use relevance feedback to improve queries (e.g. query expansion). One explanation for this situation would be the fact that all available test collections are not suitable for document-oriented techniques [3]. This is because not enough relevance data points are available for each document. This section presents a compilation of the most important research under the document-oriented view umbrella. Friedman et al. [7] describe an algorithm that creates and maintains a dynamic document space. The alteration of the document terms (or concepts) is based on their discrimination power. A good positive discriminator is characterized by the fact that it is strong in relevant documents and weak in non-relevant documents. The strength or weakness of a term is judged in terms of frequency-based importance measures (one example is tf-idf). The algorithm uses prior knowledge in order to separate relevant and non-relevant documents. The prior knowledge is built-up from the relevance assessments history. The evaluation of the algorithm revealed better results when compared to a query expansion IR system. Brauen et al. [4] propose an algorithm in which document vectors are modified for all documents relevant to a query. The modification is done according to the following formula:

d nnew = d n + α ( q0norm − d n ) where d is a document and q is a

query. The evaluation test was conducted using the Cranfield collection. After several iterations with different values for α , a value of 0.2 was found “optimal.” For this value, the methodology revealed higher normalized recall and precision levels. Ide [10] proposes a quite similar method to Brauen et al’s [4] formerly described. The main difference is the fact that Ide’s algorithm also modifies weights in non-relevant documents in order to move them farther from queries to which the documents were irrelevant. For this operation, only the high-ranking irrelevant documents were considered. The methodology was tested on the Cranfield collection and the results were similar to those of Brauen et al’s. The levels of normalized recall and precision were improved around the same value of 0.2 for .

α

Brauen [5] presents an algorithm that improves the functionality of his previous algorithm [4]. He argues in favor of an extended vector modification function according to the following terms/concepts classification: Terms/concepts present only in the query vector: d nnew (i ) =

β

Terms/concepts

vector:

present

only

in

the

document

 d (i )  d nnew (i ) = d n (i ) −  n + 1  δ  Terms/concepts present in both query and document vector:

d nnew (i ) = d n (i ) + γ [120 − d n (i )] This new methodology shows slightly better performance than the previous two methodologies [4][10]. The optimal parameters reported by authors are: beta=30, delta=8 and gamma=0.225. Parker [12] proposes a solution similar in essence to Brauen’s previously described techniques. The idea is framed under the title of document learning. In the author’s formulation, document learning represents a process having queries, documents, and a set of relevance assessments as inputs. The output of this process is an altered document space. Belew [2] presents a mechanism that uses connectionist networks (e.g. neural networks) to learn document representations and associations. The neural network generates nodes for documents, index terms and authors. These nodes are interconnected by weighted links. The weight of a link emphasizes the strength of association between two nodes. The weights are constantly adjusted using back propagation algorithms. The signal to adjust the weights is obtained from relevance feedback assessments. The system is one of the few to generate associative rules of the type term1 Æ term 2. For example, the system is able to emphasize the association between adaptive and adaptation without using any generalization techniques like stemming. Fuhr and Buckley [8][9] describe a method for probabilistic indexing using relevance feedback. They are using relevance descriptors, in order to replace individual term-document descriptors. A relevance descriptor is composed of a set of features that are considered to be important, when assigning a weight to a term-document pair. The reason to build relevance descriptors is to compensate for the lack of relevance feedback. The technique represents an interesting approach in tackling the lack of relevance assessments, but it does not accommodate continuous document representations improvement, over time and across users.

Bodoff et al. [3] present a hybrid approach called the unified view. It has the characteristics of both the query-oriented view and the document-oriented view. It uses a maximum likelihood approach in order to estimate both the query and document representations. Both document information and a priori relevance feedback are considered.

3. THE RFA ALGORITHM This section consists of a detailed presentation of the RFA algorithm. The goal of the algorithm is to create and maintain a dynamic document representation space of low dimensionality without affecting the retrieval effectiveness. RFA algorithm is based on the assumption that every document is best characterized by a set of few terms called concepts or concept terms. More precisely, a concept term is a word that is semantically related to the topicality of a document. It is not necessary for that exact concept term to be found within a document for which it is relevant. For example, the concept “notebook” is relevant to all documents containing the word “laptop,” even though “notebook” is not found in them. From RFA algorithm’s point of view, the importance of concepts is not obtained by a simple lexicographic analysis of the documents’ content. The importance of terms is derived from RF assessments over time and across users (the RF corpus). In our design, a document concept is identified as a term that has reasonable support among all queries from all relevance assessments of this document. Support is a data mining measure emphasizing the occurrence percentage of an item in a set of transactions. In this case, a query term is considered as an item while the query is considered to be the transaction. A query can have one query term or more. The algorithm alters the weight/importance of a term in a document, according to its support throughout the RF corpus by means of a weight learning function. One can notice that the discovery of concept terms is user-driven. In time, the concepts from each document will reflect the users’ general perception regarding which are the most important terms to describe the document. In order to reduce the number of concept terms used for document representation, the RFA algorithm eliminates from a document’s representation, all the terms that have low support among all the queries that retrieved the document. As a whole, the RFA algorithm is trying to identify a small number of high quality concept terms to index each document without affecting retrieval effectiveness. Before presenting the details of the algorithm, we must introduce a few concepts and notations. We use Q to denote a query vector, D to denote a document vector and the tuple (Q,D) to denote a relevance feedback assessment. RFA algorithm manipulates two types of terms: simple and composite. Simple terms are one-word terms while composite terms are two-word terms. RFA algorithm looks for composite terms only within a query Q from a relevance assessment (Q, D). The mechanism is the following: consider Q (term_1, term_2, term_3, term_4) to be a query composed of four terms. The single terms of Q are: term_1, term_2, term_3 and term_4. The composite terms are derived from the original query Q by using a heuristic called ordered terms pairing. This heuristic takes all pairs of single terms from a query while maintaining their ordering in the query, and builds composite terms by aggregating them. In the case of Q considered above, the heuristic will generate the following composite terms: [term_1, term_2], [term_1, term_3], [term_1, term_4], [term_2, term_3], [term_2, term_4] and [term_3, term_4]. After generating the composite terms, Q is considered to be composed of the union of all single and composite terms. The

rationale behind the idea is the fact that many times individual words (single terms) do not provide enough meaning with respect to the topicality of a document. From this point on, the notion term denotes both single and composite terms. RFA algorithm uses activation triggers to preprogram its actions. An activation trigger is defined as “a flag that becomes active whenever a certain condition relative to the relevance assessments collected so far is satisfied.” An example of such an activation trigger would be a counter showing how many relevance assessments the system collected with respect to a certain document. Whenever this counter reaches a certain threshold (for example 1000 relevance assessments), the trigger becomes active. RFA algorithm associates such an activation trigger to each document from the collection. The UML activity diagram in Figure 1 depicts the steps sequence of the algorithm. STEP 1 - Automatic Indexing: during this step, an initial document space is created by automatically indexing the whole document collection. The indexing procedure uses standard lexicographic analysis (e.g. tf-idf based measures) complemented by additional improvement techniques such as: stopped words removal and/or stemming. One very important remark to be made at this point is the fact that STEP 1 is performed only once after the document collection is gathered. Afterwards, only STEP 2, 3 and 4 are infinitely repeated at pre-programmed intervals (see Figure 1). The pre-programming is implemented using activation triggers formerly defined. STEP 2 – Collecting Relevance Feedback: during this step, the system will collect from searchers any relevance feedback assessment(s) of type (Q0, D). Q0 represents the original query formulated by the searcher. One can observe in Figure 1 that STEP 2 is a parallel/concurrent process with “Perform Retrieval.” That is because the process of collecting relevance feedback is part of the retrieval operation. During this step the result might be a set of (query, document) tuples rather than a single tuple. This happens when the user judges more documents from the result set to be relevant to the same query. For each of the (Q0, D) tuples: - Q0 is transformed to Q using the ordered terms pairing heuristic. The relevance judgment (Q0, D) then becomes (Q, D). - (Q, D) is then used to update the data structure designed to accumulate the relevance feedback. This data structure is composed of two matrices called Term-Document Matrix (the document vector space) and Document Matrix. For all the terms that appear in both Q and D, the RF accumulation data structures are updated. The updating consists of increasing the relevance assessment counters for (term_Q, D) pairs in Term-Document Matrix, and for documents in Document Matrix. The relevance assessment counter for (term_Q, D) pairs in Term-Document Matrix, shows how many times the term term_Q and the document D were involved together, in a relevance assessment. The relevance assessment counter for documents in Document Matrix shows how many times document D was involved in a relevance assessment. Having computed these two values it is easy to estimate the support for any (term_Q, D) pair. Any term in Q that does not appear in D, will be introduced as a new term in D’s vector. Its initial weight is set to the minimum weight among all terms in D’s vector. By this, new potentially quality terms are added to one document’s vector. This mechanism is aimed at tackling the word mismatch issue, presented as motivation in the introduction of this paper (see Section 1).

STEP 4 – Term Classification: following Step 3, during this step all the terms characterizing D are re-classified into three type categories, according to their support values (see Figure 2): -

Type R terms: relevant terms, having high support. Type C terms: candidate terms, having moderate support. Type N terms: non-relevant terms, having low support.

The type N terms will not be considered as being indexing terms anymore, but they will still be kept in the data structure, because their support might increase with future relevance assessments and become C or R type of terms again. This is how the RFA algorithm reduces dimensionality of the document representations. The support thresholds are parameters to be estimated. For our collection we used ST_N=0.05 and ST_R=0.3 as the best threshold configuration. We estimated them by testing the system for different value pairs.

Figure 1. UML Activity Diagram for RFA Algorithm STEP 3 – Document Space Transformation: the document term modification takes place for each document whenever the attached activation trigger becomes active. This is the weight learning function. For all (term, D) pairs: Step 3.1: compute the new support value SNEW (term, D) using information collected during previous STEPs 2. Step 3.2: compute the support variance ∆ SUP = S NEW − S OLD . SOLD represents the old support for the tuple (term, D), computed during the previous “STEP 3”. Step 3.3: modify the weight of the tuple (term, D) as follows: If ∆ SUP > 0 ,

w NEW = wOLD + (1 − wOLD ) ∗ ∆ SUP

In the above formula, “w” stands for weight. The weight is increased directly proportional with the increase in support ∆ SUP . The term

(1 − wOLD ) makes sure the value of the weight will not exceed 1. Weights values range from 0 to 1. If ∆ SUP < 0 ,

wNEW = wOLD + wOLD ∗ ∆ SUP

The weight is decreased directly proportional with the decrease in support ∆ SUP . In this case ∆ SUP is negative. If

∆ SUP = 0 , wNEW = wOLD

No changes are made if the support of the term does not change.

Figure 2. Types of Index Terms

4. EVALUATION 4.1 Experimental Settings and Procedures RFA algorithm’s goal is to improve the quality of the document representation without negatively affecting the retrieval effectiveness. This section describes the experimental settings used to assess the behavior of the RFA algorithm. The document collection used for evaluation consists of a portion of the TIPSTER/TREC document corpus. The TIPSTER collection, from TREC disk 1, 2 and 3 contains more than 510,000 documents. From these, around 25,000 documents are selected to form the evaluation document collection for RFA algorithm. The selection process is performed according to the following procedure: for each topic T to be used during evaluation (i.e. topics 51-100), we first selected all relevant documents using the provided relevance assessments. Then we randomly selected more un-relevant documents, until the total

reached 500 for each T. With 50 Ts, the resulting document collection will consist of about 50X500=25,000 documents. The reason to use only a sub-collection is that the RFA algorithm performs document space transformations where the documents whose representations are altered, are only those relevant to the 50 topics. There is no advantage in choosing a larger sub-collection, since for that case most of the documents’ representations remain unchanged, due to the fact that the purpose of our algorithm is to modify the document representations for relevant documents. Only the initial effectiveness parameters would be different for a larger collection. The coverage ratio [11] of relevant documents per topic for this collection is 0.012.

Table 1: Retrieval system types System name

The pool of queries used is the one generated for the TREC Query Track. This query set consists of about 43 queries for each of the 50 TREC topics. Both experts and non-experts generated this set [6]. Each query is then considered to share the relevant documents of the TREC topic to which it is associated. The independent variables of the experimental setting were: -

-

-

-

Retrieval system type: Table 1 lists all retrieval systems evaluated and compared for this study. The systems BR, BB and BS are hybrid systems created in other to test three other possible approaches to document representation improvement Stemming: each of the baseline and augmented systems, were tested under both stemming and no stemming conditions. The reason for introducing stemming as an independent variable is the fact that stemming itself is able to reduce document representations dimensionality. It is then interesting to see if stemming together with RFA, can yield even better results that RFA alone. For stemming conditions we used the Lovins stemmer. Browsing batch size: represents the number of documents, out of the total number of retrieved documents that are examined by the “user” (the evaluation is done automatically) to provide relevance assessments. In order to better simulate the real user experience, for any given training set query, the RFA algorithm does not modify all relevant returned documents. The relevant documents set is restricted to those identified within the first 30 or 50 retrieved documents. All others are ignored. The two browsing batch sizes of 30 and 50 are derived based on the findings reported in [17]. In [17] the median number of retrieved web pages browsed was found to be 8, while a page usually displays 10 results. At the same time a large percentage of the users (around 48%) only browsed one or two pages. For our study 30 represents the normal user effort load, while 50 represents the maximum user effort load. ST_N classification threshold (See Figure 2): while ST_R was held constant to 0.3, ST_N was tested for 0.03, 0.04 and 0.05. The optimal threshold parameters four our collection was found to be: ST_R=0.3 and ST_N=0.05.

Notation

Standard

ST

RFA

RFA

Brauen

BR

BrauenBatch

BB

BrauenSmooth

BS

Description Baseline system built on the vector space model with LTC as the weighting scheme [14][15] and inner product as the similarity measure. Consists of a standard baseline system augmented with the RFA algorithm. Consists of a standard baseline system augmented with the Brauen algorithm, as presented in [5] with the constants: beta=30, delta=8 and gamma=0.225. A hybrid version of the BR system that does not perform document modification for each relevance assessment, but rather batches them for efficiency. A hybrid version of the BR system that applies an exponential smoothing [18] after each document modification operation. The smoothing constant is set to 0.4.

The measurements were observed for two sets: training and evaluation. They were randomly generated from the TREC data. The testing procedure is inspired by [5] because it is one of the few concerning a document-oriented technique and because the testing procedure emulates a real-user situation. The procedure works along the following guidelines: -

The pool of queries, previously described, is divided into two distinct question sets: the training set and the evaluation set. The training set contains about 80% of the queries from the pool. The remaining 20% of the queries form the evaluation set. The dividing process is repeated two times in a random cross-validation fashion.

-

The training set is used to train the system. Training the system means to modify the document vectors according to the RFA algorithm.

-

The evaluation set is used to test average retrieval performance measures.

4.2 Measures The following aspects of the evaluated systems are measured: dimensionality reduction, retrieval effectiveness and the quality of the RFA’s learning curve. To estimate the dimensionality reduction, we computed the average number of index terms per document. To estimate the retrieval effectiveness, we used standard measures such as average precision and recall, 11-poit average precision [5], and the F-measure (with alpha=0.5 meaning precision is twice as important as recall).

Last but not least, we also measured normalized precision and recall in order to assess the sequentiality effect of how the retrieval system presents the documents to the user in the order of their relevance. To estimate the quality of the learning curve, we computed the average mean square error of the learning points with respect to the learned term-document weight.

5. RESULTS AND DISCUSSION The threshold parameters configuration of all RFA systems for which we report results throughout this section is the following: ST_R=0.3 and ST_N=0.05. This configuration was found optimal after testing with several other parameters configurations.

Table 2. Average Index Terms per Document

Stem

N/A

BSIZE=50

BSIZE=30

ST, BR BB, BS

RFA

RFA

>127.17

19.74 (-84.4%)

18.61 (-85.4%)

>140.78

19.64 (-86%)

18.41 (-86.9%)

No Stem

Table 3(a) Retrieval effectiveness

BSIZE=50

5.1 Dimensionality Reduction The results presented in Table 2, show that RFA reduces the dimensionality of the affected documents space to a great extent. The reduction is obtained while preserving or improving the retrieval effectiveness of the system augmented with RFA (see results in Sections 5.2 and 5.3). The reduction goes as high as 86.9% in the case of No_Stem/RFA/BSIZE=30. In tables 2, 3, 4 and 5, BSIZE refers to the value of the independent variable “browsing batch size”. There is no noticeable difference between the stem and no stem conditions. The retrieval systems ST, BR, BB and BS are considered together since none of them possesses a dimensionality reduction technique. In fact for the BR, BB and BS systems the dimensionality is even higher than the one presented in Table 2, because of the introduction of new terms. First column of Table 2 shows the minimum dimensionality, which in fact corresponds to the ST system.

5.2 Overall Retrieval Effectiveness

Stem

ST

BR

RFA

Tables 3(a),(b) show the comparison of the retrieval effectiveness for all 5 systems evaluated in this study. In the two tables, P is precision, R is recall, F is the F-measure, PN is the normalized precision and RN is the normalized recall. All systems in Table 3(a), except ST, were evaluated with a value of 50 for the browsing batch size. All systems in Table 3(b), except ST, were evaluated with a value of 30 for the browsing batch size.

BB

For the BS system, the smoothing coefficient was set to 0.4. For the computation of the F-measure, the coefficient alphaF was set to 0.5, meaning that precision was considered to be twice as important as recall. The results are averaged over the two evaluation sets. Parameter values displayed in bold represent the best value for that specific parameter, among all systems on the same table column. RFA systems consistently yield better overall precision P, by more than 4% in the BSIZE=50 case, and by more than 2% in the BSIZE=30 case, than all other systems (ST, BR, BB and BS). This is to say that RFA systems are able to perform better than all other systems, while using a substantially lower-dimensionality document representation space. The precision improvement is also reflected with similar increments in the precision/recall composite measure F and normalized precision PN.

BS

No Stem

P=0.045

P=0.066

R=0.927

R=0.858

F0.5=0.054

F0.5=0.078

PN=0.767

PN=0.762

RN=0.96

RN=0.965

P=0.045

P=0.066

R=0.94

R=0.882

F0.5=0.054

F0.5=0.078

PN=0.785

PN=0.786

RN=0.96

RN=0.964

P=0.047

P=0.069

R=0.892

R=0.814

F0.5=0.057

F0.5=0.082

PN=0.79

PN=0.797

RN=0.967

RN=0.97

P=0.045

P=0.066

R=0.94

R=0.882

F0.5=0.054

F0.5=0.078

PN=0.785

PN=0.785

RN=0.959

RN=0.963

P=0.045

P=0.066

R=0.94

R=0.882

F0.5=0.054

F0.5=0.078

PN=0.784

PN=0.786

RN=0.961

RN=0.965

As expected the dimensionality reduction influences the overall recall R. For all non-RFA systems, on average, recall is 6% higher in the BSIZE=50 case and 5% higher in the BSIZE=30 case. The slight recall drop does not represent a major drawback, because retrieval systems managing large document collections contain large numbers of relevant documents per topic.

Table 3(b) Retrieval effectiveness BSIZE=30

ST

BR

RFA

BB

BS

That is why the substantial increase in precision for these levels is very valuable. Table 4(a). Precision at first 3 recall levels

Stem

No Stem

P=0.045

P=0.066

R=0.927

R=0.858

ST

F0.5=0.054

F0.5=0.078

BR

PN=0.767

PN=0.762

RN=0.96

RN=0.965

RFA BB

P=0.045

P=0.066

R=0.936

R=0.873

F0.5=0.054

F0.5=0.078

BSIZE=50

BS

PN=0.775

PN=0.775 RN=0.964

P=0.046

P=0.068

BR RFA

ST

R=0.894

R=0.818

F0.5=0.055

F0.5=0.08

PN=0.779

PN=0.783

BB

RN=0.964

RN=0.969

BS

P=0.045

P=0.066

R=0.935

R=0.873

F0.5=0.054

F0.5=0.078

PN=0.774

PN=0.773

RN=0.959

RN=0.963

ST BR

P=0.045

P=0.066 R=0.873

F0.5=0.054

F0.5=0.078

PN=0.775

PN=0.775

RN=0.96

RN=0.965

RL-3

0.58 0.55 0.811 0.666 (+17.4%) (+28.4%) 0.683 0.788 (+26.3%) (+19.4%) 0.8 0.671 (+27.5%) (+18.0%) 0.807 0.653 (+28.1%) (+15.7%) No Stem 0.556 0.523 0.813 0.672 (+22.1%) (+31.6%) 0.689 0.781 (+28.8%) (+24.0%) 0.811 0.683 (+31.4%) (+23.4%) 0.816 0.662 (+31.8%) (+20.9%)

0.518 0.579 (+10.5%) 0.584 (+11.3%) 0.584 (+11.3%) 0.571 (+9.2%) 0.486 0.581 (+16.3%) 0.61 (+20.3%) 0.585 (+16.9%) 0.575 (+15.4%)

Table 4(b). Precision at first 3 recall levels BSIZE=30

RL-1

RL-2

RL-3

Stem

RFA BB BS

5.3 Precision at the First Three Recall Levels

ST

The results presented throughout the previous section only show the overall picture of the retrieval performance of the augmented systems. This section presents the retrieval improvement that we observed to be taking place within the first few recall levels (the usually browsed recall levels). Tables 4(a) and (b) show the results. In these tables, the column header name RL stands for “recall level”. The percentages in each cell represent the relative increase in precision with respect to ST systems. The bold values indicate the best performance for a recall level column. The shaded cells indicate the recall level columns where RFA yielded the best results.

BR

First, a very important observation is the substantial increase in precision for the first three recall levels, obtained when using any of the tested systems. The reason why results presented in section 5.2 reported just a slight increase in overall precision is the fact that the results were averaged over the 10 recall levels. The first two or three recall levels are the most important from the searcher’s point of view. They contain the documents usually browsed by searchers. Very few individuals will browse beyond these first three levels.

RL-2 Stem

RN=0.959

R=0.936

RL-1

RFA BB BS

0.58 0.55 0.731 0.593 (+20.6%) (+7.2%) 0.748 0.601 (+22.4%) (+8.4%) 0.729 0.598 (+20.4%) (+8.0%) 0.721 0.586 (+19.5%) (+6.1%) No Stem 0.556 0.523 0.735 0.595 (+24.3%) (+12.1%) 0.621 0.732 (+24.0%) (+15.7%) 0.738 0.603 (+13.2%) (+24.6%) 0.738 0.589 (+24.6%) (+11.2%)

0.518 0.592 (+12.5%) 0.522 (+0.2%) 0.53 (+2.2%) 0.519 (+0.1%) 0.486 0.534 (+8.9%) 0.531 (+8.4%) 0.529 (+8.1%) 0.522 (+6.8%)

The second observation of our analysis emphasizes the ability of the RFA systems to perform better, or at least the same as the other systems. The grayed cells in Tables 4(a) and (b) indicate the situations when the RFA system had the best performance. For all the other cases, the performance of RFA was very close to the best performance. The critical advantage of RFA systems is the fact that they rely on a much lower dimensional document representation space.

5.4 The Learning Function This section presents the evaluation results of the document representation modification function, also called the learning function. The quality of a learning function relies on its ability to rapidly learn the weight of a term-document pair and, to keep the future alterations close to this value. In other words, a poor quality learning function will generate far apart values for the same weight, at different points in time. We estimated the average mean square error (MSE) for all terms from about 50 randomly selected documents. The reference point for MSE calculation was the last term weight, considered to be the learned term weight. Results averaged over the two evaluation sets are presented in Table 5(a) and (b). RFA algorithm provides a much smoother learning function. Its MSE is considerably lower than any of the other systems. This means that weights of terms do not oscillate up and down with high amplitudes.

The second and third recall levels comprise of documents containing terms with medium support throughout the relevance support corpus. In this case the BR learning function generates oscillating term weights (e.g. “government” in Figure 3), depending on how terms happen to occur throughout the analyzed relevance assessments. The RFA learning function, driven by term support variance, generates steadier weights that smoothly converge to the final learned value. This analysis suggests an obvious possible improvement for the RFA learning function: a larger weight increase for terms with high support throughout the relevance assessments corpus.

Table 5(a) Learning function MSE BSIZE=50 RFA BR BB BS

Stem 0.00065 0.00625 0.00566 0.01928

No Stem 0.00061 0.00572 0.00458 0.02124

Table 5(b) Learning function MSE BSIZE=30 RFA BR BB BS

Stem 0.00073 0.00627 0.00533 0.02172

No Stem 0.0007 0.0058 0.00491 0.02204

Figure 3 presents the learning history for the term “government” with respect to one of the documents in which it occurs (in our collection: docid=3876), for both BR and RFA systems. BR_ref and RFA_ref in Figure 3 represent the learned weight for the term “government”. This weight is used as reference for calculating the MSE. It is noticeable that in the BR learning function case, weights rapidly converge in a positive or negative direction. The convergence direction is determined by the appearance pattern of term “government” in the sequence of relevance assessments. If the terms appear in many consecutive relevance assessments, its weight will rapidly increase. For this case MSE(BR)=0.026 while MSE(RFA)=0.009. A direct effect of the learning functions’ characteristics is visible in Tables 5 (a), (b). One can notice RFA systems have better precision for the second and third recall levels while BR systems have better precision for the first recall level. The first recall level comprises of documents containing terms with the highest support throughout the relevance feedback corpus. Because BR learning function rapidly converges in a positive direction for such terms, their weights will have values closer to 1 (the maximum). At the same time RFA learning function’s positive increase is bounded by the support levels of the terms. Therefore, weights are usually lower for similar terms, when compared to BR systems weights. That is why BR systems have better precision for the first recall level.

Figure 3. Weight learning history for “government”

6. FUTURE RESEARCH Experimental work to evaluate the quality of the index terms is already in progress. Future research efforts will focus on the following: -

Test a modified version of the RFA learning function that accounts for a more consistent weight increase, for terms with high support throughout the relevance assessments corpus.

-

Document representation modification coverage: in order for the modification technique to be fair for all documents about the topic described by a user query, it is necessary to modify not only the documents judged relevant by the user but also others that talk about the same topic.

-

Development of an implicit/explicit relevance feedback capture model.

-

Rare concept terms discovery: it is possible for certain terms to have low support within the whole relevance assessments pool but also be significant (characterized by high support) among the relevance assessments pool corresponding to a specific interest group. From a data-mining stand point, these are terms with high levels of confidence. The technique will develop a measure to identify such terms and associate them with the corresponding interest group. This requires a real-user experimental system implementation.

7. CONCLUSIONS The RFA algorithm proves to be a highly efficient document representation improvement technique. It is able to reduce the index dimensionality up to 86% without affecting retrieval effectiveness

parameters. RFA algorithm also provides a smooth and stable learning function. Last but not least RFA algorithm is scalable and can be used with retrieval systems managing large document collections.

[8] Fuhr, N., Buckley C. (1990) “Probabilistic document indexing from relevance feedback data.” Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval. Brussels, Belgium, pp. 45-61.

8. REFERENCES

[9] Fuhr, N., Buckley C. (1991) “A probabilistic Learning Approach for Document Indexing.” ACM Transactions on Information Systems, vol. 9, no. 3, pp. 223-248, 1991.

[1] Allan, J., Aslam, J., Belkin, N., Buckley, C., Callan, J., Croft, B., Dumais, S., Fuhr, N., Harman, D., Harper, D. J., Hiemstra, D., Hofmann, T., Hovy, E., Kraaij, W., Lafferty, J., Lavrenko, V., Lewis, D., Liddy, L., Manmatha, R., McCallum, A., Ponte, J., Prager, J., Radev, D., Resnik, P., Robertson, S., Rosenfeld, R., Roukos, S., Sanderson, M., Schwartz, R., Singhal, A., Smeaton, A., Turtle, H., Voorhees, E., Weischedel, R., Xu, J., Zhai C., (2003). “Challenges in information retrieval and language modeling.” SIGIR Forum, 37(1), March 2003. [2] Belew, R. K. (1989). “Adaptive information retrieval: using a connectionist representation to retrieve and learn about documents.” Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval: 11-20. [3] Bodoff, D., Enache, D., Kambil, A., Simon, G., Yukhimets, A. (2001). “A Unified Maximum Likelihood Approach to Document Retrieval.” Journal of the American Society for Information Science and Technology 52(10): 785-796. [4] Brauen, T. L., Holt, R.C. (1968). “Document Indexing Based on Relevance Feedback.” Report ISR-14 to the National Science Foundation, Section XI, Department of Computer Science, Cornell University, Ithaca, NY (June). [5] Brauen, T. L. (1969). “Document Vector Modification.” Scientific Report ISR-17(September). [6] Buckley, C. (2000). “The TREC-9 Query Track.” TREC 9. [7] Friedman, S. R., Maceyak, J. A., Weiss, S. F. (1967). “A Relevance Feedback System Based on Document Transformations.” Scientific Report ISR-12(June): Section X.

[10] Ide, E. (1969). “Relevance Feedback in Automatic Document Retrieval System.” Report ISR-15 to the National Science Foundation, Cornell University, Ithaca, NY (January): 81-85. [11] Korfhage, R. R. (1997). “Information Storage and Retrieval.” Wiley Computer Publishing: 221-232. [12] Parker, L. M. P. (1983). “Towards a Theory of Document Learning.” Journal of American Society for Information Science 34(1): 16-21. [13] Salton, G. (1971). “The SMART Retrieval System: Experiments in Automatic Document Processing.” PrenticeHall. [14] Salton, G. and Buckley, C. (1988). “Term-weighting approaches in automatic text retrieval.” Information Processing and Management 24(5): 513-523. [15] Savoy, J., Ndarugendamwo M., Vrajitoru, D. (1996). “Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes.” TREC 4 Proceedings (October): 537548. [16] Sebastiani, F. (2002). “Machine Learning in Automated Text Categorization.” ACM Computing Surveys 34(1): 1-47. [17] Spink, A., Wolfram, D., Jansen, B., Saracevic, T. “Searching the Web: The public and their queries.” Journal of the American Society for Information Science 52(3): 226-234. [18] URL Exponential Smoothing: http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc43 1.htm