Strength Pareto fitness assignment for generating ...

Strength Pareto fitness assignment for generating expansion features Ilyes Khennak and Habiba Drias Laboratory for Research in Artifcial Intelligence, Computer science department, USTHB, BP 32 El Alia 16111, Bab Ezzouar, Algiers, Algeria. {ikhennak,hdrias}@usthb.dz

Abstract. Owing to the increasing use of ambiguous and imprecise words in expressing the user’s information need, it has become necessary to expand the original query with additional terms that best capture the actual user intent. Selecting the appropriate words to be used as additional terms is mainly dependent on the degree of relatedness between a candidate expansion term and the query terms. In this paper, we propose two criteria to assess the degree of relatedness: (1) attribute more importance to terms occurring in the largest possible number of documents where the query keywords appear, (2) assign more importance to terms having a short distance with the query terms within documents. We employ the strength Pareto fitness assignment in order to satisfy both criteria simultaneously. Our computational experiments on OHSUMED test collection show that our approach significantly improves the retrieval performance compared to the baseline. Keywords: Information retrieval, query expansion, co-occurrence and proximity, multi-objective optimization, Pareto dominance

1

Introduction

Nowadays, the amount of data available in the Web is continuously growing and the number of new pages created is constantly increasing. In his study, Ranganathan [4] estimated that the amount of on-line data indexed by Google had increased from 5 exabytes in 2002 to 280 exabytes in 2009. According to Zhu et al. [10], this amount is expected to double every 18 months. Ntoulas et al. [3] read these statistics in terms of new pages created and demonstrated that their number is increasing by 8% a week. This revolution, that the Web is witnessing, has led to the following results: - The access of new words into the Web which is estimated, according to Williams and Zobel [9], at about one new word in every two hundred words. Studies by Eisenstein et al. [2]; Sun [8] have shown that this invasion is mainly due to: neologisms, first occurrences of rare personal names and place names, abbreviations, acronyms, emoticons, URLs and typographical errors.

2

Ilyes Khennak and Habiba Drias

- The users employ these new words to characterize their information need during the search process. Chen et al. [1] indicated in their study that more than 17% of query words are out of vocabulary (Non dictionary words). The difficulty of exploring the meanings of these new ambiguous words will certainly result in the failure of the search process. For this reason, in this paper, we suggest to expand the original user query with additional terms that best express the actual user intent. Selecting the appropriate words to be used as additional terms is mainly dependent on the degree of relatedness between a candidate expansion term and the query terms. In this work, we propose two criteria to assess the degree of relatedness: (1) co-occurrence, attributes more importance to terms occurring in the largest possible number of documents where the query keywords appear, (2) proximity and closeness, assigns more importance to terms having a short distance with the query terms within the documents. We adopt the strength Pareto fitness assignment in order to satisfy both criteria simultaneously. We will use the two well-known Pseudo-Relevance Feedback techniques: Rocchio’s method, and Robertson/Sparck Jones’ term-ranking function as the baseline for comparison; and evaluate our approach using OHSUMED test collection. The main contributions of our work in this paper are the following: - The adoption of an external correlation measure in order to evaluate the co-occurrence of words with respect to the query features. - The determination of an internal correlation measure in order to assess the proximity and closeness of words relative to the features of the query. The reminder of this paper is organized as follows: Section 2 reviews briefly the state of the art on query expansion. In Section 3 we present our proposed approach for generating expansion features. In Section 4, we show some numerical and experimental results. Finally, in Section 5, we provide a summary discussion.

2

Pseudo-Relevance Feedback for Query Expansion

One of the most natural and successful techniques to improve the retrieval effectiveness of document ranking is to expand the original query with additional terms that best capture the actual user intent. Many approaches have been proposed to generate and extract these additional terms. The Pseudo-Relevance Feedback is one of the suggested approaches. It uses the pseudo-relevant documents to select the most important terms to be used as expansion features. In its simplest version, the approach starts by performing an initial search on the original query using Okapi BM25. 2.1

Probabilistic Relevance Framework: Okapi BM25

The probabilistic Relevance framework is a formal framework for document retrieval which led to the development of one of the most successful text-retrieval

Strength Pareto fitness assignment

3

algorithms, Okapi BM25. The classic version of Okapi BM25 term-weighting function, in which the weight wiBM 25 is attributed to a given term ti in a document d, is obtained using the following formula: wiBM 25 =

tf dl k1 ((1 − b) + b ) + tf avdl

wiRSJ

(1)

Where: tf , is the frequency of the term ti in a document d. k1 , is a constant. b, is a constant. dl, is the document length. avdl, is the average of document length. wiRSJ , is the well-know Robertson/Sparck Jones weight [6]: wiRSJ = log

(ri + 0.5)(N − R − ni + ri + 0.5) (ni − ri + 0.5)(R − ri + 0.5)

(2)

Where: N , is the number of documents in the whole collection. ni , is the number of documents in the collection containing ti . R, is the number of documents judged relevant. ri , is the number of judged relevant documents containing ti . The RSJ weight can be used with or without relevance information. In the absence of relevance information (the more usual scenario), the weight is reduced to a form of classical idf : wiIDF = log

N − ni + 0.5 ni + 0.5

(3)

The final BM25 term-weighting function is therefore given by: wiBM 25 =

tf dl ) + tf k1 ((1 − b) + b avdl

log

N − ni + 0.5 ni + 0.5

(4)

Concerning the internal parameters, Robertson and Zaragoza [5] have indicated that published versions of Okapi BM25 are based on specific values assigned to k1 and b: k1 = 2, b = 0.5. As part of the indexing process, an inverted file is created containing the weight wiBM 25 of each term ti in each document d. The similarity score between the document d and a query q is then computed as follows: ∑ ScoreBM 25 (d, q) = wiBM 25 (5) ti ∈q

During the interrogation process, the relevant documents are selected and ranked using this similarity score. 2.2

Term-scoring functions for retrieval feedback

The retrieval feedback technique starts by performing an initial search on the original query using the BM25 term-weighting and the previous documentscoring function (formula 5), suppose the best ranked documents to be relevant,

4


assign a score to each term in the top retrieved documents using a term-scoring function, and then sort them on the basis of their scores. One of the best-known functions for term-scoring is the Robertson/Sparck Jones term-ranking function (formula 2). Another well-known term-scoring function is the Rocchio weight [7]: ∑ wiRocchio = wiBM 25 (6) d∈R

Where: R, is a set of pseudo-relevant documents. The original query is then expanded by adding the top ranked terms, and re-interrogated by using the BM25 similarity score (formula 5), in order to get more relevant results. 2.3

Pareto dominance for selecting expansion features

The main goal of the proposed method is to return only the documents that are relevant to the given query. For this reason, we introduced the concept of co-occurrence and closeness during the search process. This concept is based on recovering for each query term qi the documents where it appears, and then assess the relevance of the terms contained in these documents with respect to the query term qi on the basis of: 1. The co-occurrence, which gives value to words that appear in the largest possible number of those documents. 2. The proximity and closeness, which gives value to words in which the distance separating them and the query term qi within a document, with respect to the number of words, is small. Before proceeding to explore the concepts of co-occurrence and proximity, we characterize each term ti in the vocabulary VR of the top ranked documents (denoted by R) by a vector Ti of |R| length, where R corresponds to the set of pseudo-relevant documents returned by formula 5 and the indicator k th of Ti to the position(s) of ti in the pseudo-relevant document dk . Ti =< pos1 , pos2 , ..., pos|R| > It is important to note that the value of posk can be 0 in the case where ti ∈ / dk , or a vector containing all possible positions of ti in dk . We will find, in the first step, the terms which often appear together with the query terms. Finding these words is done by assigning more importance to words that occur in the largest number of documents where each term of the query appears. We interpret this importance via the measurement of the external correlation of each term ti ∈ VR to each term tj(q) of the query q.


5

This correlation computes the rate of appearance of ti with tj(q) in the set of documents R. The external correlation of ti to tj(q) is significant when ti appears in the largest number of documents in which tj(q) occurs, and vice versa. Based on this interpretation, the external correlation ext of ti to tj(q) is calculated using the Good Turing Discounting, as follows: [ ] ( ) NC+1 1 C(ti , tj(q) ) + 1 (7) ext(ti , tj(q) ) = P (ti |tj(q) ) = C(tj(q) ) NC Where: P (ti |tj(q) ), is the Good Turing probability that ti appears with tj(q) in R. C(tj(q) ), is the number of times that Tj(q) [k] ̸= 0, where k = 1, ..., |R| (i.e., the number of documents where tj(q) occurs in R). C(ti , tj(q) ), is the number of times that (Tjq [k], Ti [k]) ̸= 0, where k = 1, ..., |R| (i.e., the number of documents where ti and tj(q) occur together in R). NC+1 , is the number of pairs of terms which include tj(q) and occur C + 1 time in R. NC , is the number of pairs of terms which include tj(q) and occur C time in R. After computing the rate of appearance of ti with each query term tj(q) , the overall external correlation between ti and the whole query q is described in terms of a vector containing all possible ext between ti and each query term tj(q) : ext(ti , q) =< ext(ti , t1(q) ), ext(ti , t2(q) ), ..., ext(ti , t|q|(q) ) > The cosine similarity measure is then used to evaluate the quality of each vector ext(ti , q) with respect to the best vector ext(ti ∗ , q) where each of its elements ext(ti ∗ , tj(q) ) represents the highest external correlation between a given term ti and tj(q) . The following function fext (ti ) indicates the cosine similarity score between ext(ti ∗ , q) and ext(ti , q): ∑|q|

∗ j=1 ext(ti , tj(q) ) ∗ ext(ti , tj(q) ) √ fext (ti ) = √∑ ]2 ]2 ∑|q| [ |q| [ ∗ ∗ j=1 ext(ti , tj(q) ) j=1 ext(ti , tj(q) )

(8)

In the second step, we will find the terms which are often neighbors to the query terms. Therefore, we attribute more importance to terms having a short correlation with the query keywords. We interpret this importance via the measurement of the internal correlation between each term ti of VR and each term tj(q) of the query q. This correlation computes the correlation between ti and tj(q) within a given document dk in terms of the number of words separating them. The more ti is close to tj(q) , the greater is its internal correlation. We used the well-known kernel functions to measure the internal correlation: Gaussian kernel: ] [ −(i − j)2 (9) k(i, j) = exp 2σ 2

6


Triangle kernel:

{ k(i, j) =

1− 0

|i−j| σ

if |i − j| ≤ σ otherwise

(10)

Cosine kernel: k(i, j) =

( )] { [ |i−j|.π 1 if |i − j| ≤ σ 2 1 + cos σ 0

(11) otherwise

σ, is a parameter to be tuned. The internal correlation int between ti and tj(q) within a given document dk is then calculated as follows: ( ) int(ti , tj(q) )(dk ) = K Ti [k], Tj(q) [k] (12) The average internal correlation between ti and tj(q) in the whole R is then determined as follows: int(ti , tj(q) ) =

∑ 1 int(ti , tj(q) )(dk ) C(tj(q) )

(13)

dk ∈R

The overall internal correlation between ti and the whole query q is described in terms of a vector containing all possible int between ti and each query term tj(q) : int(ti , q) =< int(ti , t1(q) ), int(ti , t2(q) ), ..., int(ti , t|q|(q) ) > The cosine similarity measure is then used to evaluate the quality of each vector int(ti , q) with respect to the best vector int(ti ∗ , q) where each of its elements int(ti ∗ , tj(q) ) represents the highest internal correlation between a given term ti and tj(q) . The following function fint (ti ) indicates the cosine similarity score between int(ti ∗ , q) and int(ti , q): ∑|q|

∗ j=1 int(ti , tj(q) ) ∗ int(ti , tj(q) ) √ fint (ti ) = √∑ ]2 ]2 ∑|q| [ |q| [ ∗ ∗ j=1 int(ti , tj(q) ) j=1 int(ti , tj(q) )

(14)

Finally, in order to select the appropriate words to be used as expansion features, we adopt the well-known concept of Pareto dominance. The concept of Pareto dominance was introduced in order to solve the multi-objective optimization problem. The multi-objective optimization problem can be defined as the problem of finding such a solution which satisfies an objective vector whose element represent the objectives functions. The solution to this problem


7

can be described in terms of a decision vector (x1 , x2 , x3 , ..., xn ) in the decision space X. A function f : X → Y evaluates the quality of a given solution by assigning it an objective vector (y1 , y2 , y3 , ..., yk ) in the objective space Y . We say that a decision vector x1 is better than another decision vector x2 (x1 > x2 ) if the objective vector y 1 dominates the objective vector y 2 (y 1 > y 2 ) where y 1 = f (x1 ), y 2 = f (y 2 ) and K > 1. The vector y 1 is said to dominate the vector y 2 if no component of y 1 is smaller than the corresponding component of y 2 , and at least one component of y 1 is greater than the corresponding component of y 2 . One of the most improved dominance functions is the strength Pareto fitness assignment (SPEA2). It assigns each solution xi a strength values S(xi ) representing the number of solutions it dominates: S(xi ) = |xj |xj ∈ X ∧ xi > xj | Where |.| is the cardinality of set and > is the Pareto dominance relation (xi > xj if the objective vector y i assigned to xi dominates the objective vector y j assigned to xj ). On the basis of the S values, the raw fitness R(xi ) of solution xi is calculated: ∑ R(xi ) = S(xj ) (15) xj ∈X,xj >xi

It is important to note that the fitness is to be minimized here, i.e., R(xi ) = 0 corresponds to a non-dominated individual. By analogy, a candidate expansion term ti ∈ VR can be described in terms of a decision vector Ti =< pos1 , pos2 , ..., pos|R| > in the decision space X. A function f : X → Y evaluates the quality of a given extra term candidate ti by assigning it an objective vector (fext (ti ), fint (ti )) in the objective space Y . We say that an extra term candidate ti is better than another extra term candidate tj (ti > tj ) if the objective vector y 1 dominates the objective vector y 2 (y 1 > y 2 ) where y 1 = fext (ti ), y 2 = fint (ti ). The raw fitness R(ti ) of a given candidate expansion feature ti is calculated as follows: ∑

R(ti ) =

tj ∈V

R

S(tj )

(16)

,tj >ti

Using formula (16), we select the best terms to be used as expansion features. The terms are then sorted on the basis of their raw fitness R(ti ) and the top ranked ones are added to the original query q. Based on the BM25 similarity score, presented in Section 2, we retrieve the relevant documents, as follows:

ScoreBM 25 (d, q ′ ) =

∑ ti ∈q

wiBM 25 +

1 2

∑ ti ∈(q ′ −q)

wiBM 25 ∗ [fext (ti ) + fint (ti )] (17)

8

3


Experiments

In order to evaluate the effectiveness of the proposed approach, we carried out a set of experiments. First, we describe the dataset the software, and the effectiveness measures used. Then, we present the experimental results. 3.1

Dataset, Software, effectiveness measures

Extensive experiments were performed on OHSUMED test collection. The collection consists of 348 566 references from MEDLINE, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The OHSUMED collection contains a set of queries, and relevance judgments as well. We divided the OHSUMED collection into 6 sub-collections. Each sub-collection has been defined by a set of documents, queries, and a list of relevance documents. Table 1 summarizes the characteristics of each sub-collection in terms of the number of documents (docs) it contains , the number of queries (Nb Queries), the average query length in terms of number of words (Avr Query Len), the average number of relevant documents (Avr Rel Doc). All the experiments have been performed on a Sony-Vaio workstation having an Intel i3-2330M/2.20GHz processor, 4GB RAM and running Ubuntu GNU/Linux 12.04. The precision and the Mean Average Precision (MAP) have been used as measures to evaluate the effectiveness of the systems.

Table 1: Some statistics on the OHSUMED sub-collections queries. #docs 50000 Nb Queries 82 Avr Rel Doc 4.23 Avr Query Len 6.79

3.2

100000 91 7 6.12

150000 95 10.94 5.68

200000 97 13.78 5.74

250000 99 15.5 5.62

300000 101 19.24 5.51

Results

In the first set of experiments, we evaluated and compared the results of the suggested approach (EXT/INT), which use both the external and internal correlations, with those of RSJ (Robertson/Sparck Jones algorithm for Relevance Feedback) and Rocchio (Rocchio approach for Relevance Feedback); where we computed the precision values after retrieving 5 (P@5) and 10 (P@10) documents. Figure 1 shows the precision values for the EXT/INT, the RSJ and the Rocchio techniques. From Figure 1a, we can see an obvious superiority of the suggested approach EXT/INT compared with the Rocchio, and this superiority was more significant in comparison to the RSJ technique. It is clearly seen from Figure 1a that the proposed approach managed to improve the search results, after retrieving 5 documents, in all the sub-collections compared with the Pseudo-Relevance Feedback methods, e.g. on the 300000 sub-collection, EXT/INT using Cosine shows


9

EXT/INT (Gaussian)

0.28

EXT/INT (Triangle)

0.2

EXT/INT (Cosine) RSJ Rocchio

0.24

Precision

Precision

0.17

0.2

0.14 0.16 EXT/INT (Gaussian) EXT/INT (Triangle) EXT/INT (Cosine)

0.12

100

(a)

RSJ

0.11

150 200 250 Collection Size (thousand documents)

300

Precision after retrieving 5 documents.

Rocchio

100

150

200

250

300

Collection Size (thousand documents)

(b)

Precision after retrieving 10 documents.

Fig. 1: Effectiveness comparison of the EXT/INT approach to the RSJ and Rocchio methods in terms of precision.

greater improvement of 42.08% over RSJ and 22.59% over Rocchio. Despite the superiority shown in Figure 1b, the results were not similar to those observed in 1a, however, the precision values of the proposed approach were the best in all the sub-collections. In the next phase of testing, we computed the Mean Average Precision score to evaluate the retrieval performance of the EXT/INT and the Relevance Feedback methods (Table 2). Therefore, we used the two-tailed t-test to measure the statistical significance of differences between the MAP values. Table 2 shows a clear advantage of the EXT/INT approach compared to the RSJ and Rocchio approaches in all the sub-collections. The improvements over RSJ and Rocchio are statistically significant in 16 out of 18 cases (p < 0.05), and 18 out of 18 improvements are positive; e.g. on the 300000 sub-collection, EXT/INT(Gaussian) outperforms RSJ by 20.31% (p = 0.0237) and Rocchio by 21.31% (p = 0.0181) while EXT/INT(Triangle) and EXT/INT(Cosine) outperform RSJ by 19.57% (p = 0.0339), 19.95% (p = 0.0339) and Rocchio by 20.56% (p = 0.0295), 20.56% (p = 0.0295), respectively.

4

Conclusion

In this work, we propose two criteria to assess the degree of relatedness between a candidate expansion term and the query keywords: co-occurrence and proximity. We adopt the strength Pareto fitness assignment in order to satisfy both criteria. We thoroughly tested our approach using the OHSUMED test collection. The experimental results show that the proposed approach EXT/INT succeeded to improve the ranking of relevant documents and yielded a substantial enhancement in precision and mean average precision.

10


Table 2: Mean Average Precision (MAP) results of the EXT/INT, the RSJ and Rocchio methods. #docs MAP #100000

Rate MAP

#200000

Rate

EXT/INT Triangle Cosine 0.1781 0.1791 Triangle

0.1663 Gaussian

Rate MAP

#300000

Gaussian 0.1823 Gaussian

0.1684

Cosine 0.1686

Triangle 0.1617 Gaussian

0.1607

Cosine 0.1607

Triangle Cosine

RSJ

Rocchio

0.1253 +45.49%∗ +42.14%∗ +42.94%∗ 0.1174 +41.65%∗ +43.44%∗ +43.61%∗ 0.1344 +20.31%∗ +19.57%∗ +19.57%∗

0.1524 +19.62%∗ +16.86% +17.52% 0.1346 +23.55%∗ +25.11%∗ +25.26%∗ 0.1333 +21.31%∗ +20.56%∗ +20.56%∗

The ∗ indicates the difference is statistically significant, p-value < 0.05 with two-tailed t-test.

References 1. Chen, Q., Li, M., Zhou, M.: Improving query spelling correction using web search results. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 181–189. EMNLP-CoNLL’07, ACL, Stroudsburg, PA, USA (2007) 2. Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: Mapping the geographical diffusion of new words. In: Workshop on Social Network and Social Media Analysis: Methods, Models and Applications. NIPS’12 (2012) 3. Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: The evolution of the web from a search engine perspective. In: Proceedings of the 13th International Conference on World Wide Web. pp. 1–12. WWW’04, ACM, New York, NY, USA (2004) 4. Ranganathan, P.: From microprocessors to nanostores: Rethinking data-centric systems. IEEE Computer 44(1), 39–48 (2011) 5. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval 3(4), 333–389 (2009) 6. Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. Journal of the American Society for Information science 27(3), 129–146 (1976) 7. Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The Smart retrieval system - experiments in automatic document processing, pp. 313– 323. Englewood Cliffs, NJ: Prentice-Hall (1971) 8. Sun, H.m.: A study of the features of internet english from the linguistic perspective. Studies in Literature and Language 1(7), 98–103 (2010) 9. Williams, H.E., Zobel, J.: Searchable words on the web. International Journal on Digital Libraries 5(2), 99–105 (2005) 10. Zhu, Y., Zhong, N., Xiong, Y.: Data explosion, data nature and dataology. In: Proceedings of the 2009 International Conference on Brain Informatics. pp. 147– 158. BI’09, Springer-Verlag, Berlin, Heidelberg (2009)