Term Proximity and Data Mining Techniques for Information Retrieval Systems Ilyes Khennak and Habiba Drias Computer Science Departement, USTHB, Laboratory for Research in Artificial Intelligence (LRIA), Algiers, Algeria
[email protected],
[email protected]
Abstract. Term clustering based on proximity measure is a strategy leading to efficiently yield documents relevance. Unlike the recent studies that investigated term proximity for improving matching function between the document and the query, in this work the whole process of information retrieval is thoroughly revised on both indexing and interrogation steps. Consequently, an Extended Inverted file is built by exploiting the term proximity concept and using data mining techniques. Then three interrogation approaches are proposed, the first one uses query expansion, the second one is based on the Extended Inverted file and the last one hybridizes retrieval methods. Experiments carried out on OHSUMED demonstrate the effectiveness and efficiency of our approaches compared to the traditional one. Keywords: information retrieval, term proximity, word association, Fuzzy Clustering.
1
Introduction
The information technologies progress such as text editors has impacted on daily production of an enormous mass of information. On the other hand, the evolution of electronic media has allowed the storage of this vast amount of information. With the development of electronic communication systems and the increase in the amount of information available online, it becomes important to help users to quickly access information they need. Traditional Information Retrieval (IR) offers tools for users to find the information they need from sets of data with reasonable size. It consists in selecting from a volume of information, the relevant information to a user query. The traditional IR systems interpret documents as sets of words without sense. Therefore, they are capable to only find documents described by the words considered separately from the query. In other terms, they treat IR only on the basis of morphological aspect of the document text. Intuitively when further considerations such as semantic features are taken into account, the information retrieval system is expected to work more efficiently and faster. Of course these concerns are more complicated to address and constitutes important directions of research. Á. Rocha et al. (Eds.): Advances in Information Systems and Technologies, AISC 206, pp. 477–486. DOI: 10.1007/978-3-642-36981-0_44 © Springer-Verlag Berlin Heidelberg 2013
478
1.1
I. Khennak and H. Drias
Related Works
In order to palliate the limits of traditional information retrieval, new approaches taking into account words senses, proximity and association have been proposed in the literature [3], [5], [9], [11] and [12]. They use semantic resources and statistical techniques to improve the performance of information retrieval systems. The proximity of query terms is intuitive in general for the users. These last years a few works interested in improving document ranking using this heuristic have been published. The majority of these papers focused on designing measures for the term proximity to integrate in the matching process between the document and the query in order to yield better document relevance. In papers [3], [9] and [11], the authors propose proximity terms measures they combine to the classical term weighting function in order to better translate the importance of the terms in a document. All these studies claim the benefit of integrating term proximity scoring in the term weighting from the success of their experiments performed on known benchmarks. In paper [5] and [12], the authors present a theoretical study on modeling term proximity and show that the use of term proximity enhances considerably the system efficiency. 1.2
Assigned Objectives
Usually it is meant by term proximity the minimum number of words that separates two terms that appear in the same document. Of course, when this number is small, the proximity is more important. When it is equal to one, that is when the two words are adjacent such as "information retrieval", we talk about word association. In this study we examine the concept of proximity since it is more general than the concept of association. Moreover, one term may have high proximity with several other words and in many documents. For instance the word "information" may be associated with the words "retrieval", "science" and "technology". In the approach we propose, and relying on [4], all the information retrieval process is revised on both indexing and interrogation phases, which makes its originality relatively to the works reported in the literature. The indexing process is designed according to the new added feature which is term proximity. It is developed in such a way to help the interrogation phase to answer quickly to the user queries. At our knowledge, there exists no study that has proposed so far such approach.
2
Traditional Information Retrieval
Traditional Information Retrieval has been widely investigated because of its important and numerous applications. It is well related in many books such as [1], [2], [8] and [6]. In this section, the major concepts used in the literature are presented. The set of documents on which the search is performed according to the user request form a Collection. The Document can be a text, a sound, an image or a video. We call therefore a document any unit that can constitute a response to a need of
Term Proximity and Data Mining Techniques for Information Retrieval Systems
479
information. The latter is introduced to the machine through a query. The Information Retrieval System aims at finding in the mass of available information, the relevant documents to satisfy the query. A preliminary phase of document analysis is needed and corresponds in practice to the Indexing process. The formulation of the query, the search in the collection and the ranking of documents using a matching function define the Interrogation phase. In the traditional IRSs, the document is considered as a set of words represented by descriptors. The only information exploited on these words is the frequency of their occurrences in the document. The content of the query is described by a set of words as for documents. 2.1
Traditional Indexing
The indexing process starts by recognizing words from the text and eliminating all superfluous and unnecessary entities such as the punctuation marks and spaces. Once the words are extracted, they are normalized using the Stemming operation [2]. The latter is a morphological process that allows finding more general form of words. Finally, the terms of the dictionary are coded by identification numbers. These identifiers are stored for use during the search in a file. Then to each term the process assigns a weight that represents its importance in the document where it appears. To weight the terms, the measure tf*idf is usually used where tf measures the importance of the term in the document and is computed using formula (1), which is the Okapi metric [10]. The variable occurij is the number of occurrences of term i in document j. k is introduced to take into account the length of the document. In practice it is calculated using formula (2) where length docj is the length of document j and average length doc is the average of document lengths. Therefore, formula (3) is used to compute tf.idf measures the importance of the term in the collection and is usually computed using formula (4), where n is the number of documents and ni the number of documents containing term i. tf =
k = 0 .5 + 1 .5 *
tf =
occur ij occur ij + k
.
length doc
(1)
j
average length doc
.
occur ij occur ij + 0 .5 + 1 . 5 *
length doc
(2)
. j
(3)
average length doc
idf = log(
n ). ni
(4)
480
I. Khennak and H. Drias
Once the terms are weighted, for each term the process associate a list containing the documents in which it appears followed by the corresponding weight. The achieved index is saved in a file called Inverted File. 2.2
Traditional Interrogation
As it was mentioned earlier, it consists in finding documents that are relevant to the query. In the literature, there is a variety of models to represent documents and queries. Among them the vector model is the most frequently used. In such context, each document d is described by a vector of terms weights as follows: d=(w1, w2, ... , wm). Where wi is the weight of term i in document d. As with the documents, the query is represented by a vector of weights. If q represents a query, it is then modeled as follows: q=(v1, v2, ... , vm). Where vi is the weight of term i in q for i=1, ..., m. m being the number of terms in the dictionary. The matching function determines the relevance of a document to a query, and allows ranking the documents in order of assumed relevance. To appreciate the degree of relevance of a document d to a query q, the scalar product of formula (5) calculates the retrieval status value RSV(q, d). RSV ( q , d ) =
m
v i =1
i
* wi .
(5)
RSV(q,d) represents the degree of similarity between the document d and the query q. In order to evaluate the query, all documents containing at least one term of the query are selected and their relevance to the query is calculated. The documents are then ordered according to their degree of relevance.
3
Indexing with Term Proximity
In the indexing phase, the terms are grouped into clusters according to their proximity to each other in the documents. Term clustering is implemented to construct an Extended Inverted file. This file contains instead of basic relations like [Term, Document], more informed relations with the format [(Term1, Term2), Document], where Term1 and Term2 belong to the same cluster. 3.1
Term Clustering
Clustering consists in dividing a population of objects into subsets called Clusters so that all objects belonging to the same cluster are similar and objects of different
Term Proximity and Data Mining Techniques for Information Retrieval Systems
481
clusters are dissimilar [7]. An Object is an elementary datum presented in the input to the clustering algorithm. In general, this unit is represented by a vector of p attributes as follows: X=(x1, x2, ... , xp). An Attribute of an object is a scalar xi contained in the vector of data. A Centroid of a Cluster C is a vector V, where V[i] is the arithmetic means of the attributes i of the objects belonging to cluster C. V[i] is computed using formula (6) where vij is the value of attribute i of the object j and |C| is the number of objects of cluster C. V [i ] =
c
1 C
v
ij
.
(6)
j =1
The Distance is a measure that expresses the similarity between two objects. Its value is high when the two objects come together in the same cluster and small when they reside in two different clusters. There are several clustering techniques; the k-means algorithm is the most adequate for this application. The fuzzy k-means is more adapted for term clustering and more effectiveness is expectable. For such algorithm, an object can belong to more than one cluster. The distance measurement corresponds usually to a formula of similarity in the simple k-means algorithm and a membership function in the fuzzy k-means algorithm. It has been shown that the complexity of the k-means algorithm is exponential. For this concern, we propose another algorithm with a simpler complexity. For the purpose of grouping terms, we proceed on modeling the space of terms according to the concepts of clustering: a dictionary term as the object, the weight of the term in a document as the attribute and the set of terms often co-occurring in the same documents as a cluster. The cosine similarity measure is used to calculate the distance between two terms. Formula (7) computes this score where xi and yi are respectively the weights of term x and term y in document i. n
Dista nce ( x , y ) =
x i =1
n
(x ) i =1
i
i
* yi .
n
2
+ ( yi )
(7)
2
i =1
To cope with the complexity issue of the fuzzy k-means algorithm, we made some changes to the clustering algorithm. We start by determining the number of clusters compared to the number of terms in the dictionary. Then, we assign each term to a class; each term in the dictionary corresponds to a centroid. Once the clusters are defined, we assign to each of them the co-occurring terms with the centroid of the cluster. This assignment is made using a calculation of the conditional probability of proximity between the centroid of the cluster and each term in the dictionary. We replaced the distance scheme by the conditional probability of proximity Proximity(t1|t2) [12], which is given by formula (8). Count(t1 V t2) counts the number of documents where t1 and t2 appear together and Count(t2) counts the number of
482
I. Khennak and H. Drias
documents where t2 appears. All the counts are undertaken before launching the clustering algorithm. Pr oximity (t1 t 2 ) =
P ( t1 ∨ t 2 ) Count ( t1 ∨ t 2 ) . = P (t 2 ) Count ( t 2 )
(8)
The result is a set of clusters, where each one is described by a term centroid in the dictionary and a proximity vector of terms. The modified clustering algorithm is summarized in the following pseudo code: Algorithm 1. Term Clustering 1: Initialize the number k of clusters relative to the number of terms in the dictionary 2: Initialize the clusters: Assign each term in the dictionary to a cluster 3: for each term ti in the dictionary do 4: for each term tj in the dictionary (ij) do 5: Calculate Proximity(ti|tj) 6: Add the value of Proximity to V[i] 7: end for 8: Sort V[i] 9: end for
V[i] is the vector of proximity measures of the centroid. The result is then stored in a file called Clusters. In order to access to the cluster that contains a term, each term of the dictionary is linked to the centroid of the cluster to which it belongs. 3.2
Extended Inverted File
In this step, we recovered for each term different from centroid of a cluster the documents where it appears with the centroid of this cluster. We then save the obtained index in the ExtendedInvertedFile.
4
Interrogation Approaches with Term Proximity Scheme
As in traditional IRS, the documents and the query are represented by weight vectors of terms. Thus, the same matching function is used to calculate the similarity between a document and a query. In this section, we present three different methods for extracting relevant documents. 4.1
Query Expansion Based Approach
In this retrieval method, we use the structure Clusters to recover the terms cooccurring with certain proximity with those of the query. We then add these terms to the query in order to select the documents that contain at least one of them. One possible version of the retrieval algorithm is as follows:
Term Proximity and Data Mining Techniques for Information Retrieval Systems
483
Algorithm 2. Query Expansion Based Approach 1: Select a query Q to satisfy 2: Q'new ← Q 3: for each term t in Q do 4: Select (from Clusters file) the best co-occurring term (e.g. t') with t 5: Q'new ← Q'new + t' 6: end for 7: Satisfy the Q' new using the classic search
4.2
Extended Inverted File Based Approach
In this method, we select all documents that include at least one pair of query terms. Algorithm 3. Extended Inverted file Based Approach 1: Select a query Q to satisfy 2: for each term pair {ti, tj} in Q do 3: Select (from Extended Inverted file) all documents containing ti and tj 4: end for 5: for each selected document di do 6: Calculate the degree of relevance RSV(Q, di) using the Scalar product 7: end for 8: Sort the documents selected according to their relevance
4.3
Hybrid Approach
For this search method, we exploit the InvertedFile and the ExtendedInvertedFile structures together. Algorithm 4. Hybrid Approach 1: Select a query Q to satisfy 2: for each term pair {ti, tj} in Q do 3: Select (from Extended Inverted file) all documents containing ti and tj 4: end for 5: for each term ti in Q do 6: Exist ← false 7: for each term tj in Q (titj) do 8: if {ti, tj} exists in Extended Inverted file do 9: Exist ← true 10: end if 11: if not Exist = false then 12: Select (from Inverted file) all documents that contain ti 13: end if 14: end for 15: end for 16: for each selected document di do 17: Calculate the degree of relevance RSV(Q, di) using the Scalar product 18: end for 19: Sort the documents selected according to their relevance
484
I. Khennak and H. Drias
5
Experimental Results
5.1
The OHSUMED Collection
Extensive experiments were performed on OHSUMED test collection (part of RCV1 collection). It is a set of 348 566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. It requires about 381 Mb for storage of the uncompressed files. The evaluations were performed using only the titles of documents. The designed algorithms are implemented with Python under Intel(R) Core(TM)2 Duo CPU T6570 @ 2.10 GHz 2.10 GHz, 3 Gb RAM hardware configuration. 5.2
Indexing Step
To evaluate our proposed approaches, and especially the search methods, we partitioned the collection of documents into sub-collections. Table 1 presents the characteristics of each sub-collection after the preprocessing step. Table 1. Characteristics of the sub-collections
5.3
Size of the (# documents) collection: (Mb) Number of terms in the dictionary
50 000 3.34 35443
100 000 6.66 50232
150 000 10.1 62126
Size of the (# documents) collection: (Mb) Number of terms in the dictionary
250 000 17 80717
300 000 20.6 88764
350 000 24 96331
200 000 13.5 71699
The Proposed IRS versus the Traditional One
In this section, we compare the results of our retrieval methods with those of the traditional IRS. The experiments have been performed on the basis of 106 queries. Fig. 1 shows the behavior curves corresponding to the classic (traditional) and the three proposed algorithms in terms of execution time. We observe that the runtime corresponding to the Extended Inverted file is not only the shortest for all the subcollections but is almost constant and close to 0, which allows real time processing. From the performance, Fig. 2 illustrates the superiority of the Query expansion strategy on the other methods. Finally, when comparing the algorithms in terms of the number of extracted documents (Fig. 3), we see clearly the gap separating the result produced by the Extended Inverted file based algorithm and those of the other approaches in favor of the former and it is more important for the classic retrieval.
Term Proximity and Data Mining Techniques for Information Retrieval Systems
485
Fig. 1. Comparison of the four retrieval approaches in terms of execution time
Fig. 2. Comparison of the methods in terms of document relevance quality
Fig. 3. Comparison of the four retrieval approaches in terms of the number of extracted documents
486
6
I. Khennak and H. Drias
Conclusion
This work allowed us to study and develop an information retrieval system founded on traditional information retrieval background augmented with the developments based on the interpretation of term proximity concept. As part of this work, we proposed an indexing method based on the grouping of terms using statistical methods and clustering techniques. The realization of this proposal can be summarized by the creation of Cluster file and Extended Inverted file. These files are used in the interrogation phase by several retrieval methods to quickly find the relevant documents. The retrieval methods, using effective indexes, are very efficient and give very satisfactory results in terms of robustness and computation time. The system we developed has been tested on OHSUMED and the achieved results are extremely interesting. According to numerical values, the efficiency of our IRS becomes clearly visible when using the Extended Inverted file in the interrogation phase. It indeed shows the importance of exploiting the term proximity concept in information retrieval development.
References 1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, New York (1999) 2. Chu, H.: Information representation and retrieval in the digital age. Information Today, New Jersey (2010) 3. Cummins, R., O’Riordan, C.: Learning in a pairwise term-term proximity framework for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 251–258 (2009) 4. Drias, H., Khennak, I., Boukhedra, A.: A hybrid genetic algorithm for large scale information retrieval. In: IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS, pp. 842–846 (2009) 5. He, B., Huang, J.X., Zhou, X.: Modeling Term Proximity for Probabilistic Information Retrieval Models. Information Sciences Journal 181(14) (2011) 6. Kowalski, G.: Information Retrieval Architecture and Algorithms. Springer, New York (2011) 7. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2011) 8. Manning, D.M., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 9. Mingjie, Z., Shuming, S., Mingjing, L., Ji-Rong, W.: Effective Top-K Computation in Retrieving Structured Documents with Term-Proximity Support. In: CIKM 2007 (2007) 10. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. In: TREC (1995) 11. Vechtomova, O., Wang, Y.: A study of the effect of term proximity on query expansion. Journal of Information Science 32(4), 324–333 (2006) 12. Wei, X., Croft, W.B.: Modeling Term Associations for Ad-Hoc Retrieval Performance Within Language Modeling Framework. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 52–63. Springer, Heidelberg (2007)