Query Length, Number of Classes and Routes through Clusters

0 downloads 0 Views 120KB Size Report
For a query, there are roughly two kinds of documents, those that are ... the number of clusters and different ways to browse them to provide a new ranked list.
© Springer-Verlag Lecture Notes in Computer Science (LNCS) #1749 "Internet Application" Proceedings of ICSC (IEEE) Hong-Kong, December 1999 - pp. 196-205

Query Length, Number of Classes and Routes through Clusters: Experiments with a Clustering Method for Information Retrieval Patrice Bellot, Marc El-Bèze Laboratoire d’Informatique d’Avignon (LIA) 339 ch. des Meinajaries, BP 1228 FR-84911 Avignon Cedex 9 (France) {patrice.bellot,marc.elbeze}@lia.univ-avignon.fr

Abstract. A classical information retrieval system ranks documents according to distances between texts and a user query. The answer list is often so long that users cannot examine all the documents retrieved whereas some relevant ones are badly ranked and thus never retrieved. To solve this problem, retrieved documents are automatically clustered. We describe an algorithm based on hierarchical and clustering methods. It classifies the set of documents retrieved by any IR-system. This method is evaluated over the TREC -7 corpora and queries. We show that it improves the results of the retrieval by providing users at least one high precision cluster. The impact of the number of clusters and the way to browse them to build a reordered list are examined. Over TREC corpora and queries, we show that the choice of the number of clusters according to the length of queries improves results compared with a prefixed number.

1

Introduction

A classical information retrieval system retrieves and ranks documents extracted from a corpus according to a similarity function based on word co-occurrences in the texts and in a query. Generally, users cannot examine all the documents retrieved whereas some relevant items are badly ranked and thus never recovered. We have chosen to automatically cluster retrieved documents in order to assist the user in locating relevant items. According to the so-called “Cluster Hypothesis” [12], relevant documents are more likely to be similar one to another than to irrelevant ones. This hypothesis has received some experimental validations: [2], [6], [10], [4] or [11]. For a query, there are roughly two kinds of documents, those that are relevant and those that are not. In this way, we may decide to cluster retrieved documents in two classes. But this can be done only if the retrieved documents are distributed according to two main topics: relevant and irrelevant. The queries provided by a user are generally short. Given the ambiguity of natural language, they are often related to different topics.

For example, the title of the T R E C -7 query 351 is “Falkland petroleum exploration”1. Retrieved documents may be about ‘exploration’, ‘petroleum exploration’ or ‘Falkland exploration’. They are not necessarily completely relevant and may not be relevant at all. The irrelevant documents discuss a lot of very different topics and cannot be grouped in a thematically homogeneous cluster. Furthermore, pertinent documents can deal with some completely or slightly different topics according to the number of distinct meanings the query words have, or according to the subjects they are related to. Hence, the retrieved documents may be clustered in several classes, each one possibly containing some relevant documents. The restriction of the classification to the retrieved documents and not to the collection as a whole allows a quick clustering. It can be used efficiently on WWW documents and with any classical Internet IR-system. The clustering effect for the set of documents retrieved has been described numerous times in the literature, including [1] and [9]. In this article, the impact of the number of clusters and different ways to browse them to provide a new ranked list of documents will be examined. It will be shown that this algorithm allows to create at least one high precision cluster, improving the precision levels at 5 or 10 documents and recall values for low precision rates2. We will show that choosing the number of clusters according to query sizes improves results compared to those obtained with a prefixed number. We tried successfully this method over TREC -6 corpora and queries with the parameters learned during TREC-7.

2

A Clustering Algorithm

An important criterion for an IR system is the time the user has to wait for an answer. We have used a K-means like method [3] to cluster the retrieved documents, in particular because the time it requires is lower than the one that is needed by hierarchical algorithms. This list of documents is obtained by using the IR-system developed at the LIA. It is described in [8]. To improve the computation of distances, the Part Of Speech (POS) tagger and the lemmatizer developed at the LIA are used. The main classification step is performed as follows: Find an initial partition (see 2.3) • Repeat: • compute the centroids i.e., for each cluster, the set of documents which are the closest to the cluster’s geometric centre (see 2.2); • allocate each document to the cluster that has the lowest distance; • until there is little or no change in cluster membership. Since the cluster’s centroids are computed only at the beginning of an iteration, cluster memberships are order-independent. A document cannot be assigned to multiple clusters and it can be placed in a cluster only if it is sufficiently close. Hence, clusters are not confused by distant items but not all documents may be 1 2

For a description of the seventh Text Retrieval Conference, see [13]. Nevertheless, the global recall level is not modified.

© Springer-Verlag Lecture Notes in Computer Science (LNCS) #1749 "Internet Application" Proceedings of ICSC (IEEE) Hong-Kong, December 1999 - pp. 196-205

assigned to a class at the end of the process. The maximal number of classes is fixed at the start of the procedure. In the clusters, documents are ranked as they were before classification. Since this process may be seen as post-processing, it can be used with any IRsystem that returns a list of documents to a user query.

2.1

Distance between Documents

Let R and D be two documents, u a term (a lemma) with a rough syntactical tag appended, and N(u) the number of documents containing u in the corpus as a whole. Given S, the number of documents in the corpus, the information quantity of a term in a document is based on its occurrences in the corpus —IDF(u)— (and not in the set of documents to cluster) and on its frequency in the document —TF(u)—. The information quantity of a document is the sum of the weights of its terms: (1) N( u) I( D) = TF( u).IDF( u) = − TF( u). log2 S u∈D u∈D





We assume that the greater the information quantity of the intersection of the term sets from two documents, the closer they are. In order to allow convergence of classification process and to measure the quality of the partition, we must have a true distance (verifying the triangular inequality). That is the case3 of the so-called MinMax distance between two documents D and D’: I( D ∩ D ′) (2) d( D, D ′) = 1 − Max I( D), I( D ′)

(

)

In order to provide the users with a ranked list of documents from the partition or an arranged view of the clusters, we have to compute distances between a cluster and a query (“which cluster is the closest to the query ?”). This may be accomplished using the indices given by the IR system as it will be explained in the next section.

2.2

Cluster Centroids

The distance between a cluster and an item (a document or a query) is equal to the lowest distance between the query and one of the cluster centroids. We have chosen to represent a cluster C by the k documents that are closest to its geometric centre. For each document, we compute the sum of distances that separate it from other texts in the same cluster and choose the k documents corresponding to the k smallest distances as centroids. Let Ni (1 ≤ i ≤ k ) be a centroid of C and let d be the distance between a document and a cluster: (3) d( D, C ) = min d( D, N i ) 1≤ i ≤ k

3

(

)

On average, over the 50 queries of TREC-7, 6 iterations are made before convergence.

Because the centroids are documents, the indices given by the IR system can be used to rank the clusters according to the query. 2.3

Initial Partition

The result of this cluster-based method depends on the initial set. We have used a partial hierarchical method to obtain the initial partition: 1. for each couple of documents i and j such that d(i,j) < threshold4: • if i and j are not yet in a class, create a new one; • if i and/or j are already allocated, merge all the documents of the class containing i (resp. j) with those of the class containing j (resp. i); 2. partial hierarchical classification: after this first step, the number of classes created may be greater than the number of clusters wanted. So, as long as the number of classes is greater than the predefined one: a) compute class representatives; b) compute distances between every pair of classes; c) merge the two closest classes (this can be done by using distances between centroids [7]).

3

Experiments over TREC-7 Corpora

3.1

Routes through the Clusters

To evaluate the quality of the classification, we can explore clusters in different ways. We can first consider the best ranked cluster which should contain most relevant documents. We can look at the best ranked documents of each cluster, i.e. at the documents for each theme which are the closest to the query. Finally, we can present each cluster to the user so that he could choose those which contain the largest number of relevant documents [4]. Let LCn be the list of documents constructed from the succession of the clusters ranked according to their distances with the query: LC n = C1 ⋅ C2 ⋅ C3 ⋅ K Let Ln be the list of documents constructed from the succession of the n first ranked items in each ranked cluster (Ci,j is the j-th document of the i-th cluster):

(

)(

)

(

)(

)

Ln = C1,1 ⋅ C1,2 ⋅ C1,3 ⋅ K ⋅ C1, n ⋅ C2,1 ⋅ K ⋅ C2, n ⋅ K ⋅ C1, n +1 ⋅ C1, n + 2 ⋅ K ⋅ C1,2n ⋅ C2, n +1 ⋅ K ⋅ C2,2 n K

Let

Ln1 , n2 ,L be the list of documents constructed from the succession of the ni first

ranked items5 in each ranked cluster Ci :

(

)(

)

(

)(

)

Ln1 , n2 ,L = C1,1 ⋅ C1,2 ⋅ C1,3 ⋅ K ⋅ C1, n1 ⋅ C2,1 ⋅ K ⋅ C2, n2 ⋅ K ⋅ C1, n1 +1 ⋅ K ⋅ C1,2n1 ⋅ C2, n2 +1 ⋅ K ⋅ C2,2n2 K

4

The threshold value is chosen so that the number of documents assigned at the end of step 1 is greater than half the total number of documents. 5 We choose ni > ni+1 to favor the first ranked classes.

© Springer-Verlag Lecture Notes in Computer Science (LNCS) #1749 "Internet Application" Proceedings of ICSC (IEEE) Hong-Kong, December 1999 - pp. 196-205

To help to measure how much the classification groups relevant documents, we use the list of relevant documents supplied by NIST for TREC-7. For each query, we select the best clusters according to the number of relevant documents they contain. [6] and [11] used this method of evaluation. Lqrels is defined as the list of documents constructed from the succession of the clusters ranked according to the number of relevant items they contain. Lqrels can be seen as the best route through the clusters. The evaluations presented below have been obtained by means of the trec_eval application over the 50 queries (351 to 400) of TREC-7. The corpus was the one used for T REC -7 (528.155 documents) [13]. We have used the IR system developed by LIA and Bertin & Cie [8] to obtain the lists of documents to cluster. Whenever possible, the first 1000 documents retrieved for each query have been kept to cluster them. 3.2

The same Number of Classes for each Query

Usually, the number of classes is defined at the start of the process. It cannot grow but it can be reduced when a class empties. In next figures, the indicated numbers of clusters correspond to the values initially chosen. The documents that are not assigned to a class at the end of the classification are allocated to a new one (at the last position in the ranked list of clusters). 0,4

0,3 0,35

0,25

0,3

0,2

0,25 0,2

0,15

0,15

0,1 0,1

0,05

0,05

0 2

4

6

8

10

12

14

16

18

20

0 2

4

6

8

10

12

14

16

18

Number of classes

Precision at 10 documents

Average Precision

Fig. 1. Lqrels with different numbers of classes (precision at 10 and average precision without classification are respectively indicated by the top and the bottom dashed lines)

Number of classes Precision at 10 documents

Average Precision

Fig. 2. LCi with different numbers of classes

By choosing the same number of classes –from 2 to 13– for all queries, the levels of the average precision over all relevant documents are lower than those without classification with lists Lqrels and LCn (Fig. 1 and Fig. 2). The decrease rate varies from 1.2% to 2% (see the bottom dashed line in Fig. 1). Fig. 1 and Fig. 2 show that

20

those lists do not allow to globally improve results of the retrieval: the average precision decreases since the relevant documents that are not in the first cluster are ranked after all items of that one. The differences between results indicated in Fig.1 and in Fig. 2 measure how much the above-defined distance ranks the clusters. The average precision decrease is about 5% when clusters are ranked according to the computed distances and not according to the number of relevant documents they contain. However, the first ranked cluster –according to the distances to the queries– is very often better than the next ones as shown in Fig. 3 where we have compared lists C1 .C2 and C2 .C1 . With this second list, the relative decrease of the average precision over the 50 queries equals 18% (from 0.11 to 0.09). 0,35 0,3

precision

0,25 0,2 0,15 0,1 0,05 0 5

10

15

20

30

100

200

500

1000

Number of documents C1.C2

C2.C1

Fig. 3. Precision of lists C1.C2 and C2.C1

On the other hand, with list Lqrels , the precision at 10 documents is greater than the one obtained without classification. With 5 clusters the relative increase of precision at 10 documents is equal to 12.8% (from 0.33 to 0.374). At 5 documents, the relative increase equals 10.5%. These values confirm –see Fig. 1 where the top dashed line shows the precision value without classification– that the classification helps to group relevant documents together by creating a class containing a large rate of relevant items well ranked: precision is increased at a low number of documents. For each query, one class with a high level of precision exists. Those classes often have a low population rate6. In fact, they are the ones that should be proposed at first. Indeed, a high precision short list makes a better impression than a long one with a great number of scattered relevant documents. In the first one, users can browse all documents and what they are looking for is likely to be quickly accessed.

6

The population rate is the number of documents in the class divided by the total number of documents retrieved for the query.

© Springer-Verlag Lecture Notes in Computer Science (LNCS) #1749 "Internet Application" Proceedings of ICSC (IEEE) Hong-Kong, December 1999 - pp. 196-205

3.3

Experiments with Different Routes

Table 1 and Fig. 4 show some results obtained with 3 clusters. One can see that precision at low level of recall with lists Ln is better than those of list LC (succession of each cluster’s contents). However, only list

Lqrels allows to obtain

better results than without classification. At recall 0.1, the relative increase of precision of list L5 over list LC equals 18.5% (from 0.27 to 0.32). 0,45 0,4 0,35

precision

0,3 0,25 0,2 0,15 0,1 0,05 0 5

10

15

20

30

100

200

500

1000

Number of documents L40,30,20,10

L10

Lc

Lqrels

L15

L20

L5

(3 clusters)

Precision at recall 0.00

Precision at recall 0.10

Precision at 5 docs

Precision at 10 docs

Precision at 15 docs

Precision at 20 docs

Fig. 4. Precision for different ways to browse the clusters

without classification

0,58

0,38

0,36

0,33

0,32

0,30

0,63

0,37

0,40

0,37

0,32

0,29

0,48

0,27

0,25

0,51

0,32

0,24

0,50

0,31

Lqrels LC L5 L10 L15 L20 L40, 30, 20,10

0,49

0,24

0,27

0,30 0,29

Table 1. Different ways to browse automatically the clusters

0,23

0,25

0,23

0,21 0,25 0,22 0,21

4 4.1

Choice of Number of Clusters according to Query Length Linear Correlation Deduced from TREC-7

In section 3.2, we have shown that the number of classes –the same for all queries– does not strongly influence the global results. However, that is not the case if we examine each query independently. It has been found that the shorter the query, the higher the number of clusters must be. This can be explained by the fact that the topics of the retrieved documents are more varied when the query is short and, therefore, more ambiguous. Over the 50 TREC-7 queries, the correlation coefficient between the best numbers of classes and the query sizes equals -0.56. The linear correlation is verified at a 5% risk according to a Fisher test. The equation deduced from the correlation coefficient, the query sizes and the best numbers of classes, is: (4) Class number = −0.525 ⋅ (query size) + 10.65 4.2

Experiments over TREC-7

(5 clusters) Lqrels with linear regression

Precision at 15 docs

Lqrels

Precision at 10 docs

(2 clusters)

Precision at 5 docs

Lqrels

Precision at recall 0.10

without classification

Precision at recall 0.00

By choosing the number of clusters according to equation (4), we improve clearly the best results obtained previously7. Compared with the rate without classification (see Table 2 and Fig. 5), the relative increase of the precision at 5 documents is equal to 22% (from 0.36 to 0.44) and equals 10% when compared with the value obtained with constant number of clusters.

0,58

0,38

0,36

0,33

0,32

0,65 +12%

0,33 -13%

0,42 0,37 +17% +12%

0,32 =

0,67 +16%

0,32 -16%

0,39 +8%

0,32 -3%

0,27 -15%

0,75 +29%

0,33 -13%

0,44 0,37 +22% +12%

0,31 -3%

Table 2. Improvements by choosing the number of clusters according to query sizes (TREC-7)

7

In Table 2 and in Table 3, the lists L qrels are obtained by ranking clusters according to their precision values and not according to the number of relevant documents they contain.

© Springer-Verlag Lecture Notes in Computer Science (LNCS) #1749 "Internet Application" Proceedings of ICSC (IEEE) Hong-Kong, December 1999 - pp. 196-205

0,5 0,45 0,4

precision

0,35 0,3 0,25 0,2 0,15 0,1 0,05 0 5

10

15

20

30

100

200

500

1000

Number of documents Number of clusters according to queries

5 clusters for each query

Without classification

Fig. 5. Precision by choosing the number of clusters according to query sizes (TREC-7)

4.3

Test over TREC-6 Corpora and Queries

Precision at 5 docs

Precision at 10 docs

Precision at 15 docs

Lqrels

Precision at recall 0.10

Without classification

Precision at recall 0.00

The results shown in Table 2 have been obtained with the same corpus that the one used to compute equation (4). Thus, we have chosen to test equation (4) parameters on TREC-6 corpora and queries. The results in Table 4 confirm the improvements shown in 4.2 by using the same method over TREC-6 with the equation computed during TREC-7. The absolute improvements (see Table 3) of the precision levels at 5, 10 and 15 documents are equal to 7%, 3% and 1% compared with the results obtained with 5 clusters for each query. The mean number of clusters used by means of this method is equal to 5.

0,62

0,41

0,41

0,33

0,30

0,63

0,46

0,39

0,35

0,32

0,61

0,45

0,41

0,34

0,32

0,68

0,47

0,46

0,37

0,33

(3 clusters)

Lqrels (5 clusters) Lqrels with linear regression

Table 3 - Results by choosing the number of clusters according to query sizes (TREC-6)

5

Conclusion

We have shown how classifying retrieved documents helps to regroup the relevant ones. It increases the effectiveness of retrieval by providing users at least one cluster with a precision higher than the one obtained without classification. The process of classification is quick and can be used with any IR-system used on the WWW. We have examined with TREC corpora and queries how the cluster numbers and how the way to browse them automatically by means of different routes impact on the classification. The quality of the results obtained can be compared with those reported in [6] and [11] since, when the best cluster is selected, the absolute increase in precision is equal to 8% at 5 documents and to 4% at 10 documents. Moreover, we have shown that a variation of the number of clusters according to the query length improves the results over TREC-7 and over TREC-6. Organizing the set of documents retrieved according to the sizes of the user’s queries is an important new result especially if they are written in natural language.

6

References

1. Allen, R.B., Obry, P., Littman, M.: An interface for navigating clustered document sets returned by queries, Proceedings of COCS (1993), pp.166 2. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Sactter/Gather: a Cluster-based Approach to Browsing Large Document Collections, in ACM/SIGIR (1992), 318-329 3. Diday, E., Lemaire, J., Pouget, J., Testu, F.: Eléments d’Analyse des Données, Dunod Informatique (1982) 4. Evans, D.A., Huettner, A., Tong, X., Jansen, P., Subasic, P.: Notes on the Effectiveness of Clustering in Ad-Hoc Retrieval, in TREC-7, NIST special publication (1998) 5 . Frakes, W.B., Baeza-Yates R. (Editors): Information Retrieval, Data Structures & Algorithms, Prentice-Hall Inc.(1992), ISBN-0-13-463837-9 6 . Hearst, M.A., Pedersen, J.O.: Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, in ACM/SIGIR (1996), 76-84 7 . Kowalski, G.: Information Retrieval Systems, Theory and Implementation, Kluwer Academic Publishers (1997) ISBN-0-7923-9926-9 8. de Loupy, C., Bellot, P., El-Bèze, M., Marteau, P.F.: Query Expansion and Automatic Classification, in TREC-7, NIST special publication #500-242 (1999), 443-450 9. Sahami, M., Yusufali, S., Baldonaldo, M.Q.W.: SONIA a service for organizing networked information autonomously, in ACM/DL (1998), 200-209 10.Schütze, H., Silverstein, C.: Projections for Efficient Document Clustering, in ACM/SIGIR (1997), 74-81 11.Silverstein, C., Pedersen, J.O.: Almost-Constant-Time Clustering of Arbitrary Corpus Subsets, in ACM/SIGIR (1997), 60-66 12.Van Rijsbergen, C. J.: Information Retrieval, Buttherwords, London (1979) 13.Voorhees, E.M., Harman, D.: Overview of the Seventh Text REtrieval Conference (TREC7), NIST special publication (1999)

Suggest Documents