hierarchical classification. The second one is a new algorithm that relies on unsupervised decision trees (UDTs). The indexing methods we use (TF-IDF ...
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
Clustering by means of Unsupervised Decision Trees or Hierarchical and K-means-like Algorithm Patrice Bellot & Marc El-Bèze LIA – Université d’Avignon Agroparc B.P. 1228 84911 Avignon Cedex 9, France {patrice.bellot, marc.elbeze}@lia.univ-avignon.fr
Abstract A classical information retrieval system returns a list of documents to a user query. The answer list is often so long that users cannot explore all the documents retrieved. A classification of the retrieved documents allows to thematically organize them and to improve precision. In this paper, we present and compare two text classification algorithms. The first one is a clustering algorithm (K-Means-like) initialized with a partial hierarchical classification. The second one is a new algorithm that relies on unsupervised decision trees (UDTs). The indexing methods we use (TF-IDF weighting scheme, cosine similarity in the vector space model) prevent from really considering all the subjects dealt with in the texts. A better way to take all the themes into account is to cluster sentences from documents instead of documents as a whole. This is achieved the second method we propose. The effectiveness of these methods is evaluated over Amaryllis’99 corpora and queries. Since these methods are applied during a post-processing phase, they can be used with any IR-system which returns a list of documents. The methods presented here allow to obtain significant results and improvement compared with a search without classification In order to verify that improvement is due to the methods and not to the sharing out of items into classes, the results obtained are compared with those of a random classification.
Keywords Information retrieval, automatic classification, clustering, decision trees, K-Means, Amaryllis.
1.
Introduction
We have chosen to automatically cluster retrieved documents in order to assist the user in locating relevant items. According to the so-called “Cluster Hypothesis” (Van Rijsbergen, 1979), relevant documents are more likely to be similar to one another than to irrelevant ones. This hypothesis has received some experimental validation : (Cutting et al., 1992), (Hearst & Pedersen, 1996), (Schütze & Silverstein, 1997), (Silverstein & Pedersen, 1997), or (Evans et al., 1998). For a query, there are roughly two kinds of documents, those that are relevant and those that are not. In this way, we may decide to cluster retrieved documents in two classes. But this can be done only if the retrieved documents are distributed according to two main topics: relevant and irrelevant. The queries provided by a user are generally short: two or three words. Given the ambiguity of natural language, they are often related to different topics. Moreover, due to the way an IR-system generally retrieves documents (use of a similarity function based upon word cooccurrences), a lot of irrelevant documents are retrieved. These irrelevant documents are about some very different topics. Furthermore, relevant documents can deal with some completely or slightly different topics according to the number of distinct meanings the query words have or to the subjects they are related to. Therefore, the retrieved documents may be clustered in several classes, each one possibly containing some relevant documents. Lastly, the categories usually employed in Web search engines are not sufficiently fine-grained and do not easily allow a document to belong to several categories. The clustering effect for the set of documents retrieved has often been described in the literature, including (Allen et al., 1993 ; Sahami et al., 1998). This paper is organized as follows: in section 2, an algorithm combining hierarchical classification and a cluster-based (K-means-like) method is presented. It is compared with a second method, 1
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
original, faster, and allowing better understanding of the clusters with a view to using them in an interactive way. In section 3, this new clustering method is described. It relies on unsupervised decision trees. These methods are experimented with and compared over the Amaryllis'99 corpora and queries. The information retrieval process we use can be summarized as shown in Figure 1. …
1
IR-system
2
3
4
5
Ranked list of documents
SIAC
Classification
Figure 1 : Local clustering for information retrieval 1.1. Amaryllis’99 Amaryllis is a TREC-like evaluation campaign for French corpora (Landi et al., 1998 ; Lespinasse et al., 1999). Amaryllis'99 was organized by INIST (Institut de l'Information Scientifique et Technique) and sponsored by AUPELF-UREF (Agence Francophone pour l'Enseignement Supérieur et la Recherche). Compared with TREC, the six Amaryllis'99 corpora are smaller but the list of relevant documents used for the evaluation is built manually by archivists and it is revised according to the answers obtained by participants –see (Lespinasse et al., 1999) for more details–. The methods described in this paper are applied on the first OFIL corpus (OD1). This corpus is composed of 11 016 French newspaper articles (from Le Monde). It contains more than 2.4 million words. We use the 26 topics of the first topic set OT1. Figure 2 shows an example of topic used during Amaryllis'99. The final writing of a query is automatically deduced from its "topic" by merging its fields and by keeping only the lemmas of the words (Bellot, 2000). domaine : International sujet : La séparation de la Tchécoslovaquie question : Pourquoi et comment avoir divisé la Tchécoslovaquie et quelles ont été les répercussions économiques et sociales ? compléments :Prendre en compte les différentes versions présentées concepts : Partition de la Tchécoslovaquie, causes et modalités de la partition, création de la Slovaquie et de la République Tchèque, points de vue, économie Figure 2 : The five fields of "topic 1" extracted from the OT1 set (Amaryllis'99) Note that all the evaluations described here were performed by using TrecEval software used during TREC and Amaryllis campaigns. The evaluation criteria are the ones usually employed (precision, recall, precision at different recall levels and their associated curves). For a detailed description of these criteria, see (Voorhees & Harman, 1999 ; Bellot, 2000)1. 1.2. The SIAC information retrieval system The SIAC information retrieval system (Segmentation et Indexation Automatiques de Corpus) has been designed to evaluate the classification and segmentation methods we work on. SIAC is used to retrieve documents from a user query by employing some classical methods (vector space model, cosine similarity and TFIDF weighting scheme) and by using some tools created at LIA (a 1
They are respectively available at : http://trec.nist.gov and http://www.lia.univ-avignon.fr/personnel/BELLOT
2
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
part-of-speech tagger –see (Spriet & El-Bèze, 1999)–, a lemmatizer, …). SIAC is composed of three more modules: two classification modules and one segmentation module. In this paper, SIAC is used to retrieve the sets of documents and to cluster them. Figure 3 on next page shows the current GUI of the Java version of SIAC. See (Bellot, 2000) for more details about SIAC.
2.
Hierarchical and K-Means-like algorithm (HKM)
The first method presented in this paper is a combination of a hierarchical classification and of a cluster-based method. A cluster-based method (K-Means-like) allows a quick classification of texts. Beginning from an initial partition, it reallocates items until the partition becomes stable: no documents moves from one cluster to another. To obtain a homogeneous initial partition, a partial hierarchical classification is built, using a subset of the retrieved documents. This restriction to a subset is necessary to reduce computation cost as much as possible since building a global hierarchical classification requires computation of all distances between documents. This calculation is expensive in terms of computational resources. In a second step, the application of the "Nuées dynamiques", a K-Means-like method (Diday, 1982), allows to classify the documents ignored during initialization. 2.1. Algorithm The main classification step is performed as follows: • Find an initial partition (see 2.1.3) • Do: • Compute centroids of each cluster (see 2.1.2); • allocate each document to the nearest cluster (that has the lowest distance). while there is little or no change in cluster membership. Calculations are of the order O(n). Since the cluster’s centroids are computed only at the beginning of an iteration, cluster memberships are order-independent. A document cannot be assigned to multiple clusters. On the other hand, a document can be placed in a cluster only if the distance between the document and the cluster does not exceed a given threshold. Hence, clusters are not confused by distant items but not all documents may be assigned to a class at the end of the process. The maximal number of classes is fixed at the start of the procedure. In the clusters, the documents are ranked as they were before classification (we use the similarity indices given by the IR system). Since this process may be seen as post-processing, it can be used with any IR-system which returns a list of documents to a user query. 2.1.1. Distance between documents Let R and D be two documents, u a lemma and its syntactical tag, N(u) the number of documents containing u in the corpus as a whole. Given S, the number of documents in the corpus, the information quantity of a term in a document is based on its occurrences in the corpus —IDF(u)— (and not in the set of documents to cluster) and on its frequency in the document —TF(u)—. The information quantity of a document is the sum of the weights of its terms: I( D) =
N( u) + 1 S
∑ TF(u).IDF(u) = ∑ − TF(u). log2
u∈D
u∈D
(1)
We assume that the greater the information quantity of the intersection of the lemma sets from two documents, the closer they are.
3
LEMMATIZED QUERY ANSWERS
CORPUS & DOCUMENTS INDEX
Figure 3 : The GUI of the current Java version of SIAC
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
In order to allow convergence of classification process, we must have a true distance (verifying the triangular inequality). That is the case –see proof in (Bellot, 2000)– of the so-called MinMax distance between two documents D and D’: d( D, D ′) = 1 −
I( D ∩ D ′) Max I( D), I( D ′)
(
(2)
)
In order to provide the users with a ranked list of documents from the partition or an arranged view of the clusters, we have to compute distances between clusters and query (“which cluster is the closest to the query ?”). This may be accomplished using the above distance (D' is now a query) or the indices given by the IR system. This will be explained in the next section. 2.1.2. Cluster centroids We have chosen to represent a cluster by the k documents which are the closest to its geometric centre. For each document, we compute the sum of distances that separate it from other texts in the same cluster and choose the k documents corresponding to the k smallest distances as centroids or ‘representatives’ (see Figure 4). This avoids computing a "mean vector" and allows using the same similarity values during K-Means iterations (similarities between documents are computed only once). Let k be the number of representatives of cluster C. Let Ni
(1 ≤ i ≤ k )
be a representative of C.
Distances between centroids can be used to calculate distances between clusters [Kowalski, 1997]. Likewise, the distance between a cluster and an item (a document or a query) is equal to the lowest distance between the query and one of the cluster centroids. Let d be the distance between a document and a cluster:
(
d( D, C ) = min d( D, N i ) 1≤ i ≤ k
)
(3)
Because the centroids are documents, the indices given by the IR system (similarities between query and documents) can be used to rank the clusters according to the query.
Figure 4 - Each cluster has 3 centroids at the most 2.1.3. Initial partition The result of this cluster-based method depends on the initial set. A random attribution of documents in clusters is a simple idea but not a good one because clusters are consequently close, their representatives similar and the number of iterations before convergence is too large. We have used a partial hierarchical method to obtain the initial partition. In that way, the quality of the final classification has been improved.
6
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
To obtain the initial partition: (a) single-link : for each couple of documents i and j such that d(i,j) < threshold : • if i and j are not yet in a class, create a new one; • if i and/or j are already allocated, merge all the documents of the class containing i (resp. j) with those of the class containing j (resp. i); (b) partial hierarchical classification: after this step, the number of classes may be greater than the number of clusters wanted. So, as long as the number of classes is greater than the predefined one, we can: • compute class representatives; • compute distances between every pair of classes (triangular matrix); • merge the two closest classes. 2.2.
Evaluation during Amaryllis’99
2.2.1. Browsing the clusters To evaluate the quality of the classification, we can explore clusters in different ways. We can consider firstly the best ranked cluster which should contain most relevant documents. We can also look at the best ranked documents of each cluster, i.e. at the documents for each topic which are the closest to the query. Lastly, we can present each cluster to the user so that he/she could choose those which contain the largest number of relevant documents (Evans et al., 1998). In the clusters, the documents are ranked according to the similarity values computed by SIAC during the initial search (the similarity employed is the cosine measure using TFIDF weights). We evaluate the methods described here by ranking the clusters according to their size and the number of relevant documents they contain (i.e. according to the global precision of the documents in the cluster). This can be done by using the lists of relevant documents supplied by the Amaryllis evaluation campaign promoters. This technique is commonly used to evaluate document classification for information retrieval (Hearst & Pedersen, 1996). The quality of the document list produced after classification depends on the number of clusters. Indeed, K-Means like methods require some a-priori decisions about the number of clusters. It is critical but not so easy to determine the number of clusters even if we have shown that it could be computed effectively according to query size (Bellot & El-Bèze, 1999). We have shown that the best results are not always obtained with a large number of clusters (at least when the number of clusters is not too large). During Amaryllis'99, the number of retrieved documents for each query was limited to 250. This number is too small (smaller than for TREC) and the evaluation method we use (ranking clusters according to their precision) favors a great number of clusters. Figures 5 and 6 show results obtained either after classification according to the number of clusters or without classification. The quality of the results obtained are similar to those reported in (Hearst & Pedersen, 1996) over English corpora.
7
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
0,8
0,7
0,6
0,5
0,4
0,3 0,2
0,1
0 0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Recall 2 clusters
5 clusters
10 clusters
Without clustering
Figure 5: Precision at several recall levels (HKM)
0,45 0,4 0,35 0,3 0,25 0,2 1
2
3
4
5
6
7
8
9
10
Number of clusters
Figure 6: Precision at 10 documents according to the number of clusters (HKM) (1 cluster = no classification) 2.2.2. Random classification In order to verify that the improvements are due to our classification methods and not to the sharing out items into classes, the results obtained are compared with those of a random classification. Figure 7 shows that random classification (documents are randomly shared out among 5 clusters) performs very poorly compared with our classification method.
8
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0 0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Recall Mean random classification with 5 clusters
HKM with 5 clusters
Without classification
Figure 7 : Random classification vs HKM
3.
Unsupervised Decision Tree (UDT)
3.1. Introduction (supervised decision trees) The pioneering work on the application of supervised decision trees2 to natural language concerned probabilistic language modeling (Bahl et al., 1990). Decision trees were also employed to syntactically tag a word according to the surrounding text (Black et al., 1992). They were applied to the classification of newspaper articles in some predefined classes (Crawford et al., 1991) –see Figure 8 and 3.1.1–. In this paper, we deal with the use of unsupervised decision trees for classification and for information retrieval task. 3.1.1. Categorization task For the categorization task (a document has to be thematically tagged), supervised decision trees learn rules for classifying new documents from training data. The possible topics are predefined and may be for example: politics, sports, arts, etc. The questions in the nodes of the tree may involve words or higher level constituents. It could be interesting to cluster the set of retrieved document in some global topics or categories (such as Yahoo's categories). Indeed, a more fine-grained classification would be more efficient. For example, we would like to group together two documents dealing with the impact of sport results on the French political life in 1998. With a classical categorization method, these documents could be either tagged by "sports" or by "politics" (or by "French politics" ?). But a document about sports and Italian political life is likely to be considered close to the previous ones. We would like to have a cluster about "sports and French politics in 1998" and a cluster about "sports and Italian politics". These new mixed topics are not predefined and appear to users if and only if necessary (i.e. only if any retrieved document deals with them). In fact, the number of possible "mixed topics" is infinite. In our case (domain independent information retrieval), the 2
See (Breiman et al., 1984) for a general description of decision trees for classifying.
9
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
set of possible topics is unknown. We cannot learn rules to assign documents because we do not have a set of documents dealing with each possible "mixed topic". Therefore, we have to use decision trees in a different way.
new document yes ? no
pT1 pT2 pT3 pT4
…
p’T1 p’T2p’T3 p’T4
…
likelihood of topic T1
Figure 8 : Document categorization (global learning) 3.1.2. Decision trees for the routing task For information retrieval, each document has to be tagged as "relevant" or as "non relevant". From training data containing classified retrieved documents with information about their relevance (for each query or for each kind of query), it is possible to learn rules to decide whether a new document is relevant. The aim of the training is to estimate the relevance probability of a document given its topic –see Figure 9–. But this application of decision trees is only possible if we have training data. We cannot expect to obtain such training data because we cannot have a classified document set for each possible kind of query ! So, we have to do without training data. (see next section).
new document yes ? no
T1
T2
T3
T4
p 1-p relevance likelihood of topic T1
Figure 9 : Decision trees for the routing task Note however that the supervised decision trees we have described here may be used for routing and filtering tasks. If the training data are the documents previously retrieved, manually evaluated and thematically tagged (the tags may be arbitrary –Topic 1, Topic 2,…– and do not have to be textual categories), a user profile can be defined. 3.1.3. Clustering by means of UDTs Questions at any node are usually extracted from training data. In this case, the purpose is to find questions that partition the set of documents at best. Despite the lack of training data, we can carry out a classification by using decision trees. In our case, questions are chosen according to a selection criterion based on relevance probability of documents. This probability can be computed in the same way as a probabilistic IR-system (Croft & Harper, 1979). In fact, our classification consists in finding the text properties that allow to cluster documents in such a way that relevance
10
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
probabilities of documents are maximum for some clusters and minimum for some others. These properties (for example the words that occur or that do not occur in documents) must depend on the query since a good classification must depend on the point of view the user has expressed in the query. Thus, our purpose is to group similar documents to obtain some relevant clusters (the relevant documents retrieved must be located in several relevant leaves of the tree) and some non relevant clusters according to the user's query. 3.2. Method to grow the UDTs A new tree is automatically grown for each set of documents retrieved from a query. The root of a tree contains all the sentences from the retrieved documents. For the experiments described in this paper, the documents are retrieved by the SIAC system developed at LIA (Bellot, 2000). But since that classification is post-processing step, it can be carried out after any IR-system. 3.2.1. Elements to grow the UDTs To grow decision trees, one must define a set of possible yes-no questions to be applied to items (here, the items are the sentences from the documents retrieved for a given query). A rule for selecting the best question at any node must be defined as well (Breiman et al., 1984). 3.2.2. Clustering the set of sentences Most classification systems compute similarities between documents according to the words they have in common. A criterion to grow the tree could be the words occurring in the retrieved documents. At each node of the tree, we would have to select the word that allows to partition documents at best, in two clusters (two children nodes). But a document usually deals with several topics. These topics cannot be represented by only one word. In fact, a set of words may represent a whole document but not a single word. However, a single word may represent the topic of a sentence more easily. Consequently, we choose to cluster the set of sentences (from the retrieved documents) and not the documents themselves3. Note that the indexing methods we use (TF-IDF weighting scheme, cosine similarity in the vector space model) prevent from really considering all the subjects dealt with in the texts. A better way to take all the themes into account is precisely to cluster sentences instead of documents as a whole. This choice is confirmed by the experiment we carried out over the Amaryllis’99 corpora. Figure 10 shows that UDTs work better by clustering sentences than by clustering documents4. The absolute decreases in precision levels at 5, 10 and 15 documents are equal to 3.5 %, 2.5 % and 5 % compared with the results obtained by clustering sentences (see 3.6 for an explanation of the evaluation method employed). Note that the previous classification method (HKM) cannot be used to cluster all the sentences from the retrieved documents (about 10,000 sentences for 250 documents) because of its computational complexity and because the units of text are too small for simple surface matching of words (the distance is not effective for sentences) –see (Hatzivassiloglou, Klavans, Eskin, 1999).
3
Moreover, the fact of clustering sentences is used to segment documents (Bellot, 2000). This is verified when the set of possible questions are composed by single words. If the questions would be set of words, regular expressions Ðsee (Kuhn & De Mori, 1995)Ð or phrases, we think that UDTs could be used to cluster documents. This will be experimented in a future work.
4
11
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Recall Without classification
Sentence classification (UDT)
Document classification (UDT)
Figure 10 : Document or sentence classification with UDTs 3.2.3. Yes/No questions The questions in the tree ask whether the chosen word occurs in the documents5. If a document in a node N contains the question-word wN, it is sent to child node NYES. Otherwise, it is sent to child node NNO. At any node, the word that best clusters the sentences into two new child nodes has to be selected. Any non-empty word from the original set of retrieved documents may be a question. 3.2.4. Rule for selecting best questions An obvious quality criterion of a cluster is the extent to which the sentences it contains are relevant to the user's query. We would like to obtain some leaves/clusters that only contain relevant sentences and to obtain some other leaves/clusters that only contain non relevant sentences. In other words, we would like to cluster sentences in highly relevant clusters and in little relevant clusters. One way of computing this cluster relevance is to calculate the probability for the query to be produced by the sentences in this cluster. Cluster relevance may be seen as a probability distribution that has two modalities, p and 1-p. They respectively express: “the query is produced by the sentences of this cluster” and “the query is not produced by the sentences of this cluster”. In order to decide at best whether the cluster is relevant, p must be as high or low as possible. So, the questions that must be chosen are the ones that maximize (or minimize) probabilities p of the child nodes. For the experiments described in this paper, p values are computed according to a unigram model. A better (but more complex) model will be used in future work. Let q be the query. Let w j be j-th word occurring in q, S i be a sentence in node N, and Z() be the number of occurrences of a word in a sentence S i. Signs | | correspond to the number of words (occurrences) occurring in a set of sentences.
5
This binary dispatching is simple and could be extended. We could dispatch sentences according to the weights of the question-word (n-ary trees) and not according to a boolean criterion only.
12
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
p is defined as:
p( q = w1, w2 , K , wn |
U
i / S i ∈N
n
S i ) = ∏ p(w j | j =1
U
i / S i ∈N
=
Si )
∏ ∑ Z (w j , S i ) j
(4)
i
n
U
i / S i ∈N
Si
Cluster entropy H (equation 5) is defined according to the values of modalities ei of probability P:
(
H(cluster ) = ∑ − P( ei ).log P( ei )
)
(5)
i
In our case, modalities are p and 1-p. So, the entropy of a cluster N is defined as: H N = − p log p − (1 − p) log(1 − p)
(6)
In accordance to the definition of entropy, the closer values p and 1-p are, the higher the cluster entropy is (Jun et al., 1997). Thus, for a node N, the best question is the question that allows to minimize the new entropy values of the child nodes of N. In other words, the question must allow to maximize the gain in entropy ∆H . This gain ∆H is defined as the difference between entropy HN of node N and entropy HN+1 corresponding to the two child nodes of N (equation 7). (7) ∆H = H N − H N +1 Entropy HN+1 is the weighted average of entropies of the subsets corresponding to the children nodes of N. HN+1 is defined from entropy value HYES,N of child node NYES and from entropy value HNO,N of child node NNO. Since the probabilities depend on the word frequencies, the weights are the relative size of the sentences sent in the child nodes to the total size of the sentences in N. U
H N +1 =
i / S i ∈N YESI
U
i / S i ∈N
U
Si HYES , N +
i / S i ∈N NO
U
Si
i / S i ∈N
Si H NO, N
(8)
Si
3.2.5. Algorithm The figures below give a version of the algorithm we use. It is inspired by the one described in (Kuhn & De Mori, 1995). Main: • segment the retrieved documents in a set of sentences; • create a new tree; • set all the sentences in the root; • compute the initial entropy value; • expand the root (see Figure 12); • the leaves are the final clusters. Figure 11. Main program
13
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
Expansion of a node N: • find the best question (see Figure 13); • if the stop condition is not verified (see 3.6.3): • assign in the node NYES the sentences in which the word-question occurs; • assign in the node NNO the sentences in which the word-question does not occur; • expand the node N YES; • expand the node N NO. Figure 12: Expansion of a node Find question (for a node N): • for each possible question (any word occurring in the sentences of the root may be a question): • draw up the list of sentences in N in which the word-question occurs, • draw up the list of sentences in N in which the word-question does not occur, • compute the entropy values H YES,N and H NO,N (see equation 6), • compute the average entropy H N+1 according to H YES,N and H NO,N (see equation 8), • compute the change in entropy for the question (see equation 7); • choose the question that maximizes the decrease in entropy. Figure 13: Find the best question 3.3. Example At the end of the tree growing, sentences are distributed in the leaves. Assuming that each leaf corresponds to a particular topic (or rather to a mixed topic), a thematic classification of the items has been completed. Each leaf of the tree is a cluster. Figure 14 shows the tree grown for Amaryllis'99 query 23 (set of queries "OT1") and the documents retrieved from corpus "OFIL OD1". This query is about political situation in Cambodia. SIAC had to find documents identifying the political forces and describing their attitude towards the Vietnamese minority. The query words are: régime (regime), Phnom Penh, PPC, conseil (council), national, suprême (supreme), CNS (Supreme National Council), pouvoir (power), royal, khmer, rouge (red), APRONUC, immigré (immigrant), vietnamien (Vietnamese), accord (agreement), Paris, autorité (authority), nation, uni (united), FUNCINPEC. The 12,380 sentences from the 250 retrieved documents are partitioned in 6 clusters F1 to F6. Figure 14 shows the chosen questions, the number of sentences belonging to each cluster and the corresponding numbers of documents (# of sentences à # of docs). We remark that the majority of the sentences are set in leaf F6. This can be explained by the fact that the majority of the sentences does not contain the chosen question-words. Leaf F6 is the leaf of the non really clustered sentences (the sentences in F6 are not thematically close). This fact is not a major drawback of our method since the other clusters are good6. In fact, the other leaves contain some sentences thematically close and often relevant. Leaves F1 to F5 contain sentences that answered "yes" to one question at least. It can be seen that the chosen question-words are keywords for the query topic and are often the most important words of the query. For example, F1 contains sentences in which the words "CNS" and "khmer" occur. The other question-words appear to be semantically linked with the topic of the query (Pailin is a city in the south of Cambodia). The 15 sentences in F1 belong to 10 documents. Among these documents, 7 documents have been judged relevant. F2 contains 5 sentences, each belonging to a relevant document. Note that the complete set of the retrieved documents is represented in F 6 (250 documents). Among them, 40 documents are relevant. In other words, 25% of the relevant retrieved documents are represented by leaves F1 and F2 that contain only 0.15% of the retrieved sentences. This results confirms the capacity of our classification method for isolating relevant documents. 6
However, a different way to choose the questions (or more complex questions) could allow to partition the sentences of F6.
14
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
CNSÊ?
yes p yes F1
no p
1-p
khmerÊ?
15 à 10
no
yes
1-p PailinÊ?
no
F6 12Ê304 à 250
F2 5 à 3
yes Cambodge? no 51 à 15 yes
socialiste?
F5
no
F3
F4 3à 3
2à2
Figure 14 : the tree grown for Amaryllis query 23 (number of sentences à number of different documents) 3.4. Content of clusters The previous HKM classification method did not allow to easily describe clusters. This is not the case of UDTs. The clusters obtained may be represented by boolean phrases. For a given cluster (for a given leaf), the phrase is the conjunction of the chosen question-words from the root to this cluster (words occurring or not occurring in the sentences). By using AND and NOT operators, users can understand what the important words for each cluster are. For example, considering query 23 (Figure 14), leaf F3 may be represented by the phrase: (NO CNS) AND Pailin AND Cambodia AND Socialist The union of all phrases (each one corresponding to a cluster) may be seen as a logical interpretation of the natural language query used by SIAC. 3.5. Several lists of answers for users We can provide users with at least three lists of answers. The first one is the original list of documents retrieved by SIAC. The second list is a list of sentences ranked by relevance. It is built and ranked by computing similarities between the clusters, the sentences and the query. This list may help users by showing relevant information in a few words. Users can access the document as a whole from the sentence they choose. The third list is a list of documents ranked according to the clusters obtained (it is a new ranking of the retrieved documents). This new ranking could be carried out according to some computed similarities between the clusters and the query. In our case, the clusters are ranked according to their global precision level (for a cluster, the global precision is the proportion of relevant documents it contains). We consider that a document belongs to a cluster if at least one of its sentences belongs to this cluster. To avoid presenting a document several times, we have decided that if a document belongs to several clusters, its membership of the best ranked cluster is the only one that is taken into account. Because of this, the final number of clusters may be lower than the number of leaves in the tree. In the clusters, the documents are ranked according to the similarity values computed by SIAC during the initial search (the similarity employed is the cosine measure using TFIDF weights). The results reported in this section are obtained from the list composed of
15
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
the documents of the first ranked cluster followed by those of the second one and so on. This list can be evaluated like any list produced by an IR-system since it is a list of documents. Finally, we can propose to users a new list for each cluster obtained. In this way, a document may belong to several clusters. 3.6. Evaluation during Amaryllis'99 Lists of answers supplied by Amaryllis promoters are composed of documents. Thus, we cannot evaluate the clusters directly because the lists of relevant sentences are not provided). We can either evaluate each cluster independently or evaluate the list of documents produced from the clusters (the third list in the previous section)7. 3.6.1. Clusters evaluated independently Our experiments show that we obtain one high precision cluster for 20 queries from the 26 queries of Amaryllis set OT1. For a given query, if the best cluster contains R documents (for us, the best cluster is the cluster that has the highest precision among all the clusters), the precision of this cluster is higher than the precision of the original list with a cut-off of R documents. For query 20, 50% of the relevant documents (11 documents out of 22) belong to one cluster containing 13 documents only. For one other query, all the relevant documents retrieved by SIAC are grouped in one cluster. Lastly, for query 21, 5 documents from the 14 relevant retrieved documents are in one cluster that does not have any other document. 3.6.2. Evaluation of the new list of documents The precision levels of the best clusters that contain more than 15 documents mainly explain the improvement reported in Figure 15 (see section 3.6.3) and in Table 1. In these figure and table, the evaluated list is the list composed of the documents from the clusters ranked according to their precision values (see section 3.5). For this experiment, the threshold value chosen as a stop criterion (see below) is equal to 0.001. Precision at recall 0
Precision at recall 0.1
Precision at 5 docs
without classification
0.63
0.47
0.38
0.31
0.29
0.235
UDTs
0.75 (+12%) α = 0,004
0.55 (+8%) α = 0,18
0.48 (+10%) α = 0,04
0.36 (+5%) α = 0,02
0.32 (+3%) α = 0,2
0.26 (+2.5%)
Precision Precision Average at 10 docs at 15 docs precision
Table 1: Results with UDTs or without classification8 3.6.3. Stop criterion For a given node, conditions that stop expansion are verified when a node contains one sentence only or when the maximal gain in entropy is below a fixed threshold. Note that the gain is always positive or equal to zero. The gain is zero if and only if the class distribution before and after the partition according to a chosen question remains identical (Fayyad, 1994). The lower the threshold, the higher the final number of clusters (more node expansions are performed). Figure 6 shows the impact this value can have (see section 3.6 for more details on the evaluation method and for more results). A way to choose an efficient threshold value has to be found. For K-Means-like algorithms, we have proposed to choose the number of clusters according to the size of the queries –see (Bellot & El-Bèze, 1999)–. By employing this method, better results were obtained compared with a constant prefixed number for all queries. This can be explained by the fact that the topics of the retrieved 7
Note that the time required to carry out classification for all Amaryllis queries (set OT1 or set OT2) is about 2 minutes (PowerPC 750, 300 MHz, 128 Mo RAM). For a decision tree computational complexity study, see (Kuhn, 1993). 8 The significance of the improvement is evaluated by means of a Wilcoxon-Mann-Whitney test Ðsee (Keen, 1992 ; Saporta, 1990, p.Ê345 ; Bellot, 2000, p. 38).
16
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
documents are more varied when the query is short and, therefore, more ambiguous (in that case, more clusters are needed). In the same way, the threshold value could be chosen according to the query. 0,5 0,45 0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0 5
10
15
20
30
100
200
500
Number of documents Without classification
Threshold = 0.02 (UDT)
Threshold = 0.001 (UDT)
Figure 15: Precision according to the minimal change in entropy (a threshold value as a stop criterion) 3.6.4. More complex questions for UDTs Figure 16 shows that results of the classification by means of UDTs are significantly improved when the questions are composed of two words. These new questions may be expressed with: "what are the sentences that contain word x and word y ?". This new kind of question allows to resolve some ambiguities. This is a well-known technique of semantic disambiguation –see for example (Smadja, 1989 ; Brown & Chong, 1998) for the use of lexical affinities–. For example, if the question chosen to partition a tree node is "nation" only, the sentences containing this word may deal with any nation. On the other hand, if the question is "nation and united", the sentences in the "yes" cluster are likely to be about "united nations". By imposing the occurrence of one of these two words in the query, the number of possible questions is reduced and this technique allows to resolve some ambiguities in the words of the query (this is the case of the words "nation" and "united" occurring in query 23). 0,5 0,45 0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0 5
10
15
20
30
100
200
500
Number of documents Without classification Questions are a couple of words
Questions are only one word
Figure 16: Question composed of one or two words (the stop criterion value employed here is not the optimal one we have found ; see more details about this criterion in 3.6.3)
17
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
For this first experiment with "complex questions", the order and the proximity of two words in the sentences are not taken into account. However, the results obtained encourage us to use this kind of questions in future work. Figure 16 shows that the absolute improvement of the precision levels at 5, 10 and 15 documents is equal to 5.5%, 5.5% and 3%.
4.
Conclusion
The methods presented here allow to obtain significant results and improvements compared with a search without classification. A reliable comparison of the results obtained by means of HKM or by means of UDTs is a difficult task. The quality of the document list produced after classification depends on the number of clusters9. This fact could be reduced by clustering more than 250 documents for each query (250 is the number required during Amaryllis). We have effectively shown –see (Bellot & El-Bèze, 1999)– that the best results are not always obtained with a lot of clusters (at least when the number of clusters is not too large). Here, the number of documents is too small and we can say that the greater the number of clusters, the better the results. 0,45
0,41
0,37
0,33
0,29
0,25
Figure 17: UDTs or HKM (precision at 10 documents for different number of clusters)
Precision at recall 0 without classification UDTs Hierarchical + K-Means (6 clusters) Hierarchical + K-Means (11 clusters) Random classification (11 clusters)
Precision at Precision recall 0.1 at 5 docs
Precision Precision at 10 docs at 15 docs
0.63
0.47
0.38
0.31
0.29
0.75 (+ 12%) 0.65 (+ 2 %) 0.75 (+ 12 %) 0.71 (+ 8 %)
0.55 (+ 8 %) 0.51 (+ 4 %) 0.63 (+ 16 %) 0.47 (=)
0.48 (+ 10 %) 0.37 (-1 %) 0.51 (+ 13 %) 0.35 (-3 %)
0.36 (+ 5 %) 0.34 (+ 3 %) 0.44 (+ 13 %) 0.25 (- 6 %)
0.32 (+ 3 %) 0.32 (+ 3 %) 0.4 (+ 5 %) 0.2 (- 9 %)
Table 2: Summary of results 9
If we have as many clusters as documents, ranking the clusters according to their precision leads to obtain the optimal list from the retrieved documents.
18
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
By using UDTs, the average number of clusters is equal to 11 (except for the query 9 that produces 187 clusters). If three other queries are excluded, the mean number of clusters equals 7. Moreover, since a document can appear only once in the list (see 3.6), some clusters containing only sentences from documents already mentioned are excluded. Thus, the final number of clusters when employing UDTs may be lower than the quoted average value. Figures 17 and 18 show that UDTs perform like HKM using 7 or 8 clusters (that is the case when using one word questions ; when using more complex questions, UDTs outperform HKM). Thus, we could think that HKM performs slightly better than UDTs. However: • decision trees are much faster (when using one word-questions); • they allow better understanding of the clusters (see 3.4); • results depend on the threshold value chosen as a stop criterion (see 3.6.3). The results shown here are obtained with 0.001 as a threshold value. This value may not be the optimal one; • UDTs cluster sentences whereas HKM clusters documents. The lists of sentences were not evaluated (we do not have any list of relevant sentences) but it is a new and very interesting kind of answer to propose to users; • UDTs may be greatly improved; • finally, UDTs can be used in interactive information retrieval to quickly cluster several hundred documents. 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Recall Without classification
UDT
HKM with 6 clusters
HKM with 8 clusters
Figure 18: UDTs or Hierarchical and K-Means-like algorithms (HKM) (precision at different recall levels)
19
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
5.
References
Allen, R.B., Obry, P., Littman, M. (1993) An interface for navigating clustered document sets returned by queries, In Proceedings of COCS (pp. 166). Bahl, L., Brown, P., de Souza, P., Mercer, R. (1990). A tree-based statistical language model for natural speech recognition. Readings on Speech Recognition (pp. 507–514). A. Waibel & K.-F. Lee Ed., Morgan-Kaufmann. Bellot, P. & El-Bèze, M. (1999). Query Length, Number of Classes and Routes through Clusters : Experiments with a Clustering Method for Information Retrieval. In Proceedings of IEEE ICSC’99 (pp. 196–205). Hong-Kong. In Lecture Notes in Computer Science (LNCS 1749 "Internet Applications"), L. Chi-Wong Hui & D. L. Lee (Ed.). Springer-Verlag. Bellot, P. (2000). Méthodes de classification et de segmentation locales non supervisées pour la recherche documentaire. Thèse de Doctorat en Informatique. Université d'Avignon. France. ( http://www.lia.univ-avignon.fr/personnel/BELLOT/Recherche/biblioperso.html) . Black, E., Jelinek, F., Lafferty, J., Mercer, R., Roukos, S. (1992). Decision tree models applied to the labeling of text with parts of speech. In Proceedings of 1992 DARPA Speech and Natural Language Workshop, Morgan-Kaufmann (pp. 117–121). Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and regression trees. Belmont CA, USA. Wadsworth. Brown, G., Chong, H.A. (1998). The GURU System in TREC-6. In Proceedings of the Sixth Text REtrieval Conference TREC-6 (pp. 535–540). Gaithersburg MD, USA (november 1997). NIST special publication 500-240. Crawford, S., Fung, R., Appelbaum, L., Tong, R. (1991). Classification trees for information retrieval. In Proceedings of the Eigth International Workshop on Machine Learning. Northwestern university, Illinois. Croft, W.B., Harper, D.J. (1979). Using probabilistic models of retrieval without relevance information. Journal of Documentation, vol. 35, n° 4 (pp. 285–295). Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. (1992). Scatter/Gather : A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of SIGIR Conference on Research and Development in Information Retrieval (pp. 318–329). Copenhague, Danemark. De Loupy, C., Bellot, P., El-Bèze, M., Marteau, P.-F. (1999). Query Expansion and Automatic Classification. In Proceedings of Seventh Text REtrieval Conference TREC-7 (pp. 443–450), Gaithersburg, MD, USA: NIST special publication 500-242. Diday, E., Lemaire, J., Pouget, J., Testu, F. (1982). Eléments d’Analyse des Données. Dunod Informatique. Evans, D.A., Huettner, A., Tong, X., Jansen, P., Bennett, J. Effectiveness of Clustering in Ad-Hoc Retrieval. In Proceedings of the Seventh Text Retrieval Conference (TREC-7), NIST Special publication 500-242 (pp. 143–148). Fayyad, U.M. (1994). On the Induction of Decision Trees for Multiple Concept Learning. PhD Thesis, University of Michigan (USA). Hatzivassiloglou, V., Klavans, J.L., Eskin, E. (1999). Detecting text similarity over short passages: exploring linguistic feature combinations via machine learning. In Proceedings of empirical methods in natural language processing and very large corpora EMNLP'99. MD, USA. Hearst, M.A. & Pedersen, J.O. (1996). Reexamining the Cluster Hypothesis : Scatter/Gather on Retrieval Results. In Proceedings of ACM-SIGIR 96 (pp. 76–82). Jun, B.H., Kim, C.S., Song, H.Y., Kim, J. (1997), A New Criterion in Selection and Discretization of Attributes for the Generation of Decision Trees , IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 19, n° 12 (pp. 1371–1375). Keen, M. (1992), Presenting results of experimental comparisons. Information Processing and Management, vol. 28, (pp. 491–502). Kuhn, R. (1993). Keyword classification trees for speech understanding systems. PhD Thesis, McGill University, Montreal, Canada. Kuhn, R. & De Mori, R. (1995). The Application of Semantic Classification Trees to Natural Language Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 17, No. 5; May 1995 (pp. 449–460).
20
RIAOÕ2000 Conference Proceedings - Coll•ge de France, Paris, France, April 12-14, 2000 vol. I, pp. 344-363
Landi, B., Kremer, P., Schibler, D., Schmitt, L. (1998). Amaryllis : an evaluation experiment on search engines in a French-speaking context . In Proceeding of the First International Conference on Language Resources & Evaluation LREC (pp. 1211–1214)., Granada, Spain. Lespinasse, K., Kremer, P., Schibler, D., Schmitt, L. (1999). Evaluation des outils d’accès à l’information textuelle, les expériences américaine (TREC) et française (Amaryllis). Langues. John Libbey. vol. 2, n° 2, (pp. 100–109). Sahami, M., Yusufali, S., Baldonaldo, M.Q.W. (1998). SONIA a service for organizing networked information autonomously, in Proceedings of ACM Digital Library (pp. 200–209). Saporta, G. (1990). Probabilités, analyse des données et statistique. Editions Technip. Paris. ISBN 2-7108-0565-0. Schütze, H., Silverstein, C. (1997). Projections for Efficient Document Clustering. In Proceedings of ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 74–81). Philadelphie, USA. Silverstein, C., Pedersen, J.O. (1997). Almost-Constant-Time Clustering of Arbitrary Corpus Subsets. In Proceedings of ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 60–66). Philadelphie, USA. Smadja, F.A. (1989). Lexical co-occurrence : the missing link . Journal of the Association of Literary and Linguistic Computing, vol.4, n° 3. Spriet, T., El-Bèze, M. (1999). Introduction of Rules into a Stochastic Approach for Language Modelling. Computational Models for Speech Pattern Processing, NATO ASI Series F. Editor K.M. Ponting. vol. 169, (pp. 350–355). Van Rijsbergen, C.J. (1979). Information Retrieval. Buttherwords, London. Voorhees, E.M., Harman, D. (1999). Overview of the Seventh Text REtrieval Conference. In Proceeding of the Seventh Text REtrieval Conference TREC-7 (pp. 1–23). Gaithersburg MD, USA (november 1998). NIST special publication 500-242.
21