Document Clustering using 3-tuples - Semantic Scholar

0 downloads 0 Views 136KB Size Report
Tong Guan for the MEA API, which was made use of in the 3-tuples extraction ... Tong Loong Cheong, Angela Wee Li Kwang, Augustina Gunawan, Goh Ann Loo ...
Document Clustering using 3-tuples Kanagasabai Rajaraman and Hong Pan Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore email: fkanagasa,[email protected] http://textmining.krdl.org.sg

Abstract. We address the document clustering problem using a new

representation called 3-tuples. We describe the 3-tuples representation and de ne a novel document similarity measure based on this representation. Using this measure, a new document clustering algorithm is proposed. We analyze the performance of our algorithm empirically using an entropy based cluster evaluation setup. We also present a comparative study with existing methods.

1 Introduction Document clustering is a method of grouping similar documents together from a given collection. It is assumed that no a priori knowledge on the grouping is known. This aspect di erentiates it from Document Classi cation where the class/category labels are provided. Document clustering has applications in many scenarios such as: organizing the query-search results, navigating large text collections and event tracking. Various types of clustering algorithms have been discussed in the literature[9, 13, 4]. Our approach is based on a at clustering algorithm that produces clusters via an intermediate computation of a similarity measure. Also we draw inspiration from both linguistic and statistical approaches. Clustering via computing a similarity measure is a widely adopted method[9, 13, 4]. Several similarity measures are in vogue in the literature[9, 13, 14]. The most widely-used one among them is the cosine measure due to its simplicity and practical utility[8]. However, as the term vector representation assumes that

the terms are independent and treats each text entity simply as a `bag of terms', the measure cannot accurately capture the degree of similarity. There are many more vector based similarity measures proposed in the literature[9, 13], but they have the same above-mentioned disadvantage as the cosine measure. Another class of measures make use of the co-occurrence of term pairs to compute the similarity[10, 7, 11, 6]. Co-occurrence based measures capture some of the term dependencies, but this method can be used only if there is a suciently large text collection for the domain or if the co-occurrence thesaurus is already available. US Patent No. 5,297,039 to Kanaegami et.al. [5] proposes a similarity measure using syntactic relations between the terms. Text is parsed rst to extract an `analysis network' that consists of triplets of the form `(relation,element 1,element 2)'. The elements correspond to the nouns and the relation is a term (usually a verb) syntactically close to the elements 1 and 2. The similarity is measured by a sum of term agreements, pair agreements and line agreements between the corresponding analysis networks, after suitably weighting. Since the relations are themselves terms extracted from the text, this method does not overcome the synonymity problem, i.e. the term,pair and line agreements cannot be calculated e ectively. In our work, we propose a method for e ectively measuring text similarity based on a new representation called 3-tuples. It is usually mentioned in clustering literature that the choice of similarity measure does not appreciably improve the clustering performance. This is true as long as the document representation is xed since all measures then use the same information. Our results in this paper could be taken as a proof that novel similarity measures based on richer document representations can indeed lead to signi cant performance improvements

in clustering.

2 3-tuples Representation A 3-tuple is a syntactic tuple of the form `relation-term1-term2', where term1 and term2 are terms extracted from the text and relation is one of a prede ned set of relationships (see, for example, [3]). The 3-tuples are extracted through a three step process consisting of text pre-processing, morphological analysis and structural analysis, adopted from[2, 3, 12]. At the end, the edges of the directed acyclic graph constructed during structural analysis are extracted and output as the 3-tuples. Though the 3-tuples appear to be identical to the triplets used in [5], the 3-tuples di er in the way the relations are de ned. The relations in 3-tuples are generic and come from a xed pool of relations. Hence, even though the same \text meaning" is expressed in di erent surface forms, they will be normalized into the same set of 3-tuples[3]. This feature enables measuring the 3-tuple agreements more e ectively. Our similarity computation being a patent pending method could not be described in this paper. We will now describe our clustering algorithm.

3 Clustering algorithm Our clustering algorithm is based on a graph-theoretic method, i.e. the clusters are de ned in terms of components of a graph genererated using the similarity measure. The basic graph-theoretic algorithm might result in too many unclustered documents if the threshold is not chosen small enough[13]. To overcome this problem, we propose below a modi ed algorithm, called the Recursive Clustering Algorithm.

De nition 1. A connected component is a set of nodes such that each node is connected to at least one other member of the set and the set is maximal with respect to this property.

Recursive Clustering Algorithm: Consider a set of documents, say D, to be clustered.

Step 1: Compute the similarity of each pair of documents and construct a similarity matrix.

Step 2: Fix a threshold TH and a decay factor  2 (1; 1). Step 3: Repeat Step 3.1 - Represent each document in D as a node and draw

a graph by connecting two nodes if their similarity is above the threshold TH . Step 3.2 - Output all connected components of the graph as the

clusters generated. Step 3.3 - Set the document set D to the documents correspond-

ing to the isolated nodes, i.e. the unclustered documents. Step 3.4 - Set TH to TH=.

Until D is small enough Step 4: Output the remaining documents in D as the last cluster. It can be seen that Steps 1 and 2 are the initialization steps and Step 3 corresponds to the iteration. In Steps 3.1 and 3.2, the algorithm constructs clusters for the current iteration as in Steps 3 and 4 of the one-pass algorithm. In Step 3.3, the document set is revised by setting it to those documents that remain unclustered. Step 3.4 is the iterative step where the threshold is reduced

by an amount decided by the decay factor. The smaller threshold will enable more connected components in the graph. The loop stops when there are only very few unclustered documents, which are output as the last cluster in Step 4. By using our patent-pending similarity measure in Step 1 of Recursive Clustering Algorithm, we get the 3-tuples based clustering algorithm.

4 Performance Analysis In this section, we analyze the performance of the 3-tuples clustering algorithm presented above. For comparison purposes, we also implement two more clustering algorithms:

{ cosine measure in Step 1 of Recursive Clustering Algorithm (as representative of keyword-based methods)

{ the context-vector based clustering algorithm[11] (as representative of cooccurrence based methods) For our convenience, we will call these as 1-tuple clustering algorithm and 2tuples clustering algorithm respectively. We rst describe our benchmarking setup and then present the empirical results.

4.1 Benchmarking Setup We use an entropy measure to evaluate the quality of the clusters generated. The measure we use is adapted from [1]. Basically each document is assumed to have a known category label (which is masked during the clustering process). The entropy of a given cluster C is de ned by X (1) eC = ? ( Pc(ci;(i;C C) ) log( Pc(ci;(i;C C) ) ); i i i where c(i; C ) is the number of times label i occurs in cluster C . The entropy for a cluster is zero if all the documents in the cluster belong to the same category.

Otherwise it is positive. The total entropy is computed as the weighted sum of the individual cluster entropies. Suppose the number of documents in cluster C is nC ; C = 1; : : : ; m

etotal = m1

Xe c

C  nC

(2)

We will use this measure to evaluate the clustering performance. For our experiments, we use subsets of the Reuters-215781 collection. The datasets were created by considering only documents with one of a particular subset of class labels from the Reuters collection. Speci cally, we use the following three datasets: Dataset

#Docs #Words Categories(#Docs)

D1

461

D2

1426

D3

1002

Baseline Entropy livestock(114), 0 4689

4836 gnp(163), suger(184) 5859 cip(112), interest(513), money- 0 3872 fx(801) 7513 co ee(145), dlr(176), gnp(163), 0 8300 gold(135), nat-gas(130), sugar(184), yen(69) :

:

:

Table 1. Dataset Details

4.2 Empirical Results In our experiments, we varied the threshold and decay factor parameters and computed the entropy for every run. The best entropy values tabulated below. On all the three datasets, 3-tuples clustering consistently performed better than 1-tuple clustering. On D2 (containing 1426 documents), it outperformed 1tuple clustering by 16:69%. Compared to 2-tuples clustering, 3-tuples clustering was better on larger datasets. On D1, both of them performed almost equally. 1

The database Reuters-21578, is publically available from David Lewis http://www.reserch.att.com/lewis

Dataset D1 D2 D3

1-tuple 2-tuple 3-tuple improvement improvement clustering clustering clustering over 1-tuple over 2-tuple 0 032437 0 030573 0 031385 3 24% ?0 02% 0 061149 0 055432 0 050943 16 69% 8 09% 0 051647 0 050958 0 048562 5 97% 4 70% :

:

:

:

:

:

:

:

:

:

:

:

:

:

:

Table 2. Best Entropy Results In fact, we observe that 3-tuples clustering does better on larger datasets. The worse performance on smaller datasets could be because of the noise in 3-tuples extraction process. On large datasets the noise being statistically insigni cant would not have a ected appreciably. Moreover, we have used only a xed set of weights in our 3-tuples similarity measure (explained in our patent document) throughout our experiments. The weights can be tuned for each dataset and the performance improved further. This would be explored in our future studies.

5 Conclusions In this paper, we addressed the document clustering problem using a new representation called 3-tuples. We proposed a new clustering algorithm by de ning a novel document similarity measure based on the 3-tuples representation. The performance of the algorithm was analyzed empirically using an entropy based cluster evaluation setup. We also presented a comparative study with two existing algorithms and observed that 3-tuples based clustering is a more accurate approach for real-life datasets.

Acknowledgements The authors would like to thank Dr. Su Jian, Dr. Zhou Guo Dong and Tey Tong Guan for the MEA API, which was made use of in the 3-tuples extraction process.

References 1. D. L. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325{344, 1998. 2. Tong Loong Cheong. Message Intermediate Representation. Institute of System Science, National University of Singapore, Singapore, 1993. 3. Tong Loong Cheong, Angela Wee Li Kwang, Augustina Gunawan, Goh Ann Loo, Lee Chee Qwun, and Shu Huey Leng. A pragmatic information extraction architecture for the message formatting expert (mfe) system. In Proceedings of 2nd Intl. Conferenceon Intelligent Systems. Singapore, 1994. 4. W.B. Frakes and R. Baesz-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, NJ, 1992. 5. A. Kanaegami, K. Koike, H. Taki, and H. Ohgashi. Us patent: 5,297,039 - text search system for locating on the basis of keyword matching and keyword relationship matching. US Patent and Trademark Oce, USA, 1994. 6. Yasutsugu Morimoto, Toshiko Aizono, and Hiroyuki Kaji. Generation of a corpusdependent thesaurus and interactive text retrieval. In International Conference on Chinese Computing, pages 65{68. National University of Singapore, Singapore, 1998. 7. H.J. Peat and P. Willet. The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of American Society for Information Science, 42(5):378{383, 1991. 8. G. Salton. Automatic Information Organization and Retrieval. McGraw-Hill, New York, 1968. 9. G. Salton. Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading, MA, 1989. 10. H. Schuetze. Us patent 5,675,819 : Document information retrieval using global word co-occurrence patterns. US Patent and Trademark Oce, USA, 1997. 11. Hinrich Schutze and Jan O. Pedersen. A co-occurrence based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3):307{318, 1997. 12. Wang Tongsheng, Tong Loong Cheong, and Tan Chew Lim. An example-based approach to prepositional phrase attachment in nlp. In International Conference on Chinese Computing, pages 98{105. National University of Singapore, Singapore, 1996. 13. C.J. van Rijsbergen. Information Retrieval, 2nd Edition. Butterworths, London, 1979. 14. E.M. Voorhees and D.K. Harman, editors. Text Retrieval Conference - 7. National Institute of Standards and Technology, Gaithersburg, MD, 1998.

Suggest Documents