Document not found! Please try again

Research on incremental clustering

6 downloads 0 Views 280KB Size Report
Abstract— Currently, incremental document clustering is one the most effective techniques to organize documents in an unsupervised manner for many Web ...
Research on Incremental Clustering Yongli Liu1, Qianqian Guo2, Lishen Yang1, Yingying Li1 1: school of computer science and technology, henan polytechnic university, Jiaozuo, Henan Province, China ·454000 2: academic publishing center, henan polytechnic university, Jiaozuo, Henan Province, China ·454000 [email protected] Abstract— Currently, incremental document clustering is one the most effective techniques to organize documents in an unsupervised manner for many Web applications. This paper summarizes the research actuality and new progress in incremental clustering algorithm in recent years. First, some representative algorithms are analyzed and generalized from such aspects as algorithm thinking, key technique, advantage and disadvantage. Secondly, we select four typical clustering algorithms and carry out simulation experiments to compare their clustering quality from both accuracy and efficiency. The work in this paper can give a valuable reference for incremental clustering research. Keywords-incremental clustering; Web mining; algorithms; experiments

I.

Introduction

With the rapid development of World Wide Web, it becomes more and more difficult for users to find online information they need. Therefore, there is a huge potential to collect valuable information and structure behind Web documents to improve the intelligence and efficiency of internet [1]. Over the past few years, Web mining techniques have shown rapid progress and been widely used. One of these techniques is document clustering, which tries to identify inherent groupings of text documents so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low inter-cluster similarity [2]. It is particularly useful in many applications such as grouping search results, clustering Web documents, and others [2]. There is a large body of work that investigates clustering methods. Based on accumulation rules of data objects in clustering and methods employing these rules, clustering methods could be divided into four types: hierarchical clustering, partitional clustering, density&grid-based clustering and others [3]. Hierarchical and partitional clustering methods are used in many applications, and their representative algorithms include agglomerative hierarchical clustering algorithm and K-Means algorithm, respectively. Currently, to meet the demands of online applications where documents are generated in the course of using, such as blog and wiki, incremental clustering becomes a research focus [4]. Incremental clustering algorithms work by processing data objects one at a time, incrementally assigning data objects to their respective clusters while they progress [2][4]. Typical examples include Single-Pass Clustering, KNearest Neighbor clustering (KNN), et al. This paper will analyze recent representative incremental clustering algorithms from the aspect of algorithm thinking, This work was supported in part by national social science fund (No. 11CYY019) and the natural scientific research project of education department of Henan Province (No. 2011A520015).

978-1-4577-1415-3/12/$26.00 ©2012 IEEE

2803

key techniques, advantages and disadvantages, and compare their clustering quality by experiments on some datasets. The rest of this paper is organized as follows: Section 2 briefly introduces the existing typical incremental clustering algorithms. Section 3 expatiates on the experiments carried out and discusses the experimental results. Finally, we conclude our work. II.

Incremental Clustering Algorithms

In current Web applications, there is much User-Generated Content (UGC) which covers a range of media content available in modern communications technologies, such as question-answer databases, digital video, blogging, podcasting, forums, review-sites, social networking, mobile phone photography and wikis [5]. User-Generated Content has also been characterized as 'Conversational Media', which is a key characteristic of so-called Web 2.0 which encourages the publishing of one's own content and commenting on other people's. In above situations, the dataset is dynamic, so it is impossible to collect all data objects before starting clustering. When new data comes, non-incremental clustering will have to re-cluster all the data, which certainly decreases efficiency and wastes computing resource. On the contrary, incremental clustering just need to group new data and update new clusters to previous clustering results. The strategy optimizes clustering process and especially adapts to those applications where time is a critical factor for usability [2][4]. However, incremental clustering faces several challenges. The number of pioneer documents is small, so it is difficult to obtain high clustering quality in the early stage. When more and more documents come, we will find that some previous documents were put into wrong groups, so it might be necessary to reassign some documents to new clusters. In another word, documents with different order could result in different clustering results. It can be seen from above analysis that there is still much work to do, though incremental clustering seems simple. Generally, a typical documents clustering process mainly includes two components, namely a similarity measure and a clustering algorithm [4]. There are many methods for measuring document similarity, such as the Cosine measure, the Jaccard measure, the Dice measure and the Overlap measure [6]. One clustering method could choose different similarity measures, and clustering results generated by the same method under various similarity measures could even have significant difference. Therefore, it is very crucial to select an appropriate similarity measure.

However, there are some clustering methods that rely on the original document vectors not the pair-wise document similarity [7], such as Suffix Tree Clustering (STC) [8] and DC-Tree Clustering [9]. In this section, we will briefly discuss following five incremental clustering methods, covering above two categories. A. Clustering based on Similarity Wan [6] reviews existing similarity measures, including the measures in the vector space model, the informationtheoretic measure, the measures derived from popular retrieval functions and the OM-based measures, and also proposes a novel measure based on the earth mover’s distance to evaluate document similarity by allowing many-to-many matching between subtopics. When designing a clustering algorithm, similarity measure choice is sensitive to the specific representation, which determines whether the algorithm can accurately reflect the structure of various components in high dimensional data [10]. However, there are many methods to measure pair-wise document similarity and no selection criteria so that too often it is an arbitrary choice to choose similarity measure although it is very important to a clustering algorithm. x

Single-Pass Clustering

It is a very simple clustering method. The method makes the first object as the centroid for the first cluster, and then calculates the similarity between the next object and each existing cluster centroid using some similarity coefficient. If the highest value of the similarity is greater than the threshold value appointed beforehand, the new object is added to the corresponding cluster and the centroid is updated; otherwise, the object is put into a new cluster. The method is very efficient, but suffers from that the resulting clusters are not independent of the object insertion order. x

K-Nearest Neighbor Clustering

KNN is an approach to classifying objects based on closest training data in pattern recognition, and it can also be used as a clustering method. According to this algorithm, an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its K nearest neighbors [11]. B. Clustering NOT based on Similarity Some clustering algorithms are designed without obvious similarity measure. For example, in Suffix Tree Clustering, documents are combined based on base clusters that indirectly measure the distance between documents; DC-Tree Clustering is based on the B+-tree structure, and inserting a new document just involves comparison of the document feature vector with the cluster vectors. Though there are no obvious similarity measures in these algorithms, they actually include steps for measuring similarity indirectly. In another word, the similarity measures of this type of clustering methods are different mainly in manifestations from methods based on similarity. We briefly review the following three algorithms.

2804

x

Suffix Tree Clustering

This algorithm is presented by Zamir and Etzioni [8]. Its kernel is a suffix tree which is a data structure keeping track of all n-grams of any length in a set of word strings, while allowing strings to be inserted incrementally in time linear to the number of words in each string [12]. STC is composed of the following three steps: document cleaning, identifying base clusters, and combining base clusters. The algorithm is theoretically fast, but it is just useful when there is enough memory available to create the suffix tree. When creating a suffix tree from a large sequence database, the memory requirement is difficult to be ensured, and the effectiveness of STC is decreased. x

Incremental DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm proposed by Ester et al. [13] in 1996. In 2009, Nguyen-Hoang et al. [14] proposed a new document clustering approach based on graph model and enhanced incremental DBSCAN. In their approach, a graph-based model is used for document representation instead of traditional vector-based model. It is an effective incremental clustering algorithm suitable for mining in dynamically changing databases, and we can easily update graph structure when a new document is added to database by using graph model. x

ICIB

ICIB (Incremental Clustering based on Information Bottleneck theory) is presented in 2011 [4]. The method is designed to improve the accuracy and efficiency of document clustering, and resolve the issue that an arbitrary choice of document similarity measure could produce an inaccurate clustering result. ICIB measures document similarity with information bottleneck theory. A first document is selected randomly and classified as one cluster, and then each remaining document is processed incrementally according to the mutual information loss introduced by the merger of the document and each existing cluster. If the minimum value of mutual information loss is below a certain threshold, the document will be added to its closest cluster; otherwise it will be classified as a new cluster. The incremental clustering process is low-precision and order-dependent, which cannot guarantee accurate clustering results. Therefore, an improved sequential clustering algorithm (SIB [15]) is proposed to adjust the intermediate clustering results. III.

Experiments

In this section, we estimate clustering quality of such algorithms as Single-Pass Clustering, STC and ICIB through empirical evaluation, which compares a known cluster structure to the results of clustering the same set of documents algorithmically. We also add K-Means clustering algorithm to our experiments so that we can have it as a benchmark to compare the clustering quality of these algorithms.

A. Experimental Setup We select the 20NewsGroup corpus as our test dataset which is collected by Lang [16]. This corpus contains about 20,000 articles evenly distributed among 20 UseNet discussion groups, and is usually employed for evaluating supervised text classification techniques. Many of these groups have similar topics. In addition, about 4.5% of the documents in this corpus are present in more than one group [17]. Therefore, the “real” clusters are inherently fuzzy. In our experiments, we used 4 different randomly chosen subsets from this corpus. The details of these subsets are listed in table 1. Our pre-processing work includes ignoring all file headers, using a stop-list, stemming words by the popular Porter Stemmer algorithm [18]. We index all the documents using the famous search engine tool Lucene [19]. TABLE I. Datasets DS1 DS2 DS3

DS4

DATASETS DETAILS

Newsgroups included sci.crypt,sci.electronics,sci.med,sci. space talk.politics.mideast,talk.politics.mi sc comp.graphics,rec.motorcycles,rec.s port.baseball,sci.space,talk.politics. mideast alt.atheism,comp.sys.mac.hardware, misc.forsale,rec.autos,rec.sport.hock ey,sci.crypt,sci.electronics,sci.med,s ci.space,talk.politics.guns

#docs per group

500

250

500

100

500

50

500

Figure 1 illustrates the comparison of the algorithms in terms of F-Measure and Entropy. The clustering quality of ICIB is much better than other methods. The parameter denotes different adjustment level, and the result of loop=4 is a little better than loop=2. As an incremental clustering method, when many new documents are added, ICIB needs to reassign documents that have been clustered. Especially in the early phase of clustering, the adjustment process has to be held frequently. The K-Means method is very efficient. But the clustering quality of Single-Pass Clustering is unstable. Sometimes its quality is comparative to K-Means (DS4 in Figure 1), sometimes is much lower. Single-Pass Clustering is easily affected by many factors, such as insertion order. The clustering quality of Suffix Tree Clustering is poor in our experiments. We analyzed the algorithm and found that the resulting clusters of STC can have overlapping documents, that is, one document could exist in many clusters. The situation ensures that a large number of substantial clusters could be generated, each of which can be labeled fairly accurately. But the STC algorithm sometimes generates some clusters with poor quality in our experiment results, which desperately lower the overall effectiveness of STC algorithm. Similar experimental conclusions were confirmed by [12] and [20].

Total #docs

125

Means equals the number of resulting clusters generated by ICIB. We investigated the different depth of reassignment in ICIB, that is, the corresponding parameter loop (the sum in [4]) was assigned with the values 2 and 4.

B. Evaluation Measures There are several ways for numerically scoring the clustering quality, such as Entropy, F-Measure and Overall Similarity. Entropy and F-Measure are two main methods that are widely used. F-Measure is the weighted harmonic mean of precision and recall, and it is often used to measure clustering quality. The higher the value of F-Measure is, the better the clustering quality will be. Essentially, Entropy is a measure of the randomness of molecules in a thermodynamics system. In data mining theory, Entropy is often used to evaluate the clusters distribution for clustering algorithms. If documents are distributed uniformly and there are little differences between clusters, the value of Entropy will be high. On the contrary, if there are great differences between clusters, the value of Entropy will be low. The purpose of clustering is to enlarge the differences between the clusters, so the lower the value of Entropy is, the higher the clustering quality will be. C. Experimental Results We carried out experiments on the above four subsets, and compared the clustering quality of four clustering algorithms: Single-Pass Clustering, STC, ICIB and K-Means. We set the similarity threshold of Single-Pass Clustering equal 0.2, and the similarity threshold of STC equal 0.5. The value of K in K-

2805

(a)

(b)

Figure 1. The comparison among Single-Pass Clustering, STC, ICIB and KMeans. (a) F-Measure; (b) Entropy.

Time performance is another important aspect. Table 2 illustrates the time performance comparison results. The KMeans is a classical clustering algorithm. Its time complexity is O(n*k*l), where n is the number of documents, k is the number of classes, and l is iteration times. The time complexity of ICIB is also O(n*k*l), where l equals loop+1. ICIB is comparable to K-means in terms of time performance. Single-Pass Clustering has a time complexity of O(n*k), and STC has O(m), where m is the total number of words in all combined document snippets. The experimental results listed in Table 2 confirm above analysis on time complexity. TABLE II.

Single-Pass Clustering (s)

K-Means (s)

DS1

19.4

75.0

DS2

17.5

DS3 DS4

STC (s)

loop=4

88.7

29.1

51.7

45.5

123.7

32.6

55.5

19.8

65.2

98.4

40.4

71.0

22.1

28.7

114.0

26.1

65.8

IV.

[3] [4]

[7]

ICIB (s) loop=2

[2]

[5] [6]

THE COMPARISON OF CLUSTERING TIME

Datasets

References [1]

[8]

[9]

[10]

Conclusion

This paper briefly introduces the situation of current Web applications at the beginning, and emphasizes the importance of incremental clustering. Then we summarize the research actuality and new progress of incremental clustering algorithm in recent years. We select and discuss some representative algorithms, such as Single-Pass Clustering, Suffix Tree Clustering and ICIB, in terms of algorithm thinking, key technique, advantage and disadvantage. To compare their clustering quality, we absorbed K-Means algorithm to this paper as a benchmark of clustering quality and carried out simulation experiments and detailedly discussed the experimental results. As we mentioned above, though incremental clustering seems simple, it also faces several challenges. Therefore, there is still much work for researchers to do. The work in this paper can give a valuable reference for incremental clustering research in future.

[11] [12] [13]

[14]

[15]

[16] [17] [18] [19] [20]

2806

R. Baraglia, F. Silvestri, “Dynamic personalization of web sites without user intervention”, Communications of the ACM - Spam and the ongoing battle for the inbox CACM Homepage, 2007, 50(2), pp. 63-67. K. M. Hammouda, M. S. Kamel, “Efficient Phrase-Based Document Indexing for Web Document Clustering”, IEEE Transactions on Knowledge and Data Engineering, 2004, 16(10), pp. 1279-1296. J. Sun, J. Liu, L. Zhao, “Clustering Algorithms Research”, Journal of software, 2008, 19(1), pp. 48-61. Y. Liu, Y. Ouyang, Z. Xiong, “Incremental Clustering using Information Bottleneck Theory”, International Journal of Pattern Recognition and Artificial Intelligence, 2011, 25(5), pp. 695-712. http://en.wikipedia.org/wiki/User-generated_content. X. Wan, “A novel document similarity measure based on earth mover's distance”, Information Sciences, 2007, 177(18), pp. 3718-3730. K.M. Hammouda, M. S. Kamel, “Incremental document clustering using cluster similarity histograms”, in Proc. of Int. Conf. on Web Intelligence, 2003, pp. 597-601. O. Zamir and O. Etzioni, “Web document clustering: A feasibility demonstration”, in Proc. of the 21st Annual Int. ACM SIGIR Conf., 1998, pp. 46-54. W. Wong and A. Fu, “Incremental document clustering for Web page classification”, in Proc. 2000 Int. Conf. Information Soc. In the 21st Century: Emerging Technologies and New Challenges (IS2000), 2000. S. Noam and T. Naftali , Document clustering using word clusters via the information bottleneck method, in Proc. 23rd Ann. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2000, pp. 208-215. http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm. S. Branson, A. Greenberg, “Clustering Web Search Results Using Suffix Tree Methods”, Stanford University, unpublished. M. Ester, H. Kriegel, J. Sander, X. Xu , “A density-based algorithm for discovering clusters in large spatial databases with noise”, in Proc. of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp. 226-231. Tu. Nguyen-Hoang, K. Hoang, D. Bui-Thi and A. Nguyen, “Incremental Document Clustering Based on Graph Model”, Advanced Data Mining and Applications, 2009, pp. 569-576. S. Noam, F. Nir and T. Naftali, “Unsupervised document classification using sequential information maximization”, in Proc. of the 25th Ann. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2002, pp. 129-136. K.Lang, “Learing to filter netnews”, In Proc. Of the 12th Int. Conf. on Machine Learning, 1995, pp. 331-339. R. E. Schapire and Y. E. Singer, “BoosTexter: A System for Multiclass Multi-label Text Categorization”, 1998. M. F. Porter, “An Algorithm for Suffix Stripping”, Program, 1980, 14(3), pp. 130-137. Lucene [Online]. Available: http://lucene.apache.org/ H. Chim, X. Deng, “A New Suffix Tree Similarity Measure for Document Clustering”, in Proc. of the WWW, 2007, pp. 121-129.