21st International Conference on Pattern Recognition (ICPR 2012) November 11-15, 2012. Tsukuba, Japan
An Improved K-means Document Clustering using Wikipedia Hierarchical Ontology Mostafa M. Hassan, Fakhreddine Karray, and Mohamed S. Kamel Centre for Pattern Analysis and Machine Intelligence, University of Waterloo. m22hassa,karray,
[email protected]
Abstract
niques of document clustering. The k-means algorithm starts with random centroids for the clusters, which are called clusters’ centroids. Then, it assigns each data point to the nearest cluster center. For each cluster, it calculates the new cluster centroid by taking the mean value for all cluster members. The algorithm repeats the last two steps for a certain number of iterations or until convergence (the cluster assignment ceases to change). The main drawback of the k-means algorithm is that its performance is highly affected by the initial centroids’ values; therefore, there is no guarantee that it will converge to the global optimum. Sometimes, we have a set of topics that has been marked as being “of interest”, and we want to cluster input documents into groups based on these topics. For instance, say we are interested in the following topics: economics, politics, and sports, and we have a set of documents that we want to group them based on those topics. This task is usually needed by users like news agencies, where we want to automatically assign each news article to one of the predefined news topics. In this paper, we present a novel approach to tackle the problem of assigning the initial cluster centroids for kmeans based on a structured form of background knowledge, an ontology. We also briefly show how to extract the knowledge stored in the Wikipedia, and convert it to that structured Wikipedia Hierarchical Ontology (WHO) in such a way that it can be easily and efficiently used for k-means. The reason for choosing Wikipedia as a background knowledge repository is its size and wide coverage for different topics, which can be used to overcome the limitations of coverage and scalability in other knowledge bases and ontologies. These features encouraged us to use Wikipedia in building a well-structured knowledge base which can be used in different text-mining tasks. Despite the fact that our approach needs the set of input topics, one main advantage of using this approach over the conventional document clustering is that we have the centroids’ labels as once
Text document clustering is one of the crucial tasks in text mining. It is used in many different text mining applications. One of the most commonly used algorithms for document clustering is the k-means algorithm, the main drawback of which is that its output performance is very sensitive to its initial clusters’ centroids. In this work, we present a technique to initialize the centroids based on background knowledge structure extracted from one of the largest online knowledge repositories: Wikipedia. Results show that the proposed model is efficient, and promising, as it outperforms the accuracy of the conventional k-means clustering, as well as other conventional algorithms for document clustering.
1. Introduction Clustering is the process of assigning each input pattern to a group (cluster) such that each group contains similar patterns. Accordingly, text document clustering is the process of grouping text documents into groups of similar documents. The similarity in document clustering is measured by how the documents are semantically related to each other. Clustering algorithms can be categorized based on the technique for structuring clusters into hierarchical clustering and partitional clustering. The most wellknown algorithm for partitional clustering is the kmeans clustering[1]. Spherical k-means [2] is the most appropriate version of k-means for document clustering as the documents are represented as vectors, and the most fit distance measure used for the vectors is the cosine similarity1 . It is considered one of the best tech1 From now on, when we refer to to k-means, we will always mean the spherical k-means version of it.
978-4-9906441-1-6 ©2012 IAPR
2950
3. Wikipedia Hierarchical Ontology.
we assign a document to a centroid, we know its topic as well. The rest of the paper is organized as follows: Section 2 discusses some of the existing techniques that take on the initialization problem of k-means clustering, as well as the existing techniques that utilize the background knowledge in the text document clustering task. In Section 3, the proposed Wikipedia Hierarchical Ontology (WHO) knowledge representation is discussed, as well as the method of utilizing WHO to generate the initial clusters’ centroids for k-means based on this ontology. In Section 4, we provide simulations and results of applying the proposed approach, and compare them to conventional document clustering results. Finally, Section 5 concludes the work proposed in this paper.
This section introduces the approach to building a Wikipedia Hierarchical Ontology (WHO) from the Wikipedia knowledge repository. We use this ontology to utilize the knowledge stored in Wikipedia for document representation. We assume that each Wikipedia category represents a unique topic. These topics are considered to be the basic building blocks of the ontology; we refer to them as concepts. Wikipedia categories are organized in a hierarchical manner, so that the root concepts represent abstract ideas. Reciprocally, the leaf concepts represent very specific ideas. This reflects the world knowledge in different domains with different level of granularity. Each category (concept) is associated with a collection of Wikipedia articles that describe and present different ideas related to this concept. Using these articles, we can extract the set of terms that represent each concept. Furthermore, we associate a weight with each of these terms, which expresses how this term contributes to the meaning of that concept. These weights are calculated based on the frequency of occurrence of these terms in the articles under that concept. This describes the basic idea of the creation process of Wikipedia Hierarchical Ontology2 .
2. Related work The study of selecting the appropriate values for the k-means initial class centroids began a long time ago. One of the earliest works on the issue was by Forgy[3], who suggested picking the initial cluster centroids randomly from the input data points. Most of the proposed methods in literature that deal with this problem, find the k-means’ initial centroids by using either statistical techniques [4] or data mining techniques - such as density estimation [5] and sub-graph division [6]. In our proposed approach, we utilize background knowledge to identify the best initial clusters’ centroids. Recently, a new line of research introduced the use of background knowledge to enhance the efficiency of different text mining tasks. In this type of technique, documents are represented in a structure of concepts that reflects the documents’ meaning, rather than using a collection of words which are found in the document. This structure of concepts is referred to as an “ontology”. Researchers use either a well-structured source of knowledge in the form of an ontology, such as WordNet ([7, 8]), or a raw knowledge repository in the form of a web directory or an online encyclopedia: namely, Wikipedia ([9, 10]). Most of the work that involves background knowledge to improve text document clustering is based on the idea of utilizing the background knowledge to update the distance measure between the input documents. In other words, they use the background knowledge to enrich the document representation, then they apply the document clustering algorithm to that new representation. In contrast, our proposed approach utilizes the background knowledge to assign initial cluster centroids for k-means, and then the clustering algorithm runs in the original term space.
3.1
Using WHO to Improving K -means Clustering.
We here assume that the desired topics for the output cluster of the input data set must be given as an input to this approach. We use the input topics to extract the matched concepts from our ontology. Let us consider that the set of the input topics is T , and the WHO concept-term matrix that represents our ontology - which we have extracted in the previous step - is C. The algorithm for the proposed approach proceeds as follows: 1. Extract from C the specific subset of concepts, and form matrix C ∗ , where each row in C ∗ represents one topic in the set T . 2. Exclude the columns that represents the set of terms that are not found in the input document set. Hence, we can consider each row in the matrix C ∗ as a vector in the input document term space. 3. Using this representation, we can consider that these concepts’ vectors are the initial vectors for clusters’ centroids. Apply the conventional spherical k-means based on these initial centroids. 2 The details of the algorithm have been omitted due to space limitations. For more details we refer to our work in [11]
2951
We refer to this approach as a WHO-k-means.
these pre-processing steps were done once for each data set. The best output performance is shown in bold and the second best is underlined. It also compares the overall averaged output performance for the document clustering approaches and the proposed method. From the overall output, we see that WHO-k-means outperformed the different document clustering approaches in all external performance measures. On the other hand, WHO-k-means was the second best after HAC-Avg in all internal performance measures. We can also see that WHO-k-means is more efficient than the convensional k-means as WHO-k-means converged faster than the convensional k-means.
4. Experiments and Results To test the performance of the proposed approach for WHO-k-means, we have conducted a comparative study using a number of real-world benchmark document sets. In our experiments, our final goal is to utilize the knowledge we formed (WHO) to initialize the kmeans centroids. We have compared the performance of three different document clustering approaches with the proposed WHO-k-means approach: (1) hierarchical document clustering with complete linkage, HACCom; (2) hierarchical document clustering with average linkage, HAC-Avg; (3) k-means document clustering with random initial centroids, k-means; and (4) proposed k-means document clustering using the initial centroid from WHO, WHO-k-means . As is wellknown, k-means clustering output depends on the initial step. Hence, the cluster assignments are changed for the different runs, we applied the k-means clustering 10 times and we reported the average of them. Performance measures. We selected some of the common external and internal performance measures for this experiment. For internal performance measures, we reported the partition and the separation indices, SC and SI respectively [12]. For external performance measures, we reported the F-measure, Accuracy, Purity, NMI, and Entropy [12]. Data sets. We have conducted our experiments on three different benchmark data sets, which have been previously used by Zhao and Karypis [13] to evaluate the performance of different document clustering algorithms. The reviews data set is derived from the San Jose Mercury newspaper articles that are distributed as part of the TREC collection3 . The k1b and wap4 data sets are from the WebACE project [14]. Each document corresponds to a web page listed in the subject hierarchy of Yahoo!. These data sets were pre-processed and distributed with the CLUTO Toolkit [15].
5. Conclusion In this work we presented a new approach for extracting knowledge from a huge knowledge repository, namely Wikipedia. We stored the extracted background knowledge from Wikipedia in the form of a Wikipedia Hierarchical Ontology (WHO). We utilized both the information stored in Wikipedia articles and the hierarchical structure of Wikipedia categories to define the ontological concepts and the relationships between them. We used this proposed WHO to improve the k-means document clustering by providing it with the initial centroids. The simulation tests show the out-performance of the proposed WHO-k-means approach when compared to the hierarchical document clustering approaches and the conventional k-means document clustering approach for external output performance measures. The simulation tests also show that using the initial centroids extracted from WHO improves both internal and external output performance of the k-means approach. They also show the superiority of the WHO-k-means approach to all other approaches in terms of time efficiency. Lastly, another benefit of using WHO-kmeans over the conventional document clustering is that WHO-k-means produces labeled output clusters.
4.1. Results and Discussion
References
Table 1 summarizes the comparison of both internal and external output performance as well as the running time for three different document clustering approaches and our proposed approach for WHO-k-means. The running time does not include the pre-processing time and the extraction of the topic vectors from WHO, as
[1] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 281-297, 1967, p. 14. [2] I. Dhillon and D. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, vol. 42, no. 1, pp. 143–175, 2001.
3 http://trec.nist.gov 4 The original # of documents of the wap data set was 1560 and the # of classes was 20, but we removed the documents that did not have a mapping category in Wikipedia.
2952
[6] H. Suo, K. Nie, X. Sun, and Y. Wang, “One optimized choosing method of k-means document clustering center,” Information Retrieval Technology, pp. 490–495, 2008.
Table 1. Comparison of output external performance measures, internal performance measures, and the running time (secs). Data set
K1b
Wap
Reviews
Overall
Measure
HAC-Avg
HAC-Com
k-means
WHO-k-means
F-measure
0.8043
0.6612
0.7454
0.8778
Accuracy
0.8449
0.7021
0.6141
0.8632
Purity
0.8526
0.7966
0.8341
0.8902
NMI
0.6651
0.4144
0.5879
0.7253
Entropy
0.3166
0.4104
0.2098
0.1401
SC
0.834
1.4172
1.368
1.2131
SI
1.2019
2.1667
1.9273
1.4438
Run Time
0.3432
0.2808
12.6252
6.2400
F-measure
0.5698
0.5211
0.5512
0.7358
Accuracy
0.5568
0.5111
0.4918
0.7231
Purity
0.5675
0.5797
0.6739
0.7399
NMI
0.5512
0.4235
0.5216
0.6420
Entropy
0.4404
0.4915
0.3768
0.3031
SC
0.6941
0.9815
0.9957
0.9026
SI
1.3329
1.7255
1.7976
1.4517
Run Time
0.1872
0.1248
21.1706
13.9933
F-measure
0.4117
0.4538
0.6924
0.7163
Accuracy
0.3463
0.3844
0.6538
0.6862
Purity
0.3475
0.3851
0.7359
0.7606
NMI
0.0339
0.1421
0.5179
0.5505
Entropy
0.8731
0.8245
0.4253
0.3995
SC
0.2303
0.9724
1.2588
1.1797
SI
0.8897
1.7373
1.7316
1.7226
Run Time
1.1700
1.0452
9.6777
8.7361
F-measure
0.5956
0.5625
0.6418
0.7779
Accuracy
0.5827
0.5325
0.5866
0.7575
Purity
0.5892
0.5871
0.7479
0.7969
NMI
0.4167
0.3267
0.5424
0.6392
Entropy
0.5433
0.5754
0.3373
0.2809
SC
0.5861
1.1237
1.2075
1.0984
SI
1.1415
1.8765
1.8188
1.5394
Run Time
0.5668
0.4836
14.4911
9.6565
[7] J. Sedding and D. Kazakov, “WordNet-based text document clustering,” in Proceedings of the 3rd Workshop on Robust Methods in Analysis of Natural Language Data, COLING 2004, V. Pallotta and A. Todirascu, Eds. Geneva, Switzerland: COLING, August 29th 2004, pp. 104–113. [8] D. Reforgiato Recupero, “A new unsupervised method for document clustering by using WordNet lexical and conceptual relations,” Information Retrieval, vol. 10, no. 6, pp. 563–579, 2007. [9] P. Wang, J. Hu, H. Zeng, and Z. Chen, “Using Wikipedia knowledge to improve text classification,” Knowledge and Information Systems, vol. 19, no. 3, pp. 265–281, 2009. [10] M. Hassan, F. Karray, and M. S. Kamel, “Improving document clustering using a hierarchical ontology extracted from wikipedia,” in SIAM SDM Text Mining Workshop. Columbus, OH, USA: SIAM, May 2010. [11] M. Hassan, F. Karray, and M. Kamel, “Automatic document topic identification using wikipedia hierarchical ontology,” in The 11th International Conference on Information Sciences, Signal Processing and their Applications: Main Tracks (ISSPA2012 - Tracks), Montreal, Canada, Canada, Jul. 2012. [12] R. Kashef, “Cooperative Clustering Model and Its Applications,” Ph.D. dissertation, University of Waterloo, 2008. [13] Y. Zhao, G. Karypis, and U. Fayyad, “Hierarchical clustering algorithms for document datasets,” Data Mining and Knowledge Discovery, vol. 10, no. 2, pp. 141–168, 2005.
[3] E. Forgy, “Cluster analysis of multivariate data: efficiency versus interpretability of classifications,” Biometrics, vol. 21, pp. 768–769, 1965.
[14] E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, “WebACE: A web agent for document categorization and exploration,” in Proceedings of the second international conference on Autonomous agents. ACM, 1998, pp. 408–415.
[4] J. Pena, J. Lozano, and P. Larranaga, “An empirical comparison of four initialization methods for the k-means algorithm,” Pattern recognition letters, vol. 20, no. 10, pp. 1027–1040, 1999.
[15] G. Karypis, “Cluto - a clustering toolkit,” University of Minnesota, Department of Computer Science, Tech. Rep. 02-017, 2002.
[5] S. Redmond and C. Heneghan, “A method for initialising the k-means clustering algorithm using kd-trees,” Pattern Recognition Letters, vol. 28, no. 8, pp. 965–973, 2007.
2953