Comparing Document Classication Schemes Using K-means Clustering Artur ili¢1 , Lovro mak1 , Bojana Dalbelo Ba²i¢1 , and Marie-Francine Moens2 1
University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia {artur.silic,lovro.zmak,bojana.dalbelo}@fer.hr 2 Katholieke Universiteit Leuven, Department of Computer Science, Celestijnenlaan 200A, 3001 Heverlee, Belgium
[email protected]
Abstract. In this work, we jointly apply several text mining methods to a corpus of legal documents in order to compare the separation quality of two inherently dierent document classication schemes. The classication schemes are compared with the clusters produced by the K-means algorithm. In the future, we believe that our comparison method will be coupled with semi-supervised and active learning techniques. Also, this paper presents the idea of combining K-means and Principal Component Analysis for cluster visualization. The described idea allows calculations to be performed in reasonable amount of CPU time.
1 Introduction Over the past decade, the digitalization of textual data has greatly increased the accessibility of text documents in all areas of human society. This has introduced a strong need for ecient storage, processing, and retrieval of texts, since the number of available documents has become quite large. The rst step in solving such large-scale problems involves structuring the data, which can be done by introducing a classication scheme. Since many classication schemes can be used, the question of how to choose the best one among many for a designated task remains. In this work, we design a method for nding the classication scheme that best ts data given a certain representation. More specically, we try to quantify how much separation quality is gained through use of a better but more expensive classication scheme. Our work is related to that by Rosell et al. [12], where the objective was to evaluate clustering using two classication schemes. We visualize the data in order to inspect both the separation of clusters and their relationship to classes. The introduced visualization method is similar but distinct from that in the work of Dhillon et al. [4], because we use the Principal Component Analysis (PCA). The work is structured as follows. We dene the problem in Section 2, and we describe the data utilized in Section 3. The methodology of the experiments performed is covered by Section 4, and the results are presented in Section 5. Section 6 presents a broader scope and possible applications of the proposed method. We conclude the work with Section 7.
2 Problem denition We want to compare classication schemes in order to see how well they separate the space of text documents. Since that there is no perfect scheme, we have to nd some intrinsic measure of space separation. Therefore, we compare the schemes against clustering. It is clear that introducing clustering introduces a strong bias stemming from the selection of the clustering method and its parameters. For a specic classication problem and its data representation, an appropriate clustering method might be well known from experience; in some cases, the clustering parameters can even be found theoretically. In these cases, the aforementioned bias is justied. Additionally, the task of selecting the clustering method and parameters based on a predened data representation and classication scheme remains a very interesting problem. The classication scheme divides the given space in a certain way. This division has a number of geometrical properties, such as balanced/unbalanced, convex/non-convex, and spherical/non-spherical. We want to generalize these geometric properties of a classication scheme into clustering parameters. Using the new clustering parameters, we will be able to reproduce the geometric properties of clusters on a new set of points or new part of the existing space. This investigation has not yet been performed, as the experiments in the following sections deal only with the answer to the previous paragraph.
3 Data 3.1 Document collection Our document collection NN9225 consists of 9225 legislative documents from the Republic of Croatia. The documents have been collected by the governmental agency HIDRA [3]. The collection covers dates from 1990 to 2006. Each document is labeled with labels from the two classication schemes described in the following subsection.
3.2 Classication schemes We seek to compare two classication schemes in order to understand which one is better suited for the division of a huge collection of documents. Both schemes seek to represent the semantic content of the document. The dierence between the two schemes is as follows. The rst classication scheme (Issuer ) is based solely on the issuer of the document, whereas the second classication scheme (Eurovoc ) is based on the actual content of the document and was manually assigned by legal experts.
Issuer scheme The Issuer classication scheme has 25 classes (issuing institutions), and was developed in an ad hoc manner by HIDRA to improve existing online retrieval of ocial documents. Note that the very same institution maintains the described document collection. If the document was issued by the Ministry of Defense, it is assigned the label Defense, interior aairs, and national
Distribution of the number of Eurovoc labels 4000
Number of documents
3500 3000 2500 2000 1500 1000 500 0
Number of class labels on a single document
Fig. 1. Distribution of the number of Eurovoc labels over the NN9225 corpus security under the Issuer scheme. This scheme assumes that issuing institutions have narrow document topics. This classication scheme has obvious aws. For example, the Ministry of Finance issues legal documents covering much broader topics than just Finance, because this ministry deals with numerous aspects of the state. After reading such documents, human indexers might assign them multiple semantic labels like live stock, agricultural subsidy, or environmental protection. In spite of the obvious impairment of the Issuer scheme, there was no need for automatic or manual content-based classiers at the time of document labeling. There was solely a need for information about the issuing institution. Since the document labels were quite easily assigned, the separation of the data set was cheap.
Eurovoc scheme Eurovoc is a parallel multilingual thesaurus maintained by
the Oce for Ocial Publications of the European Communities and ocially used by many governmental bodies of the European Union [7]. The Eurovoc thesaurus has a hierarchy with up to eight levels of depth. Since we wanted to compare the at Issuer scheme with the hierarchical Eurovoc thesaurus classications, we simplied the Eurovoc by attening it to its rst level of depth. Documents are assigned new labels, which are rst-level predecessors of the actual labels. Links in the hierarchy are assumed to be isa relations between classes. For example, a document labeled with the class administrative science is assigned the grand predecessor class science. The attened version of the Eurovoc thesaurus is called the Eurovoc scheme. Most of the documents in the NN9225 collection have from one to three rst level-class Eurovoc labels (see Fig. 1).
Expected tradeo Since information about the issuer of legal documents is al-
most always available, the deployment of the Issuer scheme is very cost-eective. The cost of the Eurovoc scheme is determined either by the cost of maintaining personnel to index the documents or by the cost of an automatic classier construction. The two presented schemes are obviously dierent. Table 1 summarizes the expected tradeo between the cost and the partition (class separation) quality.
Table 1. Expected tradeo between the classication schemes Separation quality Cost
Issuer Low
Very low
Eurovoc
High
High
4 Methodology 4.1 Preprocessing During the experiments presented in the following subsections, each document from the text collection was represented as a point in the vector space. This vector space was constructed using the standard bag-of-words approach, in which each word was linguistically normalized to its lemma [16]. Stop words were removed from the feature space. Our corpus of 9225 documents contains about 1.9 ∗ 105 features, so we used information gain and χ2 measures for feature selection. We chose to select only 3% of the features, since the early tests have shown that this reduction does not diminish the quality of subsequent processing. At the end of preprocessing, the well-known TF-IDF normalization process was performed [14].
4.2 K-means clustering Clustering is the division of data into groups of similar objects. Among many dierent clustering algorithms, K-means is one of the simplest and most popular [6], [11], [13]. K-means is not capable of dealing with non-convex shapes, as [8] notes and [13] shows by experiment. Formally, the goal of K-means is to nd the minimum of the following potential function:
F =
K X X i=1 x∈Ci
d(ci , x)2 , ci =
1 X x, |Ci |
(1)
x∈Ci
where Ci is the i-th cluster, ci its centroid, and d a distance function. Since nding an optimal clustering requires non-polynomial execution time, a heuristic algorithm is used [10]. Dierent distance measures, such as Euclidian, Mahalonobis, and cosinus-based, are used. For the experiments performed, the cosinus distance is used because this method generates hyperspherical clusters [5]. Additionally, a non-trivial seeding variant is used to enhance the quality and speed of the clustering [1]. Basically, the initial clusters are positioned in such a way that more clusters are present where the data are dense.
4.3 Comparison approach The classication schemes are compared with hyperspherical clusters produced by the K-means algorithm. The experiments presented here did not compare many dierent clustering methods and their parameters. The bias introduced
Cluster1
Cluster1 Cluster3
Cluster3 Cluster2
Cluster2
Class1
Fig. 2. Clusters 2 and 3 have high precision with respect to class 1
Class1 Class2 Class3
Fig. 3. Cluster 3 have high recall with respect to classes 1, 2, and 3
via the selection of a well-known clustering method is tolerated, because we expect the Issuer scheme to be inferior to the Eurovoc scheme in separating the document collection since it is not based on actual content. The goal of the experiment was to show that our method can detect such separation impairments. When dealing with texts in the bag-of-words space, we expect semantic classes to be separable by linear hyperplanes. Text categorization problems are usually linearly separable, as noted in [9]. If the classes are linearly separable, then they are convex as well. This justies the use of K-means clustering as a simple baseline, because it generates hyperspherical clusters that are convex, able to cover the whole vector space of presented points, and relatively balanced [15]. The more similar the shape of a class is to the union of disjunctive and evenly distributed hyperspheres, the better the separation. A union of hyperspheres can of course be very complex. In our case, however, the number of clusters is small. Further, we observed after the completion of the experiments that the majority of documents from a particular class is found in just a few clusters. High recall for a cluster with respect to some class means that one cluster covers a great majority of that class. High recall tells us not about the shapes of classes but rather about the shapes of clusters [see Fig. 3]. On the other hand, higher precision for clusters with respect to classes means that the shapes of the classes in the vector space are spanned by the shapes of clusters [see Fig. 2]. Although we do not generalize this idea for any number of classes, we establish a hypothetical link between the purity of hyperspherical clusters and the separation quality of classes in a classication scheme. Of course, this link has yet to be generally shown and theoretically explained. While our corpus is comprised of legal documents, we expect that the comparison method may easily be extended to corpora in other subject domains and genres since the preprocessing and clustering methods employed do not utilize any domain-specic features.
4.4 Comparison measures In the light of this work, a comparison measure between two sets of classes (or clusters) in a set of examples should indicate their closeness. Like other authors
[12], we compare a classication scheme to clustering by utilizing the standard evaluation measures of information retrieval quality. Furthermore, we use the information-theoretic measure of entropy as well.
Precision Precision, the standard information retrieval measure is used to show how much the cluster i conforms to the class j :
pij =
nij , ni
(2)
where nij is the number of documents with membership in both the cluster i and the class j , and ni is the number of documents with membership in the cluster i.
Purity The purity of a cluster is dened as the maximum precision over all classes:
%i = max(pij ). j
(3)
Entropy Since precision pij is the probability that a text drawn at random
from the cluster i belongs to the class j , the entropy of a cluster is calculated as follows: X Ei = − (pij log pij ). (4) j
The entropy of the whole clustering is dened as the weighted average over all of the clusters:
E=
K X ni i=1
n
Ei ,
(5)
where n is the number of documents in the collection.
4.5 Visualization approach It is useful to visualize the data, because this provides direct insights and inferences from the illustrations generated. Among many approaches to data visualization, a very popular one involves projecting the data onto a pair of components that will demonstrate the relationships present in the data. In our case, we want to inspect how well the clusters generated by the K-means algorithm separate the vector space. While Linear Discriminant Analysis (LDA) is well suited for this task [6], we employ a slightly dierent approach using PCA. For each cluster, we nd the concept vector (normalized centroid). Then, we calculate the principal components of the concept vector space and project the whole corpus onto an arbitrary pair of principal components. Since PCA maximizes the variance of concept projections, we expect the projected clusters to be
Cluster purity
Purity (%)
Eurovoc (IG) Issuer (IG) Eurovoc ( ) Issuer ( )
Cluster index
Fig. 4. Purity of clusters with respect to the contained categories (sorted descending) well separated on the visualization plots. This method ignores the within-class scatter of points in the vector space. On one hand, this simplication is a disadvantage because losing some part of the information can lead to suboptimal visualization in comparison to LDA. On the other hand, it is an advantage because the computational complexity of the proposed method is drastically lower than that of LDA. This lower complexity arises since the matrix computations are performed on a concept-feature matrix rather than on a document-feature matrix; it is clear that the number of concepts is much smaller than the number of documents (observations). Recall that if we choose to have only 3% of the original features3 , the document-feature matrix yielded is still rather large for matrix calculations performed by ordinary personal computers. Since the visualization method is intended for interactive use on large data sets, its speed is important. By using the concept vectors, we ignore the within-class scatter. Our approach is thus similar to the work of Dhillon et al. [4]. The dierence arises from the fact that we use PCA instead of other matrix calculations to nd the projections.
5 Results 5.1 Comparison of the classication schemes The comparison of classication schemes can be seen in Fig. 4, which shows a plot of cluster purities. The clusters are sorted by their respective purities. The Eurovoc scheme is closer to the clustering than the Issuer scheme because clusters are more pure with respect to the Eurovoc scheme. The graphical data presented in Fig. 4 is numerically summarized in Table 2, which calculates the entropy measures. To eliminate the randomness of the K-means algorithm, the experiment was run ve times. The entropy values presented in Table 2 are thus averages of ve runs. Both entropy and purity support 3
5799 features
Cluster and Eurovoc class view
PCA2
PCA2
NN9225 corpus
Eurovoc class 17 Cluster 13
Not Cluster 13 Cluster 13
PCA1
PCA1
Fig. 5. Separation of a cluster from the rest of the collection
Fig. 6. A cluster spans a part of the Eurovoc class Transportation (No. 17)
our expected hypothesis, which stated that the Eurovoc scheme would be closer to the K-means clustering.
Table 2. Entropy of clusters respective to the classication schemes Entropy St.dev.
Issuer 18, 7 ∗ 10−3 1, 4 ∗ 10−3
Eurovoc 0, 86 ∗ 10−3 0, 11 ∗ 10−3
5.2 Visualization of clustering The visualization method was manually evaluated within the research team. Depending on the principle component pair, we could see that the projections from most clusters were well separated from the other points. One such separation is shown in Fig. 5. In this gure, the NN9225 corpus and one of its clusters are projected onto a pair of principal components. If a cluster is separable in the projected space, then it is separable in the original space as well because the projection dimensions (principal components) are linear combinations of the original dimensions. Fig. 6 shows the relationship between a class and a generated cluster using the same principal components used in Fig. 5. Cluster 13 has high precision with respect to Eurovoc class 17. This means that Eurovoc class 17 is partly spanned by cluster 13. Since cluster 13 is easily discriminated from the rest of the documents, the spanned part of Eurovoc class 17 is easily discriminated from the rest of the documents as well. Formal evaluation of the visualization is currently being conducted, and manual evaluation with an expert team is in initial stages.
6 Applications and impact First, we can compare dierent directories oered by online services (e.g. Dmoz, Yahoo! Directory, and VLIB) to see how they t certain text representations and cluster models. This is one of many examples in which the documents are already classied with two or more schemes. Second, our comparison approach is useful for validating additions, deletions, and other alterations of classication schemes suggested by human experts. Given a document set, its class labels, and a text representation, one could nd the clustering method and parameters that produce the clusters closest to the classication scheme. We assume that this clustering will divide the space of texts in the same manner (i.e. using shapes with similar geometric properties) as the given classication. Therefore, a system could validate the suggested scheme changes and verify that they are in line with the rest of the scheme. Finally, our approach can be useful in a semi-supervised setting. For example, during self- or co-training [2], the clustering can dene additional constraints for deciding that an unlabeled example belongs to a certain class. Additionally, the geometric space can be explored in a much more principled manner for active learning [17] when the system chooses examples that are to be manually annotated.
7 Conclusion This work presents a methodology for comparing dierent classication schemes using clustering methods. When given a motivated text representation and distance measure, one can choose between several classication schemes according to their ts. The quantitative comparison measures are explained, and their usage is justied. The experiment has conrmed our expectations by showing that the dierence in the separation quality of the two presented classication schemes can be detected using the proposed comparison method. We believe that the ideas introduced are very important and relevant for a much broader scope than the present experiment. Possible future work could explore the eects of using dierent clustering methods. After that, we could generalize the presented methodology in order to compare hierarchical classication schemes instead of at ones. Additionally, the inversion of the task, in which we nd the clustering method or data representation that ts to some classication scheme, remains to be explored. Finally, the method could be coupled with semi-supervised techniques as discussed in the previous section. We propose a computationally ecient visualization method combining Kmeans clustering and PCA that ignores within-class scatter. Possible future work could compare our approach to the work of [4]. The method was useful in discriminating clusters, so visualization helped us to obtain direct insight into the relationships between classes and generated clusters in our collection of text documents.
Acknowledgement This work has been jointly supported by the Ministry of Science, Education and Sports of the Republic of Croatia and the Government of Flanders under the grants No. 036-1300646-1986 and KRO/009/06 (CADIAL).
References 1. Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, p. 10271035, (2007) 2. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT '98), p. 92100, (1998) 3. Croatian Information Documentation Referral Agency. http://www.hidra.hr/ 4. Dhillon, I.S., Modha, D.S., Spangler, W.S.: Class visualization of high-dimensional data with applications. Computational Statistics and Data Analysis, vol.41 (1), p. 5990, Kluwer, (2002) 5. Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data Using Clustering. Journal of Machine Learning, vol.42, p. 143175, Springer Netherlands, (2001) 6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classication, 2nd Edition, Wiley, New York, (2000) 7. EUROVOC thesaurus, European Union publications oce. http://europa.eu.int/celex/eurovoc/ 8. Halkidi M., Batistatis Y., Vazirgiannis M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems, p. 107145, vol. 17, Springer Netherlands, (December 2001) 9. Joachims T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Issuers, (2002) 10. Lloyd, S.P.: Least squares quantization in PCM. IEEE Transactions on Information Theory, vol.28, p. 129136, (1982) 11. Moens, M.-F.: Note on Clustering Large Document Collections. Technical Report, CADIAL, Katholieke Universiteit Leuven, (July 2007) 12. Rosell, M., Kann, V., Litton, J.-E.: Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classications. In Proceedings of the International Conference on Natural Language Processing (ICON-2004), Hyderabad, India, (2004) 13. Satchidanandan, D., Chinmay, M., Ashish, G., Rajib, M.: A Comparative Study of Clustering Algorithms. Information Technology Journal, vol.5, p. 551559, (2006) 14. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys, vol. 34, p. 147, ACM, New York, (2002) 15. Su, MC., Chou, CH.: A K-means Algorithm with a Novel Non-Metric Distance. p. 674680, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, (June 2001) 16. najder, J., Dalbelo Ba²i¢, B., Tadi¢, M.: Automatic Acquisition of Inectional Lexica for Morphological Normalisation. Information Processing & Management, doi:10.1016/j.ipm.2008.03.006, (2008) (accepted, to be published) 17. Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classication. Proceedings of the Seventeenth International Conference on Machine Learning (ICML-00), p. 287295, Stanford, California, (2000)