Pseudo-Supervised Clustering for Text Documents - CiteSeerX

0 downloads 0 Views 244KB Size Report
ually organize all the documents in a Web search engine, the application of automatic text processing techniques, like classification and clustering, is increasing.
Pseudo-Supervised Clustering for Text Documents M. Maggini, L. Rigutini, and M. Turchi Dipartimento di Ingegneria dell’Informazione Universit`a di Siena Via Roma, 56 - Siena, Italy {maggini,rigutini,turchi}@dii.unisi.it

Abstract Effective solutions for Web search engines can take advantage of algorithms for the automatic organization of documents into homogeneous clusters. Unfortunately, document clustering is not an easy task especially when the documents share a common set of topics, like in vertical search engines. In this paper we propose two clustering algorithms which can be tuned by the feedback of an expert. The feedback is used to choose an appropriate basis for the representation of documents, while the clustering is performed in the projected space. The algorithms are evaluated on a dataset containing papers from computer science conferences. The results show that an appropriate choice of the representation basis can yield better performance with respect to the original vector space model.

1

Introduction

Search engines are the most used service to access the resources available on the Web. One of the main issues in the design of the search interface is to properly organize the query results in order to ease the selection of the most appropriate result. Thus, proper ranking schemes, like the PageRank used by the Google search engine, have been proposed to order the result list according to an absolute and user independent criterion. However, some other approaches have been used to organize the results in groups in order to direct the user choice to the most interesting result subset (e.g. vivisimo.com). Another interesting feature of a search engine is to access a list of documents similar to a given one. This approach can be particularly useful for focused search engines, where the documents belong to a restricted set of topics and the formulation of a precise keyword-based query might be difficult. For example, the Citeseer (www.citeseer.com) search engine provides a widely used service to retrieve computer science papers gathered from the Web and in its current version provides

a navigation through the document corpus based on a hierarchical directory of topics. Since it is unfeasible to manually organize all the documents in a Web search engine, the application of automatic text processing techniques, like classification and clustering, is increasing. In this paper we propose two clustering methods that can exploit the feedback of an expert. The two methods consist of two steps. In the first step a set of example documents is organized in groups either automatically or manually by an expert. This set is used to compute a basis for the representation of the documents. Two schemes proposed in the literature have been adopted: the Singular Value Decomposition (SVD) [4] and a variation of the Concept Matrix Decomposition (CMD) [5]. Then, the entire document corpus is represented using the chosen vector basis and is partitioned using a clustering algorithm. Thus the supervision of the expert can be used to bias the document representation to reflect the human clustering criteria. The paper is organized as follows. In the next section we introduce the vector model representation used for documents and the dimensionality reduction techniques proposed for Information Retrieval. In the section 3 we describe the proposed pseudo-supervised clustering algorithms. Section 4 defines some indexes that can be used to evaluate the clustering results. Finally, in section 5 the results on a dataset containing about 1000 full papers from computer science conferences are reported and in section 6 the conclusions are drawn.

2

Document representation

In Automatic Text Processing, a widely used representation for text documents is the Vector Space Model [12]. In this model each document is represented by a vector in a |V | dimensional space where V is the term vocabulary. The value of i-th component of j-th vector is the weight of the i-th word in the j-th document. The most used schemes for words weighting are the tf and the tf-idf [13]. In the first scheme the weight is the term frequency in the document,

i.e. xij = fij being fij the number of occurrences of the i-th word in the j-th document. In the tf-idf scheme the term frequency is weighted by the inverse document frequency, i.e. xij = fij · log( ddi ) being di the number of documents containing the i-th term and d the total number of documents in the corpus. This scheme assumes that a word is more informative if it is not common in the set of documents [11]. In both cases the vector can be normalized to obtain a unitary norm vector. Given the vector space representation of two documents, we can evaluate their similarity using a distance measure between the two vectors. The most used metrics is the cosine correlation d(xi , xj )

=

||xi ||·||xj ||

2.1

=

Pn = pPn r=1

jection reduction [15]. Other techniques extract the projection basis by an analysis of the matrix representing the documents in the corpus. Given a collection D of d documents represented in a vector space of dimension w, we can define the matrix word-by-document X ∈ Rw×d in which each column is a document vector. Two popular methods which use the word-by-document matrix are the Singular Value Decomposition (SVD), also known as Latent Semantic Indexing [4], and the Concept Matrix Decomposition (CMD) [7, 5]. Both approaches use the statistics on the distribution of words in the document corpus to derive a sort of semantically based projection.

r=1

(1)

xi,r ·xj,r

pPn 2

(xi,r ) ·

r=1

(xj,r )2

which is related to the number of terms shared by the two documents. Thus, two vectors xi and xj are similar if d(xi , xj ) ' 1. The Vector Space Model presents some structural problems. An evident limit of this approach is the high dimensionality of the representation, because even for short texts a vector of |V | dimensions is used. Moreover each word is removed from its original context and its correlation with adjacent words is lost. Finally, each term is considered as an independent component without considering the semantic relationships which exist among the terms in the vocabulary (synonymy, hypernymy, hyponymy, etc.). For example, a document containing only the word “pear” is completely uncorrelated with respect to a document composed by the only word “apricot”, even if from a semantic point of view they both deal with the concept of “fruit”. The resulting vector is very sparse (most of the documents contain about 1 − 5% of the total number of terms in the vocabulary) and it contains many common and low informative words. Thus, many feature reduction techniques have been proposed to select the vocabulary for a given corpus. To remove common ad rare words, the Luhn reduction [9] is often performed. Following the Zipf law f ×pos ' k, Luhn derives the importance of a term basing on its frequency in the documents. Luhn suggests that the relevant words belong to the intermediate interval of frequency. In this way it is possible to chose two cut-off frequencies removing from the vocabulary the words distributed with frequencies out of such interval. In order to reduce the dimensionality of the vector space representation, same techniques based on the projection of the original vector to a low dimensional space have been proposed in the literature. Each method defines a different vector basis for the projection. The simplest technique consists of choosing a random basis yielding the random pro-

Singular Value Decomposition (SVD)

This method computes the Singular Value Decomposition (SVD) of X and considers the first k columns of the left and right singular vectors’ matrices U ∈ Rw×k and V ∈ Rk×d , corresponding to the k largest singular values. Thus, the word-by-document matrix is factorized as ∼

X = U ΣV ⇒X k = Uk Σk Vk . The new matrix is called k-truncated SVD of the X matrix and it is the reconstruction of the original matrix after a projection of the documents into the space spanned by the k principal left singular vectors of X collected in the matrix Uk . This approach is computationally expensive since it requires to compute the SVD of X.

2.2

Concept Matrix Decomposition (CMD)

The idea of the CMD method is to use a basis which describes a set of concepts represented by reference term distributions. To obtain these reference distributions we need to compute a partition Π of the reference document collection D, such that Π = {π1 , π2 , ..., πk } where

Sk

j=1

πj = {x1 , x2 , ..., xd } = D

πj ∩ πl = ∅, j 6= l . The partition can be obtained by applying a clustering algorithm to the set of documents in D. Each partition j corresponds to a Concept Vector cj , obtained as the normalized centroid of the document vectors in the partition, P x ∈π xi . (2) cj = P i j k xi ∈πj xi k The Concept Matrix is the matrix Ck ∈ Rw×k containing the k concept vectors,   Ck = c1 c2 ... ck .



Now, we can define the Concept Decomposition X k of X as the projection of X into the space induced by Ck , ∼

X k = Ck Z ? Z ? = arg minZ kX − Ck Zk2F , where the minimization is required by the fact that the given basis is not orthonormal. The columns of the matrix Z ? represent the documents in the concept matrix space. If we add the additional constraint that the components of the document representation in the concept space are to ? be non-negative, i.e. zij ≥ 0, we obtain the Positive Concept Matrix Decomposition.

3

Pseudo-Supervised Clustering in Textual Domains

A clustering algorithm is applied to a set of data to group them into a given number of clusters on the basis of common features in their representations. Thus, a clustering algorithm needs a distance measure in order to compare the different items and to group those items which are more similar with respect to the given distance measure. In the case of textual documents we can use the vector space representation and the cosine correlation distance to perform the so called spherical clustering. Alternatively, we can project the vector space representations to a low dimensional euclidean space by using one of the techniques described in the previous section, and then apply the clustering algorithm to the document representations in the projected space. A widely used clustering algorithm is the K-Means [10, 14]. As shown in the previous section, dimensionality reduction techniques applied to the document domain may have interesting properties derived from the statistical distribution of terms in the document corpus. In particular, the SVD decomposition was shown to improve the retrieval performance due to the potential discovery of latent semantic relationships among different words, whilst the concept matrix decomposition exploits the prototypes of certain homogenous sets of documents. Thus, the projection basis can be chosen in order to extract a proper representation which should bias a clustering algorithm towards yielding more “meaningful” partitions with respect to human criteria. However, notice that the choice of the clusters is not an obvious task even for a human expert, since many partitions can be feasible for a given document set. Moreover, we can choose the set of documents used to extract the projection basis by the feedback of an expert. This procedure can be embedded in an iterative algorithm in which an expert evaluates the cluster quality and enriches the set used to compute the projection basis by a proper partition of the documents it contains. Using this approach, we designed

two pseudo-supervised clustering algorithms, in which the human evaluation is (eventually) used to refine the projection basis for the dimensionality reduction. We defined both a pseudo-SVD and a pseudo-CMD using the proposed framework. The first step of both the algorithms is a pre-clustering of the set used to compute the projection basis. This step requires to choose the documents to be used to compute the basis and may consider an expert’s feedback to properly partition this set. Then, the whole document corpus is projected to the reduced space and a unsupervised clustering is performed on the projected vectors.

3.1

Pre-clustering

Given a document collection D, we identify a subset T of D and a starting value for k (i.e. the clusters’ number). These values can be set using the a-priori knowledge (human pre-clustered set) or performing a clustering algorithm for different values of k and choosing the better one using the measures of quality for the clusters (e.g. Silhouette index). It is important that the T subset and k ∗ reflect the statistical properties of the entire data-set. Therefore, the subset must be wide enough to represent the variability in the entire corpus.

3.2

Pseudo-SVD and Pseudo-CMD

Given the partitioning Π∗ of the T in k ∗ clusters and the whole document set D, we can approximate its wordby-document matrix X by projecting it on the basis vectors collected as columns of the matrix S as ∼

X V (k) = SZ . The matrix Z contains the projected representations of the documents. The matrix S depends on the chosen projection technique (SVD or CMD). Then the document set D is partitioned applying the kmeans algorithm to the projected representations. 3.2.1

Pseudo SVD

Given the partitions πi , i = 1, . . . , k ? of the set T , we would like to choose a new reference system to represent the documents, which maximizes the dissimilarity between the different clusters and maintains the larger amount of information. Thus, for each cluster πi we compute the set of principal directions to represent that cluster. If vi is the number of principal components selected for cluster i, we Pk∗ obtain V = i=1 vi components for the new basis system. The principal directions for each cluster are obtained by a SVD of the corresponding word-by-document matrix. Thus the matrix S is composed by juxtaposing the Ui matrices,

i = 1, ..., k ∗ , obtained by performing the SVD to the wordby-document matrix of cluster i,   S = U1 U2 ... Uk∗ . ∼

Then we define the Pseudo SVD of X as the matrix X k = SZ with Z ∗ computed as the solution of the minimum square problem Z ? = arg min kX − Ck Zk2F . Z

3.2.2

and, therefore: S T X ∗ = S T SZ ⇒ S T X ∗ = Z Since each word is exclusively associated to a direction, we obtain a reduced model where the values are all not negative. This method has a very low computational cost compared to the other methods (SVD, CMD, Positive CMD, Pseudo-SVD). However, it causes a greater loss of information.

4

Pseudo CMD

This method aims at deriving a basis which describes the peculiar features of each partition πi by searching the elements common to documents of each cluster. Let Xi ∈ Rw×di be the word-by-document matrix for cluster i. For each cluster we can construct the corresponding concept vector as shown in equation (2). then, we can define the set of word clusters which will be used to represent the features distinctive of each cluster. A word belongs to the word cluster Wj if the term weight in the concept vector cj is greater than the weight of the same word in all the other concept vectors. For each partition Xi we keep only the components related to the words in the word cluster Wi , and we set to zero all the others, obtaining a new w × di matrix Xi0 . We can further sub-partition each cluster Xi0 , to obtain more than one direction for each original partition. If vi is the chosen number of directions for the cluster i, then we obtain a set Ci = {ci1 , . . . , civi } of concept vectors. Each sub-partition corresponds to a word vector Wij for j = 1, ..., vi obtained as described previously. Thus, each partition πi is represented by a set of directions represented by the concept vectors cij , where only the components corresponding to the not null elements in the associated word vectors are not set to zero. These vectors are collected in a matrix Di . Finally, these matrices are juxtaposed to obtain the w × V matrix S,   S = D1 D2 ... Dk∗ . Thus, the Pseudo CMD can be written as X ∗ = SZ where Z is a V × d matrix. To compute Z we can take advantage of a property of S which derives from its construction. Since a word (vector component) is assigned to a unique word cluster, in the matrix S there are several disjoint groups of words. Each word is clearly associated to a single vector and only in the corresponding column it has a not null value. Since each vector of S has an unitary norm, we have S T S = SS T = I ⇒ S = ortonormal matrix ,

Evaluation of cluster quality

The evaluation of the validity (quality or accuracy) of the output of clustering algorithms is a difficult task in general. Measures of the quality of the cluster sets can be of two types, internal or external [8, 6]. Internal measures evaluate the quality of clustering using the internal properties of the data and of the clusters (for example the distance between the cluster centers or the density of clusters), while external criteria measure how well the clusters match some prior knowledge about the data set. In many applications, in fact, clustering algorithms are considered to have a satisfactory performance if they partition data accordingly to criteria which are meaningful for humans. Document clustering is one of these cases. In this task an expert (or a group of experts) could label the data using a set of class labels corresponding to different topics. The quality of the clustering algorithm can then be evaluated analyzing the distribution of the topics in each cluster. Any accuracy assessment based on this procedure is relative to the particular classification chosen by the expert. To introduce the most used external measures, it is necessary to give some definitions. Given a set of labels (topics in the text domain) A = {Ai |i = 1, 2, ..., k} and a set of clusters C = {Cj |j = 1, 2, ..., k} resulting from clustering algorithm f , we call contingency table the matrix H ∈ N k×k , whose element h(Ai , Cj ) is the number of items with label Ai assigned to the cluster Cj . If we associate the cluster Cj to the topic Am(j) for which Cj has the maximum number of documents and we rearrange the columns of the H matrix such that j 0 = m(j), we obtain the confusion matrix Fm used in pattern recognition. Finally, with h(Cj ) we indicate the number of items associated to cluster Cj . This value corresponds to the sum of the elements in j-th column of H.

4.1

Classification Accuracy and Error

This measure considers the clustering as a classification task by counting the number of patterns correctly (or uncorrectly) clustered [6] with respect to the class Am(j) which represents the prevalent class in the cluster Cj . Using the

N. 1 2 3 4 5 6 7 8 9 10

contingency table, we can evaluate the Accuracy of the classification for each cluster Cj as the fraction of the documents which belong to the prevalent class h(Aj , Cj ) . h(Cj )

Accj =

A good clustering should yield an average accuracy near to 1, corresponding to clusters containing items from only one class. Analogously, we can define the Classification Error as the fraction of items not belonging to the prevalent class in the cluster Cj Pk Errj = 1 − Accj =

i6=j

h(Ai , Cj )

h(Cj )

.

Human evaluation

In many applications requiring the automatic organization of documents, the topic distribution might not be sharp. It may happen that many documents deal with more than one topic. Thus an evaluation based on a strict assignment to one class might not be precise. In fact, if we evaluate the algorithm accuracy on the table H, we may consider uncorrect many results that indeed are correct. So we can evaluate the Classification Accuracy (or Error) by asking an expert to browse each cluster to identify the main topics it contains. This approach is equivalent to label each document with more class labels and to consider correct an assignment to a cluster Cj , if Cj is in agreement with any of the labels attached to the document.

4.3

Conditional Entropy

E(A|C)

=−

P|A| Pk i=1

j=1

is E(C) (perfect correspondence between A and C) and E(A|C) = 0. If the distribution of Ai over Cj is uniform, the number of required bits is E(A) + E(C) and E(A|C) = E(A). The conditional entropy was used as external measure in [2].

5

P (Ai , Cj )log(P (Ai |Cj ))

= E(A, C) − E(C) . Using the contingency table , we can write: P (Ai , Cj ) =

h(Ai , Cj ) d

P (Ai |Cj ) =

h(Ai , Cj ) h(Cj )

Entropy is a good measure of clustering purity since it measures the number of bits required to code the composition of each cluster Cj . If each cluster is formed by documents belonging to a single class Ai , the number of necessary bits

Experimental Results

We tested the Pseudo-SVD and Pseudo-CMD algorithms described in section 3 on a dataset composed by long documents. Each document in the dataset can cover different topics.

5.1

Another very commonly used measure is the conditional entropy E(A|C) which is defined in [6, 3] as

N. Files 112 240 118 171 68 70 134 86 114 104

Table 1. Distribution of the topics in the dataset.

The limitation of these measures is that they ignore actual distribution of the incorrect classifications in each cluster.

4.2

Name Fuzzy Control Biological Evolutionary Computation Agent Systems Global Brain Models Wavelets Applications Chaotic Systems Neural Networks Clustering and Classification Image Analysis and Vision Independent and Principal Component Analysis and SVM

Data Preparation

We have evaluated our algorithms using a dataset composed by papers from conferences on computer science. These documents are usually coded in PDF format and in the first processing step we extracted the text from PDF files. In order to avoid noise due to limitations of the PDF parsers and to errors in the original documents, we filtered the terms extracted from the dataset using the Aspell0.50.4.1 library [1]. Each string is analyzed by Aspell that returns the most similar word in its dictionary or null. Then, to reduce the dictionary dimension, we defined a stop-word list to remove common words and we applied the Luhn Reduction (see section 2) with lower threshold tslow = 0.2, and upper threshold tsup = 15. To have a view on the topic distribution in the data set, we manually pre-clustered it into 10 topics, removing unknown or ambiguous documents (see table 1).

Method

Entropy

Accuracy

PCMD 4 PCMD 7 PCMD 10 PSVD 4 PSVD 7 PSVD 10 k-means

0.7105 0.7084 0.7093 0.6449 0.6229 0.6115 0.6862

0.3743 0.3387 0.2892 0.4719 0.4838 0.4691 0.3895

Human evaluated 0.4199 0.4110 0.3358 0.6609 0.7097 0.6874 0.5731

Table 2. Quality of the different clustering methods when partitioning the dataset into 10 clusters.

5.2

Figure 1. Topic distribution in the 10 clusters for the Pseudo-SVD algorithm with v=7.

Results

We applied to the dataset the following three clustering algorithms: k-means, Pseudo-SVD (PSVD) and PseudoCMD (PCMD). Each algorithm was applied setting the number of clusters to 10 (the number of chosen topics). For PSVD and PCMD we varied the number of principal components v ∈ {4, 7, 10}. Totally, we obtained seven different partitions of the data set and, for each of them, we estimated the quality of the clustering, evaluating the Accuracy, the Conditional Entropy, and the accuracy based on a human evaluation. In table 2 we can see the average values of the classification accuracy, the entropy and the human evaluated accuracy for each clustering algorithm. According to accuracy, the best method is the PSVD with v = 7, while according to the entropy, the most performing one is PSVD with v = 10. The case PSVD 7 is the best according to accuracy and it is the second one with respect to the entropy value. It could be the candidate as the best method. Even for the case PSVD 10, we can do the same consideration: it is the best case according to entropy and it has good classification accuracy. However we can note that the accuracy/entropy combined results for PSVD are better than the others ones, while the ones for PCMD are the worse. In figure 1 we report the composition of the clusters for the case PSVD 7 when considering the ten classes listed in table 1. Analyzing the results, we note that the values of accuracy and entropy are very low (under the 50% of correct results and over 0.6 of entropy). This happens because the data set has many transversal topics that false the results. To point out this fact we have performed a manual evaluation of the clusters for each clustering algorithm and we have evaluated the accuracy using the expert’s evaluations. The results are much better and show higher accuracy percentages (about 65-70%). An important fact that appears from table 2 is that the order of quality of the clustering algorithms is unchanged: the PSVD algorithms seem

to be the better than k-means and PCMD and, in particular, the PSVD 7 is still the best one. By analyzing the results of human evaluation (figure 2), we can see that there are some classes with very low accuracy (for example class 5). This class corresponds to the topic ’Wavelets’ and it has such a low accuracy because many documents dealing with wavelets, deal mainly with ’Image Analysis and Vision’ and they were not inserted in that cluster. This happens for many other documents in many other clusters and it derives from the choice of a flat clustering scheme. Moreover, the manual pre-clustering process created clusters according to one possible scheme, i.e. according to the method used in each paper (ANN, Fuzzy Control, PCA etc...). This partition, however, doesn’t take into account the fact that the ANN, Genetic Algorithms or the other ones can be applied to various fields (aerospace, robotic, bio-informatics and others). Thus, different partitions are consistent with this dataset.

6

Conclusions

We have presented two clustering algorithms for text documents which use a clustering step also in the definition of the basis for document representation. By an appropriate choice of the basis, we can exploit the prior knowledge of human experts about the data set and bias the feature reduction step towards more significant representations. The algorithms were compared on a dataset composed by conference papers on computer science. The results show that the PSVD algorithm is able to perform better than the kmeans based on the original vector space representations. On the other hand, the PCMD did not show satisfactory performances. However, these algorithms can be used to refine iteratively the clustering by adding or removing items from

[10] E. Rasmussen. Information Retrieval - Chapter Clustering Algorithms. Prentice Hall, 1992. [11] G. Salton and M. J. McGill. An Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [12] G. Salton, A. Wong, and C. S. Yang. A vector space model for information retrieval. Journal of the American Society for Information Science, 18(11):613–620, November 1975. [13] G. Salton, C. Yang, and C. Yu. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science, 26(1):33–44, 1975. [14] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining, 2000. [15] S. Vempala and R. I. Arriaga. An algorithmic theory of learning: robust concepts and random projection. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, pages 616–623, 1999.

Figure 2. Human expert’s evaluation of cluster accuracy.

the set used to compute the document representation basis.

Acknowledgments We thank Carmelo Floriddia for his contribution in the early stages of this work.

References [1] K. Atkinson. Gnu aspell 0.50.5 - manual, 2004. [2] P. Bradley, U.Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998. [3] T. Cover and J. Thomas. Elements of Informaytion Theory. Wiley, 1991. [4] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391– 407, 1990. [5] I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Machine Learning, 42(1/2):143–175, 2001. [6] B. E. Dom. An information theoretic external clustervalidity measure. Technical report, IBM Research Division, 2001. [7] G.Karypis and E. IIan. Fast supervised dimensionality reduction algorithm with applications to document categorizationa dn retrieval. In Proceedings of CIKM-00, pages 12–19, 2000. [8] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [9] H. Luhn. The automatic derivation of information retrieval encodements from machine-readable text. Information Retrieval and Machine Translation, 3(2):1021–1028, 1959.

Suggest Documents