Document Clustering using Particle Swarm ... - Semantic Scholar

1 downloads 0 Views 38KB Size Report
Document Clustering using Particle Swarm Optimization. (Extended Abstract). Xiaohui Cui, Thomas E. Potok. Applied Software Engineering Research.
Document Clustering using Particle Swarm Optimization (Extended Abstract) Xiaohui Cui, Thomas E. Potok Applied Software Engineering Research Oak Ridge National Laboratory Oak Ridge, TN 37831-6085 cuix, [email protected]

1.Introduction Document clustering is largely used in many fields, including text mining and information retrieval. Clustering involves dividing a set of objects into a specified number of clusters, so that objects that are similar to other objects located in the same cluster. There are two major styles of clustering techniques: “Partitioning” and “Hierarchical” [1]. Most document clustering algorithms can be classified into these two groups. Hierarchical techniques produce a nested sequence of partition, with a single, all inclusive cluster at the top and singleton clusters of individual points at the bottom. The partitioning clustering method seeks to partition a collection of documents into a set of non-overlapping groups so as to maximize the evaluation value of clustering. Although the hierarchical clustering technique is often portrayed as a better quality clustering approach, this technique does not contain any provision for the reallocation of entities, which may have been poorly classified at the early stage in the text analysis [2]. Moreover, the time complexity of this approach is quadratic [8]. The time complexity of the partitioning technique is almost linear, which makes it widely used. The most well-known partitioning clustering algorithm is the k-means algorithm and its variants. In addition to the k-means algorithm, several algorithms, such as Genetic Algorithm (GA) [3] and Self-Organizing Maps (SOM) [5], have been used for document clustering. Particle Swarm Optimization (PSO) [4] is another computational intelligence method that has already been applied to image clustering [6]. However, to the best of the author’s knowledge, PSO has not been used to cluster text documents. In this study, a document clustering algorithm based on PSO is proposed. PSO is a population based stochastic optimization technique that can be used to find an optimal, or near optimal, solution to a numerical and qualitative problem. A problem space in PSO has as many dimensions as needed to model the real problem space. A particle’s location in the multi-dimensional problem space represents one solution for the problem. When a particle moves to a new location, a different problem solution is generated. This solution is evaluated by a fitness function that provides a quantitative value of the solution’s utility. A particle will remember its current coordinates, its velocity that indicates the speed of its movement along the dimensions in a problem space, the best fitness value received so far, and the coordinates where these values were computed. It is this personal best value combined with its neighbor's best value that influences the movement of each particle through a problem space.

2. Description of Algorithm 2.1 Document Data Model Each text document can be represented using the Vector Space Model (VSM) [7]. In this model, the content of a document is formalized as a dot in the multi-dimension space and represented by a vector d, such that d= {w1 , w2 ,.....wn } , where wi (i = 1,2,…,n) is the term weight of the term ti in one document. The term weight value represents the significance of this term in a document. To calculate the term weight, the frequencies of occurrence of the term within a document and in the entire set of documents need to be considered. The most widely used weighting scheme is combining the Term Frequency with Inverse Document Frequency (TF-IDF) [7]. The weight of term i in document j is given in equation 1.

w ji = tf ji * idf ji = tf ji * log 2 (n / df ji )

(1)

where tfji is the number of occurrences of term i in the document j, dfji indicates the term frequency in the collections of documents and n is the total number of documents in the collection. This weighting scheme discounts the frequent words with little discriminating power.

2.2 Similarity Metric The similarity between two documents is measured in a clustering analysis. The distance measures enjoy the widespread popularity. The Euclidean distance between two dots in the document vector space can be used to compute how similar the two documents represented by the two dots are. In order to manipulate equivalent threshold distances, considering that the distance ranges will vary according to the dimension number, this algorithm uses the normalized Euclidean distance as the similarity metric of two documents mp and mj in the vector space. Equation 2 is the distance measurement formula.

d (m p , m j ) =

dm

∑ (m k =1

pk

− m jk ) 2 / d m

(2)

where m p and m j are two document vectors. d m is the dimension number of the vector space. m pk and

m jk stand for the document m p and m j ’s weight values in dimension k. 2.3 PSO Clustering The objective of the PSO clustering algorithm is to find out the proper centroids of clusters for minimizing the intra-cluster distance as well as maximizing the distance between clusters. In the PSO clustering algorithm, the multi-dimension document vector space is modeled as the problem space in PSO. Each term in the document represents one dimension of the problem space. Each document vector can be represented as a dot in the problem space. A single particle in the swarm represents one possible solution for clustering the document collection. Therefore, the swarm represents a number of candidate clustering solutions for the document collection. Each particle maintains a matrix Xi = (C1, C2, …, Ci, .., Ck), where Ci represents the ith cluster centroid vector and k is the cluster number. At each iteration, the particle adjusts the centroid vectors’ positions in the vector space according to its own experience and that of its neighbor particles. The average distance between a cluster centroid and a document is used as the fitness value to evaluate each particle’s represented solution. The fitness value is measured by below equation, pi

f =

Nc

∑ d (o , m )

i =1

pi

∑{

i

j =1

ij

}

Nc

(3)

where mij is the jth document vector, which belongs to cluster i. Oi is centroid vector of ith cluster. d (oi , mij ) is the distance between document mij and cluster centroid Oi. Pi stand for the document number, which belong to cluster Ci. Nc stands for the cluster number.

3. Experiment and Result A set of 300 document collections is randomly collected from the TREC data set. The very common words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”) are stripped out completely and different forms of a word are reduced to one canonical form by using Porter’s algorithm. In order to reduce the impact of the length variations of different documents, each document vector is normalized so that it is of unit length. For each simulation, 10 particles are generated and the iteration number of each simulation is fixed to 100. Ten simulations are performed separately and the fitness values are recorded. Figure 1 illustrates how the fitness value (the average distance between the cluster centroid and the document that are clustered) improved over time. The smaller the fitness value, the better clustering solution is. After 30 iterations, the fitness value maintains in a fixed level, which indicates the quick

convergence using the PSO algorithm. Different from the k-means algorithm, which is sensitive to the initial seed selection, PSO optimization techniques can relocate the entities, allowing that a poor initial partition can be corrected at a later stage. In each simulation, the initial centroid vectors of each particle are randomly selected and the final results indicate all 10 particles come out very similar cluster solution. Most of the time, they get exactly same cluster solution after 100 iterations.

0.15 10 Particles

Fitness Value

0.14 0.13 0.12 0.11 0.1 0.09 0

10

20

30

40

50

60

70

80

90

100

Iteration

Figure 1 PSO performance over the iterations

4. Conclusion In this study, a document clustering algorithm based on PSO is proposed. The experiment conducted in this research shows that the PSO document clustering algorithm can efficiently converge on the optimal solution or near optimal solution. The other advantage of the PSO clustering algorithm is the PSO doesn’t sensitive on the initial seed selection and selecting good seeds is not necessary. Future research will focus on how to implement the algorithm on our agent based parallel computing platform for clustering huge volume of document collection.

5. References [1] Berkhin, P. "Survey of clustering data mining techniques”. Accrue Software Research Paper. 2002 [2] Everitt, B. “Cluster Analysis”. 2nd Edition. Halsted Press, New York. 1980 [3] Jones, Gareth, Robertson, Alexander M., Santimetvirul, Chawchat and Willett, Peter "Non-hierarchic document clustering using a genetic algorithm". Information Research, 1(1). 1995 [4] Kennedy J., Eberhart R. C. and Shi Y. “Swarm Intelligence”, Morgan Kaufmann, New York, 2001. [5] Merkl D., “Text mining with self-organizing maps”. Handbook of data mining and knowledge, pp. 903-910, Oxford University Press, Inc. New York. 2002 [6] Omran, M., Salman, A. and Engelbrecht, A. P. “Image classification using particle swarm optimization”. Proceedings of the 4th Asia-Pacific Conference on Simulated Evolution and Learning 2002 (SEAL 2002), Singapore. pp. 370-374, 2002 [7] Salton G. and Buckley C. “Term-weighting approaches in automatic text retrieval”. Information Processing and Management, 24 (5): pp. 513-523, 1988 [8] Steinbach M., Karypis G., Kumar V., “A Comparison of Document Clustering Techniques”. TextMining Workshop, KDD, 2000

Suggest Documents