Clustering of Text Document based on ASBO

18 downloads 342 Views 275KB Size Report
task in organizing information, search engine results obtaining from user query, ... as an optimization process of grouping documents into k clusters so that a ...
Vol 20, No. 6;Jun 2013

Clustering of Text Document based on ASBO Prakasha S Dept. of ISE, RNSIT, Bangalore, India E-mail: [email protected] H R Shashidhar Dept. of CSE, RNSIT, Bangalore, India E-mail: [email protected] Manoj Kumar Singh Manuro Tech Research, Bangalore, India E-mail: [email protected]

G T Raju Dept. of CSE, RNSIT, Bangalore, India E-mail: [email protected]

Abstract Clustering concept is a very powerful and useful technique in data mining. Various ways this can be utilized from application perspective. Clustering of similar topic from text documents is an important task in organizing information, search engine results obtaining from user query, enhancing web crawling and information retrieval. Generally partitional clustering algorithms are reported performing well on document clustering like family of k-means. In this case clustering problem can be consider as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Existing algorithms for k-means clustering converge to different local

minima based on the initializations and creation of empty clusters as a clustering solution. To solve this problem, we applied the newly developed optimization method based on human social behavior called adaptive social behavior optimization (ASBO), which contains simplicity in computational model and deliver global solution. Proposed method is compared with the result of another well established swarm social optimization method namely particle swarm optimization (PSO) and frequently applied K-means algorithm. Performance criteria is very critical in deciding the quality of clusters hence two mostly dominating criteria which are well accepted by research community, F-measure and purity of cluster evaluated with proposed results in all cases. Vector space model

152

[email protected]

Vol 20, No. 6;Jun 2013

applied to represent the documents mathematically. Our experimental results demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy and robustness without increasing the execution time much. Keywords: Document clustering, Vector space model, K-means, PSO, ASBO. 1. Introduction For researcher how to explore and utilize the huge amount of text documents is a major question and challenges in the areas of information retrieval and text mining. Document clustering can be applied directly to help users effectively navigate, summarize, and organize text documents. By organizing a large amount of documents into a number of meaningful clusters, document clustering can be used to browse a collection of documents or organize the results returned by a search engine in response to a user’s query. It can significantly improve the precision and recall in information retrieval systems and it is an efficient way to find the nearest neighbors of a document. As the number of available Web pages grows, it is become more difficult for users finding documents relevant to their interests. Clustering is the classification of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. It can enable users to find the relevant documents more easily and also help users to form an understanding of the different facets of the query that have been provided for web search engine. A popular technique for clustering is based on K-means such that the data is partitioned into K clusters. In this method, the groups are identified by a set of points that are called the cluster centers. The data points belong to the cluster whose center is closest. The problem of document clustering is generally defined as follows: given a set of documents, we would like to partition them into a predetermined or an automatically derived number of clusters, such that the documents assigned to each cluster are more similar to each other than the documents assigned to different clusters. In other words, the documents in one cluster share the same topic, and the documents in different clusters represent different top. There are two general categories of clustering methods: agglomerative hierarchical and partitional methods. In previous research, both methods were applied to document clustering. Agglomerative hierarchical clustering (AHC) algorithms initially treat each document as a cluster, use different kinds of distance functions to compute the similarity between the pairs of clusters, and then merge the closest pair .This merging step is repeated until the desired number of clusters is obtained. Comparing with the bottom-up method of AHC algorithms, the families of k-means algorithms, which belong to the category of partitional clustering, create one-level partitioning of the documents. The k-means algorithm is based on the idea that a centroid can represent a cluster. After selecting k initial centroid, each document is assigned to a cluster based on a distance measure (between the document and each of the k centroid), then k centroid are recalculated. This step is repeated until an optimal set of k clusters are obtained based on a criterion function. Generally speaking, the partitional clustering algorithms are

153

[email protected]

Vol 20, No. 6;Jun 2013

well-suited for the clustering of large text databases due to their relatively low computational requirement and high quality. A key characteristic of the partitional clustering algorithms is that a global criterion function is used, whose optimization drives the entire clustering process. The goal of this criterion function is to optimize different aspects of intra-cluster similarity, inter-cluster dissimilarity, and their combinations. In [1], authors present a particle swarm optimization (PSO) based document clustering algorithm. They have applied

hybrid PSO clustering algorithm contains PSO with

K-means. Most of the existing techniques for document clustering rely on a "bag of words" document representation. Each word in the document is considered as a separate feature, ignoring the word order. Authors in [2] investigate the use of phrases rather than words as document features for the document clustering. they present a phrase grammar extraction technique, and use the extracted phrases as the features in a self-organizing map based document clustering algorithm. In the paper [3], authors compare and contrast two introduced approaches to document clustering based on suffix tree data model. The first is an Efficient Phrase based document clustering, and the second approach is a frequent word/word meaning sequence based document clustering. By researching all kinds of methods for document clustering, authors in [4] put forward a dynamic method based on genetic algorithm (GA). K-means is a greedy algorithm, which is sensitive to the choice of cluster center and very easily results in local optimization. Genetic algorithm is a global convergence algorithm, which can find the best cluster centers easily. Paper [5] describes a technique of document clustering based on frequent senses. The proposed graph-based document clustering system, works with frequent senses rather than frequent keywords used in traditional text mining techniques. they present text documents as hierarchical document-graphs and utilizes an priory paradigm to find the frequent sub graphs, which reflect frequent senses. Authors in [6] introduce a topical document clustering method called Document Features Indexing Clustering (DFIC), which can identify topics accurately and cluster documents according to these topics. In DFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, document features are investigated and exploited. Experimental results show that DFIC can gain a higher precision than some widely used traditional clustering methods. Paper [7] presents a semi-supervised document clustering algorithm and a method for actively selecting informative instance-level constraints to get improved clustering performance. The semisupervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Paper [8] proposes the multi-objective genetic algorithm (MOGA) for document clustering. The rest of this paper is organized as follows. In Section 2, we review the vector space model of documents, the similarity measure and the k-means and the bisecting k-means algorithms. In Section 3, core concept of PSO has presented. In section3 details of ASBO described. In Section 4, experimental results of our clustering algorithms are compared with those of original algorithms in terms of the clustering accuracy, and last Section contains conclusions of the research.

154

[email protected]

Vol 20, No. 6;Jun 2013

2. Background 2.1 Vector space model (VSM) for document representation Document representation is fundamental for data management, information filtering, information retrieval, indexing, classification, and clustering tasks. Vector Space Model (VSM) [9] represents a document as a vector of terms (or phrases) in which each dimension corresponds to a term (or a phrase). An entry of a vector is non-zero if the corresponding term (or phrase) occurs in the document. A significant progress has been made with vector space model in many applications. In most clustering algorithms, the dataset to be clustered is represented as a set of vectors X={x1, x2,….xn }, where the vector xi corresponds to a single object and is called the feature vector. The feature vector should include proper features to represent the object. In this model, the content of a document is formalized as a dot in the multi-dimensional space and represented by a vector d, such as d={w1,w2,…..wn.}, where wi (i = 1,2,…,n) is the term weight of the term ti in one document. The term weight value represents the significance of this term in a document. To calculate the term weight, the occurrence frequency of the term within a document and in the entire set of documents must be considered. The most widely used weighting scheme combines the Term Frequency with Inverse Document Frequency (TF-IDF). The weight of term i in document j is given in equation 1:

⎛ ⎞ w ji = tf ji × idf ji = tf ji × log⎜ n ⎟ df ji ⎠ ⎝

(1)

where tfji is the number of occurrences of term i in the document j; dfji indicates the term frequency in the collections of documents; and n is the total number of documents in the collection. The tf-idf–weighting scheme will increase the weight of terms that have frequent occurrence in a smaller set of documents and lower the weight of those terms that are frequently occurring over the entire corpus. 2.2. Similarity measure The choice of proximity function is a significant one because it will define what to interpret as clusters in the n–dimensional space that our documents reside in. We would like to achieve a good separation while keeping generality so we do not get either too small of too large clusters. The most intuitive metric is probably the Euclidian distance also known as the l

2

norm of the difference vector. This is

the direct distance between two objects in a linear space defined by (2)

d 2 ( p, q ) =

n

∑(p k =1

k

−q k )

2

(2)

Where p, q are either points or vectors.

155

[email protected]

Vol 20, No. 6;Jun 2013

2.3. K-means algorithm for document clustering k-means is a popular algorithm that partitions a data set into k clusters. If the data set contains n documents, denoted by d1; d2; . . . ; dn, then the clustering is the optimization process of grouping them into k clusters so that the global criterion function

∑∑ sim(d , c ) k

n

j =1 i =1

i

(3)

j

is either minimized or maximized, depending on the definition of sim(di,cj). cj represents the centroid of cluster Cj, for j= 1 . . k,

sim(di,cj) evaluates the similarity between a document di and a centroid cj .

When the vector space model is used to resent the documents and the Euclidian distance is used for sim(di,cj)., each document is assigned to the cluster whose centroid vector is more similar to the document than those of other clusters, and the global criterion function is minimize in that case. This optimization process is known as an NP-complete problem, and the k-means algorithm was proposed to provide an approximate solution. The steps of k-means are as follows: (i). Select k initial cluster centroid, each of which represents a cluster. (ii). For each document in the whole data set, compute the similarity with each cluster centroid, and assign the document to the closest (i.e., most similar) centroid. (assignment step) (iii). Recalculate k centroid based on the documents assigned to them. (iv). Repeat steps (ii) and (iii) until convergence. 2. Particle swarm optimization (PSO) Kennedy and Eberhart introduced the concept of function-optimization by means of a particle swarm[4]. Particle swarm model consists of a swarm of particles moving in an d-dimensional search space where the fitness f can be calculated as a certain quality measure. Each particle has a position represented by a position-vector xi (i is the index of the particle), and a velocity represented by a velocity-vector Vi. Each particle remembers its own best position so far in a vector pi, and its j-th dimensional value is pij . The best position-vector among the swarm so far is then stored in a vector pg, and its j-th dimensional value is pgj . During the iteration time t, the update of the velocity from the previous velocity is determined by (4). And then the new position is determined by the sum of the previous position and the new velocity by (5).

(i)Velocity population up gradation C2 r2 V[j]’ = χ     w V j C1 r1 PSB j –P j   ii Solution  position  population up gradation  P[j]’= (P[j] + V[j]’);

156

PGL j –P j                             4  

(5)

[email protected]

Vol 20, No. 6;Jun 2013

Where r1 and r2 are the random numbers, uniformly distributed within the interval [0,1] for the j-th dimension of i-th particle. c1 is a positive constant, called as coefficient of the self-recognition component, c2 is a positive constant, called as coefficient of the social component. The variable w is called as the inertia factor, which value is typically setup to vary linearly from 1 to near 0 during the iterated processing and χ is a constant and called constriction factor, From (1), a particle decides where to move next, considering its own experience, which is the memory of its best past position, and the experience of its most successful particle in the swarm. The PSO Algorithm Input: Randomly initialized position and velocity of the particles: _Xi(0) and _Vi(0) Output: Position of the approximate global optima _X(g) Begin While terminating condition is not reached do Begin For i = 1 to number of particles Evaluate the fitness: = f (Xi); Update pi and gi; Adapt velocity of the particle using equations (4); Update the position of the particle using equation(5); Increase i; end while end

3. Adaptive Social Behavior Optimization (ABSO) With the concept of interactions and influence taking place in the human society a mathematical model called “Adaptive social behavior optimization (ASBO)” has developed for the optimization by Singh [10]. The characteristics of dynamic leadership and dynamic logical neighbors along with experienced self capability are taken as fundamental social factors to define the growth of individual and in result of whole society. For each entity of a society, characteristics and affect of these three factors are not being constant for whole life span, rather than function of time and present status. To define this dynamic characteristic under a social life, in ASBO, help of self-adaptive mutation strategy is opted. With this simplified macro model of influencing environment in the human society a mathematical model developed to achieve the particular objective. Assuming a population containing numbers of individual representing the solution of problem in hand, The individual representing the solution in direct form (not in coded format). Each individual is having a fitness value, derived by

157

[email protected]

Vol 20, No. 6;Jun 2013

objective function. Individual having the maximum value of fitness treated as leader at present time. A group of individuals having next nearest higher value of fitness will be treated as neighbors for a particular individual. The change in existing status because of influence is innovated by each and every member of population using eq. (6) and the next location of status given by eq. (7).

Δx(i + 1) = C g R1 (Gbi − X i ) + C s R2 (S bi − X i ) + C n R3 ( N ci − X i )

(6)

X (i + 1) = X i + ΔX (i + 1)

(7)

Where Δx(i+1)= represents the new change in i’th dimension of an individual element. Cg,Cs,Cn are adaptive constants ≥ 0;Ri, ( i=1,2,3 ) are uniformly distributed random number in range [0 1],Gb , global best individual at present population’s, self best for an individual. Nc, center position of a group formed by an individual and its neighbors, For a D-dimensional problem, Gb, Sb, & Nc represent vectors of D-dimension.

Gb = [Gb1 , Gb 2 , Gb 3 , Gb 4 ,...............GbD ] S b = [S b1 , S b 2 , S b 3 , S b 4 ,...............S bD ] N b = [N b1 , N b 2 , N b 3 , N b 4 ,...............N bD ] 3.1 Steps to adapt the new set of progress constant. 1.

A population of N trails solution initialized. Each solution taken as a pair of real valued vector (pi, σi), for all i∈ {1, 2, 3}, with three dimension corresponding to the number of progress variables. The initial components of each pi, for all i∈ {1, 2, 3} were selected in accordance with a uniform distribution ranging over a presumed solution space. The values of σi , for all i∈ {1..... N} , the so called strategy parameters were initially set to some value. 2.

One offspring

(pi, σi), generated from each parent

P’i(j)=pi(j)+σi(j).N(0,1) σ’i’(j) = σi(j)exp(τ’N(0,1)+ τNJ (0,1))

(pi, σi), by (8) (9)

for all j∈ {1,2,3} where pi (j), p’i(j), σi(j), σ’i(j) denote the jth component of the vectors xi, x’i, σi, σ’i, respectively. N(0,1)→ a random number from Gaussian distribution. Nj(0,1)→ a random number is sampled a new for each

value of the counter j using Gaussian distribution.

A fitness function which is reciprocal of represents the average inter cluster distance and a fitness value is find out for each and every member of the population. Member having maximum value of fitness declared as global best for the present time. For each member, neighbors factor is calculated

158

[email protected]

Vol 20, No. 6;Jun 2013

by taking mean of neighbor position (in this paper three members having more nearest fitness value taken as neighbors).Self best initialize for each member, as it is available. Self-adaptive evolutionary computation applied to get the new set of progress constant. Gaussian mutation has applied for random perturbation in mutation. With the set of 2N possible progress constant, a set of N member selected who are having maximum fitness and accordingly N progress constant set selected for the searching the new position and this process repeated until terminating criteria has not satisfied. Because each and every member having very different position and hence fitness values hence it’s necessary they should have according values of progress parameters rather than a unique value for everyone and in all situation. Progress flow of ASBO has shown in Fig.1.

Population

progress constant population

Fitness estimation

mutation

offspring population of progress constant

Define [G S N] Fig.1.Working flow of ASBO

Calculate Change in position

Invented new position Selection of best half position

new population for progress constant

New position for population

No

yes Terminate

member having best fitness taken as solution Fig.1.Flow chart for ASBO

159

[email protected]

Vol 20, No. 6;Jun 2013

4.Experimental results In order to show that our proposed methods can improve the performance over k-means and PSO based clustering in document clustering, we ran the k-means and PSO algorithms and ASBO algorithm for 10 independent trails. the similarity measure based on the average inter cluster distance applied to find the fitness of cluster results. The clustering results obtained by ASBO are compared with those of K-means and PSO. We implemented all the algorithms in MATLAB on a Windows workstation. 4.1 Data set A simulated version of data set contain three category of document( total 300 documents), each category carry 100 documents and each document contains 30 keywords. Frequcy of availability of keywords in all categories are uniformly distributed random number. Range of frequency appeared for first ten in first category is [10, 20], for next ten keywords in range of [0, 1] and last ten is in the range of [0, 3]. Range of frequency appeared for first ten in second category is [0, 2], for next ten keywords in range of [10, 20] and last ten is in the range of [0, 2]. Range of frequency appeared for first ten in third category is [0, 2], for next ten keywords in range of [0, 3] and last ten is in the range of [10, 20]. 4.2 Evaluation methods of document clustering We used the F-measure and purity values to evaluate the accuracy of our clustering algorithms. The F-measure is a harmonic combination of the precision and recall values used in information retrieval . Since our data sets were prepared as described above, each cluster obtained can be considered as the result of a query, whereas each pre-classified set of documents can be considered as the desired set of documents for that query. Thus, we can calculate the precision P(i,j) and recall R(i,j) of each cluster j for each class i. 4.2.1 F-measure of cluster If

ni is the number of members of class

i , nj is the number of member of cluster j and nij is the

number of class I in cluster j,then P(i,j) and R(i,j) can be defined as

160

p (i, j ) =

nij

R(i, j ) =

nij

nj

ni

(10)

(11)

[email protected]

Vol 20, No. 6;Jun 2013

The corresponding F-measure F(i,j) is defined as

F (i, j ) =

2 * p (i, j ) * R (i, j ) P(i, j ) + R(i, j )

(12)

Then F-measure of the whole clustering result is defined as

F =∑ I

ni max j (F (i, j )) n

(13)

Where n is the total number of documents in the data set.in general ,the larger the F-measure is,the better the clustering results. 4.2.2 Purity of cluster Purity of a cluster represents the fraction of cluster corresponding to the largest class of documents assigned to that cluster, thus the purity of cluster J is defined as

Purity ( j ) =

1 max i (nij ) nj

(14)

The purity of the whole clustering result is a weighted sum of the cluster purities

purities = ∑ j

nj n

purity ( j )

(15)

In general the larger the purity value is better the clustering result is

4.3 Result analysis of clustering

For both ASBO and PSO algorithms 10 independent trails have given with population size equal to 20 and total 500 iterations were taken in each trail to determine the F-measure and purity of clustering. Parameters value for PSO in all cases of simulation defined as: C1= C2 = 0.5, χ = 0.75 and inertia weight value decreases from 1.2 towards 0 with iterations. For ASBO the upper and lower bound of adaptive constants were taken as 5 and 0, while initial standard deviation was taken as 1e-5 for all solution. Population initialization for centroid was defined as random selection from vector space model. Performances of both algorithms have shown correspondingly in Fig.1 and in Fig.2 for all 10 independent trails. It is clear from observation that PSO starts to move towards define the solution with faster rate but soon converge into local solution, where as ASBO delivered the improvement in result in much better manner. Mean performance between both algorithms

161

[email protected]

Vol 20, No. 6;Jun 2013

have shown in Fig.3.Performance of ASBO over PSO and K-means is represented in table1 with respect to F-measure and has seen both PSO and K-means could not deliver the success as compare to ASBO which has deliver maximum value of F-measure always. From purity of clustering point of view, performances are given in table2 and here also ASBO outperformed the PSO and K-means algorithm performance.

Fig2.Total square sum of intra cluster distance for 10 independent trails in PSO

Fig3.Total square sum of intra cluster distance for 10 independent trails in ASBO

162

[email protected]

Vol 20, No. 6;Jun 2013

Fig4.Comparison of Mean Total square sum intra cluster distance of 10 independent trails in PSO and ASBO.

Table 1: Performance comparison w.r.t F-measure in 10 independent trails between ASBO,PSO and K-means             Trail No. 

        ASBO 

        PSO 

      K‐Means 

                  1 

            1.0 

      0.7748 

        1.0 

                  2 

            1.0 

        1.0 

        0.8036 

                  3 

            1.0 

        1.0 

        1.0 

                  4 

            1.0 

        1.0 

        1.0 

                  5 

            1.0 

        1.0 

        1.0 

                  6 

            1.0 

        1.0 

        1.0 

                  7 

            1.0 

        0.0 

        1.0 

                  8 

            1.0 

        1.0 

        1.0 

                  9 

            1.0 

        1.0 

        0.8976 

                10 

            1.0 

        1.0 

        0.6944 

Mean/ Std.Dev 

        1.0 / 0.0 

0.8775 / 0.3163 

0.9396 / 0.1085 

163

[email protected]

Vol 20, No. 6;Jun 2013

Table 2: Performance comparison w.r.t Purity in clustering in 10 independent trails between ASBO, PSO and K-means             Trail No. 

        ASBO 

        PSO 

      K‐Means 

                  1 

            1.0 

      0.6800 

        1.0 

                  2 

            1.0 

        1.0 

        0.6667 

                  3 

            1.0 

        1.0 

        0.6667 

                  4 

            1.0 

        1.0 

        1.0 

                  5 

            1.0 

        1.0 

        0.6667 

                  6 

            1.0 

        1.0 

        0.6667 

                  7 

            1.0 

        0.0 

        1.0 

                  8 

            1.0 

        1.0 

        0.6667 

                  9 

            1.0 

        1.0 

        0.6667 

                10 

            1.0 

        1.0 

        0.6667 

Mean/ Std.Dev 

        1.0 / 0.0 

0.8680 / 0.3211 

0.7667 / 0.1610 

Conclusion In this paper we have proposed and tested a new optimization algorithm that can be used for accurate and robust document clustering. PSO and K-means algorithms have a relatively greater probability to trap in local optimal solution. Unlike them proposed algorithm has a very little chance to trap in local optimal solution, and hence it converges to a global optimal solution. In this algorithm, we have used two powerful parameters F-measure and purity of cluster to have fair estimation of cluster quality comparison. Performance of the ASBO algorithm delivers not only compact clusters but also process is very robust.

References [1]Xiaohui Cui ,”Document clustering using particle swarm optimization “,Swarm Intelligence Symposium, 2005. SIS 2005. Proceedings 2005 IEEE,Page(s): 185 - 191 [2] Bakus, J. ,Hussin, M.F.; Kamel, M.,”A SOM-based document clustering using phrases “,Neural Information Processing, 2002. ICONIP '02. Proceedings of the 9th International Conference on 2002,Volume: 5 ,Page(s): 2212 - 2216 vol.5 [3] Muhammad Rafi, Maujood, M.; Fazal, M.M.; Ali, S.M.,” A comparison of two suffix tree-based document clustering algorithms “,Information and Emerging Technologies (ICIET), 2010 International Conference on , 2010,Page(s): 1 – 5 [4] Wei Jian-Xiang, Liu Huai; Sun Yue-hong; Su Xin-Ning,” Application of Genetic Algorithm in Document Clustering “,“, Information Technology and Computer Science, 2009. ITCS 2009.

164

[email protected]

Vol 20, No. 6;Jun 2013

International Conference ,2009,Volume: 1 ,Page(s): 145 – 148 [5] Hossain, M.S. ,”GDClust: A Graph-Based Document Clustering Technique “,Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference ,,Page(s): 417 – 422 [6] Yi Ding,Xian Fu,,”A Text Document Clustering Method Based on Topical Concept”, Advances in Electronic Commerce, Web Application and Communication ,Advances in Intelligent and Soft Computing Volume 148, 2012, pp 547-552 [7] Weizhong Zhao, Qing He,Huifang Ma, Zhongzhi Shi ,” Effective semi-supervised document clustering via active learning with instance-level constraints”, Knowledge and Information Systems ,Springer,March 2012, Volume 30, Issue 3, pp 569-587 . [8] Jung Song Lee, Lim Cheon Choi, Soon Cheol Park,” Multi-Objective Genetic Algorithms, NSGA-II and SPEA2, for Document Clustering”, Software Engineering, Business Continuity, and Education Communications in Computer and Information Science,Springer, Volume 257,2011, pp 219-227 [9] C.J. van Rijsbergen, Information Retrieval, second ed., Buttersworth, London, 1979. [10] Singh kr. Manoj,”A new optimization method based on adaptive social behavior: ASBO”, ASIC 174, PP.823-831, Springer, 2012.

165

[email protected]

Suggest Documents