An Evolutionary Approach for Document Clustering - Science Direct

13 downloads 33772 Views 190KB Size Report
Due to the easy use of different web services, such as the ... needs to find an item, s/he needs the support of a search engine. ... optimization technique. ... known to achieve very good (near optimal) solution, especially when the space is large ...
Available online at www.sciencedirect.com

ScienceDirect IERI Procedia 4 (2013) 370 – 375

2013 International Conference on Electronic Engineering and Computer Science

An Evolutionary Approach for Document Clustering Ruksana Akter, Yoojin Chung* Department of Computer Science and Engineering, Hankuk University of Foreign Studies, Wangsan, Mohyun, Yongin, Korea, 449-791

Abstract We propose an evolutionary approach based on genetic algorithm for text document clustering. Instead of applying genetic algorithm on the whole dataset, we partition the dataset into some groups and apply genetic algorithm to each of the partitions separately. Finally, we apply another genetic algorithm phase on the outcomes of the earlier ones. This allows to get rid of the local minima, which is one of the major problems of using genetic algorithms. Another good feature of our proposal is that we do not require specifying the total clusters to be made in advance as most of the available methods. Experimental results also show the superior performance of our approach as compared to the previous approaches.

© 2013. by Elsevier B.V.B.V. Open access under CC BY-NC-ND license. © 2013 ThePublished Authors. Published by Elsevier Selection peerunder review under responsibility of Information Engineering Research Institute Selection andand/or peer review responsibility of Information Engineering Research Institute Keywords: Document clustering, Genetic Algorithm, Local minima, Cluster Evaluation;

1. Introduction Document clustering means grouping a set of documents into different clusters so that two objectives are satisfied: First, similar documents should be grouped in the same cluster, and second, dissimilar documents should be placed in different clusters. The influence of the Internet is continuously increasing due to its impact on the peoples’ lifestyles, working policies, communications and what not. Due to the easy use of different web services, such as the

* Corresponding author. Tel.: +82-31-330-4625; fax: +82-31-330-4120. E-mail address: [email protected].

2212-6678 © 2013 The Authors. Published by Elsevier B.V. Open access under CC BY-NC-ND license. Selection and peer review under responsibility of Information Engineering Research Institute doi:10.1016/j.ieri.2013.11.053

371

Ruksana Akter and Yoojin Chung / IERI Procedia 4 (2013) 370 – 375

blogs, social networking sites, service providing sites and news agencies, a huge amount of data are being added in the World Wide Web (Web). However, most of these data are unstructured. Hence, when a user needs to find an item, s/he needs the support of a search engine. Upon getting a request, a search engine needs to return similar documents related to the search topic. Here document clustering plays a great role to improve the results and the complexity of such search as because it clusters similar documents in one group. Document clustering is, hence, a much well studied and have got attraction of many researchers in the information retrieval field [6]. In the literature, many algorithms are found to cluster text documents. K-means [2]-[5], [7] clustering is one of the most popular clustering algorithms. It clusters a given set of documents into K different groups. It is very simple and computationally efficient. However, in K-means clustering, user needs to specify how many clusters are expected (K’s value) in advance, which may sometimes be difficult to set. Moreover, the Kmeans clustering algorithm starts with an initial (usually random) clustering, on which the performance largely depends [7]. In recent years, a number of proposals are found that use genetic algorithm [6]. Genetic algorithm is an optimization technique. It follows the evolution principles through randomized natural selection [8]. It is known to achieve very good (near optimal) solution, especially when the space is large and multi-modal [9]. Though the performances of genetic algorithms in document clustering are found to be better than the other available methods [6], it may converge to local optimal values. This is known as premature convergence phenomenon (PCP) [10]. To avoid PCP, a Double Layered Genetic algorithm for document Clustering (DLGC) [11] is proposed. However, it needs very high computation if the number of generations used in the first layer becomes high. Moreover, it also requires specifying the number of clusters in advance. We propose a two-phase genetic algorithm-based evolutionary approach for text document clustering. Instead of applying genetic algorithm on the whole dataset, we partition the dataset into some groups and apply genetic algorithm to each of the partitions separately. Finally, we apply another genetic algorithm phase on the outcomes of the earlier ones. This allows to get rid of the local minima. Unlike most of the available methods, specifying the expected number of clusters in advance is also not required in our proposal. Experimental results also demonstrate the superior performance of the proposed approach as compared to the other available approaches. The rest of the paper is paper is organized as follows. The proposed procedure described in detail in Section 2, and Section 3 presents the comparative performances of the proposed as well as some other available methods. Finally, Section 4 concludes the paper. 2. The Proposed Approach The text document clustering algorithms basically make use of the statistics (usually the frequencies) of the occurrences of different terms in the documents. Every document is represented using such statistics and represented in a vector space. Finally, the clustering algorithm actually partitions this space by putting the closer vectors in the same partition. The subject and contents of a document are usually thought to be indicated by the terms present in the document. The statistics of the terms in a document are represented using some measures such as term frequency (tf), indicating the number of appearances of a term in the document, or Boolean occurrences, whether a term occurs in the document. Another measure is inverse document frequency (idf), which measures the presence of a term in different documents. For the term ti , the idfi is measured by idfi

§N· log ¨ ¸ , © n¹

(1)

372

Ruksana Akter and Yoojin Chung / IERI Procedia 4 (2013) 370 – 375

where N denotes the number of documents in the dataset at hand, and n represents how many documents have the term ti in them. In our proposal, we use a weighted term frequency based on the Okapi rule [1], which makes a combination of tf and idf. The term weight of ti in document d j is tw ji

tf ji tf ji  Ͳ.ͷ  ͳ.ͷ u

lj

.idfi ,

(2)

l

where l j represents the document length (total terms), and l denotes the average length of all the documents in the dataset. We represent a document d j using a vector of the weighted term frequency as dj

tw jͳ, tw jʹ, ..., tw jT ,

(3)

where T denotes the total number of different terms in dataset. Thus a document vector includes all the terms present in the whole dataset. We do not, however, count few terms such as stop words because these terms do not usually represent any relationship with the contents or the topics of the documents. Given a set of documents D = {d1, d2, d3, …, dN}, we target to make an assignment of each document d j to one clusters so that related documents are put in the same clusters. In other words, the average distance between the documents in a cluster is minimized. A cluster is usually represented by the cluster center, known as the centroid. Thus the ultimate aim is minimizing the distances between the centroid and the documents in a cluster. At this point, we need to compute the similarity between the documents. Different similarity measures, for example Pearson, cosine, adjusted cosine, etc., are usually used to compute the extent of similarity between any two vectors. In our work we use the cosine similarity to measure the degree of similarity between any two documents. The cosine similarity between two document vectors dx and dx is measured by Cosine_Similarity( d x , d y )

d x .d y | dx | . | d y |

(4)

where the numerator and the denominator are the dot product of the vectors and their norms (Euclidean length), respectively. To apply genetic algorithm, we need to define a chromosome, which represents a (candidate) solution, first. Our work focuses on partitioning the documents into K clusters. In other words, we require finding the K centroids representing the K clusters. Hence, we define a chromosome chrp as a vector of K centroids using chrp

(cenͳ, cenʹ,..., cenK ) .

(5)

This definition leads to a faster convergence than that proposed by Choi et al. [11] where the chromosome’s length is N (total documents in the dataset). Moreover, it is also lighter to handle with. Another important point here is that we do not require specifying the exact value of K. Rather, similar to what is used by Song and Park [6], our method requires a range [Kmin, Kmax] of the value of K. Hence, in the population, there may be chromosomes of different lengths in our approach.

373

Ruksana Akter and Yoojin Chung / IERI Procedia 4 (2013) 370 – 375

Genetic algorithm usually starts with generating an initial population of size P, which contains P chromosomes. We pick the values of centroids from the dataset. A centroid cenk is a vector formed as cenk

(cͳ, cʹ,..., cT )

(6)

where cv is value randomly selected from the set { twͳv , twʹv , ..., twNv } of different term weights of the corresponding term tv existing all the documents in the dataset. We then apply two phases of genetic algorithm on this population. In the first phase, we divide the population into G partitions, and apply one genetic algorithm on each of them. This approach leads to a faster convergence then the DLGC [11]. After this phase we apply a genetic algorithm on the whole population on the updated chromosomes after the first phase. For our genetic algorithms, we select the fittest chromosomes for crossover. We use the classical singlepoint crossover. And the mutation is done according to the concept of Gaussian distribution. Such choice of algorithms is analogous to that used in [12]. We use the fitness function to be 1/DB, where DB is the wellknown Davies-Bouldin index [13, 14]. We set the termination criteria as the ITR and Smax, where ITR denotes the maximum number of iterations and Smax is the number of consecutive iterations making no update of the population. 3. Performance Evaluation Here, we demonstrate the outcomes of our experiments of applying our proposed approach as well as two other well-known methods, namely the K-means clustering and the Genetic algorithm-based approach (referred to as GA hereafter) proposed by Song and Park [6]. To compare the performances, we adopt the well-known benchmark database REUTERS-21578. Our dataset includes 1000 texts from 5 topics such as acq, crude, trade, grain and money-fx. After stop wards removal and stemming, there are 7142 terms. We set the genetic algorithms parameters such as amount of selection, crossover and mutation to 0.6, 0.2 and 0.2, respectively. We set the value of Smax = 10. For the GA [6], we set ITR = 200. For our proposed algorithm, we make five partitions of the population (G = 5), and set ITR = 25 for the first phase. And for the second phase, we set ITR = 75. This ensures the maximum number of iterations used by our algorithm to be maximum 200 in total. We do this for having a fare comparison.

Fig. 1. F-measures of three different algorithms.

374

Ruksana Akter and Yoojin Chung / IERI Procedia 4 (2013) 370 – 375

To quantify the performances of different methods we adopt the F-measure metric [15]. F-measure, F, is calculated by F

ʹ u precision u recall . precision u recall

(7)

Figure 1 presents values of the F-measures of the different algorithms. It clearly shows the superiority of the proposed approach. We have also applied the latent semantic Indexing (LSI) [6] on our dataset, and reduced the dimensionality to 500. And then we have run the three algorithms on the resulting dataset. Figure 2 show the F-measures of the outcomes. Here, the performance of the proposed approach is also clearly ahead of the others.

1.0 0.8 0.6 0.4 0.2 0.0 K-means

GA[6]

Proposed

Fig. 2. F-measures of three different algorithms when run on the dataset created using the LSI.

4. Conclusion In this paper, we have proposed our work on using the genetic algorithm for document clustering. The proposed method possesses the desired criteria of an unsupervised clustering such as not requiring the prespecification of the desired number of clusters and the strength of avoiding the local minima. The experimental results on benchmark dataset also show that the proposal outperforms the existing methods of documents clustering. As our future work, we hope to work more on this algorithm to make it fully automatic so that it will require no parameter to be specified.

Acknowledgements This work was supported by the Hankuk University of Foreign Studies research fund of 2012.

References [1] Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information

Ruksana Akter and Yoojin Chung / IERI Procedia 4 (2013) 370 – 375

Processing and Management. [2] Jain, A.K., Murty, M.N., and Flynn P.J. 1999. Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 3. [3] Jain, A.K. 2008. Data Clustering: 50 years beyond K-means. In 19th International Conference on Pattern Recognition. Tampa, FL. [4] Jain, A.K., and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series. Prentice Hall, NJ. [5] Duda, R.O., Hart, P.E., and Stork, D.G. 2000. Pattern Classification, 2nd Edition. Wiley-Interscience. [6] Song, W., and Park, S.C. 2009. Genetic Algorithm for text clustering based on latent semantic indexing, Computers and Mathematics with applications, vol. 57, pp. 1901-1907. [7] Cha, S.M., and Kwon, K.H. 2001. A new migration method of the multipopulation genetic algorithms. The Korea Institute of Information Scientists and Engineers. [8] Maulik, U. and Bandyopadhyay, S. 2000. Genetic algorithm-based clustering technique. Pattern Recognition, vol 33, no. 9, pp 1455-1465. [9] Srinivas, M. and Patnaik, L.M. 1994. Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE transactions on System, Man and Cybernatics, vol. 24, no. 4, pp. 656-667. [10] Andre, J., Siarry, P., and Dongon, T. 2001. An improvement of the standard genetic algorithm fighting premature convergence in continuous optimization. Advances in Engineering Software, vol. 32, no. 1, pp. 4960. [11] Choi, L.C., Lee, J.S., and Park, S.C. 2011. Double layered genetic algorithm for document clustering. Communications in Computer and Information Science, vol. 257, pp. 212-218. [12] Yao X., Liu Y., and Lin G. 1999. Evolutionary programming made faster, IEEE Transactions on Evol. Comput., vol. 3, pp. 82-102. [13] Bandyopadhyay, S., and Mauilk, U. 2001. Nonparametric genetic clustering: Comparison of validity indices, IEEE Trans. System Man Cybern.-Part C Applications and Reviews, vol. 31, pp. 120-125. [14] Davies, D.L., and Bouldin, D.W. 1979. A cluster separation measure, IEEE Trans. Pattern Anal. Intell., Vol. 1, pp. 224-227. [15] Fragoudis, D., Meretakis, D., and Likothanassis, S. 2005. Best terms: an efficient feature selection algorithm for text categorization. Knowlwdge Inform. Syst.

375

Suggest Documents