Document Clustering Using Differential Evolution

4 downloads 0 Views 334KB Size Report
partitional clustering of a large collection of text documents by using an improved ... may be broadly classified into two types – 'hierarchical' and. 'partitional' [6].
Document Clustering Using Differential Evolution Ajith Abraham1, Swagatam Das2 and Amit Konar2 1

IITA Professorship Program, School of Computer Science and Engineering, Chung Ang University (CAU), Seoul 2 Dept. of Electronics and Telecommunication Engg, Jadavpur University, Kolkata 700032, India

Abstract—This paper investigates a novel approach for partitional clustering of a large collection of text documents by using an improved version of the classical Differential Algorithm (DE). Fast and accurate clustering of documents plays an important role in the field of text mining and automatic information retrieval systems. The k-means has served as the most widely used partitional clustering algorithm for text documents. However, in most cases it provides only locally optimal solutions. In this work, the clustering problem has been formulated as an optimization task and is solved using a modified DE algorithm. To reduce the computational time, a hybrid k-means with DE method has also been proposed. The new algorithms were tested on a number of document datasets. Comparison with k-means, a state of the art PSO and one recently proposed real coded GA based text clustering methods reflects the superiority of the proposed techniques in terms of speed and quality of clustering.

I. INTRODUCTION Clustering of text documents plays a vital role in efficient document organization, summarization, topic extraction and information retrieval. Although initially used for improving the precision or recall in an information retrieval system [1,2], more recently, clustering has been proposed for use in browsing a collection of documents [3] or in organizing the results returned by a search engine in response to a user’s query [4]. Document clustering has also been used to automatically generate hierarchical clusters of documents [5]. The automatic generation of a taxonomy of Web documents like that provided by Yahoo! (www.yahoo.com) is often cited as a goal. Clustering involves the optimal partitioning of a given set of N data points into K subgroups, such that data points belonging to the same group are as similar to each other as possible whereas data points from two different groups share the maximum difference. Unsupervised document clustering may be broadly classified into two types – ‘hierarchical’ and ‘partitional’ [6]. Hierarchical techniques produce a nested sequence of partitions, with a single, all-inclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining two clusters from the next lower level (or splitting a cluster from the next higher level). The result of a hierarchical clustering algorithm can be graphically displayed as a tree, called a dendogram. This tree graphically displays the merging process and the intermediate clusters. In contrast to hierarchical techniques, partitional clustering techniques create a one-level (unnested) partitioning of the data points. If K is the desired

number of clusters, then partitional approaches typically find all K clusters at once. In recent years, it has been recognized that the partitional clustering technique is well suited for clustering a large document dataset due to their relatively low computational requirements [7, 8]. The time complexity of the partitioning technique is almost linear, which makes it widely used. The bestknown partitioning clustering algorithm is the K-means algorithm and its variants [9]. This algorithm is simple, straightforward and is based on the firm foundation of the analysis of variances. In addition to the K-means algorithm, several algorithms, including Genetic Algorithm (GA) [10, 11], Self-Organizing Maps (SOM) [12], and finally Particle Swarm Optimization (PSO) [13] have been used for document clustering. A review of the current literature reveals that DE [14] has not been employed for text document classification till date. In the present work, we determine the optimal partitioning of a large document dataset by using an improved version of DE. Das et al. [15] presented an improved version of the classical DE scheme called Differential Evolution with RANDom Scale Factor (DERANDSF). In the present paper, DERANDSF with another slight modification is seen to outperform K-means, a state of the art version of genetic algorithm (GA), and an improved particle swarm optimization (PSO)-based hard clustering algorithm in terms of accuracy, speed and robustness when applied to document clustering. We have also proposed a hybrid clustering algorithm based on k-means algorithm with the modified DE, which improves considerably in terms of computational speed. To compare the performance of the various clustering algorithms, we have used a test-suit of six well known document databases in which the number of documents range from 600 to 1700, number of classes range from 7 to 25 and the number of terms per document may be as high as 12,000. Our experimental results indicate that the improved version of DE peformed very well in a statistically significant manner when compared to other algorithms considered in majority of cases. The rest of the paper is organized as follows. In Section II, we briefly describe the methods of representing text documents as data points in a multi-dimensional space and also formulate the document clustering as an optimization problem. Section III introduces the modified DE algorithm while Section IV discusses its use in the domain of document clustering. Section V provides the detailed experiment set-up, other algorithms considered, description of the test datasets and simulation strategies. In Section VI

we present the results of the experimental study and Section VII makes a number of observations based on them. Finally, the paper is concluded in Section VIII. II. TEXT DOCUMENT CLUSTERING PROBLEM A. Representation of Documents To apply any clustering algorithm on a dataset, documents must at first be represented in a suitable form. Documents are represented by the widely used vector-space model introduced by Salton et al.[16]. In this model, each document

r

r

is treated as a vector d . Each dimension in the vector d stands for a distinct term in the term space of the document collection. We represent each document as vector

r d = [ w1 , w2 ,....wn ] , where wi is the term weight of the

term ti in one document. The term weight value represents the significance of this term in a document. To calculate the term weight, the occurrence frequency of the term within a document and in the entire set of documents must be considered. The most widely used weighting scheme combines the Term Frequency with Inverse Document Frequency (TF-IDF) [18, 19]. The weight of term i in document j is given by N (1) w ji = tf ji × idf ji = tf ji . log 2 ( ) df ji

where tf is the number of occurrences of term i in the document j; df indicates the term frequency in the collections of documents; and n is the total number of documents in the collection. This weighting scheme discounts the frequent words with little discriminating power. ji

ji

B. Similarity Metric To use a clustering algorithm we need to judge the similarity between two documents in some way. We used the cosine distance, which is represented as r r r r r r (2) cos( d , d ) = d • d d . d 1

where

2

1

2

1

2

• denotes the ‘dot product’ and

denotes the norm

of a vector. Usually to negotiate documents of various lengths the document vectors are normalized to unit length. We also define a centroid vector m for each set S of documents and their corresponding vector representations. It is given by, r r 1 m= d (3) N dr∈S



where N is the number of documents in dataset S. We note that calculating the similarity of a document and a cluster centroid is equivalent to calculating the average similarity between that document and all the documents, which are contained in the cluster the centroid represents. Mathematically, r r 1 r r 1 r r di • m = di • d = cos(d i , d ) (4) N dr∈S N dr∈S





C. A Formal Statement of the Document Clustering Problem Let S = {d1, d2... dn} be a set of n document vectors, each having p components. These vectors can also be represented by a profile data matrix Zn×p having n p-dimensional row r vectors. The ith row vector Zi characterizes the ith object

r

from the set S and each element zi,j in Zi corresponds to the jth component (j = 1, 2, .....,p) of the ith vector ( i =1,2,...., n). Given such a Zn×p, a partitional clustering algorithm tries to find a partition C = {C1, C2,......, Ck} such that the similarity of the vectors in the same cluster Ci is maximum and patterns from different clusters differ as far as possible. The partitions should maintain the following properties: 1) Each cluster should have at least one vector assigned. i.e., C i ≠ ∅ ∀i ∈ {1,2,...k }

2) Two different clusters should have no document vector in common. i.e., C i ∩ C j = ∅, ∀ i ≠ j and

i, j ∈ {1,2,..., k} 3) Each pattern should definitely be attached to a cluster. i.e. k C = S

U

i

i=1

Since the given dataset can be partitioned in a number of ways maintaining all of the above properties, a fitness function (some measure of the adequacy of the partitioning) must be defined. Then the problem turns out to be one of finding a partition C* of optimal or near-optimal adequacy as compared to all other feasible solutions C = {C1, C2... CN(n,k)} where

N ( n, k ) =

1 k!

k

∑ ( − 1) ( i

k i

)( k − i ) i

i =1

(5) is the number of feasible partitions. This is the same as Optimize f ( Z n× p , C ) (6) C where C is a single partition from the set C and f is a statistical-mathematical function that quantifies the goodness of a partition on the basis of the distance measure of the patterns. It has been shown in [20] that the clustering problem is NP-hard when the number of clusters exceeds 3. III. MODIFIED DIFFERENTIAL EVOLUTION In 1995, Storn and Price made an attempt to replace the classical crossover and mutation operators in GA by alternative operators [14], and found a suitable vector differential operator to handle the problem. They proposed a new algorithm based on this operator, and called it Differential Evolution (DE). DE begins with a randomly initialized population of p-dimensional real-valued parameter vectors. Each vector, also known as a ‘genome’ or ‘chromosome’, forms a candidate solution to the multidimensional optimization problem. The initial population (at time t = 0) is chosen randomly and should be representative of as much of the search space as possible. Subsequent generations in DE can be represented by discrete time steps: t = 1, 2, ..., etc. Since the parameter vectors are likely to be

changed over different generations the following notation has been adopted here for representing the ith vector of the population at the current generation (at time t): r X i (t ) = [ X i ,1 (t ), X i ,2 (t ),.... X i , p (t )] (7) For each individual vector

r X k , belonging to current

population, DE randomly samples three other individuals

r r r X i , X j and X m from the same generation (for distinct k, i,

j and m). It then calculates the component wise difference r r X i − X j , scales it by a scalar R (є [0,1]) and creates a trial offspring vector by adding the result to the chromosomes of

r X m . Thus, for the nth component of each parameter vector,

we have Uk ,n(t+1) = Xm ,n(t) + R.(Xi, n(t) – Xj, n(t) if randn (0, 1)