High performance clustering with differential ... - Semantic Scholar

High Performance Clustering with Differential Evolution Sandra Paterlini

Thiemo Krink

Dept. of Political Economics University of Modena and Reggio E. Viale J. Berengario 51, 41 100 Modena, Italy Email: [email protected]

EVALife Group, Dept. of Computer Science University of Aarhus Aabogade 34,8200 Aarhus N, Denmark Email: [email protected]

Abstract: Partitional clustering poses a NP bard search problem for non.trivial problems. While genetic algorithms (GA) have been very popular in lhe clustering field, particle s w a m optimization (PSO) and differential evolution (DE) are rather unhown. In this paper, we report results of a performance comparison between a GA, PSO and DE for a medoid evolution clustering approach. Our results show that DE is clearly and consistently superior compared to GAS and PSO, both in respeel to precisian and robustness of the results for hard clustering problems. We conclude that DE rather than GAS should be primarily considered for tackling partitional clustering problems with numerical optimization.

because it is easy to implement and very efficient, because of its linear time complexity. However, its main drawbacks are that it converges to arbitrary local optima and that it cannot deal well with nonspherical shaped clusters. Many partitional clustering algorithms that have been introduced in recent years are based on genetic algorithm (GAS) [8], which are stochastic search heuristics inspired by Darwinian evolution and genetics. An important advantage of these algorithms is their ability to cope with local optima by maintaining, recombining and comparing several candidate solutions simultaneously. In contrast, local search heuristics, such as the stochastic simulated annealing algorithm, only refine a single candidate solution. Deterministic local search, which is used in the k-means algorithm, always converges to the nearest local optimum from the starting position of the search. The only way to explore the search space better is to re-run the algorithm while initializing the search from different starting points. Therefore, GAS are obviously an interesting alternative to k-means and simulated annealing in clustering. GAS have been applied to partitional clustering in many ways. Most of them can be grouped into three major categories: (i) direct encoding of the objectcluster association, (ii) encoding of cluster separating boundaries, and (iii) centroidhedoid encoding for each cluster. We refer to [IO] for a literature review. Here, we summarize only the main contributions for using GAS for centroidmedoid encoding. In this type of approach, each GA individual encodes a representative variable (typically a centroid or medoid) and optionally a set of parameters to describe the extend and shape of the variance for

I. INTRODUCTION Cluster analysis is used in many different fields as

a tool for preliminary and descriptive analysis and for unsupervised classification of objects characterized by different features. Partitional clustering algorithms aim to identify homogeneous groups by finding similarities between objects regarding their characterizing attributes. The algorithmic task can be stated as an optimization problem for which the objective is to maximize the similarities among objects within the same clusters while minimizing the dissimilarities benveen different clusters. This can be quantified by a statistical criterion, such as by defining the objective as the minimization of the trace of the within variance matrix. Ideally, a clustering algorithm should be simple, efficient and capable of dealing with huge datasets. Moreover, it should be objective and robust for equivalent samples and able to detect different cluster shapes. Nowadays, the k-means algorithm is one of the most popular partitional clustering algorithms,

0-7803-8515-2/04/$20.00 02004 IEEE

2004

each cluster. Srikanth et al. [161 . proposed. an algorithm, which encodes the center, extend, and orientation of an ellipsoid for each cluster. Moreover, many authors proposed cluster centroids, baricentres. or medoids as representation points to allocate each object to a specific cluster (e.g. [ I , 4, 13, 151). The idea is to determine a representation point for each cluster and to allocate each object to the cluster with the nearest representation point, where 'nearest' refers to a distance measure, such as Euclidean distance. The fitness of a candidate solution is then computed as the adequacy of the identified partition according to a statistical criterion, such as the Marriott or trace of the within matrix criterion (see section 1I.B). Many studies have shown that this approach is more robust in converging towards the optimal partition than classic partitional algorithms [ I , 4, 13, 151. Compared to the great number of studies on partitional clustering with GAS, only a couple of applications using PSO (e.g. [I 81) and no application using DE (to our knowledge) can be found in the literature. In this study, we compared the performance of GAS with PSO and DE as heuristic search methods for the medoid evolution algorithm previously introduced by Paterlini and Minerva [I51 regarding a set of artificial and real-world machine learning data sets. Moreover, we compared these results with the nominal classification, k-means and random search (RS) as a lower bound technique. The remaining sections of the paper are organized as follows: Section 2 describes the medoid evolution approach and introduces the search heuristics. Section 3 describes the experimental set-up regarding the algorithmic parameters, benchmark problems, and run schedule. Section 4 reports the main results, and finally, section 5 comments on our results and concludes our study.

G={Cr,Cz,,..., CJ ( i . e : C ,t 0 , V k ; C , n C , = 0 , V k # h ; U b l C , = o ) such that objects which belong to the same cluster are as similar to each other as possible, while objects which belong to different clusters are as dissimilar as possible. For this, a measure of adequacy of the partition must be defined. The clustering problem is to find the partition G* that has optimal adequacy with respect to all other feasible solutions in G=/G', G', ..., G"n,R'/(i.e.: G' # G', i # j ) . The number of all feasible partitions is: . .

The problem can be stated as: uprimise f (X_,,,G) G

where G corresponds to a single partition in G and f(*) is a statistical function that quantifies the goodness of the partition (see next section 11. B). It has been shown that the clustering problem is NP-hard when the number of clusters exceeds three [21. B. Statistical Clustering Criteria Different statistical criteria have been proposed to measure the degree of adequacy of a partition and to allow comparison across different partitions [7,11,12]. These criteria usually involve transformations, such as the trace or determinant, of the pooled-within groups scatter matrix (w) and of the between groups scatter matrix (B). The pooled-within scatter matrix, W, is defined as: = where W,is the variance matrix of the

twk L=,

objects' features allocated to cluster Ck ( k = l , ...,g). Thus, if x[" indicates the I-th object in cluster C,, nk the number of objects in cluster C,, and ( )' the transpose of a vector w --&xyl--w x )(xyl-F"))', wherex"' = ( c x : " ) / n k L-

'I' THE MEDOID

,=I

,=I

is the vector of the centroids for cluster Ck.

A . The clustering problem The between scatter matrix, B , is defined as Let O={o,.o,, ..., U,] be a set of n objects and let X,,,be the profile data matrix, with n rows and p ~ = ~ n , ( i " ' - x ) ( f ' ' " - f ) where ' r=(f:x,)/n *=, id columns. Each i-rh object is characterized by a realThen, the total scatter matrix T , of the n observations value p-dimensional profile vector xi ( i = l , . . , n ) , where each element xI in xi corresponds to the j-th can be decomposed as T=B+w. real value feature +I, ...,p) of rhe i-rh object (i=l, ...,n). Given X,,,, the goal of a partitional clustering algorithm is to determine a partition

2005

In our study, we consider two statistical criteria to measure the adequacy of the partition and define the optimization problem optiniise f (X,, G ) G

MC) described in section 11. B. Otherwise, if H is infeasible i.e., C, # 0, we assign a penalty fitness K to the candidate solution. More formally the fitness function is defined as:

respectively as: f(X,.H)

F(X,.m) =

(1) TRW-Trace Within Criterion 171:

if H C G =(G',G' ....,G"".")

if H U G ={G'.G'

.....G"'"")

minimise t r a c e ( W ) G

This criterion assumes implicitly a low correlation among measurements, gives equal importance to the variance within the groups, tends to create spherical clusters and allows orthogonal transformations of the data.

(2) MC - Marriott's criterion I I 1.121: der( W ) minimise g 2 G det( T ) The Marriott criterion addresses the correlation between variables, detects elliptical clusters with axes that are not parallel to the coordinates, and allows linear (not singular) transformations of the data. Marriott's criterion is commonly used to search for clusters characterized by such a strong internal correlation that one or more eigenvalues me equal to zero. For further details about these and related criteria the reader is referred to Everitt [ 5 ] . Note that each cluster set Ck of G must at least contain one object, i.e., Ck# 0. C. Fitness Evaluation and Search Space In our approach, we use floating point arrays to encode representation points, hereafter called medoids, which we use to specify the al!ocation of objects to clusters. For this we use a method inspired by Forgy [6] that maps the medoid search space to the clustering search space G=IG', G2, ..., GNiR"J. The idea is simply to allocate each object to the cluster corresponding to the nearest medoid. Wearest' refers to a distance metric, which is the Euclidean distance in our study. Note that this method does not define a one-to-one correspondence between solutions in the medoid search space and the space of feasible partitions, i.e., different medoid vectors can identify the same partition. Moreover, the mapping can result in infeasible partitions, such that one or more clusters do not contain any objects. The quality of a feasible partition H is evaluated by using one of the two statistical criteria (TRW and

where m is the medoid vector encoded in a candidate solution, fl.) is either TRW or MC and K is IO8, which corresponds to a fitness, which is worse than the worst feasible fitness. Hence, if X,, is the profile matrix and g the number of clusters {C,,C23,...,C8] of the set of n objects O=/01,02,..., on/, each chromosome in the population consists of p x g genes mkj (k=l ,....g, j = l , ...,p), such that each group ofp genes encodes a medoid vector mk. Figure 1 shows an example for a problem with 3 clusters and 2 features.

m,

m, medoid cwdinates of cluster I

medoid cwrdjnares of

medoid coordinates of

dustsr 2

CluSIcr 3

Figure 1. Example of a cluster problem encoding with 3 clusters and 2 features.

I n principle, any point in Rp could be considered

as a possible choice for a medoid. A common choice However, in is the profile matrix domain [xmi.. x,,]. our study, we decided to make the medoid domain 40% larger than the profile matrix domain, because good medoid solutions could be located slightly beyond the profile matrix domain border.

D.Search Heuristics D.1 The Genetic Algorithm A GA is an evolutionary algorithm inspired by Darwinian evolution and genetics. In our GA implementation, first a population of individuals containing the candidate solutions (encoded in floating point numbers) is created and the fitness of each individual is evaluated by the fitness function. The chromosomes of the start-up population are initialized each as a concatenation of g randomly chosen object feature vectors from the dataset. After initialization, the population is iteratively refined by selection of individuals for the next iteration, application of mutation and crossover operators, and re-evaluation of the new population

2006

follow their current direction compared to the memorized positions ji and @ 8 . Finally, the so-

according to the fitness function. For selection we use tournament selection of size 2 and elitism with an elite size of IO. For mutation, we use a Gaussian mutation operator, which alters individuals such that ) j i = j i + N(0,1) . a, . (xi,mxwhere j i is the i-th gene of individual j , N is the Gaussian normal distribution, and um the variance parameter of the mutation operator. The crossover operator in our algorithm is arithmetic crossover with

called constriction factor x can be used to manipulate the overall velocity of the swarm. In our preliminary parameter tuning experiments (data not shown in this paper), we focused on the control of the inenia weight, which was decisive for the performance of the PSO. Moreover, the velocity of the particles is limited by a maximum velocity b,, which is

ci = wi 'ai + ( l - w i ) . b , where ci is the offspring genome of the parent genomes ai and bi, wi is a random weight of the interval [0, 11 and i = 1,..., n, with n=g.p (number of genes). The application of the crossover operator to a genome j means that j becomes parent a , parent b is chosen randomly from the population and the offspring c substitutes j . Both operators are applied to each individual in the population, which is not in the elite, with a probability pmfor mutation and pc for crossover respectively. The algorithm terminates after a fixed number of iterations.

D.2 Particle Swarm Optimization (PSO) PSO was introduced by Kennedy and Eberhard [9] and is inspired by the swarming behavior of animals and human social behavior. A particle swarm is a population of particles, where each particle is a moving object that 'flies' through the search space and is attracted to previously visited locations with high fitness. Each particle consists of a position vector ?, which represents the candidate solution to the optimization problem, the fitness of solution 2 , a velocity vector ii and a memory vector @ of the hest candidate solution encountered by the particle with its recorded fitness. The position of a particle is updated by E(t+l) t ? ( t ) + l q t + l ) and its velocity according to

typically half of the domain size for each parameter in 2. The initialization of the algorithm corresponds to the description for the CA above, but additionally requires the initialization of the velocity vectors, which are uniformly distributed random numbers in the interval [0, 9 . 1 . After initialization, the memory of each particle is updated and the velocity and position update rules are applied. If the velocity exceeds cmXit is truncated to this value. Moreover, if a new position vector is outside the domain, it is moved back into the search space by adding the negative distance with which it exceeds the search space to the position vector. This process is applied to all particles and repeated for a fixed number of iterations.

0 . 3 Diferential Evolution (DE) Differential evolution 1171 is a rather unknown approach to numerical optimization, which is very simple to implement and requires little or no parameter tuning. After generating and evaluating an initial population, as described for the GA above, the solutions are refined by a DE operator, called "RandlExp", as follows: For each individual genome j , choose three other individuals k, 1, and m randomly from the population (with j#!+l#m), calculate the difference of the chromosomes in k and I, scale it by multiplication with a parameter f and create an offspring by adding the result to the chromosome of m. The only additional twist in this process is that not W + l ) 6 ~ ( w ~ ( t ) + ~ , ( ~ - E ( t -W)) ) ) + ~ ~the ( ~entire ~ chromosome of the offspring is created in where 91, pz. are uniform distributed random this way, but that genes are partly inherited from numbers within [p,,, pmJ (typically p,,=O.O and individual j , such that p m s 2 . 0 ) that determine the weight between the m.gene,+f.(k.gene,-I.genq) ifLI(O,I)

High performance clustering with differential ... - Semantic Scholar

High performance clustering with differential ... - Semantic Scholar

Suggest Documents

Document Clustering Using Differential Evolution - Semantic Scholar

High performance statistical computing with ... - Semantic Scholar

Clustering with Constraints - Semantic Scholar

Performance evaluation of clustering techniques ... - Semantic Scholar

Clustering Inside Classes Improves Performance ... - Semantic Scholar

Clustering Inside Classes Improves Performance ... - Semantic Scholar

Clustering Inside Classes Improves Performance ... - Semantic Scholar

Clustering High-Dimensional Data with Low-Order ... - Semantic Scholar

Clustering High Dimensional Sparse Casino ... - Semantic Scholar

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

Ultra-High Performance, High-Temperature ... - Semantic Scholar

Portable High-Performance Supercomputing: High ... - Semantic Scholar

Data Clustering Using Multi-objective Differential ... - Semantic Scholar

Spectral Clustering with Neighborhood Attribute ... - Semantic Scholar

Clustering Vessel Trajectories with Alignment ... - Semantic Scholar

Hierarchical Agglomerative Clustering with ... - Semantic Scholar

Agglomerative Hierarchical Clustering with ... - Semantic Scholar

Multiobjective Clustering with Metaheuristic ... - Semantic Scholar

Sequential clustering with particle filters ... - Semantic Scholar

Phylogenetic clustering increases with elevation ... - Semantic Scholar

Simultaneous Circuit Partitioning Clustering with ... - Semantic Scholar

Effective Subspace Clustering with Dimension ... - Semantic Scholar