A Genetic EM Algorithm for Learning the Optimal

5 downloads 0 Views 65KB Size Report
components, thus increasing the total number of ..... currently is agglomerative due to a lower complexity ... measures, agglomerative hierarchical algorithms.
A Genetic EM Algorithm for Learning the Optimal Number of Components of Mixture Models Wei Lu Issa Traore German Research Center for Artificial Intelligence Department of Electrical and Computer Engineering DFKI GmbH, IUPR Research Group University of Victoria Erwin Schroedinger Strasse, 67663 Kaiserslautern P.O. Box 3055 STN CSC, Victoria, V8W 3P6 Germany Canada [email protected] [email protected] Abstract: - Mixture models have been widely used in cluster analysis. Traditional mixture densities-based clustering algorithms usually predefine the number of clusters via random selection or contend based knowledge. An improper pre-selection of the number of clusters may easily lead to bad clustering outcome. Expectation-maximization (EM) algorithm is a common approach to estimate the parameters of mixture models. However, EM is prone to converge into local maximum after a limited number of iterations. Moreover, EM usually assumes that the number of mixing components is known in advance, which is not always the case in practice. In order to address these issues we propose in this paper a new genetic EM algorithm to learn the optimal number of components of mixture models. Specifically, the algorithm defines an entropy-based fitness function, and two genetic operators for splitting and merging components. We conducted two sets of experiments using a synthetic dataset and two existing benchmarks to validate our genetic EM algorithm. The results obtained in the first experiment show that the algorithm can estimate exactly the optimal number of clusters for a set of data. In the second experiment, we computed three major clustering validity indices and compared the corresponding results with those obtained using established clustering techniques, and found that our genetic EM algorithm achieves better clustering structures. Key-Words: - Clustering, Gaussian mixture model, EM algorithm, Genetic algorithm, Clustering validity

1 Introduction Clustering is the organization of data patterns into groups based on some measures of similarities [11]. The main goal of clustering is grouping data objects in classes, so that intra-class similarity is maximized and inter-class similarity is minimized. Many clustering algorithms have been proposed in the last decades [20]. One of the most difficult issues in clustering analysis is estimating the optimal number of clusters for a given dataset [9]. Identifying the optimal number of clusters is essential for effective and efficient data clustering [7]. For instance, a traditional clustering algorithm such as k-means may generate a bad clustering result if initial partitions are not properly chosen, a situation that occurs often and is not an obvious task. A popular clustering approach sensitive to this problem is based on Gaussian mixture model (GMM) [2]. GMM is based on the assumption that the data to be clustered are drawn from one of several Gaussian distributions. It was suggested that Gaussian mixture distribution could approximate any distribution up to arbitrary accuracy, as long as a sufficient number of components are used [16]. Consequently, the entire data collection is seen as a mixture of several

Gaussian distributions, and their corresponding probability density functions can be expressed as a weighted finite sum of Gaussian components with different parameters and mixing proportions [18]. EM algorithm is a common approach for estimating the parameters of GMM [5]. However, EM may easily converge into local maximum after a limited number of iterations. Moreover, EM usually assumes that the number of mixing components is known in advance, which is not always the case in practice. Previous attempts for estimating the number of mixing components for the GMM are mainly based on statistical techniques; examples of these previous works include [10], [12], [13], [15] and [19]. Although previous approaches based on statistical techniques have proved their ability to estimate the optimal number of components for mixture models, they are prone to converge into local optima since they usually stop performing further search when the corresponding criteria reach certain thresholds. In order to address such limitation, we propose in this paper a new genetic EM algorithm to learn the optimal number of components of mixture models for a set of data. In contrast with previous statistical approaches, genetic computation schemes have inherently the potential capability to escape from

local maximum since their search space for optimal solutions can be extended by genetic operations and optimization. Thus, our approach integrates the genetic computation scheme into EM algorithm in the basis of GMM. The rest of the paper is structured as follows. Section 2 reviews the GMM and EM algorithms. The genetic EM algorithm is proposed in Section 3. Section 4 presents and discusses the results obtained in clustering validity, and Section 5 makes some concluding remarks.

2 Gaussian Mixture Model A random variable x is said to follow a finite mixture distribution when its probability density function p (x) can be represented as a weighted sum of kernels: k

p ( x) = ∑ α i f i ( x; µ i , σ i ) i =1

where k is the number of mixture components and α i (1 ≤ i ≤ k) are called the mixing proportions or mixing coefficients and satisfy the constraints α i ≥ 0 and k

∑ α i = 1 ; f i ( x; µ i , σ i ) is the component density or

i =1

kernel, where µ i refers to the mean of x and σ i stands for the variance of x. In case each component is Gaussian, we say that variable x follows Gaussian mixture model, where the kernel function f i ( x; µ i , σ i ) can be a multivariate Gaussian or a univariate Gaussian. Suppose the mixture component is the univariate Gaussian distribution, the steps of the EM algorithm used to estimate the set of parameters {α i , µ i , σ i } are described as follows: - Initialize parameter vector θ 0 = {α i0 , µ i0 , σ i0 } - E-step: for every data sample in x ~ {xn, n=1,2,… ,N} and for every mixture component k, compute the posterior probability α i = p (i | x n ) as follows: α N ( xn ; µ i , σ i ) p (i | x n ) = k i ∑ α i N (xn ; µi ,σ i )

p (i | x n )

N

µ inew = ∑ (

N

n =1

∑ p (i | x n )

) xn

n =1 N

σ inew = ∑ ( n =1

p(i | xn ) N

∑ p(i | xn )

)( xn − µinew )( xn − µ inew )T

n =1

- Go to step 2 until the algorithm converges

3 Genetic EM Algorithm The proposed genetic EM algorithm first defines a three-tuple vector to represent the initial individuals. Then two special genetic operators named splitting and merging are used to produce new clusters (individuals). An entropy-based fitness function is defined and used to fit the Gaussian mixture to the data. New populations are selected and composed by two groups of individuals: one is randomly produced by genetic operators and the other always maintains the best individuals by comparing the corresponding fitness function values. The genetic algorithm performs EM steps in order to ensure maximum likelihood, while ensuring fitness of the mixture model to the data and at the same time, accelerate the speed of convergence. In the following, we first introduce the basic parameters and operators used in our algorithm, and then present the algorithm itself.

3.1 Genetic Parameters and Operators Encoding individuals: A three-tuple vector < α i , µ i , σ i > (1 ≤ i ≤ k) represents an individual in a given dataset. EM generates an estimate for the set of parameters {α i , µ i , σ i } (1 ≤ i ≤ k) and posterior probabilities p (i | x n ) (1 ≤ n ≤ N). The posterior probability characterizes the likelihood that the data pattern x n approximates to a specified Gaussian component i . The greater the posterior probability for a data pattern x n belonging to a specified Gaussian component i , the higher the approximation is. As a result, each data x n is assigned to the corresponding Gaussian component i according to p (i | x n ) .

i =1

- M-step: re-estimate the parameters using samples weighted by the posterior probabilities p (i | x n ) as follows: 1 N α inew = ∑ p (i | x n ) N n =1

Operators: Two operators are defined and used to split or merge components during the evolution. The splitting operation generates two new components, thus increasing the total number of components by 1. For univariate Gaussian, the means, variances and mixing proportions for the two new components are calculated as follows: µ1, split = µ i + σ i , µ 2, split = µ i − σ i

σ 1, split = σ 2, split = σ i αi 2 For multivariate Gaussian, all means, covariance matrixes, and mixing proportions are re-estimated using the EM algorithm. The merging operation generates a new component by combining two existing components, thus decreasing the total number of components by 1. For univariate Gaussian, the means, variances and mixing proportions for the new component are: α µ + α 2µ2 µ merge = 1 1 α1 + α 2 α 1, split = α 2 , split =

2 σ merge =

α1 α2 [σ 1 + ( µ1 − µ merge )]2 + [σ 2 + ( µ 2 − µ merge )]2 α1 + α 2 α1 + α 2

α merge = α 1 + α 2

For multivariate Gaussian, all means, covariance matrixes, and mixing proportions are re-computed using the EM algorithm. Fitness function: We define an entropy-based fitness function. The fitness function measures how well the mixture model approximates the unknown density underlying the data samples: the higher the fitness function value, the better the approximation is. Our entropy-based fitness function EFT is defined as follows: k

EFT = ∑ α i ef i i =1

where k is the number of mixture components and α i (1 ≤ i ≤ k) are the mixing proportions satisfying the k

constraints α i ≥ 0 and ∑ α i = 1 ; ef i denotes the i =1

individual fitness for each of the mixture component, which is defined as follows: H (C i ) ef i = H max (C i ) where C i stands for the set of data whose probabilities of belonging to the i th component are the highest compared to other components, and H (C i ) represents the entropy of these data. H max (C i ) denotes the theoretical maximum entropy for an individual component. Based on results from information theory, the maximum entropy is given by: 1 H max (C i ) = log(( 2πe) n ∑ ) 2 where ∑ is the covariance matrix of Ci and n is the number of dimensions for data cluster Ci. It is established in information theory that Gaussian

variables have the maximum entropy among all the variables with equal variance. Hence, a high value of the individual entropy value ef i for the ith kernel corresponds to high probability that this component is truly Gaussian and also represents a better match between probability distribution function and data. Thus, the higher the entropy-based fitness value EFT, the better the approximation of the entire dataset.

3.2 Genetic EM Algorithm Table 1 depicts the proposed genetic EM algorithm. The algorithm starts with an initial number of components, which is empirically selected based on the size of data. An individual is denoted by < α i j , µ i j , ∑ ij > , where i represents the ith component generated and satisfies 1 ≤ i ≤ kj; kj is the number of components; j stands for the number of generations. Each individual represents a cluster of data according to the posterior probability. As a result, initial populations represent an initial partition for data cluster. A population denoted by population j corresponds to a set of new individuals yielded by genetic operators during the j th generation. Table 1. Genetic EM Algorithm Function: ECA_Mixture_Model (population) returns a clustered data with optimal number of clusters Inputs: population and the corresponding dataset X ~ {xn | n = 1,2,..., N } Initialization: j ← 0 ; population j ← population

; j

population optimal ← EM ( population j ) ; EFToptimal ← EFT ;

Repeat: j ← j + 1 ; Apply operators to populationoptimal yielding population j ; j population j ← EM ( population j ) ; Compute EFT ; j If  EFT > EFToptimal   

do population optimal ← population j , EFToptimal ← EFTj and k optimal ← k j Until: EFToptimal > th3 or j ≥ th4 or k j = 0 Return koptimal and p j (i | xn )

We use population optimal and k optimal to represent the best individuals and the optimal number of components, respectively. New individuals are composed by population j and population optimal . EFTj denotes the entropy-based fitness function value associated with population j , while EFToptimal refers

to the current optimal fitness function value during evolution. p j (i | x n ) refers to the posterior probabilities that data instances x n belongs to the i th cluster in j th generations. The algorithm combines iteratively EM steps with genetic operations; table 2 illustrates the Canonical EM sub-function. Table 2. Canonical EM Algorithm Implementation Sub_Function: EM (population) returns a population and the corresponding posterior probabilities p j (i | xn ) ; Inputs: a population of size k j Initialization: r ← 0 ;

population r ← population ; Compute the log-likelihood Lr ; Repeat: r ← r + 1 ; r stands for the number of iterations for EM; Generate a new posterior probability p j (i | x n ) using E-step and a

algorithm. According to Rudolph, a canonical genetic algorithm (CGA) will never converge to the global optimum, but “variants of CGAs that always maintain the best solution in the population, either before or after selection” converge to the global optimum [17]. Our algorithm ensures that the best population is always maintained after each selection and at the same time, increases the diversity of solutions by randomly selecting the genetic operations. The selection process determines that our algorithm belongs to the category of (1+1) Evolutionary Algorithm (EA) [6]. Based on our Genetic EM algorithm, we can define mixture densities clustering algorithm is a mixture, which is illustrated in table 3. In the table, C m stands for the clustering results. Table 3. Genetic EM based Clustering Algorithm

new group of components { (α ir , µir , ∑ir ) | 1 ≤ i ≤ k j } using M-step;

population r ← { < α1r , µ1r , ∑1r > , < α 2r , µ 2r , ∑ r2 > ,… , < α kr , µkr , ∑ rk > }; j j j

Call function ECA_Mixture_Model (population) returns an optimal number of components and posterior probability p j (i | xn ) Cm = φ , 1 ≤ m ≤ koptimal

Calculate the log-likelihood Lr for population r ; Until: L r − L r −1 < th1 or r > th2

For 1 ≤ m ≤ koptimal , 1 ≤ n ≤ N

Return population r and p j (i | x n )

Return C m , m = 1,2..., koptimal

EM is first used to estimate the best parameters for an initial number of components. The genetic algorithm is then used to estimate the best number of components given the (best) parameters obtained after the EM steps. The outcome of the genetic algorithm is re-submitted to the EM algorithm, and the same two-stages process is repeated until the termination conditions are satisfied. The termination conditions occur when the number of generated components becomes 0 or specified maximum entropy fitness value or numbers of iterations are reached. Thresholds th1 and th2 correspond to the termination conditions associated with the EM algorithm: th1 is a measure of the absolute precision required by the algorithm. If the variation in log likelihood between two consecutive steps of the EM algorithm is less than this value, then the algorithm terminates. Threshold th2 is the maximum number of iterations of EM. Thresholds th3 and th4 correspond to the termination conditions associated with the genetic algorithm: th3 is the maximum entropy fitness value and th4 is the maximum number of iterations for the genetic

If ( p j −1(m | xn ) = max( p j −1(m | xn )) ), assign x n to C m

4 Evaluation During the evaluation, we use the genetic EM to cluster data sets. We conduct two sets of experiments with three different datasets to validate our clustering approach based on genetic EM. In the first experiment, we investigate the relationship between the probability of genetic operation and the algorithm’s convergence speed. In the second experiment, we compare our algorithm with five existing well-known clustering algorithms. In the following, we discuss the evaluation in more detail.

4.1 Datasets The first dataset is a synthetic dataset, which contains 300 data items. The data items were independently generated by 3 bi-dimensional normal distributions: N 1 ( µ1 , ∑ 1 ) , N 2 ( µ 2 , ∑ 2 ) and N 3 ( µ 3 , ∑ 3 ) with prior probabilities α 1 = α 2 = α 3 = 0.33 and means µ1 = [6,3]T , µ 2 = [3,6]T and µ 3 = [2,2]T . Corresponding covariance matrices are : ∑ 1 = [ 00..615 00.15.6 ] , ∑ 2 = [ 00..40 00.25.0 ] and ∑ 3 = [ 00..60 00..03 ] . The second dataset is an existing dataset named

Iris [3], which contains 3 clusters corresponding to different types of Iris plant, namely Versicolor, Virginica and Setosa. Each cluster contains 50 4-dimensional data items. The third dataset is the Pendigit dataset from [3], which contains handwritten digit images of the first 10 digits. We extract 100 data objects based on the first 5 digits (i.e.0, 1, 2, 3 and 4). Each object is described using 16 features. Hence, the extracted dataset includes 5 clusters consisting of 16-dimensional data points.

4.2 Convergence Analysis In order to evaluate the algorithm’s effectiveness and efficiency, we studied three performance metrics, namely the Average number of Evaluations on Success (AES), the Success Rate (SR), and the Mean Best Fitness (MBF) [1]. AES measures the speed of optimization; more specifically it measures the number of iterations required for the successful runs to converge to the global optimum. SR measures the number of runs in which the algorithm successfully converges into the optimum solution; this evaluates the reliability of the algorithm. When the algorithm is unsuccessful in some runs, the MBF is used to measure how close the solution can come to the optimum. MBF is usually used to evaluate the pre-maturity of the solutions. During our empirical evaluation, MBF is set to zero if the algorithm converges to the global optimum; otherwise MBF is calculated using the fitness values of pre-mature solutions and the best fitness value. In this case, the best fitness value is set to 1 since the upper bound of the fitness function is 1. During the experiment, we studied 10 different sets of genetic operation probabilities, with merging probabilities varying from 0.3 to 0.99 and splitting probabilities ranging from 0.699 to 0.009. For each set of genetic operation probability, we run the clustering algorithm 10 times during the test. The maximum number of iterations was set to 500 and the fitness threshold was set to 0.98. During the evolution, the initial number of clusters is empirically set to 30. Table 4 and 5 summarizes the performance results obtained for the 100 test runs with the synthetic dataset and Iris dataset. For synthetic dataset, the algorithm converges to the global optimum for all the test runs with merging probability equal or greater to 0.70, by identifying the exact number of clusters (i.e., 3) contained in the test dataset. With Iris and Pendigit dataset, the algorithm converges to the global optimum for all the test runs for merging probability equal or above 0.60. The

exact number of clusters identified by our algorithm is 3 and 5 with Iris and Pendigit datset respectively. Table 4. Performance Evaluation with Synthetic Dataset Probability for Evolutionary Operator pmr ps 0.99 0.009 0.95 0.049 0.90 0.099 0.85 0.149 0.80 0.199 0.70 0.299 0.60 0.399 0.50 0.499 0.40 0.599 0.30 0.699

MBF

AES

SR

Avg./Std. 0/0 0/0 0/0 0/0 0/0 0/0 0.001/0.003 0.410/0.220 0.013/0.032 0.11/0.003

Avg./Std. 27/1 29/3 35/4 37/5 44/13 65/11 181/133 448/120 500/0 500/0

10 10 10 10 10 10 9 2 0 0

Table 5. Performance Evaluation with Iris Dataset Probability for Evolutionary Operator pmr ps 0.99 0.009 0.95 0.049 0.90 0.099 0.85 0.149 0.80 0.199 0.70 0.299 0.60 0.399 0.50 0.499 0.40 0.599 0.30 0.699

MBF

AES

SR

Avg./Std. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0.23/0.084 0.242/0.00 0.242/0.00

Avg./Std. 29/1 34/4 35/3 44/10 49/9 72/19 120/45 500/0 500/0 500/0

10 10 10 10 10 10 10 0 0 0

4.3 Comparison with other Algorithms In this section, we compare our algorithm with five existing well-known clustering algorithms, namely k-means, average linkage hierarchical algorithm, complete linkage hierarchical algorithm, single linkage hierarchical algorithm, and k-medoids. 4.3.1 Algorithms Overview We briefly review, in this section, each of the selected algorithms. K-means algorithm is one of the best-known squared errors based clustering algorithm. It is usually based on the Euclidean measure and hence it is easy to generate hyperspherical clusters. The general process of K-means is described as follows: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid distance.

3. 4.

When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move.

1. 2. 3.

Hierarchical clustering groups data objects with a sequence of partitions, either from singleton clusters to a cluster including all individuals or vice versa. It usually creates a tree of clusters. Each node in the tree is linked to one or several children clusters while sibling clusters share a common parent node. According to the method used for trees building, hierarchical clustering approaches are divided into two categories, agglomerative and divisive. Agglomerative methods recursively merge central points of different clusters while divisive methods start from a cluster including all data and then gradually split them into smaller clusters. The most widely used hierarchical clustering approach currently is agglomerative due to a lower complexity than divisive. Based on the different similarity measures, agglomerative hierarchical algorithms mainly include single linkage hierarchical, complete linkage hierarchical and average linkage hierarchical clustering. We summarize these agglomerative hierarchical algorithms as follows: 1. Start by assigning each data object to a cluster, so that you have N clusters, each containing just one item. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. The algorithm is single linkage-based if the similarity between two clusters is measured by computing the the shortest distance between their respective members. The algorithm is complete linkage-based if the similarity between one cluster and the other cluster is the greatest distance from any member of one cluster to any member of the other. The algorithm is average linkage-based if the similarity between two clusters is measured by computing the shortest distance between their respective members. K-medoids algorithm is a variant of k-means, which addresses k-means’s sensitiveness to noise and outliers. It uses real data points (medoids) as the cluster prototypes and thus avoids the effect of outliers. The general process of k-medoids can be described as follows:

4.

5.

4.3.2

Given initial partition number k, randomly pick k instances as initial medoids. Assign each data point to the nearest medoid x. Calculate the objective function, that is the sum of dissimilarities of all points to their nearest medoids. Randomly select a point y, and swap x with y if the swap reduces the value of the objective function. Repeat steps 2 to 4 until no change happens.

Evaluation Metrics

In order to measure the goodness of our clustering results, we use three validity indices: Dunn index [8], Davis-Bouldin index [4] and Rand index [14]. Dunn index is used to study cluster compactness and separateness. For any partition of clusters, where c i represents the i-cluster of such partition, the Dunn validation index can be calculated with the following formula: d (c i , c j ) D = min ( min ( )) 1≤i ≤ n 1≤ j ≤ n max{d (c )} k 1≤ k ≤ n

where d (ci , c j ) refers to the inter-cluster distance between c i and c j ; d (c k ) is the intra-cluster distance of cluster c k ; n is the number of clusters. The goal of the measurement is to maximize the inter-cluster distances while minimizing intra-cluster distances. The larger the value of D, the better the clustering result. Davis-Bouldin index is a function of the ratio of the sum of within-cluster scatter to between-cluster separation. It is given by: S n (Qi ) + S n (Q j ) 1 n DB = ∑ max ( ) n i =1 i ≠ j S (Qi , Q j ) where n is the number of clusters; S n is the average distance of all objects from a cluster to the cluster center; S (Qi , Q j ) is the distance between cluster centers. A small value of DB indicates that the clusters are compact and far from each other. Rand index measures the similarity between the resulting clustering structure C and the defined partition of the original data P. The higher the Rand value, the more similar C and P are. The Rand index is computed by: (CP + C P) Rs = (CP + C P + CP + C P)

where CP is the number of data pairs belonging to the same cluster of C and to the same group of P; C P is the number of data pairs belonging to the same cluster of C and to the different groups of P; C P is the number of data pairs belonging to different clusters of C and to the same group of P; C P is the number of data pairs belonging to different clusters of C and to different groups of P.

4.3.3 Cluster Validity Results During the second experiment, we study cluster validity by comparing our algorithm with five selected clustering algorithms. The merging probability and splitting probability is set as 0.95 and 0.049 respectively and we provide the correct number of clusters for the selected clustering algorithms. Tables 6, 7 and 8 illustrate the clustering validity results with synthetic dataset, Iris dataset, and Pendigit dataset, respectively. Table 6. Clustering Validity with Synthetic Dataset Validity Indices Approaches Our algorithm K-means Average linkage hierarchical Complete linkage hierarchical Single linkage hierarchical K-medoids

Rand

Dunn

DB

0.941 0.922 0.761 0.833 0.739 0.452

1.786 1.547 1.167 1.145 0.752 0.517

1.092 1.114 1.247 1.303 1.281 1.214

Table 7. Clustering Validity with Iris Dataset Validity Indices Approaches Our algorithm K-means Average linkage hierarchical Complete linkage hierarchical Single linkage hierarchical K-medoids

Rand

Dunn

DB

0.923 0.874 0.779 0.794 0.777 0.740

2.213 1.872 1.139 1.082 0.682 0.907

1.012 1.050 1.117 1.073 1.472 1.457

Table 8. Clustering Validity with Pendigit Dataset Validity Indices Approaches Our algorithm K-means Average linkage hierarchical Complete linkage hierarchical Single linkage hierarchical K-medoids

Rand

Dunn

DB

0.934 0.930 0.887 0.799 0.619 0.792

2.105 0.782 0.886 1.000 0.826 0.928

1.927 1.738 1.511 1.409 1.252 1.770

Our clustering algorithm achieves the biggest

value of Rand and Dunn, which means the inter-cluster distance is maximized, the intra-cluster distance is minimized, and the obtained clustering structure is more similar with the original dataset. Moreover, our algorithm has the smallest value of DB, which means the intra-clusters are compact and inter-clusters are far from each other. The evaluation results show that our clustering algorithm not only estimates accurately the number of clusters, but also generates the optimal clustering structure. Table 9 compares the complexity of our genetic EM based algorithm with the considered well-known clustering algorithms. In the table, N stands for the number of data objects; K is the number of clusters; d is the number of iterations in K-means and K-medoids; d1 is the number of genetic iterations and d2 is the number of EM iterations. Compared with hierarchical algorithm, our algorithm has low computation complexity when clustering large-scale datasets. Also, hierarchical algorithms are sensitive to outliers and noise data and cannot correct previous classifications. While comparing with k-means and k-medoids algorithm, our clustering method has a little bit higher complexity. However, the common limitations of k-means and k-medoids are they are prone to converge into local optimum and also determining the number of initial partitions k is not easy. Table 9. Comparison of Algorithm Complexity Clustering Algorithm K-means clustering Single linkage hierarchical Complete linkage hierarchical Average linkage hierarchical K-medoids clustering Genetic EM clustering

Complexity O(NKd) O(N2) O(N2) O(N2) O(NKd) O(NKd1d2)

5 Conclusions We propose in this paper a new genetic EM algorithm for learning the optimal number of components of mixture models. Based on genetic EM algorithm, we propose a new clustering method belonging to the category of mixture densities based clustering approaches. Three sets of data are used to validate our clustering method, namely a synthetic dataset, a standard Iris dataset and Pendigit dataset. The empirical evaluation with the datasets shows that our algorithm has the capability to converge into global optimal solutions (i.e., the optimal number of clusters or components for a set of data) with selected genetic operation probabilities. Moreover, the results of clustering validity confirm that the clustered data has

a high similarity with the original dataset.

References [1] T. Back, A.E. Eiben and N.A.L. Van Der Vaart, “An empirical study on GAs without parameters”, Lecture Notes in Computer Science, Vol. 1917, Pages: 315–324, Springer-Verlag, 2000. [2] P. Berkhin, “Survey of clustering data mining techniques”, Accrue Software, 2002. [3] C.L Blake and C.J. Merz, UCI repository of machine learning databases, University of California, Irvine, Department of Information and Computer Sciences, 1998. [4] D. L. Davies and D. W. Bouldin, “A cluster separation measure”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 1, No 2, Pages: 224-227, 1979. [5] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm (with discussion)”, Journal of the Royal Statistical Society B, Vol. 39, Pages: 1-38, 1997. [6] S. Droste, T. Jansen and I. Wegener, “On the analysis of the (1+1) evolutionary algorithm”, Technical Report, ISSN 1433-3325, Department of Computer Science, University of Dortmund, Germany, 2000. [7] R. C. Dubes, “Cluster analysis and related issues in Handbook of Pattern Recognition and Computer Vision”, Eds. C.H. Chen, L.F. Pau and P.S.P. Wang, World Scientific Publishing Company, Pages: 3-32, 1993. [8] J. C. Dunn, “Well separated clusters and optimal fuzzy partitions”, Journal of Cybernetica, Vol. 4, Pages: 95-104, 1974. [9] B.S. Everitt, “Unresolved problems in cluster analysis”, Biometrics, Vol. 35, Pages: 169-181, 1979. [10] J.M.N. Figueiredo and A.K. Jain, “Unsupervised selection and estimation of finite mixture models”, In Proceedings of the International Conference on Pattern Recognition, ICPR2000, Pages: 87-90, 2000. [11] A. K. Jain, M.N. Murty and P.J. Flynn, “Data clustering”, ACM Computing Surveys, Vol. 31, No. 3, Pages: 264-296, 1999. [12] G. McLachlan, “On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture”, Applied Statistics, Vol. 36, Pages: 318–324, 1987. [13] A. Penalver and J.M.S.F. Escolano, “An entropy maximization approach to optimal model selection in Gaussian mixtures”, 8th

Iberoamerican Congress on Pattern Recognition, CIARP 2003, Pages: 432-439, Havana, Cuba, November 26-29, 2003. [14] W. Rand, “Objective criteria for the evaluation of clustering methods”, Journal of the American Statistical Association, Pages: 846-850, 1971. [15] S. Richardson and P. J. Green, “On bayesian analysis of mixtures with an unknown number of components (with discussion)”, Journal of the Royal Statistical Society: Series B, Vol. 59, No. 4, Pages: 731-792, 1997. [16] B.D. Ripley, Pattern Recognition and Neural Networks. Cambridge, U.K., Cambridge University Press, 1996. [17] G. Rudolph, “Convergence analysis of canonical genetic algorithms”, IEEE Transactions on Neural Networks, 5 (1), Pages: 96-101, 1994. [18] D. Titterington and A. Smith and U. Makov, Statistical Analysis of Finite Mixture Distributions, John Wiley & Sons, New York. 1985. [19] N. Vlassis and A. Likas, “A kurtosis-based dynamic approach to Gaussian mixture modeling”, IEEE Transaction on Systems, Man, and Cybernetics, Part A, 29(4), Pages: 393-399, 1999. [20] R. Xu and D. Wunsch, “Survey of clustering algorithms”, IEEE Transactions on Neural Networks, Pages: 645- 678, Volume: 16, Issue: 3, May 2005.

Suggest Documents