Available online at www.sciencedirect.com
ScienceDirect Procedia Technology 10 (2013) 443 – 449
International Conference on Computational Intelligence: Modeling, Techniques and Applications (CIMTA) 2013
Clustering Ensemble: A Multiobjective Genetic Algorithm based Approach Sujoy Chatterjee∗, Anirban Mukhopadhyay Department of Computer Science and Engineering, University of Kalyani, Kalyani - 741235, India
Abstract Clustering ensemble refers to the problem of obtaining a final clustering of some data set from a set of input clustering solutions. In this article, the clustering ensemble problem has been modeled as a multiobjective optimization problem and a multiobjective evolutionary algorithm has been used for this purpose. The proposed multiobjective evolutionary clustering ensemble algorithm (MOECEA) evolves a clustering solution from the input clusterings by optimizing two criteria simultaneously. The first objective is to maximize the similarity of the resultant clustering with all the input clusterings, where the similarity between two clustering solutions is computed using adjusted Rand index. The second criteria is to minimize the standard deviation among the similarity scores in order to prevent the evolved clustering solution to be very similar with one of the input clusterings and very dissimilar with the others. The performance of the proposed algorithm has been compared with that of other well-known existing cluster ensemble algorithms for a number of artificial and real-life data sets. c 2013 2013 The The Authors. Authors. Published © Publishedby byElsevier ElsevierLtd. Ltd. Open access under CC BY-NC-ND license. Selection and and peer-review peer-review under && Engineering. Selection under responsibility responsibilityof ofthe theUniversity UniversityofofKalyani, Kalyani,Department DepartmentofofComputer ComputerScience Science Engineering Keywords: Clustering Ensemble; Validity indices; Multiobjective Genetic Algorithm; Pareto optimality.
1. Introduction Unsupervised classification has drawn a lot of research attractions in the field of data mining, image processing and pattern recognition domain. Clustering [1] is used to group the elements in a data set in accordance with their similarities. Therefore better clustering means that the elements lying in the same clusters are most similar to each other in some sense while the elements from different clusters are dissimilar. When many clustering algorithms are applied to same data set then it can generate different clustering results. These different types of clustering results by different algorithms is due to the various aspects of the input data set. Every clustering algorithm implicitly or explicitly assumes the data set as a different data model. That may cause to generate some wrong clustering results. Clustering ensemble [2–4] algorithms basically integrate those clustering solutions to achieve a single stable solution. ∗
Corresponding author E-mail address:
[email protected]
2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and peer-review under responsibility of the University of Kalyani, Department of Computer Science & Engineering doi:10.1016/j.protcy.2013.12.381
444
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449
It is very hard to find the optimal clustering solution from the set of clustering solutions. When different clustering solutions are generated from the same data set, its prior knowledge of the data distributions is not available completely. Again, as different elements have different characteristics so the way of grouping them in different clustering algorithms is not same. Moreover, in clustering algorithms, the grouping formation is done in different ways. For example, K-means clustering algorithm groups the data sets so that the total Mean Square Error to the center of each cluster is minimum while graph-based partitioning clustering partitions the graph into K parts based on the minimum edge weight cuts. So, it is very hard to get a final conclusion that which clustering result is better. So now the objective of ensemble method is to combine the strengths of many individual clustering algorithms. This is the focus of the research on clustering ensembles, seeking a combination of multiple partitions that provide improved overall clustering of the given data. Clustering ensembles can go beyond what is typically achieved by a single clustering algorithm in several respects i.e. robustness, novelty, stability and confidence estimation. Therefore it is useful to obtain a final clustering solution by a consensus among the input clusterings by a clustering ensemble method. In this article, we have posed the clustering ensemble problem as an optimization one where the goal is to obtain a suitable clustering solution which is roughly similar to the input clustering solutions, and thus is expected to reflect a good consensus among the input clusterings. Thus the problem can readily be modeled as a multiobjective optimization (MOO) problem [5], where two objectives are optimized simultaneously. The first objective is to maximize the similarity of the resultant clustering with all the input clusterings, where the similarity between two clustering solutions is computed using Adjusted Rand Index. The second criteria is to minimize the standard deviation among the similarity scores in order to prevent the evolved clustering solution to be very similar with one of the input clusterings and very dissimilar with the others. In MOO, search is performed over a number of, often conflicting, objective functions. In single objective optimization usually yields a single best solution. However, in MOO the final solution set contains a number of Pareto-optimal solutions, none of which can be further improved on any one objective without degrading it in another. Non-dominated Sorting Genetic Algorithm-II (NSGA-II) [6], a popular elitist MOO algorithm, is used as the underlying optimization strategy. The Adjusted Rand Index (ARI) [7] and the Standard deviation measure are used as the objective functions. The proposed multiobjective evolutionary clustering ensemble algorithm (MOECEA) has been applied on a number of artificial and real-life data sets and its performance has been compared with that of different well-known clustering ensemble techniques to establish its superiority.
2. Proposed Multiobjective Clustering Ensemble Technique This section describes the use of NSGA-II [6] for evolving a set of near-Pareto optimal ensemble clustering solutions. The proposed technique is described below in detail.
2.1. Encoding of Chromosomes Each chromosome is a sequence of integers representing the K class labels. For example, a chromosome is denoted as {r1 ,r2 ,...,rn } where ri denotes the class label of the ith data point. In different clustering solutions the class label of two positions may be overlapped. i.e. in two different positions of a chromosome the class label may be the same number. Let two chromosomes be chromosome1: {1,1,1,2,2,2,3,3} and chromosome2: {1,2,2,2,2,3,3,3}. Here in the encoding scheme for the 1st chromosome the objects {1, 2, 3} are in the first cluster, objects {4, 5, 6} are in the second cluster and {7, 8} are in the third cluster. So, the maximum number of clusters is 3.
2.2. Initial Population In the initial population, we are taking the whole set of input clusterings for which an ensemble is to be generated. In addition with that we are also taking some random clustering solutions to avoid any bias.
445
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449
2.3. Selection The selection process select chromosomes for the later breeding directed by the survival of the “fittest” concept of natural genetic systems. In binary tournament selection strategy, the selection involves running several “tournaments” among a few individuals that are chosen randomly from the initial population. But in the context of multiobjective ensemble clustering, the selection is based on crowded binary tournament selection strategy as used in NSGA-II.
2.4. Crossover Crossover is a probabilistic process that exchanges information between two parent chromosomes for generating two child chromosomes. In this article crossover with a “fixed” crossover probability of kc is used. In traditional GA based algorithms crossover is generally single point or multipoint crossover. But in cluster ensemble technique this type of crossover may distort the original population and it may affect the convergence of the optimal solution. For example let the two chromosomes be chromosome1: {1,1,1,2,2,3} and chromosome2: {3,3,3,1,1,2}. Now, these two chromosomes are basically the same clustering results as in both the cases the first 3 objects are in one cluster, 4th and 5th objects are in another cluster and 6th object is in the last cluster. Therefore, they have the same fitness value. But, if a single point crossover is performed at the 5th position, then after crossover the labeling would be for chromosome1: {1,1,1,2,1,2} and chromosome2: {3,3,3,1,2,3}. As can be seen, in the first child, the number of clusters has been changed. To avoid this, in our approach we have used a bipartite graph-based labeling method as follows. Let the two chromosomes be chromosome1: {1,3,1,2,1,2,3,1,3} and chromosome2: {2,2,2,3,3,3,1,1,1}. We calculate the dissimilarity of two different clusters of the two chromosomes using the following equation 1.
dis(Ci , C j ) =
|Ci |−|Ci C j | |Ci |
+
|C j |−|Ci C j | |C j |
2
(1)
Here dissimilarity between clusteri of chromosome1 and cluster j of chromosome2 is calculated. |Ci | means the size of clusteri . For example, the dissimilarity of the cluster1 of chromosome1 and cluster1 of chromosome2 is dis(c1 ,c1 ) = (3/4+2/3)/2=0.708. In chromosome1, objects {1,3,5,8} are the in the first cluster whereas for chromosome2, objects{7,8,9} are in the first cluster. So, only object 8 is common for both the chromosomes. The size of cluster1 of chromosome1 is 4 while the size of cluster1 of chromosome2 is 3. After computation of dissimilarity matrix we construct a bipartite graph based on this dissimilarity scores. For our example, the bipartite graph has 3 vertices in each set. Let the left hand side set (for chromosome1) has 3 vertices and the right hand side set (for chromosome2) has 3 vertices. Each vertex represents a cluster encoded in a chromosome. From this bipartite graph we can derive a replacement matrix which actually stores the final label of the chromosome2 based on chromosome1. The steps for constructing the replacement matrix are as follows: 1. 2. 3. 4.
Search the edge with minimum weight from the graph. Store the two vertices corresponding to the edge. After storing that value, remove all the edges that are incident upon the vertices on the right hand side set. Repeat the above steps for remaining edges.
This replacement matrix stores the information of the labels of chromosome2 that is replaced by the labels of chromosome1. Therefore, if two chromosomes participate for the crossover then after crossover, only labeling of chromosome2 is changed. Again, this crossover operation ensures that the two parent chromosomes representing the same solutions are not affected after the crossover operation. In the other case, that is for two parent chromosomes that do not represent the same solution can be relabeled by the same procedure and it generates two new child chromosomes after exchanging their information.
446
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449
2.5. Mutation Each chromosome undergoes mutation with a very small mutation probability M p . In the mutation operator we have used to add or subtract some quantity to the label of the chromosome. For computation purpose, the float value generated for a particular label after the addition or subtraction has been converted to near integer. 2.6. Choice of Objectives Our algorithm uses two objective functions. Here the similarities among the reference clustering and input clustering solutions are calculated first and then sum of these similarities divided by the number of clustering solutions is considered as first objective. To compute the similarity between two clustering solutions, we have used Adjusted Rand Index value. To avoid that the encoded clustering solution is too similar with one of the input clusterings and thus making the first objective value very high, we have minimized standard deviation of the similarity scores between the encoded solution to the input clusterings, and this is used as the second objective function. Therefore the first objective function is to be maximized whereas the second one is to be minimized. 2.7. Selecting a Solution from the Non-dominated Set In the final generation, the multiobjective clustering method produces near-Pareto-optimal non-dominated set of solutions. Hence, it is necessary to choose a particular solution from the set of non-dominated solutions. The optimal solution is chosen by analyzing the knee regions of the non-dominated front. The idea is to stress on the solution that conforms to the “knee region”. The “knee” is formed by those solutions of the Pareto optimal front where a small improvement of one objective would lead to large deterioration of another objective. So here in our algorithm, the most promising solution is chosen from those solutions of the optimal Pareto front where small improvement of the first objective would lead to large deterioration of the second objective value. 3. Experimental Design and Results In this section, we have first described the data sets that we used in our experiments. We present experiments on various real-life data sets and as well as artificial data sets to evaluate the performance of the proposed algorithm. The algorithm is compared with three well-known existing cluster ensemble algorithms, namely CSPA, HGPA, MCLA and with the single objective version of the proposed algorithm. The adopted performance metrics are Adjusted Rand index (ARI), Minkowski score (MS), Sillhoute index and other cluster validity indices. Here the single objective clustering ensemble algorithm uses the ratio of the first objective function of MOECEA to the number of input clusterings, i.e., the average similarity of the encoded solution to the input clustering solutions. The input clustering solutions are generated by using random subspace clustering using K-means and other clustering algorithms. Experiments are performed in MATLAB 2008a and the running environment is an Intel (R) CPU 1.6 GHz machine with 1 GB of RAM running Windows XP Professional. 3.1. Data sets Three real-life data sets are used for experiments. A short description of the data sets in terms of the number of data points, dimension and the number of clusters is provided in Table 1. The three real-life data sets are obtained from the UCI Machine learning Repository. The description of artificial data set is given in Table 2. 3.2. Performance Metrics Some external and internal cluster validity indices are used as the performance metrics. External indices, such as Adjusted Rand Index (ARI) [7], Rand Index (RI) [8], Minkowski Score (MS) [9], classification accuracy (CA) [10], Mirkins Index (MI) and Huberts Index (HI) [11] are used to evaluate the performance of the algorithms with respect to the true clustering of the data sets. Here larger values of ARI, RI, CA, HI and smaller values of MS, MI indicate better clustering, respectively.
447
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449
On the other hand, the internal validity indices, such as DB index [12], Dunn index [13] and Silhouette index [14] are used to evaluate the clustering solutions objectively. Among these indices, smaller value of DB index and larger values of Dunn and Silhouette indices indicate better clustering, respectively. Table 1. Description of Real-life data sets
Data set Wine Iris Seed
Instances 178 150 210
number of classes 3 3 3
number of attributes 13 3 7
Instances 600
number of classes 15
number of attributes 2
Table 2. Description of Artificial data sets
Data set R15
3.3. Parameter Settings The number of clusters parameter is fixed for particular data sets. For the proposed algorithm, the crossover rate is 0.9, mutation rate is 0.01 and population size is 40. 3.4. Results In Tables 3-6, the performance metric values obtained by different cluster ensemble algorithms have been reported for the four data sets, respectively. It is evident from the tables that although in few cases, some of the algorithms are performing little better, but in most of the cases, the proposed multiobjective algorithm provides consistently good performance. Moreover, it performs better than the single objective version as well, which demonstrates the utility of the multiobjective framework for developing the proposed algorithm. Table 3. Performance metric values for wine data set
Algorithm Proposed CSPA HGPA MCLA Single Obj
Silhoutte 0.1775 0.1385 -0.0721 0.1383 -0.4496
Adj Rand 0.7040 0.5481 0.2040 0.6360 0.7161
Rand 0.8681 0.7990 0.6459 0.8378 0.8742
MI 0.1319 0.2010 0.3541 0.1622 0.1258
HI 0.7362 0.5980 0.2918 0.6756 0.7485
CA 0.8933 0.8258 0.6067 0.8652 0.9045
DB 1.3639 1.6390 5.5503 1.8400 0.4869
DUNN 0.5500 0.4667 0.1586 0.3902 0
Minkowski 0.6301 0.7809 1.0361 0.6987 0.6229
3.5. Pareto-Fronts for Different Data Sets For the purpose of illustration, Figs. 1 and 2 show the final non-dominated fronts obtained by the proposed multiobjective cluster ensemble algorithm for Wine, Iris and Seed, R15 data sets, respectively. The figures also show the solutions obtained by other algorithms on the same objective space. Moreover, the selected knee solutions(denoted as Multi-objective) are also marked in the figures.
448
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449 Table 4. Different validity measure for Iris data set
Algorithm Proposed CSPA HGPA MCLA Single Obj
Silhoutte 0.7345 0.6961 0.6320 0.7127 0.5642
Adj Rand 0.7592 0.7415 0.6808 0.7156 0.7079
Rand 0.8923 0.8859 0.8590 0.8737 0.8709
MI 0.1077 0.1141 0.1410 0.1263 0.1291
HI 0.7845 0.7718 0.7179 0.7475 0.7417
CA 0.9067 0.9000 0.8733 0.8867 0.8867
DB 0.5805 0.6206 0.6988 0.6102 0.1851
DUNN 2.3270 2.1600 1.6246 2.2973 0
Minkowski 0.5577 0.5889 0.6538 0.6129 0.6247
Rand 0.8383 0.8535 0.6612 0.8352 0.8398
MI 0.1617 0.1465 0.3388 0.1648 0.1602
HI 0.6766 0.7069 0.3224 0.6704 0.6796
CA 0.8619 0.8762 0.6333 0.8524 0.8571
DB 0.5637 0.8297 2.3407 0.7763 0.6350
DUNN 2.0373 1.9937 0.5560 1.9978 2.2266
Minkowski 0.6984 0.6663 0.9519 0.6957 0.6848
Table 5. Different validity measure for Seed data set
Algorithm Proposed CSPA HGPA MCLA Single Obj
Silhoutte 0.5877 0.5769 0.1349 0.6233 0.6261
Adj Rand 0.6348 0.6687 0.2589 0.6303 0.6409
Table 6. Different validity measure for R15 data set
Algorithm Proposed CSPA HGPA MCLA Single Obj
Silhoutte 0.6267 0.8993 0.4503 0.4922 0.6841
Adj Rand 0.7610 0.9964 0.7469 0.9928 0.8638
(a)
Rand 0.9635 0.9996 0.9692 0.9991 0.9813
MI 0.0365 4.3962e-004 0.0308 8.7924e-004 0.0187
HI 0.9270 0.9991 0.9384 0.9982 0.9626
CA 0.7317 0.9983 0.8350 0.9967 0.8633
DB 0.4753 0.2912 2.6545 0.2827 0.3701
(b)
Fig. 1. (a) Pareto front for Wine data set; (b) Pareto front for Iris data set
DUNN 0 3.9081 0.0723 3.9081 0
Minkowski 0.7487 0.0822 0.6880 0.1162 0.5357
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449
(a)
(b)
Fig. 2. (a) Pareto front for Seed data set; (b) Pareto front for R-15 data set
4. Conclusion In this article, a multiobjective evolutionary cluster ensemble algorithm (MOECEA) has been proposed on the framework of a popular multiobjective genetic algorithm, NSGA-II. The objectives are to maximize the similarity of the evolved ensemble clustering solution with the input clustering solutions whereas minimizing the standard deviations of these similarities in order to avoid any bias. The performance of the proposed algorithm has been compared with that of other existing clustering ensemble algorithms on some real-life and artificial data sets. The results demonstrate the utility of the proposed technique over other existing approaches. References [1] Jain, A.K., Dubes, R.C.. Data clustering: A review. ACM Computing Surveys 1999;31. [2] Strehl, A., Ghosh, J.. Cluster ensembles - a knowledge reuse framework for combining partitionings. In: Proc. 11th National Conference of Artificial intelligence. 2002, p. 93–98. [3] Ghaemi, R., Sulaiman, M., Ibrahim, H., Mustapha, N.. A review: accuracy optimization in clustering ensembles using genetic algorithms. Artificial Intelligence Review, Springer 2011;35(4):287–318. [4] Fred, A.L.N., Jain, A.K.. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005;27(6):835–850. [5] Coello, C.. A comprehensive survey of evolutionary-based multiobjective optimization techniques. Knowledge and Information Systems 1999;1(3):129–156. [6] Deb, K., Pratap, A., Agrawal, S., Meyarivan, T.. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182–197. [7] Yeung, K.Y., Ruzzo, W.L.. An empirical study on principal component analysis for clustering gene expression data. Bioinformatics 2001;17(9):763–774. [8] W. M. Rand, W.. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 1971;66(336):846– 850. [9] Ben-Hur, A., Isabelle, G.. Detecting stable clusters using principal component analysis. Methods Mol Biol 2003;224:159–182. [10] Bandyopadhyay, S., Maulik, U., Mukhopadhyay, A.. Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Transactions on Geoscicence and Remote Sensing 2007;45(5):1506–1511. [11] Hubert, L., Arbie, P.. Comparing partitions. Journal of Classification 1985;2(1):193–218. [12] Davies, D.L., Bouldin, D.W.. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1979;1:224–227. [13] Dunn, J.C.. Well separated clusters and optimal fuzzy partitions. J Cyberns 1974;4:95–104. [14] Rousseeuw, P.. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comp App Math 1987;20:53–65.
449