A review: accuracy optimization in clustering ... - Semantic Scholar

3 downloads 734 Views 782KB Size Report
tory, which are Iris, Wine, Soybean, Galaxy, Thyroid, Biochemical, Pending, .... A review: accuracy optimization in clustering ensembles. Genotype. 1. 3. 2. 4. 8. 7.
Artif Intell Rev DOI 10.1007/s10462-010-9195-5

A review: accuracy optimization in clustering ensembles using genetic algorithms Reza Ghaemi · Nasir bin Sulaiman · Hamidah Ibrahim · Norwati Mustapha

© Springer Science+Business Media B.V. 2010

Abstract The clustering ensemble has emerged as a prominent method for improving robustness, stability, and accuracy of unsupervised classification solutions. It combines multiple partitions generated by different clustering algorithms into a single clustering solution. Genetic algorithms are known as methods with high ability to solve optimization problems including clustering. To date, significant progress has been contributed to find consensus clustering that will yield better results than existing clustering. This paper presents a survey of genetic algorithms designed for clustering ensembles. It begins with the introduction of clustering ensembles and clustering ensemble algorithms. Subsequently, this paper describes a number of suggested genetic-guided clustering ensemble algorithms, in particular the genotypes, fitness functions, and genetic operations. Next, clustering accuracies among the genetic-guided clustering ensemble algorithms is compared. This paper concludes that using genetic algorithms in clustering ensemble improves the clustering accuracy and addresses open questions subject to future research. Keywords Accuracy · Clustering ensemble · Genetic algorithms · Unsupervised classification

R. Ghaemi (B) CE Department, Islamic Azad University, Quchan Branch, Tehran, Iran e-mail: [email protected] R. Ghaemi · N. b. Sulaiman · H. Ibrahim · N. Mustapha Department of Computer Science, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia (UPM), Selangor, Malaysia N. b. Sulaiman e-mail: [email protected] H. Ibrahim e-mail: [email protected] N. Mustapha e-mail: [email protected]

123

R. Ghaemi et al.

1 Introduction The exploratory nature of clustering tasks demands for efficient methods that would benefit from combining the strengths of many individual clustering algorithms. This is the focus of research on clustering ensembles; seeking a coordinated whole from multiple partitions that provides an improved overall clustering outcome of a given data (Topchy et al. 2004a, 2005). High robustness, accuracy and stability are the most important characteristics of clustering ensemble (Strehl and Ghosh 2002). Clustering ensembles are able to achieve results beyond what is typically achieved by a single clustering algorithm in several respects: • Robustness: Better average performance across domains and datasets (Strehl and Ghosh 2002). • Novelty: Resulting in a combined solution unattainable by any single clustering algorithm. • Stability and confidence estimation: Clustering solutions with lower sensitivity to noise, outliers, or sampling variations. Clustering uncertainty can be assessed from ensemble distributions (Strehl and Ghosh 2002). • Parallelization and Scalability: Parallel clustering of data subsets with subsequent combination of results. Ability to integrate solutions from multiple distributed sources of data or attributes (features) (Topchy et al. 2004a, 2005). In classical clustering, different clustering algorithm or even different runs of the same algorithm may produce different partitions for the same dataset. The partitions produced are highly influenced by the validity criterion adopted by every algorithm. Clustering algorithms have several disadvantages. Among them is that clustering criteria such as the minimization of the within-cluster variation are usually high-dimensional, non linear and multi-modal functions with numbers of local optimal clustering solutions. The commonly used a Hill Climbing search methods only guarantee a local optimal clustering solution. Nonetheless, traditional recombination operator of genetic algorithms suffers from clustering invalidity and context insensitivity (Falkenauer 1994; Jones and Beltramo 1991). These will lead to the disruption of good building blocks, thus significantly degrades the search capability in genetic algorithms. Another problem associated with genetic-guided clustering algorithms is their slow convergence (Krishna and Murty 2002). A popular approach to speed up the convergence of genetic-guided clustering algorithms is the one-step K -means operator (Krishna and Murty 2002). However, one-step K -means operator may restrict the genetic algorithms’ search capability (Sheng et al. 2004). The disadvantages of clustering algorithms have motivated the application of more robust heuristic search methods such as genetic algorithms in clustering. Numbers of recent studies have demonstrated that clustering using genetic algorithms are often able to identify a better clustering solution when compared with these obtained by Hill Climbing search methods (Franti 2000; Garai and Chaudhuri 2004; Hong and Kwong 2008; Krishna and Murty 2002; Kuncheva and Bezdek 2002; Martnez-Otzeta et al. 2006; Mitra 2004). Although there is a motivation for using genetic algorithm in clustering, some of existing clustering approaches that employ genetic algorithms in their process are suffering drawbacks. First, redundancy seems to be a problem for the representations used (Falkenauer 1998). Second, validity of the chromosomes that appear throughout the search must be ensured (Krishna and Murty 2002). Third, the number of clusters has to be specified beforehand in genetic-guided clustering methods (Du et al. 2004; Krishna and Murty 2002).

123

A review: accuracy optimization in clustering ensembles

Multi-objective clustering and clustering ensembles are two approaches designed to reduce such limitations (Hruschka et al. 2009). The main goal of clustering ensembles is to improve the overall accuracy or precision using the best features of each individual clustering algorithm (Kuncheva et al. 2006). For such, these approaches use either the class label in the case of classification or the desired value in the case of regression (Hruschka et al. 2009). This paper presents a survey of Genetic-guided Clustering Ensemble Algorithms (GCEAs) and demonstrates the use of genetic algorithms to optimize clustering accuracy. In this paper, Sects. 2 and 3 describe the concept of clustering ensemble, challenges and Clustering Ensemble Algorithms (CEAs). Section 4 explains the structures and mechanism of GCEAs and different genetic operations are described. In Sect. 5, a number of suggested GCEAs, their features, advantages, disadvantages and future works are described. Finally in Sect. 6, this paper compares and contrasts many research works in clustering ensembles using genetic algorithm in effort to improve the clustering accuracy.

2 Clustering ensemble and its challenges Because different clustering algorithms exert different results on a dataset, we can combine the results of different clustering algorithms and calculate the final clusters from the results of the obtained combination (Minaei-Bidgoli et al. 2004). Clustering combination approaches were summarized in Ghaemi et al. (2009) (Fig. 1, Sect. 4). For more details refer to Ghaemi et al. (2009). Several recent independent studies (Coello et al. 2002; Corne et al. 2001; Deb 2001; Dudoit and Fridlyand 2003; Fischer and Buhmann 2003; Fred 2001; Fred and Jain 2002; Hruschka et al. 2009; Qian and Suen 2000; Strehl and Ghosh 2003) have pioneered clustering ensemble as a new branch in the conventional taxonomy of clustering algorithms (Fern and Brodley 2003). Another related work includes (Analoui and Sadighian 2006; Chiou and Lan 2001; Fischer and Buhmann 2003; Gablentz et al. 2000; Hruschka et al. 2009; Jain et al. 1999; Kellam et al. 2001; Topchy et al. 2004a, 2005; Xu and Wunsch 2005). Clustering ensemble is usually a two-staged algorithm. In the first stage, it stores the results of some independent runs of K -means or other clustering algorithms. In the second

Clustering Ensembles Algorithm

Dataset Preprocessing

Partitions as

Partitions of Clusters

Initial Population C1 , … , Cn1

Clustering Algorithm 1

C1 , … , Cn2

Clustering Algorithm 2

Combining of Partitions Cbest 1 , … , Cbest n



Best Partition C1 , … , Cnm



Clustering Algorithm m

Fig. 1 Clustering ensemble architecture

123

R. Ghaemi et al.

stage, it uses a specific consensus function to find a final partition from the stored results. Figure 1 presents the clustering ensemble architecture. The problem of clustering ensemble can be defined generally as follows: given multiple clusterings of a particular dataset, find a combined clustering that yields better performance. Whereas the problem of clustering combination bears some traits of a classical clustering problem, it struggles for three major problems including consensus function, diversity of clustering and strength of the constituents clustering models (Azimi et al. 2007; Dudoit and Fridlyand 2003; Fischer and Buhmann 2003; Fred 2001; Fred and Jain 2002; Hong et al. 2008; Kellam et al. 2001; Luo et al. 2007; Strehl and Ghosh 2003; Topchy et al. 2003, 2004a, 2005). More details of these problems can be seen in the Sect. 2 in Ghaemi et al. (2009). The major hardship in clustering ensemble are consensus functions and partitions combination algorithm to produce final partition, or in other words, finding a consensus partition from the output partitions of various clustering algorithms (Topchy et al. 2003, 2004a). Unlike supervised classification, the patterns in a clustering dataset are unlabeled; therefore, there is no explicit correspondence between the labels delivered by different partitions. The combination of multiple clustering, as in any optimization problem, can also be viewed as finding a median partition with respect to the given partitions which is proven to be NP-complete (Topchy et al. 2003).

3 Clustering ensemble algorithms (CEAs) In this paper, we focus on consensus function in clustering ensembles. Common consensus functions are based on co-association, graph, mutual information, and voting. A function based on co-association tries to keep intact objects found together in most of the individual partitions (Hruschka et al. 2009). Graph-based functions look for a consensus partition using partitioning techniques employed for graphs (Strehl and Ghosh 2002), which is based on mutual information to maximize mutual information between the labels of the initial partitions and the labels of the consensus partition. The voting function, after labeling the clusters has taken place, is defined by the number of times each object belonged to each cluster (Ghaemi et al. 2009). Ghaemi et al. (2009) summarized existing research works related to types of consensus functions and compared a number of CEAs based on robustness, scalability, and computing complexity (refer to Table 1 shown in Ghaemi et al. 2009). In all mentioned research works, the experiments were carried out using datasets from the UCI benchmark repository, which are Iris, Wine, Soybean, Galaxy, Thyroid, Biochemical, Pending, Yahoo, Glass, and Isolet6. Real world dataset includes 08X while artificial datasets includes 3-circle, Smile, Half-rings, 2-Spirals, 2D2k, 8D5k, EOS, HRCT, and MODIS (Analoui and Sadighian 2006; Azimi et al. 2007; Fern and Brodley 2004; Fred 2001; Fred and Jain 2002; Luo et al. 2007; Ng et al. 2001; Strehl and Ghosh 2002, 2003; Topchy et al. 2003, 2004a,b, 2005). The experiments results were tabuled in Table 2 (Ghaemi et al. 2009) and comparison were made between the mean error rates of clustering accuracy. Different consensus functions were reported: Co-association function and Average Link (CAL), Co-association function and K -means (CK), Hyper graph Partitioning Algorithm (HGPA), Cluster based Similarity Partitioning Algorithm (CSPA), Meta Clustering Algorithm (MCLA), Expectation Maximization Algorithm (EM) and Mutual Information (MI). For more details about CEAs, refer to Sect. 5 in Ghaemi et al. (2009).

123

A review: accuracy optimization in clustering ensembles

4 Employing genetic algorithms and their components in clustering ensembles In this section, suggested research works in GCEAs are discussed in details based on genotype encoding, population initialization, fitness function, selection operator, crossover operator, mutation operator, and replacement. Application of genetic algorithms is highly beneficial for optimization tasks and is highly effective in situations in which many inputs (variables) interacts with one another to produce a large number of possible outputs (solutions). Genetic algorithm constitutes search method that also can be used both for solving problems and modeling evolutionary systems. Since it is heuristic, no one actually knows if the solution is totally accurate. Due to this, most scientific problems are addressed via estimates, rather than assuming 100% accuracy (Hong and Kwong 2008). Approaches using genetic algorithm can be classified broadly into two basic categories, which are the generational genetic algorithms (standard genetic algorithms) and the steadystate genetic algorithms (incremental genetic algorithms) (Vavak and Fogarty 1996). The first category uses typical parameters such as roulette selection with elitism. This is a method by which the fittest potential parents are selected from a population. However, this does not guarantee that the fittest member proceeds to the next generation (Hong and Kwong 2008). In generational genetic algorithms, offspring generated in each generation will replace population in the same generation. This causes the algorithm to lose population diversity at a very fast rate due to convergence to a local optima solution. The second method is the steady-state genetic algorithms that select two individual parents (sometimes all individuals are selected Hong and Kwong 2008) by rank selection. Then, the algorithm combines both parents to produce one offspring, thereby replacing the worst characteristics (or traits) of a population with better characteristics (Haupt and Haupt 1998; Hong and Kwong 2008). In general, steady-state genetic algorithms have better performance for maintaining the diversity of population and are more suitable for clustering. Unfortunately, the steady-state genetic algorithm method has the potential of premature convergence when convergence happen too early (Haupt and Haupt 1998; Hong and Kwong 2008). The major difference between the steady-state and the generational genetic algorithm is that, for each parent of the population generated in the generational genetic algorithm, there are two parents selected by the steady state method. Consequently, selection drifts appear twice as fast within a steady-state genetic algorithm because this method first determines rank in the population and then every member receives fitness from as a result of this ranking. Combining the strengths of the various methods counteracts the weaknesses of each clustering system (Yoon et al. 2006a). In the steady-state genetic algorithms, population in successive two iterations significantly overlap and only one or two candidate solutions are replaced at each generation. Therefore, the steady-state genetic algorithms have a better performance for maintaining the diversity of the population and are more suitable for solving the problem of data clustering (Haupt and Haupt 1998; Hong and Kwong 2008). 4.1 Genotype Genetic-guided clustering algorithms maintain a population of coded candidate clustering solutions during its search. Several encoding strategies were proposed such as the string of-group encoding (Krishna and Murty 2002), the cluster centers encoding (Mitra 2004) and the linear linkage encoding (Du et al. 2004). However, no conclusion has been drawn on which encoding strategy is the best. This is because in the algorithms where the recombination operator is easy to perform, fitness evaluations are very time-consuming. On the other

123

R. Ghaemi et al. Fig. 2 Genotype construction

Gene

1 2 3 4 5 6 7 8 9 10 … n-1 n

Allele

4 3 2 5 5 2 1 1 2 2… 3

1

hand, in the algorithms where fitness evaluations are simple, their recombination operators are complicated (Krishna and Murty 2002). The string-of-group encoding is suitable strategy because of its simplicity and wide applications. In genetic-guided clustering algorithms with the string-of-group encoding strategy, each candidate clustering solution is coded as an integer string and the value of an integer in the string represents the label of the group in which the instance is classified. For example, if the data set has five instances {x1 , x2 , x3 , x4 , x5 } the chromosome (1 2 2 2 1) represents that the instances {x1 , x5 } are classified in one group, while the instances {x2 , x3 , x4 } are classified in the other group, and the partition of the data represented by the chromosome is {{x1 , x5 }, {x2 , x3 , x4 }} (Hong and Kwong 2008). Azimi et al. (2007) and Ozyer and Alhajj (2009) use the string-of-group encoding strategy for the chromosomes. They present each individual chromosome as clustering output that consists of all samples. Each individual in the population is presented by a chromosome of length n, where n is the number of instances in the sample dataset. Each gene is the label of clustering output. For instance, suppose that a dataset having n instances are to be clustered according to number of clusters value k = 5. As a result, the instances will align from 1 to n, and each of them will get a cluster number, i.e., each gene will store one of the values of k = 1, 2, 3, 4, 5. One example chromosome can be expressed as shown in Fig. 2. Mohammadi et al. (2008) propose a genotype for each chromosome in which each individual has two parts: sample part to represent the partitioned instances and index part to represent boundaries of each cluster. Sample part contains all samples of dataset and each gene is a representative of a member of dataset. The length of this part is equal to the size of dataset. The boundary of each cluster determines in Index part. The length of this part is equal to the number of clusters + 1. Therefore, a sample and index part together describes a candidate solution of clustering around a problem, number of clusters and members of each cluster. Both part of each individual is created at random during the initial population. In sample part, the position of each dataset member is selected at random. In index part, two integer constants Max and Min represent the maximum number of cluster and the minimum number of cluster, respectively. These two constants are determined before running the algorithm. Then, for each individual, they generated randomly an integer number between Max and Min called I as the number of clusters. Length of Index part varies for each individual because I is generated at random. Then they sample Index part for each individual by generating I − 1 integer value between 0 and the size of dataset (Mohammadi et al. 2008). In Luo et al. (2007), for a given data set X = {x1 , . . . , xn }, each chromosome is a sequence of integer numbers representing the class labels of n objects Xi , i = 1, . . . , n, where the i-th position (or gene) represents the class label of Xi . Authors denoted the clustering πk determined by the chromosome Pk . So πk (X j ) = Pk ( j), in other words, the class label of object Xi in clustering πk equals the jth integer of chromosome Pk . Handl and Knowles employ a graph based encoding. Each individual gconsists of N genes g1, . . . , g N , where N is the size of the data set given, and each gene gi can take allele values j in the range {1, . . . , N }. Thus, a value j assigned to the i-th gene, is then interpreted as a link between data items i and j: in the resulting clustering solution, they will be in the same cluster. Figure 3 shows construction of the minimum spanning tree and its genotype coding.

123

A review: accuracy optimization in clustering ensembles

2

4

1 to 1

7

3 1

Order Connection:

8

2 to 3

Genotype 1

5 6

3

1

3

4

5

4

7

3 to 1 4 to 3 5 to 4

Fig. 3 Graph-based genotype

The data item with label 1 is first connected to itself, and then Prim’s algorithm is used to connect the other items. In the genotype, each gene (i.e. position in the string) represents the respective data item, and its allele value represents the item it points to (i.e., gene 2 has allele value 3 because data item 2 points to data item 3). The genotype coding for the full MST as shown in Fig. 3 is used as the first individual in the evolutionary algorithms’ population (Handl and Knowles 2005). 4.2 Population initialization Azimi et al. (2007) and Mohammadi et al. (2008) define the chromosomal population P(t) = { p1 , . . . , p N } as consisting of N chromosomes. Initially, the chromosomes p1 , . . . , p N are randomly generated using values between 1 and the numbers of clusters. Meanwhile, in Luo et al. (2007), each chromosome in the chromosomal population P(t) = { p1 , . . . , p N } can be regarded as an integer sequence of length |X | representing a possible clustering of the dataset X = {x1 , . . . , xn }. Again, the chromosomes p1 , . . . , p N are randomly generated using values between 1 and k. Unlike existing genetic-guided clustering algorithms whose initial population is randomly generated, Hong and Kwong propose an algorithm that initializes its population by using the random subspaces method. This means part of the features set is randomly selected from the full feature set and a clustering solution is obtained by executing K -means clustering on selected features. The two steps iterate until a population of clustering solutions is obtained. The authors claim that the using of random subspaces method is able to can significantly speed up searching in genetic-guided clustering algorithms (Hong and Kwong 2008). Yoon et al. (2006b) apply different types of clustering algorithms to a dataset and constructed a paired non-empty subset with two clusters, among all clustering results of clustering algorithms. For example, one clustering algorithm generates three clusters (1,2,3) and the other also generates three clusters (A,B,C) using different parameters. These six clusters are created as an initial population that is comprised of 30 paired non-empty subsets as shown in the Fig. 4. This natural reproduction process employs the fitness function as a unique way to determine whether each chromosome will survive or otherwise. 4.3 Fitness function Azimi et al. (2007) and Mohammadi et al. (2008) present an algorithm with two stages of fitness function; intra-cluster fitness as in Eq. (11) and extra-cluster fitness as in Eq. (13). First, based on the co-association matrix values, the average similarity in each cluster is calculated (intra-cluster fitness) using Eqs. (1) and (2). Note that, the genes with equal index are grouped in the same cluster. Then, the average similarity between all clusters is calculated

123

R. Ghaemi et al. Clustering

Parameters

Algorithms

Cluster Results

Construction of paired subsets

Initial Population

with all clusters results

1 A Bio-data B

2

(1,2) (1,3) (1,A) (1,B) (1,C)

3

(2,1) (2,3) (2,A) (2,B) (2,C)

A

(3,1) (3,2) (3,A) (3,B) (3,C)

B

(A,1) (A,2) (A,3) (A,B) (A,C)

C

(B,1) (B,2) (B,3) (B,A) (B,C)

Fig. 4 Initial generation of the population

(extra-cluster fitness) using Eqs. (3) and (4). Finally, the final fitness value is calculated by subtract Intra-Clstr-Fit from Extra-Clstr-Fit. The fitness function is defined as: Fit-Clstr(Ck ) =

Ck N Ck N  

Co-association(i, j)/(NCk − 1)

(1)

i=1 j=1

Intra-Clstr-Fit =

C 

Fitness-Cluster(Ck )/C

(2)

k=1

Pnlt-Clstr(Ci , C j ) =

Cj N Ci N  

Co-association(i, j)/NCi ∗ NC j

(3)

Pnlt-Clstr(i, j)/(C(C − 1)/2)

(4)

i=1 j=1

Extra-Clstr-Fit =

C C−1  i=1 j=1

where: • • • • • •

Co-association (i, j) is the value of entry (i, j) in co-association matrix. Fit-clstr (k) is the fitness of cluster k. Nk is the number of samples in cluster k. C is the number of clusters. Pnlt-Clstr (Ci , C j ) is the average similarity between Ci and C j . Extra-clstr-Fit is the average similarity across all clusters.

Next, we describe tips to understand the suitability of a defined fitness function. First, if the suggested number of clusters in a chromosome is more than a real number of clusters; this means some samples belonging to a particular cluster, are classified in different clusters. In this case, the extra-fitness value of these chromosome increases and the final fitness value of this chromosome decreases. Second, if the suggested number of clusters in a chromosome is less than a real number of clusters, clearly there are some samples that do not belong to a cluster but are placed in a same cluster. In this case, the intra-fitness value of this cluster decreases and the final value of this chromosome decreases. Third, it is expected that after some generations, in the genetic algorithm, chromosomes which have real or close to real number of clusters will (and also discriminate clusters) dominate the population (Mohammadi et al. 2008).

123

A review: accuracy optimization in clustering ensembles

In Luo et al. (2007), define the fitness measure associated with each chromosome Pk , based on some metric M as shown in Eq. (5).  μ if Mk (πk ) = 0 Fitness f (Pk ) = (5) 1/M f (πk ) otherwise where πk is the cluster determined by the chromosome Pk and M f (πk ) as is given in Eq. (6): M f (πk ) =

H 

d f (πk , πi ) =

i=1

H 

H f (πk |πi ) + H f (πi |πk )

(6)

i=1

For each chromosome, the associated fitness value expresses how close a chromosome is to the clustering, they are searching for. The larger is the value of the fitness; the closer is the chromosome to the target clustering (Luo et al. 2007). In Gablentz et al. (2000), the fitness function y = f (x), x ∈ X in search, keeps track the difference of the tested individual in comparison to all other (original) individuals. Because all individuals are given as bit-strings, the Hamming distance, which keeps track of inverted bit-strings, will be the right measure. They define the fitness function y(b) as shown in Eq. (7): ⎧ ⎫ m n n ⎨ ⎬   y(b) = 1/m min |xi, j − b j |, |(1 − xi, j ) − b j | (7) ⎩ ⎭ i=1

j=1

j=1

where n is the length of the bit-string, m is the number of original cluster-strings, x g  X and X g,v is the vth bit of the g-th original clustering-string. In Handl and Knowles (2005, 2006), MOCK’s clustering objectives have been chosen to reflect two fundamentally different aspects of a good clustering solution: the global concept of compactness of clusters, and the more local one of the connectedness of data points. In order to express cluster compactness, they calculate the overall deviation of a partitioning. This is simply computed as the overall summed distances between data items and their corresponding cluster centre, as shown in Eq. (8).   Dev(C) = δ(i, μk ) (8) Ck εC iεCk

where C is the set of all clusters, μk is the centre of cluster Ck , and δ is the distance function chosen, which is the Euclidean distance. As an objective, overall deviation should be minimized. As an objective reflecting cluster connectedness, they used a measure, connectivity, which evaluates the degree to which neighboring data-points have been placed in the same cluster. It is computed using Eqs. (9) and (10). ⎛ ⎞ N L   ⎝ Conn(C) = xi,nni( j) ⎠ (9) i=1

 xi,nni( j) =

j=1

1/j if Ck : I, nn i ( j) ∈ Ck 0 otherwise

(10)

where nn i ( j) is the j-th nearest neighbor of datum i, and L is a parameter determining the number of neighbors that contribute to the connectivity measure. As an objective, connectivity should be minimized (Handl and Knowles 2005, 2006). Ramanathan and Guan (2006) apply genetic algorithms to clustering problems with good effect. The genetic algorithm applied is simple and retains the form of Self Organization

123

R. Ghaemi et al.

Map (SOM), but with evolutionary representation of the weights. More simply, since the objective is to maximize, for each pattern x, the value, ||W (k)T x|| is a population of real coded chromosomes encodes W (k) , for each cluster k. Each chromosome, therefore, consists of k × d elements, where k is the number of clusters and d is the dimension of the input data. The chromosomes are evaluated in a batch mode, such as to maximize Eq. (11). k  

||W (k)T X ||

(11)

k=1 xεCk

Crossover and mutation are performed and new generations of chromosomes are produced. The process is continued until the system stagnates or until a maximum number of epochs are reached. Ozyer and Alhajj (2009) consider four objectives for a multi-objective genetic algorithm, which are separateness, homogeneity, number of clusters, and cluster density. For separateness, they use the inter-cluster separabile formulas described next, where P and R denote clusters and |P| and |R| are the cardinalities of the aforementioned clusters; d(x, y) is the distance metric where x ∈ p, y ∈ R and P  = R (Ozyer and Alhajj 2009). Average Linkage between two clusters is the average of pair-wise distances. The cardinalities of P and R may be omitted to reduce the scaling factor using Eq. (12).  D(P, R) = (1/|P| · |R|) d(x, y) (12) xε P,yε R

Complete Linkage between two clusters is the maximal pair-wise distance between the members using Eq. (13). D(P, R) = max d(x, y)

(13)

xε P,yε R

Centroid Linkage is the distance between the centroid v P and v R of the two clusters P and R using Eq. (14). D(P, R) = d(v P , v R )

(14)

Average to Centroid Linkage is the distance between the members of one cluster to the other cluster’s representative member. The centroid is calculated using Eq. (15). ⎡ ⎤   (15) D(P, R) = (1/|P| + |R|) ⎣ d(x, v R ) + d(y, v P )⎦ xε P

yε R

Regardless of the type of enumerated separateness criteria is used in the process, the Total Inter Cluster Distance (TICD) is calculated using Eq. (16). TICD =

K K  

D(k, l)

(16)

k=1 l=k+1

where D is the inter-cluster distance selected as one of the above four formulas and k represents all the clusters. For homogeneity, they use the intra-cluster distance formula, Total Within Cluster Variation (TWCV), which calculates the intra-cluster distance of the cluster using Eq. (17). TWCV =

S N   n=1 i=1

123

2 X nd −

K  k=1

1/Z k

S  d=1

2 S Fkd

(17)

A review: accuracy optimization in clustering ensembles

where S is the number of features, X 1 , X 2 , . . . , X N with N objects. X ni denotes feature i of pattern X n (n = 1toN ). S Fki is the sum of the i-th features of all the patterns in cluster k(G k ), while Z k denotes the number of patterns in cluster k(G k ). S Fki is computed using Eq. (18).  X ni , (i = 1, 2, . . . , S) (18) SFki = xn εG k

The objectives are utilized in the process as minimization, whereby the separateness value is multiplied by −1 for minimization. Next, the objectives are normalized by dividing their values by the corresponding maximum values (Ozyer and Alhajj 2009). Hong and Kwong (2008) denote a data set containing n unlabeled instances as D = {x1 , x2 , . . . , xn }. Clustering algorithms work to classify these n instances into k groups such that the optimal value of a predefined clustering criterion is achieved. There is no single clustering criterion that is valid for all kinds of data sets. A popular clustering criterion is the minimization of the within-cluster variation. Provided that each instance x j has m features X j = {x j1 , x j2 , . . ., x jm } where j = 1, . . . , n, then the within-cluster variation of the clustering solution C = {c1 , c2 , . . . , ck } of the data set can be calculated using Eqs. (19) and (20). f (C) =

k n  

δ(x j , Ck ).

j=1 k=1

where ckl =

n  j=1

δ(x j , Ck ) · x jl

m 

(x jl − ckl )2

(19)

l=1

 n 

δ(x j , Ck )

(20)

j=1

⎧ ⎨ 1 if the instancex j belongs to the group Ck for k = 1, . . . , K , l = 1, . . ., m and δ(x j , Ck ) = ⎩ 0 otherwise The above objective function f (C) is usually high-dimensional, nonlinear and multimodal with a number of local optimal clustering solutions. Whereas, the commonly used Hill Climbing search methods can only guarantee a local optimal clustering solution. With regards of the optimal solution, heuristic search methods such as genetic algorithms are widely applied in solving the above combinatory optimization problem (Hong and Kwong 2008). 4.4 Selection operator The selection process selects individuals from the mating pool directed by the survival of the fitness concepts of natural genetic systems (Haupt and Haupt 1998). Azimi et al. (2007) and Mohammadi et al. (2008) use tournament selection 2 (with two parents) as selection method. In tournament selection 2, two individuals chromosome are chosen at random and the individual with better fitness is selected for next population. Yoon et al. (2006b,c) select a pair of subset with the largest number of highly-overlapped elements among all paired subsets, which form the fitness function to select a pair for the next crossover operation. For instance, suppose that bio-data with 10 elements, as shown in Fig. 4 will generate an initial population through the reproduction operation. If two subsets (1,2) and (1,3) are selected paired subsets from Fig. 4, the first cluster (1,2,3) in {(1,2,3)

123

R. Ghaemi et al.

(4,5,6) (7,8,9,10)} is compared with the other clusters {(1,2) (3,4) (5,6) (7,8,9,10)}. That is, the first cluster (1,2,3) and the first cluster (1,2) from the other cluster results has 2 values to the highly overlapped as compared to {(3,4) (5,6) (7,8,9,10)} clusters. Moreover, the (4,5,6) cluster has a value of 2 to the highly-overlapped with the other cluster (5,6) and the (7,8,9,10) cluster has a representative value equals to 4. This is derived by comparing the cluster to the other cluster (7,8,9,10) of {(1,2) (3,4) (5,6) (7,8,9,10)}. This process adds the representative values of each cluster and selects a subset for the crossover operation, by comparing all population pairs. As shown as item (A) and item (B) in Fig. 5, the subsets of (1,2) and (1,3) each has 17 and 15. Finally, the subset (1,2) was selected with greater selection probability. Yoon et al. (2006c) propose new selection genetic operator to generate the optimal result. Once a suitable chromosome is chosen for analysis, it is necessary to create an initial population to serve as the starting point for the genetic algorithm. The following explains the order of the proposed selection method, as illustrated in the Fig. 6. 1. The first step is to construct paired subsets from two clustering results, out of all the possible clustering results for the population generation. Because multi-source bio-datasets can lead to different outputs, hence generating the initial population for the selection operator combines different clustering results. 2. After generating the initial population, the next step involves selecting parents for recombination using the roulette wheel selection method with slots that are sized according to fitness value. This is one method of choosing members from a population of chromosomes with a probability that is proportional to their fitness value. Parents are selected Fig. 5 An example for selection method

10 Elements : 1,2,3,4,5,6,7,8,9,10 (A) Subset (1,2) [{(1,2,3),(4,5,6),(7,8,9,10)},{(1,2),(3,4),(5,6),(7,8,9,10)}] [{(2,2,4),(2,1,2,4)}]

[8,9]

17

(B) Subset (1,3)

Fig. 6 Selection method for the evolutionary reproduction process

2

A

1

2 2

3

4

4

5

6

8

7 8 9 10

+

B

1 2

3

2

4 5

6

7 8

2

9

4

17

9

10 1

(A,B)(A,C)(A,X)(A,Y)(A,Z) (B,A)(B,C)(B,X)(B,Y)(B,Z) (C,A)(C,B)(C,X)(C,Y)(C,Z)

30 pairs

(X,A)(X,B)(X,C)(X,Y)(X,Z) (Y,A)(Y,B)(Y,C)(Y,X)(Y,Z) 2

X

1

2

2

3 4

1 5

6

4 7 8 9

9 +

Y

1

2

3 4 2

123

5 6

7 8 9 10 1

4

7

16

A review: accuracy optimization in clustering ensembles

according to their fitness value. The better the fitness of the chromosome has more probability that it will be selected. In Ozyer and Alhajj (2009), fitness evaluation is first performed on the initial chromosomes. In the selection part, Iran individuals are randomly picked from the population. According to homogeneity and separateness, there are two goals. The selection using pareto-domination tournament step picks two candidate items from (population size − Iran ) to participate in the pareto-domination tournament against the Iran individuals for their survival in the population. With two randomly selected chromosome candidates from (population size − Iran ) individuals, each candidate is compared against each individual in the comparison set, Iran . If a candidate is dominated by the comparison set, then it will be deleted from the population permanently. Otherwise, it resides in the population. At the end, it is necessary to keep only N individuals in the population. This is achieved by ranking the surviving individuals and moving only the top N to the next generation. In other words, after all the operators are applied through pareto-domination tournament, twice the initial number of individuals remains in existence. Half of individuals are eliminated, which automatically determines the ranking without any need for external parameters. This way, the best individuals to place in the population for the next generation are picked. This approach picks the first N individuals by considering elitism and diversity among the ranked 2N individuals. Since the approach attempts to get the first N individuals, the last non-dominated front may have more individuals to complete the number of individuals to N ; hence the diversity is handled automatically. Individuals are sorted in descending order in terms of each individual’s total difference from its closest individual pair; the one with the closest smaller summed values and the one with the closest greater summed values. After sorting the individuals based on total difference from each individual, the top N individuals are moved to the next generation. The main reason is to automatically take the crowding factor into account automatically so that individuals occurring closer to others are unlikely to be picked. Solutions far apart from others will be considered to satisfy the diversity requirement. After this operation, if the maximum number of generations is reached or the pre-specified threshold is satisfied, then the process is terminated. Otherwise, the next iteration is derived (Ozyer and Alhajj 2009). 4.5 Crossover operator Crossover is a probabilistic process that exchanges information between two parent individuals for generating at least two child individuals (Haupt and Haupt 1998). Azimi et al. (2007) use multi-point crossover with a fixed crossover probability of μ. For individuals of length l, several random integers (crossover point) are generated in the range [l, l − 1]. The portions of the two individuals are exchanged to each other in cross points in order to produce several offspring. Mohammadi et al. (2008) use Cut-and-crossfill with a fixed crossover probability of μ. In Cut-and-crossfill crossover, a random integer is generated at crossover point in the range of [1, l − 1], where l is the size of dataset or length of sample part. The portions of the individuals lying to the right of the crossover point in two parents are exchanged to produce two offspring. This crossover only uses one sample part and guarantees no identical gene. It means that two valid child is created every time. In Luo et al. (2007), a number of max{2, rN} chromosomes from the old generation are probabilistically selected to be used in generating new offspring by applying the crossover operator. The selection method used is fitness-proportionate, in which the chromosomes

123

R. Ghaemi et al.

having greater fitness values have also a greater chance of being selected. The single point crossover is used with a fixed crossover probability of r is used. For the two selected parent chromosomes of length n, a random integer is generated in the range [1, n − 1] at the crossover point. Next, the portions of the chromosomes lying to the right of the crossover point are exchanged to produce two offsprings. In Dietterich (1997), ensemble learning operator is introduced as a crossover operator which refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions. In ensemble learning, a more reliable result can be obtained by combining the output of multiple experts, and a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Ensemble learning is a hot topic in machine learning, and is regarded as one of four main directions in machine learning. Nonetheless, the commonly used recombination operators of genetic algorithms such as the one-point crossover operator cannot perform well enough due to the problems of clustering invalidity and context insensitivity. The clustering invalidity occurs if the recombination operator reproduces new clustering solutions, whose number of clusters is smaller than the given number of clusters. For example, if the simple one-point crossover operator is executed on the chromosome (1 1 2 2 3 3) and the chromosome (3 1 1 3 2 2), both new clustering solutions (1 1 2 2 2 2) and (3 1 1 3 3 3) have only two clusters and both are invalid. Apart from the clustering invalidity, a more serious problem associated with commonly used recombination operators such as the one-point crossover operator is the context insensitivity. Next, the context insensitivity occurs if one clustering solution can be coded by several different chromosomes. For example, both the chromosome (1 1 2 2) and the chromosome (2 2 1 1) represent the same clustering solution where instances {x 1 , x2 } are classified into one group and instances {x3 , x4 } are classified into the other group. In this case, the recombination operator exchanges string blocks of two different chromosomes in the population, but may not exchange their clustering contexts for combining a new candidate clustering solutions. For example, the chromosome (1 1 1 2 2 2) and the chromosome (2 2 2 1 1 1) represent the same clustering solution where instances {x1 , x2 , x3 } are classified into one group and instances {x4 , x5 , x6 } are classified into the other group. However, their offspring (1 1 1 1 1 1) and (2 2 2 2 2 2) after executing the one-point crossover operator are significantly different from their parents. Both examples illustrate the fact that the commonly used recombination operator of genetic algorithms is only able to mix string blocks of different chromosomes, but not able to recombine clustering contexts of different chromosomes into new better ones. The context insensitivity of the recombination operator often leads to the disruption of good building blocks. If the disruption of good building blocks occurs too frequently, the potential of the recombination operator loses and the search of genetic algorithms becomes a random walk. In view of this, Hong and Kwong (2008) replace traditional recombination operators of genetic algorithms by an ensemble learning operator for reproducing a new candidate clustering solutions. They provide P (s) = {I (1) , I (2) , . . ., I (M) } where M is the number parental (i) clustering solutions and I j represents the label of the group in which the instance x j is classified in the ith clustering solution. The ensemble learning operator works to reproduce a new candidate clustering solution I (new) through combining these M clustering solutions without accessing features of the data. The ensemble learning operator based on the average link agglomerative clustering algorithm obtains a new clustering solution with the following steps: First, the clustering solution

123

A review: accuracy optimization in clustering ensembles

I (i) is transformed into a similarity matrix S (i) as shown in Eq. (21) (Fred and Jain 2005; Hong and Kwong 2008)  (i) (i) 1 if I j1 = I j2 (i) (21) S ( j1 , j2 ) = 0 otherwise where j1 = 1, . . ., n and j2 = 1, . . ., n. Accordingly, {S (1) , S (2) , . . ., S (M) } are obtained from these M available clustering solutions. Second all similarity matrixes {S (1) , S (2) , . . ., S (M) } are combined into a single consensus similarity matrix S( j1 , j2 ) as shown in Eq. (22). S( j1 , j2 ) =

M 

S (i) ( j1 , j2 )/M

(22)

i=1

where j1 = 1, . . ., n and j2 = 1, . . ., n. The value of S( j1 , j2 ) represents the frequency that the instances x j1 and x j2 are classified into the same group in the parental clustering solutions’ P (s) = {I (1) , I (2) , . . ., I (M) }. After the similarity matrix S is calculated, a new similarity matrix S (new) is sampled from the above similarity matrix S as shown in Eq. (23).  1 if rand (1) < S( j1 , j2 ) (23) S (new) ( j1 , j2 ) = 0 otherwise where rand(1) is a random number in the range [0,1] and j1 = 1, . . ., n and j2 = 1, . . ., n. Lastly, a new clustering solution I (new) can be obtained by the average link agglomerative clustering algorithm on the similarity matrix S (new) (Fred and Jain 2005; Hong and Kwong 2008). Note the average link agglomerative clustering algorithm classifies data instances based on their distance matrix. A small value of the element in the distance matrix indicates that two instances have a high probability to be classified into the same group. However, unlike the distance matrix, the similarity matrix of data instances describes the similarities among instances. Thus a small value of the element indicates that two instances have a small probability to be classified into the same group. In this case, the similarity matrix should be firstly transformed into the distance matrix before the execution of the average link agglomerative clustering algorithm. It is noted that the ensemble learning operator can mitigate the problem of context insensitivity. This is because one clustering context has only one similarity matrix and different chromosomes with the same clustering context share the same similarity matrix. For example, both the chromosome (1 1 1 2 2 2) and the chromosome (2 2 2 1 1 1) have the same clustering context {{x1 , x2 , x3 }{x4 , x5 , x6 }}, that is represented by the same similarity matrix: (Hong and Kwong 2008) ⎞ ⎛ 1 1 1 0 0 0 ⎜1 1 1 0 0 0⎟ ⎟ ⎜ ⎜1 1 1 0 0 0⎟ ⎟ ⎜ ⎟ S=⎜ ⎜0 0 0 1 1 1⎟ ⎜0 0 0 1 1 1⎟ ⎟ ⎜ ⎝0 0 0 1 1 1⎠ 0 0 0 1 1 1 Therefore, the ensemble learning operator on this similarity matrix does not cause the problem of clustering insensitivity. In addition, since new candidate clustering solutions are directly generated by the Average Link agglomerative clustering algorithm whose number of clusters is fixed to the given number of clusters, the ensemble learning operator is also immune from the problem of the clustering invalidity (Hong and Kwong 2008).

123

R. Ghaemi et al.

In Yoon et al. (2006b), the selected subset produces offspring from two parents, such that the offspring inherits as much meaningful parental information as possible. This operator process is based on methodological ideas in Handl and Knowles (2006). The methodology exchanges the cluster traits from different cluster results and elements with highly-overlapped and meaningful information being inherited by the offspring, until finally achieve an optimal final cluster result. In related work of Yoon et al. (2006a,c), they use a novel crossover approach. For example, A and K are two selected parents in the initial population as shown in Fig. 7. One parent has three clustering results (A1, A2, and A3) and the other parent has five clustering results (K 1, K 2, K 3, K 4, and K 5). First, one cluster is selected, say cluster A1, from the first parent to see that it has more highly-overlapped traits than the other two clusters (A2 and A3) in comparison to clusters of the second parent, K . Then, A1 is used to replace a cluster from the second parent, say K 5, which has the largest number of similarities to A1 (objects 7, 27, 39, 58, 63, 65, 71 and 84). With this replacement, those objects in A1 (objects 63, 71 and 84) do not appear as overlapping objects in K 5. However, object 63 and 84 in A1 appear as objects in K 2 and K 4, respectively. Consequently, objects 63 and 84 are removed so that each object belongs only to one cluster. The remaining objects in A1 (object 71) is taken from K 5 until these objects do not appear in any other cluster. Finally, the new clustering solution is represented by the first offspring possessing traits K 1, K 2, K 3, and K 4 with revised A1. This crossover operation is repeated once a cluster from the second parent is selected to generate the second offspring. Two parents are replaced by the new offsprings in the population during the final stage. After the replacement, the fitness is again computed with the disjoint non empty subsets using only two elements. It then determines a pair of new candidates for the following parent selection and finally repeats the process. These procedure exchanges cluster traits of different clustering results and objects with highly-overlapped and meaningful information being inherited by the offsprings until finally an optimal final clustering result is achieved. Hence, this proposed crossover operation is a stable approach because the invariable population of subsets and the process of combining highly overlapped objects (Yoon et al. 2006a,c).

Fig. 7 Crossover operation to exchange the clustering results

Population

A

B

First Parent

C

D

Select two Parents

………

K

L

Second Parent K1 K2 K3 K4 K5

A1 A2 A3 Replace A1 , K5 A1

K5

7 27 39 58 63 65 71 84

7 27 39 58 59 65 85

7 27 39 58 65 85

7 27 39 58 65 71

K5 A2 A3 Second Offspring

123

K1 K2 K3 K4 A1 First Offspring

A review: accuracy optimization in clustering ensembles

In Ozyer and Alhajj (2009), the results from running some initial tests using alternative crossover operators from previous researches yield that one-point crossover satisfies the target with lower cost. It is then applied on previously selected chromosomes considering ranking and crowding using the population with probability pc . 4.6 Mutation operator Mutation takes as inputs and outputs a chromosome by complementing the bit value at a randomly selected location in the input chromosome (Haupt and Haupt 1998). Azimi et al. (2007) propose two methods to increase the performance of the algorithm as using swap mutation and special mutation inspired by the ant colony clustering to increase the accuracy of genetic algorithms. Swap mutation changes the position of two samples at random. Special mutation is used after some iteration in final iterations, when the algorithm has achieved an approximation result. This mutation is known as intelligence mutation, because it intelligently selects the best candidate samples to mutate. The steps of special mutation are described as the following. First, a cluster C is chosen at random. Second, X , a member of C with minimum similarity (dissimilar) with other members is chosen. Third, the similarity for X against each cluster is calculated. Finally, X is transformed to the cluster with maximum similarity to it and become the previous cluster (Azimi et al. 2007). Mohammadi et al. (2008) implement four mutation methods to increase the performance of their proposed genetic algorithm, which are swap mutation, creep mutation, merge and split mutation, and special mutation to further increase the clustering accuracy. Swap mutation changes the position of two samples at random. This mutation is only using the sample part. Creep mutation is only using the index part and it works by adding a small (positive or negative) value to selected gene with probability p. It means that the mutation is able to change the boundary of the selected cluster, whether to increase or to decrease. Merge & split mutation is only using the index part. When a chromosome is selected for this mutation, a probability of 0.5 merging process and a probability of 0.5 splitting process are applied on the current chromosome. During the merging process, two clusters with greater mutual association value are merged into one cluster and the bounds of clusters are updated in the index part. During the splitting process, a cluster with small intra association value is selected and then spitted into two clusters. Finally, the Index part is updated. Special mutation is only using the sample part and is applied during the last generations when the algorithm has obtained the near optimal results. Given an individual, the steps of special mutation are as follows. First a cluster C is chosen at random. Second, X , a member of C with minimum similarity with other members is chosen. Third, the similarity of X with other clusters is calculated. X is then transferred to the cluster H with maximum similarity. The boundary for H is then updated in the index part. In Luo et al. (2007), a number of max{1, mN} chromosomes are selected with uniform probability are selected to undergo mutations. The mutation operator is not biased towards the fittest chromosomes because the chromosomes with uniform probability will be the one that suffers a mutation once they are selected. Each chromosome undergoes mutation with a fixed probability m. This type of mutation involves randomly changed position max{1, 0.1n} in the selected chromosome of length n. Each chromosome position (or gene) is mutated by simply replacing its value with a randomly selected value from 1 . . . k. In Ozyer and Alhajj (2009), after the crossover, mutation is applied on individuals of the current population. During the mutation, gene value an is replaced with an with respect to a probability distribution. For n = 1, . . . , N , an is a cluster number of cluster randomly selected

123

R. Ghaemi et al.

from {1, 2, . . . , k}, with the probability distribution { p1 , p2 , . . . , pk } computed using the formula in Eq. (24).  k  −d(X n ck ) Pi = e e−d(X n c j ) (24) j=1

where i ∈ [1 . . . k] and d(X n , Ck ) denotes Euclidean distance between pattern X n , Ck is the centroid of the k-th cluster, and pi represents the probability interval of mutating a gene assigned to cluster i (i.e., Roulette Wheel). Eventually, the K-means operator is applied to reorganize the assigned cluster number for each object. This process speeds-up the convergence because after every generation, a data point is assigned to the closest cluster with respect to the inter-intra cluster distance average value. 4.7 Replacement Elitism method is used as replacement function to select the best genes for next population. This method guarantees that the best individual is selected for next generation (Azimi et al. 2007; Haupt and Haupt 1998).

5 Genetic-guided clustering ensemble algorithms (GCEAs) and their clustering accuracy In this section, we summarized existing research works in clustering ensemble that employed genetic algorithms as shown in the Table 1. Then their features and advantages, disadvantages and future works, computing complexity are investigated. Table 1 summarizes existing research works in GCEAs. We focused on the main characteristics of GCEAs in general view point and presented their characteristics in hope to help researchers to select a consensus method or an algorithm for clustering ensembles. In this survey, we found that reproducible, reliable and robustness clustering results are often achieved using the suggested GCEAs in addition to the high accuracy achieved. Fast convergence and robustness are the main properties of existing GCEAs. Another important property of GCEAs is hybridization, which is the capability to combine the GCEAs with other global clustering algorithms. In many available algorithms, the most suitable clustering algorithms are identified for an unknown dataset by researchers that can be affect to achieve concise and stable set of partitions as initial population in the first phase of clustering ensemble. In many recent works on GCEAs, there are numerous flexible recombinations that permit efficient generation of clustering solutions. GCEAs also solve problems of instability inherent in clustering algorithms, clustering invalidity, and context insensitivity in algorithms although scalability is an overall weakness of genetic algorithms. Nonetheless, researches are lacking comparative analysis on robustness, stability and simplicity among all algorithms. This problem can be considered as a future work for researchers. GCEAs also improve clustering accuracy in experimental results. Table 2 shows comparison on related research works based on clustering accuracy and genetic operations in GCEAs. The empirical results from Table 2 suggested accuracy problems in application of GCEAs on some real world and artificial datasets. The existing GCEAs were using datasets from UCI repository such as Iris, Wine, Soybean and Glass, Clinical datasets from CAMADA,

123

Incorporating classification information into an unsupervised algorithm and using the resulting algorithm for transductive inference

Generate the purest possible clusters regarding the class distribution

Demiriz et al. (1999)

New algorithm with automatic determination of the number of clusters

Handl and Knowles (2005, 2006)

Genetic algorithm to allow for selection of more reliable clustering result and better extraction of optimal clustering Using special mutation called the intelligent mutation

Azimi et al. (2007)

High accuracy

Robustness

Simplicity

Fast convergence

A novel Heterogeneous Clustering Ensemble (HCE)

Yoon et al. (2006a)

To permit to keep the number of clusters dynamic by encoding scheme

A novel and flexible representation that permits efficiency when generating clustering solutions An automated technique for selecting high quality solutions Graph-based encoding as a suitable genetic encoding of a partitioning

To need further comparison against the other classes of SSL algorithms such as genetic based method

Improve accuracy by increasing the amount of labeled or unlabeled data that depends on how the data fits the algorithm assumption. This is the case for EM based Expectation Maximization Technique (GMM) and Seeded Fuzzy C-mean Clustering Algorithm (SFCM)

Bouchachia (2005)

Using one point crossover

Optimize cluster results by combining different bio-data sources in multi-source bio-data sets Applying different clustering algorithms

To require the identification of all sub graph for decoding graph-based representation

Investigate a combination of different semi-supervised classifier to build an ensemble classifier

Clustering all data into two clusters No use real datasets

Identify the most suitable clustering algorithm for an unknown dataset

Gablentz et al. (2000)

Achieved reproducible result

Using function as a linear combination of a measure of dispersion of the clusters (unsupervised) and a measure of impurity with respect to the known classes(supervised)

Disadvantages & future works

Features & advantages

Authors

Table 1 Summarized the suggested genetic-guided clustering ensemble algorithms

A review: accuracy optimization in clustering ensembles

123

123

Yoon et al. (2006b)

Ramanathan and Guan (2006)

Yoon et al. (2006c)

Good performance of the approach

Luo et al. (2007)

Improve its performance as the number of iterations increase

Efficiency searching for possible solutions and improve the effectiveness of clusters Overcome the instability inherent in clustering algorithm problems

Proposed a novel method based on genetic algorithm, called Heterogeneous Clustering Ensemble (HCE) Generate robust clustering results

To measure algorithm robustness with using of the correlation of the clusters with ground truth information

Training a set of core patterns using SOM that is a type of neural network

No need to remove elements for preprocessing Involve a hybrid combination of a global clustering algorithm followed by a corresponding local clustering algorithm Recursive divide-and-conquer approach to clustering

Generate better cluster results than those obtained using just one data source

Generate different types of multi source data through a variety of different experiments Consider characteristics that present optimal cluster results from different clusters and different clustering algorithms

Influence of different generator functions on the ensemble performance

Favorable comparison with other consensus functions

Features & advantages

Authors

Table 1 continued

Compare of the approach based on robustness, scalability, stability and simplicity with other algorithms

Build recursive approach on top of other current clustering approaches to improve their performance

Develop only form irregular clusters, but not for overlapping clusters Cater for overlapping clusters

Identify other bio-information based on genes, in multi source datasets Include multiple biological data types in order to discover optimal cluster results and then to again apply their proposed method Develop a more theoretically and experimentally justified verification system of multi source data

Compare of the approach based on robustness, scalability, stability and simplicity with other consensus functions

Disadvantages & future works

R. Ghaemi et al.

Ozyer and Alhajj (2009)

Hong and Kwong (2008)

Considering the knowledge of some exiting complete classification of such data Based on Multi Objective Clustering Ensemble Algorithm (MOCLE)

Faceli et al. (2007)

Achieve scalability by first partitioning a large dataset into subset of manageable sizes based on specifications of the machine to be used in clustering process Obtain the actual intended clustering by a conquer process, where each instance (leaf node) belongs to the final cluster represented by the root of its tree Using multi-objective genetic algorithm combined with validity indices to decide on the number of classes

Consider previous knowledge by frequently selection of the best results among those of the individual algorithms in a range of different data conformation levels Using steady-state genetic algorithm and the ensemble learning method or genetic-guided clustering algorithm with ensemble learning operator (GCEL) Avoid the problems of clustering invalidity and context insensitivity of algorithms by replacing its traditional recombination operator with an ensemble learning operator Less fitness evaluation are required to converge because of to generate its initial population of candidate clustering solutions by using the random subspaces method Achieve a comparative or better clustering solution with less fitness evaluations Solve the scalability problem by applying the divide-and-conquers approach in an iterative way to handle the clustering process

Automatically embedding of the prior knowledge in MOCLE

Features & advantages

Authors

Table 1 continued

Compare the approach based on robustness, stability and simplicity with other algorithms

Compare of the approach based on robustness, scalability, stability and simplicity with other consensus functions

The adaptation of MSH to the traditional semi-supervised clustering and its comparison with other exiting techniques

Investigation of other functions

Disadvantages & future works

A review: accuracy optimization in clustering ensembles

123

R. Ghaemi et al.

and also real datasets such as 08X , Gloub, Leukemia, X 8K 5D, Ionosphere, Promoters and Segmentation (Azimi et al. 2007; Faceli et al. 2007; Gablentz et al. 2000; Handl and Knowles 2005; Hong and Kwong 2008; Luo et al. 2007; Ozyer and Alhajj 2009; Ramanathan and Guan 2006; Yoon et al. 2006a,b,c). We found that the mean error rate of current algorithms such as Genetic Algorithm Clustering Ensemble (GACE) and Genetic-guided Clustering Ensemble Learning (GCEL) on UCI datasets lies in the range from 10 to 29.5%. Meanwhile, the accuracy of recent algorithms on real datasets has values between the ranges from 58.89 to 100%. These accuracy values prove GCEAs are able to achieve have high accuracy on both real world and artificial datasets. In the following section, accuracy between CEAs and GCEAs will be compared. The existing GCEAs are compared based on their genetic operations. We found that most applicable genetic algorithms are using population of the size 100, while some are using population of size 1,000. For generation of initial population, the methods like K -mean, Single Link, Average Link, and also using random subspace method are often used. These algorithms have used tournament 2 as selection operator and also single-point, two-point and uniform crossover as a recombination operator.

6 Comparison of clustering accuracy between CEA and GCEA algorithms We investigated clustering accuracy resulted by CEAs and GCEAs as shown in Table 2 on benchmark datasets including UCI repository, artificial datasets, and real world datasets. All the clustering accuracies on the same datasets including Iris, Wine, Soybean, 0X 8, Halfrings, 2-spiral, X 8D5K , Galaxy, Biochemistry, Leukemia, Segmentation, Promoters and Ionosphere are compared (Azimi et al. 2007; Faceli et al. 2007; Gablentz et al. 2000; Handl and Knowles 2005; Hong and Kwong 2008; Lu et al. 2004; Luo et al. 2007; Mohammadi et al. 2008; Ozyer and Alhajj 2009; Ramanathan and Guan 2006; Yoon et al. 2006a,b,c). The percentage of average and maximum clustering accuracy is presented in the Table 3. Based on accuracy values in Table 3, we illustrate the performance of both CEAs and GCEAs in the form of charts. Figure 8a, b show the average and the maximum clustering accuracy resulted by CEAs and GCEAs on thirteen benchmark datasets, respectively. As shown in Fig. 8a, clustering accuracy average resulted by GCEAs is better than CEAs on most datasets. On the other hand, the average mean error rate average of the clustering accuracy resulted by GCEAs is lower than the average mean error rate average of the clustering accuracy resulted by CEAs. Figure 8a presents the average clustering accuracy resulted by both CEAs and GCEAs. Out of thirteen datasets, nine datasets (Iris, 0X 8, 2-spiral, X 8D5K , Galaxy, Biochemistry, Segmentation, Promoters, and Ionosphere) yielded higher average clustering accuracy in GCEAs as compared to CEAs. The remaining four datasets (Wine, Soybean, Half-ring, and Leukemia) yielded lower average clustering accuracy GCEAs as compared to CEAs. This proves that the average clustering accuracy resulted by GCEAs on most datasets is better than the average clustering accuracy resulted by CEAs on the same datasets. On the other hand, the average means error rate in GCEAs that employed genetic algorithm for clustering ensemble is also improved. Figure 8b presents the maximum clustering accuracy resulted by both CEAs and GCEAs. Out of thirteen datasets, eleven datasets (Iris, Wine, Soybean, OX8, Half-ring,2-spiral,Galaxy, Biochemistry, Segmentation, Promoters, and Ionosphere) yielded higher maximum clustering accuracy percentage using GCEAs than using CEAs. In two datasets (X 8D5K and Leukemia), the maximum clustering accuracy resulted by GCEAs is lower than the maximum

123

Yoon et al. (2006a)

Handl and Knowles (2005)

The largest number of fitness values among 36 disjoint subsets by means of 10,000 crossover operation repetitions Using different fitness operations

Classified clusters of the clinical data set; L = 42, M = 51 and W = 25 mean least symptomatic, moderately symptomatic and most symptomatic patient’s number for CFS, respectively True clusters for: K-means = L; Hierarchical Clustering = L; PCA-based clustering = M; Heterogeneous Clustering Ensemble = L

Constraints: k{1, . . . , 25},cluster size > 2

Recombination: Uniform crossover; pr = 0.7

Mutation rate pm = 1/N

Initialization: Minimum spanning tree

The clinical data set from CAMDA is classified into three cluster groups: least, middle, and worst (most symptomatic) AVADIS analysis tool

Best value for: Single Link = 1.0; Average Link = 0.997; K-means = 0.989003; Clustering Ensemble = 1.0; MOCK = 1.0

External population size = 1,000 Internal population size = max(50, N /20)

different data properties

Number of generations = 200

Mutation; probability = 0.8

runs of each algorithm on two-dimensional synthetic data sets exhibiting

Sample Median and inter quartile range F-Measure Luo et al. (2007) values for 50

70 children in each generation

Got even more results leading again to the first original clustering bit-string with even higher fitness, i.e. 1 = y(bbest ) = 0.04096 With only four different bits inside the compared strings One-point crossover

The stop criteria 300 generations 30 parents in each generation

Highest fitness 1 = y(b); 1/y(bbest ) = 0.0393256

Gablentz et al. (2000)

Compute 28 original clustering strings with a length of 100 bit

Genetic operations

Clustering accuracy

Authors

Table 2 Comparison of accuracy in the proposed genetic-guided clustering ensemble algorithms

A review: accuracy optimization in clustering ensembles

123

123

Five different consensus functions: Genetic Algorithm Clustering (GACE) that proposed in this research work, Co-association function and Average Link (CAL), Co-association function and K-means (CK), Hyper graph Partitioning Algorithm (HGPA), Cluster-based Similarity Partitioning Algorithm (CSPA) Mean error rate on Iris: GACE = 10; CAL = 10.4; CK = 21.4; HGPA = 41.1; CSPA = 11.1 Mean error rate on Wine: GACE = 29.5; CAL = 24.4; CK = 37.3; HGPA = 28.3; CSPA = 27.8 Mean error rate on Soybean: GACE = 23.4; CAL = 24.4; CK = 30.1; HGPA = 28.3; CSPA = 27.8 Mean error rate on 08X : GACE = 18.6; CAL = 36.8; CK = 21.1; HGPA = 14.5; CSPA = 14.7 Classical real-world dataset from UCI benchmark repository: Iris Plant dataset Comparison between Genetic Algorithm (GA) existing approach in this research work and five other consensus functions: Cluster-based Similarity Partitioning Algorithm (CSPA), Hyper graph Partitioning Algorithm (HGPA), Meta-Clustering Algorithm (MCLA), Mutual Information (MI) and EM The mean error rate (%) of clustering combination from 20 runs on Iris: GA = 18; CSPA = 9.4; HGPA = 41.4; MCLA = 11.1; MI = 13; EM = 11

Applied hierarchical clustering, self organizing maps, and k-means clustering algorithms Compared the results generated using CLUSTER to those of proposed method The clinical dataset from CAMDA was classified into three cluster groups, based upon the overall severity of CFS symptoms- least symptoms (L), mid level symptoms (M), and worst symptoms (W) The proteomics data yielded better experimental results than the microarray data, because the proteomics data more closely agrees with the clusters classified using the clinical data

Azimi et al. (2007)

Yoon et al. (2006c)

Luo et al. (2007)

Clustering accuracy

Authors

Table 2 continued

Hierarchical: All linkage clustering, based on arrays SOM: Ydim: 5,7,9 and 200–2000 iterations based on arrays K -means: max cycles = 100 and k = 3, 4, 5, based on arrays

Consecutive iterations: Imax = 100

Fitness threshold: μ = 1/0.0001

Crossover rate: r = 0.8

Mutation rate: m = 0.1

Values between 1 and the true number of classes to initialize the chromosome population Set the population size: N = 100

50 independent runs

Genetic operations

R. Ghaemi et al.

Faceli et al. (2007)

Mohammadi et al. (2008)

CAMDA-2006 dataset. This dataset contains microarray, proteomics, single nucleotide polymorphisms (SNPs), and clinical data for chronic fatigue syndrome (CFS) Applied different clustering algorithms: K-mean (KM), Hierarchical Clustering (HC), Principal Component Analysis (PCA) and Heterogeneous Clustering Ensemble (HCE) HCE method mostly agrees with the clusters classified by the two categories of clinical data Six different consensus functions: Genetic Algorithm Clustering (GACE II) that is proposed in this research work and Genetic Algorithm Clustering (GACE) proposed in Azimi et al. (2007), Co-association function and Average Link (CAL), Co-association function and K-means (CK), Hypergraph Partitioning Algorithm (HGPA), Cluster-based Similarity Partitioning Algorithm (CSPA) Mean error rate on Iris: GACE II = 12; GACE = 10; CAL = 16; CK = 21; HGPA = 14; CSPA = 11.1 Mean error rate on Soybean: GACE II = 28; GACE = 27.5; CAL = 28; CK = 35; HGPA = 27; CSPA = 26.5 Mean error rate on 08X : GACE II = 14.6; GACE = 13.5; CAL = 35; CK = 19; HGPA = 14; CSPA = 14.5 Real datasets that contain more than one possible structure: Golub dataset and Leukemia dataset

Misclustered patterns on Glass: SOM = 115; eSOM = 115; rSOM = 111

The same initial configuration, including the initial population was used for both versions

50 independent runs

Crossover manipulation within the population SM(g)

Select two parents as a couple The largest number of highly overlapped elements to fitness function F(t)

Using of SOM results in 9 and 10 unassigned patterns respectively The data partitions P1 to P N that form the clustering ensemble P, as well as their integration to form Popt

Average number of misclustered patterns for the Self Organizing Map (SOM) and Evolutionary SOM (eSOM) algorithms and compares it with the recursive SOM (rSOM) Misclustered patterns on Iris: SOM = 23; eSOM = 16; rSOM = 8 Misclustered patterns on Wine: SOM = 72; eSOM = 57; rSOM = 51

Ramanathan and Guan (2006)

Yoon et al. (2006b)

Genetic operations

Clustering accuracy

Authors

Table 2 continued

A review: accuracy optimization in clustering ensembles

123

123

Hong and Kwong (2008)

Authors

Table 2 continued

Compare three algorithms: Genetic guided clustering algorithm without recombination operator with K-means Algorithm (GKA); Genetic-guided Clustering algorithm with Mutation and Crossover operator (GCMC); Genetic-guided clustering algorithm with the ensemble learning operator (GCEL) Datasets: X 8K 5D; Ionosphere; Promoters; Segmentation; Leukemia Accuracy for GCEL on X 8K 5D, Ionosphere, Promoter, Segmentation, Leukemia: 100, 58.89, 66.79, 87.86, 95.40% Accuracy for GCMC on X 8K 5D, Ionosphere, Promoter, Segmentation, Leukemia: 91.13, 58.89, 63.21, 87.06, 92.77% Accuracy for GKA on X 8K 5D, Ionosphere, Promoter, Segmentation, Leukemia: 94.29, 58.79, 62.69, 86.08, 94.96%

In most of the cases, MSH performed similarly or better than MUH (92.31%); MSH performed similarly or better than MUH in 100% of the cases

Best performance of algorithms on Golub: SL = 0.315; MUH = 0.315; MSH = 0.9, 0.877, 0.699, 0.315 Best performance of algorithms on Leukemia: KM = 0.677; MOCK = 0.782; The MOCLE obtained similar or better results than MOCK in 76.92% of the cases and outperforms ES in 92.31% of the cases

Simple mutation; mutation rate = 0.005

Selection operator: The tournament (2)

The above steps iterated for N rounds and a population of N clustering solutions Population size = 100

Initial population was generalized as follows: Part of all selected features and classified instances by executing K-means clustering algorithm on these selected features

Initialized by using the random subspaces method and the steady-state genetic algorithm

Randomly selected 100 instances from the data

Structures: E1 classifies the samples in Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML); E2 contains a refinement of the ALL class. E3 classifies the samples according to the institution where the samples came from: DFCI (Dana-Farber Cancer Institute), CALGB (Cancer and Leukemia Group B), SJCRH (St. Jude Children’s Research Hospital) and CCG (Children’s Cancer Group); E4 shows if the samples are from bone marrow (BM) or peripheral blood (PB)

Generate the initial population with the K-means algorithm (KM), Average Link (AL), Single Link (SL) and Shared Nearest Neighbors (SNN)

Compare two versions of MOCLE: One unsupervised (MUH) and one that take prior knowledge into account (MSH)

Genetic operations

Clustering accuracy

R. Ghaemi et al.

Accuracy on Glass: the best and second best number of clusters for each of the utilized indices; these values range between 3 and 7

Ozyer and Alhajj (2009)

Accuracy on Synthetic: as long as the total number of centroids obtained from each clustering stage is greater than 1,000, partitions of size 1,000 are created and individually clustered; best and second best number of clusters is 5 To consider algorithms: multi objective K-means genetic algorithm with divide and conquer (D&C) based on different objectives; K-means as a single run algorithm; CURE as hierarchical clustering; cluster ensembles with Hyper graph Partitioning Algorithm (HGPA); Cluster-based Similarity Partitioning Algorithm (CSPA); Meta-Clustering Algorithm (MCLA)

Accuracy on Pen Digits: not all the subsets got instances from the 10 classes; for each of the subsets, the majority of the indices agree on certain number of clusters; Both the best and second best number of clusters are given for each of the utilized validity indices, and for each of the four separateness criteria used in the genetic algorithm process;

Clustering accuracy

Authors

Table 2 continued

Population size = 100 Tournament size during the increment The no of clusters = no of items/20 (5% of the entire data set, or limited with an upper bound constant value 50 for large datasets) Probability of selection for Crossover = Pcrossover = 0.9; Single and two point crossover Pmutation = 0.05, and for the mutation itself, the allele number is not changed randomly

Genetic operations

A review: accuracy optimization in clustering ensembles

123

R. Ghaemi et al. Table 3 Average, maximum and minimum of the clustering accuracy resulted by CEAs and GCEAs by percentage on the same datasets Dataset

Iris Wine Soybean 0X 8 Half-rings 2-spiral X 8D5K Galaxy Biochemistry Leukemia Segmentation Promoters Ionosphere

Average clustering accuracy (%)

Maximum clustering accuracy (%)

CEA

GCEA

CEA

GCEA

82.48 75.09 79.16 78.22 74.22 56.41 94.29 76.84 55.63 94.60 86.08 62.69 58.79

86.61 70.50 76.60 81.40 72.27 56.80 95.56 85.57 57.37 94.08 87.47 65.00 58.89

58.60 62.70 69.90 63.20 59.11 53.85 94.29 50.00 52.60 94.60 86.08 62.69 58.79

82.00 70.50 76.60 81.40 72.27 56.80 91.13 85.57 57.37 92.77 87.08 63.21 58.89

Fig. 8 Average and maximum clustering accuracy resulted by CEAs and GCEAs on thirteen benchmark datasets

123

A review: accuracy optimization in clustering ensembles

Fig. 9 Improving the average and maximum clustering accuracy resulted by GCEAs against CEAs on thirteen benchmark datasets

clustering accuracy resulted by CEAs. This shows that the maximum clustering accuracy resulted by GCEAs on most datasets is better than the clustering accuracy maximum resulted by CEAs on the same datasets. Figure 9 presents the improvement rate of the average clustering accuracy and the maximum clustering accuracy resulted by GCEAs against CEAs. Improvement amount of clustering accuracy average resulted by GCEAs against CEAs are 4.13, 3.18, 3.79, 1.27, 8.73, 1.74, 1.39, 2.31, and 0.1%, on datasets including Iris, 0X 8, 2-spiral, X 8D5K , Galaxy, Biochemistry, Segmentation, Promoters, and Ionosphere. Similarity, as shown in Fig. 9, the decrease of the improvement rate in average clustering accuracy resulted by GCEAs against CEAs are −4.59, −2.56, −0.23, and −0.52%, on datasets including Wine, Soybean, Half-ring, and Leukemia. This proves that the average clustering accuracy resulted by GCEAs is often better than the clustering accuracy resulted by CEAs. Meanwhile, the improvement rate of maximum clustering accuracy resulted by GCEAs against CEAs are 23.4, 7.8, 6.7, 18.2, 13.16, 2.95, 35.57, 4.77, 1, 0.52, and 0.1% on datasets including Iris, Wine, Soybean, OX8, Half-ring,2-spiral,Galaxy, Biochemistry, Segmentation, Promoters, and Ionosphere. The decrease of the improvement rate in maximum clustering accuracy resulted by GCEAs against CEAs are −3.16 and −1.83% on datasets including X 8D5K and Leukemia. This proves that the clustering accuracy maximum resulted by GCEAs is often better than the clustering maximum resulted by CEAs. Based on the average and the maximum percentage of the clustering accuracy, we demonstrated GCEAs consists higher average clustering accuracy than CEAs. In other words, using genetic algorithms in clustering ensemble algorithms are able to improve clustering accuracy.

7 Conclusions and future works Clustering ensemble has emerged as a prominent method for improving robustness, stability and accuracy of unsupervised classification solutions. So far, numerous works have contributed to find consensus clustering. In this review paper, we first introduced clustering ensemble and its challenges. Next, we briefly introduced the Clustering Ensemble Algorithms. There

123

R. Ghaemi et al.

are several challenges for clustering ensemble and one of the major problems in clustering ensemble is finding the consensus function. An original contribution of this paper is that it discusses key issues on genetic algorithms and their components including genotype, population initialization, fitness function, selection operator, crossover operator, mutation operator, and replacement for combination partitions problem in clustering ensembles. In particular, mutation and crossover operators, which are commonly described in GCEAs, give special emphasis to those genetic operators specifically on clustering ensemble problem. Apart from the introduction, this paper also discussed the features, advantages, disadvantages, and future works on researches in the clustering ensembles. We compared clustering accuracy on the same datasets between CEA and GCEA clustering algorithms. We investigated the clustering accuracy on benchmark datasets including UCI repository, artificial datasets, as well as real world datasets. The average and the maximum clustering accuracy computed by CEAs and GCEAs on the same datasets were compared. The comparison results prove that the average clustering accuracy and the maximum clustering accuracy resulted by GCEAs on more datasets is better than the average clustering accuracy and the maximum clustering accuracy computed by CEAs on the same datasets. The improving rate of clustering accuracy average and clustering accuracy maximum resulted by GCEAs against CEAs were also presented. They prove the average clustering accuracy resulted by GCEAs is usually better than the clustering accuracy resulted by CEAs. Therefore, using genetic algorithms in clustering ensembles are able to improve the overall clustering accuracy. With investigation and comparison of the suggested genetic-guided clustering ensemble algorithms in current research works, several open questions are specified subjected to future research. In most of the references on genetic algorithms for clustering ensembles, only the quality of the partitions and combination partitions algorithms is of concern, whereas little attention has been given to initiation population methods. We believe that it will be a critical issue in clustering invalidity and context insensitivity problem. Finally, we also foresee computational efficiency as a critical issue when researchers need serious large-scale data clustering ensembles in the near future.

References Analoui M, Sadighian N (2006) Solving cluster ensemble problems by correlation’s matrix & GA. IFIP Int Fed Inf Process 228:227–231 Azimi J, Abdoos M, Analoui M (2007) A new efficient approach in clustering ensembles. In: Proceedings of the 8th international conference on intellignt data engineering and automated learning. Lecture Note Computer Science, vol 4881, pp 395–405 Azimi J, Mohammadi M, Movaghar A, Analoui M (2007) Clustering ensembles using genetic algorithm. In: The international workshop on computer architecture for machine perception and sensing, IEEE, pp 119–123 Bouchachia A (2005) Learning with hybrid data. In: Proceedings of the fifth international conference on hybrid intelligent systems. IEEE Computer Society Chiou YC, Lan LW (2001) Genetic clustering algorithms. EJOR Eur J Oper Res 135:413–427 Coello CAC, Van Veldhuizen DA, Lamont GB (2002) Evolutionary algorithms for solving multi-objective problems. Kluwer, Norwell Corne DW, Jerram NR, Knowles JD, Oates MJ (2001) PESA-II: region-based selection in evolutionary multi-objective optimization. In: Proceedings of the genetic and evolutionary computation conference, pp 283–290 Deb K (2001) Multi-objective optimization using evolutionary algorithms. ISBN: 047187339X, Wiley

123

A review: accuracy optimization in clustering ensembles Demiriz A, Bennett KP, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. Artif Neural Netw Eng J 809–814 Dietterich TG (1997) Machine-learning research. AI Mag J 18(4):97–136 Du J, Korkmaz E, Alhajj R, Barker K (2004) Novel clustering approach that employs genetic algorithm with new representation scheme and multiple objectives. Data Warehousing Knowl Discov J, Springer, pp 219–228 Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinf J, Oxford University Press, vol 19, no 9, pp 1090–1099 Faceli K, De Carvalho A, De Souto M (2007) Multi-objective clustering ensemble with prior knowledge. Adv Bioinf Comput Biol, Springer, pp 34–45 Falkenauer E (1994) A new representation and operators for genetic algorithms applied to grouping problems. Evol Comput 2:123–144 Falkenauer E (1998) Genetic algorithms and grouping problems. Wiley, USA, ISBN: 0471971502 Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML), vol 20, no 1, pp 186–193 Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the 21st international conference on machine learning. ACM, p 36 Fischer B, Buhmann JM (2003) Bagging for path-based clustering. IEEE Trans Pattern Anal Mach Intell 25(11) Fischer B, Buhmann JM (2003) Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans Pattern Anal Mach Intell 25(4) Franti P (2000) Genetic algorithm with deterministic crossover for vector quantization. Pattern Recogn Lett J 21:61–68 Fred ALN (2001) Finding consistent cluster in data partitions. Springer, Berlin 309–318 Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. Pattern Recogn J 4:835–850 Fred A, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27:835–850 Gablentz V, Koppen M, Dimitriadou E (2000) Robust clustering by evolutionary computation. In: Proceedings of the fifth online world conference soft computing in industrial applications (WSC5) Garai G, Chaudhuri BB (2004) A novel genetic algorithm for automatic clustering. Pattern Recogn Lett J 25:173–187 Ghaemi R, Sulaiman MN, Ibrahim H, Mustapha N (2009) A survey: clustering ensembles techniques. In: Proceedings of the international conference on computer, electrical, and systems science, and engineering (CESSE), vol 38, pp 644–653 Handl J, Knowles J (2005) Exploiting the trade-off—the benefits of multiple objectives in data clustering. In: Proceedings of the third international conference on evolutionary multi-criterion optimization. Springer, pp 547–560 Handl J, Knowles J (2006) Multi-objective clustering and cluster validation. Multi Object Mach Learn J, Springer, pp 12–47 Haupt RL, Haupt SE (1998) Practical genetic algorithms. ISBN 0-471-45565-2, Wiley Online Library Hong Y, Kwong S (2008) To combine steady-state genetic algorithm and ensemble learning for data clustering. Pattern Recogn Lett J, Elsevier, vol 29, no 9, pp 1416–1423 Hong Y, Kwong S, Chang Y, Ren Q (2008) Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recogn Soc 41(9):2742–2756 Hruschka ER, Campello RJGB, Freitas AA, De Carvalho A (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man Cybern C Appl Rev 39(2):133–155 Jain AK, Murty MN, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323 Jones DR, Beltramo MA (1991) Solving partitioning problems with genetic algorithm. In: Proceedings of the fourth international conference on genetic algorithms. California University, Morgan Kaufmann Publishers, pp 442–449 Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 359–392 Kellam P, Liu X, Martin NJ, Orengo C, Swift S, Tucker A (2001) Comparing, contrasting and combining clusters in viral gene expression data. In: Proceedings of the sixth workshop on intelligent data analysis in medicine and pharmocology, pp 56–62 Krishna K, Murty M (2002) Genetic K-means algorithm. IEEE Trans Syst Man Cybern B 29(3):433–439 Kuncheva LI, Bezdek JC (2002) Nearest prototype classification: custering, genetic algorithms or random search?. IEEE Trans Syst Man Cybern C Appl Rev 28(1):160–164

123

R. Ghaemi et al. Kuncheva LI, Hadjitodorov ST, Todorova LP (2006) Experimental comparison of cluster ensemble methods. In: Proceedings of FUSION, Citeseer, pp 105–115 Lu Y, Li S, Fotouhi F, Deng Y, Brown SJ (2004) Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinform J 5(1):172 Luo H, Jing F, Xie X (2007) Combining multiple clusterings using information theory-based genetic algorithm. In: International conference on computational intelligence and security, IEEE, vol 1, pp 84–89 Martnez-Otzeta JM, Sierra B, Lazkano E, Astigarraga A(2006) Classifier hierarchy learning by means of genetic algorithms. Pattern Recogn Lett J, Elsevier, vol 27, no 16, pp 1998–2004 Minaei-Bidgoli B, Topchy A, Punch WF (2004) A comparison of resampling methods for clustering ensembles. In: Proceedings of the international conference on machine learning: models, technologies and applications, Michigan State University, Citeseer Mitra S (2004) An evolutionary rough portative clustering. Pattern Recogn Lett J 25:1439–1449 Mohammadi M, Nikanjam A, Rahmani A (2008) An evolutionary approach to clustering ensemble. In: Fourth international conference on natural computation, IEEE, vol 3, pp 77–82 Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 849–856 Ozyer T, Alhajj R (2009) Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer. Appl Intell J, Springer, vol 31, no 3, pp 318–331 Qian Y, Suen CY (2000) Clustering combination method. In: Proceedings of the fifteen international conference on pattern recognition, vol 2, pp 732–735 Ramanathan K, Guan SU (2006) Recursive self-organizing maps with hybrid clustering. In: IEEE conference on cybernetics and intelligent systems, pp 1–6 Sheng W, Tucker A, Liu X (2004) Clustering with niching genetic K-means algorithm. In: Proceeding genetic and evolutionary computation conference, Springer, pp 162–173 Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitionings. In: Proceeding of 11th national conference on artificial intelligence, pp 93–98 Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Mach Learn Res J 3:583–617 Topchy A, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Proceeding of the third IEEE international conference on data mining (ICDM), pp 331–338 Topchy A, Jain AK, Punch WF (2004a) A mixture model for clustering ensembles. In: Proceedings of the SIAM international conference on data mining, Michigan State University Topchy A, Minaei-Bidgoli B, Jain AK, Punch WF (2004b) Adaptive clustering ensembles. Pattern Recogn J 1:272–275 Topchy A, Jain AK, Punch WF (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881 Vavak F, Fogarty TC (1996) Comparison of steady-state and generational genetic algorithms for use in nonstationary environments. Lecture Notes in Computer Science, Springer Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3) Yoon HS, Ahn SY, Lee SH, Cho SB, Kim JH (2006a) Heterogeneous clustering ensemble method for combining different cluster results. Data Min Biomed Appl J, Springer, pp 82–92 Yoon HS, Lee SH, Cho SB, Kim JH (2006b) A novel framework for discovering robust cluster results. Discov Sci, Springer, pp 373–377 Yoon HS, Lee SH, Cho SB, Kim JH (2006c) Integration analysis of diverse genomic data using multi-clustering results. Biomed Med Data Anal J, Springer, pp 37–48

123

Suggest Documents