Finding best algorithmic components for clustering microarray data ...

4 downloads 35111 Views 494KB Size Report
The idea is to break up existing algorithms into independent building blocks for ... consistently good performance for clustering microarray data and that should ...
Knowl Inf Syst (2013) 35:111–130 DOI 10.1007/s10115-012-0542-5 REGULAR PAPER

Finding best algorithmic components for clustering microarray data Milan Vuki´cevi´c · Kathrin Kirchner · Boris Delibaši´c · Miloš Jovanovi´c · Johannes Ruhland · Milija Suknovi´c

Received: 22 September 2011 / Revised: 30 March 2012 / Accepted: 14 August 2012 / Published online: 6 September 2012 © Springer-Verlag London Limited 2012

Abstract The analysis of microarray data is fundamental to microbiology. Although clustering has long been realized as central to the discovery of gene functions and disease diagnostic, researchers have found the construction of good algorithms a surprisingly difficult task. In this paper, we address this problem by using a component-based approach for clustering algorithm design, for class retrieval from microarray data. The idea is to break up existing algorithms into independent building blocks for typical sub-problems, which are in turn reassembled in new ways to generate yet unexplored methods. As a test, 432 algorithms were generated and evaluated on published microarray data sets. We found their top performers to be better than the original, component-providing ancestors and also competitive with a set of new algorithms recently proposed. Finally, we identified components that showed consistently good performance for clustering microarray data and that should be considered in further development of clustering algorithms. Keywords

Clustering · Microarray data · Component-based algorithms · Bioinformatics

1 Introduction The molecular state of a cell, measured by the expression of genes, can be investigated using DNA microarrays. Gene expression data sets used in this paper contain measurements of increasing and decreasing expression labels of a gene set over time points, tissue samples or patients. They are represented as a matrix of numeric values. A typical microarray is special for data mining in the way that it has a small number of samples (records), where the number

M. Vuki´cevi´c (B) · B. Delibaši´c · M. Jovanovi´c · M. Suknovi´c Faculty of Organizational Sciences, University of Belgrade, Jove Ili´ca 154, Belgrade, Serbia e-mail: [email protected] K. Kirchner · J. Ruhland Faculty of Economics and Business Administration, Friedrich Schiller University of Jena, Carl-Zeiß Straße 3, Jena, Germany

123

112

M. Vuki´cevi´c et al.

of fields, representing the number of genes (attributes), is very large [48]. The analysis of such data is, therefore, a complex problem [58]. Cluster analysis on DNA microarray data is essential for identifying biologically relevant groups of genes [21]. But selecting an effective clustering method for a particular set of gene expression data is a difficult task [34], and general conclusions about best algorithms cannot be made [29], because clustering performance is dependent on intrinsic data set characteristics. Thus, Quackenbush [51] states that choosing an appropriate algorithm is a crucial element of experimental design. Especially, representative-based clustering algorithms are often successfully used in clustering of microarray data as single algorithms (e.g., [6,27]), as improved algorithms (e.g., [27]) or as a part of consensus schemas (e.g., [28,44,71]). Although there are some recommendations for algorithm selection for clustering biological data [3], there is no consensus about the best algorithm for such hard clustering task. In this regard, the component-based approach for clustering algorithms could provide a promising alternative direction. The component-based approach, as it is suggested by Delibaši´c et al. [20], allows the design of a plethora of representative-based clustering algorithms (432 are analyzed in this paper) that are composed from different parts and improvements of existing algorithms from this family. Component-based clustering was already used in the area of microarray data analysis [65], but the main focus was the identification of adequate internal evaluation measures. In this paper, we examine whether the interchange of good problem solution ideas between existing algorithms can produce better performing algorithms for clustering microarray data. Further, we compare representative-based algorithms designed from components with other types of clustering algorithms recently proposed in the literature. So, the main contributions of this paper are as follows: • Application and detailed analysis of RC-based clustering performance on 17 microarray data sets and comparison with other approaches from the literature, • Proposal of a method for the identification of combinations of components that constitute well-performing algorithms for clustering microarray data that should be considered in further development of these algorithms. • Extension of a framework for component-based clustering algorithm design proposed in [20] (detailed differences will be explained in Sect. 3) The paper is organized in the following manner. In the next section, we review relevant literature to provide a background for our research. Section 3 describes the component-based approach for clustering. Section 4 shows our results of the application of the componentbased approach on microarray data. Our findings are summarized in Sect. 5.

2 Related work Clustering is a fundamental problem in different areas of computer science like machine learning, data mining or pattern recognition [24] and is used in different application fields like web mining [66], document grouping [6,14], data stream clustering [16], geology [30] or bioinformatics [6,8]. In this section, we give a brief overview of recently proposed methods for clustering microarray data, which are correlated with our experimental evaluation. Good reviews of the state-of-the art in this area can be found in [71] and [11]. Xu and Wunsch [71] exhaustively reviewed the usage and evaluation of clustering algorithms in biomedical research. They also

123

Finding best algorithmic components

113

emphasize the importance of finding right algorithms for different biomedical applications. Belacel et al. [11] overviewed clustering techniques used in microarray gene expression data analysis. They distinguish between simple clustering, where each gene belongs to only one cluster, and complex clustering, where a gene can belong to more than one cluster with a certain amount of membership. In this paper, we concentrate on the first group and are using the components from well-known algorithms like K -means [33] or other representativebased cluster algorithms like K -means++ [5] or G-Means [31] to build and evaluate new representative-based K -means-like algorithms. K -means is a commonly known, straightforward representative-based cluster algorithm that is identified as one of the top 10 data mining algorithms by Wu et al. [69], and it is often used for clustering microarray data. Kumar and Wasan [37] give a comparative analysis of K -means-based algorithms on microarray data. Dhiraj and Rath [22] performed an analysis of K -means behavior on gene expression data. Bayá and Granitto [9] analyzed the influence of distance measures on PAM and hierarchical clustering algorithms and proposed a new distance measure based on nearest neighbors algorithm. Also, Baralis et al. [8] propose a new measure for gene similarity based on their capability to separate samples belonging to different classes. Giancarlo et al. [29] analyzed the influence of different distance measures on the performance of hierarchical and K -means clustering algorithms and suggested Pearson, Cosine and Euclidean distances as most appropriate for clustering microarray data. Cheung [15] proposed a generalized k-means algorithm called k*-means. It is based on rival penalized competitive learning (RPCL), and the generalization is achieved in the sense that users do not have to predetermine the correct cluster number (only the upper bound). Still, all algorithmic components are fixed (i.e., it uses only one way for the initialization of clusters). The generic algorithm proposed in [20] and extended in this research demands predetermination of the cluster number (algorithm has to be restarted), and intuitively it would be beneficial to combine these two techniques. This would enable automatic search for an optimal cluster number and simple design of many clustering algorithms, but this integration is out of the scope of this paper. The approach from [20] and extensions from this research are explained in Sect. 3. Besides the well-known K -means algorithm, other clustering methodologies were also used on microarray data. Thalamuthu et al. [63] give a comprehensive experimental evaluation of six popular clustering algorithms applied to microarray data. Nascimento et al. [45] used the GRASP algorithm (greedy randomized adaptive search procedure) for finding partitions in microarray data and examined the influence of different distance metrics on GRASP algorithm. Iam-on et al. [34] developed a link-based cluster ensemble method for improved gene expression data analysis. Ayadi et al. [6] proposed BicFinder, a biclustering algorithm for microarray data analysis. Moise et al. [43] gave systematic evaluation of subspace and projected clustering techniques under a wide range of experimental settings with focus on high-dimensional data. Wu et al. [68] showed the importance of applying multiple cluster algorithms to discover relevant biological patterns. Monti et al. [44] proposed a framework for consensus-based clustering of microarray data that used resampling schemes and single clustering algorithms for producing multiple clustering solutions. They showed that consensus algorithms give more stable solutions and better quality partitions than single algorithms without resampling. This research was followed by [71] who proposed an improved graph-based consensus clustering algorithm, and [28] who developed a method for speeding up consensus-based clustering. Pirim et al. [50] used a consensus clustering approach for obtaining a consensus partition by merging different partitions gathered from individual cluster algorithms. The ensemble

123

114

M. Vuki´cevi´c et al.

clustering result is tested on microarray data sets and showed better results than the individual algorithms. Another idea is to examine alternative clusterings. While clustering a gene data set, grouping genes based on their functions or structures may be equally useful. Therefore, the goal of alternative clustering is to generate different clusterings, so that the data can view results from different perspectives and explore new hypothesis [17]. Based on the users’ perspective, De Bie [18] propose a general data mining framework. It models the user’s state of mind and guides the data mining process to reduce user’s uncertainty about data by formalizing pattern set mining as well as iterative and interactive data mining. Co-clustering is also used on gene expression data [59,72]. It allows simultaneous clustering of rows and columns of a data matrix. In overlapping clusters, an object can belong to more than one cluster simultaneously; for example, genes can participate in different metabolic pathways [12]. Another popular class of clustering algorithms is density-based algorithms [25]. This class of algorithms solves one of the main problems of original K -means algorithm: identification of clusters of arbitrary shapes. DBScan [56] and OPTICS [4] are among the most popular in this class. Still density-based algorithms are not suitable for high-dimensional data since relative distance between objects makes it harder to perform density analysis [3]. Additionally, DBScan identifies noise cluster, while representative-based algorithms do not, and because of that, it is hard to compare clustering results [52]. DBScan results cannot be evaluated with external evaluation measures (which are the basis of our experimental evaluation). Because of aforementioned problems, we did not include this class of algorithms in our experimental evaluation. On the other side, Fuzzy C-means [10] has been successfully applied to microarray data clustering [21] and has compatible structure with the generic algorithm used in this paper, and so, we included this algorithm in our experimental evaluation. 3 Component-based clustering approach Every clustering algorithm is composed of several solutions for specific sub-problems (e.g., initialization of representatives) in the clustering process. These solutions are assembled in a clustering algorithm. Solutions occurring in one algorithm can frequently occur in other clustering algorithms (e.g., random initialization). The reusable component (RC) design approach proposes decomposing algorithms into sub-problems that can be solved using different RCs [20]. These are well-documented, frequently occurring solutions for specific sub-problems in a family of algorithms (in this case representative-based clustering). This allows reusing the parts of algorithms discussed in the literature for reconstructing original algorithms or designing new RC-based algorithms by assembling RCs into algorithms. In this paper, we apply the component-based approach on clustering microarray data. For experimental evaluation, we use RCs to construct and evaluate a large amount of representative-based algorithms on microarray data. Clustering algorithms in our study are designed by varying RCs from the following four sub-problems that typically occur in representative-based algorithms: (1) (2) (3) (4)

initializing representatives; measuring distance; updating representatives; and evaluating clusters.

123

Finding best algorithmic components

115

Note that here step (d; “evaluating clusters”) is used during the execution of the algorithms (to be explained in Sect. 3), and it was not included in the framework presented in [20]. These algorithms represent the space of algorithms that are formed by reusing parts of various representative-based (k-means like) algorithms from the literature. For each sub-problem, several RC’s can be defined by reusing the parts of various representative-based (k-means like) algorithms from the literature. Here, we give a brief description of RCs that are used as solutions for clustering sub-problems in the experiments in this paper. RCs that were not included in the framework from [20] are marked with ∗. In particular, for the initialization of cluster representatives, we considered six RCs: (1) RANDOM [39]; (2) *SPSS from the SPSS k-means implementation [32]; (3) *hierarchical binary clustering where new representatives “are moved a distance proportional to the size of the region in opposite directions along a randomly chosen vector“ (XMEANS) [47]; (4) *hierarchical binary clustering utilizing data’s main principal component and corresponding eigenvalue (GMEANS) [31]; (5) *hierarchical binary clustering based on principal components analysis (PCA) [23]; and (6) a distance-based distribution (KMEANS++) [5]. The following four distances (similarities) were considered in this study as RCs: (1) (2) (3) (4)

Euclidean distance (EUCLIDEAN); City block distance (CITY); *Correlation similarity (CORREL); and *Cosine similarity (COSINE).

We used the following three RCs for representatives update: (1) mean (MEAN) [33]; (2) *median (MEDIAN) [36]; and (3) online that updates representatives each time an object is assigned to a cluster, opposite to MEDIAN and MEAN where representatives are being updated once all objects have been assigned to corresponding clusters (ONLINE) [13]. Sub-problem (4; “evaluating clusters”) is used during the execution of the algorithms and during initialization for all RCs that use binary hierarchical division of clusters. The integration of internal cluster evaluation measures RCs into algorithms influenced the retrieved cluster models as shown in experimental section of this paper. Here, we describe “evaluating clusters” RCs that are used in experiments: (1) Compactness (COMPACT) is commonly measured by the overall deviation, that is, the sum of distances of items to the corresponding cluster representatives. It showed good performance in detection of spherical shapes in Euclidean space or well-separated data structures, but lacks robustness with arbitrary data shapes. (2) *The XB-index (XB) [70] is a ratio of overall deviation to cluster separation. Thus, it can detect hyper-spherical-shaped clusters well, but similarly to compactness, it does not perform well on arbitrarily shaped clusters. (3) *Connectivity (CONN) shows whether neighboring items are in the same cluster. This measure efficiently identifies arbitrarily shaped clusters (since it captures local densities), but is not robust toward overlapping clusters or clusters with small spatial separation.

123

116

M. Vuki´cevi´c et al.

(4) *The Global silhouette index (SILHOU) [54] checks whether every instance’s current cluster is more appropriate than the neighboring one. This index is normalized on scale [−1,1], and it allows easy comparison between clusters on different data sets. (5) AIC [2] and BIC [57] are closely related information theoretic measures used for model selection among a class of parametric models with different numbers of parameters based on maximum likelihood estimation. These measures trade off distortion against model complexity and show good performance in determining the true number of clusters (e.g., in [47]). For the stop criterion, we used in all algorithms membership stability [15]. The sub-problems and their corresponding solutions described in previous text can be assembled in a form of a generic representative-based clustering [20].With this generic algorithm, several flexibilities for representative-based clustering algorithms are achieved. Users can thus decide on several available RCs for resolving each clustering sub-problem in an algorithm. By combining RCs in a generic clustering algorithm, original algorithms can be reproduced, but also new algorithms can be created. For example, the K -means algorithms can be reconstructed using the RC sequence RANDOM-EUCLIDEAN-MEAN-COMPACT. An example of a new algorithm would be PCA-COSINE-ONLINE-SILHOU that uses PCA to “initialize representatives,” COSINE to “measure distance,” ONLINE to “update representatives” and SILHOU to “evaluate clusters.” In this paper, we examined the synergistic influence of different components from different sub-problems (e.g., “initializing representatives” or “measuring distance”) on the algorithm performance in the area of microarray data clustering.

4 Experimental evaluation 4.1 Experimental setting The aim of the experiments was to provide some evidence whether component reuse and interchange of good problem solution ideas between algorithms can lead to performance improvement in representative-based clustering algorithms. We designed 432 RC-based cluster algorithms for experimental evaluation. These algorithms were built by combining RCs from the four sub-problems described in previous section (6*4*3*6). Since used data sets already contain information about number of classes, we set K equal to number of classes as it is common in the literature [1,26] for class retrieval problems. Additionally, we changed parameter K in the interval between 2 and 2*K , but evaluation measures used in this research showed best results when real K was set up. All algorithms were run 10 times, and the reported results are shown in average. Besides testing the main hypothesis, that is, whether a newly built RC-based clustering algorithm can lead to better results on a given data set than common well-known clustering algorithms, we also examined the influence of one RC and combinations of RCs on every data set and on all data sets on average. In order to evaluate the suggested approach, we measured cluster quality with external validation measures namely adjusted mutual index (AMI) [64] and adjusted rand index (ARI) [40]. Both indexes compare two clustering solutions. If the resulting partitions are identical, the value of both indices is 1, and if the partitions are completely different from one another, the index has an expected value of 0.

123

Finding best algorithmic components

117

External indices for the evaluation of clustering results have recently attracted increasing interest [7]. It has been shown recently that besides metric properties, external indices should also be normalized and adjusted for chance. This is necessary because unadjusted measures give better results as the number of clusters increases (even when the number of constructed clusters is larger than the true number of clusters). In recent research, Vinh et al. [64] conducted exhaustive comparison between a number of information theoretic and pair counting measures and adjusted mutual information (AMI) is recommended as a “general purpose” measure for clustering validation, comparison and algorithm design. It is showed in [56] that AMI satisfies normalization and metric properties and that identifies the true number of clusters better than the other measures (which is showed on a number of simulated and real-world data sets). Moreover, in [56], AMI showed cutting edge performance on 8 microarray data sets that were also used in this paper. This is why we are using this measure in following experiments. Additionally, we adopt ARI index in order to compare the results with already published work (since we are not aware of any research on clustering microarray data that reported AMI as validation measure). 4.2 Data sets Experiments are conducted on 5 simulated and 17 real-world microarray data sets. Data sets were taken from researches of [29,44,45,71] because they are used for validation of clustering performance, and we could compare the results. Data sets and their brief description are given in Table 1. Common characteristic of data sets analyzed in this paper is that they have large number of attributes and small number of samples. This fact is often addressed in the literature as the major problem in clustering microarray data since the clustering results are sensitive to noise and susceptible to overfitting [44]. So, the main idea of the paper was to identify which components (and the algorithms) can deal with the data sets that share this characteristic. The original sources and more detailed descriptions can be found in referenced papers. 4.3 Results 4.3.1 Comparison with well-known algorithms In the first experiment, we compared newly developed RC-based algorithms with well-known algorithms that are reconstructed by combining RCs. The benchmark algorithms were K means (RANDOM-EUCLIDEAN-MEAN-COMPACT), K -medians (RAN-DOM-EUCLID EAN-MEDIAN-COMPACT), K -means++ (K -MEANS++EUCLI-DE-AN-MEAN-COMPACT) and G-means (GMEANS-EUCLIDEAN-MEAN-COM-PACT). In this way, we provided a fair testing environment as stated in [62] on the same platform where all components had the same implementation. We evaluated RC-based algorithms on the artificial data sets to see whether interchange of the components can give better results than the original algorithms. The last two columns in Table 2 show AMI values for best algorithms and worst RC-based algorithms. On every data set, the best performance was achieved with a combination of RCs that is different than in the original algorithms. The best algorithms for every data set are shown in Table 3. On “Gaussian 3” and “Simulated 6” data sets, FCM and a lot of algorithms with the same performance (all of them found true partitions), but on every other data set, different combination of RCs gave

123

118

M. Vuki´cevi´c et al.

Table 1 Data sets for evaluation Reference

Dataset

[44,73]

Gaussian3

3

600

60

Gaussian 5 delta2

5

2

500

Gaussian 5 delta3

4

2

400

Gaussian 4

6

600

60

Simulated6

3

600

60

Novartis

4

1,000

103

Normal

13

1,277

90

Leukemia

3

Lung cancer

4+

[44] [44,73]

[46]

[29]

No. of classes

No. of attributes

No. of samples

999

38

1,000

197

StJude

6

985

248

CNSTumors

5

1,000

48

BreastA

3

1,213

98

BreastB

4

1,213

49

DBLCLA

3

661

141 180

DBLCLB

3

661

MultiA

4

5,565

103

CNSRat

6

17

112

Leukemia (small)

3

100

38

Lymphoma

3

100

80

NCI60

8

200

57

PBM

18

139

2,329

Yeast_cell

5

72

698

Table 2 Comparison of component-based with well-known algorithms Dataset/algorithm

K -means

K -medians

K -means++

G-means

FCM

Best

Worst

Gaussian3

0.662

0.755

0.720

0.565

1

1

0.192

Gaussian4

0.867

0.773

0.869

0.864

0.857

0.883

0.400

Gaussian5 delta2

0.665

0.673

0.678

0.681

0.671

0.700

0.456

Gaussian5 delta3

0.926

0.916

0.924

0.928

0.922

0.928

0.523

Simulated6

0.743

0.787

0.779

0.215

1

1

0.148

Best AMI values are shown in bold

Table 3 Best RC-based algorithms on artificial data sets

123

Dataset

Algorithms

Gaussian3

166 algorithms

Gaussian4 Gaussian5 delta2

GMEANS-EUCLIDEAN-MEAN-XB, PCA-EUCLIDEAN-MEAN-XB GMEANS-CITY-MEAN-CONN

Gaussian5 delta3

PCA-EUCLIDEAN-MEAN-COMPACT

Simulated6

26 algorithms

Finding best algorithmic components

119

Table 4 Comparison with results from [29] K -means (9 distances)

ALink (9 distances)

CLink (9 distances)

FCM

Best RC-based algorithm

CNSRat

0.266

0.233

0.221

0.223

0.279

Leukemia (small)

0.919

0.919

0.919

0.910

1

Lymphoma

0.591

0.678

0.603

0.601

0.591

NCI60

0.442

0.498

0.451

0.577

0.5138

PBM

0.444

0.578

0.589

0.422

0.449

Yeast_cell

0.496

0.558

0.424

0.424

0.541

Best ARI values are shown in bold Table 5 Comparison of results from [46] Dataset

GRASP (Euclidian)

GRASP (city block)

GRASP (Cosine)

GRASP (Pearson)

FCM

Best RC-based alg. 0.773

BreastA

0.682

0.682

0.686

0.692

0.628

BreastB

0.626

0.228

0.626

0.694

0.384

0.483

DBLCLA

0.408

0.800

0.605

0.585

0.664

0.958

DBLCLB

0.481

0.700

0.502

0.527

0.389

0.794

MultiA

0.874

0.899

0.805

0.828

0.901

0.724

Novartis

0.92

0.921

0.920

0.920

0.876

0.966

Best ARI values are shown in bold

the best performing algorithm. These results gave an impulse for further examination of performance of RC-based algorithms on real-world data. 4.3.2 Comparison with results from literature The second experiment aims to show that representative-based clustering algorithms designed with RCs are competitive with results of other clustering algorithms recently reported in the literature. We compared the results with different types of clustering algorithms namely consensus based, hierarchical and meta-heuristic. Comparison is made with ARI values because they are reported in the literature. First, we compared results with K -means and hierarchical algorithms (average linkage ALink and complete linkage CLink) reported in [29]. In every algorithm, 9 different distance metrics were varied. Table 4 shows maximum values for every algorithm, and last column shows the values of the best RC-based algorithm: Results show that RC-based algorithms had best performance on 2 ALink algorithm on 2, CLink algorithm on 1 and FCM on 1 data set. On “Lymphoma” data set, “ALink” gave the best result. Closer examination of results from Giancarlo et al. [29] showed that this result was achieved with Mahalanobis (d3) distance. The same distance measure gave the best result with “CLink” on PBM data set. Still, this distance measure was not used in 432 component-based algorithms designed with RCs. This result indicates that adding new RCs could be beneficial for further improvement of the performance of clustering algorithms on microarray data. RC-based algorithms were also compared with GRASP algorithm that was tested with 4 different distance metrics reported by [46]. Results in Table 5 show ARI values from the GRASP algorithm with every distance metric, and the ARI values from the best RC-based algorithm.

123

120

M. Vuki´cevi´c et al.

Table 6 Comparison with results from [44,71]

CNSTumors

GCCcorr

GCCKmeans

CChc

CCsom

FCM

Best RC-based algorithm

0.658

0.718

0.549

0.429

0.452

0.733

Leukemia

0.831

0.831

1

0.721

0.910

1

Lung cancer

0.544

0.562

0.31

0.233

0.651

0.920

Novartis

x

x

0.921

0.897

0.877

0.958

Normal

x

x

0.572

0.487

0.415

0.617

St.Jude

0.873

0.86

0.948

0.825

0.952

0.955

Best ARI values are shown in bold

Generic clustering algorithm was able to find better cluster solutions on 4 out of 6 data sets. It is interesting to note that on “DBLCLA” data set, best result of the GRASP algorithm (0.8) was achieved with city block distance, but generic algorithm found best clustering (0.96) with SPSS-CORREL-ONLINE-AIC. On “Novartis” data set, GRASP gave similar results for all metrics (0.92) and KMEANS++-EUCLIDEAN-ONLINE-COMPACT had 0.96. On DBLCLB, the GRASP showed 0.7, and PCA-CORREL-MEAN-XB showed 0.79. On “BreastA” and “MultiA,” the best result gave FCM and GRASP with “Pearson” and “City Block” distance measures, respectively. The results from the two previous experiments show that there exists the need for deeper analysis of synergetic effect of the components from different sub-problems. This will be discussed in the next section. Ensemble clustering is a very popular method in microarray data analysis and showed better results than single clustering algorithms that they are using for making a consensus [44,73]. We compared the results with the graph-based consensus clustering (GCC) methods that are using correlation clustering and K -means (GCCcorr and GCCkmeans) from [73] and with consensus clustering based on SOM and hierarchical clustering (CCsom and CCkmeans) by [44] in Table 6. The results show that algorithms designed with RC interchange produced the best results on all data sets (data sets that are tested in [44], but not in [71], are marked with “x”). Since consensus-based clustering is using single algorithms (often representative based) and resampling techniques, or several different algorithms for creating consensus partitions and identification of “true” number of clusters, this experiment indicates that inclusion of RC-based algorithms in consensus frameworks could lead to even better results. In our further experiments, we try to identify good RCs (and algorithms) for clustering microarray data that could be used in consensus frameworks. 4.3.3 Identifying good RCs for clustering microarray data To test how good an RC solves a sub-problem, we tested the quality of RCs using the following procedure. First, we picked a sub-problem (e.g., “Measure distance”) and made groups of algorithms differing in only one RC. Then, we tested whether differences between groups were significant, using Wilcoxon signed-rank paired test with 95 % confidence. This test is appropriate because we can pair algorithms from different groups that differ only in the components we test, while the rest of the algorithm is the same. Next, we separated RCs in two groups, “best” and “rest,” similar to [67]. All components are in “best” group, unless they are proved significantly worse than the best component. Components in the “best” group are recommended for use in the algorithm that should solve the selected data set in a good

123

Finding best algorithmic components

121

way. These components are shown in Table 7. It can be seen from Table 7 that PCA RC was the only best component on “Leukemia” and “CNSTumors.” PCA and GMEANS were the only two best components on “Leukemia (small),” “Lymphoma,” “DBLCLB” and “MultiA.” XMEANS was often in “best” group, but in most cases, it was not significantly better than PCA and GMEANS (except in the case of “BreastB”). KMEANS++, RANDOM and SPSS RCs were in the “best” group, when performance of clustering algorithm was not dependant on “Initializing Representatives” (all or almost all RCs were in the “best” group). The only exception was in the “Normal” data set where only these 3 components were in the “best” group. From this examination, we conclude that PCA and GMEANS RCs are the best candidates for building algorithms for clustering microarray data. Giancarlo et al. [29] identified “Pearson”, “Cosine” and “Euclidean” distances as most robust for clustering microarray data. Results from the Table 7 show that on all data sets (except “Lung Cancer”), CORREL (“Pearson”) RC was among “best” RCs. Additionally, it was the only “best” RC on 7 data sets. COSINE RC was in the “best” group 8 times but always with CORREL. CITY was in the best group 4 times but also never a single “best” component. In contrast to [29], our results show that EUCLIDEAN was not a well-performing RC (only on “Lung Cancer” data set, it was in “best”). For “updating representatives” sub-problem, MEAN was 14 times among “best” RCs (8 times the only best). ONLINE was 6 times among “best” RCs and 2 times the only one in this group. MEDIAN was 5 times among “best” RCs, but it was never the only “best”. AIC and BIC were non-dominated RCs (for “evaluate clusters” sub-problem) on almost all data sets (except “BreastA” and “MultiA”) and often single “best” RCs. SILHOU was also often in “best” group in majority of cases with AIC and BIC. COMPACT is in most of the cases dominated RC. This is important to notice, because this component is often used as model selector in the original algorithms, and can lead to misleading results. From this analysis, we identified good candidates for building algorithms for clustering microarray data: – – – –

“Initializing representatives” - PCA and GMEANS “Measuring distance” - CORREL and COSINE “Updating representatives” - MEAN “Evaluating clusters” - AIC, BIC and SILHOU

After statistical analysis of the performance of individual RCs on every data set, we tested whether “good” components assembled in algorithms really produce better results than the other ones. Algorithms are divided in two groups: “good” group of algorithms that are assembled from previously identified RCs (BIC was excluded because it showed similar performance as AIC when integrated in algorithms) and a “bad” group of algorithms that are designed from remaining RCs. This separation of components in groups reduced the space of algorithms (RCs are not mixed between groups for algorithm design) and allowed easier identification of good performing ones. In our next experiment, we consider algorithm selection in the algorithm space designed from all available RCs. Figure 1 shows the box plot of AMI values (y-axis) for every algorithm (x-axis), over all data sets. It can be seen from the diagram that algorithms composed of “good” components (first 8 bars—left from a vertical reference line) really performed better than the others. Further, we used t test for comparing the independent samples (algorithms from “bad” and “good” group), and it showed significant difference in performance on the significance level below 0.001. Figure 1 and statistical test showed that RCs identified in previous experiment

123

123

RANDOM, XMEANS, PCA, KMEANS++,

RANDOM, XMEANS, GMEANS, PCA, KMEANS++ GMEANS, PCA

XMEANS, GMEANS, PCA

Leukemia

Normal

Novartis

St. Jude

DBLCLB

BreastA

Lung Cancer

MultiA

RANDOM, SPSS, XMEANS, GMEANS, PCA KMEANS++ RANDOM, SPSS, XMEANS, PCA, KMEANS++

RANDOM, SPSS, KMEANS++

DBLCLA

PBM

PCA

CNSTumors

XMEANS

XMEANS, GMEANS, PCA

Yeast_cell

RANDOM, XMEANS, GMEANS, PCA, KMEANS++ XMEANS, GMEANS, PCA

PCA

NCI60

CNS Rat

RANDOM, SPSS, XMEANS, GMEANS, PCA KMEANS++ XMEANS, GMEANS, PCA

Lymphoma

BreastB

GMEANS, PCA

GMEANS, PCA

Leukemia (small)

Initialize repre-senta-tives (best)

Dataset

Table 7 Best performing components on each real data set

EUCLIDEAN, CITY

CORREL

EUCLIDEAN, CITY, CORREL, COSINE CORREL, COSINE

CORREL, COSINE

CORREL,COSINE

CORREL

CORREL,COSINE

CORREL, COSINE

CORREL

CORREL

CORREL

CORREL, COSINE

CORREL

CITY, CORREL

CITY, CORREL, COSINE

CITY, CORREL

Measure distance (best)

MEAN

ONLINE

MEAN

MEAN

MEDIAN, ONLINE

MEAN, ONLINE

MEAN

MEAN

MEAN

MEAN, MEDIAN

MEAN, ONLINE

MEAN

MEAN, MEDIAN

MEAN, MEDIAN, ONLINE MEAN

ONLINE

MEAN, MEDIAN

Update re-pre-sen-ta-ti-ves (best)

AIC, BIC, COMPACT, CONN, SILHOU, XB

AIC, BIC, COMPACT, SILHOU, XB

SILHOU, XB

AIC, BIC, SILHOU

AIC, BIC, SILHOU, CONN, XB

COMPACT, CONN, SILHOU, XB

AIC, BIC, COMPACT, SILHOU

AIC, BIC

AIC, BIC, COMPACT, SILHOU

AIC, BIC

AIC, BIC, CONN, SILHOU

AIC, BIC, SILHOU

AIC, BIC

AIC, BIC

AIC, BIC, SILHOU

AIC, BIC, COMPACT, XB

AIC, BIC, SILHOU

Evaluate clusters (best)

122 M. Vuki´cevi´c et al.

Finding best algorithmic components

123

Fig. 1 Box plot of algorithm performance over all data sets

are good candidates for the design of algorithms for clustering microarray data and should be considered in further development of these algorithms. For making more general conclusions, further testing should be conducted with more RCs and to relate algorithm performance with data set characteristics, which will be the part of our future work. Additionally, we wanted to see whether some RCs that are not identified as “good” can constitute well-performing algorithms. We used a regression tree to try to find rules according to which components should be combined to receive a good result in AMI value. Therefore, we considered the results from all algorithms evaluated on every data set (432 × 17). CHAID tree is applied, and it showed that there is a large difference in performance on different data sets. ‘DBLCLA’, ‘Leukemia’ or ‘Leukemia (small)’ or ‘Novartis’ or ‘St. Jude’ showed similar performance (predicted value 0.730) and generally better results than other data sets (predicted value 0.397). Closer inspection of data set characteristics (Table 1) did not show that groups of data sets identified by decision tree are similar. This implies the need for further inspection of microarray data characteristics, as suggested in [19], but this is out of the scope of this paper. From the whole decision tree, we derived rules and compared predicted AMI values. Although CITY and EUCLIDEAN had the worst performance (worse than average predicted AMI in tree root), they gain highest predicted AMI values when combined with other components: IF ‘‘measuring distance’’ IN {CITY, EUCLIDEAN} AND ‘‘recalculate representatives’’=MEAN AND ‘‘evaluating clusters’’ IN {AIC, BIC, SILHOU} AND ‘‘initialize representatives’’ IN {GMEANS, PCA} THEN predicted AMI = 0,904

123

124

M. Vuki´cevi´c et al.

The second best combination of components is represented by the following rule: IF ‘‘measuring distance’’ = CORREL AND ‘‘evaluating clusters’’ IN {AIC, BIC, SILHOU} AND ‘‘initializing representatives’’ IN {GMEANS, PCA,XMEANS} AND ‘‘updating representatives’’ IN {MEAN, MEDIAN} THEN predicted AMI = 0.893

Third best rule: IF ‘‘measuring distance’’ = COSINE AND ‘‘evaluating clusters’’ IN {AIC, BIC, SILHOU, XB} AND ‘‘initialize representatives’’ IN {GMEANS, PCA} AND ‘‘updating representatives’’ IN {MEAN, MEDIAN} THEN predicted AMI value = 0.874

These results confirm the results from Table 7 and Fig. 1, but on the other hand show that a good combination of components like in the first rule can outperform worse performance of single components. Rules gathered from decision tree should be used as guidance for building and evaluating RC-based algorithms for clustering microarray data. This is important because our RCbased approach allows building many algorithms (in this paper, 432 were analyzed) and it is expensive to evaluate each. So, if the algorithms derived from the first rule (12 algorithms) give satisfactory results, there is no need for further experimentation. The three rules described above show which RCs should be combined in order to get good quality clusters. From the first rule, it can be seen that CITY or EUCLIDEAN distance measures should not be used in the same algorithm with XMEANS initialization, MEDIAN recalculation of representatives or XB for the evaluation of clusters. These suggested RCs should be used as starting point in searching for the best algorithm in order to save time.

5 Conclusion and future research In this paper, we used the component-based approach for representative-based clustering [20] to build generic partitioning algorithms that can be applied on microarray data. With this approach, we constructed 432 algorithms and tested them on 22 data sets. Our first set of experiments showed that reuse and interchange of reusable components (RCs) extracted from the existing algorithms can lead to better results compared to the algorithms they originate from. In a second set of experiments, we showed that representative-based algorithms built from RCs are competitive with other clustering algorithms (like GRASP and hierarchical algorithms). These experiments also indicated that an extension of RC repository can improve

123

Finding best algorithmic components

125

the quality of clustering algorithms. Further, single algorithms composed from RCs showed better performance than consensus clustering frameworks that are using traditional single algorithms (e.g., K -means, SOM) [44,73]. This implies that even better clustering could be achieved if RC-based algorithms are integrated in consensus-based frameworks. Additionally, we evaluated FCM algorithm on all data sets, and it showed good performance by the means of external evaluation measures. Since this algorithm shares common structure with RC-based algorithms evaluated in this research, we intend to include fuzzy component from FCM in generic algorithm. In order to shed additional light on the problem of algorithm selection for microarray data clustering, we proposed a method for the identification of good RCs for the design of clustering algorithms for microarray data analysis. First, we used statistical analysis to identify RCs from every sub-problem showing “good” performance on every data set. Algorithms designed from “good” RCs (8 algorithms) showed significantly better performance than the algorithms that are composed from other RCs. In the last experiment, we used a regression tree to try to find rules according to which components should be combined to receive a good clustering. Regression tree confirmed the results from previous experiments, but on the other hand show that synergy of components (not only from “good group”) can give good results. Our regression tree also showed that there is a difference in algorithm performance on the groups of data sets. This indicated that the performance of RC-based algorithms should be related with data set characteristics, as suggested by [19], which will be the part of our future work. We plan to further extend the number of possible RCs for every sub-problem and to integrate RC-based algorithms into crisp (e.g., [44,73]) or soft [49] consensus clustering frameworks that will allow automatic detection of true number of clusters and possibly lead to better clustering and more stable solutions. The major limitation of this approach is that the generic algorithm cannot automatically identify the true number of clusters. So, we tried settings for different values of K (from 2 to 2*trueK), and in majority of cases, AMI identified the best cluster model when the true K is set (this approach is common in many algorithms that automatically find true K, i.e., in [15], user selects upper bound for K and then algorithm reduces the number of K guided by evaluation measure). In the small number of cases, AMI did not identify true K, but this can be due to the fact that human experts are not yet confident on the true number of clusters present (this is also discussed in [64]). Another limitation of our approach is that it generalizes only representative-based clustering algorithms (i.e., algorithm from [45] uses completely different strategy), but we do not argue that it outperforms other algorithms, since there is no best algorithm for all data sets. But our approach allows the design of many algorithms and in this way a better adoption to data. Since the representative-based algorithms are often successfully used in this area, we argue that this RC-based approach is beneficial for clustering microarray gene expression data. Experiments conducted in this paper showed that component-based design algorithms is a promising approach for clustering microarray data. Still, with adding of new RCs, the number of available algorithms grows. Therefore, manual search or automatic brute search for the best algorithm in such a large space is almost impossible. This implies the important research task: definition of intelligent strategy for automatic algorithm selection in the space of RC-based clustering algorithms. This problem is addressed in [35] where an evolutionary algorithm is used for automatic search through the space of RC-based decision trees and showed cutting edge performance. In order to deal with this task, we plan to adapt the evolutionary algorithm proposed in [35] for the search in the space of RC-based clustering algorithms. Another approach for intelligent algorithm selection is through meta-learning.

123

126

M. Vuki´cevi´c et al.

Meta-learning relates the performance of algorithms to data distribution and enables automatic selection and ranking of algorithms for a given problem. Even though this is a common approach for selection/ranking of supervised learning algorithms [61], in the area of clustering, this is a relatively new and unexplored topic [19]. There are already initial researches in the area of microarray data clustering [19,45], which gave very promising results. Integration of RC-based algorithms in the meta-learning frameworks is out of the scope of this paper, but this will be another direction of our future work. After development of intelligent strategy for automatic algorithm selection in the space of RC-based clustering algorithms and integration of RC-based algorithms in meta-learning frameworks, we plan to apply this approach in different application areas like analysis of web usage data [41,66], education [42,53] management models [55,60], document data [7,38], etc. Acknowledgments This research was partially funded by a grant from German Academic Exchange Office (DAAD) and the Serbian Ministry of Science, Project-ID 50453023.

References 1. Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng. doi:10.1016/j.datak.2007.03.016 2. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control. doi:10. 1109/TAC.1974.1100705 3. Andreopoulos B, An A, Wang X et al (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Br Bioinform 10(3):297–314 4. Ankerst M, Breunig M, Kriegel H, et al (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD’99 international conference on management of data. Philadelphia, pp 49–60 5. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms (SODA ’07), society for industrial and applied mathematics, Philadelphia, pp 1027–1035 6. Ayadi W, Elloumi M, Hao JK (2012) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst 30:341–358. doi:10.1007/s10115-011-0383-7 7. Balachandran V, Khemani D (2011) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst. doi:10.1007/s10115-011-0446-9 8. Baralis E, Bruno G, Flori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst 29:81–101. doi:10.1007/s10115-010-0374-0 9. Baya AE, Granitto PM (2011) Clustering gene expression data with a penalized graph-based metric. BMC bioinf 12:1–18 10. Bezdek JC (1981) Pattern recognition With fuzzy objective function algorithms. Plenum Press, New York 11. Belacel N, Wang Q, Cuperlovic-Culf M (2006) Clustering methods for microarray gene expression data. OMICS J Integr Biol 10(4):507–531. doi:10.1089/omi.2006.10.507 12. Bonchi F, Gionis A, Ukkonen, A (2011) Overlapping correlation clustering. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 51–60. doi:10.1109/ICDM.2011.114 13. Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Tesauro G, Touretzky D (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 585–592 14. Chen C-L, Tseng FSC (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226. doi:j.datak.2010.08.003 15. Cheung Y (2003) k*-means: a new generalized k-means clustering algorithm. Pattern Recognit Lett 24(15):2883–2893. doi:10.1016/S0167-8655(03)00146-6 16. Da Silva A, Chiky R, Hébrail G (2011) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0448-7 17. Dang H-X, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2010, pp 573–582 18. De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 564–572

123

Finding best algorithmic components

127

19. de Souto MCP, Prudencio RBC, Soares RGF et al (2008) Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the IEEE international joint conference on neural networks, pp 3729–3735. doi:10.1109/IJCNN.2008.4634333 20. Delibaši´c B, Kirchner K, Ruhland J et al (2009) Reusable components for partitioning clustering algorithms. Artif Intell Rev 32:59–75. doi:10.1007/s10462-009-9133-6 21. Dembélé D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics 19:973–980 22. Dhiraj K, Rath SK (2009) Gene expression analysis using clustering. In: Proceedings of 3rd international conference on bioinformatics and, biomedical engineering, pp 154–163 23. Ding C, He X (2004) Principal component analysis and effective k-means clustering. In: Proceedings of the SIAM international conference on data mining, pp 497–502 24. Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 681–689 25. Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231 26. Forestier G, Gançarski P, Wemmert C (2010) Collaborative clustering with background knowledge. Data Knowl Eng 69(2):211–228. doi:10.1016/j.datak.2009.10.004 27. Geraci F, Leoncini M, Montangero M et al (2009) K-boost: a scalable algorithm for high-quality clustering of microarray gene expression data. J Comput Biol J Comput Mol Cell Biol 16(6):859–873. doi:10.1089/ cmb.2008.0201 28. Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol AMB 6(1). doi:10.1186/1748-7188-6-1 29. Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Blum C, Battiti R (eds) Learning and intelligent, optimization, vol 6073, pp 125–138 30. Grujic M, Andrejiová M, Marasová D et al (2012) Using principal components analysis and clustering analysis to assess the similarity between conveyor belts. Tech Technol Educ Manag TTEM 7(1):4–10 31. Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the neural information processing systems, vol 17 32. Hartigan JA (1975) Clustering algorithms. Probability and mathematical statistics. Wiley, New York 33. Hartigan JA, Wong MA (1979) A K-means clustering algorithm. Appl Stat 28:100–108 34. Iam-on N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26:1513–1519 35. Jovanovi´c M, Delibaši´c B, Vuki´cevi´c M, et al (2011) Optimizing performance of decision tree componentbased algorithms using evolutionary algorithms in Rapid Miner. In: proceedings of the 2nd RapidMiner community meeting and conference, Dublin 36. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York 37. Kumar P, Wasan SK (2010) Comparative analysis of k-mean based algorithms. Intl J Comput Sci Netw Secur 10(4):314–318 38. Kalogeratos A, Likas A (2011) Document clustering using synthetic cluster prototypes. Data Knowl Eng 70(3):284–306. doi:j.datak.2010.12.002 39. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. doi:10.1109/ TIT.1982.1056489 40. Milligan GW, Cooper MC (1987) Methodology review: clustering methods. Appl Psychol Meas 11(4):329–354. doi:10.1177/014662168701100401 41. Milovanovi´c M, Minovi´c M, Štavljanin V et al (2012) Wiki as a corporate learning tool: case study for software development company. Behav Inf Technol. doi:10.1080/0144929X.2011.642894 42. Minovi´c M, Milovanovi´c M, Kovaˇcevi´c I, Minovi´c J, Starˇcevi´c D (2011) Game design as a learning tool for the course of computer Networks. Intern J Eng Educ 27(3):498–508 43. Moise G, Zimek A, Kröger P et al (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3):299–326. doi:10.1007/s10115-009-0226-y 44. Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118. doi:10.1023/A: 1023949509487 45. Nascimento A, Prudencio R, de Souto M, et al (2009) Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Proceedings of the 19th international conference on artificial neural networks: Part II, Springer, Berlin 46. Nascimento MCV, Toledo FMB, Carvalho A (2010) Investigation of a new GRASP-based clustering algorithm applied to biological data. Comput Oper Res 37(8):1381–1388. doi:10.1016/j.cor.2009.02.014

123

128

M. Vuki´cevi´c et al.

47. Pelleg D, Moore A (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 17, Morgan Kaufmann, Los Altos, pp 727–734 48. Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5. doi:10.1145/980972.980974 49. Punera K, Ghosh J (2008) Consensus-based ensembles of soft clusterings. Appl Artif Intell 22:780–810 50. Pirim H, Gautam D, Bhowmik T (2011) Performance of an ensemble clustering on biological datasets. Math Comput Appl 16(1):87–96 51. Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427 52. Raczynski L, Wozniak K, Rubel T, Zaremba K (2010) Application of density based clustering to microarray data analysis. Int J Electron Telecommun 56(3):281–286 53. Romero C, Ventura S (2011) Educational data mining: a review of the state-of-the-art. IEEE Trans Syst Man Cybern C Appl Rev 40(6):601–618 54. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. doi:10.1016/0377-0427(87)90125-7 ˇ 55. Savoiu G, Jaško O, Cudanov M (2010) Diversity of specific quantitative, statistical and social methods, techniques and management models in management system. Management 14(52):5–13 56. Sander J, Ester M, Kriegel H et al (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194 57. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464 58. Shao J, Plant C, Yang Q, Böhm C (2011) Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 607–616, doi:10.1109/ICDM.2011.50 59. Shaham E, Sarne D, Ben-Moshe B (2011) Sleeved co-clustering of lagged data. Knowl Inf Syst. doi:10. 1007/s10115-011-0420-6 60. Sedlak O, Kocic-Vugdelija V, Kudumovic M et al (2010) Management of family farms—Implementation of fuzzy method in short-term planning. Tech Technol Educ Manag TTEM 5(4):710–718 61. Smith-Miles K (2008) Towards insightful algorithm selection for optimization using meta-learning concepts. In: Proceedings of the IEEE international joint conference on neural networks, pp 4118–4124 62. Sonnenburg S, Braun M, Ong CS et al (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466 63. Thalamuthu A, Mukhopadhyay I, Zheng X et al (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22:2405–2412 64. Vinh NX (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854 65. Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal evaluation measures as proxies for external indices in clustering gene expression data. In: Proceedings of the 2011 IEEE international conference on bioinformatics and biomedicine (BIBM11). Atlanta, 12–15 Nov 66. Wan M, Jönsson A, Wang C, Li L, Yang Y (2011) Web user clustering and web prefetching using random indexing with weight functions. Knowl Inf Syst. doi:10.1007/s10115-011-0453-x 67. Wijaya A, Kalousis M, Hilario M (2010) Predicting classifier performance using data set descriptors and data mining ontology. In: Proceedings of the 3rd planning to learn workshop 68. Wu LF, Hughes TR, Davierwala AP (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat genet 31:255–265 69. Wu X, Kumar V, Quinlan JR et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2 70. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Patt Anal Mach Intell 13(8):841–847 71. Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154. doi:10.1109/RBME.2010.2083647 72. Yan Y, Chen L, Tjhi W-C (2011) Semi-supervised fuzzy co-clustering algorithm for document classification. Knowl Inf Syst. doi:10.1007/s10115-011-0454-9 73. Yu Z, Wong H-S, Wang H (2007) Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23:2888–2896

123

Finding best algorithmic components

129

Author Biographies Milan Vuki´cevi´c is teaching and research assistant at the Faculty of Organizational Sciences, University of Belgrade, within the Center for Business Decision Making. He is currently a Ph.D. student. His main research field interests are clustering and classification algorithm design, data mining, decision support, and meta-learning.

Kathrin Kirchner is a postdoctoral researcher and lecturer at the department of Business Information Systems, Friedrich-SchillerUniversity of Jena, Germany. For her Ph.D., she developed a spatial decision support system for the rehabilitation of gas pipeline networks. Her research interests include data mining, decision support and knowledge management.

Boris Delibaši´c is associate professor at the University of Belgrade, Serbia, at the Faculty of Organizational Sciences, at the Department for Business Decision Making since 2007. He holds a Ph.D. for “Formalization of the Business Decision Making identification of reusable components to support the decision-making process.

123

130

M. Vuki´cevi´c et al.

Miloš Jovanovi´c is teaching and research assistant at the Faculty of Organizational Sciences, University of Belgrade, within the Center for Business Decision Making. He is currently Ph.D. student. His main research field interests are data mining, decision support systems, data warehousing, optimization and artificial intelligence.

Johannes Ruhland has been a full professor at Friedrich-Schiller-Uni versity of Jena since 1994. Before that he was a full professor at Universities of Ulm and Munich. Professor Ruhland’s research addresses data mining algorithms and data mining applications in marketing and analytical CRM as well as business process management and economical and managerial aspects of energy economy.

Milija Suknovi´c is full professor at the University of Belgrade, Serbia, at the Faculty of Organizational Sciences, at the Department for Business Decision Making since 1991. He holds a Ph.D. for “Development of support methodology for group decision making.” His research interests are focused on data mining, decision support systems, group decision making and data warehousing.

123

Suggest Documents