On extensions of kmeans clustering for ... - Wiley Online Library

11 downloads 16158 Views 73KB Size Report
data analysis methods by Bashashati and Brinkman (1), auto- mated gating methods can be used to identify both known and unknown cell populations, with the ...
Commentary

On Extensions of k-Means Clustering for Automated Gating of Flow Cytometry Data George Luta* RECENT advances in high-throughput flow cytometry (FCM) technology require the theoretical development and the efficient computational implementation of new methods for automated identification of cell populations. Compared with manual gating methods, the current de facto gold standard, these automated methods are expected not only to be faster, but also to increase the reproducibility of data analysis pipelines. According to a very recent comprehensive survey of FCM data analysis methods by Bashashati and Brinkman (1), automated gating methods can be used to identify both known and unknown cell populations, with the latter including the case of subpopulations that cannot be easily identified using two-dimensional manual gating methods. To be able to properly perform the unsupervised automated gating of FCM data, general clustering methods need to fulfill several criteria, such as computational efficiency (to handle the commonly encountered very large data sets in a practically reasonable amount of time), robustness to the shape of the clusters (from spherical to concave cell populations such as ‘‘banana-shaped’’ populations) or the density of the clusters (from very sparse to very dense cell populations, depending on the type of gating to be performed), and ability to identify the (generally unknown) number of populations (1). The article by Aghaeepour et al. (2) published in this issue makes an important contribution to the field by introducing flowMeans, a fast automated gating method based on an extension of k-means clustering. It is important to note that the new method has been specifically developed to address several problematic aspects relating to the application of the kmeans clustering to FCM data. These limitations include the identification of the number of populations (k), the sensitivity of the clustering results to the initial values, and implicit restriction to spherical cell populations, with the last limitation particularly relevant to FCM data. The flowMeans method solves the two problems of identifying k and dealing Department of Biostatistics, Bioinformatics, and Biomathematics, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center Received 26 September 2010; Accepted 29 September 2010 *Correspondence to: George Luta, 4000 Reservoir Road, NW, Suite 180, Building D, Washington, DC 20057-1484.

Cytometry Part A • 79A: 3 5, 2011

with concave cell populations by starting with a larger number of clusters (by using a ‘‘reasonable’’ upper bound for k) and merging them to allow multiple overlapping clusters to represent the same subpopulation. To go into the specific details of the method, flowMeans starts by estimating the number of modes for each one-dimensional projection of the FCM data, using an approach based on kernel density estimation as described by Duong et al. (3). The total number of modes across all dimensions is used as a maximum for k, given that this sum is an overestimate of the number of subpopulations from the multidimensional space. Because there are more clusters than needed, the resulting clusters have to be further merged to determine the number of subpopulations, and the process is iterated by alternating between calculating the distances between pairs of clusters and merging the closest pair of clusters. The underlying empirical principle is that keeping track of the minimum distance between pairs of clusters will allow us to recognize when we are moving from poorly separated clusters to well separated clusters. A segmented regression algorithm is used to determine k by detecting the change point in the distance between the merged clusters. Given that the flowMeans method is an extension of the k-means clustering specifically designed to deal with FCM data, we will provide further details about k-means and related methods, based on a recent comprehensive review on the topic (4). Although it has been proposed about 55 years ago, due in large part to its simplicity, the k-means algorithm is still one of the most popular clustering algorithms used today. The algorithm involves the minimization of the sum (over all k clusters) of the squared distances between the points in the cluster and the cluster mean, and thus the name k-means. To be able to perform the required minimization, the algorithm requires the advance specification of the number of clusters k, an initial partition in k clusters, and a distance measure. The use of the Euclidian distance E-mail: [email protected] Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/cyto.a.20988 © 2010 International Society for Advancement of Cytometry

COMMENTARY generates spherical clusters, whereas the use of the Mahalanobis distance generates ellipsoidal clusters. It is important to note that ensemble clustering methods can be used as an alternative to specifying the number of clusters k and an initial partition in k clusters (5). Multiple data partitions are first obtained by changing the values of k and using several random partitions into k clusters. They are subsequently combined into a final clustering by using the cooccurrence matrix, i.e., the matrix that records the number of times two data points co-occur in the same cluster across the multiple data partitions. Notable extensions of the k-means algorithm include fuzzy c-means, bisecting k-means, X-means, k-medoids, and kernel k-means (4). The last algorithm is of particular importance to FCM data because it can detect clusters of arbitrary shape by being able to describe data distributions more complicated than the Gaussian distribution. Kernel kmeans requires the choice of a kernel similarity function, and it performs clustering by maximizing within-cluster similarity (6). In parallel with advances in FCM technology, advances in other high-throughput technologies, such as DNA microarray technology, have also generated large amounts of high-dimensional data requiring automated bioinformatics methods to provide faster and more reproducible data analysis pipelines. Among the methods found to be successful in clustering these types of data are methods based on nonnegative matrix factorization (NMF). Although NMF has been primarily applied for unsupervised clustering in the area of image and natural language processing, NMF-based methods have been also used successfully for molecular pattern discovery, class comparison and prediction, cross-platform and cross-species analysis, functional characterization of genes, and biomedical informatics (see Ref. 7 for a recent review on the use of NMF in computational biology). As the name implies, NMF is a matrix factorization method applicable to matrices containing only nonnegative values. It has been introduced as a parts-based learning paradigm by Lee and Seung (8). Given a nonnegative n 3 m matrix X, the original method introduced by Lee and Seung uses a multiplicative updates algorithm to find two nonnegative matrices, W and H, of dimensions n 3 k and k 3 m, respectively, with k < n, m, such that the matrix product W*H is close to X (with respect to a specified metric). If X is a n 3 m gene expression matrix consisting of observations on n genes from m samples, each column of W defines a ‘‘metagene,’’ and each column of H represents the metagene ‘‘expression pattern’’ of the corresponding sample (9). NMF can be used for clustering by exploiting the sparseness of the matrix H and using the expression patterns to assign observations to the k components. Further sparsity constraints on the matrix H within the NMF objective function (involving the comparison of X with W*H) reveals the natural clustering aspect of NMF because NMF is essentially equivalent to the k-means algorithm in the limiting case when there is only one nonzero entry per each column of H. Treating the objective function of 4

the k-means method as the objective function of a lower-rank approximation with special constraints allows us to obtain the NMF formulation from the k-means formulation by relaxing some of the constraints. Related to the previous discussion regarding extensions of the k-means, it is important to note that NMF clustering and kernel k-means clustering are closely related because they are different formulations of the same problem with slightly different constraints (10). As such, NMF-based clustering should be able to deal with arbitrarily shaped subpopulations as well. NMF-based clustering methods have performed better than k-means clustering methods when applied to both synthetic and real data (11). These experiments show that an extension of the original NMF method, namely sparse NMF, gives much better and consistent clustering solutions than kmeans clustering. A fast alternating nonnegative least squares algorithm was used to obtain NMF and sparse NMF for these comparisons. It is important to note that even without imposing additional sparsity constraints, NMF still can give competitive clustering results (11). Having reviewed k-means and NMF-based clustering methods, we are now in a position to suggest possible modifications of the flowMeans method. One proposal is to replace k-means clustering with sparse NMF-based clustering; the other one is to use robust versions of k-means clustering. To our knowledge, neither of these methods has been used before to cluster FCM data. Regarding the first suggestion, it should be noted that the smallest dimension of the data matrix, the minimum of n and m, provides an upper bound for the number of clusters that can be identified. For FCM data, because the number of cells is very large, this minimum is the number of cell characteristics. If the number of measured characteristics is larger then the expected number of populations (based on biological knowledge) NMF-based clustering methods can be used directly. If not, a two-stage approach may be considered instead. First, NMF can be used as a dimension reduction method to reduce the dimensionality of the data to two or three dimensions, and, then, recently developed nonparametric methods (12) can be used to cluster the resulting data without any restrictions on the shape of the cell populations. Although its current computational implementation is restricted to three-dimensional data, the curvHDR method is a nonparametric density-based approach that uses the concepts of high negative curvature region and high-density region to construct automated gates for FCM data (12). According to the previously referenced review by Bashashati and Brinkman (1), the identification and removal of outliers from subsequent analyses of FCM data is a crucial step in the data analysis pipeline. Here, the outliers are generically defined as observations that are different from the rest of the data, with cell debris, dead cells, and doublets often given as typical examples of outliers for FCM data. Because flowMeans is not designed to be robust to outliers, one possibility will be to replace the k-means clustering method with robust variants. One such robust method is the trimmed k-means clustering method (13), where a known fraction of outliers is trimmed Commentary

COMMENTARY off, and the remaining observations are clustered into k groups. In the absence of a known percent of outliers in the data (a problem similar to the unknown number of clusters), several fractions of trimming can be tried, and the resulting data partitions can be combined using ensemble clustering methods. There is a definite need for the development of more robust methods that will allow the estimation of the fraction of outliers from the data (as opposed to being assumed known) in the absence of strong modeling assumptions. Although the proposed flowMeans method has been shown to be faster and competitive against several alternative methods for automated gating, the use of robust versions of k-means, such as trimmed k-means clustering (13), may provide the needed protection against the undue influence of outliers present in FCM data. Already proven to be a strong competitor of k-means clustering for data generated from other high-throughput technologies, sparse NMFbased clustering methods deserve further investigation with respect to their usefulness in automated gating of FCM data.

Cytometry Part A  79A: 3 5, 2011

LITERATURE CITED 1. Bashashati A, Brinkman R. A survey of flow cytometry data analysis methods. Adv Bioinformatics 2009; DOI: 10.1155/2009/584603. 2. Aghaeepour N, Nikolic R, Hoos HH, Brinkman RR. Rapid cell population identification in FCM data. Cytometry Part A 2011;79A:6-13 (this issue). 3. Duong T, Cowling A, Koch I, Wand MP. Feature significance for multivariate kernel density estimation. Comput Stat Data Anal 2008;52:4225–4242. 4. Jain AK. Data clustering: 50 years beyond k-means. Pattern Recognition Lett 2010;31:651–666. 5. Fred AL, Jain AK. Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Machine Intell 2005;27:835–850. 6. Sch€ olkopf B, Smola S, M€ uller KR. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 1998;10:1299–1319. 7. Devarajan K. Nonnegative matrix factorization: An analytical and interpretive tool in computational biology. PLoS Comput Biol 2008;4:e1000029. 8. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401:788–791. 9. Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 2004;101:4164–4169. 10. Ding C, He X, Simon HD.On the equivalence of nonnegative matrix factorization and spectral clustering. In:Proceedings of the Fifth SIAM International Conference on Data Mining,Newport Beach, CA; 2005. pp 606–610. 11. Kim J, Park H.Sparse Nonnegative Matrix Factorization for Clustering. CSE Technical Reports.Atlanta, GA:Georgia Institute of Technology; 2008. 12. Naumann U, Luta G, Wand MP. The curvHDR method for gating flow cytometry samples. BMC Bioinformatics 2010;11:44. 13. Cuesta-Albertos JA, Gordaliza A, Matran C. Trimmed k-means: An attempt to robustify quantizers. Ann Stat 1997;25:553–576.

5