Unsupervised Learning: Clustering
Angela Serra∗ 1 and Roberto Tagliaferri† 1 1
NeuRoNe Lab, DISA-MIS, University of Salerno, via Giovanni Paolo II 132, 84084, Fisciano (SA), Italy
Abstract In this chapter an introduction on unsupervised cluster analysis is provided.
Clustering is the organisation of unlabelled data into sim-
ilarity groups called clusters.
A cluster is a collection of data items
which are similar between them, and dissimilar to data items in other clusters. Three are the main elements needed to perform cluster analysis: a proximity measure, to evaluate similarity between patterns and the classical one is the Euclidean distance; a quality measure, to evaluate the results of the analysis; a clustering algorithm. In this chapter three classical algorithms are described: tering and Expectation Maximisation.
k-means, Hierarchical clusFurthermore, an example of
their application on a gene expression dataset for patient sub-typing purpose is provided.
Keywords:
clustering, unsupervised learning, k-means, hierarchi-
cal clustering, EM, bioinformatics
1
Clustering
Cluster analysis is an exploratory technique whose aim is to identify groups, or clusters, of high density in which observations are more similar to each other than observations assigned to dierent clusters. This process requires to quantify the degree of similarity, or dissimilarity, between observations. The results of the analysis is strongly dependent on the kind of the used similarity metric. ∗ †
[email protected] [email protected]
1
A large class of dissimilarities coincides with the class of distance functions. The most used distance function is the Euclidean distance, dened as the sum of the squared dierences of the features between two patterns xi and xj :
d(xi , xj ) =
p X
(xih − xjh )2
h=1
where xih is the h − th feature of xi . Similarly, clustering algorithms can use also similarity measures between observations. The correlation coecient is widely used as a similarity measure P (xih − xi )(xjh − xj ) P p(xi , xj ) = P h 2 2 h (xih − xi ) h (xjh − xj ) where xi =
1 Pp xih is the average value of the features of a single p h=1
observation. Clustering has been widely applied in bioinformatics to solve a wide range of problems. Two of the main problems addressed by clustering are: (1) identify groups of genes that share the same pattern across dierent samples (Jiang et al., 2004); (2) identify groups of samples with similar expression proles (Sotiriou and Piccart, 2007). Many are the dierent clustering approaches proposed in literature. These methodologies can be divided into two categories represented by the hierarchical and partitive clustering as described in Theodoridis and Koutroumbas (2008).
2
Hierarchical clustering
Hierarchical clustering seeks to build a hierarchy of clusters based on a proximity measure. Hierarchical clustering methods need the denition of a pairwise dissimilarity function and a set of observations as inputs. Hierarchical clustering methods do not dene just a partition of the observations, but they produce a hierarchical representation, in which the clustering at a determined level is dened in terms of the clusters of the level below. The lowest level of the hierarchy is made of n clusters, each containing a single observation, whereas at the top level, a single cluster contains all the observations. The absence of a single partition allows the investigator to nd the most natural clustering, 2
in the sense of comparing the within cluster similarity to the between cluster similarity. Strategies for accomplishing hierarchical clustering can be divisive or agglomerative. The former is a "top down" approach where all the observations are in one cluster and then are splitted recursively moving down into the hierarchy. The latter is a "bottom up" approach where, initially, each pattern is a "singleton" cluster and then pairs of clusters are merged moving up into the hierarchy. Agglomerative clustering is the most used approach to construct clustering hierarchies. It starts with n clusters, each one containing only one observation. At each step the two most similar clusters are merged and a new level of the hierarchy is added. The procedure ends in n − 1 steps, when, at the top of the hierarchy, the last two clusters are merged. The most used methods to perform hierarchical cluster analysis are the single linkage, the complete linkage and the average linkage.
Single Linkage In this method, the distance between two clusters is rep-
resented by the distance of the pair of the closest data patterns belonging to dierent clusters. More formally, given two clusters G and H and a distance function d, the single linkage dissimilarity is dened as:
dS =
min
xi ∈G,xj ∈H
d(xi , xj )
Single linkage method is able to identify clusters of dierent shapes and to highlight anomalies in the data distribution better than the other hierarchical techniques. On the other side, it is sensitive to noise and outliers. Moreover it can results in clusters that represents a chain between the patterns. This can happens when clusters are well-outlined but not well separated. An illustration of how the algorithm works is in the top-left part of gure 1.
Complete Linkage In this method the distance between two clusters is represented by the distance of the pair of the farthest data points belonging to dierent clusters. More formally, given two clusters G and H and a distance function d, the complete linkage dissimilarity is dened as: dC =
max
xi ∈G, xj ∈H
d (xi , xj )
This method identies, above all, clusters of elliptical shape. This algorithm prefers the homogeneity between the elements of the group at the expense of dierentiation between the groups. An illustration of how the algorithm works is in the center-left part of gure 1. 3
Average Linkage The distance between two clusters is represented by the
average distance of all pairs of data points belonging to dierent clusters. More formally given two clusters G and H and a distance function d, the average linkage dissimilarity is dened as:
dA =
1 XX d (xi , xj ) |G| |H| i∈G j∈H
where |G| and |H| are respectively the number of observations in clusters G and H . The average linkage produces clusters relatively compact and relatively separated. As a drawback, if the measurements of the two clusters to join are very dierent the distance will be very close to that of the largest cluster. Moreover, the average linkage is susceptible to strictly increasing transformations of the dissimilarity, changing the results. On the contrary, single and complete linkages are invariant to such transformations as described in Friedman et al. (2001). An illustration of how the algorithm works is in the bottom-left part of gure 1.
Dendrogram representation Regardless of the aggregation criterion, the
produced hierarchy of clusters can be graphically represented by a binary tree called dendrogram. Each node on the dendrogram is a cluster while each leaf node is a singleton cluster (i.e. an input pattern or observation). The height of the tree measures the distance between the data points. An example is reported in gure 1, where each tree corresponds to one of the aggregation criteria above dened. The clustering algorithms are performed on a toy dataset of ten data points. The height of each node is proportional to the dissimilarity between the corresponding merged clusters. Figures were accomplished by using the ggplot2 package included in R. An in-depth description of the library can be found in Wickham (2009). The dendrogram can be a useful tool to explore the clustering structure of the data at hand. Data clustering can by obtained by cutting the dendrogram at the desired height, then each connected component forms a cluster. Inspection of the dendrogram can help to suggest an adequate number of clusters presents in the data. This number and the corresponding data clustering can be used as the initial solution of more elaborate algorithms.
3
k -means
The k -means, proposed by MacQueen et al. (1967), is one of the most used and versatile clustering algorithm. The goal of k -means is to partition the 4
observations into a number k of clusters. Each cluster contains the patterns that are more similar to each other while those dissimilar to the observations are put into other clusters. A pattern, called centroid, that is computed as the centre of mass of all the observations belonging to the cluster, is the representative prototype of all these points. Formally, let X = {x1 , . . . , xn } be a set of N points in a and let K be an integer value, the k -means algorithm seeks to nd a set of k vectors µk that minimise the Within Cluster Sum of Squares (WCSS) X X W CSS = k d (xi , µh ) xi ∈Ch
h=1
where Ch is the h-th cluster and µh is the corresponding centroid. The k -means algorithm is a two-step iterative procedure that starts by dening k random centroids in the feature space, each corresponding to a dierent cluster. Then, the patterns are assigned to the nearest centroid by using the Euclidean distance. Afterwards, each centroid is recomputed as the mean (center of mass) of the observations belonging to the same cluster. Each step is proven to minimise the WCSS until convergence to a local minimum. The assignment and recomputing steps alternate until no observation is reassigned. The k -means algorithm, as described in Jung et al. (2014), is the following:
Algorithm 1: k-means Clustering Algorithm 1
2 3 4 5 6
k -means (k, X); Input : The number k and a database X containing n patterns Output: A set of k-clusters that minimizes the squared-error criterion Randomly choose k patterns as the initial cluster centroids; repeat; (re)assign each pattern to the cluster corresponding to its nearest centroid; update the cluster centroid, i.e. for each cluster, calculate the mean value of its patterns; until no change.
It must be noted that this algorithm strongly depends on the initial assignment of the centroids. Usually, if a prior knowledge about the location of the centroids does not exist, this algorithm can be run many times with dierent initial solutions, randomly sampled from the data observations and keeping the best solution. The number k of clusters is a hyper-parameter to be estimated, and dierent choices of it are usually compared. Note that the 5
WCSS is a decreasing function with respect to k , so that care must be taken in performance comparisons with increasing values of k . Dierent heuristics to nd an appropriate k have been proposed, ranging from graphical models to evaluate how well a point ts within its cluster, like the silhouette, described by Rousseeuw (1987), to information theoretic criteria that penalise the WCSS to nd the right trade-o between maximum information compression (one cluster) and minimum loss of information (n clusters, one for each observation).
4
Expectation Maximisation (EM)
Another important category of clustering algorithms is the one that includes model based approaches. Here the main idea is that each cluster can be represented by a parametric distribution, such as a Gaussian or a Poisson for continuous or discrete data, respectively. More formally, let X = {x1 , . . . , xn } be n observed data vectors and let Z = {z1 , . . . , zn } be the n values taken by the hidden variables (i.e., the cluster labels). The expectation maximisation algorithm attempts to nd the P parameters θ that maximise the log probability L(θ) = logP (X|θ) = log Z p(X, Z|θ) of the observed data, where both θ and Z are unknown. To deal with this problem, the task of optimising logp(X|θ) is divided into a sequence of simpler sub-problems, that guarantees that their corresponding solutions θˆ(1) , θˆ(2) , . . . converge to a local optimum of logP (X|θ). More specically, the EM algorithm alternates between two phases. In the E-step, it chooses a function ft that lower bounds logP (X|θ) and for which ft (θˆ(t) ) = logP (X|θˆ(t) ). In the M-step, the EM algorithm nd a new parameter set θˆ(t+1) that maximizes ft . Since the value of ft matches the objective function at θˆ(t) , it follows that logP (X|θˆ(t) ) = ft (θˆ(t) ) ≤ ft (θˆ(t+1) ) = logP (X|θˆ(t+1) ). This proves that the objective function monotonically increases at each iteration of EM leading to the algorithm convergence. Similar to the k-means algorithm, EM is an iterative procedure: the E-step and M-step are repeated until the estimated parameters (means and covariances of the distributions) or the log-likelihood do not change anymore. Mainly, we can summarize the EM clustering algorithm as described in Jung et al. (2014) as follows: The EM technique is quite similar to the k -means. The EM algorithm extends this basic approach to clustering into two important ways:
• the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. Therefore, the 6
Algorithm 2: EM clustering algorithm 1 EM
(k, X, eps);
Input : Cluster number k, a database X , stopping tolerance eps Output: A set of k-clusters with weights that maximise the
log-likelihood function 2 Expectation step: For each database record x, compute the membership probability of x in each cluster h = 1, . . . , k . 3 Maximization step: Update mixture model parameters (probability weights). 4 Stopping criteria: If stopping criteria are satised (convergence of parameters and log-likelihood) then stop, else set j = j + 1 and go to (2). goal of the algorithm is to maximise the overall probability or likelihood of data, given the (nal) clusters. See gure 3.
• Unlike the classical implementation of k -means, the general EM algorithm can be applied to both continuous and categorical variables. Dierent from k -means, the EM algorithm does not compute actual assignments of observations to clusters, but classication probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a nal result an actual assignment of observations to clusters can be computed based on the (highest) classication probability.
5
External clustering quality measures
In order to compare the clustering results with prior knowledge related to real class membership values, several indexes can be used. Some examples are the Rand Index, the Purity Index and the Normalised Mutual Information. 5.1
Rand Index
Let S = o1 , . . . , on be a set of n elements and X and Y be two of its partitions to compare, where X = X1 , . . . Xr is a partition of S into r subsets and Y = Y1 , . . . , Ys is a partition of S into s subsets, we can dene the following measures: (i) a, is the number of element pairs in S that are in the same subset both in X and in Y ; (ii) b is the number of element pairs in S that are in dierent subsets both in X and in Y ; (iii) c is the number of element pairs in S that are in the same subset in X and in dierent subsets in Y ; 7
(iv) d is the number of element pairs in S that are in dierent subsets in X and in the same subset in Y . The Rand Index R is dened as:
R=
a+b a+b = n a+b+c+d 2
Intuitively, a + b can be considered as the number of agreements between X and Y and c + d as the number of disagreements between X and Y . Since the denominator is the total number of pairs, the Rand Index represents the occurrence frequency of agreements over the total pairs. 5.2
Purity
Purity is one of the most useful external criteria for clustering quality evaluation, when prior knowledge on real class memberships is available. To compute purity, each cluster is labelled with the class which represents the majority of the points it contains. Then the accuracy of this assignment is measured by counting the number of correctly assignments divided by the number of points. Given a set of N elements, their clustering partition ω = {ω1 , ω2 , . . . , ωK } and their real set of classes C = {C1 , C2 , . . . , CJ } the purity measure is dened as following:
purity(ω, C) =
1 X max |wk ∩ cj | N k j
where wk is the set of elements in the cluster ωk and cj is the set of elements of class Cj . 5.3
Normalised Mutual Information
The Mutual Information (MI) of two random variables is a measure of their mutual dependence. It quanties the amount of information that a variable can give about the other. Formally, the MI between two discrete random variables X and Y is dened as:
I(X, Y ) =
XX
p(x, y)log(
y∈Y x∈X
p(x, y) ) p(x)p(y)
where p(x, y) is the joint probability distribution function of X and Y and p(x) and p(y) are the marginal probability distribution functions of X and Y , respectively.
8
The concept of MI is intricately linked to that of Entropy of a random variable, a fundamental notion in Information Theory, that denes the "amount of information" held in a random variable. The MI measure is not upper limited, then the Normalised Mutual Information (NMI) measure has been dened as:
R=
I(X, Y ) H(X) + H(Y )
where H(X) and H(Y ) are the entropies of X and Y , respectively. NMI ranges between 0 and 1, with 0 meaning no agreement between the two variables and 1 meaning complete agreement.
6
An example of cluster analysis application to patient sub-typing
As noticed by Saria and Goldenberg (2015), many diseases, such as cancer or neurodegenerative disorders, are dicult to diagnose and to treat because of the high variability among aected individuals. Precision medicine tries to solve this problem by identifying individual variability in gene expression, lifestyle and environmental factors (Hood and Friend, 2011). The nal aim is to better predict disease progression and transitions between the disease stages and targeting the most appropriate medical treatments (Mirnezami et al., 2012). A central role in precision medicine is played by patient sub-typing, whose aim is to identify sub-populations of patients that share similar behavioural patterns. This can lead to more accurate diagnostic and treatment strategies. From a clinical point of view, rening the prognosis for similar individuals can reduce the uncertainty in the expected outcome of a treatment on individuals. In the last decade, the advent of high-throughput technologies has provided the means for measuring dierences between individuals at the cellular and molecular levels. One of the main goals driving the analyses of highthroughput molecular data is the unbiased biomedical discovery of disease sub-types via unsupervised techniques, such as cluster analysis (Sørlie et al., 2001; Serra et al., 2015; Wang et al., 2014). Here we report an example of application of the hierarchical, k -means and EM algorithms to the clustering of patients aected by breast cancer. The dataset is related to 20 breast cancer patients randomly selected from the TCGA repository (https://tcga-data.nci.nih.gov/tcga/ - Breast invasive carcinoma [BRCA]). The patients are divided into four classes (Her2, 9
Basal, LumA, LumB), using the PAM50 classier described in Tibshirani et al. (2002). Data are pre-processed as in Serra et al. (2015). Since gene expression data are high dimensional (more than 20k genes), genes were grouped into 100 clusters and only the cluster prototypes were used as features for the patient clustering task. Moreover, for clarity purpose, only 5 random patients for each class are used in this example. In gure 4 the 3D multidimensional scaling projection of the dataset is reported. All the experiments were performed by using the Euclidean distance and by setting the number of clusters to the same number of classes (k=4). Figure 5 shows the results of the applied clustering algorithms. For each method, the true patient classes are represented by dierent shapes and the cluster assignments by dierent colours. As we can see from the gure and also from table 6 the clustering assignment that best resembles the real classes is the k -means, being that with the highest rand index (0.87), purity (0.85) and NMI (0.72), meaning that each cluster contains a majority of patients of the same class. Figures were accomplished by using the ggplot2 package included in R. Rand Index Purity NMI
Complete 0.81 0.75 0.65
Single 0.38 0.40 0.23
Average 0.63 0.55 0.39
k -means EM 0.87 0.81 0.85 0.75 0.72 0.65
Table 1: External Validation Index for Clustering Evaluations. The results of the dierent clustering algorithms were compared with the true patient classes.
References Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin. Hood, L. and Friend, S. H. (2011). Predictive, personalized, preventive, participatory (p4) cancer medicine. Nature Reviews Clinical Oncology, 8(3):184187. Jiang, D., Tang, C., and Zhang, A. (2004). Cluster analysis for gene expression data: a survey. IEEE Transactions on knowledge and data engineering, 16(11):13701386.
10
Jung, Y. G., Kang, M. S., and Heo, J. (2014). Clustering performance comparison using k-means and expectation maximization algorithms. Biotechnology & Biotechnological Equipment, 28(sup1):S44S48. MacQueen, J. et al. (1967). Some methods for classication and analysis of multivariate observations. In Proceedings of the fth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281297. Oakland, CA, USA. Mirnezami, R., Nicholson, J., and Darzi, A. (2012). Preparing for precision medicine. New England Journal of Medicine, 366(6):489491. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:5365. Saria, S. and Goldenberg, A. (2015). Subtyping: What it is and its role in precision medicine. IEEE Intelligent Systems, 30(4):7075. Serra, A., Fratello, M., Fortino, V., Raiconi, G., Tagliaferri, R., and Greco, D. (2015). Mvda: a multi-view genomic data integration methodology. BMC bioinformatics, 16(1):261. Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jerey, S. S., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19):1086910874. Sotiriou, C. and Piccart, M. J. (2007). Taking gene-expression proling to the clinic: when will molecular signatures become relevant to patient care? Nature Reviews Cancer, 7(7):545553. Theodoridis, S. and Koutroumbas, K. (2008). Pattern Recognition. Academic Press. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America, 99(10):656772. Wang, B., Mezlini, A. M., Demir, F., Fiume, M., Tu, Z., Brudno, M., HaibeKains, B., and Goldenberg, A. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nature methods, 11(3):333337. 11
Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. SpringerVerlag New York.
Further Reading Handl, J., Knowles, J., and Kell, D. B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15):32013212. Hartigan, J. A. and Hartigan, J. (1975). Clustering algorithms, volume 209. Wiley New York. Zvelebil, M. and Baum, J. (2007). Clustering methods and statistic. In Understanding bioinformatics, chapter 16. Garland Science, Oxford.
12
Single
Object type Center Points
Cluster 1
s10
s9
s8
s7
s6
s5
s3
s2
s4
s1
2
Complete
Object type Center Points
Cluster 1
s1
s4
s2
s3
s5
s9
s10
s6
s7
s8
s1
s4
s2
s3
s5
s9
s10
s6
s7
s8
2
Average
Object type Center Points
Cluster 1 2
Figure 1: The distance between any couple of clusters can be dened in dierent ways, resulting in dierent hierarchies. The left side of the gure shows two dierent clusters in two dierent colours, with the centroids of the clusters represented by circles. The green lines represent the dierent alternative distances between clusters. The corresponding dendrograms are shown next to each dierent distance denition.
13
Figure 2: k -means results with dierent k values in input. A black X represents the cluster centroid.
(a)
(b)
Figure 3: EM clustering results. The (a) panel shows how the points are assigned to each cluster. The (b) panel shows the uncertainty in the assignment: as we can see, there is a higher condence in the assignment of the points to the red cluster, while there is some uncertainty between the green and blue clusters. Smaller points have less uncertainty in the assignment 14
Figure 4: 3D-MDS dataset representation of the example described in the text.
Figure 5: k -means, EM and Hierarchical clustering results with 4 clusters. In the scatters the shapes represent the true patient classes, while the colours represent the cluster assignments of each algorithm. 15