ARTICLE IN PRESS Journal of Visual Languages & Computing
Journal of Visual Languages and Computing 14 (2003) 341–362
www.elsevier.com/locate/jvlc
Gene expression data clustering and visualization based on a binary hierarchical clustering framework Lap Keung Szetoa, Alan Wee-Chung Liewa,*, Hong Yana,b, Sy-sen Tanga a
Department of Computer Engineering and Information Technology, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong b School of Electrical and Information Engineering, University of Sydney, NSW 2006, Australia Received 7 January 2003; received in revised form 15 March 2003; accepted 15 April 2003
Abstract Gene expression data analysis has recently emerged as an active area of research. An important tool for unsupervised analysis of gene expression data is cluster analysis. Although many clustering algorithms have been proposed for such task, problems such as estimating the right number of clusters and adapting to different cluster characteristics are still not satisfactorily addressed. In this paper, we propose a binary hierarchical clustering (BHC) algorithm for the clustering of gene expression data. The BHC algorithm involves two major steps: (i) the fuzzy C-means algorithm and the average linkage hierarchical clustering algorithm are used to partition the data into two classes, and (ii) the Fisher linear discriminant analysis is applied to the two classes to refine and assess whether the partition is acceptable. The BHC algorithm recursively partitions the subclasses until all clusters cannot be partition any further. It does not require the number of clusters to be supplied in advance nor does it place any assumption about the size of each cluster or the class distribution. The BHC algorithm naturally leads to a tree structure representation, where the clustering results can be visualized easily. r 2003 Elsevier Science Ltd. All rights reserved. Keywords: Gene expression data analysis; Fuzzy C-means clustering; Hierarchical clustering; Fisher linear discriminant analysis; Binary hierarchical clustering framework; Tree visualization
*Corresponding author. Tel.: +852-2788-7522; fax: +852-2788-8292. E-mail address:
[email protected] (A.W.-C. Liew). 1045-926X/03/$ - see front matter r 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S1045-926X(03)00033-8
ARTICLE IN PRESS 342
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
1. Introduction Research in life science, in particular, the study of the genetic codes in living organisms, has reached its epic with the sequencing of the full human genome. With the availability of the vast number of genetic sequences, the next major challenge is to study gene activities or gene expressions. DNA Microarray technology allows massively parallel, high throughput profiling of gene expression in a single hybridization experiment [1,2]. It allows the study of tens of thousands of different DNA nucleotide sequences simultaneously on a single microscopic glass slide. This has important implications in pharmaceutical and clinical research. For example, by comparing gene expression data in normal and disease cells, disease genes can be identified to facilitate drug design and therapy planning. In a microarray experiment, two samples of cRNA, which are reversed transcribed from mRNA purified from cellular contents, are labeled with different fluorescent dyes (usually Cy3 and Cy5) to constitute the cDNA probes. The two cDNA probes are then hybridized onto a DNA microarray. The microarray holds hundreds or thousands of spots, each of which contains a known unique DNA sequence. These spots are printed onto a glass slide by a robotic arrayer. The DNA in the spots is bonded to the glass to keep it from washing off during the hybridization reaction. If a probe contains a cDNA whose sequence is complementary to the DNA on a given spot, that cDNA will hybridize to the spot, where it will be detectable by its fluorescence. Spots with more bound probes will have more fluorescent dyes and will therefore fluoresce more intensely. Once the cDNA probes have been hybridized to the array and any loose probe washed off, the array is scanned by a laser scanner to determine how much of each probe is bound to each spot. The hybridized microarray is scanned for the red wavelength (corresponding to the Cy5 dye) and the green wavelength (corresponding to the Cy3 dye), which produces two sets of images typically in 16 bits Tiff format. The spots are then segmented from the images and their intensities measured [3–5]. The ratio of the two fluorescence intensities at each spot indicates the relative abundance of the corresponding DNA sequence in the two cDNA samples that are hybridized to the DNA sequence on the spot. Gene expression study can therefore be done by examining the expression ratio of each spots in the Cy3 and Cy5 images. Once expression data are obtained from the microarray images, the information embedded in the data needs to be analyzed. In gene expression study, the data are arranged in a matrix form, where the rows correspond to genes and the columns correspond to the genes’ responses under different experimental conditions. One can examine the expression profiles of different genes by comparing rows in the expression matrix, or study the responses of genes to different experimental conditions by examining the columns of the expression matrix. Gene expression data can be analyzed in two ways: unsupervised and supervised analysis [6]. In supervised analysis, information about the structure or groupings of the objects is assumed known, or at least partially known. This prior knowledge is then used in the analysis process. However, in many situations, for example, for a new drug treatment, prior knowledge about the expression data is not available. In this case, the number of
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
343
meaningful groupings, as well as their structures needs to be inferred directly from the given data in an unsupervised manner. Cluster analysis is a powerful tool in the study of gene expression data. However, in most partition-based clustering algorithms, the number of clusters needs to be set by the user. Even for non-partition based algorithm such as hierarchical clustering, the number of classes is determined by cutting the tree structure at certain level chosen subjectively by the user. In addition, the metric used in the clustering algorithm usually impose a structure on the detected clusters, e.g., K-means and fuzzy C-means (FCM) assume the clusters are ellipsoidal in shape. Unfortunately, for many gene expression data, we usually do not know the number of clusters present in the data or their distributions and must estimate them from the data. In this paper, we propose the binary hierarchical clustering (BHC) algorithm for clustering gene expression data. In BHC, the gene expression data is successively divided into many subgroups based on the idea of hierarchical binary subdivision of the data [7]. The BHC naturally leads to a hierarchical tree structure, which allows easy visualization of clustering result. The paper is divided into seven sections. Section 2 provides an overview on cluster analysis of gene expression data. Section 3 presents the BHC framework and discusses the general principles in the BHC algorithm. In Section 4, the actual implementation of the BHC algorithm is described. Section 5 describes how a cluster tree based on the BHC algorithm can be constructed. Clustering experiments on real gene expression data are reported in Section 6. Finally, Section 7 draws the conclusions.
2. Cluster analysis of gene expression data A gene expression data set usually contain thousands of genes, where each gene can be of dimension ranging from several sample points to 20 or 30 sample points. The data are usually very noisy and would also contain many missing values. Missing values are expression ratios that are not available due to poor spot quality or image artifacts. The gene expression data usually exhibit complex distribution, which are not well modeled by ellipsoidal clusters. The clusters in the gene expression data also have highly unequal size, i.e., the number of data contained in each cluster is vastly different. Cluster analysis has proved to be a useful tool in discovering structures and patterns in gene expression data [6,9]. In general, the goal of cluster analysis is to group data with similar pattern to form distinct clusters. For gene expression data, each cluster contains genes that response similarly to a set of stimuli. Cluster analysis helps to reduce the complexity of the gene expression data since genes with similar patterns are grouped together. It provides a meaningful ordering of the data when the resulting clusters are arranged in a consistent manner. This can facilitate visualization and interpretation. It also aids in the discovery of gene function because genes with similar gene expression profiles can be an indicator that they participate in the same or related cellular processes.
ARTICLE IN PRESS 344
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
Many clustering methods have been proposed for gene expression data. Some of the most commonly used clustering algorithms are hierarchical clustering, K-means clustering, self-organizing maps (SOM) and principal component analysis [6,9–17]. In hierarchical clustering [18], each of the gene expression data is considered a cluster initially. Then, pairs of clusters, which have the smallest distance between them, are merged together to form single cluster. This process is repeated until there is only one cluster left. The hierarchical clustering algorithm arranges the gene expression data into a hierarchical tree structure, which allows a biologist to visualize and interpret the result easily. However, the hierarchical tree cannot indicate the optimal number of clusters in the data set. This problem is left to the observer who has to interpret the tree topologies and identify branch points that segregate clusters of biological relevance. In addition, once a data is assigned to a node in the tree, it cannot be reassigned to a different node even if it is later found to be closer to that node. Thus, the hierarchical tree is usually suboptimal. In partition-based clustering algorithms such as K-means clustering [18], the number of clusters is arbitrarily fixed by the users from the beginning. Guesswork is usually needed to decide how many significant clusters there may be in the data. The basic idea of K-means clustering is to partition the data into a predefined number of clusters such that the variability of the expression data within each cluster is minimized. This is achieved by alternatingly updating the class assignment of each data vector and the class centroids. The Euclidean distance is usually employed in K-means clustering to measure the closeness of a data vector to the cluster centroids. Such distance metric imposes an ellipsoidal structure on the resulting clusters. Hence, data that do not conform to this structure are poorly clustered by the K-means algorithm.
3. The binary hierarchical clustering framework In a general unsupervised clustering problem, we need to consider three constraints: (i) Knowledge about the class distribution is not known. In fact, each cluster may have different distribution. (ii) Assumptions about the size or the number of samples that belong to each class cannot be made. The size of clusters may vary greatly. (iii) Knowledge about the number of the clusters is generally not available. BHC is designed to handle the above constraints [8]. The BHC algorithm uses the idea of a hierarchical binary division-clustering framework. For example, let us consider a data set, which is two-dimensional with three distinct classes as shown in Fig. 1. The algorithm starts by assuming that the data set consists of one class. The first application of BHC generates two clusters A and BC: As the projection of classes A and BC have a large enough Fisher criterion on the A2BC discriminant line, the algorithm splits the original data set into two clusters. Then, the BHC is
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
345
Fig. 1. The BHC framework. (a) Original gene expression data treated as one class, (b) split the class into two clusters, A and BC: (c) Cluster A cannot be split further, but cluster BC is split into two clusters, B and C: (d) Both cluster B and C cannot be split any more, and we have three clusters A; B; and C:
applied onto each of the two clusters. The BC cluster will be separated into B and C clusters because its Fisher criterion is large. However, the Fisher criterion of cluster A is too low to allow further division, so it remains as a single cluster. Such hierarchical binary division process is repeated until all clusters have Fisher criterion too low for further splitting, i.e., the A; B; and C cluster cannot be further subdivided and the process is halted. We can see that the BHC algorithm starts with one cluster and performs a successive binary partition of the data into subgroups until there is no more acceptable splitting. Fig. 2 shows a flow chart describing the various processing steps in the BHC algorithm. The BHC algorithm can be divided into two main steps. They are: (1) binary partitioning of the data, which uses the FCM algorithm and the average linkage hierarchical clustering (ALHC) algorithm to partition the expression data into two classes, (2) perform Fisher discriminant analysis on the two classes. If the Fisher criterion exceeds a set threshold, accept the binary partition of the data; else do not subdivide the data any further. The BHC algorithm is non-parametric in nature, so assumptions about the class distributions and the number of clusters are not needed. In BCH, the only parameter required is the threshold for the Fisher criterion. Although the Fisher criterion
ARTICLE IN PRESS 346
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
START
Data preprocessing
Assume all gene data initially belong to the same cluster. Place cluster on the stack.
Fetch one cluster from the stack. Attempt to split this cluster into two clusters using the BHC algorithm. Modify the tree structure to reflect the splitting. BHC clustering: Step 1: split cluster into two classes. Step 2: Perform LDA and compute the maximum Fisher Criterion
Split cluster into two clusters. Place each cluster on the stack.
YES
τ max .
Is τ max > Threshold ?
NO Mark this cluster as a single class.
Any remaining clusters on the stack ?
YES
NO END
Fig. 2. Flow chart of the BHC clustering algorithm.
eventually affects the number of clusters in the data set, the former has a clear relationship to the spatial structure of the cluster in consideration, while the later is difficult to determine from the given data set without any prior knowledge of the data.
4. BHC gene expression data clustering Gene expression data obtained from a microarray experiment need to be preprocessed prior to BHC clustering. The preprocessing generally involves dimensionality reduction, missing value estimation, noise filtering, and outlier elimination. The BHC clustering involves repeatedly partitioning the data into two classes and checking the validity of such partition using Fisher Linear Discriminant
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
347
Analysis. The detail procedures for BHC gene expression data clustering are described in the subsections below. 4.1. Data preprocessing Typical gene expression data usually have very high dimension (i.e., large column number). Clustering in very high dimension is generally much more difficult than in low dimension since two feature vectors which are clearly dissimilar in lowdimensional space may become less dissimilar in high-dimensional space if many of the dimensions do not exhibit clear differences in feature values. Additionally, the local optimization procedure in many partition-based clustering algorithms has much greater chance of converging to a sub-optimal local minimum in a high dimension search space. Another problem associated with gene expression data is that of missing values in the data matrix. Missing values occur for diverse reasons, including insufficient image resolution, image corruption, or simply due to dust or scratches on the slide. In addition, expression ratios that fail some quality control criteria during image analysis are often excluded for subsequent analysis. Since many analysis algorithms cannot handle incomplete data, such missing values need to be ‘‘fill-in’’ in order to have a complete data matrix prior to data analysis. Singular value decomposition (SVD) [19,20] is a technique commonly used to reduce data dimensionality. SVD can transform gene expression data from gene x array space (a high dimension space) to a reduced, diagonalized ‘‘eigengenes eigenarrays’’ space. Suppose the gene expression data is an M (genes) N (arrays) matrix A whose number of rows M is usually greater than the number of columns N: The matrix A can be SVD transformed to be the product of an M N eigengene matrix U; an N N diagonal matrix D with positive or zero elements (eigenvalues), and the transpose of an N N eigenarray matrix V ; i.e., A ¼ UDV T
ð1Þ
with U T U ¼ I;
V T V ¼ I;
D ¼ diagðd1 Xd2 XLXdM X0Þ;
where the superscript T denotes matrix transposition. Although gene expression data are usually of very high dimension, most of the interesting variations are captured in the first few SVD components with large eigenvalues [21]. By discarding SVD components with small eigenvalues, dimension reduction and noise suppression in the data can be achieved. The SVD can also be used to interpolate missing values in the data set [21,22]. In [22], this is done by regressing genes with missing values against the k eigengenes and then uses the coefficients of the regression to reconstruct the missing values from a linear combination of the k eigengenes. Gene expression data usually contain some outliers, which could adversely affect the clustering result. To detect the outliers, we first compute the Mahalanobis distance of each gene expression data with respect to the global mean. The
ARTICLE IN PRESS 348
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
Fig. 3. Euclidean distance and Mahalanobis distance.
Mahalanobis distance between a data x and the global means x% is defined as d 2 ¼ T ðx xÞ % V 1 ðx xÞ; % where V is the covariance matrix of the entire data [18]. In contrast to the simple Euclidean distance, the Mahalanobis distance takes into account the distribution of the data by means of the covariance matrix. In Fig. 3, we see that data A has a smaller Mahalanobis distance to the mean than data B; although the Euclidean distance of A to the mean is larger than that of B: This is because A lies along the axis of large variation of the data set and thus is more likely to be a member of the data set than B: Once the Mahalanobis distances are calculated, the standard deviation of the Mahalanobis distances is computed and any data which has a Mahalanobis distance larger than three standard derivations are considered as outliers and are removed from the data set. 4.2. The two-class partitioning problem In BHC clustering, the data is successively partition into two classes. However, direct clustering of the data into two classes using a clustering algorithm such as K-means or FCM may produce poor result, since an actual cluster in the data may be erroneously split into two classes. This is illustrated by the example in Fig. 4. In Fig. 4, there are actually three clusters in the data. The desirable two-class partition would be for the two closer clusters to be classified as one partition and the remaining cluster as the second partition. However, a direct clustering of the data would favor the split up of the middle cluster into two classes, since this configuration would result in a smaller sum of within-class dispersions. With this erroneous partition, the subsequent Fisher discriminant analysis would erroneously conclude that the splitting is not valid and the data should be considered as just one class. To avoid such problem, we adopt an over-clustering and merging strategy to generate the two-class partition. The idea is to first cluster the data into several clusters by the FCM algorithm. Then, the clusters are merged together by the ALHC algorithm until only two classes are left. The process is illustrated in Fig. 5. In Fig. 5a, the data is over-clustered into six clusters by the FCM algorithm. A
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
349
Fig. 4. Erroneous splitting of an actual cluster in the data.
Fig. 5. Partition of the data into two classes. (a) Original data clustered into six clusters, (b) the hierarchical tree constructed from the six clusters, and (c) the final two-class partitioning of the data.
hierarchical tree constructed from the six clusters using the average linkage clustering algorithm is constructed as in Fig. 5b. Using the cutoff as shown in Fig. 5b, a final partitioning of the data into two classes is shown in Fig. 5c. We can see that A; B are merged into one class, and C, E, F, D are merged into the second class.
ARTICLE IN PRESS 350
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
Fig. 6. Two-class partitioning of data containing two non-ellipsoidal clusters. (a) Original data, (b) direct clustering into two classes, (c) over-clustering into 6 classes, and (d) final two-class partition after clustermerging.
Besides solving the mis-partitioning problem of direct two-class clustering, our over-clustering and merging strategy also allows non-ellipsoidal clusters to be partitioned correctly. Fig. 6a shows a data set containing two non-ellipsoidal clusters. A direct clustering into two classes using the FCM algorithm resulted in the erroneous partition as shown in Fig. 6b. This occurs even when the correct number of clusters was supplied to the FCM algorithm. If we over-cluster the data set into six clusters, we obtained the result shown in Fig. 6c. After the merging stage using the ALHC algorithm, we obtain the correct two-class partition as shown in Fig. 6d. We could see from the example that the proposed strategy relaxes the assumption about the shape of the clusters in the data. 4.3. The partitioning algorithm To generate an over-clustering of the data, we adopted the FCM algorithm to cluster the data. We empirically set the number of clusters to be six. The six clusters are then merged according to the ALHC rule until two classes are obtained. The two
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
351
classes finally constitute an optimal binary partition of the data, where the betweenclass distance according to the average linkage rule is maximized. 4.3.1. Centroid initialization The FCM algorithm requires the initial cluster centroids to be given prior to the actual clustering process. Since the iteration of the FCM algorithm involves a local optimization procedure, the algorithm can be trapped in a local minimum and returns a sub-optimal clustering result if poor initial cluster centroids are used. To obtain a reasonable initial clustering of the data and hence the initial cluster centroids, a forward–backward initialization procedure is used [23]. For the forward stage, a small threshold d; set to be equal to 0.1 times the average of the range of every feature component in the set of input data vectors, is used to decide whether a new cluster should be formed. The procedure begins by letting the first data be the centroid of the first cluster. Then, for each subsequent data, the Euclidean distances between it and the set of cluster centroids are computed. If the smallest of these distances is smaller than d; the data is assigned to the cluster with the smallest distance and the centroid of that cluster is updated. If the smallest distance is larger than d; a new cluster is created with the data as the centroid of the new cluster. After the forward stage, a large number of clusters are usually present. The backward stage prunes the number of clusters down to the desire number of clusters. Pruning is done by repeatedly merging pairs of clusters, starting with pairs of clusters having centroids that are closest together, until the desire number of clusters, six in our case, is reached. 4.3.2. Multiple-class clustering In many real world data, it is not uncommon for a data vector to belong to multiple clusters. For gene expression data, a gene may participate in more than one biological function, and it may exhibit an expression profile that is a mixture of the profiles that belong to different clusters, where the different clusters reflect the different biological functions of the genes under study. The FCM clustering algorithm allows such property to be reflected in the cluster memberships. Specifically, FCM allows each data point to belong to more than one cluster according to its degree of membership in each cluster, and the clusters in FCM are allowed to overlap each other [24]. This is in contrast to the K-means clustering, where a data is either wholly belongs to a cluster or not. The FCM algorithm attempts to find fU; vg; where U is the set of membership values and v is the set of cluster centroids, that minimizes the sum of within-class variations Wi ; i.e., Jm ðU; vÞ ¼
c X
Wi ;
ð2Þ
i¼1
P 2 where the within-class variation for class i is given by Wi ¼ nk¼1 um ik dik ; subjected to Pc the constraint i¼1 uik ¼ 1; for all k: The parameter m is called the fuzzy index and it determines the fuzziness of the partition. The usual value for m is 2. The distance dik2
ARTICLE IN PRESS 352
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
measures the distance of the data vector xk to the ith cluster centroid vi : The FCM starts with an initial estimate of the cluster centroids. Then, FCM assigns every data vector a membership grade in each cluster, depending on their distance from the cluster centroids. By iteratively updating the cluster centroids and the membership grades for each data vector, FCM generates a fuzzy partitioning of the data set. The detail steps for the FCM algorithm are as follows. Given a set of gene expression data X ¼ fx1 ; x2 ; L; xn g; first set the fuzzy index m ¼ 2 and the number of clusters c ¼ 6; then: (1) Obtain the initial cluster centroids for the six clusters as described in Section 4.3.1. (2) Calculate the initial ith cluster membership for each gene expression data xk and for each cluster, as " #1 c X dik 2=ðm1Þ uik ¼ : ð3Þ djk j¼1 (3) Repeat for iteration t ¼ 1; 2; L (a) Calculate the fuzzy cluster centroids vi as Pn ðuik Þm xk vi ¼ Pk¼1 8i ¼ 1; 2; L; 6: n m k¼1 ðuik Þ
ð4Þ
(b) Update the memberships using (3) until the maximum change in memberships is smaller than a small threshold e: (4) Perform a final hard classification by assigning the gene expression data to the cluster with the highest membership. 4.3.3. Cluster merging Since the number of clusters obtained from the FCM is six, closest clusters need to be successively merged using the AHCL algorithm [18], until the final number of clusters equals to two. The average linkage cluster distance between cluster r and cluster s is defined as Dðr; sÞ ¼
sdrs ; Nr Ns
ð5Þ
where sdrs is the sum of all pairwise distances between cluster r and cluster s (see Fig. 7), Nr and Ns are the number of data in cluster r and cluster s; respectively. The average linkage distance is less sensitive to outliers than either single linkage distance or complete linkage distance [18]. The AHCL algorithm starts by computing the pairwise cluster distances for all six clusters. Then the two clusters with the smallest pairwise cluster distance are merged into one cluster. The number of clusters is now five. Recalculate the pairwise cluster distances for the five clusters, and merge the closest pair of clusters. The process is
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
353
Fig. 7. Average linkage distance between cluster A and cluster B:
repeated until only two clusters are left. The merging rule in AHCL ensures that the final two clusters left are the most distant apart from each other. 4.4. Fisher linear discriminant analysis Once the gene expression data are partitioned into two classes, Fisher linear discriminant analysis [25] is performed to first refine the partition, and then to validate the partition. The classical Fisher criterion involves solving for the inverse of the within-class scatter matrix. However, during BHC clustering of gene expression data, we often encounter the singular matrix problem because of the high dimensionality of the data and the fact that the number of genes with similar expression profiles is usually small after several levels of binary subdivision. Moreover, the within-class scatter matrix may still be singular even if the data is full dimension, since the data in each of the clusters may span a smaller dimension. To solve the singular matrix problem, we first ensure that the data set is of full dimension. This is done by projecting the data onto the subspace Rm spans by the set of orthonormal basis Pm ; where Pm is given by the first m eigenvectors associated with the m non-zero eigenvalues of the data covariance matrix [20]. We then define St ¼ Sb þ Sw ; where Sw is the within-class scatter matrix given by Sw ¼ S1 þ S2 with 1 X Si ¼ ðx mi Þðx mi ÞT ð6Þ Ni 1 xAC i
and Sb is the between-class scatter matrix given by Sb ¼ ðm1 m2 Þðm1 m2 ÞT
ð7Þ
and m1 ; m2 are the class mean for the two classes. Note that St is always full rank (thus invertible) even if Sw is rank deficient, due to our data set being always of full dimension. Our modified Fisher criterion is then given by tðoÞ ¼ t ¼
oT Sb o : oT St o
ð8Þ
ARTICLE IN PRESS 354
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
The optimal discriminant vector o maximizing (8) is given by o ¼ St1 ðm1 m2 Þ:
ð9Þ
The optimal discriminant vector o gives a projection direction, which maximally separate out the two classes. The o can be used to refine the two-class partitioning of the data as follows. Once o is obtained as in (9), we project the data and the mean vectors of the two classes onto it. The projected data are then reclassified according to their proximity to the projected class means. Once the data are reclassified, the class means and the between-class and within-class scatter matrices are updated. The process is iterated until there is no significant increase in tðoÞ anymore. The maximum Fisher criterion tmax ¼ maxo tðoÞ indicates the best linear separation of the two classes in the given data set. This value can be used to assess the validity of the two-class partitioning of the data. If tmax is smaller then a small threshold Z; the two-class partition is not valid and the binary subdivision of the data is halted. The BHC algorithm assumes the initial gene expression data to be a single cluster and then perform a hierarchical binary subdivision of the data. Initially, each subclass contains many true clusters and the compactness of the subclass is very low. After several binary subdivision stages, the two subclasses become more compact if the binary subdivision is a valid one. Thus, we expect jSt j to decrease with the subdivision process while jSb j remains relatively unchanged. This implies that tmax will tend to increase with each valid binary subdivision. Based on the above reasoning, a fix threshold Z would not be appropriate. Instead, we dynamically update Z at each binary subdivision stage as follows. Assume the first binary subdivision to be always valid, set Z equals to the tmax at that stage. For subsequent binary subdivision, set Z to be the average of the tmax at the last three binary subdivision stages.
5. Cluster tree visualization The binary hierarchical framework naturally leads to a tree structure representation similar to that of hierarchical clustering. Similar clusters are arranged to be adjacent to each other in the tree. At the root of the tree, the entire data set is treated as a single cluster. At each level down the tree, two branches originate from each node, which represent the two classes given by the binary subdivision. Near the root of the tree, only gross structures in the data are shown, whereas near the end branches, fine details in the data can be visualized. The tree structure can show the relationship between each cluster, the adjacency between different clusters, as well as the variation within each cluster. Using the tree, a user can easily decide whether two adjacent clusters should be merged based on his/ her knowledge about the biological process involved and the similarity of the visual pattern in the display. It allows user judgment in finalizing the clustering result using additional biological knowledge, in a manner similar to that in hierarchical clustering.
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
355
To ensure that adjacent clusters are more similar than non-adjacent clusters, we rearrange the clusters according to the Mahalanobis distance between their centroids, defined as drs2 ¼ ðxr xs ÞT Crs1 ðxr xs Þ;
ð10Þ
where Crs ¼ 0:5Covr þ 0:5Covs is the pooled sample covariance matrix for clusters r and s: The procedure for rearranging the tree is as follows. Let the tree be of k levels, the first binary subdivision is at level one, and the root of the tree is at level zero. The root has only one branch and one cluster, which consist of all data. Then at each j level, starting from the deepest level k and working backward to level 2 of the tree: Do for each branch at level j 2; (1) Collect all pairs of clusters at level j originating from different branches at level j 1: The number of pairs is either four (if both branches at level j 1 are successfully split), or two (if only one branch at level j 1 are successfully split), or zero (both branches at level j 1 fail to split). (2) If the number of pairs from step 1 is not zero, swap the position of the two clusters originating from the same branch at level j 1 such that the two clusters in the closest pair obtained in step 1 is adjacent to each other. The number of swaps can be zero, one, or two, depending on which pair from step 1 is the closest. We illustrate the above procedure by an example. Fig. 8 shows a 3-level tree, where some of the non-adjacent clusters are more similar than adjacent clusters in the tree. Starting at level 3 and working backward, collecting all possible pairs of clusters. For the top branch at level 1, the possible pairs are (A; C), (A; D), (B; C), (B; D), which lead to the following possible arrangements (B; A; C; D), (B; A; D; C), (A; B; C; D), (A; B; D; C), respectively, depending on which pair is the most similar. Suppose the pair (B; D) is the most similar, then we have the arrangement as shown in Fig. 8b, which corresponds to the arrangement (A; B; D; C) at the top branch of level 1. The same operation is applied to the lower branch of level 1, and in this example, they are already in order and do not need to be rearranged. Next, we consider level 2. Collect all pairs of clusters originating from different branches at level 1, but from the same branch at level 0. As level 0 is the root of the tree, it has only one branch and the rearrangement process will terminate after this stage. The possible pairs are (A; E), (A; G), (C; E), (C; G), which lead to the following possible arrangements (C; D; B; A; E; F ; G), (C; D; B; A; G; F ; E), (A; B; D; C; E; F ; G), and (A; B; D; C; G; F ; E), respectively. Suppose the pair (C; G) is the most similar, then we have the final arrangement as shown in Fig. 8(c), where all adjacent clusters are more similar than non-adjacent clusters.
6. Clustering results The gene expression data set for evaluating the performance of the BHC algorithm comes from Spellman et al. (http://cellcycle-www.stanford.edu) [12]. It contains
ARTICLE IN PRESS 356
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
Fig. 8. Tree structure visualization of BHC clustering result. Clusters are sorted according to their Mahalanobis distances such that adjacent clusters are more similar than non-adjacent clusters.
expression profiles for 6178 genes under different experimental conditions, i.e., cdc15, and cdc28, alpha factor and elutriation experiments. Data with many missing values are filtered out. The BHC algorithm is applied to each data set from each experiment and the result is displayed using the cluster tree visualization. In the alpha factor experiment, the gene expression data set contains 4489 genes with 18 sample points each. BHC is applied to the data set and fourteen clusters are found (Fig. 9). In the cdc15 experiment, the gene expression data set contains 4381 genes with 24 sample points each. Fourteen clusters are found (Fig. 10). In the elutriation experiment, the gene expression data set contains 5766 genes with 14 sample points each. Nineteen clusters are found (Fig. 11). In the cdc28 experiment, the gene expression data set contains 1383 genes with 17 sample points each. Twenty-five clusters are found (Fig. 12). We see that genes with similar expression profiles are clustered successfully by the BHC algorithm into the same group. The gene expression data are less random after BHC clustering when compared to the original data. By visualizing the results using the cluster tree, biologists can interpret the clustering results easily and search the tree for meaningful clusters or groups of closely related clusters [17].
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
357
Fig. 9. The alpha factor experiment data set. (a) Original gene expression data. (b) Expression data after BHC clustering.
7. Conclusions In this paper, we have described the unsupervised clustering of gene expression data using the BHC algorithm. BHC uses a binary hierarchical framework to cluster the data. It involves using the FCM and average linkage algorithm to split the data into two classes, and then refine and verify the validness of the split by Fisher linear discriminant analysis. The main advantages of the BHC clustering algorithm are: (1) the number of clusters can be estimated from the data directly using a binary hierarchical framework; (2) no constraint about the number of samples in each cluster is required, and (3) no prior assumption about the class distribution is needed. Moreover, non-ellipsoidal clusters in the data can also be handled successfully. The BHC can be seen as a hybrid between hierarchical-based clustering and partition-based clustering. It therefore combines the merits of both clustering techniques. We have also described how the singular matrix problem in the Fisher
ARTICLE IN PRESS 358
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
Fig. 10. The cdc15 experiment data set. (a) Original gene expression data. (b) Expression data after BHC clustering.
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
359
Fig. 11. The elutriation experiment data set. (a) Original gene expression data. (b) Expression data after BHC clustering.
linear discriminant analysis could be handled. The binary hierarchical framework naturally leads to a cluster tree representation. By visualizing the clustering results using a cluster tree, the relationship between each cluster, the adjacency between
ARTICLE IN PRESS 360
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
Fig. 12. The cdc28 experiment data set. (a) Original gene expression data. (b) Expression data after BHC clustering.
different clusters, as well as the variation within each cluster can be observed easily. The tree display thus allows user interaction in finalizing the clustering result using additional biological knowledge, in a manner similar to that in hierarchical
ARTICLE IN PRESS L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
361
clustering. Our experimental results using real gene expression data indicated that the proposed clustering algorithm could successfully group the expression data into visually distinct clusters.
Acknowledgements This work is supported by CityU SRG Grant 7001183 and Interdisciplinary Research Grant 9010003.
References [1] S.K. Moore, Making Chips, IEEE Spectrum, 2001, pp. 54–60. [2] D.J. Lockhart, E.A. Winzeler, Genomics gene, expression and DNA arrays, Nature 405 (2000) 827–846. [3] M. Eisen, ScanAlyze User Manual, Stanford University, 1999, http://rana.lbl.gov/EisenSoftware.htm [4] Axon Instruments Inc., GenePix Pro 3.0, 2001. [5] A.W.C. Liew, H. Yan, M. Yang, Robust adaptive spot segmentation of DNA microarray images, Pattern Recognition 36 (5) (2003) 1251–1254. [6] A. Brazma, J. Vilo, Minireview: Gene expression data analysis, European Molecular Biology Laboratory, Outstation Hinxton—the European Bioinformatics Institute, Cambridge, UK, 2000. [7] D.A. Clausi, K-means Iterative Fisher (KIF) unsupervised clustering algorithm applied to image texture segmentation, Pattern Recognition 35 (2002) 1959–1972. [8] L.K. Szeto, A.W.C. Liew, H. Yan, S.S. Tang, Gene expression data clustering and visualization using a binary hierarchical clustering framework, Proceedings of the First Asia–Pacific Bioinformatics Conference APBC2003, Adelaide, Australia, 4–7 February 2003, pp. 145–152. [9] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proceedings of the National Academy of Science USA 95 (1998) 14863–14868. [10] P. Tamayo, D. Solnim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, T.R. Golub, Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proceedings of the National Academy of Science USA 96 (1999) 2907–2912. [11] N.S. Halter, A. Maritan, M. Cieplak, N.V. Fedoroff, J.R. Bamavar, Dynamic modelling of gene expression data, Department of Physic and Center for materials physics, and Department of Biology and the Life Sciences Consortium, Penndylvania State University, University Park, PA, 2000. [12] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, B. Futcher, Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Molecular Biology of the Cell 9 (1998) 3273–3297. [13] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy of Science USA 96 (1999) 6745–6750. [14] C.M. Perou, S.S. Jeffrey, M. van de Rijn, C.A. Rees, M.B. Eisen, D.T. Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C.F. Lee, D. Lashkari, D. Shalon, P.O. Brown, D. Botstein, Distinctive gene expression patterns in human mammary epithelial cells and breast cancers, Proceedings of the National Academy of Science USA 96 (1999) 9212–9217. [15] K.P. White, S.A. Rifkin, P. Hurban, D.S. Hogness, Microarray analysis of Drosophila development during metamorphosis, Science 286 (1999) 2179–2184. [16] K.Y. Yeung, W.L. Ruzzo, Principal component analysis for clustering gene expression data, Bioinformatics 17 (9) (2001) 763–774.
ARTICLE IN PRESS 362
L.K. Szeto et al. / Journal of Visual Languages and Computing 14 (2003) 341–362
[17] C. Tang, L. Zhang, A. Zhang, Interactive visualization and analysis for gene expression data, IEEE Proceedings of the Hawaii International Conference on System Sciences, Vol. 6, Big Island, HI, 2002, pp. 143–166. [18] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ, 1988. [19] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flanery, Numerical Recipes in C: The Art of Scientific Computing, 2nd edition, Cambridge University Press, Cambridge, 1992. [20] G. Strang, Introduction To Linear Algebra, Wellesley-Cambridge Press, Cambridge, 1998. [21] O. Alter, P.O. Brown, D. Botstein, Singular value decomposition for genome-wide expression data processing and modeling, Proceedings of the National Academy of Science USA 97 (2000) 10101–10106. [22] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, R.Tibshirani Trevor Hastie, D. Botsttein, R.B. Altman, Missing value estimation methods for DNA microarrays, Bioinformatics 17 (6) (2001) 520–525. [23] A.W.C. Liew, S.H. Leung, W.H. Lau, Fuzzy Image Clustering Incorporating Spatial Continuity, IEE Proceedings of the Vision, Image and Signal Processing 147 (2) (2000) 185–192. [24] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function, Plenum Press, New York, 1981. [25] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley Interscience, New York, 2001.