Chapter 13
ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS Numanul Subhani∗ , Luis Rueda∗ , Alioune Ngom∗ , and Conrad Burden† ∗ School of Computer Science, 5115 Lambton Tower, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, N9B 3P4, Canada. {hoque4,lrueda,angom}@uwindsor.ca. †
Conrad Burden: Centre for Bioinformation Science, Mathematical Sciences Institute and John Curtin School of Medical Research, The Australian National University, Canberra, ACT 0200, Australia.
[email protected]
Abstract A common problem in biology is to partition a set of experimental data into clusters in such a way that the data points within the same cluster are highly similar, while data points in different clusters as dissimilar as possible. An important process in functional genomic studies is clustering microarray time-series data, where genes with similar expression profiles are expected to be functionally related or co-related. Clustering gene expression data given in terms of time-series is a challenging problem that imposes its own particular constraints, namely ex-changing two
13
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
14
or more time points is not possible as it would deliver quite different results, and it would also lead to erroneous biological conclusions. Traditional clustering methods based on conventional similarity measures are not suitable for time-series clustering. Clustering based on the similarity of their temporal profiles is a daunting task for large dataset in an experiment. Various clustering methods for time-series gene expression data have been proposed, which take the temporal dimension of the data into account. In this chapter, we review alignment-based clustering approaches for time-series profiles. The methods also consider the temporal relationships between and within the time-series profiles. We investigate the performances of these alignment methods on many datasets and compare recently proposed methods and discuss their strengths and weakness.
13.1
Introduction
Clustering is a multivariate analysis technique used to discover unknown patterns or groups in data. Clustering is appropriate when there is no a priori knowledge about the underlying data. Clustering, the process of grouping similar entities, can be done on any data such as genes, samples, time points in a time-series, etc. The particular type of input makes no difference on the clustering algorithm. The algorithm will treat all inputs as an n-dimensional feature vector. To group objects that are similar, we need a very precise definition of measure of similarity. There are many different ways in which such a measure of similarity can be calculated depending on the representation of gene expression profiles. We discuss the clustering problem of microarray time-series gene expression profiles. The Time-Series clustering problem is more formally stated in order to discuss these approaches. Given a dataset D = {x1 (t), . . . , xs (t)}. xi = [xi1 , . . . , xin ]t is an n-dimensional feature vector that represents the expression level of gene i at n different time points, t = [t1 , . . . , tn ]t . We want to partition
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
15
a set of s profiles, D, into k disjoint clusters C1 , . . . , Ck , 1 ≤ k ≤ s; such that k (i) Ci 6= ∅, i = 1, . . . , k; (ii) Ui=1 Ci = D (iii) Ci ∩ Cj = φ; i 6= j; i, j = 1, . . . , k.
Also, each profile is assigned to the cluster whose distance is the closest. We are considering the specific case of time-series clustering, where the order of time-points cannot be changed because the different permutations give different results that are biologically meaningless. In this chapter, a comprehensive review of the clustering approaches for analyzing microarray time-series data is presented. In the following sections, the key features of each method are highlighted, providing a methodological comparison, their main advantages and drawbacks, and performing an experimental comparison of some of these methods.
13.2
Current Clustering Methods for gene expression time-series profiles
Many clustering methods for time-series gene expression data have been developed. Most of these approaches are either directly adopted, or somewhat modified from the standard clustering methodology available in the fields of classification and pattern recognition. These methods, in general, take gene temporal expression profiles and apply some conventional metrics and algorithms for the purpose of clustering, like any general clustering approach, and do not perform any profile alignment before applying a similarity measure. A partitional clustering method based on k-means applied in [22] to cluster gene expression temporal data. In k-means, a dataset is partitioned into k predefined clusters by iteratively reallocating cluster members such that the overall within-cluster dispersion is minimized. For the similarity measure, the Euclidean distance was used to compute the dissimilarity between each pair of genes in the feature space. The application of k-means to this dataset revealed new sets of co-regulated genes and their putative cis-regulatory elements. This approach does not require any prior knowledge, about the structure except
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
16
the value of k needs to be known a priori, or to make any assumptions about the dynamics of the expression profile. The complete lack of structure in the clustering may result in inconsistent clusters, since the clustering may reinforce local groups [15]. In [15], Tamayo et al applied self-organizing maps (SOM) to visualize and interpret the patterns of gene temporal expression profiles. The SOM, a type of mathemetical cluster analysis suits well with exploratory analysis of the data and to reveal relevant patterns in large, high-dimensional datasets. This method consists of maintaining a set of nodes, a topology and a distance function (on the nodes). The algorithm iteratively maps the nodes into the feature space of the genes. The SOM suites well with exploratory analysis of the data and to reveal relevant patterns in a dataset. But deciding upon the number of nodes is important, since if the number is small, the patterns can not be distinguished properly (due to a large within-cluster scatter). On the other hand, adding nodes exceeding a certain value (that produces distinct patterns) fails to produce any new patterns. A Bayesian approach for improving the clustering results of gene expression series using rough knowledge regarding the general shapes of the classes was proposed [3]. Knowledge about the general shapes can be elementary regarding the change of the mean expression level over time. The information regarding the shape of the class are directly integrated into the model so that classes with the desired profiles are favored. However, if no such information is available, then classical clustering using a Bayesian approach is performed, which intuitively deals with the temporal nature of the data. The effectiveness of the method to recover a particular class of genes, for which there was prior knowledge, was demonstrated using a dataset composed of a mixture of real and synthetic data. It was constructed by injecting some synthetic data into the original Fibroblast dataset [24]. The experiments show that the method can recover the synthetic data with very high accuracy. A Bayesian method was also applied for model-based clustering where the
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
17
models are autoregressive curves of fixed order [12]. To search for the most likely set of clusters out of the given temporal expression data, an agglomerative procedure was used. During clustering, the approach explicitly takes into account the dynamic nature of gene expression time series data and principled a way that identify the number of distinct clusters. This method can be viewed as a specialized version of Bayesian Clustering by Dynamics (BCD), where the concept of similarity is defined in such a way that two time series are considered similar if they are generated by the same stochastic process. The proposed method models temporal gene-expression profiles by autoregressive equations that derive a posterior probability, and agglomeratively picks models having maximum posterior probabilities. This method is also able to identify the optimal number of clusters based on the well-known Akaike information criterion. A Hidden Markov Model (HMM) that accounts for horizontal dependencies in gene expression time-series data was proposed in [23]. The approach proposed in [23] focuses on univariate emission probability densities. Usually, model-based clustering approaches are effective when grammatical or structural properties of the data can be explicitly expressed, and the HMM can be used by these types of approaches to partition gene expression time-series data into clusters. The process starts with an initial collection of HMMs that encompasses typical qualitative behavior and iteratively finds cluster models and assigns data points to these models in such a way that the joint likelihood of the clustering is maximized. The approach also provides a method for partially supervised learning, which allows to add groups of labeled data to the initial collection of clusters. The method can cope effectively with unlabeled data, as well as with additional labeled data. The information contained in the labeled data is used to produce high quality clusters. This method is also robust with respect to noisy and frequently missing data. In [10], the authors proposed a new correlation-coefficient-based method for the similarity measure. Any clustering technique attempts to group genes with
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
18
similar expression patterns (i.e., coexpressed genes). A similarity measure is a pairwise measure of coexpression that, ideally, assigns high scores to coexpressed ORFs and low scores to unrelated ORFs. Jack-knife correlation gives high scores to gene pairs that exhibit similar behavior throughout the time points and is robust to outliers. It can also reduce false positives, i.e., giving a high score to a pair of dissimilar profiles by the similarity measure. In [21], the order-restricted inference method was used for selecting and clustering genes expression profiles for time-series or dose-response data. This approach uses known inequalities among parameters and applies the ideas of order-restricted inference. The entire process is carried out in multiple steps. Potential candidate profiles of interest are defined and expressed in terms of inequalities between the expected gene expression levels at various time points, then the mean expression level of each gene is estimated by using the method proposed in [7]. The gene is assigned to a profile based on the goodness-of-fit criterion and the bootstrap test procedure for a given gene [20], and the process continues for each gene in the dataset under consideration. This method makes use of the ordering in a time-series study and can detect genes more sensitively using their temporal ordering and finding consistent patterns over time. Conesa et al. [1] proposed a statistical procedure that identifies genes with different expression profiles across analytical groups in time-series experiments. It is a general regression-based approach suitable for analyzing single or multiple microarray temporal data. This method, referred to as maSigPro (microarray Significant Profiles), is a two-step regression strategy where the experimental groups are identified by dummy variables. The procedure adjusts a global regression model with all the defined variables to identify differentially expressed genes. Statistically significant different profiles are then found by applying a variable selection strategy that studies the differences between the groups. The model parameters are adjusted based on the data under study and the specific interests of the researcher. The proposed method does not require multiple pairwise comparisons, although it is able to detect significant profile
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
19
differences. It allows for unbalanced designs and heterogeneous sampling times. This approach can be used to find genes with significant temporal expression changes between experimental groups, and to analyze the magnitude of these differences. Ernst et al. [8] proposed an algorithm that specifically addresses the challenges inherent to short time-series expression datasets for clustering such expression data. The algorithm first selects a set of potential distinct patterns that can be expected from the experiment and then assigns genes to the profile that best represents them among the pre-selected profiles. They proposed a method for assigning genes to profiles and determining the significance of each profile. Significant profiles are retained for further analysis and can be combined to form clusters. This approach uses the correlation coefficient to measure the similarity between expression profiles, and was able to identify a more coherent set of genes than contemporary clustering methods based on k-means, by grouping together the temporal profiles of relevant functional categories. This method can work with a dataset with no repeats by leveraging the statistical power obtained from the large number of genes being profiled simultaneously. In [29], Bar-Joseph et al. focused on the analysis of gene temporal expression profiles that can cope with the problem of missing values (unobserved time-points) and non-uniformly sampled data. Gene temporal expression profiles are represented as continuous curves using statistical spline estimation. They also proposed a model-based clustering algorithm and an alignment procedure. The alignment algorithm also uses the spline representation and is proposed to warp the time-scale of one realization into another. The clustering algorithm (using a modified Expectation Maximization (EM) algorithm) operates directly on the continuous representations of gene expression profiles, thus, permitting interpolation of missing values. The algorithm attempts to produce an optimal alignment (that maximizes the similarity between the two sets of expression profiles) by adjusting the function parameters. This ap-
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
20
proach is able to reconstruct unobserved timepoints with 10 − 15% fewer errors than other methods. Also, the clustering approach was effective with nonequidistant sampled time-series data and the continuous alignment algorithm, which allows to control the number of degrees of freedom of the warp, helping avoid overfitting and the difficulties faced by the discrete methods. While this method is suitable for the analysis of relatively long time series (10 time points or more), the suitability of the approach in shorter time-series experiments has not been demonstrated. In [19], Djean et al. have also attempted to obtain relevant clusterings of gene expression temporal profiles by identifying homogeneous clusters of genes. This method focuses on the shapes of the curves and not on the absolute levels of expression (or expression ratios). It combines spline smoothing and first derivative computation with hierarchical and partitional clustering. Smoothing the temporal profiles requires to obtain regular and differentiable functions. Therefore, the approach is based on the framework of functional data analysis [17] with focus on the first derivative of curves by means of a priori spline smoothing. Spline smoothing is performed and then the clustering of the derivatives of the continuous smoothed curves result from the smoothing. Both hierarchical clustering and k-means were applied to obtain the clusters. Tuning the smoothing parameter in the first step is achieved by using a heuristic approach that takes into account both statistical and biological considerations. In [4], the authors proposed a similarity measure for the co-expressed genes based on the expression level rate of change across timepoints. The profiles were considered as piecewise linear functions and the similarity was calculated by measuring the difference of slopes between the functions. Unequal time-intervals are considered and viewed as weights. The proposed algorithm, motivated by the advantages of fuzzy clustering, is referred to as fuzzy short time-series (FSTS) clustering, and incorporates the distance measure proposed in this work in the fuzzy-c-means clustering scheme. The performance of the proposed method was shown to be better than the conventional approaches
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
21
(fuzzy c-means, k-means, average linkage hierarchical clustering and random clustering) for unevenly distributed time points. The proposed method benefited from the fuzzy approach that inherently accounts for noise in the data and allows genes to belong to more than one group and designed to cope with short and unevenly sampled time-series data. Area based profile alignment proposed in [11] takes two features vectors, and produces two new vectors in such a way that the area between “aligned” vectors is minimized. The profile alignment method that takes the length of the intervals between the time-points into consideration was proposed in [11]. That approach considers the weights of the intervals equally, irrespective to the actual size of the interval of the measurement. The Profile-Alignment algorithm takes two feature vectors from the original space as input and outputs two feature vectors in the transformed space after aligning them in such way that the sum of squared errors is minimized. The alignment of the profiles is done using an area-based distance function rather than conventional distance functions. The area-based distance function is defined by computing the integral distance between the two aligned profiles. In both [11] and [18], hierarchical agglomerative clustering is used, where the decision rule is based on the furthest-neighbor or complete linkage distance between two clusters. The complete linkage or furthest-neighbors approach calculates the distance between the furthest pair of points for each pair of clusters and merges the two clusters that have the minimum distance among all such distances between all pairs of clusters under consideration. That clustering approach does the pairwise alignment before measuring the distance between two profiles during each iteration, which slows down the computational process. In [28], Zong-Xian Yin et al. proposed Variation-based algorithm . In that approach, gene expressions are translated into gene variation vectors and the cosine values of these vectors are then used to evaluate their similarity over time. The proposed algorithm is designated as the Variation-based Coexpression Detection algorithm (VCD). That algorithm has two main advantages.
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
22
First, it is unnecessary to determine the number of clusters in advance since the algorithm automatically detects those genes whose profiles are grouped together and creates patterns for these groups. Second, the algorithm features a new measurement criterion for calculating the degree of change of the expressions between adjacent time points and evaluating their trend similarities. The use of the cosine measure based on variation vectors not only enables the algorithm to evaluate trend similarities in time-series expressions, but also allows for the accurate evaluation of the co-occurring time of the peaks within the expressions. The performance of the proposed algorithm has been verified via application to three real-world microarray data sets. The experimental results have confirmed the enhanced grouping performance of the proposed algorithm.
13.3
Profile Alignment of Gene Expression Time-Series
The aim of clustering gene expression temporal data is to group together profiles with similar patterns. Deciding upon the similarity often involves pairwise distance measures of co-expressions. Conventional distance measures include correlation, rank correlation, Euclidean distance, angle between vectors of observations, among others. However, clustering algorithms that apply a conventional distance function (e.g. Euclidian distance, correlation coefficient) on a pair of profiles, can fail to reflect the temporal information embedded in the underlying expression profiles. Some alignment techniques resolve the issue effectively prior to applying the distance function. The basic idea and the effect of profile alignment when computing the distance between two profiles can be visualized in the following example. Typical examples for the profile alignment are depicted in Fig. 13.1 with a pair of genes. Fig. 13.1(a) shows the genes prior to alignment. Using the Pearson correlation distance [21], genes in (a) are not most likely to be clustered together. If the prime interest is to cluster genes according to the variation of their expression level at different time points, then, genes from (b) (after
5
5
4
4
3
3
Expression ratio
Expression ratio
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
2
1
0
−1
23
2
1
0
0
3
8 10 (a) Time in hours
Figure 13.1:
14 15
−1
0
3
8 10 (b) Time in hours
14 15
(a) Unaligned Profiles, and (b) Aligned profiles
alignment) would be better to be clustered together.
Pairwise Gene Expression Profile Alignment Piecewise Linear Profiles The idea of profile alignment based on mean-square-error, proposed in [18], is to consider two feature vectors, and produce two new vectors in such a way that the sum of square-error difference between the aligned expression ratios at each time point is minimized. This approach treats the weights of the intervals between the measurements of an experiment equally, irrespective to the actual size of the interval of the measurement. In [18], the profile alignment scheme works as follows. Let x1 , x2 ²D be two m-dimentional feature vectors. The aim is to find two new vectors x1 and x02 = x2 − a (e.g. to find a scalar a). The algorithm takes two feature vectors from the original space as input (which are two temporal gene expression data in this case) and outputs two feature vectors in the transformed space after aligning them in such a way that the sum of square errors is minimized. Continuous and Integrable Profiles Given two profiles, x(t) and y(t) (either piece-wise linear or continuously integrable functions), where y(t) is to be aligned to x(t), the basic idea of alignment is to vertically shift y(t) towards x(t) in such a way that the squared errors between the two profiles is minimal. Let yˆ(t) be the result of shifting y(t). Here,
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS
24
the error is defined in terms of the areas between x(t) and yˆ(t) in interval [0, T ]. Functions x(t) and yˆ(t) may cross each other many times, but we want that the sum of all the areas where x(t) is above yˆ(t) minus the sum of those areas where yˆ(t) is above x(t) to be minimal (see Fig. 13.2). Let a denote the amount of vertical shifting of y(t). Then, we want to find the value amin of a that minimizes the integrated squared error between x(t) and yˆ(t). Once we obtain amin , the alignment process consists of performing the shift on y(t) as yˆ(t) = y(t) − amin . The pairwise alignment results of [11] generalize from the case of piece-wise linear profiles to profiles which are any integrable functions on a finite interval. Suppose that we have two profiles, x(t) and y(t), defined on the time-interval [0, T ]. The alignment process consists of finding the value a that minimizes Z
Z i2 x(t) − yˆ(t) dt =
Th
fa (x(t), y(t)) = 0
Th
i2 x(t) − [y(t) − a] dt.
(13.1)
0
Differentiating with respect to a yields d fa (x(t), y(t)) = 2 da Setting
Z
Z i x(t)+a−y(t) dt = 2
Th 0
d da fa (x(t), y(t))
d2 da2 fa (x(t), y(t))
i x(t)−y(t) dt+2aT. (13.2)
0
= 0 and solving for a gives amin = −
Since
Th
1 T
Z
Th
i x(t) − y(t) dt,
(13.3)
0
= 2T > 0 then amin is a minimum. The integrated error
between x(t) and the shifted yˆ(t) = y(t) − amin is then Z 0
Th
Z i x(t) − yˆ(t) dt =
Th
i x(t) − y(t) dt + amin T = 0.
(13.4)
0
In terms of Fig. 13.2, this means that the sum of all the areas where x(t) is above y(t) minus the sum of those areas where y(t) is above x(t) is zero. Given an original profile x(t) = [e1 , e2 , . . . , en ] (with n expression values taken at n time-points t1 , t2 , . . . , tn ), we use natural cubic spline interpolation, with n knots, (t1 , e1 ), . . . , (tn , en ), to represent x(t) as a continuously integrable
CHAPTER 13. ON ALIGNMENT-BASED CLUSTERING OF GENE EXPRESSION TIME-SERIES SIGNALS function
x1 (t) x(t) =
if
t1 ≤ t ≤ t2
.. .
x n−1 (t)
25
(13.5) if
tn−1 ≤ t ≤ tn
where xj (t) = xj3 (t − tj )3 + xj2 (t − tj )2 + xj1 (t − tj )1 + xj0 (t − tj )0 interpolates x(t) in interval [tj , tj+1 ], with spline coefficients xjk ∈