Document not found! Please try again

Partial Mixture Model for Tight Clustering in Exploratory ... - CiteSeerX

1 downloads 0 Views 206KB Size Report
Abstract—In this paper we demonstrate the inherent robust- ness of minimum distance estimator that makes it a potentially powerful tool for parameter estimation ...
Partial Mixture Model for Tight Clustering in Exploratory Gene Expression Analysis Yinyin Yuan and Chang-Tsun Li Department of Computer Science University of Warwick, Coventry, United Kingdom Email: yina,[email protected]

Abstract—In this paper we demonstrate the inherent robustness of minimum distance estimator that makes it a potentially powerful tool for parameter estimation in gene expression time course analysis. To apply minimum distance estimator to gene expression clustering, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated specially for tight clustering. Recently tight clustering was proposed as a response for obtaining tighter and thus more informative clusters in gene expression studies. We provide interesting results through data fitting when compared with maximum likelihood estimator using simulated data. The experiments on real gene expression data validated our proposed partial regression clustering algorithm. Our aim is to provide interpretations, discussions and examples that serve as resources for future research.

I. I NTRODUCTION Clustering gene expression data is based on the assumption that co-expression indicates co-regulation, thus clustering reveals genes of similar functions in the biological pathways. This biological rationale is readily supported by both massive empirical observations and systematic explanation [1]. In particular, consider gene expression time course experiments, where the data are made up of tens of thousands of genes each with measurements taken at either uniformly or unevenly distributed time points often with several replicates. Clustering algorithms are crucial in reducing the dimensionality of such data and thus provide a good initial investigation leading to biological inference. The maximum likelihood estimator (MLE) is the most extensively used statistical estimation technique in the literature. For a variety of models, likelihood functions [2], [3], [4], especially maximum likelihood, are used for making inference about parameters of the underlying probability distribution for a given dataset. The solution often involves a nonlinear optimization such as quasi-Newton methods or, more commonly, expectation-maximization (EM) methods [3]. The problem with the former method is that the quantities are estimated only when they satisfy some constraints, while with the latter method all parameters have to be explicitly specified, so the number of clusters K has to be known a priori, which is not practical in microarray data analysis. There are many unique features of MLE, including its efficiency. However the practical deficiencies of MLE, besides those with its optimization, are the lack of robustness against outliers and its sensitivity to the correctness of model specification. We discuss in this paper

the performance of an appealing alternative, minimum distance estimation (MDE) [5], which is less explored in this field. Inspired by the work of [6], we propose to incorporate MDE in our algorithm for gene expression time course analysis. MDE provides robust estimation against noise and outliers, which is of particular importance in gene expression data analysis where data are often too noisy and there are few replicates. Tight clustering has been proposed as a response to the need for obtaining tight clusters in genomic signal research [7]. It arose from the fact that the most informative clusters are the tight clusters usually of size 20-60 genes [7]. In this sense, to obtain tight clusters, some genes should not be assigned into clusters but classified as scattered genes, if forcing them into clusters will only result in looser clusters and less obvious patterns that are biologically relevant. However, current methods for gene expression time course data rarely deal with scattered genes. In clustering literature it was proposed in [8] that outliers can be modelled by adding a Poisson process component in the mixture model. However, this method has not been verified in this field as it relies on correct model specification. To the best of our knowledge, [7] is the first to address this issue but it heavily relies on resampling. SplineCluster [2] is a efficient hierarchical clustering software based on regression model with marginal likelihood criterion. Starting from singleton clusters, the idea is to merge clusters with the largest maximum closeness value based on marginal likelihood in each iteration. It is efficient and straightforward for visualization. However, it does not consider replicates but only mean of all replicates can be used, which leads to loss of information. The outline of this paper is as follows. In the second section, we describe the MDE framework and demonstrate how its excellent properties inspire a partial spline regression model to be used in combination with MDE for clustering gene expression time course data. In the experiment section simulated datasets are used when both MDE and MLE are applied to see their inherent differences. Later our proposed partial regression clustering algorithm is used on a real gene expression dataset to shows how naturally allows scattered genes and determines the number of clusters by itself. Superior performance of our algorithm is achieved through the comparison with that of SplineCluster. This study explores the differences of the two estimators in the hope of providing deeper insight into the nature of data and future research directions.

II. M INIMUM D ISTANCE E STIMATION (MDE) Given a density function f (·), its corresponding parameters θ and a n-dimensional variable x, we aim to find the optimal parameters θˆ to approximate the true parameters θ0 by minimizing the integrated squared error of Z d(f (θ), f (θ0 )) = [f (x|θ) − f (x|θ0 )]2 dx (1) which gives d(f (θ), f (θ0 )) =

R

Z

2

f (x|θ) dx − 2 Z + f (x|θ0 )2 dx

Z

f (x|θ)f (x|θ0 )dx (2)

There are many interesting features of MDE. First, it comes with the same robustness as all other minimum distance techniques [6]. Second, MDE approximates data by making the residuals as close to normal distribution as possible. These will be further explained and testified in the experiments later. Last but not least, in principle the finite mixture model methodology assumes that the the probability density function can be modeled as the sum of weighted component densities, the weights are often constrained to have a sum of 1. For example a weighted Gaussian mixture model has the form of: f (x|θ) =

A. Spline Regression Model For analyzing such high dimensional data as gene expression time course, spline regression model is set up in order to takes into consideration the inherent time dependence within time course data. Let Y be the variables of interest which is gene expression data matrix here, it can be modeled as Y = α + X(t)β + ε

(5)

2

The last integral f (x|θ0 ) dx is a constant with respect to θ so can be ignored. The second integral can be obtained through the kernel density estimate. Therefore the MDE criterion is given by Z n 2X θˆ = arg min[ f (x|θ)2 dx − f (xi |θ)] (3) θ n i=1

K X

It is worth noting that the number of the components of a partial mixture model can be more than one and that the fit to the data depends on the initialization. Underestimating K will still fit the major components while overestimating K leads to overfitting. This is to be validated in the experiment section.

wk φ(x|µk , σk ), w1 + w2 + ...wK = 1

(4)

k=1

where φ is the density function, µ, σ are mean and standard deviation, K is the number of component, and wk , k = 1, 2, ..., K are the PKweight parameters. However by relieving the constrain of k=1 wk = 1 the system can be extended for overlapping clustering inference [6]. In both cases, wk indicates the proportion of data points that are allocated in the kth component. To further relieve the system from constraints by the weight parameters while keeping its weighted-component structure, in the next section the idea of partial modeling is presented. It originated from the fact that incomplete densities are allowed, thus the model will be fitted to the most relevant data. III. PARTIAL M IXTURE M ODEL Another unique feature of MDE is that incomplete densities are allowed [6]. This leads to the possibility of setting up a partial mixture model for f (x|θ). The weight parameters are of particular importance here in order to understand partial modelling. They allow the model to estimate the relevant component/components while their value indicates the proportions of fitted data. This approach for outlier detection is first described in [6].

X(t) is the design matrix made up of spline basis of time. The error term ε is the residuals modeled by a weighted distribution w·N (0, σε2 ). α, β = β1 , β2 , ..., βm , m depending on the choice of X(t), are the regression parameters. As it is stated before there is an excellent feature about MDE that it fits data so that the residuals are close to normal distribution.Therefore for MDE our model is ε = Y − α − X(t)β

(6)

Partial MDE fit for this model has the form of Z n 2X θˆ = arg min[ (wφ(ε|0, σε ))2 dε − wφ(εi |0, σε ) θ n i=1 n 1 2w X = arg min[ √ w2 σε−1 − φ(εi |0, σε )] (7) θ n i=1 2 π where θ = w, α, β1 , ...βm , σε and φ being the density of a normal random variable. Altogether there are m+3 parameters to be estimated. Numerically we find the solution by using a generalized non-linear minimizer nlm() in R. B. Internal Validation In clustering literature, a method can be evaluated on theoretical ground by either external validation or internal validation, or both. Measure of agreement between two partitions are often used as external indices such as the adjusted Rand index, if the true partition is known. Although public datasets sometimes come with their original partitions by their producers, the true partitions are unknown in real microarray data. Recognizing this, we set out to discuss performance of our algorithm from a pure statistical aspect. Therefore only internal validation of clustering outcome is used. Measure of Calinski and Harabasz (CH) [9] is used as a criterion for the quality of a clustering with K clusters, CH(K) =

BSS(K)/(K − 1) W SS(K)/(N − K)

(8)

where BSS() and W SS() are the between-class and withinclass distances defined as 1X 2 K

BSS(K) =

X

l=1 xi 6∈Cl ,xj ∈Cl

d2 (xi , xj )

(9)

1X W SS(K) = 2 K

X

d2 (xi , xj )

(10)

l=1 xi ,xj ∈Cl

Cl in Eq.(9) and Eq.(10) stands for the lth cluster. The idea behind the CH measure is to compute the pairwise sum of squared errors (distances) between clusters and compare that to the internal sum of squared errors for each clusters. In effect, it is a measure of between-class dissimilarity over within-class dissimilarity. The optimum clustering outcome should be the one that maximizes the CH index in Eq. (8). The CH index was originally meant for squared Euclidean distance. Since the residuals are a nature product of our spline regression model, we use it as distance measurement in BSS(K) and W SS(K) but without the square form. C. Partial Regression Clustering Given an initial clustering, which can be obtained by some empirical knowledge or heuristic clustering methods such as k-means, we perform the following procedure: Algorithm 1 Partial Regression Clustering Require: Initialization repeat 1. Fit partial spline regression model to each of the clusters; 2. Identify potential outliers according to a tightness threshold υ and discard them from the clusters; 3. For all outliers, fit partial spline regression model to form a new cluster; repeat 4. For all genes re-evaluate distances to all existing spline regression models, assign them to the closest one; 5. Fit partial spline regression models to all clusters; 6. Calculate CH value based on current partitions; until the clustering quality measured by CH value starts to decrease. 7. Take the partition with highest CH value. until no partial spline regression model can be fitted to the outliers. if Scattered.Genes = True then 8. All outliers are taken as scattered gene else 9. Assign the outliers into their closest clusters. end if In the main loop after each new cluster is generated, all data points are reassigned in the gene redistributed loop, therefore resultant clusters should be of reasonable size. The rationale supporting our design is based on the feature of partial modelling and robustness of the MDE estimator, which we believe is able to find out the relevant component in the data, meanwhile not being distracted by the outliers. The distance between data points and spline regression models are a nature byproduct of regression model fitting, the residuals. Note here

the threshold υ in the algorithm controls the tightness of clusters thus in a sense also controls the number of clusters. A reasonable choice of υ will generate clusters of desirable tightness. An option is provided at the end to be user-specified. If scattered gene are allowed, the outliers from Step 7 are the scattered gene. Otherwise all outliers will be assigned into clusters. IV. E XPERIMENTS A highlighted feature of our partial regression clustering algorithm is its ability to identify the key component among outliers in order to form a tight cluster. Therefore, a robust parameter estimator allowing the algorithm to do so is of paramount importance. Experiment section is made up of two parts. First We empirically validate our points about the nature of partial modelling and MDE in Section 2 and 3 through simple simulated dataset. In the second part our proposed partial regression clustering algorithm is applied to a well known dataset and the result is compared with that of a recent work. A. Simulated datasets The situation when the number of components K is seriously underestimated is simulated in Figure 1(a). Both PMDE and MLE with spline regression model are fitted to a set of simulated data and their performances are compared in terms of accurate data fitting. Surprisingly superior performance of PMDE is found when it is compared with MLE even through such simple dataset. Data are generated by sine functions with Gaussian noise added. Three components are generated from three sine waves simulating gene expression data of three clusters each with 25 time points. Each of the component comprises 60%, 20% and 20% of data, respectively. PMDE locates the major component while MLE is biased to all data. This is a strong evidence that PMDE is superior to MLE in such scenario. The fact that the PMDE can find the key component without compromising to the others suggests a solution to the vexing problem when the number of components is unknown, which is the situation in gene expression clustering. Histograms of residuals from both fits are plotted in Figure 1(b) and (c) to better illustrate PMDE fit the data in such a way that the residuals are close to normal. B. Y5 Time Course Data A subset of 384 genes in Yeast Cell Cycle (Y5) dataset by Spellman et al.[10], [11] measured at 17 time points is previously clustered into five clusters based on the first peak time in the cell cycle: Early G1(G1E), late G1(G1L), S, G2 and M. Spellman’s original partition indicates ambiguities between groups. This dataset is chosen here not only because of its difficulty in terms of clustering, but also that it is well-studied in the gene expression clustering literature. The original partition, reasonable as it is, makes use of only partial information of gene expression thus should account partly for the reason why all algorithms have poor performance when it is used as external index [12], [13]. Moreover, the

1.5 0.5 −0.5

278

267

10

15

20

25

Genes

5

Time points

Genes

−1.5

Gene expression

354 333

330

203

214 182 143

68

65

1

1

40 30 20

Time

Time

0

10

Frequency

50

60

(a)

−2

−1

0

1

2

Fig. 3. The heatmaps for original partition by Spellman (left) and resulting partition (right).

Residuals by PMDE

(b)

40

TABLE I C ROSS TABULATION OF ORIGINAL PARTITION BY S PELLMAN AND RESULTING PARTITION .

20

C1 43 18 3 0 0 64

C2 1 65 12 0 0 78

C3 0 4 30 5 0 39

C4 0 0 12 20 0 32

C5 1 0 0 9 43 53

C6 4 48 14 0 0 66

C7 0 0 4 16 1 21

C8 18 0 0 2 11 31

Total 67 135 75 52 55

0

10

Frequency

30

G1E G1L S G2 M Total −2

−1

0

1

2

Residuals by MLE

(c) Fig. 1. (a) PMDE fit(pink line) and MLE fit(blue line) to simulated data generated from three sin waves; (b) Histogram of residuals by PMDE; (c) Histogram of residuals by MLE.

average cluster size (see the right most column of Table I) is still far larger than desirable by biologists. It is also recently suggested that clustering based on overall profiles is preferred than Spellman’s original partition on a different subset but same dataset [14]. We employ the proposed partial regression clustering algorithm to partition the dataset into tighter clusters as shown in Figure IV-B . By obtaining a tighter clustering we expect to offer more efficient data mining. The tightness threshold υ is set to be 8 and the number of knots for spline basis is set to 13 to allow flexibility of the curve without overfitting. To enable comparison, we allocated the scattered genes into

clusters. Still it is obvious that a major part of them are in the seventh cluster. The eight clusters (C1-C8) are then cross tabulated with the original partition in Table I. As the bottom row indicates, sizes of clusters are greatly reduced. The heatmaps are plotted in Figure 3 for comparison, where obvious improvement with respect to class distinction can be seen. The two partitions agree on many genes but also differ in a inspiring way. For instance the sixth cluster in our partition are mainly from the second original cluster. In the heatmap it reveals a sharp change in the expression pattern, which can hardly be captured by Spellman’s partition. Also we examined the clustering outcome given by our algorithm and SplineCluster. By controlling a parameter in SplineCluster we obtained 8 clusters. As both algorithms do not have any biological knowledge input, comparison is conducted in a pure statistical manner, by the CH index. Our algorithm outcome achieves the highest CH value of 608.9607, followed by 523.8909 by SplineCluster and 283.0890 by Spellman’s.

15

10

2 0 −2

15

5

15

Time points

10

15

15

3 1 −1

2 0

5

Time points

10

Time points

−2

2

10

Gene expression

2 0 −2

5

Time points

0

5

Time points

Fig. 2.

15

−2

Gene expression

2 0

10

10

Gene expression

5

Time points

−2

Gene expression

5

Gene expression

2 0

15

Gene expression

10

−2

Gene expression

2 0 −2

Gene expression

5

Time points

5

10

15

Time points

The resulting clusters by the partial regression clustering algorithm for Y5 dataset

V. C ONCLUSIONS The aim of clustering gene profiles is to find possible functional relationships among tens of thousands of genes on a microarray. We propose that while the models for data fitting should be sensitive enough for discriminating individuals/genes, the estimators should be robust enough against noise and possible outliers. Therefore we focused on the differences between estimators by providing experimental comparison. Interestingly when minimum distance estimator is applied to partial mixture model it is capable to find out the key component in gene expression data, that is, automatic outlier detection. Our contributions are introducing MDE and the idea of partial modelling to gene expression research, giving comparison with the most prevailing estimator in the literature - maximum likelihood, and proposing a novel partial regression clustering algorithm. There are many ways to apply partial modelling to gene expression time course clustering. Our spline regression model captures the inherent time dependencies within data. The error term is of particular importance as it is supposed to pick up the noise. The fact that PMDE estimates parameters so the residuals are as close to normal distribution as possible makes it a powerful tool for modelling the error term. The tightness of resulting clusters can be controlled by a threshold which in a sense decides the number of clusters. The effectiveness of this algorithm also depends on the model normality. Often gene expression data are transformed after the data extraction step so that normality holds approximately. The proposed algorithm can be applied over an existing clustering to get tighter clusters. Other ways to apply partial modeling and MDE to clustering are interesting topic. This will be pursued in our future research. Although PMDE demonstrates its effectiveness through comparisons with maximum likelihood method, it also has its limits such as relative inefficiency. The aim here is not to prove which one is better, but rather to provide analytical examples, discussions and

insights in the hope of further research. R EFERENCES [1] P. C. Boutros and A. B. Okey, “Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data.” Brief Bioinform, vol. 6, no. 4, pp. 331–343, December 2005. [2] N. A. Heard, C. C. Holmes, and D. A. Stephens, “A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: An application of bayesian hierarchical clustering of curves,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 18–29, 2006. [3] S. K. Ng, G. J. Mclachlan, K. Wang, L. B.-T. Jones, and S.-W. Ng, “A mixture model with random-effects components for clustering correlated gene-expression profiles,” Bioinformatics, vol. 22, no. 14, pp. 1745– 1752, 2006. [4] J. Wakefield, C. Zhou, and G. Self, “Modeling gene expression data over time: Curve clustering with informative prior distributions,” Bayesian Statistics, 2003. [5] R. Beran, “Minimum distance procedures,” Handbook of Statistics, vol. 4, pp. 741–754, 1984. [6] D. W. Scott, “Parametric statistical modeling by minimum integrated square error,” Technometrics, vol. 43, no. 3, pp. 274–285, 2001. [7] G. C. Tseng and W. H. Wong, “Tight clustering: A resampling-based approach for identifying stable and tight patterns in data,” Biometrics, vol. 61, no. 1, pp. 10–16, March 2005. [8] C. Fraley and A. E. Raftery, “How many clusters? Which clustering method? Answers via model-based cluster analysis,” The Computer Journal, vol. 41, no. 8, pp. 578–588, 1998. [9] T. Calinski and J. Harabasz, “A dendrite method for cluster analysis.” Comm. Statist., vol. 3, pp. 1–27, 1974. [10] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, “Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998. [11] K. Yeung, C. Fraley, A. Murua, A. Raftery, and W. Ruzzo, “Modelbased clustering and data transformations for gene expression data,” Bioinformatics, vol. 17, no. 10, pp. 977–987, 2001. [12] A. Schliep, I. G. Costa, C. Steinhoff, and A. Schonhuth, “Analyzing gene expression time-courses,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 179–193, 2005. [13] Y. Yuan and C.-T. Li, “Unsupervised clustering of gene expression time series with conditional random fields,” Proceedings of IEEE Workshop on Biomedical Applications for Digital Ecosystems, 2007. [14] L. Qin and S. G. Self, “The clustering of regression models method with applications in gene expression data,” Biometrics, vol. 62, no. 2, pp. 526–533, 2006.

Suggest Documents