with Application to Gene Expression Data Analysis. Chuan Zhou. A dissertation submitted in partial fulfillment of the requirements for the degree of. Doctor of ...
A Bayesian Model for Curve Clustering with Application to Gene Expression Data Analysis
Chuan Zhou
A dissertation submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
University of Washington
2003
Program Authorized to Offer Degree: School of Public Health and Community Medicine – Biostatistics
University of Washington Graduate School
This is to certify that I have examined this copy of a doctoral dissertation by Chuan Zhou and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made.
Chair of Supervisory Committee:
Jonathan C. Wakefield
Reading Committee:
Steve G. Self M. Kathleen Kerr Jonathan C. Wakefield
Date:
University of Washington Abstract
A Bayesian Model for Curve Clustering with Application to Gene Expression Data Analysis by Chuan Zhou Chair of Supervisory Committee: Professor Jonathan C. Wakefield Department of Biostatistics
In this dissertation, we propose a general Bayesian hierarchical mixture model for clustering curve data. Instead of clustering based on the high dimensional observed curve data, we construct the hierarchy in such a way that lower dimensional random effects, which characterize the curves, form the basis for clustering. This model provides a flexible framework that can be tuned to the specific context, and allows information regarding curve forms, measurement errors and other prior knowledge to be incorporated. Under this model, the order of observations within curve is explicitly taken into account, and the number of clusters can be treated as unknown and inferred from the data. Computation is carried out via an implementation of birth-death MCMC algorithm. A preliminary filtering algorithm is devised in order to reduce the computational burden. We also propose novel quantitative measures of the strength of the resultant clusters in terms of sensitivity and specificity, which are not easily evaluated with traditional approaches. Substantive application of this model to a set of gene expression experiments demonstrates that substantial insight into yeast transcription programs can be gained through such model-based analysis.
TABLE OF CONTENTS
List of Figures
v
List of Tables
xi
Chapter 1:
Introduction
1
1.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Microarray Technology and Gene Expression . . . . . . . . . . . . . . . . . .
3
1.2.1
Central dogma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.2
Gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.3
Microarray technology . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2.4
Special features of microarray data . . . . . . . . . . . . . . . . . . . . 10
1.3
Gene Expression Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4
Gene Expression Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.1
Sporulation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2
Cell-cycle data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 2:
Bayesian Hierarchical Models and MCMC Techniques
21
2.1
A Hierarchical View of Gene Expression Time Series . . . . . . . . . . . . . . 21
2.2
Bayesian Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3
2.2.1
Population models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2
Bayesian dynamic models . . . . . . . . . . . . . . . . . . . . . . . . . 25
Bayesian Mixture Models and Model-based Clustering . . . . . . . . . . . . . 27 2.3.1
Model-free clustering versus model-based clustering . . . . . . . . . . . 27
2.3.2
Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3
Bayesian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . 31 i
2.4
2.5
MCMC Techniques in Bayesian Computation . . . . . . . . . . . . . . . . . . 35 2.4.1
Introduction to MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2
Metropolis-Hastings algorithms . . . . . . . . . . . . . . . . . . . . . . 40
2.4.3
Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.4
MCMC with trans-dimensional moves . . . . . . . . . . . . . . . . . . 42
Linear Population Biological Growth Example . . . . . . . . . . . . . . . . . . 47 2.5.1
Normal-linear population model . . . . . . . . . . . . . . . . . . . . . 49
2.5.2
Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 3:
Filtering Based on Bayes Factors and False Discovery Rate 56
3.1
Motivation
3.2
Gene Filtering Based on Bayes Factors . . . . . . . . . . . . . . . . . . . . . . 59
3.3
3.4
3.2.1
The hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2
Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3
Choice of priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.4
Computation and importance sampling . . . . . . . . . . . . . . . . . 62
3.2.5
Ranking based on posterior probabilities
3.2.6
Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
. . . . . . . . . . . . . . . . 64
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.1
Example 1: Sporulation Data . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.2
Example 2: Cell-Cycle Data . . . . . . . . . . . . . . . . . . . . . . . . 76
Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 4: 4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Bayesian Hierarchical Models for Curve Clustering
80
The General Hierarchical Mixture Model . . . . . . . . . . . . . . . . . . . . . 81 4.1.1
Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.2
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.3
Label-switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 ii
4.2
4.3
4.4
4.5
Example 1: Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1
Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.2
Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.3
BDMCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.4
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Example 2: Sporulation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.1
Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.2
Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.3
BDMCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.4
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Example 3: Cell-Cycle Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.1
Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4.2
Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4.3
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Chapter 5:
Analysis of Cell Cycle with Gene Expression Data
114
5.1
Cell Cycle Regulated Gene Expression . . . . . . . . . . . . . . . . . . . . . . 114
5.2
Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3
Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4
Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5
Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.6
The Model
5.7
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.8
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.9
Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Chapter 6: 6.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Extensions and Further Work
Sensitivity of Posterior Distribution of K iii
155 . . . . . . . . . . . . . . . . . . . . 155
6.2
Extension to FDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3
Robust Clustering with t-distribution . . . . . . . . . . . . . . . . . . . . . . . 162
6.4
Prior on Cluster Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5
6.4.1
Joint distribution of multivariate discrete responses . . . . . . . . . . . 168
6.4.2
Bahadur representation . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.4.3
Markov random field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.4.4
Dirichlet-multinomial distribution . . . . . . . . . . . . . . . . . . . . 172
6.4.5
Direct modelling the dependence . . . . . . . . . . . . . . . . . . . . . 176
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Bibliography Appendix A:
179 Tables of Data
197
A.1 Growth Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Appendix B:
Thresholding with FDR
iv
199
LIST OF FIGURES
1.1
A schematic of the role of RNA in gene expression and protein production. Major stages include DNA replication, RNA transcription, RNA splicing and translation. Graphics from http://www.accessexcellence.org. . . . . . . . . . .
1.2
5
A schematic of gene expression. mRNA is produced by RNA transcription, after migrating into nucleus, it instructs the protein synthesis in ribosome. Graphics from http://www.accessexcellence.org. . . . . . . . . . . . . . . . . .
6
1.3
A schematic of cDNA microarrays. Graphics from Duggan et al. (1999) . . .
9
1.4
A random sample of 100 genes from the sporulation data. The entire data set is available from SMD website http://genome-www.stanford.edu/microarray/. . 18
1.5
A random sample of 100 genes from the cell-cycle dataset. The entire dataset is available from SMD website http://genome-www.stanford.edu/microarray/. . 19
2.1
Dental measurement for 11 girls and 16 boys. Solid lines are girls, dashed lines are boys, and the two thick lines are point-wise averages for each population.
2.2
48
Least squares estimates of intercepts and slopes (the plotting symbol is the individual’s number). Girls are labeled from 1 to 11, and boys are labeled 12 to 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3
posterior means of individual random effect θ i , their least square counterparts, population means and projections of the ellipsoids defined by the covariance matrices for the two populations. The ellipses correspond to ±2 standard errors in the univariate case. . . . . . . . . . . . . . . . . . . . . . . 53
2.4
Trace plots of population means. The chain was run for 200,000 iterations, with first 100,000 iterations discarded as burn-in and thinning at every 100th iteration. It appears the chains have converged and mixed well. . . . . . . . . 54 v
2.5
Dental measurement for 11 girls and 16 boys, assuming population information has been lost. It is difficult for the human eye to detect subtle features beyond the linear trend in the curves. . . . . . . . . . . . . . . . . . . . . . . 55
3.1
Random sample of 100 genes (including measurements at t = 0) . . . . . . . . 70
3.2
Distribution of gene expression at t=0 . . . . . . . . . . . . . . . . . . . . . . 71
3.3
Sporulation data from Chu et al. (1998). Expression levels versus time for 200 genes with the highest (left panel), and the lowest (right panel) values of p(M1 | y) where M1 is the model of non-constant level. . . . . . . . . . . . . . 72
3.4
Expected FDR process, the corresponding number of rejections is 711 when bF DR = 0.5, the cutoff in terms of p(M1 | y) is 0.26, bF N R = 0.07. . . . . . . 73
3.5
Expression levels versus time for 100 genes with the highest (left panel), and the lowest (right panel) values of p(M1 | y) among the top ranked 1300 genes
3.6
73
Disagreement between root mean square based filter and posterior probability based filter. Left panel shows the 85 genes missed by the filter based on posterior probabilities, right panel shows the 85 genes missed by Chu et al. (1998), among top 1116 genes.
3.7
. . . . . . . . . . . . . . . . . . . . . . . . . . 75
Disagreement between average based filter and posterior probability based filter. Left panel shows the 215 genes missed by the filter based on posterior probabilities, right panel shows the 215 genes missed by the filter based on averages, among top 1116 genes. . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.8
Cell cycle data from Spellman et al. (1998). Expression levels versus time for 100 genes with the highest (left panel), and the lowest (right panel) values of p(M1 | y). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1
(a) A total of 50 simulated curves, with t = (−3, −1, 1, 3). (b) The least squares estimates of intercepts and slopes. Units 1–15 are in group one, 16– 30 are in group two, 31–50 are in group three. The groups are labelled by their intercepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 vi
4.2
Trace plot of K from BDMCMC, and its posterior distribution. . . . . . . . . 95
4.3
Posterior estimates of cluster centers and their variances. The circles correspond to the 2 standard errors in univariate case. The right panel shows the classification conditional on K = 3. . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4
Hand-picked genes in each of seven groups (Chu et al., 1998), along with mean trajectories (panel 8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5
Distributions of standard deviation under different priors. . . . . . . . . . . . 101
4.6
Four simulations from the random walk prior, with measurement error added. There are K = 10 groups, each of which contain 20 genes. . . . . . . . . . . . 102
4.7
Trace plot of the number of clusters K and the posterior distribution of K, from BDMCMC on the sporulation data. . . . . . . . . . . . . . . . . . . . . 102
4.8
Mean expression levels as a function of time, for numbers of cluster K = 10, 12, 16, 20.
4.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Posterior profiles and genes classified to each of these profiles (using MAP classification), conditional on K = 20 clusters. . . . . . . . . . . . . . . . . . . 105
4.10 Heat map showing pairwise probabilities of common cluster membership of the 30 genes in Figure 4.4. The solid lines separate the different groups. On the left the shaded squares denote those pairwise probabilities greater than 0.5, while on the right the cut-off probability is 0.8. . . . . . . . . . . . . . . . 106 ˆi from the model E[Yit | Ai , Bi ] = Ai sin(2πf t) + 4.11 Least squares estimates Aˆi , B ˆ i ), log{(π/2 + φˆi )/(π/2 − φˆi )} Bi cos(2πf t), for gene i (left panel), and log(R (right panel), i = 1, ..., 800. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.12 200 simulations of (Ak , Bk ) for the cell-cycle parameters (left); five trajectories without random error (center); 55 simulations from 5 groups, 4 simulated plus zero group with random error (right). . . . . . . . . . . . . . . . . . . . . 109 4.13 Trace plot of the number of clusters K and the posterior distribution of K (after a burn-in of 100,000 iterations), from BDMCMC analysis of the cellcycle data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vii
4.14 Posterior profiles and genes classified to each of these profiles (using MAP classification), conditional on K = 11 clusters. . . . . . . . . . . . . . . . . . . 112 5.1
Expression of 100 genes known to be cell-cycle regulated. . . . . . . . . . . . 118
5.2
Expression of 100 genes randomly selected. . . . . . . . . . . . . . . . . . . . 118
5.3
Expression of 100 randomly selected asyncronized genes.
5.4
Boxplots for the data from each of the six chips. . . . . . . . . . . . . . . . . 121
5.5
Sampling distribution of the pooled reference data from all 6 chips. . . . . . . 121
5.6
Sampling posterior distribution of σ 2 with different parameter values. Left
. . . . . . . . . . . 120
panel shows a highly concentrated distribution directly from posterior analysis, right panel shows the sample distribution with calibrated parameter values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.7
Observed gene expression of 100 known cell cycle regulated genes, and their fitted values based on least squares estimates based on model (5.4). . . . . . . 125
5.8
Histograms of least squares estimates of amplitude Ri and phase φi . The sampling distribution of Ri is skewed to the right, and distribution of φi is rather flat over its range.
5.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Scatter plot of least squares estimates of Ri and φi . It suggests Ri and φi are uncorrelated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.10 N = 100 simulated gene expression time series based on the following priors: Ri ∼ Exp(1.43), φi ∼ Unif(−0.5, 0.5), σe2 = 0.22 .
. . . . . . . . . . . . . . . . 126
5.11 Expression of the 100 highest ranked genes (left panel) and lowest ranked genes (right panel). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.12 Optimal solutions to different loss functions in the form of cF DR + F N R. . . 128 5.13 Optimal solutions to minimizing F N R, subject to F DR ≤ 0.05. . . . . . . . . 129 5.14 Residuals and squared residuals from fitting model (5.4) to the CCR genes. . 130 5.15 Mean profile under Model Yit = Ai cos 2πf0 t + Bi sin 2πf0 t + eit . . . . . . . . 131 5.16 Multiple mean profiles with different phases under Model Yit = Ai cos 2πf0 t+ Bi sin 2πf0 t + eit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 viii
5.17 Mean profiles under Model Yit = Ai cos 2πft (φi )t + Bi sin 2πft (φi )t + eit . . . . 134 5.18 Mean profiles under Model Yit = e−γi t ×{Ai cos 2πft (φi )t + Bi sin 2πft (φi )t}+ eit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.19 CCR data: scatter plots of non-linear least squares estimates of parameters under Model Yit = e−γi t × {Ai cos 2πft (φi )t + Bi sin 2πft (φi )t} + eit . . . . . . 140 5.20 Expression of genes with three or more missing measurements: N = 68. . . . 143 5.21 Among the 1078 genes identified as periodic using SPM from Spellman data, only 584 would have passed our filter at 0.8, the other 494 genes would have failed and not be classified as periodic by our filter. . . . . . . . . . . . . . . . 144 5.22 Among the 899 genes identified as periodic by SPM using 38wt data, 745 of them passed our filter with Pr(M1 | y) ≥ 0.8 (left panel), and 216 did not pass our filter (right panel). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.23 Observed expression of the 100 known cell-cycle regulated genes, and their fitted values based on non-linear least squares estimates using Model (5.12). . 146 5.24 Residuals and squared residuals from fitting Model (5.12) to the 100 CCR genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.25 Final clustering with K = 16 fixed, different scales. . . . . . . . . . . . . . . . 148 5.26 Final clustering with K = 16 fixed, common scale. . . . . . . . . . . . . . . . 149 5.27 Heat-map of probabilities that two genes share a common label, for clusters 2, 3, 5, 6, 8, 13, 15, and 16. Shaded blocks correspond to pairwise probabilities larger than the chosen cutoff. . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.28 Strength of co-expression through sensitivity and specificity. . . . . . . . . . . 152 5.29 Some genes clustered into group 2, and co-express with cluster 16. . . . . . . 153 6.1
(a) A total of 50 simulated curves, with t = (−3, −1, 1, 3). (b) The least squares estimates of intercepts and slopes. Units 1–15 are in group one, 16– 30 are in group two, 31–50 are in the third group. The groups are labelled by their intercepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 ix
6.2
Posterior distributions of K: comparison of sensitivity to Poisson priors, and hyper-parameters between (a) fixed and (b) random population parameters . 157
6.3
Expression of genes with very different ranks before and after the clustering. Genes with increased ranks after clustering are shown in the left panel, genes with decreased ranks are in the right panel . . . . . . . . . . . . . . . . . . . 161
6.4
A total of 60 simulated curves, 50 are from the 3 clusters, 10 are uniform noises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.5
Trace plot and posterior distribution of K, with BDMCMC and normal mixture model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.6
Classification and estimation from fitting a three-component normal mixture model to the simulated data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.7
Classification and estimation from fitting a three-component Student-t mixture model to the simulated data. . . . . . . . . . . . . . . . . . . . . . . . . . 166
x
LIST OF TABLES
2.1
Posterior Summaries of Separate Analysis for Boys and Girls . . . . . . . . . 52
3.1
Sensitivity of Ranking to Prior Specification: Sporulation Data . . . . . . . . 78
4.1
Posterior Summaries of Simulated Growth Curve Data with K = 3 . . . . . . 96
6.1
Influence of prior distribution Wishart(g, (gR)−1 ) for Σ on the posterior distribution of K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
xi
ACKNOWLEDGMENTS I would like to express sincere appreciation to my supervisor, Professor Jon Wakefield, who skillfully coached me into Bayesian statistics, with patience, enthusiasm, humor and a warm heart. I am also grateful to Professor Steve Self, who guided me into the field of bioinformatics, and kindly provided me financial support through the entire writing of the dissertation. I would like to thank Professor Linda Breeden, my biology advisor, for explaining the ever confusing biology to me. Thanks to Katie Kerr, Matthew Stephens and David Haynor, for their enlightening discussions on the content of the dissertation. And Lixuan Qin deserves a special thank, for carefully proofreading the draft. I also owe many thanks to Yi, for her understanding and encouragement during these years. Finally I want to thank all my friends for providing me with good food and companionship along the way, particularly Michael, Hao, Lihong, Kristian, Yingye, Jinbo, and Chengcheng (last but not least). It has been fun.
xii
DEDICATION I would like to dedicate this dissertation to my mom and dad: Cuilan Wu and Xiwen Zhou. Though they may not understand what is written inside, they have brought up someone who was able to write it. To them I owe a great deal and more. This is for them.
xiii
1
Chapter 1 INTRODUCTION
1.1
Overview
The overwhelming focus of biologists in recent years has been on collecting and disseminating DNA sequence information through efforts such as the Human Genome Project. The speed and size of such data collection have increased dramatically, owing primarily to improved biotechnology, computing power and analytical methods. With the large amount of sequence-based information now available, researchers have started the task next on the agenda: to uncover the functional roles and complex inter-relationships of individual genes and gene products. Remarkable progress has been made in the development of numerous technologies and tools for such purposes, in particular microarrays, widely recognized as the next revolution in molecular biology, have shed light on our understanding of genomics. Genetics has traditionally been focused on individual genes or a small group of genes for lack of global measurements. The DNA microarrays enable us to assess the alterations in gene expression of the whole genome simultaneously and repeatedly, with good precision. Using such global quantitative measurements scientists are now able to study the interaction pathways among genes and try to reconstruct genomic maps. Such array-based techniques represent a more general trend toward the implementation of systematic and comprehensive methods in biological research. We need to recognize, however, that these technologies, while providing global pictures, can often obscure knowledge within the vast amount of data they generate. Failure to fully understand the scientific questions and acknowledge the systematic and stochastic variations will lead to a waste of resources, unreliable conclusions, and even more serious consequences in applications such as medicine. Therefore, careful statistical analysis should play a crucial role in extracting information from high throughput data.
2
The methodology and its applications summarized in this dissertation are largely motivated by statistical problems arising from microarray experiments, especially time course gene expression analysis, so we therefore present a brief overview of microarray technology and related gene expression analysis to provide the necessary background information for this endeavor. In this dissertation we develop a fully Bayesian framework for the specific application of analyzing curve data, with time course gene expression data as a special case. Such modelbased approach allows simultaneous clustering and estimation. The methods are intended to be largely exploratory, yet with a more reliable measure of uncertainty as compared to the majority of the currently prevalent analytical tools used in gene expression analysis. In the applications of our methodology, we demonstrate the importance of understanding the underlying scientific questions, carefully dealing with the systematic signals and random noise, and taking various sources of variations into account. There is no universal solutions to all the problems arising from gene expression experiments, and we believe appropriate analytical methods should be chosen to investigate different aspects of particular biological phenomena. Hence we consider our approach as complementary, rather than a competitor, to current body of statistical analyses for gene expression experiments. At this stage, microarray technology is being standardized and quickly maturing, on the other hand, statistical methods for analyzing microarray data are still in the development phase. We are hoping to contribute to the field with the effort summarized in this dissertation. For more detailed discussions and guidelines on microarray and gene expression analysis, see Chipping Forecast I (Collins et al., 1999) and II (Trent et al., 2002). The dissertation is structured as follows: Chapter 1
provides an introduction to gene expression and microarray technologies, and a review of related data analysis methods. We discuss sources of variations for the microarray experiments and special features that gene expression data possess, these motivate the novel statistical tools for gene expression data analysis that we develop in the remainder of the dissertation.
3
Chapter 2
gives a brief review of Bayesian hierarchical modeling, Bayesian mixture models and basic Markov chain Monte Carlo techniques for Bayesian computation.
Chapter 3
describes a simple but powerful filtering approach based on Bayes factors and false discovery rate (FDR).
Chapter 4
describes a general model, in which we apply Bayesian hierarchical models to time course gene expression data with mixture of population distributions. We allow the number of possible clusters to be fixed or unknown a priori. The proposed methods are illustrated using simulated and real gene expression data.
Chapter 5
describes a detailed analysis of cell-cycle gene expression time series, with a model tailored specifically for such data. We demonstrate that the model is appropriate for scientific questions of interest, and can provide great insight into the cell cycle control mechanism. New ways to visualize and evaluate the clustering are also discussed.
Chapter 6
discusses extensions to the proposed model and related issues, which including controlling FDR while allowing for dependence, modeling via a mixture of t-distributions to account for outliers, and the incorporation of external information through prior distribution. We also mention possible directions of future research.
1.2
Microarray Technology and Gene Expression
Because microarray technology is constantly being refined, and new techniques and analytical tools are appearing at an explosive rate, there is no way we can give a thorough review of the fast evolving field of gene expression analysis. This chapter, therefore only serves as a brief overview of microarray technology, along with some of the most widely used statistical analysis methods. We also layout some of the specific problems we are trying to address in this dissertation.
4
1.2.1
Central dogma
The central dogma of molecular biology summarizes the usual information flow in organisms: the perpetuation of nucleic acid may involve either DNA or RNA as the genetic material, and the expression of cellular genetic information is usually unidirectional. Proteins are the structural components of cells and tissues, and perform many key biological functions. The production of proteins is controlled by genes, which are coded in deoxyribonucleic acid (DNA) sequences, common to all cells in one being, and mostly static over one’s lifetime. For some viruses, the ribonucleic acid (RNA) is the hereditary material. This dogma is represented by four major stages: 1) The DNA replicates its information in a replication process that involves many enzymes; 2) The DNA codes for the production of messenger RNA (mRNA) during transcription; 3) In eukaryotic cells, the mRNA is processed (essentially by splicing) and migrates from the nucleus to the cytoplasm; 4) Messenger RNA carries coded information to ribosomes. The ribosomes “read” this information and use it for protein synthesis. This process is called translation. Figure 1.1 illustrates the roles of replication, transcription, translation, and the information flow viewed from the perspective of the central dogma.
1.2.2
Gene expression
The majority of the genomic research is conducted along the lines of the central dogma, studying various steps involved in the information flow. Among them, gene expression analysis is an area studying the specific process by which a gene gives rise to a protein. The basic stages of gene expression are outlined in Figure 1.2. The first stage is transcription, when an RNA copy of a single strand of the DNA is produced. For the simplest genes , this RNA is in fact messager ribonucleic acid, or mRNA (and this is always the case with bacteria). For genes of eukaryotes, the immediate transcript of the gene is premRNA that must be processed through RNA splicing to generate the mature mRNA, the template with specific instructions used to produce amino acids which later are assembled into proteins. Transcription and processing of RNA occur in the nucleus. The next stage of gene
5
Figure 1.1: A schematic of the role of RNA in gene expression and protein production. Major stages include DNA replication, RNA transcription, RNA splicing and translation. Graphics from http://www.accessexcellence.org.
6
expression is the translation of the mRNA into protein. This occurs in the cytoplasm. So it is necessary for the mRNA to be transported through the nuclear membrane into the cytoplasm. Subsequently, the translation is accomplished by ribosome, a complex apparatus that includes both protein and RNA components.
Figure 1.2: A schematic of gene expression. mRNA is produced by RNA transcription, after migrating into nucleus, it instructs the protein synthesis in ribosome. Graphics from http://www.accessexcellence.org.
Tremendous effort has been spent along the line of central dogma to determine how exactly a genome is manifested in the form of living organisms. The ever-increasing rate at which genomes are being sequenced is attracting attention to functional genomics – an area
7
of genome research concerned with assigning biological function to DNA sequences. Upon completion of the sequencing of a genome, the essential yet formidable task of defining the role of each gene and its product, and understanding the inter-relationship between sets of genes in the genome and their encoded products can be attempted. The mRNA plays a unique role in genomic research as it bridges the hereditary information carrier, the genes, and their functional products, the proteins. Proteins are the essential structural and functional components of any living organism. For the efficiency of material management, and the accuracy of assembly process, cells do not produce and store them all the time. Instead, cells frequently employ a more efficient strategy: produce and deliver the proteins only when they are needed. Studies have shown that the changes in the state of a cell are related to the presence and amount of proteins participating in that state or process, the identity and amount of proteins are in turn directly related to the abundance of specific mRNA transcripts, therefore systematic monitoring of transcriptome could provide valuable insight into the gene-protein-function relationship. More background on gene expression and its biological significance can be found in many basic biochemistry and biology textbooks, such as Lewin (2000) and Alberts et al. (1994).
1.2.3
Microarray technology
Several techniques have been developed for measuring gene expression, including serial analysis of gene expression (SAGE), cDNA library sequencing, differential display, cDNA subtraction, multiplex quantitative RT-PCR, and gene expression microarrays. It is widely believed that thousands of genes and their products (i.e., RNA and proteins) in a given living organism function in a complicated and orchestrated way that create the mystery of life. Strong supporting evidence for this belief has been obtained from extensive studies of cell cycle control of yeast S. cerevisiae, see, for example, the review by Kelly and Brown (2000). However, traditional methods in molecular biology generally work on a “one gene in one experiment” basis, even with some newly developed techniques, the number of genes being monitored is still quite small, which means that the resolution is very crude and there is a lack of knowledge of the “whole picture” of the genome. In recent years,
8
a new technology DNA microarrays have attracted tremendous interest not only amongst biologists, but also computer scientists and statisticians. This technology allows monitoring the gene expression of whole genome on a single chip so that researchers may have a better picture of the interactions among thousands of genes simultaneously. The vast amount of data generated by this technology have raised complicated computational and statistical issues which are waiting to be resolved. Base-pairing, or hybridization, is the underlying principle of DNA microarray. DNA microarrays, or DNA chips are fabricated by high-speed robotics, generally on glass, plastic or nylon support. On the support, also called the matrix, probes
1
with known identify
are deposited into separate spots following arranged pattern to determine complementary hybridization, thus allowing measurements of mRNA representation from the sample under study. Each spot consists a separate experiment. Arrays can have hundreds of thousands of spots. There are several versions of microarray technologies. The two most often referred to are cDNA arrays (also called spotted arrays) and oligonucleotide arrays. The early versions of these two types of arrays exploit hybridization, but differ in the properties of the arrayed DNA sequences with known identity, i.e. the “probes”. In recent years, cDNA arrays using short oligonucleotides as probes have also been developed, therefore the distinction between the two types of arrays lies more in that whether gene expressions are measured in absolute quantity or relative to reference samples. As shown in Figure 1.3, in cDNA arrays, mRNA from both the test and reference samples is reverse-transcribed into cDNA, and fluorescently labelled with dyes of different colors, usually green and red. Then equal amount of fluorescent targets are pooled and allowed to hybridize under stringent conditions to the clones (probes), each of which is deposited on a small region, or spot, on a coated glass slide. After hybridization, a laser scanner measures dye fluorescence of each color at a fine grid of pixels. Higher fluorescence indicates higher 1 In the literature, there exist at least two confusing nomenclature systems for referring to hybridization partners. Both use the common terms: “probes” and “target”, but with reversed meaning. According to the nomenclature recommended by Supplement to Nature Genetics (Collins et al., 1999), a “probe” is the tethered nucleic acid with known sequence, whereas a “target” is the free nucleic acid sample whose identify and abundance is being detected. We follow this recommendation throughout the dissertation.
9
amount of cDNA, which in turn indicates higher gene expression.
Figure 1.3: A schematic of cDNA microarrays. Graphics from Duggan et al. (1999)
The second common approach involves the use of high-density oligonucleotide (20∼80mer oligos) arrays.
Currently, the most widely used oligonucleotide array type is the
Affymetrix GeneChip. In Affymetrix arrays, the expression level of each gene is measured by comparing hybridization of the sample mRNA to a set of probes, composed of 11∼20 pairs of oligonucleotides, each of length 25 base pairs. One type of probe in each pair is the actual gene sequence of interest, termed as the perfect match (PM) in the array literature. Another type of probe in the pair is known as the mismatch (MM) and is created by changing the middle (13th) base of the PM sequence to reduce the rate of specific binding of mRNA for that gene. Unlike cDNA arrays, where two samples of mRNA (test and reference) are allowed competing hybridization to the probes on the arrays, a single
10
sample is prepared, labeled with a fluorescent dye, and hybridized to the probe sets on the oligonucleotide arrays. For cDNA arrays, though absolute intensities of fluorescence of each color are reported from the image analysis, most gene expression data analysis use a normalized ratio of expression levels from the two samples as the endpoints of each experiment. For oligonucleotide arrays, separate intensities are extracted from hybridizations to PM and MM probes, and the expression levels for each probe-set are summarized using special data analysis approaches, see Li et al. (2003) and references therein. In this research project, we are only concerned with high level data analysis of cDNA array data, therefore in the following sections and later chapters we will restrict our attention to cDNA arrays and related statistical analysis. However, we believe the methods that we have developed can be readily applied to oligonucleotide array data with appropriate modifications. Key papers describing cDNA microarrays include Schena et al. (1995), Schena et al. (1996) and DeRisi et al. (1997). Oligonucleotide arrays are discussed by Lockhart et al. (1996); Lipshutz et al. (1999); details on Affymetrix arrays can be found in Affymetrix (1999). For more comprehensive biochemical and technical coverage of DNA microarrays and their applications, see the books by Bowtell and Sambrook (2002) and Schena (2002).
1.2.4
Special features of microarray data
Considered as the technological advance that would reshape the field of genomics, microarrays have shown great promise in a wide spectrum of applications. Initially, microarrays were used mainly for gene screening and target identification. Now the range of microarray applications has expanded to include disease characterization, developmental biology, pathway mapping, mechanism of action studies, diagnosis and prognosis of diseases, drug development and toxicology. As array technology is still an intensive research area, for example, protein chips and other chemical microarrays are currently under development, it will be no surprise to see more and more innovative applications of microarrays. For more discussions on current and potential applications of microarrays, see for example Chipping Forecast II (Trent et al., 2002) and references therein.
11
Microarray experiments generate large and complex multivariate data sets, and some of the greatest challenges lie not in generating these data but in the development of computational and statistical tools to analyze them. The gene expression data generated from microarrays represent a new type of data to which most available statistical methods can not be readily applied. To extract information from gene expression data effectively and accurately, either available statistical methods have to be adapted, or novel statistical tools have to be developed. We now summarize some of the key features of gene expression data which merit such special treatment. First of all, because of the genomic-scale nature of microarrays, the biological questions behind microarray experiments are extremely complex, but maybe less clearly defined. Statisticians have to go the extra distance to learn the substantial biology knowledge, as lack of understanding of the underlying scientific questions will lead to inappropriate statistical analysis. On the other hand, scientists have to face more complex statistical methodology. It should be pointed out that different applications involve different scientific questions, therefore often require different experimental designs, different data formats, and consequently, different statistical analysis. For successful analysis, it is essential to have joint effort from biologists, computer scientists and statisticians. The second challenge is the volume of data. Microarray technology enable us to monitor gene expression at the genomic scale, and the genome of an organism usually consist hundreds of thousands of genes. Each spot on the array corresponds to a gene of interest, currently each array consists of 5,000–10,000 spots. There are Affymetrix chips that contain the whole genome of model organism, and arrays with even more spots are on the horizon. So each chip generates thousands of data points. Since there are often replicates over the same or various experimental conditions, the data generated from a single gene expression study typically consist hundreds of thousands of data points. It is a difficult task to store, process, and extract information from such large amounts of data. Thirdly, microarray data can complicate data analysis because of a lack of replicates. The microarray data are typically of the “large p, small n” format, i.e., a high dimension parameter space but with few replicates. This is partly due to the current high cost of manufacturing microarrays, and partly due to the negligence of researchers. Eisen et al.
12
(1998) even state that “when designing experiments, it may be more valuable to sample a wide variety of conditions than to make repeat observations on identical condition”. While this may, in some sense, be advantageous for the particular data display presented in that paper, it is surely not a sound general recommendation for experiments with different, less exploratory, and more quantitative analytical goals (Bryan, 2003). The statistical community has been begging for more samples over the years, for lack of replicates limits our ability to measure the precision of the estimates and the reliability of the conclusions. For example, some of the early findings by microarray analysis published in high-profile science journals were based on only one or two replicates, and later turned out that those signals were from genes with extreme large variations even under identical conditions. Therefore they are subject to large noise, and could very likely be false discoveries (Lee et al., 2000; Pritchard et al., 2001). Another important feature of gene expression data is that they are often coupled with a lot of extra information. In basic science, experiments are often duplicated by laboratories across the world based on standardized protocols, and microarray experiments are no exception. Such duplicates and previous smaller scale studies can give us important prior knowledge such as information on regulatory factors, detailed characterization of certain sets of genes, and protein-DNA interactions, etc., which are not provided by gene expression data. It is clearly beneficial if we can effectively incorporate such information into the gene expression analysis. But how to achieve this is a challenge. Our decision to choose a fully Bayesian framework is significantly influenced by this feature. The last yet foremost challenge for gene expression data analysis mentioned here is identifying and accounting for the signals and variations in microarray experiments. As pointed out by many authors, gene expression microarrays are powerful, but variability arising throughout the measurement process can obscure the biological signals of interest, this is a key motivation for statistical analysis as they can take the variations into account. There are a large number of sources of variation operating at different times and levels during the course of a typical experiment. Parmigiani et al. (2003b) classify the sources of variations into five phases of data acquisition: microarray manufacturing, preparation of mRNA from biological samples, hybridization, scanning, and imaging. In the following, we
13
classify them into three major categories: biological variability, experimental variability and processing variability.
• Biological variability is the variation intrinsically inherited by the cell samples, not from any other external sources. In microarray study, the largest variation often comes from cells from different populations. For example, in the case of human biopsy samples, cells could come from different tissues, differ in genotypes, and even in cell types. Other sources of biological variation include differences in cells’ microenviroment such as nutrients, temperature gradient, growth phases, changes due to internal or external stochastic stimuli, regulatory chemistry imbalance, etc. Even when genetically “identical” cells cultured under “identical” conditions are compared with each other, non-trivial and sometimes significant difference in gene expression levels are observed (Arfin et al., 2000; Baldi and Hatfield, 2002). Although in some sense, there are no absolutely “identical” cells, the biological variations can be greatly reduced with good experimental protocols.
• Experimental variability comes from many sources including the methods by which cells are cultured and samples are obtained. During the making of the arrays, variation can result from differences in amplification, purification, and concentration of DNA clones for spotting, in the amount of DNA spotted, in the ability of the clones to attach to the slides, in the shapes of deposited spots. Systematic within-chip variation can also result from the defects of the print-tip of spotting robotic equipment. During the preparation of the samples, variation can result from the methods by which cells are cultured, and the methods for mRNA isolation, extraction and amplification. During hybridization, variations arise from hybridization conditions such as temperature and humidity, from cross-hybridization of molecules with similar sequences, and from differential ability of the dyes to be incorporated into the samples. It has been found for some DNA sequences that only one of the two dyes work efficiently. For a more detailed discussion of these variations and the strategy to deal with them, see Yang et al. (2001).
14
• The processing variability arises from the processes by which numerical values are collected, such as fluorescence scanning, image analysis, and intensity readout. And there are many other sources of variations, although many of the variations are relatively small, the compounding of the effects across various stages of the experiments can be substantial. For a more detailed review, see Parmigiani et al. (2003b) and references therein. We believe that all the features summarized above should be carefully considered as thoroughly as possible in order to conduct gene expression data analysis properly, though it may mean extra effort from data analysts. There is, in fact, a certain amount of skepticism and dissatisfaction even among the biological community when it comes to findings based on gene expression analysis. As comment upon above, many early findings based on microarray data do not hold up under more scrutinized analysis. We attribute these false discoveries mostly to the fact that the statistical analysis failed to acknowledge these special features of gene expression data. We next briefly review some of the most prevalent statistical methods developed for gene expression data. 1.3
Gene Expression Data Analysis
The high cost, high volume and complicated artifacts associated with various experimental stages have generated a great need for statistical and data-analytic techniques. Such a need has resulted in an explosion of literature on microarray data analysis. A broad range of topics regarding microarray analysis is reviewed by Collins et al. (1999), Trent et al. (2002) and articles in the Nature Genetics Supplement issues that contain their papers. For detailed biochemical and technological issues involved in microarrays, see the books by Schena (2002) and Kohane et al. (2002). Two excellent books, Parmigiani et al. (2003b) and Speed (2003), which are devoted to statistical methods and software for microarray data analysis, have been put together by some of the finest statisticians in the field. Although statistical methods for gene expression analysis have been covered in great scope and depth by these authors, less attention has been given to the analysis of gene expression time series
15
data. The work summarized in dissertation is intended to fill this void. In an excellent review by Parmigiani et al. (2003a), the data analysis tasks involved in microarray experiments are classified into four phases. The first phase is the experimental design stage, where based upon scientific questions of interest, resources and other consideration, sample size and allocation of experimental conditions are decided. Good experimental design ensures the resulting data are amenable to statistical analysis at low expense of resources. Two good reviews are by Churchill (2002) and Yang and Speed (2002) The second phase is the signal extraction phase, which includes image analysis, gene filtering, probe level analysis of oligonucleotide arrays, normalization and removal of artifacts for comparisons across arrays, etc. Many of the issues have been reviewed by Bowtell (1999); Tseng et al. (2001); Holloway et al. (2002) and Quackenbush (2002). The third phase, which is the data analysis, has received most of the attention. Major tasks include identifying genes differentially expressed across experimental conditions, clustering and classification of biological samples, clustering and classification of genes, etc. Microarrys allow scientists to go beyond individual genes to study common features such as regulatory pathways shared by set of genes and/or their products. Even in differential expression analysis, information from other genes can be pooled to provide clues about the behavior of individual genes. Such “guilt-by-association” assumption forms the premise for most high level gene expression data analysis, in particular, clustering and classification. Currently, clustering algorithms and their variations, such as hierarchical clustering, kmeans clustering and self-organizing maps, are the most popular gene expression analysis techniques, so popular that there is a misimpression among many biologists that these clustering algorithms are the definitive methods for clustering genes. As have been cautioned by many statisticians, see for example Bryan (2003), clustering algorithms are limited in many important ways, therefore should be treated as only exploratory tools, and the results from cluster analysis should be interpreted with care. For detailed coverage of clustering methods and other statistical methods for gene expression analysis, see the reviews by Quackenbush (2001), Fraley and Raftery (2002), Pan (2002), and the books by Parmigiani et al. (2003b) and Speed (2003), and the references therein. The literature is already huge, and still expanding.
16
The fourth phase is the validation and interpretation phase. At this stage, results from gene expression analysis are validated either on a smaller scale by using alternative assays such as RNA blotting, or by repeating the experiments and carrying out a separate analysis. People have started looking into ways to combine multiple independent microarray datasets or use online repositories for validation. It would also be of interest to make the comparison across different species to study gene-function relationship preserved by evolution. This phase is still relatively undeveloped, but will surely receive more attention in the future. Ideally, the signal extraction, data analysis, and validation stages should be integrated and uncertainty propagated across stages. Because of the complexity and novelty of these tasks, only preliminary progress has been made towards this integration. In this dissertation, we will restrict ourselves to high level data analysis, where the normalized gene expression measures, e.g., normalized ratios of fluorescent intensities, are treated as observed data. Although it should be pointed out they are based on spot-level summaries which are further based on pixel-level analysis. We follow the prevalent strategy to exclude spots flagged with high uncertainty from subsequent data analysis.
1.4
Gene Expression Time Series
In this dissertation, we develop a fully Bayesian model of gene clustering for gene expression time series. Although commonly encountered in gene expression experiments, analysis of such data has received relatively little attention until recently. We refer to “time series” in the following broad sense. A time series is a collection of numerical observations arranged in a natural order. Usually each observation is associated with a particular instant or interval of time as in time series, or drug concentration as in pharmacokinetic studies, or growth temperature as in bioassay studies. Although the covariates providing the ordering may not necessarily be actual time, they are referred to conventionally as “time”. Therefore in this dissertation, we will use “time series” and “curve data” interchangeably. Note since the ordering imposed by the covariates is one aspect of the natural biological process, and contains important information about the temporal transcription programs, it is critical that they are taken into account at data analysis. But most clustering algorithms do not
17
allow this, for more discussion about this issue, see Chapter 2. In this dissertation, we will analyze three gene expression data sets. Two are widely studied and publicly available data set, to which we will give a short introduction. Both data sets are available for public access at the Stanford Microarray Database (SMD; http://genomewww.stanford.edu/microarray/). A description of SMD is given by Gollub et al. (2003). Another example data set is an unpublished cell-cycle data set, which will be undergoing detailed analysis in Chapter 5.
1.4.1
Sporulation data
Chu et al. (1998) used DNA microarrays containing 97% of the known or predicted genes of budding yeast Saccharomyces cerevisiae to explore the temporal program of gene expression during meiosis and spore formation. Changes in the concentration of the mRNA transcripts from each gene were measured at seven successive times (t = 0, 0.5, 2, 5, 7, 9, 11.5 hours) after transfer of wild-type diploid yeast cells to a nitrogen-deficient medium that induces sporulation. The mRNA transcripts levels from another vegetative samples before transfer to sporulation medium (t = 0) were used as references, and the normalized log ratios of mRNA levels at designed time intervals and reference measurements were reported as expression levels for data analysis. As the expression levels at T = 0 are essentially the same genes under identical experimental conditions, we only use them for evaluation of measurement errors, and will not include them in the statistical analysis of the complete time course. The goals are to identify distinctive temporal transcription patterns and cluster genes sharing common patterns. The underlying assumption is that genes sharing similar transcription patterns (co-expressed) are likely to be co-regulated, therefore such grouping could provide important information about the regulatory mechanisms during sporulation. In the original paper, these authors subjectively selected seven representative temporal profiles based on a few hand-picked genes. Of the approximately 6,200 protein-encoding genes in the yeast genome, they identified more than 1,000 genes as showing “significant” changes in mRNA levels during sporulation using fold-changes. These significantly changed genes were subsequently grouped according to the representative profile that gave the highest correlation coefficient. As one of the earliest application of microarrays, this work has proven
18
to be quite influential to the field, and the data set has been re-analyzed by many other authors. Though the use of these statistical methods was novel at the time, it is now clear that the study had some flaws, as there were serious problems with the experimental design and the data analysis. For example, the resolution was rather low, there were no replicates, the differential analysis was based on fold-change and the clustering method was very heuristic. We will use this data set to demonstrate how we can overcome some of these problems with our methods. A random sample of normalized sporulation data is shown in
0 −4
−2
Expression log2(R/G)
2
4
Figure 1.4, with t = 0 measurements excluded.
2
4
6
8
10
Time (hrs)
Figure 1.4: A random sample of 100 genes from the sporulation data. The entire data set is available from SMD website http://genome-www.stanford.edu/microarray/.
19
1.4.2
Cell-cycle data
Another widely used data set is from the cell cycle study conducted by Spellman et al. (1998). This work, along with the sporulation study, was among the first few full scale applications of microarrays. The great potential of microarrays, as demonstrated by these
0 −2
−1
Expression log2(R/G)
1
2
studies, was soon recognized by other scientists, thus started the “array age”.
0
20
40
60
80
100
120
Time (min)
Figure 1.5: A random sample of 100 genes from the cell-cycle dataset. The entire dataset is available from SMD website http://genome-www.stanford.edu/microarray/.
Spellman et al. (1998) sought to create a comprehensive catalog of yeast genes whose transcript levels varied periodically within the cell-cycle. They used microarrays and budding yeast Saccharomyces cerevisiae cells synchronized by three independent methods: αfactor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. After the release from synchrony, mRNA transcript levels relative to a asynchronized cell samples were measured at fixed time intervals (which were different for different synchronization
20
methods) using cDNA microarrays. More information about cell-cycle regulation and synchronization will be given in Chapter 4. Using periodicity and correlation algorithms, the authors identified about 800 genes that met an objective minimum criterion for cell-cycle regulation. We will illustrate our method using this dataset, since it is accessible to public and re-analyzed by many other authors. This is a case where a known pattern, periodicity, is presented, and therefore should be explicitly modelled. Because the α-factor synchronization method is the best among the three, we will restrict our attention to the data from the αfactor experiments. Figure 1.5 shows a random sample of α-factor synchronized yeast genes.
21
Chapter 2 BAYESIAN HIERARCHICAL MODELS AND MCMC TECHNIQUES Our model-based clustering approach is both a combination of and an extension of Bayesian hierarchical models and mixture models. The computation is via implementation of novel MCMC sampling techniques. 2.1
A Hierarchical View of Gene Expression Time Series
One of the main aims of gene expression analysis is to cluster genes based on the “similarity” in their gene expression levels across various (ordered) conditions, such as time points, doses, or temperatures. It has been pointed out that although two genes within a regulated pair do not necessarily have similar expression patterns, genes with correlated expression levels usually suggest a functional correlation, e.g.. see Eisen et al. (1998), Zhou et al. (2002), Filkov et al. (2002). A further extension of this idea is that a group of genes with “similar” expression profiles could be co-regulated by a common mechanism. And such regulation mechanisms could interact with each other or even be regulated at a higher level themselves, in order to carry out the specific biological functions. For an example of such hierarchy in cell-cycle control, see the review by Breeden (2003). For the purpose of clustering, it is helpful to consider the natural hierarchies with gene expression data. The first hierarchy is at the individual gene level, where the expression levels of each gene at different time points can be directly observed. Based on the focus of the scientific investigation, gene-specific random effects may or may not be directly modeled. For example, if we are more interested in studying how genes as a group are regulated, we may choose to integrate out the random effects and concentrate on higher level parameters mentioned next. The next hierarchy involves finite number of distinct regulation mechanisms (sub-populations) which regulate the expression of various groups of genes, often the number of such mechanisms is unknown, and groups of genes are allowed to overlap as they
22
could share certain regulation pathways. Higher hierarchies could be placed on top of these finite number of regulation mechanisms, where it is assumed that all regulation mechanisms share common features at the population level, or are further regulated by fewer yet more general mechanisms. This view of gene expression data provides a natural motivation for Bayesian hierarchical models and mixture models. In the following we review hierarchical models, with emphasis on Bayesian population models, which have been extensively applied to biomedical growth curve analysis, pharmacokinetic studies and many other fields. We then review mixture models and model-based clustering. As our fully Bayesian models could involve non-linear models, and have to deal with parameter spaces with different dimensions, special effort has to be spent on devising appropriate computational tools. We will review various MCMC sampling techniques used in our computation, and illustrate their usages in Bayesian inference.
2.2
Bayesian Hierarchical Models
Hierarchical models provide a powerful and flexible approach to the representation of belief about observables in extended data structure. They are most suited for data arising from natural hierarchical structures. Yet there are situations where conceptual hierarchical thinking can be fruitful. For example, in statistical applications involving multiple parameters which are related or connected in some way, hierarchical models could be used to provide a joint probability model with dependence placed on some of the parameters, and meanwhile avoiding problems of over-fitting. An excellent account of hierarchical models is given by Gelman et al. (1995). There is a large literature exists on Bayesian hierarchical models, some key references are Lindley (1971), Lindley and Smith (1972), Smith (1973), Berger and Robert (1990), Bernardo and Smith (1994) and references therein. The frequentist counterparts of Bayesian hierarchical models are often termed “multilevel models” or “mixed effect models”. As we are only concerned with Bayesian modelling in this dissertation, readers are referred to the following references for more information. Davidian and Giltinan (1995) and Vonesh and Chinchill (1996) provide good overviews as well as
23
theoretical developments and examples of non-linear mixed models. Pinheiro and Bates (1995, 2000) are primary references for the theory and computational techniques for such models. Another detailed treatment of multilevel models is given by Goldstein (2003). A parallel literature exists in the area of generalized linear mixed models, in which random effects appear as a part of the linear predictor inside a link function. Methods dealing with such models are discussed in articles such as Stiratelli et al. (1984), Anderson and Aitkin (1985), Lindstrom and Bates (1990), Breslow and Clayton (1993), and Diggle et al. (1994).
2.2.1
Population models
As mentioned in Chapter 1, we focus on curve data in this dissertation. In Bayesian literature, a type of models dealing specifically with such data are population models. Population models have been widely applied in many fields, such as biomedical growth curve analysis (Berkey, 1982), educational research (Rubin, 1981), econometrics (Swamy, 1970) and other fields. An area that population models have found extensive applications is pharmacokinetics, see, for example, Wakefield (1996) and Wakefield and Bennett (1996). In recent years, with the advent of simulation techniques and computing power, related models, perhaps described with different terminology, have become increasingly popular for the analysis of large and complex data sets, such as multi-center clinical trials (Skene and Wakefield, 1990; Turner et al., 2001), spatial epidemiology studies (Richardson et al., 2002), disease mapping (Green and Richardson, 2002), ecological studies (Best et al., 2001), just to name a few. From a Bayesian perspective, population models are variations on and extensions of the following hierarchical structure, expressed in terms of generic densities.
Let y =
(y 1 , · · · , y n ) denote the totality of measurement data on n units in a designated population (e.g. of patients, experimental animals, or genes as in our examples); let θ = (θ 1 , · · · , θ n ) denote the parameters defining n underlying ‘response’ profiles (e.g. weight versus age, drug concentration versus time after administration, gene expression levels versus time after release from synchronization), and let φ denote hyper-parameters defining relationship among components of θ. Then, population models correspond to the following three-stage
24
hierarchy of distributions
p(y | θ) = p(y 1 , · · · , y n | θ) =
n Y
p(y i | θ i ),
i=1
p(θ | φ) = p(θ 1 , · · · , θ n ) =
n Y
p(θ i | φ),
(2.1)
i=1
p(φ). In the context of above expressions, it may be of interest to make inference about the unit profile characteristics, the θ i ’s; or the population characteristics, φ, or on predictions of future observations from an already included individual or a new individual assumed to be exchangeable with the study population. In all cases, straightforward probability manipulations involving Bayes’ theorem provide the required posterior inferences as follows: p(θ i | y) =
Z
p(θ i | φ, y)p(φ | y)dφ,
where p(θ i | φ, y) ∝ p(y | θ i )p(θ i | φ) p(φ | y) ∝ p(y | φ)p(φ), and p(y | φ) = p(˜ y | y) =
Z
Z
p(y | θ)p(θ | φ)dθ, ˜ θ ˜ | y)dθ. ˜ p(˜ y | θ)p(
But the integrals required for a fully Bayesian analysis are typically not available in closed form and therefore a numerical or analytical approximation is required. However, analytical approximation approaches often fail to give entirely satisfactory results, largely due to the high dimensionality of the parameter space involved (Kass et al., 1991). In this dissertation, we will demonstrate that a highly effective Bayesian computation strategy for clustering curve data is available, based on various Markov chain Monte Carlo (MCMC) techniques. We will review the general MCMC sampling techniques, along with special algorithms used for our methodology in later sessions of this chapter.
25
Also note that at the first stage of population models, the relationship between responses and covariates are modelled with specific functional forms, for example, the non-linear pharmacokinetic model used in Wakefield et al. (1994). We believe special attention should be given to the functional forms of the curves in the clustering problems, as the time, or drug concentration, or other covariates, because they constrain both the order and magnitude of the responses. We will further elaborate on this point in later examples.
2.2.2
Bayesian dynamic models
The majority of curve data we encounter in applications are time series, i.e. repeated measurements over time. Classical time series analysis is a well developed area with a huge literature. There are many books devoted to time series analysis, such as Anderson (1994), Fuller (1996), and Bloomfield (2000). Readers can refer to these books and the references therein for details on the classical time series techniques. Gene expression data usually consist of a large numbers of short time series, the classical autoregression-type of methods usually do not work well due to the few observations per curve, and the lack of flexible structure for the time trend. A new class of models, namely, state-space models, also termed “dynamic models” by West and Harrison (1997), have been developed and successfully applied to such time series, in addition to spatial data, and spatiotemporal data. State-space models relate responses to unobserved “states” or “parameters” by an observation model. The states, which may represent, e.g., an unobserved temporal or spatial trend or time-varying covariate effects, are assumed to follow a latent Markov model. This feature makes state-space models an appealing choice for gene expression time series where there is no clear information about the functional form of the time trend, as we demonstrate using the sporulation data in later chapters. For approximately normal data, linear state-space or dynamic models and the famous linear Kalman filter have found numerous applications in the analysis of time series, see, e.g., Anderson and Moore (1979), Harvey (1989) and West and Harrison (1997). For extensions to non-Gaussian time series, and more general state-space models, see Martin and Raftery (1987), Fahrmeir and Tutz (2001) and references therein.
26
Given the observations y1 , y2 , · · · , yt up until time t, estimation of current, future, and past states (“filtering”, “prediction”, and “smoothing”) is a primary goal of inference. The standard linear state-space models can be written in the following hierarchy: Stage I: The uni- or multivariate observations yt are related to unobserved state vectors αt by a linear observation equation yt = Zt αt + ǫt , t = 1, 2, · · · ,
(2.2)
where Zt is an observation or design matrix of appropriate dimension, and {ǫt } is a white noise process, i.e., a sequence of mutually uncorrelated error variables with E(ǫt ) = 0 and cov(ǫt ) = Σt . For univariate observations the design matrix reduces to a design vector zt′ and the covariance matrix to the variance σt2 . Moreover, the design matrix may depend on ∗ ), with y ∗ covariates and/or past observations, so that Zt = Zt (xt , yt−1 t−1 = (yt−1 , · · · , y1 ).
Stage II : The sequence of states is defined by a linear transition equation αt = Ft αt−1 + ξt , t = 1, 2, · · · ,
(2.3)
where Ft is a transition matrix, {ξt } is a white noise sequence with E(ξt ) = 0, cov(ξt ) = Qt , and the initial state α0 has E(α0 ) = a0 and cov(α0 ) = Q0 . Stage III : The joint and marginal distributions of {yt , αt } are completely specified by distributional assumptions on the errors {ǫt }, {ξt }, and the initial state. Since linear dynamic models in combinations with the linear Kalman filter and smoother are most useful for analyzing approximately Gaussian data, most often joint normality is assumed for the errors and initial state ǫt ∼ N(0, Σt ), ξt ∼ N(0, Qt ), α0 ∼ N(a0 , Q0 ),
(2.4)
and {ǫt }, {ξt }, α0 are mutually independent. Given the above hierarchical model, often estimation of αt is the primary goal. Under the normality assumption, the optimal solution to the filtering problem is given by the posterior mean E(αt | yt∗ , x∗t ). The famous linear Kalman filter and smoother computes the posterior means and covariance matrices in an efficient recursive way. Because the model is linear and Gaussian, it is not difficult to verify that the posterior distributions are Gaussian
27
too. The usual derivations of the Kalman filter and smoother take advantage of the fact. Proofs as in Anderson and Moore (1979) and Harvey (1989) repeatedly apply formulas for expectations and covariance matrices of linear transformations in combination with Bayes theorem. For more details, see the above references. More recently, researchers are attracted to a new class of filtering methods based on the sequential Monte Carlo approach (Liu, 2001). In our implementation of Bayesian dynamic model for gene expression clustering, we too resort to MCMC for the filtering.
2.3
Bayesian Mixture Models and Model-based Clustering
Clustering analysis is the identification of groups of observations that are cohesive within groups but separated between groups. As has been reviewed in Chapter 1, clustering algorithms have helped biologists gain important clues about the function and regulatory control of vast numbers of poorly characterized genes. Depending on the scientific questions at hand, we may be interested in clustering genes, samples or both. In this dissertation, we concentrate on clustering of genes. For problems of clustering samples, or genes and samples simultaneously, see Golub et al. (1999) and Chipman et al. (2003).
2.3.1
Model-free clustering versus model-based clustering
Here we make the distinction between “model-free” clustering and “model-based” clustering. Most conventional clustering methods are model-free, and treat the observed data as a fixed input, and do not make any parametric assumptions. One widely used class of methods involves hierarchical clustering, in which a nested sequences of clusters are generated either by merging pairs of sub-clusters or split the large clusters into smaller pairs so as to optimize some criterion. Popular criteria include the distance between centroids, the minimum distance between two clusters (single-linkage), the shortest average distance between clusters (mean-linkage), and the maximum distance between the clusters (complete-linkage). A similarity measure widely used in gene expression clustering is an “empirical correlation” like measure, which has the advantage of being location and scale invariant (Eisen et al., 1998). Another common class of methods is based on iterative relocation (also called itera-
28
tive partitioning), in which given the number of clusters, data points are re-allocated from one cluster to another until there is no further improvement in some criterion. Popular partitioning methods in gene expression analysis include k-means clustering (Quackenbush, 2001) and self-organizing maps, or SOMs (Tamayo et al., 1999). Good references on general model-free clustering methods include Hartigan (1975), Kaufman and Rousseeuw (1989), Everitt (1993) and Gordon (1999). For specific applications of such clustering algorithms to gene expression analysis and related statistical issues, see Quackenbush (2001), Goldstein et al. (2002), Chipman et al. (2003) and Parmigiani et al. (2003b) and references therein. The model-free clustering methods are based largely on heuristic but intuitively reasonable procedures. This makes software implementation straightforward. In applications, they work fast even with large data sets, and often give plausible results. But due to the heuristic nature of these clustering methods, they suffer serious limitations, especially for complicated problems such as gene expression analysis. Some major drawbacks are listed as follows:
1. These clustering methods fail to acknowledge the variations inherited in the observed data. As we have reviewed in Chapter 1, substantial variations exist across various stages of microarray experiments, and the gene expression data input into clustering algorithms are themselves the output of a sequence of data processing steps. But the model-free clustering algorithms implicitly assume the underlying variables measured and the resultant clusters are essentially free of any stochastic noise. As has been realized by many experts in the field, this could lead to extremely imprecise and sometimes misleading estimates (Kohane et al., 2002; Pritchard et al., 2001).
2. Moreover, the variability in the data makes the subsequent clusterings extremely noisy. A gene may express quite differently simply because it has different roles in various biological processes. So for experiments with different experimental conditions, the clusters are likely to be different due to differences in genes’ participation. Even under the same conditions, one group of genes clustered together using one sample
29
is rarely the same with another sample. So it is important to measure how reproducible the clusters are. But without formal parametric assumptions, it is difficult to make any formal inference, including estimation of measures of uncertainty. The uncertainty issue and related remedies have been investigated by Kerr and Churchill (2001), and van der Laan and Bryan (2001), amongst others. 3. For curve data, the measurements are constrained to have specific ordering by a known factor, most often, time. This distinguishes the curve data from other multidimensional data, say, data from a randomized block design when there is no natural ordering. Therefore, this ordering should be taken into account during clustering. But for the model-free clustering methods, reordering the measurements on time within each gene makes no difference in the final clustering of genes. 4. For gene expression time series, it is often of interest to study how gene expressions change over time, but model-free clustering methods provide no information in this respect. Furthermore, other information such as covariate information, curve forms and measurement error, etc. are not incorporated into the clustering. It is difficult to do any estimation with these clustering methods. 5. Although sensitivity to outliers is an issue with all clustering methods, the lack of structure in model-free clustering methods makes them much more susceptible to outliers. For example, in model-based approaches, robustness to outliers can be achieved by using heavy-tailed distributions such as Student t-distributions. 6. Model-free clustering methods take an N × T matrix as input, corresponding to N genes measured at T time points. It is difficult to visualize the structure for T dimensional curves even if T is only moderately large, say 10 or 12. The data matrix also has to be based on a balanced design, i.e., all gene expressions are measured at the same time points with the same number of time points. The latter clearly impedes the combined analysis of data from different design protocols. For imputation strategies dealing with missing data in the input matrix, see Chipman et al. (2003).
30
Attempts to reduce the dimensionality of the input data matrix include singular value decomposition (SVD) by Alter et al. (2000), principle component analysis (PCA) by Yeung and Ruzzo (2001), and gene shaving algorithm by Hastie et al. (2000). Based on these limitations, it should be emphasized that the model-free clustering methods, which account for the majority clustering algorithms used in gene expression analysis, should be treated only as exploratory tools, their results should be interpreted with caution and subject to further validation. 2.3.2
Model-based clustering
It has long been realized that basing cluster analysis on a probability model can provide a principled framework for clustering, and many problems associated with model-free clustering methods as mentioned above can be easily overcome. This realization has led to better understanding of when existing methods are likely to perform well, and development of new methods. Connections between some of the most popular heuristic clustering methods and model-based clustering procedures have also been established. For a recent review, see Fraley and Raftery (2002). The literature on probability model-based clustering and classification is extensive. Just like model-free clustering methods, model-based clustering methods can often be classified into two basic types: partitioning and hierarchical; see Bock (1996) for a comprehensive survey. Some of the model-based clustering methods worth mentioning include product partition models (PPMs), plaid models, and phylogenetic trees. Product partition models, as the name suggests, are partitioning model-based clustering methods. Introduced by Hartigan (1975), and recently extended through decision theoretic approaches to situations of unknown number of clusters, they have the potential for full Bayesian clustering of gene expression data (Quintana and Iglesias, 2003). Both plaid models and phylogenetic tree models are hierarchical types of clustering models. Proposed by Lazzeroni and Owen (2002), plaid models decompose the effects of clusters, genes and samples into various layers using ANOVA-like models. Plaid models can cluster genes and samples simultaneously. Phylogenetic tree models have been widely used to study the evolution of molecular se-
31
quences, they also provide a nice way to cluster high dimensional morphological data. The usual procedure for estimation is maximum likelihood over all tree topologies, with different topologies corresponding to different clusters; for an excellent review, see Felsenstein (1983). Wen et al. (1998) have applied phylogenetic tree clustering to gene expression time series of rat data to study the central nervous system development. Although many probability models exist for clustering, the most popular model-based clustering methods are based on mixture models. By nature the mixture model-based clustering methods are partitioning type of models. Some of the theoretic difficulties related to likelihood estimation and hypothesis testing may not necessarily apply to hierarchical type of models. For example, there is no clear definition of likelihood ratio test because parameters could be on the boundary of parameter space. The literature on mixture models is large and still growing, readers are referred to the book by McLachlan and Peel (2000) and references therein for more comprehensive coverage. In the following, we only attempt to give a brief review of mixture models.
2.3.3
Bayesian mixture models
Since the first major analysis involving the use of mixture models by Pearson (1894), finite mixture models have drawn strong and sustained interest, from both a practical and theoretical point of view. Their usefulness as a mathematical-based approach to the statistical modeling of a wide variety of random phenomena can be attributed primarily to two complementary aspects: first, they provide a natural framework for the modelling of heterogeneity (thereby incorporating cluster analysis easily). Second, they provide a convenient and flexible semi-parametric framework for estimating or approximating distributions which are not well modelled by any standard parametric family, e.g., density estimation or construction of Bayesian priors. For comprehensive discussions of the field, see the books by Titterington et al. (1985) and McLachlan and Basford (1988). Those two books cover largely non-Bayesian analysis of mixture models, including maximum likelihood methods. A coverage of more recent progress, including a chapter on Bayesian analysis of mixture models, can be found in McLachlan and Peel (2000).
32
Statistical analysis of mixtures has not been straightforward, with non-standard problems posed by the geometry of the parameter space, and also computational difficulties. It has only been in the last 20 or so years that considerable advances have been made in the fitting of finite mixture models, in particular by the method of maximum likelihood, which is made feasible through the implementation of the EM algorithm (Dempster et al., 1977) and/or its variants (Celeux et al., 1996). However, we share the view of Richardson and Green (1997) that the Bayesian paradigm is particularly suited to mixture analysis especially with an unknown number of components. Fully Bayesian approaches usually outperform the EM algorithm when there are many spikes on the likelihood surface (corresponding to small and/or linear clusters); they provide more flexible and precise statistical summaries; they provide better uncertainty assessment since they also take the uncertainty of parameter estimates into account; and they can be easily extended to incorporate covariate(s) and prior information. For more detailed discussion regarding these issues, see Stephens (1997) and Fraley and Raftery (2002). Although mixture models have been studied for more than a century, Bayesian analysis of mixture models have only become popular recently, due to advances in both methodology and computing power, especially the development of MCMC. Chapters on the applications of Bayesian methods to mixture models are included in the books by Robert (1994) and Gelman et al. (1995); see also Robert (1996). Some key papers on the Bayesian analysis of mixtures are Diebolt and Robert (1994), Escobar and West (1995), Richardson and Green (1997), and Stephens (2000a). Write the basic mixture model for independent scalar or vector observations yi as, p(yi | k, w, θ) =
k X
wj fj (· | θj )
i = 1, · · · , n,
(2.5)
j=1
where k is the number of components, w = (w1 , · · · , wk ) are the mixing proportions or weights, constrained to be non-negative and sum to unity, θ = (θ1 , · · · , θk ) are the component parameters, and fj (· | θ) is a given parametric family of densities indexed by a scalar or vector parameter θ. Different parametric families may also be assumed for different components, but in practice, we often assume common parametric family of densities for all components, for example, mixtures of normal distributions. For the mixture models in
33
this dissertation, we assume mixture of distributions from a common parametric family, and subsequently we will suppress the index j on fj (· | θ). The objective of the analysis is inference about the unknowns: the number of components k (if unknown), the component parameters θ and the component weights w. The mixture likelihood for a mixture model with k components is LM (k, w, θ) =
k n X Y
wj f (yi | θj ).
(2.6)
i=1 j=1
Such a model arises in two rather distinct contexts. In the first, each observation is assumed to have arisen from one of k (with k possibly unknown) heterogeneous groups, each group is being suitably modelled by a density from the parametric family f , characterized by parameter θ. The mixing proportions w then represent the relative frequency of the occurrence of each group in the population, and the model provides a framework by which observations may be clustered into groups for discrimination or classification, see, for example, McLachlan and Basford (1988). In another context, the mixture model (2.5) is thought of as a convenient parsimonious representation of a non-standard density, and the objective of inference is a semi-parametric density estimation. This is not the focus of this dissertation, interested readers can refer to the articles by Roeder (1994), Priebe (1994), and Roeder and Wasserman (1997) for more related information. In the clustering context, it is convenient to introduce the missing data formulation of the model, in which each observation yi is assumed to arise from a specific but unknown component zi ∈ {1, 2, · · · , k} of the mixture. We will refer to the missing group labels z = (z1 , · · · , zn ) as the latent allocation variables, and to (y, z) as the complete data. Assume z1 , · · · , zn are realizations of independent and identically distributed discrete random variable Z1 , · · · , Zn with probability mass function Pr(Zi = j | w, θ) = wj ,
i = 1, · · · , n; j = 1, · · · , k.
Consider Zi as a k-dimensional vector, with jth element one and all other elements zero if random variable Yi is drawn from jth component, then it is easy to see Zi is distributed
34
according to a multinomial distribution with one draw from k categories, denoted by Zi | w ∼ Multink (1, w). Conditional on Zi , yi are assumed to be an independent observations from the density p(yi | Zi = j, w, θ) = f (yi | θj ),
i = 1, · · · , n.
Integrating out the missing data Z1 , · · · , Zn then yields the model (2.5): p(yi | w, θ) =
k X
Pr(Zi = j | w, θ)p(yi | Zi = j, w, θ)
j=1
=
k X
(2.7) wj f (yi | θj )
j=1
The likelihood for the complete data (y, z), also called the classification likelihood, has the following form LC (θ, z | y) =
n Y
f (yi | θzi ).
(2.8)
i=1
If the goals of the mixture analysis include classification, then inference for the allocation variables Z’s may be of interest itself, and we may be interested in quantities such as the classification probabilities: Pr(Zi = j | yi , w, θ) ∝ Pr(Zi = j | w, θ)p(yi | Zi = j, w, θ)
(2.9)
∝ wj f (yi | θj ), which gives wj f (yi | θj ) Pr(Zi = j | yi , w, θ) = Pk . j=1 wj f (yi | θj )
(2.10)
The simulation of group labels Zi according to (2.9) is called completing the sample, following EM terminology. Even if inference for the allocation variables is of no direct interest, as may be the case if the mixture model is being used purely for approximating a non-standard density, the missing data formulation is still a convenient notational and computational device. It facilitates the EM algorithm and leads to much simplified Bayesian computation. Maximum likelihood estimation of mixture models with known k can be done either based on the mixture likelihood (2.6) (McLachlan and Basford, 1988; Celeux and Govaert,
35
1995), or the classification likelihood (2.8) (Banfield and Raftery, 1993; Fraley and Raftery, 1998), both through the implementation of EM algorithm or its variants. In a Bayesian framework, the unknowns w, θ, and k (if assumed unknown) are regarded as drawn from an appropriate prior distributions. Estimation and inference are then based on the posterior distribution. In theory, the Bayesian approach appears to have many advantages over the maximum likelihood approaches. In practice it has problems of its own, including computational difficulties, prior choices, model selection, label switching, and summarization (Stephens, 1997; Celeux et al., 2000). In recent years, great progress has been made regarding many of the above issues, key works include Diebolt and Robert (1994), Richardson and Green (1997), and Stephens (2000a,b). In this dissertation, we will combine some of the newly developed tools in order to deal with the curve clustering problem: the models will be specified through flexible hierarchical mixture models; informative prior distributions will be used if such information is available, if there is no prior information, “weakly informative” priors in the sense of Richardson and Green (1997) will be used. Model selection is carried out naturally through modeling the number of clusters k as an unknown parameter in the joint distribution. Label-switching will be dealt with using identifiability constraints and the decision theoretic approach of Stephens (2000b). More detailed discussion will be given in later chapters as real data sets are considered. Next we review some basic MCMC techniques for computation involving mixture models. 2.4
MCMC Techniques in Bayesian Computation
Suppose we have parameter of interest θ, and observed data y. In Bayesian analysis, θ is treated as a random variable, and inference requires evaluation of integrals of the form Z Eπ f = f (θ)π(θ | y)dθ, (2.11) where f is a relevant real-valued function of the parameter, and the expectation is taken with respect to the posterior distribution π(θ | y). For example, if f (θ) = θ, integral (2.11) gives the posterior mean of θ. In most cases, analytic evaluation of this integral (2.11) (summation for discrete parameter) is intractable, and if the integral is over a very high dimensional space, then traditional
36
numerical methods of integration like Laplace approximation are impossible to apply accurately. See Smith (1991) for an overview and detailed references on integration via numerical approximation techniques. If independent samples can be drawn directly from the posterior distribution π(θ | y), or from some other appropriately chosen trial distribution and can be suitably re-weighted to form samples from π(θ | y), then standard Monte Carlo methods can be applied to evaluate (2.11). Suppose samples {θ (1) , · · · , θ (s) } are drawn independently from π(θ | y), then the sample path average
n
1X f¯S = f (θ (t) ) n
(2.12)
t=1
is a consistent estimator for Eπ f , in that it converges almost surely to Eπ f as n → ∞. This result follows directly from the strong law of large numbers. Ripley (1987) provides an excellent account of standard Monte Carlo techniques. For distributions with high complexity and high dimensional parameters, which we increasingly encounter in application, directly generating samples from π(θ | y) is often not feasible. Under these situations, an alternative approach is provided by MCMC methods (see for example Gilks et al. (1996)), which are close in spirit to standard Monte Carlo methods. Correlated samples are generated by running a Markov chain with the distribution of interest, π(θ | y), as its invariant distribution, then various aspects of the posterior distribution including the integral (2.11) can be evaluated, based on these samples. We now describe briefly the theory underlying MCMC methods, and introduce MetropolisHastings algorithms, the Gibbs sampler and other more specific MCMC algorithms for mixture model analysis. We then illustrate the use of the Gibbs sampler in fitting a hierarchical model for growth curve data.
2.4.1
Introduction to MCMC
MCMC methods have had a profound influence on statistics over the past dozen years, especially but not only in Bayesian inference. Some key references include Gelfand and Smith (1990), Smith and Roberts (1993), Besag and Green (1993), Tierney (1994), Besag et al. (1995), Gelman et al. (1995), Gilks et al. (1996) and Robert and Casella (1999).
37
Although the basic concepts of Markov chain are most-easily described on a discrete state space. For the consistency with later presentation of algorithms, we choose to present some important definitions and results for general state space Markov chain theory in this section. Their counterparts in terms of discrete state space can be found in Roberts (1996). A rigorous mathematical discussion has to be set in an appropriate theoretical framework, see for example Tierney (1994, 1996). In our presentation, we will mainly use elementary language and avoid any measure theoretical terminologies. Definition 1 (Markov chain) A Markov chain in discrete time and general state space X is a sequence of random variable X (0) , X (1) , · · · , with X (t) ∈ X , which satisfies the first-order Markov property: P (X (t+1) ∈ A | X (t) , · · · , X (0) ) = P (X (t+1) ∈ A | X (t) ),
(2.13)
for any set A ⊂ X . That is, the value of X (t+1) is only dependent on its history through its nearest past, X (t) . Typically for MCMC, the Markov chain takes values in Rd (d-dimensional Euclidean space). Definition 2 (Transition kernel) The distribution of a Markov chain {X (t) } on a state space X is specified by the distribution of X (0) and by its transition kernel P (x, A) = Pr(X (t+1) ∈ A | X (t) = x),
(2.14)
for all x ∈ X and A ⊂ X . If the transition kernel does not change over time t, we call the Markov chain time homogeneous. Different MCMC strategies, such as the Gibbs sampler, Metropolis-Hastings algorithms, etc., give rise to different transition kernels. Transition kernels are general state space versions of discrete chain transition matrices. To simplify notation, we introduce the transition function P (x, y) for both cases. Definition 3 (Invariant distribution) A Markov chain {X (t) } with transition kernel P has invariant distribution π if π = πP , that is, if π(y) =
Z
X
π(x)p(x, y)dx.
(2.15)
38
Invariant distributions are often called stationary distributions. By this definition, if X (0) ∼ π, then X (t) ∼ π for all t = 1, 2, · · · . Equation (2.15) is also called general balance. In MCMC, we already have a target distribution π, the goal is to construct a proper transition kernel P which ensures π is the invariant distribution of the Markov chain, in other words, satisfies the general balance (2.15). But we would like to avoid the generally intractable integration (summation in discrete state space) over the state space X in (2.15). We can achieve this goal by demanding a more restrictive, but easier-to-check condition than (2.15), namely detailed balance. Definition 4 (Detailed balance) The transition function P (x, y) satisfies the detailed balance condition if π(x)P (x, y) = π(y)P (y, x).
(2.16)
Clearly, if detailed balance (2.16) holds, we have Z
π(x)P (x, y)dx =
Z
π(y)P (y, x)dx = π(y)
Z
P (y, x)dx = π(y).
Thus, detailed balance is sufficient for invariance. It is not a necessary condition. In Markov chain literature, chains that satisfies the detailed balance condition are called time reversible. To ensure the sample path average (2.12) converges to the expectation (2.11) for any starting state of the Markov chain whenever the expectation exists, we need to require that the chain must be able to reach any non-zero probability parts of the state-space, i.e., the chain is irreducible. Definition 5 (Irreducibility) A Markov chain is irreducible if there exists a probability distribution ϕ on X such that, for all A ⊂ X with ϕ(A) > 0 Pr(X (t) ∈ A for some t > 0 | X (0) = x) > 0 for all x ∈ X . We can now state the following theorem (taken from Tierney, 1996, p.65, with appropriate notational changes):
39
Theorem 2.4.1 (Ergodic theorem) Suppose {X (t) } is an irreducible Markov chain with transition kernel P and invariant distribution π, and let f be a real-valued function on X such that Eπ |f | < ∞. Then Pr(f¯n → Eπ f | X (0) = x) = 1 for π-almost x ∈ X , where f¯n is sample path average like (2.12). MCMC is different from the analysis of Markov chain as a stochastic process. In the latter the transition function P is assumed known, and interest is in finding the invariant distribution if it exists; whereas in MCMC, the invariant distribution is known, and we are interested in constructing proper transition function P to achieve the invariance distribution. Under suitable regularity conditions on f , the ergodic theorem allows us to generate samples from a Markov chain, then use the sample path average (2.12) as a consistent estimate of Eπ f . Stronger distributional results are possible with further conditions, including asymptotic results on the convergence rate (Tierney, 1996). Definition 6 (Aperiodicity) An m-cycle for an irreducible chain with transition kernel P is a collection {E0 , · · · , Em−1 } of disjoint sets such that P (x, Ej ) = 1 for j = (i + 1 mod m) and all x ∈ Ei . The period d for the chain is the largest m for which a m-cycle exists. The chain is aperiodic if d = 1. Theorem 2.4.2 (Equilibrium distribution) Suppose {X (t) } is an irreducible, aperiodic Markov chain with transition kernel P and invariant distribution π. Then π is an equilibrium distribution for the chain, in that lim P (X (n) ∈ A | X (0) = x) = lim P n (x, A) = π(A),
n→∞
n→∞
for π-almost all x. This theorem underpins the MCMC techniques for sampling from non-standard distributions. Under the conditions given above, it guarantees the chain will reach the target distribution π regardless of the starting point. A full discussion, including full theoretical exposition of Metropolis-Hastings algorithm and Gibbs sampler, can be found in Tierney (1996) with proofs given in Tierney (1994) and references therein.
40
2.4.2
Metropolis-Hastings algorithms
The Metropolis Hastings algorithm was first developed by Metropolis et al. (1953), and subsequently generalized by Hastings (1970). The algorithm uses a proposal function q(x, y) to suggest possible move and then employs an acceptance-rejection rule. The original Metropolis algorithm uses a symmetric proposal function q(x, y) = q(y, x), later Hastings (1970) extended it to quite general proposal functions, the only serious restriction is that q(x, y) > 0 if and only if q(y, x) > 0. Algorithm 2.4.1 (Metropolis-Hastings) Given current state x(t) : • Draw y from the proposal distribution q(x(t) , y). • Draw U ∼ Unif(0, 1) independently of y, and update
x(t+1) =
y,
x(t)
if U ≤ α(x(t) , y)
(2.17)
otherwise
where Metropolis et al.(1953) and Hastings (1970) give the acceptance probability as π(y)q(y, x) α(x, y) = min 1, . π(x)q(x, y) The above algorithm is very general. When the proposal distribution q(x, y) = q(y), i.e., the candidate state doesn’t depend on the current state, the Metropolis-Hastings algorithm reduces to the independence chain Metropolis algorithm. When the proposal distribution q(x, y) = q(x, x + ǫ) with ǫ ∼ gσ (·), i.e., the candidate state is equal to the current state plus a perturbation, it is called the random walk Metropolis algorithm. The distribution gσ (·) is often chosen to be a spherically symmetric distribution, with the parameter σ tuned by the user to maintain a certain acceptance rate (Gilks and Roberts, 1996). We now verify that the Metropolis-Hastings algorithm prescribes a transition rule with respect to which the target distribution π is invariant. For any x 6= y, the probability that the chain actually moves from x to y is equal to the proposal probability, q(x, y), multiplied
41
by the acceptance probability; that is A(x, y) = q(x, y) × α(x, y) π(y)q(y, x) = q(x, y) min 1, π(x)q(x, y)
(2.18)
Hence, π(y)q(y, x) π(x)A(x, y) = π(x)q(x, y) min 1, π(x)q(x, y) = min {π(x)q(x, y), π(y)q(y, x)} , which is a symmetric function in x and y. Therefore, the detailed balance condition (2.16) is satisfied. As shown by Tierney (1994), the Markov chain constructed by the Metropolis-Hastings algorithm is irreducible and aperiodic almost surely, so the ergodic theorem applies. Also note because of the ratio involved in the acceptance probability α(x, y), we only need to know the target distribution π up to a normalizing constant, which is often a necessity for the posterior simulation because the normalizing constant is unknown and difficult to calculate.
2.4.3
Gibbs sampling
The Gibbs sampler (Geman and Geman, 1984) may be the best known MCMC sampling algorithm in the Bayesian computation literature. Its most prominent feature is that the underlying Markov chain is constructed by composing a sequence of conditional distributions, through which it enables the MCMC sampler to follow the local dynamics of the target distribution, and makes the implementation much easier. Suppose we can decompose the random variable x ∈ X = Rn into k components, i.e., x = (x1 , · · · , xk ), let π(x) denote its joint distribution, and let π(xi | x−i ) denote the induced full conditional distributions for each of the components xi , given values of the other components x−i = {xj ; j 6= i}, i = 1, · · · , k, 1 < k ≤ n. (t)
(t)
Algorithm 2.4.2 (Gibbs) Let x(t) = (x1 , · · · , xk ) for iteration t. Then at iteration t+1
42
(t+1)
• For i = 1, · · · , k, draw xi
from the full conditional distribution (t+1)
π(xi |x1
(t+1)
(t)
(t)
, · · · , xi−1 , xi+1 , · · · , xk ).
The above algorithm is called the systematic scan Gibbs sampler because all the components of x are updated in a fixed order at each iteration. Another version of Gibbs sampler, called random scan Gibbs sampler, randomly picks a component of x to update at each iteration. There is some evidence to suggest that the random scan Gibbs sampler can be more efficient (Amit and Grenander, 1991). The Gibbs sampler can be considered as a special case of the component-wise MetropolisHastings algorithm, in which the proposal distribution is the full conditional q(x, y) = q((x1 , · · · , xi , · · · , xk ), (x1 , · · · , yi , · · · , xk )) = π(xi |x−i ). Then the acceptance probability α(x, y) reduces to one, i.e., we always accept the candidate. Although the maintenance of the target distribution π by a Gibbs sampler is ensured by the general theory for Metropolis-Hastings algorithm, there is a more direct and intuitive (t)
justification. To see this, suppose x(t) ∼ π. Then, x−i follows its marginal distribution under π. Thus (t+1)
π(xi
(t)
(t)
(t)
(t+1)
| x−i ) × π(x−i ) = π(x−i , xi
),
which means after one conditional update, that the new configuration still follows distribution π. For more information, see Besag (2000). 2.4.4
MCMC with trans-dimensional moves
In cluster analysis, most often the number of clusters k is not known a priori, and therefore itself needs to be inferred from the data. A comprehensive review of strategies employed by model-free clustering methods is given by Gordon (1999). Much previous work on finite mixture model estimation, Bayesian or otherwise, has separated the issues of estimating the number of components k, from estimation with k fixed. A review of mainly frequentist approaches to assessing k in mixture models can be found in McLachlan and Peel (2000). Diebolt and Robert (1994) have presented a comprehensive Bayesian treatment of finite mixtures using MCMC methods for the fixed k cases. Assessing the number
43
of components k in mixture models is essentially a model selection problem. Early approaches to the general case where k is unknown typically treat the problem as a “Bayesian non-parametric” problem, and base prior distributions on the Dirichlet process, see, for example Escobar and West (1995). Such approaches are more geared towards density estimation, with k treated as a nuisance parameter to be averaged away (West, 1997). Other researchers, e.g. Mengersen and Robert (1996), Raftery (1996) and Roeder and Wasserman (1997), have proposed using respectively a Kullback-Leibler distance, a Laplace-Metropolis estimator or a Schwarz criterion to choose the number of components. A more direct line, as advocated by Richardson and Green (1997), is to model the unknown k directly as a parameter and make fully Bayesian inference. This line has been followed by a number of other researchers, including Phillips and Smith (1996) and Stephens (2000a). We now review two ingenious MCMC algorithms with trans-dimensional moves which makes the fully Bayesian inference of mixture models with unknown k possible: reversible jump MCMC and birth-death MCMC.
1. Reversible-jump MCMC The reversible jump MCMC (RJMCMC) was proposed by Green (1995), and later applied to finite mixture models in Richardson and Green (1997). In brief, reversible jump MCMC is a random sweep Metropolis-Hastings method adapted for general state spaces. The general formulation is as follows. Let x denote the state variable, π(x) is the target distribution. Consider a countable family of move types, index by m = 1, 2, · · · . When the current distribution is x, a move type m and destination y are proposed from some proposal distribution qm (x, y), then subsequently accepted or rejected. Green (1995) showed that, if π(x)qm (x, y) is dominated by a common symmetric measure and has Radon-Nikodym derivative with respect to this measure given by R(x, y), then detailed balance is preserved if we accept the proposed new state with probability π(y)qm (y, x) αm (x, y) = min 1, π(x)qm (x, y)
(2.19)
44
For a move type that does not change the dimension of the parameter space, (2.19) reduces to the familiar Metropolis-Hastings probability. For dimension-changing moves, more care needs to be taken. Algorithm 2.4.3 (RJMCMC) Suppose the current state is x. A jump to higherdimensional space can be implemented as follows • Choose a move of type m with probability rm (x) • Generate a random vector u with length d from density q(u), and set y = f (x, u) by an invertible deterministic function f (·), where d is the difference in dimensions between x and y. • Accept the new state y with probability α(x, y) = min 1,
π(y)rm (y) ∂y . π(x)rm (x)q(u) ∂(x, u)
(2.20)
Note that the final term in the ratio above is a Jacobian arising from the “dimension-matching” transformation from (x, u) to y.
The reverse of the move (from y to x) can be accomplished by using the inverse transformation, so that the proposal is deterministic. In their application of RJMCMC to mixture models, Richardson and Green (1997) proposed four types of move that change dimensions: 1) birth step where a new component is born; 2) death step where a component is randomly chosen to die; 3) combine step where two adjacent components are merged; 4) and split step where one component is split into two. It turns out that choosing appropriate proposal distribution q(·) and dimension-matching transformation f (·) is a challenging task in finite mixture models. The model-changing moves generally involve moving between spaces differing by more than one dimension. Thus there is more degree of freedom in the proposal distribution. Furthermore, the transformations may result in very complicated Jacobians.
45
Although the reversible jump methodology has demonstrated great potential in a wide scope of problems, its application still remains predominantly within the domain of the MCMC expert, owing both to difficulties in constructing appropriate algorithms and to a common perception that it is difficult to implement. Recently, Brooks et al. (2003) with discussions has provided a general framework both for constructing jumps (dimension-changing moves) and for automating the process of choosing proposals efficiently.
2. Birth-death MCMC For our computation, we resort to birth-death MCMC (BDMCMC) proposed by Stephens (2000a) because it is easy to implement and conceptually elegant in mixture settings. In this section we review BDMCMC methods in the mixture case. Consider the mixture model (2.5), let x = (k, w, θ) denote the parameters, and L(x) = LM (k, w, θ) denote the likelihood, r(x) = r(k, w, θ) denote the prior distribution on the parameters. Note that the likelihood (2.6) is invariant under permutations of the components labels, if the prior on component parameters (w, θ) given k does not depend on labelling of the components, then the posterior distribution p(x | y) ∝ L(x)r(x) will be similarly invariant. Thus we can view each component θj of the mixture as a point in parameter space, with its weight wj as its mark, and adapt theory from the simulation of marked point process to help construct a Markov chain with the posterior distribution of the parameters p(x | y) as its stationary distribution. Algorithm 2.4.4 (BDMCMC) Suppose the current state of the Markov chain is x, and that new components are born in continuous time at a rate b(x). The death rate for component j is dj (x) =
L(x\(wj , θj ))r(x\(wj , θj )) 1 b(x)q(x\(wj , θj ), x) , L(x)r(x) k (1 − wj )k−1
for j = 1, · · · , k, and the total death rate is d(x) = operation of removing (wj , θj ) from x.
P
j
(2.21)
dj (x). Here ‘\’ represents the
46
• Simulate time to next birth/death from an exponential distribution with rate {b(x) + d(x)}−1 . • Simulate birth or death with Pr(birth) = 1 − Pr(death) = b(x)/(b(x) + d(x)). • For a birth, increment k by 1, generate (w, θ) ∼ q(x, x ∪ (w, θ)), where q(x, x′ ) is a proposal density and may or may not depend on the current state x, re-scale the weights w by multiplying them with (1 − w). • For a death, decrease k by 1, select a component j to die with probability dj (x)/d(x), and re-scale the weights w by dividing them by (1 − wj ).
The 1/k term in Equation (2.21) is the ratio of the factorials (k − 1)! and k! arising from the exchangeability assumption on the mixture components, and (1 − w)k−1 is a Jacobian arising from renormalization of the weights. Stephens (2000a) proved that under regularity conditions, the Markov chain with birth and death rates given by (2.21) satisfies the detailed balance condition, therefore has invariant distribution π(x | y). Furthermore, if we assume exchangeable priors, and use the Poisson prior on k as the birth rate, i.e., k ∼ Poisson(λ), b(x) = λ, the death rate (2.21) reduces to the likelihood ratio, therefore leads to much simplified implementation. Note that RJCMCM constructs discrete Markov chains with acceptance/rejection steps. In contrast, BDMCMC constructs continuous time Markov chain with every proposed move accepted. The acceptance probability of usual MCMC methods is replaced by the differential holding times. In particular, implausible configurations, i.e. configurations such that L(x)r(x) is small, die quickly. Recently, a connection between reversible jump MCMC and birth-death MCMC has been established by Capp´e et al. (2003). They also extended birth-death setting to include other types of continuous time jumps like the split-combine moves in the Richardson and Green (1997).
47
2.5
Linear Population Biological Growth Example
We now illustrate the use of the Gibbs sampler in fitting a hierarchical models to a classic growth curve data set. Potthoff and Roy (1964) describe a study conducted at the University of North Carolina Dental School in two groups of children (11 girls and 16 boys). At ages 8, 10, 12, and 14, the distance (in millimeters) from the center of the pituitary gland to the pteryo-maxillary fissure was measured. The 27 × 4 observations are given in the appendix to this dissertation (Section A.1). This data set has been analyzed by many researchers using various statistical techniques. For frequentist approaches, see Davis (2002) and Kshirsagar and Smith (1995). Bayesian analysis of this data set can be found in Fearn (1975) and Wakefield et al. (1994). Here we follow most of the previous analyses and assume a linear relationship between the dental measurement and age. We further follow Wakefield et al. (1994) and assume homoscedastic normal errors but different variances within each separate population of girls and boys. The observed growth curves are shown in Figure 2.1. It appears that on average, the boys have a slightly higher growth rate in terms of the dental measurement, and have much higher measurements at age 11 as a group. The irregular set of measurements of boy 20 (23, 20.5, 31, 26) may be recording errors, but turn out to have little effect on the population inference as the fitted straight line stays within its population. Here we give another interpretation to the linear growth curve assumption. When assuming linear structure on the growth curves, we assume each curve is characterized by two features: intercept and slope. We emphasize here and later in our examples that special care should be given to finding important features and appropriate functional form for the curves, as bad choices could give us misleading results. Figure 2.2 shows the least square estimates of intercepts and slopes from the pooled data. It suggests that the boy labelled 24 could be a ‘slope outlier’, and the boy labelled 21 could be a ‘intercept outlier’. Wakefield et al. (1994) proposed a modelling analysis strategy for outlier detection and robust inference with outliers based on heavy-tailed Student t-distribution. Since we only want to illustrate the ideas of hierarchical modeling and the Gibbs sampler, we will restrict ourselves
48
25 20
Length (mm)
30
Girls Boys
8
9
10
11
12
13
14
Age (years)
Figure 2.1: Dental measurement for 11 girls and 16 boys. Solid lines are girls, dashed lines are boys, and the two thick lines are point-wise averages for each population.
49
2.0
to normal distributions.
26 1.0
Slope
1.5
24
23
20
12
16 2 13 183
0.5
14 727
25 4
10 61
19 22
9
5 15
8 20
21
17 11
22
24
26
28
Intercept at 11 years
Figure 2.2: Least squares estimates of intercepts and slopes (the plotting symbol is the individual’s number). Girls are labeled from 1 to 11, and boys are labeled 12 to 27.
2.5.1
Normal-linear population model
Let xij and yij denote, respectively, the jth time point (using age 11 years as origin) and associated measurement on the ith individual with i = 1, · · · , 11, for the population of girls, and i = 12, · · · , 27 for the population of boys, and j = 1, · · · , 4. Given the number of sub-populations K = 2 is known, we also introduce the sub-population labels Z = {Zi , i = 1, · · · , 27}, where Zi = 1 if the ith individual is a girl, and Zi = 2 if ith individual is a boy. We now specify a more general hierarchical normal-linear model, following Wakefield et al. (1994).
50
• Stage One (individual level): Suppose that the first-stage model has the form ni N Y Y
p(yij | θ i , X i , τ ) =
N Y
Nni (y i | f (xi , θ i ), τi−1 I ni ),
i=1
i=1 j=1
where the individual-specific mean structure has a linear functional form f (X i , θ i ) = X i θ i , in other words, E(yij |xi , θ i ) = αi + βi xij for θ i = (αi , βi ), so that y i = (yi1 , · · · , yini ), the ith of N observation vectors, is modelled as linear structure with conditionally independent homoscedastic normal errors having variances τi−1 . • Stage Two (population level): Assume the firsts-stage regression parameter vectors are random samples from one of the K sub-populations, with both the sub-population labels Z and number of sub-populations K known, that is N Y
p(θ i | Zi = k, φk ) =
i=1
N Y
N2 (θ i | µk , Σk ),
k = 1, · · · , K.
i=1
We assume the variance τi−1 to be common within sub-populations, but allow them −1 to vary between sub-populations so that τi−1 = τ0k if Zi = k for k = 1, · · · , K. An
inverse gamma prior is assumed, for k = 1, · · · , K, 1 1 τ0k ∼i.i.d. Ga τ | ν0 , ν0 τ0 . 2 2 • Stage Three (Priors): Suppose that the prior specification is completed by assuming ν0 and τ0 to be known and −1 p(φk ) = p(µk )p(Σk ) = N2 (µk | η, C)Wishart2 (Σ−1 k | ρ, (ρR) ),
independently for k = 1, · · · , K, where η, C, ρ and R are known, and Wishart denotes the Wishart distribution. Further hyper-priors can be specified if necessary, see for example Richardson and Green (1997). 2.5.2
Gibbs sampler
Let z = (z1 , · · · , zN ) denote the known population labels. Define y = (y 1 , · · · , y N ), θ = ¯ = N −1 ΣN θ i , D −1 = τ X T X i + (θ 1 , · · · , θ N ), µ = (µ1 , · · · , µK ), Σ = (Σ1 , · · · , ΣK ), θ i i=1 i
51
−1 −1 −1 Σ−1 and treating ψ = (θ, µ, Σ, τ ) as unknown parameters, then our zi , V k = N Σ k + C
inference can be based on samples generated from the posterior distribution p(ψ | y, z). Here and later we use ‘ | · · · ’ to denote conditioning on all other variables. It is easy to see that a Gibbs sampler is defined by the following full conditional distributions: −1 θ i | · · · ∼ N2 (D i (τ X T i y i + Σzi µk ), D i ),
¯ + C −1 η), V k ), µk | · · · ∼ N2 (V k (N Σk θ )−1 (N X , Σ−1 (θ i − µk )(θ − µk )T + ρR k | · · · ∼ Wishart2 N + ρ, i=1
τ | · · · ∼ Ga
X 1 1 (ν0 + ni ), 2 2 i
(
N X (y i − X i θ i )T (y i − X i θ) + ν0 τ0 i=1
)!
,
for i = 1, · · · , N , and k = 1, · · · , K. Generation of random variates is straightforwardly achieved for the normal and gamma distributions; generation for the Wishart distribution is achieved by using the algorithm given in Odell and Feiveson (1966). 2.5.3
Results
Clearly prior specification plays a crucial role in Bayesian modelling. Here we choose noninformative but proper priors on τ and µk ; for the Wishart distribution, we choose parameters ρ and R a priori: ρ = p, the dimension of θ i , is the most non-informative in the sense that its distribution is flattest while retaining a proper posterior; the matrix R is chosen to be an approximate prior estimate of Σk ; η is chosen to be an approximate prior estimate of µk . Here following Richardson and Green (1997) we choose a data-dependent but weakly informative priors such as median of least square estimates. More specifically, the values chosen for the parameters in the dental growth curve example are 1 0 23.75 , η = . ν0 = 0, C −1 = 0, ρ = 2, R = 0 0.1 0.68 For given population labels z, our analysis is equivalent to analyzing boys and girls separately, as done by Wakefield et al. (1994). The median, 5% and 95% posterior sample
52
percentiles of µ = (α, β) and τ (the population intercept, slope and measurement precision) along with the sample median of the population precision matrix Σ−1 are summarized in Table 2.1 for boys and girls. To see the influence of outliers, in addition to the analysis of the full data of boys (N0), we also included results with boy 21 removed (N1), and both boys 21 and 24 removed (N2). The parameter inferences for the boy population shift in the expected directions with the various normal models; for example the intercept in N0 is higher than that of N1 and N2. The population variance of both the intercept and the slope are higher in the N0 model than in the other normal models, showing that the extreme boys parameters are being accommodated. Table 2.1: Posterior Summaries of Separate Analysis for Boys and Girls Model
µ1
µ2
τ
Median of Σ
24.96
0.79
0.39
2.48
(24.17, 25.74)
(0.58, 0.98)
(0.26, 0.56)
24.66
0.79
0.37
(24.02, 25.28)
(0.58, 1.00)
(0.25, 0.54)
24.69
0.70
0.41
(24.01, 25.37)
(0.51, 0.89)
(0.28, 0.61)
22.66
0.48
2.24
(21.54, 23.75)
(0.33, 0.62)
(1.34, 3.50)
Boys N0
N1
N2
0.00 0.08
1.31
0.00 0.08
1.51
0.00 0.06
Girls N0
3.99
0.00 0.05
†The numbers in parentheses denote a 90% sample interval.
The Gibbs sampler was started from a point drawn at random from the prior, and was run for 200,000 iterations, with the first 100,000 iterations being discarded as burn-in. After some thinning, i.e. keeping the samples at every 10th iteration, the 10,000 samples were used to calculate the results listed in Table 2.1. Estimates of individual random effect θ i ’s are shown in Figure 2.3. Figure 2.4 shows the trace plots of the population means (with
53
further thinning). It appears the chains have converged and mixed well, and the results are close to the ones reported in Wakefield et al. (1994). We also checked Geweke convergence diagnostic for the long Markov chain we ran, all Z-scores were non-significant, suggesting convergence of the chain. For more discussion on assessment of Markov chain convergence,
2.5
see Gelman (1996).
Slope
1.0
1.5
2.0
Posterior (girls) LSE (girls) Posterior (boys) LSE (boys)
0.5
*
−0.5
0.0
*
15
20
25
30
35
Intercept
Figure 2.3: posterior means of individual random effect θ i , their least square counterparts, population means and projections of the ellipsoids defined by the covariance matrices for the two populations. The ellipses correspond to ±2 standard errors in the univariate case.
We have illustrated the idea of using hierarchical models to model truly hierarchical data or to structure some dependence into the parameters in above example, and various MCMC techniques for Bayesian inference. To motivate our methodology for curve clustering, now let us assume we have lost the population labels, i.e. we do not know which population each curve belongs to, then we have the data as shown in Figure 2.5. Now we are facing some
30 28 26 20
22
24
mu_1 (boys)
26 24 20
22
mu_1 (girls)
28
30
54
0
200
400
600
800
1000
0
200
600
800
1000
800
1000
1.5 1.0
mu_2 (boys)
0.0
0.5
1.0 0.5 0.0
mu_2 (girls)
400
sample point
1.5
sample point
0
200
400
600
sample point
800
1000
0
200
400
600
sample point
Figure 2.4: Trace plots of population means. The chain was run for 200,000 iterations, with first 100,000 iterations discarded as burn-in and thinning at every 100th iteration. It appears the chains have converged and mixed well.
55
interesting questions: If we know there are two populations, can we recover the grouping? What if we do not know the number of populations? What if the curves are gene expression
25 20
Length (mm)
30
time series? These questions lead to our development of methods in subsequent chapters.
8
9
10
11
12
13
14
Age (years)
Figure 2.5: Dental measurement for 11 girls and 16 boys, assuming population information has been lost. It is difficult for the human eye to detect subtle features beyond the linear trend in the curves.
56
Chapter 3 FILTERING BASED ON BAYES FACTORS AND FALSE DISCOVERY RATE
3.1
Motivation
Genome sizes vary from organism to organism, ranging from a few hundred (e.g., virus or bacteria) to hundreds of thousands (e.g., humans). The recent advance in microarray technology and other high throughput methods for analyzing complex nucleic acid sequences has made it possible to rapidly and efficiently measure the levels of virtually all of the genes expressed in an organism. On one hand, such capacity enables genomic scale study of gene interactions; on the other hand, the large amount of data generated from microarray experiments make it a daunting task to extract information. In recent years, research has been conducted on developing analytic techniques in the various stages of microarray experiments. A brief review of such development has been given in the first chapter. The high dimensionality of gene expression data has been a prime obstacle for gene expression analysis. It distracts our attention from the feasibility of the methods, and forces us to focus more on the efficiency of the computation instead. Yet with the complicated methods, with the substantial computing power and time spent, we still have little control over the false positive and false negative discoveries. There are two primary ways in which the dimension of microarray data may be reduced. One is to reduce the dimension of data for each gene. This is often done through data transformation, such as principle component analysis (Yeung and Ruzzo, 2001), also called singular value decomposition (Alter et al., 2000), or summary statistics (Efron et al., 2001). But these approaches suffer from criticism such as loss of information, lack of interpretation, and difficulty in assessing the uncertainty of the results. In this chapter we propose a dimension reduction strategy in a different direction. We
57
seek to restrict our attention on genes which are likely to participate in the biological process under study by excluding genes whose expressions do not appear to be affected by the process, or those contain too little signal relative to noise. This approach is a companion to our model-based clustering of genes, based on gene expression time series. For such data we have a large number of curves, and this number can be even larger if replicates are taken. Because the models we propose are computationally expensive to implement, an initial screen is helpful. The major goal of our model-based clustering is to characterize and quantify how gene expression levels vary across ordered conditions or time, therefore it is important to maintain the resolution of observations for each gene. On the other hand, it is also important to guard against the potential problem of missing too many interesting, we take this into account by controlling the error rates of false discoveries and false non-discoveries. For any biological process, though many genes regulate or participate, the whole genome is not involved. And the amount of participation varies from biological process to biological process, from sample to sample, and from organism to organism. For low level eukaryotes, the genomes tend to be simple and less diversified, therefore a larger portion of the genome may participate. But the number of genes tends to be smaller too. As for high level eukaryotes, a larger number of genes could be involved. But the proportion is usually small. For example, mitotic cell cycle regulation is essential for the survival and propagation of eukaryotic cells. This process has been extensively studied in the model organism budding yeast Saccharomyces cerevisiae. Availability of complete sequence information for the Saccharomyces cerevisiae genome has made it possible to measure mRNA transcript levels for virtually every yeast gene (DeRisi et al., 1997; Wodicka et al., 1997). Yet among the about 6,000 genes, Spellman et al. (1998) identified 800 genes as cell cycle dependent using Fourier transformation and correlation coefficients; Cho et al. (1998) identified 416 genes as cell cycle dependent using visual examination; and Zhao et al. (2001) identified 473 genes using a regression type single-pulse model (SPM). Therefore no more than 7% of yeast genes are predicted as cell cycle regulated, many of which are already well known to investigators. What is required are clues to additional genes that are probably differentially expressed and yet uncharacterized by far, and there is generally a willingness to permit several of these
58
to be false positives to avoid missing too many of the true positives (L¨onnstedt and Speed, 2002). Some sort of filtering of the gene expression data will simplify the analysis, and save us computing time, also it will improve the accuracy and reliability of the results. Our filtering approach is based on the assumption that genes with little variation in their expression levels across multiple conditions or time points are deemed either not regulated by the biological process under inspection or contain too much noise for us to identify the signals. This filtering is similar to detection of differential expression, but with less complicated modelling. In early approaches, the filtering was often carried out using simple fold change. For example, Eisen et al. (1998) only included genes whose expression levels deviated from measurements at time zero by at least a factor of three in at least two time points during the serum stimulation of primary human fibroblasts. Tamayo et al. (1999) excluded yeast genes that did not show a relative change above two units and an absolute change above 35 units. Values of 3 and 100 were used for human genes with no justification for these values being given. Now it is well recognized that such methods are unreliable because variability is not formally taken into account. Since then, many more sophisticated statistical methods have been proposed, and can be roughly divided into four groups. The first group is based on traditional two-sample t-tests or t-test like hypothesis tests. Test statistics are constructed in the form of average differences over gene-specific variation measures, then significant levels are calculated based on distributional assumptions or on permutation, see for example Tusher et al. (2001) and Dudoit et al. (2002). The second group is to fit regression models to each individual gene, and determine differential expression by testing the significance of the corresponding regression coefficient, e.g., Zhao et al. (2001). The third group is to assume a mixture model for each gene, then calculate the probability of differential expression via empirical Bayesian method, see for example Efron et al. (2001) and L¨onnstedt and Speed (2002). The fourth and final group is based on fully Bayesian models, see Newton et al. (2001) and Ibrahim et al. (2002) for examples of this approach. A comparative review of non-Bayesian statistical methods for differential expression analysis is given by Pan (2002). Next we describe a simple and quick filtering approach using Bayes factors. The Bayesian
59
framework allows us to incorporate external information, and provide a way of ranking genes with a probabilistic interpretation. 3.2
Gene Filtering Based on Bayes Factors
3.2.1
The hypothesis
In cDNA microarray experiments, the expression level is actually a summary measure of fluorescence intensities of red channel relative to green channel, most often the base 2 logarithm of intensity ratios, denoted by log2 (R/G). In what follows, unless otherwise stated, expression level is assumed to be in the form of log2 (R/G). Let yij denote the expression level for gene i measured at time1 tj , j = 1, · · · , T , and y i the T × 1 expression vector for gene i. We assume that, independently for all i = 1, · · · , N , and j = 1, · · · , T , yij | µij , σe2 ∼ N(yij | µij , σe2 ),
(3.1)
where σe2 is the variance of measurement error, which is assumed to be common to all observations. Our filtering is based on comparing the following two models for each gene, M0 : µi1 = µi2 = · · · = µiT = µ (3.2)
v.s. M1 : not M0 via the Bayes factor p(y i |M1 )/p(y i |M0 ), where I1 = p(y i | M1 ) = and I0 = p(y i | M0 ) =
T Z Y
j=1 µij T Z Z Y
j=1 µ
Z
p(yij | µij , σe2 ) × π(µij , σe2 )dµij dσe2 ,
(3.3)
p(yij | µij = µ, σe2 ) × π(µ, σe2 )dµdσe2 .
(3.4)
σe2
σe2
This procedure tries to discriminate differentially expressed genes from those with unaltered expression levels, while allowing a small amount of perturbation. Note there is a 1
Here time is referred to in the broad sense as mentioned in Chapter 1, for example, it can be dosage, temperature or other indexing covariates.
60
difference between µ = 0 and µ = c a constant; µ = 0 means there is no difference in expected expression levels between the experimental sample and reference sample for gene i across time, while µ = c means there is a difference between the two samples, but it does not depend on the experimental conditions under inspection. In most cases we take µ = 0 because the data are centered by normalization, prior to formal data analysis. Independence within genes and across time points are two strong assumptions, which are very likely to be violated because genes co-regulated tend to have correlated expression profiles, and expression levels within a gene also tend to be correlated. These assumptions are for computational convenience; since we only intend to develop an efficient filtering procedure, we try to avoid more sophisticated models at this stage. Both assumptions can be relaxed to have more structure built in if necessary, which we return to in the discussion section.
3.2.2
Bayes factor
A common Bayesian approach for choosing between two models is to compute the Bayes factor. Consider a more general setting, and suppose the observed data y are generated from a probability model Mk which is in a finite set of competing models. Suppose θ k is the parameter vector with dimension dk under model Mk and the likelihood function is p(y | θ k , Mk ) . Assume further that π(θ k | Mk ) is the prior distribution for θ k under model Mk . Then Ik = p(y | Mk ) =
Z
p(y | θ k , Mk )π(θ k | Mk )dθ k
(3.5)
is the marginal density of the data under the model. Definition 7 (Bayes Factor) The Bayes factor for comparing two models M1 and M0 is defined as B10 =
I1 p(y | M1 ) = , p(y | M0 ) I0
the ratio of the marginal densities of the data y under the two models.
(3.6)
61
Using Bayes’s theorem, the Bayes factor can also be expressed as the posterior to prior odds ratio: B10
p(y | M1 ) = = p(y | M0 )
p(M1 | y) p(M0 | y)
p(M1 ) / p(M0 )
(3.7)
Intuitively, the Bayes factor provides a measure of whether the data y have increased or decreased the odds on M1 relative to M0 . Thus, B10 > 1 signifies that M1 is more relatively plausible in the light of data y, while B10 < 1 signifies that the plausibility of M0 has increased. Kass and Raftery (1995) provided a comprehensive review of Bayes factors, including their interpretation, computation, approximation, robustness to the model-specific prior distributions, and examples in a variety of scientific applications. For more information on Bayes factors, see Bernardo and Smith (1994) and references therein.
3.2.3
Choice of priors
In order to compute a Bayes factor, the prior distributions π(θ k |Mk ) (k = 1, 0) must be specified. Although these priors provide a way to include external information or subjective belief about the values of the parameters, the prior densities are hard to set when there is no such information. If this is the case, the easiest choice is to use “non-informative” priors such as uniform distributions or Jeffrey’s priors (Gelman et al., 1995). But in biological research, additional information about the parameters is often available from repeated experiments at other laboratories, or from previous studies on the same model organisms, or from the subjective knowledge of biologists. It is clearly beneficial if we can construct prior distributions using the available information. Care must be take, however, even if there are other data, as a judgment must be made regarding which information and relevant representation of such information in terms of a probabilistic distribution should be used. For a few general guidelines, see Kass and Raftery (1995) and references therein. In our attempt to filter genes based on microarray data, we will try to construct simple priors in problem-specific forms. Conjugate priors are a convenient choice, for it is easier to formulate the distributions and carry out the computation in many cases. By examining Equation (3.7), we can see that Bayes factors generally require proper priors, as improper
62
priors lead to great difficulties (Berger and Pericchi, 1996). Conjugate priors have been used in Baldi and Long (2001) and L¨onnstedt and Speed (2002) for microarray data analysis. As illustrated by the following examples, we formulate the priors in a problem-specific manner, with parameter values decided by separate analyses using related data. Having assumed the distribution (3.1) for the data, we further specify the joint prior distribution on (µij , σe2 ) as follows, π(µij , σe2 | M1 ) = N(µij | m1 , v1 )Ga(σe−2 | a1 , b1 ),
(3.8)
π(µij = µ, σe2 | M0 ) = N(µ | m0 , v0 )Ga(σe−2 | a0 , b0 ),
(3.9)
and
independently for j = 1, · · · , T and i = 1, · · · , N . Ga(a, b) denotes the Gamma distribution with mean a/b and variance a/b2 . A conjugate analysis would require the prior distribution for the mean parameter to depend on the variance parameter in the Normal models. Such an implicitly assumed mean-variance relationship is artificial and in the absence of any known dependence between the variance and mean (recall the data are on the log scale) independence seems reasonable, so we assume independence between the mean parameter µij and variance parameter σe2 . Unlike point estimation, the Bayes factor tends to be sensitive to the choice of priors on the model parameters. Hence it is always advisable to perform some kind of sensitivity analysis. Ideally, the filtering procedure should be robust to the choice of priors on the parameters. In other words, we would like the same genes to be filtered out regardless of the priors. 3.2.4
Computation and importance sampling
Computing or approximating Bayes factors can be challenging as it often involves high dimensional integrations. In some conjugate cases, the integrals may be evaluated analytically. More often, they are intractable and must be computed via numerical methods. Several techniques, including numerical approximation, importance sampling and MCMC, are available to compute Bayes factors. See Kass and Raftery (1995) and Han and Carlin (2001) for a review of them.
63
The importance sampling idea suggests that one should focus on the region(s) of “importance” to save computational resources. This idea of biasing toward “importance” regions of the sample space is essential for Monte Carlo computation with high-dimensional models such as those in statistical physics, molecular simulation, and Bayesian statistics. The basic importance sampling procedure is described as follows. Suppose we are interested in evaluating µ = Eπ {h(x)} =
Z
h(x)π(x)dx,
where π(x) maybe difficult to sample from. Importance sampling suggests first generating independent samples x(1) , · · · , x(S) from an easy-to-sample distribution g(·), then approximate µ by S 1 X (i) µ ˆ= w h(x(i) ), w i=1
where w(i) = π(x(i) )/g(x(i) ), i = 1, · · · , S are the importance weights. By a judicial choice of g(·), one can reduce the variance of the estimate substantially. A good candidate for g(·) is one that is close in shape to h(x)π(x). For a more thorough review and advanced topics see the textbook by Liu (2001). With the idea outlined above, we can sample from the prior and estimate (3.5) by S 1X (s) p(y | θ k , Mk ), Iˆk = S
(3.10)
s=1
(s)
where θ k ∼i.i.d. π(θ k | Mk ), for s = 1, · · · , S, provided π(θ k | Mk ) is easy to sample from. More specifically, we can estimate (3.3) and (3.4) by importance sampling S Ti 1 XY (s) ˆ p(yij | µij , σe2(s) ), I1 = S
(3.11)
S Ti 1 XY ˆ I0 = p(yij | µ(s) , σe2(s) ), S s=1
(3.12)
s=1 j=1
and
j=1
(s)
2(s)
where µij , µ(s) and σe
are independent samples generated from prior distributions (3.8)
ˆ10 = Iˆ1 /Iˆ0 . and (3.9). Finally, the Bayes factor is estimated as B
64
We resort to the importance sampling method for approximating Bayes factors in this filtering procedure, because it does not require conjugate priors and is convenient to implement. In microarray experiments, there are typically only a small number of observations for each gene, the so called “large p, small n” data structure (West et al., 2000). So when we calculate Bayes factor for each gene, the size of the dataset is small, in which case importance sampling can work well because the likelihood is not too peaked. Informative prior distributions can also be easily implemented if available. 3.2.5
Ranking based on posterior probabilities
The goal of our filtering procedure is to include genes with certain patterns in the presence of large variations across time. Algorithm 3.2.1 (Gene Filtering) To achieve such a goal, we propose a simple algorithm as follows: • Rank the genes with marginal posterior probability Pr(M1 | y). • Exclude genes with Pr(M1 | y) ≤ c, where c is a cutoff point chosen to ensure an overall control of misclassification errors. • Proceed with more complex statistical models on the highly ranked genes. The posterior probabilities p(M1 | y) and p(M0 | y) provide an intuitive measure of the support for models M1 and M0 given the observed data y. Let π10 = p(M1 )/p(M0 ) denote the prior odds of M1 versus M0 . Then from the definition of the Bayes factor (3.7), we have Pr(M1 | y) =
B10 × π10 . 1 + B10 × π10
(3.13)
If the two models M1 and M0 are equally probable a priori, so that p(M1 ) = p(M0 ) = 1/2, then Pr(M1 | y) = B10 /(1 + B10 ), which is easily obtained once we have the Bayes factor. In reality, however, if we have information on the process under study then each gene will have a different prior probability of differential expression. But since we are more concerned with the relative ranking of the genes in a screening step, we will assume the two models
65
are equally likely a priori. Similar idea of ranking genes has been explored in L¨onnstedt and Speed (2002). In contrast to their empirical Bayesian approach, we carry out fully Bayesian analysis and rank the genes based on their posterior probabilities. Such a ranking is more intuitive and has better probabilistic interpretation. Even more plausible results can be obtained through a more refined gene-specific probabilities for the two models. The parametric model also allows us to have some control over the potential false positives and false negatives through the choice of cutoff point in the filtering procedure. We elaborate more on this point in the next section.
3.2.6
Thresholding
Essentially, our filtering procedure carries out hypothesis testing for each single gene independent of other genes. There are two types of errors that can be committed at this point, the first is a Type I error (false positive), which occurs when the gene is classified by the procedure as differentially expressed when it is truly not. The second is a Type II error (false negative), which occurs when the gene is classified as not differentially expressed although it truly is. A common frequentist practice is to maximize the power (1 − Type II error rate) while controlling the Type I error rate. For the large number of genes involved in microarray experiment, filtering becomes a large multiple hypothesis testing problem, and defining and controlling the error rates becomes much more critical and more complicated. Traditional procedures to control the per comparison error rate (PCER), or the family-wise error rate (FWER) such as Bonferronitype methods have been applied to deal with multiple hypothesis testing problems. But they tend to be too conservative, thus do not fit the exploratory nature of gene expression data analysis (Storey and Tibshirani, 2003). Recently, a new Type I error rate for multiple testing, the so called false discovery rate (FDR), introduced by Benjamini and Hochberg (1995) has attracted much attention from statisticians working with microarray data. The frequentist FDR of Benjamini and Hochberg (1995) is the expected proportion of Type I errors among the rejected hypotheses. In terms of detection of differentially expressed genes, it is the expected proportion of truly unaltered genes among the genes classified as
66
differentially expressed. A number of recent articles have addressed the question of multiple testing in DNA microarray experiments. A detailed comparison and discussion of different frequentist approaches to multiple testing in the context of microarray experiments can be found in Dudoit et al. (2003). Tusher et al. (2001) and Storey and Tibshirani (2003) discussed thresholding procedures and software for controlling false discovery rate in microarray differential analysis. Extensions of FDR beyond frequentist approaches have been made by Efron et al. (2001) and Storey (2002) amongst others. Our filtering procedure can be viewed as a special case of controlling the Bayesian false discovery rate (bFDR) introduced in Genovese and Wasserman (2003). The Benjamini and Hochberg FDR can be generalized in the following way: for each comparison some univariate summary statistic vi is available, which could be a p-value or any other univariate statistic related to the comparison of interest. All comparisons with vi beyond a certain threshold t are considered discoveries, and control of FDR is achieved by calibrating the threshold t. This thresholding point of view has also been discussed in Storey (2002) and Genovese and Wasserman (2002). The Bayesian false discovery rate and false negative rate have been formulated nicely in decision theoretic framework by M¨ uller et al. (2003). The procedures to control frequentist FDR based on p-values have been implemented in microarray analysis by some authors, but the concept of Bayesian FDR has only been developed recently. Genovese and Wasserman (2003) concentrate on the frequentist properties of Bayesian FDR and agreement between Bayesian and frequentist multiple testing procedures. They make no reference to an underlying probability model for the data. M¨ uller et al. (2003) have developed the decision theoretic framework for bFDR and applied it to sample size determination. To our knowledge, there has not been an application of bFDR controlling procedure in gene filtering. In this dissertation, we apply bFDR to filtering problems. We propose a parametric model, with mixture components corresponding to altered and unaltered genes, we then use the marginal posterior probabilities as summary statistics to evaluate FDR. We make further extension on bFDR to allow dependence among genes. Consider our filtering as a two-condition comparison problem. Let zi ∈ {0, 1} indicate the true state of gene i, where zi = 0 indicates gene i is truly unaltered over time and zi = 1 indicates gene i is truly altered. Let di ∈ {0, 1} denote the classified state of gene i. di = 0(1)
67
indicates the decision that gene i is unaltered (altered). Let vi = Pr(zi = 1 | y i ) = p(M1 | y i ) denote the marginal posterior probability (3.13) for the i-th comparison. For the filtering, we declare a discovery (reject M0 ) if the marginal posterior probability vi is beyond a certain cutoff point t, di = di (t) = I(vi > t),
(3.14)
where I(·) is the identity function. Definition 8 (Realized Bayesian FDR and FNR) We define the realized false discovery rate and the realized false negative rate respectively as, F DR(d, y, z) = where D =
Pn
i=1 di
P
D+
P − di )zi − zi ) i (1Q , and F N R(d, y, z) = , n − D + i I(di = 1) i I(di = 0)
di (1 iQ
(3.15)
is the total number of discoveries, n is the total number of hypotheses.
The last term in the denominators is introduced to ensure FDR and FNR are well defined even when there are no discoveries or every gene is a discovery. Note that the true states zi ’s are unobserved, but with some distributional assumptions on the two competing models we can evaluate the average false discovery rate and false negative rate over the model space. In other words, we can evaluate the expected false discovery rate and expected false negative rate. Definition 9 (Expected Bayesian FDR and FNR) Conditioning on data y and marginalizing with respect to z, we obtain the posterior expected FDR and FNR bF DR(d, y) =
Z
F DR(d, y, z)dp(z | y) =
and bF N R(d, y) =
Z
F N R(d, y, z)dp(z | y) =
P
D+
di (1 iQ
− vi ) , i I(di = 0)
P
− di )vi i (1Q . n − D + i I(di = 1)
(3.16)
(3.17)
Unlike the frequentist FDR discussed in Benjamini and Hochberg (1995), the above expected Bayesian FDR and FNR have natural probabilistic interpretations. The numerators P P P P bF D = i di (1 − vi ) = i di Pr(zi = 0 | y i ) and bF N = i (1 − di )vi = i (1 − di ) Pr(zi =
1 | y i ) are the posterior expected counts of false discoveries and false negatives.
68
The way we decide upon a threshold follows the general framework given in Benjamini and Hochberg (1995) and Genovese and Wasserman (2003): a common cutoff point is chosen to make as many discoveries as possible while controlling the expected bFDR. A more rigorous definition is as follows, since di = di (t) = I(vi > t), we can view bF DR as an expected Bayesian false discovery rate process bF DR(t, y), and define the threshold
tB (y) = min{t : bF DR(t, y) ≤ α, 0 ≤ t ≤ 1}.
(3.18)
By this definition, bF DR(tB , y) ≤ α, and clearly the procedure makes as many discoveries as possible. This thresholding procedure can be formulated in a decision theoretical framework. It can be shown that tB (y) is the optimal solution to minimize the loss function in the form of bF N R + λ × bF DR (see Appendix). Therefore it provides combined control over both errors. Other choices of threshold are possible with different loss functions, a good account of these choices are given by M¨ uller et al. (2003). However, all decision rules with attempt to control bFDR and bFNR have similar forms as the optimization is done on similar assignment problems (see Appendix B). All current procedures for controlling FDR, Bayesian or frequentist, assume the hypotheses are independent of each other. Such assumption is often violated in the multiple hypothesis setting. An example would be gene expression data. Co-regulated genes tend to have correlated expression levels, so treating them as independent may reduce our power to detect interesting genes, and affect our control over FDR. In this dissertation, we propose an extended bFDR controlling procedure which takes into account of the dependence among genes. As such approach depends on the clustering models proposed in Chapter 4 and 5, we defer the extension to Chapter 6.
3.3
Examples
We now apply our filtering procedure to two cNDA microarray experiments.
69
3.3.1
Example 1: Sporulation Data
We first consider the microarray data reported in Chu et al. (1998). This data set was introduced in Chapter 1. Gene expression during sporulation in budding yeast were measured at t = {0, 0.5, 2, 5, 7, 9, 12} hours. Chu et al. (1998) were interested in genes that exhibited similar expression profiles. To this end, they identified, via expert knowledge and visual inspection, seven profiles with inducement/repression at different time points. Each profile was then represented by the average expression of 3 − 8 genes known to be contained in that profile. After an initial screening in which 80% of genes were eliminated, they then clustered the remaining genes to each profile using correlation as the distance measure. The sporulation data of Chu et al. (1998) include transcripts measurements at t = 0, which are essentially control sample versus control sample. These measurements can be considered as “reference” data in that they were not subjected to experimental conditions and therefore they only reflect measurement error, σe2 . Using these data, we were able to specify a highly informative prior on the measure error σe2 for the Bayes factor calculation, based on a straightforward posterior analysis. The t = 0 data were used in the posterior analysis, and excluded from subsequent filtering procedure and clustering analysis. Figure 3.1 shows a random sample of the sporulation data. Unlike Figure (1.4) in Chapter 1, this figure includes t = 0 observations. It can be seen that shortly after time zero, many genes showed dramatic changes in their expression levels, induced or repressed. It is common practice to normalize data in gene expression analysis before proceeding to formal statistical analysis. In our case, however, after dropping the time zero data, it is important not to center each curve, as now the measurements at the first time point contain vital information concerning the magnitude and direction of changes in gene expression at the beginning of sporulation. We note that because the inducement of sporulation is invasive to cells (the cells were transferred to nitrogen-deficient medium), the dramatic changes in gene expression at the beginning of the experiment are quite likely experimental artifacts, not effect of sporulation. Therefore, the early responses should be regarded as less reliable as indicators of sporulation than those obtained later when cells have stabilized. We believe this is a key issue which
70
should be addressed carefully, yet so far has been largely overlooked by the majority of data analyses. This realization has important implication to better experimental design. We will
0 −4
−2
Expression log2(R/G)
2
4
further address these issues in the cell cycle data analysis in Chapter 5.
0
2
4
6
8
10
Time (hrs)
Figure 3.1: Random sample of 100 genes (including measurements at t = 0)
Assume model (3.1) for the sporulation data, and assume prior distributions (3.8) and (3.9) on parameters under the two competing models. To specify the priors on the measurement error σe2 , we first carry out a posterior analysis using t = 0 reference data. Figure 3.2 shows the distribution of the reference data. The reference data were analyzed under the model Y¯0 | µ0 , σe2 ∼ N(µ0 , σe2 /N ) with the improper prior π(µ0 , σe2 ) ∝ σe−2 which leads to the posterior σe−2 | y 0 ∼ Ga{(N − 1)/2, N s2 /2}, where s2 is the unbiased estimator of σe2 . Since N , the number of data points (genes), is large, the posterior distribution is highly concentrated. The posterior mean of σe is estimated to be 0.248, with 95% posterior sample interval (0.243, 0.252). From experience we knew the measurement error was likely to increase when the cells became less controlled, so we calibrated from the posteriors to allow an informative but less stringent prior on the measurement error as follows: derive the distribution of σe from the inverse-gamma prior on σe2 , set the modal value to be 0.25,
71
and set an upper bound 1.5 so that Pr(0 < σe < 1.5) = 0.95, this is to express our belief that the measurement error was most likely to be contained in a certain range of 0.25, then solve the equations to get the prior values. It may require numerical search if no simple
1.0 0.0
0.5
Density
1.5
2.0
analytical form is available. We chose a0 = 0.93 and b0 = 0.09 as the priors on σe .
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Expression at t=0
Figure 3.2: Distribution of gene expression at t=0
Next we specify other parameters in the priors. We chose m0 = 0 to reflect our belief that genes not regulated under sporulation should have constant expression across time, and take v0 = 22 to reflect the range of variation of observed data in similar experiments, e.g. DeRisi et al. (1997). For the importance sampling, we used S = 50, 000 to estimate the Bayes factor for each gene. We applied the filtering procedure to the total of 6118 genes in the sporulation data set. Figure 3.3 displays the 200 “top” and “bottom” genes ranked by p(M1 | y). It appears the filtering procedure was able to separates genes with large variations in their expression levels from those with small variations across time. Chu et al. (1998) identified 1116 genes with significantly changed expression levels. Our filter, while controlling bFDR at 0.5 level, suggests choosing the top 711 genes, and has
6 4 2 −4
−2
0
Expression log2(R/G)
2 0 −4
−2
Expression log2(R/G)
4
6
72
2
4
6
8
10
2
Time (hr)
4
6
8
10
Time (hr)
Figure 3.3: Sporulation data from Chu et al. (1998). Expression levels versus time for 200 genes with the highest (left panel), and the lowest (right panel) values of p(M1 | y) where M1 is the model of non-constant level.
bFNR 0.07 (Figure 3.4). We chose to analyze the top 1300 genes (≈ 21% of the genome) ranked by p(M1 | y). This threshold has bFDR 0.63 and bFNR 0.10. Although it seems the error rates are high, as we discuss later, the computation of bFDR and bFNR depends highly on the model and priors assumed. Hence they can not be taken literally without careful evaluation of the model and the prior specified. At this stage they only serve as an exploratory tool to guide our decisions. We believe that for filtering, relative ranking is more important than the absolute values of marginal posterior probabilities or bFDRs. As can be seen from Figure 3.4 the bFDR is about 80% when all hypotheses are rejected, in other words, under the model and prior specified, 20% of the genes should be true discoveries. So we chose the 20% or so top ranked genes, hoping we have included as many true discoveries as possible, and meantime only missed a few. Figure 3.5 shows the 100 top and bottom genes within the chosen subset. Genes ranked low in this subset already show relatively smaller variation across time than the higher ranked genes. Chu et al. (1998) made the following remarks about their filtering procedure, “genes
73
0.4
α=0.5
0.0
0.2
minimal FDR conditional on D
0.6
0.8
Optimal Cutoff Pr(M1|y) = 0.26 , FNR = 0.07
711 0
1000
2000
3000
4000
5000
6000
number of rejections
6 4 2 −4
−2
0
Expression log2(R/G)
2 0 −4
−2
Expression log2(R/G)
4
6
Figure 3.4: Expected FDR process, the corresponding number of rejections is 711 when bF DR = 0.5, the cutoff in terms of p(M1 | y) is 0.26, bF N R = 0.07.
2
4
6 Time (hr)
8
10
2
4
6
8
10
Time (hr)
Figure 3.5: Expression levels versus time for 100 genes with the highest (left panel), and the lowest (right panel) values of p(M1 | y) among the top ranked 1300 genes
74
were included for clustering analysis if the root mean square of log2 R for a given time point was greater than 1.13, where R is the measured ratio of each gene’s mRNA levels to its mRNA levels in vegetative cells just before transfer to sporulation medium (Xt = 0). In practice, this criterion is essentially equivalent to a three-fold change for a single time point or an average 2.2-fold change across entire time course”. Under the null hypothesis that there is no difference between the two samples, i.e., the q 2 . It true expression is zero across time, the root mean square is defined as RM Si = T1 yit
is a single summary of variation around zero for gene i. So we think the sentence should be “... the root mean square of log2 R for a given gene ...”, not time point. Also the root mean square is a measure of variability, whereas the other two criteria only concern the magnitude of changes. Therefore we believe the “equivalence” remark is not true either. In fact, using root mean square, we were able to recover the number 1116 mentioned in Chu et al. (1998) using the 1.13 cutoff2 . On the other hand, using the ‘average 2.2 fold change across the entire time course’ criterion, we were only able to pick out 283 genes, 97 for the ‘three-fold change at a single time point’ criterion. We now compare our posterior probability based filtering with the ones suggested by Chu et al. (1998): root mean squares and average across time. We have found good agreement between our filtering and the one based on root mean squares. Among the top 1116 genes ordered, by root mean squares and posterior probabilities respectively, the agreement is about 92%. Figure 3.6 shows the 85 genes missed by either of them. It appears that genes picked out by our filter are more likely to contain signals of sporulation than those by root mean squares. Those genes picked out by Chu et al. (1998) but missed by our filter show little variation across time thus contain little information, although they deviate much from zero. The agreement between our filter and that based on fold change in average is even smaller, only about 80%. Figure (3.7) shows the genes missed by each filter. It appears that the genes missed by our filter yet would have been classified as differentially expressed by fold-change varied little across the course of sporulation, therefore are likely false discoveries. The performance comparison suggests that for differential expression detection, both 2
The actual number is 1148, we think the difference is due to rounding or ignoring the flags.
4 2 0
Expression
−4
−2
0 −4
−2
Expression
2
4
75
2
4
6
8
10
2
4
Time (hr)
6
8
10
Time (hr)
4 2 0
Expression
−4
−2
0 −4
−2
Expression
2
4
Figure 3.6: Disagreement between root mean square based filter and posterior probability based filter. Left panel shows the 85 genes missed by the filter based on posterior probabilities, right panel shows the 85 genes missed by Chu et al. (1998), among top 1116 genes.
2
4
6 Time (hr)
8
10
2
4
6
8
10
Time (hr)
Figure 3.7: Disagreement between average based filter and posterior probability based filter. Left panel shows the 215 genes missed by the filter based on posterior probabilities, right panel shows the 215 genes missed by the filter based on averages, among top 1116 genes.
76
magnitude and variability of the changes should be considered. Our filter combines both measures naturally through probabilistic models, and provides a measure of errors via bFDR and bFNR.
3.3.2
Example 2: Cell-Cycle Data
As introduced in Chapter 1, Spellman et al. (1998) describe a number of microarray experiments that were carried out to create a comprehensive list of yeast genes whose expression levels vary periodically within the cell cycle. For illustration purpose, we analyze one of the data sets that measured expression levels of genes synchronized via α factor. Expression levels were measured at every 7 minutes for 140 minutes (so that p = 20 measurements were recorded in total for each gene3 ). In this case, Spellman et al. (1998) are more interested in finding genes with cyclic pattern across time, so they normalized the data by centering each gene to have mean zero. We carried out the simple filtering assuming the same model and prior as in above example. The top and bottom ranked genes are shown in Figure (3.8). It appears the filtering procedure separates genes with different variations very well. Under the assumed model and priors, if we claim the top 1000 ranked genes as discoveries, the expected bFDR is 0.80 with the expected bFNR being 0.13.
3.4
Conclusion and Discussion
In this chapter we have outlined a fast and efficient screening procedure for gene expression analysis. Unlike previous largely heuristic methods based on fold change or visual inspection, our approach is fully model-based, therefore provides a more rigorous way to quantify the variability of genes. It further allows us to measure the quality of the filtering via FDR. Next we make some remarks about our procedure. First we want to point out that the filtering is not a necessary step in gene expression analysis. The main motivation for a screening step is to reduce time and cost of computation. Given enough computing power, 3
For illustration purpose, we only considered the N = 4489 genes without any missing observations here. But this is not necessary with our approach, as we demonstrate in next two chapters.
4 2
Expression log2(R/G)
−2
0
2 0 −2
Expression log2(R/G)
4
77
0
20
40
60 time (min)
80
100
120
0
20
40
60
80
100
120
time (min)
Figure 3.8: Cell cycle data from Spellman et al. (1998). Expression levels versus time for 100 genes with the highest (left panel), and the lowest (right panel) values of p(M1 | y).
it may still be preferable to analyze the whole data, an objective with gene expression experiments at the first place. As we will demonstrate in later chapters, genes with little variability across time can be easily dealt with via an addition of “zero” cluster in our mixture models. It should be emphasized that for gene filtering, relative ranking based on posterior probabilities is more sensible than the absolute values of posterior probabilities, a view shared by some other researchers (L¨onnstedt and Speed, 2002). Although the procedure is model based, little effort was spent on developing models for the data, let alone the priors. The model and priors were chosen largely for computational convenience, with minimum consideration of available information. Along the way, we have made some quite strong assumptions, which include independent observations within each gene, independence between genes, simple independent priors and equal chance of being differentially expressed or not for each gene. As the expression for each gene is a continuous process, clearly the expression levels within a gene are correlated instead of independent. There are possible mean variance relationship due to the correlation within each gene, and clearly some genes
78
are more likely to express differentially than others. If filtering procedure doesn’t address these situations, we can not rely on the face values of the posterior probabilities. For the same reasons, the FDR and FNR should not be taken literally either. Bad choices of models and priors could easily lead to unreasonable values, as can be seen from our examples. They only reflect the possible error rates conditional on the model and priors specified. They do provide an important quantitative reference for us to decide the cutoff point for filtering, but little can be said based on their absolute values. More accurate evaluation of FDR would require us to pay much more attention to the models and priors. Fortunately, we have found that the relative ranking based on the posterior probabilities are more robust to the assumptions, as a consequence the ranking provides more reliable information concerning the variation of genes. Table 3.1 summarizes the number of genes remain in the top-ranked 1300 genes with different prior distributions on the measurement error σe2 . It appears the ranking based on posterior probabilities is robust to the prior specification, in the sense that though the actual ranks of the genes may vary, the high ranked genes tend to stay in the high ranking group, while the low ranked genes tend to stay in the low ranking group. Table 3.1: Sensitivity of Ranking to Prior Specification: Sporulation Data Priors
Common genes among top 1300
σe2 ∼ IG(a0 , b0 )
1300
σ 2 = 0.12
1071
σ 2 = 0.252
1285
σ 2 = 2.52
1287
Since the filtering is model based, it can be extended to incorporate more information quite easily. For example, in the cell cycle example, we tried to filter out genes with large variations. But in fact we were more interested in finding genes showing periodic patterns. The genes showing large variations during cell cycle may not necessarily be cell cycle regulated. So we can build the model to accommodate this requirement, thus specifically filter
79
out genes showing cyclic patterns. Such more elaborated filtering will be illustrated in next two chapters. In our calculation, each gene is assumed to have equal probability to follow the two competing models a priori. Since it is the exact question we are trying to answer through our analysis, this non-informative prior seems quite reasonable. But from a biological point of view, certainly there are genes that are more likely to express differentially than others. Efron et al. (2001) and Storey (2002) have employed empirical Bayesian approach and mixture models to estimate these probabilities, and then detected differential expressions based on FDR. Similar adjustments could be added into our procedure as well, but the complexity of such approaches would make the procedure beyond simple screening, and fall within the area of differential analysis, which is not the main focus of this dissertation. Another improvement to this approach could be to introduce some dependence among genes through model prior p(Mk ) specification. Cho et al. (1998) identified 416 of 6220 monitored transcripts as cell cycle-dependent. And more than 25% of them were found directly adjacent to other genes in the genome that displayed induction in the same cell cycle phase. They speculated that these gene pairs could be regulated by the same upstream sequence. Therefore some sort of dependence among genes may be closer to the truth, consequently the results may be more efficient and accurate. The calculation of FDR also assumes independence among genes, which could also be modified through accounting for the dependence. Clearly there is room for improvement, but for the purpose of ranking and filtering, the method described in this chapter works sufficiently well.
80
Chapter 4 BAYESIAN HIERARCHICAL MODELS FOR CURVE CLUSTERING
There are various scientific questions that may be addressed via microarray experiments. In this research project we are interested in the situation where the experiments are indexed by a variable which has a natural ordering, such as time, temperature, or the dose level of a toxin. Under this scenario, the aim is to gain insight into those genes that display similar patterns (co-express) over the course of the experiment. By comparing genes of unknown function with those previously characterized, clues regarding their function and regulation mechanism may be obtained. This “guilt-by-association” has proven to be a useful strategy in biological research, and it is the premise for gene expression clustering. A variety of approaches to this problem have been proposed. By far the most common approach is to apply generic supervised or unsupervised clustering algorithms to the data. For example, Eisen et al. (1998) used hierarchical clustering with correlation as the distance measure, Chu et al. (1998) clustered genes according to known profiles again based on correlation, and Tamayo et al.(1999) applied self-organizing maps to cluster gene expression data. These approaches, though useful as exploratory tools, are unsatisfactory in a number of respects. To summarize briefly, first, these algorithms ignore the variability in the data, and do not take into account the measurement error. Second, the data are clustered on the basis of raw measurements, so the classifications can be sensitive to outlying observations. To overcome this, often the raw data are screened at first to remove aberrant observations based on fold change. But this procedure is rather ad hoc, and we have proposed a probabilistic recipe in previous chapter. Third, there is no measure of the uncertainty associated with the clustering. A remedy to this problem has been proposed by Kerr and Churchill (2001), who proposed to access the uncertainty by bootstrapping the residuals from an analysis of variance model that includes the gene-time effects of interest; the proportion of bootstrap samples that cluster to the original classification then provides a measure of
81
the reliability. Fourth, missing data and unbalanced design are not easily dealt with using these clustering algorithms. Finally, clustering algorithms are generic and are in no way tuned to the application in hand (except perhaps for the measure of dissimilarity used), and in particular do not allow the incorporation of covariates and prior information. So when clustering gene expression time series, because the time ordering is not used, we can interchange the columns of the input data matrix and have the same clustering. For more detailed review and discussion of these clustering algorithms, see Chapter 2. In this chapter we propose a model-based approach to this problem, in which we explicitly model the trajectory as a function of the ordering variable (e.g. time) and a gene-specific set of parameters. We then cluster on the basis of the latter, with our probabilistic framework providing quantitative measures of classification. This approach is tailored specifically for the curve data, combining hierarchical models with mixture models. The former enables us to model the variability in gene expression data at different levels, and allows us to make inference on parameters by pooling information across genes, and the latter provides a probabilistic framework for clustering with straightforward measure of uncertainty. We carry out fully Bayesian analysis, with computation based on MCMC. A brief review of hierarchical models, mixture models and MCMC was given in Chapter 2.
4.1
The General Hierarchical Mixture Model
In this section, we describe the general model we propose for clustering curve data. Although the descriptions are in terms of gene expression time series, we emphasize such model can be applied to curve data in general.
4.1.1
Model description
Let yij denote, for gene i, the log-ratio of mRNA expression level measured at time xij , relative to a reference sample, i = 1, · · · , n, j = 1, ..., Ti . The experiments may be indexed by any variable with a natural ordering, but for presentation purpose we suppose that variable is time. To further simplify the notations, we assume a balanced design common to all the genes so that xij = xj , j = 1, · · · , T for all i = 1, · · · , n, although this is not
82
necessary for our approach (unbalanced design and missing data can be easily dealt with by applying appropriate subscripts in the formulation). We then model these data via the following multi-stage hierarchical model. • Stage One: For the observed data we assume yij = h(θ i , xj ) + eij ,
(4.1)
where eij ∼i.i.d. N(0, σe2 ) is the measurement error, and h(θ i , xj ) denotes the context specific form of the trajectory and depends on a set of gene-specific parameters, θ i , and the covariate(s) xj . • Stage Two: Conditional on K known trajectories, we introduce the trajectory membership labels zi that indicate the underlying trajectories to which gene i belongs, such that θ i | zi = k, φ, K ∼ f (θi | φk ).
(4.2)
Here we model θ i as arising from a mixture of distributions f (·) that depends on unknown cluster specific parameters φ = {φ1 , · · · , φK }. • Stage Three: We assume the cluster labels are independent, and follow the Multinomial distribution, Pr(z1 , · · · , zn | π, K) =
n Y
Pr(zi | π, K),
(4.3)
i=1
where Pr(zi = k | π, K) = πk , k = 1, · · · , K, and π = (π1 , · · · , πK ). • Stage Four: The cluster specific parameters (φ, π) can be viewed as sub-population parameters. At this stage, we place population distributions on them, and assume φk ∼i.i.d. f (φk | ψ),
(4.4)
for k = 1, · · · , K, where ψ is the population parameter (vector). This is the prior for the collection of trajectories. For the cluster weights π, we assume a Dirichlet prior with parameter δ, π ∼ Dirichlet(δ),
(4.5)
83
where Dirichlet(·) denotes the Dirichlet distribution, a common choice with mixture models, see e.g. Richardson and Green (1997). • Stage Five: Finally, we complete the hierarchy by placing prior distributions on σe2 , ψ, and possibly K if it is assumed unknown. Further hyper-priors can also be assumed.
Let η denote the collection of σe2 , ψ, δ and all other parameters common to all clusters. Under the above hierarchical mixture model along with some conditional independence assumptions, the joint distribution of all the variables can be written as p(y, z, θ, φ, π, K, η) =p(y | θ, η)p(θ | z, φ, K, η)p(z | π, K)
(4.6)
p(φ | η, K)p(π | η, K)p(η)p(K), based on which we make inferences. The intuition behind this model is that each curve is characterized by certain features, which are indicative of the underlying regulation mechanism. After identifying these features, we try to group curves sharing similar features, with weights attached to express our uncertainty of the clustering. As the features identified are often fewer than the observed data points for each curve, significant computational efficiency and better visualization can be achieved by clustering upon the reduced feature data instead of upon the original data. However, this depends on how well the features characterize the curves, as information may be lost due to the reduction of data. Our fully Bayesian hierarchical model provides a way to circumvent this problem. Stages one to three in above hierarchical model enable us to measure the variability associated with the sub-population level features, thus allow the uncertainty to propagate correctly. In gene expression analysis, interest often lies in detecting and characterizing representative trajectories, which may correspond to underlying regulation mechanisms. In other words, we are more concerned with making inference on the cluster specific parameters (π, φ), the number of clusters K if it is unknown, and on the cluster labels zi . This is a key feature of gene expression clustering – the focus is on sub-population level inference, not on population level nor on individual level. This realization can lead to simplified model
84
representation and computation, as we can integrate out the random effect, and work with much less parameters. We now provide an interpretation of the random variables zi which represent the gene expression curve that gene i is following. The change in expression levels describing transcription from DNA to RNA that occurs within the cell nucleus is determined by, amongst other things, the regulatory proteins that are responsible for gene i and the timings at which these proteins are acting (either to activate or suppress activity). The curve membership indicators can therefore be viewed as summarizing the proteins that are relevant for gene i, and if two genes lie in the same cluster it is evidence of shared transcription factors. With such interpretation, co-expression of a group of genes can be measured simply by the probability that these genes share the same cluster labels. Our approach of assuming a mixture model with flexible mean structures is crucially different from the “model-based” clustering approach of Yeung et al. (2001), who analyzed similar data but simply assumed that the data arose from a mixture of T -dimensional normal distributions and hence did not acknowledge the time-ordering of the data (the analysis would be unchanged if the time ordering were permuted). In particular it would be desirable to allow serial dependence, within such an approach, but the MCLUST software (Fraley and Raftery, 1998) that is used by Yeung et al. (2001) does not allow for this possibility, and it does not perform well when the dimension T gets large. In their approach, missing data and unbalanced design also cause complications whereas in our model no such problems arise. Medvedovic and Sivaganesan (2002) also proposed a Bayesian hierarchical model for clustering microarray data, but again they failed to take the time ordering into account in their approach.
4.1.2
Computation
In this section we outline the general strategy for computation, and specific algorithms will be given in later sections along with their applications. For fixed K, samples from corresponding posterior distributions can be generated using Gibbs sampler or Metropolis-Hastings MCMC algorithms. For unknown K either reversible
85
jump MCMC (Richardson and Green, 1997), or birth-death MCMC (Stephens, 2000a) may be used (both methods have been reviewed in Chapter 2). We resort to the latter since it is relatively straightforward to implement in our context. The general MCMC algorithm consists of two types of sampling steps: one involves dimension-changing moves, another is conditional on fixed K. The algorithm obtains samples from the posterior p(K, π, φ, η | y) by combining a continuous time marked point process from which the points φk along with their associated “marks” πk are sampled, with a Gibbs sampler (or Metropolis-Hastings algorithm if no simple form of full conditional is available) through which η, the parameters not depending on K, are updated. The major component of BDMCMC is the generation (“birth”) or removal (“death”) of the “marked points” – the parameters associated with a specific cluster. We follow Stephens (2000a) quite closely and implement the “naive” algorithm he proposed. We take the birth rate simply to be the mean of the Poisson prior that we specify for K. For a birth we simulate a new mixing proportion (“mark”) π ∼ Beta(1, K), and the new trajectory vector (“point”) from the fourth stage prior, which in our examples are normal distributions. The death rate for component k, dk , is the likelihood for the collection (π, φ) with (πk , φk ) removed, divided P by the likelihood for (π, φ). Component k is selected to die with probability dk / k′ dk′ .
Next we outline the algorithm that combines the birth-death process with the Gibbs
sampler (or Metropolis-Hastings algorithm), following Stephens (2000a). Algorithm 4.1.1 Given the state (K (t) , π (t) , φ(t) , θ (t) , η (t) ) at time t, simulate new state (K (t+1) , π (t+1) , φ(t+1) , θ (t+1) , η (t+1) ) as follows: • Step 1: Starting from (K (t) , π (t) , φ(t) ), run the birth-death process for a fixed period ′
t0 , with other non-cluster-specific parameters fixed. Let (K (t) , π (t) , φ(t) ) denote the ′
′
state of cluster-specific parameters at t + t0 . ′
• Step 2: Set K (t+1) = K (t) . ′
• Step 3: Sample z (t+1) from p(z | K (t+1) , π (t) , φ(t) , θ (t) , η (t) , x). ′
• Step 4: Update non-cluster-specific parameters to obtain θ (t+1) and η (t+1) .
86
′
• Step 5: Update π (t) and φ(t) to obtain π (t+1) and φ(t+1) . ′
• Step 6: Return to Step 1. It should be noted that although we run the birth-death process continuously, we only update the parameters and record the samples at fixed time intervals. It should also be emphasized that parameters should not be updated without ensuring the other parameters that they depend upon have achieved the stationary distribution. This is required so that the defined Markov chain has the desired stationary distribution. It has also be suggested that parameters can be updated at each birth/death move, and then weighted by their staying time in each state. This weighting states by the time spent there should make more efficient use of the sampled points, at the expense of greater storage requirements. It would be of interest to implement this algorithm and study its performance in the future. In our examples convergence of the BDMCMC was assessed via formal diagnostics and less formal visual inspection. We informally examined the trace plots of parameters and compared results from multiple chains initiated at different starting points. We also ran fixed K analysis and compared the results to the BDMCMC algorithm. More formal assessment of convergence was done through examination of Geweke diagnostic (Geweke, 1992). But caution has to be taken because both the changing dimension and label-switching (see below) will complicate the calculation of convergence diagnostics. We chose to evaluate the convergence diagnostics based on re-labelled MCMC samples with K fixed. 4.1.3
Label-switching
As discussed in detail in Richardson and Green (1997) (and the accompanying discussion), and Stephens (2000b) there is a fundamental non-identifiability associated with mixture problems in that the posterior contains K! modes of equal height that are indistinguishable from the data alone. The root of the label-switching problem is that the likelihood L(π, φ) = π1 f (y; φ1 ) + · · · πk f (y; φk ) is invariant under the permutation of the mixture components labels. In a Bayesian analysis, if we have no prior information that distinguishes amongst the components of the
87
mixture, then we can only assume same prior distribution on all permutations of (π, φ), as a consequence the posterior distribution will be similarly symmetric. This symmetry and multi-modality of posterior distribution can cause problems when we try to estimate quantities based on simulated samples from MCMC, as we can not pin down the samples to their according components of the mixtures. To identify parameters uniquely some form of “labelling” must be carried out. A common strategy of removing label-switching is to impose an identifiability constraints on the model parameters (such as π1 < π2 < · · · < πk ) that can be satisfied by only one permutation of the parameters. Supposedly this would break the symmetry of the prior distribution, and thus the posterior distribution. However, as pointed out by Stephens (2000b), many choices of identifiability constraint will not be sufficient to remove the symmetry in the posterior distribution. Richardson and Green (1997) also acknowledged this issue, and suggested the MCMC output be post-processed according to different identifiability constraints to obtain the clearest picture of the component parameters. Stephens (2000b) proposed a general decision-theoretic approach to re-label the component parameters, and provided a specific re-labeling algorithm for clustering inference which was based on minimization of Kullback–Leibler divergence between estimated classification probabilities and true classification probabilities. In our applications, we implement both multiple identifiability constraints and the decision-theoretic approach, and report results based on the re-labelling with the most plausible interpretation. We used a Numerical Algorithms Group C routine with an efficient algorithm for the assignment problem involved in the decision-theoretic approach. The computing time of the decision-theoretic approach is substantially longer than that of the identifiability constraints, especially when K is large. For gene expression analysis, as the features (parameters) used to identify clusters often have clear scientific meaning, we have found that re-labelling according to these features are quite plausible. For example, we can order the groups of cell-cycle regulated genes based on their time to the first peak. Although the label-switching may complicate our inference, the fundamental question of co-expression of genes i and i′ may be answered without resolving this problem, since the quantity Pr(zi = k, zi′ = k | y) is invariant to re-labelling.
88
4.2 4.2.1
Example 1: Simulated Data Data description
We first demonstrate the feasibility of our proposed model using a simulated data set. We generated n = 50 linear curves from K = 3 clusters, with the cluster centers and covariance matrices given by µ1 = (8, 1)′ , µ2 = (10, 0)′ , µ3 = (12, −1)′ , 0.2 0 0.1 0 0.1 0 . , Σ2 = , Σ2 = Σ1 = 0 0.2 0 0.1 0 0.1
After labelling the groups based on the first dimension of µ (intercepts), we specified the cluster labels for each individual curve, 1 zi = 2 3
i = 1, · · · , 15, i = 16, · · · , 30, i = 31, · · · , 50.
The random effects (intercept and slope) were assumed to follow multivariate normal distribution, α i ∼ N2 (µz , Σzi ). i βi
Finally, data were generated as a linear function of given time points, with measurement error added, yij ∼i.i.d. N(αi + βi xij , σe2 ), where xi = (−3, −1, 1, 3) for i = 1, · · · , 50, and σe2 is the variance of measurement error, chosen to be 1.0. The simulated curves and corresponding least squares estimates of intercepts and slopes are shown in Figure 4.1. We see no obvious clusters from the observed curves, but when the curves are reduced to lower-dimensional features, intercepts and slopes in this case, there appear to be relatively well separated groups, although the exact number is unknown.
2
89
9
15
15
87 14
1
2 * 11 6 10
13
35
4
28 23
1
0
Slope
10
response
24
20 22
12
25
*
27
26 17 21
3147 44 41
18 49
16 −1
19
30
29
5
45 39
3343 32
50
*
35 38 34
48
−2
46 42 36
40 37
−3
−2
−1
0
1
2
3
7
8
time
9
10
11
12
13
Intercept
(a) Simulated Curves
(b) Least Squares Estimates
Figure 4.1: (a) A total of 50 simulated curves, with t = (−3, −1, 1, 3). (b) The least squares estimates of intercepts and slopes. Units 1–15 are in group one, 16–30 are in group two, 31–50 are in group three. The groups are labelled by their intercepts.
4.2.2
Model description
Next we tailor the general model to this data set as follows. • Stage 1: Suppose the first-stage model has the form yij ∼i.i.d. N(αi + βi xij , σe2 ), where αi is the curve specific intercept, βi the slope, for i = 1, · · · , n, j = 1, · · · , T . σe2 is the variance of measurement error. So y i = (yi1 , · · · , yiT ), the ith of n curves, is modeled as a linear structure of covariates xi with conditionally independent homoscedestic normal errors having variances σe2 . • Stage 2: Next we assume the first-stage curve specific parameters θ i = (αi , βi )′ arise
90
from a mixture of K sub-populations, θ i | zi = k, µ, Σ, K ∼ N2 (µk , Σk ),
k = 1, · · · , K.
• Stage 3: For the group label z = (z1 , · · · , zn ), we assume they are independent and follow multinomial distribution, Pr(z1 , · · · , zn ) =
n Y
Pr(zi ),
i=1
and Pr(zi = k | π, K) = πk ,
k = 1, · · · , K.
• Stage 4: We then specify the population distributions on the mixture component parameters, π | K, δ ∼ Dirichlet(δ1 , · · · , δK ), µk | K, m, V ∼i.i.d. N(m, V ), −1 Σ−1 k | K, ρ, R ∼i.i.d. Wishart(ρ, (ρR) ),
for k = 1, · · · , K. • Stage 5: We can finish the hierarchy by imposing the following priors. σe−2 ∼ Ga(g, h), K ∼ Poisson(λ). • Stage 6 (optional): Or we can also treat (m, V , R) as hyper-parameters on which we place “vague” priors. This is an attempt to represent the belief that the means and variances of the clusters will be similar when viewed on some scale, without being informative about their actual values. We chose to place a uniform prior on m, and “vague” Wishart distributions on V and R, m ∼ const, V −1 ∼ Wishart(a, (aV 0 )−1 ), R ∼ Wishart(b, (bR0 )−1 ),
91
with a = b = 2 the dimension of V and R. As we have reviewed in Chapter 2, such Wishart priors are the least informative in the sense that the distribution is the flattest while being proper. Stage 6 is closely related to the variable–κ priors used in Stephens (2000a). The specification of hyper-priors allows the population distributions which we use to generate the potential clusters to be updated by data. On the other hand, the completion of hierarchy at stage 5 corresponds to the fixed–κ priors used by Richardson and Green (1997), which does not allow the population distributions to be updated by data. Since the focus of our inference is on the sub-population parameters, we have found little difference in the results provided the priors and hyper-priors are “non-informative” or only “weakly informative”. But as we mention later in the examples, the inference for K can be highly sensitive to the priors used, so sensitivity analysis should be carried out and informative priors would be preferable. 4.2.3
BDMCMC
We follow Stephens (2000a) to construct a Markov chain with the stationary distribution p(K, π, φ, z, θ, η | y) ∝ the joint likelihood given in (4.6) by combining a birth-death process with Gibbs sampling steps, following Algorithm 4.1. The algorithm can be dissected into two steps. One step is the birth-death step , which involves dimension changing jumps. It follows Algorithm 2.4. 1. For simplicity we take the birth rate b = λ, the Poisson prior on K. 2. The death rate for component k is given by L(π, µ, Σ\(πk , µk , Σk )) , L(π, µ, Σ) P for k = 1, · · · , K, and the total death rate is d = k dk . dk =
3. Simulate the time to jump from an exponential distribution with mean 1/(b + d).
92
4. Simulate the type of jumps: birth or death with respective probabilities Pr(birth) =
b , b+d
Pr(death) =
d . b+d
5. For a birth, simulate π ∼ Beta(1, K), µ ∼ N2 (m, V ), and Σ−1 ∼ Wishart2 (ρ, (ρR)−1 ). For a death, select a component k to die with probability dk /d. 6. Re-scale: if a birth is chosen, then multiply the current weight (π1 , · · · , πK ) by (1−π), and increment K by 1; if a death is chosen then divide π −k by (1− πk ), and decrement K by 1. Return to step 1. The other step is updating parameters conditional on fixed K. In this linear example, all full conditionals can be easily derived, allowing straightforward implementation of Gibbs samplers. For the full hierarchical models with hyper-priors (Stage 1–6), the full conditionals are as follows. The full conditionals for random effects {θ i = (αi , βi )′ } are θ i | · · · ∼ N(µ∗ , Σ∗ ), for i = 1, · · · , n, where ∗
Σ =
1 ′ X X i + Σ−1 zi σe2 i
−1
,
and ∗
∗
m =Σ
1 ′ −1 X y + Σzi µzi , σe2 i i
with X i = (1′ , x′i ) the design matrix for ith curve. The full conditionals for cluster labels {zi } are
Pr(zi = k | · · · ) ∝ πk f (θ i ; µk , Σk ), where f (; ) is the Normal density. Let nk = #{i : zi = k}, the number of curves clustered into cluster k, the full conditionals for the component means {µk } are µk | · · · ∼ N(m∗ , V ∗ ),
93
where −1 , (V ∗ )−1 = nk Σ−1 k +V
and −1 ¯ m∗ = V ∗ (nk Σ−1 m), k θk + V
¯k = with θ
1 nk
P
i:zi =k
θi .
For the component variances {Σk } we have
Σ−1 k | · · · ∼ Wishart ρ + nk ,
ρR +
X
i:zi =k
For components weights {πk } we have
−1 . (θ i − µk )(θ i − µk )′
π | · · · ∼ Dirichlet(δ1 + n1 , · · · , δK + nK ). For measurement errors, the full conditionals are σe2 |
1 ′ · · · ∼ Ga g + n/2, h + (y i − X i θ i ) (y i − X i θ i ) . 2
For the hyper-parameters, their full conditionals are X m | · · · ∼ N( µk /K, V /K), k
(
X V −1 | · · · ∼ Wishart a + K, aV 0 + (µk − m)(µk − m)′
(
k
R | · · · ∼ Wishart Kρ + b, bR0 + ρ 4.2.4
X k
)−1 . Σ−1
)−1
,
k
Analysis
Since under mixture models there is always the possibility that no observations are allocated to one or more components, and so the data are uninformative about them. When this is the case, standard choices of improper non-informative priors could lead to improper posteriors; therefore we follow Richardson and Green (1997) to specify weakly informative priors. We also use informative priors when pertinent information is available.
94
Here we report the results from analyzing the simulated data using the hierarchical model with stage 6 priors. We chose “non-informative” priors on σe2 with g = 0.001 and h = 0.001. We chose m = (10.38, −0.24), the midpoint of least squares estimates. V 0 was chosen to be proportional to a diagonal matrix D, with the diagonal being the squared ranges of least squares estimates, and R0 was chosen to be proportional to 1/D. The multipliers were chosen to make the variances large so that the priors were not too informative. The prior on number of clusters K is chosen to be λ = 5. Figure 4.2 shows the trace plot of K from BDMCMC. The chain was run for 100,000 iterations, with additional thinning and a burn-in period of 100,000 iterations. The BDMCMC output were re-labelled by the intercepts. The posterior probability p(K = 3 | y) = 0.63 is the largest among several competing models, suggesting the model is able to pick the right number of clusters. Figure 4.3 displays the estimated cluster centers (intercept and slopes), superposed with the confidence region calculated from the estimates of covariance matrices, and the final classifications. With this model, not only were we able to identify the right number of clusters, we also correctly classified all the simulated curves to their corresponding clusters. Table 4.1 lists the posterior medians along with 90% sampling intervals for some of the parameters. It can be seen that the estimates of mixing proportions, cluster means and measurement error are all close to the truth, and given the small sample size and the weakly informative prior, the estimates of cluster covariance matrices are also acceptable. Uncertainty in the classification is assessed straightforwardly with our model, namely through the posterior classification probabilities {p(zi = k | y, K)}. For example, observation 1 is in between cluster 1 and cluster 2, this is reflected by its classification probabilities p(z1 = 1 | y, K = 3) = 0.54 and p(z1 = 2 | y, K = 3) = 0.45, whereas well separated curves are associated with distinct classification probabilities. The clustering model works extremely well in this example, but this is partly because the curves have a simple linear structure and the clusters are well separated. Overlapping clusters and low data resolution are two key obstacles for clustering, regardless of the clustering method used. But the fully Bayesian cluster model gives us the flexibility to incorporate external information, if available. In this respect, our model may be a better choice than others, especially in the situation of gene expression clustering. This advantage is further
0
2
4
k
6
8
10
95
0
2000
4000
6000
8000
10000
p(k)
0.0 0.2 0.4 0.6 0.8 1.0
iterations
0.63
0.24 0.07
0.06
2
3
4
5
0.01
0.00
0.00
6
7
8
k
2
2
Figure 4.2: Trace plot of K from BDMCMC, and its posterior distribution.
9
1
15
87
11
14
1 1
1
11 * 6 4
24
20 22
27
−1
1
2 2
1
25 * 26 17 21
2
2
1
2 2 * 2 2 2
2 44 41
29 45 39
3343 * 35 38 32 34
3
33
3
3
48 3
3 * 3 3 3 3 3
3 3
3
−2
−2
46 42 36
2
40 37
33 3 3
2 2
2 50
2
2
3147
18 49
16
1
11 19
30
0
12 Slope
1
28 23
1
Slope
13
−1
10 35
*1 1
0
1
2
7
8
9
10 Intercept
(a) Estimates
11
12
13
7
8
9
10
11
12
13
Intercept
(b) Classification
Figure 4.3: Posterior estimates of cluster centers and their variances. The circles correspond to the 2 standard errors in univariate case. The right panel shows the classification conditional on K = 3.
96
Table 4.1: Posterior Summaries of Simulated Growth Curve Data with K = 3 Parameters
Cluster 1
Cluster 2
Cluster 3
0.29
0.33
0.38
(0.19, 0.43)
(0.19, 0.47)
(0.24, 0.52)
7.97
10.32
11.92
(7.58, 8.55)
(9.91, 10.78)
(11.47, 12.25)
0.93
-0.07
-1.13
(0.70, 1.16)
(-0.40, 0.28)
(-1.33, -0.89)
Σ11
0.28
0.18
0.18
Σ12
0.00
0.00
0.00
Σ22
0.15
0.23
0.13
π
µ1
µ2
σe2
0.90 (0.72, 1.13)
illustrated with our next two applications.
4.3
Example 2: Sporulation Data
4.3.1
Data description
In this section we analyze the sporulation data by Chu et al. (1998). This data set has been introduced in Chapter 1 and underwent initial screen in Chapter 3. To recap, yeast gene expression during sporulation relative to time t = 0 were measured at times x = {0.5, 2, 5, 7, 9, 11.5} so that T = 6 (t = 0 measurements were dropped, see Chapter 3). Chu et al. (1998) were interested in genes that exhibited similar profiles, and to this end they created seven “characteristic curves” by averaging, via a visual inspection, genes contained in each profile. Figure 4.4 shows the sets of genes in each hand-picked profile, along with the mean trajectories. After an initial screen in which 80% of genes were eliminated Chu et al. (1998) clustered the remaining genes to a set of hand-picked profiles with the largest correlation coefficient.
2
4
6
8
10
12
2
4
6
8
10
12
6 0
2
13 16 12 11
−4
−2
Expression log2(R/G)
4
6 2 0 −4
0
Time (hr)
15 14
0
2
4
Time (hr)
6
8
10
12
0
2
4
Time (hr)
8
10
12
6
6
26
6
18
6
Time (hr)
6
0
7 9 8 10
−2
Expression log2(R/G)
4
6 −4
−4
−2
2
2
4
6 5
0
0
3 1
Expression log2(R/G)
2
4
−2
Expression log2(R/G)
4
6
97
2
4
6
8
Time (hr)
10
12
0
2
4
6
8
10
12
Time (hr)
4 2 0
Metabolic Early I Early II Early−Mid Middle Mid−Late Late
−2
Expression log2R/G)
4 −2
0
2
27 29 30 28
−4
−4
−4
−4
0
Expression log2(R/G)
4 0
2
25 24
−2
Expression log2(R/G)
4 2
23 20 21 22
0
17
−2
Expression log2(R/G)
19
0
2
4
6
8
10
12
Time (hr)
0
2
4
6
8
10 12 14
Time (hr)
Figure 4.4: Hand-picked genes in each of seven groups (Chu et al., 1998), along with mean trajectories (panel 8).
This approach provides a list of genes that appear to conform to each profile, but does not give a measure of the uncertainty of this classification, in common with other distance-based clustering procedures, e.g. Eisen et al. (1998) and Tamayo et al. (1999). The screen procedure of Chu et al. (1998) is rather ad hoc, and we have proposed a model-based filtering procedure. We applied the filter to the n = 6118 genes and decided to include the 1300 top ranked genes for further analysis. All the genes hand-picked by Chu et al. (1998) were included in this subsample. The filter and its application to the sporulation data were discussed in detail in Chapter 3.
98
4.3.2
Model Description
In this experiment yij represents the log2 -ratio of expression for gene i at time tj , relative to time t = 0, i = 1, ..., 6118, j = 1, · · · , 6. We re-iterate that the aim is to discover structure over time, and in particular to determine genes that follow common trajectories. In the absence of further information, we would expect the trajectories to be smooth but no clear functional forms after the onset of sporulation (that is following time t1 = 0.5) which occurs rapidly. To this end we assume a first-order random walk model for the cluster specific trajectories, which is a special case of the dynamic model as reviewed in Chapter 2. The full hierarchy is specified next, • Stage 1: For the observed data, we assume yij | θij , σe2 ∼iid N(θij , σe2 ). • Stage 2: Assume θij = θjk if zi = k. Conditional on K, the data are modelled as arising from the K trajectories θ 1 = (θ11 , ..., θT1 ), · · · , θ K = (θ1K , · · · , θTK ). • Stage 3: Assume cluster labels are independent and follow multinomial distribution, with p(zi = k) = πk for i = 1, · · · , n, k = 1, · · · , K. • Stage 4: For the smooth mean trajectories, we assume a first order random walk model k θjk = θj−1 + ukj ,
and k
ukj ∼iid N(0, ∆j σu2 ), for j = 2, · · · , T , where ∆j = tj − tj−1 so that observations closer in time are more likely to be similar. Assume inverse-gamma prior on the cluster specific variance k
σu−2 ∼iid Ga(αu , βu ). For the first time point we have θ1k ∼iid N(0, σ12 ).
99
• Stage 5: Complete the hierarchy with prior distributions on σe2 , σ12 , and K, σe−2 ∼ Ga(αe , βe ), σ1−2 ∼ Ga(α1 , β1 ), K ∼ Poisson(λ). Collapsing over Stage 1–3 shows that conditional on K, we are modelling each observed gene expression curve as arising from a mixture of K underlying multivariate normal distriP k butions, with smooth mean trajectories, i.e. θ i = K k=1 πk θ . The between time variation is allowed to differ across clusters. 4.3.3
BDMCMC
Dimension changing moves are made during birth-death steps following Algorithm 2.4, with new cluster generated from Stage 4 priors. Conditional on K, the parameters are updated using Gibbs samplers. For the cluster labels {zi }, p(zi = k | · · · ) ∝ πk NT (y i ; θ k , σe2 ). For the measurement error, ! n X 1 1 σe−2 | · · · ∼ Ga αe + nT, βe (y i − θ zi )T (y i − θ zi ) . 2 2 i=1
For the variance of the first time point of mean trajectories, K
σ1−2 |
1 X k2 1 θ1 · · · ∼ Ga α1 + K, β1 + 2 2 k=1
!
.
For the between time variance within mean trajectories, T k − θ k )2 X (θ 1 1 k j j−1 σu−2 | · · · ∼ Ga αu + (T − 1), βu + . 2 2 ∆j j=2
The mean trajectories are updated as follows, for the first time point θ1k | · · · ∼ N(m, s2 ),
100
where 1
s2 =
m=s
σ12 2
k
+ θ2k
1 ∆2 σu2 k P
∆2 σu2 k
+
nk + 2 σe
!−1
i:zi =k yi1 σe2
,
!
,
with nk = #{i : zi = k} the number of genes in cluster k. For the time points in between 1 < j < T, θjk | · · · ∼ N(m, s2 ), where
!−1 1 1 n k + + , s2 = σe2 ∆j+1 σu2 k ∆j σu2 k ! P k k θ θ y ij j−1 j+1 i:z =k i + . m = s2 + σe2 ∆j+1 σu2 k ∆j σu2 k
And for the last time point j = T , θTk | · · · ∼ N(m, s2 ), where
!−1 n 1 k , s2 = + σe2 ∆T σu2 k ! P k y θ iT i:zi =k m = s2 + T −1k . σe2 ∆T σu2
For the mixing proportion, π | · · · ∼ Dirichlet(δ1 + n1 , · · · , δK + nK ). 4.3.4
Analysis k
We now discuss the prior specification for σe2 , σ12 , σu2 and K. The prior for σe2 is identical to k
that described in Section 3. The priors for each of the variances σ12 and σu2 were specified in a similar manner. We picked a “most likely” value and an “upper bound” for the standard deviation. These values were then converted to the inverse variance scale, and we picked the parameters of the gamma distribution to line up the mode with the most likely point, and the 95% point of the distribution with the upper value (which requires a numerical search).
101
0.6 0.5
0.5 0.4
0.4 0.3 density
density 0.3
0.2 0.2
0.1
0
0.1
1
2
3
4
5
6
0
1
sigma (a=2.18, b=10.72)
(a) σ1 (mode=2, upper=5)
2
3
4
5
6
sigma (a=1.72, b=5.00)
(b) σu (mode=1.5, upper=4)
Figure 4.5: Distributions of standard deviation under different priors.
We chose the modal value for σ12 to be 22 (which is consistent with µ0 ∼ N(0, 22 ) in the k
filtering step), and for σu2 (which is a conditional variance) we take the most likely value to be 1.52 (so that in one unit time interval we expect the trajectory to be within ±3.0 with probability 0.95. Because the distribution of standard deviation under the inverse-gamma distribution on the variance is very concentrated, we took relatively large upper values to k
avoid being too restrictive. For the sporulation data, we chose 52 and 42 for σ12 and σu2 , respectively. Figure 4.5 shows the distributions of standard deviation under the chosen priors. As we can see some of the features of the sporulation data are being captured by the priors. Figure 4.6 shows four simulations from these priors with fixed K = 10 and 20 genes within each cluster. The prior for K was Poisson with mean 15. The Dirichlet prior was δ = (1, · · · , 1), following Richardson and Green (1997). Figure 4.7 shows the trace plot of K versus iteration number, and the marginal posterior of K. The BDMCMC was run for 100,000 iterations after a burn-in period of another 100,000
−5
0
Expression
0 −5
−10
−10
Expression
5
5
10
102
2
4
6
8
10
2
4
6
8
10
8
10
Time (hr)
−10
−10
−5
0
Expression
0 −5
Expression
5
5
10
Time (hr)
2
4
6
8
10
2
4
Time (hr)
6 Time (hr)
20 16
18
K
22
24
Figure 4.6: Four simulations from the random walk prior, with measurement error added. There are K = 10 groups, each of which contain 20 genes.
0
2000
4000
6000
8000
10000
0.3 0.2
0.29
0.0
0.28
0.16
0.1
p(K)
0.4
0.5
Iterations
0.15 0.05
0.04 0.00
0.01
15
16
17
18
19
20
21
22
0.01
0.00
0.00
23
24
25
K
Figure 4.7: Trace plot of the number of clusters K and the posterior distribution of K, from BDMCMC on the sporulation data.
103
iterations, the posteriors were summarized based on the simulated samples. The results from BDMCMC were compared to the MCMC with K fixed, as convergence is easier to assess for the latter. Similar results between those two suggest convergence of the BDMCMC. We found that the posterior for K was highly sensitive to the prior choices, especially to the priors on the variances, adding to the need for informative prior distributions. For illustration, Figure 4.8 shows the curves that result from values of K = 10, 12, 16, 20 (these may be compared with Figure 4B of Chu et al. (1998)). We re-labelled the clusters based on the mean at the second time-point (since the curves were relatively well separated at this point). This re-labelling showed good agreement with the decision theoretic approach of Stephens (2000b). It appears our model was able to identify some key patterns in the curves. As K increases the extreme trajectories remain relatively constant. The posterior medians of standard deviation σe were 0.57, 0.55, 0.54, 0.52 for K = 10, 12, 16, 20, respectively. The prior mode of the informative prior on σe was 0.25, which suggests there is some model misspecification, or the measurement error increased after time zero. It is possible that the measurement error was underestimated by our initial analysis because we only used time zero data. Note that the estimates of σe decrease when the number of clusters increases. This is not surprising because larger variances lead to larger clusters therefore fewer clusters. So it is crucial to have the measurement error carefully evaluated, as we did here. Figure 4.9 shows the mean curves along with the genes clustered to these curves conditional on K = 20. Some clusters identified by our model closely resemble the hand-picked profiles, for example, compare our cluster 19 (third from left at the last row) to the 5th hand-picked profile (Figure 4.4), or cluster 5 with the 7th hand-picked profile, or cluster 12 with the 3rd hand-picked profile. Because some of the hand-picked profiles are highly correlated themselves, we did not expect to recover all of them; however, our model was able to identify some of the key features in the sporulation data. We also note that although the re-labelling based on the second time point worked reasonably well, the label-switching problem was not totally resolved, as can be seen from cluster 10. Under our approach, co-expression is estimated via pairwise probabilities. They are not only invariant under re-labelling, but also can be evaluated across different numbers
2 −2
0
Expression log2(R/G)
2 0 −4
−4
−2
Expression log2(R/G)
4
4
6
6
104
2
4
6
8
2
10
4
8
10
8
10
(b) K = 12
−4
2 0 −4
−2
−2
0
2
Expression log2(R/G)
4
4
6
6
(a) K = 10
Expression log2(R/G)
6 Time (hr)
Time (hr)
2
4
6 Time (hr)
(c) K = 16
8
10
2
4
6 Time (hr)
(d) K = 20
Figure 4.8: Mean expression levels as a function of time, for numbers of cluster K = 10, 12, 16, 20.
105
8
10
4
6
8
10
8
10
10
6 4 0 −4
8
10
8
10
8
10
6 2 0 −4 6 2 0 −4
2
4
6
6 2
8
10
2
4
6
2
4
6
n=13
6
10
10
0
6
2
8
8
−4
4
0
6
6
4
6
2
−4
4
10
n=42
4
6
2
4
n=27
2
10
10
4
10
0
8
8
2
8
−4
6
6
0
6
4
6 4 2 0
4
4
n=21
−4
2
2
−4
4
2
n=22
6
2
n=16
8
4
6
10
6
n=9
2
8
2
10
10
0
6
0
8
8
−4
4
−4
6
6
4
6 2
2
4
6 4 2 0
4
4
n=93
−4
2
2
n=168
0
10
4
4
6
8
−4
8
n=139
2
n=30
2
6
4
6 4 2 0
6
10
0
4
n=42
−4
4
8
−4
2
n=8
2
6
4
6 2 0 −4
6
4
n=120
4
6 4 2 0
4
2
n=31
−4
2
2
4
2
0
6
n=9
−4
4
−4
0
2
4 −4
0
2
4 2 0 −4
2
n=259
6
n=188
6
n=44
6
n=19
2
4
6
8
10
2
4
6
Figure 4.9: Posterior profiles and genes classified to each of these profiles (using MAP classification), conditional on K = 20 clusters.
30 25 20 15 10 5 0
0
5
10
15
20
25
30
106
0
5
10
15
20
25
30
0
5
10
15
20
25
30
Figure 4.10: Heat map showing pairwise probabilities of common cluster membership of the 30 genes in Figure 4.4. The solid lines separate the different groups. On the left the shaded squares denote those pairwise probabilities greater than 0.5, while on the right the cut-off probability is 0.8.
of clusters. A good visualization tool for co-expression is heat-map. In Figure 4.10 we summarize the pairwise probabilities based on the BDMCMC analysis, averaging across all K. For illustration, the genes we examine are the 30 hand-picked genes highlighted by Chu et al. (1998), and reproduced in Figure 4.4. If the cluster membership was consistent with this figure then we would see blocks of shaded areas for genes in the same group (close to the diagonal), and white in the other areas. In fact we see that although there is a greater probability of co-expression close to the diagonal, there are both genes within the same hand-picked collection which do not appear to co-express, and genes in other groups that co-express. Concentrating on the sixth group in Figure 4.4, gene 26 would not appear to co-express with any of the other 29 genes, while genes 24 and 25 co-express with each other and also with genes 11, 12, 16, 20, 21 and 22. Hence we see that our model offers new insights into co-expression.
107
4.4
Example 3: Cell-Cycle Data
4.4.1
Data Description
This data has been introduced in Chapter 1 and Chapter 3. Spellman et al. (1998) describe a number of microarray experiments that were carried out to create a comprehensive list of yeast genes whose expression levels vary periodically within the cell cycle. For illustration we analyze one of the data sets that measured levels on genes that were synchronized via α factor, and randomly select 800 genes. Expression levels were measured at every 7 minutes for 119 minutes (so that T = 18 measurements were recorded in total). The expression levels of the 800 genes are shown in Figure 4.14, after application of our model. We take as our objective the identification of genes that display similar cyclic pattern in terms of amplitude and phase. 4.4.2
Model Description
In this example our model formulation is strongly driven by prior information. Specifically we assume that the data arise from a mixture of functions with periodic structure. Under the general model framework described in Section 4.1, the main features of this model are as follows, • Stage 1: We build the cyclic feature into the mean structure of the curves. Specifically we assume that yit = Ri cos 2π(f t + φi ) + eit = Ai X1t + Bi X2t + eit , where Ri is the amplitude and φi the phase of gene i, X1t = sin(2πf t) and X2t = cos(2πf t), and f = 1/66 minutes as the known frequency (obtained from Spellman et al. (1998)). It may seem more natural to model in terms of amplitude and phase, but because of the irregular constraints on the collection (Ri , φi ) we prefer to formulate our mixture model in terms of θ i = (Ai , Bi ). Figure 4.11 gives the least squares estimates for (Ai , Bi ) (left panel), and (log Ri , log[(π/2 − φi )/(π2 + φi )]) (right panel) and clearly shows the irregularity of the joint distribution of the latter which does not allow us to simply parameterize in terms of functions of the phase and amplitude.
108
• Stage 2: Given cluster labels zi = k, we assume that θ i = θ k , the cluster-specific mean vector. In other words, we model the gene specific effect as a smooth function P k θi = K k=1 πk θ . • Stage 3: Assume cluster label zi ∼iid Multin(1, θ). • Stage 4: We assume the cluster-specific means are from bivariate normal distributions, θ k | m, V ∼iid N2 (m, V ), where we assume that m, V are known. • Stage 5: Priors on σe2 , K and other hyper-parameters. When the frequency and time are known, Stage 1 reduces to a simple linear structure, under which Gibbs samplers can be easily devised for the posterior simulation. The computation for this example is a special case of the enhanced model described in detail in next Chapter, therefore is omitted here.
4.4.3
Analysis
We used “weakly informative” priors, along the lines followed by Richardson and Green (1997) and Stephens (2000a). In particular we take m = (−0.215, −0.005), which correspond to the means of the least squares estimates, and V with elements (4.45, 2.13) as diagonal which are the squared ranges of the least squares estimates. Given that our model here is exploratory we are not troubled by the mild dependence of the prior on the data. The priors for σe2 and K were as for the sporulation data. Figure 4.12 shows simulations from our prior distribution. We are modeling the Ai , Bi pairs as arising from a bivariate normal distribution. So in this example we would not want to filter out the constant genes since this would leave a “hole” close to zero. For these data (although we acknowledge the random error around each curve in our first stage distribution) we have effectively reduced the dimensionality of the data from 20 to 2.
−1.5
−2 −5
−1.0
−4
−3
log(amplitude)
0.0 −0.5
B_i
0.5
−1
1.0
0
1.5
109
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
−2
A_i
−1
0
1
2
log((0.5 * pi + phase)/(0.5 * pi − phase))
4 2 0
Gene Expression
0
Gene Expression
0
−4
−2
0 A_k
2
4
6
−2 −4
−4
−4
−2
−2
B_k
2
2
4
4
ˆi from the model E[Yit | Ai , Bi ] = Ai sin(2πf t) + Figure 4.11: Least squares estimates Aˆi , B ˆ i ), log{(π/2 + φˆi )/(π/2 − φˆi )} (right panel), Bi cos(2πf t), for gene i (left panel), and log(R i = 1, ..., 800.
0
20
40
60 Time
80
100
120
0
20
40
60
80
100
120
Time
Figure 4.12: 200 simulations of (Ak , Bk ) for the cell-cycle parameters (left); five trajectories without random error (center); 55 simulations from 5 groups, 4 simulated plus zero group with random error (right).
6
8 10
K
14
18
110
0
2000
4000
6000
8000
10000
0.3 0.2
0.25
0.0
0.25
0.17
0.1
p(K)
0.4
0.5
Iterations
0.16 0.07
0.05 0.00
0.00
6
7
8
0.03 9
10
11
12
13
14
0.01
0.00
0.00
0.00
15
16
17
18
K
Figure 4.13: Trace plot of the number of clusters K and the posterior distribution of K (after a burn-in of 100,000 iterations), from BDMCMC analysis of the cell-cycle data.
The behavior of K as a function of iteration number, and the posterior distribution of K are shown in Figure 4.13. It can be seen that the posterior for K is concentrated between 8 and 15 for this random sample of 800 genes. The computational overhead was a little less in this example as compared to the sporulation example, due to the use of a linear model that reduces a number of the required calculations. For the K = 11 analysis we relabelled on the basis of B k (which were relatively distinct); again we found good agreement with the decision theoretic approach of Stephens (2000b). The means θ k = (Ak , B k ), k = 1, ..., 11, are displayed in the left hand panel of Figure 4.11. Figure 4.14 shows the collection of estimated mean profiles, conditional on K = 11. We classified each gene to a profile based on Pr(zi | y) and these classifications are displayed in Figure 4.14. The largest cluster is in the third panel of the second row and corresponds to the “zero” cluster which has a flat profile, supporting our speculation that a relatively small portion of the genome is cell-cycle regulated. We see that the classifications look reasonable, though there is clearly some
111
model misspecification. In particular a number of the trajectories seem to have attenuated profiles at the beginning or as time increases. This observation led to our effort to enhance the model for cell cycle data, which is summarized in next Chapter.
4.5
Discussion
In this chapter we have proposed a general hierarchical model for the analysis of curve data, with emphasis on clustering. Such model, when tailored for gene expression data, allows a quantitative description of gene co-expression, in contrast to clustering techniques that are currently used. In addition, in our approach, the often unknown number of underlying clusters is directly modelled as a parameter, and estimated through BDMCMC proposed by Stephens (2000a). Our approach also provides a natural measure of uncertainty associated with clustering and classification. A systematic comparison with those more traditional techniques is unfortunately beyond the scope of this dissertation. In investigations not reported here, we have found that the number of clusters, and sometime the co-expression probabilities, can be highly sensitive to the prior distributions and we would have far less faith in our quantitative conclusions if the priors we had used were not based on biological and experiment-specific information. Although we have emphasized that posterior probabilities of co-expression are invariant to component re-labelling. But for other summaries such as the reporting of mean trajectories, the problem remains. An alternative to the procedure followed for the examples, and is closer to the context is to label on the basis of known marker genes, or of some features with clear biological interpretation. In our model formulation of Section 4.1 we assumed a priori that the trajectory indicator variables zi were independent. In practice there will often be substantial information available to place collections of genes in the same cluster with high probability. Some attempt to improve the prior distribution on zi ’s will be discussed in Chapter 6. More substantively, our eventual aim is to combine expression and sequence data. Models for the latter have been extensively developed, see for example Liu (1999). A Bayesian framework is ideally suited to such an endeavor, since it allows a natural combination of
112
n=6
n=108
n=5 2 1 0 −2
−1
Gene Expression
2 1 0 −2
−1
Gene Expression
1 0
−2
−2
−1
0 −1
Expression
1
Gene Expression
2
2
Posterior Profiles
Time (min)
Time (min)
Time (min)
Time (min)
n=2
n=26
n=568
n=28
1 0 −2
−1
Gene Expression
1 0 −2
−1
Gene Expression
1 0 −2
−1
Gene Expression
1 0 −1
Time (min)
Time (min)
Time (min)
n=30
n=1
n=12
n=14
1 0 −2
−1
Gene Expression
1 0 −2
−2
−1
Gene Expression
1 0
Gene Expression
1 0 −1
2
Time (min)
2
0 20 40 60 80 100 120
2
0 20 40 60 80 100 120
−1
Gene Expression
−2
0 20 40 60 80 100 120
2
0 20 40 60 80 100 120
−2
Gene Expression
2
0 20 40 60 80 100 120
2
0 20 40 60 80 100 120
2
0 20 40 60 80 100 120
2
0 20 40 60 80 100 120
0 20 40 60 80 100 120
0 20 40 60 80 100 120
0 20 40 60 80 100 120
0 20 40 60 80 100 120
Time (min)
Time (min)
Time (min)
Time (min)
Figure 4.14: Posterior profiles and genes classified to each of these profiles (using MAP classification), conditional on K = 11 clusters.
113
multiple data sources, and the incorporation of prior information, which is likely to be essential in complex problems such as these. In fact, effort to combine gene expression information with sequence information has generated new insight into the transcriptional regulatory network in S. cerevisiae (Lee et al., 2002). We emphasize that the general hierarchical mixture model we proposed is intended for clustering curve data in general. As we demonstrated through gene expression examples, information regarding the shape and variation of the curves can be straightforwardly incorporated into the model. In cases when there is no information available about the curve shape, semi-parametric models such as random walk models can be applied. Therefore our approach provides a flexible model-based framework for automatic curve fitting and clustering, which has the potential for a broad range of applications.
114
Chapter 5 ANALYSIS OF CELL CYCLE WITH GENE EXPRESSION DATA
5.1
Cell Cycle Regulated Gene Expression
Cells reproduce by duplicating their contents and then dividing into two. The repetition of this process is called the cell cycle. This cell division cycle is the fundamental means by which all living creatures propagate. It defines life. Meanwhile, abnormal cell divisions are responsible for many diseases, the most common form being cancer. So analysis of the control mechanisms and the factors essential for the process should contribute significantly to our understanding of cell replication, malignancy, and reproductive diseases associated with genomic instability and abnormal cell divisions. In eukaryotic cells, the cell cycle is traditionally divided into four successive phases, G1 → S → G2 → M , based on cell events visible under a microscope. The relatively long period covering G1 , S and G2 phases is called interphase, during which the cell grows continuously. DNA replication is confined to the S phase. During the M phase, the duplicated chromosomes segregate and the cell splits in two. The cell cycle is covered in detail by many biology textbooks, see for example Alberts et al. (1994). Previous studies have revealed that the cell cycle is a complex yet highly coordinated process. The general strategy of cell cycle is summarized as follows: discrete cell cycle events occur against a continuous background; DNA is replicated during the S phase; the sequence of cell cycle events is governed by a cell cycle control system, which cyclically triggers the essential processes of cell reproduction such as DNA replication and chromosome segregation. At the heart of this system is a set of protein complexes formed from protein kinase subunits and activating protein called cyclins, whose inter-linked activation and deactivation ensure that the cell cycle continue with high fidelity. For good biological reviews of cell cycle regulation, see Kelly and Brown (2000), Morgan (1997) and references therein.
115
For years, biologists have relied on microscopic studies of cell cycles, using the model organism budding yeast Saccharomyces cerevisiae and its mutants. With the advance in high throughput technology, biologists are now able to study the regulation of cell cycles at the genetic level. Microarray analysis has been used to identify a large number of genes in Saccharomyces cerevisiae that are candidates subject to cell cycle regulation. Spellman et al. (1998) conducted a set of experiments using different synchronization methods (αfactor arrest, temperature arrest of a temperature sensitive mutant, elutriation synchronization). mRNA transcripts were extracted at a number of time points following the release of cells from synchronization, and the expression levels of approximately 6,000 yeast genes were measured using two-color cDNA arrays, relative to asynchronized cell samples. Genes expressed in a cell-cycle-specific manner were identified using a Fourier model. These researchers identified about 800 genes as being cell-cycle regulated, among which 300 were further classified as G1-phase genes, 71 as S-phase genes, 121 as G2-phase genes, 195 as M-phase genes and 113 as M/G1-phase genes. Another set of experiments for studying yeast gene expression during the division cycles were carried out by Cho et al. (1998). These researchers employed two temperature-sensitive mutants (cdc28 strains) of S. cerevisiae for synchronization. Using Affymetrix oligonucleotide arrays, gene expression following release from synchronization was determined at a sequence of time points. The set of transcripts were nearly identical to those measured with cDNA arrays by Spellman et al. (1998). Cho et al. (1998) identified 416 genes as cell cycle regulated by visual inspection and subjective knowledge. The cdc28 synchronization results from the Affymetrix experiments have been converted to the ratio formats as cDNA arrays and included in the cell-cycle analysis of Spellman et al. (1998), providing four comparable experiments. Since the publication of the above two experiments, there have been an explosion of biological and analytical research on gene expression during the yeast cell cycles. The paper by Spellman et al. (1998) has currently been referenced by more than three hundred papers. It is also worth mentioning the work by Zhao et al. (2001). These authors developed a regression-type model, termed single pulse model (SPM), to identify cell cycle regulated genes. In their analysis, they identified a total of 607 genes as periodic using the cdc28 data
116
set, and a total of 254 genes as periodic in at least two of the cdc28, cdc15 and α-factor experiments carried out by Spellman et al. (1998). Disagreement between different gene expression analyses is quite common, often substantial. Even though it may be beneficial to compare results from different data sets, care must be taken as cells may behave quite differently under different experimental conditions, and so may the gene expression. For example, the cell cycle time spans in cdc28 experiment differ dramatically from those in cdc15 experiment. It is problematic to apply the same analysis to different data sets without careful adjustment. The different levels of sophistication and focus of the analytical methods may also contribute to the discrepancies. To determine and characterize which (or whether) specific genes are expressed at particular times during the cell cycle, it is important that any uncharacterized sources of experimental variation or reasonable alternative explanations for the patterns in the experimental data are brought to light. In other words, with all the effort we spend on microarray experiments, can we really tell the signal from the noise? Are the cyclic patterns shown in the data true cell-cycle specific patterns, artifact of synchronization, or just random fluctuations? How much can we trust the grouping of genes from clustering algorithms? Are there ways we can categorize cell-cycle regulated genes which are biologically plausible, and statistically sensible? Clearly there are no simple answers to these questions, and biologists have been tackling them for at least two decades. With the help of high-throughput technologies such as microarrays, expression of thousands of genes can now be measured simultaneously during the course of cell division. This allows us to study the functions and interactions of genes involved in cell cycle regulation. With the increasing amount of data generated from microarray experiments, the challenge of quantitative analysis of gene expression data has attracted researchers from many fields to join the effort. In this chapter, we restrict ourselves to the data analysis at a higher end, i.e., we assume that the data have been properly processed and normalized. We develop a model-based approach to identify cell cycle regulated genes, and cluster them based on their expression patterns.
117
5.2
Data Description
The working data is provided by Tata Pramila and Linda Breeden at the Fred Hutchinson Cancer Research Center. Early cell cycle experiments, including Spellman et al. (1998) and Cho et al. (1998), have many defects in their design. They suffer from a range of problems, such as non-standard and immature microarry techniques, low resolution of time points, bad synchronization, lack of quality control, etc. Some of the experiments by Spellman et al. (1998) even failed to cover two full cell cycles. Similar problem exists with some other early published experiments, for example, the experimental time used by Chu et al. (1998) is only about half of the time that a normal sporulation process actually takes. Pramila and Breeden (2003) have carefully designed and conducted the experiments to minimize these potential problems, and generated data with relatively high quality. Pramila and Breeden (2003) conducted a set of experiments. Here we describe three data sets we focused on. All three data sets use the same yeast transcripts, 6309 of them, including controls. The main data set (referred to as 38wt data) consists of data from the following design: cell samples were first synchronized by α-factor; after the cells were released from synchrony, gene expression levels relative to asynchronized cell samples were measured using two-color cDNA microarrays at 5 minutes time intervals from t = 0 to t = 120 minutes. This length covers about two full yeast cell cycles. The 5 minutes interval is at a much finer resolution when compared to those of Spellman et al. (1998) and Cho et al. (1998). In this experiment, the experimental genes were labeled red, and the reference genes green, which is the usual convention. We are fortunate to have the luxury of two other data sets. One is from an almost identical design as the 38wt data, and differs only in that the dyes were swapped. This dyeswapping data set provided us important prior information regarding the magnitude and variability of gene expression. Another data set consists of several arrays with expression measures of asynchronized genes at different time points relative to asynchronized genes. The same versus same nature of this reference data set enables us to gather information about measurement errors. Using the fully Bayesian model-based approach, we were able to incorporate additional information gathered from these data into our main analysis.
0 −4
−2
Expression
2
4
118
0
20
40
60
80
100
120
Time (min)
0 −2 −4
Expression
2
4
Figure 5.1: Expression of 100 genes known to be cell-cycle regulated.
0
20
40
60
80
100
Time (min)
Figure 5.2: Expression of 100 genes randomly selected.
120
119
An initial exploratory analysis, followed by closer examination, revealed that the mRNA sample at 105 minutes was contaminated, therefore the data generated from that array were dropped from subsequent analysis. Figure 5.1 shows expression of 100 genes known to be cell cycle regulated (CCR). These genes have been identified based on traditional methods in the past. It appears that they do demonstrate strong cyclic signals. As a contrast, a large portion of the genes do not show strong signals. A random sample of 100 genes are shown in Figure 5.2.
5.3
Measurement Error
As reviewed in Chapter 1, there are various sources of variation involved in microarray experiments, and their identification and evaluation have proven to be crucial for making accurate inference. Other than variations which we can attribute to certain systematic sources, we often refer to the remaining variability as measurement error. During the span of the experiment, Pramila and Breeden (2003) measured gene expression levels of asynchronized samples at times 0, 25, 35, 45, 60 and 100 minutes into the experiment. The time 0 sample is also referred to as the steady state (ss) sample. Since these data are essentially in the form of same versus same hybridization, presumably they are not subject to any experimental conditions, hence their variations are only indicative of measurement errors. We now summarize the analysis of this reference data, based on which the prior distribution on the measurement error was specified. Figure 5.3 displays the expression of a random sample of 100 genes from these six chips. Overall the difference is not significant across the six chips, which is also suggested by the flatness of the expression time series. There were genes which exhibited large variations across time, but they did not appear to be cyclic under visual inspection. Figure 5.4 shows the boxplots of the data from these 6 chips, again we can see that the average gene expressions of these asynchronized samples are close to zero. It should be noted that the samples appear to be more spread out at later times, suggesting the measurement error may increase with time. This observation supports our speculation in the analysis of sporulation data (see Chapter 4) – using only time zero
120
data could under-estimate the measurement error. Therefore we proceeded to carry out a
0.0 −1.0
−0.5
Normalized Log Ratios
0.5
1.0
posterior analysis using the pooled data from all six chips.
0
20
40
60
80
100
Time
Figure 5.3: Expression of 100 randomly selected asyncronized genes.
Let yi denote the ith observation in the pooled reference data, y = {y1 , · · · , yN }. We assume a simple normal model for the data yi | µ, σ 2 ∼i.i.d. N(yi | µ, σ 2 ),
(5.1)
where µ is the mean parameter, σ 2 is the variance parameter. Figure 5.5 shows the sampling distribution of the data, suggesting that the normal assumption is plausible. We assume a “non-informative” prior on (µ, σ 2 ) with p(µ, σ 2 ) ∝ 1/σ 2 .
(5.2)
This leads to the following posterior distribution p(σ −2 | y) = Ga(σ −2 | a, b), where a = 12 (N − 1), b = 21 ns2 with ns2 =
Pn
i=1 (yi
− y¯)2 .
(5.3)
−1.0
−0.5
0.0
0.5
1.0
1.5
121
NLR.ss
NLR.t25
NLR.t35
NLR.t45
NLR.t60
NLR.t100
2 1 0
Density
3
4
Figure 5.4: Boxplots for the data from each of the six chips.
−1.0
−0.5
0.0
0.5
1.0
1.5
Pooled data from 6 chips
Figure 5.5: Sampling distribution of the pooled reference data from all 6 chips.
122
To apply our knowledge gained from these data, we would apply the parameter values from this posterior analysis as a way of obtaining a prior specification. But the large sample size from pooling the six chips led to a highly concentrated posterior distribution on the standard deviation σ. The sampling posterior median of σ is 0.151, with 95% sampling interval (0.150, 0.153). To avoid being too restrictive, we calibrated a and b in the same way as in Chapter 4. We set the modal value for σ to be 0.15, and an upper bound 0.5 so that Pr(0 < σ < 0.5) = 0.95. Solving the resultant equations gave a = 1.52 and b = 0.05, under which the 95% sampling interval is (0.10, 0.68). These values were then used as priors in subsequent filtering and clustering analysis. The impact on the distribution of σ from different choices of priors can be seen from Figure 5.6. sigma^(−2) ~ Ga( 0.95 , 0.02 )
3 0
0
1
200
2
Density
400
Density
4
600
5
sigma^(−2) ~ Ga( 12980.5 , 194.44 )
0.0
0.2
0.4
0.6
All 6 chips
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
All 6 chips
Figure 5.6: Sampling posterior distribution of σ 2 with different parameter values. Left panel shows a highly concentrated distribution directly from posterior analysis, right panel shows the sample distribution with calibrated parameter values.
5.4
Filtering
As reviewed above and in Chapter 3, in general only a small portion of genes participate in any specific cell process, largely due to highly specialized gene regulation. In this section,
123
we extend the filtering procedure outlined in Chapter 3 to cell cycle data. The aim is to first identify candidate periodic genes, then perform more reliable analysis on these candidates, using a more sophisticated model tuned to the cell-cycle nature of the data. In cell cycle analysis, our main interest lies in identifying and characterizing genes that are cell-cycle regulated. For those genes which show differential expression but do not coincide with cell cycle events, we do not consider them as cell cycle regulated, and consequently excluded them from later analysis. Let yij denote gene expression at time tj for gene i, for i = 1, · · · , n, and j = 1, · · · , T . We assume a first order Fourier model for the data, yij = Ri cos 2π(f0 tj + φi ) + ǫij ,
(5.4)
where ǫij ∼i.i.d. N(0, σe2 ) are the measurement errors, and (Ri , φi ) are gene specific parameters, Ri is the amplitude, i.e., the magnitude of the cyclic signal, and φi is the phase, governing where the signals peak. The cell cycle frequency is denoted by f0 , fixed at 1/58 minutes−1 , and assumed to be common to all genes. The cell cycle span is estimated to be 58 minutes using the known CCR genes (Zhao et al., 2001). This model has been used by many researchers for cell cycle analysis, for example, Spellman et al. (1998) and Shedden and Cooper (2002). For the purpose of filtering, we want to test the following hypothesis independently for each gene i, M0 : Ri = 0 v.s. M1 : Ri 6= 0. To carry out the filtering procedure as described in Chapter 3, we need to specify the prior distributions. For measurement error, we assume σ −2 ∼ Ga(a, b),
(5.5)
with parameter values a and b determined from posterior analysis of reference data described in Section 5.3. We assume models M0 and M1 are equally probable a priori. Under M0 , the parameter φi is redundant. Under M1 , we assume Ri and φi are independent with the following prior
124
distributions, Ri ∼i.i.d. Exp(λ)
(5.6)
φi ∼i.i.d. Unif(−0.5, 0.5)
(5.7)
Because the trigonometric functions in the Fourier model are periodic, φi is restricted in (−0.5, 0.5) for identifiability, so the uniform prior on φi is “non-informative”. We chose exponential prior on the amplitude Ri because it has a simple form and the correct support. The parameter λ was based upon an exploratory analysis of the 100 known CCR genes. We have found that the 100 known cell cycles genes showed consistently strong signals in both main experiment and dye-swapping experiment, and believed their expression levels were representative of genes with strong signals. So we extracted data for the 100 known cell cycle regulated genes from the dye-swapping experiment, transformed them into the same format as the 38wt data set by changing the signs of the log ratios and analyzed them. Model (5.4) can be re-parameterized as yij = Ai cos 2πf0 tj + Bi sin 2πf0 tj + ǫij ,
(5.8)
with Ai = Ri cos 2πφi , and Bi = Ri sin 2πφ. Given f0 and tj , it is just a simple linear model, for which we can obtain least squares estimates of (Ai , Bi ) and transform them back to (Ri , φi ). We chose λ to be 1.43 so that the mean amplitude is 0.7 with variance 0.5. These prior values correspond to our belief that amplitudes of these known CCR genes are within the upper range of the signals, we would expect many CCR genes to have smaller amplitude than these genes. Figure 5.7 shows the expression of the 100 CCR genes, with fitted curves based on the least squares estimates. Figure 5.8 shows the histograms of least squares estimates of (Ri , φi ), Figure 5.9 shows the scatter plot of the estimates. The independence assumption appears reasonable, but the distributional assumptions could be improved. Figure 5.10 shows 100 simulated gene expression time series from the above priors. It suggests our prior choices are reasonable, as we can see patterns in the simulated data match quite closely to what we would expect to see in the main data (compare to Figure 5.1 and Figure 5.2).
2 −4
−2
0
Expression (fitted)
0 −4
−2
Expression (obs)
2
125
0
20
40
60
80
100
120
0
20
40
Time (min)
60
80
100
120
Time (min)
2
Density
0.6
0
0.0
0.2
1
0.4
Density
0.8
3
1.0
4
Figure 5.7: Observed gene expression of 100 known cell cycle regulated genes, and their fitted values based on least squares estimates based on model (5.4).
0.0
0.5
1.0 amplitude
1.5
2.0
−0.2
−0.1
0.0
0.1
0.2
phase
Figure 5.8: Histograms of least squares estimates of amplitude Ri and phase φi . The sampling distribution of Ri is skewed to the right, and distribution of φi is rather flat over its range.
0.0 −1.5
−1.0
−0.5
B_i
0.5
1.0
1.5
126
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
A_i
0 −4
−2
expression
2
4
Figure 5.9: Scatter plot of least squares estimates of Ri and φi . It suggests Ri and φi are uncorrelated.
0
20
40
60
80
100
120
Time
Figure 5.10: N = 100 simulated gene expression time series based on the following priors: Ri ∼ Exp(1.43), φi ∼ Unif(−0.5, 0.5), σe2 = 0.22 .
127
We used importance sampling to estimate the posterior probabilities pi = Pr(M1 | y i ) (for details, see Section 3.2), then ordered genes based on these probabilities. Figure 5.11 displays the 100 highest ranked genes and the 100 lowest ranked genes. It appears that the filter was able to pick out genes with large variations. Because the model (5.4) allows cyclic oscillation in the data, genes showing cyclic patterns tend to be ranked higher than genes that are not cyclic even though they may show differential expression. So the higher a gene is ranked by this filtering procedure, the more likely it is cyclic thus a candidate for cell cycle regulation.
1 0 −2
−1
Expression −2
−1
Expression
0
1
2
100 genes with lowest P(M1|y)
2
100 genes with highest P(M1|y)
0
20
40
60 time
80
100
120
0
20
40
60
80
100
120
time
Figure 5.11: Expression of the 100 highest ranked genes (left panel) and lowest ranked genes (right panel).
At this point, we can either pick a cutoff point subjectively, and proceed with genes above the threshold, or we can choose the cutoff point based on some more formal criteria, such as controlling the false discovery rate (FDR) and false negative rate (FNR). The concepts of FDR and FNR, and Bayesian procedures for controlling them have been discussed in Section 3.2.6. Note FDR and FNR are two competing concepts, optimal results for minimizing both error rates cannot be achieved at the same time. We would miss nothing by rejecting all hypotheses, so FNR=0, but clearly FDR would be high, and vice versa. Thus some
128
compromise has to be made, depending on the scientific question and our preference. In our analysis, we feel we are in a “discovery” mode, and therefore a certain amount of false discovery is tolerable as long as we do not miss too many cell-cycle genes. Figure 5.12 illustrates various thresholds from minimizing the loss function cF DR + F N R, where c is a positive number chosen to reflect our preference in controlling FDR and FNR. For example, if we are twice as concerned with FDR as with FNR, we could set c = 2 and consider the top 1340 genes. Of course, choosing appropriate value c is not a trivial task. Loss = 1 FDR + FNR
3000
4000
5000
0.5 0.4
1790 0
1000
2000
3000
4000
number of rejections
Loss = 2 FDR + FNR
Loss = 4 FDR + FNR
4000
number of rejections
5000
6000
5000
6000
1.5 1.0 0.0
3000
6000
0.5
minimal loss contional on D
0.8 0.6
2000
5000
2.0
number of rejections
0.4
1000
0.3
6000
1340 0
0.2
minimal loss contional on D 2000
1.0
1000
0.2 0.0
minimal loss contional on D
0.0
2491 0
0.1
0.3 0.2 0.1 0.0
minimal loss contional on D
0.4
0.6
Loss = 0.5 FDR + FNR
1019 0
1000
2000
3000
4000
number of rejections
Figure 5.12: Optimal solutions to different loss functions in the form of cF DR + F N R.
Figure 5.13 shows the optimal number of rejections for minimizing Bayesian FNR while controlling Bayesian FDR at the 0.05 level. This is similar to the frequentist practice – maximizing the power while controlling the significance level. For this filtering procedure, we have employed a quite sophisticated model for the data, with carefully chosen priors, so
129
as a consequence we believe this controlling FDR and FNR is more reliable than the filtering described in Section 3.3.1. Based on this result, we decided to identify the top 1680 genes as candidates for cell cycle regulation, and the cutoff for marginal posterior probability Pr(M1 | y i ) was set to be 0.781 .
0.1
0.2
0.3
minimal FDR contional on D
0.4
0.5
Optimal Cutoff Pr(H1|y) = 0.78 , FNR = 0.05
0.0
α=0.05
1685 0
1000
2000
3000
4000
5000
6000
number of rejections
Figure 5.13: Optimal solutions to minimizing F N R, subject to F DR ≤ 0.05.
5.5
Model Development
We now give a detailed justification to our extended model for cell cycle data. In Wakefield et al. (2003), we developed a statistical framework for time course gene expression analysis, and applied it to the data set from Spellman et al. (1998). Although the model was able to identify some interesting groups of genes, we found out that there was still a large amount 1
The discrepancy of 5 is due to the rounding error in 0.78.
130
of attenuation in the data, especially at the beginning and the end of the experiment, which had not been accounted for. It can be seen from the residual plot (Figure 5.14) from fitting model (5.4) to the CCR genes. There is also evidence suggesting unaccounted cyclic
6 4 0
−4
2
−2
Residuals
0
Squared Residuals
8
2
10
patterns. These are our primary motivation for further model development.
0
20
40
60 Time (min)
80
100
120
0
20
40
60
80
100
120
Time (min)
Figure 5.14: Residuals and squared residuals from fitting model (5.4) to the CCR genes.
To motivate model enhancement, we give some relevant biological/experimental information concerning cell cycle control. As mentioned earlier, to study cell-cycle, yeast cells have to be initially synchronized. Changes in the concentrations of the mRNA transcripts from each gene during synchronous cell cycles are then monitored after the cells have been released from synchronization. Synchrony can be obtained either by inducing specific arrests, or by collecting small G1 cells on the basis of size. Release from synchrony can be achieved by extracting the arrest agents, or transferring them to a growth medium depending on the synchronization methods. An excellent review of synchronization of budding yeast, especially α-factor synchronization can be found in Breeden (1997); there are reviews
131
on other synchronization methods in the same issue. Ideally, we would expect to see unperturbed cyclic patterns like the one in Figure (5.15), which corresponds to the following model Yit = Ai cos 2πf0 t + Bi sin 2πf0 t + eit ,
(5.9)
where Yit denotes expression level of gene i at time t, Ai and Bi are gene-specific parameters determining the amplitude and phase of the cyclic pattern, f0 is the true cell-cycle frequency, and eit ∼i.i.d. N(0, σe2 ) represents the random noise. 1
0.5
0
–0.5
–1 0
50
100
150
200
t
Figure 5.15: Mean profile under Model Yit = Ai cos 2πf0 t + Bi sin 2πf0 t + eit
Cell-cycle analysis has its own features which require special attention. We are looking initially for cyclic expression patterns, and subsequently time to peak during the cell-cycle. The former indicates whether the gene is cell cycle regulated; the latter relates the pattern to specific cell cycle events, thus providing information about gene’s role in cell cycle control and in relation to other cell cycle genes. In some biological processes, the gene expression profiles like the ones shown in Figure 5.16 would be considered the same because the only difference among them is the shift in phase. Statistical methods have been developed to
132
1
0.5
0
–0.5
–1 50
100
150
200
t
Figure 5.16: Multiple mean profiles with different phases under Model Yit = Ai cos 2πf0 t + Bi sin 2πf0 t + eit .
deal with this kind of situation, see Zhang (2002) and Zhou et al. (2002). But for the cell-cycle analysis considered here we would like to distinguish between genes with different phases, because the time to peak contains crucial information about the interaction among cell-cycle regulated genes and the underlying cell-cycle control mechanism. This first-order Fourier model is the starting point for our model refinement. The adequacy of this model is questionable, as we have shown that there is unaccounted attenuation in the cell-cycle data. This model assumes a fixed frequency and constant amplitudes for all genes, which seems rather restrictive and not biologically plausible. The synchronization causes an intrinsic difficulty in a cell-cycle study. To effectively observe the cell-cycles, yeast cells have to be synchronized at first. On the other hand, our ability to observe the true cell-cycle span is impeded because the cell-cycle could be altered by the synchronization. This fact has long been recognized by biologists, and recently has been brought up in gene expression analysis by Shedden and Cooper (2002). α-factor synchronization is considered as a better choice compared to other synchronization
133
methods because of its relative ease, sensitivity and gentleness to cells. α-factor is a mating pheromone that is secreted by haploid S. cerevisiae cells of the α mating type. It blocks cell division in G1 and induces mating-specific gene expression. Even when transcriptions are held at START2 , during this time cell mass increases and cell wall growth continues, resulting enlarged and frequently distorted cells. After the release the large size leads to near elimination of the G1 phase and hence an abbreviated cell cycle. This is consistent with our observation that there tends to be shortened cell-cycle span early on after release, but the difference decreases over time. Breeden (1997) recommends that with α-factor arrest, the first cycle after release should be considered a recovery cycle, which may differ from the normal mitotic cycle in specific ways. Any oscillating activity that persists through the second and third cycles after recovery is most likely to be a property of the normal mitotic cell cycle. To account for this phenomenon, we allow the cell-cycle frequency to deviate from the fundamental frequency f0 across time. f 0 ft = ft (φ) = f × (t/t )φ 0 max
t=0
,
(5.10)
t>0
where tmax is the duration of the experiment, and φ is the parameter governing the change in frequency. Note the definition of frequency at time 0 is to make sure it is well defined, because φ can take negative values. In application, time 0 usually refers to the time immediately after release from synchronization. Under this model, the cell cycle spans at later cycles are close to the actual ones. Now the model can be written as Yit = Ai cos 2πft (φi )t + Bi sin 2πft (φi )t + eit ,
(5.11)
Typical profiles under this model look like the ones in Figure 5.17, There are drug-induced cell cycle arrests, which are unnatural and potentially toxic and non-specific. Genetically induced arrests using cdc mutants are more specific and two such arrests (cdc28, cdc15) have been used by Spellman et al. (1998). However, the arrests evoked by these mutations are abnormal in the sense that they are caused by the loss of 2
Important checkpoint in the eukaryotic cell cycle. Passage through START commits the cells to enter S phase.
134
1
1
0.5
0.5
0
0
–0.5
–0.5
–1
–1 50
100
150
200
t
(a) Early cycles lengthened
50
100
150
200
t
(b) Early cycles shortened
Figure 5.17: Mean profiles under Model Yit = Ai cos 2πft (φi )t + Bi sin 2πft (φi )t + eit .
a critical gene product. The cells arrest in an apparently uniform state, but it cannot be assumed that all cell-cycle specific progresses are halted, or that recovery from the arrest occurs under balanced growth conditions. Even with the elutriation synchronization, which collects G1 cells based on size and introduces minimal perturbation, cells need some time before they resume normal mitotic cell cycles. With these synchronization methods, the first cell cycle should also be considered a recovery cycle, as with α-factor synchronization. So if the first cycle cannot be trusted, why not run the experiments longer and only look at later cell cycles? This brings up our second point, we are unable to observe many cell cycles. Most of the time, the cyclic signals dissipate after three or four cycles. There are several factors that could contribute to this phenomenon. One is how well the cells are synchronized. But even with a perfect synchrony, after two doublings only one of four of the cells experienced the initial conditions. This, in addition to random fluctuation in the transcription of each gene, means that soon the cells become asynchronous and we are not able to observe the cyclic patterns any more. To make the matter even more complicated, certain signals we observe could be artifacts of the synchronization. For example, in the case of α-factor synchrony, because α-factor is a
135
mating pheromone, it will induce mating-specific gene expression. As a consequence, many mating-related gene expression will either be elevated or suppressed, leading to increased or decreased transcriptions. In some extreme cases, the changes in expression level are so dramatic that the cyclic signals are obscured. In order to account for the change of the magnitude of transcription over time, we now modify our model to allow time-dependent amplitude, Yit = e−γi t × {Ai cos 2πft (φi )t + Bi sin 2πft (φi )t} + eit ,
(5.12)
Under this model, the shape of the curve is governed by a set of four parameters. Amplitude and phase are determined by Ai and Bi , γi governs the attenuation of amplitude over time, and φi perturbs the cell cycle frequency. Note Model (5.9) and Model (5.11) are special cases of Model (5.12), depending on whether γi = 0 or φi = 0, or both. Typical cyclic profiles should look like the ones in Figure 5.18 (compare to Figure 1 in Breeden, 1997), As explained above, the changing cell-cycle span and magnitude of signals are systematic and correspond to actual biological phenomena. Although a large number of research papers have been published on the topic of cell-cycle control, few have tried to take them into account. Zhao et al. (2001) considered the second issue in their single-pulse model (SPM), in which they allowed the precision to decrease over time. Bar-Joseph et al. (2002) mentioned both issues, but used semi-parametric models instead of directly modeling the phenomena. Here we advocate a science motivated, model-based approach towards cellcycle gene expression analysis. We believe that it is less appropriate to rely totally on data-driven approaches, regardless of the biological context and scientific questions waiting to be addressed. Because every synchronization protocol has its limitations, a prudent strategy for determining if a specific process is cell cycle regulated is to employ at least two different synchrony methods. If the oscillation can be observed through two or more mitotic cycles in two different synchrony experiments, it is unlikely the oscillation is induced by the arrest (Breeden, 1997). But combining analyses from different experiments is a difficult task, and has not been fully addressed by researchers. We leave it as future research, and do not attempt this problem here.
136
0.8 3 0.6 2
0.4
0.2 1 0 0 –0.2
–0.4
–1
–0.6 –2 –0.8
–3
–1 50
100
150
200
50
100
t
150
200
t
(a) Amplitude decreasing
(b) Amplitude increasing
1
2
0.5
1
0
0
–0.5
–1
–1 0
50
100
150 t
(c) Amplitude decreasing
200
50
100
150
200
t
(d) Amplitude increasing
Figure 5.18: Mean profiles under Model Yit = e−γi t × {Ai cos 2πft (φi )t + Bi sin 2πft (φi )t} + eit .
137
5.6
The Model
Having extended the model to allow variable frequency and time-dependent amplitude, we now specify the hierarchical model in details. Let yij denote expression levels of gene i measured at time tj , and let y i = (yi1 , · · · , yiTi ) denote the expression profile for gene i measured across Ti time points, so that genes are allowed to be measured at different sets of time points or have missing values under our model. • Stage 1: We assume each observed gene expression profile follow a multivariate normal distribution independently, y i | θ i , S i ∼ NTi (θ i , S i ),
(5.13)
where θ i is the Ti × 1 mean vector, S i is the Ti × Ti covariance matrix, for i = 1, ·, n. • Stage 2: We introduce cluster label zi , which indicates the cluster that gene i belongs to. Following the general model proposed in Chapter 4, we assume the mean vector is a context specific function of covariates X i and cluster specific parameter vector µk , θ i = h(X i , µk ) if zi = k. For the cell cycle data, the covariate is time, and the mean structure has the form of Model (5.12), h(tj , µk ) = e−γk tj Ak cos 2πftj (φk )tj + Bk sin 2πftj (φk )tj ,
(5.14)
with µk = (Ak , Bk , γk , φk ) characterizing the mean trajectory. We assume the covariance matrix is also characterized by cluster specific parameter(s) so that S i = S(ξ k ) if zi = k. If Ti = T for all i, and there is no restriction on the covariance structure, we can assume S i = Σk given zi = k. • Stage 3: We assume the cluster label zi ’s are independent and identically distributed, conditional on the total number of clusters K and mixing proportion π = (π1 , · · · , πK ), Pr(z1 , · · · , zn ) =
n Y i=1
Pr(zi ),
(5.15)
138
with Pr(zi = k | K, π) = πk ,
(5.16)
for k = 1, · · · , K, and i = 1, · · · , n. • Stage 4: At this stage, we specify the prior distributions for the cluster specific parameters. Assume µk | K, m, V ∼i.i.d. Nq (m, V ),
(5.17)
−1 Σ−1 k | K, g, R ∼i.i.d. Wishart(g, (gR) ),
(5.18)
π | K, δ ∼ Dirichlet(δ),
(5.19)
and priors on {ξ k } if they are being used. We also included a “zero” cluster with Ak = Bk = 0. Genes showing no cyclic pattern will be included in this cluster. • Stage 5: The hierarchy is completed with specification of prior values and hyperpriors. Throughout the analysis, we choose δ to be a K-vector of 1’s for the Dirichlet prior. We assume the total number of clusters K follows Poisson distribution with parameter λ if it is considered unknown. We choose g = p, the dimension of Σk , for it is the least informative in the sense that the distribution is the flattest while being proper (Wakefield et al., 1994). The model described above is quite general. For the cell-cycle analysis, there are several versions of the model can be chosen. • In the most general form of the model, m, V and R are treated as hyper-parameters on which we place “vague” priors. This is an attempt to represent our belief that the mean trajectories and within-cluster variability of the clusters will be similar when viewed on some scale, without being informative about their actual values. This is similar to the variable-κ prior used by Stephens (1997). So in Stage 4, we can choose to place an improper uniform prior distribution on m, a “vague” Wishart(a, (aV 0 )−1 ) on V −1 and a “vague” Wishart(b, (bR0 )−1 ) on R, with a and b equal to the dimensions of V and R. We can choose V 0 and R0 to be “weakly informative” as suggested by Richardson and Green (1997).
139
• Alternatively, we can fix m, V and R, and choose their values in a data-dependent but only weakly informative fashion, similar to the fixed-κ priors used by Richardson and Green (1997). Again using the known CCR genes from the dye-swapping experiment, we can obtain non-linear least squares estimates of µi = (Ai , Bi , γi , φi ) for each gene. We can then choose m to be the mid-point of these estimates, V to be proportional to the squared range of the parameter estimates, R proportional to the sample variance of the raw data. Both V and R should be calibrated to prevent them from being too informative. The common practice is to choose the multipliers in such a way that the variances on µk and Σk are large. Since in the actual computation, the cluster specific parameters are generated from Stage 4 priors, by fixing m, V and R we do not allow the distributions for generating new clusters to be updated by data. • Further simplification can be achieved as follows. We can assume Σk = σk2 I. This prior assumes that conditional on the mean, the residuals within each curve are independent, and have the same cluster-specific variation. We assumed σk−2 ∼i.i.d. Ga(α, β), with (α, β) chosen to be (1.52, 0.05), calibrated from the posterior analysis of the reference data. We also assumed γk ’s and φk ’s were independent of (Ak , Bk )’s, with the following priors at Stage 4:
Ak
mA
VA
∼ N2 , VAB Bk mB γk ∼ N(0, σγ2 )
VAB VB
,
φk ∼ N(0, σφ2 ). The independence assumption is suggested by the scatter plots of non-linear least squares estimates of the parameters from the 100 CCR genes (Figure 5.19). And the priors were chosen to be “weakly informative” following the lines of Richardson and Green (1997) and Stephens (2000a). We applied the last model to the cell-cycle data, and report results here. More detailed investigation and comparison of these models will be conducted in future work.
R_i
−0.10
−4
−0.05
0.00
0 −2
B_i
2
0.05
4
0.10
140
−4
−2
0
2
4
−1.5
−1.0
−0.5
1.0
1.5
1.0 0.5 −1.0
−0.5
0.0
Phi_i
0.5 0.0 −1.0
−0.5
Phi_i
0.5
A_i
1.0
A_i
0.0
−1.5
−1.0
−0.5
0.0 B_i
0.5
1.0
1.5
−0.10
−0.05
0.00
0.05
0.10
R_i
Figure 5.19: CCR data: scatter plots of non-linear least squares estimates of parameters under Model Yit = e−γi t × {Ai cos 2πft (φi )t + Bi sin 2πft (φi )t} + eit .
141
5.7
Computation
The computation is similar to the algorithms described in detail in Chapter 4. It is a hybrid MCMC sampler, combining Gibbs sampling, and a Metropolis-Hastings algorithm with a continuous birth-death process. When the number of clusters K is assumed unknown, dimension-changing moves are made through BDMCMC, following Algorithm 2.4. For a given K, the simulation of parameters from corresponding posterior distributions can be achieved through Gibbs sampling and Metropolis-Hastings algorithm. Let ‘| · · · ’ denotes conditioning on all other variables, our hybrid MCMC sampler iterates through the following steps, • Gibbs sampling steps: parameters with easy-to-sample full conditionals are updated using Gibbs sampler. zi = k | · · · ∝ πk f (yi | µk , σk2 ), π | · · · ∼ Dirichlet(δ1 + n1 , · · · , δK + nK ),
with nk = #{i : zi = k}, the number of genes in cluster k. X 1 1 (y i − θ i )T (y i − θ i ) , σk−2 | · · · ∼ Ga α + nk p, β + 2 2 i:zi =k
(Ak , Bk )′ | · · · ∼ N2 (m∗ , V ∗ ),
where [V ∗ ]−1 =
X
σk−2 X Ti X + V −1 ,
i:zi =k
m∗ = V ∗
X
zi =k
with
σk−2 X Ti y i + V −1 m ,
X i = (e−γzi tj cos 2πftj (φzi )tj , e−γzi tj sin 2πftj (φzi )tj ), the Ti × 2 design matrix.
142
• Metropolis-Hastings sampling: γk ’s and φk ’s do not have simple forms of full conditionals because of the non-linear functional form involved; however, updating them using random-walk Metropolis samplers is straightforward. • Birth-death steps: for dimension-changing moves, clusters are born or killed through a continuous birth-death process, see Algorithm 2.4. 5.8
Analysis
We now report the results from applying our enhanced hierarchical mixture model to the cell-cycle expression data. Among all 6309 genes (including controls) on each of the 24 microarrays (t = 105 was dropped due to mRNA contamination), 6141 had no missing data across all chips, 75 had one missing value, 25 had two missing values, and 68 had three or more. A close inspection reveals genes with many missing values tend to be highly unreliable thus genes with three or more missing values were dropped (Figure 5.20). Some of the measurements were flagged as unreliable at the data processing stage, we still decided to include them in subsequent analysis because the ad hoc nature of flagging. We first evaluate our filtering procedure. Applying their single pulse model to Spellman’s data, Zhao et al. (2001) identified 1106 yeast genes as candidates for cell cycle regulation: 846 of them passed the threshold of SPM in only one of the three data sets (α-factor, cdc28, cdc15), and 259 in at least two data sets. Among these 1106 “periodic” transcripts, 1078 were included in our 38wt data set. Using our filter, we found only about half of these 1078 genes would be classified as candidates for cell cycle regulation. More specifically, only 529 (49%) genes had Pr(M1 | y) ≥ 0.9, 584 (54%) had Pr(M1 | y) ≥ 0.8, and 626 (58%) had Pr(M1 | y) ≥ 0.7. Figure 5.21 shows the genes above and below the threshold 0.8. Genes with Pr(M1 | y) < 0.8 show weaker cyclic patterns than those with higher posterior probabilities, and many of them are likely to be false discoveries. Applying SPM directly to the 38wt data we obtained 899 genes with z-scores larger than the threshold 5 thus would be classified as periodic under SPM. We found relatively high agreement between our filter and SPM; 683 out of the 899 (76%) were among our top 1292
5 −5
0
Expression
10
15
143
0
20
40
60
80
100
120
Time (min)
Figure 5.20: Expression of genes with three or more missing measurements: N = 68.
genes with Pr(M1 | y) ≥ 0.9, 745 out of the 899 (83%) were among our 1617 top ranked genes with Pr(M1 | y) ≥ 0.8, and 774 (86%) were among our top 1839 genes with threshold 0.7. Figure 5.22 shows the 745 genes included among our top 1617 genes and the 154 genes which did not pass our filter. The 154 genes, which would have been classified as “periodic” by SPM, do not appear to be periodic under visual inspection, and were removed by our filter. We next evaluate the extension to the mean structure. Figure 5.23 shows the observed curves and the fitted curves based on non-linear least square estimates from Model (5.12). Compared to Figure 5.7, the improvement in the attenuation adjustment is clear. The residual plots in Figure 5.24 illustrates the improvement in attenuation adjustment, as compared to Figure 5.14. We have found that the number of clusters K is highly sensitive to the prior specification, not only the Poisson prior, but also other priors on the variance parameters which could affect the size and shape of clusters. This is in agreement with Stephens (2000a). In addition,
144
2 1 −3
−2
−1
0
Expression
0 −3
−2
−1
Expression
1
2
3
494 genes with P(M1|y) < 0.8
3
584 genes with P(M1|y)>=0.8
0
20
40
60 time
80
100
120
0
20
40
60
80
100
120
time
Figure 5.21: Among the 1078 genes identified as periodic using SPM from Spellman data, only 584 would have passed our filter at 0.8, the other 494 genes would have failed and not be classified as periodic by our filter.
145
2 1 −3
−2
−1
0
Expression
0 −3
−2
−1
Expression
1
2
3
154 genes with P(M1|y) < 0.8
3
745 genes with P(M1|y)>=0.8
0
20
40
60 time
80
100
120
0
20
40
60
80
100
120
time
Figure 5.22: Among the 899 genes identified as periodic by SPM using 38wt data, 745 of them passed our filter with Pr(M1 | y) ≥ 0.8 (left panel), and 154 did not pass our filter (right panel).
2 −4
−2
0
Expression (fitted)
0 −4
−2
Expression (obs)
2
146
0
20
40
60
80
100
120
0
20
40
Time (min)
60
80
100
120
Time (min)
6 4 0
−4
2
−2
Residuals
0
Squared Residuals
8
2
10
Figure 5.23: Observed expression of the 100 known cell-cycle regulated genes, and their fitted values based on non-linear least squares estimates using Model (5.12).
0
20
40
60 Time (min)
80
100
120
0
20
40
60
80
100
120
Time (min)
Figure 5.24: Residuals and squared residuals from fitting Model (5.12) to the 100 CCR genes.
147
our enhanced model allows genes to be classified at a finer scale (with more features), which led to a large number of clusters. Given there is no clear definition regarding the underlying regulation pathways during the cell cycle, we found this number hard to interpret and highly unreliable, so we decided to restrict our attention on the analysis with K fixed. Figures 5.25 and 5.26 display the classification and estimated mean profiles from fitting the enhanced model to the 38wt data with K fixed at 16. We re-labelled the clusters on the basis of time to the first peak. This decision is based on the fact that the cell cycle events are regulated in an orderly fashion, the early activation or deactivation of transcription factors are often responsible for the next wave of gene expression, so this re-labelling has a nice biological interpretation. Our model was able to identify some interesting cell-cycle gene clusters, and the effect of model enhancement is obvious. From Figure 5.26, we can see that clusters 3, 6, 8, 13 and 16 are clusters with strong cyclic signals, and they all show the dissipation of synchrony over time. In particular, cluster 3 has a greatly heightened first peak, which is large enough to obscure the later cyclic pattern. Without the improvement to the model, we may not be able to identify this group of genes. We suspect these genes are related to the mating process, so their expression is induced by the pheromone. Several clusters appear to have shortened first cycles, such as cluster 2, 3 and 11. These are G1 or G2 phase genes, confirming our speculation that the synchronization may shorten the growth phase. At least 9 out of the 13 genes classified into cluster 8 are the S-phase histone coding genes. The products of these genes form a single complex that is used for DNA condensation. These genes are coordinately regulated and have been well characterized. A closer inspection reveals that many genes in cluster 2 are M –G1 genes and share a promoter element called ECB; many genes in cluster 5 and 6 are late G1 genes and share MCB and/or SCB promoter elements; cluster 9 consists G2 -phase genes and many of them also share the MCB/SCB promoter elements; and many genes in cluster 13 appear to share MCM1 and FKH sites. We also identified a potential new element in cluster 11, further investigation through sequence analysis and validation experiments is currently underway. Note that the time to first peak in cluster 16 is larger than 58 minutes, the normal cell cycle span we used. This is because the attenuation at the beginning of the experiment is
148
0.5
1.5
n = 138 ( Tpeak = 19.1 )
0
20
40
60
80 100 120
0
n = 130 ( Tpeak = 22.4 )
20
40
60
80 100 120
−0.5
0
n = 48 ( Tpeak = 24.4 )
20
40
60
80 100 120
0
n = 161 ( Tpeak = 28.8 )
20
40
60
80 100 120
n = 13 ( Tpeak = 34.3 )
60
80 100 120
0
20
40
60
80 100 120
40
60
80 100 120
0 −3 −4
40
60
80 100 120
20
40
60
80 100 120
0
20
40
60
80 100 120
80 100 120
0
20
40
60
80 100 120
4
n = 28 ( Tpeak = 69.3 )
0
20
40
60
80 100 120
2 −1
−0.5
0
0.0
1
0.0 −0.5
80 100 120
−1.0
60
60
3
1.0 0.5
0.5
1 0
40
40
n = 247 ( Tpeak = 44.7 )
n = 98 ( Tpeak = 68.1 )
1.0
n = 282 ( Tpeak = 56.3 )
−1 −2
20
20
1.0 0.0 −1.0
0
n = 48 ( Tpeak = 55.3 )
0
0
n = 106 ( Tpeak = 43.5 )
−0.5
−0.5 −1.0
20
20
0.5
0.5 0.0
1 0 −1 −2
0
0
n = 172 ( Tpeak = 39.7 )
0.5
40
n = 101 ( Tpeak = 36.3 )
0.0
20
−0.5
0
−2
−0.5 −1.0
−3
−1.5
−2
−0.5
−1
−1
0
0.0
0.5
1
1
0.5
2
2
1.5
−1.0
−1.5
−1.0
−2
−0.5
0
0.0
0.0
0.5
0.5
2
1.0
4
n = 26 ( Tpeak = 14.5 )
1.0
n = 68 ( Tpeak = 9.1 ) 1.5
n = 14 ( Tpeak = 0 )
0
20
40
60
80 100 120
0
20
40
Figure 5.25: Final clustering with K = 16 fixed, different scales.
60
80 100 120
149
80 100 120
20
40
60
80 100 120
80 100 120
40
60
80 100 120
60
80 100 120
80 100 120
80 100 120
80 100 120
2 0 −1 −2 2 0 −1 −2
20
40
60
80 100 120
n = 247 ( Tpeak = 44.7 ) 2
60
80 100 120
2
60
80 100 120
1
40
0
40
0
0
n = 98 ( Tpeak = 68.1 )
−1
20
60
0
20
−2
0
40
−1
0
1
2 0
60
80 100 120
2
60
−1
40
60
0
40
n = 282 ( Tpeak = 56.3 )
−2
20
40
−1
20
20
n = 13 ( Tpeak = 34.3 )
n = 106 ( Tpeak = 43.5 )
−2
0
1
2 1 0 −1 −2
0
20
1
2 0 −1 −2
40
0
n = 172 ( Tpeak = 39.7 )
n = 48 ( Tpeak = 55.3 )
0
1
2 0 −1
20
1
2 1 0 −1 −2
20
80 100 120
−2
0
n = 101 ( Tpeak = 36.3 )
0
60
−2
60
40
n = 161 ( Tpeak = 28.8 )
20
40
60
80 100 120
n = 28 ( Tpeak = 69.3 ) 2
40
20
1
2 0 −1 −2
20
0
n = 48 ( Tpeak = 24.4 )
1
2 1 0 −1 −2
0
1
2 1 0 −1
0
1
60
0
40
n = 130 ( Tpeak = 22.4 )
−1
20
n = 138 ( Tpeak = 19.1 )
−2
0
n = 26 ( Tpeak = 14.5 )
−2
−1
0
1
2
n = 68 ( Tpeak = 9.1 )
−2
−2
−1
0
1
2
n = 14 ( Tpeak = 0 )
0
20
40
60
80 100 120
0
20
40
Figure 5.26: Final clustering with K = 16 fixed, common scale.
60
80 100 120
150
so large that the first peak of this cluster is obscured. If we shift the time to peak by 58 minutes, we can see that this group actually coincide with cluster 2, except with much larger amplitude. As we show in the following, there does exist stronger co-expression between the two clusters (Figure 5.27). Under the Bayesian mixture models, specific clusters are susceptible to the re-labelling problem (see Section 4.1.3). But as suggested in Wakefield et al. (2003), we can examine the probabilities of co-expression p(zi = zi′ | y), which are invariant to re-labelling. A good visual display of co-expression is the heat-map. Due to space limit, we select a sub-sample of the clusters to show. Figure 5.27 shows the co-expression, with dark areas indicate high co-expression, and as excepted, shaded areas are close to the diagonal, suggesting strong co-expression within clusters.
Pairwise Co−expression: Cluster = c(2, 3, 5, 6, 8, 13, 15, 16)
Gene 100
100
200
200
Gene
300
300
400
400
Pairwise Co−expression: Cluster = c(2, 3, 5, 6, 8, 13, 15, 16)
100
200
300 Gene
(a) cutoff= 0.5
400
100
200
300
400
Gene
(b) cutoff= 0.1
Figure 5.27: Heat-map of probabilities that two genes share a common label, for clusters 2, 3, 5, 6, 8, 13, 15, and 16. Shaded blocks correspond to pairwise probabilities larger than the chosen cutoff.
151
The posterior classification probability of each gene p(zi = k | y) provides a natural measure of uncertainty concerning the clustering of each individual gene. However, it is also of interest to measure the strength of the clusters, such as how tight genes are within a cluster, and how much overlapping is between different clusters. So we examine the sensitivity and specificity of the clusters where, sensitivity is the probability of co-expression, given labelling in the same cluster, and specificity is the probability of non-co-expression, given labelling in different clusters. Such functions cannot be evaluated with traditional clustering approaches. The sensitivity of cluster k is estimated by sensitivity =
X
p(zi = zi′ = k | y)/Nk1 ,
(5.20)
i,i′ ∈Ck
where Ck denotes cluster k, and Nk1 is the number of distinct gene pairs classified into Ck . And specificity of cluster k is estimated by specificity =
X
p(zi = k, zi′ = k′ | y)/Nk2 ,
(5.21)
i∈Ck ,i′ ∈Ck′ ,k ′ 6=k
where Ck and Ck′ are different clusters, and Nk2 is the number of distinct gene pairs with only one gene classified into Ck . The sensitivity and specificity of the 16 clusters are shown in Figure 5.28. Cluster 1 is the “zero” cluster for non-cyclic genes, so it is not surprising to see it has the lowest sensitivity. Clusters 11 and 12 only have weak signals and overlap with each other, hence their sensitivity and specificity are low. Cluster 8 contains a tight group of histone genes which have strong cyclic signals, and it is ranked the highest in terms of sensitivity and specificity. Other high quality clusters include clusters 3, 6, 13 and 16, which are evident from Figure 5.26. The sensitivity and specificity estimates provide a natural quantitative measure of the quality of clusters, based on which we can focus on the high quality clusters, and proceed with validation or more sophisticated analysis. Studying the co-expression can also provide important information about relationship between clusters. For example, Figure 5.29 shows several genes identified from the heatmap which had high co-expression with genes in cluster 16 though they were classified into cluster 2. Examination of the mean trajectories reveals the peaks of one trajectory appear to coincide with the other, suggesting these two clusters could be co-regulated, although the
1.00
152
8 16
133
6
2
0.99
1
5
7
15 14
4
0.98
Specificity
9
10
0.97
12
11
0.75
0.80
0.85
0.90
0.95
1.00
Sensitivity
Figure 5.28: Strength of co-expression through sensitivity and specificity.
magnitude of the signals differs. People may argue that these genes should be considered co-regulated as long as the peaks and troughs of their oscillations concur, regardless of their magnitude. Here we distinguish these genes, for we speculate that genes with higher amplitude may contain more promoting elements, or some other element(s) responsible for increased expression levels, or have unstable mRNA transcripts. In fact, a sequence search reveals that the cluster 16 and cluster 2 do share common MCM1 elements. The relevant motif is tttccnnnnnnggaaa, a palindrome to which two MCM1 proteins bind. Such binding is required for transcriptional activation at the M/G1 boundary. And as we thought, the
153
2
Genes classified in cluster 2, and co−express with cluster 16
0 −2
−1
Expression
1
individual expression mean of cluster 2 mean of cluster 16
0
20
40
60
80
100
120
time (min)
Figure 5.29: Some genes clustered into group 2, and co-express with cluster 16.
cluster 16 genes have multiple elements and a larger consensus sequence, and the cluster 2 genes have only one site. Many cluster 2 genes do not have the MCM1 site at all. This causes us to suspect that there may be new element(s) in cluster 2 genes which have similar properties as MCM1. We will continue investigation in this direction with our biology collaborators.
5.9
Conclusion and Discussion
In this chapter, we reported a detailed analysis of cell cycle gene expression data. We have improved the filtering procedure with more realistic models and carefully chosen priors. The model-based filtering has a natural probabilistic interpretation, and allows FDR and FNR to be controlled under the filtering, which is an advantage over other heuristic methods.
154
Our filter appears to perform well, and produced results which gave additional benefits over the single pulse model developed by Zhao et al. (2001). After the initial filtering, we further clustered genes based on the similarity in their expression profiles, through a fully Bayesian model-based approach. The model was improved to allow amplitude and frequency to change over time. Both extensions have very plausible biological explanations. The clustering has important implications in identifying promoter elements. Promoter elements are sequences to which transcription factors that are only active at specific points in the cell cycle bind and activate transcription. They are short sequences, and some are only 5 base pairs long. So they are difficult to identify in the genome because of their low information content. By combining the information on groups of coordinately regulated genes with sequence alignment, we have a better chance of finding the promoter elements and mapping out the regulatory circuitry of the cell cycle. The computation was carried out through MCMC. Unknown number of clusters was dealt with by integrating a birth-death process with conventional MCMC techniques. This model-based approach allowed us to combine clustering with estimation, quantify the uncertainty of the clustering in a natural way, and also enabled us to evaluate the strength of co-expression of genes. All of them can not be handled easily by traditional approaches. We have demonstrated that our enhanced model can provide further insight into our understanding of the cell-cycle transcription programs. In our enhanced model, each cluster is characterized by a set of four parameters. Intuitively speaking, the finer we characterize the clusters, the easier to distinguish them, therefore the more clusters. So we were not surprised to find that a large number of clusters were being identified under this parameterization. Although many numerical methods for detecting underlying clusters based on gene expression data have been published, none of them are satisfactory. From our experience we have found that without plausible interpretation and biological validation, the number of clusters produced by the numerical analysis is highly unreliable, and sometimes even misleading. The clusters are defined by the model, which in turn is motivated by the biology. The ultimate validation of the clustering should be based on scientific investigation, with data analysis providing numerical support and further hypotheses. In other words, the conclusion should be based on science, not just data analysis.
155
Chapter 6 EXTENSIONS AND FURTHER WORK This chapter covers a few miscellaneous issues, including additional sensitivity analysis, extension of the FDR to allow dependence, and robust clustering with t-distributions. We also report some progress in attempting to incorporate dependence information through more flexible prior specification of the cluster labels zi . Much of the work reported in this chapter is on-going research, and subject to improvement and modification. 6.1
Sensitivity of Posterior Distribution of K
Our model-based clustering framework falls in between unsupervised learning and supervised learning, in the sense that it allows characteristics and structures of the clusters to be built in through model and prior parameterizations, and meanwhile it also allows the number of clusters K to vary and new clusters to be identified. The number of clusters and the clusters themselves can be viewed as the optimal solutions to the partitioning problem under the general hierarchical mixture model (see Section 4.1). When the number of clusters K is fixed, the clusters can be viewed as optimal (in terms of posterior distributions) partitions with respect to the restricted models, and therefore the clusters are still well defined. Finding the number of underlying clusters K is one of the primary goals of any clustering analysis. As we have discussed in previous chapters, there are many factors that could affect the estimation of the number of clusters with gene expression data, including the size of the data, the resolution of the experiments, the features used to distinguish clusters, measurement errors, and the model and prior specifications. In this section, we carry out sensitivity analysis to show how, in our model, the number of clusters is related to the prior specification. Due to the difficulty in ascertaining the true number of clusters and cluster labels with the real gene expression data, we illustrate using simulated data. Figure 6.1 is
156
identical to Figure 4.1, showing the simulated curves, gene specific parameters (intercept
2
and slope), and cluster specific means and covariances.
9
15
15
87 14
1
2 * 11 6 10
13
35
4
28 23
1
0
Slope
10
12 response
24
20 22 25
*
27
26 17 21
3147 44 41
18 49
16 −1
19
30
29
5
45 39
3343 32
50
*
35 38 34 37
−2
46 42 36
40 48
−3
−2
−1
0
1
time
(a) Simulated Curves
2
3
7
8
9
10
11
12
13
Intercept
(b) Least Squares Estimates
Figure 6.1: (a) A total of 50 simulated curves, with t = (−3, −1, 1, 3). (b) The least squares estimates of intercepts and slopes. Units 1–15 are in group one, 16–30 are in group two, 31–50 are in the third group. The groups are labelled by their intercepts.
If the number of clusters K is assumed unknown, we may place a Poisson prior with mean λ on it. Figure 6.2 shows the posterior distributions of K under various values of λ. It can be seen that larger values of λ lead to larger estimates of number of clusters, as expected. This problem is more serious with fewer yet more spread-out data. Due to the exploratory nature of the analyses reported here, we think a guideline for choosing the Poisson mean λ should be to take an educated guess, and to select a relatively large value so that more clusters are encouraged, thus reducing the chance of missing interesting features. We let θ i = (αi , βi )′ , where αi and βi are the intercept and slope for unit i, i = 1, · · · , 50. When fitting the general hierarchical mixture model to these simulated data, we essentially
0.7 0.6 0.5
0.5
0.6
0.7
157
0.1
0.2
0.3
p(K|y)
0.4
λ=1 λ=3 λ=7 λ = 10 λ = 15
0.0
0.0
0.1
0.2
0.3
p(K|y)
0.4
λ=1 λ=3 λ=7 λ = 10 λ = 15
0
5
10
15
20
25
0
5
K
10
15
20
25
K
(a) Fixed-κ
(b) Variable-κ
Figure 6.2: Posterior distributions of K: comparison of sensitivity to Poisson priors, and hyper-parameters between (a) fixed and (b) random population parameters
put a mixture of normals on the unit specific random effects,
θ i | zi = k ∼ N2 (µk , Σk ),
and for the cluster specific parameters, we have population distributions,
µk ∼ N2 (m, V ), −1 Σ−1 . k ∼ Wishart g, (gR) At this stage, we can either fix the population distributions (the ‘fixed-κ’ priors) so that the distributions used to generate new clusters are not updated by data, or we can put hyper-priors on the population distributions (the ‘variable-κ’ priors) to allow them to be
158
updated, m ∝ const, V −1 ∼ N2 (a, (aV 0 )−1 ), R ∼ Wishart(b, (bR0 )−1 ).
The latter is used to express our belief that the cluster means and variances are similar, without being specific about their actual values. Both types of priors have been discussed by Richardson and Green (1997) and Stephens (2000a). It is of interest to compare these two settings. However, we have found little difference in the posterior distributions of K in our setting when the fixed-κ priors and variable-κ priors are “non-informative” or only “weakly informative”. Figure 6.2 shows the posterior distributions of K under the two types of priors. Following Richardson and Green (1997), for the fixed-κ prior, m was chosen to be the mid-point of the least squared estimates (LSE), V was a diagonal matrix, with squared range of the LSEs as its diagonal, g = 2, and R was also chosen to be proportional to the squared range of the LSEs. For the variable-κ priors, a = b = 2, both V 0 and R−1 0 were chosen to be proportional to the squared range of the LSEs. As illustrated in Figure 6.2, both priors gave very similar results, with the distributions under variable-κ prior are more spread-out. As we mentioned before, the robustness of the results to the prior specifications is largely due to the fact that the simulated clusters are tight and well separated. In fact, the population priors on the cluster specific parameters can be highly influential on the posterior distribution of K. The m and V are parameters governing the location of the cluster centers, so that smaller values of V could lead to stronger shrinkage effect in the estimates. The population prior on the cluster specific covariance matrices has a even greater effect on the number of clusters K. Table 6.1 lists the posterior distribution of K with different population priors on Σk , when the cluster specific covariance matrices are restricted to be small, the number of clusters increases. This is not surprising, because the shape and size of the clusters are affected by the covariance matrices, so there will be more clusters if only small clusters are allowed.
159
Table 6.1: Influence of prior distribution Wishart(g, (gR)−1 ) for Σ on the posterior distribution of K R1/2
Range of K with
K with highest p(K)
p(K | y) ≥ 0.05
p(K | y) ≥ 0.001
L
[2–4]
[2–6]
3
L/5
[2–5]
[2–7]
3
L/10 L/20
[2–5] [3–6]
[2–8] [3–9]
4 5
L/30
[3–9]
[3–15]
7
Simulated data with fixed-κ priors. L denotes the diagonal matrix with the range of LSEs as its diagonal.
For the hierarchical mixture models applied to gene expression data, we have reached similar conclusion that the number of clusters can be sensitive to the priors, especially those on the variances. In summary, the posterior distributions of the number of underlying clusters K can be highly sensitive to the priors specified, and therefore we recommend informative priors to be used if extra information is available, and sensitivity analysis to be carried out after the clustering analysis with the initial prior is done. 6.2
Extension to FDR
One limitation of differential detection with FDR is that hypotheses are assumed to be independent of each other. When dealing with gene expression data or spatial data, such an assumption is not always plausible. To be specific, co-regulated genes tend to have correlated expression profiles, so if the hypothesis concerning one gene in the co-regulated gene group is accepted or rejected, then this gives us information concerning whether other genes in the same group will be accepted or rejected similarly. In other words, the chance that a differentially expressed gene is detected while being considered in a group should differ from that while the gene is considered in isolation. In gene expression analysis, the primary motivation for clustering is the notion that
160
genes sharing similar expression profiles are likely to be co-regulated. Therefore it is not consistent to treat genes as independent initially while acknowledging that some of them are correlated. This issue has been recognized by some researchers, e.g., Storey and Tibshirani (2003), but the effort to adjust for dependence while controlling FDR has been limited. Allowing dependence among hypotheses is not straightforward with frequentist FDR as there is no easy way to impose dependence on the p-values. On the other hand, with Bayesian FDR, dependence can be imposed quite naturally through hierarchical models. In this section we outline such a procedure. The key idea is to calculate marginal posterior probabilities following the fitting of a hierarchical mixture model to the gene expression data. Here we follow the notation introduced in Chapter 3. Suppose there are K + 1 clusters, the number of clusters K is either known a priori or can be estimated from the data. Each cluster corresponds to a model Mk indexed by parameter θ k , k = 0, · · · , K; M0 is the “zero” cluster, representing the group of genes with constant expression across time, often with θ 0 = 0. Model Mk with k > 0 represents the group of genes showing non-constant expression levels, and sharing the cluster specific parameter θ k . The marginal posterior probability of gene i belonging to model Mk is given by p(Mk | y i ) ∝ p(y i | Mk )p(Mk ), where p(Mk ) is the prior distribution of model Mk , and Z p(y i | Mk ) = p(y i | θ k )p(θ k | Mk )dθ k ,
(6.1)
(6.2)
for k = 0, · · · , K. Let M 0 denotes the collection of “non-zero” models, M 0 = {M1 , · · · , MK }. Differentially expressed genes are identified by testing the following hypothesis: H0 : y i ∈ M0 v.s. H1 : y i ∈ M 0
i = 1, · · · , N.
¯ 0 | y i ) = 1 − p(M0 | y i ) ≥ c, where c is a cutoff determined by the thresholdReject H0 if p(M ing procedure (3.18) (also see Appendix B). During the calculation of the marginal posterior probabilities, information from other genes in the same cluster is “pooled” through the cluster specific parameters, thus the dependence among genes has been taken into account.
1 −2
−2
−1
0
Expression
0 −1
Expression
1
2
2
161
0
20
40
60
80
Time (min)
(a) Rank increased
100
120
0
20
40
60
80
100
120
Time (min)
(b) Rank decreased
Figure 6.3: Expression of genes with very different ranks before and after the clustering. Genes with increased ranks after clustering are shown in the left panel, genes with decreased ranks are in the right panel
We illustrate this method using the 38wt data and the clustering model discussed in Chapter 5, with K = 16. Since we did allow the clustering method to find its own “zero” cluster (otherwise that would mean having a hole in the center of the bivariate normal distribution on (Ai , Bi )), it is possible that some of the clusters other than the imposed “zero” cluster are also non-differentially expressed groups. After examining Figures 5.26 and 5.28, we decided to classify clusters 4 and 12 as “zero” clusters too, for they have weak signals, and relatively low quality in terms of sensitivity and specificity. The posterior estimates of amplitude for clusters 4 and 12 are 0.29 and 0.31, respectively. We then computed the marginal posterior probabilities pi = p(M 0 | y i ) as described above, and ranked the genes based on the pi ’s. Figure 6.3 displays genes among the top 200 and bottom 200 whose ranks differ at least 1000 before and after the clustering. The left panel of Figure 6.3 shows 41 genes which were ranked really low before clustering, but high afterwards,
162
all of them are in clusters 14 or 15. These are genes with relatively weak signals, but strong cyclic patterns. When being considered in isolation, they are not easily detected. However, when combined with a group of genes sharing similar patterns, we have more confidence that they are true signals instead of mere random oscillation. The right panel in Figure 6.3 shows 8 genes which are now ranked low, but were high before clustering. All of them are in cluster 1. Although these genes have large variation, their cyclic patterns are not matched well to any other clusters. We speculate that they are either not cell-cycle regulated, or their expression patterns are so different from other genes they may warrant closer inspection. Under the above model, if all 1685 genes were analyzed and declared as discoveries, the bF DR is 22.3%, in other words about 376 genes are expected to be false positives. This leaves us with more cell-cycle genes than Spellman et al. (1998) and Cho et al. (1998) identified. This proposal to account for dependence allows us to borrow strength from other genes in the same cluster, so far the results from the initial analysis look promising. Future enhancement could include better parameterization of the “zero” cluster, extension to clustering with unknown K, and analysis with the entire data set without filtering.
6.3
Robust Clustering with t-distribution
Statistical inference based on the normal distribution (univariate or multivariate) is known to be vulnerable to outliers. As we have illustrated in the growth curve example in Section 2.5, the parameter estimates can be significantly affected by outliers. A strategy often adopted to deal with such situation is to use heavy-tailed distributions such as tdistribution (O’Hagan, 1987). Wakefield et al. (1994) proposed a modelling analysis strategy based on t-distribution for growth curve data, which provided both a coherent outlier detection diagnostic and direct inference with outlier effect accommodated. A good discussion on the mixture models with t-distributions can be found in McLachlan and Peel (2000). There are two sorts of outliers with the curve data – one at the observed data level, another at the random effect level. In this section we are concerned with the outliers of the second sort. Extending the work of Wakefield et al. (1994) to our hierarchical mixture
163
model in Section 4.1 with linear mean structure is straightforward. We can simply replace the mixture of normals at Stage 2 with a mixture of multivariate Student t-distributions. One way of representing such an assumption is to write Stage 2 as θ i | zi = k, µ, Σ ∼ N(µk , γi−1 Σk ), 1 1 γi | ν ∼ Ga ν, /ν 2 2
(= χ2ν /ν).
The remaining hierarchical structure is defined exactly as in Section 4.1. For references establishing that N(θ i ; µ, γ −1 Σ) and χ2ν (γ)/ν generate Stν (θ i | µ, Σ), see Johnson and Kotz (1972) and Lange et al. (1989). Under this setting, the full conditionals defining the Gibbs sampler become the following. For the gene specific random effect, θ i | · · · ∼ N(µ∗ , Σ∗ ), for i = 1, · · · , n, where ∗
Σ =
1 ′ X X i + γi Σ−1 zi σe2 i
−1
,
and ∗
m =Σ
∗
1 ′ −1 X y + γi Σzi µzi , σe2 i i
with X i the design matrix for the ith curve. The full conditionals for cluster labels {zi } are Pr(zi = k | · · · ) ∝ πk f (θi ; µk , γi−1 Σk ), where f (; ) is the Normal density. The full conditionals for the component means {µk } are µk | · · · ∼ N(m∗ , V ∗ ), where (V ∗ )−1 = (
X
−1 , γi )Σ−1 k +V
i:zi =k
and
m∗ = V ∗ (Σ−1 k (
X
i:zi =k
γi θ i ) + V −1 m).
164
For the component variances {Σk } we have
Σ−1 k | · · · ∼ Wishart ρ + nk ,
ρR +
X
i:zi =k
−1 , γi (θ i − µk )(θ i − µk )T
where nk = #{i : zi = k}, the number of genes classified into cluster k. For components weights {πk } we have π | · · · ∼ Dirichlet(δ1 + n1 , · · · , δK + nK ).
For the measurement error, the full conditional is 1 1 T 2 σe | · · · ∼ Ga g + n, h + (y i − X i θ i ) (y i − X i θ i ) . 2 2 We next illustrate this robustified clustering using simulated data. In addition to the curves generated from the 3 clusters, we add 10 uniformly distributed random points onto
2
15
3
the intercept-slope plane, the resultant curves and clusters are shown in Figure 6.4.
9 15
87
1
14
Slope
2 4
1
28 23 20 22
0
12
10
response
35
* 6 11 10 13
−1
16
24 30
* 2526 17 27 21 1849
3147 44 41 50 * 45 39 3343 35 40 38 32 34 37 48 46 42 36
−3
−2
5
29
19
−3
−2
−1
0
1
time
(a) Simulated Curves
2
3
6
8
10
12
14
Intercept
(b) Least Squares Estimates
Figure 6.4: A total of 60 simulated curves, 50 are from the 3 clusters, 10 are uniform noises.
0
2
4
k
6
8
10
165
0
2000
4000
6000
8000
10000
p(k)
0.0 0.2 0.4 0.6 0.8 1.0
iterations
0.65
0.25 0.08 0.00 1
2
3
0.02
0.00
0.00
0.00
5
6
7
8
4 k
2
2
Figure 6.5: Trace plot and posterior distribution of K, with BDMCMC and normal mixture model.
9
1 52
15
2 1
87
11
14
1
1
30
60 27
1
* 45 39
3343
3 2
32
58
51
* 3
3
33
3
3
3
3 3 3
3
2
48
3
3 3
2
40 37
33
2
3 3
3 3
3
−2
−2
46 42 36
50
3
2 2*
2 2
2 2
35 38 34
2
2
2
2
29
2
2
2
56
44 41
18 49
55 16
1
1
3147
21
2 2 2
1
59 57
26 17 *
−1
1
11 19
25
0
Slope
24
20 22
12
53
1 1 * 1
0
4
28 23
1
Slope
35
1
54
−1
1
2 11 6 * 10 13
8
10 Intercept
(a) Estimates
12
8
10
12
Intercept
(b) Classification
Figure 6.6: Classification and estimation from fitting a three-component normal mixture model to the simulated data.
166
The addition of random noise makes it harder to detect the right number of clusters. As shown in Figure 6.5, the most favored number of clusters is 2 instead of 3. The results are based on BDMCMC with same model and prior choices as in Chapter 4. The classification and posterior estimates from the normal mixture model with K = 3 are shown in Figure 6.6. It appears the normal mixture model actually performed well in the presence of outliers. On the other hand, the mixture model with three t-components failed to give better results (see Figure 6.7). We chose 4 to be the degrees of freedom for the t-distributions. Figure 6.7 displays the classification and estimates from this model. The clusters appears larger than those would come out of the normal mixture model, but the classification and estimates are
2
2
off.
9
1 52
15
2 1
87
11 1
14
35
*4
1
30
0
27
57
26 17 21
*1
1
3 *2
45 39
3343 32
58
35 38 34 37
*
3
3 3
40
33
3
3
3
3 3 3
3
3
48
3
3 3
3 3
3 3
3
−2
−2
46 42 36
50
33
3 3
3 3
51
29
3
3 3
3
3 *
3
2
1
56
44 41
3
2
1
1
3147
18 49
55 16
3 1 1
1
59
60
1
11 19
25 *
−1
1
53
24
20 22
12 Slope
28 23
0
13
1 1 1
Slope
10
1
54
−1
1
2 11 6
8
10 Intercept
(a) Estimates
12
8
10
12
Intercept
(b) Classification
Figure 6.7: Classification and estimation from fitting a three-component Student-t mixture model to the simulated data.
The strategy of using heavy-tailed distributions often provides plausible robustness against outliers. But such benefit is not easy to achieve with high dimensional data, such
167
as the curve data we have here. One possible solution is to put the t-distributions on the observed curves instead of the random effects. The additional parameter v, the degrees of freedom, may be viewed as a robustness tuning parameter. It can be fixed in advance or it can be inferred from the data for each component. But it is not clear what prior should be use on v, and whether it should be allowed to be component specific. The multiplier γi complicates the implementation of BDMCMC when the number of clusters is treated as unknown. A key requirement of birth-death process is that the problem has to be formulated into a point process in some way. But the presence of γi makes such formulation questionable. To get around this, we may have to integrate out this random effect, and work with a much more complicated likelihood. All these issues remain to be addressed in future research. 6.4
Prior on Cluster Labels
In this section we focus on the specification of prior distribution on the cluster labels z. In our general hierarchical mixture model (Section 4.1) and its variants, we assumed the cluster labels were independent and identically distributed, p(z1 , · · · , zn | π, K) =
n Y
p(zi | π, K),
(6.3)
i=1
with p(zi = k | π, K) = πk for k = 1, · · · , K. Such an assumption is a common practice in Bayesian analysis of mixture models. However, this independence assumption is not always plausible in practice, as often we have prior knowledge that some observations are dependent and more likely to be in the same cluster. The independence prior (6.3) on component labels is really only a technical shortcut to ease the computation, and offers no information about the possible dependence which is known a priori. In order to improve estimation and inference, we make an attempt to specify the joint distribution of component labels z1 , · · · , zn which takes into account the dependence among some of the observations. As gene expression analysis is an important application of our mixture models, we describe our approach in terms of gene expression. The goal of the analysis is to cluster genes based on their expression profiles. Suppose we study the expression of n genes. Among
168
these n genes, we have no prior information for some of them about their association at all, so we refer the group of these genes as independent group and assume the independence prior (6.3) on their labels; in contrast some of the genes we know either through earlier studies or expert knowledge that they are likely to function together in similar ways. For example, if certain genes share common regulatory elements (transcription factor binding sites), then they have the tendency to be turned on and off at the same time, participate in the same biological process, and function in a concomitant way, so consequently they may show similar expression patterns and are likely to be clustered together. Thus introducing dependent cluster labels could be a beneficial way to incorporate additional information, even sequence information, into gene expression data, and provide more insight into the coordination of the genetic network. We first introduce some notation. Let z = (z1 , · · · , zn ) denote the cluster labels for gene 1 through n. To quantify the situation we mentioned above, we group these labels into C + 1 groups z = (z (0) , z (1) , · · · , z (C) ), where z (c) contains the cluster labels in group c. Group 0 contains the labels that we have no information on, each of the other C groups contains the labels of genes that are dependent within the group. We assume labels in different groups are independent. Note that the “groups” are not the same as the “clusters (components)” which we obtain at the end of the model-based clustering analysis (although they may coincide with each other). We want to incorporate the information that certain genes are correlated, though we do not know which “cluster” they belong to (which is what we want to find out with the cluster analysis). The index of the C groups are arbitrary. C is assumed known a priori. So the joint distribution of z1 , · · · , zn can be written as Pr(z1 , · · · , zn ) =
C Y
Pr(z (c) ).
(6.4)
c=0
Now the problem is reduced to specifying the joint distribution of a group of correlated discrete variables, for each of the C groups. 6.4.1
Joint distribution of multivariate discrete responses
This section focuses on probability models for the labels z (c) = (zc1 , · · · , zcnc ) in a single group c. To simplify the notation, we suppress the subscript, c. Each label can be written
169
as a multivariate discrete variable, and we intend to find a joint distribution for them, with dependence included. There is a large literature on this topic, mostly in the area of longitudinal data analysis. Good references include Diggle et al. (1994), Cox (1972), Zhao and Prentice (1990), Bishop et al. (1975), Fitzmaurice and Laird (1993), and Zhao et al. (1992). We start out with binary labels, i.e., given K = 2, zi takes value 0 or 1. E(zi ) = Pr(zi = 1) = πi is the marginal mean. The joint distribution of n binary responses is multinomial with probability vector of length 2n , most commonly represented by the loglinear model (Bishop et al., 1975), n X X Pr(z) = g(θ) exp θj1j2 zj1 zj2 + · · · + θ1...n z1 · · · zn θj zj + j=1
(6.5)
j1 t.
201
VITA Chuan Zhou was born in a suburban town of Beijing, China. To him life means to see more of the world and never stop seeking knowledge. He has travelled a few places and spent quite some time in school. He went to high school at Beijing No. 4 Middle School. In 1996, he graduated from Peking University with a Bachelor of Science degree in Statistics. After that he came to the United States, and two years later he received a Master of Science degree in Statistics from University of Maryland, Baltimore County. Then he moved to Seattle, where he conducted graduate study in the world acclaimed biostatistics program at University of Washington. In 2000 he earned a Master of Science degree in Biostatistics. In 2003 he earned his Doctor of Philosophy in Biostatistics from University of Washington.