Bayesian approach to estimation and testing in time course microarray experiments Claudia Angelini, Daniela De Canditiis Istituto per le Applicazioni del Calcolo, ‘Mauro Picone’, CNR, Italy Margherita Mutarelli, Istituto di Scienze dell’Alimentazione, CNR, Italy Marianna Pensky Department of Mathematics, University of Central Florida, USA Abstract The objective of the paper is to develop a truly functional fully Bayesian method which allows to identify differentially expressed genes in a time-course microarray experiment.
Each gene expression profile is modeled as an expansion over some
orthonormal basis with coefficients and the number of basis functions estimated from the data.
The proposed procedure deals successfully with various technical
difficulties which arise in microarray time-course experiments such as a small number of observations available, non-uniform sampling intervals, presence of missing data or multiple data as well as temporal dependence between observations for each gene. The procedure allows to account for various types of errors thus offering a good compromise between nonparametric and normality assumption based techniques. The method accounts for multiplicity, select differentially expressed genes and rank them. In addition, all evaluations are performed using analytic expressions, hence, the entire procedure requires very small computational effort. The quality of the procedure is studied by simulations. Finally the procedure is applied to the case study of the human breast cancer cell stimulated with estrogen leading to discovery of some new significant genes which were not marked earlier due to the high variability in the raw data.
1
1
Introduction
Expression level of genes in a given cell can be influenced by a pathological status, a pharmacological or medical treatment. The response to a given stimulus is usually different for different genes and may depend on the time. One of the goal of the modern genetics is to automatically detect those genes that can be associated with the biological condition under investigation. Microarray experiments allow one to simultaneously monitor the level of expression of thousands of genes. Although various microarray experiments can be used for studying different factors of interest, in what follows, for simplicity we consider experiments involving comparisons between two biological conditions like control and treatment made in the course of time. In general, the problem can be formulated as follows. Consider the data which consists of the records on N genes, the record on each gene is taken at n time points t(j) , j = 1, .., n, (j)
and for a gene i at a time point t(j) there are ki
records available making the total number
of records for gene i to be Mi =
n X
(j)
ki .
(1)
j=1
Each record can be modeled as a noisy measurement of a function si (t) at a time point t(j) . The number of time point is relatively small (n ≈ 10), there are very few replications at each (j)
time point (ki = 0, 1, . . . K) with K quite small (K = 1, 2 or 3) while the number of genes is very large (N ≈ 10, 000). The objective of the analysis is to identify the curves which are different from the identical zero (i.e., significant) and estimate those curves. Subsequently, the curves are usually subject to some kind of clustering in order to group genes by their type of responses to the treatment. In the past, the literature on statistical methods for identification of differentially expressed genes in microarray experiments usually was concerned with static experiments, see e.g. Efron et al. (2001), Lonnstedt and Speed (2002), Dudoit et al. (2002), Kerr et al. (2002), Ishwaran and Rao (2003), among many others. Although data for a few time-series microarray experiments became recently available, there is still a shortage of statistical methods which are specially designed for time course experiments. For example, SAM ver 2.0 software package (originally proposed by Tusher, Tibshirani and Chu (2001) and later described in Storey and Tibshirani (2003)) was adapted to handle time course data by regarding the different time points as different groups. In much the same manner, the ANOVA approach by Kerr, Martin and Churchill (2000) and Wu et al. (2003) was applied to time course experiments by treating the time variable as a particular experimental factor; the very recent paper by Park et al. (2003), Conesa et al. (2006), Di Camillo et al. (2005) and the Limma package by Smyth (2005) have similar approaches. To summarize, these packages 2
and papers mainly treat the data using old statistical techniques designed for static data, so that the results are often invariant under permutation of the time points. The latter means that the biological role of causality inferred from the temporal response is totally ignored and the importance of the time variable is neglected. Nowadays, as the time course microarray experiments are becoming increasingly popular for studying a wide range of biological systems, it is well recognized that the regulation of gene expression is a dynamic process where the time variable plays a crucial role. This realization lead to new developments in the area of analysis of time-course microarray, see e.g. de Hoon et al. (2002), Bar-Joseph et al. (2003), Bar-Joseph (2004), and Storey et al. (2005). When analyzing the time course microarray experiments, biologists are faced with various technical difficulties such as small number of observations available (the time series are usually very short and hence no asymptotic method can be used), non-uniform sampling intervals, presence of missing data or multiple data and temporal dependence between observations for each gene. To the best of our knowledge, the most comprehensive approach to detection of differentially expressed genes which addresses these challenges is given in Storey et al. (2005). They consider both the cases of independent (non-temporal) and longitudinal (temporal) data. In the latter case, each of the response curves is expanded over the polynomial or B-spline basis with the coefficients estimated by the least squares procedure. The degree of polynomial (number of basis functions) used in these expansions is common for all the genes and is determined via the following procedure: identify the set of most significant genes using singular value decomposition, fit the curves for this group and determine the optimal value of the polynomial using generalized cross-validation, choose the common degree to be the maximum of the degrees in the above set of curves. To test which curves (genes) are significant, the set of statistics is formed (one per gene) which are somewhat similar to F -statistics used in ANOVA. However, Storey et al. (2005) do not impose assumption of normality: the distribution of these statistics is treated as completely unknown and is studied via bootstrap. Subsequently, the most significant curves are selected by controlling q-values using an FDR-like procedure. The goal of the present paper is to develop a general statistical approach which is specifically designed for the time course microarray data. In what follows, we propose a new, fully Bayesian approach to the problem formulated above. The method treats records as functional data, thus, preserving causality and taking into account the temporal nature of the data. Since response of each gene is treated as a function, no assumption of independence between time points is made. The technique which we propose does not require the number
3
of records Mi for each gene to be equal, thus avoiding a tiresome problem of missing data which is addressed in Storey et al. (2005) by employing the EM algorithm. The genes for which too few records are available can be discarded while the rest is treated using the general algorithm. Moreover, no adjustment is necessary even if records for each gene are taken at different time points. Since the response curve for each gene is relatively simple and only few measurements for each gene are available, each of the curves si (t) is estimated globally by expanding it over an orthogonal basis (Legendre polynomials or Fourier) and estimating the few first coefficients. Hence, as a result of the analysis, each function is presented by a short vector of coefficients and Bayesian approach delivers the posterior distribution of this vector. This information makes an excellent starting point for a subsequent clustering which can be viewed as a particular case of clustering of a large number of a low-dimensional vectors. Bayesian approach is applied in the paper at all stages of analysis.
It is used for
simultaneous estimation of the curves as well as for testing which of the curves is significant, moreover, the method ranks the curves (genes) according to the level of their significance. As the result, the technique is more uniform and flexible than that of Storey et al. (2005): it allows different number of basis functions for each curve which improves the fits, does not require to pre-determine the most significant genes to select the dimension of the fit and avoids a somewhat ad-hoc evaluation of the p-values. Also, our method can accommodate various types of error distributions, namely, all scale mixtures of a normal distribution (e.g., normal, Student T and double-exponential) as well as any additional prior information. By not treating the errors in the model non-parametrically as in Storey et al. (2005), we avoid application of resampling methods which may feel rather formidable to a practitioner. In fact, all of the formulae used in Bayesian computations are explicit and easy to implement. The closest to our method in spirit is the Bayesian clustering technique of Heard et al. (2006) where the gene profiles are also represented by expansions over a certain basis and the normal-inverse gamma prior is imposed on the unknown coefficients. The number of clusters as well as cluster participation are also treated as random leading to a full Bayesian model as in our paper. However, since the goal of Heard et al. (2006) is different from ours, they do not address many of the issues which we treat in depth (replications, different time points for different genes, etc.) Nevertheless, the philosophical similarity between the two approaches makes the technique of Heard et al. (2006) to be an attractive and natural choice for subsequent clustering of the curves which are found to be significant by our algorithm. In what follows, we consider the data set which contains measurements of the differences in the expression levels between the “treated” and the “control” samples of N genes at n time points in the interval [0, T ]. The objective is first to identify the genes which respond
4
to the treatment (i.e., which show different functional expressions between treatment and control), and then to estimate the type of response (i.e., the effect of the treatment). This experimental setup can be easily realized by the direct hybridization of the two samples on cDNA microarrays and then repetition of the hybridization process at different time points after the treatment. However, the approach proposed in this paper is much more general and is not limited to this kind of microarray experiments: it can be applied to time course microarray experiments in general. For example, by replacing the time variable with the dose variable without changing the model, one can apply the technique developed in this paper to the situation when one studies the effect of changing the dose of a chemical or a medication for a given treatment (provided the number of different doses is relatively large). Also, in some special cases it can be applied to the oligonucleotide microarray experiments. Regardless of the type of the experiment, in what follows we assume that appropriate preprocessing, filtering and normalization have been applied to the data before starting the statistical analysis (see e.g. Yang et al. (2002) or Cui et al. (2002) for the description of the pre-processing procedures). The rest of the paper is organized as follows. In Section 2 we describe hierarchical Bayesian model for our data. Section 3 describes the procedure for estimating gene profiles and testing which genes are significant. Section 4 deals with estimation of global parameters. Section 5 summarizes the algorithm for automatic identification and estimation of the gene expression profiles in a time course microarray experiment. Sections 6 and 7 study performance of the method proposed in the paper using simulated and real life data sets. Section 8 concludes the paper with the discussion. Finally, Section 9 contains derivation of formulae in Section 3.
2
Statistical modeling of gene expression profiles
The experiment consists of a sequence of cDNA microarrays where the control and treated samples are compared at times t(j) , j = 1, .., n after the treatment, where points t(j) are not necessarily uniformly spaced and technical repetitions (i.e. multiple measurements at a single time point) may be present. The expression value of each microarray is the result of the competitive hybridization. The mRNA samples are reverse-transcripted into cDNA, one sample is labeled with green fluorescence dye, the other one with red dye, then mixed and applied to the microarray. After the cDNA has hybridized to its specific spot, the image is captured using a scanner and the intensity in the two channels are measured. For each spot the log2 -ratio of the red to green fluorescence intensity represents the relative expression value of the corresponding gene. After that, the data needs to be pre-processed to remove 5
systematic sources of variation. In what follows, we assume that such pre-processing has be done accurately and that we have normalized log2 -ratio data. For a detailed discussion on the preliminary processing of microarray data we refer the reader to Yang et al. (2002), Cui et al. (2002), McLachlan et al. (2004) or Wit and McClure (2004). Data are taken at n different time points in [0, T ] where the sampling grid t(1) , t(2) , . . . , t(n) is not necessary uniformly spaced on the interval. For each array, the measurements consist of N normalized log2 -ratios zij,k , where i = 1, . . . , N is the gene number, index j corresponds (j)
to the time point t(j) and k = 1, . . . , ki t(j) since
accommodates for possible replications at time
(j) ki
(j)
≥ 1. Note that usually, by the structure of the experimental design, ki is P (j) (j) independent of i, i.e. ki ≡ k (j) and M = k . However, since some observations may be missing due to a technical error in the experiment which cannot be adjusted during the preprocessing, we let k (j) to depend on i. For each gene i, we assume that evolution in time of its relative expression value is governed by the function si (t) and each of the measurements also involves some measurement error, i.e. zij,k = si (t(j) ) + ζij,k ,
(j)
i = 1, . . . , N, j = 1, . . . , n, k = 1, . . . , ki .
(2)
The measurement errors ζij,k are assumed to be i.i.d. with zero mean and finite variance. The function si (t) represents the temporal differential expression level of gene i over the interval [0, T ] and this is the quantity of interest. In particular, if si (t) ≡ 0, then gene i is not affected by the treatment since its expression value does not change at any moment after the treatment. On the other hand, if si (t) 6≡ 0, then gene i changes its biological response due to the treatment. In this case, the estimated value sˆi (t) of si (t) is a measure of the effect induced by the stimulus and, hence, it is important for a biologist. Therefore, the objective of the analysis is to select the genes for which the expression function si (t) 6≡ 0 and estimate those functions si (t). Since functions si (t) are measured only at a few time points, we estimate each of the functions globally. Specifically, we expand each of the functions over some standard orthonormal basis on [0, T ] si (t) =
Li X
(l)
ci φl (t)
(3)
l=0
and characterize each of the functions by the vector of its coefficients ci . In the present paper we use Legendre polynomials or Fourier basis suitably rescaled and normalized in [0, T ], however, other choices are possible. Due to the fact that functions si (t) model a biological system, these functions are continuous although discontinuities in the first derivatives are (l)
possible. The values of the coefficients ci as well as the degree of the polynomial Li are to be estimated from the observations. 6
We assume that the genes are conditionally independent and hence we can model each gene separately. Using (3), we re-write (2) as zi = Di ci + ζ i
(4)
where zi = (zi1,1 . . . zi1,k1 , · · · , zin,1 , . . . zin,kn )T ∈ RMi is the column vector of all measurements i
for gene i (see (1)), ci = (c0i , . . . , cLi i )T ∈ RL +1 is the column vector of the coefficients of si (t) in the chosen basis, ζ i = (ζi1,1 , . . . , ζi1,k1 , · · · , ζin,1 , . . . , ζin,kn )T ∈ RMi is the column vector of random errors and Di is the block design matrix the j-row of which is the block vector [φ0 (tj ) φ1 (tj ) . . . φLi (tj )] replicated kji times, so that matrix Di has dimension Mi × (Li + 1). In the present paper we consider a fully Bayesian model for our data.
The latter
means that we treat all parameters in the model either as random variables or as nuisance parameters which are recovered from the data. We assume that given σ 2 , the vector of errors ζ i is normally distributed ζ i | σ 2 ∼ N (0, σ 2 IMi ), hence zi | Li , ci , σ 2 ∼ N (Di ci , σ 2 IMi ). On the unknown parameters we elicit the following priors: Li ∼ P ois∗ (λ, Lmax ), the Poisson distribution with parameter λ truncated at Lmax ; ci | Li , σ 2 ∼ π0 δ(0, . . . , 0) + (1 − π0 )N (0, σ 2 τi2 Q−1 i ). The choice of the truncated Poisson distribution to model the number of terms in the expansion (3) is motivated by the prior belief that each function si (t) can be represented by a relatively small number of factors which can be different for different genes, however, choosing a different prior on Li will lead to a very minor modification of the theory. The values of Li are estimated from the observations. Parameters λ is proportional to the average degree of the polynomial and Lmax refers to the the maximal possible degree; both are treated as known parameters in this paper. In the simulation study we choose λ = 9 which approximatively corresponds to an expected prior degree Li = 3 of the polynomials in (3) while the maximum degree allowed is Lmax = 6 (default choice for n = 11 time points). In general, the choice of these parameters should be motivated by the number of measurements as well as the nature of the problem. Simulations show that the choices of λ and Lmax do not significantly affect the results of estimation and testing. Given σ 2 , the prior distribution for the vectors of coefficients ci is chosen to be the mixture of a point weight at zero and a multivariate normal density with the covariance matrix σ 2 τi2 Q−1 i . This choice reflects the fact that some of the curves are identical zeros, (l)
while for the others, positive and negative coefficients ci are equally likely. Parameter π0 is the prior probability that the treatment does not affect a gene, it is a global parameter 7
which is estimated from the data. Matrix Qi is a diagonal matrix which accounts for the decay of the coefficients in the chosen basis. This, of course, depends on the smoothness of the underlying functions si (t). If functions si (t) are νi times continuously differentiable, (l)
(l)
then coefficients ci usually have polynomial decays ci ∼ (l + 1)−νi which corresponds to Qlli = (l + 1)2νi . However, choosing gene-dependent parameter νi is difficult and, moreover, unnecessary. First, usually the amount of data is insufficient for reliable estimation of νi . Second, as results of our extensive simulation show (see Section 6), performance of the method is practically independent of how well νi is estimated. For this reason, we choose a single global parameter ν in our further analysis. Note that if no assumption about smoothness is made, then ν = 0 and one can use Qi = I. Scaling covariance matrix for the i-th gene by σ 2 τi2 reflects the fact that the scale of coefficients ci for different genes can be different (the more gene is affected by the treatment, the larger the value of τi is). Parameters τi are estimated from observations while parameter σ 2 is assumed to be random σ 2 ∼ ρ(σ 2 ). Treating σ 2 as a random variable allows one to to account for possibly non-Gaussian errors which are quite common in microarray experiments. Assigning different parametric forms to ρ(σ 2 ) allows to accommodate various types of errors while making σ 2 a factor of covariance matrices for each ci leads to a closed-form expressions for the estimators and test statistics. In particular, in the present paper we consider three types of priors ρ(·), however, other choices of ρ(·) are possible: case 1: ρ(σ 2 ) = δ(σ 2 − σ02 ), where parameter σ02 is the overall variance of the experiment. In that case, the marginal distribution for error is N (0, σ02 ). case 2: ρ(σ 2 ) = IG(γ, b), the Inverse Gamma distribution. In that case, the marginal distribution for error is Student T . case 3: ρ(σ 2 ) = cµ σ Mi −1 e−σ
2 µ/2
. In that case, the marginal distribution for error is double
exponential. The global hyperparameters, π0 and parameters of a particular ρ(σ 2 ) (σ02 for case 1, γ and b for case 2 and µ for case 3), are estimated from the data. Possible strategies for doing this are discussed in Section 4. Once the hyperparameters are estimated, Bayesian analysis is carried out by combining the prior information and the sample information into the posterior distribution.
8
3
Estimation and testing of gene expression profiles
If the global parameters of the model were known, one could proceed to gene-by-gene analysis of gene expression ci , i = 1, · · · , N . In this section, we provide the final formulas, giving the details of calculations in Appendix. To deal with different choices of ρ(σ 2 ) we introduce a function
Z
∞
F (A, B) =
2
σ −A e−B/2σ ρ(σ 2 )dσ 2
(5)
0
which can be calculated explicitly in the three cases discussed above: 2 σ ˆ Mi e−B/2ˆσ in case 1, Mi i i Γ( 2 +γ) − M B −( M2 +γ) F (Mi , B) = in case 2, b 2 (1 + 2b ) Γ(γ) √ p − Bµ cµ 2π/µ e in case 3.
(6)
Also, denote gλ (Li ) =
"L max X
#−1 (l!)−1 λl e−λ
(Li !)−1 λLi e−λ , Li = 0, . . . , Lmax ,
(7)
l=0
Hi (zi ) = zTi zi − zTi Di (DTi Di + τi−2 Qi )−1 DTi zi ,
(8)
V (zi , Li , Mi ) = |τi2 DTi Di + Qi |−1/2 [(Li + 1)!]ν F (Mi , Hi (zi )).
(9)
Here gλ (Li ) is the pdf of the truncated Poisson distribution and we suppress dependence on λ, τi and ν. Then the marginal density of zi is of the following form # " L max X gλ (Li ) V (zi , Li , Mi ) . p(zi ) = (2π)−Mi /2 π0 F (Mi , zTi zi ) + (1 − π0 )
(10)
Li =0
Expression (10) contains a gene-dependent parameter τi which is estimated by τˆi = arg maxτi p(zi ). The posterior pdf of the degree Li given data zi is given by . p(Li |zi ) = (2π)−Mi /2 gλ (Li ) π0 F (Mi , zTi zi ) + (1 − π0 )V (zi , Li , Mi ) p(zi ).
(11)
ˆ i by either maximizing the posterior pdf (11) (MAP For each gene i, we can estimate L principle) or using its posterior mean. After τi and Li are estimated, we replace them by τˆi ˆ i in all subsequent calculations. and L Our main goals are to estimate coefficients ci for each gene and test whether ci ≡ 0. For this purpose, we evaluate the posterior pdf of ci given zi and Li : ˆ i ) = [(2π) M2i p(zi |L ˆ i )]−1 π0 F (Mi , (zi − Di ci )T (zi − Di ci ))δ(0, . . . , 0)+ p(ci |zi , L i ˆ +1) (L i ˆ i + 1)!]ν (2πˆ ˆ i + 1, (zi − Di ci )T (zi − Di ci ) + τˆ−2 cT Qi ci ) (1 − π0 )[(L τ 2 )− 2 F (Mi + L i
i
i
(12) 9
where (2π)
Mi 2
ˆ i ) = π0 F (Mi , zT zi ) + (1 − π0 )V (zi , Li , Mi ). p(zi |L i
Introduce Bayes factors BFi , the quotient between the posterior odds ratio and the prior odds ratio
p BFi =
|Qi + τˆi2 DTi Di | F (Mi , zTi zi ) . ˆ i + 1)!]ν F (Mi , Hi (zi )) [(L
Then the posterior probability that ci = 0 is . ˆ i ) = π0 π0 + (1 − π0 )(BFi )−1 . p(ci = 0|zi , L
(13)
(14)
In the usual Bayesian inference (see Berger (1985)), the Bayes Factor BFi , can be used for independent testing of the null hypotheses H0i : ci = 0 versus H1i : ci 6= 0 for i = 1, . . . , N . However, the classical Bayesian approach does not account for the multiplicity of comparisons (indeed, placing independent priors on parameters and using additive losses does not imply any control of familywise error). When N is large, like in microarray experiments, the problem of multiplicity cannot be ignored. To take this problem into account, we use the Bayesian multiple testing procedure proposed by Abramovich and Angelini (2005) where a simple hierarchical prior model is obtained by imposing a prior distribution π(r) > 0, r = 0, 1, 2, · · · , on the number r of hypotheses which are a priori assumed to come from the alternative (i.e., the number of significant genes in our case), and the decision is taken by finding the most likely configuration of null and alternative hypotheses. Of a particular interest in the microarray context are the “sparse” priors π(r) which force E(r) to be relatively small with respect to N in order to model the prior belief that relatively few genes are usually differentially expressed. In this paper, instead of using the global maximum a posteriori, we control multiplicity by a procedure which is similar in the spirit to the step-up method (see Abramovich and Angelini (2005), Section 2.2): if Bayes factor are ordered such that BF(1) ≤ BF(2) ≤ . . . ≤ BF(N ) and the corresponding hypotheses are re-indexed, then one can start from the most plausible null hypothesis BF(N ) and continue accepting the null hypotheses as long as i π(i) , (15) N − i + 1 π(i − 1) then rejects all remaining hypotheses. All genes corresponding to the rejected hypotheses BF(i) >
are detected to be “significant”. In a similar manner, one can also implement a step-down controlling procedure. We note that if the prior π(i) in (15) is Binomial with parameter α, then step-down and step-up procedures provide the same answer as the global maximization procedure (see Abramovich and Angelini (2005) for details). Finally, for the significant genes we estimate the coefficients ci by the posterior expectation over (12) cˆi =
−1 T (1 − π0 )/π0 DTi Di + Qi /ˆ τi2 D i zi , BFi + (1 − π0 )/π0 10
and, by (3), the estimator of si (t) is ( Pˆ sˆi (t) =
4
Li l=0
cˆi φl (t) if H0
0
is rejected,
(16)
if H0 is accepted.
Estimation of global parameters
Now, to complete the theory, we need to address the problem of estimation of global parameters of the model: common parameter π0 and case-specific parameters σ02 for case 1, γ and b for case 2 and µ for case 3. Several options are available in the literature for estimating the global parameters of the model, see for example Efron et al (2001), Ishwaran and Rao (2003), Storey and Tibshirani (2003), Pounds and Cheng (2004), Pawitan et al. (2005). In this section we describe some of the choices we use in simulation and real data analysis, however, we note that a different set of methods can be applied without changing the main algorithm. First, we observe that according to our model, there may be genes for which complete set of M observations is available and genes for which some observations may be missing (since few values are filtered out during the preprocessing). Estimation of the global parameters is based on only Nc genes for which the complete set of observations is available. Therefore, we have M arrays with Nc observations at each time point t(j) , j = 1, . . . , M . Note that here we use one-index enumeration of time points, so if several observations are made at the same time, then t(j) = t(j+1) = t(j+2) . . .. For each array of observations at a time point t(j) , the standard deviation σ (j) is estimated by the sample standard deviation σ ˆ (j) using sparsity assumption (majority of the components of the array are zeros). If normality of the data can be justified, the sample variance can be replaced by a more robust estimator like MAD. Note that if a sufficiently large number of arrays is available, then one can also use a gene specific estimator of σ or an unbiased pooled estimator. The estimator σ ˆ 2 is obtained by averaging of (ˆ σ (j) )2 , j = 1, . . . , M . Given σ ˆ , we estimate the global parameter π0 by averaging over the arrays the proportion √ of data points which fall below the universal threshold σ ˆ 2 log Nc . Note that this method tends to overestimate π0 when the error is normally distributed.
However, this is no
longer true when the error distribution has heavier tails which is very common in real life applications. Alternatively, π0 can be estimated using empirical Bayes approach as in Johnstone and Silverman (2005). The set of estimates σ ˆ (j) , j = 1, . . . , M , is then treated as the sample of values of σ and is used for estimating parameters of ρ(σ 2 ). In the case 1, σ02 is estimated by σ ˆ 2 . If the prior model is given by case 2, one can estimate hyper-parameters γ and b by using the MLE 11
(note that if (ˆ σ (j) )2 ∼ IG(γ, b), then (ˆ σ (j) )−2 ∼ Gamma(γ, b)). An alternative procedure is to fix one of the two parameters, γ or b, and then estimate the other one by matching the mean of IG(γ, b) with σ ˆ 2 . Similarly, if prior of case 3 is applied, µ can be estimated by µ ˆ = (Mi − 1)/ˆ σ 2 , so that the mean of the prior distribution ρ(σ 2 ) is centered at σ ˆ2.
5
Algorithm
In this Section we describe the algorithm for automatic identification and estimation of the gene expression profiles in a time course microarray experiment. We stress that data in (4) should be pre-processed and normalized to remove systematic sources of variation and they are expressed in terms of log2 -ratio between treatment and control (or vice versa). The algorithm can be performed by carrying out the following steps: 1. A preliminary step is to fix prior parameters λ, Lmax and ν. 2. Estimate global parameters: σ 2 and π0 , and additional case-specific hyper-parameters σ02 (for case 1), γ and b (for case 2) or µ (for case 3). 3. For each gene i, estimate the gene specific parameter τi by maximizing the marginal pdf of the data (10). Subsequently, plug in π ˆ0 , σ ˆ 2 , γˆ , ˆb or µ ˆ instead of π0 , σ 2 , γ, b or 0
µ when required. 4. For each gene i, estimate the most appropriate degree Li as the mean or mode of the posterior pdf (11). ˆ i , using formula (13), compute Bayes Factor BFi . 5. For each gene i, conditionally on L 6. Perform Bayesian multiple testing procedure for controlling the multiplicity error and rank the genes according to the ordered Bayes Factors. For the choice of the prior π(r) in (15) one can use, for example, a Binomial B(N, α), or a truncated Poisson P oiss∗ (α, N ) prior or any other “sparse” distribution which the biologists can impose on the number of hypotheses coming from the alternative. Note that the smaller the parameter α is chosen, the more sparse will the prior π(r) be, allowing the user to model the case where fewer genes are differentially expressed. We note that if Binomial prior B(N, α) is chosen then (15) becomes BF(i) > α/(1 − α). and one can immediately see that smaller α will imply a stronger control of the multiplicity. Similar reasoning applies to P oiss∗ (α, N ) prior as well. In principle 12
α can be estimated from the data. However, for simplicity in our simulations we use the Binomial prior with α = 1 − π ˆ0 . 7. Estimate the gene expression profiles by sˆi (t) defined in (16).
6
Simulations
To investigate the performance of the proposed method, we carried out a simulation study. We generated data according to model (4) with N = 8000, n = 11, kij = 2 for all j = 1, . . . , 11 except ki2,5,7 = 3 to mimic the structure of the real data set used in the next section. We also used the same time points for the measurements as in the real data set. We randomly chose 600 curves to be “significant” and simulated their profiles according to (3) with the Legendre basis; the other 7400 curves were set to be identical zero. This scenario corresponds to the situation when 7.5 % of the total number of genes are “differentially expressed”. For each significant curve, we first sampled the degree of the polynomial Ltrue from a discrete i uniform distribution in [1, Lmax ]. Starting from the degree one is a biology-based choice since a nonzero constant signal is questionable from a biological point of view. Indeed, the constant response would mean that the treatment had an immediate effect which later would not change anymore. For this reason, in our simulations we avoided polynomials of degree zero and chose Lmax = 6. After that, we randomly sampled ci from N (0, σ 2 τi2 Q−1 i ) where σ = 0.27, Qi = 2ν
diag(12νi , 22νi , ..., Li i ), νi ∼ U ([0, 1]) and τi2 was uniformly sampled in order to produce the signal-to-noise ratio (SNR) in the interval [2, 6] mimicking both weak and strong signals, respectively. The choice of the sampling interval [0, 1] for the parameter νi was motivated by the belief that the biological response to the treatment may depend on the particular gene and that the profile can be continuous or at most differentiable. Note that in the estimation algorithm we chose a common prior parameter ν, however, our method could be modified to allow gene dependent prior parameters νi . Yet, this choice comes at a price of heavier calculations without much gain in precision. We performed simulations with three kinds of i.i.d. noise: normal N (0, σ 2 ) and Student T with 5 or 3 degrees of freedom, T (5) and T (3), respectively. Student noise was rescaled to have the same variance σ 2 as in the normal case, additionally, very large values were filtered out and substitute with “missing values” mimicking the preprocessing of the data which eliminates unrealistic values. The number of missing values per gene did not exceed 2, and at most 2 − 3% of the profiles were affected. For each noise distribution, the simulations were repeated with 30 randomly generated sets of profiles si in order to make the result independent of the particular choice of functions 13
si and to make the results comparable, the set of functions was the same for all three noise scenarios. Then for each set of profiles, the experiment was replicated with 10 different realizations of the noise. The data were processed using the methods proposed in case-1-2-3, and the results were averaged. Simulations were carried out with various choices of parameters λ, Lmax , ν and the results turned out to be robust with respect to these choices. Hence, we report the results only with Lmax = 6, λ = 9 (which corresponds to an expected degree of about 3) and ν = 0 or ν = 1. In Tables 1 and 2 we present the average number of rejected hypothesis (genes declared differentially expressed) (reje), the average number of the correctly rejected (corr), the False Discovery Rate (FDR) computed as the average proportion of the falsely rejected hypotheses over the total number of rejected hypotheses, and the False Negative Rate (FNR) computed as the average proportion of the significant curves which were not detected curves over the total number curves declared nonsignificant. Results in the two cases are comparable (with ν = 1 few more curves are usually detected). Simulations show good performance of the procedure in the case of the normal and the T(5) noise, and acceptable performance for T(3). In addition, results are robust non only with respect to the number of detected genes but also with respect to ranking of the curves: for the first 200-300 significant curves the ranks assigned by different methods may vary by at most few positions. Moreover, in the normal and T(5) noise cases, the false positives enter only after the first 450 and 400 genes respectively, while in the T(3) noise case there are almost no false positives before the first 300 genes. Simulations reported in the paper used the MAP procedure to estimate the degree Li for each gene, however, we did not notice any remarkable difference when the degree was estimated with the posterior mean. The estimation of the hyper-parameters was carried out using the methods described in Section 4. In particular, σ was estimated by the sample standard deviation averaged over the arrays. In case-2, various strategies for selecting γ and b were tested, but in the tables we only report (case-2-I) where we fix γ = 15 selecting b such that the mean of the prior IG distribution coincides with the estimated σ ˆ 2 ; and (case-2-II) where the simultaneous estimation of γ and b by MLE was applied. Note that although the two methods provide different estimates of γ and b, there was very little difference in detections of significant genes. However, case-2-I may be preferable if a biologist wants to use a tuning parameter to slightly adjust the number of selected genes. The quality of the selection was also evaluated in terms of functional errors in estimating the curves (i.e., L2 -norm). In Table 3, we show several types of functional errors associated with our decision: errA = ||si − 14
sˆi ||22
.
||si ||22
N noise
T (5) noise
T (3) noise
reje
corr
FDR
FNR
reje
corr
FDR
FNR
reje
corr
FDR
FNR
case-1
488.8
488.8
.00001
.0148
505.0
491.9
.0260
.0144
575.1
502.1
.1270
.0132
case-2-I
483.3
483.3
.00005
.0155
497.4
488.0
.0188
.0149
572.1
504.5
.1182
.0129
case-2-II
485.0
485.0
.00002
.0153
500.7
488.8
.0237
.0148
578.4
501.8
.1325
.0132
case-3
474.6
474.5
.00003
.0167
502.7
481.8
.0416
.0158
621.7
501.3
.1937
.0134
Table 1: Simulated results for the case Lmax = 6, λ = 9, ν = 0. N noise
T (5) noise
T (3) noise
reje
corr
FDR
FNR
reje
corr
FDR
FNR
reje
corr
FDR
FNR
case-1
502.3
502.3
.00004
.0130
514.2
504.7
.0185
.0127
574.3
515.8
.1018
.0113
case-2-I
499.1
499.0
.00011
.0135
511.2
503.7
.0155
.0129
582.4
519.4
.1082
.0109
case-2-II
499.9
499.9
.00004
.0133
513.1
503.1
.0195
.0129
587.6
516.5
.1210
.0113
case-3
491.4
491.4
.00008
.0145
517.6
497.6
.0385
.0137
634.5
516.9
.1854
.0113
Table 2: Simulated results for the case Lmax = 6, λ = 9, ν = 1. represents the relative L2 estimation error which is averaged over functions si correctly detected as significant and estimated with sˆi ; errB = ||ˆ si ||22 is the absolute L2 error averaged over false positive functions when si = 0 but si is incorrectly declared significant and estimated with sˆi ; finally, errC = ||si ||22 is the absolute L2 error averaged over false negative functions, i.e., si 6= 0 but it is not detected as significant. Note that all three methods for all the three noise scenarios have comparable errors errA and errC. The only exception is errB which is larger when the noise has heavier tails since in this case it is much easier to confuse the noise with the signal. Note also that results of case-2-II are comparable with those of case-1 and are quite different from those of case-2-I. The reason for this is that the MLE of parameters γ and b are higher than in case-2-I, hence, the T-distribution in case-2-II approximates closely the normal distribution of case-1 and quite far from the T-distribution in case-2-I. ErrC is comparable for all the methods and represents a kind of a threshold (in L2 -norm) under which effect of the treatment cannot be detected by the proposed methods.
15
N noise
T (5) noise
T (3) noise
errA
errB
errC
errA
errB
errC
errA
errB
errC
case-1
0.07905
0.00086
0.46856
0.08828
0.63378
0.45311
0.10035
0.83435
0.43564
case-2-I
0.07792
0.00235
0.49161
0.07898
0.29420
0.47501
0.08428
0.32265
0.44137
case-2-II
0.07799
0.00087
0.48094
0.08162
0.45795
0.46422
0.09078
0.51321
0.43956
case-3
0.07590
0.00141
0.51823
0.08126
0.26061
0.49212
0.08821
0.32450
0.45070
Table 3: Average relative estimation error (errA); Average false positive error (errB); Average false negative error (errC). Results obtained with Lmax = 6, λ = 9, ν = 0. N noise
T (5) noise
T (3) noise
errA
errB
errC
errA
errB
errC
errA
errB
errC
case-1
0.07536
0.00550
0.42738
0.08300
0.71074
0.41384
0.09344
0.90208
0.39634
case-2-I
0.07524
0.01141
0.45080
0.07695
0.25375
0.43511
0.07620
0.26662
0.40459
case-2-II
0.07497
0.00560
0.43933
0.07798
0.42431
0.42401
0.08429
0.43825
0.40088
case-3
0.07426
0.00894
0.47344
0.07922
0.21030
0.45074
0.07918
0.25972
0.41210
Table 4: Average relative estimation error (errA); Average false positive error (errB); Average false negative error (errC). Results obtained with Lmax = 6, λ = 9, ν = 1.
7
Real data application
We have applied the proposed methods to the time course microarray study described in Cicatiello et al. (2004) (the data is available on the GEO repository, accession number GSE1864). The objective of the experiment was to identify genes involved in the estrogen response in a human breast cancer cell line. In fact, estrogen is known to promote cell proliferation: hence, its role in cancer development, especially if it is located in hormone-responsive tissues such as breast and ovary. In the original experiment, ZR-75.1 cells were stimulated with a mitogenic dose of 17β-estradiol, after 5 days of starvation on an hormone-free medium, and samples were taken after t = 1, 2, 4, 6, 8, 12, 16, 20, 24, 28, 32 hours, with a total of 11 time points covering the completion of a full mitotic cycle in hormone-stimulated cells. For each time point at least two replicates were available (three replicates at t = 2, 8, 16). From the 8400 genes we first removed those genes which did not pass the image analysis quality control flag. Then, we filtered out single spots with low intensity values in any single channel, red or green (min (log2 (R), log2 (G)) < 5), or in both channels (log2 (R ∗ G) < 11), and declared the corresponding values as missing. Here, R (red) and G (green) are the values measured on the cDNA microarray on the red and green dye, respectively. We 16
also removed the few spots where there were both difference in signs and big difference between the values of the expressions for the replicates at a time point. After that, a gene was removed from further investigation if more than 20% of values were missing. As a result, the total number of the remaining 8161 genes were analyzed, among them about 200 contained at least one missing value. Retained genes were normalized using the standard lowess normalization procedure with span parameter 0.3 in order to remove various nuisance sources of systematic variation in the measured fluorescence intensities (e.g., different labeling efficiencies and scanning properties of the two fluorochromes, or suboptimal choice of the scanning parameters). We refer the reader to Yang et al. (2002) and Cui et al. (2002) for the review on normalization procedures for microarrays. We analyzed the data set mentioned above using our proposed method and various choices for the parameters ν and λ. Table 5 shows the number of genes declared affected by the treatment for Lmax = 6, ν = 0 and λ ranging between 6 and 12 which corresponds to an expected prior degree of polynomials from 2.5 to 3.5. From Table 5 one can see that the results are quite robust with respect to the number of detected significant genes, with smaller λ providing larger lists (λ larger than 12 does not provide any noticeable changes in the list). The technique is also very robust with respect to the list of genes detected as significant: 574 genes where common to all 28 lists (combination of different methods and different parameter values) and the overlap remains very substantial in any sublist of genes (for example, for any i, there is overlapping of about 85% in the top i ranked genes of all lists). Note also that among the 340 genes studied in Cicatiello et al. (2004) our list of 574 common genes includes 267 genes; among the remaining 73 genes, 16 were filtered out in our analysis due to a more stringent selection of quality flags before processing the data. Another 57 genes were detected by our method with some combinations of priors and parameters but they were not found robust with respect to these choices. Subsequently, we examined their raw profiles and discovered that for those 57 genes the possible effect of the treatment is between weak and very weak compared with the noise. On the other hand, our list contains 307 genes not detected in Cicatiello et al. (2004) which seem to be affected by the treatment. While looking the newly selected genes we noticed that these genes have profiles which are very similar to the profiles of the genes selected in Cicatiello et al. (2004) but raw data show much more variability between replicates. Since the Cicatiello et al. (2004) analysis was carried out manually, point by point, the data for those genes was treated as unreliable and genes were discarded. However, the functional approach to the microarray data which lies at the core of our method allows to estimate the gene profiles with sufficient precision even in the presence of missing or less reliable single data points. Moreover, interestingly, 17 of 307 newly selected genes were replicate spots of some genes already selected in the Cicatiello
17
λ=6
λ=7
λ=8
λ=9
λ = 10
λ = 11
λ = 12
case-1
876
808
752
712
692
688
691
case-2-I
893
823
765
711
679
656
650
case-2-II
869
810
754
714
694
689
693
case-3
854
876
726
676
640
617
610
Table 5: Results on Real Data: Total number of genes declared affected by the treatment. Here ν = 0, Lmax = 6 et al. (2004), and most of them are known to be involved in biological processes related to estrogen response, such as cell cycle and cell proliferation (AREG, NOLC1, cyclin D1), DNA replication (MCM7, RFC5), mRNA processing (SFRS1) and lipid metabolism (APOD and LDHA). The analysis was also repeated on the same data set using ν = 1 and the same range of λ leading to an expected result of detecting a larger number of significant genes since simulations show that ν = 1 is less conservative. The difference of about 100 − 200 more genes is not negligible, however, the genes selected for ν = 0 were all present in the list for ν = 1. Finally, we repeated the analysis with different choices at the preprocessing stage (i.e., the size of the span parameter of the lowess method, the cut off on the intensities, etc) and observed that the results are quite robust to these choices with deviations of about 20 − 40 genes.
8
Discussion
In this paper we developed a fully Bayesian approach for detecting differentially expressed genes in a time course experiment. The proposed method contributes to the new and increasingly popular research area of analysis of time course microarray data (see Figure 1 of Ernst et al. (2005) for examples of experiments in which the proposed procedure my be helpful). To the best of our knowledge, our approach is the first functional fully Bayesian procedure available in the literature for such kind of problem. Our method can also be complemented by a somewhat similar Bayesian approach to clustering of gene profiles considered by Heard et al. (2006). Indeed, any clustering procedure may benefit of a preliminary step where differentially expressed genes are detected. The Bayesian formulation allows to explicitly use the prior information that biologists may have about the data and successfully handles various technical difficulties which arise in microarray time-course experiments such as a small number of observations available,
18
non-uniform sampling intervals, presence of missing data or multiple data as well as temporal dependence between observations for each gene. Moreover, the method allows to accommodate wide variety of errors, thus avoiding the two undesirable extremes: treating the error distribution as completely unknown which leads to very heavy computational procedures (as e.g. in Storey et al. (2005)) or assuming that the errors are normally distributed which is unrealistic in real life situation. As a result, all evaluations are based on explicit expressions, thus, leading to fast and simple computational procedures which are attractive to a practitioner. In addition, since the computational cost of the method is relatively small, a fine tuning of the prior parameters can be easily done. The procedure proposed in the paper was applied to the microarray data provided by Cicatiello et al. (2004) which describes the estrogen response in a human breast cancer cell line. In the course of our study, we detected a number of genes which seem to be strongly affected by the treatment but were discarded by Cicatiello et al. (2004) since data on those genes were treated as unreliable. Application of our technique allowed to estimate the profiles for those genes with sufficient precision even in the presence of missing or less reliable single data points. The Matlab routines which were used for carrying out simulations and for analysis of real data are available upon request from the first two authors.
Acknowledgments This research was supported in part by the NSF grant DMS 0505133 and the Short Term Mobility grants 2005 and 2006. We are grateful to Luigi Cicatiello, Angelo Facchiano and Alessandro Weisz for their constructive comments. Also, Marianna Pensky would like to thank Claudia Angelini and Daniela De Canditiis for warm hospitality while visiting Italy to carry out part of this work and Claudia Angelini would like to thank Marianna Pensky for the warm hospitality while visiting Florida.
9
Appendix
9.1
Derivation of p(Li |zi )
Combination of the the model and the priors lead to the following joint pdf p(zi , ci , Li , σ 2 ) = gλ (Li )ρ(σ 2 )(2π)−Mi /2 σ −Mi exp − 2σ1 2 (zi − Di ci )T (zi − Di ci ) √ |Qi | 1 T π0 δ(0, . . . , 0) + (1 − π0 ) (2πσ2 τ 2 )(Li +1)/2 exp − 2σ2 τ 2 ci Qi ci . i
19
i
(17)
Using equality (zi − Di ci )T (zi − Di ci ) + τi−2 cTi Qi ci = T ci − (DTi Di + τi−2 Qi )−1 DTi zi [ci − (DTi Di + τi−2 Qi )−1 DTi zi ] + Hi (zi ), where Hi (zi ) is defined in (8), we integrate out ci : 2 gλ (Li )ρ(σ ) (1 − π0 )[(Li + 1)!]ν zi T zi 2 q p(zi , Li , σ ) = + π exp − 0 (2π)Mi /2 σ Mi 2σ 2 (τ 2 )(Li +1)/2 |DT Di + i
i
Qi | τi2
exp −
Hi (zi ) . 2σ 2
Integrating out σ 2 and using definition of F given in (5) we obtain p(zi , Li ) = (2π)−Mi /2 gλ (Li ) π0 F (Mi , zTi zi ) + (1 − π0 ) V (zi , Li , Mi ) .
(18)
Summing over all possible degrees Li = 0, . . . , Lmax , we derive the marginal pdf of the data (10). The posterior pdf (11) of the degree Li can be obtained by dividing (18) by (10).
9.2
Estimation of ci
ˆ i in (17) and dividing it by by gλ (L ˆ i ) we obtain p(zi , ci , σ 2 |L ˆ i ). Plugging an estimate L Integrating out σ 2 , we derive ˆ i ) = (2π)−Mi /2 π0 F (Mi , (zi − Di ci )T (zi − Di ci ))δ(0, . . . , 0)+ p(zi , ci |L i ˆ i +1)/2 −2 T T ν 2 −(L ˆ ˆ F (Mi + Li + 1, (zi − Di ci ) (zi − Di ci ) + τ c Qi ci ) . (1 − π0 )[(Li + 1)!] (2πτ ) i
i
i
(19) ˆ i in (18) and dividing by gλ (L ˆ i ) we obtain Similarly, plugging L h i −Mi /2 T ˆ ˆ p(zi |Li ) = (2π) π0 F (Mi , zi zi ) + (1 − π0 ) V (zi , Li , Mi ) .
(20)
Finally we obtain the posterior pdf (12) of ci dividing (19) by (20). The estimator cˆi of coefficients ci is the mean of the pdf (12).
References [1] F. Abramovich and C. Angelini (2005), Bayesian Maximum a Posteriori Multiple Tetsing Procedure, Tech. Report IAC-CNR 303/05 [2] Z. Bar–Joseph (2004), Analyzing time series gene expression data. Bioinformatics, vol. 20, no 16, 2493–2503. [3] Z. Bar–Joseph, G. Gerber, T. Jaakkila, D. Gifford, I., Simon (2003a). Comparing the continuous representation of time series expression profiles to identify differentially expressed genes. Proc. Nat. Acad Sci. USA, 100, 10146-10151. 20
[4] Z. Bar–Joseph, G. Gerber, T. Jaakkila, D. Gifford, I., Simon (2003b). Continuous representation of time series gene expression data. J. Comput. Biol., 3-4, 341-356. [5] J.O Berger Statistical Decision Theory and Bayesian Analysis. Springer series in Statistics (1985). [6] Cicatiello, L., Scarfoglio, C., Altucci, L., Cancemi, M., Natoli, G., Facchiano, A., Iazzetti G., Calogero, R., Biglia, N., De Bortoli, M., Sfiligol, C., Sismondi, P., Bresciani, F. and Weisz, A., (2004). A genomic view of estrogen actions in human breast cancer cells by expression profiling of the hormone-responsive trascriptome. Journal of Molecular Endocrinology, 32, 719–775. [7] A. Conesa, M.J. Nueda, A. Ferrer and M. Talon (2006). MaSigPro:
a method
to identify significantly dofferential expression profiles in time-course microarrayexperiments.Bioinformatics, 22, 1096–1102. [8] X. Cui, M.K. Kerr, G. A. Churchill (2002), Transformation for cDNA Microarray Data. Statistical Applications in Genetics and Molecular Biology, 2. [9] M.J.L. de Hoon, S. Imoto and S. Miyano (2002). Statistical analysis of a small set of time-ordered gene expression data using linear splines. Bioinformatics, 18, 1477–1485. [10] B. Di Camillo, F. Sanchez-Cabo, G. Toffolo, S.K. Nair, Z. Trajanosky, C. Cobelli (2005). A quantization method mased on threshold optimization for microarray short time series. BMC Bioinformatics, 6. [11] S. Dudoit, Y.H. Yang, M. J. Callow and T. P. Speed (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12, 111-140. [12] B. Efron,R. Tibshirani, J.D. Storey,V. Tusher, (2001) Empirical Bayes Analysis of a Microarray Experiment. J. Amer. Statist. Assoc., 96, 1151 - 1160. [13] J. Ernst, G.J. Nau, Z. Bar–Joseph (2005), Clustering short time series gene expression data. Bioinformatics, 21, 1159–1168. [14] N.A. Heard, C.C. Holmes and D. A. Stephens (2006). A quantitative study of gene regulation involved in the Immune response of Anopheline Mosquitoes: An application of Bayesian hierarchical clustering of curves. J. Amer. Statist. Assoc. 101, 18-29. [15] M.K. Kerr, M. Martin and G.A. Churchill (2000). Analysis of variance for gene expression microarray data, Journal of Computational Biology, 7, 819-837. 21
[16] M. K. Kerr, C. A. Afshari, L. Bennett, P. Bushel, J. Martinez, N. J. Walker and G.A. Churchill (2002) Statistical analysis of a gene expression microarray experiment with replication. Statistica Sinica, 12, 203-218. [17] H. Ishwaran, and J.S. Rao, (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Statist. Assoc. 98, 438–455. [18] I. Lonnstedt and T. Speed (2002). Replicated microarray data. Statistica Sinica, 12, 31-46. [19] G. McLachlan, K.A. Do, C. Ambroise (2004). Analyzing microarray gene expression data. Wiley series in Probability and Statistics. [20] T. Park, S.G. Yi, S. Lee, S.Y. Lee, D.H. Yoo, J.I. Ahn, and Y.S. Lee, (2003). Statistical tests for identifying differentially expressed genes in time course microarray experiments. Bioinformatics, 19, 694-703. [21] Pawitan Y., K.R. Krishna Murthy, S. Michiels, and A. Ploner (2005). Bias in the estimation of false discovery rate in microarray studies. Bioinformatics, 21, 3865 3872. [22] S. Pounds and C. Cheng (2004). Improving false discovery rate estimation. Bioinformatics, 20, 1737-1745. [23] G.K. Smyth, (2005). Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York, pages 397-420. [24] J.D. Storey, W. Xiao, J.T. Leek, R.G. Tompkins and R.W. Davis (2005). Significance analysis of time course microarray experiments (with supplementary material). Proc. Nat. Acad. Soc. 12, 12837-12842. [25] J.D. Storey and R. Tibshirani. (2003). SAM thresholding and false discovery rates for detecting differantial gene expression in DNA microarrays. n The analysis of gene expression data: Methods and Software. Eds. G. Parmigiani, E.S.Garrett, R. A. Irizarry, S.L. Zeger. Statistics for Biology and Health. Springer. 272-290. [26] V. Tusher, R. Tibshirani and C. Chu (2001): Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Soc. 98, 5116-5121.
22
[27] Y.H. Yang, S. Dudoit, P. Luu, M.D. Lin, V. Peng, J. Ngai, and T.P.Speed, (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, 30. [28] E. Wit, J. McClure, (2004). Statistics for Microarrays: Design, Analysis and Inference, Wiley. [29] H. Wu, M. K. Kerr, X. Cui, G.A., Churchill. (2003). MAANOVA: A software package for Analysis of spotted cDNA Microarray experiments. in The analysis of gene expression data: Methods and Software. Eds. G. Parmigiani, E.S.Garrett, R. A. Irizarry, S.L. Zeger. Statistics for Biology and Health. Springer. 313-341. Corresponding author: Dr. Marianna Pensky Department of Mathematics University of Central Florida Orlando, FL 32816-1364 e-mail:
[email protected]
23