JOURNAL OF COMPUTATIONAL BIOLOGY Volume 12, Number 4, 2005 © Mary Ann Liebert, Inc. Pp. 391–406
Bayesian Normalization and Identification for Differential Gene Expression Data DABAO ZHANG,1 MARTIN T. WELLS,2 CHRISTINE D. SMART,3 and WILLIAM E. FRY3
ABSTRACT Commonly accepted intensity-dependent normalization in spotted microarray studies takes account of measurement errors in the differential expression ratio but ignores measurement errors in the total intensity, although the definitions imply the same measurement error components are involved in both statistics. Furthermore, identification of differentially expressed genes is usually considered separately following normalization, which is statistically problematic. By incorporating the measurement errors in both total intensities and differential expression ratios, we propose a measurement-error model for intensity-dependent normalization and identification of differentially expressed genes. This model is also flexible enough to incorporate intra-array and inter-array effects. A Bayesian framework is proposed for the analysis of the proposed measurement-error model to avoid the potential risk of using the common two-step procedure. We also propose a Bayesian identification of differentially expressed genes to control the false discovery rate instead of the ad hoc thresholding of the posterior odds ratio. The simulation study and an application to real microarray data demonstrate promising results. Key words: Bayesian analysis, intensity-dependent normalization, Markov chain Monte Carlo, measurement-error model, spotted microarray.
1. INTRODUCTION
I
n the study of differentially expressed genes with spotted microarrays, statistical issues arise in experimental design (Yang and Speed, 2002; Kerr and Churchill, 2001a, 2001b), image analysis (Yang et al., 2000), normalization and inference on differential expression on a probe-by-probe basis (Chen et al., 1997; Kepler et al., 2000; Kerr et al., 2000; Finkelstein et al., 2001; Newton et al., 2001; Tseng et al., 2001; Wolfinger et al., 2001; Yang et al., 2001; Shadt et al., 2002; Yang et al., 2002; Dudoit et al., 2002), and finally the ambitious work on regulatory network of genes (Friedman et al.., 2000; Pe’er et al., 2001). Although each issue is, more or less, touched on by statisticians, difficulties in
1 Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642. 2 Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853. 3 Department of Plant Pathology, Cornell University, Ithaca, NY 14853.
391
392
ZHANG ET AL.
normalization still confront analysts in setting up any statistical approach as routine. The lack of a reliable statistical protocol may endanger the validity of subsequent analyses. In this article, we will revisit the normalization issue and propose a new Bayesian approach for combining the normalization and inference of differential gene expression data. The purpose of normalization is to remove any effect of systematic measurement errors in a microarray experiment that identifies the differentially expressed genes. The two most important effects are the dye effect and the print-tip effect. The dye effect comes from the labeling efficiencies and scanning properties of the two fluors and is complicated, perhaps, by the use of different scanner settings. When two identical mRNA samples are labeled with different dyes and hybridized to the same slide in a self–self experiment, the red intensities often tend to be lower than the green intensities even though there is no differential expression and it is expected that the red and green intensities are equal (Smyth et al., 2002). Furthermore, the imbalance between the red and green intensities is usually not constant across the spots within and between arrays and can vary according to overall spot intensity, location on the array, slide origin, and possibly other variables. The print-tip effect comes from the differences between the print-tips on the array printer and variation over the course of the print run. Although normalization of microarray data is partially processed in the image analysis of the data by removing background noises, the resultant data of image analysis needs to be further normalized to remove other systematic measurement errors, such as dye and print-tip effects. Normalization of differential gene expression data usually refers to the normalization after image analysis of the data and assumes the availability of a group of genes believed not to be differentially expressed between the two cell types of interest, some of which are known ahead of the experiment and are therefore called “housekeeping" genes. Global normalization by adjusting all differential gene expression values with a constant for each array is considered and may be incorporated into the model for identification of differentially expressed genes (Chen et al.., 1997; Newton et al., 2001; Kerr et al., 2000). Dudoit et al. (2002) consider the normalization as a procedure independent of identification of differentially expressed genes. Instead of global normalization, they consider within-print-tip-group local normalization (intensity-dependent normalization). For each spot within an array of microarray data, the log-differential expression ratio is M = log2 (R/G) and the log-intensity, a measure of the overall brightness of the spot, is A = 21 log2 (RG) (M is for minus and A is for add). Observing the obvious nonlinear relationship between the log-differential expression ratio M and the log-intensity A for each print-tip group, Dudoit et al. (2002) suggest to use a LOWESS (locally weighted scatterplot smoothing) approach (Cleveland, 1979; Cleveland and Devlin, 1988) for each print-tip group to remove the effect of log-intensity A from M. The residuals of M are therefore considered without any systematic error to identify differentially expressed genes by possibly using models proposed by Chen et al. (1997), Newton et al. (2001), or other approaches such as those of Efron et al. (2001) and Lönnstedt and Speed (2002). Because the LOWESS smoother is available in many statistical packages, the ideas of Dudoit et al. (2002) are easily implemented, and the two-step procedure for separately taking normalization and identification has been commonly accepted without question. As shown in Section 2, the intensity-dependent normalization ignores measurement errors in the total intensity, and the two-step procedure is also statistically problematic. Therefore, a measurement-error model based on the implicit model in intensity-dependent normalization is proposed. To avoid the potential risk of using the common two-step procedure, a Bayesian framework is constructed for this model in Section 3 by assuming that the true differential gene expression values distributed a priori as a mixture of a mass at zero and a heavy-tailed t-type distribution. Because of the flexibility of the mixture models, there is an extensive literature on using them in microarray data analyses (e.g., Efron et al., 2001; Newton et al., 2001; Broët et al., 2002; Lönnsted and Speed, 2002; McLachlan et al., 2002; Pan et al., 2002; Newton et al., 2004). Lewin et al. (2004) recently built a hierarchical Bayesian model including simultaneous normalization of the data and identification of differentially expressed genes, but based on the ANOVA formulation by Kerr et al. (2000). To avoid Lindley’s paradox (Lindley, 1957) with regard to the misbehavior of Bayes factors, we propose an alternative Bayesian identification of differentially expressed genes in Section 3. A simulation study is given in Section 4, which shows that the proposed Bayesian identification statistics perform well and ignoring measurement errors in the total intensity will increase false positives. We close the paper with an application of the proposed measurement-error model and a Bayesian approach to a microarray data studying potato late blight.
BAYESIAN NORMALIZATION AND IDENTIFICATION
393
2. WHY A NEW APPROACH? Denote the log-differential expression ratio by Mij and the log-intensity by Aij , for the j -th replication of i-th gene. Let Tij denote the print-tip index of the j -th replication of the i-th gene, 1 ≤ j ≤ mi and 1 ≤ i ≤ n. The implicit model in the intensity-dependent normalization by Dudoit et al. (2002) is Mij = hTij (Aij ) + γi + ij , where hk (·), k = 1, . . . , L (assuming there are L different print tips in an array printer) are some unknown nonlinear functions, ij is independently and identically distributed with mean zero and variance σ2 , and γi is the individual effect of i-th gene. Although the above model is semiparametric, it is usually fit with a two-step procedure. First, by normalizing, a LOWESS smoothing procedure is used to remove the nonlinear effect of Aij from Mij ; that is, Mij = hTij (Aij ) + ˜ij . Then, M˜ ij = Mij − hˆ Tij (Aij ) is used to infer the differentially expressed genes by considering the model, M˜ ij = γi + ij . Estimating a semiparametric model using this two-stage strategy needs the important assumption that there is no random measurement error on Aij . Unfortunately, the observed abundance A of gene expression cannot be the true abundance in the corresponding spot as long as M includes measurement errors. Another potential risk is the assumption that, if there is no measurement error in measuring the abundance, the estimate of individual gene effects from the above two-stage strategy is unbiased. Let γ = (γ1 , . . . , γn )T and further stack Mij , Aij , ij and hTij (Aij ) into M, A, and H(A), respectively. The above semiparametric model can be rewritten as a partially linear model, M = H(A) + Wγ + , where W is the design matrix indicating which gene is observed for each spot. Let PA be a smoother ˆ = PA M with γ = 0. Speckman (1988) showed that matrix which transforms M into fitted values M ˜ ˜ = (I − PA )W provides estimate γˆ = (W˜ T W˜ )−1 W˜ T M˜ projecting M and W into M = (I − PA )M and W which avoids the bias problem noted by Rice (1986). However, the above two-stage procedure estimates ˜ which may have an asymptotically nonnegligible bias when A the gene effects by γˆ = (W T W )−1 W T M, and W are correlated (Rice, 1986). For the j -th replication of i-th gene, assume log2 Rij = rij + rij and log2 Gij = gij + gij , where rij and gij are the ideal log-intensities for individual channels and rij and gij are the measurement errors. Throughout this article, we assume rij and gij are normally distributed with var(rij ) = var(gij ). Therefore, we have Aij = (rij + gij ) + ξij and Mij = (rij − gij )/2 + ij , where ξij = rij + gij is independent of ij = (rij − gij )/2 even though rij and gij may be correlated. With the true abundance ιij = rij + gij , a reasonable model for both normalization and identification is
Mij = h(ιij ) + ZijT η + γi + ij , Aij = ιij + ξij ,
(1)
where Zij incorporates all covariates affecting the log-differential expression ratio, such as the spatial contamination caused by print-tips and scanners, η includes all corresponding coefficients, and the measurement errors ij and ξij are assumed to be independently and identically distributed as N (0, σ2 ) and N(0, σξ2 ), respectively. Instead of considering within-print-tip-group local normalization by Dudoit et al. (2002), we can incorporate the design information of microarrays, which is much easier to retrieve, into Zij to consider spatial differences in the gene expressions brought about by differences between print-tip sizes and scanning effects, etc. In practice, more complex models may be developed to include other covariates for A. For example, the aforementioned print-tip effect and other spatial effects may be incorporated into
394
ZHANG ET AL.
a model for ι. Throughout this article, we assume that rij and gij are independently and identically distributed and therefore, based on the definitions of A and M, σ = 2σξ , which guarantees the identifiability of the model in (1). A likelihood approach to estimate the gene effects in the above measurement error model is difficult since the dimension of parameter space is in the magnitude of thousands and h(·) is an unknown nonlinear function. We therefore develop a Bayesian framework to fit the measurement error model in (1), next section. The Bayesian approach is much more natural since it not only enables us to consider the above measurement error model by using some appropriate prior distributions but also enables us to efficiently draw valid conclusions from a small number of samples relative to the large number of parameters.
3. BAYESIAN INFERENCE FOR DIFFERENTIAL GENE EXPRESSION DATA Consider the semiparametric measurement-error model proposed for differential gene expression data ⎧ ⎨Mij = h(ιij ) + γi + Z T η + ij , ij iid ∼ N (0, σ2 ), ij (2) iid ⎩ Aij = ιij + ξij , ξij ∼ N (0, σ2 /4). We use a penalized splines (or simply P-splines) approach to approximate the unknown nonparametric function h(·). See Eilers and Marx (1996) for a discussion of the many benefits of this method. A convenient basis (the truncated power basis), with given knots (t1 , t2 , . . . , tκ ), is chosen as B(x) = (1, x, x 2 , . . . , x d , (x − t1 )d+ , . . . , (x − tκ )d+ )T . When h(·) is a smooth function, with appropriately chosen knots (t1 , . . . , tκ ), spline degree d and coefficients β = (β0 , β1 , . . . , βd+κ )T , B(x)T β can approximate h(x) well enough to ignore the estimation error. Therefore, we can focus on developing a Bayesian inference procedure for the following model: Mij = B(ιij )T β + γi + ZijT η + ij , ij ∼ N (0, σ2 ), Aij = ιij + ξij ,
ξij ∼ N (0, σ2 /4),
where 1 ≤ j ≤ mi , and 1 ≤ i ≤ n. Fully Bayesian inference on nonparametric measurement-error models was considered by Berry et al. (2002). Here, we will establish the fully Bayesian inference on our proposed semiparametric measurement-error model to simultaneously normalize the microarray data and identify those differentially expressed genes.
3.1. Prior distributions Let β = (βT1 , βT2 )T , where β1 includes the first d +1 coefficients for the polynomial part and β2 includes the last κ coefficients for the nonpolynomial part of the spline regression. We therefore set up the prior distributions, β1 ∼ N (0(d+1)×(d+1) , λ−1 1 I(d+1)×(d+1) ), β2 ∼ N (0κ×κ , λ−1 2 Iκ×κ ), where λ1 = α1 /σ2 and λ2 = α2 /σ2 . This prior is related to a penalized least squares (or penalized likelihood) estimation with the penalty term as αN1 βT1 β1 + αN2 βT2 β2 (N is the number of observations in total). Although, in their Bayesian inference for nonparametric measurement-error models, Berry et al. (2002) suggest a diffuse prior on the polynomial coefficients β1 by letting α1 = 0, it is better to use a nondiffuse prior on β1 for the semiparametric measurement-error models based on our experiences in microarray studies. Note that γi is the effect of the i-th gene and our primary interest is to test whether each individual gene effect is significantly different from zero, i.e., the following hypothesis, H0 : γi = 0 versus H1 : γi = 0
BAYESIAN NORMALIZATION AND IDENTIFICATION
395
for each i. Since most genes are not differentially expressed, we therefore assume H0 can be true with a nonzero probability 1 − p; i.e., a priori P (γi = 0) = 1 − P (γi = 0) = p. Although an appropriate heavy-tailed prior for nonzero γi may make the Bayes estimator of γi to be minimax, we assume it to be normally distributed for the sake of convenient calculation from conjugate normal priors. Hence, γi is assumed to be distributed a priori as a mixture of a mass at zero and a normal distribution, indep
γi ∼ (1 − p)δ0 + pN (0, σγ2 ),
(3)
where δ0 is the Dirac delta function at zero. This mixture prior not only accounts for the nature of differential expression among genes, as we will discuss later in Section 5, but it also alleviates multiple hypothesis testing issues while averaging models. This prior is essentially used by Lönnstedt and Speed (2002); however, they took an empirical Bayes approach rather than the fully Bayes approach taken here. Apparently, γi corresponding to each housekeeping gene should be set to zero. Since the true abundance ι is the location parameter of the distribution of the observed abundance, the observed abundance Aij is distributed as normal with mean ιij under the usual normality assumption of the measurement errors. We therefore suggest use of Jeffrey’s reference prior for ιij , i.e., π(ιij ) ∝ 1, which is an improper noninformative prior. Although Berry et al. (2002) suggest a fixed proper prior for this uncontaminated latent variable, here the true abundance varies too widely to be modeled this way. Assume the spatial covariates Zij can be partitioned into S groups, i.e., Zij = (ZijT 1 , . . . , ZijT S )T , based on their characteristics. Then η is accordingly partitioned into S vectors ηj , j = 1, 2, . . . , S. Each component parameter vector ηj is assumed to have the conjugate prior N (0, ϕj−1 Isj ×sj ) with Isj ×sj being an sj × sj identity matrix. For simplicity, conjugate priors are selected for parameters p and σ2 and for hyperparameters σγ2 , λ1 , λ2 and ϕs , s = 1, 2, . . . , S. Therefore, the prior of p is set as Beta(θp , φp ), and the priors of σ−2 , σγ−2 , λ1 , λ2 and ϕs , s = 1, 2, . . . , S are set as Gamma distributions with parameters (θ , φ ), (θγ , φγ ), (θ1λ , φ1λ ), (θ2λ , φ2λ ) and (θsϕ , φsϕ ), s = 1, 2, . . . , S, respectively. For generality, Gamma(0, ∞) is defined as the improper density function of positive τ proportional to 1/τ , which is Jeffrey’s reference prior for the variance of a normal distribution.
3.2. Implementation of a Gibbs sampler Although only p and γ = (γ1 , . . . , γn )T are of our primary interest, their posterior distributions are difficult to calculate due to the many nuisance parameters involved. Because of the available full conditionals, we therefore use the Gibbs sampling algorithm, which is developed fully in Appendix A. The full conditional distributions for all parameters and hyperparameters, except those of ιij , have convenient distributions. So it is straightforward to get random draws of these parameters except ιij from the corresponding full conditional distributions. However, the Metropolis–Hastings Algorithm nested in Gibbs samplers, also shown in Appendix A, can be developed to get random draws of ιij from its full conditional distribution. Usually, setting the number of knots, κ, to at most 8 provides an excellent approximation of cubic spline functions (i.e., d = 3) to the nonlinear function h in our microarray studies. The knots can then be chosen at equally spaced quantiles of the observed log-intensities Aij . While it is feasible to choose θp = φp to be either 0.5 or an even smaller value, all other unspecified hyperparameters can be set as θ = θγ = θ1λ = θ2λ = θj ϕ = 0 and φ = φγ = φ1λ = φ2λ = φj ϕ = ∞ by using noninformative priors for the corresponding parameters, j = 1, 2, . . . , S. The hyperparameter values can be adjusted accordingly whenever prior information is available. Choosing α1 = 1 and α2 = 0.1, we then can roughly estimate β, γ, σ2 , and σγ2 by assuming no measurement errors on log-intensities Aij . We suggest to use this estimation to set up the initial values (0)2 (0)2 (0) (0)2 (0) (0)2 β(0) , γ(0) , η(0) , σ , σγ , and λ1 = α1 /σ , λ2 = α2 /σ . The initial value ϕ(0) can be set based (0) (0) on η(0) , and p (0) can be set as the proportion of γi whose absolute values are larger than 3σγ . As for (0) the initial values of ιij , we simply set them as ιij = Aij .
3.3. Identification of differentially expressed genes From the viewpoint of model selection, identification of differentially expressed genes is equivalent to (i) (i) choosing marginally, for i-th gene, between the model M0 : γi = 0 and the model M1 : γi = 0.
396
ZHANG ET AL.
Therefore, we have to select one of the following 2n models for each microarray study with n genes: Mc =
n
M(i) ci ,
i=1
where c = (c1 , . . . , cn ) with the binary indicator ci . While single marginal model selection for each gene can be easily implemented even when there are no convenient posterior distributions, the multiple model selections may need explosively intensive computation. Instead of fitting each compound model Mc , the mixture prior in (3) make it possible to consider the multiple model selection simultaneously from a single (i) run of MCMC while it also averages all other models when testing each hypothesis H0 . It is obvious that the posterior distribution of γi is still a mixture of mass at zero and an absolutely continuous distribution. For simplicity, we refer to the nonzero part posterior of γi as the alternative posterior of γi , which is denoted as Fi (·) = P (γi ≤ ·|γi = 0, Data). Correspondingly, the null posterior of γi is a mass at zero. We then, using the tail-area probability of Fi (·), define the differentiation score (or simply d-score) to be di = −1 (1 − Fi (0)), where −1 (·) is the inverse cumulative distribution function of standard normal. The d-score is equal to its standardized mean value if the alternative posterior Fi (·) is normally distributed, which may be derived based on the asymptotic normality of posteriors in Bayesian theory. As an extension to credible regions defined in Bayesian inference, the d-score measure the distance between different models (the model under the null hypothesis and the model under the alternative hypothesis) after incorporating information from the observed data, i.e., the distance between the null posterior and the alternative posterior Fi (·). This new metric can provide an understanding of the Bayesian test under the framework of a classical hypothesis test, such as type I error and type II error, comparing to the ad hoc threshold used for the Bayesian factor in the Bayesian test. As shown in Appendix B, this new metric is closely related to the Bayes factor but it avoids Lindley’s paradox (Lindley, 1957) caused by using Bayes factor. Indeed, when diffuse priors are used, the d-score is exactly the z-score used in frequentist hypothesis testing, which is as ideal as we can expect since Bayesian inference on a location parameter with the diffuse prior should be equivalent to its admissible frequentist inference. In the study of spotted microarray data, the d-score for the i-th gene, di , should be distributed as a (i) standard normal if the i-th gene is not differentially expressed (i.e., M0 holds). This d-score will be used to decide whether the corresponding gene is differentially expressed. The false discovery rate (FDR) can therefore be controlled by the method of Benjamini and Hochberg (1995) through the p-values of the d-scores. Once we have determined the i-th gene to be differentially expressed, its differential expression level is estimated by γ dFi (γ ), which can be approximately calculated by using the nonzero values in the Markov chain of γi from the Gibbs sampler shown in Appendix A.
4. SIMULATION STUDY In order to study the performance of our proposed Bayesian framework and defined d-score, we simulate 100 groups of data from the following data generating process: Mij = 1 + 3 × sin(ιij /3 + 2.75) + γi + ij , ij ∼ N (0, 0.32 ), (4) Aij = ιij + ξij , ξij ∼ N (0, 0.152 ), where ιij is sampled from N (10, 32 ) with restricted range from 5 to 15. In each dataset, there are 150 genes in total and two observations for each gene, i.e., mi ≡ 2 and n = 150. We artificially set 10 genes as differentially expressed with signal–noise ratios of 2, 3, 4, 6, and 8, respectively (see Table 1 for the corresponding γi ). Bayesian inference of these simulated data is done by choosing the hyperparameters as follows: θp = φp = 0.01, θ = θγ = θ1λ = θ2λ = 0, and φ = φγ = φ1λ = φ2λ = ∞. A cubic spline with three equally spaced knots is used to approximate the nonlinear function. The identification results are summarized in Table 1. Under the proposed fully Bayesian framework, each estimator of γi performs surprisingly well. To control errors, we consider two different approaches, i.e., controlling the FDR by the method of Benjamini
BAYESIAN NORMALIZATION AND IDENTIFICATION Table 1.
397
Bayesian Analysis of the Simulated Data Assuming Measurement Errors in Both M and A FDR
α (Bonferroni)
γi
γˆi (s.e.)
0.1a
0.01a
0.001a
0.1a
0.01a
γ10 = 0.6 γ20 = −0.6 γ30 = 0.9 γ40 = −0.9 γ50 = 1.2 γ60 = −1.2 γ70 = 1.8 γ80 = −1.8 γ90 = 2.4 γ100 = −2.4 γi = 0
0.5604 (0.2347) −0.5706 (0.2370) 0.8644 (0.2381) −0.8822 (0.2181) 1.1957 (0.2191) −1.1446 (0.2179) 1.7562 (0.2228) −1.7315 (0.2260) 2.3800 (0.2375) −2.34932 (0.2141) 0.0010 (0.2459)
46 46 84 88 99 99 100 100 100 100 202/140
19 13 53 59 95 89 100 100 100 100 41/140
8 1 27 29 79 75 100 100 100 100 15/140
16 8 49 53 94 88 100 100 100 100 29/140
8 1 25 25 76 75 100 100 100 100 15/140
a The numbers shown under these columns are the number of times the corresponding genes have been correctly identified as differentially expressed in the 100 datasets. In the last row, the corresponding values are averaged for all those 140 genes not differentially expressed.
and Hochberg (1995) and controlling type I error by Bonferroni adjustment. For both approaches, the strongly differentially expressed genes, i.e., the four genes with signal–noise ratios of 6 or 8, can always be identified. The smaller the signal–noise ratio of a gene, the more difficult it is to identify it. It is interesting to observe that the result by controlling the FDR at 0.01 (or 0.001) is very similar to that by controlling α = 0.1 (or α = 0.01) using Bonferroni’s approach. Indeed, it will always be the case that controlling the α-level using the Bonferroni approach is equivalent to controlling the FDR at a certain level by the method of Benjamini and Hochberg (1995), since both methods are based on finding cutoffs for p-values. For comparison, the same datasets are reanalyzed by using the fully Bayesian approach with the same strategy as above but ignoring the measurement errors in A. The identification results are summarized in Table 2. Comparing Table 2 with Table 1, we can observe that ignoring measurement errors in A results in more false positives, which is extremely undesirable in microarray study. In both tables, all parameters are slightly underestimated (in absolute values). This is expected with the Bayesian approach because it shrinks the estimates toward zero for model uncertainty (this is analogous to the advantage of James–Stein estimators in high-dimensional parameter spaces). Table 2.
Bayesian Analysis of the Simulated Data Assuming Measurement Errors in M Only FDR
γi γ10 = 0.6 γ20 = −0.6 γ30 = 0.9 γ40 = −0.9 γ50 = 1.2 γ60 = −1.2 γ70 = 1.8 γ80 = −1.8 γ90 = 2.4 γ100 = −2.4 γi = 0
γˆi (s.e.) 0.5293 −0.5511 0.8545 −0.8662 1.1935 −1.1408 1.7533 −1.7245 2.3785 −2.3445 0.0021
(0.2364) (0.2179) (0.2515) (0.2245) (0.2201) (0.2225) (0.2222) (0.2319) (0.2407) (0.2174) (0.2295)
α (Bonferroni)
0.1a
0.01a
0.001a
0.1a
0.01a
86 91 98 97 100 100 100 100 100 100 2875/140
64 70 94 96 100 100 100 100 100 100 773/140
43 49 89 89 100 100 100 100 100 100 293/140
51 59 93 94 100 100 100 100 100 100 473/140
39 41 88 87 100 100 100 100 100 100 210/140
a The numbers shown under these columns are the number of times the corresponding genes have been correctly identified as differentially expressed in the 100 datasets. In the last row, the corresponding values are averaged for all those 140 genes not differentially expressed.
398
ZHANG ET AL.
5. APPLICATION We apply our Bayesian procedure for normalization and identification of differential gene expression data to one array (2,723 genes with two spots for each gene after discarding genes with bad quality spots) in a spotted microarray study of potato late blight (see Fig. 1 for the M–A plot). The goal of this study was to determine the subset of genes that are either induced or repressed in a plant while a pathogen is attacking. To do this, we used a compatible interaction between Phytophthora infestans, an oomycete pathogen that causes a disease known as late blight, and potato. An efficient pathogen of potato, P. infestans can kill a plant within seven days (Fry and Goodwin, 1997). In order to identify the largest subset of genes that might be turned on during the infection process, we collected tissue 72 hours after inoculation since many plant defense response genes have been previously shown to be induced by this time (Vleeshouwers et al., 2000; Smart et al., 2003). The upper panel of Fig. 2 shows the original log-differential expression ratios against gene IDs. Obvious periodic patterns appears. Since the ID of a gene is assigned based on its location on the array (the column and row within the print-tip group, so its location refers to its column, row, and group), the periodic pattern implies the spatial contamination caused by printer and scanner, etc. However, the periodic pattern disappears in the lower panel of Fig. 2, where the differences across columns, rows, and groups are simply averaged out, respectively. Therefore, it is enough to use the column, row, and group as covariates (incorporated into Z in our proposed model) to take account of the spatial contamination in our spotted microarray study. A cubic spline with three equally spaced knots is used to approximate the nonlinear function modeling the relation between the true abundance and the log-differential expression ratio. The proposed Bayesian framework is used to analyze this array. The result is shown in Fig. 3. The upper panel shows the fitted nonlinear curve (M − Z T η vs. ι), and the lower panel shows the identification result by plotting di against − log10 {P (γi = 0|Data)}. By controlling the FDR at 0.1 using a procedure proposed by Benjamini and Hochberg (1995), there are seven genes identified as up-regulated (marked as circles); by controlling the FDR at 0.01, there is only one gene identified as up-regulated (the leftmost one marked as a cross); no gene is identified as down-regulated under either of these two criteria. However, there are three genes identified as up-regulated if we control the α-level at 0.1; there is also only one gene identified as up-regulated if we control the α-level at 0.01; and no gene is identified as down-regulated in either case (not shown here). For comparison, we also apply the approach proposed by Newton et al. (2001) to the same data. However, both M and A are averaged for each gene after averaging out the spatial contamination and then normalized by the LOWESS smoother with smoothing parameter f = 0.3. The M–A plot with the fitted nonlinear curve and the identification results are shown in Fig. 4. There are 4 genes whose odds ratios of the posteriors are larger than 100, 15 genes whose odds ratios of the posteriors are larger than 10, and 59 genes whose odds ratios of the posteriors are larger than 1.
FIG. 1.
The M–A plot for the P. infestans data.
BAYESIAN NORMALIZATION AND IDENTIFICATION
399
FIG. 2. M versus ID for the P. infestans data: before (upper panel) and after (lower panel) averaging out differences of M across columns, rows, and groups within the array.
FIG. 3. Bayesian analysis for the P. infestans data: (a) Shown is the fitted nonlinear curve with the spatial contamination removed M against the true abundance ι; (b) Circled genes are detected by controlling the FDR at 0.1, and the crossed gene at the left-top is detected by controlling the FDR at 0.01.
400
ZHANG ET AL.
FIG. 4. Results by Newton et al.’s approach after LOWESS smoothing for the P. infestans data: (a) The M–A plot with the fitted nonlinear curve. Both M and A are averaged for each gene after removing spatial contamination; (b) The genes detected by using Newton et al.’s approach with log–odds ratio at 0, 1, 2, respectively. The squared genes are detected by the Bayesian approach with the FDR at 0.1, and the crossed gene at the bottom is detected by the Bayesian approach with the FDR at 0.01.
Compared to the approach proposed by Newton et al. (2001), our proposed approach makes it possible to control the false discovery rate. It is also flexible to incorporate covariates in modeling the log-differential expression ratio and pave a natural way for further pooling the results from single-array analyses.
6. DISCUSSION The commonly accepted intensity-dependent normalization tries to remove systematic errors from spotted microarray data by assuming some systematic errors exist in the dependence of differential intensities to total intensities. Both ignoring measurement errors in observed total intensities and normalizing the data without providing satisfactory statistics for identification will essentially invalidate all subsequent analyses. As shown in the simulation study, ignoring the measurement errors in observed total intensities will significantly increase false positives in identification. Theoretically, pursuing a two-step procedure for normalization and identification separately may also induce potential risks as stated in Section 2. Although the statistical analysis of microarray data usually plays a role as an exploratory tool to provide candidate genes for further biological investigation, unreliable results not only increase research costs but may also invalidate the subsequent investigations. For the proposed semiparametric measurement-error models, selection bias will be present in the classical approach which selects significant gene effects and then estimates them using the same data (see Miller, 1990). Instead, a Bayesian approach provides conservative and robust model-averaging estimates for gene effects by shrinkage. This is important especially for pooling analyses from multiple arrays and investigating time-course changes of differential gene expressions.
BAYESIAN NORMALIZATION AND IDENTIFICATION
401
With the assumption that σ2 = 4σξ2 , a spine-smoothing version of the semiparametric measurement-error model in (1) is identifiable when either ni ≥ 2 for some i or sufficient housekeeping genes are available. Otherwise, we can fix p at some sensible value such as 0.01 or 0.001 as suggested by Lönnstedt and Speed (2002). For heteroscedastic microarray data, we may further model the variances of the measurement errors by log-spines, which will be investigated in our future work. With the mixture prior for γi as in (3), the d-score, di , can be used to test the hypothesis that i-th gene is not differentially expressed. We can therefore control the false discovery rate by calculating the corresponding p-values (Benjamini and Hochberg, 1995). Alternatively, Newton et al. (2004) recently proposed a natural method for mixture models to control the FDR by estimating the FDR conditional on the data with the posterior probabilities. It would also be interesting to compare the performances of these two different approaches in a future work.
APPENDIX A. IMPLEMENTATION OF THE GIBBS SAMPLER The complete joint distribution for the model defined in Section 3 can be written as, [M, A, ι, β, γ, η, σ2 , λ1 , λ2 , ϕ, σγ2 , p] = [M|β, γ, η, σ2 , ι] × [A|σ2 , ι] × [β|λ1 , λ2 ] × [λ1 , λ2 ] × [σ2 ] × [γ|σγ2 , p] × [σγ2 ] × [η|ϕ] × [ϕ] × [p] × [ι]. Let D1 = diag(11×(d+1) , 01×κ ) and D2 = diag(01×(d+1) , 11×κ ). Then the posterior distribution is as follows: [ι, β, γ, η, σ2 , λ1 , λ2 , ϕ, σγ2 , p|M, A] ⎧ mi n S ⎨ 1 1 1 [Mij − B(ιij )T β − γi − ZijT η]2 − βT (λ1 D1 + λ2 D2 )β − ϕj ηTj ηj ∝ exp − 2 ⎩ 2σ 2 2 i=1 j =1
−
j =1
n mi S ϕj 2 λ1 λ2 2 (A − ι ) − − − ij ij 2 φ1λ φ2λ φj ϕ σ i=1 j =1
(d+1)/2+θ1λ −1
× λ1
j =1
κ/2+θ2λ −1
× λ2
×
S
s /2+θj ϕ −1
ϕj j
⎫ 1 1 ⎬ − − φγ σγ2 φ σ2 ⎭
−2θγ −2
× σγ
j =1
× σ−2N−2θ −2 × p θp −1 (1 − p)φp −1 ×
n
[γi |σγ2 , p].
i=1
A.1. Full conditionals for β and γ Let M˜ ij = Mij − ZijT η. Then the full conditionals of β and γ can be written as [β, γ|M, η, ι, σ2 , λ1 , λ2 , σγ2 , p] ⎫ ⎧ mi n ⎬ ⎨ 1 1 [M˜ ij − B(ιij )T β − γi ]2 − βT (λ1 D1 + λ2 D2 )β ∝ exp − 2 ⎭ ⎩ 2σ 2 i=1 j =1
×
n i=1
1 p(2πσγ2 )−1/2 exp − 2 γi2 I [γi = 0] + (1 − p)I [γi = 0] . 2σγ
402
ZHANG ET AL.
Further, let Bij = B(ιij ), and define B = (B11 , . . . , B1m1 , . . . , Bn1 , . . . , Bnmn )T , M = (M11 − γ1 , . . . , M1m1 − γ1 , . . . , Mn1 − γn , . . . , Mnmn − γn )T . We have the full conditional of β, ˜ σ2 (BT B + σ2 D)−1 , β|γ, M, η, ι, λ1 , λ2 , σ2 ∼ N (BT B + σ2 D)−1 BT M, where D = λ1 D1 + λ2 D2 . And the full conditional of γi , 1 ≤ i ≤ n, is ⎛ γi |β, M, η, ι, σ2 , σγ2 , p
mi
⎜ ⎜ j =1 ⎜ ∼ (1 − ri )δ0 + ri N ⎜ ⎜ ⎝
⎞ (Mij − BijT β − ZijT η) mi + σ2 /σγ2
⎟ ⎟ ⎟ , ⎟, mi σγ2 + σ2 ⎟ ⎠ σ2 σγ2
where the weight ri is updated as ri = 1 −
1−p ⎧ ⎛ ⎞2 ⎫ . mi ⎪ ⎪ ⎪ ⎪ ⎪ T T ⎠ ⎪ ⎪ ⎪ ⎝ ⎪ ⎪ [M − B(ι ) β − Z η] ij ij ⎪ ⎪ −1/2 ij ⎪ ⎪ 2 ⎬ ⎨ mi σγ 1 j =1 1−p+p 1+ exp − ⎪ ⎪ 2 σ2 mi σ2 + σ4 /σγ2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎩
Denote the mean and variance of the full conditional distribution of nonzero γi as µi and σi2 , respectively. It is interesting to observe that the updated nonzero proportion ri of γi is determined by µi , σi2 together with p and σγ2 , i.e., ri = 1 −
1−p . 1 − p + p exp{µ2i /(2σi2 )} × σi /σγ
A.2. Full conditionals for η Let Vη−1 = diag{ϕ1 Is1 ×s1 , . . . , ϕS IsS ×sS }. Then, the full conditional distribution of η can be derived from ⎧ ⎫ mi n ⎨ 1 ⎬ 1 [η|M, β, γ, ι, σ2 , σγ2 , ϕ] ∝ exp − 2 [Mij − B(ιij )T β − γi − ZijT η]2 − ηT Vη−1 η ⎩ 2σ ⎭ 2 i=1 j =1
Redefine T ˜ = (M11 − B T β − γ1 , . . . , M1m1 − B T β − γ1 , . . . , Mnmn − Bnm M β − γ n )T , 11 1m1 n
and let T Z = (Z11 , . . . , Z1m1 , . . . , Zn1 , . . . , Znm ). n
Then we have the full conditional of η, ˜ σ2 (ZT Z + σ2 Vη−1 )−1 . η|β, γ, M, ι, λ1 , λ2 , σ2 ∼ N (ZT Z + σ2 Vη−1 )−1 ZT M,
BAYESIAN NORMALIZATION AND IDENTIFICATION
403
A.3. Full conditionals for ι The full conditionals for ι is 1 2 [ιij |Mij , Aij , σ2 , β, γ ] ∝ exp − 2 (Mij − B(ιij )T β − γi )2 − 2 (Aij − ιij )2 . 2σ σ While there is no convenient distribution available for drawing ιij from its posterior, we consider the (t) Metropolis–Hastings Algorithm nested within a Gibbs sampler. Suppose that ιij is drawn, we want to (t+1)
further draw ιij
from the above full conditional distribution of ι. To construct a good Metropolis– (t)
Hastings Algorithm, the newly proposed candidate ι˜ij should be good enough to lower the rejection frequency in the following step of the algorithm to accelerate the mixing rate of the sampled chains. An (t) easy strategy is to propose the new candidate from the normal distribution N (ιij , σ2 /2). Choosing a convenient distribution to propose new candidates may contradict the goal of lowering the rejection frequency of the algorithm. Gilks et al. (1995) adopt the idea of an accept–reject method and propose adaptive rejection metropolis sampling (ARMS). The algorithm progressively fits a pseudoenvolope function for the density function and proposes a new candidate from the psudo-envolope function. Although each iteration of this algorithm may take much longer time than the above strategy, we don’t have the annoying task to search good convenient distributions to guarantee sampled chains to be mixed efficiently.
A.4. Full conditionals for other parameters and hyperparameters Since the priors for parameters p and σ2 , and hyperparameters σγ2 , λ1 , λ2 , and ϕj , j = 1, 2, . . . , S, are set up such that each of them is conditionally conjugate, their full conditionals are easy to derive as ⎛ ⎡ mi n 1 σ−2 |M, ι, β, γ, η ∼ Gamma ⎝N + θ , ⎣ + 2 (Aij − ιij )2 φ i=1 j =1
+
n mi 1
2
⎤−1 ⎞ ⎟ (Mij − BijT β − γi − ZijT η)2 ⎦ ⎠ ,
i=1 j =1
−1 d +1 1 1 T , Gamma + β1 β 1 + θ1λ , 2 φ1λ 2 −1 κ 1 1 T , Gamma + β2 β2 + θ2λ , 2 φ2λ 2 −1 sj 1 1 T , j = 1, . . . , S, Gamma + ηj η j + θj ϕ , 2 φj ϕ 2 n˜ 1 1 T −1 , + θγ , Gamma + γ γ 2 φγ 2
λ1 |β1 ∼ λ2 |β2 ∼ ϕj |ηj ∼ σγ−2 |γ ∼
p|γi ∼ Beta(θp + n, ˜ φp + n − n), ˜ where 1 ≤ i ≤ n, and N =
n i=1 mi ,
n˜ =
n i=1 I [γi
= 0].
APPENDIX B. THE DIFFERENTIATION SCORE AND BAYES FACTOR Assume we have the likelihood function f (X|θ), which includes all the information needed in a frequestist (Fisherian) inference for the unknown parameter θ, from observed sample X. Let θ exist in a
404
ZHANG ET AL.
parameter space , which is further partitioned into two disjointed subspaces 0 and 1 . We are interested in whether the given sample is taken from a population with θ ∈ 0 or a population with θ ∈ 1 , i.e., in selecting the model between M0 : θ ∈ 0
versus M1 : θ ∈ 1 .
For Bayesian inference, we assume P (M1 ) = 1 − P (M0 ) = p and the prior of θ is πi (θ ) under model Mi , i = 0, 1. The posterior distribution of θ is θ|X ∼
f (X|θ )[(1 − p) × π0 (θ ) + p × π1 (θ )] f (X|θ )[(1 − p) × π0 (θ ) + p × π1 (θ )]dθ
= π(θ |X, M0 )
(1 − p) × m0 (X) (1 − p) × m0 (X) + p × m1 (X)
+ π(θ|X, M1 )
p × m1 (X) , (1 − p) × m0 (X) + p × m1 (X)
where π(θ |X, Mi ) is the posterior distribution of θ under model Mi , and mi (X) is the marginal density of X under model Mi . Since P (Mi |X) is defined by m0 (X) and m1 (X), which measure the likelihood of X under model M0 and M1 , respectively, the usual Bayes factor B01 = m0 (X)/m1 (X) is used to measure which model should be chosen. Let us consider the case θ ∈ R, and 0 = {0}, 1 = R − {0}. While π0 (θ ) = δ0 (θ ), the log-likelihood log f (X|θ ) and log π1 (θ ) can both be expanded about their respective maxima, θˆn and θ = 0. Therefore, we obtain log f (X|θ ) = log f (X|θˆn ) − log π1 (θ ) = log π1 (0) −
1 (θ − θˆn )2 + Rn , 2σn2
1 2 θ + R0 , 2τ 2
where Rn , R0 denote remainder terms and
σn2
∂ 2 log f (X|θ ) = − ∂θ 2
−1 ! ! ! !
θ=θˆn
,
∂ 2 log π(θ ) τ = − ∂θ 2 2
−1 ! ! ! !
.
θ=0
Under regularity conditions (see Bernardo and Smith, 1994) which ensure that R0 and Rn are small for large n, and ignoring constants of proportionality, we have 1 2 ˆ [X|θ] ∝ exp − 2 (θ − θn ) , 2σn 1 2 2 −1 θ ∼ (1 − p)δ0 (θ ) + p(2π τ ) exp − 2 θ . 2τ Therefore, the posterior distribution of θ is a mixture of a mass at zero with probability 1 − rn and N(µn , τn2 ) with probability rn , where µn =
θˆn τ 2 , τ 2 + σn2
τn2 =
σn2 τ 2 , τ 2 + σn2
and the probability rn = P (M1 |X) = 1 − P (M0 |X) is given as rn = 1 −
1−p . 2 " µn 2 2 1 − p + p exp τn /τ 2τn2
BAYESIAN NORMALIZATION AND IDENTIFICATION
405
Therefore, we have the following Bayes factor: B10
2 " P (M1 |X)/p 1−p rn µn = × τn2 /τ 2 . = = exp P (M0 |X)/(1 − p) 1 − rn p 2τn2
In this special case, the Bayes factor is closely related to the statistics d = µn /τn , which actually measures not only the distance between the alternative posterior and null posterior, but also the direction of the alternative posterior related to the null posterior. By letting τ → ∞, then µn → θˆn , τn2 → σn2 and hence B10 → 0, and inference based on the Bayes factor will always reject the null hypothesis H0 no matter what the true value we have. This is the famous Lindley’s paradox. Instead, d = µn /τn → θˆn /σn , which is essentially the z-score used in frequentist hypothesis testing.
ACKNOWLEDGMENTS The support of NSF grant DMS 0204252 is acknowledged by M.T.W. The support of NSF Grant, Development of Tools for Potato Functional Genomics: Application to Disease Resistance and Development, is gratefully acknowledged by W.E.F.
REFERENCES Benjamini, Y., and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. Series B 57, 289–300. Bernardo, J.M., and Smith, A.F.M. 1994. Bayesian Theory, John Wiley, New York. Berry, S.M., Carroll, R.J., and Ruppert, D. 2002. Bayesian smoothing and regression splines for measurement error problems. J. Am. Statist. Assoc. 97, 160–169. Broët, P., Richardson, S., and Radvanyi, F. 2002. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comp. Biol. 9, 671–683. Chen, Y., Dougherty, E.R., and Bittner, M.L. 1997. Ratio based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics 2, 364–374. Cleveland, W.S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Am. Statist. Assoc. 74, 829– 836. Cleveland, W.S., and Devlin, S.J. 1988. Locally weighted regression: An approach to regression analysis by local fitting. J. Am. Statist. Assoc. 83, 596–610. Dudoit, S., Yang, Y.H., Callow, M.J., and Speed, T.P. 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12, 111–139. Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. 2001. Empirical Bayes analysis of a microarray experiment. J. Am. Statist. Assoc. 96, 1151–1160. Eilers, P.H.C., and Marx, B.D. 1996. Flexible smoothing with B-Splines and penalties (with discussion). Statist. Sci. 11, 89–102. Finkelstein, D.B., Gollub, J., Ewing, R., Sterky, F., Somerville, S., and Cherry, J.M. 2001. Iterative linear regression by sector, in S.M. Lin and K.F. Johnson, eds., Methods of Microarray Data Analysis. Papers from CAMDA 2000, 57–68, Kluwer Academic, New York. Friedman, N., Linial, M., Nachman, I., and Pe’er, D. 2000. Using Bayesian networks to analyze expression data. J. Comp. Biol. 7, 601–620. Fry, W.E., and Goodwin, S.B. 1997. Re-emergence of potato and tomato late blight in the United States. Plant Dis. 81, 1349–1357. Gilks, W.R., Best, N.G., and Tan, K.K.C. 1995. Adaptive rejection metropolis sampling within Gibbs sampling. Appl. Statist. 44, 455–472. Kepler, T.B., Crosby, L., and Morgan, K.T. 2000. Normalization and analysis of DNA microarray data by selfconsistency and local regression. Santa Fe Institute Working Paper, Santa Fe, New Mexico. Kerr, M.K., and Churchill, G.A. 2001a. Experimental design for gene expression microarrays. Biostatistics 2, 183–201. Kerr, M.K., and Churchill, G.A. 2001b. Statistical design and the analysis of gene expression microarrays. Genet. Res. 77, 123–128.
406
ZHANG ET AL.
Kerr, M.K., Martin, M., and Churchill, G.A. 2000. Analysis of variance for gene expression microarray data. J. Comp. Biol. 7, 819–837. Lewin, A., Richardson, S., Marshall, C., Glazier, A., and Aitman, T. 2004. Bayesian modelling of differential gene expression. Manuscript. Lindley, D.V. 1957. A statistical paradox. Biometrika 44, 187–192. Lönstedt, I., and Speed, T. 2002. Replicated microarray data. Statistica Sinica 12, 88–99. McLachlan, G.J., Bean, R.W., and Peel, D. 2002. A mixture model-based approach to the clustring of microarray expression data. Bioinformatics 18, 413–422. Miller, A.J. 1990. Subset Selection in Regression, Chapman & Hall, London. Newton, M.A., Kenziorski, C.M., Richmond, C.S., Blattner, F.R., and Tsui, K.W. 2001. On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comp. Biol. 8, 37–52. Newton, M.A., Noueiry, A., Sarkar, D., and Ahlquist, P. 2004. Detecting differential gene expression with a semiparametric hierarchial mixture method. Biostatistics 5, 155–176. Pan, W., Lin, J., and Le, C.T. 2002. How many replicaties of array are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3, research0022.1–0022.10. Pe’er, D., Regev, A., Elidan, G., and Friedman, N. 2001. Inferring subnetworks from perturbed expression profiles. Bioinformatics 17, S1, S215–224. Rice, J. 1986. Convergence rates for partially splined models. Statistics and Probability Letter 4, 203–208. Schadt, E.E., Li, C., Ellis, B., and Wong, W.H. 2002. Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J. Cell. Biochem. Suppl. 37, 120–125. Smart, C.D., Myers, K.L., Restrepo, S., Martin, G.B., and Fry, W.E. 2003. Partial resistance of tomato to Phytophthora infestans is not dependent upon ethylene, jasmonic acid or salicylic acid signaling pathways. Mol. Plant-Microbe Interact. 16, 141–148. Smyth, G.K., Yang, Y.H., and Speed, T. 2002. Statistical issues in cDNA microarray data analysis. To appear in M.J. Brownstein and A.B. Khodursky, eds., Functional Genomics: Methods and Protocols. Methods in Molecular Biology Series, Humana Press, Totowa, NJ. Speckman, P. 1988. Kernel smoothing in partial linear models. J. Roy. Statist. Soc. B 50, 413–436. Tseng, G.C., Oh, M.-K., Rohlin, L., Liao, J.C., and Wong, W.H. 2001. Issues in cDNA microarray analysis: Quality filtering, channel normalization, models of variations and assessment of gene effects. Nucl. Acids Res. 29, 2549–2557. Vleeshouwers, V.G.A.A., van Dooifeweert, W., Govers, F., Kamoun, S., and Colon, L.T. 2000. Does basal PR gene expression in Solanum species contribute to non-specific resistance to Phytophthora infestans? Physiol. Mol. Plant Pathol. 57, 35–42. Wolfinger, R.D., Gibson, G., Wofinger, E.D., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C., and Paules, R.S. 2001. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comp. Biol. 8, 625–637. Yang, Y.H., Buckley, M.J., Dudoit, S., and Speed, T.P. 2000. Comparison of methods for image analysis on cDNA microarray data. www.stat.Berkeley.EDU/users/terry/zarray/Html/image.html. Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. 2002. Normalization of cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucl. Acids Res. 30(4), e15. Yang, Y.H., Dudoit, S., Luu, P., and Speed, T.P. 2001. Normalization for cDNA microarray data, in M.L. Bittner, Y. Chen, A.N. Dorsel, and E.R. Dougherty, eds., Microarrays: Optical Technologies and Informatics, vol. 4266 of Proceedings of SPIE. Yang, Y.H., and Speed, T. 2002. Design issues for cDNA microarray experiments. Nature Genet. 3, 579–588.
Address correspondence to: Dabao Zhang Department of Biostatistics and Computational Biology University of Rochester Medical Center Rochester, NY 14642 E-mail:
[email protected]
This article has been cited by: 1. Jianshe Zhang, Wuying Chu, Guihong Fu. 2009. DNA microarray technology and its application in fish biology and aquaculture. Frontiers of Biology in China . [CrossRef] 2. Wen-Harn Pan, Ke-Shiuan Lynn, Chun-Houh Chen, Yi-Lin Wu, Chung-Yen Lin, Hsing-Yi Chang. 2006. Using endophenotypes for pathway clusters to map complex disease genes. Genetic Epidemiology 30:2, 143-154. [CrossRef] 3. Joseph D. Clarke, Tong Zhu. 2006. Microarray analysis of the transcriptome as a stepping stone towards understanding biological systems: practical considerations and perspectives. The Plant Journal 45:4, 630-650. [CrossRef]