Statistical Applications in Genetics and Molecular Biology Volume 4, Issue 1
2005
Article 34
Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes Xing Qiu∗
Lev Klebanov†
Andrei Yakovlev‡
∗
Department of Biostatistics and Computational Biology, University of Rochester,
[email protected] † Department of Probability and Statistics, Charles University, Institute of Informatics and Control of the National Academy of Sciences of the Czech Republic,
[email protected] ‡ Department of Biostatistics and Computational Biology, University of Rochester, andrei
[email protected] c Copyright 2005 by the authors. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepress, which has been given certain exclusive rights by the author. Statistical Applications in Genetics and Molecular Biology is produced by The Berkeley Electronic Press (bepress). http://www.bepress.com/sagmb
Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes∗ Xing Qiu, Lev Klebanov, and Andrei Yakovlev
Abstract Stochastic dependence between gene expression levels in microarray data is of critical importance for the methods of statistical inference that resort to pooling test statistics across genes. The empirical Bayes methodology in the nonparametric and parametric formulations, as well as closely related methods employing a two-component mixture model, represent typical examples. It is frequently assumed that dependence between gene expressions (or associated test statistics) is sufficiently weak to justify the application of such methods for selecting differentially expressed genes. By applying resampling techniques to simulated and real biological data sets, we have studied a potential impact of the correlation between gene expression levels on the statistical inference based on the empirical Bayes methodology. We report evidence from these analyses that this impact may be quite strong, leading to a high variance of the number of differentially expressed genes. This study also pinpoints specific components of the empirical Bayes method where the reported effect manifests itself. KEYWORDS: microarray analysis, gene expression, two-sample tests, empirical Bayes method, correlated data, resampling techniques
∗
We would like to express our gratitude to Dr. B. Efron for fruitful and inspiring discussions. We are also very grateful to the anonymous reviewer and the managing editor for truly helpful comments. This research is supported by NIH grant GM075299 (Yakovlev) and Czech Ministry of Education Grant MSM 113200008 (Klebanov).
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data 1. INTRODUCTION
The idea of pooling information across genes has been engrossing the attention of many investigators since the advent of microarray technology. Indeed, it is very tempting to overcome the tight cost and labor limitations of microarray experiments by treating test statistics associated with each gene, or even the original expression levels, as a sample drawn from some distribution. The introduction of this riveting idea can be dated back to the paper by Chen et al. (1997), which was probably the first methodological publication on microarray data analysis. In its most theoretically sound form, the idea was set forth in the nonparametric empirical Bayes method (NEBM) designed to select differentially expressed genes in replicated microarray experiments (Efron et al. 2001, Efron 2003, 2004). To date, a great many papers have discussed various facets of the NEBM and its parametric counterparts. A comprehensive theoretical treatment of the multiple testing aspect of the NEBM was given by Storey (2002, 2003a) and Storey et al. (2003). To formulate the specific aims of the present study, we need to recall some basic facts about the NEBM. For simplicity, we limit our consideration of the NEBM to two-sample comparisons. This method starts with choosing a two-sample test statistic T that measures differences between samples and accounts for biological variability between study subjects. The statistic T is computed for each gene and then all the statistics (or associated p-values) are pooled together and treated as a sample from which to estimate the sampling distribution of this statistic, the false discovery rate (FDR), q-values, etc. More specifically, let Ti,n be the test statistic for the ith gene, i = 1, . . . , m, where n is the sample size. In what follows, the index n will be suppressed but always implied. One can choose a distribution-free statistic for Ti so that, under the complete null hypothesis, the statistics Ti , i = 1, . . . , m, can be thought of as a sequence of identically distributed random variables. In particular, the (unpooled) t-statistic is asymptotically (as n → ∞) distribution-free. The two-sample t-statistic is the most common choice in microarray analysis (Dudoit et al. 2002, 2003) in general and in the NEBM (Efron et al. 2001, Efron 2003, 2004) in particular. The NEBM is based on the assumption that there are two classes of genes: “Not Different” and “Different” with their prior probabilities being equal to π0 and π1 = 1−π0 , respectively. Introducing the class indicator variable J, we can write: π0 = Pr{J = 0} and π1 = Pr{J = 1}. Denote the conditional probability density of T given J = 0 by f0 (t) and the corresponding density of T given J = 1 by f1 (t). Then the density of the random variable (rv) T is given by the two-component mixture f (t) = π0 f0 (t) + (1 − π0 )f1 (t).
(1)
The posterior probability of J = 1 given T = t is P (t) = Pr{J = 1|T = t} = 1 − π0 f0 (t)/f (t).
(2)
The simplest Bayes rule is to select a gene with T = t if P (t) ≥ C for this gene, where C < 1 is a pre-set threshold level. The next step is to estimate P (t) as Pˆ (t) = 1 − π ˆ0 fˆ0 (t)/fˆ(t),
Published by The Berkeley Electronic Press, 2005
(3)
1
in Genetics and Molecular Biology, Iss. 1, Art. 34 whereStatistical π ˆ0 , fˆ0 (t)Applications and fˆ(t) are some suitable estimators forVol. π0 ,4f[2005], 0 (t) and f (t), respectively. This step is the most crucial one in the NEBM as it suggests estimating f (t) from the T observations. The same applies equally to the density f0 whenever the null distribution is mimicked through permutations. With some distribution-free statistics, the needed null distribution f0 can be derived theoretically (Conover 1999), thereby obviating the need for resampling altogether. Since the mixture model (1) is generally unidentifiable from the T -observations, there are two ways to deal with the prior probability π0 . The first one is to proceed from the most conservative choice π0 = 1 so that
Pˆ (t) = 1 − fˆ0 (t)/fˆ(t).
(4)
Efron (2003) provides a convincing rationale for this approach. The second way is to estimate π0 by some other route which was the subject matter of several papers (Allison et al. 2002, Storey 2002, Storey and Tibshirani 2003, Pounds and Morris 2003, Tsai et al. 2003, Reiner et al. 2003, Dalmasso et al. 2004, Pounds and Cheng 2004). Having obtained an estimate Pˆ (t) for the posterior probability P (t), one declares the ith gene differentially expressed if Pˆ (ti ) ≥ C, (5) where ti is an observed value of Ti . One of the above-described components of this inventive methodology is of real concern, namely, the use of the T -observations to construct the statistical estimator Pˆ (t). If the genes were independent, the observed test-statistics associated with them could be treated as a random sample from some general distribution and the commonly used estimators for its density would have the usual nice statistical properties. The snag is that the expression levels for different genes and, consequently, the random variables Ti , i = 1, . . . , m, are stochastically dependent, a property which may cause high variability of associated statistical estimators and even invalidate their consistency. To obtain theoretical results, it is frequently assumed that weak or almost sure convergence holds for an empirical distribution function constructed from the data pooled across genes (see, i.e., Storey 2003b, Storey et al. 2004). However, this assumption has never been validated with real data so that the required convergence to the true distribution function is always questionable; it may or may not be the case depending on the type and strength of stochastic dependence. The same is true for the ergodicity condition employed by Cheng et al. (2004) to characterize a somewhat different but closely related type of weak dependence. Some authors offer biological reasons to substantiate the assumption on weak dependence; their arguments will be disputed in Section 8. The most fundamental question still remains: are the correlations strong enough to deteriorate the NEBM performance? In a recent paper (Qiu et al. 2005), we estimated correlation coefficients between the t-statistics associated with all pairs of genes reported in the St. Jude Children’s Research Hospital Database on childhood leukemia. The coefficients are quite high even after the application of normalization procedures that substantially reduce the correlation. The paper by Qiu et al (2005) also addressed the question of consistency of the arithmetic mean of the t-statistic taken over the genes by experimenting with this large set of gene expression data. The study provided evidence for the lack of consistency of this estimator. The present paper is focused on more direct effects of dependence between the tstatistics associated with different genes on the performance of the NEBM. In particular,
http://www.bepress.com/sagmb/vol4/iss1/art34
2
Qiugenes et al.: Empirical Method with Correlatedby Microarray Data may serve as a the number of declared Bayes differentially expressed the NEBM criterion of its performance. This number is a rv defined as
Z C = # {Pˆ (ti ) ≥ C; i = 1, . . . , m}.
(6)
The rv Z C can also be thought of as an estimator for the true number of differentially expressed genes and its properties in the presence of dependencies in microarray data can be studied empirically by computer simulations and real data analyses. Another criterion is the FDR which is essentially estimated from the T -observations as well. Not only is this rate important as a multiple-comparison criterion, but it is also used for other purposes. For example, the FDR served as a performance indicator in a study of several methods for producing Affymetrics expression scores reported by Shedden et al. (2005). If the available methods for estimation of the FDR performed poorly, it would be difficult to justify the utility of this indicator in comparison studies of different methods. Using resampling techniques we assess the variance of the FDR estimated from actual biological data by indirect estimation procedures. In doing so, we limit our attention to the “Bayesian FDR” (Storey 2001) and its local version (Efron 2003), because they have the most direct bearing on the empirical Bayes methodology. In simulation studies, however, the FDR can be estimated directly as the proportion of false discoveries among all discoveries. We will call this estimate the “true” FDR and report its mean together with the corresponding standard deviation. This is an operational definition applicable only to simulations. We refrain from using indirect procedures for estimating the FDR from simulated data because such procedures introduce an additional variation in the estimates which is impossible to distinguish from that caused by a given selection procedure. It should be noted that our simulation studies play a subsidiary role; their only purpose is to illustrate the effect of correlation between gene expression signals on probabilistic characteristics associated with the true FDR and the number of differentially expressed genes. In other words, our simulations provide only explanatory insight into the correlation effects. Unfortunately, our study shows that the effect of stochastic dependence between expression levels on the performance of the NEBM can be disastrous, manifesting itself in both high bias and high variance of the number of differentially expressed genes. It is a credible speculation that parametric formulations of the empirical Bayes methodology (Newton et al. 2000, Ibrahim et al. 2002, Lee et al. 2002, Newton and Kendziorski 2003, to name a few) should suffer from the same problem as well. While offering no constructive alternative to the NEBM, our work suggests the use of resampling tools for validating its applicability in each specific analysis of microarray data. The paper is organized as follows: Section 2 presents the study design and methods used to achieve the specific goals; Sections 3-7 report the results of real data analyses and simulation studies; Section 8 contains discussion and concluding remarks. 2. STUDY DESIGN AND METHODS 2.1. Biological Data For the purposes of this study, use was made of the St. Jude Children’s Research Hospital (SJCRH) Database on childhood leukemia which is publicly available on the following website: http://www.stjuderesearch.org/data/ALL1/. The SJCRH Database contains gene expression data on 335 subjects, each represented by a separate array
Published by The Berkeley Electronic Press, 2005
3
StatisticalSanta Applications Genetics and Molecular Biology, Vol. [2005], Iss.set 1, Art. 34 = 12558 (Affymetrix, Clara,in CA) reporting measurements on4the same of m genes. We selected two subsets of the data. Data Set 1 is represented by 21 arrays obtained from patients with mixed lineage lymphoblastic leukemia (MLL) and 19 arrays obtained from normal blood samples. Data Set 2 includes a group of 45 patients with T-cell acute lymphoblastic leukemia (TALL) and a group of 79 patients with a special translocation type of acute lymphoblastic leukemia (TEL). Our preliminary analyses with the aid of methods controlling the family-wise error rate (FWER) led us to expect much more differentially expressed genes in the second setting than in the first one. Since the nature of our study was purely methodological, the choice of the two data sets was quite arbitrary; it was dictated solely by sample size considerations. Both sets of microarray data were background corrected and normalized using the Bioconductor RMA software. This software implements the quantile normalization procedure (Bolstad et al. 2003, Irizarry et al. 2003) carried out at the probe feature level. After this normalization, each gene is represented in the final data set by the logarithm (base 2) of expression level of the probe set.
2.2. Simulated Data To gain a better insight into the effects of correlation between gene expression signals on the performance of the NEBM, we simulated several sets of data with exchangeable correlation structure. By no means are such simulations intended to model the actual correlation structure of gene expression signals which is clearly far more complex (see Qiu et al. 2005 for relevant data) than the simplistic one employed for explanatory purposes in the present paper. Each data set included two groups of arrays with 21 arrays pertaining to the first group and 19 arrays to the second one. Each array reports on simulated signals from m = 1255 genes. To produce the needed data sets we proceed as follows. We first generate a 1255×40 matrix with each entry being an independent realization of a standard normal random variable. To model a set of “Different” genes, we add a value of 2 to the first 125 genes (rows) in the first group (columns 1 to 21) and denote the resultant matrix by X = {xij } , i = 1, . . . , 1255; j = 1, . . . , 40. All the elements xij of this matrix are stochastically independent, but those with i = 1, 2, . . . , 125 and j = 1, 2, . . . , 21 are normally distributed with mean 2 and unit variance. Expression levels of the genes outside this special set of 125 genes follow the standard normal distribution. Next we generate a 40-dimensional random vector with i.i.d. components, each component having a standard normal √ distribution. Denote this vector by A = {aj } , j = √ 1, . . . , 40. Define yij = ρaj + 1 − ρxij , i = 1, . . . , 1255; j = 1, . . . , 40, so that for any i1 = i2 and j we have Corr (yi1 j , yi2 j ) = ρ. We choose a value of ρ and repeat the process 500 times with different random seeds. The following values of the correlation coefficient: ρ = 0, 0.2, 0.4 and 0.6 were used to generate four independent data sets denoted by SIMU00, SIMU02, SIMU04 and SIMU06, respectively. Given ρ, the 500 independent 1255 × 40 matrices provide simulated expression levels of both “Different” and “Not Different” genes. This setting is an analogue of the jackknife resampling used to analyze the biological (SJCRH) data (Section 2.6). Each of the data sets thus simulated was permuted 500 times to give rise to another collection of simulated data sets on which to study the performance of the NEBM when the complete null hypothesis (no differentially expressed genes) is true. This setting is an analogue of permutation analysis of the biological data (Section 2.6).
http://www.bepress.com/sagmb/vol4/iss1/art34
4
Qiu et Densities al.: Empirical Bayes Method with Correlated Microarray Data 2.3. Estimating
It is common practice to use the t-statistic as a measure of differential expression of genes in conjunction with the NEBM. We followed this practice in our study. Once the 12558 t-statistics had been computed in a given setting, three different methods were used to estimate the corresponding distribution density. 1. Histogram method. This method provides the simplest nonparametric estimate of the density function. We used the R build-in function hist() to compute the histogram estimate. 2. GLM smoothing method. Smoothing techniques are known to reduce the variance of a nonparametric estimator while admitting a small bias in the bias-variance trade-off. A Poisson Generalized Linear Model (GLM) fit was recommended by Efron (2003) as an appropriate smoothing technique for estimating the density of the t-statistic. To implement this method, we partitioned the range of T observations into 2000 equally-spaced intervals. The left end of this partition is the minimum of all observed t-statistics minus one half of the interval length while the right end is their maximum plus one half of the interval length. The frequencies of each interval were fit with a Poisson GLM model of the fourth order using the R build-in command glm(). Then the curve was normalized to obtain the required density estimate. We conducted a separate set of numerical studies to ensure that invoking the curves of higher order is not warranted. 3. Kernel smoothing method. Kernel estimation is an alternative smoothing technique for obtaining a stable estimate of the density function. We used the R build-in function density() to obtain such an estimate with the Gaussian kernel. This highly flexible estimation method is computationally much less expensive than the GLM smoothing, which is its distinct advantage. All the three methods were used to provide an estimate fˆ of the density f in formulas (3) and (4). The same methods were employed to estimate the null distribution density f0 when the complete null hypothesis was modeled through permutations. Following the recommendation by Efron (2003), we used 20 permutations to generate 20 × 12558 sample values of the t-statistic in order to produce the nonparametric estimates fˆ0 by each of the three methods. Unbalanced permutations were used for this purpose, because permuting the arrays in a balanced way resulted in a much higher variance of the performance indicators under study. For comparison purposes, we also used a Student’s t-distribution (with a pertinent number of degrees of freedom) proceeding from the assumption of normality of gene expression levels. The following notation for the above-described methods will be referred to in Sections 3-5 presenting the results of our empirical studies: KERP : The null hypothesis is modeled through permutations with both f and f0 being estimated by the kernel smoothing method. GLMP : The null hypothesis is modeled through permutations with both f and f0 being estimated by the GLM smoothing method. KERT : A Student’s theoretical distribution is used for fˆ0 while f is estimated by the kernel method.
Published by The Berkeley Electronic Press, 2005
5
Applications in Genetics and Molecular Biology, 4 [2005], 1, Art. 34 by the GLMT Statistical : A Student’s theoretical distribution is used forVol. fˆ0 while f Iss. is estimated GLM method.
2.4. Estimating the Prior Probability π0 We used the spline approximation method by Storey and Tibshirani (2003) to estimate the prior probability π0 . They proposed estimating π0 by a limiting value of π ˆ0 (λ) =
#{pi > λ; i = 1, . . . , m} , m(1 − λ)
(7)
as the tuning parameter λ tends to 1. In formula (7), pi , i = 1, . . . , m, are the observed p-values calculated from quantiles of the Student distribution. To borrow strength from ˆ0 (λ) as a function the whole range of π ˆ0 (λ), the authors suggest fitting a cubic spline to π of λ defined on the grid (0.01, 0.02, 0.03, . . . , 0.95). The cubic spline value evaluated at the point λ = 1 provides the desired estimate of π0 . This is an ingenious method but it is important to see how well it tolerates the correlation between gene expression levels. 2.5. Selecting Differentially Expressed Genes Once all the functions involved in formula (2) have been estimated, one can use formulas (3) or (4) to compute the posterior probability estimate Pˆ (t). We explored both the method based on formula (3) and the conservative version based on formula (4). To select differentially expressed genes, we used the decision rule (5) with the same threshold level C = 0.8 in all analyses of real and simulated data. There is no well-defined procedure for selecting a specific value of C. It is clear that choosing higher values of C would concurrently reduce the type 1 error rate and the overall power. The choice of the threshold C is of little importance for the main purpose of the present paper. 2.6. Resampling Techniques for Assessing the Performance of the NEBM We used resampling methods to assess the mean and variance of the number of selected genes Z C given by formula (6). When the rv Z C serves as a performance criterion, the NEBM proves to be extremely sensitive to ties in the data which is why the bootstrap methodology is of little utility in this setting and random sampling without replacement represents a better choice. For both biological data sets, we applied a subsampling version of the delete-d jackknife method (Politis and Romano 1994, Shao and Tu 1995), which is technically equivalent to the leave-d-out cross-validation. Although the jackknife methodology and cross-validation serve different purposes, this type of analysis will be referred to as the cross-validation analysis for the sake of simplicity. Since the samples under study are of unequal size, different values of d have to be chosen to more evenly perturb each set of the data. When analyzing Data Set 1 (MLL versus NORMAL), we set d1 = 6 for the group consisting of 21 arrays (MLL) and d2 = 4 for the group of 19 arrays (NORMAL). For Data Set 2, we choose d1 = 12 for the group consisting of 45 arrays (TALL) and d2 = 21 for the group of 79 arrays (TEL). This is the basic scenario for cross-validation analysis referred to in Sections 3.2, 4, 5.2, and 6. For comparison, we also used smaller values of d (i.e. d1 = 7 and d2 = 9) to perturb Data Set 2. As one would expect, reducing the number of deleted arrays decreases the bias and increases the variance of the number of selected genes. A total of 500 cross-validations were conducted to estimate the mean value of Z C and its variance in each analysis. The latter characteristic was
http://www.bepress.com/sagmb/vol4/iss1/art34
6
et al.: Empirical Bayes Method withjackknife Correlated sample Microarray Data (Shao and Tu estimated by Qiu a resampling counterpart of the variance 1995): 2 B B 1 n−d C − ZC , (8) V = Zn−d,l d B l=1 B k=1 n−d,k C C where B is the total number of subsamples (B = 500), Zn−d, j is the statistic Z evaluated at the jth delete-d jackknife subsample. We resorted to permutations in order to estimate a bias in the number of genes selected by the NEBM in the situation where the actual number of differentially expressed genes is known to be zero. This bias can be reduced by increasing the threshold level C but with certain sacrifice in power. Our simulations are instrumental in demonstrating the contribution of correlation to this bias given a fixed value of C. However, the main focus of the permutation analysis is on the variance of the number of rejections under the complete null hypothesis. The number of permutations in this analysis was equal to 500. The chosen number of permutations (as well as cross-validations) seems sufficient for our purposes because the results obtained with just 100 permutations are largely similar. In our study, the role of the FDR was two-fold. In simulation experiments, it provided additional information on the NEBM performance when the main criterion was the number of selected genes. In this setting, the true FDR was determined by identifying the genes that truly belong to the pre-specified set of “Different” genes. In the analysis of biological data, however, our focus was on the quality of estimation of the FDR. To this end, we conducted cross-validation analyses aimed at estimating the mean and variance of the following empirical counterpart of the local FDR (Efron 2003):
fdr(t) =π ˆ0 fˆ0 (t)/fˆ(t),
(9)
where the estimates π ˆ0 , fˆ0 (t) and fˆ(t) are obtained by the methods described in Sections 2.3 and 2.4. Another interesting quantity is the Bayesian FDR giving rise to the q-value (Storey 2001). As recommended by Storey and Tibshirani (2003) this integral version of the FDR is estimated by π ˆ0 mu FDR(u) = , (10) #{pi ≤ u} where pi is the observed p-value for the ith test and u is a threshold p-value chosen to select significant genes. The observed p-values were computed from the Student tdistribution. Both concepts of the FDR were explored in terms of stability of their estimation from biological data. Software The software used in this paper was specially designed for the purposes of the study. The code can be obtained from the corresponding author on request. 3. DENSITY ESTIMATES We begin by discussing the estimates fˆ0 and fˆ produced by the three methods mentioned in Section 2.3.
Published by The Berkeley Electronic Press, 2005
7
Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34 3.1. Statistical Permutation Analysis
Shown in Figure 1 are the results of permutation analysis of the two sets of biological data described in Section 2.1. The solid curves display the mean over 500 permutations, while the vertical bars represent the corresponding standard deviation. Note that these permutations were carried out to assess the stability of different estimators for f0 and not to model the complete null hypothesis for selecting differentially expressed genes. The variability of fˆ0 appears to be quite high both for the Poisson GLM smoothing and for the kernel method. The results produced by both methods look fairly similar. The role of correlation can be seen from the simulation study designed in much the same way as the above permutation analysis of biological data. Figure 2 shows the results for the GLM method. It is clear that the variance of fˆ0 is quite small in the case of independent expression levels, but it increases dramatically with increasing the correlation coefficient ρ. In these experiments, the estimate fˆ0 is practically unbiased, which statement was verified (not shown) by comparing its sample mean (over permutations) with the theoretical t-distribution density. The same is true for the kernel estimation procedure. Again, the estimates fˆ0 produced by both methods are largely similar. The histogram method, however, shows a much higher variability even in the case of independent data. 3.2. Cross-validation Analysis This study was designed similarly to the permutation analysis of Section 3.1. Using the leave-d-out cross-validation procedure of Section 2.5, we obtained the results presented in Figure 3. It is seen in this figure that the standard deviation of fˆ deriving both from the GLM smoothing and from the kernel estimation procedure is quite high; the most direct consequences of this observation will be demonstrated in Sections 5 and 6. As expected, the estimate fˆ becomes even more variable when smaller values of d1 and d2 are used in the analysis of Data Set 2 (Figure 4). Consistent with the above findings are the results of our simulations presented in Figure 5. Recall that there are 125 “Different” genes in the simulation model used to produce the results reported in this figure. The results show the same effect of correlation as was demonstrated for the GLM estimate fˆ0 . Notice that the variability of fˆ is practically negligible in the case of independent expression levels. The same appears to be the case for the kernel method. In terms of the standard deviation, the histogram method appears to be the least accurate one and should not be recommended for use with the NEBM. 4. VARIABILITY OF THE ESTIMATE π ˆ0 The estimate π ˆ0 of the expected proportion of true null hypotheses π0 contributes to the variability of the estimated posterior probability, given by (2), and consequently to the variability of the FDR. Therefore, it is interesting to assess its bias and variance. Table 1 presents values of the mean and standard deviation of the estimate π ˆ0 obtained from simulated data using the method by Storey and Tibshirani (2003), while Table 2 provides the same characteristics resulted from cross-validation analysis of the biological data under the basic cross-validation scenario. The data in Table 1 show that the estimate π ˆ0 is practically unbiased but can be quite unstable in the presence of strong
http://www.bepress.com/sagmb/vol4/iss1/art34
8
Qiuthe et al.: Empirical Bayes Method Correlated Microarray Data correlations in data. We cannot checkwith unbiasedness of π ˆ0 with real data but the magnitude of its variation coefficient (Table 2) raises practical concerns.
5. THE NUMBER OF DIFFERENTIALLY EXPRESSED GENES The number of genes selected by NEBM is the most basic performance criterion as it indicates the magnitude of erroneous decisions that may have been made given different realizations of gene expression data. 5.1. Permutation Analysis In this setting, the true number of differentially expressed genes is known to be equal to zero. This allows us to evaluate the true FDR for each permutation and then average it over the 500 permutations. Additional 20 permutations were carried out (see Section 2.3) to estimate f0 by KERP or GLMP and select differentially expressed genes at each of the 500 permutations. Table 3 presents the mean number and variance of selected genes, and the FDR average for both biological data sets and different estimation methods. The table shows that the estimate Z C , defined by formula (6), has a very high variance. All the characteristics given in Table 3 vary but slightly depending on the method used to estimate the null distribution density f0 . To better appreciate the magnitude of the effect under discussion, the number of genes declared differentially expressed at each of the 500 random permutations is plotted in Figure 6. For Data Set 1, one could find as many as over 1500 differentially expressed genes (albeit there are none) at a particular permutation when using the NEBM with the GLM estimate for fˆ0 . For Data Set 2, this number could be as high as over 650. A similar pattern results from the permutation analysis of simulated data (Figure 7). It is clear that the NEBM works very well if there is no correlation between gene expression levels (SIMU00), but the results for SIMU02 and SIMU04 parallel rather closely those for the biological data (Figure 6) as far as extreme values of Z C are concerned. 5.2. Cross-validation Analysis The results of the analysis of biological data are shown in Table 4. In addition to the mean (over cross-validation samples) and the corresponding standard deviation, Table 4 presents the number of genes declared differentially expressed before resampling, i.e., for the original grouping of the arrays. It is striking how much variability manifests itself in the number of genes selected by the NEBM. The same analysis with smaller values of d1 and d2 results in even higher standard deviation of the number of selected genes (Table 5). The conservative selection of differentially expressed genes based on formula (4) produces more stable (and probably less biased) results but the standard deviation is still very high (Table 6). Recall that the estimate π ˆ0 is not involved in formula (4), thereby contributing no extra variability to the number of selected genes. The results of the same analysis of simulated data (Figures 8 and 9) are consonant with those obtained for the biological data. It is clear that the regular NEBM based on formula (3) performs very well in the case of independent expression levels but not nearly so when correlations are present in the data. The standard deviation of Z C increases with increasing the correlation coefficient ρ. As follows from Figure 9, the conservative version of the NEBM yields a smaller variance of the number of selected genes than the regular NEBM based on formula (4) in this simulation experiment. 6. ESTIMATED FALSE DISCOVERY RATE
Published by The Berkeley Electronic Press, 2005
9
Statistical Applications Genetics and MolecularinBiology, Vol. 4 [2005], Iss.NEBM. 1, Art. 34Consider The false discovery rate isinroutinely estimated conjunction with the first its local version suggested by the empirical Bayes paradigm. The mean value and the corresponding standard deviation of the estimate fdr(t), given by formula (9), at different values of the observed t-statistic were obtained from both sets of biological data. The results are shown in Figure 10. As would be expected, the standard deviation of fdr(t) is quite high so that specific FDR values reported in various applications can be unreliable. A similar simulation study (not shown) corroborates this conclusion. As can be seen from Figure 11, the integral version of the Bayesian FDR is also highly variable, which is likely to be attributable to strong correlation between p-values.
7. STABILITY OF GENE RANKING Another practical problem that deserves discussion is that of gene ranking(L¨onnstedt and Speed 2002, Pepe et al. 2003, Smyth 2004). To address this issue within the NEBM framework we conducted the following study. The delete-d-jackknife procedure was applied to resample the arrays from Data Set 2. All of the 12558 genes were ranked by their posterior probability values in each of the 500 subsamples produced by the resampling procedure. The top 50 genes were identified in each subsample as the set of interest (target set) and the frequency of occurrence in this set was recorded for each gene. In this case, the size of the set of selected genes is constant and the stability of ranking manifests as the stability of membership in the target set of genes. Only those genes that are in the union of the 500 subsets of genes with the highest ranks, each of size 50, were considered. None of the genes had frequency of occurrence of more than 80%, only 1 gene had this frequency exceeding 70%, and only 15 genes appeared in the target set more than half of the time. Another way to approach this problem is to characterize the variability of ranking for a particular gene pertaining to at least one of the target sets identified in 500 subsamples by the mean and standard deviation of its rank computed from the subsamples. The mean (over genes in the union of target sets) values of the two characteristics were 417 and 345, respectively, with both values being much larger than the size of the target set. Clearly, the stability of ranking relative to the size of the target set is expected to increase with larger target sets. When we defined the target set as the top 3212 genes, which is equal to the mean number of genes declared differentially expressed by the NEBM, we found 2461 stable genes that appeared in the target set in more than 80% of the subsamples. The mean rank and the corresponding standard deviation were equal to 4323 and 1201 in this case. Obviously, the second example is of little practical importance because the target set thus defined is far too large. 8. DISCUSSION AND CONCLUDING REMARKS We begin our discussion by quoting from Storey (2003): “Many assumptions that have been made for modeling microarray data have yet to be verified. Hopefully evidence either for or against these assumptions will emerge...”. In the present paper, we explored possible violations of the independence assumption and their immediate effect on the NEBM-based statistical inference. Our main conclusion is that the strong correlation between gene expression measurements can cause a highly unstable performance of the NEBM. However, this elegant methodology may still enjoy other applications where the dependence between statistical tests is less critical; the Shakespearian example in Efron
http://www.bepress.com/sagmb/vol4/iss1/art34
10
Qiu et al.: Empirical Bayes Method with Data (2005) proposes (2003) presumably falls into this category. In Correlated a recent Microarray paper, Efron a modification of NEBM that accounts for the correlation structure of the histogram estimator for fˆ0 in rationalizing the data-driven choice of a null distribution. The effect of this modification on major performance indicators has yet to be explored. The following statement of Efron (2005) is worth quoting here: “Purely statistical improvements can also reduce correlations, for instance by more extensive standardization techniques as in Qiu, Brooks, Klebanov, and Yakovlev (2005). None of this will help, however, if microarray correlations are inherent in the way genes interact at the DNA level, rather than a limitation of current methodology.” In our opinion, it is exactly the kind of interactions that manifests itself in gene expression data. Although less explicit, the same applies to parametric formulations of the empirical Bayes method relying on parameter estimates inferred from dependent observations. Problems of the same nature arise in maximum likelihood inference from microarray data (Ideker et al. 2000, Segal et al. 2003, Purdom and Holmes, 2005). To fit a parametric model by the method maximum likelihood, the joint distribution is typically replaced by the product of marginal distributions which is a valid operation only when the observations are independent. It is unclear whether hierarchical modeling that attempts to fit sample correlations can remedy this difficulty because such correlations are also stochastically dependent variables. This remains to be an open question. The main lesson we have learned from our study is that extreme caution must be exercised when dealing with dependent variables (test-statistics) in the analysis of microarray data. The literature on multiple testing with dependent data is quite scarce. Pertinent references include Dudoit et al. (2004a,b), van der Laan et al. (2004a,b), and Pollard and van der Laan (2004). Finner and Roters (2001, 2002) studied the expected number of Type 1 errors (ENE) in multiple testing procedures controlling the FWER and the FDR at a pre-set level. In the case of independent tests, they found that the ENE is bounded by a small number regardless of the total number of hypotheses. For dependent test statistics, however, the behavior of the ENE is completely different; the more common situation here is that the ENE does not converge to a finite limit when the number of hypotheses tends to infinity. This shows how cautious one should be when extrapolating any results valid in the case of independence to dependent data. All the theoretical work related to the empirical Bayes methodology hinges upon validity of the postulated weak dependence between genes. There have been several attempts to justify this postulate on biological grounds. Some authors (e.g., Tsai et al. 2003) allow for dependence between differentially expressed genes while assuming stochastic independence of those genes that do not change their expression between the two conditions under study. The biological rationale for such a hypothesis is unclear, because the normal gene function involves numerous biochemical pathways much like the altered ones. Another questionable argument was offered by Kuznetsov (2001). The quote from his paper reads: “Although transcription events of some genes may in fact be correlated in a given cell, most transcription events in a cell population seem to be random, independent events”. First of all, it is difficult to come up with a natural probabilistic model to support that claim. However, it is easy to construct a reasonable counter-example. Consider a homogeneous population of k independent cells and let Xi , Yi be the levels of expression of a given pair of genes in the ith cell. Since the population is homogeneous, the random vectors (Xi , Yi ), i = 1, ..., k, are identically distributed. Suppose
Published by The Berkeley Electronic Press, 2005
11
Iss. Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Cov(X Xi , 1, Art. Yi ) 34 = kr = 0. i , Yi ) = r for any i = 1, . . . , k. Then it follows that Cov ( Secondly, this assumption is inconsistent with existing experimental data on gene expression (Qiu et al. 2005) which can only be generated at the level of cell populations and not single cells. Storey (2003) advocates a form of weak dependence between genes which he terms the “clumpy dependence”. His main reason is that “... genes tend to work in pathways, that is, small groups of genes interact to produce some overall process”. He hypothesizes that such groups can involve just a few to 50 or more genes and that each group is independent of the others. Clearly, the most critical part of this hypothesis is the actual size of a typical clump. Is it really about 50 genes? By contrast, our empirical study conducted with the SJCRH data (Qiu et al. 2005) suggests that such clumps rather involve thousands of tightly dependent genes. In summary, all the above-presented lines of reasoning are disputable. The NEBM deals with test-statistics which are at least approximately identically distributed. Some authors (Chen et al. 1997, Ideker et al. 2000, Kuznetsov 2001, Hoyle et al. 2002, Sidorov et al. 2002, Segal et al. 2003, Purdom and Holmes 2005, to name a few) pool the expression levels across genes in order to estimate what they call the “distribution of gene expressions” or the “error distribution for gene expression data”. Each gene in such data is represented by a small number of independent and identically distributed copies of expression signals, each associated with a different array (subject), while the expression levels of large numbers of different genes are heavily dependent and have dissimilar distributions. Not only does this approach suffer from the same problem of correlated observations but it also seeks to estimate a distribution which is irrelevant to testing significance for individual genes. The fact is that the expression levels for different genes are not identically distributed and the true distribution behind such data is the mixture of the equally weighted m distributions, each being associated with a different gene. In other words, this distribution refers to some putative “average gene” and any parametric inference on a concrete gene has little to do with the sample space behind such a model. The relevant information on biological variability for a specific gene is provided by different subjects and not by other genes. The present paper provides direct evidence that the correlation structure of microarray data cannot be ignored when designing methods for selecting differentially expressed genes. Not only does this study add complexity to the problem already complicated by the presence of technological noise of unknown structure and multiplicity of tests, but it also adds clarity to our understanding of what should be required from a sound statistical methodology for microarray data analysis. At this point in time, it is much more important to determine the limits imposed on present-day inferential procedures by complexity of the data generated by microarray technology, rather than produce more methods and theoretical results based on overly simplistic assumptions that can be very far from the truth. We believe that the ability to overcome the limitations of the NEBM reported in this paper depends critically on the ability to devise procedures that destroy correlations between test statistics while preserving their marginal distributions. We envision that statistical methodology for microarray analysis will increasingly concentrate on dependencies between gene expressions, and the most recent papers by Efron (2005), Dettling et al. (2005), Qiu et al. (2005), as well as the present paper, are the burgeons of this tendency. AUTHORS’ CONTRIBUTIONS
http://www.bepress.com/sagmb/vol4/iss1/art34
12
et al.:ofEmpirical Bayes with Correlated Microarray Data The primary Qiu thrust this work wasMethod suggested by A. Yakovlev and crystallized in his discussions with X. Qiu and L. Klebanov. These discussions led to a final design of the study. X. Qiu carried out the needed computations and simulations.
REFERENCES Allison, D.B., Gadbury, G.L., Heo, M., Fern´andez, J.R., Les, C-K., Prolla, J.A. and Weindruch, R. (2002) A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis, 39, 1-20. Bolstad, B.M., Irizarry, R.A., Astrand, M. and Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185-193. Chen, Y., Dougherty, E. and Bittner, M. (1997) Ratio-based decisions and the quantitative analysis of cDNA micro-array images. Journal of Biomedical Optics, 2, 364-374. Conover, W.J. (1999) Practical Nonparametric Statistics, 3d Ed., Wiley, New York. Dalmasso, C., Broet, P. and Moreau, T. (2004) A simple procedure for estimating the false discovery rate. Bioinformatics Advanced Access. Dettling, M., Gabrielson, E. and Parmigiani, G. (2005) Searching for differentially expressed gene combinations. http://www.bepress.com/jhubiostat/paper77 Dudoit, S., Yang, Y.H., Speed, T.P., and Callow, M.J. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12, 111-139. Dudoit, S., Shaffer, J.P., and Boldrick, J.C. (2003) Multiple hypothesis testing in microarray experiments. Statistical Science, 18, 71-103. Dudoit, S., van der Laan, M. J., and Pollard, K. S. (2004a) Multiple testing. Part I. Single-step procedures for control of general Type I error rates. Statistical Applications in Genetics and Molecular Biology, 3, No. 1, Article 13. Dudoit, S., van der Laan, M. J., and M. D. Birkner (2004b). Multiple testing procedures for controlling tail probability error rates. (Technical Report #166, Division of Biostatistics, UC Berkeley). Efron, B., Tibshirani, R., Storey, J.D. and Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96, 1151-1160. Efron, B. (2003) Robbins, empirical Bayes and microarrays. The Annals of Statistics, 31, 366-378. Efron, B. (2004) Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association, 99, 96-104.
Published by The Berkeley Electronic Press, 2005
13
Statistical Applications in Genetics and MolecularSimultaneous Biology, Vol. 4 [2005], Iss. 1, Art. 34 Efron, B. (2005) Correlation and Large-Scale Significance Testing. ˜ http://www-stat.stanford.edu/brad/papers/
Finner, H. and Roters, M. (2001) On the false discovery rate and expected type 1 errors. Biometrical Journal, 43, 985-1005. Finner, H. and Roters, M. (2002) Multiple hypotheses testing and expected number of Type 1 errors. The Annals of Statistics, 30, 220-238. Hoyle, D.C., Rattray, M., Jupp, R. and Brass, A. (2002) Making sense of microarray data distributions. Bioinformatics, 18, 576-584. Ibrahim, J.G., Chen, M.-H. and Gray, R.J. (2002) Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association, 97, 88-99. Ideker, T., Thorsson, V., Seigel, A.F. and Hood, L.E. (2000) Testing for differentially expressed genes by maximum likelihood analysis of microarray data. Journal of Computational Biology, 7, 805-817. Irizarry, R.A., Gautier, L., and Cope, L.M. (2003) An R package for analyses of Affymetrix oligonucleotide arrays. In: The Analysis of Gene Expression Data, Parmigiani, G., Garrett, E.S., Irizarry, R.A., Zeger, S.L., eds, Springer, New York, 102-119. Kuznetsov, V.A. (2001) Distribution associated with stochastic processes of gene expression in a single eukaryotic cell. EURASSIP Journal on Applied Signal Processing, 4, 285-296. Lee, M.-L.T., Lu, W., Whitmore, G.A. and Beier, D. (2002) Models for microarray gene expression data. Journal of Biopharmaceutical Statistics, 12, 1-19. L¨onnstedt, I. and Speed, T. (2002) Replicated microarray data. Statistica Sinica, 12, 31-46. Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R. and Tsui, K.W. (2000) On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology, 8, 37-52. Newton, M.A. and Kendziorski, C.M. (2003) Parametric empirical Bayes methods for microarrays. In: The Analysis of Gene Expression Data, Parmigiani, G., Garrett, E.S., Irizarry, R.A., Zeger, S.L., eds, Springer, New York, 254-271. Pepe, M.S., Longton, G., Anderson, G.L. and Schummer, M. (2003) Selecting differentially expressed genes from microarray experiments. Biometrics, 59, 133-142. Pollard, K. S. and van der Laan, M. J. (2004). Choice of a null distribution in resampling-based multiple testing. Journal of Statistical Planning and Inference, 125, No. 1-2, 85-100.
http://www.bepress.com/sagmb/vol4/iss1/art34
14
et al.: Empirical Bayes Method withsample Correlated Microarrayregions Data based on subPolitis, D.N.Qiu and Romano, J.P. (1994) Large confidence samples under minimal assumptions. The Annals of Statistics, 22, 2031-2050.
Pounds, S. and Cheng, C. (2004) Improving false discovery rate estimation. Bioinformatics, 20, 1737-1745. Pounds, S. and Morris, S.W. (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics, 19, 1236-1242. Qiu, X., Brooks, A.I., Klebanov, L., and Yakovlev, A. (2005) The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics, 6, Article # 120. Reiner, A., Yekutieli, D., and Benjamini, Y. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19, 368375. Segal, E., Wang, H. and Koller, D. (2003) Discovering molecular pathways from protein interactions and gene expression data. Bioinformatics, 19, i264-i272. Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap. Springer Series in Statistics, Springer, New York. Shedden, K., Chen, W., Kuick, R., Ghosh, D., MacDonald, J., Cho, K.R., Giordano, T. J., Gruber, S.B., Fearon, E.R., Taylor, J.M.G. and Hanash, S. (2005) Comparison of seven methods for producing Affymetrix expression scores based on False Discovery Rates in disease profiling data. BMC Bioinformatics, 6:26 Sidorov, I.A., Hosack, D.A., Gee, D., Yang, J., Cam, M.C., Lempicki, R.A. and Dimitrov, D.S. (2002) Oligonucleotide microarray data distribution and normalization. Information Sciences, 146:67-73 Smyth, G.K. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, No 1, Article 3. Storey, J. (2001) The false discovery rate: A Bayesian interpretation and the q-value. Technical Report, Stanford University, http:www.stat.berkeley.edu. Storey, J.D. (2002) A direct approach to false discovery rates. Journal of Royal Statistical Society, Ser. B, 64, 479-498. Storey, J.D. (2003a) The positive false discovery rate: a Bayesian interpretation and the q-value. The Annals of Statistics, 31, 2103-2035. Storey, J.D. (2003b) Comment on ’Resampling-based multiple testing for DNA microarray data analysis’ by Ge, Dudoit, and Speed. Test, 12, 1-77. Storey, J.D., Taylor, J.E. and Siegmund, D. (2003) Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of Royal Statistical Society, Ser. B, 66, 187-205.
Published by The Berkeley Electronic Press, 2005
15
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34 studies. Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide Proceedings of the National Academy of Sciences USA, 100, 9440-9445.
Tsai, C-A, Hsueh, H-M and Chen, J.J. (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics, 59, 1071-1081. van der Laan, M. J., Dudoit, S., and Pollard, K. S. (2004a). Multiple testing. Part II. Step-down procedures for control of the family-wise error rate. Statistical Applications in Genetics and Molecular Biology, 3, No. 1, Article 14. van der Laan, M. J., Dudoit, S., and Pollard, K. S. (2004b). Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical Applications in Genetics and Molecular Biology, 3, No. 1, Article 15.
http://www.bepress.com/sagmb/vol4/iss1/art34
16
0.5
0.4
Data Set1
Data Set1
0.1
0.2
0.3
kernel
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data
5
−5
0
5
0.5
0
0.4
Data Set2
Data Set2
0.1
0.2
0.3
kernel
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
−5
−5
0
5
−5
0
5
Figure 1: Permutation analysis of biological data. The mean (solid line) and standard deviation (vertical bars) of the estimate fˆ0 obtained by GLM (left panels) and kernel (right panels) smoothing.
Published by The Berkeley Electronic Press, 2005
17
0.5
0.4
simu00
simu02
0.1
0.2
0.3
GLM
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34
5
−5
0
5
0.5
0
0.4
simu04
simu06
0.1
0.2
0.3
GLM
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
−5
−5
0
5
−5
0
5
Figure 2: Permutation analysis of simulated data. The mean (solid line) and standard deviation (vertical bars) of the estimate fˆ0 obtained by GLM smoothing for different values of ρ.
http://www.bepress.com/sagmb/vol4/iss1/art34
18
0.5
0.4
Data Set1
Data Set1
0.1
0.2
0.3
kernel
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data
0
5
−5
0
5
Data Set2 kernel
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
Data Set2 GLM
0.4
0.5
0.5
−5
−5
0
5
−5
0
5
Figure 3: Cross-validation analysis of biological data (basic scenario). The mean (solid line) and standard deviation (vertical bars) of the estimate fˆ obtained by GLM (left panels) and kernel (right panels) smoothing.
Published by The Berkeley Electronic Press, 2005
19
0.5
0.4
Data Set2
Data Set2
0.1
0.2
0.3
kernel
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34
−5
0
5
−5
0
5
Figure 4: Cross-validation analysis of Data Set 2 with d1 = 7 and d2 = 9. The mean (solid line) and standard deviation (vertical bars) of the estimate fˆ obtained by GLM (left panel) and kernel (right panel) smoothing.
http://www.bepress.com/sagmb/vol4/iss1/art34
20
0.5
0.4
simu00
simu02
0.1
0.2
0.3
GLM
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data
5
−5
0
5
0.5
0
0.4
simu04
simu06
0.1
0.2
0.3
GLM
0.0
0.0
0.1
0.2
0.3
GLM
0.4
0.5
−5
−5
0
5
−5
0
5
Figure 5: Cross-validation analysis of simulated data with 125 “Different” genes. The mean and standard deviation of the GLM estimate fˆ for different values of ρ.
Published by The Berkeley Electronic Press, 2005
21
1500
700
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34
Data Set2
0
0
100
200
500
300
400
1000
500
600
Data Set1
0
100
200
300
400
500
0
100
200
300
400
500
Figure 6: Permutation analysis of biological data. The sizes of sets of genes declared differentially expressed in the course of permutation analysis. The GLM estimation procedure is used in this analysis.
http://www.bepress.com/sagmb/vol4/iss1/art34
22
400
5
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data
simu02
0
0
1
100
2
200
3
300
4
simu00
200
300
400
500
0
100
200
300
1000
100
800
0
500
simu06
0
0
200
200
400
400
600
600
800
simu04
400
0
100
200
300
400
500
0
100
200
300
400
500
Figure 7: Permutation analysis of simulated data. The sizes of sets of genes declared differentially expressed in the course of permutation analysis. The GLM estimation procedure is used in this experiment.
Published by The Berkeley Electronic Press, 2005
23
7.5%
13.5%
600
26.2% 200
0.5%
GLMP
2.4%
0.0
0.2
0.4
0.6
0.8
0.0
0.4
0.6
0.8
12.8%
600
GLMT
2.2%
7.5%
13%
0
6.6%
0
0.5%
0.2
24.9% 200
600
KERT
24.5% 200
14%
Correlation Coefficients
Mean Diff. Genes
Correlation Coefficients
Mean Diff. Genes
8.5%
0
200
26.4%
Mean Diff. Genes
600
KERP
0
Mean Diff. Genes
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34
0.0
0.2
0.4
0.6
0.8
Correlation Coefficients
0.0
0.2
0.4
0.6
0.8
Correlation Coefficients
Figure 8: Results on the regular NEBM procedure. The mean (circles) and variance (vertical bars) of the number of differentially expressed genes Z C generated by each method for ρ = 0.0, 0.2, 0.4, 0.6. The corresponding mean values of the true FDR are indicated for each method and each ρ-value.
http://www.bepress.com/sagmb/vol4/iss1/art34
24
6.6% 11.5%
600 200
3.9%
GLMP
0.0
0.2
0.4
0.6
0.8
0.0
2.9%
0.2
0.4
0.6
0.8
5.6% 10.3%
600
GLMT
1.9%
4%
6.4% 10.4%
0
0.4%
200
600
KERT
0
200
7.9% 12.2%
Correlation Coefficients
Mean Diff. Genes
Correlation Coefficients
Mean Diff. Genes
5.1%
2%
0
0.5%
Mean Diff. Genes
200
600
KERP
0
Mean Diff. Genes
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data
0.0
0.2
0.4
0.6
0.8
Correlation Coefficients
0.0
0.2
0.4
0.6
0.8
Correlation Coefficients
Figure 9: Results on the conservative (π0 = 1) NEBM procedure. The mean (circles) and variance (vertical bars) of the number of differentially expressed genes Z C generated by each method for ρ = 0.0, 0.2, 0.4, 0.6. The corresponding mean values of the true FDR are indicated for each method and each ρ-value.
Published by The Berkeley Electronic Press, 2005
25
1.0
Data Set2
0.6 0.4 0.2 0.0
0.0
0.2
0.4
0.6
0.8
Data Set1
0.8
1.0
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34
−6
−4
−2
0
2
4
6
−6
−4
−2
0
t−statistic
2
4
6
t−statistic
Figure 10: Cross-validation analysis (basic scenario) of biological data. Estimation of the local FDR from Data Set 1 and Data Set 2 using the GLM estimation procedure. Solid line: mean fdr(t) over cross-validations, vertical bars: the corresponding standard
0.8 0.4
0.4
0.8
deviation.
0.0
Data Set2
0.0
Data Set1
0.0
0.2
0.4
0.6
0.8
1.0
p−value
0.0
0.2
0.4
0.6
0.8
1.0
p−value
Figure 11: Cross-validation analysis of biological data (basic scenario). Estimation of the Bayesian FDR (q-value) from Data Set 1 and Data Set 2. Solid line: mean FDR(u) over cross-validations, vertical bars: the corresponding standard deviation.
http://www.bepress.com/sagmb/vol4/iss1/art34
26
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data
Table 1: Estimation of π0 from simulated data. Cross-validation analysis.
Table 2: Estimation of π0 from biological data. Cross-validation analysis.
Published by The Berkeley Electronic Press, 2005
27
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34
Table 3: Permutation analysis of biological data. Mean and standard deviation (SD) of the number of selected genes, and the true FDR average.
Table 4: Cross-validation analysis of biological data (basic scenario). Mean and standard deviations (SD) of the number of selected genes Z C .
http://www.bepress.com/sagmb/vol4/iss1/art34
28
Qiu et al.: Empirical Bayes Method with Correlated Microarray Data
Table 5: Cross-validation analysis of biological data (Data Set 2) with smaller d. Mean and standard deviations (SD) of the number of selected genes Z C .
Published by The Berkeley Electronic Press, 2005
29
Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 34
Table 6: Cross-validation analysis of biological data (basic scenario). Conservative selection procedure. Mean and standard deviations (SD) of the number of selected genes Z C .
http://www.bepress.com/sagmb/vol4/iss1/art34
30