Communications in Statistics—Theory and Methods, 38: 2733–2747, 2009 Copyright © Taylor & Francis Group, LLC ISSN: 0361-0926 print/1532-415X online DOI: 10.1080/03610910902936281
Control of the FWER in Multiple Testing Under Dependence
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
D. CAUSEUR, M. KLOAREG, AND C. FRIGUET Agrocampus, Applied Maths Department, European University of Brittany, Rennes, France Multiple testing issues have long been considered almost exclusively in the context of General Linear Model, in which usually the significance of a quite limited number of contrasts is tested simultaneously. Most of the procedures used in this context have been designed to control the so-called Family-Wise Error Rate (FWER), defined as the probability of more than one erroneous rejection of a null hypothesis. In the last two decades, large-scale significance tests encountered for example in microarray data analysis have renewed the methodology on multiple testing by introducing novel definitions of Type-I error rates, such as the False Discovery Rate (FDR), to define less conservative procedures. High dimension has also highlighted the need for improvements, to guarantee the control of the error rates in various situations of dependent data. The present article gives motivations for a factor analysis modeling of the covariance between test statistics, both in the situation of simultaneous tests of a small set of contrasts in the General Linear Model and also in high-dimensional significance tests. Impact of the dependence on the power of multiple testing is first discussed and a new procedure controlling the FWER and based on factoradjusted test statistics is presented as a solution to improve the Type-II error rate with respect to existing methods. Finally, the beneficial impact of the new method is shown on simulated datasets. Keywords Factor analysis; Family-wise error rate; Multiple-hypothesis testing; Non discovery rate. Mathematics Subject Classification Primary 62H15; Secondary 62P10.
1. Introduction Simultaneous inference on a set of linear contrasts in the General Linear Model has long been the main motivation for the study of multiple testing procedures. This framework is generally characterized by a relatively small number of hypotheses to be tested simultaneously, usually some pairwise comparisons of mean levels in analysis of variance models. However, and despite many attempts for a Received October 31, 2008; Accepted February 1, 2009 Address correspondence to D. Causeur, Agrocampus, Applied Maths Department, European University of Brittany, Rennes, France; E-mail:
[email protected]
2733
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
2734
Causeur et al.
decision-theoretic approach of multiple hypothesis testing (see Shaffer, 1995, for a comprehensive review), even a limited multiplicity of the hypotheses has turned out to raise some complex issues, which make a proper theoretical framing for multiple testing much more than just a multivariate extension of Neyman–Pearson theory for single testing. As noticed by Shaffer (1995), Dudoit et al. (2003), and Storey (2007), discussions about an overall Type-I error rate, about the conditions under which multiple testing procedures can guarantee this error rate is lower than a preset level or about the eventual optimality of a given procedure are much more open than in the single hypothesis testing theory, due to the diversity of objectives, the combinatorially high variety of possible decisions and the need to account for the joint distribution of the test statistics. In order to handle multiple hypothesis testing of linear contrasts in the General Linear Model, it has often been considered as natural to extend the usual Type-I error rate defined as the probability of a false rejection by the probability of at least one false rejection among the decisions. Because this overall Type-I error rate applies to the whole family of hypotheses and not to each separately, it is now well known as the Family-Wise Error Rate (FWER). Most of the multiple testing procedures were first designed to control the FWER, or in other words to ensure that it remains lower than a given level . It is beyond the purpose of this article to give a comprehensive typology of the numerous existing multiple testing procedures (see Dudoit et al., 2003, for such a review). It should only be kept in mind that most of them are based on the ordered p-values of the individual tests and a thresholding method that gives, for each p-value, a cut-off under which the corresponding null hypothesis is rejected. The large number of existing procedures can partly be explained by different approximations of the multivariate probability involved in the definition of the FWER and also by the need for sophisticated thresholding procedures in order to control the FWER under general distributional assumptions. Accounting for the dependence between test statistics by a factor analysis model to improve multiple testing procedures was initially proposed by Hsu (1992), for simultaneous inference on linear contrasts in the General Linear Model. In the context of microarray data analysis, Hsu et al. (2006) also suggested a one-factor approximation of the correlation between gene expressions, considered as induced by a normalization of the data, whose aim is to reduce the technological bias between microarrays. More recently, Friguet et al. (2009) studied the impact of dependence on the variance of the number of false rejections under the assumption of a factor analysis model for the correlation of high-throughput data. By analogy with Efron (2007), they propose a conditional estimate of the FDR and, to reduce the instability due to dependence in the distribution of the error rates, suggest a multiple testing procedure based on factor-adjusted test statistics, which controls the FDR. The present article can be viewed as a transposition of Friguet et al. (2009)’s factor analytic approach to the control of the FWER. Section 2 gives motivating arguments for a factor analysis modeling of the correlation between test statistics, both for multiple testing of linear contrasts in the General Linear model and for high-dimensional significance tests. In Sec. 3, a factor-adjusted multiple testing procedure is presented based on a EM estimation of the factor analysis model. A simulation study is proposed in Sec. 4 to illustrate the properties of the new method with respect to the equivalent procedure with non adjusted test statistics.
Control of the FWER in Multiple Testing Under Dependence
2735
An application to a public gene expression dataset is also provided. Finally, Sec. 5 is dedicated to a discussion and comments focusing on open problems.
2. Settings for the Multiple Testing Procedure Multiple testing issues have essentially been explored in two main contexts: first, simultaneous tests of linear contrasts in the General Linear Model can be viewed as the classical situation in which most of the usual methods have been developed and, more recently, large-scale significance tests on a single contrast derived for a potentially huge number of response variables have motivated novel and specific approaches.
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
2.1. Simultaneous Tests of Linear Contrasts in the General Linear Model Consider the general linear model Y = X + , where Y is an n × 1 vector of observations for the response variable, X is the n × p design matrix with rank p < n, and = 1 p is the vector of unknown expectation parameters. is a n × 1 vector of error terms assumed to be normally distributed with mean 0 and variance 2 In . Let us focus on linear contrasts = 1 k of interest defined by = L, where L is a known k × p matrix. The best linear unbiased estimator for is given by ˆ = LX X−1 X Y and its variance, hereafter denoted V ˆ , has expression V ˆ = 2 LX X−1 L . As an illustration, let us focus on the one-way design with covariates, as proposed by Hsu (1992): Yik = i + xik − x¯ + ik i = 1 I and k = 1 ni
(1)
where 1 I are treatment effects and is the p-vector of common slopes for the covariate x = x1 x2 xp . Suppose the linear contrasts of interest are the I − 1 treatment versus control effects = 1 − I I−1 − I . It can be checked that V ˆ is given by the following expression: 1 1 V ˆ = 2 diag + 1 1 + x Sx−1 x (2) nI n where n = Ii=1 ni is the sample size, = 1/n1 1/n2 1/nI−1 , diag denotes the I − 1 × I − 1 diagonal matrix with diagonal , 1 is the I − 1-vector which I − 1 × p matrix with ith row x¯ i − x¯ I and Sx = all ni are 1, x is the I entries ¯ i xik − x¯ i /n is the p × p within-group variance matrix of the i=1 j=1 xik − x covariates. Therefore, the variance of ˆ has a factor structure with q + 1 factors, where q is the rank of the matrix x Sx−1 x of squared Mahalanobis distances of the covariate between the treatment groups and the control:
q+1
V ˆ = +
bk bk
k=1
where = 2 diag stands for the diagonal matrix of specific variances, also called uniquenesses and the I − 1-vectors bk are the loadings. In most applications where the covariates are continuous variables not subject to any experimental control,
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
2736
Causeur et al.
the rank q shall be equal to p, provided p does not exceed the number of hypotheses to be tested simultaneously. Note also that one of the vectors of loadings, say b1 , does not depend on the covariates but only on the sampling design: b1 = /nI 1. Moreover, the part of the factor structure defined by the q remaining loadings clearly varies according to the effect of the factor on the covariates. If vii stands for the ith diagonal term in the unscaled covariance matrix −2 V ˆ , then the multiple testing procedure is based on the usual Student’s test statistics √ Ti = ˆ i /s vii , where s2 is the residual mean square error of model (1), or the corresponding p-values Pi = 21 − FTi , where F is the probability function of the Student distribution with r = n − p − I degrees of freedom. Moreover, for a i given threshold t such as, if Pi ≤ t, the null hypothesis H0 i = 0 is rejected, FWERt is defined as follows: FWERt = 1 − P Pi > t (3) i∈0
where 0 = i = 1 I − 1 i = 0. Expression (3) involves the probability function of the joint distribution of the test statistics, namely the multivariate ˆ directly Student distribution Tr R ˆ , where R ˆ is the correlation matrix of , deduced from expression (2) of its variance. This multivariate probability can be evaluated using for instance Genz and Bretz (2002)’s method, implemented in the R package mvtnorm (see also Genz et al., 2008). Ignoring the dependence between the contrasts simplifies expression (3) to FWERt = 1 − 1 − tm0 , which leads to the so-called Šidàk equation for the choice of a threshold t ensuring FWERt ≤ 1 − 1 − tm0 ≤ . In order to illustrate the impact of dependence on the control of FWERt , consider a one-way analysis of covariance model with p = 5 covariates and, for all i, i = 0 or, in other words, 0 = 1 I − 1 , with I = 10. Scenarios of dependence between the estimators of contrasts are considered hereafter, with a growing proportion = q+1 k=1 bk bk /traceV of common variance shared in the factor structure: for each scenario, the same matrix x is used, chosen arbitrarily with rank x Sx−1 x = p, and Sx = 1/cx Ip where cx is a scaling parameter varying from a large value for a corresponding large value of to a small value for a corresponding small value of . For each scenario, the sample sizes per level of the factor are chosen to ensure a power of 0.8 for each individual t-test. Note that, in the limiting case cx = 0, the lower bound is reached for , which here equals 05. Solving Šidàk equation to control the FWER at level = 01 gives t = 0012. Figure 1 shows how the theoretical FWERt varies along the values of for this choice of t. The same plot also displays the approximated FWERt deduced from a 1-factor decomposition of V ˆ , as suggested by Hsu (1992). This 1-factor approximation is here obtained by maximum likelihood factor analysis. First, considering Fig. 1, it is interesting to note that ignoring the dependence in the thresholding procedure leads to a control of the FWER at a much smaller level than expected. Equivalently, it should be said that ignoring dependence decreases the power of multiple testing procedures, which is also a common conclusion of many recent articles dealing with the properties of multiple testing procedures under dependence (see, for instance, Efron, 2007; Storey, 2007). Moreover, the 1-factor approximation proposed by Hsu (1992) enables a close approximation of the theoretical FWER, which provides a simple method to account for dependence in the choice of a proper threshold.
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
Control of the FWER in Multiple Testing Under Dependence
2737
Figure 1. Theoretical FWERt along the amount of dependence between the estimators ˆ i of linear contrasts, measured by the proportion of common variance in the factor analysis ˆ The threshold t is chosen to control FWERt at level decomposition of the variance of . = 01 under assumption of independence.
2.2. Large-Scale Significance Tests The following framing for a multiple testing issue is specific of situations where data are obtained from high-throughput technology, resulting in a much larger number of observed variables than the sample size n. First, let Y = Y 1 Y 2 Y m be a random m-vector which conditional distribution with respect to some explanatory k variables x = x1 xp , p ≥ 1, is normal with expectation Ex Y = 0 + x k k=1m and variance , assumed to be constant with respect to x and positive definite. For variables with indices k in a subset 0 of 1 2 m of size m0 , a particular linear contrast of interest k is zero, whereas for k in 1 = c0 , k k = 0. Our aim is to find out the response variables for which H0 k = 0 is not true, or in other words, to test simultaneously the m null hypotheses. Hereafter, the least-squares estimator of k , calculated on a sample of n independent items, is denoted ˆ k and ˆ k − k is of course normally distributed with expectation −1 , where k is the conditional standard deviation of 0 and variance k2 n−1 Sxx k Y given x and Sxx is the sample variance-covariance matrix of the explanatory variables. A usual testing
procedure is based on the individual Student’s test √ −1 , where s 2 is the residual mean square error of statistics T k = n ˆ k /sk Sxx k the linear model relating Y k to x.
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
2738
Causeur et al.
For k = k , the correlation between ˆ k and ˆ k is simply the conditional k k correlation kk between Y and Y given x. It is deduced that dependence between the estimators of the contrasts, and asymptotically between the test statistics, is here directly inherited from the dependence between the response variables, and not from the sampling design as in the case of multiple tests in the General Linear Model. As proposed by Efron (2007), modeling the correlation between the responses can lead to some improvements of multiple testing procedures for correlated responses. Different types of dependence that can be deduced from biological assumptions on the co-expressions of genes are, for example, explored by Gordon et al. (2007) and Klebanov and Yakovlev (2007). Kim and van de Wiel (2008) also proposed graphical covariance structures that introduce partial independence assumptions relating to a network representation of the gene expressions. Hereafter, it is assumed that the conditional covariance matrix of the responses, given the explanatory variables, is represented by a factor analysis model:
= + BB
(4)
where is a diagonal m × m matrix of uniquenesses 2k and B is a m × q matrix of factor loadings. This data reduction technique, although quite popular for sociologists or psychometricians, has only appeared recently as an interesting tool to investigate, the dependence structure in highly dimensional microarray data (see Kustra et al., 2006; Pournara and Wernish, 2007). In order to account for the dependence between the test statistics resulting from a very usual pre-processing of microarray data called normalization, a one-factor approximation of the correlation between gene expressions is also proposed by Hsu et al. (2006). Friguet et al. (2009) showed that a large proportion of common variance in the factor analysis decomposition has a negative impact on the stability of the FDR-controlling multiple testing procedures and suggest a factor-analytic method to reduce correlation between test statistics. This method is described hereafter and its impact on the properties of a FWER-controlling procedure is studied.
3. A Factor-Analytic Multiple Testing Procedure Before presenting an improved multiple testing procedure which takes advantage of the factor structure, we start with a similar, yet much simpler, single-testing issue in which it can be assumed that the null hypothesis is true for some auxiliary covariates. 3.1. Likelihood-Ratio Test in the Presence of Covariates Under H0 Let us examine the following single-testing issue in the multivariate context introduced above: for a contrast of interest defined by the p-vector of coefficients , m m our aim is to test the null hypothesis H0 m = 0 against H1 m = 0, under the assumptions j = 0 j = 1 m − 1. For example, this situation can be encountered in microarray data analysis, where Y m is a gene expression of interest and Y 1 Y m−1 are expressions of so-called housekeeping genes. Such control genes, which expression has no biological reason to vary from an experimental condition to another, are introduced in some microarray experiments in order to estimate and to remove an eventual technological bias between microarrays.
Control of the FWER in Multiple Testing Under Dependence
2739
The above problem can be restated as a classical testing issue in a general linear model context. Let Y be the mn-vector obtained by concatenating the measurements 1 m of Y j j = 1 m, on the sample of size n. If = 0 1 0 m is the vector of unknown regression coefficients in the model relating Y to x, the test of m H0 can be viewed as a test for the significance of a particular linear combination of under linear restrictions which state the nullity of j j = 1 m − 1. If the variance parameters are assumed to be known, it can be shown that the LikelihoodRatio Test statistics resulting from application of the normal theory in this special case of general linear hypothesis testing is given by:
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
T=
ˆ m − Bm−1 m m−1
−1 m m−1 Sxx
(5)
where Bm−1 is the p × m − 1 matrix which jth row is ˆ j+1 , mm−1 is the m − 1vector of regression coefficients for Y 1 Y m−1 in the model relating Y m 2 is the residual variance of this model. to x and Y 1 Y m−1 , and mm−1 Of course, if the conditional covariance between Y m and Y 1 Y m−1 , given x, is zero, T coincides with the classical Student’s test statistics. Plugging-in the 2 maximum likelihood estimators of mm−1 and mm−1 in expression (5) leads to an asymptotically optimal test statistics, which can show large improvements with respect to the classical Student’s test. 3.2. Application for High-Dimensional Data Generally, apart from the special case mentioned above of control genes in microarray experiments, direct measurements of auxiliary covariates for which it could be assumed that H0 is true are not available to improve large scale significance tests. However, in gene expression datasets for example, it can often be assumed that H0 is true for a large fraction of variables, or in other words, that 0 is large. The method we propose hereafter consists in taking advantage of this unknown but large set of variables to derive new individual test statistics inspired by expression (5). A crucial issue in such an approach is the handling of the potentially huge size of 0 . This can be addressed by the factor analysis modeling of the conditional variance of the variables. Assumption (4) can indeed be viewed as equivalent to the existence of latent factors Z = Z1 Zq , supposed to concentrate in a small dimension space the common information contained in the m responses: for k = 1 m, Y k = 0 + x k + bk Z + k k
(6)
where bk is the kth row of B and = 1 m is a random m-vector, independent of Z, with mean 0 and variance-covariance . Application of expression (5) using the factors as covariates results in the following factor-adjusted test statistics: T k Z =
ˆ k − bˆ k ˆ z
ˆ k S −1 xx
(7)
2740
Causeur et al.
ˆ 2 is the kth diagonal where bˆ k is the kth row of the matrix B of estimated loadings, k
ˆ element of the matrix of estimated uniquenesses, and z is the least-squares estimator of the p × q matrix of slope coefficients in the multivariate regression model relating the estimated factors Z and the explanatory variables x. In the sequel, we describe an estimation procedure for the parameters of the factor analysis model inspired by the EM approach by Rubin and Thayer (1982).
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
3.3. Estimation in the Factor Analysis Model Many estimation methods can be used for the parameters of a factor analysis model, among which Principal Factoring is probably the most famous (see Mardia et al., 1979), especially when factor analysis is used for an exploratory purpose. However, in high-dimensional situations (involving thousands of variables), Principal Factoring can be computationally cumbersome because each step of the iterative algorithm consists in a singular value decomposition (SVD) of a large correlation matrix. Since factor analysis is a particular latent variable model, an EM algorithm (see Rubin and Thayer, 1982) can be implemented to achieve the maximum likelihood solution and avoid SVD of large matrices. The algorithm we propose slightly modifies the initial EM algorithm to apply to the modeling of a conditional covariance matrix, given x. In the following, ˆ x is the n × m residual
0 of 0 by matrix for the fixed effects of models (6). After a primary estimation the set of variables for which the p-value of the usual Student’s test for H0 exceeds 0.05, the kth column of ˆ x is derived by fitting either an unrestricted linear model
0 or the same model under restriction H k if k ∈
0 . The relating Y k to x if k 0 iterative algorithm is now described through its E and M steps: z
−1 ˆ x • E step. Z is first computed: for i = 1 n, Zi = G B and Si = G + i −1 −1
Zi Zi , where G = Iq + B and B Zi denotes the ith row of Z. • M step. The uniquenesses and factor loadings are derived: B= n x n z −1 1 n x
ˆ i Zi and = diag S − B n i=1 Zi ˆ i , where diag is i=1 i=1 Si here the matrix operator that sets all the off-diagonal elements to zero and S stands for the usual sample estimate of : S=
n 1 x x ˆ i ˆ i n i=1
It is especially important in the present multiple testing context to avoid underestimation of the uniquenesses 2k because it would inflate artificially the test statistics and consequently increase the FWER. Let us focus on the estimators of the uniquenesses resulting from the above EM algorithm, viewed as residual mean square errors: ˆ 2 = 1 ˆ kx Pz ˆ kx k n n z −1
where ˆ kx is the kth column of ˆ x and Pz = In − Z . By analogy Z i=1 Si with other nonlinear smoothing methods, we propose to account for the parametric complexity of the factor analysis model by replacing the denominator n in the above expression by the trace of Pz .
Control of the FWER in Multiple Testing Under Dependence
2741
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
Another crucial point to avoid overfitting of the factor structure is to properly estimate the number q of factors. In their study of the impact of dependence on the variance of the number of false rejections, Friguet et al. (2009) proposed estimating q by minimization of an ad-hoc criterion, which can be viewed as the amount of variance inflation due to the correlation between the test statistics. If, for a given number j of factors, Rj stands for the correlation matrix between the factor-adjusted test statistics T k Z, k ∈ 0 , Friguet et al. (2009) showed that this variance can be expressed as m0 + Mt Rj t1 − t, where Mt Rj is the sum of U -shaped functions of all pairwise correlations in Rj . In the special case Rj = Im0 , Mt Rj = 0 and the variance of the number of false rejections is simply the variance of the binomial distribution m0 t. The same procedure is suggested in the present situation. For each j, the inflation term Mt Rj is estimated and the estimator qˆ of the number of
t Rj over all possible j. factors is chosen as the minimizer of M 3.4. A Thresholding Procedure for the Factor-Analytic Procedure Šidák equation used to find a threshold t ensuring the control of the FWER at a given level is obtained under assumption of independence between the p-values Pk of the individual test statistics. That is the reason why some more sophisticated thresholding methods have been derived, some of them inspired by Šidák equation, to face situations of dependent data (see Dudoit et al., 2003, for a large review of these methods and a discussion about their conditions of use). In the next section, the factor-analytic multiple testing procedure is compared to two famous existing methods known as less conservative than the direct Šidák correction for the multiplicity of the tests and yet controlling the FWER: the step-down Šidák procedure which consists in rejecting the null hypotheses with adjusted p-values Prj = maxk=1j 1 − 1 − Prk m−k+1 , where rk is the rank of the p-values, lower than and the step-down Holm procedure which can be viewed as analogously based on Bonferroni’s inequality, with the adjusted p-values Prj = maxk=1j minm − k + 1Prk 1. Since the factor-analytic procedure aims at reducing the correlation between the test statistics, we propose hereafter to apply a single-step Šidák correction for z multiplicity to the p-values of the factor-adjusted tests. If Pk denotes the p-values of the factor-adjusted tests, then the null hypotheses are rejected if the the adjusted z z z p-values Pj = 1 − 1 − Pj m does not exceed . For large n, Pk is derived from a normal approximation of the null distribution of the corresponding factor-adjusted test statistics. However, for small n, we suggest to approximate this null distribution by a Student distribution with tracePz degrees of freedom.
4. Simulation Results and Illustration First, some practical issues concerning the implementation of the above method are addressed and the properties of the factor-analytic multiple testing procedure are investigated on simulated datasets. Finally, a comparative approach with existing methods is proposed by application to a public gene expression dataset. 4.1. Implementation of the Factor-Analytic Method As for many recent methods dealing with dependence in multiple testing (for example, Efron, 2007; Storey, 2007), the present procedure involves a primary
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
2742
Causeur et al.
estimation of 0 , mainly to estimate the parameters of the factor analysis model. It is suggested above to estimate 0 by the indices of the non-significant Student’s tests, without correction for the multiplicity of the tests. In practice, re-estimating 0 by the indices of the non-significant factor-adjusted tests and consequently updating the estimation of the factor parameters increases the stability of the multiple testing procedure. This two-stage calculation of the factor-adjusted test statistics is implemented in the following simulation study. z Note also that, in the definition of the adjusted p-values Pj introduced in previous section, the number m0 of true null hypotheses should be the right exponent of the Šidák correction for multiplicity, instead of m. By the way, the estimation of m0 is known to be a central issue in the FDR-controlling methodologies. Ignoring this parameter by assuming m − m0 is negligible regarding m leads in fact to a control of the FDR at a lower level than expected. This point has therefore received an increased scrutiny recently (see Langaas et al., 2005, for a z Pj also large review). Considering m − m0 as negligible in the above expression of implies an underestimation of t and therefore a loss of power. By analogy with many procedures controlling the FDR, we propose to plug the estimator of m0 proposed z by Storey and Tibshirani (2003) in the above expression of Pj . 4.2. Simulation Study The properties of the present method are studied by means of simulations involving ten scenarios of conditional correlation matrices Cj = j + Bj Bj which differ by their proportion j = trBj Bj /trCj of common variance. The number q = 5 of factors and the dimension m = 500 are the same for each correlation matrix. Here, the multiple testing procedure aims at finding out the variables for which the expectations are different between two groups with equal sample size n = 30. For m1 = 100 variables, the difference is chosen so that the usual t-tests have a variableby-variable power of 0.8. For each intra-group correlation matrix Cj , 1.000 datasets are simulated according to a normal distribution. For each dataset, two sample Student’s test statistics are calculated and an estimate m ˆ 0 of m0 is deduced, using Storey and Tibshirani (2003)’s estimator. The factor-adjusted test statistics are also calculated and, for a control of the FWER at level = 005, the threshold of the factor-analytic procedure is obained by solving Šidák equation with this estimate of m0 . For each dataset, a step-down Šidák procedure and a step-down Holm procedure controlling the FWER at level = 005 are also implemented to identify the variables with significantly different means in each group, using the usual t-tests as variable-by-variable statistics. Implementation is done using the R package multtest (see Pollard et al., 2003). Figure 2 displays the theoretical FWER estimated using the simulated datasets, for the three multiple testing procedures. It shows that the step-down procedures based on the Student’s tests control the FWER at a lower level than = 005. For the factor-analytic procedure, the FWER is close to , whatever the amount of dependence between the test statistics, but tends to slightly overcome this level. Figure 3 reproduces boxplots of the distributions of the non discovery multiple k proportion NDPt = # k 0 H0 is not rejected /m1 . It shows that the fraction of common variance generates a high instability in the distribution of NDPt for the step-down procedures based on Student’s tests whereas the variability of NDPt
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
Control of the FWER in Multiple Testing Under Dependence
2743
Figure 2. Theoretical FWER estimated from the simulated datasets along with the proportion j of common variance, for the procedures based on factor-adjusted tests with single-step Šidák correction, Student’s tests with step-down Šidák, and Holm corrections.
remains constant along the proportion of common variance for the factor-analytic procedure. Moreover, the mean NDPt , which can be viewed as a Type-II error rate, decreases markedly with the factor-analytic procedure with respect to the procedures based on the Student’s tests. 4.3. Application to a Gene Expression Dataset The factor analysis modeling of the conditional variance is hereafter applied on wellknown real microarray data which were primarily analyzed by Golub et al. (1999) in order to identify genes that are differentially expressed in patients with two types of leukemias (ALL, AML). Gene expression levels were measured using Affymetrix high-density chips containing 6817 human genes. The sample is made of 27 ALL cases and 11 AML cases and pre-processed data for 3,051 genes are available in the R package multtest. Figure 4 shows graphical displays of the fit by a factor analysis model of the intra-group correlation matrix R of the gene expressions. The left panel plot reproduces an histogram of the correlations rij together with density curves for the fitted values rˆij obtained with various choices of a number of factors (from 1–8 factors). The right panel plot gives the values of the variance inflation
t R ˆ used to determine the number of factors, where the threshold t is criterion M obtained by solving Šidák equation with = 005. In the following, the factoranalytic procedure is implemented with four factors. Application of the method suggested by Storey and Tibshirani (2003) to estimate m0 and implemented in the R package fdrtool (see Strimmer, 2008) gives mˆ 0 = 1521. Figure 5 plots the number of positive genes with the present method
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
2744
Causeur et al.
Figure 3. Distributions of the non discovery proportion for the factor-analytic procedure (top panel) and for the step-down Šidák and Holm procedures based on the Student’s tests (bottom panels) along with the proportion j of common variance.
Figure 4. Factor analysis modeling of the intra-group correlation for the two-sample comparison in Golub’s data.
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
Control of the FWER in Multiple Testing Under Dependence
2745
Figure 5. Numbers of positive genes by the factor-analytic method, step-down Šidák, and step-down Holm, over a range of Type-I levels .
and the step-down Šidak and Holm methods. The factor-analytic procedure reveals more significant genes than the procedures based on Student’s tests.
5. Discussion The thresholding methods used in most multiple testing procedures are based on majorizations of the FWER (Šidak inequality was mentioned frequently in this article but we could have started from Bonferroni’s or Simes’ inequality with the same kind of results) with equality under assumption of independence between the test statistics. Most efforts to correct those procedures for a better performance in the presence of correlated data have consisted in sophistications of the thresholding procedure, leaving the individual testing strategy unchanged. The Optimal Discovery Procedure, designed by Storey (2007) to control the FDR, is probably one of the first attempts to modify the individual test statistics to account for dependence. The factor-analytic method proposed above also suggests a transformation of each test statistics which reduces their mutual dependence by taking advantage of the shared variance within the factor structure. Estimation of m0 seems to be a promising direction for improvements of the single-step Šidák thresholding method. In the present article, this point is simply addressed by using a widely used estimator. However, it can be imagined that dependence has also an impact on the estimation of m0 , which has to be dealt with in a global factor-analytic procedure. Finally, it is important to note that the usability
2746
Causeur et al.
of the factor-analytic method is restricted to linear testing methods, since it is based on the inheritance of the dependence between the test statistics from the dependence between the response variables themselves. For quadratic test statistics such as in the Fisher Analysis of Variance tests or for rank-based testing strategies, the impact of a factor-analysis model assumption for the covariance between the responses on the dependence between the test statistics is still unclear.
Acknowledgments
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
The authors are grateful to the Associate Editor and the referee for their constructive comments which helped to improve the manuscript.
References Dudoit, S., Shaffer, J. P., Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci. 18(1):71–103. Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102:93–103. Friguet, C., Kloareg, M., Causeur, D. (2009). A factor model approach to multiple testing under dependence. To appear in Journal of the American Statistical Association. Genz, A., Bretz, F. (2002). Methods for the computation of multivariate t-probabilities. J. Computat. Graph. Statist. 11:950–971. Genz, A., Bretz, F., Hothorn, T. with contributions by Miwa, T., Mi, X., Leisch, F., Scheipl, F. (2008). mvtnorm: Multivariate Normal and t Distributions, R package version 0.9-2. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537. Gordon, A., Glazko, G., Qiu, X., Yakovlev, A. (2007). Control of the mean number of false discoveries, Bonferroni and stability of multiple testing. Ann. Appl. Statist. 1(1): 179–190. Hsu, J. C. (1992). The factor analytic approach to simultaneous inference in the general linear model. J. Computat. Graph. Statist. 1:151–168. Hsu, J. C., Chang, J. Y., Wang, T. (2006). Simultaneous confidence intervals for differential gene expressions. J. Statist. Plann. Infer. 136(7):2182–2196. Kim, K. I., van de Wiel, M. A. (2008). Effects of dependence in high-dimensional multiple testing problems. BMC Bioinform. 9:114. Klebanov, L., Yakovlev, A. (2007). Diverse correlation structures in gene expression data and their utility in improving statistical inference. Ann. Appl. Statist. 1(2):538–559. Kustra, R., Shioda, R., Zhu, M. (2006). A factor analysis model for functional genomics. BMC Bioinform. 7:216. Langaas, M., Lindqvist, B. H., Ferkingstad, E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. Roy. Statist. Soc. Ser. B 67:216, 555–572. Mardia, K. V., Kent, J. T., Bibby, J. M. (1979). Multivariate Analysis. Probability and Mathematical Statistics. London: Academic Press. Pollard, K. S., Ge, Y., Taylor, S., Dudoit, S. (2003). multtest: Resampling-based multiple hypothesis testing, R package version 1.21.1. Pournara, I., Wernish, L. (2007). Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinform. 8:61.
Control of the FWER in Multiple Testing Under Dependence
2747
Downloaded By: [Causeur, D.] At: 08:46 21 August 2009
Rubin, D. B., Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika 47:1, 69–76. Shaffer, J. P. (1995). Multiple hypothesis testing. Ann. Rev. Psychol. 46:561–584. Storey, J. D. (2007). The optimal discovery procedure: a new approach to simultaneous significance testing. J. Roy. Statist. Soc. Ser. B Statist. Methodol. 69:347–368. Storey, J. D., Tibshirani, R. (2003). Statistical significance for genome-wide studies. Proc. Nat. Acad. Sci. USA 100:9440–9445. Strimmer, K. (2008). fdrtool: Estimation and Control of (Local) False Discovery Rates, R package version 1.2.5. http://strimmerlab.org/software/fdrtool/.