Predicting False Discovery Proportion Under Dependence - CiteSeerX

0 downloads 0 Views 386KB Size Report
Department of Mathematics and Statistics. University ... E-mail: anindya@math.umbc.edu .... the DPMM of multivariate skew-normal with AR(1) and intra-class correlation ap- ..... (12). Let Li = j if θi = θ∗ j . As N → ∞, the Dirichlet sieve process converges to the corre- ..... www4.stat.ncsu.edu/~ghoshal/papers/identifiability.pdf.
Predicting False Discovery Proportion Under Dependence Subhashis Ghosal∗ Department of Statistics North Carolina State University Raleigh, NC 27695 E-mail: [email protected]

Anindya Roy† Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, MD 21250 E-mail: [email protected]

∗ †

Research supported by NSF grant number DMS-0803540 Research supported by NSF grant number DMS-0803531

1

Abstract We present a flexible framework for predicting error measures in multiple testing situation under dependence. Our approach is based on modeling the distribution of probit transform of p-values by mixtures of multivariate skewnormal distributions. The model can incorporate dependence among p-values and also allows for shape restrictions on the p-value density. A nonparametric Bayesian scheme for estimating the components of the mixture model is outlined and Markov chain Monte-Carlo algorithms are developed. These lead to prediction of false discovery proportion and related credible bands. An expression for the positive false discovery rate for dependent observations is also derived. The power of the mixture model in estimation of key quantities in multiple testing is illustrated by a simulation study. A data on kidney transplant is also analyzed using the methods developed.

KEY WORDS: Dirichlet Process mixture; False discovery rate; p-value distribution; shape restriction; Skew-normal distribution.

2

1

Introduction

Multiple hypothesis testing methodologies are among the most important statistical tools for modern biomedical and bioinformatic applications such as DNA microarray analysis, proteomics and functional magnetic resonance imaging (fMRI), where thousands of hypotheses need to be tested simultaneously. A very appropriate measure to control called the false discovery rate (FDR), introduced by Benjamini and Hochberg (1995), is defined as E(V / max(R, 1)), where R stands for the total number of rejections and V the number of rejections when the null hypothesis is actually true. They devised an FDR controlling procedure which essentially rejects all null hypotheses with p-values smaller than a certain bound. Storey (2002, 2003) argued that often a related measure, known as the positive false discovery rate (pFDR), defined by E(V /R|R > 0) = FDR/P(V > 0), is more appealing. Storey (2002, 2003) adopted a product of independent Bernoulli(π0 ) super-population model for occurrence of null hypotheses and gave a Bayesian interpretation of pFDR. In particular, if the resulting test statistics are independent and the nominal level for all tests is γ, then Storey (2002) showed that pFDR is given by the expression π0 γ/F (γ), where F stands for the marginal cumulative distribution function (CDF) of p-values. His approach follows by estimating the pFDR function at every γ and finding a cut-off γ such that the estimated pFDR is attains a prescribed value α. In Storey’s approach, F is easily estimated by the empirical distribution of p-values or some refinements of it, and hence the method essentially boils down to estimating π0 . To this end, it is observed that the p-values under the null hypotheses are generally distributed uniformly over the unit interval, at least approximately, while p-values under alternative tend to be concentrated near zero, and hence they can be classified accordingly. Sarkar (2002, 2004, 2006, 2007), Genovese and Wasserman (2002, 2004) and Efron (2006, 2007) gave various modifications of the Benjamini-Hochberg FDR controlling procedure. 3

An objective Bayesian method for normal experiments was developed by Scott and Berger (2005). A wide range of research focused on the FDR and its control in various application such as Genovese et al. (2002) in neuroimaging; Golub et al. (1999) in cancer research; Efron et al. (2001) in DNA microarrays; Miller et al. (2001) and Hopkins et al. (2002) in astrophysics. However, in most present-day applications like fMRI, proteomics (2D-gel electrophoresis, mass-spectroscopy) and DNA microarray analysis, data show clear evidence of dependence. In fMRI examples, tests regarding the activation of different voxels are spatially correlated. In diffusion tensor imaging problems, the diffusion directions are correlated generating dependent observations along tracts. Hence, a grid-by-grid comparison of such images across patient groups will generate several p-values that are highly dependent. In such data structures error control procedures based on the working hypothesis of independence may be invalid. Benjamini and Yekutielli (2001), Storey and Tibshirani (2003), Finner et al. (2007), Farcomeni (2007) and Sarkar (2007) have shown that some controlling procedures or some appropriate modifications of them are still able to control the targeted measure of error under certain dependence scenarios. Clarke and Hall (2009) showed that the error control procedures exhibit some robustness properties to dependence assumption provided the number of hypothesis is large and the p-value distribution has lighter tail. Nevertheless, FDR controlling procedures designed for independent data tend to be too conservative especially under positive dependence, which is most commonly observed in practice. Storey’s super-population formulation for the truth of null hypotheses is best interpreted when one adopts a Bayesian approach by considering the p-values as the basic data. While the p-value distribution under null hypotheses can be taken to be uniform over the unit interval, it is reasonable to model the p-value distribution

4

under the alternative by nonparametric mixtures. Nonparametric mixtures of suitable kernels have high degree of flexibility to fit a variety of shapes. An additional benefit of the mixture model approach is that random effects in the alternative hypotheses are automatically incorporated [Sarkar and Zhou (2008), Wu (2008)]. The mixing distribution can then be given a nonparametric prior like the Dirichlet process and estimation of pFDR can proceed using the standard techniques for Dirichlet process mixture models (DPMM). Tang et al. (2007) used mixtures of Beta(a, b), a < 1, b ≥ 1 to model p-value densities and showed that such mixtures can effectively use many additional features of p-value distribution under the alternative such as decreasing density and high concentration near zero. Ghosal et al. (2008) showed that the resulting procedure is consistent under mild conditions. The approach of modeling univariate p-value density faces two main obstacles for dependent data. Firstly, it is difficult to find a flexible multivariate generalization of distributions on the unit interval to that on the unit cube which can incorporate many different forms of dependence. A simple fix for this problem is to transform the p-values to unrestricted real numbers using a transformation from the unit interval to the real line and then model the transformed p-values by mixtures of multivariate normal distributions. The probit function, that is the inverse of the standard normal CDF, is a good choice of the transformation map, since in this case, the uniform null distribution is converted into the standard normal distribution. Efron (2007) suggested that in the univariate case (that is, when p-values are independent), probit transformed p-values (to be called probit p-values in short) may be modeled by mixtures of normal distributions. However, as we shall observe later, that normal mixtures are unable to produce a decreasing shape for the density of original p-values, a feature that is often desirable. This difficulty can be resolved by considering a more general skew-normal kernel introduced by Azzalini (1985) and its multivariate ana-

5

log. The second obstacle is more subtle and arguably more serious. Storey’s formula pFDR = π0 γ/F (γ) is not valid for dependent data, and hence any approach which relies on estimating π0 γ/F (γ) is inapplicable. We shall derive a general formula for FDR (and hence for pFDR) valid under any dependence scenario, assuming Storey’s super-population model for null hypotheses. Unfortunately, the general expression is too complicated for estimation purposes. As an alternative, we consider the approach of predicting the false discovery proportion (FDP) given by the ratio V / max(R, 1) [Genovese and Wasserman (2004)]. The FDP seems to be a more appealing quantity to Bayesians because of it is better interpretable in a conditional sense on a given scenario. Moreover, the Bayesian approach has the attractive feature of having a posterior probability attached to any given null hypothesis. Such a measure is extremely relevant in applications, yet is very difficult to obtain using the classical approach. Pawitan et al. (2006) showed that FDP prediction can be sensitive to dependence, and hence incorporation of dependence in the model is essential. The main objective of this article is to describe a black-box model for dependent probit p-values and predict the FDP function using a DPMM of multivariate skew-normal distribution. We shall describe Markov chain Monte-Carlo (MCMC) algorithms for the evaluation of the posterior distribution. In principle, the working correlation structure among all probit p-values corresponding to both null and alternative hypotheses can be taken to be arbitrary. However, for the algorithm to be practically feasible, it is important that we are able to invert the correlation matrix analytically in each MCMC iteration, since numerical inversion of large matrices will take prohibitively long time. We advocate the use of a simple one-parameter family of correlation matrices such as the autoregressive of order 1 (AR(1)) or intra-class correlation. Despite the relative simplicity, the presence of an undetermined parameter in the correlation structure along with the high flexibility of mixtures makes the

6

resulting model sufficiently flexible. In the limited simulation study we performed, the DPMM of multivariate skew-normal with AR(1) and intra-class correlation appears to predict FDP very accurately. We also apply our methodology to a data on kidney functionality. The organization of the paper is as follows. In Section 2., we describe a mixture model framework for dependent p-values. Posterior computation techniques are described in Section 3.. A simulation study and analysis of a real data are presented in Section 4.. Proofs are presented in the Appendix. We shall use the notation 1l to denote the indicator function. The notation δ(·; θ0 ) will stand for the probability measure degenerate at θ0 .

2 2.1

Modeling dependent p-values Issues

We consider m simultaneous hypotheses testing problems H0i versus H1i , i = 1, . . . , m. Let Hi stand for the indicator that H0i is false. Following Storey’s (2002) superpopulation approach, we assume that Hi are independent and identically distributed (i.i.d.) with P(Hi = 0) = 1 − P(Hi = 1) = π0 , i = 1, . . . , m. We consider the corresponding p-values X1 , . . . , Xm as the basic data and try to estimate pFDR or predict FDP based on a model for X1 , . . . , Xm . If Hi = 0, it is reasonable to assume that the marginal density of Xi is uniform over the unit interval. This holds, for instance, if H0i is simple, or reduced to a simple hypothesis by similarity, or holds true approximately for general composite hypothesis if partial predictive p-value is considered; see Bayarri and Berger (2000) and Robbins et al. (2000). Efron (2004), however, cautioned that in some situations, the overall null distribution of all p-values may deviate from the uniform, in which cases, the empirical null distribution should

7

be considered. If the density of p-values under alternative is denoted by f1 , then under Storey’s super-population setting, the overall distribution of p-values is given by the mixture model f (x) = π0 + (1 − π0 )f1 (x). When the data (and hence the p-values) are independent, Tang et al. (2007) proceeded by modeling f1 as a mixture R of Beta(a, b) distribution, a < 1, b ≥ 1, that is, f1 (x) = (0,1)×[1,∞) (Γ(a)Γ(b)/Γ(a + b))xa−1 (1 − x)b−1 dG(a, b), and then putting a Dirichlet process prior on G and a beta prior on π0 . Since the alternatives are often not fixed, the p-values generated from the alternatives arise possibly from different densities. Thus the marginal p-value density under the alternative is best described by a mixture. The main reason for choosing the Beta(a, b) kernel is that then the alternative density has a decreasing shape, which is a priori a desirable feature for the p-value density under alternative. Tang et al. (2007) developed MCMC algorithm based on the “no gaps” algorithm of MacEachern and M¨ uller (1998). Their approach generate posterior samples for latent indicators H1 , . . . , Hm as well, and hence can estimate the pFDR, predict the FDP and give credible intervals for them. As p-values are generally dependent, methods based on marginal p-value models have limited applicability. However, suitable expression for FDR (pFDR) based on joint mixture modeling of p-values are not currently available. The following result derives the expression for FDR (and hence pFDR) under dependence. Let Ii = Ii (γ) = 1l{Xi < γ} denote the indicator that H0i is rejected at nominal level γ, i = 1, . . . , m and I = (I1 , . . . , Im ). Observe that the FDP at γ is given by Pm

(1 − Hi ) i=1 IiQ . m i=1 Ii + i=1 (1 − Ii )

FDP(γ) = Pm

(1)

By definition, FDR(γ) = E(FDP(γ)) and pFDR(γ) = FDR(γ)/P(V > 0). Let Bim denote the set of all m-dimensional binary vectors with a one at the ith position.

8

Theorem 1. For arbitrarily dependent observations, FDR(γ) = π0

m X

bi (γ),

(2)

i=1

where bi (γ) =

X P(I = a|Hi = 0) Pm m i=1 ai a∈B i

If the observations are exchangeable, then FDR(γ) = π0 mb1 (γ).

(3)

When the observations are independent and identically distributed, (3) reduces to the familiar expression FDR(γ) =

π0 γ P(V F (γ)

> 0).

The proof of the theorem is given in the appendix. Even for the exchangeable case, numerical evaluation of FDR(γ) involves quadrant probabilities of a multivariate density and can be challenging. Thus in the multivariate setting, a much more prudent strategy is to concentrate on predicting the FDP process. While modeling the joint behavior of p-values, it is easier to work with a transformation that removes the restriction on the range of the values. Using the probit transformation Yi = Φ−1 (Xi ), the mixture model for the marginal density of probit p-values Y1 , . . . , Ym will then be given by h(y) = π0 φ(y) + (1 − π0 )h1 (y), where φ(·) denotes the standard normal density. Then h1 can be conveniently modeled by nonparametric mixtures of some suitable parametric kernel. The versatility of mixtures can easily incorporate features such as bumps and shoulders appearing in some testing problems. As an illustration, consider testing H0 : θ = 0 against H1 : θ > 0 based on an observation U following a Cauchy density with location parameter θ. Suppose a test rejects H0 for large values of U . Then the p-value is given by cot(πU ) and hence its density under an alternative θ > 0 is given by {1 − θ sin(2πx) + θ2 sin2 (πx)}−1

9

for 0 < x < 1. The density of the probit transform Y = Φ−1 (X) is plotted in Figure 1(a), which shows existence of bumps that typically cannot be captured by a single parametric density, but can be easily approximated by a mixture. In order to generalize the mixture model to joint densities of probit p-values, we proceed by first conditioning on the latent indicators H1 , . . . , Hm . Consider a multivariate family of densities p(y; θ1 , . . . , θm , ψ) which includes the family of multivariate normal with means zero and variances one and some correlation structure driven by the additional parameter ψ. Let θ0 stand for a null value of θ which gives rise to N (0, 1) marginal distribution for the corresponding component. Model the joint density of Y = (Y1 , . . . , Ym ) given H = (H1 , . . . , Hm ) by the following hierarchical scheme: Y |H ∼ p(y; θ1 , . . . , θm , ψ),

θi ∼

   

δ(·; θ0 ),

  G,

if Hi = 0,

(4)

if Hi = 1.

It appears that a multivariate normal family with general mean, variances and a given correlation structure will be a very natural candidate for the family of kernels p(y; θ1 , . . . , θm , ψ). In spite of its apparent appropriateness, it turns out that if normal mixtures are used, the p-value density in the original scale will always have bumps, and thus will never be always decreasing (see Theorem 2 below). As the decreasing shape restriction is a common feature of p-value distributions under natural conditions like monotone likelihood ratio property (cf. Propositions 1 and 2 of Ghosal et al. (2008)), we consider a broader family of skew-normal mixtures in our modeling. The skew-normal distribution was introduced by Azzalini (1985) and generalized by Azzalini and Dalla Valle (1996) in the multivariate situation as a flexible yet very tractable generalization of the normal family that incorporates skewness into consideration and has found a wide range of applications; see Genton (2004) for details. The primary reasons for choosing the skew-normal family are that the following:

10

(i) The skew-normal is a broader family than the normal, and hence in particular includes N (0, 1), which is to be used as the null distribution for probit p-values. (ii) The joint distribution can model a broad class of correlation structures. (iii) Individual components of the mixture exhibit appropriate skewness. Even though unrestricted mixtures of symmetric densities with a large number of components can capture skewness, only a few components may be needed if they are already skewed. (iv) Finally, suitably restricted skew-normal mixtures can produce decreasing shape of the original p-value density as shown below. To illustrate the ideas behind modeling by skew-normal mixtures, we begin with the univariate case and generalize to the multivariate setting afterwards. Let q(y; µ, ω, λ) denote the univariate skew-normal density with location parameter µ, scale parameter ω and shape parameter λ given by q(y; µ, ω, λ) = 2φ(y; µ, ω)Φ(−λω −1 (y − µ)),

(5)

where φ(y; µ, ω) denotes the N (µ, ω 2 ) density and the Φ(·) denotes the standard normal CDF. Then the density of probit p-values under the alternative is modeled as Z h1 (y) = q(y; µ, ω, λ)dG(µ, ω, λ), (6) where G is the mixing measure on the space of (µ, ω, λ). Theorem 2. Suppose that Y has density q(y; µ, ω, λ) and X = Φ(Y ). Then the density f1 (x) of X is decreasing in 0 ≤ x ≤ 1 if and only if µ ≤ λω −1 ϕ(λ−2 (ω 2 − 1)),

ω ≥ 1,

λ≥



ω 2 − 1,

(7)

where ϕ(β) = inf{HΦ (x) − βx : x ∈ R}, and HΦ (x) = φ(x)/(1 − Φ(x)) is the hazard function of the standard normal distribution. 11

The proof of the theorem is given in the appendix. Observe that, as the class of decreasing densities is convex, the skew-normal mixtures produce decreasing densities as long as the mixing distribution is supported on the region defined by (7). The plot of ϕ is shown in Figure 1(b).

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6

−4

−2

0

2

4

6

(a) Cauchy: probit p-value density

(b) ϕ(β)

Figure 1: (a) Density of probit p-value for Cauchy model; (b): Plot of ϕ(β).

2.2

Multivariate skew-normal mixture model for probit pvalues and prior specification

The main advantage of the skew-normal mixture model (6) is that it can be generalized to incorporate dependence while maintaining the salient features of the marginal pvalue density, leading to a model for the joint distribution of the p-values. The generalization to the dependent case is achieved by replacing the univariate skewnormal kernel q(y; µ, ω, λ) by the multivariate skew-normal density [Azzalini and Dalla Vale (1996)] given by qm (y; µ, ω, λ, R) = 2φm (y; µ, Ω)Φ(−α0 D −1 (y − µ)),

12

(8)

where µ ∈ Rm , ω, λ ∈ (0, ∞)m , R is an m × m positive definite correlation matrix, p D = diag(ω), Ω = R + λλ0 , α = Ω−1 D −1 λ/ 1 + λ0 R−1 λ and φm (·; µ, Ω) stands for the density of Nm (µ, Ω). If R can be easily inverted, Ω may then be inverted using the formula Ω−1 = R−1 − R−1 λλ0 R−1 /(1 − λ0 R−1 λ). Properties of the multivariate skew-normal density were neatly reviewed by Dalla Valle (2004). It may be noted that there are other multivariate generalizations of the skew-normal family such as by Sahu et al. (2004). An advantage of the latter is that unlike in (8), the correlation and skewness parameters are completely separately treated, and hence their effect do not mix up. Nevertheless, structural properties of (8) make it easier to work with q in the present context. Further, let δj = −λj / 1 + λ2j , j = 1, . . . , m, and ∆ = p p 2 ) = diag((1 + λ2 )−1/2 , . . . , (1 + λ2 )−1/2 ). Many interesting diag( 1 − δ12 , . . . , 1 − δm m 1 properties of multivariate skew-normal density as well as posterior computational methods in the next section are driven by the following representations (see Dalla Valle (2004)): Y = µ + |Z0 |Dδ + D∆Z, (9)    1 0 . where Z = (Z1 , . . . , Zm )0 and (Z0 , Z)0 ∼ Nm+1 0,  0 R Now, from (4) and (8), a multivariate skew-normal mixture model for Y = (Y1 , . . . , Ym )0 is given by Y |(µ, ω, λ, R, H) ∼ qm (Y ; µ, ω, λ), where given H, (µi , ωi , λi ) is (0, 1, 0) if Hi = 0 and is a draw from the mixing distribution G if Hi = 1. The multivariate skew-normal mixture model is not identifiable, that is, the mixing distribution G may not be uniquely obtained from the mixture densities, as is common in most mixture models with a scale parameter. Nevertheless, under appropriate restriction on the support of G such as the shape restriction condition defined by (7), Ghosal and Roy (2010) showed that the point mass π0 of the mixing distribution at the point (0, 1, 0) is identifiable. Typically the entries of the correlation matrix R (which necessarily have all diag13

onal entries 1) will be modeled as functions of a low-dimensional parameter ψ. For instance, if R is the correlation matrix of the AR(1) process with autocorrelation ρ, then all elements of R are functions of ρ. Similarly, R can be the correlation matrix of the intraclass correlated process with all off-diagonal entries equal to ρ, 0 ≤ ρ < 1. Now we need to specify a prior distribution for the parameters π0 , ψ and the infinite dimensional parameter G. Let all of them be a priori π0 independently distributed with π0 ∼ Beta(a0 , a1 ),

ψ ∼ pψ (ψ),

G ∼ DP(M, G0 ),

(10)

where pψ is a density on the parameter space of ψ (like (−1, 1) for AR(1) and [0, 1) for intra-class correlated process) and DP(M, G0 ) is the Dirichlet process with precision parameter M > 0 and center measure G0 . The center measure G0 is chosen to be supported on the space defined by inequalities (7) to make the marginal p-value density maintain the decreasing shape restriction. In most applications, the correlation among the probit p-values is known only in a qualitative manner. For example, it may be known that the test statistics have a nearest-neighbor dependence, which translates to similar relation among p-values. The marginal correlation will be a function of the correlation matrix R and other parameters. For a given value of (µ, ω, λ), the correlation structure of Y is a rank one modification of R [cf., Azzalini and Dalla Valle (1996)] and hence tend to be very similar to R. While the correlation structure in the model may be simplistic, allowing an undetermined parameter as well as taking mixtures together can pick up a relatively complex structure. It is essential to keep R simple since inversion will be necessary in each MCMC iteration of posterior computation. If the expected value of the rank one modification is small compared to R, then even after mixing, the correlation structure will tend to be similar to that of R. If the p-values are exchangeable, so are the probit p-values, like when R is the

14

correlation matrix of the intra-class process, so the intra-class structure is preserved under probit transforms, although the value of the intra-class correlation is likely to change. A similar property holds under stationarity as the following result shows. Proposition 1. Let the elements of R be of the form r(s, t) = r(|s − t|) for some stationary autocorrelation function r(·). Then the p-values and probit p-values follow strictly stationary stochastic processes. Moreover, the p-value process is also covariance stationary. The proof of Proposition 1 is given in the appendix. Observe that, if the observations arise from an AR(1) process, then probit p-value distribution is stationary, and hence an AR(1) structure may not be entirely misfit, even though the AR(1) structure is not actually correct for the probit p-values. Indeed, in the simulation section, we shall observe that prediction of FDP is fairly accurate assuming the AR(1) correlation structure, when the data are generated from an AR(1) process. Finally, we concretely specify the center measure G0 of the Dirichlet process of the form pω (ω)pλ (λ|ω)pµ (µ|ω, λ), where we take pω (ω) to be Gamma(αω , βω ) shifted √ to the right by 1, pλ (λ|ω) to be Gamma(αλ , βλ ) shifted to the right by ω 2 − 1 and pµ (µ|ω, λ) to negative of Gamma(αµ , βµ ) shifted to the right by λω −1 ϕ(λ−2 (ω 2 − 1)).

3

Posterior computation

In the DPMM, the random measure G can be integrated out from the prior distribution, leading to the joint prior distribution of θi = (µi , ωi , λi )’s given by the generalized Polya urn scheme i

θ1 ∼ G0 ,

X 1 M θi+1 |(θ1 , . . . , θi ) ∼ G0 + δ(·; θj ), i ≥ 1. M +i M +i j=1

(11)

As a result, the DPMM generates θi values, many of which are likely to be tied, giving a very desirable clustering property allowing huge dimension reduction. Let the 15

∗ distinct values be denoted by θ1∗ , . . . , θN in a typical realization. MCMC algorithms

for posterior computation in DPMM were developed by Escobar and West (1995) and others, and are particularly useful when the center measure G0 is conjugate to the likelihood of (Y1 , . . . , Yn ) given the latent variables. In the absence of such conjugacy property as in the present case, the “no gaps” algorithm of McEachern and M¨ uller (1998) or similar refinements are needed. While in principle, the “no gaps” algorithm is applicable, the dependence and complexity of the multivariate skew-normal likelihood makes it very challenging to implement. An alternative approximate method developed by Ishwaran and Zarepour (2000) proposed to fix N beforehand to a sufficiently high number and approximate the Dirichlet by the so called Dirichlet sieve process given by θi ∼

N X

i.i.d.

∗ pj δ(·; θj∗ ), θ1∗ , . . . , θN ∼ G0 , (p1 , . . . , pN ) ∼ Dir(N ;

j=1

M M , . . . , ). N N

(12)

Let Li = j if θi = θj∗ . As N → ∞, the Dirichlet sieve process converges to the corresponding Dirichlet process. The main advantage of this approach is that the MCMC problem is reduced to a relatively low and fixed dimensional sampling problem, where only (Li : i = 1, . . . , m) and (θj∗ , pj : j = 1, . . . , N ) need to be generated. When the sample size m is large, Ishwaran and Zarepour (2000) recommended taking N to be √ about m. Strictly speaking, the approximation is justified only in the prior, although it often gives fairly accurate approximation to the full DPMM posterior. An alternative would be to view the Dirichlet sieve process itself as the prior distribution, which shares almost all desirable features of the full Dirichlet process for large N . We shall follow the Dirichlet sieve process approach in our computation. We introduce some notation before we describe the steps of posterior computation. Let µ∗ = (µ∗1 , . . . , µ∗N ) denote the set of possible alternative values of µi ’s, while the null value is µ∗0 = 0. For j = 0, 1, . . . , N , let µi→j denote the vector (µ1 , . . . , µi−1 , µ∗j , µi+1 , . . . , µm ). Similarly define ω i→j and λi→j , where their null val16

ues are 1 and 0 respectively. Then in each step of posterior iteration: (i) For i = 1, . . . , m, Hi are randomly drawn from a Bernoulli distribution with probability P (Hi = 1|Y , rest) given by (1 − π0 )qm (Y ; µi→Li , ω i→Li , λi→Li , R(ψ)) . π0 qm (Y ; µi→0 , ω i→0 , λi→0 , R(ψ)) + (1 − π0 )qm (Y ; µi→Li , ω i→Li , λi→Li , R(ψ)) (ii) For i = 1, . . . , m, the ith component of the configuration vector, Li , is updated as a random draw from the distribution pj qm (Y ; µi→j , ω i→j , λi→j , R(ψ))

P (Li = j|Y , rest) = PN

k=1

pk qm (Y ; µi→k , ω i→k , λi→k , R(ψ))

, j = 1, . . . , N.

(iii) For i = 1, . . . , m, the parameters are then updated as θi = θL∗ i if Hi = 1, and θi = θ0∗ if Hi = 0. (iv) The assignment probability vector is updated as (p1 , . . . , pN )|(Y , rest) ∼ Dir(N,

M M + c1 , . . . , + cN ), N N

where cj = #{i : Li = j and Hi = 1}. (v) The null probability π0 is updated as a draw from Beta(a0 + m0 , a1 + m − m0 ) P where m0 = m i=1 (1 − Hi ). (vi) The parameters for the correlation structure, R(ψ) is updated using a MetropolisHastings step p(ψ|Y , rest) ∝ pψ (ψ)q(y; µ, ω, λ, R(ψ)). The posterior sampling algorithm described above can be used to generate a large ∗ number K of posterior samples of the parameters π0 , θ1∗ , . . . , θN , p1 , . . . , pN and hidden

labels L1 , . . . , Ln , H1 , . . . , Hm and hence also any quantity of interest. For instance, the probit p-value density function for each sample is computed as R P (r) qm (y; θ)dG0 (θ) + N (r) (r) (r) M r=1 qm (y; θi ) h1 (y) = π0 φ(y) + (1 − π0 ) , M +m 17

r = 1, . . . , K, and the corresponding posterior mean may be obtained by averaging. Similarly, the posterior samples of p-value density function may be using the relationship between the p-value density and the probit p-value density, while posterior samples of the CDF of probit p-values can be obtained by computing the CDF of a skew-normal distribution numerically using the technique of Bazan et al. (2006). To predict the FDP, one can similarly compute the posterior mean by plugging in posterior samples in the formula (1). Credible intervals are obtained easily using posterior samples.

4 4.1

Numerical illustration Simulation Study

To evaluate the performance of the proposed method we performed some simulation experiments. 4.1.1

Data generation process

We consider the scenario where each observation is an m-dimensional Gaussian data vector and there are two groups containing n1 and n2 such observations, respectively. We assume that both groups have the same covariance structure. Specifically, if xgj denote the jth observation in group g, then xgj ∼ Nm (µg , Ψ), g = 1, 2; j = 1, . . . , ni , where the dependence structure across the m coordinates is governed by the covariance matrix Ψ. The mean vector for the gth group is µg = (µg1 , . . . , µgm ), g = 1, 2. We consider testing equality of group means across each of the m coordinates. Thus, we have m simultaneous hypotheses H0i : µ1i = µ2i . Let pi denote the p-value gen18

erated from the two sample t-test of the ith hypothesis and let Yi = Φ−1 (pi ) be the corresponding probit p-value. 4.1.2

Parameter and prior specification

For the simulation experiment, we choose n1 = n2 = 15 and m = 1000. The mean vector for the first group is zero and the first m0 = bπ0 mc coordinates of µ2 are chosen to be zero, reflecting about π0 proportion for true null hypotheses. Of the remaining m1 = m − m0 coordinates of µ2 , 40% are chosen to be 0.5, 50% are chosen to be 1 and 10% are set to 2. The common variance across groups is one, so Ψ is also the correlation matrix. The elements of the correlation matrix Ψ = ((ψij )) are parameterized by a single parameter ρ. We choose two different correlation structures; intra-class where ψij (ρ) = ρ if i 6= j and 1 if i = j; autoregressive of order one where ψij (ρ) = ρ|i−j| . The prior for the true null proportion π0 is chosen as Beta(a0 , a1 ) where a0 = 5 and a1 = 1. Thus, a priori the expected proportion of null proportion hypotheses is 83.33%. The prior for the correlation parameter is chosen as Uniform[−1, 1]. The number of alternative values is N = 20 and the number of hypotheses is m = 1000. All parameters for the gamma priors for µ, ω and λ were set to two. The formula for inverse of the correlation matrix given by Remark 1 in the Appendix are used in computation. 4.1.3

Summary output

Table 1 summarizes the properties of the proposed methodology in terms of the rootmean-squared error (RMSE) values. The RMSPE stands for the integrated rootmean-squared error in predicting FDP when the nominal level γ and is given by R1 [ [ stands for the posterior RMSPE = [E( 0 |FDP(γ) − FDP(γ)|2 dγ)]1/2 , where FDP

19

mean of FDP. The column indicating coverage gives the proportion of the 90% FDP credible interval when γ = 0.05. When the data dependence is not too strong, the proposed method is accurate with respect to all three measures for both type of correlation structures. When the correlation parameter is high (ρ = 0.5), true null proportion is still estimated accurately. However, the accuracy of FDP prediction goes down slightly for the AR(1) case. A possible reason for this is that autoregressive structure on p-values does not exactly translate into an autoregressive structure for the probit p-values used in the modeling. For intraclass correlated data, the induced correlation structure on probit p-values is still exchangeable. Nevertheless, the fairly accurate prediction even in the autoregressive case suggests that matching correlation structure exactly is not critical as long as there is enough flexibility left in the modeling. Figure 2 shows typical FDP curves with their predicted values and confidence bounds when γ is in the interval [0, 0.5]. In most cases, the bounds are quite sharp and yield reasonable coverage. Intraclass

AR(1)

π0

ρ

RMSE(ˆ π0 )

Coverage

RMSPE

RMSE(ˆ π0 )

Coverage

RMSPE

0.90

0.10

0.0050

0.94

0.0084

0.0227

0.93

0.0440

0.90

0.50

0.0066

0.89

0.0141

0.0256

0.88

0.0821

0.95

0.10

0.0032

0.91

0.0067

0.0095

0.92

0.0333

0.95

0.50

0.0067

0.90

0.0188

0.0348

0.79

0.0956

Table 1: Empirical performance measures for the skew-mixture model

20

(a) inrtaclass

(b) AR(1)

Figure 2: Predicted and true FDP and credible bands for correlation structures: (a) intraclass; (b) AR(1).

4.2

Analysis of kidney data

The data used to test our model comes from an analysis of isografted kidneys from brain dead donors. The data can be obtained from the National Center for Biotechnology Information (NCBI) database [Kisaka et al. (2006)]. Brain death in donors triggers inflammatory events in recipients after kidney transplantation. Inbred male Lewis rats were used in the experiment as both donors and recipients, with the experimental group receiving kidneys from brain dead donors and the control group receiving kidneys from living donors. Gene expression profiles of isografts from brain dead donors and grafts from living donors were compared using a high-density oligonucleotide microarray that contained approximately 22,000 genes. For illustration, we have chosen every alternative gene resulting in about 11,000 genes. The p-value range and the probit p-value density histogram of the full set of genes and the sample of half of those genes are almost identical. We fit the skew-normal mixture model for the kidney data and constrained one

21

Component

µ

ω

λ

π

1

0.0000

1.0000

0.0000

0.7040

2

-1.5873

1.0767

0.4902

0.0248

3

-2.1821

1.1976

0.9266

0.0004

4

0.6722

1.5100

3.4430

0.0475

5

0.2994

1.8521

2.9584

0.0594

6

0.0516

1.5378

1.9729

0.0238

7

-0.2822

1.1193

0.5981

0.0101

8

-2.2818

1.0514

0.3346

0.0663

9

-2.8501

1.0356

3.0999

0.0519

10

-1.0813

1.0328

0.7730

0.0119

Table 2: Parameters for estimated skew-normal components for kidney data of the components to be N (0, 1). The AR(1) correlation structure seems to be more reasonable for this data. Due to the high dimension of the data, we generate a chain of length 10000 after a burn-in of 10000. At convergence, there were nine alternative components with significant mixing proportion. The estimated proportion corresponding to the standard normal component was π ˆ0 = 0.704. The parameters of the mixture component along with the proportions are given in Table 2. From Figure 3(a), the estimated model fits the observed data very well. The FDP curve is shown in Figure 3(b). From the curve, if it desired have an FDP less than 5%, the corresponding nominal level for individual tests should be approximately 0.0001.

22

(a) Fitted Density

(b) Predicted FDP

Figure 3: Left: Fitted Histogram for the kidney data using the skew-mixture model. The theoretical null distribution (standard normal) is also shown for reference. Right: Predicted FDP curve for the kidney data.

Appendix Proof of Theorem 1. By Bayes’ theorem, for any a = (a1 , . . . , am ) ∈ {0, 1}m , we have m m X X E( Ii (1 − Hi )|I = a) = ai P(Hi = 0|I = a) i=1

i=1

= π0

m X i=1

ai

P (I = a|Hi = 0) . P (I = a)

Consequently, FDR(γ) = π0

X

Pm P(I = a)

a∈{0,1}m

= π0 = π0

X a∈{0,1}m m X X

i=1

ai P (I = a|Hi = 0)/P (I = a) Pm Qm a + j j=1 j=1 aj

Pm ai P(I = a|Hi = 0) Pi=1 Qm , m i=1 ai + i=1 (1 − ai )

i=1 a∈Bim

P(I = a|Hi = 0) Pm i=1 ai

proving the first part of the result. If the observations are exchangeable, then (Hi , Ii ), i = 1, . . . , m, are exchangeable m as well. Let for any pair (i, j), let Bijuv denote the set of all m-dimensional binary

23

vectors with u at the ith position and v at the jth position where u, v ∈ {0, 1}. P P P(I=a|Hj =0) Pm i =0) = Pm Then noting that under exchangeability a∈Bm P(I=a|H , a∈Bm ai ai ij10

i=1

ij01

i=1

we have bi (γ) = bj (γ). Hence we have the second part of the result. Finally, using P −1 the fact that m i=2 Ii = V − I1 ∼ Binomial(m − 1, F (γ)) and E[(1 + V − I1 ) ] = (mF (γ))−1 P(V > 0), we have the result. Proof of Theorem 2. If the density of the probit p-value Y is q(y; µ, ω, λ), then the p-value X = Φ(Y ) has density given by 2

−u /2 Φ(−λu) q(Φ−1 (x); µ, ω, λ) −1 e = 2σ , 2 /2 −1 −(µ+ωu) φ(Φ (x)) e

where z = z(x) = ω −1 (Φ−1 (x) − µ). Since z(x) is an increasing function of x, it is enough to investigate when r(z) = exp{ 12 (ω 2 − 1)z 2 + µωz}Φ(−λz) is decreasing in z. If ω < 1 then limz→−∞ r(z) = 0, and hence the density cannot be decreasing in this case. First consider the case ω > 1. If λ ≤ 0, then limz→∞ r(z) = ∞, and hence the density cannot be decreasing again in this case. So let ω > 1, λ > 0. Now, r(z) is decreasing, that is

d dz

log r(z) ≤ 0 for all z if and only if (ω 2 − 1)z + µω − λHΦ (λz) ≤ 0

for all z. The last assertion is equivalent to λ−2 (ω 2 − 1)u + µωλ−1 − HΦ (u) ≤ 0 for all u, that is, µ ≤ λω −1 inf{HΦ (u) − λ−2 (ω 2 − 1)u : u ∈ R} = λω −1 ϕ(λ−2 (ω 2 − 1)). In view of Sampford (1953), the hazard function HΦ of standard normal CDF is convex and the right hand side is finite if and only if λ−2 (ω 2 − 1) ≤ 1, that is, √ λ > ω 2 − 1. Indeed, under this condition, the right hand side is nonnegative. Finally consider the case ω = 1. In this case, r(z) = eµz Φ(−λz). If λ < 0, then by the estimate Φ(−t) ≤ t−1 φ(t), it follows that limz→−∞ r(z) = 0, which again rules out decreasing density. If λ = 0, clearly the condition for decreasing density is µ ≤ 0, so assume λ > 0. By arguments similar to those used in the case ω > 1 shows that r(z) is decreasing if and only if µ ≤ λ inf{HΦ (u) : u ∈ R} = 0. 24

Proof of Proposition 1. Using (9), conditionally on µ, ω and λ, we have Yi = p µi + ωi δi |Z0 | + ωi 1 − δi2 Zi . Thus we can write Yi = τ (Hi , µi , ωi , λi , Z0 , Zi ), where τ (x1 , x2 , x3 , x4 , x5 , x6 ) = x1 x2 + x1 x3 √ x4

1+x24

|x5 | + √ x3

1+x24

x1 x6 . Since (Hi , µi , ωi , λi ) are

i.i.d. and independent of (Z0 , Z), it follows that Yi ’s are strictly stationary if and only if Z is a strictly stationary process. Due to Gaussianity, the process Z is strictly stationary if and only if R is a stationary correlation matrix. The following expressions of inverse and determinant of AR(1) and intraclass correlation matrices, used in the computation in Section 4., are well known in the literature. Proposition 2. For the intraclass correlation matrix R = (1 − ρ)I + ρ110 , the determinant is given by det(R) = (1 − ρ)m−1 (1 + mρ − ρ) and the inverse is given by R−1 = (1 − ρ)−1 I −

ρ 110 . (1−ρ)(1+mρ−ρ)

For the AR(1) correlation matrix R = ((ρ|i−j| )), the determinant is given by det(R) = (1 − ρ2 )m−1 and the inverse is given by R−1 = (1 − ρ2 )−1 ((rij )), where r11 = rmm = 1, rii = 1 + ρ2 for i = 2, . . . , (m − 1), and for i 6= j,    −ρ if |i − j| = 1, ij r =   0 if |i − j| > 1.

References Azzalini, A. (1985). A class of distributions which includes the normal ones. Scand. J. Statist, 12, 171–178. Azzalini, A. and Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83, 715–726. Bayarri, M. J. and Berger, J. O. (2000). p-values for composite null models. J. Amer. Statist. Asooc. 95 1127–1142. 25

Bazan, J. L., Branco, M. D. and Bolfarine, H. (2006). A skew item response model. Bayesian Analysis, 1, 861–892. Benjamini, Y. and Hochberg, Y. (1995). Controlling the FDR: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc., Ser. B, 57, 289–300. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165–1188. Clarke, S. and Hall, P. (2009). Robustness of multiple testing procedures against dependence. Ann. Statist. 37, 332–358. Dalla Valle, A. (2004). The skew-normal distribution. In Skew-elliptical Distributions and Their Applications: A Journey Beyond Normality (Genton, M. G., editor), Chapter 1, 3–24. Chapman & Hall/CRC. Efron, B. (2004). Large-scale simultaneous hypothesis testing. J. Amer. Statist. Assoc. 99, 96–104. Efron, B. (2007). Size, power and false discovery rates. Ann. Statist. 35, 1351–1377. Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc., 96, 1151–1160. Escobar, M. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90, 577–588. Farcomeni, A. (2007). Some results on the control of the false discovery rate under dependence. Scand. J. Statist. 34 275–297. Finner, H., Dickhaus, T. and Roters, M. (2007). Dependency and false discovery rate: Asymptotics. Ann. Statist., 35, 1432–1455. 26

Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. Roy. Statist. Soc., Ser. B, 64, 499–517. Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. Ann. Statist. 32, 1035–1061. Genovese, C. R., Lazar, N. A. and Nichols, T. E. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15, 870-878. Genton, M. G. (Ed.) (2004). Skew-elliptical Distributions and Their Applications: A Journey Beyond Normality. Chapman & Hall/CRC. Ghosal, S. and Roy, A. (2010). Identifiability of proportion of null hypotheses in mixture models for p-value distribution in multiple testing. Preprint. www4.stat.ncsu.edu/~ghoshal/papers/identifiability.pdf Ghosal, S., Roy, A. and Tang, Y. (2008). Posterior consistency of Dirichlet mixtures of beta densities in estimating positive false discovery rates. In Beyond Parametrics in Interdisciplinary Research: Festscrift in Honor of Professor Pranab K. Sen (Balakrishnan, N. et al. , Eds.) IMS Collection 1, Institute of Mathematical Statistics, Breechwood, OH. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. and Lander, E. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Hopkins, A. M., Miller, C.J., Connolly, A.J., Genovese, C., Nichol, R.C. and Wasserman, L. (2002). A new source detection algorithm using the false-discovery rate. The Astronomical Journal, 123, 1086-1094. 27

Ishwaran, H. and Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixture models. Statist. Sinica 269–283. Kisaka, M., Yamada, K., Kuroyanagi, Y., Terauchi, A. et al. . (2006). Genomewide expression profile in rat model of renal isografts from brain dead donors. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5104 MacEachern, S. N. and M¨ uller, P. (1998). Estimating mixture of Dirichlet process models. J. Comp. Graph. Statist. 7, 223–228. Miller, C. J., Genovese, C., Nichol, R. C., Wasserman, L., Connolly, A., Reichart, D., Hopkins, A., Schneider, J. and Moore, A. (2001). Controlling the false discovery rate in astrophysical data analysis. The Astronomical Journal, 122, 3492-3505. Pawitan, Y., Calza, S. and Alexander, P. (2006). Estimation of false discovery proportion under general dependence. Bioinformatics, 22 3025–3031. Sahu, S. K., Dey, D. K. and Branco, M. D. (2003). A new class of multivariate skew distributions with applications to Bayesian regression models. Canad. J. Statist. 31, 129–150. Sampford, M. R. (1953). Some inequalities on Mill’s ratio and related functions. Ann. Math. Statist. 24 130–132. Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Ann. Statist. 30, 239–257. Sarkar, S. K. (2004). FDR-controlling stepwise procedures and their false negatives rates. J. Statist. Plann. Inf., 125, 119–137.

28

Sarkar, S. K. (2006). False discovery and false non-discovery rates in single-step multiple testing procedures.Ann. Statist, 34, 394–415. Sarkar, S. K. (2007). Step-up procedures controlling generalized FWER and Generalized FDR. Ann. Statist. 35, 2405–2420. Sarkar, S. K. and Zhou, T. (2008). Controlling Bayes directional false discovery rate in random effects model. J. Statist. Plann. Inf. 138 682–693. Scott, J. and Berger, J. O. (2005). An exploration of aspects of Bayesian multiple testing. J. Statist. Plan. Infer. 136, 2144–2162. Storey, J. D. (2002). A direct approach to false discovery rates. J. Roy. Statist. Soc., Ser. B, 64, 479–498. Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. Ann. Statist. 31, 2013–2035. Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc. Nat. Acad. Sci., USA, 100, 9440–9445. Tang, Y., Ghosal, S., and Roy, A. (2007). Nonparametric Bayesian Estimation of False Discovery Rates. Biometrics, 63, 1126–1134. Wu, W. B. (2008). On false discovery control under dependence. Ann. Statist. 36 364–380.

29

Suggest Documents