Priors for High-Dimensional Sparse Poisson Means Jyotishka Datta ∗
David B. Dunson †
arXiv:1510.04320v1 [stat.ME] 14 Oct 2015
October 16, 2015
Abstract In a variety of application areas, there is a growing interest in analyzing high dimensional sparse count data. For example, in cancer genomic studies it is of interest to model the counts of mutated alleles across the genome. Existing approaches for analyzing multivariate count data via Poisson log-linear hierarchical models cannot flexibly adapt to the level and nature of sparsity in the data. We develop a new class of local-global shrinkage priors tailored for sparse counts. Theoretical properties are assessed, including posterior concentration, superefficiency in estimating the sampling density, and robustness in posterior mean. Simulation studies illustrate excellent small sample properties relative to competitors, and we apply the method to model rare mutation data from the Exome Aggregation Consortium project.
Keywords: Bayesian; Count data; Gauss Hypergeometric; Global local shrinkage; High-dimensional data; Rare variants; Shrinkage prior; Zero inflation.
1 Introduction This paper considers the problem of estimating an n-vector of Poisson means θ = (θ1 , . . . , θn )T , based on observations y = (y1 , . . . , yn )T , where yi ∼ Poi(θi ) are independent. Often θ is sparse, with many entries either zero or very small compared to a few relatively large elements, leading to count data with an excess of zeroes. For example, in whole genome sequence analysis, we routinely observe mutation data with zero or very low frequencies in an overwhelming majority of locations, and moderately higher frequencies of rare genomic mutations in a few ‘hotspots’ that act as drivers or inhibitors for a given phenotype. Analysis of such data is made difficult by zero-inflation and over-dispersion. Our main goal in this paper is to introduce a novel shrinkage prior for the rate parameters, which simultaneously accommodates highly sparse and heavy tailed signals θ, while maintaining computational tractability and theoretical guarantees. The proposed shrinkage prior is built upon the Gauss Hypergeometric distribution (Armero and Bayarri, 1994) proposed for modeling the traffic intensity of an M/M/1 queue in equilibrium. Empirical Bayes approaches have been popular for Poisson compound decision problems, dating back to Robbin’s formula (Robbins (1956)) for prediction of a Poisson mean vector. Empirical Bayes methods have gained a recent resurgence in high dimensional applications fueled by Efron (2003, 2011, 2014); Zhang (2003); Brown and Greenshtein (2009); Greenshtein and Ritov (2009); Brown et al. (2013) among others. In Robbin’s Empirical Bayes approach, the ∗ Department † Department
of Statistical Science, Duke University email:
[email protected] of Statistical Science, Duke University, email:
[email protected]
1
θi ’s are assumed to be independent draws from a distribution G (·), and the goal is to estimate θ = (θ1 , . . . , θn )T depending on the observations y = (y1 , . . . , yn )T . The relative accuracy of an 2 estimator θˆ = δ(y) = {δ1 (y1 ), . . . , δn (yn )}T is assessed by the risk W (δ) = Eθ || δ(y) − θ || . The Bayes estimator that minimizes W (δ) assumes a simple form for the Poisson kernel as R θi p(yi | θi )dG (θi ) (y + 1) PY (yi + 1) δi (yi ) = R , = i PY (yi ) p(yi | θi )dG (θi )
where PY (·) is the marginal distribution of Y. Robbin’s frequency ratio estimator uses the empirical frequencies PˆY (y) = n−1 ∑ni=1 I (yi = y) to estimate PY . Although Robbins’ estimator enjoys a few asymptotic optimality properties, its performance can be deteriorated by slow convergence of PˆY → PY (Brown et al., 2013). To improve on Robbins’ estimator, Brown et al. (2013) proposed a three-stage smoothing adjustment to (i) corrupt the original counts by adding a small Poi(h) noise; (ii) Rao–Blackwellization and (iii) imposing monotonicity constraints on the decision rule. They showed substantial improvement in total Bayes risk nW (δ) in simulation studies, with the true θ being either n equispaced values between fixed a and b, or sparse in the sense of many θi s being equal to a non-zero θ0 . A more efficient approach for the Poisson compound decision problem is nonparametric maximum likelihood estimation (Kiefer and Wolfowitz, 1956), which estimates θ by maximizing the likelihood with respect to the unknown distribution G. The resulting estimator δKW is monotone non-decreasing over the support of G. Koenker and Mizera (2014) use monotonicity to recast estimation as a convex optimization problem, finding efficient approximate solutions via an interior point approach. The Kiefer and Wolfowitz (1956) estimator is atomic, and is expected to outperform Robbin’s estimator when the true G is sparse and discrete. This article develops a hierarchical Bayesian model that allows for sparsity in G, while maintaining the ability to capture large signals. The proposed prior is adaptive to the degree of sparsity in the data, and is inspired by local-global shrinkage priors for sparse Gaussian means and linear regression (Carvalho et al., 2010; Armagan et al., 2011, 2013; Bhattacharya et al., 2014). Such priors are structured as scale mixtures of Gaussians for convenience in computation, with corresponding theory support when the true mean or regression vector is mostly zero (Bhattacharya et al., 2014; van der Pas et al., 2014). Naively, one could apply such priors to the coefficients in Poisson log-linear models, but such formulations lack the computational advantages obtained in the Gaussian case and fail to represent sparsity through zero-inflation. The proposed model induces zero-inflation in a continuous manner, which has important advantages over zero-inflated Poisson models and their many variants, such as the zero-inflated generalized Poisson and zeroinflated negative binomial (Yang et al., 2009).
2 Shrinkage priors for count data We propose a shrinkage prior that separately adjusts for zero-inflation and over-dispersion in count data. If θi ∼ Ga(α, β i ) the marginal distribution of yi is negative binomial with variance, −1 1 αβ i −1 (1 + β− i ), higher than the mean, αβ i . To allow zero-inflation, the usual approach would mix a Poisson or negative binomial distribution with a degenerate distribution at zero. We instead choose a prior for θi with a pole at zero leading to a spike in the marginal distribution for yi at 2
zero. The Poisson-Gamma hierarchical model can be expressed as: yi ∼ Poi(θi ),
θi ∼ Ga(α, λ2i τ 2 ),
λi ∼ p(λ2i ),
τ ∼ p( τ 2 ),
where p(λ2i ) and p(τ 2 ) are densities for λ2i and τ 2 respectively and are chosen according to our desiderata in the above paragraph. Marginalizing out θi and writing κi = 1/(1 + λ2i τ 2 ), the model can be written as: yi α λ2i τ 2 1 p( yi | λi , τ ) ∝ , (1) 1 + λ2i τ 2 1 + λ2i τ 2 p(yi | κi ) ∝ (1 − κi )yi κiα ⇒ [yi | κi ] ∼ NB(α, 1 − κi ).
(2)
In other words, given κi , yi follows a negative binomial distribution with size α and probability of ‘success’ 1 − κi . The posterior distribution and mean of θi given yi and κi are respectively p(θi | yi , κi ) ∼ Ga(yi + α, 1 − κi ),
E(θi | yi , κi ) = (1 − κi )(yi + α).
Hence, the parameter κi can be interpreted as a random shrinkage factor pulling the posterior mean towards 0. Priors on shrinkage factors having a U-shaped distribution are appealing in shrinking small signals to zero while avoiding shrinkage of larger signals. In normal mean estimation and linear models, such priors have been widely used and include the horseshoe (Carvalho et al., 2010), generalized double Pareto (Armagan et al., 2013), three-parameter beta (Armagan et al., 2011) and Dirichlet-Laplace (Bhattacharya et al., 2014). In our motivating sparse count applications, we require additional flexibility in the mass of the shrinkage parameter κi around 0 and 1. In genomic sequencing applications, one observes very low counts in addition to zeros; low counts arising due to sequencing errors should be segregated in the analysis from true rare events that could be associated with a given trait. This requires careful treatment of the prior for κi . Consider the three-parameter beta prior (Armagan et al., 2011): p(κi | a, b, z) ∝ (1 − κi ) a−1 κib−1 {1 + zκi }−( a+b) for 0 < κi , z < 1, a, b > 0.
(3)
Armagan et al. (2011) recommend a, b ∈ (0, 1) leading to both Cauchy-like tails and a kink at zero. For example, a = b = 1/2 and z = 0 in (3) leads to κi ∼ Be(1/2, 1/2), where Be( a, b) denotes the beta distribution with parameters a and b. The ‘horseshoe’-shaped Be(1/2, 1/2) prior combined with the likelihood in (2) produces a (1 − κi )yi −1/2 term in the posterior, which leaves all non-zero yi ’s unshrunk. For a = b = 1/2 and z → −1, the corresponding term in the posterior would be (1 − κi )yi −3/2 , which shrinks yi ≤ 1. To extend the flexibility of the prior on κi , while retaining the heavy-tailed property of the induced marginal prior on θi , we make the exponent in the final term in (3) a general non-negative parameter γ. Higher values of γ imply shrinkage of larger observations. The proposed prior has the following functional form: GH(κi | a, b, z, γ) = Cκia−1 (1 − κi )b−1 (1 + zκi )−γ for κi ∈ (0, 1),
(4)
where C −1 = β( a, b)2 F1 (γ, a, a + b, −z) is the norming constant, with β( a, b) = Γ ( a) Γ (b) Γ ( a + b) −1 and 2 F1 the Gauss Hypergeometric function defined as: ∞
2 F1 ( a, b, c, z)
=
( a) k ( b) k zk for | z |< 1, (c)k k! k=0
∑
3
where (q)k denotes the rising Pochhamer symbol, defined as (q)k = q(q + 1) . . . (q + k − 1) for k > 0 and (q)0 = 1. Expression (4) corresponds to the Gauss Hypergeometric (GH) distribution introduced in Armero and Bayarri (1994) for a specific queuing application. Our sparsity-inducing prior on κi is GH( a = 1/2, b = 1/2, z = τ 2 − 1, γ), where τ 2 is a global shrinkage parameter adjusting to the level of sparsity in the data. The GH prior is conjugate to the negative binomial likelihood in (2) and will lead to the following posterior distribution: κiα−1/2 (1 − κi )yi −1/2 {1 − (1 − τ 2 )κi }−γ p(κ i | yi , τ , γ ) = β(α + 1/2, yi + 1/2)2 F1 (γ, α + 1/2, yi + α + 1, 1 − τ 2 ) ⇒ κi | yi , τ 2 , γ ∼ GH α + 1/2, yi + 1/2, γ, τ 2 − 1 . 2
(5)
Plots of the prior density for different values of the hyper-parameters a, b and γ for τ 2 = 0·01 are given in Figure 1. The first column shows the effect of different choices of the parameters a, b when γ = 1/2 and the second column shows the effect of γ on p(κi ) for a = b = 1/2.
7.5
7.5
Prior density
10.0
Prior density
10.0
5.0 2.5 0.0
5.0 2.5 0.0
0.00
0.25
0.50 κ
0.75
1.00
0.00
(a) Effect of a and b on the prior density of κi .
0.25
0.50 κ
0.75
1.00
(b) Effect of γ on the prior density of κi .
Figure 1: A comparison of density of κi under the Gauss Hypergeometric prior for different values of the hyper-parameters: (a) a = b = 0·5 (solid), a = 0·5, b = 1·5 (dashed), a = 1·5, b = 0·5 (dotted), and a = b = 1·5 (dot-dash), (b) γ = 0 (solid), γ = 0·5 (dashed), γ = 1 (dotted), and γ = 2 (dot-dash). As shown in Figure 1, the GH prior results in a U-shaped posterior density of κi for a = b = 1/2 for different values of γ and τ 2 . This is a general class that includes the horseshoe prior. Putting a standard half-Cauchy prior, denoted as C + (0, 1)), on λi in the hierarchical set-up (1)(2) leads to the same posterior as under the prior GH(1/2, 1/2, γ = 1, 1 − τ 2 ). The half-Cauchy C + (0, 1) prior introduced by Gelman et al. (2006) is a default shrinkage prior that leads to the Horseshoe prior for the sparse normal means problem by Carvalho et al. (2010). However, the GH prior leads to lower estimation and misclassification error relative to the horseshoe prior for sparse counts, as we show in the numerical study section S4. The kth posterior moment for κi given yi , τ, and γ can be written as β(k + α + 1/2, yi + 1/2)2 F1 (γ, k + α + 1/2, yi + α + 1 + k, 1 − τ 2 ) . β(α + 1/2, yi + 1/2)2 F1 (γ, α + 1/2, yi + α + 1, 1 − τ 2 ) (6) Equation (6) provides a way to rapidly calculate an empirical Bayes estimate of posterior mean ˆ γˆ ). We will show in S3 that the posterior distribution of κi for the GH prior will E(κi | yi , τ, concentrate near 0 or 1 for ‘large’ or ‘small’ observations, respectively. E(κik | yi , τ, γ) =
4
2.1 Properties of the prior We use the following lemma to derive the marginal prior on θi for τ 2 = 1. The proofs are deferred to the Appendix. Lemma 2.1. Let Z ∼ Ga(α, β), then E{1/(c + ZR)} = βα cα−1 e βc Γ (1 − α, βc), where Γ (s, z) ∞ is the upper incomplete gamma function defined as z ts−1 et−1 dt.
Proposition 2.1. Let τ 2 = 1. For θ ∼ Gamma(α, 1/λ2 ) and λ ∼ C + (0, 1), the prior density on θi can be written as: 2 πΓ (α)
Z ∞
1 1 dλ = √ eθi θi α−1 Γ (1/2 − α, θi ) , +1 πβ(1/2, α) 0 (7) where Γ (s, z) is the upper incomplete gamma function. p( θi ) =
2
θi α−1 e−θi /λ λ−2α
λ2
An important upshot of Proposition 2.1 is that we can use sharp bounds for the upper incomplete Gamma function Γ (s, z) to derive useful bounds for the marginal density of the prior on θi . As we show in Corollary 2.1 and its proof in Appendix 1, the bounds on Γ (1/2 − α, θi ) determine the bounds on p(θi ), and play a crucial role in producing a spike in p(θi ) near zero and a polynomially heavy tail. The plot of the prior density p(θi | α = 1) as well as p(θi | α = 1/2) near the origin and the tails are given below, along with a few other candidate prior densities. 60 0.010 40 0.005
20 0 0.00
0.000 0.01
0.02
0.03 θ
0.04
0.05
5.0
(a) Prior densities near origin
7.5
10.0 θ
12.5
15.0
(b) Tails of prior densities
Figure 2: Comparison of density of θi under different priors: Gamma (solid), Cauchy (dashed), Gauss-Hypermetric with α = 1/2 (dotted), and Gauss Hypergeometric with α = 1 (dot-dash). Corollary 2.1. The prior density on θi satisfies the following sharp upper and lower bounds for α = 1 and α = 1/2 respectively: 1 1 1 2 2 1 √ √ −√ √ −√ √ ≤ p ( θ i | α = 1) ≤ π π θi θi θi + θi + 2 θi + θi + 4/π 1 −1/2 2 1 1 −1/2 ≤ p ( θi | α = 1/2) ≤ 3/2 θi (8) θ log 1 + log 1 + θi θi 2π 3/2 i π 5
As seen in Figure 2, the parameter α controls the behavior of the prior near the origin. For α = 1, the prior has a similar order to Jeffrey’s prior θi −1/2 for a Poisson rate parameter, which has both a heavy tail and a pole at the origin. For α = 1/2, the prior p(θi ) has a sharper spike as well as a heavier tail. We state two corollaries below for the marginal behavior of p(θi ) given specific values of the shape parameter α.
2.2 Impact of hyper-parameters The Gauss Hypergeometric posterior distribution in (5) depends on three hyper-parameters α, γ and τ 2 . The parameter α controls the behavior of the marginal prior on θi as shown in S2.1, with smaller values, such as α = 1/2, leading to heavier tails and a higher spike near zero. The value of α = 1/2 is recommended as a default, leading to super-efficiency in Kullback-Leibler risk as shown in S3. The parameter τ controls global shrinkage. Figure 3 shows the combined effect of γ and yi on the posterior distribution of κi .The plot of the posterior density for different values of γ is given in Figure 3. The hyper-parameters are chosen to ensure the prior drives most of its mass to κi = 0 and κi = 1 for appropriate values of yi . For a small value of γ, the posterior distribution of κi given a large yi , such as yi = 10, concentrates near κi = 0, indicating no shrinkage. However, a small value of yi , such as yi = 1, reinforces the shrinkage by τ and the posterior concentrates near κi = 1, leading to strong shrinkage of low counts. This clearly shows the role of the additional shrinkage parameter γ in controlling the shape of the posterior distribution. The parameter γ enables shrinkage by adapting to the zero inflation in the data.
3 Theoretical Properties In this section, we study theoretical properties, with the main results showing adaptability to sparsity when the true rate parameter is zero and robustness to large signals. We first formalize the behavior of the posterior distribution of the shrinkage weight κi under the GH prior. Towards this, we provide bounds for posterior mass in neighbourhoods of 0 and 1 depending on the magnitude of the observation yi and the global shrinkage parameter τ. We shall also see that the posterior mean density estimator under the proposed prior has a faster rate of convergence in terms of Ces´aro risk, than any prior with a bounded mass near zero, when the true parameter is zero. Finally, we establish robustness properties of the posterior mean, showing that | E(θi | Yi = yi ) − yi | is not too large.
3.1 Flexible Posterior Concentration Since the posterior mean for θi can be written as (1 − κˆi )(yi + α), it is natural to expect that the posterior distribution of κi puts increasing mass at zero as yi becomes large relative to the hyper-parameter γ. On the other hand, the posterior mass of κi concentrates near 1 for values of yi small relative to γ. Indeed, the plots in Figure 3 show the concentration of the posterior distribution at either extreme of the κi scale depending on the magnitude of γ. This leads to great flexibility in the shrinkage profile through the posterior mean θˆi = (1 − κˆi )(yi + α) by differential shrinkage depending on the value of γ. We prove this formally in the two theorems that follow. Theorem 3.1 shows that the posterior probability of an interval around 1 approaches
6
(y = 1) (y = 2) (y = 5)
Posterior density
(y = 0)
20 15 10 5 0 15 10 5 0 7.5 5.0 2.5 0.0 3 2 1 0 4 3 2 1 0
(y = 10)
0.00
0.25
0.50 κ
0.75
1.00
Figure 3: Posterior distribution of the shrinkage parameter κi under the Gauss Hypergeometric prior for different values of the hyper-parameter γ: γ = 0 (solid), γ = 0·5 (dashed), γ = 1 (dotted), γ = 5 (dot-dash), γ = 10 (two-dash). Each row corresponds to a different value of yi ∈ {0, 1, 2, 5, 10}. 0 as yi goes to infinity. Theorem 3.2 shows the posterior distribution of κi instead concentrates near 1 when yi < γ − 1/2 and τ → 0. Theorem 3.1. Suppose yi ∼ Poi(θi ) and let p(κi | yi , τ, γ) denote the posterior density of κi given yi and fixed τ,γ for the GH prior (κi | τ, γ) ∼ GH(1/2, 1/2, γ, τ 2 − 1). Then the posterior density of κi satisfies the following: 1 pr(κi > η | yi , τ, γ) ≤ C (η ) 1 − τ2
τ2 η 1+ 1−η
⇒ (κi | yi , τ, γ) → δ{0} as yi → ∞
−(yi −1/2−γ)
where δ{0} denotes the point mass at zero. Theorem 3.2. Suppose yi ∼ Poi(θi ) and let p(κi | yi , τ, γ) denote the posterior density of κi given yi and fixed τ, γ for the GH prior (κi | τ, γ) ∼ GH(1/2, 1/2, γ, τ 2 − 1). Then the
7
posterior density of κi satisfies the following for all values of yi ≤ γ − 1/2: pr(κi < η | yi , τ, γ) ≤
τ2 1−η
d
⇒ (κi | yi , τ, γ) → δ{1} as τ → 0 where d = (γ − 1/2 − yi ) > 0 where δ{1} denotes the point mass at one.
3.2 Kullback-Leibler Risk Efficiency An infinite spike near zero in the marginal prior density of p(θi ) favors an efficient handling of sparsity. This can be formalized in terms of the rate of decay of the Kullback-Leibler divergence between the Bayes estimator of the density and the true sampling density when the true parameter is zero. The proof is based on a lemma by Clarke and Barron (1990) that provides an upper bound to the risk in terms of the prior mass in Kullback-Leibler neighbourhoods around the true parameter value. For any θi in the parameter space and any yi ∼ Poi(θi ), let the probability mass function be denoted by p(θi ) = Poi(yi | θi ). Let θ0 denote the true value of the parameter, and let the Kullback-Leibler divergence of p(θi ) from the sampling density under the true parameter value p(θ0 ) be denoted by ∆ KL { p(θi ), p(θ0 )}. Let the affinity between p(θi ) and p(θ0 )Rmeasuring the degree of similarity between two distributions be defined as Aff{ p(θi ), p(θ0 )} = p(θi )1/2 p(θ0 )1/2 dµ, and let the chi-square distance between two distriR butions be defined as ∆ χ2 { p(θi ), p(θ0 )} = ( p(θi ) − p(θ0 ))2 /p(θ0 )dµ, where µ( A) denotes the measure for any measurable set A. Then it follows that:
− 2 log Aff{ p(θi ), p(θ0 )} ≤ ∆ KL { p(θi ), p(θ0 )} ≤ log ∆ χ2 { p(θi ), p(θ0 )} + 1 p p θi 2 ≤ (θi − θ0 )2 /θi ⇒ ( θi − θ0 ) ≤ θi − θ0 + θi log θ0 p p {θ : (θi − θ0 )2 /θi ≤ ǫ} ⊂ Aǫ = {θ : ∆ KL { p(θi ), p(θ0 )} ≤ ǫ} ⊂ {θ : ( θi − θ0 )2 ≤ ǫ} (9) The key insight is that the prior mass in any weak neighbourhood of zero Aǫ = {θ : ∆ KL { p(θi ), p(θ0 )} ≤ ǫ} for any prior that is bounded at zero will be dominated by the prior mass for the GH prior. This is analogous to the KL-super-efficiency of Horseshoe estimators (Polson and Scott (2010) Theorem 2). Let ν(dθi ) and νk (dθi | y1 , . . . yk ) be the prior Rand the posterior measure based on k ≤ n observations y1 , . . . , yk respectively, and let pˆ k (θi ) = p(yi | θi )νk (dθi | y1 , . . . yk ) be the posterior mean estimator of the density. Let the prior ν(dθi ) be information dense at the true sampling density p(θ0 ) in the sense that all relative entropy neighborhoods have positive prior measure: ν( Aǫ ) > 0 for all ǫ > 0. Then Barron’s lemma states that the Ces´aro-average risk denoted by Rn will satisfy the following bound for all n: . R n = n −1
n
∑ ∆ KL { p(θ0 ), pˆ k (θi )} ≤ ǫ − n−1 log ν( Aǫ )
k=1
8
for all ǫ > 0
This gives us a direct relationship between the Ces´aro-average risk of the Bayes estimator and the prior mass of a weak neighbourhood of the true density. Ces´aro-average risk is an important quantity in the context of information-theoretic asymptotics and its decay implies convergence in Kullback-Leibler divergence, which is a stronger mode of convergence than ℓ1 convergence, and is preferred for certain applications like portfolio selection (vide Clarke and Barron (1990)). Theorem 3.3. Suppose the true sampling model is Poi(θ0 ), and let RGH be the Ces´aro-average n risks for Bayes estimators pˆ GH , the posterior mean estimator of the density under the Gauss n Hypergeometric prior. The rates of convergence of Rn under θ0 = 0 and θ0 6= 0 are ( O[n−1 {log(n) − log log n}] if θ0 = 0 GH Rn = O{n−1 log(n)} if θ0 6= 0 The proof follows from a lower bound on the prior mass in a neighbourhood of the true θ0 derived from the lower bound on p(θi ) in (8). Theorem 3.3 implies that the GH prior is KL-superefficient (Polson and Scott (2010)) in that it can reconstruct the true sampling density faster than any prior with a bounded density at the origin when θ0 = 0, and for θ0 > 0, RGH should decay n at least as fast as the risk for any other prior. The extra log log n term has minimal impact.
3.3 Robustness of Posterior Mean Efron (2011) introduced the following simple empirical Bayes estimation formula, called Tweedie’s formula, for one parameter exponential families. Suppose f (yi | ηi ) = f ηi (yi ) = eηi yi −ψ(ηi ) f0 (yi ), ηi ∼ g(ηi ), where ηi is the canonical parameter of the family and f0 (yi ) is the density when ηi = 0. Then the posterior of ηi given yi is: f ( yi ) ηi yi − λ( yi ) − ψ ( ηi ) , (10) g ( ηi | y i ) = e { g ( ηi ) e } where λ(yi ) = log f 0 ( yi ) R with f (·) the marginal density f (yi ) = f ηi (yi ) g(ηi )dηi . The posterior distribution of ηi follows a one-parameter exponential family with canonical parameter yi and cumulant generating function λ(yi ). It follows that the posterior mean and variance can be written as: E(ηi | yi ) = λ′ (yi ) = l ′ (yi ) − l0′ (yi ); V (ηi | yi ) = λ′′ (yi ) = l ′′ (yi ) − l0′′ (yi ).
(11)
The Tweedie’s formula when applied for the Poisson mean problem yields the following expression for the posterior mean and variance of ηi = log(θi ): E(ηi | yi ) = log Γ (yi + 1)′ + l ′ (yi ), V (ηi | yi ) = log Γ (yi + 1)′′ + l ′′ (yi ), (12) R where l (yi ) = log m(yi ) = log{ Pηi (yi )dG (ηi )} denotes the log-marginal density of yi . The following theorem establishes the tail robustness properties for the canonical parameter log(θi ). Theorem 3.4. Suppose yi ∼ Poi(θi ), i = 1, . . . , n and let m(yi ) denote the marginal density of yi under the Gauss Hypergeometric prior for fixed τ and γ, (θi | κi ) ∼ Ga{α, κi (1 − κi )−1 } and (κi | τ, γ) ∼ GH(1/2, 1/2, γ, z = τ 2 − 1). Then, lim m′ (yi )/m(yi ) = 0
yi →∞
|E(log θi | yi ) − log(yi )| → 0 as yi → ∞. 9
The importance of Theorem 3.4 lies in proving the “Bayesian robustness” properties of the GH prior for the canonical parameter ηi = log θi , which guarantees against under-estimation of ηi for large values of log(yi ). However, the conclusions of Theorem 3.3 and Theorem 3.4 will also hold for the horseshoe or three-parameter beta prior with similar asymptotic rates of decay since these are special cases of the Gauss Hypergeometric prior and they possess the spike at zero, which is critical for ‘KL-super-efficiency’. The GH prior stands out from the rest by dint of its flexible shrinkage property (Theorem 3.2) that enables it to shrink observations relative to the parameter γ, which is not active for the other shrinkage priors.
4 Simulation Studies 4.1 Sparse Count Data In this section, we report some simulation studies to compare the performance of the different estimators for a high dimensional sparse Poisson mean vector. We compare our GH estimator with the Horseshoe estimator, the Kiefer–Wolfowitz nonparametric maximum likelihood estimator, Robbins’ frequency ratio estimator and a global shrinkage Bayes estimator. To assess estimation accuracy, we use the total Bayes risk BR(θ ) = EΘ (kθˆ − θ k2 ) as the performance criterion. ˆ θ) = Some authors (Clevenson and Zidek (1975)) prefer a local Kullback-Leibler loss ∆ KL (θ, n 2 ∑i=1 (θˆi − θi ) /θi over the squared error loss as a natural loss function for estimating a Poisson mean vector, with Brown et al. (2013) providing a discussion on the choice of loss function. The global shrinkage estimator is obtained by putting a standard conjugate gamma prior on the Poisson means. We generate the parameters θi and the observations from the following model: θi ∼ (1 − ω )δ{0} + ω |t3 |, and Yi ∼ Poi(θi ), i = 1, . . . , n, with |t3 | denoting a folded t-distribution with 3 degrees of freedom. The simulation scheme mimics the popular ’spike-and-slab’ model for continuous data, with the spike and the slab representing the signal and the noise distribution. Shrinkage priors are built so that the shape of the prior on θi resembles the ‘spike-and-slab’ density. We generate 1, 000 different data-sets from the above model for each combination of multiplicity n = 200, 500 and proportion of non-zero parameters ω = 0·1, 0·15, 0· 2. For each of the data-sets, we estimate the parameter θ and report the mean and standard deviations for the squared error loss for each of the estimators based on these 1, 000 simulations. The results are given in Table 1 with the boldfaced entries marking the row winners. Figure 4 shows the boxplots for ω = 0·1, and n = 200 and n = 500. As expected, the GH estimator outperforms its competitors in this sparse simulation set-up across different values of multiplicity and sparsity. The Kiefer–Wolfowitz method also performs well, and is a close runner-up in terms of accuracy. The frequency ratio estimator does not do well as expected, and the difference is more prominent for higher multiplicity. The global shrinkage prior has the lowest accuracy, as it is not built to handle large sparse mean vectors.
4.2 Multiple Testing A nice feature of the shrinkage prior approach is that the shrinkage weights (1 − κˆi ) can be interpreted as ‘pseudo’ posterior inclusion probabilities (Polson and Scott, 2012), which induces 10
Table 1: Comparison of total Bayes risk for different Bayes and empirical Bayes procedures for θi ∼ (1 − ω )δ{0} + ω |t3 |, i = 1, . . . , n n = 200
n = 500
Methods w = 0· 1 w = 0·15 w = 0· 2
Horseshoe 20·42(7· 4) 29· 80(12· 5) 39· 28(14· 1)
Kiefer–Wolfowitz 23·02(14· 1) 12·59(5· 0) 27·96(14· 9)
Gauss Hypergeometric 16·42(11·7) 10·94(4·6) 22·79(11·1)
Robbins 31·17(22· 7) 28·24(20· 0) 35·57(24· 7)
Global 59·20(5· 4) 18·90(1· 5) 56·26(6· 6)
w = 0· 1 w =0·15 w = 0· 2
55· 39(17· 1) 92· 97(23· 4) 113· 9(27· 2)
43·66(15· 8) 40·09(9·0) 69·11(16· 6)
36·97(14·3) 44·10(10· 5) 65·58(16·1)
83·13(47· 5) 92·14(53· 3) 136·7(74· 7)
101· 0(5·8) 76·20(4· 3) 130· 3(9·3)
200
150
150
error
error
200
100
100
50
50 0
0
HS
GH
KW Robbins Global method
HS
(a) ω = 0·1
GH
KW Robbins Global method
(b) ω = 0·1
Figure 4: Boxplots for the estimation errors of the competing estimators, namely Horseshoe (HS), Gauss Hypergeometric (GH), Kiefer–Wolfowitz (KW), Robbins’ and Global shrinkage estimator, for n = 200 and n = 500 for a fixed value of ω = 0·1. a multiple testing rule. Analogous quantities can be calculated for the Kiefer–Wolfowitz estimator using the estimate of the underlying distribution G (·) of the parameters θi s. Here we show that the decision rule induced by the GH outperforms the decision rule induced by the three parameter beta prior (Armagan et al. (2011)) and the Kiefer–Wolfowitz estimator. We choose a = b = 1/2 and z = τ 2 − 1 as the hyper-parameter values for the three parameter beta. We generate n = 200 observations from a ‘contaminated’ zero-inflated model: yi is either zero with probability 1 − ω or drawn from a Poi(4) with probability ω. We contaminated the data by setting p proportion of the 0’s equal to 1. Our goal is to detect the non-null parameters. The non-null parameter value λ = 4 is chosen to be of the order of the maximum order statistics Mn of n independent and identically distributed Poisson random variables. It can be shown that Mn lies inside the interval [ In , In + 1] with probability 1 as n → ∞, where In ∼ log n/ log log n (Anderson (1970)), and the midpoint of the interval is I200 + 0·5 = 3·67. The decision rule induced by the shrinkage priors is to reject the ith null hypothesis if 1 − E(κi |yi , τ, γ) > δ for some fixed threshold δ. We calculated this threshold by applying the kmeans clustering algorithm on the shrinkage weights with number of clusters k = 2 and setting δ to be mean of the cluster centers. To compare the shrinkage priors, we calculate the number of misclassified hypotheses for 1, 000 simulated data-sets for both the shrinkage rules for 10 equidistant values of the proportion of ‘sparsity’ ω ∈ [0·1, 0· 3]. We fixed the value of the contamination proportion p = 0·1. The boxplots for the number of misclassification errors, shown in Figure 5, and the mean number
11
of misclassified hypothesis, given in Table 2, suggest that the GH decision rule outperforms the three-parameter beta as well as the Kiefer–Wolfowitz method for all sparse scenarios.
20
TPB
10 0 30 20
GH
10 0 30 20
KW
Number of misclassified hypothesis
30
10 0 0.3 0.28 0.26 0.23 0.21 0.19 0.17 0.14 0.12 0.1 Proportion of non−zero parameters
Figure 5: Number of misclassified hypotheses under competing decision rules, the threeparameter beta (TPB), Gauss Hypergeometric (GH), and the Kiefer–Wolfowitz (KW) for different values of the proportion of non-null effects. Table 2: Mean misclassification errors for the three parameter beta prior, the Gauss Hypergeometric prior and the Kiefer–Wolfowitz non-parametric maximum likelihood estimator. Sparsity
0 ·3
0·27
0·25
0·23
0·21
0·18
0·16
0·14
0·12
0· 1
Three parameter Beta Gauss Hypergeometric Kiefer–Wolfowitz
7 ·1 5 ·9 13·7
7· 1 5· 4 12·7
7· 0 4· 7 11·7
6· 5 4· 4 10· 5
6· 5 4· 0 9· 0
6· 3 3· 5 7· 6
6 ·2 3 ·0 6 ·1
6· 3 2· 7 4· 8
5· 8 2· 1 3· 4
5· 6 1· 7 2· 8
5 Real Data Example In this section, we apply our methods to count data arising from a massive sequencing study called the Exome aggregation consortium (ExAC). The ExAC database reports the total number of mutated alleles or variants along the whole exome for 60,076 individuals, and provides information about genetic variation in the human population. It is widely recognized in the scientific community that these rare changes are responsible for both common and rare diseases (Pritchard 12
(2001)), and currently there is a lack of statistical methods that incorporate the discrete nature of the point mutations as well as their physical locations. The frequency of mutated alleles is very low or zero for a vast majority of the locations across the genome and substantially higher in certain functionally relevant genomic locations, such as promoters and insulators (Lewin et al. (2011)). An important problem in rare mutations is identification of such mutational ‘hotspots’ where the mutation rate significantly exceeds the background. This is important since these mutational hotspots might be enriched with disease-causing rare variants (Ionita-Laza et al. (2012)). Our goal is to use the shrinkage prior developed in this paper to identify the potential hotspots in a genomic region. Towards this, we consider all variants with minor allele frequency less than 0·05% (Cirulli et al., 2015) on a gene PIK3CA, which has been implicated for ovarian and cervical cancers (Shayesteh et al. (1999)). The mutation dataset contains the number of ‘mutated’ alleles along the gene PIK3CA for 240 amino acid positions ranging from 0 to 1066. Intuitively, the flexible shrinkage property of the GH prior will lead to a better detection of the ‘true’ hotspots by shrinking low counts that are left unshrunk by the other methods. The number of mutated alleles at the i th position, denoted by Yi , ranges from 0 to 58. We model yi ∼ Poi( Ni θi ) independently, where θi is the mutation rate and Ni is the total number of alleles at position i, which in our case is equal to Ni = N = 2 × 60, 076 = 120, 152. We compare the shrinkage profile of the GH GH
HS/TPB
KW
1.00
1-k
0.75
0.50
0.25
0.00 0
900 0 900 0 Amino Acid Positions
900
Figure 6: The histogram of the count data and a comparison of the shrinkage profile for different estimators, namely, Gauss Hypergeometric (GH), three-parameter beta (TPB) /Horseshoe (HS) prior, and the Kiefer–Wolfowitz (KW) nonparametric maximum likelihood estimator. prior with the three-parameter beta / horseshoe prior and the Kiefer–Wolfowitz estimator in terms of the number of identified mutational hotspots. We apply the multiple testing rule proposed in S4.2 to identify the variants for which the number of mutated alleles are substantially higher than the background. The number of variants identified as ‘non-null’ by the three methods, namely, GH prior, three-parameter Beta/horseshoe prior and the Kiefer–Wolfowitz method are 7, 81 and 56 respectively. Figure 6 shows the pseudo inclusion probabilities of the competing methods and 13
clearly demonstrates that the GH prior has a sharpened ability of segregating the substantially higher signals from the background noise. The supplementary file contains more details about the data and possible mutation clusters in the gene ‘PIK3CA’.
Acknowledgment The authors thank Dr Sandeep Dave for invaluable help in learning about exome sequencing data. This research was funded by the National Science Foundation through the Statistical and Applied Mathematical Sciences Institute and a grant from the National Institutes of Health.
Supplementary material The supplementary file contains more details about the dataset used in S5.
Appendix 1 Proof of Lemma 2.1 Proof. Using Laplace transformation 1/(c + z) = E{1/(c + Z )} =
Z ∞ 0
e
− ct
Z ∞ βα
= ecβ βα c1−α
0
Γ (α)
Z ∞ cβ
z
R∞ 0
e−(c+z)t dt gives the following:
α−1 −( β + t) z
e
dzdt =
Z ∞ 0
e−ct
βα dt ( β + t) α
s−α e−s ds = ecβ βα c1−α Γ (1 − α, βc).
where the last step follows by changing variable s = c( β + t).
Proof of Proposition 2.1 Proof. For simplicity of notation, we drop the subscript i in the proofs and use θ, y and κ to denote a generic element of the corresponding vector rather than the vector itself. Transforming u = 1/λ2 , we get: p( θ ) =
= = =
∞ 1 2 dλ θ α−1 exp(−θ/λ2 )λ−2α 2 πΓ (α) 0 λ +1 Z ∞ 1 1 du θ α−1 exp(−θu)uα−1/2 πΓ (α) 0 1+u Γ (α + 1/2) −3/2 θ E{1/(1 + U )} where U ∼ Ga α + 1/2, θ πΓ (α) 1 √ eθ θ α−1 Γ (1/2 − α, θ ) . πβ(α, 1/2)
Z
where the last step follows from Lemma 2.1 with c = 1.
14
Proof of Corollary 2.1 Proof. For the first set of bounds for α = 1, we use the recurrence relation for incomplete gamma functions Γ ( x + 1, y) = xΓ ( x, y) + yx e−y to derive the following: 1 1 p(θ | α = 1) = √ eθ Γ (−1/2, θ ) = √ eθ (θ −1/2 e−θ − Γ (1/2, θ )) 2 π π √ 1 = √ θ −1/2 − eθ erfc( θ )) π The bounds follow from the following inequality (Abramowitz and Stegun (1964), 7.1.13): Z ∞ −1 −1 p p 2 x2 2 ≤e e−t dt ≤ x + x2 + 4/π x+ x +2 x
For α = 1/2, the prior density p(θ ) can be written as:
1 1 eθ θ −1/2 Γ (0, θ ) = 3/2 eθ θ −1/2 E1 (θ ) (13) π πβ(1/2, 1/2) R∞ where E1 (·) is the exponential integral function defined as E1 (z) = z e−t /tdt, and satisfies the following tight upper and lower bounds: p(θ | α = 1/2) = √
(1/2)e−t log(1 + 2/t) < E1 (t) < e−t log(1 + 1/t)
The bounds on the prior density for α = 1/2 can be easily derived by using these bounds on the exponential integral and the expression for the marginal prior in Equation (13).
Proof of Theorem 3.1 Proof. To prove the theorem, that the posterior density of κ can be written √ we will use the fact y − 1/2 as the product of p1 (κ ) = κ, p2 (κ ) = (1 − κ ) and p3 (κ ) = (1 − (1 − τ 2 )κ )−γ , and for 2 − γ 2 − any τ ∈ (0, 1) and γ, (1 − κ ) > (1 − (1 − τ )κ ) γ , which implies
(1 − (1 − τ 2 )κ )y−1/2−γ ≤ p2 (κ ) p3 (κ ) ≤ (1 − κ )y−1/2−γ for γ ≥ 0, τ 2 ≥ 0
(14)
Clearly, both the bounds in Equation 14 are increasing functions of κ ∈ (0, 1) for any y > 1/2 + γ and can be trivially bounded by its values at the boundaries. We derive the following upper bound on the tail probability of the posterior distribution of κ given y, τ, and γ. R 1 1/2 (1 − κ )y−1/2 (1 − (1 − τ 2 )κ )−γ dκ η κ P(κ > η | y, τ, γ) ≤ R η 1/2 (1 − κ ) y−1/2 (1 − (1 − τ 2 )κ ) − γ dκ 0 κ R 1 1/2 κ (1 − κ )y−1/2−γ dκ η ≤ R η ( 1− τ 2 ) (1 − τ 2 )−1/2 0 z1/2 (1 − z)y−1/2−γ dz −(y−1/2−γ) (1 − η )y−1/2−γ (1 − η 3/2 ) 1 τ2 η ≤ = C (η ) 1+ 1 − τ2 1−η (1 − (1 − τ 2 )η )y−1/2−γ (1 − τ 2 )η 3/2 The proof follows from the fact that the right hand side of the last equation goes to zero as y goes to infinity for any non-zero η and τ 2 . 15
Appendix 2 Proof of Theorem 3.2 Proof. We use the same factorization as in the proof of 3.1 and the fact that both the bounds in 14 are decreasing functions of κ for y ≤ γ − 1/2. Also, we denote y − (γ − 1/2) = −d < 0 for small y. R η 1/2 √ Rη − δ−1 dκ η 0 (1 − κ )−d−1 dκ κ ( 1 − κ ) P(κ < η | y, τ, γ) ≤ R 1 0 ≤√ R 1−(1−τ2 )η 1/2 (1 − (1 − τ 2 )κ ) − d−1 dκ η (1 − τ 2 ) −1 τ 2 (z)−d−1 dz η κ 2 d τ (1 − τ 2 )((1 − η )−d − 1) (1 − τ 2 )((1 − η )−d − 1) = ≤ ≤ −2d 2 − d − 2d 1−η τ − (1 − (1 − τ ) η ) τ (1 + o(1)) Hence for any η ∈ (0, 1), the posterior mass of κ in the interval (0, η ) will shrink to zero for all values of y ∈ (0, γ − 1/2) as τ → 0.
Appendix 3 Proof of Theorem 3.3 Proof. We substitute ǫ = 1/n in the general bound for Rn which corresponds to the risk for n independent and identically distributed observations y1 , . . . , yn . From Equation (9), we get the following upper bound for the Ces´aro-average risk Rn for any prior ν that is information dense. n o Rn ≤ n−1 − n−1 log ν(θ : θ ≤ n−1 ) (15)
Clearly, for any prior with a bounded density at θ0 , the rate of convergence of Rn is O{n−1 log(n)}. Hence, for any non-zero θ0 , the rate of convergence of Rn for the GH prior equals that of any prior with bounded density at θ0 6= 0. On the other hand, if θ0 = 0, we use the bounds from Equation (8) for the prior density at θ to get the following: Z 1/n 0
−3/2
Z 1/n
θ −1/2 log(1 + 2/θ )dθ n o √ √ ≥ π −3/2 n−1/2 log(1 + 2n) + 2 2 arctan( 1/2n ) = O(n−1/2 log(n))
ν(dθ ) ≥ (2π )
0
Using the bound in Equation (15), we can derive the optimal rate of convergence for the Ces´aroaverage risk given in Theorem 3.3.
16
Appendix 4 Proof of Theorem 3.4 Proof. For the first part of Theorem 3.4, we note the following: R 1 1/2 d (1 − κ )y−1/2 (1 − (1 − τ 2 )κ )−γ dκ m ′ ( y) dy 0 κ = R1 1/2 (1 − κ ) y−1/2 (1 − (1 − τ 2 )κ ) − γ dκ m ( y) 0 κ R 1 1/2 d ( dy (1 − κ )y−1/2 )(1 − (1 − τ 2 )κ )−γ dκ 1 0 κ = R1 = y − 1/2 1/2 2 − γ log ( y − 1/2) (1 − κ ) (1 − (1 − τ )κ ) dκ 0 κ Hence from Tweedie’s formula for Poisson mean (vide Equation 12), it follows that:
| E(η | y) − log(y) |= (ψ(y + 1) − log(y)) + log(y − 1/2) −1 where ψ(z) = satisfies:
(16)
d dz
log Γ (z) is the Digamma function (Abramowitz and Stegun (1964), Ch. 6) that 1 1 1 1 + · · · = log ( z ) + O ψ(z) ≍ log(z) − − + (17) 2z 12z2 z 120z4
Using Equations (16) and (17) together, we arrive at | E(η | y) − log(y) |= log(1 + y−1 ) + log(y − 1/2) −1 + O y−1 = O log(y) −1
where the right hand side of the above equation goes to zero as y → ∞.
References Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, volume 55 of National Bureau of Standards Applied Mathematics Series. For sale by the Superintendent of Documents, U.S. Government Printing Office, Washington, D.C. Anderson, C. W. (1970). Extreme value theory for a class of discrete distributions with applications to some stochastic processes. Journal of Applied Probability, 7:99–113. Armagan, A., Clyde, M., and Dunson, D. B. (2011). Generalized beta mixtures of Gaussians. Advances in Neural Information Processing Systems, 24:523–531. Armagan, A., Dunson, D. B., and Lee, J. (2013). Generalized double Pareto shrinkage. Statistica Sinica, 23(1):119–143. Armero, C. and Bayarri, M. J. (1994). Prior assessments for prediction in queues. Journal of the Royal Statistical Society. Series D (The Statistician), 43(1):pp. 139–153. Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2014). Dirichlet-Laplace priors for optimal shrinkage. Journal of the American Statistical Association, in press. 17
Brown, L. D. and Greenshtein, E. (2009). Nonparametric empirical Bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics, 37(4):1685–1704. Brown, L. D., Greenshtein, E., and Ritov, Y. (2013). The Poisson compound decision problem revisited. Journal of the American Statistical Association, 108(502):741–749. Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97:465–480. Cirulli, E. T., Lasseigne, B. N., Petrovski, S., Sapp, P. C., Dion, P. A., Leblond, C. S., Couthouis, J., Lu, Y.-F., Wang, Q., Krueger, B. J., et al. (2015). Exome sequencing in amyotrophic lateral sclerosis identifies risk genes and pathways. Science, 347(6229):1436–1441. Clarke, B. S. and Barron, A. R. (1990). Information-theoretic asymptotics of Bayes methods. Institute of Electrical and Electronics Engineers. Transactions on Information Theory, 36(3):453–471. Clevenson, M. L. and Zidek, J. V. (1975). Simultaneous estimation of the means of independent Poisson laws. Journal of the American Statistical Association, 70(351, part 1):698–705. Efron, B. (2003). Robbins, empirical Bayes and microarrays. The Annals of Statistics, 31(2):366– 378. Dedicated to the memory of Herbert E. Robbins. Efron, B. (2011). Tweedies formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614. Efron, B. (2014). Two modeling strategies for empirical Bayes estimation. Statistical Science. A Review Journal of the Institute of Mathematical Statistics, 29(2):285–301. Gelman, A. et al. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian analysis, 1(3):515–534. Greenshtein, E. and Ritov, Y. (2009). Asymptotic efficiency of simple decisions for the compound decision problem. In Optimality, volume 57 of IMS Lecture Notes Monograph Series, pages 266–275. Inst. Math. Statist., Beachwood, OH. Ionita-Laza, I., Makarov, V., Buxbaum, J. D., Consortium, A. A. S., et al. (2012). Scan-statistic approach identifies clusters of rare disease variants in lrp2, a gene linked and associated with autism spectrum disorders, in three datasets. The American Journal of Human Genetics, 90(6):1002–1013. Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27:887– 906. Koenker, R. and Mizera, I. (2014). Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109(506):674–685.
18
Lewin, B., Krebs, J., Kilpatrick, S. T., and Goldstein, E. S. (2011). Lewin’s Genes X, volume 10. Jones & Bartlett Learning. Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., et al. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265):747–753. Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: sparse bayesian regularization and prediction. Bayesian Statistics, 9:501–538. Polson, N. G. and Scott, J. G. (2012). Good, great, or lucky? Screening for firms with sustained superior performance using heavy-tailed priors. The Annals of Applied Statistics, 6(1):161– 185. Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? The American Journal of Human Genetics, 69(1):124–137. Robbins, H. (1956). An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I, pages 157–163. University of California Press, Berkeley and Los Angeles. Shayesteh, L., Lu, Y., Kuo, W.-L., Baldocchi, R., Godfrey, T., Collins, C., Pinkel, D., Powell, B., Mills, G. B., and Gray, J. W. (1999). PIK3CA is implicated as an oncogene in ovarian cancer. Nature Genetics, 21(1):99–102. van der Pas, S., Kleijn, B., van der Vaart, A., et al. (2014). The horseshoe estimator: Posterior concentration around nearly black vectors. Electronic Journal of Statistics, 8(2):2585–2618. Yang, Z., Hardin, J. W., and Addy, C. L. (2009). Testing overdispersion in the zero-inflated Poisson model. Journal of Statistical Planning and Inference, 139(9):3340–3353. Zhang, C.-H. (2003). Compound decision theory and empirical Bayes methods. The Annals of Statistics, 31(2):379–390.
19