Bayesian Analysis (2010)
5, Number 1, pp. 171–188
Inference with normal-gamma prior distributions in regression problems Jim. E. Griffin∗ and Philip. J. Brown† Abstract. This paper considers the effects of placing an absolutely continuous prior distribution on the regression coefficients of a linear model. We show that the posterior expectation is a matrix-shrunken version of the least squares estimate where the shrinkage matrix depends on the derivatives of the prior predictive density of the least squares estimate. The special case of the normal-gamma prior, which generalizes the Bayesian Lasso (Park and Casella 2008), is studied in depth. We discuss the prior interpretation and the posterior effects of hyperparameter choice and suggest a data-dependent default prior. Simulations and a chemometric example are used to compare the performance of the normal-gamma and the Bayesian Lasso in terms of out-of-sample predictive performance. Keywords: Multiple regression, p > n, Normal-Gamma prior, “Spike-and-slab” prior, Bayesian Lasso, Posterior moments, Shrinkage, Scale mixture of normals, Markov chain Monte Carlo
1
Introduction
The standard multiple linear regression model assumes that a vector of responses y = (y1 , y2 , . . . , yn ) can be represented as y = α1 + Xβ + ² T
(1) 2
where ² = (²1 , . . . , ²n ) are independent, p(²i ) = N(²i |0, σ ) and X is an n × p matrix of explanatory variables. Here, N(x|µ, σ 2 ) denotes the density of a normal distribution with mean µ and variance σ 2 . The scalar α is the intercept and 1 a n × 1 unit vector. This paper is concerned with the Bayesian analysis of this model and, in particular, the choice of the prior distribution of the (p×1)-dimensional vector of regression coefficients β. A zero mean normal prior leads to the ridge estimator as posterior mean. This estimator performs poorly if there are large differences in the size of regression coefficients. Alternatively, we could perform variable selection and assume that only a subset of the variables have non-zero regression coefficients which mitigate the problems associated with the normal prior. The standard approach is the “spike-and-slab” prior (Mitchell and Beauchamp 1988). An indicator variable zi is introduced to identify whether the i-th variable is included in the model (zi = 1) or excluded (zi = 0). The prior for βi can be written as π(βi ) = zi N(βi |0, σβ2 ) + (1 − zi )δβi =0 ,
p(zi = 1) = w,
∗ School
of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, UK, mailto:
[email protected] † School of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, UK, mailto:
[email protected]
c 2010 International Society for Bayesian Analysis °
DOI:10.1214/10-BA507
172
normal-gamma priors
where δβi =0 represents the Dirac delta measure which places all its mass on zero and Mitchell and Beauchamp’s uniform slab is replaced by the now traditional Gaussian “slab”. The independent Bernoulli variables zi have mean w, and so the hyperparameter w can be interpreted as the prior proportion of non-zero regressors. Alternatively, a prior distribution can be used for w and its value inferred from the data. The scale σβ controls the variance of the prior. As we shall show in Section 2, this prior induces adaptive matrix shrinkage (Chamberlain and Leamer 1976) where the norm |E[β|y]| is ˆ and larger |β| ˆ are shrunk less than smaller |β|. ˆ However, for a decreasing function of |β| ˆ fixed design X, very large |β| shrinkage asymptotes to some non-zero value determined by σβ2 , (and the eigenvalues of X T X). Absolutely continuous prior distributions can also induce adaptive shrinkage and represent an interesting alternative to “spike-and-slab” priors. Park and Casella (2008) describe a Bayesian analysis where the regression coefficients are given independent double exponential prior distributions. The idea is motivated by the equivalence of the maximum a posteriori (MAP) estimator under this prior distribution and the Lasso estimator (Tibshirani 1996). The prior distribution can also be motivated as a member of the scale mixture of normals family. If we assume that βi |Ψi ∼ N(βi |0, Ψi ) then a double exponential prior for βi arises when Ψi follows an exponential distribution (Andrews and Mallows (1974). The prior can be seen as a ridge prior where the variance of the normal is allowed to change from regression coefficient to regression coefficient, allowing differences in their scales. Bayesian estimation of regression models with other absolutely continuous priors has been investigated by several authors (MacKay 1996; Tipping 2001; Figueiredo and Jain 2001; Figueiredo 2003; Kiiveri 2008; Caron and Doucet 2008). However their work was restricted to MAP estimation rather than fully considering the posterior distribution. The performance of estimators based on absolutely continuous priors depends on the form of the prior. For example, the double exponential prior used in the Bayesian Lasso has a single hyperparameter. In many regression problems this inflexibility can have a considerable effect on the inference since it restricts the prior beliefs that can be expressed. As we will show this choice fixes the rate of decay of the ordered regression coefficients. In the extreme case where some regression coefficients are zero then the mean of the exponential needs to be set to a small value to shrink those regression coefficients close to zero. However, this also leads to substantial shrinkage of the regression coefficients that are truly non-zero. In other words, the shrinkage is not adaptive enough for the problem. This can lead to both poor prediction accuracy and inference about the regression coefficients. This is an extreme case but the problem will also effect analyses where all the regression coefficients are truly non-zero but many are close to zero. The problem is particularly acute if n is small and p is large when prior assumptions play an important role in posterior inference. This paper looks at the effect of the prior for β on the posterior expectation and variance of β. We show that the posterior expectation is a shrinkage estimator where the level of shrinkage is directly related to the form of the prior predictive distribution of the least squares estimator. We develop a generalisation of the double exponential
J. E. Griffin and P. J. Brown
173
prior distribution for regression problems. The normal-gamma prior provides a natural extension and we show that it defines estimators that induce a wide-range of shrinkage behaviour, including effective selection (by shrinking the posterior expectation of many regression coefficients to values very close to zero). The paper is organised in the following way. Section 2 discusses a very general expression for the posterior expectation of the regression coefficients showing that the shape of the prior predictive distribution plays a key role (and acts as an analogue to some of the results of Fan and Li (2001) for penalized maximum likelihood estimation). Section 3 considers the use of the normal-gamma distribution as a prior for regression coefficients and the interpretation of its hyperparameters in a Bayesian context. We finish the section by suggesting a simple, data-dependent default prior for the hyperparameters. Section 4 introduces a Gibbs sampling scheme for posterior inference including the use of the singular value decomposition for reduced dimension matrix inversion and computational speed. Section 5 applies the method to two simulation examples and a real problem in chemometrics and compares with results for estimation using the double exponential prior distribution. A discussion follows in section 6.
2
The posterior expectation of regression coefficients in linear regression
Suppose that the regression error variance σ 2 is known in the linear regression model given by equation (1), then the posterior expectation and variance of the (p × 1)dimensional vector β can be expressed in a useful form by extending a univariate result of Pericchi and Smith (1992). Proposition 1. Suppose that we have the linear regression model given by equation (1) where n ≥ p + 1, the design matrix X is non-singular and its columns have been centred, and the intercept α is independent of the p×1 vector β a priori. Let βˆ = (X T X)−1 X T y, R ˆ = N(β|β, ˆ σ 2 (X T X)−1 )π(β) dβ where the standard least squares estimator, and h(β) π(β) is the prior distribution of β then ³ ´ T ˆ = I − S(β) ˆ βˆ and V[β|β] ˆ = σ 2 (X T X)−1 − σ 4 (X T X)−1 W (β)(X ˆ E[β|β] X)−1 where
ˆ = σ 2 (X T X)−1 R(β) ˆ S(β)
and R(x) is a diagonal matrix with Rii (x) = −
1 ∂ log h(x) xi ∂xi
∂ ∂ and W (x) = − ∂x log h(x). ∂xT
ˆ is the prior predictive The sampling density of βˆ is N (β, σ 2 (X T X)−1 ) and so h(β) ˆ The result may be extended to singular X and p > n − 1 using the density of β.
174
normal-gamma priors
singular value decomposition of X and exploiting the scale mixture of normals, as illustrated in section 4. The result in Proposition 1 has several implications. The vector posterior expectation is always a matrix-shrunken version of the least squares estimator. The amount of shrinkage is controlled by the shape of h and the standard error ˆ Interestingly, the penalized maximum likelihood estimator can also be expressed of β. as a shrinkage estimator (Fan and Li 2001). In that case the shrinkage is controlled by the derivative of the penalty function (which is related to the log prior density if the posterior mode is used as our estimator). In contrast, the posterior expectation depends on the derivative of the log predictive distribution. The predictive distribution can be considered a “smoothed” version of the prior distribution and so, in this case, the distinction between absolutely continuous and discrete prior distributions is blurred. ˆ whereas the MAP estimate Therefore, the posterior mean is a continuous function of β, is not. If the design matrix is orthogonal, the result can be simply expressed in terms of each regression coefficient: ³ ´ E[βj |βˆj ] = βˆj 1 − S (j) (βˆj ) and
σ2 V[βj |βˆj ] = Pn
2 i=1 xij
σ4 W (j) (βˆj ), − Pn ( i=1 x2ij )2
2 where W (j) (βˆj ) = − ddβˆ2 log h(βˆj ) and S (j) (βˆj ) = j
2 Pnσ i=1
x2ij
R(jj) (βˆj ) with R(jj) (βˆj ) =
log h(βˆj ). In this case the shrinkage factor is simply S (j) (βˆj ). The shrinkage of βj only depends on the univariate predictive density of βˆj and its observed value. Clearly the shape of the prior distribution of βj will directly affect the shrinkage and heavier tails will lead to less shrinkage. Conversely densities which are more peaked at zero will lead to larger shrinkage of small estimated values. A desirable property for © ª fixed design X is that E[βj |βˆj ] → βˆj as βˆj → ∞. If h(x) ≈ exp − 21 cx2 ( i.e. the d predictive distribution has normal tails), then − x1 dx log h(x) → c and so E[βj |βˆj ] does ˆ ˆ not limit to βj as βj → ∞. The predictive distribution will have these tails if the prior on β is normal or if a normal “slab” is chosen in a “spike-and-slab” prior. If the tails of the prior distribution of β are heavier than normal, then the E[βj |βˆj ] → βˆj as βˆj → ∞, (see also Dawid (1973)). − βˆ1
j
3
d dβˆj
The normal-gamma prior
A wide and natural class of prior densities for regression coefficients is the scale mixtures of normals (SMN) (see e.g. West (1987)), which we write as Z π(βi ) = N(βi |0, Ψi )dG(Ψi ) where G is a mixing distribution. The prior can be expressed in a hierarchical form as βi |Ψi ∼ N(0, Ψi ),
Ψi ∼ G.
(2)
J. E. Griffin and P. J. Brown
175
This hierarchical form for the model shows that the i-th regression coefficient has a normal prior distribution conditional on an idiosyncratic variance (or scale), Ψi . This allows for larger differences in the sizes of the regression coefficients than would be possible under a normal prior. The marginal prior distribution for βˆi has heavier than normal tails (apart from the degenerate case where G places all its mass at a single point). The “spike-and-slab” prior can be represented in this way by choosing G(Ψi ) = zi δΨi =σβ2 + (1 − zi )δΨi =0 . The double exponential prior of the Bayesian Lasso arises if G is an exponential distribution. An interesting choice of absolutely continuous prior is the normal-gamma distribution, which includes the double exponential prior as a special case. Let Ga(x|c, d) represent the density of a gamma distribution with shape c and rate d so that Ga(x|c, d) =
dc c−1 x exp {−dx} . Γ(c)
We refer to the distribution as Ga(c, d). The normal-gamma distribution arises by assuming that the mixing distribution in a SMN has the density g(x) = Ga(x|λ, 1/(2γ 2 )). The density function is expressible as π(βi ) = √
1 π2λ−1/2 γ λ+1/2 Γ(λ)
|βi |λ−1/2 Kλ−1/2 (|βi |/γ),
(3)
where K is the modified Bessel function of the third kind. The variance of βi is vβ = 2λγ 2 and the excess kurtosis is λ3 . The gamma distribution can represent a wide-range of shapes. As the shape parameter λ decreases these include distributions that place a lot of mass close to zero but at the same time have heavy tails. Figure 1 shows the
log f(β)
2
0
−2
−4 −2
−1
0 β
1
2
Figure 1: The log density of the normal-gamma prior with a variance of 2 and different values of λ. λ = 0.1 (solid line), λ = 0.333 (dot-dashed line) and λ = 1 (dashed line). effect of shape parameter λ on the marginal prior distribution of βi . The marginal distribution becomes more peaked at zero which places increasing mass close to zero as λ decreases. The distribution has proved a popular choice for modelling fat tails in finance (Bibby and Sorensen 2003), and is a member of the generalized hyperbolic
176
normal-gamma priors
family (Barndorff-Nielsen and Blaesild 1981). The prior was considered by Griffin and Brown (2007), but the shape of the density made it difficult to obtain MAP estimates. More recently, Caron and Doucet (2008) have looked at MAP estimation and drawn a link to L´evy processes. The density has exponential tails, but the heaviness of the tail is controlled by λ, which can take any positive value. The choice of λ and γ plays an important role in estimation. Park and Casella (2008) discuss an empirical Bayes estimation strategy for the hyperparameter of the Bayesian Lasso. However, with the normal-gamma prior, the posterior distribution of λ and γ can be highly multimodal and an empirical Bayes approach is very difficult to implement. Therefore we take a fully Bayesian approach and concentrate on choosing a prior distribution for the hyperparameters of the normal-gamma prior. In order to make an informed choice, we consider the effect of λ on both the prior and the posterior.
3.1
The effect of λ
It follows from the definition of the model in equations (1) and (2) that V[yi |Ψ, σ 2 ] = V[α] +
p X
Ψj + σ 2 ,
j=1
if the regressors have been standardized so that the sample mean andP variance of each p regressor is 0 and 1 respectively. The regression total variability is ( k=1 Ψk ). Thus Ψ ζj = Pp j Ψk can be interpreted as the proportion of total variability attributable to k=1 the j-th regressor. If the mixing distribution is gamma, then ζ follows a Di(λ, λ, . . . , λ), a Dirichlet distribution with all parameters equal to λ. Consequently, the distribution of ζ is controlled by λ only and γ has no effect. Increasing λ will lead to more evenly distributed values of ζ1 , ζ2 , . . . , ζp and small values of λ will be associated with large differences between the proportions. We look at the strength of this effect by considering ζ(1) > ζ(2) > · · · > ζ(p) , which is the ordered version of ζ, and define rj = log ζ(j) − log ζ(j+1) . Plotting r1 , r2 , . . . , rp−1 gives an indication of the rate at which the ordered proportions decay for a given prior distribution. These plots for various values of λ are shown in Figure 2. The shape of the curve defined by the values r1 , r2 , . . . , rp−1 is similar for all values of λ. The rate is fairly constant for small values of j but increases for larger values of j. The level is determined by both p and λ. Smaller values of λ and smaller values of p are associated with larger values of rj (i.e. a faster decay). This illustrates a limitation of the Bayesian Lasso prior, (λ = 1), which implies a particular set of values for rj for a given p. Extending the prior to the normal-gamma distribution leads to a wider choice of decay rates. The posterior properties of the regression coefficients can be studied using Proposiˆ for the posterior expectation of a single regression tion 1. The shrinkage factor, S(β), coefficient is plotted against the standard error (SE) of βˆ in Figure 3 for different choices of λ. The shrinkage for small values of βˆ changes markedly with λ (smaller values of λ are associated with larger amounts of shrinkage). When SE < 1/5 the graphs show a fast transition from a high level shrinkage to a low level of shrinkage (e.g. if SE = 1/5
J. E. Griffin and P. J. Brown
177 p = 30
20
20
20
10
5 i
0 0
10
10
10
i
20
0 0
30
12
10
10
8
8
8
6
6
6
r
ri
12
10
i
12
4
4
4
2
2
2
0 0
0 0
5 i
10
10
i
20
0 0
30
4
4
3
3
3
2
2
2
1
1
0 0
0 0
5 i
10
ri
4
ri
ri
λ=1
ri
0 0
ri
30
i
30
10
λ = 1/3
p = 100
30
r
λ = 1/10
ri
p = 10
50 i
100
50 i
100
50 i
100
1
10
i
20
30
0 0
Figure 2: The prior distribution of ri = log ζ(i+1) − log ζ(i) represented by the median (cross) and 95% central region (dots).
this transition occurs at around 1). The positions of these transitions are controlled by the hyperparameter γ.
3.2
Similarity to “spike-and-slab”
The above shrinkage factor results for the normal-gamma prior show an adaptive pattern which can take a wide range of shapes according to the choice of hyperparameters. It is therefore interesting to ask whether the “spike-and-slab” prior leads to forms of shrinkage different from the normal-gamma. The hyperparameters of the normal-gamma prior can be matched to the “spike-and-slab” prior using the following argument. The prior proportion of non-zero coefficients, w, of the “spike-and-slab” can be elicited by choosing a prior guess of the number of non-zero regression coefficients, p? , and setting w = p? /p. For the normal-gamma prior, we could mimic the “spike and slab” prior by choosing λ so that most of the variation in the prior is contained in a small number of regressors. To make this idea operational, we chose λ to solve ? p X median ζ(i) = 1 − ² i=1
178
normal-gamma priors SE = 1 0.8
0.8
0.8
0.6 0.4
0 0
0.6 0.4 0.2
1
2
3
4
0 0
5
0.6 0.4 0.2
1
2
3
4
0 0
5
1
1
0.8
0.8
0.8
0.6 0.4 0.2 0 0
Shrinkage
1
Shrinkage
Shrinkage
Shrinkage
1
0.2
γ=2
SE = 1/25
1
Shrinkage
Shrinkage
γ=1
SE = 1/5
1
0.6 0.4 0.2
1
2
3
4
0 0
5
1
2
3
4
5
1
2
3
4
5
0.6 0.4 0.2
1
2
3
4
0 0
5
Figure 3: Shrinkage factor for different values of βˆ for different priors: λ = 0.1 (solid line), λ = 0.333 (dashed line) and λ = 1 (dot-dashed line). for a pre-specified “small” value of ². Choosing the variance of the normal component σβ2 = 2λγ 2 p/p? guarantees that V[βi ] is the same under the two priors.
0.8
0.8
0.6 0.4
.
0.6 0.4 0.2
0 0
1
2
3
4
0 0
5
0.6 0.4 0.2
1
2
3
4
0 0
5
1
1
1
0.8
0.8
0.8
Shrinkage
Shrinkage
Shrinkage
0.8
0.2
γ=2
SE = 1/25 1
0.6 0.4 0.2 0 0
Shrinkage
γ=1
SE = 1/5 1
Shrinkage
Shrinkage
SE = 1 1
0.6 0.4 0.2
1
2
3
4
5
0 0
1
2
3
4
5
1
2
3
4
5
0.6 0.4 0.2
1
2
3
4
5
0 0
Figure 4: Shrinkage factor for different values of βˆ for matched normal-gamma (solid line) and “spike-and-slab” priors (dashed line) with p = 100 and p? = 5. ˆ for a single regressor with Figure 4, with ² = 0.1, shows the shrinkage factor S(β) p = 100 and p? = 5 for several values of the standard error (SE). The two priors lead to remarkably similar shrinkage for many values of the least squares estimate (only positive values are shown since this function will be symmetric around 0). The very sharp transition associated with the “spike-and-slab” prior is mimicked by the matching normal-gamma distribution. The main differences occur for very small values and very
J. E. Griffin and P. J. Brown
179
ˆ When βˆ is large, the “spike-and-slab” prior reverts to the constant large values of β. shrinkage factor induced by a ridge prior, whereas the shrinkage for the normal-gamma prior tends to zero. If β is small, the shrinkage associated with the “spike-and-slab” prior tends to be larger than for the matching normal-gamma prior.
3.3
Prior hyperparameter settings
The hyperparameters of the normal-gamma distribution could be chosen to match the “spike-and- slab” prior as discussed in the Section 3.2. However, we choose a simpler route by directly specifying priors for λ and γ. A prior for λ which seems to work well in the simulations and our example is obtained by taking the prior of λ to be an exponential distribution with mean 1. This offers variability around the Bayesian Lasso prior (λ = 1). The prior for the scale parameter γ conditional on λ is given by vβ = 2λγ 2 ∼ IG(2, M ), where IG denotes the inverted gamma distribution, the inverse of a gamma distribution, so that IG(2, M ) has expectation M. When X is non-singular, Pp M = p1 i=1 βˆi2 where βˆ is the least squares estimate. When X is singular, as when Pp p > n − 1, M = n1 i=1 βˆi2 where βˆ is the Minimum Length Least Squares (MLLS) estimate. In the nonsingular X case, this is the same as that used in the Hoerl-KennardBaldwin estimate of the variance of β for the constant in ridge regression (apart from a Stein-type dimension correction) (see Hoerl et al. (1975) or Brown (1993), section 4.4). This completes the prior specification for β. The prior for the intercept α in (1) is π(α) ∝ 1. Lastly, we choose a vague prior for the error variance so that σ −2 ∝ 1.
4
Computational method
The posterior distribution of the parameters can be simulated using a Gibbs sampler with an additional Metropolis-Hastings update. The convergence of the method is improved by augmenting the model with the latent scale parameters Ψ1 , Ψ2 , . . . , Ψp . The full conditionals used in the updating steps are given below. Updating α and β T
Let φ = (α, β) . The full conditional distribution of φ follows a joint normal distribution ³ ´−1 ³ ´−1 with mean X ? T X ? + σ 2 Λ X ? T y and variance σ 2 X ? T X ? + σ 2 Λ , where ¶ µ 1 1 1 , ,..., , Λ = diag 0, Ψ1 Ψ2 Ψp and X ? = [1 : X]. It is computationally convenient in problems with p > n − 1 to express the mean and variance of this distribution using the following form which only involves the inversion of an n×n matrix rather than a larger (p+1)×(p+1) matrix. The standard MLE estimator will not be defined if p > n − 1. Consequently, the problem is
180
normal-gamma priors
re-expressed in terms of an n-dimensional parameter, θ, for which the MLE exists. As in West (2003), the singular value decomposition of X ? can be written as X ? = F T DAT where A is a ((p + 1) × n)-dimensional matrix such that AT A = In , D is an (n × n)dimensional diagonal matrix and F is a (n × n)-dimensional matrix for which F T F = In and F F T = In . Clearly, we can write X ? φ = (F T D)θ. ˆ of θ is well-defined and has the form The MLE, θ, θˆ = D−1 F y. Let Λ? = D−2 and Ψ0 = AT ΨA. After some simplification we can express the posterior mean and covariance in terms of the inverse of an n × n matrix: ´ ³ ¯ ¯ E φ ¯Ψ, θˆ = ΨA(Ψ0 + σ 2 Λ? )−1 θˆ and
³ ¯ ´ ¯ V φ ¯Ψ, θˆ = Ψ − ΨA(Ψ0 + σ 2 Λ? )−1 AT Ψ.
Updating σ 2 The full conditional distribution of σ −2 is Ga (c? , d? ) where c? = n/2 and d? = (y − α − Xβ)T (y − α − Xβ)/2. Updating Ψ The parameter Ψ can effectively be updated in a block since the full conditional distributions of Ψ1 , Ψ2 , . . . , Ψp are independent. The full conditional distribution of Ψi follows ¢ ¡ a Generalized Inverse Gaussian distribution GIG λ − 12 , 1/γ 2 , βi2 where GIG(m, c, d) has the density ½ ¾ (c/d)m/2 m−1 1 √ x exp − (cx + d/x) . 2 2Km ( cd) An algorithm for simulation of this distribution is described by Devroye (1986). A Matlab implementation is available in the “randraw” toolbox which is available from Matlab Download Central. Updating hyperparameters of the normal-gamma prior In section 3.3 we assigned priors for λ and γ. If we denote the prior for λ by π(λ) then the full conditional of λ is à p !λ Y 1 π(λ) Ψi , (2γ 2 )pλ (Γ(λ))p i=1
J. E. Griffin and P. J. Brown
181
which can be updated using a Metropolis-Hastings random walk update on log λ. We propose λ0 = exp{σλ2 z}λ, where z is standard normal then λ0 is accepted with probability !λ0 −λ Ã p π(λ0 ) µ Γ(λ) ¶p Y 2 −p . Ψ (2γ ) min 1, i π(λ) Γ(λ0 ) i=1
The tuning parameter σλ2 is chosen to set the average acceptance rate at around 20-30%. Now turning to the scale parameter γ, with γ −2 ∼ Ga(2, M/2λ) from section 3.3, γ −2 can be updated directly from its full conditional distribution, which is γ −2 ∼ Ga(e? , f ? ) Pp where e? = 2 + pλ and f ? = M/2λ + 21 i=1 Ψi .
5
Examples
We consider two simulation examples and a real data example from chemometrics. In each example, the MCMC algorithm was run for 40,000 iterations with the first 5,000 discarded as a burn-in.
5.1
Simulation 1
In the first simulation we generate data where the design matrix has independent elements which are standard normally distributed. The regression coefficients are drawn from normal-gamma priors with three possible shape parameters, (0.1, 1 and 3). In each case, the prior mean of the regression coefficients was chosen to be 1. Sample sizes considered were n = 50 and 200 with p = 100. In the first case, n > p and p > n in the second case. The performance of the double exponential prior is compared to the normal-gamma prior using root mean squared error of the estimation of β by the posterior mean. The double exponential was favored when λ = 1 and the normal-gamma in the other two cases. The posterior distributions for λ, γ are summarised in Table 1 λ 1 1 3 3 0.1 0.1
n 200 50 200 50 200 50
p 100 100 100 100 100 100
1.02 0.92 1.87 1.09 0.12 0.08
Normal-Gamma λ vβ = 2λγ 2 (0.54, 2.08) 0.59 (0.35, 1.02) (0.31, 2.40) 0.25 (0.11, 0.89) (0.98, 7.62) 0.64 (0.38, 1.08) (0.30, 5.02) 0.19 (0.09, 0.68) (0.07, 0.19) 0.55 (0.26, 1.36) (0.05, 0.16) 0.94 (0.29, 5.16)
RMSE NG DE 0.107 0.107 0.683 0.677 0.100 0.101 0.739 0.740 0.069 0.091 0.203 0.452
Table 1: Simulation 1: Posterior mean and 95% credibility interval of the parameters of the Normal-Gamma model and the root mean squared errors (RMSE) of estimating β by its posterior mean with the NG prior and double exponential (DE) in terms λ, vβ , where variance vβ = 2λγ 2 . The values of λ and vβ seem well estimated
182
normal-gamma priors
in each case. The credibility interval for λ increases as the true value of λ increases. n = 200
n = 50
4
2
2 E[β|y]
λ=1
E[β|y]
0 0
−2 −2
−2
0 β
2
−4 −5
4
4
2
2
1 E[β|y]
λ=3
E[β|y]
−4 −4
0 −2 −4 −4
−2
0 β
2
−2
0 β
2
4
6 4 E[β|y]
E[β|y]
0
−2 −4
4
2 0 −2 −4 −4
5
−1
4
λ = 0.1
0 β
2 0 −2
−2
0 β
2
4
−4 −5
0
β
5
10
Figure 5: Plots of the true regression coefficients against their posterior means under the NG (crosses) and Lasso priors (circles), simulation 1. The root mean square errors for estimation of the regression coefficients are given in Table 1. The performance of the double exponential and normal-gamma priors are roughly similar when λ = 1 (when the double exponential is the true distribution of the regression coefficients) and λ = 3. This is reassuring since it demonstrates that there is little lost in inference about the regression coefficients if the λ is considered unknown. However, there are substantial differences when λ = 0.1 and, as we would expect, the differences become larger as the sample size, n, becomes smaller. The RMSE is more than doubled by the double exponential prior relative to the normal-gamma prior when
J. E. Griffin and P. J. Brown
183
n = 50 and λ = 0.1. Figure 5 shows the estimated coefficient versus their true values. Clearly, the regression coefficients are well estimated when n = 200 with both prior distributions but less well when n = 50. If λ = 0.1, there is a substantial number of regression coefficients whose value is close to zero, which is better captured by estimates using the normal-gamma prior rather than the double exponential prior.
5.2
Simulation 2
In this example, the design matrix is generated in the same way as Simulation 1 but the regression coefficients have a different structure. In this case, only 10 regression coefficients are non-zero which are placed evenly throughout the vector of regression coefficients which can be written as ½ ? β mod((i − 1), p/10) = 0 βi = 0 otherwise. In these simulation β ? = 1, 5, n = 50 and p = 100, 200. This is a challenging example since p is larger than n and our prior places no mass on the regression coefficients being zero. However, it would be re-assuring if the prior performs well for this type of data. Table 2 displays the means and 95% credibility intervals for the hyperparameters of β? 1 5 1 5
n 50 50 50 50
p 100 100 200 200
λ 0.14 (0.05, 1.22) 0.030 (0.015, 0.054) 0.34 (0.08, 1.56) 0.018 (0.012, 0.0231)
vβ = 2λγ 2 0.16 (0.03, 0.84) 4.00 (1.19, 19.65) 0.014 (0.004, 0.075) 1.95 (0.59, 8.55)
Table 2: Estimates of the hyperparameters of the normal-gamma distribution for simulation 2. the NG model. The posterior mean value of λ is much smaller than the Lasso value of 1. The posterior means of the regression coefficients with the NG prior and DE prior are displayed in Figure 6. The coefficient estimates based on the NG prior out-perform those based on the DE prior. For large signal (i.e. both β ? and n large), the NG estimates identified all the correct β ? with very little attenuation.
5.3
Example: NIR spectroscopy data
The data consists of 215 near-infrared absorbance spectra of meat samples, recorded on a Tecator Infratec Food Analyzer (represented as a 100-channel absorbance spectrum in the wavelength range 850-1050nm) and the composition of each sample in terms of water, fat and protein content. We consider predicting fat content on the basis of its infrared spectrum using the 100 channels. The data is split in a training/monitoring/testing set
184
normal-gamma priors p = 100 β? = 1
p = 200 β? = 5
1.5
6
1
4
β? = 1
β? = 5 6
1
NG
4
2
0.5
0.5
2
0 0
50
100
0
1.5
6
1
4
0
0
−2 −0.5 0
50
100
0
50
100
150
200
0
50
100
150
200
50
100
150
200
6 1
DE
4
2
0.5
0.5
2
0 0 50
100
0
0
0
−2 0
50
100
0
50
100
150
200
0
Figure 6: The posterior mean of β for simulation 2 with normal-gamma (NG) prior and double exponential prior (DE).
of 129/43/43 samples. The data, originally used by Borggaard and Thodberg (1992), is available at http://lib.stat.cmu.edu/datasets/tecator. More recently it was analysed in Eilers et al. (2009). We used the training and monitoring data comprising n = 172 = 129 + 43 samples (all data) as our main data set and also took a random subset of 60 of the training data to create a p larger than n data set (small). The RMSEs for
normal-gamma Lasso
All data 1.94 3.54
Small 2.59 3.09
Table 3: RMSEs for fat prediction prediction of the 43-sample test set, using the normal-gamma and Lasso priors, are given in Table 3. The difference in predictive performance is not surprising when one looks at the posterior median of λ in the normal-gamma prior which is 0.020 with a 95% credibility interval of (0.016, 0.026) on the full data. The differences in the estimate of β under the two priors are shown in Figure 7, which shows the posterior means for the two datasets. Clearly, in the large data sets the RMSE is very different for the two methods as are the posterior means for β. Only a few regression coefficients are estimated to be far from zero with the normal-gamma prior unlike the Lasso prior which substantially overestimates many of the regression coefficients. The results for the smaller data set are similar. The posterior median of λ is 0.019 with a 95% credible interval of (0.013, 0.032). In the normal-gamma prior there is more selection with only a few regression coefficients estimated to be far from zero. However, the estimates of these regression coefficients are similar to those with a larger sample size, unlike the Lasso prior whose estimates are much smaller.
J. E. Griffin and P. J. Brown
185
All
Small
4
1
x 10
5000
0.5 0 0 −0.5
−5000
−1 −1.5 0
50
100
−10000 0
50
100
Figure 7: The posterior means of the β for the normal-gamma (solid) and Lasso (dashed) for two datasets.
6
Discussion
This paper considers the performance of absolutely continuous distributions as priors for regression coefficients. We demonstrate how the posterior expectation of the regression coefficients depends on the derivative of the prior predictive density of the least squares estimate of the regression coefficients. This allows us to compare the shrinkage induced by different absolutely continuous distributions. A natural class of prior distributions are scale mixtures of normal distributions and we consider, in detail, the particular choice of a normal-gamma distribution. This distribution allows us to control prior beliefs about the decay of the absolute values of the ordered regression coefficients. At one extreme we have normal priors which promote a similarity between the magnitudes of the regression coefficients, and on the other hand prior distributions that can promote effective variable selection by extreme shrinkage of “small” regression coefficients to a value close to zero. A gamma mixing distribution can be interpreted through the well-known links between the gamma distribution and the Dirichlet distribution. We specify a default prior for the hyperparameters, shape λ and scale γ, of the normal-gamma and adopt a full Bayesian analysis. A Gibbs sampler is proposed to fit the model with a normal-gamma prior, augmented by a Metropolis-Hastings step for updating the gamma shape parameter λ. Recently, there has been interest in regression problems where n is small and p is large. Naive application of Gibbs sampling would lead to the inversion of (p × p)-dimensional matrix. We exploit the singular value decomposition of the design matrix to avoid this computationally expensive inversion, replacing it by the inversion of an (n × n)dimensional matrix. We have shown that the normal-gamma prior can better represent heterogeneity in regression effects relative to standard “spike-and-slab” priors in terms of predictive performance. The differences in the regression estimates under the priors are linked to the normal distribution commonly used as a “slab”, which sets an upper limit on the shrinkage factors. The normal-gamma could also be used as the “slab” in a “spike-and-slab” prior. The use of such a distribution with heavy tails is an interesting area for future research. The computational methods developed in this paper could be easily extended to other generalized inverse Gaussian mixing distributions, which would
186
normal-gamma priors
generate generalized hyperbolic priors for β.
Appendix A: Proof of Proposition 1 The proof follows closely that found in Pericchi and Smith (1992) generalised to multiparameter β and multiple regression. The centring of the columns of X implies that the least squares estimates α ˆ and βˆ are independent under the sampling distribution. The assumed prior independence implies that α and β are independent a posteriori and we can work with β only. From the definition of h(x) as the predictive density of the ˆ it follows that p × 1 sufficient statistic β, R ∂ N(x|β, σ 2 (X T X)−1 )π(β)dβ ∂ log h(x) = R∂x ∂x N(x|β, σ 2 (X T X)−1 )π(β)dβ R −2 T σ (X X)(x − β)N(x|β, σ 2 (X T X)−1 )π(β)dβ R . =− N(x|β, σ 2 (X T X)−1 )π(β)dβ Re-arranging we have 2
T
−1
x + σ (X X)
R βN(x|β, σ 2 (X T X)−1 )π(β)dβ ∂ log h(x) = R ∂x N(x|β, σ 2 (X T X)−1 )π(β)dβ
and so the posterior expectation of (p × 1)-dimensional vector β is given by ˆ = βˆ − σ 2 (X T X)−1 s(β) ˆ E[β|β] where s(x) = −
(4)
∂ log h(x). ∂x
Clearly, this can can be written as ³ ´ ˆ = I − S(β) ˆ βˆ E[β|β] where ˆ = σ 2 (X T X)−1 R(β), ˆ S(β) with R as defined in the Proposition. The form of the variance follows from observing that the mean square error is the ‘variance plus squared bias’ that is in matrix form ˆ = E[(β − β)(β ˆ ˆ T |β] ˆ − [E[β|β] ˆ − β][E[β| ˆ ˆ − β] ˆ T. Var(β|β) − β) β] From the result (4) we have ˆ − βˆ = −σ 2 (X T X)−1 s(β). ˆ E[β|β]
J. E. Griffin and P. J. Brown
187
ˆ differentiating it twice, we can write Letting p(x|β) denote the sampling density of β, (β − x)(β − x)T p(x|β) = σ 4 (X T X)−1
∂2 p(x|β)(X T X)−1 + σ 2 (X T X)−1 p(x|β). ∂x∂xT
Multiplying by the prior for β and integrating shows that · ¸ ∂2 1 T ˆ 4 T −1 E[(β − x)(β − x) |β] = σ (X X) + σ 2 (X T X)−1 . h(x) (X T X)−1 ∂x∂xT h(x) and so
"
ˆ = σ (X X) Var(β|β) 2
T
−1
4
T
−1
−σ (X X)
# ¯ 2 ¯ 1 ∂ ˆ β) ˆ − s(β)s( h(x)¯¯ (X T X)−1 . ˆ ∂x∂xT ˆ h(β) x=β T
The result follows from noting that ·½ ¾ ¸Á ∂ 2 log h(x) ∂h(x) ∂h(x) ∂ 2 h(x) = − h(x) h2 (x). W (x) = − ∂x∂xT ∂x ∂xT ∂x∂xT
References Andrews, D. F. and Mallows, C. L. ((1974). “Scale mixtures of normal distributions.” Journal of the Royal Statistical Society B, 36: 99–102. 172 Barndorff-Nielsen, O. E. and Blaesild, P. (1981). “Hyperbolic distributions and ramifications: contributions to the theory and applications.” In Statistical Distributions in Scientific Work, Vol. 4, 19–44. Dorderecht: Reidal. 176 Bibby, B. M. and Sorensen, M. (2003). “Hyperbolic Processes in Finance.” In Handbook of Heavy Tailed Distributions in Finance, 211–248. Elsevier Science. 175 Borggaard, C. and Thodberg, H. H. (1992). “Optimal minimum neural interpretation of spectra.” Analytical Chemistry, 64: 545–551. 184 Brown, P. J. (1993). Measurement, Regression and Calibration. Oxford: Clarendon Press. 179 Caron, F. and Doucet, A. (2008). “Sparse Bayesian nonparametric regression.” In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. 172 Chamberlain, G. and Leamer, E. E. (1976). “Matrix weighted averages and posterior bounds.” Journal of the Royal Statistical Society B, 38: 73–84. 172 Dawid, A. P. (1973). “Posterior expectations for large observations.” Biometrika, 60: 664–667. 174 Devroye, L. (1986). Non-Uniform Random Variate Generation. New York: Springer. 180
188
normal-gamma priors
Eilers, P. H. C., Li, B., and Marx, B. D. (2009). “Multivariate calibration with single index regression.” Chemometrics and Intelligent Laboratory Systems, 96: 196–202. 184 Fan, J. and Li, R. Z. (2001). “Variable selection via non-concave penalized likelihood and its oracle properties.” Journal of the American Statistical Association, 96: 1348– 1360. 173, 174 Figueiredo, M. A. T. (2003). “Adaptive sparseness for supervised learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 25: 1150–1159. 172 Figueiredo, M. A. T. and Jain, A. K. (2001). “Bayesian learning of sparse classifiers.” In Proceedings IEEE Computer Society Conference in Computer Vision and Pattern Recognition, volume 1, 35–41. 172 Griffin, J. E. and Brown, P. J. (2007). “Bayesian adaptive lassos with non-convex penalization.” Technical report, IMSAS, University of Kent. 176 Hoerl, A. E., Kennard, R. W., and Baldwin, K. F. (1975). “Ridge regression: some simulations.” Communications in Statistics, 4: 105–123. 179 Kiiveri, H. (2008). “A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations.” BMC Bioinformatics, 9:195. 172 MacKay, D. J. C. (1996). “Bayesian methods for back-propagation networks.” In Models of Neural Networks III , chapter 6, 211–254. New York: Springer. 172 Mitchell, T. J. and Beauchamp, J. J. (1988). “Bayesian variable selection in linear regression (with Discussion).” Journal of the American Statistical Association, 83: 1023–1036. 171 Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the American Statistical Association, 103: 672–680. 171, 172, 176 Pericchi, L. R. and Smith, A. F. M. (1992). “Exact and Approximate Posterior Moments for a Normal Location Parameter.” Journal of the Royal Statistical Society B, 54: 793–804. 173, 186 Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society B, 58: 267–288. 172 Tipping, M. E. (2001). “Sparse Bayesian learning and the relevance vector machine.” Journal of Machine Learning Research, 1: 211–244. 172 West, M. (1987). “On scale mixtures of normal distributions.” Biometrika, 74: 646–648. 174 — (2003). “Bayesian Factor regression models in the large p, small n paradigm.” In Bayesian Statistics 7, 733–742. Oxford: Clarendon Press. 180