arXiv:1807.03439v1 [math.ST] 10 Jul 2018
Submitted to Bernoulli arXiv: 0000.0000
Bayesian Linear Regression for Multivariate Responses Under Group Sparsity BO NING1 and SUBHASHIS GHOSAL1∗ 1
Department of Statistics, North Carolina State University, 4276 SAS Hall, 2311 Stinson Drive, Raleigh, NC 27695, USA E-mail: *
[email protected]; **
[email protected]
We study the frequentist properties of a Bayesian high-dimensional multivariate linear regression model with correlated responses. Two features of the model are unique: (i) group sparsity is imposed on the predictors. (ii) the covariance matrix is unknown and its dimensions can be high. We choose a product of independent spike-and-slab priors on the regression coefficients and a Wishart prior with increasing dimension on the inverse of the covariance matrix. Each spike-and-slab prior is a mixture of a point mass at zero and a multivariate density involving a ℓ2 /ℓ1 -norm. We first obtain the posterior contraction rate, the bounds on the effective dimension of the model with high posterior probabilities. We then show that the multivariate regression coefficients can be recovered under certain compatibility conditions. Finally, we quantify the uncertainty for the regression coefficients with frequentist validity through a Bernstein-von Mises type theorem. The result leads to selection consistency for the Bayesian method. We derive the posterior contraction rate using the general theory through constructing a suitable test from the first principle by bounding moments of likelihood ratio statistics around points in the alternative. This leads to the posterior concentrates around the truth with respect to the average log-affinity. The technique of obtaining the posterior contraction rate could be useful in many other problems. Keywords: Average log-affinity, Bayesian variable selection, covariance matrix, group sparsity, multivariate linear regression, posterior contraction rate, spike-and-slab prior.
1. Introduction Asymptotic behavior of variable selection methods, such as the lasso, have been extensively studied (Bühlmann and van der Geer, 2011). However, theoretical studies on Bayesian variable selection methods are limited to relatively simple settings (Castillo et al., 2015; Chae et al., 2016; Martin et al., 2017; Ročková, 2018; Belitser and Ghosal, 2017; Song and Liang, 2017). For example, Castillo et al. (2015) studied a sparse linear regression model in which the response variable is one-dimensional and the variance is known. However, it is not straightforward to extent those results to study the multivariate linear regression models with unknown covariance matrix (or even the univariate case with unknown variance). In many applications, predictors are naturally clustered in groups. Below, we give three examples. ∗ Research
is partially supported by NSF grant number DMS-1510238.
1 imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
2
Ning and Ghosal 1. Cancer genomics study. It is important for biologists to understand the relationship between clinical phenotypes and DNA mutations, which are detected by DNA sequencing. Since these mutations are spaced linearly along the DNA sequence, it is often assumed that the adjacent DNA mutations on the chromosome have a similar genetic effect and thus should be grouped together (Li and Zhan, 2010). 2. Multi-task learning. When information for multiple tasks is shared, it is preferable to solve these tasks at the same time to improve learning efficiency and prediction accuracy. Relevant information is preserved across different equations by grouping them together (Lounici et al., 2009). 3. Causal inference in advertising. Measuring the effectiveness of an advertising campaign running on stores is an important task for advertising companies. Counterfactuals, which are constructed using the sales data of a few stores, chosen by a variable selection method, from a large number of control stores not subject to the advertising campaign, are needed to conduct a causal analysis (Ning et al., 2018). Control stores within the same geographical region—as they share the same demographic information—can be grouped together and selected or not selected at the same time.
Driven by these applications, new variable selection methods designed to select or not select variables as groups, through imposing group sparsity on the regression coefficients, have been developed. For example, the group lasso method is proposed (Yuan and Lin, 2006). It replaces the ℓ1 -norm in the lasso with the ℓ2 /ℓ1 -norm, where the ℓ2 -norm is put on the predictors within each group and the ℓ1 -norm is put across the groups. Theoretical properties of the group lasso have been studied (Nardi and Rinaldo, 2008) and its benefits over the lasso in the group selection problem have been demonstrated (Lounici et al., 2009, 2011; Huang and Zhang, 2010). Recently, various Bayesian methods for selecting variables as groups were also proposed (Li and Zhan, 2010; Curtis et al., 2014; Ročková and Lesaffre, 2014; Xu and Ghosh, 2015; Chen et al., 2016; Greenlaw et al., 2017; Liquet et al., 2017). However, large-sample frequentist properties of these Bayesian methods have not been studied yet. In this paper, we study a Bayesian method for the multivariate linear regression model with two distinct features: group sparsity that is imposed on the regression coefficients and an unknown covariance matrix. To the best of our knowledge, even in a simpler setting without assuming group sparsity, convergence and selection properties of methods for high-dimensional regression with a multivariate response having an unknown covariance matrix have not been studied in either the frequentist or the Bayesian literature. However, it is important to understand the theoretical properties of those models because correlated responses arise in many applications. For example, in the study of the causal effect of an advertising campaign, sales in different stores are often spatially correlated (Ning et al., 2018). Furthermore, when the dimension of the covariance matrix is large, it would affect the quality of the estimation of the regression coefficients. When the covariance matrix is unknown and high-dimensional, the techniques that were developed for deriving posterior concentration rates (Castillo et al., 2015; Martin et al., 2017; Belitser and Ghosal, 2017) cannot be applied. Also, The general theory of posterior imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
Bayesian variable selection for multivariate responses
3
concentration in its basic form (Ghosal and van der Vaart, 2017) is not appropriate to use because it typically deals with the average Hellinger distance which is not sufficient for our analysis. Thus in order to apply the general theory to derive a rate, we shall construct required tests directly by controlling the moments of likelihood ratios in small pieces. In this study, we consider a multivariate linear regression model Yi =
G X
Xij β j + εi ,
i = 1, . . . , n,
(1.1)
j=1
where Yi is a 1 × d response variable, i = 1, . . . , n, Xij is a 1 × pj predictor variable, j = 1, . . . , G, β j is a pj × d matrix containing the regression coefficients, and ε1 , . . . , εn are independent identically distributed (i.i.d) as N (0, Σ) with Σ being a d × d unknown covariance matrix. In other words, in the regression model, there are G > 1 nonoverlapping groups of predictor variables with the group structure being pre-determined. When G = p, it reduces to the setting that the sparsity is imposed on the individual coordinates. Thus the results derived in our paper are applicable to the ungrouped setting as well. The model can be rewritten in the vector form as Yi = Xi β + εi ,
(1.2)
P where β = (β ′1 , . . . , β′G ) is a p × d matrix, where p = G j=1 pj , and Xi = (Xi1 , . . . , XiG ) is a 1 × p vector. The dimension p can be very large. The dimension d can be large as well to a lesser extent when the sample size is large. The number of total groups G is clearly bounded by p. We denote the groups which contain at least a non-zero coordinate as non-zero groups and the remaining groups as zero groups. To allow derivation of asymptotic properties of estimation and selection, certain conditions on the growth of p, G, d and p1 , . . . , pG need to be imposed. We allow p ≫ n (which means that n/p → 0) but require that the total number of the coefficients in all non-zero groups together are less than n in order. We further assume that the number of coordinates in any single group must be of order less than p and that log G ≪ n. Finally, to make the covariance matrix is consistently estimable, we assume that the dimension d of the covariance matrix satisfies the condition that d2 log n ≪ n. As for the priors, we choose a product of d independent spike-and-slab priors for β and a Wishart prior for Σ−1 , the precision matrix. The spike-and-slab prior is a mixture of point mass for the zero coordinates and a density for non-zero coordinates. In the ungrouped setting, commonly used densities for non-zero coordinates are a Laplace density (Castillo et al., 2015), a Cauchy density (Castillo and Mismer, 2018) and a normal density with mean chosen by empirical Bayes methods (Martin et al., 2017; Belitser and Ghosal, 2017). In this paper, we choose a special density for the non-zero coordinates (see (3.1)). This density involves the ℓ2 /ℓ1 -norm, which is used as a penalty to obtain the group lasso in a non-Bayesian setting. The remainder of the paper is organized as follows. Section 2 introduces notations that will be used in this paper. Section 3 describes the priors, along with the necessary imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
4
Ning and Ghosal
assumptions. Section 4 provides the main results. The proofs of the main results are given in Section 5. Auxiliary lemmas are provided in Appendix A.
2. Notation. We assume that G1 , . . . , GG are G disjoint groups such that ∪G j=1 Gj = {1, . . . , p}. Since these groups are given and will be kept the same throughout, their notaions will be dropped from subscription notations. Clearly, pj is the number of elements in Gj . Let pmax = max1≤j≤G pj . For each k = 1, . . . , d, let Sk ⊆ {1, . . . , G} stand for the set which contains the indices of the non-zero groups for the k-th component and sk = |Sk | be the Pd cardinality of the set Sk . Also, define S = ∪dk=1 Sk and s = k=1 sk . Let S0,k be the set containing the indices of the true non-zero groups, where S0,k ⊆ {1, . . . , G}. Define Pd P pS0,k = j∈S0,k pj , and pS0 = k=1 pS0,k . For a vector A, let kAk1 , kAk2,1 and kAk be the ℓ1 -, ℓ2 /ℓ1 - and ℓ2 -norm of A rePG spectively, where kAk2,1 = j=1 kAj k with Aj is the submatrix of A consisting of k ∈ Gj coordinates. For a d × p matrix B, we denote Bk as the k-th column of B q
and kBkF = Tr(B T B) as the Frobenius norm of B. For a d × d symmetric positive definite matrix C, let eig1 (C), . . . , eigd (C) denote the eigenvalues of C ordered from the smallest to the largest and det(C) stand for the determinant of C. For a scalar c, we denote |c| to be the absolute value of c. R Let ρ(f, g) = −Rlog( f 1/2 g 1/2 dν) be the negative log-affinity between densities f and g and h2 (f, g) = (f 1/2 − g 1/2 )2 dν be their squared Hellinger distance. The KullbackR Leibler divergence between f and g is given by K(f, g) = f log(f R /g) and the KullbackLeibler variation between f and g is denoted by V (f, g) = f (log(f /g) − K(f, g))2 . The symbol f0 stands for the density of f with the parameters at their true values. The notation kµ−νkT V denotes the total variation distance between two probability measures µ and ν. We let N (ǫ, F , ρ) stand for the ǫ-covering number of a set F with respect to ρ, which is the minimal number of ǫ-balls needed to cover the set F . Let I d stand for the d dimensional identity matrix and 1 stand for the indicator function. The symbols . and & will be used to denote inequality up and down to a constant while a ≍ b stand for C1 a ≤ b ≤ C2 a for two constants C1 and C2 . The notations a ≪ b and a ∨ b stand for a/b → 0 and max{a, b} respectively. The symbol δ0 (·) stands for a Dirac measure.
3. Prior specifications In this section, we introduce the priors used in this study. We place two independent priors on β and Σ as they are both unknown. We place d products of independent spike-and-slab priors on β and a Wishart prior on Σ−1 , which is known as the precision matrix. imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
Bayesian variable selection for multivariate responses
5
3.1. Prior for regression coefficients We denote the k-th column of β as βk and the notations βSk and βSkc as collections of the regression coordinates in the non-zero groups and the zero groups respectively. Each spike-and-slab prior is constructed as follows. First, a dimension sk is chosen from a prior πG on the set {0, 1, . . . , G}. Next, a subset Sk of cardinality sk is randomly chosen from the set {1, . . . , G}. Finally, A vector βSk = {βj 1(j ∈ Sk )} is chosen from a probability density gSk on RpSk given by (3.3). The remaining coordinates βSkc set to 0. To summarize, the prior for β is π(S1 , . . . , Sd , β) =
d Y
π(Sk , βk ),
(3.1)
k=1
where π(Sk , βk ) = πG (|Sk |) for the dimension sk = |Sk |.
1 G |Sk |
gsk (βSk )δ0 (βSkc ) and the density πG (|Sk |) is the prior
Assumption 1 (Prior on dimension). There are positive constants A1 , A2 , A3 , A4 with (3.2) A1 G−A3 πG (sk − 1) ≤ πG (sk ) ≤ A2 G−A4 πG (sk − 1), for sk = 1, . . . , G, k = 1, . . . , d. For example, the complexity prior given by Castillo et al. (2015) by replacing p by G shall satisfy the above assumption. The Laplace density (Castillo et al., 2015) or the Cauchy density (Castillo and Mismer, 2018) are generally chosen as g, since the normal density has too sharp tail that overshrinks the non-zero coefficients, although some empirical Bayes modifications of the mean can overcome the issue (see Martin et al., 2017; Belitser and Ghosal, 2017). However, in our setting, as sparsity is imposed at the group level, a more natural choice of the prior is the density which incorporates the ℓ2 /ℓ1 -norm. We thus consider the prior ! G Y λk pj g(βk ) = (3.3) exp − λk kβk k2,1 , a j j=1 √ Γ(pj + 1) 1/pj π ≥ 2 (see Lemma A.1). This density has its tail lighter Γ(pj /2 + 1) than the corresponding Laplace density. From Stirling’s approximation, it follows that 1/2 aj = O(pj ). We would like to mention that Xu and Ghosh (2015) developed a posterior computational strategy for a similar prior. They also incorporated the ℓ2 /ℓ1 -norm into their prior, except for that they did not provide the explicit expression of the normalizing constant for that prior. The tuning parameter λk needs to be bounded both from above and below. The value of λk cannot be too large or it will shrink the non-zero coordinates too much towards 0.
where aj =
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
6
Ning and Ghosal
It should not be too small because a very small value will be unable to prevent many false signals appear in the model and hence making the posterior to contract slower. The upper and lower bounds for the permissible limits are stated in below. For each k = 1, . . . , d, λk ≤ λk ≤ λ, where v v P u n u n X p u uX j j∈S0,k −1/pS 2 t 0,k kXi k , λ = 3t kXi k2 log G. , λk = G = s0,k i=1 i=1
Assumption 2.
pS0,k
(3.4)
The lower bound of λk is derived from (5.12). Suppose that p (when each group pG P= n 2 has only one element), the lower bound reduces to λk = i=1 kXi k /p, which is analogous to the lower bound displayed in Castillo et al. (2015). The upper bound of λk is motivated from the following lemma. Lemma 3.1.
Under Assumption 2, P0
n X i=1
−1/2 2 kF
kXi′ (Yi − Xi β 0 )Σ0
≥ dλ
2
→ 0.
(3.5)
3.2. Prior for the covariance matrix We put a Wishart prior on the precision matrix: Σ−1 ∼ Wd (ν, Φ), where Wd stands for a d-dimensional Wishart distribution, Φ is a symmetric positive definite matrix, ν > d − 1, ν ≍ d, and d → ∞. Although, other priors can be used, the Wishart prior is the most commonly used prior in practice as it is conjugate for the multivariate normal likelihood.
4. Main results 4.1. Posterior contraction rate We study the posterior contraction rate for the model (1.1) and the priors given in Section 3. We write β0 and Σ0 for the true values for β and Σ respectively. Recall that S0,k is the set which includes the index of the true non-zero groups of βk , s0,k = |S0,k | is Pd the cardinality of that set S0,k . Let S0 = ∪dk=1 S0,k and s0 = k=1 s0,k . Define the set S = {S : |Sk | = s0,k , k = 1, . . . , d}. The general theory of posterior contraction for independent non-identically distributed observations (see Theorem 8.23 of Ghosal and van der Vaart, 2017) is often used to derive a posterior contraction rate, which is based on the average squared Hellinger distance. Since the average squared Hellinger distance between multivariate normal densities with an unknown covariance is small does not necessarily imply that the parameters in the imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
7
Bayesian variable selection for multivariate responses
two densities are also close on average, we work directly with on the average log-affinity which is still very tractable in the multivariate normal setting. To derive a posterior contraction rate, we construct a suitable test from the first principle by breaking up the effective parameter space given by a sequence of appropriate sieves under the alternative hypothesis into several pieces. For each piece sufficiently separated from the truth, we pick up a representative and obtain a most powerful test (i.e., the Neymann-Pearson test) for the truth against that alternative. We bound the moments of the likelihood ratio of an arbitrary density in the piece to the density of the representive of that piece to show that the most powerful test for the truth against the representative has adequate power for any alternatives in the corresponding piece. By using this approach, we require the true values of β 0 and Σ0 to be restricted into certain regions to ensure that the prior concentration around the true point is not too small so that the posterior contraction rate is sufficiently fast. This is unlike Castillo et al. (2015), who obtained results uniformly over the whole space as their case (univariate with known variance and Laplace prior) allows explicit expressions for a direct treatment. More precisely, we require β0 ∈ B0 and Σ0 ∈ H0 , where B0 and H0 are shown in the following assumption. Assumption 3. The true values of β 0 ∈ B0 and Σ0 ∈ H0 , where B0 = β : max1≤k≤d kβk k2,1 ≤ β , H0 = {Σ : b1 I d ≤ Σ ≤ b2 I d }, (4.1) s0 log G ∨ pmax log n b1 , b2 are two fixed positive values, β = , ǫn is given in (4.4) and d max1≤k≤d λk λk satisfies Assumption 2. The largest value of β is obtained is by taking λk = λk for all k and G = p, which ps0 log p . For the is the ungrouped case. Then the upper bound becomes β = pPn 2 kX k d i i=1 Pn sake of simplicity, we assume that n−1 i=1 kXi k2 is bounded above by a large constant. Then that upper bound increases to infinity very quickly since p ≫ n. When G ≪ p, the upper bound increases at a slower rate than the bound when G = p, as p increases. Theorem 4.1. For the model (1.1) and the priors given in Section 3, suppose that Pn 2 kX k ≤ nb3 for a fixed positive number b3 , d2 log n ≪ n ≪ p, s0 pmax log n ≪ i i=1 n, where pmax = max(p1 , . . . , pG ), and that Assumptions 1–3 hold. Then, for M1 > 0 sufficiently large, Pn (4.2) Π β : i=1 kXi (β − β0 )k2 ≥ M1 nǫ2n Y1 , . . . , Yn → 0, (4.3) Π Σ : kΣ − Σ0 k2F ≥ M1 ǫ2n Y1 , . . . , Yn → 0, where
ǫn =
(r
s0 log G ∨ n
r
s0 pmax log n ∨ n
r
) d2 log n . n
(4.4)
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
8
Ning and Ghosal
Remark 1. A major contribution we make to prove this theorem is the construction of exponentially powerful tests for the truth against the complement of a ball by splitting the complement in suitable pieces (not necessarily balls) where we can control a moment of the likelihood ratio for two points within each piece. This gives a general technique of construction of tests required for the application of the general theory, which can be useful in many other problems. Remark 2. Instead of using the prior given in (3.3), one can also choose a Laplace density for the coordinates in the non-zero groups. Then the ℓ2 /ℓ1 -norm of β0,k , kβ0,k k2,1 , in the set B0 should be replaced by kβ0,k k1 . Clearly, kβ0,k k2,1 ≤ kβ0,k k1 , hence in the latter case the set B0 will be smaller. In fact, one can replace the ℓ2 /ℓ1 -norm with any other ℓq /ℓ1 -norms, for 1 ≤ q ≤ ∞. Then the norm of β0,k in B0 needs to be adjusted accordingly. p p Remark 3. When G = p, the rate reduces to ǫn = { (s0 log p)/n ∨ (d2 log n)/n }. The first part of the rate is the same as the rate obtained when the sparsity is imposed at the individual level, such as in Bühlmann and van der Geer (2011) and Castillo et al. (2015). When G ≪ p, the first rate can be obtained when the ratio s0 pmax log n = O(s0 log G) (i.e., when the number of coordinates in each group takes a fixed number) and d is sufficiently slowly growing. Remark 4. The second rate in (4.4) reveals that in some situations, by imposing group sparsity, the posterior will contract at a slower rate than imposing sparsity at the individual level. This happens when too many zeros are put into non-zero groups (often known as weakly group-sparse (Huang and Zhang, 2010)). From Theorem 4.1, if the dimension of the covariance is too large, then the posterior contraction rate can be much slower. Under such a situation, the rate may be improved if we know any special structures for the precision matrix. Here we give two examples. Example 1 (Independent responses). If the responses are independent across components, then the model (1.1) can be written as d independent model with each one is 1 1 Yik = Xi βk + ξik , σk σk
ξik ∼ N (0, 1).
Then one can estimate the parameters in the d models separately. qP The posterior concend 2 tration rate for each corresponding posterior becomes ǫn = k=1 ǫn,k , where ǫn,k = r r n s log G s0,k pmax log n o 0,k , k = 1, . . . , d. ∨ n n Example 2 (Sparse precision matrix). The third rate in ǫn may be improved if the precision matrix is appropriately sparse. Banerjee and Ghosal (2014) showed that when imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
Bayesian variable selection for multivariate responses
9
the matrix has an exact banding structure with banding size k, then using an appropriate G-Wishart prior, the posterior for the precision matrix Σ−1 contracts at the rate k 5/2 (log n/n)1/2 with respect to the spectral norm. When the sparsity does not possess √a Banerjee and Ghosal (2015) showed that the rate reduces from d/ n specific structure, p to (d + m) log d/n with repsect to the Frobenius norm, where m is the number of non-zero off-diagonal elements. As a consequence of posterior contraction near the truth, the following estimate is easily obtained (see page 200 of Ghosal and van der Vaart, 2017). Lemma 4.2. event
then
For positive constants C1 and C2 , and the rate ǫ2n in (4.4), define the n o nZ Z Y 2 fi (Yi )dΠ(β)dΠ(Σ) ≥ e−C1 nǫn , En = f i=1 0,i
P0 (Enc ) ≤ exp − (1 + C2 )nǫ2n .
(4.5)
4.2. Dimensionality and recovery In this section, we study the dimensionality and recovery properties of the the marginal posterior of β. Lemma 4.3 (Dimension). Let a prior πG (sk ) satisfying (3.2) for all k = 1, . . . , d be given. Assume that s0 pmax log n ≪ n, s⋆k = max{s0,k , s0,k pmax log n/ log G, d log n/log G}, and log d < A4 log G. Then for a sufficiently large number M2 ≥ 2(1 + C2 )/A4 + 1, (4.6) E0 Π βk : |Sk | < M2 s⋆k , k = 1, . . . , d Y1 , . . . , Yn → 1. sup β∈B0 ,Σ∈H0
Lemma 4.3 also implies that the sum P of the cardinalities of the non-zero groups in d d different columns will not exceed s⋆ = k=1 s⋆k . We state this result in the following corollary. Corollary 4.4.
Under the setup of Lemma 4.3, with s⋆ = nǫ2n / log G, E0 Π β : |S| ≥ M2 s⋆ Y1 , . . . , Yn → 0, sup
(4.7)
β∈B0 ,Σ∈H0
From Corollary 4.4, s⋆ > s0 when either s0 pmax log n/log G ≫ s0 or d2 log n/log G ≫ s0 . This means that the support of the posterior can substantially overshoot the true dimension s0 . In the next corollary, we show that the posterior is still able to recover β0 even when s⋆ > s0 ;
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
10
Ning and Ghosal
Corollary 4.5 (Recovery). Under Assumption 2, if s0 pmax log n ≪ n, then for a sufficiently large constant M3 > 0, sup β∈B0 ,Σ∈H0
M3 nǫ2n Y1 , . . . , Yn → 0, 2 ∗ 2 i=1 kXi k φℓ2 (s )
E0 Π kβ − β 0 k2F ≥ Pn
(4.8)
where φ2ℓ2 (s∗ ) is the restricted eigenvalue (see Definition 4.6 below).
Definition 4.6 (Restricted eigenvalue). The smallest scaled singular value of dimension s˜ is defined as ) ( P n kXi βk2 2 i=1 0 ≤ s ≤ s˜ . (4.9) s) = inf Pn φℓ2 (˜ 2 , 2 i=1 kXi k kβkF
As p ≫ n, the smallest eigenvalue of the design matrix must be 0. The restricted eigenvalue condition assumes that the smallest eigenvalue for the sub-matrix of the design matrix, which corresponds to the coefficients within non-zero groups, is not 0. The results for other norms for the difference between β and β 0 can be also derived by assuming different assumptions on the smallest eigenvalue for the sub-matrix of the design matrix. For example, using the uniform compatibility condition (in Definition 4.7 below), we can conclude that for a sufficiently large number M4 > 0, d X M4 nǫ2n s⋆ → 0. , . . . , Y E0 Π kβk − β0,k k22,1 ≥ Pn Y 1 n ∗ 2 2 β∈B0 ,Σ∈H0 i=1 kXi k φℓ2,1 (s )
sup
k=1
The proof is almost identical to that of Corollary 4.5.
Definition 4.7 (Uniform compatibility, ℓ2 /ℓ1 -norm). The ℓ2,1 -compatibility number in vectors of dimension s˜ is defined as ( ) Pn s i=1 kXi βk2 2 s) = inf Pn φℓ2,1 (˜ , 0 ≤ s ≤ s˜ . (4.10) Pd 2 2 k=1 kβk k2,1 i=1 kXi k
s) ≤ By the Cauchy-Schwarz inequality, kβk k2 ≥ kβk k22,1 /sk , it follows that φℓ2 (˜ s) for any s˜ ≪ G. φℓ2,1 (˜
4.3. Distributional approximation In this section, we show that the posterior distribution can be approximated by a mixture of multivariate normal densities. We first rewrite the model (1.1) as Yi = Vec(β)X i + εi ,
i = 1, . . . , n,
(4.11)
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
11
Bayesian variable selection for multivariate responses
where Vec(β) is a 1 × pd vector by stacking all the columns of β into a row vector, X i = I d ⊗ Xi is a pd × d block diagonal matrix. The above model can be also written as Yi = Vec(β S )X i,S + εi ,
i = 1, . . . , n,
Pd P where Vec(β S ) is a 1 × ( k=1 j∈Sk pj ) vector, which consists of coordinates of β from P P the set S, and X i,S is a ( dk=1 j∈Sk pj ) × d matrix, which is the submatrix of X i . Then log-likelihood function is given by ℓn (Vec(β S ), Σ) =
n X i=1
log f Yi |Vec(β S )X i,S , Σ
nd n log(2π) − log det(Σ) 2 2 n ′ 1X Yi − Vec(β S )X i,S Σ−1 Yi − Vec(βS )X i,S . − 2 i=1
=−
(4.12)
Pd P If k=1 j∈Sk pj ≪ n, then the maximum likelihood estimator (MLE) of Vec(β S ) is ˆ ) and the Fisher information matrix as In,S . unique. We denote the MLE as Vec(β S ˆ ) = Pn X i,S X ′ −1 Pn X i,S Y ′ and From (4.12), we can obtain that Vec(β S i,S i i=1 i=1 P n ′ In,S = n−1 i=1 X i,S Σ−1 0 X i,S . Based on the model (4.11), the marginal posterior distribution of β is RR exp ℓ(Vec(β), Σ) − ℓ(Vec(β 0 ), Σ0 ) dΠ Vec(β) dΠ(Σ) B , (4.13) Π(B|Y1 , . . . , Yn ) = R R exp ℓ(Vec(β), Σ) − ℓ(Vec(β 0 ), Σ0 ) dΠ Vec(β) dΠ(Σ) with
d Y dΠ Vec(β) = k=1
X
Y λk pj exp(−λk kβSk k2,1 )dβSk ⊗ δSkc . aj
Sk ⊆{1,...,G} j∈Sk
It is clear that the posterior distribution is a mixture density over different subsets S1 , . . . , Sd . Let Sk⋆ = {Sk ⊆ {1, . . . , G} : |Sk | ≤ M2 s⋆k }, k = 1, . . . , d. In the next theorem, we show that the posterior Π(B|Y1 , . . . , Yn ) can be approximated by a mixture of multivariate normal densities given by Π∞ (B|Y1 , . . . , Yn ) ∝
d Y X
k=1 Sk ∈Sk⋆
ˆ ), I−1 ) ⊗ δS c , wS∞k N (Vec(β Sk n,Sk k
(4.14)
where wS∞k ∝ πG (sk )
1 G
sk
n P X Y λk pj −1/2 p /2 X ′i,Sk Σ0−1 X i,Sk det (2π) j∈Sk j aj i=1
j∈Sk
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
12
Ning and Ghosal × exp
n n1 X
2
i=1
−1/2 2
ˆ )X i,S Σ kVec(β Sk k 0
k
o
1{Sk ∈ Sk⋆ },
(4.15)
P with Sk wS∞k = 1, for all k = 1, . . . , d. Before we state the theorem, one more terminology should be introduced. We recall the notion of the small λ region (see Castillo et al., 2015). In our setting, each λk belongs √ max1≤k≤d λk s⋆k log G pPn → 0. When λk belongs to this region, the to the small λ region if 2 i=1 kXi k ˆ ), is an asymptotically unbiased estimator and does not depend on the MLE, Vec(β S choice of different values of λk . When choosing the value of λk outsides the small λ region, this MLE is no longer asymptotically unbiased and will depend on the choice of λk (cf. see Theorem 11 of the supplementary material of Castillo et al., 2015). As a result, the posterior will concentrate near a distribution with center differing a lot with different values of λk . Theorem 4.8 (Distributional approximation). For k = 1, . . . , d, if πG (sk ) satisfies √ maxk λk s⋆k log G → 0, then for a positive constant c, (3.2) and pPn 2 i=1 kXi k sup
β0 ∈B0 :φℓ2 (s∗ )>c, Σ0 ∈H0
kΠ(·|Y1 , . . . , Yn ) − Π∞ (·|Y1 , . . . , Yn )kT V → 0.
(4.16)
Note that the above theorem does not require that the cardinality of the set S to be close to s0 . The result still holds when s∗ ≫ s0 .
4.4. Selection In the previous two sections, we have shown that even if s∗ > s0 , the marginal posterior of β can recover the truth and can be approximated by a mixture of multivariate normal densities. In this section, we derive conditions for selection consistency. Since selection consistency requires s∗ = s0 , we need to assume the dimension of the covariance and the coordinates in the non-zero groups are sufficiently small. We also need to assume that the smallest signal cannot be too small, which is a group sparse version of the Beta-min condition. This condition is stated in below: s ( ) M3 nǫ2n ˜ B= min kβjk k ≥ Pn . (4.17) 2 2 j∈S0,k i=1 kXi k φℓ2 (s0 ) 1≤k≤d
The lower bound displayed in the condition is derived from (4.8). Unlike the Beta-min condition in Castillo et al. (2015) which the individual components are bounded away from 0, our condition allows a zero to be included in a non-zero group.
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
13
Bayesian variable selection for multivariate responses
Theorem 4.9 (Selection consistency). If πG (sk ) satisfies Assumption 1 for all k = √ maxk λk s0,k n log G pPn → 0, d2 log n . s0 log G, and log d ≤ A4 log G, then for 1, . . . , d, 2 kX k i i=1 sn ≤ GA4 −1 with A4 ≥ max(1, 2b2 ), where b2 is defined in (4.1), and a positive number c, sup ˜ 0,k |≤sn ,k=1,...,d, β∈B0 ∩B:|S φℓ2 (s0 )≥c,Σ∈H0
E0 Π(β : Sβ = Sβ0 |Y1 , . . . , Yn ) → 0.
If the conditions of Theorem 4.9 are satisfied, then the marginal posterior distribution of β in non-zero groups can be approximated by a multivariate normal distribution with ˆ ) and the covariance matrix I−1 = n Pn X i,S Σ−1 X ′ )−1 . Theremean Vec(β 0 S0 0 i,S0 i=1 0,S0 fore, credible intervals for β can be obtained directly from the approximating multivariate normal density.
5. Proofs −1/2
Proof of Lemma 3.1. Let ξi = (Yi − Xi β0 )Σ0 n X i=1
−1/2 2 kF
kXi′ (Yi − Xi β0 )Σ0
=
n X i=1
, then
kXi k2 kξi k2 =
n X i=1
kXi k2
d X
k=1
Therefore, the probability in (3.5) can be also written as Pn kX k2 Pd (ξ 2 − 1) i i=1 p Pn k=1 ik P0 ≥t , 2d i=1 kXi k4 where t =
d X
2
λk − d
n X
2 ξik .
(5.1)
n 1/2 X kXi k4 . kXi k2 / 2d
i=1 i=1 k=1 p By (A.4), the probability (5.1) is bounded above by 2 exp −t2 / 2 1 + 2/dtm(X) , n 1/2 X . We plug-in the expression for t in (A.2) and kXi k4 where m(X) = max kXi k2 / i
i=1
m(X), then the last display is bounded below by 2G−q if d X
k=1
n 1/2 X 2 kXi k4 + q 2 log2 G max kXi k4 λk ≥ 2 dq log G i
i=1
+ 2q log G max kXi k2 + d i
n X i=1
kXi k2 .
(5.2)
pPn −d 2 By choosing q = d and λ = λk = 3 , i=1 kXi k log G, (5.1) is bounded above by G which goes to 0 as G → ∞ or d → ∞. imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
14
Ning and Ghosal
Proof of Theorem 4.1. The proof contains two parts. In the first part, we quantify prior concentration around the truth in the sense of Kullback-Leibler divergence from the true density. In the second part, using the results obtained from the first part, we derive (4.2) and (4.3). Part I. The method we use to obtain the posterior contraction rate is described as follows. We construct a test from the first principle by breaking up the effective parameter space into several pieces which are sufficiently separated from the truth. Then for each piece, we consider the likelihood ratio test for the truth against a representative in the piece. We show that this test works for the entire piece by controlling the likelihood ratio. Finally, we consider the maximum of these tests and control its size by estimating the total number of pieces. Let Fn be a suitable “sieve”. We shall verify that (5.3) Π K(f, f0 ) ≤ ǫ2n , V (f, f0 ) ≤ ǫ2n ≥ exp(−C1 nǫ2n ), Π(Fnc ) ≤ exp(−C2 nǫ2n ),
(5.4)
for positive constants C1 and CQ 2 > C1 + 2 and the following condition. Qn n Let f0 = i=1 f0,i and f = i=1 fi , where f0,i = N (Xi β0 , Σ0 ) and fi = N (Xi β, Σ). Then there exists a test φn such that 2
Ef0 φn ≤ e−nǫn ,
f ∈Fn :
Pn
2
sup
2 i=1 kXi (β−β 0 )k ≥nM1 ǫn , or kΣ−Σ0 k2F ≥M1 ǫ2n
where C3 is a positive constant. Consider the sieve n Fn = f : max |Sk | < s¯n , 1≤k≤d
2
Ef (1 − φn ) ≤ e−C3 nǫn ,
(5.5)
max kβjk k < pmax Hn ,
1≤j≤G, 1≤k≤d
o exp(−d log n) < eig1 (Σ−1 ), eigd (Σ−1 ) < n ,
where s¯n = max
s0 pmax log n d log n o , , d d log G log G
ns
0
,
Hn =
3¯ sn (log G ∨ pmax log n) . pmax (mink λk )
(5.6)
Recall that the expression of λk is shown in (3.4). First, we check (5.3). The Kullback-Leibler divergence between f and f0 is n 1X 1 −1 −1 ′ ′ Tr(Σ0 Σ) − d − log det(Σ0 Σ) + Xi (β − β0 )Σ−1 K(f, f0 ) = 0 (β − β 0 ) Xi , 2 n i=1
and the Kullback-Leibler variation between f and f0 is V (f, f0 ) =
1 −1 −1 Tr(Σ−1 0 ΣΣ0 Σ) − 2Tr(Σ0 Σ) + d 2
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
15
Bayesian variable selection for multivariate responses n
+
1X −1 ′ ′ Xi (β − β 0 )Σ−1 0 ΣΣ0 (β − β 0 ) Xi . n i=1
We define the two events A1 and A2 as follows: −1 2 A1 = Σ : Tr(Σ−1 0 Σ) − d − log det(Σ0 Σ) ≤ ǫn ,
−1 −1 2 Tr(Σ−1 0 ΣΣ0 Σ) − 2Tr(Σ0 Σ) + d ≤ ǫn , n n X ′ ′ 2 A2 = (β, Σ) : Xi (β − β0 )Σ−1 0 (β − β 0 ) Xi ≤ nǫn , i=1
n X i=1
o −1 ′ ′ 2 Xi (β − β0 )Σ−1 0 ΣΣ0 (β − β 0 ) Xi ≤ nǫn /2 .
Writing Π(K(f, f0 ) ≤ ǫ2n , V (f, f0 ) ≤ ǫ2n ) = Π(A2 |A1 )Π(A1 ), we shall derive lower bounds for Π(A1 ) and Π(A2 |A1 ) separately. 1/2 1/2 −1/2 −1/2 and note that Σ∗−1 ∼ Wd (ν, Σ0 ΦΣ0 ) as Σ−1 ∼ Define Σ∗ = Σ0 ΣΣ0 Wd (ν, Φ). Then n
A1 = Σ :
d X
k=1
d o X 2 2 eigk (Σ ) − 1 − log(eigk (Σ ) ≤ ǫn , eigk (Σ∗ ) − 1 ≤ ǫ2n . (5.7) ∗
∗
k=1
Td
Furthermore, we define A⋆1 = k=1 Σ : 1 ≤ eigk (Σ∗ ) ≤ 1 + d−1/2 ǫn . It is easy to verify that A1 ⊃ A⋆1 . By (A.10), we obtain that Π(A1 ) ≥ Π(A∗1 ) ≥ exp − c11 d log d − c12 d2 log d − d(d + 1) log(1/ǫn )/2 − c13 d , (5.8)
where c11 , c12 and c13 are positive constants. To derive a lower bound for Π(A2 |A1 ), we need the following two results. First, by Σ0 ≥ b 1 I d , n n 1X 1 X ′ ′ kXi (β − β0 )k2 . Xi (β − β 0 )Σ−1 (β − β ) X ≤ 0 i 0 n i=1 nb1 i=1
Second, we have n
1X −1 ′ ′ Xi (β − β 0 )Σ−1 0 ΣΣ0 (β − β 0 ) Xi n i=1
n n 1X 1X −1/2 −1/2 kXi (β − β0 )Σ0 k2 kΣ∗ − I d kF + kXi (β − β 0 )Σ0 k2 n i=1 n i=1 v ! u d n uX X 1 2 −1/2 2 kXi (β − β0 )Σ0 k 1 + t ≤ eigk (Σ∗ ) − 1 . n i=1
≤
k=1
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
16
Ning and Ghosal
Conditional on A1 and again, by Σ0 ≥ b1 I d , the last expression can be further bounded above by n n 2 X 1 + ǫn X −1/2 kXi (β − β 0 )k2 , kXi (β − β 0 )Σ0 k2 < n i=1 nb1 i=1
provided that ǫn < 1. Thus Π(A2 |A1 ) is bounded below by n n 1 X b1 ǫ2n b1 ǫ2n 1X Π kXi (β − β 0 )k2 ≤ ≥ Π max |Xi (βk − β0,k )|2 ≤ 1≤k≤d n n i=1 4 4d i=1 √ rn b1 ≥ Π max kβk − β0,k k2,1 ≤ , (5.9) 1≤k≤d 2 s nǫ2 Pn n . By (3.1), (5.9) can be further bounded below by where rn = d i=1 kXi k2 d Y
πG (s0,k )
k=1
1
G s0,k
Z
kβk −β0,k k2,1 ≤
rn
√ 2
b1
gsk (βSk )dβSk .
(5.10)
By changing the variable βSk − β0,Sk to βˇSk and using the fact that kxk ≤ kxk1 for any vector x, each integral in (5.10) is bounded below by λ pj Y Z ˇ k e−λk kβj k1 dβˇj e−λk kβ0,k k2,1 √ a ˇj k1 ≤ rn b1 j k β 2s0,k j∈S0,k ! λ r √b pj √ Y 1 k n −λk kβ0,k k2,1 −λk rn b1 /(2s0,k ) 1 ≥e e . pj ! s0,k aj j∈S0,k
The lower bound in the last inequality is obtained by using the result that the integrand equals √ to the probability of the first pj events of a Poisson process happen before time rn b1 /(2s0,k ) (similar to the argument used to derive (6.2) in Castillo et al., 2015). Now, (5.10) is bounded below by ! √ d Y 1 −λk kβ0,k k2,1 Y −λk rn √b1 /(2s0,k ) 1 λk rn b1 pj . πG (s0,k ) G e e pj ! s0,k aj s k=1
0,k
j∈S0,k
By Assumption 1, the last display can be further bounded below by
d d X X p As10 G−(1+A3 )s0 exp − λk kβ0,k k2,1 − λk rn b1 /2 k=1
×
d Y Y
k=1 j∈S0,k
√ 1 λk rn b1 pj . pj ! s0,k aj
k=1
(5.11)
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
17
Bayesian variable selection for multivariate responses
Combining the lower bounds (5.8) and (5.11), log Π(K(f, f0 ) ≤ ǫ2n , V (f, f0 ) ≤ ǫ2n ) is bounded below by − c11 d log d − c12 d2 log d − d(d + 1) log(1/ǫn )/2 − c13 d + s0 log A1 √ d d d X X p λk rn b1 X X − c14 s0 log G − λk kβ0,k k2,1 − pj log(λk rn b1 ) + 2 k=1
−
d X X
k=1 j∈S0,k
k=1
pj log(s0,k aj ) −
d X X
k=1 j∈S0,k
(5.12)
log pj !.
k=1 j∈S0,k
Let ǫ2n be as in (4.4). Since ǫ2n ≥ (d2 log n)/n, the sum of the first four terms in (5.12) is bounded below by a multiple of −nǫ2n . By Assumption 2, d X
λk rn
d X X p p pj log λk rn b1 . s0 log G ≤ nǫ2n , b1 /2 − k=1 j∈S0,k
k=1
as ǫ2n ≥ s0 (log G)/n. Also, since maxk kβ0,k k2,1 ≤ β with the expression of β is displayed Pd in (4.1), then k=1 λk kβ0,k k ≤ nǫ2n . Furthermore, since log(pj !) ≤ pj log pj and aj = 1/2 O(pj ), we obtain that d X X
pj log(s0,k aj ) +
d X X
k=1 j∈S0,k
k=1 j∈S0,k
log(pj !) ≤ 3
d X X
pj log pj . nǫ2n ,
k=1 j∈S0,k
as ǫ2n ≥ s0 pmax log n/n. This completes the verification of (5.3). Next, we verify (5.4). We obtain that Π(Fnc ) ≤
d X
k=1
Π(|Sk | ≥ s¯n ) +
d X X X k=1
Sk ∈Sn j∈Sk
Π(kβj,k k ≥ pmax Hn )
(5.13)
−1 −1 + Π eig1 (Σ⋆ ) ≤ exp(−d log n) + Π eigd (Σ⋆ ) ≥ n ,
where Sn = {S ⊆ {1, . . . , G} : |S| ≤ s¯n }. By (3.2), the first term in (5.13) is bounded above by d X X
sn k=1 sk ≥¯
π(sk ) ≤ dπ(¯ sn )
∞ X A2 j ≤ 2dAs2¯n G−A4 s¯n . A4 G j=0
To derive an upper bound for the second term in (5.13), we apply the the upper bound of the tail √ of a gamma density in page 29 of Boucheron et al. (2013) and the inequality 1 + x − 1 + 2x ≥ x2 /(2(1 + x)), for any x > 0, to obtain that Z λpkmax pmax −1 −λk r Π(kβjk k > pmax Hn ) = r e dr r>pmax Hn Γ(pmax ) imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
18
Ning and Ghosal ≤ exp − pmax (1 + λk Hn − pmax λ2k Hn2 , ≤ exp − 2(1 + λk Hn )
p 1 + 2λk Hn )
(5.14)
where j = 1, . . . , G, k = 1, . . . , d. We then plug-in (5.14) to obtain the upper bound, which is d X pmax λ2k Hn2 exp log d + 2d log s¯n + s¯n log G − . 2(1 + λk Hn ) k=1
For the summation of the third and the fourth terms in (5.13), we apply (A.7) and (A.8), then it is bounded above by exp(c25 d2 log d − c26 d2 log n) + exp(−d2 log d + c23 d2 log n + d2 /2 − c24 n), where c21 , c22 , . . . , c26 are positive constants. Now we combine the upper bounds for each term in (5.13), choose Hn in (5.6) and C2 ≥ C1 + 2. Then (5.13) is bounded above by exp(−C2 nǫ2n ). We thus obtain (5.4). Last, we verify (5.5). Consider the most powerful Neyman-Pearson test φn = 1{f1 /f0 ≥ R 1/2 1/2 1}. If the average negative log-affinity −n−1 log f0 f1 between f0 and f1 is bigger than ǫ2n , then s f Z p 2 1 Ef0 φn = Ef0 f0 f1 ≤ e−nǫn . ≥1 ≤ f0 This gives the first inequality in (5.5). For the second inequality in (5.5), observe that by the Cauchy-Schwarz inequality, f 2 o1/2 1/2 n Ef (1 − φn ) ≤ Ef1 (1 − φn ) Ef1 . f1
By following the similar arguments used in proving the first inequality in (5.5), we obtain that s Z p f 2 0 f0 f1 ≤ e−nǫn . ≥1 ≤ Ef1 (1 − φn ) = Ef1 f1 p For kβ1 − βkF ≤ δn = ǫ2n /(6nb3 ) and kΣ1 − Σk ≤ δn′ = 1/(n2 d), we have Ef1 (f /f1 )2 ≤ P 2 enǫn /2 . Recall that b3 is the upper bound for n−1 ni=1 kXi k2 . To end this, observe that f 2 n/2 −1 −n/2 = det(Σ⋆ ) det(2I − Σ⋆ ) Ef1 f1 (5.15) n X ′ Xi β⋆ Σ−1/2 (2Σ⋆ − I)−1 Σ−1/2 β⋆ Xi′ . × exp i=1
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
19
Bayesian variable selection for multivariate responses Now kΣ1 − Σk ≤ δn′ = 1/(n2 d) implies that kΣ⋆ − Ik ≤ kΣ−1 kkΣ1 − Σk ≤ nkΣ1 − ΣkF ≤ nδn′ .
Therefore, 1 − nδn′ ≤ kΣ⋆ k ≤ 1 + nδn′ . Writing eigk (Σ⋆ ) for the k-th eigenvalue of Σ⋆ , we obtain that n/2 −1 −n/2 det(Σ⋆ ) det(2I − Σ⋆ ) d d n X nX 1 = exp log eigk (Σ⋆ ) − log 2 − ⋆ 2 2 eigk (Σ ) k=1
≤ exp
d n X
2
≤ exp(n
2
k=1
log(1 + nδn′ ) −
k=1 dδn′ /4)
≤ exp(1/4),
n 2
d X
k=1
log 1 +
nδn′ 1 + nδn′
by the choice of δn′ ; here the second line is obtained by applying the inequalities 1−x−1 ≤ log x ≤ x − 1 for x > 0 and 1 + nδn′ < 2. Using the inequalities k(2Σ⋆ − I)−1 k ≤ −1 ≤ 2 and kΣ−1 k = eigd (Σ−1 ) ≤ n, we bound the exponential term of 2(1 − nδn′ ) − 1 (5.15) by n X i=1
kXi k2 kβ1 − βk2F kΣ−1 kk(2Σ⋆ − I)−1 k ≤ 6n2 b3 δn2 ≤ nǫ2n /2.
Hence, (5.15) is bounded above by exp(1/4 + nǫ2n /2). This verifies the second inequality in (5.5). Now, with kβ1 −βkF ≤ δn and kΣ1 −ΣkF ≤ nδn′ , the metric entropy can be calculated as follows: log N (ǫn , Fn , ρ) d Y X G 6pmax Hn Pj∈Sk pj ≤ log + d2 log 6d3/2 log n/(nδn′ ) |Sk | δn k=1 Sk ∈Sn √ 6p max Hn 6nb3 + d2 log(6nd5/2 log n) ≤ d log s¯n + d¯ sn log G + d¯ sn pmax log ǫn . d¯ sn log G + d¯ sn pmax log(pmax Hn ) + d¯ sn pmax log n + d2 log d + d2 log n . nǫ2n . This shows that the posterior Π Part II. Observing that n
Pn
i=1
1X ρ(fi , f0,i ) = − log n i=1
ρ(fi , f0,i ) & nǫ2n |Y1 , . . . , Yn → 0 in P0 -probability. 1/4 1/4 ! det(Σ) det(Σ0 ) 1/2 0 det Σ+Σ 2
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
20
Ning and Ghosal +
Then
Pn
i=1
n Σ + Σ −1 1 X 0 Xi (β − β 0 ) (β − β 0 )′ Xi′ . 8n i=1 2
ρ(fi , f0,i ) ≤ nǫ2n implies both − log
and
1/4 1/4 ! det(Σ) det(Σ0 ) ≤ ǫ2n , 1/2 0 det Σ+Σ 2
n Σ + Σ −1 1X 0 Xi (β − β0 ) (β − β 0 )′ Xi′ ≤ ǫ2n . n i=1 2
(5.16)
(5.17)
We first show the probability of (5.16) goes to 1 implies (4.3). Let 1/4 1/4 det(Σ) det(Σ0 ) 2 2 d (Σ, Σ0 ) = h N (0, Σ), N (0, Σ0 ) = 1 − . 1/2 0 det Σ+Σ 2
Because Σ0 has eigenvalues bounded away from zero and infinity, by Lemma 2 of Suarez and Ghosal (2017), we obtain that −1/2
d2 (Σ, Σ0 ) ≤ kΣ0
−1/2 2 kF
(Σ − Σ0 )Σ0
. d2 (Σ, Σ0 ).
(5.18)
Since − log
1/4 1/4 ! det(Σ) det(Σ0 ) = − log(1 − d2 (Σ, Σ0 )) ≍ d2 (Σ, Σ0 ), 1/2 Σ+Σ0 det 2
we obtain that kΣ − Σ0 k2F . ǫ2n . We now show that the probability (5.17) goes to 1 implies (4.2). Given (4.3) and by Assumption 3, we obtain that kΣ + Σ0 k2 = kΣ − Σ0 + 2Σ0 k2 ≤ 2kΣ − Σ0 k2F + 8kΣ0 k2 ≤ 2ǫ2n + 8b22 . Hence, by eig1
Σ+Σ Σ + Σ0 −1
Σ + Σ0 −1 0 −1 = eigd =
, 2 2 2
where eig1 (A) and eigd (A) are the smallest and the largest eigenvalues of a matrix A respectively. Then (5.17) implies that ǫ2n
n n q
−1 1X 1X 2 Σ + Σ0 ≥ kXi (β − β0 )k kXi (β − β0 )k2 / ǫ2n /2 + 2b22 .
≥ n i=1 2 n i=1
Combining with (4.3), we obtain (4.2).
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
21
Bayesian variable selection for multivariate responses
Proof of Lemma 4.3. Let S = {S : s1 < r1 , . . . , sd < rd }, where rk ≥ M2 s⋆k for k = 1, . . . , d, we need to show that E0 Π(S : S ∈ S c |Y1 , . . . , Yn ) → 0 as n → ∞. The posterior probability Π(S c |Y1 , . . . , Yn ) is given by R R Qn fi i=1 f (Yi )dΠ(β)dΠ(Σ) Sc c . (5.19) Π(S |Y1 , . . . , Yn ) = R R Qn fi0,i i=1 f0,i (Yi )dΠ(β)dΠ(Σ) 2
By Lemma 4.2, the denominator of (5.19) is bounded below by e−(1+C2 )nǫn with a large probability. For the numerator of the posterior (5.19), we have E0
n d Z Y X fi dΠ(β) ≤ Π(sk ≥ rk ). (Yi )dΠ(β)dΠ(Σ) ≤ S c i=1 f0,i Sc
Z Z
(5.20)
k=1
By Assumption 1 and A2 /GA4 ≤ 1/2 as G → ∞, for each k, Π(sk ≥ rk ) =
Therefore,
d X
k=1
∞ X
sk =rk
∞ A rk A rk −s0,k X A2 j 2 2 ≤ 2 . πG (sk ) ≤ πG (s0,k ) A A A4 4 G 4 G G j=0
Π(sk > rk ) ≤ 2
d X
A2 /GA4
k=1
rk
and we obtain that
E0 Π(S c |Y1 , . . . , Yn )
≤ E0 Π(S c |Y1 , . . . , Yn )1En + P0 (Enc ) ! d X r k ≤ exp (1 + C2 )nǫ2n + log 2 + log + o(1). A2 /GA4
(5.21)
k=1
Now we shall show that (5.21) goes to 0 as G → ∞. We write nǫ2n = the expression in the exponential function of (5.21) equals to log 2 +
d X
log
k=1
d X
k=1
Pd
⋆ k=1 sk
log G. Then
⋆ G(1+C2 )sk −A4 rk Ar2k .
(5.22)
For rk = M2 s⋆k , d X
k=1
⋆
⋆
M2 s⋆ k
G(1+C2 )sk −A4 M2 sk A2
⋆
M2 s⋆ k
≤ d max(G(1+C2 −A4 M2 )sk A2 k
) → 0,
as G → ∞ if log d ≤ A4 log G and M2 ≥ 2(1 + C2 )/A4 + 1. Therefore, (5.22) goes to ∞ and thus (5.21) goes to 0. We then complete the proof.
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
22
Ning and Ghosal
Proof of Corollary 4.5. By Lemma 4.3 and Definition 4.6, we have n X i=1
2
kXi (β − β 0 )k ≥
φ2ℓ2 (s⋆ )
n X i=1
kXi k2 kβ − β 0 k2F .
Plugging-in the inequality into (4.2), we obtain (4.8). Proof of Theorem 4.8. Let n o Pn 2 2 ⋆ Θn = β : |S| ≤ M2 s⋆ , kβ − β 0 k2F ≤ (M3 nǫ2n )/ kX k φ (s ) , i ℓ2 i=1 Hn = Σ : kΣ − Σ0 k2F ≤ M1 ǫ2n .
The proof contains two parts. In the first part, we show that the total variation distance between two measures Π(·|Y1 , . . . , Yn ) and ΠΘn (·|Y1 , . . . , Yn ) is small, where the second measure is the renomalized measure of Π(·|Y1 , . . . , Yn ) restricted to the set Θn . We also show that the total variation distance between Π∞ (·|Y1 , . . . , Yn ) and Π∞ Θn (·|Y1 , . . . , Yn ) is also small, where the second measure is the renomalized measure of Π∞ (·|Y1 , . . . , Yn ) that is restricted to the same set. In the second part, we show that the total variation distance between Π(·|Y1 , . . . , Yn ) and Π∞ (·|Y1 , . . . , Yn ) is small. Part I. For any set A, let ΠA (·) be the renormalized measure which restricted to the set A. Then kΠ(·) − ΠA (·)k ≤ 2Π(Ac ). Clearly, / Θn |Y1 , . . . , Yn ) kΠ(·|Y1 , . . . , Yn ) − ΠΘn (·|Y1 , . . . , Yn )kT V ≤ 2Π(β ∈ M3 nǫ2n , . . . , Y ≤ 2Π(|S| ≥ M2 s⋆ |Y1 , . . . , Yn ) + 2Π kβ − β 0 k2F ≥ Pn Y 1 n ⋆ 2 2 i=1 kXi k φℓ2 (s ) → 0,
by (4.2) and (4.7). Now, to show that ∞ c kΠ∞ (·|Y1 , . . . , Yn ) − Π∞ Θn (·|Y1 , . . . , Yn )kT V ≤ 2Π (β ∈ Θn |Y1 . . . , Yn ),
we write exp{ℓ Vec(β), Σ0 − ℓ Vec(β 0 ), Σ0 }dU Vec(β) , Π∞ (β ∈ Θcn |Y1 , . . . , Yn ) = R exp{ℓ Vec(β), Σ0 − ℓ Vec(β 0 ), Σ0 }dU Vec(β) (5.23) R
Θcn
where d X πG (|Sk |) Y λk pj Y dΠ Vec(β S ) ⊗ δS c . dU Vec(β) = G aj ⋆ |S | k=1 Sk ∈Sk
k
j∈Sk
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
23
Bayesian variable selection for multivariate responses Recall that Sk⋆ = {Sk : |Sk | ≤ M2 s⋆k }. By (4.12), ℓ(Vec(β), Σ0 ) − ℓ(Vec(β), Σ0 ) equals to d X k=1
n n X 1X −1/2 ′ ′ . Yi − Vec(β0 )X i Σ−1 X (β − β ) k(βk − β0,k )X i,k Σ0 k2 + k 0,k i,k 0 2 i=1 i=1
−
Note that except for the elements of the first row of X i,k , the rest are 0. By plugging-in the last display into (5.23), the denominator can be bounded below by Z n d Y 1X πG (s0,k ) Y λk pj −1/2 k(βSk − β0,k )X i,S0,k Σ0 k2 ) exp(− G a 2 j s i=1
k=1
j∈S0,k
0,k
× exp
n X i=1
′ ′ dβSk . Yi − Vec(β 0 )X i,S0 Σ−1 X (β − β ) Sk 0,k i,S0,k 0
By changing the variables βSk − β0,k to βˇSk and applying Jensen’s inequality, the last display is bounded below by Z n d 1X Y πG (s0,k ) Y λk pj ˇS X i,S Σ−1/2 k2 dβˇS k β exp − k k 0,k 0 G aj 2 i=1 s
k=1
=
0,k
d Y
k=1
j∈S0,k
P
pj /2 πG (s0,k ) Y λk pj (2π) j∈S0,k 1/2 . P G ′ aj det( ni=1 X i,S0,k Σ−1 s0,k j∈S0,k 0 X i,S0,k )
(5.24)
We thus obtain a lower bound for the denominator. The numerator of (5.23) can be written as follows:
n 1X −1/2 kVec(β S − β0 )X i,S Σ0 k2 exp − 2 i=1 Θcn
Z
× exp
n X i=1
′ ′ Yi − Vec(β 0 )X i,S Σ−1 0 X i,S Vec(β S − β 0 ) dU Vec(β S ) .
By applying the tail bound for a standard multivariate normal distribution, v u n n 2d X u
X −1 kXi k2 ≤ Yi − Vec(β 0 )X i Σ0 X i,k ≥ 2tb1 log n P max → 0, k n i=1 i=1
p p P as d ≪ n. Note that 2 b1 log n ni=1 kXi k2 = λ b1 log n/ log G. By the Cauchy-Schwartz inequality, with probability tending to one, we obtain that n X i=1
′ ′ Yi − Vec(β 0 )X i Σ−1 0 X i Vec(β − β 0 ) imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
24
Ning and Ghosal =
√ d −1 ′ λ b1 log n X ′ √ kβk − β0,k k. Yi − Vec(β 0 )X i Σ0 X i,k (βk − β0,k ) ≤ log G k=1
d X n X
k=1 i=1
q P Pd d By writing x = 3x/2−x/2 for any x and applying the inequality i=1 kAi k ≤ d i=1 kAi k2 for vectors A1 , . . . , Ad , the upper bound in the last display is bounded above by v u d √ √ d X λ b1 log n X 3λ db1 log n u t 2 √ √ kβk − β0,k k. (5.25) kβk − β0,k k − 2 log G 2 log G k=1 k=1 Furthermore, by writing 3x/2 = 2x − x/2 for any x and using the restricted eigenvalue condition in (4.6), the first term in (5.25) can be further bounded above by v u d p Pn √ 2 X 2λ db1 log n i=1 kVec(β − β 0 )X i k λ db1 log n u t q √ kβk − β0,k k2 . − Pn 2 log G log G kXi k2 φ2 (s⋆ ) k=1 i=1
ℓ2
We apply the inequality 2ab ≤ a2 + b2 for any two numbers a and b to the first term in the last display. As β ∈ Θcn , recall that the posterior is concentrated on the set S ⋆ = {S : |S| ≤ M2 s⋆ }, then the last display can be further bounded above by √ 2 n λǫn M3 ndb1 log n 1X 2λ db log n Pn 1 q kVec(β − β 0 )X i k2 + − . 2 i=1 log G i=1 kXi k2 φ2ℓ2 (s⋆ ) 2 log G Pn kX k2 φ2 (s⋆ ) i i=1 ℓ2
(5.26)
Now we combine the upper bounds (5.25) and (5.26), the numerator of the posterior (5.23) is bounded above by √ 2 2λ db log n λǫn M5 ndb1 log n Pn 1 p − Pn 2 ⋆ 2 log G i=1 kXi k φℓ2 (s ) log G i=1 kXi k2 φℓ2 (s⋆ ) √ Z d b1 log n X λk kβk − β0,k k dU (Vec(βS )). × exp − √ 2 log G k=1
exp
The integral part in the last display can be further bounded above by d Z Y G λ √b log n Y X λk pj 1 k √ kβk − β0,k k dβk πG (sk )1{Sk ∈ Sk⋆ } exp − aj 2 log G s =0 j∈S k=1 k
k
=
d Y
Y log G b1 log n
k=1 j∈Sk
G pj /2 X
sk =0
πG (sk )1{Sk ∈ Sk⋆ }.
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
25
Bayesian variable selection for multivariate responses Finally, we obtain the upper bound for the numerator, which is √ 2 λǫn M5 ndb1 log n 2λ db log n Pn 1 p − P n log G i=1 kXi k2 φ2ℓ2 (s⋆ ) log G i=1 kXi k2 φℓ2 (s⋆ ) d G Pd P b1 log n Y X k=1 j∈Sk pj log πG (sk )1{Sk ∈ Sk⋆ } × exp − 2 log G k=1 sk =0 P P √ d 18db log n 3ǫ M ndb log n b1 log n 1 n 5 1 k=1 j∈Sk pj − − log = exp 2 ⋆ ⋆ φℓ2 (s ) φℓ2 (s ) 2 log G
exp
×
d X G Y
k=1 sk =0
πG (sk )1{Sk ∈ Sk⋆ }.
(5.27)
Now, we combine the lower bound of the denominator (5.24) and the upper bound of the numerator (5.27). Then the posterior (5.23) is bounded above by 1/2 P ′ Y aj pj det( ni=1 X i,S0,k Σ−1 0 X i,S0,k ) P pj /2 πG (s0,k ) λk (2π) j∈S0,k j∈S0,k k=1 18db log n 3ǫ √M ndb log n Pd P b1 log n n 1 3 1 k=1 j∈Sk pj − − log × exp φ2ℓ2 (s⋆ ) φℓ2 (s⋆ ) 2 log G d Y
×
G s0,k
G X
sk =0
πG (sk )1{Sk ∈ Sk⋆ }.
Note that Σ0 ≥ b1 I d , by letting ΓS0,k = inequality, 1 det(ΓS0,k ) ≤ P
j∈S0,k
pj
(5.28) Pn
i=1
′ X i,S0,k Σ−1 0 X i,S0,k and applying Jensen’s
Pj∈S pj 0,k Tr(ΓS0,k ) ≤ ≤
n P 1 X pj kX i,S0,k k2 j∈S0,k b1 i=1 n P 1 X pj kXi k2 j∈S0,k . b1 i=1
Thus p Pj∈S pj X Y aj pj p j 0,k det(ΓS0,k )1/2 ≤ √ pj log(pj / b1 ) Gs0,k . Gs0,k = exp λk b1
j∈S0,k
j∈S0,k
P √ By putting the term exp p log(p / b1 ) together with the exponential expresj j j∈S0,k sion in (5.28) and choosing a large enough value of M3 , then the exponential term goes to 0 as n → ∞ (note that we assume that φℓ2 (s⋆ ) is bounded below by a constant c). imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
26
Ning and Ghosal
Also, the prior mass πG (sk ) can be bounded below by G−sk . As a result, the product of the rest terms also goes to 0. Therefore, the posterior goes to 0 as n → ∞. Part II. For a generic set B, d Y X πG (sk ) Y λk pj ΠΘn (B|Y1 , . . . , Yn ) ∝ G aj sk j∈Sk k=1 Sk ∈S ⋆ Z Z n 1X × exp − k(Yi − Vec(β S )X i,S )Σ−1/2 k2 exp(−λk kβk k2,1 )dβ S dΠ(Σ), 2 i=1 (B∩Θn )
and Π∞ Θn (B|Y1 , . . . , Yn ) ∝ ×
Z
(B∩Θn )
exp −
d d Y X πG (sk ) Y X λk pj G aj ⋆ s
k=1 Sk ∈S
1 2
n X i=1
k
j∈Sk
k=1
−1/2 2
k(Yi − Vec(β S )X i,S )Σ0
k
exp(−
d X
k=1
λk kβ0,k k2,1 )dβ S .
In the following, we shall show that the total variation distance between the above two measures goes to 0 as n → ∞. We take an Sk from the corresponding set Sk⋆ for all k and denote the corresponding measure by ΠSΘn (B|Y1 , . . . , Yn ) and ΠS,∞ Θn (B|Y1 , . . . , Yn ). We first show that the total variation distance between the two measures goes to 0. To this end, we use the Bernstein von-Mises theorem (see Theorem 1 of Castillo, 2010) in a semiparametric model. The main difference with Castillo (2010)’s setup is that is the nuissance part in our model is parametric but high-dimensional. This theorem requires calculating the remainder term in the local asymptotic normality (LAN) expansion of the log-likelihood, which is denoted as Remn (Vec(β t ), Σt ), and showing that sup β∈Θn ,Σ∈Hm
|Remn (Vec(β 0 ), Σ) − Remn (Vec(β), Σ)| → 0. 1 + nkVec(β − β 0 )k2
(5.29)
For simplicity, we denote Ω = Σ−1 and define two local paths, Vec(βt ) = Vec(β 0 ) + tb,
Ωt = Ω0 +
tΦ 1/2 1/2 kΣ0 ΦΣ0 ||F
,
where b is a 1 × pd vector, Φ is a d × d symmetric matrix and t = n−1/2 . Then by Lemma A.5, the remainder term in the LAN expansion on the log-likelihood is given by d Z ρk n X (ρk − s)2 ds , Remn (Vec(β t ), Σt ) = 2 (1 − s)3 0 k=1
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
27
Bayesian variable selection for multivariate responses 1/2
1/2
where ρk is the k-th eigenvalue of Σ0 (Ωt − Ω0 )Σ0 . Since the remainder term does not depend on β, |Remn (Vec(β 0 ), Σ) − Remn (Vec(β), Σ)| = 0. Thus we verified (5.29). We now conclude that the total variation distance between the two measures ΠSΘn (B|Y1 , . . . , Yn ) and ΠS,∞ Θn (B|Y1 , . . . , Yn ) goes to 0. Then the total variation distance between the two measures ΠΘn (B|Y1 , . . . , Yn ) and Π∞ Θn (B|Y1 , . . . , Yn ) also goes to 0. Proof of Theorem 4.9. The proof is similar to the proof of Theorem 4 of Castillo et al. (2015), thus we will simplify our proofs when the results can be obtained easily by following their arguments. Let Ξ be a collection of all sets for S such that S ∈ S0 , d ˜ Sk ⊃ SP 0,k and Sk 6= S0,k , where S0 = ∪k=1 {Sk : |Sk | ≤ M2 s0,k , β0,Sk ∈ B} and let n −1 ′ ∞ ΓS k = i=1 X i,Sk Σ0 X i,Sk . We shall show that Π (β : S ∈ Ξ|Y1 , . . . , Yn ) → 0 in probability. By (4.14), we obtain that Π∞ (β : S ∈ S0 |Y1 , . . . , Yn ) ≤ ≤
d Y X
wS∞k
k=1 Sk ∈S0 d Y
M2 s0,k
πG (sk )
X
det(ΓS0,k ) det(ΓSk )
× exp
n
−
1 2
πG (s0,k )
k=1 sk =s0,k +1
×
G s0,k
!1/2
n X i=1
exp
G−s0,k sk −s0,k G sk
n n1 X
2
i=1
P p /2 (2π) j∈Sk j P max Q p ′ /2 ′ Sk ∈Ξ, pj′ (2π) j ∈S0,k j j ′ ∈S0,k (λk /aj ′ ) |Sk |=sk Q
pj j∈Sk (λk /aj )
−1/2 2
ˆ )X i,S Σ kVec(β Sk k 0 −1/2 2
ˆ kVec(β S0,k )X i,S0,k Σ0
k
k
o
o ,
(5.30)
where M2 = 2(1 + C2 )/A4 + 1 (see Lemma 4.3). By using the same techniques as those used to prove (6.11) and (6.12) of Castillo et al. (2015), we can verify that 1/2 9√log G Pj∈S pj −Pj′ ∈S pj′ det(ΓS0,k ) k 0,k ≤ , 1/2 Q p ′ (s ) b φ j 1 ℓ2 0 ′) det(Γ ) (λ /a ′ S k j k j ∈S0,k Q
and
pj j∈Sk (λk /aj )
P
n X i=1
ˆ )X S Σ−1/2 k2 − kVec(β Sk k 0
≤ tk
X
j∈Sk
pj −
X
j ′ ∈S0,k
n X i=1
−1/2 2
ˆ kVec(β S0,k )X S0,k Σ0
k
pj ′ (log G)/b2 → 1,
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
28
Ning and Ghosal
where tk > 2b2 (sk − s0,k )/ d Y
M2 s0,k
X
P
j∈Sk
pj −
P
(A1 G−A4 )sk −s0,k
k=1 sk =s0,k +1
×
3√2π log G Pj∈S
k
pj ′ . Then (5.30) is bounded above by pj′ sk Q j ′ ∈S0,k aj s0,k Q max pj Sk ∈Ξ, j∈Sk aj
j ′ ∈S0,k
pj −
|Sk |=sk P
j ′ ∈S0,k
b1 φℓ2 (s0 )
pj′
Gsk −s0,k ,
with probability tending to 1. The last display goes to 0 as n → ∞ since s0,k ≤ GA4 −1 sk and s0,k ≤ (M2 GA4 −1 )sk −s0,k .
Appendix A: Auxiliary results Lemma A.1.
√ Γ(m+1) 1/m For a = π Γ( , m 2 +1) Z m λ exp(−λk(X1 , . . . , Xm )k)dX1 · · · dXm = 1. Rm a
(A.1)
Also as m → ∞, a ≍ m1/2 . Expressing Xi in the spherical polar coordinates by a radius r, a base angle θm−1 ∈ (0, 2π), and m2 angles θ1 , . . . , θm−2 ranging over (−π/2, π/2), then the density of r is given by f (r|λ) =
λm m−1 r exp(−λr), Γ(m)
(A.2)
which is a gamma density with the shape parameter m and rate parameter λ. Proof. Applying the polar transformation, evaluating the Jacobian, and applying the results shown in Chapter 1.5.1 of Scott (2015), the integral in (A.1) equals to Z π/2 Z m Z 2π Z π/2 m−2 Y λ ··· cosi θm−i−1 drdθ1 · · · dθm−2 dθm−1 rm−1 exp(−λr) a −π/2 0 −π/2 i=1 Z 2π m/2 λm m−1 = r exp(−λr)dr. (A.3) Γ(m/2)am The second line of the R last display is obtained by using the results in Chapter 1.5.2 of Scott (2015). Since rm−1 e−λr dr = Γ(m)/λm , the choice √ 2Γ(m) 1/m √ Γ(m + 1) 1/m = π a= π Γ(m/2) Γ(m/2 + 1) m makes (λ/a) exp − λk(X1 , . . . , Xm )k a probability density function. Now by Stir√ 2π 2m 1/2 ling’s approximation to the gamma functions, we obtain that ≤ a ≤ e e e 2m 1/2 √ , which implies that a ≍ m1/2 . (A.2) is self-evident from (A.3). 2 e imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
Bayesian variable selection for multivariate responses
29
Lemma A.2 (Lounici et al., 2011). Let ξ1 , . . . , ξk ∼ N (0, 1) be i.i.d random variables √ Pk 2 and v = (v1 , . . . , vk ) be a non-zero vector. Define ηv = i=1 (ξi − 1)vi /( 2kvk) and m(v) = kvk∞ /kvk. Then for t > 0, P(|ηv | > t) ≤ 2 exp −
t2 √ . 2 1 + 2tm(v)
(A.4)
Lemma A.3 (Muirhead, 1982). If A ∼ Wd (ν, Φ) with ν > d − 1, ρ1 < · · · < ρd are the eigenvalues for A, then the joint distribution for (ρ1 , . . . , ρd ) is πd
−ν/2 d d Y (ν−d−1)/2 Y 2 det(Φ) ρi (ρj − ρi ) Γd (d/2)Γd (ν/2) i=1 i0
where a > (d − 1)/2 and B > 0 means that B is positive definite. Further if Φ = I d , then (A.5) reduces to
2 d d d Y 1X Y π d /2 2−dν/2 (ν−d−1)/2 (ρj − ρi ). ρi ρi exp − Γd (d/2)Γd (ν/2) 2 i=1 i d − 1 and ν ≍ d, then for positive constants a1 , a2 , a3 , a4 , b1 , b2 , b3 , b4 , with t1 > νd, t2 > 0 and 0 ≤ t3 ≤ 1, d2 /2 exp(d2 /2 − a1 t1 ), (A.7) P eigd (Σ−1 ) > t1 . (b1 t1 )/d2 −1 b3 d2 a2 d P eig1 (Σ ) ≤ t2 . (b2 d) t2 , (A.8) Td 2 d(d+1)/2 exp(−a3 d(1 + t3 )),(A.9) P i=1 {1 ≤ eigi (Σ−1 ) ≤ 1 + t3 } & (b4 d)−d d−d /2 t3 Td −d −d2 /2 d(d+1)/2 t3 exp(−a4 d). (A.10) P i=1 {1 ≤ eigi (Σ) ≤ 1 + t3 } & (b4 d) d
Proof. The proofs of (A.7)–(A.9) follow from the proof of Lemma 9.16 of Ghosal and van der Vaart (2017), except that here we need to express factors involving d explicitly. To prove (A.10), we need the following inequality: P
d \
d \ {Σ : 1 − t3 ≤ ρi ≤ 1} , {Σ : 1 ≤ eigi (Σ) ≤ 1 + t3 } ≥ P
i=1
i=1
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
30
Ning and Ghosal
for 0 ≤ t3 ≤ 1 and ρi = eigi (Σ−1 ), which is the i-th smallest eigenvalue of Σ−1 , i = 1, . . . , d. Consider the set Ii = {1 − (d − i + 1)t3 /d, 1 − (d − i + 1/2)t3 /d} for each i. It is easy to verify that if ρi ∈ Ii , then ρi ∈ [1 − t3 , 1]. By (A.5), we obtain that P(1 − t3 ≤ ρi ≤ 1, i = 1, . . . d) −ν/2 Z 2 Z Y d d Y π d /2 2−dν/2 det(Φ) (ν−d−1)/2 (ρj − ρi ) ρi ··· ≥ Γd (d/2)Γd (ν/2) I1 i=1 Id i i, lj − li > t/(2d). The lower bound in the third line of the last display is obtained by noticing that −∆(ρ) > −ρd I d > −I d , and HH T = I d . Then (A.10) is obtained by applying Stirling’s approximation to the gamma functions in that lower bound. Lemma A.5.
For a multivariate normal density N (Yi |Vec(β)X i , Σ), i = 1, . . . , n, let Vec(β t ) = Vec(β 0 ) + tb,
Ωt = Ω0 +
tΦ 1/2 1/2 kΣ0 ΦΣ0 kF
,
be local paths for Vec(β) and Ω respectively, where Ω = Σ−1 , b is a pd × 1 vector, Φ is a d × d symmetric matrix and 0 < t < 1. Then ℓn (Vec(β t ), Σt ) − ℓn (Vec(β 0 ), Σ0 ) √ n √ X n Tr (Q0 − Σ0 )Φ bX i Ωt (Yi − Vec(β 0 )X i )′ n =− + 1/2 1/2 2kΣ0 ΦΣ0 kF i=1 n d Z X 1 1 n X ρi (ρi − s)2 ′ ′ − bX i Ωt X i b − − ds, 2n i=1 2 2 i=1 0 (1 − s)3 1/2
(A.11)
1/2
where ρi is the i-th eigenvalue of the matrix Σ0 (Ωt − Ω0 )Σ0 . Proof. The proof is similar to that of Lemma A.1 in GaoPand Zhou (2016). Let Qt = n 1 Pn 1 ′ ′ (Y i=1 i − Vec(β t )X i ) (Yi − Vec(β t )X i ) and Q0 = n i=1 (Yi − Vec(β 0 )X i ) (Yi − n Vec(β 0 )X i ). From (4.12), we have ℓn (Vec(β t ), Σt ) − ℓn (Vec(β 0 ), Σ0 ) n n n log det(Σ0 ) − log det(Σ) − Tr(Qt Ωt ) + Tr(Q0 Ω0 ) = 2 2 2 n n 1/2 1/2 = log det(I n − Σ0 (Ω0 − Ωt )Σ0 ) − Tr (Q0 − Σ0 )(Ωt − Ω0 ) 2 2 imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
31
Bayesian variable selection for multivariate responses −
n n 1/2 1/2 − Tr (Qt − Q0 )Ωt . Tr Σ0 (Ωt − Ω0 )Σ0 2 2
To obtain (A.11), first we plug-in the expressions of Qt and Q0 into Tr (Qt − Q0 )Ωt to obtain that
n n X t2 X n tbX i Ωt (Yi − Vec(β 0 )X i )′ − bX i Ωt X ′i b′ . Tr (Qt − Q0 )Ωt = 2 2 i=1 i=1
Next, we apply Taylor’s expansion with the integral form of the remainder to the log function to obtain that 1/2 1/2 1/2 1/2 log det(I n − Σ0 (Ω0 − Ωt )Σ0 ) + Tr Σ0 (Ω0 − Ωt )Σ0 =
d X i=1
=−
ρi + log(1 − ρi ) d
d
1X 2 1X ρ − 2 i=1 i 2 i=1
Z
0
ρi
(ρi − s)2 ds (1 − s)3 d
1 1/2 1X 1/2 = − kΣ0 (Ωt − Ω0 )Σ0 k2F − 2 2 i=1
Z
0
ρi
(ρi − s)2 ds. (1 − s)3
tΦ Finally, we plug-in the expression of Ωt − Ω0 = √ into the last line 1/2 1/2 nkΣ0 ΦΣ0 kF of the last display to complete the proof.
References Banerjee, S. and S. Ghosal (2014). Posterior convergence rates for estimating large precision matrices using graphical models. Electronic Journal of Statistics 8, 2111– 2137. Banerjee, S. and S. Ghosal (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis 136, 147–162. Belitser, E. and S. Ghosal (2017). Empirical Bayes oracle uncertainty quantification for regression. Preprint . Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. Bühlmann, P. and S. van der Geer (2011). Statistics for High-dimensional Data: Methods, Theory and Applications. Springer-Verlag. Castillo, I. (2010). A semiparametric Bernstein-von Mises theorem for Gaussian process priors. Probability Theory Related Fields 152, 53–99. Castillo, I. and R. Mismer (2018). Empirical Bayes analysis of spike and slab posterior distributions. arXiv:1801.01696 . imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
32
Ning and Ghosal
Castillo, I., J. Schmidt-Hieber, and A. van der Vaart (2015). Bayesian linear regression with sparse priors. The Annals of Statistics 43, 1986–2018. Chae, M., L. Lin, and D. B. Dunson (2016). Bayesian sparse linear regression with unknown symmetric error. arXiv:1608.02143 . Chen, R.-B., C.-H. Chu, S. Yuan, and Y. N. Wu (2016). Bayesian sparse group selection. Journal of Computational and Graphical Statistics 25, 665–683. Curtis, S. M., S. Banerjee, and S. Ghosal (2014). Fast Bayesian model assessment for nonparametric additive regression. Computational Statistics and Data Analysis 71, 347–358. Gao, C. and H. H. Zhou (2016). Bernstein-von Mises theorems for functionals of the covariance matrix. Electronic Journal of Statistics 10, 1751–1806. Ghosal, S. and A. van der Vaart (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press. Greenlaw, K., E. Szefer, J. Graham, M. Lesperance, F. S. Nathoo, and F. the Alzheimer’s Disease Neuroimaging Initiative (2017). A Bayesian group sparse multi-task regression model for imaging genetics. Bioinformatics 33, 2513–2522. Huang, J. and T. Zhang (2010). The benefit of group sparsity. The Annals of Statistics 38, 1978–2004. Li, F. and N. R. Zhan (2010). Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American Statistics Association 105 (491), 1202–1214. Liquet, B., K. Mengersen, A. N. Pettitt, and M. Sutton (2017, 12). Bayesian variable selection regression of multivariate responses for group data. Bayesian Analysis 12, 1039–1067. Lounici, K., M. Pontil, A. B. Tsybakov, and S. van de Geer (2009). Taking advantage of sparsity in multi-task learning. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT-2009), 73–82. Lounici, K., M. Pontil, S. van de Geer, and A. B. Tsybakov (2011). Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics 39, 2164–2204. Martin, R., R. Mess, and S. G. Walker (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli 23, 1822–1857. Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. John Wiley & Sons, Inc. Nardi, Y. and A. Rinaldo (2008). On the asymptotic properties of the group lasso estimator for linear models. Electronic Journal of Statistics 2, 605–633. Ning, B., S. Ghosal, and J. Thomas (2018). Bayesian method for causal inference in spatially-correlated multivariate time series. Bayesian Analysis (to appear). Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-andslab prior. The Annals of Statistics 46, 401–437. Ročková, V. and E. Lesaffre (2014). Incorporating grouping information in Bayesian variable selection with applications in genomics. Bayesian Analysis 9, 221–258. Scott, D. W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, Inc. Song, Q. and F. Liang (2017). Nearly optimal Bayesian shrinkage for high-dimensional imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018
Bayesian variable selection for multivariate responses
33
regression. arXiv:1712.08964 . Suarez, A. J. and S. Ghosal (2017). Bayesian estimation of principal components for functional data. Bayesian Analysis 12, 311–333. Xu, X. and M. Ghosh (2015). Bayesian variable selection and estimation for group lasso. Bayesian Analysis 10, 909–936. Yuan, M. and Y. Lin (2006). Model selection and estimation in regression with grouped variables. Journal of Royal Statistical Society: Series B 68, 49–67.
imsart-bj ver. 2014/10/16 file: Bayesian-variable-selection.tex date: July 11, 2018