Bayesian Variable Selection for Ultrahigh-dimensional Sparse

0 downloads 0 Views 287KB Size Report
Sep 20, 2016 - A previous study by Song and Liang (2015) identified two particular SNPs ... selection of George and Foster (2000), and the spike and slab ...
arxiv (0000)

00, Number 0, pp. 1

Bayesian Variable Selection for

arXiv:1609.06031v1 [stat.ME] 20 Sep 2016

Ultrahigh-dimensional Sparse Linear Models Minerva Mukhopadhyay∗ and Subhajit Dutta† Abstract.

We propose a Bayesian variable selection procedure for ultrahigh-

dimensional linear regression models. The number of regressors involved in regression, pn , is allowed to grow exponentially with n. Assuming the true model to be sparse, in the sense that only a small number of regressors contribute to this model, we propose a set of priors suitable for this regime. The model selection procedure based on the proposed set of priors is shown to be variable selection consistent when all the 2pn models are considered. In the ultrahigh-dimensional setting, selection of the true model among all the 2pn possible ones involves prohibitive computation. To cope with this, we present a two-step model selection algorithm based on screening and Gibbs sampling. The first step of screening discards a large set of unimportant covariates, and retains a smaller set containing all the active covariates with probability tending to one. In the next step, we search for the best model among the covariates obtained in the screening step. This procedure is computationally quite fast, simple and intuitive. We demonstrate competitive performance of the proposed algorithm for a variety of simulated and real data sets when compared with several frequentist, as well as Bayesian methods. Keywords: Model selection consistency, Screening consistency, Gibbs sampling.

1

Introduction

Variable selection in ultrahigh-dimensional setup is a flourishing area in the contemporary research scenario. It has become more important with increasing availability of data in various fields like genetics, finance, machine learning, etc. Sparsity has frequently been identified as an underlying feature for this kind of data sets. For example, in genome wide association studies (GWAS), “a prototype is measured for a large panel c 0000 arxiv

College, Kolkata-700006, India. [email protected] † Indian Institute of Technology, Kanpur-208016, India. [email protected] ∗ Bethune

c 0000 arxiv

DOI: 0000

2

Bayesian Variable Selection for Ultrahigh-dimensional Settings

of individuals, and a large number of single nucleotide polymorphisms (SNPs) throughout the genome are genotyped in all these participants. The goal is to identify SNPs that are statistically associated with the phenotype and ultimately to build statistical models to capture the effect of genetics on the phenotype” (Rosset (2013)). One such data set is the metabolic quantitative trait loci which consists of 10000 SNPs that are close to the regulatory regions (predictor variables) over a total of 50 participants (observations). A previous study by Song and Liang (2015) identified two particular SNPs to be important and significant. We have studied this data set in detail in a later section. Several methods have been proposed to model high-dimensional data sets in both the frequentist and the Bayesian paradigm. Frequentist solutions to this problem are often through penalized likelihood, among which variants of LASSO like the elastic net of Zou and Hastie (2005), the group LASSO of Yuan and Lin (2006) and the adaptive LASSO of Zou (2006) are worth mentioning. Another important frequentist solution to this problem involves a screening algorithm to first reduce the data dimension, and then use some classical methods on this reduced data. This idea is implemented in sure independence screening (SIS) of Fan and Lv (2008), iterative SIS (ISIS) of Fan and Song (2010), forward selection based screening of Wang (2009), nonparametric independence screening (NIS) of Fan et al. (2011), iterative varying-coefficient screening (IVIS) of Song et al. (2014), etc. Other ways of approaching this problem is by using the smoothly clipped absolute deviation (SCAD) of Fan and Li (2001), the Dantzig selector of Cand`es and Tao (2007), modified EBIC of Chen and Chen (2012), etc. A detailed and nice review of most of these methods is contained in the paper by Fan and Lv (2010). In the Bayesian literature, popular methods include the empirical Bayes variable selection of George and Foster (2000), and the spike and slab variable selection of Ishwaran and Rao (2005). Among recent developments, the methods of Bondell and Reich (2012), Liang et al. (2013), Song and Liang (2015) and Castillo et al. (2015) use the idea of penalized credible regions to accomplish variable selection in the ultrahighdimensional setting. While Castillo et al. (2015) have proved theoretical results related to the posterior consistency for the regression parameter, Liang et al. (2013) have shown the equivalence of posterior consistency and model selection consistency under appropriate sparsity assumptions. The authors of Narisetty and He (2014) claim to prove the

Mukhopadhyay and Dutta

3

‘strongest selection consistency result’ using the spike and slab prior in the Bayesian framework. They introduce shrinking and diffusing priors, and establish strong selection consistency of their approach. In all of the above studies, the authors have considered the case where log pn = o(n). Note that the algorithms for computing the posterior distribution for the spike and slab prior are routine for small values of pn and n, but the resulting computations are quite intensive for higher dimensions due to the large number of possible models. Several authors have developed MCMC algorithms that can cope with larger numbers of covariates, but truly high-dimensional models are still ‘out of reach of fully Bayesian methods at the present time’ (see Castillo et al. (2015)). In this paper, we propose a Bayesian method for model selection, and examine model selection consistency for the same under the assumption of sparsity. In cases where pn >> n, the number of competing models is so large that one first requires a screening algorithm to discard unimportant covariates. We present a two-step model selection procedure based on a screening algorithm and Gibbs sampling. The first step of the algorithm is shown to achieve screening consistency in the sense that it discards a large set of unimportant covariates with probability one. The objective of the present work is three-fold. First, to develop a method which is suitable for ultrahigh-dimensional models. Secondly, to provide a faster and intuitive model selection algorithm. Finally, to keep the method and the algorithm as simple as possible. The proposed set of priors has the advantage of generating closed form expressions of marginals, which makes the method as tractable as a simple penalized likelihood method, such as Bayesian information criterion (BIC). To the best of our knowledge, this is the first work in the area of Bayesian variable selection which can accommodate cases with log pn = O(n). The selection algorithm we adopt is simple and intuitive, and it makes the selection procedure quite fast. Further, its good performance is supported through theoretical results. In Section 2, the prior setup and the model selection algorithm are described in detail. Section 3 contains the theoretical results including model selection consistency of the proposed set of priors, and consistency of the proposed algorithm. In Sections 4 and 5, we validate the performance of the proposed algorithm using simulated and real

4

Bayesian Variable Selection for Ultrahigh-dimensional Settings

data sets, respectively. Proofs of the main results are provided in Section 6, that of the other results and mathematical details are provided in a supplementary file.

2

The Proposed Prior and Model Selection Algorithm

2.1

Setup

Suppose we have n data points, each consisting of pn regressors {x1,i , x2,i , . . . , xpn ,i } and a response yi with i = 1, 2, . . . n. The response vector yn is modeled as follows yn = Xn β + en ,

(2.1)

where Xn is the n × pn design matrix, β = (β1 , β2 , . . . , βpn )′ is the vector of corre-

sponding regression parameters and en is the vector of regression errors. We consider a sparse situation, where only a small number of regressors contributes to the model,

while pn >> n. For simplicity, we assume that the design matrix Xn is non-stochastic and en ∼ N (0, σ 2 In ). The space of all the 2pn models is denoted by A, and indexed by α. Here, each α

consists of a subset of size pn (α) (0 ≤ pn (α) ≤ pn ) of the set {1, 2, . . . , pn }, indicating

which regressors are selected in the model. Under Mα , with α ∈ A, yn is modeled as Mα :

yn = Xα βα + en ,

where Xα is a sub-matrix of Xn consisting of the pn (α) columns specified by α and β α is the corresponding vector of regression coefficients. When Mα is true, we assume that all the elements of β α are non-zero. We consider the problem of selecting the model Mα with α ∈ A, which best explains the data. The true data generating model, denoted by

Mαc , is assumed to be an element of A, and is expressed as Mαc :

yn = µn + en = Xαc βαc + en ,

where µn is the regression of yn given Xn . The dimension of Mαc , denoted by p(αc ), is assumed to be a small number, free of n. In a Bayesian approach, each model Mα is assigned a prior probability p(Mα ), and  the corresponding set of parameters θ α = β0 , β α , σ 2 involved in Mα , is also assigned

5

Mukhopadhyay and Dutta

a prior distribution p(θα |Mα ). Given the set of priors, one computes the posterior

probability of each model. The posterior probability of the model Mα is given by

where

p(Mα |yn ) = P mα (yn ) =

Z

p(Mα )mα (yn ) , α∈A p(Mα )mα (yn )

p(yn |θ α , Mα )p(θ α |Mα )dθ α

is the marginal density of yn , p(yn |θα , Mα ) is the density of yn given the model param-

eters θα and p(θα |Mα ) is the prior density of θα under Mα . We consider the procedure that selects the model in A with the highest posterior probability.

We denote the rank of the design matrix of model Mα by rα , i.e., r (Xα′ Xα ) = rα , and also refer rα as the rank of Mα . For two numbers a and b, the notations a ∨ b and a ∧ b are used to denote max{a, b} and min{a, b}, respectively. For α, α∗ ∈ A, the

notations Xα∨α∗ and Xα∧α∗ are used to denote sub-matrices of X formed by columns

corresponding to either Xα or Xα∗ (or both), and columns which are common to both Xα and Xα∗ , respectively. For two square matrices A and B of the same order, A ≤ B

means that B − A is positive semidefinite.

2.2

Prior Specification

On each model Mα with α ∈ A, we assign the Bernoulli prior probability as follows: P (Mα ) = qnpn (α) (1 − qn )pn −pn (α)

with

qn = 1/pn .

Given a model Mα , we consider the conjugate prior on β α as β α |σ 2 , Mα ∼ N (0, gn σ 2 Ipn (α) ), where gn is a hyperparameter which depends on n. When σ 2 is unknown, we consider the popular Jeffreys prior π(σ 2 ) ∝ 1/σ 2 . The Bernoulli prior probability is widely used as model prior probability because of its property of favoring, or penalizing models of large, or small dimensions. The choice qn = 1/pn has previously been considered by Narisetty and He (2014). This prior is

6

Bayesian Variable Selection for Ultrahigh-dimensional Settings

particularly useful for sparse regression models, where it is known in advance that the true model is small-dimensional, and pn is quite large. The use of the inverse gamma prior for error variance is fairly conventional in the literature of model selection (see, e.g., Johnson and Rossell (2012), Narisetty and He (2014)). Jeffreys prior is the limit of inverse gamma prior, as both the hyperparameters involved in inverse gamma prior approach zero. The property of invariance under reparametrization makes it suitable as a prior on the scale parameter. We choose a simple set of priors. Except for the choice of gn , we completely specify the set of priors. We do not provide any specific choice of gn , rather indicate the optimal order which is necessary to achieve consistency. The posterior probabilities generated using this set of priors is of a closed form, which makes the resulting method easily applicable.

2.3

Model Selection Algorithm

Our model selection procedure is quite simple as it chooses the model with the highest posterior probability among all competing models. In the next section, we will show that the proposed set of priors is model selection consistent in the sense that the posterior probability of the true model goes to one. However, identifying the model with the highest posterior probability still remains a challenging task for ultrahigh-dimensional data. As pn = exp{O(n)}, it is impossible to evaluate all the 2pn models in the model space even for small values of n. For example, if n = 5, the model space can be of order exp(45), which is a huge number. Therefore, we need to develop a screening algorithm, which reduces the model space to a tractable one. In other words, we need to discard a set of ‘unimportant’ variables at the beginning using some suitable algorithm. After implementation of the algorithm, ideally, we will be left with a smaller set of variables which includes all the active covariates. We describe the algorithm in detail below. The Two-step Algorithm. Screening: The first step discards a large set of unimportant covariates. Here, we use the fact that the number of regressors in the true model, p(αc ), is very small and free of n. First, we choose an integer d such that d/K0 ≤ p(αc ) < d, where K0 is a positive

Mukhopadhyay and Dutta

7

number free of n. We will choose the best model in the class of all models of dimension  d. Thus among the 2pn models, we only compare pdn models. Once d is selected, we

proceed along the following steps:

1. Initialization. Choose any model Mα0 of dimension d, and calculate its marginal. 2. Evaluation. Consider each of the covariates present in Mα0 individually. Let x1 be a covariate of Mα0 . Replace x1 with each of the covariates which are not present in Mα0 one by one, and compute the marginal density. Update Mα0 by replacing x1 with the covariate that yields the highest marginal density, say x∗ , if x∗ 6= x1 . Retain x1 otherwise. Do the same for the other (d − 1) active covariates of Mα0 as well.

3. Replication. Repeat the previous step N times, where N is a moderately large number. In the next section, we will show that if N is moderately large, the screening algorithm finally selects a supermodel of the true model with probability tending to one. Model Selection:

Once the screening algorithm selects a model, say Mα∗ , we discard

all the regressors that are not present in Mα∗ . In the next step, we apply the Gibbs sampling algorithm to select the best model among the 2d models, which can be formed by the d regressors present in Mα∗ . The sampling scheme that we use is completely described in Chipman et al. (2001, Section 3.5) in the section on Gibbs Sampling Algorithms under the conjugate setup. Note that the Gibbs sampling algorithm chooses models directly following a Markov chain with ratio of the posterior probabilities as the transition kernel, and the set of regressors obtained at the end of screening step contains all the active covariates with probability tending to one. Therefore, after sufficient iterations, the algorithm must select the model with highest posterior probability, i.e., the true

model.

Remark 2.1. Note that the total number of models among which the algorithm selects  the best one is pdn + 2d . Thus, if we have some idea about the actual number of active covariates, we can use it to choose d as small as possible. A small choice of d makes the algorithm much faster.

8

Bayesian Variable Selection for Ultrahigh-dimensional Settings

Remark 2.2. In the evaluation step of the screening algorithm, we update the chosen model d(pn −d) times. If we repeat the evaluation step N times, then N d(pn −d) updates

take place. Therefore, it is enough to choose a moderately large N .

Note that N d(pn − d) is also the computational complexity of the screening step.

Even if we consider all the 2d competing models for comparison in the second stage,

the total computational complexity of the proposed algorithm would be N d(pn − d) + 2d , which is linear in pn and much smaller than 2pn .

3

Consistency of the Proposed Prior

This section is dedicated towards asserting consistency results of the proposed method of model selection. We consider the cases with known, as well as unknown error variances σ 2 separately, stating clearly the assumptions required in each case.

3.1

Results for Known Error Variance

Often one has enough data to estimate the variance σ 2 properly, or being independent of the design matrix, σ 2 is estimated from earlier data sets. In such cases, σ 2 may be assumed to be known. In this subsection, we discuss results for the case with known σ 2 . Given σ 2 and gn , the posterior probability of model Mα is proportional to pn (α)    2∗ 1 Ip (α) + gn X ′ Xα −1/2 exp − Rα , P (Mα |yn ) ∝ α n pn − 1 2σ 2 n −1 ′ o 2∗ where Rα = yn′ In − Xα Ipn (α) /gn + Xα′ Xα Xα yn . Our results for the case with

known σ 2 is based on the following set of assumptions.

(A1) The number of regressors pn = exp{b0 n1−r } where 0 ≤ r < 1 and b0 is any number free of n.

∗ ∗ and τmin , free of n, (A2) The true model Mαc is unique. There exists constants τmax ∗ ∗ Ip(αc ) . such that nτmin Ip(αc ) ≤ Xα′ c Xαc ≤ nτmax

(A3) Let τmax and τmin be the highest and lowest non-zero eigenvalues of Xn′ Xn /n, |z |

−|wn |

then τmax ≤ pn n with zn → 0, and τmin ≥ pn

with wn → 0.

9

Mukhopadhyay and Dutta

 (A4) Consider the constants K0 > 6 and ∆0 = δn1−s ∨ {6σ 2 p(αc ) log pn } with δ > 0 and 0 < s ≤ 0.5. Let A3 = {α ∈ A : Mαc * Mα , rα ≤ K0 p(αc )}, µn = Xαc β αc

and Pn (α) be the projection matrix onto the span of Xα . We assume that inf µ′n (I − Pn (α))µn > ∆0 .

α∈A3

(A5) The hyperparameter gn is such that ngn = pn2+δ1 , for some 5/(K0 − 1) ≤ δ1 ≤ 1 free of n, where K0 is as stated in assumption (A4) above.

Assumptions (A1) and (A5) describe the setup and our choice of the hyperparameter gn , respectively. Assumption (A2) states that the true model is unique, and it includes a set of independent regressors. The design matrix corresponding to Xα′ c Xαc depends on n, but not on pn . Therefore, we allow the eigenvalues of the true model to vary only with n. Assumption (A3) is also quite general, as we allow the eigenvalues of Xn′ Xn to vary with both n and pn . This is more reasonable since the dimension of Xn′ Xn depends on both n and pn . Assumption (A4) is commonly termed as an identifiability condition for model selection. The quantity µ′n (I −Pn (α))µn may be interpreted as the distance of the αth model

from the true model. For consistent model selection, it is necessary for the true model to

keep a distance from other models. Otherwise, the true model may not be identifiable. It has been proved in Moreno et al. (2015, Lemma 3) that limn→∞ µ′n (I −Pn (α))µn /n > 0 for any non-supermodel of the true model. We have just assumed a uniform lower bound

for µ′n (I − Pn (α))µn over non-supermodels of low rank, and fixed a threshold value for

the case when log pn = b0 n with b0 > 0. When log pn = b0 n1−r with r > 0, the threshold is not even of order n, and therefore, the condition is trivially satisfied (by Moreno et al. (2015, Lemma 3)). The consistency results are split into two parts. Model selection consistency of the proposed set of priors is shown in Section 3.1.1, and consistency of the model selection algorithm is shown in Section 3.1.2.

3.1.1. Model Selection Consistency If the true model is among one of the candidate models in the model space, it is natural to check whether a model selection procedure can identify the true model with prob-

10

Bayesian Variable Selection for Ultrahigh-dimensional Settings

ability tending to one. This property, known as ‘model selection consistency’, requires p

→ 1, which is equivalent to showing that P (Mαc |yn ) − X

α∈Ar{αc }

P (Mα |yn ) p − → 0. P (Mαc |yn )

(3.1)

We now state the result on model selection consistency for the case where σ 2 is known. Theorem 3.1. Consider the model (2.1) with known σ 2 . Under assumptions (A1)-(A5), the method based on the proposed set of priors is model selection consistent. Remark 3.1. The proof of model selection consistency for known σ 2 (see Section 6.2) only requires pn → ∞, and does not explicitly require n → ∞. As log pn ≤ b0 n, an

appropriately large pn and only a moderately large n is sufficient for good performance

of this set of priors in practice. From this point of view, the proposed set of priors is suitable for high-dimensional medium sample size settings.

3.1.2. Consistency of the Model Selection Algorithm In the proposed model selection algorithm at the end of the screening step, one will be left with a model of dimension d. We claim that this step of screening is consistent in the sense that the model chosen at the end of it, say Mα∗ , may not be unique but it would be a supermodel of the true model, i.e., Mαc ⊆ Mα∗ with probability tending to one. We now consider the following result.

Theorem 3.2. Let d be an integer such that d/K0 ≤ p(αc ) < d, and Mα1 and Mα2

be a supermodel, and a non-supermodel of Mαc of dimension d, respectively. If σ 2 is known and assumptions (A1)-(A5) hold, then for the proposed set of priors we have sup α1 ,α2

P (Mα2 |yn ) p − → 0, or, equivalently P (Mα1 |yn )

sup α1 ,α2

m(Mα2 |yn ) p − →0 m(Mα1 |yn )

under Mαc .

This theorem states that under Mαc , the posterior probability of a supermodel of Mαc of dimension d is much higher than the posterior probability of a non-supermodel of same dimension. As the algorithm selects a model on the basis of the marginal density, which is equivalent to the posterior probability when models of the same dimension are considered, it is expected that a supermodel will be selected after some iterations. After a supermodel is selected in the first stage, we only consider the d regressors

11

Mukhopadhyay and Dutta

included in the supermodel and find the best model among the 2d possible ones formed by these d regressors. As the Gibbs sampling algorithm in the second step chooses models on the basis of posterior probabilities, consistency of this part is immediate from the screening consistency and the model selection consistency of the proposed set of priors.

3.2

Results for Unknown Error Variance

When the error variance σ 2 is unknown, we assign the standard non-informative Jeffreys prior π(σ 2 ) ∝ 1/σ 2 . In this case, the posterior probability of any model Mα is p (Mα |yn ) ∝



1 pn − 1

pn (α)

|I + gn Xα′ Xα |

−1/2

2∗ Rα

−n/2

.

As σ 2 is unknown, we need to modify assumptions (A1)-(A5) of the previous subsection. The modified assumptions are stated below: (B1) For some positive integer M free of n, rank (Xn′ Xn ) ≤ M . (B2) The number of regressors pn = exp{b0 n1−r } where 0 ≤ r < 1 and b0 is any number   free of n. For r = 0, we need b0 < ξ(1 − ξ){2(1 + ξ)(M − p(αc ))}−1 ∧(4p(αc ))−1 for some 1/(K0 − 1) < ξ ≤ 0.1 and K0 > 12.

1 (B3) The hyperparameter gn is such that ngn = p2+δ , 7ξ/(1−ξ)2 +2/(K0 −1) ≤ δ1 ≤ 2. n

(B4) Assumption (A4) holds with ∆0 = {12σ 2 p(αc ) log pn } ∨ {δn1−s } for 0 < s ≤ 0.5. Note that assumption (B1) implies that the highest and the lowest non-zero eigenvalues of the design matrix are free of n. Assumption (B2) imposes some additional restrictions on the dimension pn when it is of the order exp{O(n)}. Unlike the case for known σ 2 , here we fail to accommodate any pn of the order exp{O(n)} (recall assumption (A1)), rather impose a multiplicative constant b0 such that log pn ≤ b0 n. Assumption (B3)

indicates that we need a slightly larger value of gn in order to achieve consistency when

the parameter σ 2 is unknown. Finally, assumption (B4) is same as assumption (A4) with a partially changed threshold value. Nevertheless, implications of the assumption and its importance remains the same here. We now state the result on model selection consistency for an unknown value of σ 2 .

12

Bayesian Variable Selection for Ultrahigh-dimensional Settings

Theorem 3.3. Consider the model (2.1) with unknown σ 2 . Under assumptions (B1)(B4), the method based on the proposed set of priors is model selection consistent. We do not present a separate result for screening consistency of the algorithm stated in Section 2.3 for the case where σ 2 is unknown. A result similar to Theorem 3.3 can be stated here. A proof similar to that Theorem 3.3 (i.e., the case with known σ 2 , see Section 6.2) can also be presented in this respect using assumptions (B1)-(B4) instead of (A1)-(A5).

4

Simulation Study

We validate the performance of the proposed method of model selection using a wide variety simulated data sets. Under different simulation schemes, we present the proportion of times a model selection algorithm selects the true model. Our method: The model selection algorithm we follow is completely described in Section 2.3. The number of regressors selected at the first stage, d, is taken to be [n/4] in each case. In the screening step, we choose gn = p2n /n and in the second step of model selection, we choose gn = d2 . Other methods: As we mentioned in the Introduction, there are several methods for variable selection both from the classical, as well as the Bayesian perspectives. We consider some of the more competitive methods for comparison. Among the classical methods, we consider three approaches based on iterative sure independence screening (ISIS), namely, ISIS-LASSO-BIC, ISIS-SCAD-BIC and ISIS-MCP-BIC. Here an initial set of variables are first selected by ISIS, and then a step of penalized regression is carried out using the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), or minimax concave penalty (MCP, Zhang (2010)) with the regularization parameter tuned by the BIC. Among the Bayesian competitors, we consider methods based on Bayesian credible region (BCR.marg and BCR.joint, Bondell and Reich (2012)) and Bayesian shrinking and diffusing prior (BASAD, Narisetty and He (2014)). We have used R codes for all the methods. For ISIS, we have implemented codes from the R package SIS. The R codes for BCR is obtained from the first author’s website, while the first author of Narisetty and He (2014) kindly shared the codes for BASAD with us. There are two versions for BASAD, one is exact while the other is an

Mukhopadhyay and Dutta

13

approximate one for high-dimensional data. We have implemented the second version for the sake of saving computing time. Simulation setup. We consider two values for n, namely, 50 and 100. For n = 50, we choose pn = 100 and 500, while for n = 100 we choose pn = 500, 1000 and 2000. The model yn = µn + en is considered as the true model, where µn = Xαc βαc . The vector βαc is assumed to be sparse, i.e., there are only p(αc ) components in β αc with p(αc ) t) ≤ P



χ2r

t−λ > 2



+P



t−λ Z> √ 4 λ



,

where χ2r denotes a central χ2 random variable with df r and Z ∼ N (0, 1). ′ I ≤ Lemma 6.3. Let Mα , α ∈ A, be any model and Mα′ be a model such that nτmin ′ ′ ′ Xα′ ′ Xα′ ≤ nτmax I for some τmin , and τmax , free of n. Then under assumption (A3),

|I + gXα′ Xα |

|I +

−1

−1 gXα′ ′ Xα′ |

′ −(rα +r1 )

′ r α′ ≤ c(ng)rα′ −rα τmax τmin

r1 τmax

for sufficiently large p, where r1 = r(Xα∧α′ ) and c > 1 is some constant. Lemma 6.4. If Mαc be the true model, then under the setup (2.1) and assumptions (A2) and (A5), and for any ǫ > 0, the probabilities of the following three events 2 > n(1 + ǫ)σ 2 , (b) Rα c

2 2∗ > ǫ, − Rα (a) Rα c c

and

are tending to 0 exponentially in n.

2 < n(1 − ǫ)σ 2 , (c) Rα c

Lemma 6.5. Let A1 be the set of all super models of the true model Mαc , A∗1 be the

subset of A1 containing models of rank at most d, for some d > p(αc ), and A2 = {α :

Mαc * Mα , rα > K0 p(αc )}. Then the following statements hold. (a) For any R > 2, with probability tending to 1 2 2 max (Rα ) ≤ Rσ 2 (rα − p(αc )) log p. − Rα c

α∈A1

2∗ 2 (b) For any α ∈ A∗1 and any ǫ > 0, the probability that Rα − Rα > ǫ, tends to 0

exponentially in n.

(c) For R > 2K0 /(K0 − 1), with probability tending to 1, 2 2 max (Rα − Rα ) ≤ Rσ 2 (rα − p(αc )) log p. c ∨α c

α∈A2

Lemma 6.6. Let yn = µn + en with en ∼ N (0, σ 2 I) and µ′n µn = O(n). For any hn ,

such that hn = nk for some 0.5 < k < 1, we have |µ′n en | = op (hn ). The proofs of Lemma 6.1-6.6 are given in the supplementary file.

21

Mukhopadhyay and Dutta

6.2

Main Results

Proof of Theorem 3.1. The ratio of posterior probabilities of any model to the true model is given by P (Mα |yn ) = P (Mαc |yn )



1 pn − 1

pn (α)−p(αc )

1/2  2∗ 2∗  I + gn Xα′ c Xαc Rα − Rα c . exp − 1/2 2σ 2 |I + gn Xα′ Xα |

(6.1)

We split A into three subclasses as follows: (i) Supermodel of the true model,

A1 = {α : Mαc ⊂ Mα }.

(ii) Non-supermodel of large dimension, where rα is the rank of Xα .

A2 = {α : Mαc * Mα ;

rα > K0 p(αc )}

(iii) Non-supermodel of small to moderate rank, A3 = {α : Mαc * Mα ; rα ≤ K0 p(αc )}. We prove (3.1) separately for models in A = Ai , for i = 1, 2, 3. Case I: Super-models (α ∈ A1 ) First, we obtain a uniform upper bound for ratio of

the posterior probabilities of any model Mα and Mαc , given in (6.1). Note that 2 2 2 2 2∗ 2∗ 2∗ 2∗ − Rα , + Rα − Rα − Rα = Rα − Rα ≤ Rα Rα c c c c c o n −1 2∗ 2 2 ≥ Rα . where Rα = yn′ I − Xα (Xα′ Xα ) Xα′ yn and Rα

2 2∗ By part (a) of Lemma 6.4 we have Rα = op (1). By part (a) of Lemma − Rα c c  2 2 > 6.5, for α ∈ A1 and some R = 2(1 + ǫ), ǫ > 0, we have maxα∈A1 Rα − Rα c

Rσ 2 (rα − p(αc )) log p. Therefore, for any ǫ > 0   1 2∗ 2∗ exp − 2 (Rα ) ≤ pR(rα −p(αc ))/2+op (1) . − Rα c 2σ Again, by Lemma 6.3 and assumptions (A2)-(A3) we have −1/2

|I + gXα′ Xα | p(αc )/2 ≤ c∗ (ngτmin )−(rα −p(αc ))/2 τmax ≤ c∗ (ngτmin )−(rα −p(αc ))/2 po(1) , I + gXα′ Xαc −1/2 c



where c is some appropriate constant. Therefore, summing the ratio of posterior probabilities over Mα ∈ A1 , we have X p(Mα |yn ) p(Mαc |yn )

α∈A1



X

α∈A1

c∗ p(1+ǫ)(rα −p(αc ))+op (1)+o(1) (rα −p(αc ))/2

(p − 1)pn (α)−p(αc ) (τmin n g)

22

Bayesian Variable Selection for Ultrahigh-dimensional Settings



X

α∈A1

s 

p2+δ1 /3 p2+δ1

rα −p(αc ) 

1 , (p − 1)pn (α)−p(αc )

for some suitably chosen ǫ > 0. This is due to the fact that we can choose ǫ so that the term ǫ + op (1)+ o(1) < δ1 /3, for sufficiently large p. By assumption (A2), the true model is unique and is of full rank, and therefore rα − p(αc ) ≥ 1. Thus, the above expression is less than p

−δ1 /3

p−p(αc ) 

X q=1

 p   p − p(αc ) 1 1 −δ1 /3 1+ −1 , ≤p q (p − 1)q p−1

and this tends to 0 as p → ∞. 2∗ 2∗ − Rα as Case II : Non-super models of large dimension (α ∈ A2 ) We split Rα c 2 2 ≤ Rα . Thus, we have before and use the fact that Rα∨α c

2 2 2 2 2∗ 2∗ 2∗ 2∗ . − Rα∨α + Rα − Rα − Rα ≤ Rα − Rα ≤ Rα Rα c c c c c c 2 2 From part (c) of Lemma 6.5, with probability tending to 1, Rα ≤ Rσ 2 (rα − − Rα∨α c c

p(αc )) log p for R = 2(1 + s) with s > 1/(K0 − 1). Using Lemma 6.3, along with

assumptions (A2)-(A3) as in the previous case, we have X p(Mα |yn ) p(Mαc |yn )

α∈A2

≤ ≤

X

c∗ p(1+s)(rα −p(αc ))+op (1)+o(1) (r −p(α ))/2

c (p − 1)pn (α)−p(αc ) (τmin n g) α  1+6/(5(K0 −1)) rα −p(αc ) X p 1 ∗ c pp(αc ) , √ ng (p − 1)pn (α)

α∈A2

α∈A2

for an appropriately chosen c∗ and s so that s + op (1) + o(1) ≤ 6/(5(K0 − 1)) for sufficiently large n. As rα − p(αc ) > (K0 − 1)p(αc ), the above expression is less than c∗



p1+11/(5(K0 −1)) √ ng

(K0 −1)p(αc )

p X

q=K0 t+1

  p 1 . q (p − 1)q

(6.2)

Also δ1 ≥ 5/(K0 − 1), and so the second term in the above expression is no bigger than

p−3p(αc )/10 , which converges to 0 as p → ∞. However, the third term is dominated by  Pp p −q , which converges to e as p → ∞. The above facts together imply that q=1 q (p − 1) (6.2) converges to 0 as p → ∞.

23

Mukhopadhyay and Dutta Case III : Non-super models of small to moderate rank (α ∈ A3 ) case, we have

2∗ Rα c



2∗ Rα



2∗ Rα c



2 Rα c

+

2 Rα c



As in the previous

2 Rα .

2 2∗ = op (1). Next, consider the third part − Rα By part (a) of Result 6.4, we have Rα c c

in the right hand side of the above expression. 2 2 = yn′ (Pn (αc ) − Pn (α))yn Rα − Rα c

= µ′n (Pn (αc ) − Pn (α))µn + 2µ′n (Pn (αc ) − Pn (α))en + e′n (Pn (αc ) − Pn (α))en . Note that µ′n (Pn (αc )− Pn (α))µn = µ′n (I − Pn (α))µn > ∆0 by assumption (A4). Again, µ′n (Pn (αc ) − Pn (α))en

=

µ′n (Pn (αc ) − Pn (α ∨ αc ))en +µ′n (Pn (α ∨ αc ) − Pn (α))en ≥ −2|µ′n en |.

By Lemma 6.6, |µ′n en | = op (hn ) for hn = nd for some 0.5 < d < 1. Finally, we get e′n (Pn (αc ) − Pn (α))en ≥ −e′n (Pn (α ∨ αc ) − Pn (α))en . As Pn (α ∨ αc ) − Pn (α) is an idempotent matrix, we have e′n (Pn (α ∨ αc ) − Pn (α))en ≥ 0. Note that e′n (Pn (α ∨ αc ) − Pn (α))en ≤ e′n Pn (αc )en for any α ∈ A3 (see Section 2.3.2

of Yanai et al. (2011)). Also, e′n Pn (αc )en = Op (1) since it follows the σ 2 χ2 distribution with df p(αc ). Combining all these facts and using Assumption (A4), we have 2∗ 2∗ − Rα ≤ −∆0 (1 + op (1)). Rα c

Further, from Lemma 6.3, the ratio of determinants in the last term of (6.1) is less than p(αc ) √ c∗ for an appropriately chosen c∗ > 0. Therefore, ngτmax

X  p   X p(Mα |yn ) √ p 1 ∆0 p(α ) . (6.3) ≤ c∗ (p ngτmax ) c exp − 2 (1 + op (1)) p(Mαc |yn ) 2σ (p − 1)q q q=1

α∈A3

p(αc ) √ For sufficiently large p, we have c∗ p ngτmax ≤ p2(1+δ1 /3)p(αc ) . By assumption

(A4), exp{−∆0 /(2σ 2 )} ≤ p−3p(αc ) . Thus the product of first three terms in the right

hand side (rhs) of (6.3) converges to zero, whereas the last term converges to e. Using

the above facts it is evident that the rhs of (6.3) is less than p−(1−δ1 /3)p(αc ) . As p → ∞, the result follows.

Proof of Theorem 3.2. First note that Mα1 and Mα2 are of the dimension d, and d is a constant free of n. Therefore, the ranks of both the models rα1 and rα2 , are also

24

Bayesian Variable Selection for Ultrahigh-dimensional Settings

free of n. We now have   m(Mα2 |yn ) 1 ∗2 ∗2 sup = sup exp − 2 (Rα2 − Rα1 ) 2σ α1 ,α2 m(Mα1 |yn ) α1 ,α2

I + gXα′ Xα2 −1/2 2 . I + gXα′ Xα1 −1/2 1

(6.4)

By assumptions (A2) and (A3) and Lemma 6.3, we get I + gXα′ Xα1 1/2 √ √ rα −rα2 r1 +p(αc ) r1 −rα2 r −r +o(1) 1 sup τmax τmin (1 + ξn ) ≤ ( ng) α1 α2 1/2 ≤ ( ng) 1 α1 ,α2 I + gX ′ X α2 α2

where r1 = rank(Xα1 ∧α2 ). We also have

2 2 2 2 2 ∗2 2 ∗2 ∗2 ∗2 . − Rα + Rα − Rα + Rα − Rα = Rα − Rα ≤ Rα − Rα Rα 2 c c 1 1 1 2 1 2 1 2 ∗2 ) = op (1). By the properties of − Rα From part (b) of Lemma 6.5, we get supα1 (Rα 1 1

2 2 2 2 which is equal to − Rα ≤ 0. Next consider Rα − Rα projection matrices, Rα c 2 c 1

yn′ (Pn (αc ) − Pn (α2 ))yn = µ′n (Pn (αc ) − Pn (α2 ))µn + 2µ′n (Pn (αc ) − Pn (α2 ))en + e′n (Pn (αc ) − Pn (α2 ))en . We now have µ′n (Pn (αc ) − Pn (α2 ))en = µ′n (Pn (αc ) − Pn (α2 ∨ αc ))en + µ′n (Pn (α2 ∨ αc ) − Pn (α))en ≥ −2|µ′n en |. By Lemma 6.6, |µ′n en | = op (hn ) for hn = nk for some 0.5 < k < 1. Finally, e′n (Pn (αc ) − Pn (α2 ))en ≥ −e′n (Pn (α2 ∨ αc ) − Pn (αc ))en , as e′n (Pn (α2 ∨αc )−Pn (α2 ))en ≥ 0. Note that e′n (Pn (α2 ∨αc )−Pn (αc ))en ≤ e′n Pn (αc )en ,

and e′n Pn (αc )en = Op (1).

Again, by assumption (A4), we have µ′n (Pn (αc ) − Pn (α2 ))µn ≥ ∆0 . Combining the

above statements and using assumption (A5), from (6.4) we have   m(Mα2 |yn ) ∆0 √ p(αc ) o(1) ≤ ( ng) sup p exp − 2 (1 + op (1)) ≤ p−(1−δ1 /2+op (1))p(αc ) 2σ α1 ,α2 m(Mα1 |yn ) which converges to 0 as p → ∞.

Supplementary Material (https://drive.google.com/open?id=0By7-ldtnmyfvUjlpWUZnaGpwNWs). The Supplementary Material contains proofs of all the Lemmas and Theorem 3.3. A pdf copy is available at the link mentioned above.

Mukhopadhyay and Dutta

25

References Bondell, H. D. and Reich, B. J. (2012). “Consistent high-dimensional Bayesian variable selection via penalized credible regions.” J. Amer. Statist. Assoc., 107(500): 1610– 1624. Cand`es, E. and Tao, T. (2007). “Rejoinder: “The Dantzig selector: statistical estimation when p is much larger than n” [Ann. Statist. 35 (2007), no. 6, 2313–2351; MR2382644].” Ann. Statist., 35(6): 2392–2404. Castillo, I., Schmidt-Hieber, J., and van der Vaart, A. (2015). “Bayesian linear regression with sparse priors.” Ann. Statist., 43(5): 1986–2018. Chen, J. and Chen, Z. (2012). “Extended BIC for small-n-large-P sparse GLM.” Statist. Sinica, 22(2): 555–574. Chipman, H., George, E. I., and McCulloch, R. E. (2001). “The practical implementation of Bayesian model selection.” In Model selection, volume 38 of IMS Lecture Notes Monogr. Ser., 65–134. Inst. Math. Statist., Beachwood, OH. Fan, J., Feng, Y., and Song, R. (2011). “Nonparametric independence screening in sparse ultra-high-dimensional additive models.” J. Amer. Statist. Assoc., 106(494): 544–557. Fan, J. and Li, R. (2001). “Variable selection via nonconcave penalized likelihood and its oracle properties.” J. Amer. Statist. Assoc., 96(456): 1348–1360. Fan, J. and Lv, J. (2008). “Sure independence screening for ultrahigh dimensional feature space.” J. R. Stat. Soc. Ser. B Stat. Methodol., 70(5): 849–911. — (2010). “A selective overview of variable selection in high dimensional feature space.” Statist. Sinica, 20(1): 101–148. Fan, J. and Song, R. (2010). “Sure independence screening in generalized linear models with NP-dimensionality.” Ann. Statist., 38(6): 3567–3604. George, E. I. and Foster, D. P. (2000). “Calibration and empirical Bayes variable selection.” Biometrika, 87(4): 731–747. Ishwaran, H. and Rao, J. S. (2005). “Spike and slab variable selection: frequentist and

26

Bayesian Variable Selection for Ultrahigh-dimensional Settings

Bayesian strategies.” Ann. Statist., 33(2): 730–773. Johnson, V. E. and Rossell, D. (2012). “Bayesian model selection in high-dimensional settings.” J. Amer. Statist. Assoc., 107(498): 649–660. Liang, F., Song, Q., and Yu, K. (2013). “Bayesian subset modeling for high-dimensional generalized linear models.” J. Amer. Statist. Assoc., 108(502): 589–606. Moreno, E., Gir´ on, J., and Casella, G. (2015). “Posterior model consistency in variable selection as the model dimension grows.” Statist. Sci., 30(2): 228–241. Narisetty, N. N. and He, X. (2014). “Bayesian variable selection with shrinking and diffusing priors.” Ann. Statist., 42(2): 789–817. Rosset, S. (2013). “Practical Sparse Modeling: an Overview and Two Examples from Genetics.” Chapter 3 in Practical Applications of Sparse Modeling, I. Rish et al. (eds.), MIT Press. Song, Q. and Liang, F. (2015). “A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression.” J. R. Stat. Soc. Ser. B. Stat. Methodol., 77(5): 947–972. Song, R., Yi, F., and Zou, H. (2014). “On varying-coefficient independence screening for high-dimensional varying-coefficient models.” Statist. Sinica, 24(4): 1735–1752. Wang, H. (2009). “Forward regression for ultra-high dimensional variable screening.” J. Amer. Statist. Assoc., 104(488): 1512–1524. Yanai, H., Takeuchi, K., and Takane, Y. (2011). Projection matrices, generalized inverse matrices, and singular value decomposition. Statistics for Social and Behavioral Sciences. Springer, New York. Yuan, M. and Lin, Y. (2006). “Model selection and estimation in regression with grouped variables.” J. R. Stat. Soc. Ser. B Stat. Methodol., 68(1): 49–67. Zhang, C.-H. (2010).

“Nearly unbiased variable selection under minimax concave

penalty.” Ann. Statist., 38(2): 894–942. Zou, H. (2006). “The adaptive lasso and its oracle properties.” J. Amer. Statist. Assoc., 101(476): 1418–1429.

Mukhopadhyay and Dutta

27

Zou, H. and Hastie, T. (2005). “Regularization and variable selection via the elastic net.” J. R. Stat. Soc. Ser. B Stat. Methodol., 67(2): 301–320. Acknowledgments The authors are sincerely thankful to Prof. Tapas Samanta for his insightful comments, and guidance towards this article.