an empirical bayes information criterion for selecting ... - terrapub

0 downloads 0 Views 583KB Size Report
matrix of the explanatory variables, Z is an N × M matrix of known covariates, ... parameters, and G = G(ψ) and R = R(ψ) are positive definite matrices. Then, .... The organization of the paper is as follows. ..... ˆσ2, ̂ψ,λ) with respect to λ, namely, it is given by ˆλ = max(λ0,0) where λ0 = ..... matrix with the (i, j)-th element Cij.
J. Japan Statist. Soc. Vol. 40 No. 1 2010 111–130

AN EMPIRICAL BAYES INFORMATION CRITERION FOR SELECTING VARIABLES IN LINEAR MIXED MODELS Tatsuya Kubokawa* and Muni S. Srivastava** The paper addresses the problem of selecting variables in linear mixed models (LMM). We propose the Empirical Bayes Information Criterion (EBIC) using a partial prior information on the parameters of interest. Specifically EBIC incorporates a non-subjective prior distribution on regression coefficients with an unknown hyper-parameter, but it is free from the setup of a prior information on the nuisance parameters like variance components. It is shown that EBIC not only has the nice asymptotic property of consistency as a variable selection, but also performs better in small and large sample sizes than the conventional methods like AIC, conditional AIC and BIC in light of selecting true variables. Key words and phrases: Akaike information criterion, Bayesian information criterion, consistency, empirical Bayes method, linear mixed model, maximum likelihood estimator, nested error regression model, random effect, restricted maximum likelihood estimator, selection of variables.

1. Introduction Consider the general linear mixed model (LMM) (1.1)

y = X β + Zv + ,

where y is an N × 1 observation vector of the response variable, X is an N × p matrix of the explanatory variables, Z is an N × M matrix of known covariates, β is a p × 1 unknown vector of the regression coefficients, v is an M × 1 vector of the random effects, and  is an N × 1 vector of the random errors. Here, v and  are mutually independently distributed as v ∼ NM (0, σ 2 G (ψ)) and  ∼ NN (0, σ 2 R(ψ)), where ψ = (ψ1 , . . . , ψd ) is a d dimensional vector of unknown parameters, and G = G (ψ) and R = R(ψ) are positive definite matrices. Then, y has a marginal distribution NN (X β, σ 2 Λ(ψ)) for Λ = Λ(ψ) = R(ψ) + ZG (ψ)Z  . Throughout the paper, we assume that X has full column rank p. In LMM, we address the problem of selecting regression variables x(1) , . . . , x(p) for X = (x(1) , . . . , x(p) ), and we propose a new variable selection criterion which is consistent in the sense of selecting true variables. Received July 29, 2009. Revised February 2, 2010. Accepted July 5, 2010. *Faculty of Economics, University of Tokyo, Hongo, Bunkyo-ku, Tokyo 113-0033, Japan. Email: [email protected] **Department of Statistics, University of Toronto, 100 St George Street, Toronto, Ontario, Canada M5S 3G3. Email: [email protected]

112

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

The variable selection problem in the ordinary linear regression model, which corresponds to the case of Λ = IN , has been greatly studied in the literature. Of these, the Akaike Information Criterion (AIC) proposed by Akaike (1973, 1974), Mallow’s Cp criterion and the Bayesian Information Criterion (BIC) proposed by Schwarz (1978) have been recognized as useful selection procedures and their properties have been investigated. As shown in Nishii (1984) and the papers referred therein, AIC and Cp are good procedures in the sense of minimizing the prediction errors, but they are not consistent in the sense of selecting true variables. On the contrary, BIC and its generalization GIC are consistent, namely they have different asymptotic properties from AIC and Cp (see Nishii (1984)). Since BIC is derived as an asymptotic approximation of a marginal distribution, the prior information is completely neglected in the derivation of BIC. In recent developments of Bayesian statistics and computations, the Bayes factor based on the full prior information has been used when the prior information is available. As long as the prior distribution is proper, the Bayes factor is consistent as seen in Fernandez et al. (2001) and Liang et al. (2008). The linear mixed model given in (1.1), however, has not received much attention in the literature. Vaida and Blanchard (2005) proposed a conditional AIC in which G (ψ) and R(ψ) are known. Thus, the unknown parameters are only β and σ 2 . Although Vaida and Blanchard (2005) showed their conditional AIC, denoted here by cAIC, is better than the marginal AIC, it does not seem that cAIC is superior to other methods like BIC for large N when the comparison is made in terms of the frequency with which true variables are selected, see Table 1. Thus, it will be desirable to search for a consistent selection procedure which is superior in the sense of selecting true variables in both cases of large and small N . The marginal likelihood and the Bayes factor may be possible selection procedures. However, we are faced with a couple of difficulties to implement them: one is how to set up the prior distribution of the nuisance parameters σ 2 , ψ1 , . . . , ψd , and the other is how to compute the multi-dimensional integrals for computation of the marginal distribution. To avoid these difficulties, in this paper, we propose a new Bayesian variable selection procedure called the Empirical Bayes Information Criterion (EBIC). The basic idea is to use a partial prior information of the parameters, namely, the marginal distribution is approximated so that the non-subjective prior distribution of β, the parameter of interest, can be incorporated, but the prior distribution of (σ 2 , ψ), the nuisance parameters, can be neglected. In particular, we assume that β has the prior distribution Np (0, σ 2 λ−1 W ) where λ is an unknown scalar hyper-parameter and W is a p × p known matrix. This is a common prior used in the inference on β, and the prior with W = N (X  X )−1 is called Zell x , . . . , N/x  x ) ner’s g-prior, and other choices of W are W = diag(N/x(1) (1) (p) (p) and W = Ip . For an account of Zellner’s g-prior, see Liang et al. (2008). Since λ is unknown and estimated using the data (y , X , Z ), the prior distribution is adjustable to the data. In this paper, we derive an explicit expression of EBIC and show that it is consistent in the sense of selecting true variables.

EBIC IN LINEAR MIXED MODELS

113

The organization of the paper is as follows. In Section 2.1, we briefly introduce the conventional procedures proposed in the literature, namely, we describe the marginal AIC (mAIC), two kinds of conditional AICs, (cAIC and CAIC) and BIC. After giving a motivation for seeking a new procedure, we explain the concept of EBIC in Section 2.3 and provide an explicit form of EBIC in LMM (1.1) in Section 2.4. The consistency of EBIC can be proved in Section 3. In Section 4, EBIC in a nested error regression model is given, and the numerical performance of EBIC is investigated and compared with AIC, cAIC, CAIC and BIC in the sense of selecting the true variables. The simulation results show that the proposed EBIC improves not only on BIC for small and large sample sizes, but also on AIC, cAIC and CAIC in the case of large sample sizes. 2. Empirical Bayes information criterion 2.1. Conventional methods We begin with describing briefly some conventional methods for variable selection. For stating the concepts of the selection procedures, let θ = (σ 2 , ψ) and let f (y | v , β, θ) and f (v | θ) be the conditional density of y given v and the marginal density of v , respectively, where y | v ∼ N (X β + Zv , σ 2 R(ψ)) and 2 v ∼ N (0, σ G (ψ)). Then, the marginal density of y is written by fm (y |2β, θ) = f (y | v , β, θ)f (v | θ)dv , which has the marginal distribution N (X β, σ Λ(ψ)). [1] AIC. The AIC proposed by Akaike (1973, 1974) is based on the thought of choosing a model which minimizes an unbiased estimator of the expected Kullback-Leibler information. The expected Kullback-Leibler information based on the marginal distribution is     ∗ | β, θ) ( y f m  θ)  = Ey R(β, θ; β, log fm (y ∗ | β, θ)dy ∗ , ∗   fm (y | β(y ), θ(y ))  y ) are estimators of β and θ. This yields the marginal Akaike  y ) and θ( where β( Information (mAI)   y ), θ(  y ))}fm (y ∗ | β, θ)fm (y | β, θ)dy ∗ dy . mAI = −2 {log fm (y ∗ | β(  When β and θ are estimated by the maximum likelihood estimators (MLE) β M  and θ M , AIC is defined by an asymptotically unbiased estimator of mAI as (2.1)

 ,θ M ) + 2(p + d + 1), AIC0 = −2 log fm (y | β M

for d = dim(ψ), the dimension of ψ. Sugiura (1978) and Hurvich and Tsai (1989) suggested to use an exact unbiased estimator of mAI and showed that it is better than AIC in the sense of selecting true variables in ordinary linear regression models. In the case of known ψ, the MLEs of β and σ 2 are given by (2.2) (2.3)

 β(ψ) = (X  Λ−1 X )−1 X  Λ−1 y ,  −1   σ ˆ 2 (ψ) = (y − X β(ψ)) Λ (y − X β(ψ))/N, M

114

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

for Λ = Λ(ψ), and an exact bias correction AIC based on the marginal likelihood is (2.4)

2 mAIC(ψ) = N [log(2πˆ σM (ψ)) + 1] + log |Λ(ψ)|

+ 2N (p + 1)/(N − p − 2).  of ψ is substituted into (2.4) In the case of unknown ψ, a consistent estimator ψ  to get mAIC(ψ). [2] Conditional AIC. In the case that one has an interest in the prediction of a specific random effect, Vaida and Blanchard (2005) considered the expected Kullback-Leibler information based on the conditional density, given by     ∗ | v , β, θ) f ( y ∗ ∗  θ)  = Ey ,v Rc (β, θ; β, log f (y | v , β, θ)dy ,  y ), θ(  y )) f (y ∗ | v (y ), β( where Ey ,v [·] is expectation with respect to the joint distribution of (y , v ) and v (y ) is the empirical Bayes estimator of v . This gives the conditional Akaike Information (cAI) defined by   y ), θ(  y ))} cAI = − 2 log{f (y ∗ | v (y ), β( × f (y ∗ | v , β, θ)f (y | v , β, θ)f (v | θ)dy ∗ dy dv . In the case of known ψ, the MLEs of β and σ 2 are given in (2.2) and (2.3), respectively, and the empirical Bayes estimator of v is (2.5)

 v (ψ) = G (ψ)Z  Λ−1 {y − X β(ψ)}.

Then, Vaida and Blanchard (2005) showed an exact unbiased estimator of cAI is given by 2  cAIC(ψ) = − 2 log f (y | v (ψ), β(ψ), σ ˆM (ψ), ψ) N (N − p − 1) N (p + 1) +2 (2.6) (ρ(ψ) + 1) − , (N − p)(N − p − 2) (N − p)(N − p − 2)

where ρ(ψ) = tr[ZGZ  Λ−1 ] + tr[(X  Λ−1 X )−1 X  Λ−2 X ], which is called the effective degrees of freedom, and f (y | v , β, σ 2 , ψ) is the density function of the conditional distribution y | v ∼ N (X β + Zv , σ 2 R(ψ)). In the case of unknown  for a consistent estiψ, Vaida and Blanchard (2005) suggested to use cAIC(ψ)  of ψ. mator ψ [3] Another conditional AIC. Srivastava and Kubokawa (2010) recently considered another expected Kullback-Leibler information given by  θ)  RC (β, θ, v ; β,   = Ey log

f (y ∗ | v , β, θ)  y ), θ(  y )) f (y ∗ | v (y ), β(



 f (y ∗ | v , β, θ)dy ∗ v ,

EBIC IN LINEAR MIXED MODELS

115

where Ey [· | v ] is expectation with respect to the conditional distribution of y given v . This gives the conditional Akaike information (CAI)   y ), θ(  y ))} (2.7) log{f (y ∗ | v (y ), β( CAI(v ) = − 2 × f (y ∗ | v , β, σ 2 )f (y | v , β, σ 2 )dy ∗ dy . For estimating σ 2 , Srivastava and Kubokawa (2008) used the estimator σ ˆ02 = (y −  0 = (W  R−1 W )−1 W  R−1 y for W = (X , Z ). In  0 ) (y − W γ  0 )/N where γ Wγ the case of known ψ, Srivastava and Kubokawa (2010) showed that an unbiased estimator of CAI(v ) is (2.8)

 CAIC(ψ) = −2 log f (y | v (ψ), β(ψ), σ ˆ02 , ψ) +

2N (ρ(ψ) + 1). N −p−2

 is substituted into CAIC(ψ).  In the case of unknown ψ, a consistent estimator ψ [4] Bayes factor and BIC. The Bayes factor and BIC are Bayesian variable selection procedures based on the marginal density function given by  fm (y | β, θ)π(β, θ)dβdθ, fπ (y ) = where π(β, θ) is a density of a prior distribution of (β, θ) for θ = (σ 2 , ψ). The Bayes factor is provided by the ratio of the marginal density functions of the candidate model and the full (or null) model. On the other hand, BIC proposed by Schwarz (1978) is given by an asymptotic approximation of −2 log{fπ (y )} as −2 log{fπ (y )} = BIC + op (log(N )), where (2.9)

2   ψ  ), σ  BIC = −2 log{fm (y | β( M ˆM (ψ M ), ψ M )} + (p + d + 1) log(N ),

 is the MLE of ψ. It is noted that BIC is free from the setup of the where ψ M proper prior distribution. 2.2. A motivation Before explaining EBIC, here we give some comments on the conventional methods. As seen from (2.1) and (2.9), the distinction between AIC and BIC appears only in the penalty terms. However, AIC and BIC are derived through different paths as described in the previous subsection, and they have different optimality properties. Namely, BIC has consistency for selecting true variables, while AIC is not consistent. On the other hand, AIC chooses models which give smaller prediction errors, while BIC does not possess such a property. In this paper, we compare the variable selection procedures in term of the frequency for selecting true variables. From this point of view, it is numerically shown that the exact bias correction information criteria mAIC, cAIC and CAIC have high frequencies of selecting true variables for small N . However, these frequencies go down as N gets larger, because these information criteria do not possess the consistency. In contrast, the Bayes factor and BIC are consistent procedures for

116

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

variable selection, and they have high frequencies for large N . This suggests that we can search for a desirable selection procedure among the Bayesian selection methods. The Bayes factor and BIC are well known Bayesian methods for variable selection, where the full prior information is used in the Bayes factor, but is neglected in BIC, because the prior information comes into neglected terms asymptotically in the derivation of BIC. Although BIC is consistent asymptotically, this does not necessarily mean that BIC is excellent in the sense of selecting true variables in small sample sizes. In fact, as seen from Table 1, BIC is inferior to mAIC, cAIC and CAIC for small N . Why is BIC inferior for small N ? One of the plausible reasons may be that BIC is far from the exact marginal distribution fπ (y ) in small sample sizes. This suggests that instead of BIC, we would use the Bayes factor or the marginal distribution fπ (y ). These Bayesian selection procedures are definitely worth investigating numerically. However, we are faced with the issues about how to set up the full proper prior distribution of (β, θ) and how to compute the multi-dimensional integrals. Concerning the prior information, one may be reluctant to use a subjective prior for the model selection, because subjective prior information can control the selection of variables. However, the setup of non-informative prior distributions involves another issue. For example, a usual non-informative prior of β is improper, and in this case, it is well known that the Bayes factor based on the improper prior does not work for selecting variables. To resolve this issue, Berger and Pericchi (1996) suggested the intrinsic Bayes factors based on the intrinsic prior distribution, and Casella et al. (2009) derived an exact expression of the intrinsic Bayes factor and showed the consistency in the ordinary linear regression model. Taking the above remarks into account, we propose an empirical Bayes information criterion. The linear mixed model consists of two sets of the parameters β and θ. The parameter β is a vector of regression coefficients involved in selecting variables in X , and for the interesting parameter β, we assume a proper prior distribution with an unknown hyper parameter. Since this prior distribution includes the unknown quantity, it is not completely subjective, and we shall use the empirical Bayes arguments. On the other hand, the parameter θ includes variance components and correlation coefficients, and it may be hard to set up an appropriate proper prior distribution. For the nuisance parameter θ, we shall apply the Laplace approximation to get a selection procedure which is free from a setup of a prior distribution of θ. This is the idea of EBIC and the details are given in the following subsections. 2.3. Derivation of EBIC Assume that (β, θ) has a prior distribution of the form (β, θ) ∼ π(β, θ | λ) = π1 (β | θ, λ)π2 (θ), where given θ, the regression parameter β conditionally has π1 (β | θ, λ) with unknown hyperparameter λ. It is noted that π1 (β | θ, λ) is not completely

EBIC IN LINEAR MIXED MODELS

117

subjective since λ is unknown. Then, the marginal density against the prior distribution is given by  f (y | β, θ)π1 (β | θ, λ)dβπ2 (θ)dθ fπ (y | λ) =  = m1 (y | θ, λ)π2 (θ)dθ, where m1 (y | θ, λ) is the conditional marginal density based on the partial prior distribution π1 (β | θ, λ) on β, given by  (2.10) m1 (y | θ, λ) = f (y | β, θ)π1 (β | θ, λ)dβ.  be the estimator of λ based on the conditional marginal distribution given Let λ as  λ)},  = arg maxλ {m1 (y | θ, λ

(2.11)

 is a consistent estimator of θ such that θ =θ M + Op (N −1 ) for the MLE where θ θˆM . Then, the Empirical Bayes Information Criterion (EBIC) is given by  λ)}  + dim(θ) log(N ), EBIC = −2 log{m1 (y | θ,

(2.12)

where dim(θ) is the dimension of θ. It is noted that EBIC is based on the nonsubjective prior distribution π1 (β | θ, λ) with uknown θ and λ and it is free from the prior information π2 (θ) on θ. Thus, we do not have to set up a prior distribution for θ, which can ease the computation of this criterion. We here show that EBIC can be derived as an approximation of the marginal  be a consistent estimator of β such that β  =β M + density fπ (y | λ). Let β  M . It is noted that fπ (y | λ) can be exactly rewritten Op (N −1 ) for the MLE β as



 m1 (y | θ, λ)π2 (θ)dθ =

 λ) exp{(θ; β,  y )}dθ, h(θ; β,

 θ) and (θ; β,  y ) = log f (y | β,  θ).  λ) = m1 (y | θ, λ)π2 (θ)/f (y | β, where h(θ; β,  Using the Taylor series expansion around θ = θ gives that

  λ)}(θ − θ)  {∇ h(θ; β, θ  λ) = h(θ;  β,  λ) 1 + + Op (N −1 ) , h(θ; β,  β,  λ) h(θ;  β,  y ) + {∇ (θ;  β,  y )}(θ − θ)   y ) = (θ; (θ; β, θ

1  β,  y )(θ − θ)  + Op (N −1/2 ),   J (θ; − (θ − θ) 2  y ) = −∇ ∇θ (θ; β,  y ). Since β =β  M +Op (N −1 ), where ∇θ = ∂/∂θ and J (θ; β, θ it follows that  y ) = (θ;  β,  y ) − 1 (θ − θ)  β,  y )(θ − θ)  + Op (N −1/2 ).   J (θ; (θ; β, 2

118

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

Then, the Laplace method is applied to approximate fπ (y | λ) as  β,  λ) exp{(θ;  β,  y )} fπ (y | λ) = h(θ;

  λ)}(θ − θ)  {∇θ h(θ; β, −1 + Op (N ) {1 + Op (N −1/2 )} × 1+   h(θ; β, λ)

1      × exp − (θ − θ) J (θ; β, y )(θ − θ) dθ 2   = h(θ; β, λ)  β,  y )|−1/2 {1 + Op (N −1/2 )}.  β,  y )}(2π)dim(θθ )/2 |J (θ; × exp{(θ; Hence,  λ)} − dim(θ) log(2π) + log(|J (θ;  β,  y )|) −2 log{fπ (y | λ)} = − 2 log{m1 (y | θ, − 2 log(1 + Op (N −1/2 ))  λ)} + dim(θ) log(N ) + op (log N ). = − 2 log{m1 (y | θ, Since λ is unknown, we can estimate it based on the marginal distribution m1 (y |  λ) as given in (2.11). Hence, we get the empirical Bayes information criterion θ, given in (2.12). 2.4. EBIC in linear mixed models  of σ 2 and ψ close to the ML We now derive EBIC for estimators σ ˆ 2 and ψ M  = (ψM , . . . , ψM ) is given as ψ minimizing estimators. The MLE ψ 1 d (2.13)

2 2  −2 log f (y | β(ψ), σ ˆM (ψ), ψ) = N log(2πˆ σM (ψ)) + log |Λ(ψ)| + N,

namely, it is the solution of the equation (2.14)

2   ){∂i Λ(ψ  )}P (ψ  )y = σ  )−1 {∂i Λ(ψ  )}], y  P (ψ ˆM (ψ ) tr[Λ(ψ M

M

M

M

M

M

for i = 1, . . . , d, where ∂i = ∂/∂ψi and (2.15)

P (ψ) = Λ−1 (ψ) − Λ−1 (ψ)X (X  Λ−1 (ψ)X )−1 X  Λ−1 (ψ).

Then, the EBIC will be derived under the following conditions: (A1) The elements of X , Z and Λ(ψ) are uniformly bounded, and X  Λ(ψ)−1 X = O(N ) as N → ∞; (A2) Λ(ψ) is positive definite and continuously differentiable with respect to ψ;  and σ  ψ M = (A3) ψ ˆ 2 are consistent estimators of ψ and σ 2 which satisfy ψ− M ) = Op (N −1 ), where σ Op (N −1 ) and σ ˆ2 − σ ˆ 2 (ψ ˆ 2 (ψ) is given in (2.3). M

M

As a prior distribution of β, we consider a common distribution used in the inference on β in the ordinary linear regression model. Assume that given σ 2 , the conditional distribution of β is a multivariate normal distribution π1 (β | σ 2 , λ) = Np (0, σ 2 λ−1 W )

EBIC IN LINEAR MIXED MODELS

119

for an unknown scalar λ and a p × p known matrix W . It is noted that the prior with W = Wg = N (X  X )−1 is called Zellner’s g-prior. See Liang et al.  x ,... , (2008) for example. Other choices of W are W = Wd = diag(N/x(1) (1)  x ) where X = (x , . . . , x ), and W = I . In this paper, it is assumed N/x(p) p (p) (1) (p) that W = O(1) as N → ∞ without any loss of generality. Then, the marginal density m1 (y | σ 2 , ψ, λ) is expressed as  2 m1 (y | σ , ψ, λ) = NN (X β, σ 2 Λ(ψ))Np (0, σ 2 λ−1 W )dβ

1 1 1  (2.16) = exp − 2 y Q (ψ, λ)y , 2σ (2πσ 2 )N/2 |Λ(ψ) + XWX  /λ|1/2 where (2.17) Q (ψ, λ) = Λ−1 (ψ) − Λ−1 (ψ)X {X  Λ−1 (ψ)X + λW −1 }−1 X  Λ−1 (ψ).  under the condition (A3), The parameters σ 2 and ψ are estimated by σ ˆ 2 and ψ ˆ and the hyper-parameter λ is estimated by λ through the maximization of m1 (y |  λ) with respect to λ, namely, it is given by λ ˆ = max(λ0 , 0) where λ0 = σ ˆ 2 , ψ, 2  σ , ψ) is the solution of the equation λ0 (ˆ −1

 y Λ (2.18)

−1

 X (X  Λ

−1

 X + λ0 W −1 )−1 W −1 (X  Λ −1

 =σ ˆ 2 tr[(X  Λ

−1

 X + λ0 W −1 )−1 X  Λ

−1

 X + λ0 W −1 )−1 X  Λ

y

X ]/λ0 ,

 As N → ∞, the equation (2.18) converges to β  W −1 β =  = Λ(ψ). where Λ  ψ)  as an initial ˆ0 = σ  ψ)   W −1 β( pσ 2 /λ0 , so that we can use the estimator λ ˆ 2 /β( value of the iteration to compute the root of the equation. Then the EBIC based  is given by on the estimators σ ˆ 2 and ψ

(2.19)

 λ) ˆ + (d + 1) log(N ) ˆ 2 , ψ, EBIC = − 2 log m1 (y | σ  +λ ˆ −1 XWX  |) = N log(2πˆ σ 2 ) + log(|Λ(ψ) +

 λ) ˆ y y  Q (ψ, + (d + 1) log(N ). σ ˆ2

This can be also rewritten as

(2.20)

ˆ −1 X  Λ(ψ)  −1 XW |) − p log(N ) EBIC = BIC + log(|Ip + λ  ψ)/ˆ  σ2,  ψ)   {(X  Λ(ψ)  −1 X )−1 + λ ˆ −1 W }−1 β( + β(

where BIC is given by  + y  P (ψ)  y /ˆ (2.21) BIC = N log(2πˆ σ 2 + (p + d + 1) log(N ). σ 2 ) + log |Λ(ψ)| It is noted that EBIC gives an explicit expression as well as it incorporates the non-subjective prior distribution π1 (β | σ 2 , λ). It is also noted that statistical software packages are available for the computation of the ML and REML estimates of σ 2 and ψ.

120

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

3. Consistency of EBIC We now show the consistency of EBIC given in (2.19). We here regard the model (1.1) as a full model, and consider the problem of selecting some of the variables x(1) , . . . , x(p) for X = (x(1) , . . . , x(p) ). The model is indexed by the parameter γ, and the model Mγ is described as Mγ : y = Xγ β γ + Zv + , where Xγ = (x(i1 ) , . . . , x(ipγ ) ) is an N × pγ submatrix of X and β γ is a pγ dimensional vector. Denote a set of γ by Γ. It is assumed that the true model is within the class of the models {Mγ ; γ ∈ Γ}, and it is described as MT : y = XT β T + Zv + . From (2.19), the EBIC under the model Mγ is written as ˆ −1 X  Λ  −1  γ |) + log(|Ipγ + λ EBICγ = N log(2πˆ σγ2 ) + log(|Λ γ γ γ Xγ Wγ |)  γ y /ˆ + y Q σ 2 + (d + 1) log(N ), γ

 ), Q  ,λ ˆγ ) = Λ  γ = Qγ (ψ  −1  −1 − Λ  −1 Xγ {X  Λ  γ = Λ(ψ where Λ γ γ γ γ Xγ + γ γ  −1 . The model minimizing BICγ or EBICγ for γ ∈ Γ is deλγ Wγ−1 }−1 Xγ Λ γ noted by Mγˆ . The consistency of Bayesian procedures for variable selection in the ordinary linear regression model has been established by Fernandez et al. (2001), Liang et al. (2008) and Casella et al. (2009). We shall prove the consistency of EBICγ as a variable selection method in LMM. Theorem 3.1. Assume the conditions (A1)–(A3) and Wγ = O(1) as N → ∞. Then, the variable selection procedure EBICγ is consistent, namely when y is distributed under MT , P [Mγˆ = MT ] → 0 as N → ∞, where Mγˆ is the model minimizing EBICγ for γ ∈ Γ. Proof. As estimators of σ 2 and ψ, we first handle the ML estimators given in (2.3) and (2.14). The ML estimators of σ 2 and ψ under the model Mγ are  ) = y  Pγ (ψ  )y /N and ψ  in this proof. denoted by σ ˆγ2 (ψ γ γ γ It is noted that P [Mγˆ = MT ] = P [∪γ∈Γ,γ=T {EBICγ < EBICT }]  ≤ P [EBICγ < EBICT ]. γ∈Γ,γ=T

Thus, we need to show that limN →∞ P [EBICγ < EBICT ] = 0 for any γ ∈ Γ such that γ = T . The case of γ = T means either of MT  Mγ or MT  Mγ .

EBIC IN LINEAR MIXED MODELS

121

In both cases, it is sufficient to show that ∆γ → ∞ in probability as N → ∞, where ∆γ = EBICγ − EBICT .  )XT = 0, [1] Case of MT  Mγ . In this case, it is noted that Pγ (ψ γ 2  are not consistent as N → ∞. Let ψ ∗ be the which suggests that σ ˆγ and ψ γ solution of the equation 1 2 {σ tr[ΛPγ∗ Λ∗(i) Pγ∗ ] + β T XT Pγ∗ Λ∗(i) Pγ∗ XT β T } N →∞ N 1 2 = lim {σ tr[ΛPγ∗ ] + β T XT Pγ∗ XT β T } N →∞ N 1 × lim tr[(Λ∗ )−1 Λ∗(i) ], N →∞ N lim

(3.1)

∗ for i = 1, . . . , d, where Λ∗ = Λ(ψ ∗γ ), Pγ∗ = Pγ (ψ ∗γ ) and Λ∗(i) = ∂Λ(ψ ∗γ )/∂ψγ,i ∗ , . . . , ψ ∗ ) . This equation can be derived as a limiting value for ψ ∗γ = (ψγ,1 γ,d  converges to ψ ∗ in of the equation (2.14). In fact, Lemma 3.1 shows that ψ γ

γ

probability.  is the MLE of ψ under the true model MT , ψ  minimizes Since ψ T T 2 N log(2πˆ σT (ψ)) + log |Λ(ψ)| from (2.13), so that  )) + log |Λ(ψ  )| ≤ N log(2πˆ  )) + log |Λ(ψ  )|. σT2 (ψ N log(2πˆ σT2 (ψ T T γ γ This inequality implies that ∆γ = EBICγ − EBICT is evaluated as  ) ˆ −1 X  Λ  −1 γy T y |Ipγ + λ σ ˆγ2 (ψ y Q y Q γ γ γ γ Xγ Wγ | + log − . + ∆γ ≥ N log 2 (ψ 2 (ψ  )  ) σ  ) ˆ −1 X  Λ  −1 XT WT | σ σ ˆT2 (ψ ˆ ˆ |IpT + λ γ γ T γ T T T T Since −1 −1 ˆ  −1 Xγ )−1 − (X  Λ  −1 (Xγ Λ γ γ Xγ + λγ Wγ ) γ −1 −1   −1 −1 −1 ˆ γ (X  Λ ˆ  −1 =λ γ γ Xγ ) Wγ (Xγ Λγ Xγ + λγ Wγ ) ,

it is observed that −1 ˆγ y Λ  γ y − y P γ y = λ  −1 Xγ (X  Λ  −1 y Q γ γ Xγ ) γ −1

−1

ˆ γ W −1 )−1 X  Λ  Xγ + λ  y. × Wγ−1 (Xγ Λ γ γ γ From the arguments around (2.18) and the condition that Wγ = O(1), it follows ˆ γ = Op (1). From the conditions (A1)–(A3) and Wγ = O(1), it follows that λ that −1 −1   −1 −1 −1   −1 ˆ  −1 Xγ (X  Λ  −1 y Λ γ γ Xγ ) Wγ (Xγ Λγ Xγ + λγ Wγ ) Xγ Λγ y = Op (1). γ

122

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

 ) = N + Op (1). Similarly,  γ y = y P  γ y + Op (1), namely, y  Q  γ y /ˆ Thus, y  Q σγ2 (ψ γ  ) = N + Op (1), which is used to see that  T y /ˆ σT2 (ψ it can be observed that y  Q T  ) − y Q  ) = Op (1).  γ y /ˆ  T y /ˆ y Q σγ2 (ψ σT2 (ψ γ T ˆ γ = Op (1), it is observed that log(|Ip + Since Wγ = O(1) and λ γ −1 −1  ˆ  λγ Xγ Λγ Xγ Wγ |) = pγ log(N ). Thus, we can see that log

ˆ −1 X  Λ  −1 |Ipγ + λ γ γ γ Xγ Wγ | ˆ −1 X  Λ  −1 |IpT + λ T T T XT WT |

= (pγ − pT ) log(N ) + op (log(N )),

so that  )} − N log{ˆ  )} + (pγ − pT ) log(N ) + op (log(N )). ∆γ ≥ N log{ˆ σγ2 (ψ σT2 (ψ γ γ 2   )/ˆ From Lemma 3.1 and the Taylor series expansions of log{ˆ σγ2 (ψ γ σT (ψ γ )}, it follows that 2   )/ˆ log{ˆ σγ2 (ψ σγ2 (ψ ∗γ )/ˆ σT2 (ψ ∗γ )} + op (1), γ σT (ψ γ )} = log{ˆ

so that σγ2 (ψ ∗γ )/ˆ σT2 (ψ ∗γ )} + op (1)}. ∆γ ≥ N {log{ˆ

(3.2)

Letting u = (y − XT β T )/σ, we see that u ∼ N (0, Λ) for Λ = Λ(ψ). For an N × N matrix A, we have that E[y  Ay ] = E[(σ u + XT β T ) A(σ u + XT β T )] = σ 2 tr[ΛA] + β T XT AXT β T . Thus, σ ˆγ2 (ψ ∗γ ) converges to limN →∞ N −1 {σ 2 tr[ΛPγ∗ ] + β T XT Pγ∗ XT β T }, while ∗ 2 σ ˆT (ψ γ ) converges to limN →∞ N −1 {σ 2 tr[ΛPT∗ ] + β T XT PT∗ XT β T } = −1 2 ∗  ∗ limN →∞ N {σ tr[ΛPT ]} since XT PT = 0. From the assumptions (A1) and (A2), it is noted that Pγ∗ = (Λ∗ )−1 + O(N −1 ) = PT∗ + O(N −1 ) componentwise, and hence, lim N −1 tr[ΛPγ∗ ] = lim N −1 tr[Λ(Λ∗ )−1 ] = lim N −1 tr[ΛPT∗ ].

N →∞

N →∞

N →∞

This implies that σ ˆγ2 (ψ ∗γ )/ˆ σT2 (ψ ∗γ ) = 1 + c + Op (N −1/2 ), where c = limN →∞ β T XT Pγ∗ XT β T /{σ 2 tr[Λ(Λ∗ )−1 ]} is a positive constant. Then from (3.2), ∆γ is evaluated as ∆γ ≥ N {log(1 + c) + op (1)}, which means that ∆γ → ∞ as N → ∞.

EBIC IN LINEAR MIXED MODELS

123

[2] Case of MT  Mγ . In this case, we can set Xγ = (XT , X2 ) without any loss of generality. Thus, Pγ XT = Pγ (XT , X2 )(I , 0) = 0, which means that  are consistent as N → ∞. Also, note that pT < pγ . The same σ ˆγ2 and ψ γ arguments as in the proof of the case of MT  Mγ can be used to show that 2   )/ˆ σγ2 (ψ ∆γ = N log{ˆ γ σT (ψ T )} + (pγ − pT ) log(N ) + op (log(N )). 2   )/ˆ We here show that N log{ˆ σγ2 (ψ γ σT (ψ T )} = Op (1), which implies that ∆γ → ∞ as N → ∞.  ) and log σ  ) around Using the Taylor series expansions of log σ ˆγ2 (ψ ˆT2 (ψ γ T   ψ γ = ψ and ψ T = ψ, respectively, we can observe that

 ) σ ˆγ2 (ψ) σ ˆγ2 (ψ γ = N log 2 + I(ψ) + Op (1), N log  ) σ ˆT (ψ) σ ˆ 2 (ψ T

T

where  − ψ) I(ψ) = N (ψ γ

∂ log(ˆ σγ2 (ψ)) σT2 (ψ))  − ψ) ∂ log(ˆ − N (ψ . T ∂ψ ∂ψ

σT2 (ψ)}, note that Pγ XT = 0, Pγ ΛPT = Pγ , To evaluate the term N log{ˆ σγ2 (ψ)/ˆ (PT − Pγ )ΛPγ = 0 for MT  Mγ . And Λ1/2 (PT − Pγ )Λ1/2 , and Λ1/2 Pγ Λ12 are idempotent matrics. Thus, N σ ˆT2 (ψ) = y  PT y = y  (PT − Pγ )y + y  Pγ y .  2 2 Hence, y (PT − Pγ )y ∼ σ χpγ −pT and y  Pγ y ∼ σ 2 χ2N −pγ are independently distributed. Since, as N → ∞, y  Pγ y /N → σ 2 , it follows that N log

σ ˆT2 (ψ) = N log{1 + y  (PT − Pγ )y /(y  Pγ y )} σ ˆγ2 (ψ) =N

y  (PT − Pγ )y y  (PT − Pγ )y −1 + O (N ) = + Op (N −1 ), p y  Pγ y σ2

which is approximately χ2pγ −pT . To complete the proof, we thus need to ver ify that I(ψ) = Op (1). The first term in I(ψ) is written as N ki=1 (ψγ,i − ψi ){∂i σ ˆγ2 (ψ)}/ˆ σγ2 (ψ) for ∂i = ∂/∂ψi . Since σ ˆγ2 (ψ) converges to σ 2 , the Taylor 2 2 2 σγ (ψ) − σ 2 }/σ 4 + Op (N −1 ). Noting expansion gives that 1/ˆ σγ (ψ) = 1/σ − {ˆ ˆγ2 (ψ)} − E[{∂i σ ˆγ2 (ψ)}] = that ψγ,i − ψi = Op (N −1/2 ) from Lemma 3.2, and {∂i σ Op (N −1/2 ), we can see that

∂i σ σ ˆγ2 (ψ) − σ 2 ˆγ2 (ψ) 1 −1 N (ψγ,i − ψi ) 2 − + Op (N ) = N (ψγ,i − ψi ) σ ˆγ (ψ) σ2 σ4 ˆγ2 (ψ)] + {∂i σ ˆγ2 (ψ) − E[∂i σ ˆγ2 (ψ)]}} × {E[∂i σ = N (ψγ,i − ψi )E[∂i σ ˆγ2 (ψ)]/σ 2 + Op (1).

124

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

It is here noted that E[∂i σ ˆγ2 (ψ)]/σ 2 = E[u  {∂i Pγ (ψ)}u ]/N = tr[Λ{∂i Pγ (ψ)}]/ N = − tr[Λ−1 Λ(i) ]/N + O(N −1 ). Thus, we obtain that  − ψ) N (ψ γ

d  ∂ log(ˆ σγ2 (ψ)) N (ψγ,i − ψi ) tr[Λ−1 Λ(i) ]/N + Op (1). =− ∂ψ i=1

Similarly, we can show that 2

σT (ψ))  − ψ) ∂ log(ˆ N (ψ =− T ∂ψ

d 

N (ψT,i − ψi ) tr[Λ−1 Λ(i) ]/N + Op (1).

i=1

Combining these evaluations gives that I(ψ) =

d 

N {(ψT,i − ψi ) − (ψγ,i − ψi )} tr[Λ−1 Λ(i) ]/N + Op (1).

i=1

From Lemma 3.2, it follows that N {(ψT,i − ψi ) − (ψγ,i − ψi )} = Op (1), which implies that I(ψ) = Op (1). Therefore, the consistency of EBICγ based on the  ) is shown.  and σ ˆγ2 (ψ ML estimators β γ γ  ) satisfying  and σ ˆ2 = σ ˆ 2 (ψ Finally, we consider more general estimators ψ γ

γ

γ

γ

2  −ψ M = Op (N −1 ) and σ the assumption (A3), namely, ψ ˆγ2 − σ ˆγ,M = Op (N −1 ) for γ γ M and σ the ML estimators ψ ˆ 2 given in (2.14) and (2.3) with X = Xγ . Let us γ

γ,M

2 2  ,σ M ˆ 2 ) by EBICγ (ψ  ,σ denote EBIC’s based on (ψ γ ˆγ ) and the ML (ψ γ , σ γ ˆγ ) and γ,M M , σ EBICγ (ψ ˆ 2 ), respectively. From the Taylor expansion and the assumption γ

γ,M

(A3), it follows that 2  ,σ  ˆ 2 ) + Op (1). EBICγ (ψ γ ˆγ ) = EBICγ (ψ γ , σ γ,M M

(3.3)

As stated above, the order necessary for the proof of the consistency is Op (N ) in the case of MT  Mγ and Op (log(N )) in the case of MT  Mγ . Thus, the term 2  ,σ of Op (1) in (3.3) is negligible. Hence, the consistency of EBICγ (ψ γ ˆγ ) follows M , σ from the consistency of EBICγ (ψ ˆ 2 ). Therefore, the proof of Theorem 3.1 is complete.



γ

γ,M

Lemma 3.1. Assume the conditions (A1) and (A2), and consider the case  be the ML estimator defined in (2.14) for X = Xγ , and let of MT  Mγ . Let ψ γ ∗  converges to ψ ∗ in probability ψ be the solution of the equation (3.1). Then, ψ γ γ as N → ∞.  )=0  = (ψγ,1 , . . . , ψγ,d ) is the solution of Bi (ψ Proof. From (2.14), ψ γ γ for i = 1, . . . , d, where (3.4)

ˆγ2 (ψ) tr[Λ(ψ)−1 {∂i Λ(ψ)}]. Bγ,i (ψ) = y  Pγ (ψ){∂i Λ(ψ)}Pγ (ψ)y − σ

EBIC IN LINEAR MIXED MODELS

125

 around ψ ∗ , Using the Taylor series expansion of the equation with respect to ψ γ γ   ) = Bγ,i (ψ ∗ ) + d (∂j Bγ,i (ψ ∗ ))(ψγ,j − ψ ∗ ) + Op (1), it follows that 0 = Bγ,i (ψ γ γ γ γ,j j=1 so that  − ψ ∗ = −{Bγ (ψ ∗ )}−1 (Bγ,1 (ψ ∗ ), . . . , Bγ,d (ψ ∗ )) + Op (N −1 ), ψ γ γ γ γ γ where Bγ (ψ ∗γ ) is the d × d matrix with the (i, j)-th element ∂j Bγ,i (ψ ∗γ ). Then, ∂j Bγ,i (ψ ∗γ ) = y  ∂j {Pγ∗ Λ∗(i) Pγ∗ }y − σ ˆγ2 (ψ ∗γ )∂j {tr[(Λ∗ )−1 Λ∗(i) ]} ˆγ2 (ψ ∗γ )} tr[(Λ∗ )−1 Λ∗(i) ], − {∂j σ

(3.5)

which has the order Op (N ). Since there exists a positive definite matrix Bγ∗ such  − ψ ∗ is expressed as that Bγ (ψ ∗γ )/N converges to Bγ∗ , ψ γ γ (3.6)

 − ψ ∗ = −N −1 (B ∗ )−1 (Bγ,1 (ψ ∗ ), . . . , Bγ,d (ψ ∗ )) + Op (N −1 ). ψ γ γ γ γ γ

Here, Bγ,i (ψ ∗γ ) is expressed as Bγ,i (ψ ∗γ ) = (σ u + XT β T ) Pγ∗ Λ∗(i) Pγ∗ (σ u + XT β T ) − (σ u + XT β T ) Pγ∗ (σ u + XT β T ) tr[(Λ∗ )−1 Λ∗(i) ]/N, for u = (y − XT β T )/σ. From the definition (3.1) of ψ ∗γ , it is seen that limN →∞ E[Bγ,i (ψ ∗γ )]/N = 0, namely, E[Bγ,i (ψ ∗γ )] = o(N ). Then, Bγ,i (ψ ∗γ ) can be expressed as Bγ,i (ψ ∗γ ) = {Bγ,i (ψ ∗γ ) − E[Bγ,i (ψ ∗γ )]} + E[Bγ,i (ψ ∗γ )] = σ 2 tr[Pγ∗ Λ∗(i) Pγ∗ {uu  − Λ}] − σ 2 tr[Pγ∗ {uu  − Λ}] tr[(Λ∗ )−1 Λ∗(i) ]/N + o(N ) + 2σ u  Pγ∗ Λ∗(i) Pγ∗ XT β T − 2σ u  Pγ∗ XT β T tr[(Λ∗ )−1 Λ∗(i) ]/N + o(N ). Using the equality (3.7)

E[u  Cuu  Du ] = 2 tr[ΛC ΛD ] + tr[ΛC ] tr[ΛD ]

for N × N matrices C and D , we can verify that E[{Bγ,i (ψ ∗γ ) − E[Bγ,i (ψ ∗γ )]} · {Bγ,j (ψ ∗γ ) − E[Bγ,j (ψ ∗γ )]}] = O(N ) for i = 1, . . . , d, j = 1, . . . , d. Thus,  − E[Bγ,i (ψ ∗γ )Bγ,j (ψ ∗γ )] = o(N 2 ). Hence from (3.6), it follows that E[(ψ γ ∗   ∗ ∗   ψ γ ) (ψ γ − ψ γ )] = o(1), which means that ψ γ − ψ γ = op (1). Therefore, ψ γ converges to ψ γ in probability.  Lemma 3.2. Assume the conditions (A1) and (A2), and consider the case  be the ML estimator defined in (2.14) for X = Xγ . Then, of MT  Mγ . Let ψ γ  − ψ) − (ψ  − ψ)} = Op (1) and ψ  − ψ = Op (N −1/2 ). N {(ψ γ T γ

126

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

Proof. Since MT  Mγ , it is noted that Pγ XT = 0. For Bγ,i (ψ) defined in (3.4), from (3.5), we observe that ∂j Bγ,i (ψ) u  Pγ u  = u ∂ { P Λ P } u − ∂j {tr[Λ−1 Λ(i) ]} j γ (i) γ σ2 N u  (∂j Pγ )u − tr[Λ−1 Λ(i) ] N = tr[Λ∂j {Pγ Λ(i) Pγ }] − ∂j {tr[Λ−1 Λ(i) ]} − tr[Λ{∂j Pγ }] tr[Λ−1 Λ(i) ]/N + Op (N 1/2 ) = Cij + O(N 1/2 ), where Cij = − tr[Λ−1 Λ(j) Λ−1 Λ(i) ]+tr[Λ−1 Λ(j) ] tr[Λ−1 Λ(i) ]/N . Let C be a d×d matrix with the (i, j)-th element Cij . From (3.6), it follows that  − ψ = −C −1 (Bγ,1 (ψ), . . . , Bγ,d (ψ)) + Op (N −1 ). ψ γ Noting that C does not depend on γ, we see that  − ψ = −C −1 (BT,1 (ψ), . . . , BT,d (ψ)) + Op (N −1 ), ψ T  − ψ) − (ψ  − ψ)} = −N C −1 (Bγ,1 (ψ) − BT,1 (ψ), . . . , Bγ,d (ψ) − so that N {(ψ γ T BT,d (ψ)) + Op (1). Since N C −1 = O(1), it is sufficient to show that Bγ,i (ψ) − BT,i (ψ) = Op (1) for i = 1, . . . , d. From (3.4), this difference is written as tr[Λ−1 Λ(i) ] Bγ,i (ψ) − BT,i (ψ)   = u Q u − u Q u , 2 1 σ2 N where Q1 = Pγ − PT and Q2 = Pγ Λ(i) Pγ − PT Λ(i) PT . Using the equality (3.7), we can evaluate E[{Bγ,i (ψ) − BT,i (ψ)}2 /σ 4 ] as 2 tr[ΛQ2 ΛQ2 ] + {tr[ΛQ2 ]}2 + {2 tr[ΛQ1 ΛQ1 ] + {tr[ΛQ1 ]}2 }{tr[Λ−1 Λ(i) ]/N }2 − 2{2 tr[ΛQ1 ΛQ2 ] + tr[ΛQ1 ] tr[ΛQ2 ]} tr[Λ−1 Λ(i) ]/N. Let A11 = XT Λ−1 XT , A12 = XT Λ−1 X2 , A21 = X2 Λ−1 XT , A22 = X2 Λ−1 X2 and

V = X2 Λ−1 X2 − X2 Λ−1 XT (XT Λ−1 XT )−1 XT Λ−1 X2 = A22 − A21 A−1 11 A12 . Then, with Xγ = (XT , X2 ), −1 A11 A12 = A21 A22     A−1 −A−1 11 0 11 A12 + = V −1 ( −A21 A−1 11 I ), 0 0 I 

(Xγ Λ−1 Xγ )−1

EBIC IN LINEAR MIXED MODELS

127

from Corollary 1.4.2 of Srivastava and Khatri (1979). Hence,

Pγ = Λ−1 − Λ−1 Xγ (Xγ Λ−1 Xγ )−1 Xγ Λ−1  −1 = Λ−1 − Λ−1 XT A−1 11 XT Λ   −A−1 −1  −1 11 A12 − Λ Xγ V −1 ( −A21 A−1 11 I )Xγ Λ I  −1 −1   −1 = PT − Λ−1 (I − XT A−1 X2 (I − Λ−1 XT A−1 11 XT Λ )X2 V 11 XT )Λ

(3.8)

= PT − PT X2 V −1 X2 PT .

Thus, with G = X2 V −1 X2 ,

Q1 = −PT GPT , Q2 = −PT GPT Λ(i) PT − PT Λ(i) PT GPT + PT GPT Λ(i) PT GPT . It is easy to see that tr[ΛQ1 ] = tr[ΛPT GPT ] = tr[V −1 X2 PT ΛPT X2 ] = O(1), since V = O(N ). Using such an argument, we can see that E[{Bγ,i (ψ) − BT,i (ψ)}2 ] = O(1), which means that Bγ,i (ψ) − BT,i (ψ) = Op (1). Hence,  − ψ) − (ψ  − ψ)} = Op (1). From this result and the fact that ψ  −ψ = N {(ψ γ T T  − ψ = Op (N −1/2 ). Therefore, Lemma 3.2 is Op (N −1/2 ), it is trivial that ψ γ proved.  4. EBIC in a specific model and simulation studies 4.1. EBIC in a nested error regression model In this section, we treat the nested error regression model (NERM) as a simple but useful example, which is described as (4.1)

yij = xij β + vi + εij ,

i = 1, . . . , k, j = 1, . . . , ni ,

where vi ∼ N (0, σ 2 ψ) and εij ∼ N (0, σ 2 ). Letting yi = (yi1 , . . . , yi,ni ) , Xi = (xi1 , . . . , xi,ni ) , i = (εi1 , . . . , εi,ni ) and jni = (1, . . . , 1) ∈ Rni , we can express the NERM in the matricial form as

yi = Xi β + jni vi + i ,

i = 1, . . . , k,

which is a special case of the model (1.1). It is assumed kthat ni ’s are uniformly bounded and that k → ∞, which implies that N = i=1 ni → ∞. Let y i = ni −1 ni y , x = n x and γ = γ (ψ) = 1/(1 + ni ψ). Let C1 = n−1 i i i j=1 ij j=1 ij i i k   ni k   j=1 (xij − x i )(xij − x i ) , C2 (ψ) = i=1 i=1 ni γi (ψ)x i x i and C (ψ) = C1 +

128

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

C2 (ψ). The ML estimators of β and σ 2 are given by ⎛ ⎞ ni k  k    ψM ) = C (ψM )−1 ⎝ β( (xij − x i )(yij − y i ) + ni γi (ψM )x i y i ⎠ , i=1 j=1

i=1

⎛ ni k   1 2 M  ψM )}2 ⎝ (ψ ) = {(yij − y i ) − (xij − x i ) β( σ ˆM N i=1 j=1

+

k 

 M

ni γi (ψ ){y i −

 ψM )}2 x i β(

,

i=1

where ψM is the ML estimator of ψ, given as the solution of the equation k 

2 M  ψM )}2 = σ {ni γi (ψM )}2 {y i − x i β( ˆM (ψ )

i=1

k 

ni γi (ψM ).

i=1

Then BIC and EBIC, respectively, given in (2.21) and (2.20) are expressed as (4.2) BICM = N

2 M log(2πˆ σM (ψ ))

+N −

k 

log γi (ψM ) + (p + 2) log(N ),

i=1

EBICM (4.3)

ˆ −1 X  Λ−1 (ψM )XW |) − p log(N ) = BICM + log(|Ip + λ 2 M  ψM )/ˆ  ψM ) {(X  Λ(ψM )−1 X )−1 + λ ˆ −1 W }−1 β( + β( σM (ψ ).

4.2. Simulation studies We now investigate the numerical performances of the information criteria described in Sections 2.1 and 2.4 in the NERM through simulation experiments in terms of the frequencies of selecting the true variables. The criteria we examine are the marginal AIC, the conditional AICs, the BIC and the empirical BIC, which are denoted by mAIC, cAIC, CAIC, BIC and EBIC, respective. As the matrix W in the EBIC, we here use the diagonal matrix  x , . . . , N/x  x ) for X = (x , . . . , x ). W = diag(N/x(1) (1) (1) (p) (p) (p) In the simulation experiments, we consider the two cases of (k = 6, n1 = · · · = n6 = 4, N = 24) and (k = 15, n1 = · · · = n15 = 4, N = 60), which mean the cases of relatively small and large sample sizes, respectively The dimension of the full model is p = 7. For the N × p matrix X of the regressor variables in the model (1.1) and (4.1), the column vectors x1 , . . . , xN for X  = (x1 , . . . , xN ) are generated as mutually independent random variables distributed as Np (0, Σx ) where Σx = (1 − ρx )Ip + ρx Jp for Jp = jp jp and ρx = 0.3. In this experiment, we assume that the true model is given by (p∗ )

y = X β ∗ + block diag(jn1 , . . . , jnk )v + ,

where 1 ≤ p∗ ≤ 7, β ∗ = (β1 , . . . , βp∗ , 0, . . . , 0) , and v and  are mutually independent random variables having v ∼ Nk (0, σ 2 ψ Ik ) and  ∼ NN (0, σ 2 IN )

EBIC IN LINEAR MIXED MODELS

129

Table 1. Frequencies selected by the five criteria mAIC, cAIC, CAIC, BIC and EBIC for n1 = · · · = nk = 4, k = 6, 15 and ψ = 0.1, 0.5, 1.0: the dimension of a full model is p = 7 and the true model is (p∗ ) = {1, . . . , p∗ }.

k = 6, n = 4, N = 24

Mk mAIC cAIC CAIC BIC EBIC

k = 15, n = 4, N = 60

mAIC cAIC CAIC BIC EBIC

p ∗ = 2, ψ = 0. 1

(1) (2) (3) (4) (5) (6) (7)

11.3 61.0 11.0 8.7 3.7 1.3 3.0

0.0 71.3 13.0 8.0 4.0 2.7 1.0

0.0 11.3 74.0 61.0 13.0 10.0 7.7 7.0 2.7 3.7 2.0 2.3 0.7 4.7

11.3 70.7 7.0 4.7 2.7 1.0 2.7

2.3 69.3 12.3 6.7 4.7 3.0 1.7

0.0 70.0 12.0 6.7 5.7 4.0 1.7

0.0 2.3 72.7 83.7 11.7 6.3 5.7 3.3 5.3 2.0 3.0 1.0 1.7 1.3

2.3 88.7 3.0 2.7 1.7 0.7 1.0

0.0 0.0 0.0 76.0 11.0 8.7 4.3

0.0 0.3 0.0 0.0 0.0 0.0 76.3 94.0 10.7 4.7 8.7 1.0 4.3 0.0

0.3 0.0 0.0 98.7 0.7 0.3 0.0

0.0 0.0 0.0 0.0 0.0 84.0 16.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 84.3 91.0 15.7 9.0

0.0 0.0 0.0 0.0 0.0 98.7 1.3

p ∗ = 4, ψ = 0. 5

(1) (2) (3) (4) (5) (6) (7)

2.7 2.0 3.3 75.7 9.7 5.0 1.7

0.0 0.0 0.0 80.7 10.7 6.3 2.3

0.0 2.7 0.0 2.0 0.0 3.3 83.7 71.3 10.0 10.3 4.0 6.3 2.3 4.0

2.7 2.0 3.3 85.7 3.7 1.7 1.0

0.3 0.0 0.0 80.3 10.0 6.3 3.0

p ∗ = 6, ψ = 1. 0

(1) (2) (3) (4) (5) (6) (7)

0.0 0.0 1.0 0.7 5.7 82.7 10.0

0.0 0.0 0.0 0.0 0.0 87.7 12.3

0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.7 0.0 5.7 89.0 76.7 11.0 16.0

0.0 0.0 1.0 0.7 5.7 87.0 5.7

0.0 0.0 0.0 0.0 0.0 85.3 14.7

for σ 2 = 1 and ψ = 0.1, 0.5, 1.0. Also, β for 1 ≤  ≤ p∗ is generated as a random variable distributed as β = 2(−1)+1 {1 + U (0, 1)} for a uniform random variable U (0, 1) on the interval (0, 1). Let (m) be the set {1, . . . , m}, and we write the model using the first m regressor variables by Mm or simply (m). Then, the full model is (7) and the true model is (p∗ ). As candidate models, we consider the nested subsets (1), . . . , (7) of {1, . . . , 7}, namely, (m)

y = X β (m) + block diag(jn1 , . . . , jnk )v + e ,

where β (m) = (β1 , . . . , βm , 0, . . . , 0) . In the simulation experiments, 10 observations of the regressor variables X are generated, and for each observation of X , 30 observations of the response variable y are generated from the true model (p∗ ) for p∗ = 2, 4, 6. Thus, we have 10 × 30(= 300) total data sets. For each data set, we calculate the values of the information criteria mAIC, cAIC, CAIC, BIC and EBIC for the seven

130

TATSUYA KUBOKAWA AND MUNI S. SRIVASTAVA

candidate models (1), . . . , (7), and we select the models minimizing the values of the information criteria. For each criterion and each candidate model (m), the frequency, namely, the number of the selection of the model (m) is counted for 300 data set. These frequencies are reported in Table 1 . From Table 1, we can see the following observations in the sense of selecting the true variables. In the case of N = 24, cAIC and CAIC are better than mAIC, and EBIC is superior to BIC. Three criteria cAIC, CAIC and EBIC perform well, but mAIC and BIC are inferior. This fact is reasonable because BIC is an asymptotic approximation but EBIC incorporates the prior distribution for the regression parameters. In the case of N = 60, on the other hand, EBIC and BIC are much better than mAIC, cAIC and CAIC. This is plausible because EBIC and BIC are consistent but mAIC, cAIC and CAIC are not consistent. The performance of BIC is improved on as N gets large. Clearly, EBIC performs the best in this case. These observations show that EBIC is recommendable in the cases of small and large sample sizes. 5. Concluding remarks In this paper, we have derived the exact expression of EBIC in the problem of selecting the regression variables in the linear mixed models. The Bayesian variable selection procedures like the marginal distribution and the Bayes factors are based on the full prior information, but it may be hard to set up the prior distributions for all the parameters and to compute the multi-dimensional integral. In contrast, BIC is free from any setup of prior distribution, but it may be far from the marginal distribution in the case of small sample sizes. The EBIC proposed here is an intermediate procedure between BIC and the full Bayes variable selection procedures, namely, EBIC incorporates a partial non-subjective prior distribution for the parameters of interest, but it neglects any prior setup for the nuisance parameters. As a theoretical optimality, we have shown that EBIC is consistent as N goes to infinity. The performance of EBIC has been numerically investigated in the sense of selecting the true variables, and it has been shown that EBIC is better than BIC for a small sample size and superior to the marginal AIC and the conditional AICs for a large sample size. This means that EBIC is recommendable as a useful variable selection procedure in both small and large sample sizes. Acknowledgements The authors are grateful to the editor and the referee for their valuable comments. The research of the first author was supported in part by Grant-inAid for Scientific Research Nos. 19200020 and 21540114 from Japan Society for the Promotion of Science. The research of the second author was supported by NSERC.

EBIC IN LINEAR MIXED MODELS

131

References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle, in B. N. Petrov and F. Csaki (eds.), 2nd International Symposium on Information Theory, 267–281, Akademia Kiado, Budapest. Akaike, H. (1974). A new look at the statistical model identification. System identification and time-series analysis, IEEE Trans. Autom. Contr., AC-19, 716–723. Berger, J. O. and Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection and prediction, J. Amer. Stat. Assoc., 91, 109–122. Casella, G., Giron, F. J., Martinez, M. L. and Moreno, E. (2009). Consistency of Bayesian procedures for variable selection, Ann. Stat., 37, 1207–1228. Fernandez, C., Ley, E. and Steel, M. F. J. (2001). Benchmark prior for Bayesian model averaging, J. Econometrics, 100, 381–427. Hurvich, C. M. and Tsai C. L. (1989). Regression and time series model selection in small samples, Biometrika, 76, 297–307. Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of g-priors for Bayesian variable selection, J. Amer. Stat. Assoc., 103, 410–423. Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression, Ann. Stat., 12, 758–765. Schwarz, G. (1978). Estimating the dimension of a model, Ann. Stat., 6, 461–464. Srivastava, M. S. and Khatri, C. G. (1979). An Introduction to Multivariate Statistics, NorthHolland, New York. Srivastava, M. S. and Kubokawa, T. (2010). Conditional information criteria for selecting variables in linear mixed models, J. Multivariate Analysis, to appear. Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the finite corrections, Commun. Stat.—Theory Methods, 1, 13–26. Vaida, F. and Blanchard, S. (2005). Conditional Akaike information for mixed-effects models, Biometrika, 92, 351–370.