Apr 7, 2016 - estimator. In particular, it is shown that MCp and AICc are uniformly minimum variance unbiased estimators of their corresponding risks (Davies ...
arXiv:1603.09458v2 [math.ST] 7 Apr 2016
Generalized ridge estimator and model selection criterion in multivariate linear regression Yuichi Mori∗, Taiji Suzuki∗†
Abstract We propose new model selection criteria based on generalized ridge estimators dominating the maximum likelihood estimator under the squared risk and the Kullback-Leibler risk in multivariate linear regression. Our model selection criteria have the following favorite properties: consistency, unbiasedness, uniformly minimum variance. Consistency is proven under an asymptotic structure np → c where n is the sample size and p is the parameter dimension of the response variables. In particular, our proposed class of estimators dominates the maximum likelihood estimator under the squared risk even when the model does not include the true model. Experimental results show that the risks of our model selection criteria are smaller than the ones based on the maximum likelihood estimator and that our proposed criteria specify the true model under some conditions.
1
Introduction
Model selection criteria such as AIC (Akaike (1971)) and Cp (Mallows (1973)) have been used in various applications and their theoretical properties have been extensively studied. We consider a model selection problem in a multivariate linear regression based on a kind of generalized ridge estimators. The multivariate linear regression considered in this paper has p response variables on a subset of k explanatory variables, and the response is contaminated by a multivariate normal noise. This model, in which the response is multivariate, is thus an extension of multiple linear regression, where the response is univariate. Applications of multivariate linear regression include genetic data analysis, (e.g., Gharagheizi (2008)) and multiple brain scans (e.g., Basser and Pierpaoli (1998)). Multivariate linear regression is written as Y ∼ Nn×p (AB, Σ ⊗ In ), where Y is an n × p observation matrix of p response variables, A is an n × k observation matrix of k explanatory variables, B is a k × p unknown matrix ∗ Department of Mathematical and Computing Sciences Graduate School of Information Science and Engineering Tokyo Institute of Technology † PRESTO,JST
1
of regression coefficients, Σ is a p × p unknown covariance matrix, k is a nonstochastic number, and n is the sample size. We assume that, for all n ≥ k, n − p − k − 1 > 0 and that rank(A) = k. The purpose of the model selection problem is to select an appropriate subset of regression coefficients. Suppose that J denotes a subset of the index set of coefficients F = {1, ..., k}. J denotes the power set of F , and kJ denotes the number of elements that J contains, that is, kJ = |J|. Then, the candidate model corresponding to the subset J can be expressed as Y ∼ Nn×p (AJ BJ , Σ ⊗ In ), where AJ is an n × kJ matrix consisting of the columns of A indexed by the elements of J, and BJ is a kJ × p unknown matrix of regression coefficients. We assume that the candidate model corresponding to J∗ ∈ J is the true model. One way to perform model selection in multivariate linear regression is to apply the well known model selection criteria such as AIC (Akaike (1971)), AICc (Bedrick and Tsai (1994)), Cp (Mallows (1973)), and MCp (Fujikoshi and Satoh (1997)). These criteria are unbiased or asymptotic unbiased estimators of the squared risk and the Kullback Leibler risk that are defined as follows: RS (B, Σ, Φ) = E tr Σ−1 (B − Φ)⊤ A⊤ A(B − Φ) , " !# ˜ |B, Σ) f ( Y RKL (B, Σ, fˆ) = EY˜ ,Y log , fˆ(Y˜ |Y ) where Φ is an estimator of B, f is the true probability density of Y , and fˆ is a predictive density of Y˜ conditional to Y where Y˜ is the independent copy of Y . Cp and MCp are unbiased estimators of the squared risk of the maximum likelihood estimator, and AIC and AICc are asymptotic unbiased and unbiased estimators, respectively, of the Kullback-Leibler risk of the maximum likelihood estimator. In particular, it is shown that MCp and AICc are uniformly minimum variance unbiased estimators of their corresponding risks (Davies et al. (2006)). One important property of a model selection criterion is consistency, that is, it selects the true model asymptotically in probability. Fujikoshi et al. (2014) showed consistency of AIC, AICc , Cp and MCp under an asymptotic structure p n → c and some conditions in multivariate linear regression, although they are not consistent in usual univariate settings. Besides model selection criteria corresponding to the maximum likelihood estimator as introduced above, some authors have studied those corresponding to other estimators that might dominate the maximum likelihood estimator. Yanagihara and Satoh (2010) investigated an unbiased estimator of the squared risk of the ridge estimator. They developed a model selection criterion to select the model candidate and the parameter of the ridge estimator simultaneously. Furthermore, although Nagai et al. (2012) proposed the model selection criterion of this type, it is not based on estimators that are rigorously proven to dominate the maximum likelihood estimator.
2
In this paper, we propose new model selection criteria for multivariate linear regression based on new shrinkage estimators dominating the maximum likelihood estimator under the given risks. In particular, even when the model does not include the true model, our proposed estimator dominates the maximum likelihood estimator under the squared risk. Moreover, our model selection criteria have the following favorite properties: consistency, unbiasedness, and uniformly minimum variance. Furthermore, our model selection criteria have closed forms that are given by modifying AICc and MCp. We construct a class of Bayes estimators that dominate the maximum likelihood estimator under the risks and have a form of the generalized ridge estimator. The generalized ridge estimator of multivariate linear regression is a class of estimators that can be written as ⊤ −1 ⊤ (A⊤ AJ Y, J AJ + PJ KJ PJ )
where KJ is a kJ × kJ diagonal matrix and PJ is the kJ × kJ orthogonal matrix −1 of eigenvectors of (A⊤ . In other words, J AJ ) −1 PJ⊤ (A⊤ PJ = DJ , J AJ )
PJ⊤ PJ = IkJ ,
where DJ = diag(dJ,1 , dJ,2 , ..., dJ,kJ ) and dJ,1 ≥ dJ,2 ≥ · · · ≥ dJ,kJ . Our estimator is related to the generalized Bayes estimator proposed by Maruyama and Strawderman (2005) for linear regression, the Stein type estimator proposed by Konno (1991) for multivariate linear regression, and the generalized Bayes estimator proposed by Tsukuma (2009) for multivariate linear regression. In contrast to these estimators, our estimators enable us to construct closed-form model selection criteria based on them. Moreover, as stated above, it is shown that the criteria have several favorable statistical properties. Since our model selection criteria are based on the generalized ridge estimators dominating the maximum likelihood estimator, it is expected that the risks of our estimators on the models selected by our model selection criteria are smaller than the risks of the maximum likelihood estimator on the models selected by MCp and AICc. We carry out numerical experiments to show the properties of our method. The contents of this paper are summarized as follows. In Section 2, we list the classes of estimators dominating the maximum likelihood estimator under the squared risk and the Kullback-Leibler risk. Our estimator is given as a Bayes estimator, and eventually, it is shown that it has the same form as a generalized ridge estimator. By setting the hyper parameters of our Bayes estimator appropriately, we derive the class of estimators dominating the maximum likelihood estimator under the squared risk and the Kullback-Leibler risk. In Section 3, we construct model selection criteria based on the estimators in the classes proposed in Section 2. It is also shown that our model selection criteria are uniformly minimum variance unbiased estimators of the two risks respectively and have consistency. In Section 4, we give numerical comparisons between our model selection criteria with AIC, AICc and MCp. In Section 5, we provide the discussion and conclusions.
3
2
Generalized ridge estimator
In this section, we construct a class of Bayes estimators that have the form of the generalized ridge estimator and dominate the maximum likelihood estimator under the squared risk and the Kullback-Leibler risk. To derive the estimator, we rotate the coordinate and construct the Bayes estimator that can be considered as a generalized ridge estimator on the coordinate. It is shown that the Bayes estimator with tuned hyper parameters dominates the maximum likelihood estimator under the squared risk and the Kullback-Leibler risk. Then, the Bayes estimator is minimax because the maximum likelihood estimator is minimax optimal with constant risk. In particular, in the case of the squared risk, it is not necessary that the candidate model includes the true model. However, in the case of the Kullback-Leibler risk, this property is shown only on models that include the true model. First, we give a coordinate transformation on Y to derive the estimator. Let QJ be an n × n orthogonal matrix such that −1/2 DJ PJ⊤ QJ AJ = . 0 and let D∗J be an n × n diagonal matrix diag(dJ,1 , dJ,2 , ..., dJ,kJ , 1, ..., 1). We define random matrices XJ = (xJ,1 , ..., xJ,kJ )⊤ ∈ RkJ ×p and ZJ = (zJ,1 , ..., zJ,n−kJ )⊤ ∈ R(n−kJ )×p such that 1 XJ 2 QJ Y. = D∗J ZJ Then (XJ , ZJ ) has the joint density given by kJ Y
1 exp − tr[Σ−1 (XJ − ΘJ )⊤ DJ−1 (XJ − ΘJ )] (2π) |Σ| 2 i 1 (n−kJ )p −(n−kJ ) −1 ×(2π) |Σ| exp − tr[Σ SJ ] . 2 −kJ p
−kJ
d−p J,i
where ΘJ = PJ⊤ BJ = (θJ,1 , ..., θJ,kJ )⊤ and SJ = ZJ⊤ ZJ . With this transformation, the squared risk can be rewritten as E tr Σ−1 (ΦJ − ΘJ )⊤ DJ (ΦJ − ΘJ ) ,
where ΦJ is an estimator of ΘJ . Note that the maximum likelihood estimator of ΘJ is given by XJ , and thus XJ is minimax optimal. We construct the Bayes estimator of ΘJ so that it dominates the maximum likelihood estimator under the squared risk and the Kullback-Leibler risk.
4
2.1
Derivation of the estimator
In this subsection, we derive a Bayes estimator that is based on the following prior distribution: ΘJ |Σ Σ
∼ NkJ ×p (0, Σ ⊗ DJ (Λ−1 J CJ − IkJ )) = π(ΘJ |Σ), ∼ π(Σ),
(1) (2)
where ΛJ and CJ are diagonal matrices: ΛJ = diag(λJ,1 , ..., λJ,kJ ), CJ = diag(cJ,1 , ..., cJ,kJ ) (λJ,i > 0, cJ,i > 0 , i = 1, ..., kJ), and π(Σ) is any distribution on the positive definite matrices such that E Σ−1 |XJ , SJ and −1 exist. E Σ−1 |XJ , SJ Theorem 1
The Bayes estimator based on the prior (1) and (2) is given by
ˆ J,B = PJ (IkJ − ΛJ C −1 )DJ P ⊤ A⊤ Y ˆJ,B = PJ Θ B J J J and is a generalized ridge estimator. Proof. To derive the Bayes estimator based on this prior, we calculate a part of exponential of the joint density of (XJ , ZJ , ΘJ , Σ). Let FJ = (Λ−1 J CJ − IkJ )−1 . Then, −1 −1 −1 tr{Σ−1 (XJ − ΘJ )⊤ DJ−1 (XJ − ΘJ )} + tr{Σ−1 Θ⊤ DJ Θ J } J (ΛJ CJ − I)
= = =
−1 −1 −1 DJ ΘJ ]} tr{Σ−1 [(XJ − ΘJ )⊤ DJ−1 (XJ − ΘJ ) + Θ⊤ J (ΛJ CJ − I) −1 ⊤ −1 ⊤ −1 ⊤ −1 ⊤ tr{Σ [XJ DJ XJ − XJ DJ ΘJ − ΘJ DJ XJ + ΘJ (I + FJ )DJ−1 ΘJ ]} h n ⊤ tr Σ−1 ΘJ − (I + FJ )−1 XJ (I + FJ )DJ−1 ΘJ − (I + FJ )−1 XJ + XJ⊤ DJ−1 I − (I + FJ )−1 XJ ,
and (IkJ + FJ )−1 is given by (IkJ + FJ )−1
= =
IkJ + CJ Λ−1 J − Ikj IkJ + ΛJ (CJ − ΛJ )
= (CJ (CJ − ΛJ )−1 )−1 = IkJ − ΛJ CJ−1 .
−1 −1
−1
−1
Therefore, the joint density of (XJ , ZJ , ΘJ , Σ) is proportional to h ⊤ 1 n −n −p 2 2 |Σ| exp − tr Σ−1 ΘJ − (I − ΛJ CJ−1 )XJ (I − ΛJ CJ−1 )−1 DJ−1 2 × ΘJ − (I − ΛJ CJ−1 )XJ + XJ⊤ DJ−1 I − (I − ΛJ CJ−1 ) XJ 1 −1 − tr[Σ SJ ] π(Σ). 2 5
The Bayes estimator of Θ under the squared risk is obtained by ∂ E tr{Σ−1 (ΘJ − ΦJ )⊤ DJ−a (ΘJ − ΦJ )}|XJ , SJ = 0 ∂ΦJ E 2DJ−a (ΘJ − ΦJ )Σ−1 |XJ , SJ = 0 E DJ−a ΘJ Σ−1 |XJ , SJ = E DJ−a ΦJ Σ−1 |XJ , SJ E ΘJ Σ−1 |XJ , SJ = ΦJ E Σ−1 |XJ , SJ ΦJ = E ΘJ Σ−1 |XJ , SJ {E Σ−1 |XJ , SJ }−1
⇒ ⇒ ⇒ ⇒ ⇒
ΦJ = E [ΘJ |XJ , SJ ] .
ˆ J,B based on this prior is E [ΘJ |XJ , SJ ]. From Therefore, the Bayes estimator Θ the joint density of (XJ , ZJ , ΘJ , Σ), ˆ J,B = (IkJ − ΛJ C −1 )XJ . Θ J ˆJ,B of BJ based on this prior can be written as Moreover, the Bayes estimator B ˆJ,B B
= =
ˆ J,B = PJ (IkJ − ΛJ C −1 )DJ PJ⊤ A⊤ PJ Θ JY J −1 −1 −1 ⊤ ⊤ −1 PJ (DJ (IkJ − ΛJ CJ ) ) PJ AJ Y
= =
PJ (DJ−1 (IkJ + ΛJ (CJ − ΛJ )−1 )−1 PJ⊤ A⊤ JY PJ (DJ−1 + DJ−1 ΛJ (CJ − ΛJ )−1 )−1 PJ⊤ A⊤ J Y.
ˆJ,B is regarded as the Furthermore, let KJ = DJ−1 ΛJ (CJ − ΛJ )−1 . Then B generalized ridge estimator. The Bayes estimator is close to the multivariate form of Maruyama (2005). While he defined a prior of ΛJ as univariate, in this paper, ΛJ is a hyper parameter. Since his estimator does not have a closed form, we change the prior to construct the Bayes estimator expressed in a closed form. We consider a plug-in predictive density by using this Bayes estimator even for the KullbackLeibler risk because it is easy to handle.
2.2
Generalized ridge estimator under the squared risk
In this subsection, we set the parameters of the generalized ridge estimator and show that it dominates the maximum likelihood estimator under the squared risk for any candidate model. In many studies, although several generalized ridge estimators have been examined, they assume that the candidate model includes the true model. However, this paper does not make this assumption. In the following theorem, we give sufficient conditions of the parameters so ˆJ,B dominates the maximum likelihood estimator B ˆJ that the Bayes estimator B under the squared risk. Theorem 2 (i) 0 < λJ,i
0, (iii) J∗ ∈ J . Then, ˆJ ) ˆJB ) < RS (BJ∗ , Σ, B RS (BJ∗ , Σ, B ˆJB ) is minimum at λJ,i = Moreover RS (BJ∗ , Σ, B
(∀ J ∈ J ). dJ,i (p−2) n−p−kF +3
(i = 1, ..., kJ ).
−1 ⊤ −1 We employed cJ,i = x⊤ J,i SF xJ,i (i = 1, 2, ..., kJ ) instead of xJ,i SJ xJ,i because of the following reason. If we use SJ and not SF , we need to assume that ˆJ,B dominates B ˆJ . the candidate model J includes the true model to show that B Moreover, in this case, we do not obtain a closed form of the model selection criterion. However, by employing our definition of CJ , we obtain an unbiased estimator of the risk by slightly modifying MCp. While the assumption (i) of this theorem gives the condition of ΛJ , we may simply set λJ,i = dJ,i (p−2)/(n−kJ −p+3) (i = 1, 2, ..., kJ ). The assumption (ii) of this theorem means a restriction of the dimension. In particular, p ≥ 3 is the same condition as that under which Stein’s estimator dominates the maximum likelihood estimator with respect to the squared risk. The assumption (iii) is to ensure that the candidate models contain the true model. ˆJ,B is similar to the Stein’s estimator. Furthermore, B ˆJ,B domEach row of B ˆ inates BJ under the squared risk when the model J includes the true model. However, this is not obvious when the model J does not include the true model. ˆJ,B and the fact that Stein’s estimator Furthermore, from the form of lows of B ˆ is not admissible, BJ,B is not admissible. However, we employ this form rather than pursuing a statistically optimal one because of its computational efficiency. In fact, the model selection criteria based on the estimator can be analytically computed as shown later.
Proof. From the assumption (iii), without loss of generality, we may regard J∗ as F . Therefore, let BJ∗ be BF and AJ∗ be AF . For ∀ J ∈ J , ii h h ˆJ,B − BJ∗ ) ˆJ,B ) = E tr Σ−1 (B ˆJ,B − BJ∗ )⊤ (A⊤ AF )(B RS (BJ∗ , Σ, B F ii h h −1 ˆJ,B − AF BJ∗ ) ˆJ,B − AF BJ∗ )⊤ (AJ B = E tr Σ (AJ B =
ˆJ ) RS (BJ∗ , Σ, B ii h h ˆJ ˆJ − AF BJ∗ )⊤ AJ PJ ΛJ C −1 P ⊤ B −2E tr Σ−1 (AJ B J J ii h h −1 ⊤ ⊤ −1 ⊤ ˆ −1 ˆ ⊤ +E tr Σ BJ PJ ΛJ CJ PJ AJ AJ PJ ΛJ CJ PJ BJ .
Therefore it is sufficient to show that the sum of the second and third terms is non-positive. From the definitions of XJ and SF , XJ ∼ NkJ ×p (DJ PJ⊤ A⊤ J AF BJ∗ , Σ ⊗ DJ ), 7
SF ∼ Wp (n − kF , IkJ ).
−1
Let WJ = DJ 2 XJ = (wJ,1 , ..., wJ,kJ )⊤ . Then WJ −1 x⊤ S x J,i J,i F
1 2 ˆ = PJ⊤ (A⊤ J AJ ) BJ ∼ NkJ ×p (ΨJ , Σ ⊗ I), ⊤ −1 = dJ,i wJ,i SF wJ,i , 1
−2 AJ AF BJ∗ = (ψJ,1 , ..., ψJ,kJ )⊤ . From the definition where ΨJ = PJ⊤ (A⊤ J AJ ) of WJ and the joint density of (XJ , SJ ), the joint density of −1 ⊤ −1 ⊤ AF −AJ (A⊤ AJ )Y ) is given wJ,1 , wJ,2 , ..., wJ,kJ , ZF , and Y ⊤ (AF (A⊤ F AF ) J AJ ) by kJ Y
1 exp − (wJ,i − ψJ,i )⊤ Σ−1 (wJ,i − ψJ,i ) 2 i=1 1 (n−kJ )p −(n−kJ ) ×(2π) |Σ| exp − tr[Σ−1 SF ] 2 1 −1 ⊤ ⊤ −1 ⊤ ⊤ −1 ⊤ − tr[Σ Y (AF (AF AF ) AF − AJ (AJ AJ ) AJ )Y ] . 2
(2π)−kJ p |Σ|−kJ
From the Fisher-Cochran theorem and the definitions of WJ and SF , wJ,i (i = −1 ⊤ −1 ⊤ 1, 2, ..., kJ ), SF , and Y ⊤ (AF (A⊤ AF − AJ (A⊤ AJ )Y are indepenF AF ) J AJ ) dent for a fixed model J. Therefore, from Lemma 2.1 of Kubokawa and Srivas-
8
tava (2001), the second term is given by ⊤ ˆJ ˆJ − AF BJ∗ AJ PJ ΛJ CJ−1 PJ⊤ B −2E tr Σ−1 AJ B ⊤ −1 ⊤ ˆ ⊤ ˆ B − A A B A P Λ C P B = −2E tr Σ−1 A⊤ J F J J J J J ∗ J J J J ⊤ 21 − 12 21 ⊤ ˆ A = −2E tr Σ−1 A⊤ B − A A A A B A⊤ J J F J∗ J J J J J AJ ii − 21 12 ˆJ A⊤ B × PJ ΛJ CJ−1 PJ⊤ A⊤ J AJ J AJ ⊤ 1 − 21 − 21 −1 2 ⊤ ⊤ −1 WJ − PJ AJ AJ = −2E tr Σ AJ AF BJ∗ DJ ΛJ CJ DJ WJ ⊤ − 21 −1 Λ C W A A B = −2E tr Σ−1 WJ − PJ⊤ A⊤ A J J J F J J ∗ J J =
−2
kJ X i=1
=
−2
kJ X
⊤ − 21 −1 ⊤ ⊤ −1 E λJ,i cJ,i WJ − PJ AJ AJ AJ AF BJ∗ Σ wJ,i i·
λJ,i d−1 J,i
i=1
⊤ − 21 ⊤ ⊤ ⊤ −1 −1 −1 WJ − PJ AJ AJ ×E (wJ,i SF wJ,i ) AJ AF BJ∗ Σ wJ,i i·
=
−2
kJ X i=1
=
−2
⊤ −1 −1 λJ,i d−1 wJ,i J,i E tr ∇i· (wJ,i SF wJ,i )
p kJ X X
λJ,i d−1 J,i E
i=1 j=1
=
−2p
+4
kJ X
i=1 p k J X X
−2p
kJ X i=1
=
∂ (w⊤ S −1 wJ,i )−1 WJ,ij ∂WJ,ij J,i F
⊤ −1 −1 λJ,i d−1 J,i E (wJ,i SF wJ,i )
i=1 j=1
=
i h ⊤ −1 ⊤ −1 −2 wJ,i SF j WJ,ij λJ,i d−1 J,i E (wJ,i SF wJ,i )
kJ i h i h X E λJ,i c−1 + 4 E λJ,i c−1 J,i J,i
−2 (p − 2)
i=1
kJ X i=1
i h E λJ,i c−1 J,i .
Similarly, from the proof of Proposition 2.1 of Kubokawa and Srivastava (2001),
9
the third term is given by i h ˆ ⊤ PJ ΛJ C −1 P ⊤ A⊤ AJ PJ ΛJ C −1 P ⊤ B ˆJ E tr Σ−1 B J J J J J J = E tr Σ−1 XJ⊤ ΛJ CJ−1 DJ−1 ΛJ CJ−1 XJ = E tr Σ−1 WJ⊤ Λ2J CJ−2 WJ =
kJ X i=1
=
kJ X i=1
=
⊤ −1 −2 ⊤ λ2J,i d−2 wJ,i Σ−1 wJ,i J,i E (wJ,i SF wJ,i ) ⊤ −1 −1 λ2J,i d−2 J,i E (wJ,i SF wJ,i )
(n − kF − p + 3)
kJ X i=1
h i 2 −1 E d−1 J,i λJ,i cJ,i .
Therefore, from the assumptions (i) and (ii), the sum of the second and third terms is given by −2 (T − 2)
kJ X i=1
=
kJ X i=1
0 1−c lim
p n →c
in probability. Therefore, it is sufficient to show that lim
p n →c
p−2 tr(ΛJ CJ−1 ) = 0, ∀ J ∈ J n
in probability because ZMCp(J) − ZMCp(J∗ ) ) − tr(ΛJ CJ−1 ) . = MCp(J) − MCp(J∗ ) + (p − 2) tr(ΛJ∗ CJ−1 ∗ 17
(4)
From the definitions of xJ,i and ZF , letting ηJ,i = (DJ PJ⊤ A⊤ J AJ∗ PJ∗ ΘJ∗ )i,: , where Ai,: is the i-th row of A, we have −1
1
−1
1
1
ZF Σ− 2 ∼ N(n−kF )×T (0, I ⊗ I),
dJ,i2 Σ− 2 xJ,i ∼ NT (dJ,i2 Σ− 2 ηJ,i , I), 1
1
Σ− 2 SF Σ− 2 ∼ WT (n − kF , I). Therefore we can bound the magnitude of λJ,i c−1 J,i as λJ,i c−1 J,i =
dJ,i (p − 2) (x⊤ S −1 xJ,i )−1 n − kJ − p + 3 J,i F
=
n − kF − p − 1 p−2 n − kJ − p + 3 p
= = = =
d−1 J,i (n − kF − p − 1) p
−1 x⊤ J,i SF xJ,i
!−1
χ2(n−kF −p−1)
p−2 n − kF − p − 1 p ⊤ Σ−1 η ) n − kJ − p + 3 p n − kF − p − 1 χ2p (d−1 η J,i J,i J,i ⊤ −1 χ2(n−kF −p−1) p + d−1 ηJ,i p−2 n − kF − p − 1 J,i ηJ,i Σ −1 ⊤ −1 ⊤ −1 2 n − kJ − p + 3 d−1 n − k − p − 1 ηJ,i + p χp (dJ,i ηJ,i Σ ηJ,i ) F J,i ηJ,i Σ ! p Op −1 ⊤ dJ,i ηJ,i Σ−1 ηJ,i + p ! p Op ⊤ Σ−1 η nηJ,i J,i + p
because d−1 J,i (n − kF − p − 1) T
−1 x⊤ J,i SF xJ,i
!−1
⊤ −1 ∼ F ′′ (n − kF − T − 1, T, 0, d−1 ηJ,i ), J,i ηJ,i Σ
where χ2k (δ) is the noncentral chi-squared distribution that has degree of freedom k and non-central parameter δ, in particular, we simply write χ2k for χ2k (0) and call χ2k the chi-squared distribution, and F ′′ (n1 , n2 , γ1 , γ2 ) is the doubly noncentral F distribution that has degree of freedom (n1 , n2 ) and non⊤ −1 central parameters (γ1 , γ2 ). From the assumption (iv) and ηJ,i Σ ηJ,i = O(1)× P P kJ kJ ⊤ −1 ΘJ,i· , i=1 j=1 ΘJ,i· Σ k
J p−2 p−2X λJ,i c−1 tr(ΛJ CJ−1 ) = plim J,i = 0 n n i=1 n →c n →c
lim p
in probability.
18
3.2
Modified AICc
In this subsection, we propose a model selection criterion that is a modification of AICc under the Kullback-Leibler risk. AICc is the estimator of the Kullback-Leibler risk with the maximum likelihood estimator and given by 1 np(n + kJ ) AICc(J) = n log SJ + np log 2π + , n n − kJ − p − 1
where AICc(J) is AICc under a model J. As in MCp, we consider the generalized ridge estimator and construct an unbiased estimator of the Kullback-Leibler risk ˆJ,B , ΣJ )) is given by corresponding to that. RKL (BJ∗ , Σ, fˆ(·|B ˆJ,B , ΣJ )) 2RKL (BJ∗ , Σ, fˆ(·|B ˆJ , ΣJ )) − n = 2RKL (BJ∗ , Σ, fˆ(·|B
p−2 E tr(ΛJ CJ−1 ) n − kJ − p + 1
by the proof of Theorem 3. In contrast to AICc which is the unbiased estimator ˆJ , ΣJ )), we denote by ZKLIC an unbiased estimator of of RKL (BJ∗ , Σ, fˆ(·|B ˆJ,B , ΣJ )) under a model J which is given by RKL (BJ∗ , Σ, fˆ(·|B ZKLIC(J) = =
p−2 tr(ΛJ CJ−1 ) AICc(J) − n n − kJ − p + 1 1 np(n + kJ ) p−2 n log SJ + np log 2π + −n tr(ΛJ CJ−1 ). n n − kJ − p − 1 n − kJ − p + 1
The model selection criterion has the following properties.
Theorem 6 ZKLIC is a uniformly minimum variance unbiased estimator of the Kullback-Leibler risk when J∗ ⊂ J. Theorem 7
Assume that
(i) J∗ ∈ J , (ii) p → ∞, n → ∞,
p n
→ c ∈ (0, 1),
(iii) If J∗ 6⊂ J then ΩJ = npΞJ = Op (np), lim np →c ΞJ = Ξ∗J and Ξ∗J is positive definite, (iv) ∀ J ∈ J C∃ i, j ∈ J∗ ,
−1 lim np →c Θ⊤ ΘJ,i· = ∞, J,i· Σ
then = 1. lim P arg min ZKLIC(J) = J ∗ p
n →c
J∈J
19
The assumption (iv) is made for the same reason as discussed in Theorem 5. We again observe that ZKLIC has consistency like ZMCp. Proof. From the proof of Theorem 2 of Fujikoshi et al. (2012), 1 {AICc(J) − AICc(J∗ )} = γJ > 0, J∗ 6⊂ J, n log p 1 1 lim {AICc(J) − AICc(J )} = r log(1 − c) + 2 > 0, J∗ ⊂ J, J = 6 J∗ . ∗ J p p c n →c lim
p n →c
Therefore, it is sufficient to show that lim
p n →c
lim
p n →c
1 n(p − 2) tr(ΛJ CJ−1 ) = 0, J∗ 6⊂ J, n log p n − kJ − T + 1
n(p − 2) 1 tr(ΛJ CJ−1 ) = 0, J∗ ⊂ J, J 6= J∗ p n − kJ − T + 1
(5)
in probability because ZKLIC = AICc −
n(p − 2) tr(ΛJ CJ−1 ). n − kJ − p + 1
In the case of J∗ ⊂ J, J 6= J∗ , (5) can be easily shown by the assumption (iv) and the proof of Theorem 5. Let J∗ 6⊂ J then d−1 J,i (n − kJ − p − 1) p
−1 x⊤ J,i SJ xJ,i
!−1
−1 ⊤ −1 ∼ F ′′ (n − kJ − p − 1, p, 0, dJ,i ηJ,i Σ ηJ,i )
because −1
−1
1
1
1
dJ,i2 Σ− 2 xJ,i ∼ Np (dJ,i2 Σ− 2 ηJ,i , I),
1
Σ− 2 SJ Σ− 2 ∼ Wp (n − kJ , I, ΩJ ).
Therefore, from the assumption (iv) lim
p n →c
p−2 n tr(ΛJ CJ−1 ) = 0 n log p n − kJ − p + 1
in probability.
4
Numerical study
In this section, we numerically examine the validity of our propositions. The risk of a selected model and the probability of selecting the true model by MCp, ZMCp, AIC, AICc, and ZKLIC were evaluated by Monte Carlo simulations with 1,000 iterations. The ten candidate models Jα = {1, ..., α} (α = 1, ..., 10) 20
were evaluated. In the experiment to evaluate the risks of the selected models, we employed n = 100, 200 and p/n = 0.04, 0.06, ..., 0.8. In the experiment to evaluate the probability of selecting the true model, we employed n = 100, 200, 400, 600 and p/n = 0.04, 0.06, ..., 0.8. The true model was determined by BJ∗ = (1, −2, 3, −4, 5)⊤1⊤ p , J∗ = {1, 2, 3, 4, 5}, and the (a, b)-th |a−b| element of Σ was defined by (0.8) (a = 1, ..., p; b = 1, ..., p). Here, 1p is the p-dimensional vector of ones. Thus, J1 , J2 , J3 , and J4 are underspecified models, and J6 , J7 , J8 , J9 , and J10 are overspecified models. Explanatory variables (b−1) A is generated in two different ways: in Case 1, Aa,b = ua (a = 1, ..., n; b = 1, ..., 10), where u1 , . . . , un ∼ U (−1, 1) i.i.d, and in Case 2, Aa,b ∼ N (0, 1) i.i.d.. Squared Risk case1 n=200 MCp ZMCp
600 400 0
200
Risk
800
MCp ZMCp
0
Risk
100 200 300 400 500
Squared Risk case1 n=100
0
20
40
60
80
0
50
p
100
150
p
Figure 1: Comparison between MCp and ZMCp under the squared risk in case 1.
Squared Risk case2 n=200 MCp ZMCp
600 400 0
200
Risk
800
MCp ZMCp
0
Risk
100 200 300 400 500
Squared Risk case2 n=100
0
20
40
60
80
0
p
50
100
150
p
Figure 2: Comparison between MCp and ZMCp under the squared risk in case 2. Figure 1 and Figure 2 show the squared risks of the selected models by MCp and ZMCp. While Figure 3 and Figure 4 show the Kullback-Leibler risks of 21
the selected models by AIC, AICc, and ZKLIC, the constant part depending on the true distribution was subtracted. In Case 1, it is seen that ZMCp largely improves MCp. On the other hand, in Case 2, the difference between the squared risks of the selected models is not as much as that in Case 1. The reason is that the explanatory variables have larger correlation in Case 1 than in Case 2, and the generalized ridge estimator is more robust against the correlation. KL Risk case1 n=200
0
20
40
−15000
Risk
AIC AICc ZKLIC
−30000
AIC AICc ZKLIC
−25000
Risk
−15000
−5000
0
KL Risk case1 n=100
60
80
0
50
p
100
150
p
Figure 3: Comparison between AIC, AICc and ZKLIC under the Kullback-Leibler risk in case 1.
KL Risk case2 n=200
0
20
40
−15000
Risk
AIC AICc ZKLIC
−30000
AIC AICc ZKLIC
−25000
Risk
−15000
−5000
0
KL Risk case2 n=100
60
80
0
p
50
100
150
p
Figure 4: Comparison between AIC, AICc and ZKLIC under the Kullback-Leibler risk in case 2. Figure 5 and Figure 6 show the probability of selecting the true model by each model selection criterion. In Figure 6, the probability of selecting the true model by our model selection criteria is large when the sample size is large. However, in Figure 5, this probability is small; the matrix of regression coefficient has large correlation. Furthermore, the probabilities of selecting the true model by our model selection criteria are smaller than those of existing
22
ones in each case. The reason for this is that the variance of our model selection criteria is bigger than those of the existing ones.
0.2
0.4
0.6
0.8
80 100 60 20
0.2
0.4
0.6
0.8
p/n
0.0
0.2
0.4
0.6
0.8
p/n
ZKLIC case1 80 100
AICc case1
60
n=100 n=200 n=400 n=600
20 0
0
20
40
60
True(%)
n=100 n=200 n=400 n=600
40
80 100
n=100 n=200 n=400 n=600
0 0.0
p/n
True(%)
40
True(%)
60 20 0
0
n=100 n=200 n=400 n=600
0.0
AIC case1 n=100 n=200 n=400 n=600
40
True(%)
60 40 20
True(%)
80 100
ZMCp case1
80 100
MCp case1
0.0
0.2
0.4
p/n
0.6
0.8
0.0
0.2
0.4
0.6
0.8
p/n
Figure 5: Comparison between MCp, ZMCp, AIC, AICc, and ZKLIC of the probability of selecting the true model in case 1
23
0.4
0.6
0.8
0.2
20 0.4
0.6
0.8
p/n
0.0
0.2
0.4
0.6
0.8
p/n
80 100
ZKLIC case2
60
n=100 n=200 n=400 n=600
20 0
0
20
40
60
True(%)
n=100 n=200 n=400 n=600
40
80 100
AICc case2
60
80 100 0.0
p/n
n=100 n=200 n=400 n=600
0
n=100 n=200 n=400 n=600
0
0.2
40
True(%)
60 20
40
True(%)
60 40
True(%)
20 0
n=100 n=200 n=400 n=600
0.0
True(%)
AIC case2
80 100
ZMCp case2
80 100
MCp case2
0.0
0.2
0.4
0.6
0.8
p/n
0.0
0.2
0.4
0.6
0.8
p/n
Figure 6: Comparison between MCp, ZMCp, AIC, AICc, and ZKLIC of the probability of selecting the true model in case 2
Although the risks of our model selection criteria are smaller than the ones based on the maximum likelihood estimator, the probability of selecting the true model by our criteria is worse than their probabilities. This is because predictive efficiency and consistency are not compatible (Yang (2005)) and our criteria are specialized in making the risk smaller.
5
Conclusion
In this paper, we proposed model selection criteria based on the generalized ridge estimator, which improves the maximum likelihood estimator under the squared risk and the Kullback-Leibler risk, in multivariate linear regression. Moreover, we showed that our model selection criteria have the same properties as MCp, AIC, and AICc in a high-dimensional asymptotic framework. We demonstrated through the numerical experiments that our model selection criteria have better performances in terms of the risks than the ones based on the maximum likelihood estimators, especially when the matrix of regression coefficients has strong correlation.
Acknowledgement This work was partially supported by MEXT kakenhi (25730013, 25120012, 26280009, 15H01678 and 15H05707), JST-PRESTO and JST-CREST.
24
References Akaike, H. (1971). Information theory and an extension of the maximum likelihood principle; 1973. Tsahkadsor, Armenian SSR, pages 267–281. Basser, P. J. and Pierpaoli, C. (1998). A simplified method to measure the diffusion tensor from seven MR images. Magnetic resonance in medicine, 39(6):928–934. Bedrick, E. J. and Tsai, C.-L. (1994). Model selection for multivariate regression in small samples. Biometrics, pages 226–231. Davies, S. L., Neath, A. A., and Cavanaugh, J. E. (2006). Estimation optimality of corrected AIC and modified Cp in linear regression. International statistical review, 74(2):161–168. Fujikoshi, Y., Sakurai, T., and Yanagihara, H. (2012). High-dimensional AIC and consistency properties of several criteria in multivariate linear regression. Technical report, TR. Fujikoshi, Y., Sakurai, T., and Yanagihara, H. (2014). Consistency of highdimensional AIC-type and Cp-type criteria in multivariate linear regression. Journal of Multivariate Analysis, 123:184–200. Fujikoshi, Y. and Satoh, K. (1997). Modified AIC and Cp in multivariate linear regression. Biometrika, 84(3):707–716. Gharagheizi, F. (2008). Qspr studies for solubility parameter by means of genetic algorithm-based multivariate linear regression and generalized regression neural network. QSAR & Combinatorial Science, 27(2):165–170. Konno, Y. (1991). On estimation of a matrix of normal means with unknown covariance matrix. Journal of Multivariate Analysis, 36(1):44–55. Kubokawa, T. and Srivastava, M. S. (2001). Robust improvement in estimation of a mean matrix in an elliptically contoured distribution. Journal of multivariate analysis, 76(1):138–152. Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15(4):661–675. Maruyama, Y. and Strawderman, W. E. (2005). A new class of generalized Bayes minimax ridge regression estimators. The Annals of Statistics, 33(4):1753– 1770. Nagai, I., Yanagihara, H., Satoh, K., et al. (2012). Optimization of ridge parameters in multivariate generalized ridge regression by plug-in methods. Hiroshima Mathematical Journal, 42(3):301–324. Tsukuma, H. (2009). Generalized Bayes minimax estimation of the normal mean matrix with unknown covariance matrix. Journal of Multivariate Analysis, 100(10):2296–2304. 25
Yanagihara, H. and Satoh, K. (2010). An unbiased Cp criterion for multivariate ridge regression. Journal of Multivariate Analysis, 101(5):1226–1238. Yang, Y. (2005). Can the strengths of aic and bic be shared? a conflict between model indentification and regression estimation. Biometrika, 92(4):937–950.
26