Submitted to the Annals of Statistics arXiv: math.PR/0000000
GENERAL ESTIMATING EQUATIONS: MODEL SELECTION AND ESTIMATION WITH DIVERGING NUMBER OF PARAMETERS By Mehmet Caner† , Hao Helen Zhang∗,† North Carolina State University† This paper develops adaptive elastic net estimator for general estimating equations. We allow for number of parameters diverge to infinity. The estimator can also handle collinearity among large number of variables as well. This method has the oracle property, meaning we can estimate nonzero parameters with their standard limit and the redundant parameters are dropped from the equations simultaneously. This paper generalizes the least squares based adaptive elastic net estimator of Zou and Zhang (2009) to nonlinear equation systems with endogenous variables. The extension is not trivial and involves a new proof technique due to estimators lack of closed form solution.
∗
Zhang’s research is supported by National Science Foundation Grant DMS 0645293, and National Institutes of Health Grant NIH/NCI R01 CA-085848. AMS 2000 subject classifications: Primary 62J05; secondary 62J07 Keywords and phrases: Estimating Equations, Lasso Penalty
1 imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
2
MEHMET CANER AND HAO HELEN ZHANG
1. Introduction. General estimating equations are studied extensively in statistics. They provide a convenient way to describe parameters and statistics. We can estimate the parameters by two-step efficient Generalized Method of Moments (GMM) of Hansen (1982), or by empirical likelihood estimator of Owen (2001), and Qin and Lawless (1994). The other estimators used to are exponential tilting of Kitamura and Stutzer (1997), and exponentially tilted empirical likelihood of Schennach (2007). They all share the first order efficiency of GMM. GMM is also used in economics, finance, accounting, and strategic planning literature as well. In this paper we are concerned about model selection in GMM when the number of parameters diverge. These situations can arise in labor economics, international finance (see Alfaro, Kalemli-Ozcan, Volosovych (2008)). In linear models when the some of the regressors are correlated with errors and with lots of co-variates, the model selection tools are essential. They can improve the finite sample performance of the estimators. Model selection techniques are very useful and now widely used in statistics. For example Knight and Fu (2000) derive the asymptotics of lasso, and Fan and Li (2001) propose SCAD estimator. In econometrics, Knight (2008), and Caner (2009) offer Bridge-least squares and Bridge-GMM estimators respectively. But these are all in finite dimensions. Recently model selection with large number of parameters are analyzed in least squares by Huang, Horowitz, and Ma (2008), and Zou and Zhang (2009), where the first article analyzes Bridge estimator, and the second paper is concerned with adaptive elastic net estimator. Adaptive elastic net estimator has the oracle property when the number of parameters diverge with the sample size. Furthermore, this method can handle the collinearity arising from large number of regressors when the system is linear with endogenous regressors. When some of the parameters are redundant, (i.e. when the true model has sparse representation) this estimator can estimate the zero parameters as zero. When we set parameters to zero, we benefit from a data dependent method unlike Bridge estimation. In this paper we extend the least squares based adaptive elastic net of Zou and Zhang (2009) to GMM. GMM has no closed from solution unlike least squares estimator. This results in a new proof technique compared with least squares case of Zou and Zhang (2009). We need a different and extensive consistency proof. The nonlinear nature of the functions introduce additional difficulties, and this restricts the growth of the number of parameters compared to simple linear case. The estimator also has the oracle property, the nonzero coefficients are estimated converging to a normal distribution. This is their standard limit furthermore, the zero parameters are estimated imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
ADAPTIVE ELASTIC NET
3
as zero. Earlier work on diverging parameters include Huber (1988), and Portnoy (1984). In recent years, Fan, Peng, Huang (2005) study the a semi-parametric model with a growing number of nuisance parameters, Lam and Fan (2007) analyze the profile likelihood ratio inference with growing number of parameters. As far as we know this is the first paper to estimate and select the model in GMM with diverging number of parameters. Section 2 presents the model and the estimator. Then in section 3 we derive the asymptotic results for the estimator. Section 4 conducts simulations. Appendix includes all the proofs. 2. Model. Let β be a p dimensional parameter vector, where β ∈ Bp which is a compact subset in Rp . The true value of β is β0 . We allow p to grow with the sample size, so when n → ∞, we have p → ∞, but p/n → 0 as n → ∞. We do not provide a subscript of n for parameter space not to burden ourselves with the notation. When n converges to infinity the parameter space is denoted as B∞ . For example this may be infinite cube [0, 1]∞ in certain examples. The population orthogonality conditions are E[g(Xi , β0 )] = 0, where the data are {Xi : i = 1, 2 · · · n}, g(.) is a known function, and the number of orthogonality restrictions are q, q ≥ p. So we also allow q to grow with the sample size, but q/n → as n → ∞. From now on we denote g(Xi , β) as gi (β). Also assume that gi (β) are independent, and we do not use gni (β) just to simplify the notation. 2.1. The Estimators. We first define the estimators that we use. The estimators that we are interested in answer the following questions. If we have large number of control variables some of them may be irrelevant (we may have also large number of endogenous variables and control variables) in the structural equation in a simultaneous equation system or large number of parameters in a nonlinear system with endogenous and control variables can we select the relevant ones as well as estimate the selected system simultaneously? If we have large number of variables possibly there may be some correlation among the variables, can this method handle that? Is it also possible that the estimators achieve the oracle property? The answers to all three questions are affirmative. First of all, the adaptive elastic net estimator simultaneously selects and estimates the model when there are large number of parameters/regressors. It can also take into account the possible correlation among the variables. By achieving the oracle property, the imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
4
MEHMET CANER AND HAO HELEN ZHANG
nonzero parameters are estimated with their standard limits, and the zero ones are estimated as zero. This method is computationally easy and uses data dependent methods to set small coefficient estimates to zero. A subcase of the adaptive elastic net estimator is adaptive lasso estimator which can handle the first and third questions and does not handle correlation among large number of variables. First we introduce the notation: β1 = pj=1 |βj |, and β22 = pj=1 |βj |2 . We start by introducing the adaptive elastic net estimator, given the positive and diverging tuning parameters λ∗1 , λ2 , (how to choose them in finite samples, and its asymptotic properties will be discussed below in Assumptions and then in Simulation Section) n
βˆaenet = (1+λ2 /n){argminβ [(
n
gi (β)) Wn (
i=1
(1) where w ˆj =
gi (β))+λ2 β22 +λ∗1
i=1
p
w ˆj |βj |]},
j=1
βˆenet is a consistent estimator immediately explained
1 , |βˆenet |γ
below, and γ is a positive constant, and with the p = nα , 0 < α < 1/3, 0
0, α < 1/3. Q.E.D.
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
16
MEHMET CANER AND HAO HELEN ZHANG
Proof of Theorem 2. In this proof we start by analyzing the GMMRidge estimator that is defined as follows: n
βˆR = argminβ [
n
gi (β)] Wn [
i=1
gi (β)] + λ2 β22 .
i=1
Note that this estimator is similar to the elastic net estimator, if we set λ1 = 0, in elastic net estimator, then we obtain the GMM-Ridge estimator. So since elastic net estimator is consistent, GMM-Ridge will be consistent n ∂gi (βˆR ) ˆ ˆ . Then as well. Define also the following q × p matrix Gn (βR ) = i=1 ∂β
around β0 applied to first order set β¯ ∈ (β0 , βˆR ). A mean value theorem conditions provides, with gn (β0 ) = ni=1 gi (β0 ), ¯ + λ2 Ip ]−1 [G ¯ 0 ]. ˆ n (β) ˆ n (β)β ˆ n (βˆR ) Wn G ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G βˆR = −[G (19)
Also using the mean value theorem with first order conditions, adding and subtracting λ2 β0 yields (20)
¯ + λ2 Ip ]−1 [G ˆ n (β) ˆ n (βˆR ) Wn G ˆ n (βˆR ) Wn gn (β0 ) + λ2 β0 ]. βˆR − β0 = −[G
We need the following expressions by using (19) ˆ ¯ 0 ] = −[G ¯ 0 ] ˆ n (βˆR ) Wn G ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G ˆ n (β)β ˆ n (β)β [Gn (βˆR ) Wn gn (β0 ) − G βˆR ¯ + λ2 Ip ]−1 ˆ n (βˆR ) Wn G ˆ n (β) × [G
¯ 0 ]. ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G ˆ n (β)β × [G
(21) ˆ ¯ + λ2 Ip ]βˆR ˆ n (β) [Gn (βˆR ) Wn G βˆR
=
(22)
×
¯ 0 ] ˆ n (β)β ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G [G ¯ + λ2 Ip ]−1 [G ¯ 0 ]. ˆ n (β) ˆ n (β)β ˆ n (βˆR ) Wn G ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G [G
Next the aim is to rewrite the following GMM-Ridge objective function via a mean value expansion n
[
i=1
n
gi (βˆR )] Wn [
gi (βˆR )] + λ2 βˆR 22 = gn (β0 ) Wn gn (β0 )
i=1
(23)
¯ βˆR − β0 ) + (βˆR − β0 ) G ¯ Wn gn (β0 ) ˆ n (β)( ˆ n (β) + gn (β0 ) Wn G ¯ Wn G ¯ βˆR − β0 ) + λ2 βˆR 2 . ˆ n (β) ˆ n (β)( + (βˆR − β0 ) G
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
2
17
ADAPTIVE ELASTIC NET
Furthermore we can rewrite the right hand side of (23) as ˆ ¯ Wn gn (β0 ) − G ¯ Wn G ¯ 0] ˆ n (β) ˆ n (β)β [Gn (β) gn (β0 ) Wn gn (β0 ) + βˆR ¯ Wn gn (β0 ) − G ¯ Wn G ¯ 0 ] βˆR ˆ n (β) ˆ n (β) ˆ n (β)β + [G ˆ ¯ Wn G ¯ + λ2 Ip ]βˆR ˆ n (β) + βˆR [Gn (β) ˆ ¯ − β G ¯ ˆ n (β) ˆ ¯ ˆ ¯ − gn (β0 ) Wn G 0 n (β) Wn gn (β0 ) + β0 Gn (β) Wn Gn (β)β0
(24)
ˆ ¯ Wn G ¯ + λ2 Ip ]βˆR ˆ n (β) = gn (β0 ) Wn gn (β0 ) − βˆR [Gn (β) ¯ 0 − β G ¯ Wn gn (β0 ) + β G ¯ Wn G ¯ 0. ˆ n (β)β ˆ n (β) ˆ n (β) ˆ n (β)β − gn (β0 ) Wn G 0
0
The equality is obtained through (21)(22), and note that the right hand side expression in (21) is just the negative of the right hand side of the expression in (22). As in (24), for the estimator βˆw we have the following and the mean value theorem applied to that, where β¯w ∈ (β0 , βˆw ), n
[
n
gi (βˆw )] Wn [
i=1
gi (βˆw )] + λ2 βˆw 22 = gn (β0 ) Wn gn (β0 )
i=1 ˆ ˆ n (β¯w ) Wn G( ˆ β¯w )β0 ] + βˆw [Gn (β¯w ) Wn gn (β0 ) − G ˆ n (β¯w ) Wn G ˆ n (β¯w )β0 ] βˆw ˆ n (β¯w ) Wn gn (β0 ) − G + [G ˆ ˆ β¯w ) + λ2 Ip ]βˆw + βˆw [Gn (β¯w ) Wn G( ˆ n (β¯w )β0 − β G ˆ n (β¯w ) Wn gn (β0 ) − gn (β0 ) Wn G 0
(25)
+
ˆ n (β¯w ) Wn G ˆ n (β¯w )β0 . β0 G
Then see that by assuming β¯ to be the same variable in the mean value theorem for both βw , βR analysis without losing any generality (i.e. β¯ = β¯w ) ˆ ¯ Wn gn (β0 ) − G ¯ Wn G( ¯ Wn G ¯ + λ2 Ip ][G ¯ Wn G ¯ + λ2 Ip ]−1 ¯ 0 ] = βˆ [G ˆ n (β) ˆ n (β) ˆ n (β) ˆ β)β ˆ n (β) ˆ n (β) [Gn (β) βˆw w ¯ Wn gn (β0 ) − G ¯ Wn G( ¯ 0] ˆ n (β) ˆ n (β) ˆ β)β × [G
(26)
ˆ ¯ Wn G ¯ + λ2 Ip ]βˆR + op (1). ˆ n (β) [Gn (β) = −βˆw
The asymptotically negligible remainder in the last equality is due to Assumption 2i and Theorem 1i with λ1 = 0. Next substitute (26) into (25) to have n
[
i=1
n
gi (βˆw )] Wn [
ˆ ¯ Wn G ¯ + λ2 Ip ]βˆR ˆ n (β) gi (βˆw )] + λ2 βˆw 22 = gn (β0 ) Wn gn (β0 ) − βˆw [Gn (β)
i=1 ˆ ¯ Wn G ¯ + λ2 Ip ]βˆw + βˆ [G ¯ Wn G ¯ + λ2 Ip ]βˆw ˆ n (β) ˆ n (β) ˆ n (β) − βˆR [Gn (β) w ¯ 0 ˆ n (β)β − gn (β0 ) Wn G
(27)
¯ Wn gn (β0 ) + β G ˆ n (β) ˆ ¯ ˆ ¯ − β0 G 0 n (β) Wn Gn (β)β0 + op (1).
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
18
MEHMET CANER AND HAO HELEN ZHANG
Now subtract (24) from (27) to have n
[
n
gi (βˆw )] Wn [
i=1
n
gi (βˆw )] + λ2 βˆw 22 − ([
i=1
n
gi (βˆR )] Wn [
i=1
gi (βˆR )] + λ2 βˆR 22 )
i=1
¯ Wn G ¯ + λ2 Ip ](βˆw − βˆR ) + op (1). ˆ n (β) ˆ n (β) = (βˆw − βˆR ) [G
(28)
After this important result see that by the definitions of βˆw , βˆR λ1
p
n
(βˆj,R − βˆj,w ) ≥ (
j=1
i=1 n
− [(
(29)
n
gi (βˆw )) Wn (
gi (βˆw )) + λ2 βˆw 22
i=1 n
gi (βˆR )) Wn (
i=1
gi (βˆR )) + λ2 βˆR 22 ].
i=1
Then also see that (30)
p
p w ˆj (|βˆj,R | − |βˆj,w |) ≤ (w ˆj )2 βˆR − βˆw 2 .
j=1
j=1
Next benefit from (29), with (28)(30) to have p ¯ Wn G ¯ ˆ ˆ ˆ n (β) ˆ n (β)+λ (w I ]( β − β )+o (1) ≤ λ ˆj )2 βˆR −βˆw 2 . (βˆw −βˆR ) [G 2 p w R p 1 j=1
(31) ¯ Wn G ¯ ˆ n (β) ˆ ¯ ˆ n (β)+λ ˆ ¯ See that Eigmin (G 2 Ip ) = Eigmin (Gn (β) Wn Gn (β))+λ2 . Use this in (31) to have
¯ Wn G ¯ + λ2 }βˆw − ˆ n (β) ˆ n (β)) {Eigmin (G
βˆR 22
p ≤ λ1 w ˆj2 βˆR − βˆw 2 . j=1
This results in (32)
βˆw − βˆR 2 ≤
λ1
p
ˆj2 j=1 w
+ op (1)
¯ Wn G ¯ + λ2 ˆ n (β) ˆ n (β)) Eigmin (G
.
We also want to modify the last inequality. By the consistency of βˆw , βˆR , p β¯ → β0 . Then with the uniform law of large numbers on the partial derivative we have by Assumptions 2ii, 3
¯ ˆ n (β) G n
Wn
¯ ˆ n (β) G p − G(β0 ) W G(β0 ) → 0. n
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
19
ADAPTIVE ELASTIC NET
¯ Then The last equation is true also for βˆw , βˆR replacing β. ¯ Wn G ¯ = n2 [G(β0 ) W G(β0 )] + op (n2 ). ˆ n (β) ˆ n (β) G
(33)
Modify (32) in the following way given the last equality, set W = Ω−1 (since this is the efficient limit weight as shown by Hansen (1982))
(34)
βˆw − βˆR 2 ≤
λ1
p
ˆj2 + op (1) j=1 w n2 Eigmin (G(β0 ) Ω−1 G(β0 )) + λ2
+ op (n2 )
.
Now we consider the second part of the proof of this theorem. We use GMM ridge formula. Note that from (20)
(35)
ˆ n (βˆR ) Wn G ˆ n (β¯R ) + λ2 Ip ]−1 β0 βˆR − β0 = −λ2 [G ¯ + λ2 Ip ]−1 [G ˆ n (βˆR ) Wn G ˆ n (βˆR ) Wn gn (β0 )]. ˆ n (β) − [G
We try to modify the equation above a little. In the same way we obtain (33) ˆ n (βˆR ) Wn gn (β0 ) = n[G(β0 ) W gn (β0 )] + op (n). G
(36)
Second, see that by gi (β) being independent Egn (β0 )gn (β0 ) − nΩ → 0. E[gn (β0 ) Ω−1 G(β0 )G(β0 ) Ω−1 gn (β0 )] = tr{G(β0 ) Ω−1 E[gn (β0 )gn (β0 ) ]Ω−1 G(β0 )} = ntr{G(β0 ) Ω−1 G(β0 )} (37)
≤ npEigmax (Σ),
where we use Σ = G(β0 ) Ω−1 G(β0 ). Now we modify (35) using (36) (33) βˆR − β0 = −λ2 [n2 G(β0 ) Ω−1 G(β0 ) + λ2 Ip + op (n2 )]−1 β0 (38)
− [n2 G(β0 ) Ω−1 G(β0 ) + λ2 Ip + op (n2 )]−1 [nG(β0 ) Ω−1 gn (β0 )].
Then see that E(βˆR − β0 22 ) ≤ 2λ22 [n2 Eigmin (G(β0 ) Ω−1 G(β0 )) + λ2 + o(n2 )]−2 β0 22 + 2[n2 Eigmin (G(β0 ) Ω−1 G(β0 )) + λ2 + o(n2 )]−2 × n2 E[gn (β0 ) Ω−1 G(β0 )G(β0 ) Ω−1 gn (β0 )] + o(n2 ) ≤ 2[n2 Eigmin (G(β0 ) Ω−1 G(β0 ) + λ2 + o(n2 )]−2 (39)
× [λ22 β0 2 + n3 pEigmax (Σ) + op (n2 )],
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
20
MEHMET CANER AND HAO HELEN ZHANG
where the last inequality is by (37). Now use (34) and (39) to have E(βˆw − β0 22 ) ≤ 2E(βˆR − β0 22 ) + 2E(βˆw − βˆR 22 ) ˆj2 + o(1) 4λ22 β0 22 + 4n3 pB + o(n2 ) + 2λ21 E pj=1 w (40) . ≤ [n2 b + λ2 + op (n2 )]2 See that b = Eigmin (G(β0 ) Ω−1 G(β0 )), B = Eigmax (Σ).Q.E.D Proof of Theorem 3i. To prove the Theorem we need to show the following (Note that by Kuhn-Tucker conditions of (1)), n
˜ Wn ( ˆ n,j (β) P [∀j ∈ Ac , |2G
˜ ≤ λ∗ w gi (β))| 1 ˆj ] → 1,
i=1
n
˜
i (β) ˜ = i=1 ∂g ˜ ˆ n (β) ˆ n,j (β) where G , and Ac = {j : βj0 = 0, j = 1, 2, · · · , p}. G ∂β denotes the j th column of the partial derivative matrix which corresponds ˜ Or we need to show to zero parameters (i.e. at true values), evaluated at β.
n
˜ Wn ( ˆ n,j (β) P [∃j ∈ Ac , |2G
˜ > λ∗ w gi (β))| 1 ˆj ] → 0,
i=1
Now set η = minj∈A |βj0 |, ηˆ = minj∈A |βˆj,enet|. So n
˜ Wn ( ˆ n,j (β) P [∃j ∈ Ac , |2G
˜ > λ∗ w gi (β))| 1 ˆj ] ≤
n
˜ Wn ( ˆ n,j (β) P [|2G
j∈Ac
i=1
˜ > λ∗ w gi (β))| ˆ > η/2] 1 ˆj , η
i=1
+ P [ˆ η ≤ η/2].
(41)
Then as in p.15 of Zou and Zhang (2009), we can show that Eβˆenet − β0 22 η 2 /4 2 λ β0 22 + n3 pB + λ21 p + o(n2 ) , ≤ 16 2 [n2 b + λ2 + o(n2 )]2 η 2
P [ˆ η ≤ η/2] ≤ (42)
where the second inequality is due to Theorem 2. Then we can also have j∈Ac
n ˆ n,j (β) ˜ Wn ( ˜ P [|2G gi (β))|
> λ∗1 w ˆj , ηˆ > η/2]
i=1
≤
n ˜ Wn ( ˜ > λ∗ w ˆ n,j (β) P [|2G gi (β))| ˆ > η/2, |βˆj,enet | ≤ M ] 1 ˆj , η
j∈Ac
+
i=1
P [|βˆj,enet | > M ]
j∈Ac
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
21
ADAPTIVE ELASTIC NET
≤
n ˜ ˜ > λ∗ M −γ , ηˆ > η/2] ˆ P [|2Gn,j (β) Wn ( gi (β))| 1
j∈Ac
(43)
+
i=1
P [|βˆj,enet | > M ],
j∈Ac
λ∗
1
1 where M = ( n3+α ) 2γ . In (43) we consider the second term on the right hand side. Via inequality (6.8) of Zou and Zhang (2009), and Theorem 2 here
Eβˆenet − β0 22 M2
P [|βˆj,enet | > M ] ≤
j∈Ac
≤ 4
(44)
λ22 β0 22 + 4n3 pB + λ21 p + o(n2 ) . [n2 b + λ2 + o(n2 )]2 M 2
Next we can consider the first term on the right hand side of (43)
n
˜ Wn ( ˆ n,j (β) P [|2G
j∈Ac
˜ gi (β))| > λ∗1 M −γ , ηˆ > η/2]
i=1
≤
(45)
n 4M 2γ ˜ Wn ( gi (β))| ˜ 2 I{ˆη≥η/2} ]. ˆ n,j (β) E[ | G λ∗2 1 i=1 j∈Ac
So we try to simplify the term on the right hand side of (45). Now we evaluate
n
˜ Wn ( ˆ n,j (β) |G
j∈Ac
˜ 2 ≤ 2 gi (β)|
j∈Ac
i=1
(46)
n
˜ Wn ( ˆ n,j (β) |G
+ 2
gi (βA,0 )|2
i=1
˜ Wn G ¯ β˜ − βA,0 )|2 , ˆ n,j (β) ˆ n (β)( |G
j∈Ac
˜ and where we have β¯ ∈ (βA,0 , β),
¯ ˜ = gi (βA,0 ) + ∂gi (β) (β˜ − βA,0 ). gi (β) ∂β Analyze each term in (46). Note that β˜ is consistent if we go through the same steps as in Theorem 1. Then applying Assumption 2ii with Assumption 3 (Uniform Law of Large Numbers) (47) 2
j∈Ac
˜ Wn ( ˆ n,j (β) |G
n i=1
gi (βA,0 ))|2 ≤ 2n2 G(βA,0 ) W
n
gi (βA,0 )22 + op (n2 ).
i=1
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
22
MEHMET CANER AND HAO HELEN ZHANG
Then via (46) E[
(48)
n
˜ Wn ( ˆ n,j (β) |G
j∈Ac
i=1
n
gi (βA,0 )|]2 ≤ n3 B + o(n3 ),
where we use limn→∞ ( i=1 gi (βA,0 ))( ni=1 gi (βA,0 )) = nΩ∗ , and with W = Ω−1 ∗ , (49) B ≥ Eigmax(Σ) ≥ Eigmax(G(βA,0 ) Ω−1 ∗ G(βA,0 )). In the same manner we have j∈Ac
˜ Wn G ¯ β−β ˜ A,0 )|2 ≤ n4 |G(βA,0 ) Ω−1 G(βA,0 )(β−β ˜ A,0 )|2 +op (n4 ). ˆ n (β)( ˆ n,j (β) |G ∗
(50) Then by (49), and taking into account (50) (51)
˜ Wn G ¯ β˜ − βA,0 )|2 ≤ n4 B 2 β˜ − βA,0 2 + op (n4 ). ˆ n,j (β) ˆ n (β)( |G 2
j∈Ac
Now substitute (48)-(51) into E[
j∈Ac
n
˜ Wn ( ˆ n,j (β) |G
˜ 2 I{ˆη>η/2} ] ≤ 2B 2 n4 E(β˜ − βA,0 2 I{ˆη>η/2} ) gi (β))| 2
i=1
+ 2Bn3 + o(n4 ).
(52)
Define the ridge based version of β˜ with imposing λ∗1 = 0 (53)
n
˜ 2 , 0) = argminβ {( β(λ
i=1
n
gi (βA ) Wn (
i=1
gi (βA )) + λ2
βj2 }.
j∈A
Then use the arguments in the proof of Theorem 2 (equation (34)) leading to √ λ∗1 maxj∈A w ˆj p + op (1) ˜ 2 , 0)2 ≤ β˜ − β(λ 2 n2 Eigmin(G(βA,0 ) Ω−1 ∗ G(βA,0 )) + λ2 + op (n ) √ λ∗1 ηˆ−γ p (54) , ≤ n2 b + λ2 + op (n2 ) −1 where Eigmin(G(βA,0 ) Ω−1 ∗ G(βA,0 )) ≥ Eigmin(G(β0 ) Ω∗ G(β0 )) ≥ b. Then follow the proof of Theorem 2, for the right hand side term in (52)
(55) E(β˜ − βA,0 2 I{ˆη>η/2} ) ≤ 4
−2γ p + op (n2 ) λ22 β0 22 + n3 pB + λ∗2 1 (η/2) . 2 2 2 (bn + λ2 + op (n ))
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
23
ADAPTIVE ELASTIC NET
Now combine (42) (43)(44)(45)(52)(55) into (41) ˜ Wn ( ˆ n,j (β) P [∃j ∈ Ac , |2G
n
˜ > λ∗1 w gi (β))| ˆj ]
λ22 β0 22 + Bn3 p + λ21 p + o(n2 ) [n2 b + λ2 + o(n2 )]2
≤
16 η2
+
4 M2
+
4M 2γ [(2n3 B + op (n3 )) λ∗2 1
+
(2B 2 n4 + o(n4 ))
i=1
λ22 β0 22 + Bn3 p + λ21 p + o(n2 ) [n2 b + λ2 + o(n2 )]2
(56)
Now we have to show that each square bracketed term on the right hand side of the equation (56) converges in probability to zero. We consider each of the right hand side elements in (56). The first square bracketed element converges in probability to zero since by λ22 /n → 0, λ21 /n → 0, p/n → 0 provides us with η 2 being a constant λ22 β0 22 → 0, n4 Bn3 p → 0, n4 λ21 p → 0. n4 Next we consider the second square bracketed term on the right hand side of (56). See that the dominating term in that expression is stochastic order of p 1 ) → 0. O( n M2 The above is true since by Assumption 6
λ∗1
n3+α
1/2γ
λ∗1 nγ(1−α) n3+α
→ ∞, and M =
, and since γ > 0 M 2n = p
λ∗1 n3+α
1/γ
−2γ λ22 β0 22 + n3 pB + λ∗2 p + o(n2 ) 1 (η/2) ]. 2 2 2 (bn + λ2 + o(n ))
n1−α → ∞.
The other terms in the second term on the right hand side of (56) are λ22 β0 22 1 → 0, n nM 2 n2
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
24
MEHMET CANER AND HAO HELEN ZHANG
p by λ22 /n → 0, nM 2 → 0 by the analysis of the dominating term above. Then also in the same way
λ21 p λ21 p 1 = → 0, M 2 n4 n nM 2 n2 p where we use λ21 /n → 0, and the analysis of the dominating term nM 2 → 0 above. Now we consider the last square bracketed term in (56). For that term the dominating terms are (the last two terms in (56)
O(
M 2γ 3 n p) → 0, λ∗2 1
O(M 2γ p) → 0. The other terms in the last square bracketed term is of smaller order. To show the first result, note that by definition of M 1 M 2γ 3 n p = ∗ → 0, λ λ∗2 1 1 since λ∗1 is diverging to infinity. Then for the second result M 2γ p =
λ∗1 α λ∗1 n = 3 → 0, n3+α n
γ
by λ∗1 = o(n(2−α)+ 2 (α−1) ), and 0 < α < 1/3. Q.E.D. Proof of Theorem 3ii. After Theorem 3i result, it suffices to prove P (minj∈A |β˜j | > 0) → 1. ˜ 2 , 0) defined in (53) Then we can write the following with β(λ (57)
˜ 2 , 0)j | − β˜ − β(λ ˜ 2 , 0)2 . minj∈A |β˜j | > minj∈A |β(λ
Also see that (58)
˜ 2 , 0)j | > minj∈A |βj0 | − β(λ ˜ 2 , 0) − βA,0 2 . minj∈A |β(λ
Combine (57)(58) to have ˜ 2 , 0) − βA,0 2 − β˜ − β(λ ˜ 2 , 0)2 . (59) minj∈A |β˜j | > minj∈A |βj0 | − β(λ
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
25
ADAPTIVE ELASTIC NET
Now we consider the last two terms on the right hand side of (59). Similar to derivation of (38)(39) we have 4λ22 βA,0 22 + 4n3 pB + o(n2 ) [n2 b + λ2 + o(n2 )]2 = O(p/n) = o(1).
˜ 2 , 0) − βA,0 2 ≤ Eβ(λ (60) Then by (54)
˜ 2≤ ˜ 2 , 0) − β β(λ
(61) See that
√ λ∗1 ηˆ−γ p . n2 b + λ2 + op (n2 )
√ ∗√ λ∗1 ηˆ−γ p −γ λ1 p −γ = O(1) (ˆ η /η) η . n2 b + λ2 + op (n2 ) n2
(62)
Since p = nα , by Assumption 6, √ γ λ∗1 p = o(n(2−α)+ 2 (α−1)+α/2−2 ) 2 n γ (63) = o(n−α/2+ 2 (α−1) ) = o(1), since 0 < α < 1/3, γ > 0. Then by Theorem 2
η /η)2 E (ˆ
2 E[ˆ η − η]2 η2 2 ˆ 2 , λ1 ) − β0 2 ≤ 2 + 2 Eβ(λ 2 η ≤ 2+
8 λ22 β0 22 + n3 pB + λ21 p + o(n2 ) ≤ 2+ 2 η [n2 b + λ2 + o(n2 )]2 (64)
= O(1),
by λ21 /n → 0, λ22 /n → 0, p/n → 0. Substitute (63)(64) into (62) to have (65)
√ λ∗1 ηˆ−γ p p → 0, n2 b + λ2 + op (n2 )
Now use (65) in (61) and combine this with (60) to have minj∈A |β˜j | > η − op (1). Then we obtain the desired result.Q.E.D imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
26
MEHMET CANER AND HAO HELEN ZHANG
Proof of Theorem 4. We now prove the limit result. First, define
zn = δ
ˆ βˆaenet,A ))−1 ˆ βˆaenet,A ) Wn G( I + λ2 (G( ˆ βˆaenet,A ))1/2 n−1/2 (βˆaenet,A −βA,0 ). ˆ βˆaenet,A ) Wn G( (G( 1 + λ2 /n
Then we need the following result. Following the proof of Theorem 1 and using Theorem 3, p βˆaenet,A → βA,0 , and by (60) p ˜ 2 , 0) → βA,0 . β(λ
Next by Assumption 2.ii, and Assumption 3, and considering the results about consistency we have
(66)
˜ 2 , 0)) p ˜ 2 , 0)) ˆ βˆaenet,A ) G( ˆ β(λ ˆ βˆaenet,A ) ˆ β(λ G( G( G( − Wn Wn → 0. n n n n
Next note that ˜ 2 , 0)) Wn G( ˜ 2 , 0))]−1 ] ˆ β(λ ˆ β(λ δ [I + λ2 [G(
× = × + +
˜ 2 , 0))] n−1/2 (β˜ − β(λ ˆ β(λ ˜ 2 , 0)) Wn G( ˆ β(λ ˜ 2 , 0)) Wn G( ˆ β(λ ˜ 2 , 0))]−1 )(G( ˆ β(λ ˜ 2 , 0)))1/2 [δ (I + λ2 [G(
×
˜ 2 , 0) − βA,0 )]. n−1/2 (β(λ
× (67)
βA,0 ) 1 + λ2 /n ˜ 2 , 0)) Wn G( ˜ 2 , 0)) Wn G( ˜ 2 , 0))]−1 )(G( ˜ 2 , 0)))1/2 ˆ β(λ ˆ β(λ ˆ β(λ ˆ β(λ [δ (I + λ2 [G( λ2 βA,0 n−1/2 ] n + λ2 ˜ 2 , 0)) Wn G( ˜ 2 , 0)) Wn G( ˜ 2 , 0))]−1 )(G( ˜ 2 , 0)))1/2 ˆ β(λ ˆ β(λ ˆ β(λ ˆ β(λ [δ (I + λ2 [G( ˜ 2 , 0)) Wn G( ˜ 2 , 0)))1/2 n−1/2 (β˜ − ˆ β(λ ˆ β(λ (G(
To explain the last term in the equation above via (20), and with simple matrix algebra ˜ 2 , 0))]−1 + I} × ˜ 2 , 0)) Wn G( ˆ β(λ ˆ β(λ {λ2 [G( = − ×
(68)
˜ 2 , 0)))1/2 n−1/2 [β(λ ˜ 2 , 0)) Wn G( ˜ 2 , 0) − βA,0 ] ˆ β(λ ˆ β(λ (G( ˜ 2 , 0))]−1/2 βA,0 ˜ 2 , 0)) Wn G( ˆ β(λ ˆ β(λ −λ2 n−1/2 [G( ˜ 2 , 0))]−1/2 ˜ 2 , 0)) Wn G( ˆ β(λ ˆ β(λ n−1/2 [G( ˜ 2 , 0)) Wn gn (βA,0 )], ˆ β(λ [G(
where gn (βA,0 ) = ni=1 gi (βA,0 ). Via Theorem 3, and also using (66), with probability tending to one , zn = T1 + T2 + T3 , where ˜ 2 , 0)) Wn G( ˜ 2 , 0)) Wn G( ˜ 2 , 0))]−1 )][G( ˜ 2 , 0))]1/2 ˆ β(λ ˆ β(λ ˆ β(λ ˆ β(λ T1 = [δ [(I + λ2 [G( imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
27
ADAPTIVE ELASTIC NET
βA,0 ] n + λ2 ˜ 2 , 0)) Wn G( ˜ 2 , 0))]−1/2 βA,0 ]. ˆ β(λ ˆ β(λ − [δ λ2 n−1/2 [G( × n−1/2 λ2
˜ 2 , 0)) Wn G( ˜ 2 , 0)) Wn G( ˜ 2 , 0))]−1 ][G( ˜ 2 , 0))]1/2 ˆ β(λ ˆ β(λ ˆ β(λ ˆ β(λ T2 = δ [I + λ2 [G( ˜ 2 , 0)). × n−1/2 (β˜ − β(λ
˜ 2 , 0)) Wn G( ˜ 2 , 0))]−1/2 G( ˜ 2 , 0)) Wn ˆ β(λ ˆ β(λ ˆ β(λ T3 = δ [G(
n gi (βA,0 ) i=1
n1/2
.
Consider T1 , use Assumption 4i, Assumption 4ii λ2 βA,0 2 2 (I + λ2 (G(βA,0 ) W G(βA,0 ))−1 )(G(βA,0 ) W G(βA,0 ))1/2 n n + λ2 2 2 λ2 (G(βA,0 ) W G(βA,0 ))−1/2 βA,0 22 + o(1) + n 1 λ2 2 2 λ22 Bn (1 + )2 βA,0 22 + λ22 βA,0 22 2 + op (1) ≤ 2 n (n + λ2 ) bn n bn = op (1),
T12 ≤
via λ2 = o(n1/2 ). Consider T2 similar to the above analysis and (54) λ2 1 ˜ 2 , 0)2 (1 + )2 (Bn)β˜ − β(λ 2 n bn 2 √ λ∗1 ηˆ−γ p λ2 2 ≤ B(1 + ) bn [n2 b + λ2 + op (n2 )]
T22 ≤
= op (1), by (65) and λ2 = o(n1/2 ). For the term T3 we benefit from Liapunov Central Limit Theorem. By Assumptions 2 and 3 T3 =
n
i=1 δ
n
i=1 δ
=
[G( ˜ 2 , 0)) Wn G( ˜ 2 , 0)) Wn ]gi (βA,0 ) ˜ 2 , 0))]−1/2 [G( ˆ β(λ ˆ β(λ ˆ β(λ n1/2 [G(β −1/2 [G(β A,0 ) W G(βA,0 )] A,0 ) W ]gi (βA,0 ) n1/2
Next set Ri = δ [G(βA,0 ) W G(βA,0 )]−1/2 [G(βA,0 ) W ]
+ op (1).
gi (βA,0 ) . n1/2
So
n
n
ERi2
=
i=1
n−1
E[δ [G(βA,0 ) W G(βA,0 )]−1/2 G(βA,0 ) W gi (βA,0 )gi (βA,0 ) W G(βA,0 )[G(βA,0 ) W G(βA,0 )]−1/2 δ]
i=1
=
δ [G(βA,0 ) W G(βA,0 )]−1/2 G(βA,0 ) W [n−1
n
Egi (βA,0 )gi (βA,0 ) ]W G(βA,0 )[G(βA,0 ) W G(βA,0 )]−1/2 δ.
i=1
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
28
MEHMET CANER AND HAO HELEN ZHANG
−1 Then set W = Ω−1 A , and use the definition limn→∞ n ΩA . Take the limit of the term above
lim
n→∞
n
n
i=1 Egi (βA,0 )gi (βA,0 )
=
−1 −1/2 −1 −1/2 ERi2 = δ [G(βA,0 ) Ω−1 G(βA,0 ) Ω−1 δ. A G(βA,0 )] A ΩA ΩA G(βA,0 )[G(βA,0 ) ΩA G(βA,0 )]
i=1
= δ δ = 1. Next we show and 5 n i=1
1
n
2+l i=1 E|Ri |
→ 0, for l > 0. See that by Assumptions 4
−1/2 2+l E[δ (G(βA,0 ) Ω−1 G(βA,0 ) Ω−1 A G(βA,0 )) A gi (βA,0 )] n1+l/2
≤ →
B 1+l/2 b
1 n1+l/2
0.
The desired result then follows from zn = T1 + T2 + T3 with probability one. Q.E.D.
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
n i=1
−1/2
EΩA
gi (βA,0 )2+l 2+l
ADAPTIVE ELASTIC NET
29
References Alfaro, L. & Kalemli-Ozcan, S, & Volosoych V. (2008) ”Why does not capital flow from rich to poor countries? An empirical investigation”, Review of Economics and Statistics 90, 347-368. Andrews, D. (1994), ”Empirical process methods in econometrics” in Handbook of Econometrics 4, 2247-2294. Andrews, D.W.K., and B. Lu (2001), ”Consistent model and moment selection procedures for GMM estimation with application to dynamic panel data models”, Journal of Econometrics 101, 123-165. Caner, M. (2008), ”Nearly Singular Design in GMM and Generalized Empirical Likelihood Estimators”, Journal of Econometrics, 144, 511-523. Caner, M., (2009), ”Lasso-type GMM estimator” Econometric Theory, 25 270-290. Davidson, J. (1994), Stochastic Limit Theory, Oxford University Press. Fan, J. & Li, R. (2001), ”Variable selection via nonconcave penalized likelihood and its oracle properties”, Journal of The American Statistical Association 96, 1348-1360. Fan, J., Peng, H. & Huang, T. (2005), ”Semilinear high dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion)”, Journal of the American Statistical Association 100, 781-813. Hansen, L. P. (1982), ”Large sample properties of generalized method of moment estimators”, Econometrica 50 , 1029-1054. Huang, J. & Horowitz, J. & and Ma S. (2008), ”Asymptotic properties of bridge estimators in sparse high-dimensional regression models” The Annals of Statistics 36, 587-613. Huber, P. (1988), ”Robust regression: Asymptotics, conjectures and monte carlo” The Annals of Statistics 1, 799-821. Kitamura, Y. & Stutzer, M. (1997), ”An information-theoretic alternative to generalized method of moments estimation”, Econometrica 65, 861-874. Knight, K. & Fu, W. (2000), ”Asymptotics for lasso type estimators” The Annals of Statistics 28, 1356-1378. Knight , K. (2008), ”Shrinkage estimation for nearly-singular designs”, Econometric Theory 24, 323-338. Lam, C. & Fan, J. (2007), ”Profile-kernel likelihood inference with diverging number of parameters” The Annals of Statistics to appear. Newey, W. & McFadden, D. (1994), Large Sample estimation and hypothesis testing Handbook of Econometrics 4. Owen, A. (2001), Empirical Likelihood, Chapman Hall. Portnoy, S. (1984), ”Asymptotic behavior of M-estimators of p-regression imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009
30
MEHMET CANER AND HAO HELEN ZHANG
parameters when p2 /n is large. I. consistency” The Annals of Statistics 12, 1298-1309. Qin, J. & Lawless, J. (1994), ” Empirical Likelihood and general estimating equations”, The Annals of Statistics 22, 300-325. Schennach, S. (2007), ”Point Estimation with Exponentially Tilted Empirical Likelihood”, The Annals of Statistics 35, 634-672. van der Vaart, A. (1998), Asymptotic Statistics, Cambridge University Press. van der Vaart, A. & Wellner, J. (1996), Weak Convergence and Empirical Processes Springer-Verlag. Wang, H., Li, R., & Tsai, C., (2007), ”Tuning parameter selectors for the smoothly clipped absolute deviation method”, Biometrika 94, 553-568. Zhang, H. & W. Lu (2007), ”Adaptive Lasso for Cox’s proportional hazards model”, Biometrika 37 1-13. Zou, H., Hastie, T., & Tibshirani, R. (2007), ”On the degrees of freedom of the lasso”,The Annals of Statistics 35, 2173-2192. Zou, H. & Zhang, H. (2009), ”On the adaptive elastic-net with a diverging number of parameters”, The Annals of Statistics, to appear. MEHMET CANER DEPARTMENT OF ECONOMICS NORTH CAROLINA STATE UNIVERSITY RALEIGH, NC 27695 E-mail:
[email protected]
HAO HELEN ZHANG DEPARTMENT OF STATISTICS NORTH CAROLINA STATE UNIVERSITY RALEIGH, NC 27695 E-mail:
[email protected]
imsart-aos ver. 2009/08/13 file: aenet2.tex date: August 17, 2009