Smooth Varying-Coefficient Estimation and Inference for Qualitative and Quantitative Data∗ Qi Li Department of Economics, Texas A&M University College Station, TX 77843-4228, USA, Jeffrey S. Racine Department of Economics, McMaster University Hamilton, ON L8S 4M4, Canada March 6, 2008
Abstract We propose a semiparametric varying-coefficient estimator that admits both qualitative and quantitative covariates along with a test for correct specification of parametric varying-coefficient models. The proposed estimator is exceedingly flexible and has a wide range of potential applications including hierarchical (mixed) settings, small area estimation, and so forth. A data-driven cross-validatory bandwidth selection method is proposed that can handle both the qualitative and quantitative covariates and that can also handle the presence of potentially irrelevant covariates, each of which can result in finite-sample efficiency gains relative to the conventional frequency (sample-splitting) estimator that is often found in such settings. Theoretical underpinnings including rates of convergence and asymptotic normality are provided. Monte Carlo simulations are undertaken to assess the proposed estimator’s finite-sample performance relative to the conventional semiparametric frequency estimator, and to assess the finite-sample performance of the proposed test for correct parametric specification. An application is presented which demonstrates the potential value-added by the proposed methods in applied settings.
Keywords: Categorical covariates, nonparametric smoothing, cross-validation, asymptotic normality.
∗ Corresponding author: Jeffrey S. Racine, email:
[email protected]; Tel: 905-525-9140 extension 23825. Li’s research is partially supported by the Private Enterprise Research Center, Texas A&M University. Racine would like to gratefully acknowledge support from Natural Sciences and Engineering Research Council of Canada (NSERC:www.nserc.ca), the Social Sciences and Humanities Research Council of Canada (SSHRC:www.sshrc.ca), and the Shared Hierarchical Academic Research Computing Network (SHARCNET:www.sharcnet.ca). Racine would like to thank Tristen Hayfield for his exemplary research assistance.
1
Introduction
The seminal work of Aitchison & Aitken (1976) has spawned a rich literature on the kernel smoothing of categorical data, and has also spawned numerous advances in the kernel smoothing of datasets comprised of both qualitative and quantitative data. One area in which this approach could be particularly helpful is when confronted with data in groups in which potential sparsity can adversely affect one’s analysis but which can be alleviated by ‘borrowing’ information from, say, neighboring cells in a prescribed manner. By way of example, one might encounter students grouped in classes grouped in schools or crops grouped in plots of land grouped in farms and so forth. These situations are frequently encountered in mixed model settings such as when conducting small area estimation, estimating multi-level (hierarchical) models, and the like. The estimator we propose is ideally suited to these settings. The conventional parametric approach to modeling such problems suffers from a number of drawbacks, which has given rise to a growing literature on semiparametric and nonparametric approaches. A variety of flexible methods have been proposed for estimating these models including semiparametric and nonparametric smoothing spline approaches (Zhang, Lin, Raz & Sowers (1998), Gu & Ma (2005)) and semiparametric kernel-based approaches (Zeger & Diggle (1994)), while a recent Journal of Multivariate Analysis (2004) special issue was devoted entirely to such advances. Cai, Fan & Yao (2000) and Fan, Yao & Cai (2003) have further extended varying coefficient models to dependent data settings. However, when confronted with the mix of qualitative and quantitative covariates that we often encounter in applied settings, existing semiparametric and nonparametric approaches rely on sample-splitting in order to handle the presence of qualitative covariates, and this can lead to substantial efficiency losses in finite-sample settings. Though existing semiparametric and nonparametric approaches provide the user with much needed flexibility, the methods proposed in this paper possess a number of appealing features not shared by their peers that ought to be quite appealing to practitioners. Most noteworthy is the fact that the proposed method does not rely upon sample-splitting since it smooths the qualitative covariates in the manner described in Section 2 below. The rest of the paper proceeds as follows. In sections 2 and 3, we provide theoretical underpinnings of the proposed estimator including rates of convergence and asymptotic normality along with a data-driven cross-validatory method of bandwidth selection. Section 4 presents the proposed test for correct specification of parametric varying coefficient models. In Section 5, Monte Carlo simulations are undertaken to assess the finite-sample performance of the proposed estimator and to assess the finite-sample properties of the proposed test for correct parametric specification. An application involving the well-known Boston housing data is presented in Section 6, while Section 7 concludes. Proofs of the main theorems are relegated to the appendices.
1
2
Kernel Estimation of Varying-Coefficient Models
Consider a varying-coefficient regression model given by Yi = Xi0 β(Z¯i ) + ui ,
(1)
where Xi is a p-dimensional vector of regressors, Z¯i = (Z¯ic , Z¯id ), where Z¯ic is a continuous covariate of dimension q1 , and Z¯id is a categorical covariate of dimension r1 . The functional form of β(·) is not specified, and ui is an error term satisfying E(ui |Xi , Z¯i ) = 0. We allow for the possibility that the nonparametric varying-coefficient function β(·) is overspecified. That is, instead of using only the relevant covariates Z¯i in (1), one may use a possibly larger set of covariates Zi = (Zic , Zid ), where Zic is a q-dimensional continuous covariate, and Zid is an r-dimensional categorical covariate (q ≥ q1 and r ≥ r1 ). Therefore, in practice one estimates the following model, Yi = Xi0 β(Zi ) + ui .
(2)
Of course, we maintain the assumption that (1) is the correctly specified model. Hence, (2) is an over-specified model if q > q1 and/or r > r1 . d to denote the sth component of Z d , and we assume that Z d assumes c unique We use Zis s i is
values (cs ≥ 2 is a finite positive integer). We use S d to denote the support of Z d . We allow for one or more of the nonparametric covariates to be irrelevant, where the specific definition of ‘irrelevant covariate’ shall be provided below. Without loss of generality, we assume that the first q1 (1 ≤ q1 ≤ q) components of Z c and the first r1 (0 ≤ r1 ≤ r) components of Z d are ‘relevant’ covariates in the sense defined below. Note that we assume there exists at least one relevant continuous covariate (q1 ≥ 1). It can be shown that when all of the continuous covariates are irrelevant (q1 = 0), the asymptotic distribution of the cross-validated smoothing parameters will be quite different from the case treated in this paper. Results for the case in which q1 = 0 cannot be obtained as a special case of those derived below, hence we must treat this case separately and intend do so in future work. Let Z¯ consist of the first q1 relevant components of Z c and the first r1 relevant components of ¯ denote the remaining irrelevant components of Z. Z d , and let Z˜ = Z \ {Z} Following Hall, Li & Racine (2007) we assume that ¯ is independent of Z. ˜ (Y, X, Z)
(3)
Clearly, (3) implies that, for all measurable functions g(Y, X), ¯ E(g(Y, X)|Z) = E(g(Y, X)|Z).
(4)
2
Obviously (3) is a stronger assumption than (4). A weaker condition would be to require that ¯ the variables Z˜ and Y are independent of each other. Conditional on (X, Z),
(5)
However, using (5) would cause some formidable technical difficulties in the proofs that follow which we are unable to handle at this time. Therefore, in this paper we will only consider the unconditional independence assumption (3). Nevertheless, we have also investigated the case of conditional independence as defined by (5) via simulation, and the simulation results reveal that cross-validation can smooth out irrelevant covariates under either unconditional independence (3) or conditional independence (5).1 We now discuss the kernel smoothing of discrete covariates. For an unordered discrete covariate, we suggest using a variant of Aitchison & Aitken’s (1976) kernel function defined as ( d l(Zis , zsd , λs )
=
1,
d = zd, when Zis s
λs ,
otherwise.
(6)
For an ordered discrete covariate, we use the following kernel function, ( d l(Zis , zsd , λs )
=
d = zd, when Zis s
1, d −z d | |Zis s
1
,
d 6= z d . when Zis s
,
(7)
where 1(A) denotes the usual indicator function, which assumes the value 1 if A holds true and 0 otherwise. Note that for both unordered and ordered discrete covariates, λs = 0 leads to an indicator function while λs = 1 leads to a uniform weight function. Therefore, the range of λs is [0, 1] for all s = 1, . . . , r. Using (6) and (7), we can construct a product kernel function for the discrete covariates via L(Zid , z d , λ)
=
r Y
d l(Zis , zsd , λs ).
(8)
s=1
Observe that the kernel weight function we use here does not add up to 1 when λs 6= 0, however, this ˆ does not affect the semiparametric estimator x0 β(z) defined in (10) below as the kernel function appears in both the numerator and the denominator of (10), thus the kernel function can be ˆ multiplied by any non-zero constant leaving the definition of β(z) defined below intact. 1
We do not report these simulations here for space considerations. These results are available upon request.
3
c , . . . , Z c ) we use the product kernel given by For the continuous covariates Zic = (Zi1 iq
µ Wh
Zjc − Zic h
¶ =
µ c q c ¶ Y Zjs − Zis 1 w , hs hs
s=1
where w(·) is a symmetric univariate density function and where 0 < hs < ∞ is the smoothing parameter for zsc , s = 1, . . . , q. The kernel function for the mixed covariate case z = (z c , z d ) is simply the product of Wh (·) and L(·), i.e., def
Kγ,ij ≡ Kγ (Zi , Zj ) = Wh
µ
Zjc − Zic h
¶ L(Zjd , Zid , λ),
(9)
where γ = (h, λ) = (h1 , . . . , hq , λ1 , . . . , λr ). We use S to denote the support of Zi .2 For z ∈ S, from (1) it is fairly straightforward to show that £ ¤−1 β(z) = E(Xi Xi0 |Zi = z) E(Xi Yi |Zi = z). Therefore, we estimate β(z) by the local constant kernel method, which is given by " ˆ β(z) = n−1
n X
#−1 " Xi Xi0 Kγ (Zi , z)
n−1
i=1
n X
# Xi Yi Kγ (Zi , z) .
(10)
i=1
When λs = 0 for all s = 1, . . . , r, our estimator reverts back to the conventional approach that uses a frequency estimator to deal with the categorical covariates (i.e., one resorts to sample-splitting), d , z d , λ = 1) ≡ 1 (a ˆ while if λs = 1 for some s, then β(z) becomes unrelated to zsd because l(Zis s s constant which is unrelated to zsd ). That is, when λs = 1, zsd is smoothed out from the regression model (it is deemed to be an ‘irrelevant’ qualitative covariate). Similarly, if hs is sufficiently large, c −z c )/h ) ∼ w(0), a constant that is unrelated to z c . Thus, z c is completely say, hs = ∞, then w((Zis s s s s c ˆ smoothed out and β(z) becomes unrelated to z when hs is sufficiently large. s
We choose γ = (h, λ) = (h1 , . . . , hq , λ1 , . . . , λr ) to minimize3 n h i2 X CV (γ) = Yi − Xi0 βˆ−i (Zi ) M (Zi ),
(11)
i=1
where 0 ≤ M (·) ≤ 1 is a weight function which serves to avoid difficulties caused by dividing by Similarly, we use S¯d , S¯c , S˜d and S˜c to denote the support of Z¯id , Z¯ic , Z˜id and Z˜ic , respectively. For related work that uses least squares cross-validation for selecting smoothing parameters in a nonparametric regression model with continuous covariates, see H¨ ardle & Marron (1985), H¨ ardle, Hall & Marron (1988) and H¨ ardle, Hall & Marron (1992). 2
3
4
zero, or by the slower convergence rate that arises when Zi lies near the boundary of its support, while βˆ−i (Zi ) = n−1
n X
−1 Xj Xj0 Kγ (Zj , Zi )
n−1
j6=i
n X
Xj Yj Kγ (Zj , Zi )
(12)
j6=i
is the leave-one-out kernel estimator of β(Zi ).
3
Asymptotic Analysis Allowing for Irrelevant Covariates
¯ and the marginal densities We will use f (x, z¯), f¯(¯ z ) and f˜(˜ z ) to denote the joint density of (X, Z) ˜ respectively. Define σ 2 (x, z¯) = E(u2 |Xi = x, Z¯i = z¯). We assume that of Z¯ and Z, i
The data are i.i.d. and ui has finite moments of any order; β(¯ z ), f (x, z¯) and σ 2 (x, z¯) have two continuous derivatives (with respect to z¯c ); M (·) is continuous, non-negative and has compact support; f and f˜ are bounded away from zero for z = (z c , z d ) ∈ S = S c × S d .
(13)
We impose the following conditions on the bandwidth and kernel functions. Define à H=
q1 Y s=1
! hs
q Y
min(hs , 1).
(14)
s=q1 +1
Letting 0 < ² < 1/(q + 4) and for some constant c > 0, we further assume that n²−1 ≤ H ≤ n−² ; n−c < hs < nc for all s = 1, . . . , q; The kernel function w(·) is a symmetric, compactly supported, H¨older-continuous probability density; w(0) > w(δ) for all δ 6= 0.
(15)
The above conditions basically require that each hs does not converge to zero, or to infinity, too fast, and that nh1 . . . hq1 → ∞ as n → ∞. We use H to denote the permissible set for (h1 , . . . , hq ) that satisfies (15). The range for (λ1 , . . . , λr ) is [0, 1]r , and we use Γ = H × [0, 1]r to denote the range for the smoothing parameter vector γ ≡ (h1 , . . . , hq , λ1 , . . . , λr ). We expect that, as n → ∞, the smoothing parameters associated with the relevant covariates will converge to zero, while those associated with the irrelevant covariates will not. It would be convenient to further assume that hs → 0 for s = 1, . . . , q1 , and that λs → 0 for s = 1, . . . , r1 . 5
However, for practical reasons we choose not to assume that the relevant components are known ¯ γ¯,ij K ˜ γ˜,ij , a priori, but rather assume that assumption (17) given below holds. We write Kγ,ij = K ¯ and K ˜ are the where γ¯ = (h1 , . . . , hq1 , λ1 , . . . , λr1 ), and γ˜ = (hq1 +1 , . . . , hq , λr1 +1 , . . . , λr ) so that K product kernel functions associated with the relevant and the irrelevant covariates, respectively. We define ηβ (¯ z ) = E[Xj Xj0 Kγ,ij |zi = z]−1 E[Xj Xj0 β(¯ zj )Kγ,ij |zi = z] ¯ γ¯,ij |¯ ¯ γ¯,ij |¯ = E[Xj Xj0 K zi = z¯]−1 E[Xj Xj0 β(¯ zj )K zi = z¯],
(16)
¯ Therefore, the term where the second equality comes from the fact that Z˜ is independent of (X, Z). ˜ γ˜,ij |˜ ˜ γ˜,ij |˜ related to z˜ cancels out since E[K zi = z˜]−1 E[K zi = z˜] = 1, hence ηβ (¯ z ) does not depend on z˜ nor does it depend on (hq1 +1 , . . . , q, λr1 +1 , . . . , λr ). We further assume that Z ¯ (¯ (¯ ηβ (¯ z ) − β(¯ z ))0 m(¯ z ) (¯ ηβ (¯ z ) − β(¯ z )) M z )d¯ z , a function of h1 , . . . , hq1 and λ1 , . . . , λr1 , vanishes if and only if all of the smoothing parameters vanish, ¯ (¯ where m(¯ z ) = E(Xi Xi0 |Z¯i = z¯)f¯(¯ z ), M z) =
R
(17)
f˜(˜ z )M (z)d˜ z . In Lemma A.6 in Appendix A
we show that (15) and (17) imply that as n → ∞, hs → 0 for s = 1, . . . , q1 and λs → 0 for s = 1, . . . , r1 . Therefore, the smoothing parameters associated with the relevant covariates all vanish asymptotically. Define an indicator function d
d
1s (v , x ) =
1(vsd
6=
xds )
r Y
1(vtd = xdt ).
t6=s
Note that 1s (v d , xd ) equals one if v d and xd differ only in their sth component, and is zero otherwise. R R c P Let d¯ z = z , define m(¯ z ) = E[Xi Xi0 |Z¯i = z¯], and let ms and mss denote the z¯d ∈S¯d d¯ first and second derivatives of m(·) with respect to zsc , respectively, while βs and βss are similarly defined. Now define h i B1s (¯ z ) = ms (z c , z d )βs (¯ z c , z¯d ) + (1/2)m(¯ z c , z¯d )βss (¯ z c , z¯d ) , h i X B2s (¯ z) = 1s (¯ z d , v¯d )m(¯ z c , v¯d ) β(¯ z c , v¯d ) − β(¯ z c , z¯d ) . v¯d ∈S¯d
6
(18)
In Appendix A we show that the leading term of CV (γ) is Z ("X q1
h2s B1s (¯ z)
s=1
κq1 + nh1 . . . hq1
Z
+
r1 X
#0 λs B2s (¯ z)
−1
m(¯ z)
s=1
"q 1 X
h2s B1s (¯ z)
+
s=1
r1 X
#) ¯ (¯ M z )d¯ z
λs B2s (¯ z)
s=1
³ ´ −1 2 ¯ ˜ ˜ f (¯ z ) f (˜ z )δ(¯ z , z¯)R(˜ z )M (z)dz + op ζn + (nh1 . . . hq1 ) , 2
(19)
R R κ = w(v)2 dv, κ2 = w(v)v 2 dv, δ(¯ z , z¯) is defined in the appendix ˜ ˜ (below (A.9)), and where R(z) = R(z, hq1 +1 , . . . , hq , λr1 +1 , . . . , λr ) is given by
where ζn =
Pq1
2 s=1 hs +
Pr1
s=1 λs ,
z) ˜ z ) = ν2 (˜ , R(˜ [ν1 (˜ z )]2
(20)
where for l = 1, 2, νl (˜ z ) = E
q Y
s=q1 +1
µ h−1 s w
c zis
− hs
zsc
l 1(z d 6=z d ) λs is s .
¶ Y r
(21)
s=r1 +1
˜ z ). By H¨older’s inequality, R(˜ ˜ z ) ≥ 1 for all choices In (19) the irrelevant covariate z˜ appears in R(˜ ˜ of z, hq1 +1 , . . . , hq , and λq1 +1 , . . . , λq . Also, R(z) → 1 as hs → ∞ (q1 + 1 ≤ s ≤ q) and λs → 1 (r1 + 1 ≤ s ≤ r). Therefore, in order to minimize (19), one needs to select hs (s = q1 + 1, . . . , q) ˜ z ). In fact, we show that the only smoothing parameter and λs (s = r1 + 1, . . . , r) to minimize R(˜ ˜ hq +1 , . . . , hq , λr +1 , . . . , λr ) = 1 are hs = ∞ for q1 + 1 ≤ s ≤ q, and λs = 1 values for which R(z, 1 1 ³ c c ´Q Q z −Z r d d for r1 + 1 ≤ s ≤ r. To see this, let us define Vn = qs=q1 +1 w s hs is s=r1 +1 l(zs , Zis , λs ), where d) 1(zsd 6=Zis
d,λ ) = λ l(zsd , Zis s s
d −z d | |Zis s
d,λ ) = λ if zsd is an unordered covariate, and l(zsd , Zis s s
if zsd is an
ordered covariate. If at least one hs is finite (for q1 + 1 ≤ s ≤ q), or one λs < 1 (for r1 + 1 ≤ s ≤ r), then by (15) (w(0) > w(δ) for all δ > 0) we know that var(Vn ) = E[Vn2 ] − [E(Vn )]2 > 0 so that R = E[Vn2 ]/[E(Vn )]2 > 1. Only when, in the definition of Vn , all hs = ∞ and all λs = 1, do we ˜ z ) = 1 only in this case. have Vn ≡ w(0)q−q1 (a constant) and var(Vn ) = 0 so that R(˜ Therefore, in order to minimize (19), the smoothing parameters corresponding to the irrelevant ˜ z ) → 1 as n → ∞ for all z˜ ∈ S˜ (S˜ is covariates must all converge to their upper bounds so that R(˜ ˜ Thus, irrelevant components are asymptotically smoothed out. the support of Z). To analyze the behavior of smoothing parameters associated with the relevant covariates, we ˜ replace R(z) by 1 in (19), thus the second term on the right-hand-side of (19) becomes κq1 nh1 . . . hq1
Z ¯ (¯ f¯(¯ z )2 δ(¯ z , z¯)M z )d¯ z,
(22)
7
where Z ¯ (¯ M z) =
f˜(˜ z )M (z)d˜ z,
where, again, we have used
R
(23) d˜ z=
P z˜d
R
d˜ zc.
Next, defining as = hs n1/(q1 +4) and bs = λs n2/(q1 +4) , then (19) (with (22) as its first term since ˜ z ) → 1) becomes n−4/(q1 +4) χ(a R(˜ ¯ 1 , . . . , aq1 , b1 , . . . , br1 ), where Z κq1 ¯ (¯ f¯(¯ z )2 δ(¯ z , z¯)M z )d¯ z χ(a ¯ 1 , . . . , br1 ) = a1 . . . aq1 #0 "q #) Z ("X q1 r1 r1 1 X X X 2 −1 2 ¯ (¯ + as B1s (¯ z) + bs B2s (¯ z ) m(¯ zi ) as B1s (¯ z) + bs B2s (¯ z) M z )d¯ z. s=1
s=1
s=1
s=1
(24) Let a01 , . . . , a0q1 , b01 , . . . , b0r1 denote values of a1 , . . . , aq1 , b1 , . . . , br1 that minimize χ ¯ subject to each of them being non-negative. We require that Each a0s is positive and each b0s non-negative, all are finite and uniquely defined.
(25)
The approach of Li & Zhou (2005) can be used to obtain primitive necessary and sufficient conditions that ensure that (25) holds true. The above analysis leads to the main result of this paper, which is stated in the following theorem. ˆ 1 ,. . . , h ˆq, λ ˆ 1 ,. . . , Theorem 3.1. Assume that conditions (13), (15), (17), and (25) hold, and let h ˆ r denote the smoothing parameters that minimize CV (γ). Then λ ˆ s → a0 in probability for 1 ≤ s ≤ q1 , n1/(q1 +4) h s ³ ´ ˆ s > C → 1 for q1 + 1 ≤ s ≤ q and for all C > 0, P h ˆ s → b0 in probability for 1 ≤ s ≤ r1 , n2/(q1 +4) λ s ˆ s → 1 in probability for r1 + 1 ≤ s ≤ r, λ and
(26)
o n ˆ 1, . . . , h ˆq, λ ˆ1, . . . , λ ˆ r ) − n−1 Pn u2 → inf χ ¯ in probability. n−4/(q1 +4) CV (h i=1 i
The proof of Theorem 3.1 is given in Appendix A. Theorem 3.1 states that the smoothing parameters associated with the irrelevant covariates all converge to their upper bounds, so that, asymptotically, all irrelevant covariates are smoothed out, while the smoothing parameters associated with the relevant covariates all converge to zero at a rate that is optimal for minimizing asymptotic mean square error (MSE) (i.e., without the presence of irrelevant covariates). 8
From Theorem 3.1 one can easily obtain the following result. Theorem 3.2. Under the same conditions given in Theorem 3.1, then ! Ã q1 q r1 X X 2 ˆ ˆ ˆ ˆ ˆ nh1 . . . hq1 β(z) − β(¯ z) − hs B1s (¯ z) − λs B2s (¯ z ) → N (0, Ω(¯ z )) in distribution, s=1
s=1
where Ω(¯ z ) = {E[Xi Xi |Z¯i = z¯]}−1 κq1 E[Xi Xi0 σ 2 (Xi , Z¯i )|Z¯i = z¯]{E[Xi Xi |Z¯i = z¯]}−1 , σ 2 (x, z¯) = R z ) can be consistently estimated by E(u2i |Xi = x, Z¯i = z¯), and where κ = w2 (v)dv. Note that Ω(¯ −1 −1 ˆ ˆ ˆ A B A , where Aˆ =
n X
Xi Xi0 Kγˆ,iz /
i=1
Bˆ =
n X
n X
Kγˆ,iz ,
i=1
Xi Xi0 u ˆ2i Kγˆ2,iz /
i=1
u ˆi = Yi −
n X
Kγˆ,iz , and
i=1
ˆ i ). Xi0 β(Z
A sketch of the proof of Theorem 3.2 is provided in Appendix A.
4
Testing for Correct Varying-Coefficient Functional Specification
In this section we propose a test for correct parametric specification of β(·). Since all irrelevant covariates can be smoothed out asymptotically (see Theorem 3.1), in this section we shall restrict attention to the relevant covariates case. That is, we will omit the ‘bar’ notation in z¯, and simply write z to simplify the notation in what follows. If one believes that β(z) may be of a simple parametric functional form, say β(z) = β0 , where β0 is a p × 1 vector of parameter constants, then naturally one ought to test this hypothesis. Furthermore, if the hypothesis is believed to be true then the semiparametric varying-coefficient model becomes a fully parametric regression model, and one should therefore estimate the parametric model. Let β0 (z, α0 ) denote a parametric varying-coefficient model of interest, where β0 (·, ·0 ) has a known parametric functional form and where α0 ∈ Rk is a finite k-dimensional unknown parameter vector. Formally, we state the null hypothesis as H0 :
P r[β(Z) = β0 (Z, α0 )] = 1 for some α0 ∈ B,
(27)
where B is a compact subset in Rk . √ Let α ˆ denote a n-consistent estimator of α∗ , where α∗ = α0 under H0 . We will use the short hand notation βˆ0 (z) to denote β(z, α ˆ 0 ). Following Li, Huang, Li & Fu 9
i0 h i Rh (2002), we construct our test statistic based on βˆ (z) − βˆ0 (z) βˆ (z) − βˆ0 (z) dz, where £ ¤−1 £ −1 Pn ¤ P Qq1 ˆ −1 c βˆ (z) = n−1 ni=1 Xi Xi0 Kγˆ,iz n ˆ ,iz and where Kγ ˆ ,iz = s=1 hs w((Zis − i=1 Xi Yi Kγ Q r ˆ s ) 1 l(Z d , z d , λ ˆ s ), i.e., β(z) ˆ z c )/h is the semiparametric estimator of β(z) using cross-validated s
s=1
is
s
smoothing parameter selection.
To avoid a random denominator problem in our test statis-
tic, we modify´ the test matrix Dn (z)0 Dn (z) between ³ ³ statistic by´ inserting a positive definite 0 P βˆ (z) − βˆ0 (z) and βˆ (z) − βˆ0 (z) , where Dn (z) = n−1 ni=1 Xi Xi0 Kγˆ,iz , which leads to the weighted statistic Z Z h n n ³ ´i0 h ³ ´i 1 XX 0 ˆ ˆ ˆ ˆ Xi Xj u ˆi u ˆj Kγˆ,iz Kγˆ,jz dz, Dn (z) β (z) − β0 (z) Dn (z) β (z) − β0 (z) dz = 2 n i=1 j=1
(28) where u ˆi = Yi − Xi0 βˆ0 (Zi ). To avoid a non-zero center term in the test statistic, we will remove the i = j term in (28). Also, note that for i 6= j, ·Z
Z Kγˆ,iz Kγˆ,jz dz =
¸
c Wh,iz ˆ c Wh,jz ˆ c dz
X
d d ˆ ˆ L(Zid , z d , λ)L(Z ˆ Lλ,ij ˆ , j , z , λ) = Wh,ij
z d ∈S d
´ R Qq1 ˆ −1 0 ³ c c )/h ˆ s , w0 (u) = w (v) w (u + v) dv is the two-fold convoluh w (Z − Z is js s=1 s P d d ˆ ˆ tion kernel defined from w(·), and Lγˆ,ij = z d ∈S d L(Zid , z d , λ)L(Z j , z , λ). Since the convolution
where Wh,ij = ˆ
kernel W(·) is also a (product) second order kernel function, there is no need to use a convolution kernel in practice. Rather, one can simply replace the convolution kernel W(·) by the original kernel ¯ (·). Similarly, one can replace Lγˆ,ij by Lγˆ,ij = Qr1 l(Z d , Z d , γˆs ). Our modified test statistic is W s=1
is
js
then given by n n 1 XX 0 ˆ Xi Xj u ˆi u ˆj Kγˆ,ij , In = 2 n
(29)
i=1 j6=i
Qq1 ˆ −1 c c ˆ Qr1 l(Z d , Z d , λ ˆ is js s ). s=1 hs w((Zis − Zjs )/hs ) s=1 The asymptotic null distribution of Iˆn is given in the following theorem.
where Kγˆ,ij =
Theorem 4.1. Under the conditions given in Theorem 3.1, then, under H0 , we have q def d ˆ1 . . . h ˆ q Iˆn /ˆ Tˆn = n h σ0 → N (0, 1), 1 where R P P ˆ2i u ˆ2j Kγˆ2,ij κ = w2 (v)dv, σ 2 (Xi , Zi ) = E(u2i |Xi , Zi ), and where σ ˆ02 = 2n−2 ni=1 nj6=i (Xi0 Xj )2 u is a consistent estimator of σ 2 = 2κq1 E[f (Xi , Z¯i )(X 0 Xj )2 σ 4 (Xi , Zi )]. 0
i
10
The proof of Theorem 4.1 is similar to the proof of Theorem 2.1 of Hsiao, Li & Racine (2007). A sketch of the proof of Theorem 4.1 is provided in Appendix B. It q can also be shown that if the null hypothesis is false then Tˆn converges to +∞ at the ¡ ¢ ˆ1 . . . h ˆ q = Op n(8+q1 )/(8+2q1 ) . Hence, the test statistic Tˆn provides a consistent basis rate n h 1
for detecting a parametric null model. However, it is known that kernel-based consistent model specification tests such as Tˆn that use asymptotic null distributions (such as that given in Theorem 4.1) are plagued by severe size distortions in finite-sample applications. Therefore, we propose a nonparametric bootstrap procedure to approximate the null distribution of Tˆn . We advocate using the following residual-based wild bootstrap method to approximate the null distribution of Tˆn . The wild bootstrap error u∗i is generated via a two point distribution u∗i = [(1 − √ √ √ √ √ √ 5)/2]ˆ ui with probability (1+ 5)/[2 5], and u∗i = [(1+ 5)/2]ˆ ui with probability ( 5−1)/[2 5]. From {u∗i }ni=1 , we generate the bootstrap sample dependent variable Yi∗ = Xi0 β0 (Zi , α ˆ 0 ) + u∗i for i = 1, . . . , n. Calling {Yi∗ , Xi , Zi }ni=1 the ‘bootstrap sample’, we use this bootstrap sample to obtain the parametric null model and obtain α ˆ 0∗ .4 Next, we compute the bootstrap residual given by u ˆ∗i = Yi∗ −Xi0 β0 (Zi , α ˆ ∗ ). The bootstrap test statistic Tˆn∗ is obtained from Tˆn with u ˆi u ˆj being replaced ∗ ∗ ˆ by u ˆ u ˆ . Note that we use the same cross-validated smoothing parameters hs (s = 1, . . . , q1 ) and i
j
ˆ s (s = 1, . . . , r1 ) when computing the bootstrap statistics. That is, there is no need to rerun crossλ validation for each bootstrap resample, hence our bootstrap test is computationally quite simple. In practice, we repeat the above steps a large number of times, say B = 399 times; see Davidson & MacKinnon (2000) for the appropriate number of bootstrap replications in these settings. The B bootstrap test statistics approximate the empirical finite-sample null distribution of Tˆn , and these B statistics plus the original test statistic Tˆn can be used to compute nonparametric P -values. Note that in the above bootstrap test procedure, we generate the bootstrap sample Yi∗ according to the null model. Therefore, even if the null hypothesis is false, our test statistic Tˆn∗ still mimics the null distribution of Tˆn . Hence, the empirical distribution function of Tˆn∗ can be used to approximate the null distribution of Tˆn in practice, regardless of the validity of the null hypothesis. The next theorem shows that the wild bootstrap works for the cross-validation-based Tˆn∗ test. Theorem 4.2. Under the same conditions as in Theorem 4.1 except that we do not require that the null hypothesis be true, we then have ¯ ³ ¯ ´ ¯ ¯ sup ¯P Tˆn∗ ≤ v|{Yi , Xi , Zi }ni=1 − Φ(v)¯ = op (1),
(30)
v∈R
q ∗ ∗ ∗ ˆ1 . . . h ˆ q Iˆ∗ /ˆ ˆ ˆ0∗ are defined the same way as Iˆn and σ ˆ0 except that u ˆi u ˆj where Tn = n h n σ0 , In and σ are replaced by u ˆ∗i u ˆ∗j whenever they occur, and Φ(·) is the cumulative distribution function of a standard normal random variable. 4 The computation of α ˆ 0∗ is the same as for computing α ˆ 0 except that we use the bootstrap sample {Yi∗ , Xi , Zi }n i=1 rather than the original sample {Yi , Xi , Zi }n . i=1
11
The proof of Theorem 4.2 is similar to the proof of Theorem 2.2 of Hsiao et al. (2007). We provide a sketch of the proof of Theorem 4.2 in Appendix B. In the next section we report Monte Carlo simulations designed to examine the finite-sample performance of both the test statistic Tˆn∗ ˆ and the semiparametric estimator β(·).
5
Finite-Sample Performance
5.1
ˆ The Finite-Sample Performance of β(z)
In this section we outline a modest Monte Carlo simulation designed to highlight the finite-sample ˆ behavior of the proposed estimator β(·). For what follows, we shall consider the behavior of two estimators, namely the proposed kernel method and the conventional frequency-based kernel method that breaks the data into subsets. We simulate data from Yi = β0j + β1ji Xi1 + β2 Xi2 + ui ,
i = 1, . . . , n,
j = 0, . . . , c − 1,
where Xi1 and Xi2 are Uniform[0, 1], ui ∼ N (0, 1), β0j =β0 + η0j , β1ji =β1j + (Zic )2 ,
η0j ∼ N (0, 1), β1j = j + 1,
β0 = 1,
j = 0, . . . , c − 1,
j = 0, . . . , c − 1,
c ∼ N (0, 1). with Zi1
We draw c subsets of size n/c (n > c and divides evenly) so that if, say, n = 100 and, say, c = 2, then we would have two subsets consisting of 50 observations for which Yi = β00 + β10i Xi1 + d β2 Xi2 + ui and 50 observations for which Yi = β01 + β11i Xi1 + β2 Xi2 + ui . Next we let Zi1
denote ‘group membership’, i.e., if, say, n = 100 and, say, c = 2, then for the 50 observations for d = 0, while for the 50 observations for which which Yi = β00 + β10 Xi1 + β2 Xi2 + ui we set Zi1 d = 1. Finally, we generate another discrete covariate Yi = β01 + β11 Xi1 + β2 Xi2 + ui we set Zi1 d ∈ {0, . . . , c − 1} that is uncorrelated with Y , i.e., is ‘irrelevant’ though we do not presume this Zi2 i
is known a priori. One can interpret Z1d as a discrete covariate (say, ‘group 1 membership’) where the model changes with respect to Z1d , while Z2d can be interpreted as a discrete covariate (say, ‘group 2 membership’) where there is no variation in the model with respect to this group. In other words, the true data generating process (DGP) is a function of X1 , X2 , Z1c and Z1d only, namely d d c Yi = γ0 (Zi1 ) + γ1 (Zi1 , Zi1 )Xi1 + γ2 Xi2 + ui .
However, Z2d is an irrelevant covariate, but this is not known a priori hence the user includes all 12
covariates in the specification. The semiparametric model is of the form d d c c d d c d d Yi = γ0 (Zi1 , Zi2 , Zi1 ) + γ1 (Zi1 , Zi1 , Zi2 )Xi1 + γ2 (Zi1 , Zi1 , Zi2 )Xi2 + ui .
We conduct M = 1, 000 Monte Carlo replications from this DGP, estimate each model, consider settings having 4, 16, and 25 cells of data, and consider samples of size n = 100, 200, 300, 400, 500. In Table 1 we report the relative median MSE for the smooth and frequency-based kernel approaches (i.e., the ratio of the median MSE of the conventional sample-splitting approach to the median MSE of the proposed approach). Numbers greater than one indicate a loss of efficiency relative to the proposed method.
Table 1: Relative median MSE of the frequency c=2 100 1.92 200 1.82 300 1.80 400 1.79 500 1.79
estimator versus the proposed smooth estimator c=4 c=5 4.41 5.50 3.91 4.96 3.65 4.62 3.63 4.51 3.60 4.38
A quick scan of Table 1 reveals that the conventional semiparametric approach that breaks data into subsets results in substantial efficiency losses that worsen as the number of subsets increases and/or the sample size falls. ˆ 1 ), Z d (λ ˆ 2 ), and Next we consider the performance of the cross-validated bandwidths for Z1d (λ 2 ˆ Tables 2 through 4 summarize the median bandwidths over the M = 1, 000 Monte Carlo Z1c (h). replications. ˆ1 Tables 2 through 4 reveal that the cross-validated bandwidths for the relevant covariates (λ ˆ behave as expected converging to zero as n increases, while that for the irrelevant covariate and h) ˆ 2 ) converges instead to its upper bound value of one in probability as n increases. (λ This modest Monte Carlo simulation highlights the fact that the proposed method that smooths both the discrete and continuous covariates is more efficient that the conventional frequency-based approach. It also illustrates how cross-validated bandwidth selection delivers appropriate bandwidths for both relevant and irrelevant covariates. Next we consider the finite-sample performance of the proposed test for correct specification of parametric varying coefficient models.
5.2
The Finite-Sample Performance of the Test Tˆn
In this section we report Monte Carlo simulations designed to examine the finite-sample performance of the proposed bootstrap-based test for the statistic Tˆn . For what follows, the null model
13
Table 2: Median bandwidths for the proposed estimator (c = 2) ˆ1 ˆ2 ˆ λ λ h 100 0.46 1.00 0.98 200 0.34 1.00 0.78 300 0.29 1.00 0.69 400 0.28 1.00 0.62 500 0.25 1.00 0.59 Table 3: Median bandwidths for the proposed estimator (c = 4) ˆ1 ˆ2 ˆ λ λ h 100 0.30 0.99 1.12 200 0.19 0.99 0.91 300 0.15 0.99 0.80 400 0.14 1.00 0.73 500 0.12 1.00 0.68 Table 4: Median bandwidths for the proposed estimator (c = 5) ˆ1 ˆ2 ˆ λ λ h 100 0.24 0.98 1.26 200 0.16 0.99 0.95 300 0.12 1.00 0.84 400 0.10 1.00 0.77 500 0.09 1.00 0.73
is a parametric model of the form c d d c d d Yi = γ0 (Zi1 , Zi1 , Zi2 ) + γ1 (Zi1 , Zi1 , Zi2 )Xi + ui c d d = (1 + Zi1 + Zi1 + Zi2 ) + Xi + ui ,
while the alternative model is given by c d d c d d Yi = γ0 (Zi1 , Zi1 , Zi2 ) + γ1 (Zi1 , Zi1 , Zi2 )Xi + ui c d d c c = (1 + Zi1 + Zi1 + Zi2 + cos(Zi1 )/2) + (1 + Zi1 /2)Xi + ui ,
where x ∼ Uniform[0, 1], z c ∼ Uniform[−π, π], z1d and z2d are draws from the binomial distribution with c trials (z d ∈ {0, 1, . . . , c − 1} with probability of success 1/2), while u ∼ N (0, 1). The estimated parametric model is always the null model. We simulate data under the null to
14
assess the test’s empirical size and then under the alternative to assess its empirical power. We conduct M = 1, 000 Monte Carlo replications. For each replication we compute Tˆn using crossvalidated bandwidth selection and then conduct B = 399 bootstrap replications and compute the empirical P -value. We then report the empirical rejection frequencies for α = 0.10, 0.05, and 0.01. We consider two variants of the test, namely, the proposed test and that using the traditional frequency (sample-splitting) approach (λ1 = λ2 = 0). Results for c = 2 (four sub-samples) and c = 4 (sixteen sub-samples) are presented in tables 5 through 13. Table 5: Empirical size for proposed test, c = 2 n α = 0.10 α = 0.05 α = 0.01 100 0.122 0.054 0.012 200 0.108 0.053 0.011 300 0.130 0.065 0.014 400 0.121 0.065 0.011 500 0.100 0.046 0.008 Table 6: Empirical n 100 200 300 400 500
size for the α = 0.10 0.109 0.105 0.119 0.117 0.096
frequency α = 0.05 0.056 0.050 0.063 0.070 0.040
test (λ1 = λ2 = 0), c = 2 α = 0.01 0.015 0.009 0.017 0.015 0.008
Table 7: Empirical size for proposed test, c = 4 n α = 0.10 α = 0.05 α = 0.01 100 0.127 0.077 0.011 200 0.104 0.066 0.010 300 0.111 0.060 0.013 400 0.108 0.046 0.008 500 0.098 0.045 0.007 Table 8: Empirical n 100 200 300 400 500
size for the α = 0.10 0.119 0.114 0.104 0.108 0.097
frequency α = 0.05 0.063 0.061 0.059 0.052 0.043
15
test (λ1 = λ2 = 0), c = 4 α = 0.01 0.012 0.013 0.018 0.008 0.008
Table 9: c = 2, H1 Table 10: Empirical power for proposed test, c = 2 n α = 0.10 α = 0.05 α = 0.01 100 0.755 0.663 0.448 200 0.961 0.933 0.826 300 0.996 0.993 0.959 400 1.000 0.998 0.990 500 1.000 1.000 0.999 Table 11: Empirical power for the frequency test (λ1 = λ2 = 0), c = 2 n α = 0.10 α = 0.05 α = 0.01 100 0.595 0.492 0.242 200 0.924 0.864 0.665 300 0.988 0.974 0.906 400 0.998 0.994 0.982 500 1.000 0.999 0.997
Table 12: Empirical power for proposed test, c = 4 n α = 0.10 α = 0.05 α = 0.01 100 0.605 0.490 0.247 200 0.846 0.765 0.481 300 0.947 0.894 0.710 400 0.986 0.980 0.897 500 0.998 0.994 0.957 Table 13: Empirical power for the frequency test (λ1 = λ2 = 0), c = 4 n α = 0.10 α = 0.05 α = 0.01 100 0.423 0.312 0.110 200 0.746 0.611 0.315 300 0.903 0.825 0.573 400 0.982 0.957 0.840 500 0.997 0.985 0.929
Tables 5 through 13 reveal that i) the proposed test is correctly sized, ii) it has power that increases with n, and iii) it has substantially higher power than its frequency-based counterpart.
16
6
Boston Housing Data Application
In this section we consider the 1970s era Boston housing data that has been extensively analyzed by a number of authors; see, e.g., Chaudhuri, Doksum & Samarov (1997). Our model is of the form Yi = Xi0 β(Zi ) + ui , where y is the median value of owner-occupied homes in $1,000’s in a town (MEDV), x includes the average number of rooms per dwelling (RM) and the log of the per capita crime rate by town (log(CRIM)), z includes the percentage lower status of the population (LSTAT lies in the range [0, 100]) and a Charles River dummy variable (CHAS equals 1 if a tract bounds the river, 0 otherwise). This dataset contains n = 506 observations in total. We estimate three models. The first is a simple constant parameter linear parametric model of the form Yi = γ0 + γ1 Xi1 + γ2 Xi2 + γ3 Zic + γ4 Zid + ui . The second is a linear parametric varying coefficient specification whose coefficients are linear in LSTAT and CHAS of the form Yi = γ0 (Zic , Zid ) + γ1 (Zic , Zid )Xi1 + γ2 (Zic , Zid )Xi2 + ui = (a0 + a1 Zic + a2 Zid ) + (b0 + b1 Zic + b2 Zid )Xi1 + (c0 + c1 Zic + c2 Zid )Xi2 + ui . The third model is the proposed semiparametric smooth coefficient model. For the in-sample measure of fit we compare models using the following measure. Letting y denote the outcome and yˆ the fitted value for observation i, we may define R2 as i2 ¯ )(Yˆi − Y¯ ) (Y − Y i i=1 R2 = Pn , ¯ 2 Pn (Yˆi − Y¯ )2 i=1 (Yi − Y ) i=1 hP n
and this measure will always lie in the range [0, 1] with the value 1 denoting a perfect fit to the sample data and 0 denoting no predictive power above that given by the unconditional mean of y. It can be demonstrated that this method of computing R2 is identical to the standard measure P P computed as n (Yˆi − Y¯ )2 / n (Yi − Y¯ )2 when the model is linear and includes an intercept i=1
i=1
term and OLS is used to fit it. We split the sample into n1 = 450 estimation observations and n2 = 56 independent evaluation Pn2 observations. We compute the out-of-sample predicted square error via PMSE = n−1 2 i=1 (Yi − 2 2 Yˆi ) . We repeat this process 1,000 times and compute the median PMSE and R . We also provide density plots for the PMSE and R2 values of the parametric varying coefficient specification and
17
the semiparametric specification for all 1,000 splits of the data in figures 1 and 2 (we omit the simple linear model to avoid cluttering the figures).
0.06
PMSE
0.03 0.00
0.01
0.02
Density
0.04
0.05
Semiparametric Parametric VC
0
10
20
30
40
50
60
PMSE [Parametric median = 17.29, Semiparametric median = 14.85]
Figure 1: Predicted MSE density plots for the parametric varying coefficient specification and the semiparametric specification
12
R−squared
6 0
2
4
Density
8
10
Semiparametric Parametric VC
0.70
0.75
0.80
0.85
0.90
0.95
R−squared [Parametric median = 0.787, Semiparametric median = 0.873]
Figure 2: In-sample R2 density plots for the parametric varying coefficient specification and the semiparametric specification ˆ med = 0.91 Summarizing, the median cross-validated bandwidths for LSTAT5 and CHAS are h 5
The standard deviation for LSTAT is 7.14.
18
ˆ med = 0.12. The median PMSE for the proposed method is 14.85, for the constant parameter and λ parametric specification is 28.89, and for the varying coefficient parametric specification is 17.29. The median R2 for the semiparametric method is 0.87, for the constant parameter parametric specification is 0.66, and for the parametric varying coefficient specification is 0.79. Both the insample and out-of-sample performance of the proposed method dominates that for the parametric methods. Next, using the full n = 506 observations, we apply the proposed test for correct specification to the simple constant parameter linear parametric model and to the linear parametric varying coefficient specification whose coefficients are linear in LSTAT and CHAS. For the former, we obtain Iˆn = 2.769 with a P -value of < 0.0025 (the 90th, 95th, and 99th percentiles for the null distribution of Iˆn , based on the B = 399 bootstrap statistics are 0.070, 0.162, and 0.410), while for the latter we obtain Iˆn = 0.380 with a P -value of < 0.0025 (the 90th, 95th, and 99th percentiles for the null distribution of Iˆn are 0.069, 0.133, and 0.253). These results indicate that the proposed test soundly rejects each parametric specification. On the basis of the relative out-of-sample performance of the proposed estimator versus the parametric specifications and on the basis of the proposed test for correct parametric specification, we conclude that the proposed method yields a more faithful description of the underlying DGP than the linear constant parameter and varying parameter parametric models that are often found in applied research.
7
Summary
In this paper we propose a novel varying-coefficient model that is capable of handling the mix of qualitative and quantitative covariates commonly encountered in applied work. A data-driven bandwidth selection method is proposed, theoretical underpinnings are provided, and a Monte Carlo experiment is undertaken to examine the finite-sample behavior of the proposed estimator. We also propose a consistent test for correct parametric specification of varying coefficient models. An application to the Boston housing data underscores the potential value-added of the proposed estimator for practitioners. In closing, we note that results for the case in which all elements of Z are qualitative cannot be obtained as a special case of those derived herein (which require the presence of at least one continuous variable), hence we must treat this case separately and intend do so in future work.
References Aitchison, J. & Aitken, C. G. G. (1976), ‘Multivariate binary discrimination by the kernel method’, Biometrika 63(3), 413–420.
19
Bhattacharya, R. N. & Rao, R. R. (1986), Normal Approximations and Asymptotic Expansions, R.E. Krieger Publishing Company, Malabar, FL. Cai, Z., Fan, J. & Yao, Q. W. (2000), ‘Functional-coefficient regression models for nonlinear time series’, Journal of the American Statistical Association 95, 941–956. Chaudhuri, P., Doksum, K. & Samarov, A. (1997), ‘On average derivative quantile regression’, Annals of Statistics 25(2), 715–744. Davidson, R. & MacKinnon, J. G. (2000), ‘Bootstrap tests: How many bootstraps?’, Econometric Reviews 19, 55–68. de Jong, P. (1987), ‘A central limit theorem for generalized quadratic forms’, Probability Theory and Related Fields 75, 261–277. Fan, J., Yao, Q. W. & Cai, Z. (2003), ‘Adaptive varying-coefficient linear models’, Journal of the Royal Statistical Society, Series B 65, 57–80. Gu, C. & Ma, P. (2005), ‘Generalized nonparametric mixed-effect models: Computation and smoothing parameter selection’, Journal of Computational and Graphical Statistics 14, 485– 504. Hall, P. (1984), ‘Central limit theorem for integrated square error of multivariate nonparametric density estimators’, Journal of Multivariate Analysis 14, 1–16. Hall, P., Li, Q. & Racine, J. S. (2007), ‘Nonparametric estimation of regression functions in the presence of irrelevant regressors’, The Review of Economics and Statistics 89, 784–789. Hall, P., Racine, J. S. & Li, Q. (2004), ‘Cross-validation and the estimation of conditional probability densities’, Journal of the American Statistical Association 99(468), 1015–1026. H¨ardle, W., Hall, P. & Marron, J. S. (1988), ‘How far are automatically chosen regression smoothing parameters from their optimum?’, Journal of The American Statistical Association 83, 86–101. H¨ardle, W., Hall, P. & Marron, J. S. (1992), ‘Regression smoothing parameters that are not far from their optimum’, Journal of the American Statistical Association 87, 227–233. H¨ardle, W. & Marron, J. (1985), ‘Optimal bandwidth selection in nonparametric regression function estimation’, The Annals of Statistics 13, 1465–1481. Hsiao, C., Li, Q. & Racine, J. S. (2007), ‘A consistent model specification test with mixed categorical and continuous data’, Journal of Econometrics 140, 802–826. Journal of Multivariate Analysis (2004), Vol. 91, Academic Press, Inc., Orlando, FL, USA. Special issue on semiparametric and nonparametric mixed models. Lee, J. (1990), U-statistics: Theory and practice, Marcel Dekker, New York. Li, Q., Huang, C. J., Li, D. & Fu, T. T. (2002), ‘Semiparametric smooth coefficient models’, Journal of Business and Economics Statistics 20, 412–422.
20
Li, Q. & Zhou, J. (2005), ‘The uniqueness of cross-validation selected smoothing parameters in kernel estimation of nonparametric models’, Econometric Theory 21(5), 1017–1025. Masry, E. (1996), ‘Multivariate regression estimation: local polynomial fitting for time series’, Stochastic Processes and Their Applications 65, 81–101. Rosenthal, H. P. (1970), ‘On the subspace of lp (p ≥ 1) spanned by sequences of independent random variables’, Israel Journal of Mathematics 8, 273–303. Zeger, S. L. & Diggle, P. J. (1994), ‘Semiparametric models for longitudinal data with application to cd4 cell numbers in hiv seroconverters’, Biometrics 50, 689–699. Zhang, D., Lin, X., Raz, J. & Sowers, M. (1998), ‘Semiparametric stochastic mixed models for longitudinal data’, Journal of the American Statistical Association 93(442), 710–719.
A
Proofs of Theorems 3.1 and 3.2
A.1
Preliminaries
The proof of Theorem 3.1 is quite tedious. Therefore, it is necessary to introduce some short-hand notation and preliminary manipulations in order to simplify the derivations that follow. For the reader’s convenience we list most of the notation used in this appendix directly below. 1. We will use βi to denote β(Zi ) and βˆ−i to denote βˆ−i (Zi ). P Pn 2. We define = i i=1 , Pn Pn Pn PPP i=1 j=1,j6=i l=1,l6=i,l6=j . l6=j6=i =
PP j6=i
=
Pn Pn i=1
j=1,j6=i ,
3. We write An = Bn + (s.o.) to denote the fact that Bn is the leading term of An , where (s.o.) denotes terms that have probability orders smaller than Bn . Ai = Bi + (s.o.) always means P P P P that n−1 i Ai = n−1 i Bi + (s.o.), and Aij = Bij + (s.o.) means that n−2 i j Aij = P P n−2 i j Bij + (s.o.). Also, we write An ∼ Bn to mean that An and Bn have the same order of magnitude in probability. 4. For notational simplicity we often ignore the difference between n−1 and (n − 1)−1 simply because this will have no effect on the asymptotic analysis. In the proofs that follow we make use of Rosenthal’s inequality and the U-statistic Hdecomposition. We present these results below for the readers’ convenience. Rosenthal’s Inequality Let p ≥ 2 be a positive constant and let X1 , . . . , Xn denote i.i.d. random variables for which E(Xi ) = 0 and E(|Xi |p ) < ∞. Then there exists a positive constant (which may depend on p)
21
C(p) such that ¯ ! ï n " n #p/2 n ¯X ¯p X X ¡ ¢ ¯ ¯ E ¯ Xi ¯ ≤ C(p) E (|Xi |p ) + E Xi2 . ¯ ¯ i=1
(A.1)
s=1
i=1
Equation (A.1) is widely known as ‘Rosenthal’s inequality’ (see Rosenthal (1970)). The H-decomposition for U-Statistics Let (n, k) = n!/[k!(n − k!)] denote the number of combinations obtained by choosing k items from n (distinct) items. Then a general k th order U-statistic U(k) is defined by U(k) =
1 (n, k)
X
Hn (Xi1 , . . . , Xik ),
1≤i1 i XX X 6 (Hn,ijl − Hn,ij − Hn,jl − Hn,li + Hni + Hnj + Hnl − θ) , + n (n − 1) (n − 2)
U(3) = θ +
l>j>i
(A.3) where Hn,ijl = Hn (Xi , Xj , Xl ), Hn,ij = E[Hn,ijl |Xi , Xj ], Hn,i = E[Hn,ij |Xi ] and θ = E[Hn,ijl ]. For a derivation of the above formulas as well as an H-decomposition for a general k th order U-statistic, see Lee (1990, page 26). Proof of Theorem 3.1. Using (11) and Yi = Xi0 βi +ui , we have (where βi = β(Zi ), βˆ−i = βˆ−i (Zi ) and Mi = M (Zi )) 1X [Yi − Xi0 βˆ−i ]2 Mi n i X 1X 0 2X = [Xi (βi − βˆ−i )]2 Mi + ui Xi0 (βi − βˆ−i )Mi + n−1 u2i Mi . n n
CV (γ) =
i
i
i
22
(A.4)
Below we obtain the leading terms of CV (γ). We use CV0 (γ) to denote the first two terms on the right-hand-side of (A.4). Minimizing CV (γ) over γ = (h, λ) is equivalent to minimizing CV0 (γ) P as n−1 i u2i does not depend on γ = (h, λ), where def
CV0 (γ) = n−1
X
[Xi0 (βi − βˆ−i )]2 Mi + 2n−1
i
X
ui Xi0 (βi − βˆ−i )Mi = CV0,1 + CV0,2 ,
(A.5)
i
where the definitions of CV0,1 and CV0,2 should be apparent. Replacing Yj = Xj0 βj + uj = Xj0 βi + Xj0 (βj − βi ) + uj in (12), we get βˆ−i
h i−1 X X ˆ i) n−1 = βi + A(z Xj Xj0 (βj − βi )Kγ,ij + n−1 Xj uj Kγ,ij j6=i
j6=i
i h i−1 h ˆi + Cˆi , = βi + Aˆi B
(A.6)
P ˆi = n−1 Pn Xj X 0 (βj −βi )Kγ,ij and Cˆi = n−1 Pn Xj uj Kγ,ij . where Aˆi = n−1 nj6=i Xj Xj0 Kγ,ij , B j j6=i j6=i Substituting (A.6) into CV0,1 we get CV0,1 = n−1
io2 h Xn ˆi + Cˆi Mi Xi0 Aˆ−1 B i i
i2 i2 Xh Xh 0 ˆ−1 ˆ ˆi Mi + n−1 Mi X A C = n−1 Xi0 Aˆ−1 B i i i i i
i
+ 2n
−1
ih i Xh ˆi Xi0 Aˆ−1 Cˆi Mi Xi0 Aˆ−1 B i i i
≡ CV1 + CV2 + 2CV3 , where the definitions of CVj (j = 1, 2, 3) should be apparent. In lemmas A.1 through A.3 below we show, uniformly in (h, λ) ∈ Γ, that CV1 =
Z ("X q1 s=1
h2s B1s (¯ z)
+
r1 X
#0 λs B2s (¯ z)
−1
m(¯ z)
s=1
³ ´ ¯ (¯ ×M z )d¯ z + op ζn2 + (nh1 . . . hq1 )−1 , Z κq1 ˜ z )M (z)dz, f¯(¯ z )2 f˜(˜ z )δ(¯ z , z¯)R(˜ CV2 = nh1 . . . hq1 ³ ´ CV3 = op ζn2 + (nh1 . . . hq1 )−1 ,
"q 1 X s=1
h2s B1s (¯ z)
+
r1 X
#) λs B2s (¯ z)
s=1
(A.7) (A.8) (A.9)
R ¯ (¯ M z) = f˜(˜ z )M (z)d˜ z , δ(¯ zi , z¯j ) = 0 −1 2 2 0 ˜ z ) = ν2 (˜ E[(Xi m(¯ zi ) Xj ) σ (Xj , z¯j )|¯ zi , z¯j ], m(¯ z ) = E[Xi Xi |Z¯i = z¯]f¯(¯ z ), R(˜ z )/ν1 (˜ z )2 and where B1s (·) and B2s (·) are defined in (18),
23
ζn =
Pq1
2 s=1 hs
+
Pr1
s=1 λs .
Also, in Lemma A.4 we show that
³ ´ CV0,2 = op ζn2 + (nh1 . . . hq1 )−1 .
(A.10)
Hence, the leading term of CV0 (γ) is given by CV1 + CV2 . This proves (19). The remaining steps for proving Theorem 3.1 were stated in Section 3; see (20) through (24). p ˆs → Proof of Theorem 3.2. By Theorem 3.1 we know that h +∞ for s = q1 + 1, . . . , q and p ˆ s → 1 for s = r1 + 1, . . . , r. Therefore, we need only consider the case with all irreleλ ˆ z ) = [P Xi Xi K ¯ γˆ,iz ]−1 P Xi Yi K ¯ γˆ,iz , where K ¯ γˆ,iz = vant covariates removed, i.e., we consider β(¯ i i hQ i hQ i q1 ˆ −1 r c ˆ d ˆ c d 1 s=1 hs w((zis − zs )/hs ) s=1 l(zis , zs , λs ) .
We first consider the benchmark case whereby we use non-stochastic smoothing parameters. Define h0s = a0s n−1/(4+q1 ) for s = 1, . . . , q1 , and λ0s = b0s n−2/(4+q1 ) for s = 1, . . . , r1 , where a0s and b0s are defined in (25). Also, define " ¯ z) = β(¯
n X i=1
#−1 ¯ γ 0 ,i¯z Xi Xi0 K
n X
¯ γ 0 ,i¯z , Xi Yi K
i=1
£ ¤£ ¤ ¯ γ 0 ,iz = Qq1 (h0 )−1 w((zis − zs )/h0 ) Qr1 l(z d , z d , λ0 ) . Then, using Yi = X 0 β(¯ where K s s is s i zi ) + ui s=1 s s=1 ¯ and by adding and subtracting terms in β(¯ z ), we obtain #−1 X X 1 1 0 ¯ z) = ¯ γ 0 ,iz ¯ γ 0 ,iz X X K Xi [Xi0 β(¯ z ) + Xi0 (β(¯ zi ) − β(¯ z )) + ui ]K β(¯ i i nh01 . . . h0q1 i nh01 . . . h0q1 i #−1 " X X 1 1 0 ¯ ¯ γ 0 ,iz X X K Xi [Xi0 (β(¯ zi ) − β(¯ z )) + ui ]K = β(¯ z) + 0 ,iz i γ i 0 0 nh1 . . . h0q1 i nh1 . . . h0q1 i £ ¤−1 £ 0 ¤ ≡ β(¯ z ) + A0 (¯ z) B (¯ z ) + C 0 (¯ z) , (A.11) "
P P ¯ γ 0 ,iz , B 0 (¯ ¯ γ 0 ,iz where A0 (·) = (nh01 . . . h0q1 )−1 i Xi Xi0 K z ) = (nh01 . . . h0q1 )−1 i Xi Xi0 (β(¯ zi ) − β(¯ z ))K P 0 0 0 −1 ¯ and C (¯ z ) = (nh1 . . . hq1 ) i Xi ui Kγ 0 ,iz . By the same arguments as we used in the proof of Lemma A.1, one can show that6 P1 A0 (¯ z ) = E[Xi Xi0 |¯ zi = z¯]f¯(¯ z ) + op (1) ≡ m(¯ z ) + op (1), and that B 0 (¯ z ) = f¯(¯ z )[ qs=1 (h0s )2 B1s (¯ z) + P P Pr1 0 q r 1 0 2 0 0 0 0 1 z ) has zero mean and z )] + op (ζn ), where ζn = s=1 (hs ) + h s=1 λs . Obviously, Ci (¯ s=1 λs B2s (¯ ¡ 0 ¢−2 −1 0 0 2 2 ¯ its asymptotic variance, namely n h1 . . . hq1 E Xi Xi σ (Xi , z¯i )Kγ 0 ,i¯z , is given by © ª κq1 (nh01 . . . h0q1 )−1 f¯(¯ z )2 E[Xi Xi0 σ 2 (Xi , z¯i )|¯ zi = z¯] + o(1) ≡ (nh01 . . . h0q1 )−1 [V (¯ z ) + o(1)], where V (¯ z ) = κq1 f¯(¯ z )2 E[Xi Xi0 σ 2 (Xi , z¯i )|¯ zi = z¯]. By applying a triangular-array central limit 6
It can be easily seen that A0 (¯ z ) is a local constant kernel estimator of E[Xi Xi0 |¯ zi = z¯]f¯(¯ z ).
24
theorem (CLT), we know that # " q1 r1 q X X ¡ ¢ d −1 ¯ z ) − β(¯ nh01 . . . h0q1 β(¯ z) − (h0s )2 B1s (¯ z) − λ0s B2s (¯ z ) → N 0, m−1 , z Vz mz s=1
(A.12)
s=1
where mz = m(¯ z ) and Vz = V (¯ z ). Note that factors of f¯(¯ z ) in m−1 z and in Vz cancel each other, so −1 that m−1 z ) as given in Theorem 3.2. z Vz mz = Ω(¯ P ˆ z) ¯ γˆ,iz , ¯ γˆ,iz ]−1 P Xi Yi K where Next,h we consider β(¯ = [i i Xi Xi K i i h Q Q q r 1 −1 c c d d 1 ˆ ˆ ˆ ¯ Kγˆ,iz = s=1 hs w((zis − zs )/hs ) s=1 l(zis , zs , λs ) . Therefore, the only difference between ˆ z ) and β(z) ¯ β(¯ is that the former uses the cross-validated smoothing parameters, while the lat-
ter uses some benchmark non-stochastic smoothing parameters. By Theorem 3.1 we know that p p ˆ s /λ0 → ˆ s /h0 → 1 for s = 1, . . . , r1 . By using stochastic equicontinuity 1 for s = 1, . . . , q1 , and λ h s s ˆ z ) and arguments as in Hall, Racine & Li (2004), or in Hsiao et al. (2007), one can show that β(¯ ¯ z ) have the same asymptotic distribution, i.e., β(¯ " # q1 q r1 X X ¡ ¢ d −1 ˆ1 . . . h ˆ q β(¯ ˆ z ) − β(¯ ˆ 2 B1s (¯ ˆ s B2s (¯ nh z) − h z) − λ z ) → N 0, m−1 . s z Vz mz 1 s=1
(A.13)
s=1
Finally, we show that Aˆ−1 BˆAˆ−1 (defined in Theorem 3.2) is a consistent estimator of Ω(¯ z ). Writing f¯z = f¯(¯ z ), then it is straightforward to show that Aˆ = mz /f¯z +op (1) and Bˆ = Vz /f¯z2 +op (1). −1 Hence, Aˆ−1 BˆAˆ−1 = m−1 z ) + op (1). This completes the proof of Theorem z Vz mz + op (1) ≡ Ω(¯
3.2. In the remaining part of this appendix we prove some lemmas that are used in the proof of Theorem 3.1. For notational convenience we will write zi for Zi . Lemma A.1. Equation (A.7) holds true. ˆ Proof. By Lemma A.5 we know that A(z) is the kernel estimator of µ(z) = m(¯ z )ν1 (˜ z ), where 0 ¯ ¯ ˜ m(¯ z ) = E[Xi Xi |Zi = z¯]f (¯ z ) and ν1 (˜ z ) = E[Kγ˜,ij |˜ zi = z˜]. Therefore, we know that (see Lemma A.5) −1 −1 ˆ ˆ i )−1 in CV1 by its leading term the leading term of A(zi ) is µ(zi ) . Define CV 0 by replacing A(z 1
µ(zi )−1 . Then using the result of Lemma A.5, it is easy to show that CV1 = CV10 + (s.o.). Hence,
25
ˆi = n−1 P Xj X 0 (βj − βi )Kγ,ij ) we only need to consider CV10 which is defined by (recall that B j j6=i CV10 = n−1
i2 Xh ˆi Mi Xi0 µ(zi )−1 B i
−3
=n
XXX i
j6=i l6=i
i
j6=i
Xi0 µ(zi )−1 Xj Xj0 (βj − βi )Kγ,ij Xi0 µ(zi )−1 Xl Xl0 (βl − βi )Kγ,il Mi
XX£ ¤2 = n−3 Xi0 µ(zi )−1 Xj Xj0 (βj − βi )Kγ,ij Mi −3
+n
XX X i
Xi0 µ(zi )−1 Xj Xj0 (βj − βi )Kγ,ij Xi0 µ(zi )−1 Xl Xl0 (βl − βi )Kγ,il Mi
j6=i l6=i,l6=j
≡ CV1,1 + CV1,2 , where the definitions for CV1,1 and CV1,2 should be apparent. ˜ γ˜,ij |˜ By noting that µ(zi ) = m(¯ zi )ν1 (˜ z ) and that ν1 (˜ z ) = E[K zi ], it is fairly straightforward to show that (note that |M (·)| ≤ 1) o ¤ o n 2 ¯ γ¯,ij 2 E K ˜ γ˜,ij /ν1 (˜ Xi0 m(¯ zi )−1 Xj Xj0 (βj − βi )K zi )2 Ãq ! r1 1 X X ¡ ¢ ≤ C(nh1 . . . hq1 )−1 h2s + λs = o (nh1 . . . hq1 )−1 ,
E [|CV1,1 |] ≤ n−3 n(n − 1)E
n£
i=1
(A.14)
s=1
because
iti ¾ is ½h 2 ¡ ¢ Pq1 2 Pr1 0 −1 0 −1 ¯ easy to see that E Xi m(¯ zi ) Xj Xj (βj − βi )Kγ¯,ij = O (h1 . . . hq1 ) ( i=1 hs + s=1 λs ) 2 ˜ ˜ γ˜,ij |˜ ˜ γ˜,ij |˜ and that E{[ zi )´2 } = E{[K zi ]2 /(E[K zi ])2 } ≤ 1. Hence, (A.14) implies that ³ Kγ˜,ij ] /ν1 (˜ −1 CV1,1 = op (nh1 . . . hq1 ) .
Next, CV1,2 can be written as a third order U-statistic. By the U-statistic H-decomposition, one can show that CV1,2 = E[CV1,2 ] + (s.o.). Before evaluating E[CV1,2 ], we first compute an intermediate quantity. Recalling that µ(z) = m(¯ z )ν1 (˜ z ) and that m(¯ z ) = E[Xj Xj0 |¯ zj = z¯]f¯(¯ z ), we
26
˜ γ˜,ij /ν1 (˜ have (noting that E[K zi )|˜ zi ] = ν1 (˜ z )/ν1 (˜ zi ) = 1) £ ¤ E Xi0 µ(zi )−1 Xj Xj0 (βj − βi )Kγ,ij |Xi , zi £ ¤ ¯ γ¯,ij |Xi , z¯i E[K ˜ γ˜,ij /ν1 (˜ = Xi0 m(zi )−1 E E(Xj Xj0 |¯ zj )(βj − βi )K zi )|˜ zi ] Z ¯ γ,ij d¯ = Xi0 m(zi )−1 f¯(¯ zj )E(Xj Xj0 |¯ zj )(βj − βi )K zj Z ¯ γ,ij d¯ ≡ Xi0 m(zi )−1 m(¯ zj )(βj − βi )K zj X Z = Xi0 m(¯ zi )−1 m(¯ zic + hv, z¯jd )(β(¯ zic + hv, z¯jd ) − β(¯ zic , z¯id ))W (v)L(¯ zid , z¯d , λ)dv z¯jd ∈S¯d
= Xi0 m(¯ zi )−1
(q 1 X
h i h2s ms (zic , zid )βs (¯ zic , z¯id ) + (1/2)m(¯ zic , z¯id )βss (¯ zic , z¯id )
s=1
h i + λs 1s (¯ zid , v¯d )m(¯ zic , v¯d ) β(¯ zic , v¯d ) − β(¯ zic , z¯id ) + op (ζn ) d d ¯ s=1 v¯ ∈S "q # r1 1 X X ≡ Xi0 m(¯ zi )−1 h2s B1s (¯ zi ) + λs B2s (¯ zi ) + op (ζn ) r1 X
X
s=1
(A.15)
s=1
P 1 2 Pr1 uniformly in (h, λ) ∈ Γ and ζn = qs=1 hs + s=1 λs . Substituting (A.15) into CV1,2 and letting ξ(¯ z ) = E[Xi Xi0 |¯ zj = z¯] (so m(¯ z ) = ξ(¯ z )f¯(¯ z )), we R ˜ ¯ immediately obtain (where M (¯ z ) = f (˜ z )M (z)d˜ z) E[CV1,2 ] = E
(" q 1 X
h2s B1s,i
+
=
s=1
h2s B1s (¯ z)
+
"
#0 λs B2s,i
s=1
s=1
Z ("X q1
r1 X
r1 X
−1
m(¯ zi )
ξ(¯ zi )m(¯ zi )
#0 λs B2s (¯ z)
" −1
m(¯ z)
s=1
−1
q1 X
h2s B1s,i
s=1
s=1 q1 X s=1
h2s B1s (¯ z)
+
r1 X
+
r1 X
#
)
¯ (¯ λs B2s,i M zi )
+ (s.o.)
#)
λs B2s (¯ z)
¯ (¯ M z )d¯ z + (s.o.),
s=1
(A.16)
R where in the last equality we used the fact that E(·) = (·)f¯(¯ z )d¯ z and ξ(¯ z )f¯(¯ z ) = m(¯ z ). Hence, the leading term of CV1 is given by (A.16). Note that in the above we have only shown that for all fixed values of (h, λ) ∈ Γ that (A.16) holds true. By utilizing Rosenthal’s and Markov’s inequalities, it is straightforward to show that (A.16) holds true uniformly in (h, λ) ∈ Γ. Lemma A.2. Equation (A.8) holds true. ˆ i )−1 being replaced by its leading term µ(zi )−1 . Using Proof. We use CV20 to denote CV2 with A(z Lemma A.5 it is fairly straightforward to show that CV2 = CV20 + (s.o.). Hence, we only need to
27
consider CV20 , which is given by CV20 = n−3 −3
=n
XXX i
j6=i l6=i
i
j6=i
Xi0 µ(zi )−1 Xj uj Kγ,ij Xi0 µ(zi )−1 Xl ul Kγ,il Mi
XX 2 (Xi0 µ(zi )−1 Xj )2 u2j Kγ,ij Mi −3
+n
XX X i
Xi0 µ(zi )−1 Xj uj Kγ,ij Xi0 µ(zi )−1 Xl ul Kγ,il Mi
j6=i l6=j,l6=j
≡ D1 + D2 . ˜ γ˜,ij are bounded We first consider D2 which has mean zero. By noting that terms related to K ˜ and from since the kernel functions W (·) and L(·) are bounded and that hq1 +1 . . . hq from the K ³ ´ µ(·)−1 cancel, then it is fairly straightforward to show that E[D22 ] = n−6 O n4 (h1 . . . hq1 )−1 = ´ ³ ´ ³ ³¡ ¢−1 ´ O n2 h1 . . . hq1 . Hence, D2 = Op n−1 (h1 . . . hq1 )−1/2 = op (nh1 . . . hq1 )−1 . Next, we consider D1 . D1 can be written as a second order U-statistic. By the U-statistic H-decomposition it is straightforward, though tedious, to show that D1 = E(D1 ) + (s.o.). Define δ(¯ zi , z¯j ) = E[(Xi0 m(¯ zi )−1 Xj )2 σ 2 (Xj , z¯j )|¯ zi , z¯j ]. By Lemma A.5 we know that hs → 0 for s = 1, . . . , q1 and λs → 0 for s = 1, . . . , r1 . Applying the Taylor expansion, using the law of iterated ˜ 2 |˜ expectations, and recalling that µ(z) = m(¯ z )ν1 (˜ z ) and ν2 (˜ z ) = E[K zi = z˜], we have the following ˜ij
intermediate result: ¤ £ 2 E (Xi0 m(¯ zi )−1 Xj )2 σ 2 (Xj , z¯j )Kγ,ij |zi ¤ £ ¯ γ¯2,ij |¯ zi )/ν1 (˜ zi )2 = E δ(¯ zi , z¯j )K zi ν2 (˜ Z ˜ zi ) f¯(¯ ¯ γ¯2,ij d¯ = R(˜ zj )δ(¯ zi , z¯j )K zj · ¸ Z −1 ˜ 2 ¯ ¯ = (h1 . . . hq1 ) R(˜ zi )f (¯ zi )δ(¯ zi , z¯i )[ W (v) dv] + O(ζn ) ³ ´ ˜ zi ) + Op ζ 1/2 (h1 . . . hq )−1 , = (h1 . . . hq1 )−1 κq1 f¯(¯ zi )δ(¯ zi , z¯i )R(˜ n 1 ˜ z ) = ν2 (˜ where R(˜ z )/ν1 (˜ z )2 and κq1 =
R
(A.17)
R ¯ (v)2 dv ≡ [ w2 (v)dv]q1 . Using (A.17) and the law of W
iterated expectations, we obtain © ª 2 E(D1 ) = n−1 E (Xi0 m(¯ zi )−1 Xj )2 σ 2 (Xj , z¯j )Kγ,ij Mi © £ ¤ª 2 = n−1 E E (Xi0 m(¯ zi )−1 Xj )2 σ 2 (Xj , z¯j )Kγ,ij Mi |zi n o ³ ´ ˜ zi )Mi + O ζ 1/2 (nh1 . . . hq )−1 = (h1 . . . hq1 )−1 κq1 E f¯(¯ zi )δ(¯ zi , z¯i )R(˜ n 1 Z ³ ´ ˜ z )M (z)dz + O ζn1/2 (nh1 . . . hq )−1 . = (nh1 . . . hq1 )−1 κq1 f¯(¯ z )2 f˜(˜ z )δ(¯ z , z¯)R(˜ 1
28
Summarizing the above we have shown that CV2 = CV20 + (s.o.) = E(D1 ) + (s.o.) Z ˜ z )M (z)dz + (s.o.). = (nh1 . . . hq1 )−1 κq1 f¯(¯ z )2 f˜(˜ z )δ(¯ z , z¯)R(˜ Moreover, by utilizing Rosenthal’s and Markov’s inequalities, one can show that the above result holds uniformly in (h, λ) ∈ Γ. This completes the proof of Lemma A.2. Lemma A.3. Equation (A.9) holds true. ˆ Proof. By Lemma A.5 we know that A(z) = µ(z) uniformly in z ∈ M and (h, λ) ∈ Γ, where M ˆ i ) being replaced denotes the support of the weight function M (·). Let CV 0 denote CV3 with A(z 3
by its leading term µ(zi ). Then it can be shown that CV3 = CV30 + (s.o.). Hence, we only need to consider CV30 which is defined by CV30 = n−3
XXX i
−3
=n
j6=i l6=i
XX i
2 Xi0 µ(zi )−1 Xj (βj − βi )Xi0 µ(zi )−1 Xj uj Kγ,ij
j6=i
−3
+n
Xi0 µ(zi )−1 Xj (βj − βi )Kγ,ij Xi0 µ(zi )−1 Xl ul Kγ,il
XX X i
Xi0 µ(zi )−1 Xj (βj − βi )Kγ,ij Xi0 µ(zi )−1 Xl ul Kγ,il
j6=i l6=i,l6=j
≡ CV3,1 + CV3,2 , where the definitions of CV3,1 and CV3,2 should be apparent. We first consider CV3,1 . CV3,1 has zero mean. By following the same arguments as we did in the proof of Lemma A.1, one can show that the part in CV1,2 that is related to the irrelevant covariates is ˜ γ˜,ij |˜ bounded. This occurs because we have ν1 (˜ zi )−2 = {E[K zi ]}−2 from µ(zi )−2 = m(¯ zi )−2 ν1 (˜ zi )−2 , ˜ 2 , always gives us a bounded quantity when computing the moments which when combined with K γ ˜ ,ij of CV3,1 . Therefore, when evaluating the order of CV3,1 we can ignore the irrelevant covariates part and need only consider ¯ 3,1 def CV = n−3
XX i
¯2 . Xi0 m(¯ zi )−1 Xj (βj − βi )Xi0 m(¯ zi )−1 Xj uj K γ ¯ ,ij
j6=i
¯ 3,1 only depends on (h1 , . . . , hq , λ1 , . . . , λr ). By Lemma A.6 we know these Note that CV 1 1 smoothing parameters all converge to zero as n → ∞. Hence, we can use standard change-ofvariable and Taylor expansion arguments to deal with the continuous covariates kernel function, and use the polynomial expansion for the discrete kernel functions. Then, it is straightforward to
29
¯ 3,1 is of order (by noting that uj has zero mean) show that the second moment of CV 2
¯ 3,1 ] = n−6 E[CV
XXX i
£ ¯ γ¯2,ij E Xi0 m(¯ zi )−1 Xj (βj − βi )Xi0 m(¯ zi )−1 Xl u2j K
j6=i i0 6=j
¤ ¯2 0 Xi00 m(¯ zi0 )−1 Xj (βj − βi0 )Xi00 m(¯ zi0 )−1 Xj K γ ¯ ,i j ³ ´ = n−6 O n3 ζn2 (h1 . . . hq1 )−2 + n2 ζn (h1 . . . hq1 )−3 ,
(A.18)
where in the above the first term corresponds to three indices, i, j, i0 , each different from the other, and the second term corresponds to the case where i0 = i so that there are only two summations P 1 2 Pr1 in the second case (ζn = qs=1 hs + s=1 λs ). Hence, ´ ´ ³ ³ CV3,1 = OP ζn n−1/2 (nh1 . . . hq1 )−1 + ζn1/2 n−1/2 (nh1 . . . hq1 )−3/2 = op ζn2 + (nh1 . . . hq1 )−1 . Similarly, when evaluating the moments of CV3,2 , we can ignore the irrelevant covariate part of CV3,2 and need only consider ¯ 3,2 = n−3 CV
XX X i
¯ γ¯,il . ¯ γ¯,ij X 0 m(¯ Xi0 m(¯ zi )−1 Xj (βj − βi )K zi )−1 Xl ul K i
j6=i l6=i,l6=j
¯ 3,1 also has zero mean and it can be shown that its second moment is Note that CV ¯ 23,2 ] = n−6 E[CV
XXX X i
j6=i
X
£ ¯ γ¯,ij Xi0 m(¯ ¯ γ¯,il E Xi0 m(¯ zi )−1 Xj (βj − βi )K zi )−1 Xl u2l K
i0 j 0 6=i0 l6=i,j,i0 ,j 0 j
¯ γ¯,i0 j 0 X 00 m(¯ ¯ γ¯,i0 l ×Xi00 m(¯ zi0 )−1 Xj 0 (βj 0 − βi0 )K zi0 )−1 Xl K i
¤
(A.19)
¯ 3,2 only has five summations due to the fact that ul has zero The second moment of CV mean. If the five summation indices i, j, l, i0 , j 0 all differ from one another, then by the in¯ γ¯,ij X 0 m(¯ ¯ γ¯,il ] dependent data assumption, we can compute E[X 0 m(¯ zi )−1 Xj (βj − βi )K zi )−1 Xl u2 K
i 0 −1 0 −1 ¯ ¯ γ¯,i0 l ] separately. Each and E[Xi0 m(¯ zi0 ) Xj 0 (βj 0 − βi0 )Kγ¯,i0 j 0 Xi0 m(¯ zi0 ) Xl K Pq1 2 Pr1 5 2 order ζn = s=1 hs + s=1 λs . Hence, we have an order n ζn for this
i
l
term will give us an case. Similarly, one
can ´ five indices take only four different values, we will get an order ³ easily show that if the −1 4 2 4 2 . Other cases give even smaller orders since they involve fewer summan ζn + n ζn (h1 . . . hq1 ) tions. Therefore, we conclude that ³ ´ ¡ ¢ ¯ 23,2 ] = n−6 O ζn2 n5 + n4 ζn (h1 . . . hq )−1 = O n−1 ζn2 . E[CV 1
30
¡ ¢ ¡ ¢ Hence, CV3,2 = Op n−1/2 ζn2 = op ζn2 . Thus, we have shown that ³ ´ CV3 = CV30 + (s.o) = op ζn2 + (nh1 . . . hq1 )−1 .
(A.20)
Moreover, by utilizing Rosenthal’s and Markov’s inequalities, one can show that (A.20) holds uniformly in (h, λ) ∈ Γ. Hence, (A.16) holds true. ³ ´ Lemma A.4. CV0,2 = op ζn2 + (nh1 . . . hq1 )−1 uniformly in (h, λ) ∈ Γ. 0 denote CV ˆ Proof. Letting CV0,2 0,2 with µ(zi ) replacing A(zi ) in CV0,2 , we get 0 CV0,2 = n−1
X
i h ˆi + Cˆi Xi0 µ(zi )−1 B
i
=n
−2
XX i
ui Xi0 µ(zi )−1 Xj Xj0 (βj − βi )Kγ,ij + n−2
XX i
j6=i
ui Xi0 µ(zi )−1 Xj uj Kγ,ij
j6=i
= F1 + F2 . By following exactly ³ the same arguments as ´we did in ³ the proof of Lemma A.3, one´ can show −1 2 −4 3 2 2 that E[F1 ] = n O n ζn + n ζn (h1 . . . hq1 ) = O n−1 ζn2 + n−1 ζn (nh1 . . . hq1 )−1 . Hence, ³ ´ ³ ´ 1/2 F1 = Op n−1/2 ζn + n−1/2 ζn (nh1 . . . hq1 )−1/2 = op ζn2 + (nh1 . . . hq1 )−1 because n−1/2 = ³ ´ o ζn + (nh1 . . . hq1 )−1/2 . Also, F2 is a second order degenerate ´U-statistic, ³ ³ so it is fairly ´straightforward to ¡ 2¢ −1 −4 2 show that E F2 = n O n (h1 . . . hq1 ) = O n−1 (nh1 . . . hq1 )−1 . Hence, F1 = ³ ´ ³ ´ ³ ´ Op n−1/2 (nh1 . . . hq1 )−1/2 = op (nh1 . . . hq1 )−1 because n−1/2 = o (nh1 . . . hq1 )−1/2 . Summarizing the above we have shown that ³ ´ 0 CV0,2 = CV0,2 + (s.o.) = op ζn2 + (nh1 . . . hq1 )−1 .
(A.21)
By utilizing Rosenthal’s and Markov’s inequalities, one can show that (A.21) holds uniformly in (h, λ) ∈ Γ. This completes the proof of Lemma A.4. ˜ γ˜,ij |˜ Lemma A.5. Defining m(¯ z ) = E[Xj Xj0 |¯ zj = z¯]f¯(¯ z ), ν1 (˜ z ) = E[K zi = z˜] and µ(z) = m(¯ z )ν1 (˜ z ), then h i ¡ 4 ¢ ¯ + |λ| ¯ 2 + ln(n)(nh1 . . . hq )−1 , ˆ −1 = µ(z)−1 − µ(z)−1 A(z) ˆ − µ(z) µ(z)−1 + Op |h| A(z) 1 uniformly in z ∈ M and (h, λ) ∈ Γ.
31
ˆ i )|zi = z], then by the independence of z˜i and (yi , xi , z¯i ), we have Proof. Defining µ ˆ(zi ) = E[A(z i £ ¤ h ¯ γ¯,ij |¯ ˜ γ˜,ij |˜ µ ˆ(z) = E Xj Xj0 K zi = z¯ E K zi = z˜ h i ˜ γ˜,ij |˜ = {m(¯ z ) + O (ζn )} E K zi = z˜ = µ(z) + Op (ζn ) , where ζn =
Pq1
2 s=1 hs
+
Pr1
s=1 λs .
(A.22) ˆ Also, A(z) −µ ˆ(z) has zero mean. Following standard arguments
used when deriving uniform convergence rates for nonparametric kernel estimators (e.g., Masry (1996)), we know that à ˆ −µ A(z) ˆ(z) = Op
(ln(n))1/2 (nh1 . . . hq1 )1/2
! (A.23)
uniformly in z ∈ M and (h, λ) ∈ Γ. Combining (A.22) and (A.23) we obtain ³ ´ ˆ − µ(z) = Op ζn + (ln(n))1/2 (nh1 . . . hq )−1/2 . A(z) 1
(A.24)
uniformly in z ∈ M (M is the support of the trimming function M (·)) and (h, λ) ∈ Γ. Using (A.24) we obtain h i−1 ˆ −1 = µ(z) + A(z) ˆ − µ(z) A(z) µ¯ ¯2 ¶ i h ¯ˆ ¯ −1 −1 −1 ˆ A(z) − µ(z) µ(z) + Op ¯A(z) − µ(z))¯ = µ(z) − µ(z) h i ³ ´ ˆ − µ(z) µ(z)−1 + Op ζn2 + ln(n) (nh1 . . . hq )−1 , = µ(z)−1 − µ(z)−1 A(z) 1 completing the proof of Lemma A.5. ˆ s = op (1) for s = 1, . . . , q1 and λ ˆ s = op (1) for s = 1, . . . , r1 . Lemma A.6. h Proof. Without assuming that any of the smoothing parameters converge to zero, then the only possible non-op (1) term in CV (γ) is CV1 . It is fairly straightforward to see that CV1 = PPP 0 ˆ (z )−1 X X 0 (β − β )K X 0µ ˆ(zi )−1 Xl Xl0 (βl − βi )Kγ,il Mi + op (1) ≡ G1 + op (1), n−3 i j j j i l6=j6=ihXi µ i h iγ,ij i ¯ γ¯,ij |¯ ˜ γ˜,ij |˜ zi E K zi is defined in Lemma A.23. where µ ˆ(zi ) = E Xj Xj0 K Note that G1 can be written as a third order U-statistic, hence by the H-decomposition of a U-statistic it is fairly straightforward to show that G1 = E(G1 ) + op (1). Furthermore, by the law
32
of iterated expectations we have o ¡ ¢¤2 Xi0 µ ˆ(zi )−1 E Xj Xj0 (βj − βi )Kγ,ij |Xi , zi M (zi ) n© o ¡ ¢ ¡ ¢ª ¯ γ¯,ij |¯ ¯ γ¯,ij |Xi , z¯i 2 M ¯ (¯ = E Xi0 [E Xj Xj0 K zi ]−1 E Xj Xj0 (βj − βi )K zi ) n© o ª2 ¯ (¯ = E Xi0 [ηβ (¯ zi ) − β(¯ zi )] M zi ) © ª ¯ (¯ = E (ηβ (¯ zi ) − β(¯ zi ))0 E(Xi Xi0 |¯ zi )(ηβ (¯ zi ) − β(¯ zi ))M zi ) Z ¯ (¯ = (ηβ (¯ z ) − β(¯ z ))0 m(¯ z )(ηβ (¯ z ) − β(¯ z ))M z )d¯ z,
E(G1 ) = E
n£
(A.25)
¯ (¯ where m(¯ z ) = E(Xi Xi0 |¯ zi = z¯)f¯(¯ z ), ηβ (¯ z ) is defined above (17), and M z ) is defined in (23). Note that the right hand side of (A.25) does not depend on (hq1 +1 , . . . , hq , λr1 +1 , . . . , λr ) since ˜ γ˜,ij |˜ E[K zi ] in the numerator cancels with the same quantity in the denominator (from µ ˆ(zi )−1 = ¯ γ¯,ij |¯ ˜ γ˜,ij |˜ E[Xj X 0 K zi ]−1 E[K zi ]−1 ). j
If the smoothing parameters h1 , . . . , hq1 , λ1 , . . . , λr1 that minimize CV (γ) do not all converge in probability to zero, then by (17), E(G1 ) (or CV1 ) does not converge to zero, which implies that the probability that the minimum of CV1 (over the smoothing parameters) exceeds δ, which does not converge to zero as n → ∞ (for some δ > 0). However, choosing h1 , . . . , hq1 to be of size n−1/(q1 +4) , and λ1 , . . . , λr1 to be of size n−2/(q1 +4) , letting hq1 +1 , . . . , hq diverge to infinity, and letting λr1 +1 , . . . , λr converge to 1, one can easily show that CV1 converges in probability to zero. This contradicts the result obtained in the previous paragraph, and thus demonstrates that, at the minimum of CV (γ), the smoothing parameters h1 , . . . , hq1 , λ1 , . . . , λr1 , for the relevant components of Z, all converge in probability to zero.
B
Proofs of Theorems 4.1 and 4.2
Proof of Theorem 4.1. First, we consider the case where the smoothing parameters are nonstochastic, i.e., hs = h0s = a0s n−1/(4+q1 ) for s = 1, . . . , q1 and λs = λ0s = b0s n−2/(4+q1 ) for s = 1, . . . , r1 . We use Iˆ0 (Tˆ0 ) to denote the test statistic that uses the smoothing parameter γ 0 . Under H0 we n
n
conduct a Taylor expansion of β0 (zi , α ˆ 0 ) at α ˆ 0 = α0 to get ¡ ¢ β(zi , α ˆ 0 ) + β (1) (zi , α0 )(ˆ α − α0 ) + Op n−1 , where β (1) (z, α0 ) = Then from u ˆi = fact that α ˆ 0 − α0 =
£
¤
∂ ∂α β(z, α) |α=α0 . Yi − Xi0 β(zi , α ˆ 0 ) = Yi − Xi0 β(zi , α0 ) − m(1) (zi , α0 )(ˆ α − α0 ) + Op (n−1 ) and the Op (n−1/2 ), one can show that the leading term in Iˆn0 is obtained by replacing
33
0 to denote it, i.e., u ˆi with Yi − Xi0 β(zi , α0 ) = ui . We use In1 0 In1 =
1 XX 0 Xi Xj ui uj Kγ 0 ,ij , n2 i
j6=i
0 is a second order degenerate U-statistic. It is straightforwhere γ 0 = (h01 , . . . , h0q1 , λ01 , . . . , λ0r1 ). In1 0 has zero mean and asymptotic variance (n2 h0 . . . h0 )−1 (σ 2 + o(1)). By the ward to show that In1 q1 1 0
same arguments as in Hsiao et al. (2007), one can show that the conditions for Hall’s (1984) CLT holds. Therefore, we have d
n(h01 . . . h0q1 )1/2 Iˆn0 → N (0, σ02 ), ˆ02 (with smoothing paramwhere σ02 = 2κq1 E[f (Xi , zi )(Xi0 Xj )2 σ 4 (Xi , zi )]. It is easy to show that σ eter γ 0 ) is a consistent estimator of σ02 . Hence, we have d Tˆn0 = n(h01 . . . h0q1 )1/2 Iˆn0 /ˆ σ0 → N (0, 1).
(B.1)
ˆ s /h0 − 1 = op (1) for s = 1, . . . , q1 and Next, from the result of Theorem 3.2 we know that h s
ˆ s /λ0 − 1 = op (1) for s = 1, . . . , r1 . By stochastic equicontinuity arguments similar to those in λ s Hsiao et al. (2007), one can show that Tˆn − Tˆn0 = op (1).
(B.2)
d (B.1) and (B.2) lead to Tˆn → N (0, 1) under H0 , completing the proof of Theorem 4.1.
Proof of Theorem 4.2. Since the standard normal random variable’s distribution function (Φ(·)) is a continuous distribution, by Poly¯ a’s Theorem (Bhattacharya & Rao (1986)), we known that (30) is equivalent to the following, for any given v ∈ R, ¯ ³ ¯ ´ ¯ ¯ ¯P Tˆn∗ ≤ v|{Yi , Xi , Zi }ni=1 − Φ(v)¯ = op (1).
(B.3)
The proof of (B.3) is similar to the proof of Theorem 4.1. Using α ˆ0∗ − α ˆ 0 = Op (n−1/2 ) and a Taylor expansion argument, one can show that the leading term of Iˆn∗ is given by ∗ In1 =
1 XX 0 Xi Xj u∗i u∗j Kγˆ,ij . n2 i
j6=i
∗ has zero mean and asymptotic variance It is easy to show that, conditional on {Yi , Xi , Zi }ni=1 , In1 ˆ1 . . . h ˆ q )−1 (ˆ (n2 h σ 2 + op (1)). Also, σ ˆ∗ = σ ˆ0 + op (1). Hence, conditional on {Yi , Xi , Zi }, T ∗ ≡ 1
0
0
n1
∗ ∗ ˆ1 . . . h ˆ q )1/2 I ∗ /ˆ n(h 1 n1 σ0 has zero mean and asymptotic unit variance. To show that Tn1 has an
34
asymptotic normal distribution conditional on the random sample, we apply the CLT of de Jong ∗ |{Y , X , Z }n . (1987) for generalized quadratic forms to derive the asymptotic distribution of Tn1 i i i i=1
The reason for using de Jong’s (1987) CLT instead of Hall’s (1984) CLT is that in the bootstrap world, i.e., conditional on the random sample {Yi , Xi , Zi }ni=1 , the u∗i are not identically distributed as their distribution depends on i. Therefore, de Jong’s (1987) CLT is the appropriate one to use. By following the same arguments as in the proof of Theorem 2.2 of Hsiao et al. (2007), one can ∗ . show that the sufficient condition for de Jong’s (1987) CLT holds. Hence, (B.3) holds true for Tn1 ∗ + o (1). It also holds true for Tn∗ since Tn∗ = Tn1 p
35