May 11, 2005 - In this paper we consider a nonparametric regression model which ... Keywords: Irrelevant regressors, discrete regressors, nonparametric ...
Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall Center for Mathematics and Its Applications Australian National University Canberra, ACT 0200, Australia Qi Li Department of Economics, Texas A&M University College Station, TX 77843-4228, USA Jeff Racine Department of Economics, McMaster University Hamilton, ON L8S 4M4, Canada May 11, 2005
Abstract In this paper we consider a nonparametric regression model which admits a mix of continuous and discrete regressors, some of which may in fact be redundant (i.e. irrelevant). We show that, asymptotically, a data-driven least squares cross-validation method can remove these irrelevant regressors. Simulations reveal that this ‘automatic dimensionality reduction’ feature is very effective in finitesample settings, while an application to modeling strike volume suggests that the method performs well in applied settings. Keywords: Irrelevant regressors, discrete regressors, nonparametric smoothing, cross-validation, asymptotic normality.
1
Introduction and Background
The most appealing feature of nonparametric kernel estimation techniques is that they are robust to functional form specification. However, this feature comes with strings attached. In particular, nonparametric methods are known to suffer from the ‘curse of dimensionality’ which prevents their application to high dimensional data unless one has an abundance of data. In addition to the curse of dimensionality issue, another factor which limits the application of nonparametric techniques is the fact that many datasets, particularly in the social sciences, contain both continuous and categorical regressors such as gender, family size, choices made by economic agents, and so on. For example, when evaluating the effectiveness of job training programs by estimating an average treatment effect, one frequently encounters datasets containing a preponderance of categorical variables (Hahn (1998), Hirano, Imbens and Ridder (2003)). The conventional frequency nonparametric estimation method does not handle discrete regressors in a satisfactory manner, as it requires one to split the sample into many subsets or ‘cells’. When the number of cells is large, each cell may not have enough observations to yield sensible nonparametric estimates of the relationships among the remaining continuous regressors. There is, however, a rich literature in statistics involving kernel smoothing for discrete variables which can be leveraged to overcome these issues in mixed regressor settings. For instance see Aitchison and Aitken (1976), Hall (1981), Hall and Wand (1988), Scott (1992), Grund and Hall (1993), Fahrmeir and Tutz (1994), and Simonoff (1996), among others. It is acknowledged that data-driven methods for selecting bandwidths for kernel estimators are indispensable in applied settings, and bandwidth selection procedures for nonparametric kernel regression have been widely studied. For instance, Clarke (1975) proposed leave-one-out least squares cross-validation for kernel regression, while H¨ardle and Marron (1985) have demonstrated the asymptotic optimality of this approach. A bootstrap procedure that minimizes mean absolute square error with respect to the bandwidth has been proposed by H¨ardle and Bowman (1988), various data-driven methods have been studied by H¨ardle, Hall, and Marron (1988, 1992) in the context of univariate regression models, and Fan and Gijbels (1995) have studied bandwidth selection in the context of local polynomial kernel regression. More recently, Racine and Li (2004) have considered the nonparametric estimation of regression functions with a mixture of discrete and continuous variables. A property shared by each of these approaches is that the bandwidths are assumed to converge to zero as the sample size increases. Another common thread in this literature is that it treats only the case for which the regressors are presumed to be relevant. Suppose, however, that some of the regressors are in fact irrelevant for the model being studied. These regressors should be removed, and in fact doing so will produce a lower dimensional model having improved finite-sample properties. In order for the kernel method to remove a regressor, however, the bandwidth 1
must be sufficiently large, and in particular must not converge to zero as the sample size increases. We observe that, in applications, the presence of irrelevant regressors appears to occur surprisingly often (e.g., Li and Racine (2005)). In this paper we show that, by smoothing both the discrete and continuous regressors and by using the least squares cross-validation method to select the smoothing parameters, asymptotically all irrelevant regressors can be smoothed out and hence automatically removed from the resulting estimate. There is related literature on testing for the presence of irrelevant variables in nonparametric settings; see Fan and Li (1996), Lavergne and Vuong (1996), Racine (1996), and Lavergne and Vuong (2000) among others. The beauty of the results outlined in the current paper is simply that we demonstrate that pre-testing is in fact unnecessary when one employs cross-validation bandwidth selection. We view this as a powerful result as it has the potential for extending the reach of nonparametric estimation methods, especially for datasets having mixed categorical and continuous data types. The paper is organized as follows. Section 2 presents the main results of the paper by showing that the cross-validation method has the ability to remove irrelevant variables. Simulations and an empirical application are presented in Section 3. We discuss some possible future research topics and conclude the paper in Section 4, and the Appendices provide proofs of the results presented in Section 2.
2
Regression Models Having Irrelevant Regressors
2.1
Summary
We consider a nonparametric regression model where a subset of regressors is categorical and the remainder are continuous. Let Xid denote a q × 1 vector of regressors that assume discrete values and let
d to denote the sth component of X d , Xic ∈ Rp denote the remaining continuous regressors. We use Xis i
d takes c ≥ 2 different values, and we use S d to denote the support of X d . Letting we assume that Xis s
Xi = (Xid , Xic ), interest lies estimating E(Yi |Xi ) using nonparametric methods. However, we recognize
that often, in applied settings, not all of the q + p regressors in Xi are relevant. We consider cases where one or more of the regressors may be irrelevant. Without loss of generality, we assume that only the first p1 (1 ≤ p1 ≤ p) components of X c and the first q1 (0 ≤ q1 ≤ q) components of X d are “relevant” regressors in the sense defined below. Note that we assume there exists at least one relevant continuous variable (p1 ≥ 1). It can be shown that when all the continuous variables are irrelevant, the asymptotic distribution of the cross-validation selected smoothing parameters will be different from the result of this paper. We shall treat this case in our future research work. Note that the regression model case is quite different from the conditional density estimation case as considered in Hall et al (2004) where one also smooths the dependent variable. This calls for substantial changes to the proofs. ¯ consist of the first p1 relevant components of X c and the first q1 relevant components of X d , Let X 2
˜ = X \ {X} ¯ denote the remaining irrelevant components of X. and let X
¯ to be relevant and X ˜ to be irrelevant, is to ask that The way of defining X ¯ Y ) is independent of (X,
˜ X
(2.1)
Clearly, (2.1) implies that ¯ , E(Y |X) = E(Y |X)
(2.2)
and so a standard regression model, of the form Y = g(X) + error, may equivalently be written in the ¯ + error. dimension-reduced form, Y = g¯(X) Obviously (2.1) is a stronger assumption than (2.2). A weaker condition would be to ask that ¯ the variables X ˜ and Y are independent. Conditional on X,
(2.3)
However, using (2.3) will cause some technical difficulties in the proof of our main result of the paper. Therefore, in this paper we will only consider the unconditional independence of (2.1). Nevertheless, in the simulations reported in section 3, we also investigate the case of conditional independence as defined by (2.3), the simulation results show that the cross-validation method can smooth out irrelevant regressors under either the unconditional independence assumption (2.1) or conditional independence of (2.3). We shall assume that the true regression model is ¯ i ) + ui , Yi = g¯(X
(2.4)
¯ i ) = 0. We shall consider the case for where g¯(·) is of unknown functional form, and where E(ui |X which the exact number of relevant variables is unknown, and where one estimates the conditional mean ¯ X). ˜ We use f (x) to denote the function of Y conditional on (possibly) a larger set of regressors X = ( X, ¯ i and X ˜i, joint density function of Xi , and we use f¯(¯ x) and f˜(˜ x) to denote the marginal densities of X respectively. For the discrete regressors Xid , we will first consider the case for which there is no natural ordering in Xid . The extension to the general case whereby some of the discrete regressors have natural orderings will be discussed at the end of this section. For 1 ≤ s ≤ q, we define the kernel function for discrete variables as d l(Xis , xds , λs )
=
1 λs
d = xd , if Xis s d = if Xis 6 xds .
(2.5)
Therefore, the product kernel for xd = (xd1 , . . . , xdq ) is given by d
K (x
d
, Xid )
=
q Y
s=1
d l(Xis , xds , λs )
=
q Y
d 6=xd ) I(Xis s
λs
s=1
3
,
d 6= xd ) is an indicator function which equals one when X d 6= xd , and zero when X d = xd . where I(Xis s s s is is
Here, 0 ≤ λs ≤ 1 is the smoothing parameter for xds . Note that when λs = 1, K d (xd , Xid ) is unrelated
d ) (i.e. the sth component of xd is smoothed out). Also note that the kernel function K d does to (xds , Xis
not sum to one, however, for the nonparametric estimation of regression functions, the kernel function appears in both the numerator and the denominator, so there is no need to normalize K d as would be necessary were one estimating a probability function. For the continuous variables xc = (xc1 , . . . , xcp ) we use the product kernel given by c
K (x
c
, Xic )
c p c Y xs − Xis 1 = K , hs hs s=1
where K is a symmetric, univariate density function, and where 0 < hs < ∞ is the smoothing parameter for xcs .
The kernel function for the mixed regressor case x = (xc , xd ) is simply the product of K c and K d , i.e., K(x, Xi ) = K c (xc , Xic )K d (xd , Xid ). Thus we estimate E(Y |X = x) by Pn Yi K(x, Xi ) . gˆ(x) = Pi=1 n i=1 K(x, Xi )
(2.6)
We choose (h, λ) = (h1 , . . . , hp , λ1 , . . . , λq ) by minimizing the cross-validation function given by n
1X CV (h, λ) = (Yi − gˆ−i (Xi ))2 w(Xi ), (2.7) n i=1 P P where gˆ−i (Xi ) = nj6=i Yj K(Xi , Xj )/ nj6=i K(Xi , Xj ) is the leave-one-out kernel estimator of E(Yi |Xi ), and 0 ≤ w(·) ≤ 1 is a weight function which serves to avoid difficulties caused by dividing by zero, or by
the slower convergence rate arising when Xi lies near the boundary of the support of X. ¯i = x Define σ 2 (¯ x) = E(u2i |X ¯) and let S c denote the support of w. We also assume that The data are i.i.d. and ui has finite moments of any order; g, f and σ 2 have two continuous derivatives; w is continuous, nonnegative and has compact support; f is bounded away from zero for x = (xc , xd ) ∈ S = S c × S d .
(2.8)
We impose the following conditions on the bandwidth and kernel functions. Define H=(
Q p1
s=1 hs )
Qp
s=p1 +1 min(hs , 1).
(2.9)
Letting 0 < < 1/(p + 4) and for some constant c > 0, n−1 ≤ H ≤ n− ; n−c < hs < nc for all s = 1, . . . , p ; the kernel K is a symmetric, compactly supported, H¨older-continuous probability density; and K(0) > K(δ) for all δ > 0. 4
(2.10)
The above conditions basically ask that each hs does not converge to zero, or to infinity, too fast, and that nh1 . . . hp1 → ∞. We expect that, as n → ∞, the smoothing parameters associated with the relevant regressors will converge to zero, while those associated with the irrelevant regressors will not. It would be convenient to further assume that hs → 0 for s = 1, . . . , p1 , and that λs → 0 for s = 1, . . . , q1 , however, for practical reasons we choose not to assume that the relevant components are known a priori. Therefore, we shall assume the following condition holds. Defining µ ¯ g (¯ x) = E[ˆ g (x)fˆ(x)]/E[fˆ(x)],1 we assume that R
supp w
[¯ µg (x) − g¯(¯ x)]2 w(¯ ¯ x)f¯(¯ x)d¯ x, a function of h1 , . . . , hp1 ,
and λ1 , . . . , λq1 , vanishes if and only if all of the smoothing parameters vanish,
(2.11)
where w ¯ is a weight function defined in (2.16) below. In the appendix we show that (2.10) and (2.11) imply that as n → ∞, hs → 0 for s = 1, . . . , p1 and λs → 0 for s = 1, . . . , q1 . Therefore, the smoothing parameters associated with the relevant variables all vanish asymptotically. Define an indicator function q Y d d d d I(vsd = xdt ). Is (v , x ) = I(vs 6= xs )
(2.12)
t6=s,t=1
Note that Is (v d , xd ) = 1 if and only if v d and xd differ in their sth component only. R P R c Define d¯ x = x¯d d¯ x . Letting ms and mss denote the first and second derivatives of m(x) with
respect to xcs (m = g¯, f ), in the appendix we show that the leading term of CV is Z X Z q1 n o i Xh κp1 σ 2 (¯ x) ˜ ˜ w(x)R(x)f (˜ x)dx + λs Is (¯ vd, x ¯d ) g¯(¯ xc , v¯d ) − g¯(¯ x) f¯(¯ xc , v¯d ) nh1 . . . hp1 s=1 v¯d !2 p 1 1 X 2 x)f¯(¯ x) + 2f¯s (¯ x)¯ gs (¯ x) f¯(¯ x)−1 w(¯ ¯ x)d¯ x, (2.13) + κ2 hs g¯ss (¯ 2 s=1 R R ˜ ˜ hp +1 , . . . , hp , λq +1 , . . . , λq ) is given by with κ = K(v)2 dv, κ2 = K(v)v 2 dv, and where R(x) = R(x, 1 1 ˜ R(x) =
ν2 (x) , [ ν1 (x) ]2
where for j = 1, 2, and
νj (x) = E
w(¯ ¯ x) =
Z
p Y
(2.14)
h−1 s K
s=p1 +1
c Xis
− hs
xcs
Y q
d 6=xd ) I(Xis s
λs
s=q1 +1
f˜(˜ x)w(x)d˜ x,
j
,
(2.15)
(2.16)
1
Note that µ ¯ g (¯ x) does not depend on x ˜, nor does it depend on (hp1 +1 , . . . , p, λq1 +1 , . . . , λq ), because the components in the numerator related to the irrelevant variables cancel with those in the denominator.
5
where, again, we define
R
d˜ x=
P
x ˜d
R
d˜ xc .
˜ By H¨older’s inequality, R(x) ˜ In (2.13) the irrelevant variable x ˜ appears in R. ≥ 1 for all choices of
˜ → 1 as hs → ∞ (p1 + 1 ≤ s ≤ p) and λs → 1 (q1 + 1 ≤ x, hp1 +1 , . . . , hp , and λq1 +1 , . . . , λq . Also, R s ≤ q). Therefore, in order to minimize (2.13), one needs to select hs (s = p1 + 1, . . . , p) and λs
˜ In fact, we show that the only smoothing parameter values for which (s = q1 + 1, . . . , q) to minimize R. ˜ hp +1 , . . . , hp , λq +1 , . . . , λq ) = 1 are hs = ∞ for p1 + 1 ≤ s ≤ p, and λs = 1 for q1 + 1 ≤ s ≤ q. To R(x, 1 1 c d Q d Q I(xd x −X q s 6=Xis ) see this, let us define Zn (x) = ps=p1 +1 K s hs is . If at least one hs is finite (for s=q1 +1 λs
p1 + 1 ≤ s ≤ p), or one λs < 1 (for q1 + 1 ≤ s ≤ q), then by (2.10) that K(0) > K(δ) for all δ > 0,
we know that Var(Zn ) = E[Zn2 ] − [E(Zn )]2 > 0 so that R = E[Zn2 ]/[E(Zn )]2 > 1. Only when, in the
definition of Zn , all hs = ∞ and all λs = 1, do we have Zn ≡ K(0)p−p1 (a constant) and Var(Zn ) = 0 so
˜ = 1 only in this case. that R
Therefore, in order to minimize (2.13), the smoothing parameters corresponding to the irrelevant ˜ regressors must all converge to their upper extremities, so that R(x) → 1 as n → ∞ for all x ∈ S. Thus, the irrelevant components are asymptotically smoothed out. To analyze the behavior of smoothing parameters associated with the relevant variables, we replace ˜ R(x) by 1 in (2.13), thus the first term on the right-hand-side of (2.13) becomes Z κp1 σ 2 (¯ x) w(x)d¯ ¯ x, nh1 . . . hp1
(2.17)
where w ¯ is defined in (2.16). Next, defining as = hs n1/(q1 +4) and bs = λs n2/(q1 +4) , then (2.13) (with ˜ (2.17) as its first term since R(x) → 1) becomes n−4/(q1 +4) χ(a ¯ 1 , . . . , ap1 , b1 , . . . , bq1 ), where χ(a ¯ 1 , . . . , a p 1 , b1 , . . . , b q 1 ) =
Z
p1
+
q1 X
bs
s=1
1 X 2 x)f¯(¯ x) + 2f¯s (¯ x)¯ gs (¯ x) κ2 as g¯ss (¯ 2 s=1
X v¯d
!2
Is (¯ vd, x ¯d ) g¯(¯ xc , v¯d ) − g¯(¯ x) f¯(¯ xc , v¯d )
f¯(¯ x)−1 w(¯ ¯ x)d¯ x+
Z
κp σ 2 (¯ x) w(¯ ¯ x)d¯ x. a 1 . . . a p1
(2.18)
Let a01 , . . . , a0p1 , b01 , . . . , b0q1 denote values of a1 , . . . , ap1 , b1 , . . . , bq1 that minimize χ ¯ subject to each of them being nonnegative. We require that Each a0s is positive and each b0s nonnegative, all are finite and uniquely defined.
(2.19)
A sufficient condition that ensures (2.19) holds true is given in Lemma 5.1 in the appendix. The above analyses lead to the following result. ˆ 1 ,. . . , h ˆ p, λ ˆ 1 ,. . . , λ ˆq Theorem 2.1 Assume conditions (2.8), (2.10), (2.11), and (2.19) hold, and let h
6
denote the smoothing parameters that minimize CV. Then ˆ s → a0 n1/(p1 +4) h s
in probability for 1 ≤ s ≤ p1 ,
ˆ s > C) → 1 P (h
for p1 + 1 ≤ s ≤ p and for all C > 0,
ˆ s → b0 n2/(p1 +4) λ s
in probability for 1 ≤ s ≤ q1 ,
ˆs → 1 λ
in probability for q1 + 1 ≤ s ≤ q,
(2.20)
ˆ 1, . . . , h ˆ p, λ ˆ1, . . . , λ ˆ q ) → inf χ and n4/(p1 +4) CV (h ¯ in probability. The proof of Theorem 2.1 is given in the appendix. Theorem 2.1 states that the cross-validated smoothing parameters will behave so that the smoothing parameters for the irrelevant components converge in probability to the upper extremities of their respective ranges. Therefore, all irrelevant regressors are (asymptotically) automatically smoothed out, and the smoothing parameters for the relevant regressors are equal to the optimal bandwidths one would obtain were the irrelevant regressors not present. Next we present the asymptotic normality result for gˆ(x). Theorem 2.2 Under the same conditions as in Theorem 2.1, for x = (xc , xd ) ∈ S = S c × S d , then " # q1 p1 1/2 X X ˆ1 . . . h ˆp ˆ2 − ˆ 2 → N (0, Ω(¯ gˆ(x) − g¯(¯ x) − nh B1s (¯ x)λ x)) in distribution, (2.21) B2s (¯ x)h s
1
s
s=1
s=1
¯ gs (¯ x) xc , v¯d )f¯(¯ x)−1 , B2s (¯ x) = 12 κ2 g¯ss (¯ Is (¯ vd, x ¯d ) g¯(¯ xc , v¯d ) − g¯(¯ x) f¯(¯ x) + 2 fs (¯fx¯)¯ , (¯ x) and Ω(¯ x) = κp1 σ 2 (¯ x)/f¯(¯ x) are terms related to the asymptotic bias and variance, respectively.
where B1s (¯ x) =
P
v¯d
Theorem 2.2 follows from Theorem 2.1 and its proof is therefore omitted. The Presence of Ordered Discrete Regressors Until now we have only considered the case for which the discrete regressors are unordered. If, however, some of the discrete regressors are ordered, one should use alternative kernel functions that reflect the fact. If xds takes cs different values, {0, 1, . . . , cs − 1}, and isordered, Aitchison and Aitken (1976, p.29) suggest c using a kernel function given by K d (xds , vsd , λs ) = s λts (1−λs )cs −t where |xds −vsd | = t (0 ≤ t ≤ cs ), and t cs = cs !/[t!(cs − t)!]. Though these weights sum to one, if cs ≥ 3 this weight function is problematic as t there is no value of λs for which K d (xds , vsd ) is a constant function. Thus, even when xds is an irrelevant regressor, it cannot be smoothed out. Therefore, we suggest the use of an alternative simple kernel function for ordered regressors defined by d
d
s −vs | . K d (xds , vsd ) = λ|x s
(2.22) 7
The range of λs is [0, 1]. Again, when λs assumes its upper extreme value (λs = 1), K d (xds , vsd ) ≡ 1
for all values of xds , vsd ∈ S d , and xds is completely smoothed out from the regression function.
It can be easily shown that when some of the discrete regressors are ordered, and if one uses the kernel function given by (2.22), and modifies the definition of Is (v d , xd ) given in (2.12) to Is (v d , xd ) = Q I(|vsd − xds | = 1) qt=1,t6=s I(vtd = xdt ) (when xds is an ordered categorical regressor), then the results of Theorem 2.1 hold true. We therefore omit this proof for space considerations.
3
Finite-Sample Behavior
3.1
Simulations
We now consider a modest simulation exercise designed to assess the effectiveness of our cross-validatory approach to bandwidth selection when there exist irrelevant regressors. We shall consider a mix of continuous and discrete data types, and we focus on three issues: i) out-of-sample predictive performance; ii) the behavior of the cross-validated bandwidths; iii) the performance of cross-validation when relevant and irrelevant regressors are highly correlated. We conduct 1,000 Monte Carlo replications for each experiment, and we consider three models: 1) a parametric model (P); 2) a nonparametric frequency model having cross-validated bandwidths for the continuous regressors (NP CV-FR); 3) the proposed nonparametric approach having cross-validated bandwidths for both the continuous and discrete regressors (NP CV). For i = 1, . . . , n we generate the following random variables: (zi1 , zi2 ) ∈ {0, 1}, P r[zi1 = 1] = 0.69, P r[zi2 = 1] = 0.73, xi1 ∼ N (0, 1), xi2 ∼ N (0, 1) and ui ∼ N (0, 1). The regressors x1i , x2i , z1i , z2i vary in
their degree of correlation2 , ρ = {0.0, 0.25, 0.50, 0.75, 0.95}, while the regressors and ui are independent of one another. However, not all the regressors are relevant, as we generate y i according to yi = zi1 + xi1 + i ,
i = 1, 2, . . . , n,
so that z2 and x2 are irrelevant. Note that, when ρ 6= 0, the relevant and irrelevant regressors are correlated. The simulation results reported below clearly demonstrate that cross-validation does indeed smooth out irrelevant regressors regardless of whether they are independent (ρ = 0) or correlated (ρ 6= 0) with the relevant regressors. In order to assess a model’s predictive performance, we estimate a given model on samples of size n1 = 100, 250, 500, and then evaluate a model’s performance on independent data drawn from the same P DGP of size n2 = 1, 000. Predictive performance is computed as PMSE = n−1 y i − y i )2 . 2 i (ˆ
2 To generate the correlated regressors, we first generate a multivariate normal vector W = (w 1 , w2 , w3 , w4 )0 with zero means and covariance matrix Σ of dimension 4 × 4 with its diagonal elements equal to one and its off diagonal elements equal to zero. We then set z1 = 1 if w1 < 0.5, and 0 otherwise, and set z2 = 1 if w2 < 0.6, and 0 otherwise. Finally, we set x1 = w3 and x2 = w4 .
8
We consider two parametric models, both of which include quadratic terms in x 1 and x2 along with interaction terms, one having interaction terms of order two (P int(2)) and the other of order three (P int(3)). The nonparametric model is a local constant one with a generalized kernel function constructed from a Gaussian kernel for the continuous regressors and the kernel for the discrete regressors introduced in Section 2. All models therefore include the same conditioning information, (z i1 , zi2 , xi1 , xi2 ). PMSE results are presented in Table 1, while cross-validated bandwidth summaries for the NP CV model appear in tables 2 through 6. Table 1: Out-of sample predicted PMSE performance for parametric and nonparametric models containing irrelevant regressors (ρ = 0.00). Median [5th Percentile, and 95th Percentile] of PMSE n1 NP CV NP CV-FR P int(2) P int(3) 100 1.14 1.23 1.52 15.00 [1.02, 1.31] [1.09, 1.44] [1.19, 2.09] [3.94, 61.26] 250 1.07 1.11 1.13 1.69 [0.98, 1.15] [1.01, 1.21] [1.02, 1.24] [1.27, 2.65] 500 1.04 1.06 1.05 1.18 [0.96, 1.11] [0.98, 1.13] [0.97, 1.12] [1.05, 1.36]
Table 1 reports the predicted MSE comparisons. Note that both the parametric and nonparametric models overspecify the true model. Nevertheless, since an overspecified model nests the true model, and √ given that an estimated parametric model has a faster ( n) rate of convergence, a parametric model will out-perform a nonparametric model for sufficiently large n. However, in finite sample applications, Table 1 reveals that the proposed nonparametric approach (NP CV) not only has better out-of-sample performance than the cross-validated frequency approach (NP CV-FR), it is also capable of outperforming parametric models containing irrelevant regressors. Qualitatively identical results hold for ρ > 0.00 and are omitted for space considerations but are available from the authors upon request. Next we consider the behavior of the cross-validated bandwidths which are summarized in Table 2 (results for ρ = 0.25, 0.50 and n = 500 are omitted for space considerations, but are available upon request from the authors). An examination of Table 2 reveals that cross-validation indeed displays a tendency to ‘smooth out’ or ˆz remove irrelevant regressors. The irrelevant discrete regressor is smoothed out when its bandwidth λ 2
takes on its upper bound value of 1, while the irrelevant continuous regressor is effectively smoothed out ˆ x exceeds just a few standard deviations of the data. Note that the median value when its bandwidth h 2
ˆ z is indeed 1 for all sample sizes considered, while that for h ˆ x is orders of magnitude larger than of λ 2 2 the standard deviation of x2 (σx2 = 1). Furthermore, the probability mass for the irrelevant regressor bandwidths appears to be shifting away from zero as the sample size increases. This is best appreciated 9
Table 2: Summary of cross-validated bandwidths for the NP CV estimator. n1 100 250
100 250
100 250
ˆ h. ˆ Median, [10th Percentile, 90th Percentile] of λ, ˆz ˆz ˆx ˆx λ λ h h 1 2 1 2 ρ = 0.00 0.05 1.00 0.39 3072495.00 [0.00, 0.14] [0.13, 1.00] [0.24, 0.50] [0.76, 20670300.00] 0.03 1.00 0.33 1633310.00 [0.00, 0.06] [0.21, 1.00] [0.23, 0.40] [0.87, 11928360.00] ρ = 0.75 0.00 1.00 0.36 1938935.00 [0.00, 0.28] [0.05, 1.00] [0.21, 0.49] [0.52, 12731400.00] 0.00 1.00 0.30 700582.50 [0.00, 0.03] [0.15, 1.00] [0.20, 0.37] [0.61, 7030172.00] ρ = 0.95 0.00 1.00 0.35 675587.50 [0.00, 1.00] [0.01, 1.00] [0.19, 0.61] [0.29, 7023094.00] 0.00 1.00 0.28 389321.50 [0.00, 0.05] [0.05, 1.00] [0.18, 0.37] [0.38, 4069998.00]
by noting that the 10th percentiles indeed are uniformly rising as n increases for the irrelevant regressors regardless of the degree of correlation among the regressors. Table 2 also reveals that cross-validation displays the same tendencies in the presence of irrelevant variables even when the regressors are highly correlated with one another. In general, as the degree of correlation increases, the bandwidths for the relevant and irrelevant variables tend towards one another as one might expect, however, the tendency of relevant variable bandwidths to go to zero and for irrelevant variable bandwidths to be large is remarkably insensitive to the degree of correlation. This tendency can be observed in Table 2 by comparison of the 90th percentiles of the relevant variables to the 10th percentiles of the irrelevant variables, however, this shift in probability mass is best seen by considering Tables 3 through 6, which summarize the shift in the probability mass of the distribution of the bandwidths as the degree of correlation among the regressors, ρ, increases. Tables 3 through 6 better illustrate that, in general, as the degree of correlation increases, the bandwidths for the relevant and irrelevant variables tend towards one another, though we again point out that the tendency of relevant variable bandwidths to go to zero and for irrelevant variable bandwidths to be large is remarkably insensitive to the degree of correlation.
10
Table 3: Empirical P r[λz1 > λ∗z1 ] (λz1 ∈ [0, 1]), P r[Z = 1] = 0.5 n
P r[λz1 > 0.10]
P r[λz1 > 0.25]
100 250
0.19 0.01
0.03 0.00
100 250
0.20 0.01
0.11 0.00
100 250
0.29 0.06
0.24 0.03
P r[λz1 > 0.50] ρ = 0.00 0.01 0.00 ρ = 0.75 0.07 0.00 ρ = 0.95 0.20 0.02
P r[λz1 > 0.75]
P r[λz1 > 0.90]
0.01 0.00
0.00 0.00
0.06 0.00
0.05 0.00
0.19 0.02
0.18 0.02
Table 4: Empirical P r[λz2 > λ∗z2 ] (λz2 ∈ [0, 1]), P r[Z = 1] = 0.5 n
P r[λz2 > 0.10]
P r[λz2 > 0.25]
100 250
0.92 0.96
0.82 0.87
100 250
0.86 0.93
0.76 0.85
100 250
0.78 0.86
0.70 0.76
P r[λz2 > 0.50] ρ = 0.00 0.70 0.71 ρ = 0.75 0.66 0.73 ρ = 0.95 0.63 0.68
P r[λz2 > 0.75]
P r[λz2 > 0.90]
0.61 0.62
0.57 0.56
0.59 0.63
0.55 0.59
0.59 0.62
0.57 0.61
Table 5: Empirical P r[hx1 > h∗x1 ] (hx1 ∈ [0, 1]), P r[Z = 1] = 0.5 n
3.2
P r[hx1 > 0.10]
P r[hx1 > 0.25]
100 250
0.99 1.00
0.87 0.87
100 250
0.99 0.99
0.84 0.77
100 250
0.97 0.99
0.78 0.68
P r[hx1 > 0.50] ρ = 0.00 0.11 0.00 ρ = 0.75 0.08 0.00 ρ = 0.95 0.16 0.02
P r[hx1 > 0.75]
P r[hx1 > 0.90]
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.07 0.00
0.05 0.00
Application: Modeling Strike Volume
Next we consider a panel of annual observations used to model strike volume for 18 OECD countries. The data consist of observations on the level of strike volume (days lost due to industrial disputes per 11
Table 6: Empirical P r[hx2 > h∗x2 ] (hx2 ∈ [0, 1]), P r[Z = 1] = 0.5 n
P r[hx2 > 0.10]
P r[hx2 > 0.25]
100 250
1.00 1.00
1.00 1.00
100 250
1.00 1.00
0.98 0.99
100 250
0.99 1.00
0.93 0.96
P r[hx2 > 0.50] ρ = 0.00 0.96 0.98 ρ = 0.75 0.90 0.93 ρ = 0.95 0.76 0.81
P r[hx2 > 0.75]
P r[hx2 > 0.90]
0.90 0.93
0.87 0.89
0.80 0.85
0.75 0.79
0.66 0.69
0.62 0.65
1,000 wage salary earners), and their regressors in 18 OECD countries from 1951-1985. The average level and variance of strike volume varies substantially across countries. The data distribution also features a long right tail and several large values of volume. The data fields include the following regressors: 1) country code; 2) year; 3) strike volume; 4) unemployment; 5) inflation; 6) parliamentary representation of social democratic and labor parties; 7) a time-invariant measure of union centralization. The data are publicly available on StatLib (http://lib.stat.cmu.edu). As one country had incomplete data we analyze only the 17 countries having complete records. These data were analyzed by Western (1996) who considered a linear panel model with countryspecific fixed effects and a time trend. We consider a nonparametric model which treats country-code as categorical with the remaining regressors being continuous. To assess each model’s performance we estimate each model on data for the period 1951-83, and the evaluate each model based on its outof-sample predictive performance for the period 1984-85, again using predictive squared error as our criterion. Table 7 summarizes the cross-validated bandwidths. Table 7: Regressor standard deviations and bandwidths for the proposed method for the training data n1 = 561. Standard Deviation ˆ λ) ˆ Bandwidth (h,
year 9.53 102565846.11
unemp. 2.84 4800821.61
inflat. 4.75 5.84
parlia. 13.31 30.56
union 0.31 408328.03
country NA 0.12
As can be seen from Table 7, the continuous regressors year, unemployment, and union centralization are effectively smoothed out of the resulting nonparametric estimate, having bandwidths that are orders of magnitude larger than the respective regressors’ standard deviation. This suggests that inflation and parliamentary representation are relevant continuous predictors. Country code, treated as a discrete 12
regressor, has a small bandwidth again suggesting that this regressor is relevant. We then compare the out-of-sample forecasting performance. The relative predictive MSE of the parametric panel model (all variables enter the model linearly) versus the NP CV approach is 1.33. We note that varying the forecast horizon has little impact on the relative predictive performance. We have also tried a parametric model with interaction terms (quadratic in the continuous regressors) and the resulting out-of-sample predictive MSE is larger than even the linear model’s MSE prediction (the ratio of its predictive MSE over NP VC’s is 1.44), while a parsimonious parametric model having only a constant, unemployment, and inflation has a predicted MSE 15% larger than that of the NP CV approach (the relative predictive MSE is 1.15), suggesting that a linear parametric specification is inappropriate. On the basis of this modest application, it appears that the proposed method has the capability of removing irrelevant regressors in applied settings, and is capable of outperforming common parametric models in terms of out-of-sample performance via the automatic dimensionality reduction inherent to the cross-validatory method.
4
Concluding Remarks
In this paper we consider a nonparametric regression model which admits both continuous and categorical data, and we allow for the case whereby some of the regressors are in fact redundant (i.e. irrelevant). We show that, asymptotically, the data-driven least squares cross-validation method can automatically remove irrelevant regressors. A natural extension would be to consider local linear/local polynomial regression. Based upon results derived in this paper, we expect that, when the regression model is linear in some regressors, then a local linear least squares cross-validation method will select very large values of smoothing parameters for those linear regressors, resulting in a partially linear fit. Another research direction is to consider a regression model having only discrete regressors. The asymptotic behavior of the resulting cross-validation selected smoothing parameters is expected to be quite different from the results presented in this paper.
APPENDIX: PROOF OF THEOREM 2.1 Step (i): Preparations. Letting gi = g¯(¯ xi ), gˆ−i = gˆ−i (xi ), fˆ−i = fˆ−i (xi ), and wi = w(xi ) (xi ≡ Xi ), we have CV (h, λ) =
n
n
n
i=1
i=1
i=1
1X 2X 1X 2 (gi − gˆ−i )2 wi + ui (gi − gˆ−i )wi + u i wi . n n n
(5.1)
The third term on the right-hand-side of (5.1) does not depend on (h, λ). Step (vi) below shows 13
that the second term has an order smaller than the first term. Therefore, the first term is the leadP ing term of CV . Define m ˆ 1,−i (xi ) = (n − 1)−1 nj6=i [¯ g (¯ xj ) − g¯(¯ xi )]K(xj , xi ) and let m ˆ 2,−i (xi ) = P (n − 1)−1 nj6=i uj K(xj , xi ). Then the first term of (5.1) can be written as n
n
n
i=1
i=1
i=1
1X 2 1X 2 2X 2 2 2 m ˆ 1,−i wi /fˆ−i + m ˆ 2,−i wi /fˆ−i + m ˆ 1,−i m ˆ 2,−i wi /fˆ−i ≡ S1 + S2 + 2S3 , n n n
(5.2)
where the definition of Sj (j = 1, 2, 3) should be apparent. Define p1 X
ζn =
h2s
s=1
+
q1 X
λs
s=1
!2
.
(5.3)
In Steps (ii) to (iv) below we show that Z X q1 i n o Xh λs xc , v¯d ) Is (¯ vd, x ¯d ) g¯(¯ xc , v¯d ) − g¯(¯ x) f¯(¯ S1 = s=1
+
1 κ2 2
p1 X s=1
v¯d
hs gss (¯ x)f¯(¯ x) + 2f¯s (¯ x)gs (¯ x) 2
!2
f¯(¯ x)−1 w(¯ ¯ x)d¯ x + op (ζn ) ,
κ p1 σ ¯ 2 (¯ x) ˜ f˜(˜ w(x)R(x) x)dx + op (nH)−1 , nh1 . . . hp1 S3 = op ζn + (nH)−1 , S2 =
Z
(5.4) (5.5) (5.6)
˜ where R(x) is defined in (2.14) and the op (·) terms are all uniform in (h, λ) such that n−1 ≤ H ≤ n− , n−c < hs < nc for all s = 1, . . . , q, and λs ∈ [0, 1] for 1 ≤ s ≤ q, 1 . for some ∈ 0, 4+p
(5.7)
Therefore, the leading term of CV is given by (2.13), obtained from the leading terms of S 1 and S2 . Below we prove some results that are used in the proof of Theorem 2.1. In the proofs below we will
often ignore the difference between n−1 and (n − 1)−1 (or (n − k)−1 for any fixed finite integer k). Also, unless otherwise stated, all op (·), o(·), Op (·), and O(·) below are uniform in (h, λ) = (h1 , . . . , hp , λ1 , . . . , λq ) satisfying (5.7). Letting µf (xi ) = E(fˆ−i (xi )|xi ), then for integers k ≥ 1, define E K(xi , xj )k Hk = , (E[µf (xi )])k
(5.8)
It can easily be shown that Hk is bounded below and above by some constants multiplying H −(k−1) , that is for some constant c > 1, we have c−1 H −(k−1) ≤ Hk ≤ c H −(k−1) ,
(5.9) 14
where H = h1 . . . hp1
Qp
s=p1 +1 min(hs , 1)
as defined in (2.9).
Step (ii): Proof of (5.4). Recall that µf,i = µf (xi ) = E(fˆ−i (xi )|xi ), and define S10 by replacing fˆ−i by P ˆ 21,−i wi /µ2f,i . We will show that (5.4) holds true with S1 replaced by S10 , µf,i in S1 , i.e., S10 = n−1 ni=1 m
and that S1 − S10 = op (ζn + (nH)−1 ) uniformly in (h, λ). Letting Kij = K(xi , xj ), we write S10 = G1 + G2 , where G1 =
XX 1 2 (gj − gi )2 Kij wi /µ2f,i , n(n − 1)2 i
(5.10)
j6=i
and G2 =
XX X 1 (gj − gi )Kij (gl − gi )Kil wi /µ2f,i . 2 n(n − 1)
(5.11)
l6=j6=i
We first consider G2 , which can be written as a third order U-statistic. Define Qijl as the symmetrized version of (gi − gj )(gi − gl )Kij Kil wi /µ2f,i (symmetric in i, j, l), let Qij = E(Qijl |xi , xj ), and let Qi =
1 by E(Qij |xi ). Then by the U-statistic Hoeffding-decomposition (replacing n−1 P P P n 6 3 G2 = EQ1 + n i=1 [Qi − EQ1 ] + n(n−1) i j>i [Qij − Qi − Qj + EQ1 ] PPP 6 + n(n−1)(n−2) l>j>i [Qijl − Qij − Qil − Qjl + Q+ Qj + Ql − EQ1 ]
1 n−2 )
≡ J0 + J1 + J2 + J3 ,
where the definition of Jj (j = 0, . . . , 3) should be apparent. n o J0 = E(Qi ) = E(Qijl ) = E (E[(gj − gi )Kij |xi ])2 w(xi )/µ2f,i .
In (5.46) Step (vii) we shall show that hs for s = 1, ..., p1 and λs for s = 1, ..., q1 all converge to zero
as n → ∞, see (5.46). Therefore, by a Taylor expansion argument, it can be shown that p1 κ2 X g¯ss (¯ x)f¯(¯ x) + 2¯ gs (¯ x)f¯s (¯ x) h2s 2 s=1 ! h i X c d d d c d x , v¯ ) ν1 (x) + o ζn1/2 + λs Is (¯ v ,x ¯ ) g¯(¯ x , v¯ ) − g¯(¯ x) f¯(¯
E[(gj − gi )Kij |xi = x] =
v¯d
uniformly in x ∈ S, with (h, λ) as prescribed by (5.7), and where ζn is defined in (5.3). Therefore J0 =
Z +
p1 κ2 X g¯ss (¯ x)f¯(¯ x) + 2¯ gs (¯ x)f¯s (¯ x) h2s 2 s=1
X v¯d
h
i
λs Is (¯ vd, x ¯d ) g¯(¯ xc , v¯d ) − g¯(¯ x) f¯(¯ xc , v¯d )
!2
w(¯ ¯ x)f¯(¯ x)−1 d¯ x + o (ζn ) ,
(5.12)
1/2 where in the above we have also used µf (x) = f¯(¯ x)ν1 (x) + Op (ζn ) uniformly in x ∈ S, (h, λ), where f (x) = f¯(¯ x)f˜(˜ x), and where w(·) ¯ is defined in (2.16).
15
Next, we consider J1 . Note that for integers k ≥ 2, the kth moment of Qi has the same order as the kth moment of ζn wi . Hence, E(Qki ) = O ζnk , and by Rosenthal’s inequality, we have E |J1 |2k ≤ n−2k Ck nk ζn2k + nζn2k = O(n−k ζn2k ). Therefore, by Markov’s inequality, for δ ∈ (0, 1/2) and for all C > 0, we have P |J1 | > n−δ ζn ≤ Ck n−(1−2δ)k = O n−C
(5.13)
uniformly in (h, λ).
Since C in (5.13) is arbitrarily large, then the same result holds uniformly in any set of size no larger than a polynomial in n of values of (h, λ) as prescribed by (5.7). That is, if Λ is any such set of values of (h, λ) then P
sup (h,λ)∈Λ
|J1 | > n−δ ζn
!
= O n−C .
(5.14)
The regularity conditions asserted in Theorem 2.1 imply that each of h 1 , . . . , hp is no smaller than n−C1 , and no larger than nC2 , for constants C1 , C2 > 0. Furthermore, the function K is compactly supported and H¨older continuous. Therefore, taking a polynomially fine mesh of vectors ξ prescribed by (5.7), we deduce that (5.14) continues to hold if Λ is replaced by the class of all (h, λ) satisfying (5.7), which implies that J1 = op (ζn ) uniformly in (h, λ). 1/2
Next, we consider J2 .
By noting that Qij has the same kth moment order as does ζn (gi − k/2 k gj )Kij wi /µf,i , we have E(Qij ) = O ζn Hk , by Step (v) (a) below we know that n o E |J2 |2k ≤ Ck n−4k n2k ζnk H2k + (s.o.) = O(n−2k ζnk H2k ) = O(ζnk n−k (nH)−k ),
(5.15)
where the last equality follows from (5.9), and (s.o.) denotes terms having orders smaller than that of the leading term of order n−2k ζnk H2k (see Step (v) for the proof). By Markov’s inequality, for δ ∈ (0, /2) and for all C > 0, we have P |J2 | > n−δ (nH)−1 ≤ Ck n2δk n−k (nH)k = O n−(−2δ)k = O n−C
(5.16)
uniformly in (h, λ), because (nH) < n1− by (5.7).
By the same arguments used to obtain (5.14) from (5.13), we know that (5.16) will lead to P
sup |J2 | > n−δ (nH)−1 = O n−C ,
(5.17)
where the supremum is over (h, λ) as prescribed in (5.7). Therefore, J2 = op (nH)−1 uniformly in (h, λ).
16
The variable J3 is a third order U-statistic. By noting that the kth moment of Qijl has the same order as the kth moment of the unsymmetrized quantity (gj − gi )Kij (gl − gi )Kil wi /µ2f,i , we know that k/2 E(Qkijl ) = O ζn Hk2 . By Step (v) (b) below we know that (also using (5.9)) E |J3 |2k ≤ Ck n−6k n3k ζnk H22k + (s.o.) = O n−3k ζnk H22k = O ζnk n−k (nH)−2k , (5.18)
where (s.o.) denotes terms having order smaller than O n−3k ζnk H2k . Comparing (5.18) with (5.15), we know that J3 has an order smaller than that of J2 . h i 2 2 2 , Q = E[Q |x ]. Then We now consider G1 . Define Qij = (1/2) wi /µf,i + wj /µf,j (gj − gi )2 Kij i ij i n X X X 2 2 1 EQ1 + [Qij − Qi − Qj + EQ1 ] [Qi − EQ1 ] + G1 = n−1 n n(n − 1) j>i
i=1
= G1,0 + G1,1 + G1,2 .
Then by exactly the same arguments as in the analysis for G2 above, one can easily show that 1 G1,0 = EQij = O(n−1 ζn1/2 H21 ) = O ζn1/2 (nH)−1 = o (nH)−1 n−1 uniformly in (h, λ).
k/2 Noting that E[Qki ] = O ζn H2k , by Rosenthal’s inequality we obtain E |G1,1 |2k ≤ Ck n−4k nk ζnk H4k + nζnk H4k = O ζnk n−3k H4k = O ζnk (nH)−3k
by (5.9). From this, and using Markov’s inequality, it can be shown that G1,1 = op (nH)−1
uniformly in (h, λ). Similarly, one can show that G1,2 = op (nH)−1 uniformly in (h, λ). Summarizing the above we have shown that S10 = G1 + G2 = J0 + op ζn + (nH)−1 ,
where J0 is given in (5.12). Thus, we have shown that (5.4) holds true with S1 is replaced by S10 . It remains to be shown that S1 − S10 = op (ζn + (nH)−1 ). Defining ∆f,−i = E[fˆ−i (xi )|xi ], then by Rosenthal’s inequality, we have ! n o ∆f,−i (x) 2k −2k k k −k E ≤ C n n H + n H = O (nH) , k 2k 2 µf,i (x)
(5.19)
by (5.9). By (5.19) and Markov’s inequality, for δ ∈ (0, /6) and for all C > 0, we have ∆f,−i (x) −δ −1/3 2δk −k/3 −(−6δ)k/3 −C > n (nH) ≤ C n (nH) = O(n ) = O n . P k µf (x)
(5.20)
uniformly 1 ≤ i ≤ n, x ∈ S, and (h, λ) is as prescribed in (5.7), and where we have also used H 2 = O(H −1 )
17
By the same arguments used to obtain (5.14) from (5.13), one can further show that ∆f,−i (x) −δ −1/3 P sup > n (nH) = O n−C , µf (x)
(5.21)
where ξ = (x, h, λ), the supremum is over 1 ≤ i ≤ n, and x ∈ S and (h, λ) are as prescribed in (5.7). Also, by Rosenthal’s inequality, ( 2k ) o n m ˆ (x) − E m ˆ (x) 1,−i 1,−i k/2 −k −2k k/2 k k . = O ζ (nH) E ≤ C n ζ n H + nH k 2k n n 2 µf,i (x)
From (5.22) and Markov’s inequality, for δ ∈ (0, /6) and for all C > 0, we have m ˆ 1,−i (x) − E m ˆ 1,−i (x) −δ −1/3 ≤ Ck n2δk ζnk/2 (nH)−1/3 P > n (nH) µf,i (x) = O n−(−6δ)k/3 = O n−C ,
which can be P sup
further strengthened to m ˆ 1,−i (x) −δ −1/3 −C ˆ 1,−i (x) − E m > n (nH) = O n , µf,i (x)
(5.22)
(5.23)
(5.24)
where the supremum is over 1 ≤ i ≤ n, x ∈ S, and (h, λ) is as prescribed in (5.7). 1/2 Also, a simple Taylor expansion yields sup1≤i≤n,x,h,λ |E m ˆ 1,−i (x)| = O ζn . Therefore, sup |m ˆ 1,−i (x)| ≤ sup |E m ˆ 1,−i (x)| + sup |m ˆ 1,−i − E m ˆ 1,−i (x)| = O ζn1/2 + (nH)−1/3 ,
(5.25)
where the supremum is over 1 ≤ i ≤ n, x ∈ S, and (h, λ) is as prescribed in (5.7). Thus, using (5.21) and (5.25), we immediately obtain ! X ∆f,−i (x) 1 1 1 2 2 0 = op ζn + (nH)−1 . m ˆ 1,−i wi ≤ C sup |m ˆ 1,−i (x)| sup − 2 S1 − S 1 = 2 n µf,i (x) fˆ−i µf,i i Step (iii): Proof of (5.5). Define S20 by replacing fˆ−i by µf,i in S2 , we will show that (5.5) holds true with S2 being replaced by S20 , and that S2 − S20 = op ζn + (nH)−1 uniformly in (h, λ) as prescribed in (5.7).
Now, S20 = n−1
P
+[n(n − 1)−2 ]
ˆ 22,−i wi /µ2f,i im
P P −1 i
j6=i
P
= [n(n − 1)2 ]−1 l6=i,l6=j
P P i
2 2 2 j6=i uj Kxi ,xj wi /µf,i
uj ul Kxi ,xj Kxi ,xl wi /µ2f,i ≡ D1 + D2 .
Note that D2 can be written as a third order U-statistic. Let Vijl denote the symmetrized version of
uj ul Kxi ,xj Kxi ,xl wi /µ2f,i , and define Vij = E[Vijl |zi , zj ], zi = (xi , ui ).3 Then by the U-statistic Hoeffding decomposition, we have 3
Note that we could define Vi = E[Vij |zi ]. But then Vi = 0, as well as EVi = 0.
18
D20 =
6 n(n−1)
P P i
j>i Vij
+
6 n(n−1)(n−2)
PPP
l>j>i [Vijl
− Vij − Vil − Vjl ] ≡ D2,1 + D2,2 .
Here, D2,1 is a second order degenerate U-statistic, hence it is easy to show that E[V ijk ] = O(Hk ). By the same arguments used in the proof of Step (v) (a), it can be shown that n o E |D2,1 |2k ≤ n−4k Ck n2k H2k + (s.o.) = O n−2k H2k = O n−k (nH)−k , where (s.o.) denotes terms smaller than O n−2k H2k .
By Markov’s inequality, for δ ∈ (0, /2) and for all C > 0, we have
P |D2,1 | > n
−δ
(nH)
−1
≤n
2δk
k −k
(nH) n
=O n
−(−2δ)k
= O n−C
uniformly in (h, λ) satisfying (5.7) because nH < n− .
By the same arguments which led to (5.14), the above result can be strengthened to P
sup |D2,1 | > n−δ (nH)−1 = O n−C ,
(5.26)
where sup is over (h, λ) satisfying (5.7).
k ] = O H 2 . Therefore, Note that D2,2 is a third order U-statistic, and it is easy to show that E[Vijl k
by an argument similar to that used in the proof of Step (v) (b), we know that o n E |D2,2 |2k ≤ n−6k Ck n3k H22k + (s.o.), = O n−k (nH)−2k ,
where (s.o.) denotes smaller order terms (smaller than O(n−k (nH)−2k )). We observe that D2,2 has an order smaller than that of D2,1 . Therefore, we have shown that D2 = op (nH)−1 uniformly in (h, λ). i h Next, we consider D1 . Define Vij = (1/2) u2j wi /µ2f,i + u2i wj /µ2f,j Kx2 i ,xj , and let Vi = E(Vij |xi , ui ).
Then we have n 1 D1 = n−1 EV1 +
o PP 2 [V − V − V + EV ] ≡ B0 + B1 + B2 . [V − EV ] + ij i j 1 i 1 j>i i=1 n n(n−1) o 2 x )K2 |x ] . B0 = (n − 1)−1 E(Vij ) = (n − 1)−1 E w(xi )µ−2 j ij i f,i E[σ (¯ 2 n
Pn
Applying a Taylor expansion we have h i 1/2 2 |x = x = κp1 σ 2 (¯ x)f¯(¯ x)ν2 (x)(h1 . . . hp1 )−1 + O ζn (h1 . . . hp1 )−1 E σ 2 (xj )Kij i
uniformly in (x, h, λ). Therefore, using (n − 1)−1 = n−1 + O(n−2 ), we have Z p1 −1 B0 = κ (nh1 . . . hp1 ) σ 2 (¯ x)f¯(¯ x)ν2 (x)w(x)µf (x)−2 f (x)dx + o (nH)−1 Z p1 −1 = κ (nh1 . . . hp1 ) σ 2 (¯ x)w(x)(ν2 (x)/ν1 (x)2 )f˜(˜ x)dx + o (nH)−1
(5.27)
1/2 uniformly in (x, h, λ), where in the last equality we have used µf (x) = f¯(¯ x)ν1 (x) + Op ζn uniformly
in (x, h, λ).
19
For B1 , noting that E(Vi2 ) = O(H4 ), by Rosenthal’s inequality we have E |B1 |2k ≤ Ck n−4k nk H4k + nH2k = O(n−3k H4k ) = O (nH)−3k ,
where the last equality follows from Hk = O H −(k−1) by (5.9). It follows by Markov’s inequality that for δ ∈ (0, /2) and for all C > 0,
P |B1 | > n−δ (nH)−1 ≤ Ck n2δk O (nH)−k = O n−(−2δ)k = O n−C
uniformly in (h, λ). It can be further strengthened to P sup |B1 | > n−δ (nH)−1 = O n−C ,
where the supremum is over (h, λ).
Noting that B2 equals (n−1)−1 multiplied by a second order degenerate U-statistic, and that E(Vij2 ) = O(H4 ), by the same arguments as in Step (v) (a) we have n o E |B2 |2k ≤ Ck n−6k n2k H4k + (s.o.) = O(n−4k H4k ) = O n−k (nH)−3k ,
where (s.o.) denotes smaller order terms (smaller than O(n−4 H4k )). We observe that B2 has an order smaller than that of B1 . Summarizing the above we have shown that Z κ p1 0 σ 2 (¯ x)w(x)[ν2 (x)/ν1 (x)2 ]f˜(˜ x)dx + op (nH)−1 S2 = nh1 . . . hp1
(5.28)
uniformly in (h, λ).
∆ (x) −1/3 , one can show that (nH) = o Finally, following the same proof as for sup1≤i≤n,ξ µf,−i p (x) f m ˆ (x) −1/3 , where ξ = (x, h, λ), the supremum is over x ∈ S, and (h, λ) is sup1≤i≤n,ξ µ2,−i (nH) = o p f (x) given in (5.7). Therefore, S2 − S20 ≤ C
uniformly in (h, λ).
"
sup 1≤i≤n,ξ
#2 m ˆ (x) 2,−i µf (x)
sup 1≤i≤n,ξ
∆f,−i (x) −1 µf (x) = op (nH)
Step (iv): Proof that S3 = op ζn + (nH)−1 uniformly in (h, λ) as prescribed in (5.7). Define P P P S30 by replacing fˆ−i by µf,i in S3 , that is, S30 = n−1 i m ˆ 1,−i m ˆ 2,−i wi /µ2f,i = n−3 i j6=i uj (gi − PPP 2 gj )Kx2 i ,xj wi /µ2f,i + n−3 l6=j6=i uj (gi − gj )Kxi ,xj Kxi ,xl wi /µf,i = M1 + M2 . M2 can be written as a third order U-statistic. Letting Rijl denote the symmetrized version of
uj (gi − gl )Kxi ,xj Kxi ,xl wi /µ2f,i , Rij = E[Rijl |zj , zl ], and Ri = E[Rij |zi ], then (note that ERi = 0) 20
PP 3 Pn 6 j>i [Rij − Ri − Rj ] i=1 Ri + n(n−1) n P P P 6 l>j>i [Rijl − Rij − Ril − Rjl + n(n−1)(n−2)
M2 = +
Ri + Rj + Rl ]
≡ G1 + G2 + G3 .
1/2
By noting that the kth moment of Ri has the same order as the kth moment of ζn ui , we have, for k/2 k ≥ 2, E[Rki ] = O ζn . Then, by Rosenthal’s inequality, E |G1 |2k ≤ Ck n−2k {nk ζnk + nζnk } = O ζnk n−k .
(5.29)
By Markov’s inequality, for δ ∈ (0, /2) and for all C > 0, we have P |G1 | > n−δ ζn1/2 (nH)−1/2 ≤ Ck n2δk (nH)k n−k = O n−(−2δ)k = O n−C
(5.30)
uniformly in (h, λ) because (nH) < n1− by (5.7). 1/2
Result (5.30) implies that G1 = op (ζn (nH)−1/2 ) = op (ζn + (nH)−1 ) uniformly in (h, λ). 1/2
For G2 , noting that the kth moment of Rij has the same order as the kth moment of ζn ui (gj − k/2 gi )Kij /µf,i , we obtain E[Rkij ] = O ζn Hk . Therefore, by exactly the same arguments as used in the proving Step (v) (a), we have
E |G2 |2k ≤ Ck n−4k n2k ζnk H2k + (s.o.) = O ζnk n−2k H2k = O ζnk n−k (nH)−k .
(5.31)
By Markov’s inequality and the same arguments that lead to (5.14), one can show using (5.31) that G2 = op (nH)−1 uniformly in (h, λ).
For G3 , noting that the kth moment of Rijl has the same order as the kth moment of the unsym k/2 metrized ui (gj − gi )Kij (gl − gi )Kil wi /µ2f,i , we know that E[Rkijl ] = O ζn Hk2 . Therefore, by exactly
the same arguments used in Step (v) (b), we have
E |G3 |2k ≤ Ck n−6k n3k ζnk H22k + (s.o.) = O ζnk n−3k H22k = O ζnk n−k (nH)−2k .
(5.32)
Comparing (5.32) with (5.31) we know that G3 has an order smaller than that of G2 . Therefore, we have shown that G2 = op ζn + (nH)−1 uniformly in (h, λ).
h i Next, we consider M1 . Defining Rij = (1/2) uj (gj − gi )wi /µ2f,i + ui (gi − gj )wj /µ2f,j Kij , and Ri =
E[Rij |xi , ui ] (ERi = 0), then P Pn n 2 2 1 R + [R − R − R ] M1 = n−1 i j . i=1 i i=1 ij n n(n−1) 21
Using Rosenthal’s and Markov’s inequalities, it can be easily shown that M1 = op (ζn ) uniformly in (h, λ). Therefore, we have shown that S30 = M1 + M2 = op ζn + (nH)−1 uniformly in (h, λ).
Finally, we have, uniformly in (h, λ), that ˆ 1,−i (x) m ˆ 2,−i (x) S3 − S30 ≤ C sup m sup sup µf (x) µf (x)
∆f,−i (x) −1 , µf (x) = op (nH)
where the supremum is over 1 ≤ i ≤ n, x ∈ S, and (h, λ) as prescribed in (5.7). This completes the proof of Step (iv). Step (v): PP 2
n(n−1)
(a) Proof of (5.15) and (b) Proof of (5.18).
j>i [Qij
(a) Proof of (5.15).
Let J 2 =
− Qi − Qj + EQi ]. Note that J22k contains 4k summations, since J2 is a degener-
ate U-statistic, the non-zero terms in E(J22k ) must have the property that each summation index is equal to at least another summation index. Thus, the non-zero terms can at most contain 2k summations (each summation index is paired with and only with another one), the next non-zero term contains 2k − 2 summations, and so on, while the last non-zero term (having the smallest number of summations) has
two summations. Also, it is easy to show that the kth moment of Qij has the same order as the kth 1/2 k/2 moment of Vij = ζn (gi − gj )Kij wi /µf,i . Therefore, for integers k ≥ 2, we have E(Qkij ) = O ζn Hk .
Hence, straightforward calculation shows that4
n o E |J2 |2k ≤ n−4k ζnk Ck,1 n2k H2k + Ck,2 n2k−2 (H2 )(k−3) H32 + · · · + n2 H2k .
(5.33)
Noting that (5.33) contains a finite number of terms, after excluding the common factor n −4k ζnk we will show that n2k H2k is the leading term of (5.33). Note that a typical term is of the form 2k Y
t2k n2tl Hltl = n2(t2 +···+t2k ) H2t2 H3t3 . . . H2k ,
(5.34)
l=2
with tl ≥ 0 and
P2k
l=2 l tl
= 2t2 + 3t3 + · · · + (2k)t2k = 2k. We want to show that (5.34) has an order no
larger than n2k H2k , i.e., we want to show that
t2k ≤ n2k H2k . n2(t2 +···+t2k ) H2t2 H3t3 . . . H2k
(5.35)
4
Note that there are different ways of pairing up the indices, for example, for the eight indices in E[Q i1 ,j1 Qi2 ,j3 Qi3 ,j3 Qi4 ,j4 ] to form four pairs, it could be (i) E[Q2i1,j1 Q2i2 ,j2 ], or (ii) E[Qi1 ,j1 Qj1 ,i2 Qi2 ,j2 Qj2 ,i1 ] (or some other case). Nevertheless, (i) and (ii) have the same order of magnitude. Therefore, we restrict attention to case (i) for expositional simplicity.
22
Noting that Hk is bounded below and above by some constants multiplied by H −(k−1) (see (5.9)), we can therefore replace Hk by H −(k−1) , and re-arranging terms in (5.35) leads to n
2k−2
P2k
l=2 tl
H
k−
P2k
l=2 tl
2
= n H
which certainly holds true since k − all l = 3, . . . , 2k.
k−P2k l=2 tl
P2k
l=2
≥ 1,
(5.36)
tl ≥ 0, and the equality holds only when t2 = k and tl = 0 for
(b) Proof of (5.18) Note that the kth moment of Qijl has the same order of magnitude as the kth order of k/2 the unsymmetrized version of Vijl = (gj − gi )Kij (gl − gi )Kil wi /µ2f,i . Hence, E[Qkijl ] = O ζn Hk2 , while n o 2 E |J3 |2k ≤ n−6k ζnk Ck,1 n3k H22k + Ck,2 n2k−2 (H2 )2k−3) H32 + · · · + n2 H2k .
(5.37)
All terms inside the curly brackets of (5.37) have the following form, 2k Y
2t2k n3tl Hl2tl = n3(t2 +···+t2k ) H22t2 H32t3 . . . H2k ,
(5.38)
l=2
with tl ≥ 0 and
P2k
l=2 l tl
= 2t2 + 3t3 + · · · + (2k)t2k = 2k. We want to show that (5.38) has an order no
larger than n3k H22k , i.e., we want to show that
2t2k n3(t2 +···+t2k ) H22t2 H32t3 . . . H2k ≤ n3k H22k .
(5.39)
By exactly the same arguments as in the proof of (i) above, one can show that (5.39) is equivalent to n3 H 2
k−P2k l=2 tl
≥ 1.
(5.40)
This completes the proof of (5.18). def
Step (vi): Here we show that An = n−1
P
i ui (gi
2 = o (ζ + (nH)−1 ) uniformly in − gˆ−i )wi /fˆ−i p n
(h, λ) as prescribed in (5.7). Recall that ∆f,−i = E[fˆ−i (xi )|xi ], and note that ∆f,−i 1 1 + Op (∆2f,−i ), = − 2 ˆ µf,−i µf,−i f−i
(5.41)
where we define A1n and A2n by replacing (1/fˆ−i ) in An by (1/µf,i ), and −∆f,−i /µ2f,−i , respectively, i.e., A1n = n−1
XX 1 ui (gi − gj )Kij wi /µf (xi ), n(n − 1)
X
ui (gi − gˆ−i )wi fˆ−i /µf (xi ) =
X
ui (ˆ g−i − gi )wi fˆ−i ∆f,−i /µ2f (xi ).
i
i
(5.42)
j6=i
and A2n = n−1
i
23
(5.43)
We will show that A1n + A2n = op ζn + (nH)−1 and that An − A1n − A2n = op (ζn ) uniformly in (h, λ). Define Vij = (1/2) ui (gi − gj )wi /µf (xi )2 + ui (gi − gj )wi /µf (xi )2 Kij , and let Vi = E[Vij |xi , ui ]
(EVi = 0). Then P A1n = n2 i Vi +
2 n(n−1)
PP
j>i [Vij
− Vi − Vj ] ≡ A1n,1 + A1n,2 .
k/2
Note that the kth moment of Vi has the same order as the kth moment of ζn vi gi wi , so, by Rosenthal’s
inequality, we have E |A1n,1 |2k ≤ Ck n−2k nk ζnk + nζnk = O n−k ζnk . By Markov’s inequality, for δ ∈ (0, /2) and for all C > 0, we have P |A1n,1 | > n−δ (nH)−1/2 ζn1/2 ≤ Ck n2δk n−k (nH)k = O n−(−2δ)k = O n−C .
1/2 because (nH) < n1− by (5.7). From this one can further show that A1n,1 = op (nH)−1/2 ζn =
op (ζn + (nH)−1 ) uniformly in (h, λ).
1/2 Similarly, by noting that E(Vij2 ) = O ζn H2 , we have, by an argument similar to that for Step (v),
that
E |A1n,2 | ≤ Ck n−4k n2k ζnk H2k + (s.o.) = O n−2k ζnk H2k = O n−k ζnk (nH)−1 .
From this we know that A1n,2 = op (A1n,1 ). Hence, A1n = op ζn + (nH)−1 uniformly in (h, λ). P P P 1 For A2n , note that A2n = n(n−1) 2 i j6=i l6=i εijl , where εijl ≡ ui (gj −gi )Kij wi [Kil −E(Kil |xi )]/µf,i . Let Tijl be the symmetrized version of εijl , and Tij = E[Tijl |zi , zj ], zi = (xi , ui ). Noting that
E[εijl |zj , zl ] = E[εijl |zi , zj ] = 0, while it is easy to see that E[Tijl |zi ] = 0, while we also know that Ti = E(Tij |zi ) = 0 and ETi = 0, we have A2n =
XX XX X 6 6 Tij + [Tijl − Tij − Til − Tjl ] ≡ W2 + W3 . n(n − 1) n(n − 1)(n − 2) j>i
l>j>i
k/2 By noting that E |Tij |k = O ζn Hk , then by the same proof as in Step (v), one can show that E |W2 |2k ≤ Ck n−4k n2k ζnk H2k + (s.o.) = O n−2k ζnk H2−k = O n−k ζnk (nH)−k .
Using this results it is easy to show that W2 = op (nH)−1 uniformly in (h, λ). k/2 By noting that E[|Tijl |k ] = O ζn Hk2 , then by the same proof as in Step (v), one can show that E |W3 |2k ≤ Ck n−6k n3k ζnk H23k + (s.o.) = O n−3k ζnk H2−3k = O n−k ζnk (nH)−3k .
From this we know that W3 has an order smaller than that of W2 . Hence, A2n = W2 + W3 = op (ζn ) uniformly in (h, λ). 24
Summarizing the above, we have shown that A1n + A2n = op ζn + (nH)−1 uniformly in (h, λ).
Finally, using (5.41), we have that |An − A1n − A2n | ≤ C sup
1≤i≤n,ξ
uniformly in (h, λ) where ξ = (x, h, λ).
m ˆ 1,−i (x) µf (x)
sup 1≤i≤n,ξ
!2 ∆f,−i (x) = op (nH)−1 µf (x)
Step (vii). Here we show that the cross-validated smoothing parameters associated with the relevant variables all converge to 0 in probability. In the derivations that follow, all the O p (·) and op (·) quantities are uniform in i = 1, . . . , n and in x, h, and λ satisfying (5.7). Since E(ui |{xj }nj=1 ) = 0, obviously, the only possible non op (1) term in CV that is related to (h, λ) is S1 defined in (5.2). Moreover, it can be shown that S2 = G2 + op (1). Therefore, the only possible non-zero term in S2 is G2 = PPP 1 2 l6=j6=i (gi − gj )Kij (gi − gl )Kil wi /µf,i . Furthermore, it is easy to show that n(n−1)2 G2 = E(G2 ) + [G2 − E(G2 )] = E(G2 ) + op (1) uniformly in (h, λ).
˜i , we have (ignoring the difference between (n − 2)−1 and By the independence of (¯ xi , ui ) with x
(n − 1)−1 )
E(G2 ) =
Z
¯ x ,x ] E[(¯ g (¯ xi ) − g¯(¯ x))K i
2
w(x)ν1 (x)2 µf (x)−2 f (x)dx.
(5.44)
d 6=xd ) I(Xis Xis −xs Qq1 s ¯ x ,x = Qp1 h−1 denote the kernel function related to the relevant Let K i s=1 λs s=1 s K( hs )
¯ x ,x ], then µf (x) = µ (bar) variables. Letting µ ¯ f (¯ x) = E[K ¯f (¯ x)ν1 (x), and we have i Z 2 ¯ x ,x ]¯ E(G2 ) = E[(¯ g (¯ x) − g¯(¯ xi ))K µf (¯ x)−1 w(x)f (x)dx i Z = [¯ g (¯ x) − µ ¯g (¯ x)]2 w(¯ ¯ x)f¯(¯ x)d¯ x, where w(¯ ¯ x) is defined in (2.16). Hence, we have shown that Z S1 = [¯ g (¯ x) − µ ¯g (¯ x)]2 w(¯ ¯ x)f¯(¯ x)d¯ x + O (nH)−1 .
(5.45)
If the smoothing parameters h1 , . . . , hp1 , λ1 , . . . , λq1 , along with remaining smoothing parameters, minimize CV , but do not all converge in probability to zero, then, by (2.11), S1 does not converge to zero, which implies that the probability that the minimum of S1 , over the smoothing parameters, exceeds δ, does not converge to zero as n → ∞ (for some δ > 0).
However, choosing h1 , . . . , hp1 to be of size n−1/(p1 +4) , and λ1 , . . . , λq1 to be of size n−2/(p1 +4) , letting
hp1 +1 , . . . , hp diverge to infinity, and letting λq1 +1 , . . . , λq1 converge to 1, one can easily show that S1
25
converges in probability to zero. This contradicts the result obtained in the previous paragraph, and thus demonstrates that at the minimum of CV (the equivalent of S1 ), the smoothing parameters h1 , . . . , hp1 , λ1 , . . . , λq1 , for the relevant components of X, all converge in probability to zero. Lemma 5.1 Consider the following function of positive z1 , . . . , zp and general variables zp+1 , . . . , zp+q : χ(z1 , . . . , zp+q ) =
Z 0
p+q X s=1
= z Az +
Bs (x)zs
!2
dx +
c0 (z1 . . . zp+q )1/2
c0 , (z1 . . . zp+q )1/2
(5.46) (5.47)
where z = (z1 , . . . , zp+q )0 , A is a (p + q) × (p + q) matrix, and c0 > 0 is a positive constant. Then, if A is positive definite, χ(z1 , . . . , zp+q ) has a unique minimum, at a point where z1 , . . . , zp are positive and finite and zp+1 ,. . . ,zp+q are nonnegative and finite.
References Aitchison, J. and Aitken, C.G.G. (1976), “Multivariate binary discrimination by the kernel method”, Biometrika, 63, 413-420. Clarke R. M. (1975), “A calibration curve for radiocarbon dates”, Antiquity, 49, 251-256. Fahrmeir, L. and G. Tutz (1994), Multivariate Statistical Modeling Based on Generalized Models. New York: Springer-Verlag. Fan, J. and I. Gijbels (1995), “Data-driven bandwidth selection in local polynomial regression: variable bandwidth selection and spatial adaptation”, Journal of the Royal Statistical Association, Series B, 57:371–394. Fan, Y. and Q. Li (1996), “Consistent model specification tests: omitted variables and semiparametric functional forms,” Econometrica, 64, 865-90. Grund, B. and P. Hall (1993), “On the performance of kernel estimators for high-dimensional sparse binary data”, Journal of Multivariate Analysis, 44, 321-344. Hahn, J. (1998), “On the Role of the Propensity Score in Efficient Semiparametric Estimation”, Econometrica 66, 315-331. Hall, P. (1981), “On nonparametric multivariate binary discrimination”, Biometrika, 68, 287-294.
26
Hall, P., J. Racine and Q. Li (2004), “Cross-validation and the estimation of conditional probability densities”, Journal of the American Statistical Association 99, 1015-1026. Hall, P. and M.P. Wand (1988), “On nonparametric discrimination using density differences”, Biometrika, 75, 541-547. H¨ardle, W., P. Hall, and J.S. Marron (1988), “How far are automatically chosen regression smoothing parameters from their optimum?” Journal of American Statistical Association, 83, 86-99. H¨ardle, W., P. Hall, and J.S. Marron (1992), “‘Regression smoothing parameters that are not far from their optimum.” Journal of American Statistical Association, 87, 227-233. H¨ardle, W and A. Bowman (1988), “Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bounds”, Journal of American Statistical Association, 83, 102-110. H¨ardle, W. and J.S. Marron (1985), “Optimal bandwidth selection in nonparametric regression function estimation”, The Annals of Statistics, 13, 1465-1481. Hirano, K., G. W. Imbens and G. Ridder (2003), “Efficient estimation of average treatment effects using the estimated propensity score”, Econometrica, 71, 1161-1189 Lavergne, P. and Q. Vuong (1996), “Nonparametric Selection of Regressors: The Nonnested Case,” Econometrica, 64, 207-19. Lavergne, P. and Q. Vuong (2000), “Nonparametric Significance Testing,” Econometric Theory, 16, 576601. Li, Q. and J. Racine, (2005), Nonparametric Econometrics: Theory and Practice. Princeton University Press, forthcoming. Racine, J. S. (1997), “Consistent Significance Testing for Nonparametric Regression,” Journal of Business and Economic Statistics, 15, 369-379. Racine, J. and Q. Li, (2004), “Nonparametric estimation of regression functions with both categorical and continuous data”, Journal of Econometrics, 119, 99-130. Scott, D. (1992), Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley and Sons. Simonoff, J.S. (1996), Smoothing Methods in Statistics, New York: Springer. Tutz, G. (1991), “Sequential models in categorical regression”, Comun. Statist. Data Anal. 11, 275-195. Western, B. (1996), “Vague Theory and Model Uncertainty in Macrosociology”, Sociological Methodology, 26, 165-192.
27