EDGEWORTH APPROXIMATIONS FOR SEMIPARAMETRIC INSTRUMENTAL VARIABLE ESTIMATORS AND TEST STATISTICS Oliver Linton Department of Economics, Yale University August 28, 1998 Abstract We establish the validity of higher order asymptotic expansions to the distribution of a version of the nonlinear semiparametric instrumental variable estimator considered in Newey (1990) as well as to the distribution of a Wald statistic derived from it. We employ local polynomial smoothing with variable bandwidth, which includes local linear, kernel, and [a version of] nearest neighbor estimates as special cases. Our expansions are valid to order n?2 for some 0 < < 1=2; where depends on the smoothness and dimensionality of the data distribution and on the order of the polynomial chosen by the practitioner. We use the expansions to dene optimal bandwidth selection methods for both estimation and testing problems and apply our methods to simulated data. Keywords: Bandwidth Choice; Edgeworth Approximation; Instrumental Variable; Kernel Esti-
mation; Second Order; Semiparametric.
Address correspondence to: Oliver Linton, Cowles Foundation for Research in Economics, Yale University, Yale
Station Box 208281, New Haven, CT 06520-8281, USA. Phone: (203) 432-3699. Fax: (203) 432-6167. E-mail address:
[email protected]; http:nnwww.econ.yale.edunlinton.
1
1 Introduction Instrumental variables and the related Generalized Method of Moments estimation procedures are widely used and taught in econometrics courses because of their widespread application and conceptual appeal. In recent years these methods have been viewed as being semiparametric, in the sense that the joint distribution of the data is unspecied apart from a nite number of moment conditions. Frequently, this information is in the form of conditional moments, which implies, generally, an innite number of unconditional moment conditions, although a certain nite dimensional combination of them gives full eciency ? see Chamberlain (1987) and Newey (1986, 1988, 1990). The ecient instrument function involves an unknown conditional expectation. Therefore, to fully exploit the given moment conditions, it is necessary to use nonparametric regression techniques to estimate the optimal instruments. Newey (1990) established the asymptotic properties of a semiparametric instrumental variable estimator b based on a nonparametric estimate bg (specically, nearest neighbors and series estimates) of the optimal instrument function g: Under regularity conditions, he showed that n1=2(b ? 0) is asymptotically normal with zero mean and asymptotic variance !; where n is sample size and ! is a positive denite matrix; in fact, b is asymptotically equivalent to the procedure based on the true unknown optimal instrument function. There are many applications of this estimation procedure in sample selection and binary choice models as well as other microeconometric contexts. We have argued elsewhere, Linton (1995, 1996a,b), that the rst-order asymptotics of semiparametric procedures can be misleading and unhelpful.1 The limiting variance matrix ! does not depend on the specic details of how bg is constructed, and thus sheds no light on how to implement this important part of the procedure. Specically, bandwidth choice cannot be addressed by using the rst-order theory alone. Also, the relative merits of alternative rst-order equivalent implementations, e.g., one-step procedures, cannot be determined by the rst-order theory alone. Finally, to show when bootstrap methods can provide asymptotic renements for asymptotically pivotal statistics requires some knowledge of higher-order properties ? see Horowitz (1995). This motivates the study of higher-order expansions. Carroll and Härdle (1989) was to our knowledge the rst published paper that developed second-order mean squared error expansions for a semiparametric, i.e., smoothingbased but root-n consistent, procedure, in the context of a heteroskedastic linear regression. Härdle, 1
See also the monte carlo evidence presented in for example Hsieh and Manski (1987).
2
Hart, Marron, and Tsybakov (1992) developed expansions for scalar average derivatives which was extended to the multivariate case, actually only the simpler situation of density-weighted average derivatives, by Härdle and Tsybakov (1993); these papers used the expansions to develop automatic bandwidth selection routines. This work was extended to the slightly more general case of densityweighted averages by Powell and Stoker (1996). Linton (1995, 1996a) developed similar expansions for the partially linear model and the heteroskedastic linear regression model and provided some results on the optimality of the bandwidth selection procedures proposed therein. Xiao and Phillips (1996) worked out the same approximations for a time series regression model with serial correlation of unknown form; Xiao and Linton (1997) give the analysis for Bickel's (1982) adaptive estimator in the linear regression model; Linton and Xiao (1997) works out the approximations for the nonlinear least squares and prole likelihood estimators in a semiparametric binary choice model. Robinson (1995) revisited the average derivative estimation problem, in particular the density-weighted version of this procedure, which is easier to handle, and proved the validity of the Berry-Esséen theorem for this estimator. This work was recently extended by Nishiyama and Robinson (1997) to include a proof of the validity of an Edgeworth expansion for the same estimator. These last two works have contributed greatly to the rigour of the analysis, albeit in a simple setting. They also point out that under certain circumstances the dominant correction eect on the distribution of the estimator is of order n?1=2; as in parametric situations, and unrelated to the smoothing operation itself. This point has also been made by Liang and Cheng (1993) in their study of the partially linear regression model; in fact, they take the analysis one step further and argue that the order n?1=2 term is optimal in a certain sense. See also Liang (1995) for similar results from the point of view of Bahadur eciency. Some other important developments include work by Horowitz (1996a,b), who investigates bootstrap in non- and semiparametric models with a view to determining bandwidth and providing asymptotic renements. In this paper, we develop second-order approximations for an implicitly dened semiparametric instrumental variable estimator similar to that considered by Newey (1990). Previous work by the author (Linton, 1995, 1996a) developed asymptotic expansions for the cumulants of standardized semiparametric estimators in the heteroskedastic linear regression and the partially linear model. In terms of the asymptotic mean squared error, these are of the form ! + n?2! c; where !c is a positive denite matrix and 0 < < 1=2. The correction term is of larger magnitude than in parametric procedures and reects the method used to estimate g: These estimators are both explicitly dened. Furthermore, a restrictive xed design assumption was maintained. In this paper we assume the more 3
general random design and derive the second-order properties of the implicitly dened instrumental variable estimator. We work with standard although somewhat strong assumptions relative to those required for the rst-order results. We calculate approximations to the rst four cumulants [valid to order n?2] and prove that the formal Edgeworth approximation based on them provides a valid approximation to the distribution of the standardized estimator correct to the same order of magnitude. As in Nishiyama and Robinson (1997), the order n?1=2 term, due to bias and skewnesses, can dominate in the distributional approximation, while the smoothing-based terms aect only the variance, which is, under some restrictions, of a smaller magnitude. We use the distributional approximation to dene an optimal bandwidth for a general class of criteria, and develop practical bandwidth selection methods. We also examine a Wald test statistic of general nonlinear restrictions, providing the Edgeworth approximation to its distribution under the null hypothesis. The second-order properties of the test depend on which standard error and which bandwidths are used. By using less smoothing when estimating the standard error matrix estimator, we obtain better asymptotic performance in terms of null rejection frequency. In this case, the size distortion [the dierence between the nominal and actual null rejection frequency] of the test is of the same magnitude, i.e., order n?2; as the variance of the estimator. We dene an optimal bandwidth in this context as one that minimizes the second-order size distortion and then use our expansions to suggest a feasible bandwidth selection method based on this notion. The optimal bandwidth is of a similar form to that derived in the estimation problem and does not to depend on the level of the test. This paper is organized as follows. In section 2 we dene the model of interest and introduce the estimators. In section 3 we derive the higher-order asymptotic properties of the estimator. In section 4 we examine standard errors and a Wald statistic. In section 4 we discuss optimality and bandwidth choice for estimation. In section 6 we present some simulation results. All derivations are given in the appendix. P For any vectors x = (x1; : : :; xd)0 and a = (a1; : : : ; ad)0; dene jxj = dj=1 xj ; x! = x1! xd!; and xa = (xa11 ; : : :; xadd )0; also let jajg @ @ g(x) = @xa1 @xad (x): 1 d a
Let ;V [;V ] and Fp;1 [fp;1] be the distribution functions [densities] of a N (; V ) and a 2p random variable respectively, and let kAk = tr1=2(A0A) be the Euclidean norm of any p m matrix A:
4
2 Estimator and Assumptions The observed data fZigni=1 ; where Zi = (Yi; Xi ) are independent and identically distributed and drawn from a population we denote by the random variable Z = (Y; X ). We suppose that there is a unique 0 2 Rp satisfying the conditional moment conditions: E f(Zi; 0) jXi g = 0 with probability one, where (z; ) is an m by 1 vector of functions. This implies the unconditional moment conditions E fA(Xi)(Zi; 0)g = 0; i = 1; : : : ; n; (1) for any p m matrix A(Xi) [for which the expectation exists]. The sample version of (1) is the basis of estimation as described in many previous papers. Our attention will be directed to the homoskedastic case in which with probability one
= E f(Zi ; 0)(Zi; 0)0 jXi g ; (2) although in Linton (1997) we extend some of our results to the heteroskedastic case. When (2) is true, the optimal weighting matrix is proportional to D(Xi ) ?1; where D(Xi ) = D(Xi ; 0) with D(Xi ; ) = E f@(Zi ; ) /@0 jXi g; in which case the asymptotic variance of the standardized estimator is J ?1 with J = E fD(X ) ?1 D(X )0 g ? see Newey (1990): The optimal instrument function D(); is assumed to be of unknown form, but is estimable. We shall rst obtain preliminary root-n consistent estimates P of ; say e; which can be obtained as solutions from ni=1 A(Xi)(Zi; e) = 0 for any known function A(); see Amemiya (1974). We then consider estimators b of that solve the p equations n X sb(b) = n1 De (Xi ; e) e ?1(Zi ; b) = 0;
where e = n
?1 Pn
e0
i=1 (Zi ; )(Zi ; ) e
i=1
(3)
and
De (Xi; e) =
X
j 6=i
Zj ; ) wij @(@ 0 e
(4)
is an estimate of the optimal instrument, where wij are nonparametric smoother weights. Our results will be proven for a general class of variable bandwidth local polynomial weights, which we now introduce. For a scalar dependent variable figni=1 ; let the parameter vector b (Xi ) minimize the criterion function
5
92 =
8
2p1 () jH0 = + o(1) ; Pr W > 2p1 () jHA = 1 + o(1);
(14)
where 2p() denotes the 'th critical value of the chi-squared(p) random variable 2p. In fact, j (W ) = j (2p1 ) + O(n?1 ); j = 1; : : : ; and furthermore, the errors in (14) are actually O(n?1 ): Phillips and Park (1988) compute explicitly the order n?1 correction terms in some special cases. Now consider the semiparametric Wald statistic c W
n o?1 0 ? 1 0 b b b = nbg GJ G gb;
where bg = g(b) and Gb = G(b); while Jb is any consistent estimate of the asymptotic covariance matrix c and W are rst-order equivalent under the null hypothesis, but J . Under our regularity conditions; W obviously not second-order equivalent. Our expansions reveal that if a single bandwidth magnitude hni = O(n?1=(2q+d)) is used in estimating and J; then the bias from estimating J donates a large c . To avoid this, we suppose that a second bandwidth h is used in J; b second order eect to W ni c has second order eect which i.e., Jb(hni ); where hni is smaller in order than hni :5 In this case, W is the same magnitude as in estimation, and is indeed smaller than can be achieved when a single bandwidth is used throughout. Theorem 2. Suppose that the regularity conditions A1-A7 hold, that the function g is four times
continuously dierentiable in a neighborhood of 0 ; and that the matrix Q = G00fG0 J ?1 G00 g?1 G0 ; where G0 = G( 0); is of full rank. Suppose that maxin hni = o(n?2=(2q+d)): Then, conditional on X n with probability one, we have i h 2 x c x ? tr( ) + o(n?2); Pr W x jH = F (15) 0
p1 ;1
p1
n
Note that an optimal bandwidth, according to the mean squared error of Jb, balances the O(h2q ) term with the O(n?2h?d ) term, which requires bandwidth hni = O(n?2=(2q+d) ), which is smaller than the bandwidth used in estimation. 5
14
where n = Qn with n as dened below Theorem 1 : Remarks.
1. The actual null rejection frequency of the test based on the asymptotic critical values 2p1 () is + O(n?2 ); where the O(n?2 ) term depends on the bandwidth constant through the matrix n : Since n is positive semi-denite, the test based on the asymptotic critical values will tend to over-reject. 2. When a single bandwidth is used throughout, i.e., hni = hni ; then the order of magnitude of the correction term in (15) is larger; specically it is of order n?q=(q+d) when hni = O(n?1=(q+d)): Furthermore, the direction of the eect could take either sign. One use of Theorem 2 is to dene a method of `size correction' which produces size distortion of magnitude o(n?2 ): Dene size-adjusted critical values 22p1 () b SA 2 c = p1 () + p tr(n); 1 ?2 where bn is an estimate of n: Then, the test based on cSA has null rejection frequency + o(n ); provided that for all > 0; i h 2 b Pr n tr(n ? ) > = o(n?2 );
extending the argument of Abramowitch and Singh (1985, Theorem 1). To our point of view this explicit size correction is not so attractive, and similar objectives could be achieved by using a higher order polynomial in the rst place. We think it makes more sense to x a particular `level' of smoothing and then try to optimize within that class, to which problem we now turn.
5 Second-Order Eciency and Bandwidth Selection We now suppose that the bandwidth is of the form hn (x) = 0(x)n?1=(2q+d) for a given function 0() and dene a general optimality criterion for deciding on a best choice of the parameter for estimation and testing purposes: For estimation it seems natural to work with a scalar valued risk function ` that depends only on the asymptotic variance matrix of the estimator, i.e., `(b; 0) = %( n); where % is some smooth scalar valued function and 15
n() = J ?1 + n () := J ?1 + n1() + n2(); where n1() = 2qJ ?1 n1 ( 0)J ?1 and n2 () = ?d J ?1 n2( 0)J ?1: For example, % could be any of the widely used criteria for multivariate optimality such as the determinant or trace of the matrix or a particular quadratic form c0 n c. When % is linear, we can equivalently minimize %(n ()) with respect to : Without loss of generality, we shall suppose that %(J ?1) = 0: We next show how testing can be put in the same framework. Suppose that we dene an optimal c as one that produces the smallest discrepancy between the actual null rejection bandwidth for W frequency and the nominal level of the test, that is, we choose the bandwidth constant to solve the following problem
i
h
c > 2 () jH0 ? : min Pr W p1
(16)
This criterion is not the only one that one might be interested in, but it is quite important in itself. Furthermore, it does at least prove to be tractable. This concentration on the null hypothesis is very similar to that adopted in the bootstrap literature, see Hall (1992, p222). By Theorem 2 and a Taylor expansion, we have Pr
h
c W
> 2p1 () jH0
i
? =
1 ? Fp1;1 2p1 () ?
22p1 () ?2 p1 tr(n) ? + o(n )
22p1 ()
?2 p1 tr(n) + o(n ): Therefore, since 2p1 () > 0 and fp1;1(2p1 ()) < 0 always; the optimal bandwidth according to (16) equivalently minimizes tr(n()): It is the same magnitude as the optimal bandwidth in the estimation problem; it depends on the restrictions through G0; but does not depend on : Note that we can write tr(n()) as %( n ()) for some function %; which puts the estimation and testing problems within a common framework. We now return to the general formulation of the bandwidth selection problem. By Taylor expansion, we can approximate %f n()g to order n?2 by n(); where
= fp1;1 2p1 ()
n () = 2q %0!n1 + ?d%0!n2; 16
(17)
in which % = f@% /vech( ) g =J ?1 ; while !n1 = vech(J ?1 n1( 0)J ?1) and !n2 = vech(J ?1 n2( 0)J ?1): The optimal bandwidth constant opt minimizes %f n ()g or (approximately) equivalently n (); and is (approximately) of the specic form
which results in
0 ! n2 hopt(x) = 2d% q%0!n1
1=(2q+d)
0(x)n?1=(2q+d) := opt 0(x)n?1=(2q+d);
(18)
%f n(opt)g = (d; q)f%0!n2g2q=(2q+d)f%0!n1gd=(2q+d) + o(n?2 ); where (d; q) is a constant depending just on d and q:6 This is the best one can do for this class of smoothers indexed by .7 More broadly, though, one might want to compare methods with dierent 0(): It may be possible to nd an optimal 0() through calculus of variation techniques, but it will depend on the data distribution in a complicated fashion. It is easy to see that no particular 0() method can dominate any other according to % uniformly over the class of joint distributions D of the data Zi; i = 1; : : : ; n: One way of obtaining an unambiguous ranking might be through the minimax criterion, as in Fan (1993). However, it is to be expected that no method can be found to achieve this bound. In any case, this way of comparing estimators is of questionable value, since it makes nature seem unduly hostile.8 The formulae (17) and (18) can be used to select the bandwidth constant used in estimating b bopt minimize c]. Let [or in constructing W bn () with respect to 2 ; where = [; ] with 0 < < < 1 is a set of permitted bandwidth constants, while bn () is an estimate of n (): Then, let bhopt(x) = bopt 0(x)n?1=(2q+d): Provided sup2 jbn () ? n()j = op (n?2); we have h?opt1 (bhopt ? hopt)(x) = op(1) uniformly in x: Under additional conditions we can establish the second order eciency of the estimator b based on data-based bandwidth selection. Theorem 3. Suppose that for some '1 and '2 with 0 < '1 ; '2 ; where '1 + '2 > 2 + 1=2 : 6 The optimal kernel is the Epanechnikov K (u) = 0:75(1 ? u2 )1(juj 1): opt
Thus the cross-validation method of bandwidth selection proposed in Newey (1990) at least achieves the right bandwidth rate, although the constants would generally be wrong. 8 Herman Cherno [Bather (1997, p. 339)] gives the following example: If you have a choice between committing suicide or not taking action, in which case you might lead a normal life or you might die a horrible death, the minimax principle tells you to avoid any possibility [however remote] of a horrible death by committing suicide. 7
17
b opt
p
? opt = oD (n?'1 ; n?2 )
b
@ sup ?'
@ ()
j?opt jn 1
(19)
= oD (n?'2 ; n?2 ):
Then, the distribution of nc0 (b(bopt) ? 0) is the same as the distribution of order n?2 ; i.e., b(bopt) is second order ecient.
(20)
pnc0(b( ) ? ) to opt 0
c (b Remark. A similar result holds for W opt): The condition (19) is similar to what we established
for the xed bandwidth estimator b in section 3, and can be shown [for '1 < 1=2] by using the same arguments under additional smoothness conditions. The condition (20) is a sort of distributional tightness which follows from the same sort of arguments used in establishing Theorem 1. See Linton (1995) for similar arguments. There are many suitable estimates b of opt; however, most of these involve the selection of a preliminary bandwidth, which we would like to avoid. We suggest a method that involves minimizing the estimated risk bn(): Specically, let (
n n X X 1 1 ? 1 ? 1 0 \ b e e e n1() = J n B (Xi) B(Xi ) ? n De (Xi; e) e ?1 Be(Xi )0 Jb?1 i=1
)
i=1
n X 1 n Be(Xi) e ?1 De (Xi ; e)0 Jb?1 i=1 ( ) XX XX 1 0 0 0 2 e ?1 ?1 b?1 e ?1 e ?1 \ n2 () = Jb n j6=i wij ej ej + j6=i wij wjiej eiej ei J ;
(21)
(22)
where ei = (Zi ; e); ej = @(Zj ; e) /@0 ? De (Xj ; e); and Jb is any of the proposed estimates of J given above, while Be(Xi ) is an estimator of Bni: Finally, let b minimize
bn() = b%0!bn1() + b%0!bn2(); where b% = f@% /vech( ) g =J ?1 ; while !bn1 = vech(\ n1 ()) and ! n2 = vech(\ n2 ()): 18
The main issue arises with how to estimate Bni ; since all the other quantities can be computed from what we know already. The form of the conditional bias Bni suggests a general method for estimating the bias, which is
Be(Xi ) =
X
j 6=i
wij fDb (Xj ) ? Db (Xi )g;
(23)
where the weights fwij g are precisely those used in (4), while Db (Xj ) is some estimate of D(Xj ): By the triangle inequality, we have for any element ; e(Xi ) B max 1in
? Bn(Xi )
;
2
X
P
j 6=i
jwij j
D b (Xi ) max 1in
? D(Xi )
; ;
(24)
where j6=i jwij j = O(1) with probability one under our conditions, so that for Be(Xi) to provide a good approximation to Bn(Xi ) it will suce if Db (Xi) ? D(Xi ) is of smaller order than Bn(Xi ) P uniformly in i: For example, let Db (Xi ) = ` wi` @(Z`; e) /@0 ; where the weights fwi` g come from a local polynomial regression of order q + s for some s > 0; and have bandwidth sequence hn(x) = 0(x)n?1=(2q+2s+d): Provided D and f are smooth enough, i.e., satisfy A3 with q + s replacing q, PP 2 wij ej e ?1e0j =n well approximate their (24) will be op(n? ). The `variance' terms in (21) like j 6=i estimands as can be shown by standard arguments for U-statistics, see Fan and Li (1996). In the testing case, a feasible bandwidth selection method can be based on replacing n() by an estimate bn() = Gbb n()Gb0fGbJb?1Gb0g?1 and proceeding as above. We investigate this below.
6 Monte Carlo Experiment We evaluated our second-order approximations and the bandwidth selection procedure on the sample selection model given below. We report here only results about testing, see Linton (1997) for results on estimation.
Yi = c + 00si +
4 X
j =1
"
j0Xji + "i ; si = 1 (10 + 20X0i + i > 0) ; #
"
# "
#!
"i N 0 ; 1 0:7 ; 0 0:7 1 i where (X0i; X1i ; : : :; X4i) are mutually independent with marginal distributions standard normal truncated [at +/-3]. We take 10 = 20 = 1 throughout. We computed the Wald statistics for 19
testing the two hypotheses: (a) H0: 00 = 0 [in which case we set 00 = 0 and j0 = 1; j = 1; : : : ; 4:]; (b) H0: j0 = 0; j = 1; : : : ; 4 [in which case we set 00 = 1 and j0 = 0; j = 1; : : : ; 4:]: We shall consider the constant design [in which c = 1], and the no-constant design [in which c = 0 and is not estimated]. In total this makes four designs. We suppose that it is known that selection is determined only by X0; in which case the optimal instrument for s is (x0) = Pr[s = 1jX0 = x0]: We estimate this function by the Nadaraya-Watson estimator and the local linear method using the following bi-quadratic kernel function: K (u) = 0:9375(1 ? u2)21(juj 1) and a constant bandwidth hn = n?1=5 when used in computing b and c We computed the test statistic at a grid of xed hn = n?1=5 when used in computing Jb1 in W: bandwidths with in the range [min; max], taking = 0:9 min throughout. We also computed the test statistics using our data-based bandwidth selection method to determine : Specically, we estimated the bias by (23) using two dierent methods to compute Db (); which in our case is b(): nonparametric plug-in and rule-of-thumb plug-in. In the former case, we took e() as our estimate of (x) in (23). In the latter case, we take a parametric specication fp(; #); # 2 g for (); and compute estimates #e of # by maximum likelihood: We then take p(; #e) in place of Db () in (23) and hence (21):9 We take p to be either a quadratic function or a logit with linear index.10 c jX1 ; : : :; Xn : The results of 20; 000 Throughout, we examine the conditional distribution of W replications are given in Figures 1-4 below. First, there is quite substantial over-rejection which decreases with sample size for each bandwidth value. There is some variation in the bandwidth eect across designs. While in Figure 1 there is a substantial eect, in Figures 2 and 4, where there is no constant, the rejection frequency is much less sensitive to bandwidth. Figure 3 is somewhat intermediate, but still exhibits less dependence on bandwidth. The main reason for this lack of sensitivity is that in these cases the matrix J ?1 is approximately diagonal. Since the bandwidth eects only come in from the estimation of (X0i); i.e., the matrix Ln is zero except for the corresponding diagonal element, our theory does predict that the second order matrix should be zero or approximately zero in this case. We now turn to Figure 1. The local linear xed bandwidth results show a poorer performance at small bandwidths than the kernel method, while at larger bandwidth the local linear does better. Under suitable conditions, this method provides second-order optimality for the resulting b when D() is as specied, and the right magnitude for h more generally. This method was proposed in Silverman (1986) and further exploited in Andrews (1991) in other contexts. 10 None of the methods we examine is guaranteed to give consistent estimates of the optimal bandwidth, although they do give the right order of magnitude for the bandwidth. 9
20
The automatic bandwidth selection methods generally do pretty well, being in some cases slightly better than the best xed bandwidth and in some cases slightly worse. There is not much to choose amongst the dierent bandwidth selection methods, although the nonparametric method appears to do the best and the quadratic rule-of-thumb method does the worst. Figure 5 compares the distribution of the Wald statistic with the corresponding chi-squared distribution - the distributions are quite close. ***Figures Here***
7 Concluding Remarks In concluding, we mention some extensions of this paper. Our theory is restricted to the case where the marginal density of the covariates is bounded away from zero on its compact support. When this assumption is violated, a totally new theory is necessary which would combine extreme value theory with the methods we use here. This remains to be addressed in future work. Finally, we would like to point out that the naive bootstrap, which draws samples fYi; Xigni=1; although rst order correct, will fail to capture the second order eects. Specically, this bootstrap will fail to capture the bias-related terms, as it does in nonparametric density and regression problems, see Härdle and Marron (1991). In order to provide good approximation at the optimal bandwidth it is necessary to use a more complicated bootstrap algorithm, which is under investigation in joint work with J. Horowitz.
A Appendix The appendices are organized as follows. In section A1 we give some background results on kernel regression estimation. In section A2 we establish an approximation for the rst three derivatives of the score function. In appendix B we present the proofs of our theorems. In appendix C we give the proofs of six lemmas given in Appendix A. Throughout let k denote a positive nite constant which can be dierent from expression to expression. 21
A.1 Nonparametric Preliminaries
?
P
?1 t denote the total number of parameters in the vector (X ), where t = `+d?1 Let t(q; d) = `q=0 ` i ` d?1 is the total number of distinct partial derivatives of order `: We will assume that these partial derivatives have been arranged in some order, which will be the same throughout the sequel. Dene the real symmetric t t matrices 2
2
3
3
Mni;0;0 Mni;0;1 Mni;0;q 7 M0;0 M0;1 M0;q 7 6 6 . ... 7 6 M 6 M .. 77 6 ni;1;0 Mni;1;1 6 1;0 M1;1 7 Mni = 66 .. 7 ; Mi = f (Xi ) 6 . 7; . . 6 7 7 . . . . . . . 4 4 5 5 Mni;q;0 Mni;q;1 : : : Mni;q;q Mq;0 Mq;1 : : : Mq;q in which each sub-matrix Mni;`;k [and M`;k ] has dimensions t` tk : The typical element of the submatrix Mni;`;k is mn;a+b(Xi ), where jaj = ` and jbj = k; in which for each d-tuple a = (a1; : : :; ad) a X 1 X X i ? Xj i ? Xj mn;a(Xi) = nhd K ; hni hni ni j 6=i R
while the typical element in Mj;k is ma+b = ua+bK (u)du: Dene the local polynomial weights as a X X X 1 i ? Xj i ? Xj 0 ? 1 a a ; wij = wij ; wij = nhd e1Mni e`(a)K h h ni ni ni fa:jajqg where for each vector a with jaj q; `(a) is its position in the t 1 vector (Xi ), and ej = (0; : : : ; 1; : : : 0) is the j th elementary vector. Dene also the t t matrix M ni = E (Mni jXi). We make use of the following lemmas. Lemma 1. For all ; = 1; : : : ; p; with probability one,
1=2 ? = O (log dn)1=2 (nh ) ? = O(h) 1in 1=2 (log n ) q b h : + O max D (Xi ) ? D (Xi ) = O d 1 = 2 1in (nh ) ? max Mni 1in ? max M ni
M ni Mi
22
(25) (26) (27)
Lemma 2. Suppose that the conditions A1-A5 hold except that we only require moments with
(1 + )=: Then, for some nite k;
log n = o(n?2) Pr 1max j ( M ? M ) j > k ni i in n log n = o(n?2) Pr 1max j ( B ) j > k ni in n log n = o(n?2): Pr 1max j ( V ) j > k ni in n
(28) (29) (30)
This says that the corresponding random sequence is oD (n?(?); n?2) for any > 0: Lemma 3. With probability one, as n ! 1;
max # fj : wij 6= 0g = O(n2 ) max jw j = O(n?2 ) 1i;j n ij
1in
max 1in
X
j 6=i
jwij j = O(1):
(31) (32) (33)
A.2 Standardized Criterion Derivatives Note that there exists a matrix of random variables (Zi) whose elements have mean zero [with probability one conditional on Xi ] and nite second moments such that n X 1 ? = n (Zi ) + oD (n?2 ) i=1 by Taylor expansion and Assumptions A2 and A7.
e
(34)
Lemma 4. There exist random vectors Sj ; j = 0; : : : ; 4; with E [kSj kv ] < 1 for v = 1; 2; : : : ;
and j = 0; : : : ; 4; such that
23
p
nbs(0) =
where
4 X
j =0
Sj + oD (n?2 );
n X S0 = p1n D(Xi ) ?1(Zi ; 0) = Op(1) i=1 ( ) n n X X 1 1 S1 = ? n D(Xi ) ?1 pn (Zl) ?1 (Zi; 0) = Op(n?1=2) i=1 l=1 ( ) p n n X X X 1 1 S2 = F (Xi) ?1 (Zi; 0) = Op (n?1=2)
(Zj ) pn n
=1 j =1 i=1 n X S3 = p1n Bni ?1(Zi ; 0) = Op(hq ) i=1 n X 1 S4 = pn Vni ?1 (Zi; 0) = Op (n?1=2h?d=2): i=1
where F (Xi) is the p p matrix with typical elements F ; (Xi) = E ; ; = 1; : : : ; p:
n 2 @ @ @ (Zi ; 0 )
o
jXi ; for
Lemma 5. There exist random matrices Hj ; j = 1; : : : ; 5; with E [kHj kv ] < 1 for v = 1; 2; : : : ;
and j = 1; : : : ; 5; such that
5 @ bs ( ) ? J = X Hj + oD (n?2 ); @0 0 j =1
where n n X 1X 1 ? 1 0 D(Xi ) D(Xi ) ? J + n D(Xi ) ?1 0i = Op(n?1=2) H1 = n i=1 i=1 n X H2 = n1 Vni ?1D(Xi )0 = Op(n?1=2) i=1 ) ( n n X X 1 1 H3 = ? n D(Xi ) ?1 n (Zl) ?1D(Xi )0 = Op(n?1=2) i=1 l=1
24
(
p X
)
n n X X 1 1 H4 = F (Xi ) ?1D(Xi )0 = Op(n?1=2)
(Zj ) n n
=1 j =1 i=1 n X H5 = n1 Bni ?1 D(Xi )0 = Op(hq ): i=1 Furthermore, E (Hj ) = 0; j = 1; 2; and are asymptotically normal.
Lemma 6. For ; ; ; = 1; : : :; p; we have for some k < 1:
Pr n1=2 jbs(0)j > k(log n)1=2 = o(n?2 ) Pr Pr
@b s ( ) ? E n @ 0
() >
#
@s ( ) > k(log n)1=2 = o(n?2 ) @ 0
@ 2sb ( ) ? E @ 2s ( ) > k(log n)1=2 = o(n?2 ) 0 @ @ @ @ 0
n
"
Pr
sup
n1=2 j? 0 jk log n
@ 3b s @ @ @
k(log n)1=2 = o(n?2 ):
(35) (36) (37) (38)
B Appendix We use Greek subscripts to denote either an element of a vector or partial dierentiation with respect to the parameters, thus bs () = @ bs() /@ ; bs () = @ 2bs() /@ @ ; etc. Let s() = P n?1 ni=1 D(Xi ; ) ?1(Zi ; ); where s() = (s1(); : : :; sp())0 and bs() = (bs1(); : : : ; bsp())0: Let also J = E fs (0)g ; J = E fs (0)g ; and (J ) = (J )?1; and dene the standardized p p p random variables Z = ns(0); Z = n fs (0) ? J g ; and Z = n fs (0) ? J g ; p while Zb = nsb(0); Zb = n fsb (0) ? J g ; and Zb = n fbs (0) ? J g : In the sequel we shall make use of the facts that ?
and oD (n?) + oD (n? ) = oD max n?; n? ? oD (n?) oD (n? ) = oD n?(+ ); max n?; n? : 25
(39)
Proof of Theorem 1. By a Taylor expansion (for = 1; : : : ; p);
0 = sb(b) = sb(0) + + 3!1
p X
p X
=1
; ;=1
sb (0)(b ? 0 ) + 21
p X ; =1
(40)
s (0)(b b
? 0 )(b ? 0 )
sb ()(b ? 0 )(b ? 0 )(b ? 0);
(41) (42)
where are intermediate values between and 0. We rst establish (12). We can rewrite (40-42) as the following system of equations: p X
p X
p X 1 J p ? = J 2 n ; =1 =1 =1 p p X X 1 1 b J Z + 3!n J bs () ; + 2n1=2+ ; =1 ; ;=1
J Zb + n1
J Zb +
(43) (44)
for = 1; : : : ; p; where = n1=2(b ? 0) = (1; : : : ; p)0: When kk k(log n)1=2; the right hand side of (43) is less than k(log n)1=2 with probability 1 ? o(n?2 ); by Lemma 6: By Brouwer's xed point theorem there exists a solution to (40) in the disk
n1=2(b ? 0)
k(log n)1=2; which concludes the proof of (12). We now establish (13). Dene the random variable b = (b1; : : : ; bp)0 that solves the truncated p equations determined by letting the right hand side of (41) equal zero, and let Tn = n(b ? 0) and p Tn = n(b ? 0): By the implicit function theorem, we can write Tn as a smooth function of the array fsb(0); bs (0); bs (0)gp; ; =1 and Tn as a smooth function of the same array as well as of the array fbs ()gp; ; ;=1 : Therefore, by Taylor expansion
Pr n2 kTn ? Tnk > k(log n)1=2 = o(n?2 ): Now let
Tn
= ?
p X =1
J Zb + n1
p X ; ;=1
J J Zb Zb 26
(45)
p p X X 1 1
b b ? 2pn J J J J Z Z ? 2n2 J J J Zb Zb Zb ; ; ;;;=1 ; ;;; =1
where J = (J 1 ; : : :; J p )0: By Taylor expansion and application of Lemma 6, the random variable Tn satises
Pr n2 kTn ? Tnk > k log n = o(n?2 ): The properties of the truncated expansion Tn can now be found from the properties of the standardized random variables Zb ; Zb ; and probability limits J and J : Substituting the expansions we found in Lemmas 4-6 in Appendix A, we obtain
Tn = where the leading term Tn0 = ?
Tn1 =
p X ; ;=1 p X
Tn3 = ? Tn4 = Tn5 =
=1
p X
; ;=1 p X ; ;=1
3 Z ?
J J H J (S
1 + S2 ) +
5 X
`=0
Tn` + oD (n?2 ) = Tn + oD (n?2 );
Pp
=1 J
p X
S0
J S
=1 p X
; ;=1
3
(46)
= Op(1); while ; Tn2 = ?
p X ; =1
J S4
p X 1 1 + H2 ) Z ? p n ; ;;;=1 J J J J Z Z
J J (H
p X 1 3 S3 ? 2 ; ;;;=1 J J J H3 H3 Z
J J H
J J H3 S4: p
p
Here, Tn1 = Op(hq ); Tn2 = Op(1= nhd); Tn3 = Op(n?1=2); Tn4 = Op(h2q ); and Tn5 = Op(hq = nhd): Write Tna = Tn0 + Tn1 + Tn4; Tnb = Tn3; and Tnc = Tn2 + Tn5: The quantities n; n in Theorem 1 are the mean and variance of Tn: Dene the random variables n X U1 = p1n n (Zi) (47) i=1 n X n X 1 U2 = pn wij 'n(Zi ; Zj ); (48) j =1 i=1 i6=j
27
with n() a deterministic vector function satisfying E [ n (Zi)] = 0 and supn E [k n (Zi)kr ] < 1; for r = 1; 2; : : : ; while 'n(; ) is a deterministic vector function satisfying E ['n (Zi; Zj )jZi ] = E ['n(Zi ; Zj )jZj ] = 0 with probability one, and supn E [k'n(Zi; Zj )kr ] < 1; r = 1; 2; : : :. Note that Tna is of the form (47), the random sequence Tnb is a homogenous quadratic polynomial P in such sums with magnitude Op(n?1=2), while Tnc is like (48) with 'n(Zi ; Zj ) = p ; ;=1fJ + J J H3 g(j ?1i) : We now establish the validity of the Edgeworth approximation to the distribution of Tn from which follows the validity of the distributional approximation for Tn: Our method of proof extends the univariate results of Nishiyama and Robinson (1997) to higher dimensions and smaller order of error [they considered only O(n?1=2 ) approximations]. The random sequence Tna + Tnb has a valid Edgeworth expansion to o(n?2 ), as follows from standard results for parametric estimators such as can be found in Bhattacharya and Ghosh (1979), the main problem arises with Tnc. Therefore, for expositional reasons we shall assume henceforth that Tnb 0: By the so-called smoothing lemma [Bhattacharya and Rao (1976, Lemma 12.1 and Lemma 12.2) and Götze (1987, 3.4)], the left hand side of (13) is bounded by max jajp+1
Z n2 log n a @
0
f n (s) ? e n(s)g ds + o(n?2 )
(49)
for some constant k; where n (s) is the characteristic function of Tn, while e n(s) is the Fourier transform of the signed measure Fen1(x) [which is Fen1(x) renormalized to have mean vector n and R variance matrix n]. Here, ab ds denotes integration over the p-dimensional region where a < ksk < b: By the triangle inequality, Z n2 log n a @
0
f n(s) ? e n(s)g
ds
Z n a @
0
f n(s) ? en (s)g
Z n2 log n ds+ n
j@
a
n (s)j
Z 1 ae ds+ @ n (s) ds
log n
for any > 0: We take = (1 ? 2)=(p + 5): The last integral is o(n?2 ) because of the form of the Edgeworth characteristic function e n: We must develop some machinery for dealing with @ a n(s) and @ a en (s): For any set Lm = fi1; : : : ; img with m = m(n); let 0n(m) = P Pi;j2=Lm wij 'n(Zi ; Zj ) and n(m) = Tnc ? 0n (m): Then, P p Tn ? n(m) = i2Lm n(Zi )= n + L(m); where L(m) does not depend on Zj ; j 2 Lm: When m = n; n(m) = Tnc and L(m) 0: Note that under our moment conditions, E [kn(m)kr] = O(mr=2=nr(1+2)=2) for any integer r: By Taylor expansion we have for any integer r; n (s) =
n;m;r (s) + Rm;r (s)
28
r X `=0
n;m;` (s) + Rm;r (s);
(50)
where
h
i
` 0 0 n;m;` (s) = E exp(is (Tn ? n (m)) (is n (m)) =`! and
Rm;r (s) = E
0 (Tn?n (m)u) (is0n (m))(r+1) is ; e
(r + 1)! where u is uniformly distributed on [0; 1] independently of the data, while 1 X m?(J \Lm)# (s) w w ? (s; J; m); i1 j 1 i` j` n` n;m;` (s) = `!n`=2 n where the summation is over all 2`-tuples J = (i1; j1; : : :; i`; j`) with is 6= js and A# is the number of distinct elements in the set A; while "
!
#
X ?n` (s; J; m) = E exp pin s0 n(Zj ) exp(is0L(m))s0'n (Zi1 ; Zj1 ) s0'n(Zi` ; Zj` ) : j 2J \Lm
The analysis of mn ?(J \Lm )# (s) and ?n` (s; J; n) and their partial derivatives follows from similar work to Götze (1987); the behaviour of nonparametric weights follows from Lemma 3. We shall use the following fact: for any p-vectors x = (x1; : : :; xp)0 and b = (b1; : : :; bp)0 and Q Q integer r; we have @ beis0x = ijbjeis0x pj=1 xbjj and @ b(s0x)r = (r)jbj(s0x)r?jbj pj=1 xbjj ; where (r)jbj = r (r ? jbj): We rst take m = n: Then for any vector a with jaj p + 1 we have by the chain rule and the fact that j exp(ix)j 1 for all real x;
j@ Rn;r (s)j k a
k
X b+c=a
X b+c=a
h E @ beis0(Tna+Tncu)
@
c
i
(s0Tnc )r+1
" p Y 0 r +1 ?j bj (Tna + Tnc E (s Tnc)
?
j =1 ?
# u)bjj (Tnc)cjj
k 1 + kskr+1 E kTnckr+1 1 + kTnc kp+1 kTnakp+1 ; by the Cauchy-Schwarz inequality. Then, since
?
R n ?
0
1 + kskr+1 ds = O(n(p+r+2)) and
E kTnckr+1 1 + kTnc kp+1 kTnakp+1 = O(n?(r+1) ); we have shown that
max
Z n
jajp+1 0
j@ aRn;r (s)j ds = o(n?2 );
provided r > f + (p + 2)g=( ? ):
29
We next deal with the term a @
n;n;r (s) k
R n Pr
0
`=3 @
a
n;n;` (s) ds:
First note that
X X m?(J \Lm )# (s) 1 @ b n n`=2
b+c=a
jwi1j1 wi`j` j j@ c?nr (s; J; n)j ;
by crude bounding. Furthermore, there is some polynomial P () with bounded coecients and constant k > 0 such that b m?(J \Lm )# ( s ) @ n
exp(?k ksk2)P (ksk); P
by Götze (1987, Lemma 3.3). Therefore, it suces to estimate jwi1j1 wi`j` j j@ c?nr (s; J; n)j : Consider the case a = c = 0 and ` = 3; in which situation there are ve subcases J# = 2; : : : ; J# = 6: By Lemma 3, X
w3 ij
= O(n1?4 ); X jw w w j = O(n3?2 ); ij jk rs
X
jw
1?2 ); ij wjk wki j = O(n X jw w w j = O(n3 ); ij kl rs
X
jw
ij wjk wkl j = O(n
2?2 );
(51)
where the sums are over two, three, four, ve, and six distinct indices respectively. Now let nj (s) = p p exp(is0 n(Zj )= n): We have on the range 0 ksk o( n) that:
"
E "
E E
Y
" ` Y "
E
` Y
"
E
`
Y
`
E ni (s) nj (s)s0'n(Zi ; Zj ) = O(ksk3 =n); Y
`
#
p
(52)
n` (s)fs0'n(Zi ; Zj )g2s0'n (Zj ; Zk ) = O(ksk4 = n); #
n` (s)s0'n (Zi; Zj )s0'n (Zj ; Zk )s0'n (Zk ; Zi) = O(ksk3); #
n` (s)s0'n (Zi; Zj )s0'n (Zj ; Zk )s0'n (Zk ; Zl) = O(ksk5 =n); #
p
n` (s)s0'n(Zi ; Zj )s0'n(Zj ; Zk )s0'n(Zl ; Zr ) = O(ksk6 =n n) #
n` (s)s0'n (Zi; Zj )s0'n (Zk ; Zl)s0'n(Zr ; Zt) = O(ksk9 =n3):
(53)
p
The magnitude of the expectations (52)-(53) is established as follows. Write ni(s) = Pq (s0 n (Zi)= n)+ p Rq (s0 n(Zi )= n); where Pq (x) = 1 + ix + : : : + (ix)q=q! is the qth order expansion of exp(ix) about x = 0 and Rq (x) = exp(ix) ? Pq (x). Then, multiply out and take expectation term by term 30
[the polynomial terms yield zeros when there is an odd Zj ]: In conclusion, we obtain n;n;r (s) exp(?k ksk2)P(ksk)o(n?2 ) for some polynomial P with bounded coecients. Thus, the integral over 0 ksk n is o(n?2 ) as required. The same argument applies to general a, because: (a) dierentiating the polynomial terms with respect to s simply changes the polynomial P in our bound, while (b) as for the remainder terms, we have for any integer q and any vector p p a, j@ aRq (s0 n (Zi )= n) k jRq (s0 n (Zi )= n)j for some constant k: Finally, under our conditions, k @ a @ a ( s ) ( s ) n;n;`+1 for any `: Therefore, combining (51) and (52)-(53), we get n;n;` Z n X r a @ ( s ) n;n;` ds 0
`=3
as required. We next establish that
Z n2 log n
n
= o(n?2 )
j@ a n (s)j ds = o(n?2):
(54)
For this we use the expansion (50) with m(n) < n: We just consider the case a = 0, the general case follows similarly. We have
j n(s)j
r m?2` X k (s) n
`=0
m + k kskr+1 m ; ksk` n`(1+2 ) nr(1+2)=2 `
r=2
(55)
by an extension of some results of Callaert, Janssen, and Verarbereke (1984) to the multivariate case with kernel weighting [see also Bickel, Götze, and van Zwet (1986, 2.17)]. Without loss of generality we suppose that > 1=4? when 1=4 this means that 2 1=2 and the argument p below is even simpler. We rst consider the range n ksk k n for some constant k; which p we split up into subintervals I1 = n ksk n and I2 = n ksk k n: On each interval we take m(n; s) = O(nj ); where 1 > 2 > 0: On I1, the rst term in (55) is o(n?2 ); because jn (s)j exp(?k ksk2 =n) when ksk kpn; and nm?2` (s) k exp(?kn1 +2?1) = o(n?d ) for any d; provided n < ksk and 1 > 11 ? 2: The second term in (55) contributes o(n?2 ) provided R r > (8 + 2p)=(1 ? 1), because nn kskr+1 ds = O(n(p+r+2)1 ) for any 1. On I2; we must take 2 > 1 ? 2 and r > (4 + 2 + p)=(2 ? 2). p On the range k n ksk n2 log n we have j n(s)j 1 ? # for some # > 0: Take m(n) = P O(log n): Then, the rst term in (55) is bounded by a constant times j1?#jk log n r`=0 ksk` (log n)`n?`(1+2): For large k; this is small enough. As for the second term in (55), this is small enough because n(r+2+p)2(log n)k n?r(1+2)=2 can be made smaller than order n?2 by taking r > (12 + 4p)=(1 ? 2): 31
Finally, we show that
Z n X 2 @a 0
`=0
f
ds
n;n;` (s) ? e n (s)g
= o(n?2 ):
(56)
Consider again the case a = 0: Using the conditioning argument and the magnitudes of the weights, we obtain
E E
h
i eis0Tna s0T
h
eis0Tna (s0T
nc
2 nc )
i
"
?1 E fs0 (Z )s0 (Z )s0' (Z ; Z )g + O ksk3 = nn (s) p n 1 n 2 n 1 2 n n "
4 k s k n 0 2 = n (s) E f(s Tnc ) g + O n
!#
!#
;
and after integrating over the range 0; n we obtain o(n?2 ) errors as required. The general a case is similar.
Proof of Corollary 1. This follows by a standard application of deJong (1987, Theorem
2.1 and Proposition 3.2) and of the Cramér-Wold device, and is worked out in more detail in Linton (1997). Proof of Theorem 2. By Taylor series expansions we obtain
p
p
c = n(b ? 0)0Q n(b ? 0) + Op(n?1=2) + oD (n?2); W
where the Op(n?1=2) term is a homogeneous cubic polynomial in asymptotically normal single sums c now follows of independent random variables. The validity of the Edgeworth approximation for W p from the validity of a multivariate expansion for n(b ? 0) [joint with the Op(n?1=2) terms], as in Chandra and Ghosh (1979). This expansion follows along the same lines as Theorem 1 [the Op(n?1=2) terms are `standard']. The terms are given in detail in Linton (1997). We next rewrite the expansion in more convenient form Bn + Cn + o (n?2 ); c = W 0 + An + p W n n n2 D 32
where W 0; An; Bn and Cn are stochastically bounded sequences, and the distribution of W 0 is 2p1 to order n?1: In fact, W 0 = X 0QX; An = 2X 0Q; and Cn = 0Q, where X [= Tna] is the leading p term of n(b ? 0) and [= Tnc ] is the second order term. We next apply the algorithm described in Rothenberg (1984), which, with suitable modication for the dierent orders of magnitudes we have, says that h i 0(x) + q (x)v (x) + v 0(x) ? 2c(x) b ( x ) 2 a ( x ) a a ( x ) c x = Fp ;1 x ? p + o(n?2 ); Pr W 1 n ? n + 2n2 ? ? where q(x) = (p1?2?x)=2x = d log Fp01;1 (x)=dx; while a(x) = E AnjW 0 = x ; b(x) = E Bn jW 0 = x ; ? ? c(x) = E CnjW 0 = x ; and v(x) = var AnjW 0 = x : The random variable Bn is a homogeneous cubic polynomial in asymptotically normal single sums of independent random variables, while W 0 is a quadratic function of similar random variables; this implies that b(x) = O(n?1=2) by standard arguments. Also, a(x) = o(n? ) by the arguments used in Linton (1997, expression (54)). Finally, by the law of iterated expectation
v(x) = 4E [X 0QE (0jX ) QX jX 0QX = x] = 4E [X 0QQX jX 0QX = x] + o(1) = 4tr p(Q) x + o(1); 1 because is asymptotically independent of X; and because the best prediction of X 0QQX by X 0QX is linear with coecient cov(X 0QQX; X 0 QX )=var(X 0QX ): Similarly,
c(x) = E [tr (0QjX ) jX 0QX = x] = tr (Q) : Proof of Theorem 2. We have
p k
b b b Pr n (opt) ? (opt) > n2 log n Pr
"
sup
j?opt jn?'1
+ Pr
h
n'1 bopt ? opt >
= o(n?2 ); 33
pn
b() ? b( )
> k
opt n2 log n i
#
because of (19) and sup
j?opt jn?'1
pn
b() ? b( )
n1=2?'1
opt
b
@ sup ?'
()
= oD (n1=2?('1+'2 ); n?2 ); j?opt jn 1 @
which follows by the Mean Value Theorem, crude bounding, and (20).
C Proofs of Lemmas Our arguments below will make use of the following exponential inequality. Lemma Exp. Let Zni be independent random variables with E (Zni ) = 0 and var(Zni ) = 2ni > 0:
Suppose also that jZni j m for some constant m: Then, for any > 0; # " n X Pr Zni >
i=1
2 2 exp ? 2 (Pn 2 + m) : i=1 ni
See Pollard (1985) for proof. Proof of Lemma 1. The proof of a stronger [involving supremum] version of (25) and (27) for
the case where hni = hn is given in the recent paper by Masry (1996). The extension to our case is straightforward under our assumptions A1 and A5. In particular, there is a nite constant k such that X wj (x)Tj
j
?
X
j
0 wj (x )Tj
kh?(d+1) jx ? x0j
for any bounded sequence Tj and any points x; x0: This leads to the same proof of his 3.22, 3.23, and 4.21.
?
P
Proof of Lemma 2. Let vni = Mni ? M ni = n?1 h?ni1 j 6=i Znij ; where
Znij = K Xi h? Xj ni
Xi ? X j hni
a
? E K Xi h? Xj ni 34
Xi ? Xj hni
a
jXi
are mutually independent given Xi : We apply Lemma Exp with Znij as above and = kn log n: Since K is bounded and has bounded support, jZni j 2 supu jK (u)uaj K , i.e., we can take P m(n) = K; while var j6=i Znij jXi = snin2; where sni s for some nite constant s: We have with probability one
Pr jvnij > k lognn jXi
2
2 exp 4? =
n
3
k2n2(log n)2
P
2 var j6=i Znij jXi + kn log n 2 n2 (log n)2 k 2 exp ? 2 (s n2 + kn log n) ni
o 5
= o(n?1?2 ); for large k: Therefore, by the Bonferroni inequality,
log n jX Pr 1max j v j > k ni in n i
n X
Pr jvnij > k lognn jXi i=1
= o(n?2 ) with probability one, as required. The argument for (29) is a straightforward consequence of the smoothness assumptions. The argument for (30) is similar to that for (28) except that we must use a truncation argument to compensate for the unboundedness of j : Dene 0j = j 1( j n ) and 00j = j 1( j > n ): We show that X max wij 1in
"
Pr "
j n X
>
f0j ? E (0j )g
00 E (j ) >
#
k lognn = o(n?2 )
(57)
#
Pr wij f00j ? g k lognn = o(n?2 ): (58) j Note that by Lemma 3, max1in jwij j Kn?2 for some positive nite constant K: Therefore, using Markov's inequality X wij max 1in
"
Pr
j
f00j ?
max 1in
E (00j ) >
g
k lognn
#
" X ? Pr n?2 00j + E 00j
j
35
> k lognn
#
?
?
k(log n)?1n?2n1+ E 00j
k(log n)?1n1?E j Pr 1( j > n ) ?
k(log n)?1n1?E j E j ` n?` = o(n?2 ); provided ` (1 + )=: Because 0j are bounded (specically, we have jwij 0j j Kn? ); we can apply the exponential inequality. Therefore, we have, using the Bonferroni inequality, X max wij 1in
"
Pr
j
>
f0j ? E (0j )g
k lognn
#
n X
i=1
Pr
" X "
j
? wij 0j + E 0j
j j
> k lognn
2 ?2 (log n)2 2n exp ? 2 vn?k2n+ kKn ?2 log n
#
#
= o(n?2 ); provided k > 2K . Proof of Lemma 3. Dene for each 0
kni ( ) = ] j :
K Xi h
? Xj > ; k ( ) = max k ( ): n 1in ni ni
It suces to show that for all with 0 < supu jK (u)j ; we have with probability one, as n ! 1; kn ( ) O(n2 ): Conditional on Xi;
kni ( ) =
? Xj > = X Z
X Xi 1 K h
j 6=i
ni
j 6=i
nij
is a Binomial random variable with parameters n ? 1 and pni ; and hence has conditional mean ni = E [kni( )jXi ] = npni and variance 2ni = var[kni( )jXi ] = (n ? 1)pni qni with qni = 1 ? pni . By the Markov inequality and assumptions A4 and A5, 36
pni =
? Xj > jX
Pr K Xi h
h X ? X i j E K hni
hdn
R
i
ni
jXi
i
R
= hdni jK (u)j f (X i ? uhni)du
jK (u)j du supx f (x) = O(n?d=(2q+d) )
with probability one. We now apply the exponential inequality to kni( ) ? E fkni( ) jXi g, noting that each Znij is bounded by one. Therefore, for any > 0;
Pr jkni ( ) ? E fkni ( ) jXi
gj n2
2 n4 2 exp ? 2 ((n ? 1)p q + 2n2) ni ni
2 exp(?kn2) for some constant k: Therefore, by the Bonferroni inequality and identity of distribution for large k;
Pr kn ( ) > kn2 n Pr kni ( ) > kn2
n Pr E fkni ( ) jXi g > k2 n2 + n Pr jkni ( ) ? E fkni ( ) jXi gj > k2 n2
n Pr npni > k2 n2 + kn exp(?kn2) using the triangle inequality. Finally, there exists n0; k such that for all n > n0 k 2 Pr npni > 2 n = 0; which implies that Pr [kn( ) > kn2] = o(n?2 ) as required. We now turn to the proof of (32). We have jwij j
X w a
fa:jajqg
ij
37
0 ?1 Xi ? Xj a e M e`(a) max K Xi ? Xj nhtd famax :jajqg 1 ni fa:jajqg hni hni ni 0 ?1 1=2 0 e M e1 e M ?1 e`(a) 1=2 max sup jK (u)uaj nhtd famax `(a) ni :jajqg 1 ni fa:jajqg u ni a nhtd max(Mni?1 ) famax sup j K ( u ) u j: :jajqg
u
ni
Finally, with probability one max(Mni?1 ) = max(Mi?1) + o(1), where the matrix Mi is uniformly positive denite by A1 and A4. P Xi ?Xj Xi ?Xj a The proof of (33) follows similar lines and uses the fact that uniformly in i; j6=i K hni = hni d O(nhn) with probability one for all vectors a: Proof of Lemma 4. By Taylor expansion, assumptions A3, A7, and (34), we have n X
e ?1 ? ?1 = ? n1 ?1 (Zi ) ?1 + oD (n?2 ) i=1
@ (Z ; e) ? @ (Z ; ) = @ l @ l 0
(59)
p X
n @ 2 (Z ; ) 1 X ?2 ( Z l 0 j ) + oD (n );
n j=1
=1 @ @
(60)
which establishes the order of the remainders. The magnitudes of S3 and S4 follow from Lemmas 1-3 using the arguments employed in Linton (1995, Lemmas 6-9), and the moments exist by A2.
Proof of Lemma 5. This follows by the same arguments as used in Lemma 4.
Proof of Lemma 6. The proof of (35-38) is based on the expansions of Lemmas 4-5 and
that under our conditions: Pr jZj > k(log n)1=2 = o(n?2 );Pr jZ j > k(log n)1=2 = o(n?2 ); Pr jZ j > k(log n)1=2 = o(n?2 ); and Pr supn1=2 j?0 jk log n js ()j > k(log n)1=2 = o(n?2 ); for ; ; ; = 1; : : : ; p; see Bhattacharya and Ghosh (1979, 2.32). We just show the arguments for (38). Write 38
p X n @ 3sb() ? @ 3s() = 1 X @ @ @ @ @ @ n =1 i=1 aibi();
where bi() = @ (Zi; ) /@ @ @ and ai is the th element of the vector De (Xi ; e) e ?1 ?D(Xi ) ?1 P [ai depends on the nonparametric estimates, but not on ]: Further, write ai = 4`=1 a`i; where a1i is the corresponding element of fDe (Xi; 0) ? D(Xi )g ?1; a2i is the corresponding element of fDe (Xi; e) ? De (Xi ; 0)g ?1; a3i is the corresponding element of D(Xi )( ?1 ? e ?1); and a4i is the corresponding element of fDe (Xi; e)?D(Xi )g( ?1 ? e ?1 ): Then, using the Cauchy-Schwarz inequality we have sup
n1=2 j ? 0 jk log n
@ 3sb @ @ @
(
n 3 X () ? @ @@s@ () n1 ti i=1
)1=2 p 4 ( n XX 1 X
=1 `=1
2
n i=1 a`i
)1=2
(61)
where ti = max supn1=2 j?0jk log n jbi()j2 are independent across i and have 2+ moments. ThereP fore, we can apply standard moderate deviation results to p1n ni=1fti ? E (ti)g, such as in Michel (1974), to conclude that n n 1X 1X ?(1=2?) ; n?2 ); t = n i=1 i n i=1 E (ti) + oD (n
i.e., term f n1 (
Pn
i=1 ti g
n 1X 2 n i=1 a1i
1=2
(62)
in (61) can be treated as constant. Furthermore, we have
)1=2
k max max j(V ) j + max max j(Bni ) j = oD (n?(?); n?2) ; 1in ni ; 1in P
(63)
by Lemma 2. Combining (62) and (63), we get that (38) is true. The term n1 ni=1 a23i 1=2 can be shown to be oD (n?(1=2?); n?2) using standard techniques [it only depends on e ? ]; while f n1 Pni=1 a24ig1=2 is of even smaller order.
Acknowledgements
39
I would like to thank Zhiejie Xiao and Moto Shintani for excellent research assistance, Joel Horowitz, Yuichi Kitamura, Peter Phillips, and Tom Rothenberg for helpful comments, and Germán Aneiros for pointing out a mistake in an earlier draft of this paper. I would also like to thank three referees for extensive comments which greatly improved the paper. I am grateful to the National Science Foundation for nancial support. Gauss programs are available from the author which implement the bandwidth selection procedures proposed above.
References [1] Abramovitch, L., and K. Singh (1985): Edgeworth corrected pivotal statistics and the bootstrap, Annals of Statistics 13, 116-132. [2] Amemiya, T., (1974): The non-linear two-stage least squares estimator, Journal of Econometrics 2, 105-110. [3] Andrews, D.W.K., (1991): Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation. Econometrica 59, 817-858. [4] Bather, J. (1997): A conversation with Herman Cherno, Statistical Science 11, 335-350. [5] Bhattacharya, R. N. and J. K. Ghosh (1978): On the Validity of the Formal Edgeworth Expansion, Annals of Statistics 6, 434451. [6] Bhattacharya, R. N. and Ranga Rao (1976): Normal approximation and asymptotic expansions. Wiley, New York. [7] Bickel, P.J. (1982): On adaptive estimation, Annals of Statistics 10, 647-671. [8] Brown, B.W. and W.K. Newey (1996): Bootstrapping for GMM, Preprint, Rice University. [9] Carroll, R.J., and W. Härdle, (1989): Second Order Eects in Semiparametric Weighted Least Squares Regression. Statistics, 2, 179-186. 40
[10] Chamberlain, G. (1987): Asymptotic eciency in estimation with conditional moment restrictions, Journal of Econometrics 34, 305-334. [11] Chandra, T.K., and J.K. Ghosh (1979): Valid asymptotic expansions for the likelihood ratio statistic and other perturbed chi-squared variables, Sankhya, Series A 41, 22-47. [12] Davies, J.A. (1968): Convergence rates for probabilities of moderate deviations, The Annals of Mathematical Statistics 39, 2016-2028. [13] De Jong, P. (1987): A central limit theorem for generalized quadratic forms, Probability Theory and Related Fields 75, 261-277. [14] Fan, J. (1992): Design-Adaptive Nonparametric Regression. Journal of the American Statistical Association, 87, 998-1004. [15] Fan, J. (1993): Local Linear Regression Smoothers and their Minimax Eciencies, The Annals of Statistics, 21, 196-216. [16] Fan, J., and I. Gijbels (1996), Local Polynomial Modelling and Its Applications Chapman and Hall. [17] Fan, Y., and O.B. Linton (1996): Consistent model specication tests: Omitted variables and semiparametric functional forms, Econometrica 64, 865-890. [18] Fan, Y., and O.B. Linton (1997): Some higher-order theory for a consistent nonparametric model specication test, Cowles Foundation Discussion Paper no. 1148. [19] Götze, F. (1987). Approximations for Multivariate U-statistics, Journal of Multivariate Analysis 22, 212-229. [20] Gozalo, P., and O.B. Linton (1995): Using Parametric Information in Nonparametric Regression, Working Paper no. 95-40, Department of Economics, Brown University. [21] Härdle, W., J. Hart, J. S. Marron, and A. B. Tsybakov (1992): Bandwidth Choice for Average Derivative Estimation, Journal of the American Statistical Association, 87, 218-226. [22] Härdle, W., and A. B. Tsybakov (1993): How sensitive are Average Derivatives, Journal of Econometrics, 58, 31-48. 41
[23] Härdle, W., and O.B. Linton (1994): Applied nonparametric methods, The Handbook of Econometrics, vol. IV, eds. D.F. McFadden and R.F. Engle III. North Holland. [24] Härdle, W. and E. Marron (1991) Bootstrap Simultaneous Error Bars for Nonparametric Regression, Annals of Statistics, 19, 778-796. [25] Hall, P. (1992): The Bootstrap and Edgeworth Expansion. Springer-Verlag: Berlin. [26] Hall, P., and J. Horowitz (1996): Bootstrap critical values for tests based on Generalized Methods of Moments estimation, Econometrica 64, 891-916. [27] Horowitz, J., (1995): Bootstrap methods in econometrics: Theory and numerical performance, in Advances in Economics and Econometrics: 7th World Congress, D. Kreps and K.W. Wallis, eds., Cambridge: Cambridge University Press, forthcoming. [28] Horowitz, J., (1996a): Bootstrapping the smoothed maximum score estimator, mimeo, University of Iowa. [29] Horowitz, J., (1996b): Bootstrapping in median regression, mimeo, University of Iowa. [30] Hsieh, D.A., and C.F. Manski (1987): Monte Carlo Evidence on Adaptive Maximum Likelihood Estimation of a Regression. Annals of Statistics, 15, 541-551. [31] Jones, M.C., J.S. Marron, and S.J. Sheather (1992): Progress in data-based bandwidth selection for kernel density estimation, University of New South Wales working paper no 92-04. [32] Liang, H. (1994): The Berry-Esséen bounds of error variance estimation in semiparametric regression models, Communications in Statistics 23, 3439-3451. [33] Liang, H. (1995): On Bahadur asymptotic eciency of the maximum likelihood estimator for a generalized semiparametric model, Statistica Sinica 5, 363-371. [34] Liang. H., and P. Cheng (1993): Second order asymptotic eciency in a partially linear model, Statistics and Probability Letters (1993) 18, 73-84. [35] Linton, O.B. (1995): Second Order Approximation in the Partially Linear Regression Model, Econometrica 63, 1079-1112. 42
[36] Linton, O.B. (1996a): Second order approximation in a linear regression with heteroskedasticity of unknown form, Econometric Reviews 15, 1-32. [37] Linton, O.B. (1996b): Edgeworth Approximation for MINPIN Estimators in Semiparametric Regression Models. Econometric Theory 12, 30-60. [38] Linton, O.B. (1997): Second-Order approximation for semiparametric instrumental variable esitmators and test statistics. Cowles Foundation Discussion Paper no 1151. [39] Linton, O.B. and Z. Xiao (1997): Second order approximation in a semiparametric binary choice model. Manuscript, Yale University. [40] Masry, E. (1996a): Multivariate local polynomial regression for time series: Uniform strong consistency and rates, J. Time Ser. Anal. 17, 571-599. [41] Masry, E. (1996b): Multivariate regression estimation Local polynomial tting for time series, Stochastic Processes and their Applications 65, 81-101. [42] Michel, R. (1974): Results on probabilities of moderate deviations, The Annals of Probability 2, 349-353. [43] Nagar, A.L. (1959): The bias and moment matrix of the general k-class estimator of the parameters in simultaneous equations, Econometrica 27, 575-595. [44] Newey, W.K., (1986): Ecient estimation of models with conditional moment restrictions, mimeo, Princeton University. [45] Newey, W.K., (1988): Adaptive estimation of regression models via moment restrictions, Journal of Econometrics [46] Newey, W.K., (1990): Ecient instrumental variables estimation of nonlinear models, Econometrica 58, 809-837. [47] Newey, W.K., (1996): Optimal choice of instruments in nonlinear models, mimeo. [48] Nishiyama, Y., and Robinson, P. M. (1997): Edgeworth expansions for semiparametric averaged derivatives, Manuscript, LSE. 43
[49] Pfanzagl, J., (1980): Asymptotic Expansions in Parametric Statistical Theory, in: Developments in Statistics, vol3. ed. P.R.Krishnaiah. Academic Press. [50] Phillips, P.C.B. and J.Y. Park (1988): On the formulation of Wald tests of nonlinear restrictions, Econometrica 56, 1065-1083. [51] Powell, J.L., and T.M. Stoker (1996): Optimal bandwidth choice for density-weighted averages, Journal of Econometrics 75, 291-316. [52] Rilestone, P., V.K. Srivastava, and A. Ullah (1996): The second-order bias and mean squared error of nonlinear estimators, Journal of Econometrics 75, 369-395. [53] Robinson, P. M. (1987): Asymptotically Ecient Estimation in the Presence of Heteroscedasticity of Unknown form. Econometrica 56, 875-891. [54] Robinson, P. M. (1988a): The Stochastic Dierence between Econometric Statistics, Econometrica, 56, 531-548. [55] Robinson, P. M. (1988b): Root-N-Consistent Semiparametric Regression, Econometrica, 56, 931-954. [56] Robinson, P. M. (1991a): Automatic Frequency Domain Inference on Semiparametric and Nonparametric Models. Econometrica, 59, 1329-1364. [57] Robinson, P. M. (1991b): Best Nonlinear Three-stage Least Squares Estimation of certain Econometric Models. Econometrica, 59, 755-786. [58] Robinson, P. M. (1995): The normal approximation for semiparametric averaged derivatives, Econometrica 63, 667-680. [59] Rothenberg, T., (1984): Approximating the Distributions of Econometric Estimators and Test Statistics. Ch.14 in: Handbook of Econometrics, vol 2, ed. Z. Griliches and M.Intriligator. North Holland. [60] Rubin, H. and J. Sethuraman (1965): Probabilities of moderate deviation, Sankhyæ, Series A 27, 325-346. [61] Silverman, B. (1986): Density estimation for statistics and data analysis. London, Chapman and Hall. 44
[62] Xiao, Z. and O.B. Linton (1997): Second order approximation for an adaptive estimator in a linear regression. Manuscript, Yale University. [63] Xiao, Z. and P.C.B. Phillips (1996): Higher order approximation for a frequency domain regression estimator, Manuscript, Yale University.
D Figure Information In each gure we give the rejection frequency as a function of bandwidth. The solid line represents kernel and the dotted line is for the local linear. The higher line is always the n = 100 case and the lower one therefore the n = 200 one. The triangle symbol represents the logit rule-of-thumb, square represents the quadratic probability model rule-of-thumb, while circle represents the nonparametric plug-in. Solid symbols represent kernel and hollow symbols are local linear. The leftermost symbols are for the n = 100 and the rightmost symbols are for n = 200: Figure 1 is the case where c = 1 and hypothesis (a); Figure 2 is the case c = 0 with the same null hypothesis, Figure 3 is the hypothesis (b) with c = 1, and Figure 4 is same null with c = 0: The letter A corresponds to 10% nominal level, B is the 5% case, and C is the 1% case. c compared with the same estimates applied to Figure 5 shows kernel density estimates of W 21 data for the case corresponding to Figure 1 with kernel estimation and nonparametric plug-in bandwidth with n = 100:
45