A Computationally E cient Oracle Estimator for Additive Nonparametric ...

17 downloads 0 Views 274KB Size Report
This paper makes three contributions. First, we introduce a computationally e cient esti- mator for the component functions in additive nonparametric regression ...
A Computationally Ecient Oracle Estimator for Additive Nonparametric Regression with Bootstrap Condence Intervals Woocheol Kim, Oliver B. Linton, and Niklaus W. Hengartner Abstract This paper makes three contributions. First, we introduce a computationally ecient estimator for the component functions in additive nonparametric regression exploiting a dierent motivation from the marginal integration estimator of Linton and Nielsen (1995). Our method provides a reduction in computation of order n; which is highly signicant in practice. Second, we dene an ecient estimator of the additive components, by inserting the preliminary estimator into a backtting algorithm but taking one step only, and establish that it is equivalent in various sense to the oracle estimator based on knowing the other components. Our two-step estimator is minimax superior to that considered in Opsomer and Ruppert (1997), due to its better bias. Third, we dene a bootstrap algorithm for computing pointwise condence intervals and show that it achieves the correct coverage.

: Instrumental variables; Kernel estimation; Marginal Integration.

Key Words

1. INTRODUCTION It is common practice to study the association between multivariate covariates and responses via regression analysis. While nonparametric models for the conditional mean m(x) = E (Y jX = x) are useful exploratory and diagnostic tools when X is one-dimensional, they suer from the curse of dimensionality [Härdle (1990) and Wand and Jones (1995)] in that their best possible convergence rate is n?q= q d ; where d is the dimension of X and m() is q-times continuously dierentiable. Additive regression models of the form (2 + )

Woocheol Kim is doctoral candidate in the Department of Economics, Yale University, New Haven, CT 06520 (Email: [email protected]). Oliver B. Linton is Professor in the Department of Economics, Yale University (E-mail: [email protected]). Niklaus Hengartner is Assistant Professor in the Department of Statistics, Yale University (E-mail: [email protected]). 0

1

m(x) = c + m (x ) + m (x ) +    + md(xd) (1) with x = (x ; : : :; xd)T 2 IRd; oer a compromise between the exibility of a full nonparametric regression and reasonable asymptotic behaviour. In particular, in such additive regression models, the functions mj () can be estimated with the one-dimensional rate of convergence ? e.g., nq= q for q-times continuously dierentiable functions ? regardless of d; see Stone (1985,1986). The backtting algorithm of Breiman and Friedman (1985), Buja, Hastie and Tibshirani (1989), and Hastie and Tibshirani (1990) is widely used to estimate the one dimensional components mj () and regression function m(). More recently, Newey (1994), Tjøstheim and Auestad (1994), and Linton and Nielsen (1995) independently introduced the alternative method of marginal integration to estimate m`(x`) [see also earlier work by Auestad and Tjøstheim (1991)]. One advantage of the integration method is that its statistical properties are easier to describe; specically, one can easily prove central limit theorems and give explicit expressions for the asymptotic bias and variance of the estimators. Hengartner (1997) provides the weakest set of conditions, showing that the curse of dimensionality does not necessarily operate. Extensions to generalized additive models, Linton and Härdle (1996), to transformation models, Linton, Chen, Wang, and Härdle (1997), and to hazard models, Nielsen and Linton (1997), are also readily analyzed using this estimation method. The marginal integration estimator relies on the following heuristic: if the regression function is additive and if Q is a d ? 1-dimensional probability measure, then m(x)dQ(x ; x ; : : :; xd) = c0 + m (x ); where c0 is a constant with respect to x : For identication, it is traditional to assume that E fm`(X` )g = 0 for ` = 1; : : : d, in which case the constant is E (Y ) = c; if the integrating measure is exactly the probability distribution of X ; : : : ; Xd. Applied to data, one replaces m by a d-dimensional pilot estimator m; and uses some known measure to integrate with. The empirical marginal integration method is the most convenient in practice. Partition Xj = (X j ; X j ) and x = (x ; x ), where X j and x are scalars, while X j and x are d ? 1 dimensional, and consider the multidimensional Nadaraya-Watson nonparametric regression estimator n n m(x) = nh1 d K x ?h Xj Yj /p(x) ; p(x) = nh1 d K x ?h Xl ; (2) o o oj ol 1

1

2

2

1

(2 +1)

R

2

1

1

3

1

2

c

1

1

2

1

1

X

2



2



c

X

b

2





b

=1

=1

where K (t ; : : :; td) = dj k(tj ) with k a scalar kernel of order q, while ho is a scalar bandwidth. Here, p(x) is an estimator of the joint probability density p(x). Then Q

1

=1

b

n X ? 1 c m(x1; X2i)

1(x1) = n i=1

(3)

b

is the empirical marginal integration estimator as described in Linton and Nielsen (1995) and Chen, Härdle, Linton, and Severance-Lossin (1996). The latter paper showed that (x ) converges in b

2

1

1

probability to

Z

(x ) = m(x ; x )p (x )dx = m (x ) + c; when (1) holds, where p () is the density function of X : Assuming that ho = O(n? = q ); the rate of convergence of (x ) is nq= q ; and its asymptotic distribution follows from the stochastic expansion 1

1

1

2

2

2

2

2

b

1

1

1

1 (2 +1)

2

(2 +1)

1

(4) k x ?h X i p(pX(X; Xi) ) "i + bn(x ; ho) + op(n?q= q ); oi o i i where "i = Yi ? E (YijXi); while the term bn(x ; ho) is a deterministic bias of order hqo. There are two main disadvantages of the integration estimator. First, it is perhaps even more time consuming to compute than the backtting estimator. To evaluate () at the observations X i; i = 1; : : :; n, one must compute n regression smooths evaluated at the pairs (X i; X j ); i; j = 1; : : : ; n, each of which requires O(nho ) operations when the kernel K () has compact support. Thus, f (X i)gni takes O(n ho ) operations to compute. The required number of operations for many other choices of pilot estimators, including the local linear regression smoother, is of the same order. For the backtting estimator, computational time is O(n hor) ? where r is the number of iterations required to achieve convergence ? which generally requires less operations. The second disadvantage of the integration estimator is that it is statistically inecient. A reasonable standard against which to judge any estimate of the component m (x ) is that provided by the so-called oracle estimator, which is the corresponding one-dimensional smoother of the (unobserved) data Yi ? c ? dj mj (Xji ) against X i: By this standard, the integration estimator is inecient, and very much so for correlated data. Linton (1996) considers the hybrid procedure of applying one backtting iteration to a pilot integration estimator of the additive components. If the pilot estimator undersmooths in all the variables, then this two-step estimator is itself asymptotically normal with the same asymptotic variance and bias as the oracle estimator [undersmoothing is needed to eliminate the bias in the pilot estimator, but plays no role in determining the variance]. Recently, Opsomer and Ruppert (1997) provided conditional mean squared error expansions for one version ? in fact, the exact solution of an nd by nd system of equations based on local linear smoothing ? of the `backtting' estimator. Although their procedure has optimal, i.e. oracle, variance, its bias is not oracle, and is not even design adaptive, despite the fact that their procedure is based on local linear estimation throughout. Apart from anything else, this shows that the dierences between alternative implementations of the `backtting' idea are as signicant as, say, the dierence between Nadaraya-Watson kernel and local linear regression smoothing estimation. In this paper, we suggest a state-of-the-art implementation of an ecient estimator of m (): We rst introduce a computationally convenient and consistent, but inecient, estimator of mj ();

(x ) ? (x ) = nh1 b

1

1

1

n

X



1

1



1

2

2

1

=1

2

(2 +1)

1

1

b

2

1

b

1

1

=1

1

1

2

3

2

1

1

P

=2

1

1

3

j = 2; : : : ; d. We give an alternative motivation of our initial estimator as being of the instrumental variable type. It takes only O(n ho) operations to compute. This reduction of O(n) operations is practically signicant when implementing computer intensive methods such as the bootstrap or cross-validation. We then dene a two-step estimator by plugging in the initial estimator into a backtting-like iteration. The ecient estimator takes just twice as many operations as the initial estimator. We establish the equivalence of the two-step ecient estimator and the oracle estimator in various senses: in probability, with probability one, and in distribution with rate, pointwise as well as uniformly. This allows us to address many of the practical questions concerning implementation of the additive estimator, because we argue that the host of results established for one-dimensional regression can now be applied with impunity to our feasible procedure. Finally, we dene a bootstrap condence interval and establish its large sample coverage. For presentational simplicity we shall maintain that the additive structure (1) holds. Furthermore, in contradistinction to previous work, for example Linton and Härdle (1996) who use bias reduction in the direction not of interest (at least when the dimensionality is large), we use a common bandwidth ho and kernel k of order q; where q is an even integer, for each direction. Therefore, our conditions below require that q is related to d: specically, when d > 4 we must have q > 2: Of course, better performance can be achieved by using multiple kernels and multiple bandwidths, but this adds what we consider unnecessary complication. This paper is organized as follows. Section 2 introduces and motivates a computational ecient pilot estimator for the additive component which is used in Section 3 to produce an oracle estimator for the additive components. Bootstrap of the oracle estimator for constructing condence intervals for the individual components is studied in Section 4. Section 5 presents a simulation study of the nite sample performance of the bootstrap algorithm and a rule-of-thumb bandwidth selection method. The proofs are given in the Appendix. 2

2. A FAST INSTRUMENTAL VARIABLE PILOT ESTIMATOR We give an alternative heuristic for estimating the additive model in place of the usual integration idea presented in the introduction. We would really like to apply just a one-dimensional smoother, so consider the nonparametric regression of Y on X ; 1

E (Y jX = x ) = c + m (x ) + E [m (X )jX = x ] : 1

1

1

1

2

2

1

1

Unfortunately, E (Y jX = x ) is a biased estimator of c + m (x ) due to the presence of the unknown function E [m (X )jX = x ] on the right hand side. Now suppose we can nd some `instrument' 2

2

1

1

1

1

1

4

1

function w(x ; x ) [see Angrist, Imbens and Rubin (1996), and the discussion, for an interesting recent discussion of the econometric concept of instrument and some background literature]; for which 1

2

E [w(X ; X )jX = x ] = 1 ; E [w(X ; X )m (X )jX = x ] = 0 with probability one, then

(5)

E [w(X ; X )Y jX = x ] = c + m (x ); as required. In fact, w(x ; x ) = p (x )p (x ) /p(x ; x ) is such a function, because

(6)

1

2

1

1

1

1

1

2

1

2

1

2

1

2

2

1

2

1

2

1

1

1

1

2

w(x ; x ) p(px(x; x) ) dx = p (x )dx = 1 w(x ; x )m (x ) p(px(x; x) ) dx = m (x )p (x )dx = 0: In practice, we must replace w(x ; x ) by an estimate and replace population conditional expectation by a regression smoother. Thus we can compute a variety of one-dimensional smooths of w(X j ; X j )Yj on X j ; where w(X j ; X j ) = p (X j )p (X j ) /p(X j ; X j ) ; including kernel or local polynomial. We propose the somewhat simpler version of the kernel estimate that eliminates the superuous explicit estimation of p ; Z

1

1

2

1

Z

1

2

2

1

b

1

2

b

1

2

1

1

2

Z

2

1

2

2

2

Z

2

2

1

2

2

2

2

2

2

1

b1

2

1

b2

b

2

1

2

1

(7) k x ?h X j p(pX(X; Xj ) ) Yj j j oj o as our estimate of (x ): It has several interpretations in addition to the above instrumental variable estimate. First, as a version of the one-dimensional regression smoother but adjusting internally by a conditional density estimate p j (X j jX j ) = p(pX(jX; X)j ) ; j instead of by a marginal density estimate. Second, one can think of (7) as a one-dimensional standard Nadaraya-Watson (externalized) regression smoother of the adjusted data Yj on X j ; where Yj = p (x )p (X j )Yj /p(X j ; X j ) : Finally, note that pi(X i) can be interpreted as a marginal integration estimator in which the pilot estimator is a fully internalized smoother [see Jones, Davies and Park (1994)]1 , n m~ (x) = nh1 d K x ?h Xj Yj /p(X j ; X j ) ; o oj

pi(x )  pi(x ; ho ) = nh1 b

1

1

b

1

n

X



1

1

1



b2

b

=1

2

1

2

1

1

b

b1 2

1

2

1

b2

2

2

b

b1

1 b2

2

b

1

b

2

X

1



1

b

1



b

1

2

=1

Actually, Jones, Davies and Park (1994) only considered the version where pb is replaced by p and its relation to the local linear smoother. 1

5

rather than the Nadaraya-Watson: by interchanging the orders of summation, we obtain

pi(X b

i ; ho )



X

=

1

1

1 n k X i ? X j p (X j ) Y nho j ho p(X j ; X j ) j X ?X 1 n K X k?X j 1 n k 1iho 1j Yj nho j p(X j ; X j ) nhdo? k ho 1 n n k X1ih?oX1j K X2kh?oX2j Yj n hdo k j p(X j ; X j ) 1 n m(X ; X ); i k nk 1

1



b

=1





b

=1

1 

1

2

2



X

1

2





XX

=

2

(

X

=

b2

2

2

=1

2

)



2

b

=1 =1

1

2

X

=

f

1

2

=1

provided the kernel k is symmetric about zero. Here, K (s ; : : :; sd? ) = d` ? k(s` ) is a (d ? 1)dimensional kernel. Our procedure (7) is computational convenient ? it involves just simple smoother matrices and takes only order n smoothing operations. It also does not require evaluation of either density or regression function outside of the joint support. It has very similar asymptotic properties to the empirical marginal integration estimator ? in particular, it converges in distribution at rate nq= q to a normal random variable. We will show later how to improve on the eciency of this estimate. Dene Dq g(x ; : : : ; xd) = d` @ qg(x ; : : :; xd) /@xq` for any positive integer q: Q

2

1

1

1 =1

(2 +1)

P

1

1

=1

Theorem 1. Suppose the conditions given in the Appendix hold, and that ho =  o n?1=(2q+1).

Then,

nq=

q

(2 +1)

h

i

h

pi(x ) ? (x ) ?! N b (x ); v (x ) b

1

1

1

1

1

2 1

1

i

1

in distribution, where q b (x ) = qo! q (k)Dq m (x ) + q (K ) m(xp)(px)(x ) Dq p(x) ? q (K )m(x)Dq p (x ) dx v (x ) = ?o kkk pp((xx))  (x) + m (x) dx "

1

2 1

Z

1

1

1

1

2 2

Z

2 2

)

(

2

1

2

n

2

2

2

2

2

2

#

2

o

2

with  2 (x) = var(Y jX = x); while q (k) = k(t)tqdt and kkk22 = k(t)2dt: R

R

Remarks

1. This estimator has an additional factor in the variance relative to the marginal integration estimator of Linton and Nielsen (1995), and is less ecient. 6

2. By using more bias reduction in the directions not of interest, as in Linton and Härdle (1996), one can eliminate some terms from the bias; specically, those terms involving derivatives with respect to the direction not of interest are not present. 3. To estimate m (x ); we can use either of the following recentred estimates 1

1

mcd(x ) = pi(x ) ? Y or mce(x ) = pi(x ) ? n1 c

b1

1

1

c

1

b1

1

1

1

n

X

i=1

pi(X i): b1

1

Since Y = n ni Yi = c + Op(n? = ); the bias and variance of mcd(x ) are the same as those of pi(x ): However, this is not the case for mce(x ); specically, while the variance is the same as for pi(x ), the bias of mce (x ) is b (x ) ? b (x )p (x )dx : Although mce() has (sample) mean zero by construction, mcd () does not. 1

b

1

P

1 2

=1

c

c

1

b

1

c

1

1

1

mcdj(xj )

;

1

1

1

1

1 R

1

1

1

1

1

c

1

1

c 1

4. To estimate m(x); we let

mcd(x) = Y f

+

d

X

j =1

c

mce(x) = f

d X pi ce c (xj ):

j (Xji) + m j j =1 =1 i=1

1 d nd j

n

XX

b

It is easy to see that the bias of mcd(x) is the sum of the biases of pij(xj ), while the asymptotic bias of mce(x) is dj bj (xj ) ? d?d bj (xj )pj (xj )dxj . The estimates m`j (xj ) and m`k (xk ) are asymptotically uncorrelated for any j; k [this is true for both ` = cd and ` = ce]; so that the asymptotic variance of m`(x) is just the sum of the asymptotic variances of m`j (xj ): f

P

f

b

1

=1

R

c

f

c

c

3. ORACLE ESTIMATION OF ADDITIVE COMPONENTS Suppose an oracle tells us all but one of the one-dimensional components fm`(x`) : ` 6= j g. To estimate mj (xj ), construct Yjioracle = Yi ? `6 j m`(X`i ) ? c; and consider moracle (xj ), a univariate local j oracle oracle polynomial smoother of the pairs (Xji; Yji ), i.e., mj (xj ) = a (xj ); where a (xj ); a (xj ); : : :; aq? (xj ) solve q? n x j ? Xji ` oracle Y ? a k (8) ` (Xji ? xj ) ; ji a0 ;amin 1 ;:::;aq ?1 h P

c

=

c

X

with h = n? =





b0

2 4

q

b1

3

1 X

i=1

1 (2 +1)

b0

5

`=0

a bandwidth sequence. Under the conditions given in Fan (1992,1993),

nq=

q

(2 +1)

n

o

h

i

moracle (xj ) ? mj (xj ) ?! N bej (xj ); vej (xj ) j 2

c

in distribution, where

 (x )  (k) bej (xj ) = q qq! Dq mj (xj ) ; vej (xj ) = ?  q (k) p j(x j) ; j j 2

7

1

2

b

1

in which j (xj ) = var(Y jXj = xj ); while q (k) and  q (k) are certain constants derived from k ? see Fan and Gijbels (1992) for elucidation, and see Fan (1993) for a discussion of the minimax eciency of this procedure. In fact, one can improve slightly on this method by using the fact that mj (xj )pj (xj )dxj = 0. Let mcoracle (xj ) = moracle (xj ) ? n? ni moracle (Xji); then j j j 2

R

c

nq=

q

(2 +1)

1

c

n

o

P

=1

c

h

mcoracle (xj ) ? mj (xj ) ?! N bcej (xj ); vcej (xj ) j 2

c

i

in distribution, where vcej (xj ) = vej (xj ); and 2

2

bcej (xj

 (k) ) = q q



Dq mj (xj ) ?

Z

Dq mj (xj )pj (xj )dxj



: q! Note that mcoracle () is more ecient than moracle () according to integrated mean squared error, j j because c

c

h

i

varXj fDq mj (Xj )g  EXj fDq mj (Xj )g : 2

We call mj (xj ) an oracle estimator for mj (xj ) if it behaves (asymptotically) like moracle (xj ). While j in parametric regression oracle estimators for the slope parameter only exist when the covariate Xj is uncorrelated with fX` : ` 6= j g, we show that for additive regression, applying one step of the backtting algorithm to an integration estimator produces an oracle estimator. Specically, denote by pi`(x`; ho) the integration estimator of the one-dimensional components described in Section 2 with bandwidth ho, and set c

c

b

Yji?step = Yi ?

X

e2

`6=j

pi`(X`i ; ho) + (d ? 1)Y : b

Theorem 2 below states that with appropriate choices for the bandwidths ho and h, mj ?step(xj ; h; ho) = a (xj ); in which a (xj ); a (xj ); : : :; aq? (xj ) solve c

b0

b0

b1

b

n

X

a0 ;amin 1 ;:::;aq ?1 i=1

1

k xj ?h Xji 

2



2

Yji?step ?

4e2

qX ?1 `=0

3

a` (Xji ? xj )` 1(Xi 2 S on ); 5

(9)

is an oracle estimator for mj (xj ). Here, S on is the intersection of the trimmed one-dimensional supports, i.e.,

S on = x 2 IRd : bj + ho  xj  bj ? ho j = 1; : : : ; d ; o

n

and where bj ; bj are the lower and upper bounds of the support of Xji [we have assumed a rectangular support for simplicity]. This trimming is for technical reasons. Our initial estimator 8

pi `6=j b ` (X`i ; ho ) ? (d ? 1)Y has poor der ho compared with interior bias of

boundary bias behaviour, specically boundary bias of ororder hqo . The trimming eliminates the observation at the boundary at no rst order cost, provided the bandwidth ho converges to zero.

P

Theorem 2. Suppose the conditions in the Appendix hold, that ho = o(n?1=(2q+1)); and that

h = a o n? =

q

1 (2 +1)

for some ao > 0: Then, for all ; h

n

o



i

Pr nq= q mj ?step(xj ; h; ho) ? moracle (xj ; h) >  X n ?! 0 j with probability one, as n ! 1. Here, X n = fX ; : : :; Xn g : (2 +1)

c

2

c

1

A similar theorem for a marginal integration estimator with an externally adjusted pilot can be found in Linton (1996). The immediate reason for this result is that mj ?step is asymptotically independent of mk?step for all k 6= j; or rather the correlation between these two random variables is of order n? which is of smaller order than the variance of either. Thus, mj ?step inherits, to rst (xj ). The reason for the low correlation can be understood order, the asymptotic properties of moracle j from the simpler problem of comparing two marginal smooths on variables Xj and Xk : The marginal kernel windows contain O(nh) observations, while the intersection of the marginal windows [which determines the correlation between the two estimators] contains O(nh ) observations. c

c

2

2

1

c

2

c

2

Remarks.

1. Under slightly stronger conditions ? specically E (jY js) < 1 for some s > 2 ? it is true that ?step (x ; h; h ) ? moracle (x ; h) ?! 0 m nq= q sup j o j j j x (2 +1)

j

2 c



c

with probability one [the supremum is taken over any compact set contained in the support of Xj ], see Masry (1996). Combining this with the well known optimal rate of uniform convergence for moracle , i.e., nq= q = log n; gives the rate for mj ?step: j (2 +1)

c

c

2. If one uses the same bandwidth h = ho = o n?q=

2

q

in both (7) and (9), then d ? step q= q oracle n mj (xj ) ? mj (xj ) ?! ? bk (xk ) pj;kp(x(xj ; )xk ) dxk j j k6 j in probability, where pj;k is the joint density of Xj and Xk : Hence, mj ?step(xj ) has the same variance as moracle (xj ); but has bias j d bj (xj ) p ;jp(x(x; )xj ) dxj : be (x ) ? j (2 +1)

(2 +1)

n

c

o

2

X

Z

c

=

c

c

X

1

Z

1

1

1

1

=2

9

1

2

This procedure is evidently inferior, according to a minimax criterion, than when ho=h ?! 0: We conjecture that by iterating to innity, with h = ho; one obtains the same bias as when ho=h ?! 0: Similar calculations can be performed for the centred estimates as well as for the estimates of m(x): 4. BOOTSTRAP CONFIDENCE INTERVALS The bootstrap distribution of studentized estimators usually improves upon the Normal approximation by capturing the rst term of the Edgeworth expansion. In this section, we investigate the bootstrap for the two-stage estimator mj (xj ; h; ho) introduced in the previous section, where ho is the bandwidth of the pilot estimator pi(xj ) and h is the bandwidth of the local polynomial regression in the second step. Dene the additive reconstruction c

b

madd(x; h; ho) = Y + c

d

X

`=1

m`(x`; h; ho); c

and the residuals

ei = Yi ? madd(Xi; h; ho); i = 1; : : : ; n: The naive bootstrap, which samples with replacement from the centred residuals, leads to a bootstrap distribution with a dierent bias than that of the estimator. As a remedy, Härdle and Marron (1991) adapt an idea of Wu (1986) and propose drawing "y ; "y; : : : ; "yn independently from the two point p distributions dGi = iai + (1 ? i)bi , where  is the Dirac delta function, ai = eci(1 + 5)=2; p p bi = eci(1 ? 5)=2; where eci = ei ? j ej /n , and i = (5+ 5)=10, to construct the bootstrap sample c

1

2

P

Yjy = madd(x; g; ho) + "yj : c

To see why yet another bandwidth g is introduced for constructing the bootstrap sample, observe that m`(x`; h; ho) and its bootstrap counterpart my` (x`; h; ho) have the same variances, but that their biases are E [m`(x`; h; ho) ? m`(x`)] = hq C (K )  Dq m`(x`) + o(hq ) and E y my` (x`; h; ho) ? m`(x`; g; ho) = hq C (K )  Dq m`(x`; g; ho) + o(hq ); respectively, where C (K ) is a constant depending on the kernel. Here, the expectation E y is understood to be conditionally on the data. Hence, the asymptotic distribution of c

c

c

h

i

c

c

c

10

p

n

p

o

nh my` (x`; h; ho) ? m`(x`; g; ho) and nh fm`(x` ; h; ho) ? m` (x`)g will agree if Dq m`(x`; g; ho) converges to Dq m`(x`). A bandwidth g(n) of larger order than n? = q is needed to ensure this. Our analysis relies on a strengthening of Theorem 2 to show that the bootstrap distributions of the two-step and the oracle estimators are asymptotically equivalent. Properties of the bootstrap of the two-step estimator follows from established theorems for one-dimensional nonparametric regression, see Härdle and Marron (1991). c

c

c

c

1 (2 +1)

Theorem 3. In addition to the conditions in the Appendix, suppose that E (jY js ) < 1 for some

s > 2, and that

ho q = n h

! 0. Then, there exists a random variable a (X n) that is positive and

h

jm ?step(x ) ? moracle(x )j > n j X n  a ?n s n?s=

1

nite with a probability tending to one, such that

Pr nq=

q

(2 +1)

c

i

2 1

c

1

q

2(2 +1)

1

1

1

:

(10)

We have using standard arguments [see Hall (1992, pp 292) for example], that for any n ! 0; h

n



o

i

G(z; h; ho ) = Pr nq= q mj ?step(xj ; h; ho) ? mj (xj )  z X n = Pr nq= q moracle (xj ; h; ho) ? mj (xj )  z X n + O(n ) j +O Pr nq= q mj ?step(xj ; h; ho) ? moracle (xj ; h) > n X n : j (2 +1)

h

2 c

n

(2 +1)

o

c



2 (2 +1) c

h

i





c

i

Applying the theorem with n = 1= log n, for example, the remainders are o (1) ; for s  2. A similar approximation holds for the bootstrap distribution y

h

y

h

Gy(z; h; ho; g) = Pr nq=

q

n

q

n

(2 +1)

= Pr nq=

(2 +1)

o

myj (xj ; h; ho) ? mj (xj ; g; ho)  z

i

c

c

(x ; h; h ) ? moracle (x ; g; h ) < z + o(1); moracle; j o j o j j o

c

i

c

where Pry is computed conditionally on the data. Consistency of the bootstrap distribution for the two-stage estimator then follows from the consistency of the bootstrap of the oracle estimator. Theorem 4. Suppose that in addition to the assumptions set forth in the Appendix: ho =

= [an? = q ; an? = while 0 < a  a < 1. Then,

o(n?1=(2q+1)), Hn

1 (2 +1)

q

1 (2 +1)

] , and Gn = [n? =

 ; n? ], where

q

1 (2 +1)+





sup sup Gy(x; ho; h; g) ? G(x; ho; h) ?! 0

g2Gn h2Hn

with probability one.

11

0 <  < 1=(2q +1),

Remark. The uniformity over Hn and Gn implies that the bootstrap of estimators with data-

dependent bandwidth is consistent.

5. NUMERICAL RESULTS We investigate the bootstrap condence intervals on simulated data. We took the following setup

Yi = sin X i + sin X i + "i; i = 1; : : : ; n; 1

2

where Xi is uniformly distributed on [?; ] and "i is standard normal independent of Xi ; while  = 0:1: We computed the estimator using the following matrix commands, which makes for speedy processing in programs like Gauss or Matlab. Dening the n  n smoother matrices 1 k X i?X j 1 k X i?X j ; S = ; S = nh h nh h i;j i;j we can write the n  1 vector of estimates = ( (X ); : : :; (X n ))T as 2





1

1

1







2

b

b

1

b

11

1

1

2

2



1

= S fy (S i):=(n(S S )i)g ; b

1

1

2

1

2

in which and := denote matrix Hadamard product and division respectively, while i = (1; : : : ; 1)T and y = (Y ; : : : ; Yn)T : By comparison, the marginal integration estimate involves n smoother matrices. We also computed the ecient smoothers e and e using and as starting values, i.e., 1

b

b

1

b

2

b

1

2

e = S ye:=(S i) ; ye = y ? ? Y i: b

1 1

1

1

b

1

2

We estimated the regression function vector m = (m(X ); : : :; m(Xn))T by m = + ? Y i and by the ecient version me = e + e ? Y i . We took hj = 0:5Xj n? = ; j = 1; 2; as the bandwidths for estimation, where Xj = n? ni (Xji ? X j ) ; and we examined a grid of bootstrap bandwidths g = cXj n? = ; for c in the range 0:1 ? 2:0 using nb = 200 bootstrap samples: Below we show the estimation results from a typical sample along with bootstrap condence intervals. f

1

f

b

b

1 5

1

b

b

1

b

2

2

b

2

b

2

1

P

=1

1 9

*** Figure 1 Here ***

The preliminary estimator is badly biased, and the asymptotic condence intervals seem way too narrow. The bootstrap bands seem to be wide enough in both cases. Finally, the ecient estimator has much smaller bias. 12

The tables give rejection frequencies for the 1%, 5%, and 10% level bootstrap condence bands for the fast preliminary estimator, cFB ; and the ecient estimator, cEB : The condence interval is centred at a single random point [which changes from sample to sample] and there were 5000 replications. ***Table 1 and 2***

For cFA; i.e. the asymptotic condence interval based on the fast additive smoother, the rejection frequencies are 0.5764, 0.6938, and 0.7498 respectively when n = 50 and 0.5446, 0.6740, and 0.7368 respectively when n = 100. We should point out that no attempt has been made to optimize the estimation bandwidths hj ? their specic value was chosen by their visual performance for the ecient estimator. No doubt, better performance could be achieved for the fast estimator by careful choice of bandwidth. The main conclusion from these simulations is that the ecient estimator in conjunction with the bootstrap method performs quite well, provided a good resampling bandwidth is chosen. We now evaluate a practical method for selecting the bootstrap bandwidth based on an elaboration of Silverman's rule-of-thumb idea. More sophisticated bandwidth selection rules, such as those reviewed in Härdle (1990) and Jones, Marron, and Sheather (1996), can also be applied here, but our purpose here is just to investigate one very simple method. The approach is to specify a parametric model for the conditional distribution L(Yi jXi; ) for the purpose of selecting g: One estimates  by maximum n likelihood and then generates samples Yi i from L(Yi jXi; ); which are used to perform a simulation experiment like we did above to determine an optimal bandwidth. Figure 2 below reports the results of such a method taking a Gaussian quartic regression model as the conditional model and compares it with the infeasible bandwidth selection method based on the true design. Everything else is as before. The `estimated' trade-o quite closely mimics the `actual' trade-o, although there is a tendency towards slight over-coverage of the interval by this method. This would presumably be reduced by taking a richer parametric model for the pilot. b

n

e

o

b

=1

*** Figure 2 Here ***

6. CONCLUDING REMARKS Our work might be interpreted as supporting the use of the standard backtting method of estimating additive models. However, we point out that there are some substantial dierences between the asymptotic properties of dierent versions of the `backtting' method. The one such method that has been successfully analyzed from a statistical point of view [see Opsomer and Ruppert (1997)] diers from the oracle estimator in its bias, and in fact is not design adaptive ? except when the 13

covariates are mutually independent ? even if local linear estimation is used throughout. By contrast, our two-step estimator has the same asymptotic variance and bias as the oracle estimator, i.e., it is design adaptive. It is thus minimax superior to the Opsomer and Ruppert (1997) procedure. Furthermore, our two-step estimator is much less computationally demanding than their method. Finally, while we require stronger smoothness/dimensionality trade-o than they did, we have not restricted the dependence among the covariates as they did.

Acknowledgements

We would like to thank the National Science Foundation for nancial support. We would also like to thank Jens Perch Nielsen and Arthur Lewbel for helpful comments. GAUSS programs are available at http:nnwww.econ.yale.edunlinton.

APPENDIX: REGULARITY CONDITIONS AND PROOFS Let X n = fX ; : : : ; Xn g : We use the following regularity conditions: 1

A1. The kernel k is symmetric about zero and of order q, i.e., k(u)uj du = 0; j = 1; : : :; q ? 1: Furthermore, k is supported on [?1; 1], bounded, and Lipschitz continuous, i.e., there exists a nite constant c such that jk(u) ? k(v)j  c ju ? vj for all u; v. R

A2. The functions m() and p() are q-times continuously dierentiable in each direction, where q  (d ? 1)=2. A3. The joint density p() is bounded away from zero and innity on its compact support, which is dj [bj ; bj ]. =1

A4. The conditional variance  (x) = var(Y jX = x) is Lipschitz continuous, and is bounded away from zero and innity. 2

14

Proof of Theorem 1. We start with decomposing the estimation error as,

pi(x ) ? (x ) = n1 b

1

1

1

1

where

1 k( X k ? x ) p (X k ) Y ? (x ) = S + S + S ; n n n p(X k ; X k ) k h h

n

X

1

k=1

b2

1

0

b

0

2

1

1

1

2

1

2

3

n S n  n1 h1 k( X kh? x ) p(pX(X; Xk ) ) "k ; k k k n S n  n1 h1 k( X kh? x ) p(pX(X; Xk ) ) [m (X k ) ? m (x )] ; k k k n 1 k( X k ? x ) p (X k ) m(x ; X ) ? (x ) : S n  n1 k h h p(X k ; X k ) k (1) We rst examine the term S n. Note that E (S njn ) = 0, while n k ( X kh? x ) p (pX(X; Xk ) )  (Xk ) var (S njn ) = nh1 nh1 k k k n = nh1 nh1 k ( X kh? x ) p (pX(X; Xk ) )  (Xk ) [1 + op(1)] ; k k k by the uniform convergence of p to p and p to p [which follows under our conditions, see Masry(1996)]. Then, applying a law of large numbers and changing variables, we get that v0 (x )  var nh S n = kkk p(px(z; z) )  (x ; z )dz [1 + o(1)] : In fact, applying the Lindeberg Central Limit Theorem, X

1

1

0

=1

=1

X

b

0

X

2

1

0

1

0

=1

2

b2

b

1

2

1

1

0 (

3

b2

1

2

1

2

b2

2

b

0

0

X

0

1

2

1

2

1

2

1

1

0

1

0

2

)

2

2

1

2

2



q

2

1

b2

0

b2

2

2 b 2

1

=1

b

b2

0

2

)

b2 2

1

=1

X

0

1

1

1

(

1

1

)

1

1

(

1

1

Z

2 2

1

2 2

2

1

2

2

1

2

2

q

nh S n ?! N [0; v0 (x )] ; 0

1

1

1

in distribution ? the Lindeberg condition is satised due to k () having bounded support. (2) S n : We approximate S n by n nk h0 k( X1kh?0 x1 ) p pX21kX;X2k2k [m (X k ) ? m (x )] with an error of op( pnh0 ); where p(x ; x ) = E (p(x ; x )) = hdo K ( zh?0x )(z)dz; and p (x ) = E (p (x )) = z2 ?x2 hd0?1 K ( h0 )p (z )dz : By a Taylor expansion, 2

1

R

1

1 P

2

2

1

2

2

b

2

=1

1

(

1

R

2

)

(

1

)

1

1

1

2

1

b2

2

2

B n  E (S n) = h1 k( z h? x ) pp(z(z; z) ) [m (z ) ? m (x )] p(z ; z )dz dz = h1 k( z h? x )p (z ) [m (z ) ? m (x )] dz dz ? h1 k( z h? x ) pp(z(z; z) ) [m (z ) ? m (x )] [p(z ; z ) ? p(z ; z )] dz dz  (k) = hq qq! Dq m (x ) + o(hq ): Z

2

1

2

0

Z

1

Z

0

1

2

1

1

0

1

1

2

1

1

1

2

2

2

2

0

0

0

0

1

2

1

2

1

1

1

1

0

15

1

1

1

1

1

1

1

1

2

1

2

2

1

2

1

2

1

2

2



The variance of S n is negligible, having the order of O 2

h0 n



: To understand this, consider

var(S n) = n1 h1 k ( z h? x ) p p(z(z; z) ) [m (z ) ? m (x )] p(z ; z )dz dz ? n1 E (Bn ) 0 (z ) u h q =O h ; p ( x ? h u ; z ) du dz + O = hn k (u) pp ((zx )m ? hu ; z ) n n where z is a point between x and x ? h u : Z

1

2

2

2 0

2 2 2

2

e1

2

2

0

Z

0

2 2

1

2 1

2

1

2

2 1

e1

1

1

1

1

1

1

1

2

0

1

0

1

2

1

2

1

1

2

1

2 0

2

2

2

!

!

0

1

h



i

(3) S n: Using the fact that n nk h0 k X1kh?0 x1 p pX21kX;X2k2k ? p pX21kX;X2k2k m(x ; X k ) = op( pnh0 ); we get an approximation of S n : n 1 0 S n = S n + n m (X k ) + op( p 1 ) = S 0 n + Op( p1n ) + op ( p 1 ); nh nh k 1 P

3

1

=1

b(

(

)

b(

)

(

)

1

)

1

2

3

X

3

2

3

2

n

3

0

=1

0

o

where S 0 n = n nk h0 k( X1kh?0 x1 ) p pX21kX;X2k2k ? 1 m(x ; X k ): The approximation in the second equation is a direct result from the Lindeberg Central Limit theorem, which holds under our assumptions A2,A3. Now, let n 1 0 00 S n = S n + B n  n  k + ; 1 P

3

=1

(

1

)

(

1

)

2

X

3

n

3

3

e

k=1

o

? 1 m(x ; X k ),   E (  k ), and  k   k ?: A simple calculation where  k  gives the additional bias term as B n = h1 k( z h? x ) pp(z(z; z) ) m(x ; z )p(z ; z )dz dz ? (x )  (K ) q D p (z )m(x ; z )dz = hq q q ! ?hq q q(!K ) p(px(z; z) ) m(x ; z )Dq p(x ; z )dz + o(hq ) X1k ?x1 p2 (X2k ) 1 h0 k ( h0 ) p(X1k ;X2k ) Z

1

1

3

0

2

1

2

2

0 Z

1

2

0

Z

2

0

e

2

1

2

2

1

2

1

1

2

1

1

2

2

1

2

1

1

2

2

1

2

2

0

Next, we turn to S 00n = n nk  k : It is mean zero by construction. Its variance is simply equal to E n2 nk  k ; since the covariance terms disappear due to the independence Xk and Xl: That is, 

1

P

=1

e

2

1

3



P

=1

e

var (S 00n) = n1 E  k ? n1  = nh1 kkk 

3

2



2

0

2 2

Z

p (z ) m (x ; z )dz + O( 1 ); p(x ; z ) n 2 2

2

1

2

1

2

Hence, applying the Lindeberg CLT theorem for S 00n, it follows that nh S 00n ?! N 0; kkk m (px(x; z; )zp)(z ) dz

2

2

3

q

0

3

2 2

Z

2

1

2 2

2

1

2

2

! 2

in distribution. Finally, combining all the results above for S n ; B n; B n and S 00n, together with p E S nS 00n = E S 00nE (S njX n) = 0; we get the asymptotic distribution of nh pi(x ) ? (x ) : 



1

3



1



3

1

2

3

3 h

0

16

b

i

1

1

1

1

Proof of Theorem 2. To prove Theorem 2, we need to determine the equivalent kernel of the

local polynomial regression. Such a representation is stated in Lemma 5 whose proof can be found in Wand and Jones (1995). Lemma 5. The weights Wn;i of the univariate local polynomial regression satisfy 1 1 k x ? X i + o (1); Wn;i (h) = nh p p (x )  h where q? k (u) = stutk(u); sst = S ? st ; with [S ]st = us t? K (u)du; 

1

1

1



1

1 X

h

1

Z

i

+

2

t=0

and satisfy

Z

k () du = 1;

Z

p =  = 0;

0

urk () du = 0; for 0 < r  q ? 1: 0

We continue with the proof of Theorem 2. Dene m?(x ) = a (x ) that minimizes c

n

k a0 ;amin 1 ;:::;aq ?1 i=1 X



x ?X i h 1

1



2

Y oracle ? i

4

1

b0

1

1

1

3

qX ?1

a` (X i ? x j )` 1(Xi 2 S on): 1

`=0

5

1

Because the set S on contains order n(1 ? ho) observations, moracle(x ) ? m?(x ) = Op 1 ?hoh (nh)? = Op n? : o Whence it suces to compare m ?step(x ) to m?(x ). We write !

c

c

1

1

c

m ?step(x ) ? m?(x ) = c

2 1

1

c

1

1

2 1

1

1

c

1

n

X

i=1



1

1



1

Wn;i (h) ( ? Y ) ?

d n

XX

j =2 i=1

h

i

Wn;i (h) pij(Xji; ho) ? j (Xji ) ; b

where Wn;i (h) are the weights of the local polynomial smoother. The rst term ni Wn;i (h) ( ? Y ) = Op(n? = ). From the result of Theorem 1 and Lemma 5, we get n n Wn;i (h) pij(Xji; ho) ? j (Xji ) = p (1x ) n1 h1 k X i h? x [B n(Xji ) + B n(Xji )] i i n + 1 1 1 k X i ? x [S n(Xji ) + S 00n(Xji )] + op( p1 ) p (x ) n i h h nh  (x ) + V  (x ) + o ( p1 ):  Bn;j p n;j nh Since Wn;i = 0 for Xi 2= S on, only points away from the boundary contribute to the bias, which, by Theorem 1, is n 1 1 1  q Bn;j (x ) = ho p (x ) n h k X i h? x Bn (Xji) + op (hqo) = O(hqo ); i P

=1

1 2

X

h

i



X

b

1

=1

=1



X

0

1

=1

1



X

1

1

1

1



2

0

1

1

1

1

1

0

=1

17

1



1

1

3



1

3

where

q (k) q q (K ) q D m D p?j (z?j )m(Xji ; z?j )dz?j j (Xji ) + q! q! ) p?j (z?j ) q ? q q(K ! p(x ; z?j ) m(Xji ; z?j )D p(Xji ; z?j )dz?j = O(1):

Bn(Xji ) =

Z

2

Z

1

To analyze the variance term, recall that n n j 1 1 X jk ? Xji p?j (X?jk ) 00 S n(Xji) + S n (Xji) = nh k p(Xk ) "k + n k  ik ; ho ok 1





X

b

=1

where  jik =  jik ?  ji with  jik  e

Vn;j (x1) =

h

h0 k ( 1

1

b

b 1

1

1

b

=1

1

2

1

e

=1

Xjk ?Xji ) p?j (X?jk ) ? 1i m(X ) k h0 p(Xjk ;X?jk )

1 1 n p (x ; X ) p?j (X?jk ) " + 1 1 jk p (x ) n k ;;j p(Xk ) k p (x ) n X

X

b

3

and  ji  Ei  jik : That is, 



1 k X i ? x  j  V  (x )+V  (x ); ik n;j n;j h h

n n



XX

1

1



0

i=1 k=1

e

1

1

where p;j (x ; Xjk ) = n ni h k X1ih?x1 h0 k( Xjkh?0Xji ): The same argument used to show S n = Op pnh0 in Theorem 1 gives b 1

1 P

1

1

=1





1

0



V1n;j (x1) =

1 1 n p (x ; X ) p?j (X?jk ) " = O p1 : jk p (x ) n k ;;j p(Xk ) k p n X

1

b 1

1

1

b

=1

1 1 p (x ) n 1

!

b

It remains to show

V2n;j (x1) =



1

1

2

1

n n

XX

i=1 k=1

1 k X i ? x  j = O p1 : p ik h h n 

1

1

!



e

0

Note that p1 x1 n2 nk h k X1kh?x1  jkk = Op npnh ; for the same reason in the proof of S 00n =  ik is smaller Op pnh0 in Theorem 1. Thus, we only need to check that the variance of p1 x1 n2 

1





1 P 1 1 =1 ( )



0



e



1

1 1 ( )

PP

i6=k

3

e

order than O nh ; where  ik  h k X1ih?x1  jik ;  i  Ei (  ik ) ;  ik   ik +  i; and   E (  i) : The E ( ih  kh ); E ( ik  ki); E ( ik ); variance of this double sum consists of i6 k6 h i6 k6 h i6 k i6 k E ( ik  hm): The following direct computations complete the proof. E ( ik  ih); and 

e

e

1





1

e



0

e

e

PP

e

2

PP

e

i6=k6=h6=m

PPP

e

e

2

2 0

2

2

!



2

2



0

1 E (  ) = 1 E (  ) ? 2 E (  ) + O h q = O 1 ik ki ik ki i ki n n n n nh 1 E (  ) = 1 E (  ) + O h q = O 1 : im km n im km n n n 2

e

2 0

e

2

e

e

2

18



2

2 0

2

!

!

2





e

PPP = =

e

1 E ( ) = 1 E ( ) + O h q = O 1 ik ik n n n nhh 2

e

= =

=

=

PPPP

e

0



2

1

Finally, E ( ik  im) = E [Ei( ik  im)] = E [Ei( ik )Ei( im)] = 0; and E ( ik  lm) = E ( ik )E ( lm) = 0, which is immediate from the conditioning argument. Proof of Theorem 3. First, note that, by the triangle inequality, e

h

e

e

p  Pr nq=

q

(2 +1)

2

 Pr nq= 4

e

e

e

e

jm ?step(x ) ? moracle(x )j > nj X n c

2 1

jm

q

c 1

1

?step (x

2 (2 +1) c 1

1

e

e

i

1

) ? moracle(x ) ? c

e

1

1

d

X

j =2

 (x ) j >  ? nq= Bn;j n

q

(2 +1)

1

j

3

d

X

j =2

 (x ) jj X n : Bn;j 5

1

Applying inequalities of Markov and Esseen, we get, for s > 2, h

p  =

E nsq= ?n s E

q

h

nsq=

2 c 1

q

2(2 +1)

1

c

1

P



1

1

(2 +1)

2 (2 +1) c 1

= a ?n s n?s=

i

 (x ) js j X n jm ?step(x ) ? moracle(x ) ? dj Bn;j  (x ) j s n ? nq= q j dj Bn;j q jm ?step (x ) ? moracle (x ) ? d B  (x ) js j X n j n;j s q 1 ? ?n O hho

(2 +1)

=2 1

P

=2

c

1

1

1

P

1  

1 

;

 (x ) = O (hq ) and m ?step (x ) ? moracle (x ) ? where we used dj Bn;j o q from Theorem 2, and the condition, hho =n ! 0: P

=2

1

i

1

=2





c

2 1

1

c

1

1

d B  (x ) = O p j =2 n;j 1

P





n? = ; 1 2

REFERENCES Angrist, J.D., Imbens, G.W., and Rubin, D.B. (1996). Identication of causal eects using instrumental variables (with discussion), Journal of the American Statistical Association, 91, 444-473. Auestad, B. and Tjøstheim, D. (1991). Functional identication in nonlinear time series, In Nonparametric Functional Estimation and Related Topics, ed. G. Roussas, Kluwer Academic: Amsterdam. pp 493507. Breiman, L. and Friedman, J.H. (1985). Estimating optimal transformations for multiple regression and correlation, (with discussion), Journal of the American Statistical Association 80, 580-619. Buja, A., Hastie, T. and Tibshirani R. (1989): Linear smoothers and additive models (with discussion), Annals of Statistics 17, 453-555. Chen, R., W. Härdle, Linton, O.B., and Severance-Lossin, E. (1996). Estimation in additive nonparametric regression. Proceedings of the COMPSTAT conference Semmering. Eds. W. Härdle and M. Schimek. Physika Verlag, Heidelberg. 19

Fan, J. (1992). Design-adaptive nonparametric regression, Journal of the American Statistical Association, 82, 998-1004. Fan, J. (1993). Local linear regression smoothers and their minimax eciencies, Annals of Statistics. 21 196-216. Fan, J. and Gijbels, I. (1992). Variable Bandwidth and Local Linear Regression Smoothers, Annals of Statistics 20, 2008-2036. Hall, P. (1992): The Bootstrap and Edgeworth Expansion. Springer-Verlag: Berlin. Härdle, W. (1990). Applied Nonparametric Regression. Econometric Monograph Series 19. Cambridge University Press. Härdle, W. and Marron, E. (1991), Bootstrap Simultaneous Error Bars for Nonparametric Regression, Annals of Statistics, 19, 778-796. Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London. Hengartner, N.W. (1996). Rate optimal estimation of Additive regression via the integration method in the presence of many covariates, Preprint, Department of Statistics, Yale University. Http:nnwww.stat.yale.edunPreprints. Jones, M.C., Davies, S.J., and Park, B.U. (1994). Versions of kernel-type regression estimators. Journal of the American Statistical Association, 89, 825-832. Jones, M.C., Marron, J.S., and Sheather, S.J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91, 401-7. Linton, O.B. (1996). Ecient estimation of additive nonparametric regression models. Biometrika, 84, 469-474. Linton, O.B. and Härdle, W. (1996). Estimating additive regression models with known links. Biometrika 83, 529-540. Linton, O.B. and Nielsen, J.P. (1995). A kernel method of estimating structured nonparametric regression based on marginal integration. Biometrika 82, 93-100. Linton, O.B., Wang, N., Chen, R., and Härdle, W. (1997). An analysis of transformation for additive nonparametric regression. Journal of the American Statistical Association 92, 15121521. 20

Masry, E. (1996). Multivariate local polynomial regression for time series: Uniform strong consistency and rates. Journal of Time Series Analysis 17, 571-599. Newey, W.K. (1994). Kernel estimation of partial means. Econometric Theory. 10, 233-253. Nielsen, J.P., and Linton, O.B. (1997). Multiplicative marker dependent hazard estimation. Manuscript. Opsomer, J. D., and D. Ruppert (1997). On the existence and asymptotic properties of backtting estimators. Annals of Statistics, 25, 186 - 211. Stone, C.J. (1985). Additive regression and other nonparametric models. Annals of Statistics 13, 685-705. Stone, C.J. (1986). The dimensionality reduction principle for generalized additive models. Annals of Statistics 14, 592-606. Tjøstheim, D., and Auestad, B. (1994). Nonparametric identication of nonlinear time series: projections. Journal of the American Statistical Association 89, 1398-1409. Wand, M.P., and Jones, M.C. (1995): Kernel Smoothing. Chapman and Hall, London. Wu, C.J.F., (1986). Jacknife, bootstrap and other resampling methods in regression analysis. Annals of Statistics 14, 1261-1343.

21

Table 1: Rejection Frequency of Bootstrap Intervals (n=50)

c 0.10 0.30 0.50 0.55 0.60 0.65 0.75 1.00 2.00

1% 0.5144 0.3960 0.2062 0.1754 0.1584 0.1394 0.1262 0.1308 0.4346

cFB 5% 0.5882 0.4550 0.2598 0.2252 0.2008 0.1798 0.1656 0.1722 0.5208

10% 0.6258 0.4960 0.3072 0.2712 0.2402 0.2180 0.1944 0.2108 0.5800

1% 0.0010 0.0046 0.0180 0.0212 0.0286 0.0330 0.0538 0.0862 0.1584

cEB 5% 0.0026 0.0096 0.0324 0.0400 0.0508 0.0588 0.0796 0.1088 0.1664

10% 0.0050 0.0168 0.0470 0.0602 0.0694 0.0822 0.1004 0.1282 0.1712

NOTE: cFB denotes the rejection frequency of the condence interval of the preliminary fast additive smoother is denoted ; while cEB denotes this quantity for the ecient fast additive smoother. c in the rst column denotes the bootstrap bandwidth constant, i.e., the actual bandwidth is g = c  n? = . 1 9

Table 2: Rejection Frequency of Bootstrap Intervals (n=100)

c 0.10 0.30 0.50 0.55 0.60 0.65 0.75 1.00 2.00

1% 0.4838 0.3006 0.1278 0.1088 0.0998 0.0914 0.0848 0.0940 0.3658

cFB 5% 0.5488 0.3662 0.1682 0.1476 0.1334 0.1216 0.1140 0.1204 0.4686

10% 0.5898 0.4078 0.2012 0.1790 0.1614 0.1482 0.1394 0.1464 0.5362

22

1% 0.0002 0.0026 0.0134 0.0170 0.0230 0.0288 0.0442 0.0652 0.0934

cEB 5% 0.0006 0.0080 0.0274 0.0336 0.0420 0.0518 0.0668 0.0842 0.1004

10% 0.0008 0.0118 0.0410 0.0484 0.0582 0.0700 0.0830 0.1002 0.1038

FIGURE CAPTIONS Figure 1A shows the (preliminary) fast additive smoother b 1 along with 95% asymptotic condence intervals. Figure 1B shows the same estimator with bootstrap condence intervals. Figure 1C

shows the ecient fast additive smoother e with bootstrap condence intervals. In each gure, the true function is shown as a solid line. Figure 2A shows the rejection frequency for the 1% condence bands constructed with the bandwidth g = c  n? = as c varies. This is the same information given in Table 1, and we use the term `actual' to denote it. We also give the `estimated' trade-o which was obtained using the rule-ofthumb method described in the text. Figure 2B shows the same information for the 5% condence bands, and Figure 2c shows the same information for the 10% condence bands. We show both the n = 50 [solid symbols] and n = 100 [hollow symbols]. b

1

1 9

23