Local linear estimation of multivariate regression functions does not have the op- ... Local polynomial regression of an appropriate order is required to achieve ...
Optimal rate multivariate local polynomial regression ∗ Lijian Yang Department of Statistics and Probability Michigan State University East Lansing, MI 48824 U. S. A. Rolf Tschernig Department of Quantitative Economics University of Maastricht Postbus 616, NL-6200 MD Maastricht The Netherlands This version: February 2002
Abstract Local linear estimation of multivariate regression functions does not have the optimal rate of convergence if higher-order partial derivatives of the regression function exist. Local polynomial regression of an appropriate order is required to achieve the optimal rate of convergence. Since the number of terms of polynomial approximations drastically increases with the dimension of the regression function, this approach has little practical relevance for multivariate problems. We show that by using kernels of an appropriate higher-order one can eliminate all cross terms and still obtain the optimal convergence rate. It is further shown that for third order expansions and second-order kernels various schemes of eliminating subsets of cross terms exist that differ in the asymptotic bias but guarantee the convergence rate of local cubic estimators. The proposed estimators are illustrated in a Monte Carlo example.
Key words and phrases: higher-order kernel, local polynomial estimation, partial local polynomial estimation. JEL-code: C14
1. Introduction In this paper we are concerned with the nonparametric estimation of the general nonlinear regression model Y = m(X) + σ(X)ε (1.1) ∗
Acknowledgements: This research was supported by in part by the Deutsche Forschungsgemeinschaft ¨ via Sonderforschungsbereich 373 “Quantifikation und Simulation Okonomischer Prozesse” at Humboldt Universit¨ at zu Berlin, and by NSF grant DMS 9971186.
where Y is a scalar dependent variable, X = (X1 , X2 , ..., Xd ) is a vector of explanatory variables and is i.i.d. white noise with E[ε] = 0, V ar(ε) = 1 which is independent of X. A popular tool for nonparametric estimation of (1.1) is the local polynomial estimation, proposed by Stone (1977), Tsybakov (1986), Cleveland and Devlin (1988). More recently, the asymptotic asymptotic bias and variance for multivariate local linear and local quadratic estimators were derived by Ruppert and Wand (1994), while minimax efficiency for the univariate local linear estimator is established by Fan (1992). Plug-in selection of optimal bandwidth is studied both theoretically and numerically in Ruppert, Sheather and Wand (1995) for univariate local polynomial regression, and in Yang and Tschernig (1999) for multivariate local linear regression. The latter has found applications in marketing research, see for example, Heerde, Leeflang, Wittink (2001). While univariate nonparametric regression has become popular, see for example, the monograph of Fan and Gijbels (1996), its multivariate counterpart has not generated much interest from practioners, due to inaccuracy commonly refered to as the “curse of dimensionality”. Stone (1980, 1982) showed that if (p + 1)-th order derivatives of m(·) are Lipschitz continuous, the optimal rate of convergence of nonparametric estimators for m(·) −(p+1)/(2p+2+d) . Obviously, this rate goes to zero rather slowly if d is large and p is O n is small. If one applies a local linear estimator to m(·), then one uses the derivatives of m(·) only up to order 2, or p = 1. Hence, local linear estimator suffers from the “curse of dimensionality” in high dimensions. Naturally, when m(·) has higher-order of smoothness (say p = 3, 5), higher-order local polynomial estimation is preferable. For multivariate regression functions m(·), however, this has not been applied in practice since the number of terms in the Taylor approximation explodes with higher d or p, see Table 1 in Section 2. Yang and Tschernig (1999) observed that one may suppress some cross terms in the local cubic expansion without losing the asymptotic convergence rate. They therefore suggested a partial local cubic estimator for estimating second-order derivatives. This idea is further expanded in the present paper to much more general settings. A better-known approach to achieve the optimal rate of convergence, is to use a Nadaraya-Watson estimator with a higher-order kernel. The idea of using higher-order kernels for eliminating asymptotic bias terms is not new to nonparametric estimation. Marron and Wand (1992) investigate the asymptotic as well as finite sample properties of density estimators for normal mixture densities based on higher-order kernels as defined by Wand and Schucany (1990). For multivariate regression, however, using a NadarayaWatson estimator with a higher-order kernel leads to an asymptotic distribution which depends on the derivatives of the design density, and hence is not design-adaptive. As we show in this paper this asymptotic dependence can be avoided by blending the higher-order kernel approach with that of higher-order local polynomials. This allows to achieve three objectives: design adaptivity, optimal rate of convergence, and a small number of terms for locally weighted least squares problem. Further, we derive the asymptotically optimal bandwidth vector for the proposed estimators. In a small Monte Carlo study we investigate the small sample performance of selected estimators. The paper is organized as follows. In the following sections we present several variants of local cubic estimators with some or all cross terms eliminated that all have the convergence rate of local cubic estimation but differ in terms of their asymptotic bias. In Section 3 we show that all cross terms can be eliminated if kernels of an appropriate higher 2
Table 1: Number of terms of Taylor approximation for d-dimensional function and order p
p\d 1 2 3 4
1 2 3 4 5
2 3 6 10 15
3 4 10 20 35
N (d, p) 5 10 6 11 21 66 56 286 126 1001
15 16 136 816 3876
order are used. Section 4 contains formulas for obtaining the global or local asymptotically optimal bandwidths for the presented estimators. Section 5 discusses the asymptotic efficiency of the proposed estimators. The Monte Carlo example is presented in Section 6. All assumptions and proofs are in the Appendix.
2. Partial local cubic estimation Let (Xi , Yi ), i = 1, 2, ..., n be a sample following model (1.1). We assume here that the observations are i.i.d., but the method works as well if the Xi ’s are arithmetically β-mixing and strictly stationary. If the relevant derivatives exist, one may approximate the function m(·) for any fixed x = (x1 , ..., xd ) by a Taylor expansion of order p m(z) ≈ m(x) +
p 1 λ! λ=1
∂λm (x) (zα − xα )λα . λ λ ∂ 1 x1 · · · ∂ d xd d
λ1 ! · · · λd !
λ1 +···+λd =λ
α=1
It can be shown that the total number of different terms in (2) is given by p (d + p)! d+λ−1 = . N (d, p) = 1 + λ d!p! λ=1
This implies that for higher orders of approximation p the number of terms explodes with increasing d since for a given order the number of terms increases by O(dp ). Table 1 displays some examples for various dimensions d and orders p. Therefore, such higherorder polynomials become intractable in practice even for moderate dimensions d. In this section we show that for the case of a local cubic Taylor expansion, p = 3, one can suppress enough terms such that the increase of terms in d drops from O(d3 ) to O(d2 ) without changing the asymptotic distribution. Moreover, we show that one may further reduce the number of terms to O(d) without affecting the rate of convergence, however, at the cost of adding additional asymptotic bias terms.
3
We will state these results in turn. Denote by 1 ··· 1 − x )/h · · · (X − x1 )/h1 (X 11 1 1 n1 .. . . .. .. . (X1d − xd )/hd · · · (Xnd − xd )/hd Z0 = .. .. .. . . . 3 /h3 · · · (X (X − x ) − x1 )3 /h31 11 1 n1 1 .. .. .. . . . 3 3 (X1d − xd ) /hd · · · (Xnd − xd )3 /h3d
T
the matrix containing the terms associated with the function value and all direct partial derivatives. In order not to change the asymptotic distribution implied by the full cubic Taylor expansion one has to add several terms associated with various partial cross derivatives which leads to Z0T {(X − xα )/hα (X1β − xβ )/hβ }α