Jun 25, 1997 - Variable bandwidth and One-step Local M-Estimator. Jianqing Fan. University of North Carolina at Chapel Hill and Chinese University of Hong ...
Variable bandwidth and One-step Local M-Estimator Jiancheng Jiang y Jianqing Fan University of North Carolina at Chapel Hill Department of Probability & Statistics Peking University and Chinese University of Hong Kong June 25, 1997
Abstract
We study a robust version of local linear regression smoothers augmented with variable bandwidth. The proposed method inherits the advantages of local polynomial regression and overcomes lack of robustness of least-squares techniques. The use of variable bandwidth enhances the exibility of the resulting local M-estimators and makes them possible to cope well with spatially inhomogeneous curves, heteroscedastic errors and nonuniform design densities. Under appropriate regularity conditions, it is shown that the proposed estimators exist and are asymptotically normal. Based on the robust estimation equation, we introduce one-step local M-estimators to reduce computational burden. It is demonstrated that the one-step local Mestimators share the same asymptotic distributions as the fully iterative M-estimators, as long as the initial estimators are good enough. In other words, the one-step local M-estimators reduce signi cantly the computation cost of the fully iterative M-estimators without deteriorating their performance. This fact is also illustrated via simulations.
Keywords and Phrases. Local regression, M-estimator, nonparametric estimation, one-step, robustness, variable bandwidth. Abbreviated title: Local M-estimator. AMS1991 subject classi cations. Primary 62G07; secondary 62G35, 62F35, 62E20.
1 Introduction Local polynomial regression methods have been demonstrated as eective nonparametric smoothers. They have advantages over popular kernel methods, in terms of the ability of design adaptation and high asymptotic eciency. Moreover, the local polynomial regression smoothers can adapt to almost all regression settings and cope very well with the edge eects. For details, see Fan and Gijbels (1996) and references therein. A drawback of these local regression estimators is, however, lack of robustness, and M-type of regression estimators are natural candidates for achieving desirable robustness properties. This paper focuses on establishing joint asymptotic normality of Correspondence address: Department of Statistics, University of North Carolina, Chapel Hill, NC 27599-3260. Supported by NSF grant DMS-9504414 and NSA Grant 96-1-0015. y Supported by Youth Science Foundation of Peking University and a research grant of National Natural Science Foundation, China.
1
the nonparametric M-type estimators of regression function and its associated derivative, based on the local linear regression smoothers implemented with variable bandwidth. There are many methods for estimating nonparametric functions: kernel, spline, local regression and orthogonal series methods. For an introduction to this subject area, see Hardle (1990), Green and Silverman (1994), Wand and Jones (1995) and Fan and Gijbels (1996), among others. An intuitively appealing method|local linear smoother, has become popular in recent years because of its attractive statistical properties. As mentioned by Fan (1993), Hastie and Loader (1993) and Ruppert and Wand (1994), local regression provides many advantages over modi ed kernel methods. Therefore, it is reasonable to expect that the local regression based M-type estimators carry over those advantages. The nonparametric M-type estimators of regression function have been investigated by several authors, including Cleveland (1979), Cox (1983), Tsybakov (1986), Hardle and Tsybakov (1988), Cunningham et al. (1991), Fan, Hu and Truong (1994), Welsh (1996), among others. In Cleveland (1979), Tsybakov (1986), Fan, Hu and Truong (1994) and Welsh (1994) local polynomial techniques are employed, while in Cox (1983) and Cunningham et al. (1991) smoothing spline techniques are used. The innovation of Cleveland's approach is to use the locally reweighted regression to achieve the robustness. The relation of this approach with the local M-regression estimators can be found in section 2.4.1 of Fan and Gijbels (1996). Tsybakov (1986) pioneeringly investigated the minimax convergence rates of the local M-type of polynomial tting. Hardle and Tsybakov (1988) studied simultaneous nonparametric estimation of regression and scale function by using a kernel method. Fan, Hu and Truong (1994) and Welsh (1996) pointed out that the local M-regression copes well with edge eects and is an eective method for derivative estimation. Our approach is similar to the local M-regression methods above, and is enhanced via incorporating a variable bandwidth scheme. This scheme includes the nearest neighborhood bandwidth in Cleveland (1979) as a speci c example. This allows the resulting estimation procedure to cope well with spatially inhomogeneous curves, heteroscedastic errors and highly nonuniform designs. Our asymptotic analysis of the resulting estimator enables one to nd the asymptotic optimal variable bandwidth scheme, and this in turns allows one to estimate the optimal variable bandwidth from data. See for example Brockmann, Gasser and Herrmann (1993) and Fan and Gijbels (1995). These data-driven procedures are con ned to the least-squares setting and based on the asymptotic formula for the optimal variable bandwidths. Analogously, our asymptotic formula will enable us to develop data-driven optimal variable bandwidth in the local M-regression context. 2
The local M-regression inherits many nice statistical properties from the local least-squares regression. However, unlike the local least-squares regression, the local M-regression estimators are de ned implicitly and numerical implementation requires an iterative scheme. This can create large computational burden on the procedure and makes it less attractive. To reduce the computational burden, we follow the idea of Bickel (1975) and propose one-step local M-regression estimator. This estimator shares the same computational expediency as the local least-squares estimator, and possesses the same asymptotic performance as the local M-regression estimator when the initial estimator behaves reasonably good. In other words, the local one-step M-regression estimator, while robusti es the local least-squares estimator, truly inherits all good properties from the local leastsquares estimator, in terms of not only asymptotic performance but also computational expediency. A popular and useful smoother, LOWESS, introduced by Cleveland (1979) can be regarded as a kind of local one-step M-regression, using the Huber bisquare function with the nearest neighborhood variable bandwidth. Thus, it is a speci c example that is studied in Sections 3 and 4. Combining the asymptotic results in these two sections, it is easy to derive the asymptotic normality of LOWESS. Our asymptotic and simulation studies give theoretical endorsement of Cleveland's LOWESS, namely the one-step (or a few steps) M-regression is as ecient as the fully iterative M-regression. The outline of this paper is as follows. In Section 2, we outline the local M-type of regression estimators with variable bandwidth. Section 3 concentrates on the asymptotic properties of the proposed estimators, including the pointwise consistency and the asymptotic normality. In Section 4, one-step local M-estimators are proposed and shown to have the same asymptotic behaviors as their corresponding M-estimators. Section 5 conducts some simulation studies to compare the relative eciencies between the one-step robust estimators and the fully iteratively robust estimators. Technical proofs are given in Section 6.
2 Local M-Estimators with Variable Bandwidth Let (X1; Y1); : : :; (Xn; Yn ) be a random sample from a population (X; Y ) having a density f (x; y ). Let fX (x) be the marginal density of X . Denote the regression function by m(x) = E (Y j X = x). For the given random sample, the regression function m(x) and its derivative functions can be estimated via a local polynomial tting. For simplicity of presentation, we focus on the local linear t and comment on the results for the local polynomial t. The local linear regression with the variable bandwidth is de ned as the solution to the following 3
weighted least squares problem: Find a and b to minimize n X
j =1
(Yj ? a ? b(Xj ? x))2(Xj )K ( x ?h Xj (Xj ));
(2.1)
n
where K () is a kernel function, hn is a sequence of positive numbers tending to zero, and () is a nonnegative function re ecting the variable amount of smoothing at each data point. The quantity hn =(Xj ) is called variable bandwidth by Breiman et al. (1977). Some related studies on bandwidth variation can be found in Abramson (1982), Muller and Stadtmuller (1987), Hall and Marron (1988), Hall, Hu and Marron (1995), Fan and Gijbels (1992, 95), among others. Obviously, criterion (2.1) is based on the least-squares principle and is not robust. To overcome this shortcoming, we propose to nd a and b to minimize n X
j =1
(Yj ? a ? b(Xj ? x))(Xj )K ( x ?h Xj (Xj ))
(2.2)
n
or to satisfy the local estimation equations: n X
(Yj ? a ? b(Xj ? x))(Xj )K ( x ?h Xj (Xj )) = 0
(2.3)
(Yj ? a ? b(Xj ? x)) Xjh? x (Xj )K ( x ?h Xj (Xj )) = 0;
(2.4)
j =1
and
n X j =1
n
n
n
where () is a given outlier-resistant function and () is the derivative of (). To facilitate notation, we denote (2.3) and (2.4) by 1n (a; b) = 0
and
2n (a; b) = 0
respectively. The M-type estimators of m(x) and m0 (x) are de ned as a^ and ^b, which are the solution to the equations (2.3) and (2.4). We denote them by m ^ n (x) and m ^ 0n (x), respectively. For a given point x0 , the following notation and assumptions are needed.
(A1) The kernel function K is a continuous probability density function with bounded support R [?1; 1], say. Let s` K (u)u` du, for ` 0. (A2) minx (x) > 0 and () is continuous at the point x . 0
(A3) The regression function m() has a continuous second derivative at the point x . 0
(A4) The sequence of bandwidths hn tends to zero such that nhn ! +1. (A5) E [ (")jX = x] = 0 with " = Y ? m(X ). 4
(A6) The design density fX () is continuous at the point x and fX (x ) > 0. 0
0
(A7) The function () is continuous and has a derivative 0() almost everywhere. Further, assume that " (x) = E [ 0(")jX = x] and "2 (x) = E [ (")jX = x] are positive and continuous at the point x and there exists > 0 such that E [j (")j jX = x] is bounded in a neighborhood 2
2+
0
of x0 .
(A8) The function 0() satis es that E [ sup j 0(" + z) ? 0(")j jX = x] = o(1) jzj
and
E [ sup j (" + z) ? (") ? 0(")zj jX = x] = o(); jzj
as ! 0;
uniformly in x in a neighborhood of x0 The conditions above are mild and are ful lled in many applications. We do not need the monotonicity and boundedness of (x) here. Condition (A8) is weaker than Lipschitz continuous of the function 0(x). It appears to be the minimal smoothness assumption on (x). In particular, Huber's (x) function satis es this requirement. The bounded support restriction on K () is not essential. It is just imposed to avoid technicalities of proofs and can be removed if we put restriction on the tail of K (). In contrast with previous results by Fan, Hu and Truong (1994), we do not require the convexity of (). Also, we do not need the symmetry of conditional distribution of " given X , which is required by Hardle and Tsybakov (1988).
3 Asymptotic properties In this section, we will establish the consistency and asymptotic normality of the local M-type estimators.
Theorem 3.1 Under conditions (A1)-(A8), there exist solutions, denoted by (m^ n(x); m^ 0n(x)), to
equations (2.3) and (2.4) such that ! m^ n0(x0) ? m(x0 0) p hn (m^ n(x0) ? m (x0)) ?! 0; n ! 1: Theorem 3.2 Under assumptions ! (A1)-(A8), the solutions given in Theorem 3.1 satisfy that pnh m^ n0(x0) ? m(x0 0) n h (m ^ (x ) ? m (x )) is asymptotically normally distributed with mean n
n
0
0
p
00 2 m nhn 2(s s(x0?)hsn2 ) 0 2 1
2 (x0 ) (s2 ? s1 s3 ) 1 (x0) (s1s2 ? s0 s3 ) 1
2
5
!
f1 + o(1)g
and covariance matrix
!
(x0) "2 (x0) 2 2 ( x ) f ( x )( " 0 X 0 s0s2 ? s1 )2
c11 c12 c12 c22 R +1(s ? us )2K 2(u)du; c = 1 R +1 (s ? us )(s ? us )K 2(u)du; and where c11 = 4 (1x0 ) ?1 2 1 12 1 1 0 3 (x0 ) ?1 2 R + 1 c22 = 2 (1x0) ?1 (s1 ? us0 )2K 2(u)du: 5
By Theorem 3.2, the asymptotic bias bn (x0) and variance vn2 (x0) are naturally given by 00 2 m bn(x0) = 2(x0) ss2 s? ?s1ss32 ( (hxn ) )2 0 2 0 1
R +1(s ? us )2K 2(u)du (x0 ) 1 ?1 2 "2 (x0 ) : 2 (x ) f (x )(s s ? s2 )2 nhn " 0 X 0 0 2 1
vn(x0 ) = 2
In particular, when K has mean zero, namely, s1 = 0, the above formulae become 00 bn(x0) = s2 m2(x0) ( (hxn ) )2
vn2 (x0) =
Z
0
1
+
?1
K 2 (u)du
"2 (x0) (x0) : "2(x0) nhn fX (x0)
Further, if the kernel function K is symmetric, then by Theorem 3.2, m ^ n (x) is asymptotically independent of m^ 0n (x). Therefore, the asymptotic mean square error of the estimator m ^ n (x) can be de ned as AMSE (x0) = b2n(x0 ) + vn2 (x0 ):
Remark 3.1 If a local polynomial regression of order p p n X X (Yj ? ak (Xj ? x)k )(Xj )K ( x ?h Xj (Xj )) j =1
n
k=0
is implemented, then there exist consistent solutions a^ = (^a0 ; ; ^ap )T to the corresponding local M-estimation equations which are asymptotically normal. More precisely, let
S = (si+j?2 ) and S = (i+j?2 );
(1 i p + 1; 1 j p + 1)
be (p + 1) (p + 1) matrices with
si = Denote by
Z
1 i u K (u)du; and ?1 +
i =
Z
1 i 2 u K (u)du: ?1 +
H = diag(1; hn=(x0); ; hpn=(x0)p); cp = (sp+1 ; ; s2p+1)T : 6
(3.1)
Then under conditions (A1) { (A8) with m() having continuous (p + 1)th derivative at the point x0, it can be shown along the same lines of proofs that (p+1) p+1 p 0 )hn S ?1cp(1 + o(1))g ?! N (0; v(x)S ?1S S ?1 ) (3.2) nhn fH (^a ? a^0) ? m (p +(x1)! where a^0 = (m(x0); m0(x0); ; m(p)(x0)=p!)T ; v(x) = 2((xx0))f"2 ((xx0)) : " 0 X 0 This establishes the joint asymptotic normality for the function and its derivative estimators. In particular, when p = 1, the above result is the same as Theorem 3.2.
Remark 3.2 We now consider the edge eect of the local M-estimation. A convenient mathemat-
ical formulation of this problem is given by Gasser and Muller (1979). Assume that the design density has a bounded support [0; 1], say. Consider the local polynomial tting at the left-hand point x0 = dhn for some positive constant d > 0. Then, the joint asymptotic normality (3.2) continues to hold, with slight modi cations on the de nition of moments in (3.1):
si =
Z
1 i u K (u)du; and ?d +
i =
Z
1 i 2 u K (u)du: ?d +
(3.3)
A similar result holds at right-hand boundary points. This property implies that the local polynomial M-estimation shares a similar boundary adaptation as the least-squares local polynomial tting. For details, see Fan and Gijbels (1992) and Ruppert and Wand (1994).
Now, we discuss the choice of the variable bandwidth (x) and hn for the estimation of m(x) based on local linear t. Similar derivations can be made for the optimal choice of variable bandwidth for estimation of m0(x) or higher order derivatives based on local polynomial t. Following Muller and Stadtmuller (1987) and Fan and Gijbels (1992), the optimal bandwidth should minimize Z AMSE (x)w(x)dx for some given nonnegative weight function w(x). Then, using the same arguments as those of Theorem 3 in Fan and Gijbels (1992), the optimal variable bandwidth is given by
8 < b f x m00 x 2 15 ; opt(x) = : 2 x (x); ( )] X ( )[ ~ ( )
if w(x) > 0; otherwise;
(3.4)
and the optimal constant bandwidth is given by
R (s ? us ) K (u)du ! 15 1 n? 5 ; hn;opt = b (s ? s s ) 2
1
2 2
2
1 3
7
2
2
(3.5)
where ~ 2(x) = ""22((xx)) , b is any arbitrarily positive constant, and (x) can be taken to be any positive value. Note that the optimal variable bandwidth opt() does not depend on the weight function w(x). That is, opt (x) is intrinsic to the problem. From (3.4) and (3.5), it follows that the more the design points or the larger the curvature near x0 , the bigger the (x0). Then the eective bandwidth hn =(x0) becomes smaller. This is consistent with our intuition|at dense design regions or high curvature regions, a smaller bandwidth should be employed, whereas at low density or low curvature regions, a larger bandwidth is required.
4 One-step Local M-estimators In section 3, we have established asymptotic normality of the local linear M-estimators. But, the solution to the nonlinear system (2.3) and (2.4) involves intensive iterations which can be very time consuming. Clearly, a non-iterative procedure with a similar performance is preferable. One viable approach is to follow the idea of Bickel (1975) to construct an one-step estimation procedure. We now outline the procedure. Consider solving the system (2.3) and (2.4) by Newton's method with an initial value a0 = m^ 0 (x) and b0 = m ^ 00(x). They can be the local least-squares estimator, namely, the solution to the problem (2.1). In this case, our m^ 0 (x) and m ^ 00(x) are the same as those in Fan and Gijbels (1992), and admit simple and explicit expressions. Then, the rst iteration has the form ! ! ! m~ n (x) = m^ 0(x) ? W ?1 1n(a0; b0) ; (4.1) n m~ 0n (x) m^ 00(x) 2n (a0; b0) where @ 1n (a0; b0) @ 1n (a0 ; b0) ! @a Wn = @ 0 (a ; b ) @b@0 (a ; b ) : n
@a0
2
0
0
n
@b0
2
0
0
The estimators, m~ n (x) and m~ 0n (x), are so-called \one-step local M-estimators". The LOWESS of Cleveland (1979) can be regarded as a one-step local M-estimators using Huber's bisquare function. The one-step local estimators have the same computational expediency as the local least-squares estimators. We now show that the one-step local M-estimators have the same asymptotic performance as the M-estimators m^ 0 (x) and m^ 00 (x), as long as the initial estimators are good enough. In other words, the one-step local M-estimators reduce computational cost without down grading their performance.
Theorem 4.1 Assume that the initial estimators satisfy 1 ) and h (m^ 0 (x ) ? m0 (x )) = O (h + p 1 ): m^ (x ) ? m(x ) = Op(hn + pnh n p n nh 0
0
0
2
0
n
8
0
0
2
n
Then, under conditions (A1) { (A8), the normalized one-step local M-estimators
p
nhn
m~ n0(x0) ? m(x0 0) hn (m~ n(x0 ) ? m (x0))
!
is asymptotically normally distributed with the same mean and covariance matrix as those in Theorem 3.2.
Remark 4.1 The condition on the initial estimators in Theorem 4.1 is mild. All commonly used nonparametric regression estimators satisfy the condition.
Remark 4.2 When K () is symmetric, m~ n(x) is asymptotically independent of m~ 0n(x) and the initial estimator m ^ 00 (x). This follows easily from (4.1) and (6.11).
Remark 4.3 The one-step local M-estimator can easily be extended from the local linear t to the
local polynomial t. The resulting estimators share the same joint asymptotic normality as (3.2), if the initial estimators m ^ (0k)(x0 ) (k = 0; ; p) satisfy
hkn fm^ (0k)(x0) ? m(k)(x0 )g = Op(hpn+1 + p 1 ) nhn
(4.2)
and conditions (A1) { (A8) are ful lled.
5 Simulations In this section, we compare the relative eciency between the one-step M-estimators with the fully iteratively M-estimators via simulation studies. For the robust M-estimators, we take Huber's function (x) = maxf?c; min(c; t)g with c = 1:35 , where is a scale parameter speci ed in each simulation model. There are several possibilities for choosing initial estimators. The most intuitive method is to use the least-squares regression estimator (2.1). Let us call the resulting estimator (4.1) a local one-step M-regression estimator. This local estimator can also serve as an initial estimator of (4.1). Let us call the resulting estimators as a local two-step M-regression estimator. Since the least-squares regression estimator satis es the conditions of Theorem 4.1, so does the one-step local M-estimator. Hence, both one-step and two-step local M-regression estimators are asymptotically as ecient as the fully-iterative local M-regression. It is intuitively clear that the local two-step M-estimator further robusti es the local one-step M-regression. It is of interest to see, at a nite sample, how much the local two-step M-regression can improve the local one-step estimator and how good they are in comparison with the fully iterative estimator. 9
For each of the four methods, namely, the least-squares, one-step, two-step and the fully iterative method, we assess its perform via the Mean Absolute Deviation Error (MADE): MADE(m ^ ) = N ?1
N X
j =1
jm^ (xj ) ? m(xj )j;
where fxj ; j = 1; ; N g are grid points and m is the true regression function. For each given bandwidth and for each given estimator, the ratio of the MADE of the fully iterative estimator to that of the given estimator is computed. This gives a relative error (eciency) between them at a given nite sample. The bandwidths used in our simulation are h = hopt =2; hopt, and 2hopt , where hopt is the optimal bandwidth for the fully iterative M-regression estimator. We regard this range of bandwidths are wide enough to cover most of applications. The bandwidth hopt is computed based on 400 simulations. For each simulation, one can compute the optimal bandwidth that minimize the MADE and hopt is the sample average (about the same as the sample median) of those 400 optimal bandwidths. We now describe our simulation models and present simulation results. Example 1. We simulated 400 samples of size n = 200 from the model
Y = X + 2 exp(?16X 2) + " where X Uniform(?2; 2) is independent of
" 0:1N (0; 52 2) + 0:9N (0; 2);
with = 0:217:
The parameter was chosen so that var(") = 0:42, i.e. = 0:217. The model was used in Fan and Gijbels (1995) except that " N (0; 0:42). For this model, hopt = 0:18 was selected. Figure 1 depicts the results based on 400 simulations. Figure 1(d) presents a typical simulated data set along with four dierent estimates using bandwidth h = hopt. The criterion for choosing a typical simulated data set is the one with which the fully iterative M-regression estimator has its median performance, in terms of MADE, among 400 simulations. Example 2. In this example, we estimate a dierent function with a even larger contaminated errors: Y = 0:3 expf?4(X + 1)2g + 0:7 expf?16(X ? 1)2g + " where X Uniform(?2; 2) is independent of
" 0:1N (0; 1022 ) + 0:9N (0; 2) with = 0:075: 10
Ex 1 : Relative errors at h = hopt
1.0 0.8 0.4
0.6
relative errors
1.0 0.8 0.6 0.2
0.4
relative errors
1.2
1.2
1.4
Ex 1 : Relative errors at h = hopt / 2
LS
one-step
two-step
LS
one-step
(a)
(b)
Ex 1 : Relative errors at h = 2 hopt
Ex 1 : Typical Estimated curves
. .
.
1
0.9
1.0
2
1.1
.
. .. . .. . .. . . . ... . ..... . . . .. . . . . . ...... .. . . . . .. . . .... . . . . . . . . .. . . . . . .. .. . .. . . . .
0 -1
0.7
y
0.8
.
0.5
-2
0.6
relative errors
two-step
LS
one-step
two-step
-2
-1
.
. .. . ... .. . ... . . . . .. .. . . .. . .. ... . . . . . .. ... ...... . . . . . . .. . . ... ......... . . ... .... . .. . . . . .
.. ... .. .
. . .
.
0
1
2
x (d)
(c)
Figure 1: Simulation results for Example 1. The relative errors with respect to the fully iterative M-regression estimator for bandwidth (a) h = hopt =2, (b) h = hopt and (c) h = 2hopt . Typical estimated curves are presented in (d). Solid curve | Least-square methods; from short to long dash: one-step, two-step and fully-iterative estimator. This model was used in Fan and Gijbels (1995) except that " N (0; 0:12). The variance in the current model is var(") = 0:2482, much higher than that in Fan and Gijbels (1995). Simulation results are presented in Figure 2, where hopt = 0:19 was selected. From both Figures 1 and 2, one can easily see that the least-squares method is not very robust. For small bandwidth h = hopt =2, since the number of the local data points is small, the leastsquares estimator is very sensitive to outliers and hence the method has low eciency in this case. Clearly, the one-step and two-step estimators improve largely over the least-squares method. For reasonably large bandwidths, the one-step and two-step local M-regression estimators are nearly as ecient as the fully iterative method. This is consistent with the result of Theorem 4.1. When the bandwidth increases, the least-squares method become more ecient and robust as evidenced 11
Ex 2 : Relative errors at h = hopt
0.8 0.4
0.6
relative errors
1.0 0.8 0.6
0.2
0.4
relative errors
1.0
1.2
1.2
Ex 2 : Relative errors at h = hopt / 2
LS
one-step
two-step
LS
two-step
(b)
Ex 2 : Relative errors at h = 2 hopt
Ex 2 : Typical Estimated curves . .
1.5
1.2
(a)
.
1.0 -0.5
y
0.0
0.5
1.0 0.8
LS
one-step
two-step
.
. . .
. . .. . . . . . .. . . ........ ... .. . .... .. . . ..... .. .. . . .. . .. .. . . . . ... .. ..... ... . . . . . . . .. . .. .. .. . . ... .. . .. ..... . . . . . . . .. . . .
. .. . .. .. ... . .... .. . .
-2
. ..... .... ... . ....... ...... . ... . . .. .. .
.
-1.5
0.4
0.6
relative errors
one-step
. -1
0
1
2
x (d)
(c)
Figure 2: Simulation results for Example 2. The relative errors with respect to the fully iterative M-regression estimator for bandwidth (a) h = hopt =2, (b) h = hopt and (c) h = 2hopt . Typical estimated curves are presented in (d). Solid curve | Least-square methods; from short to long dash: one-step, two-step and fully-iterative estimator. in Figures 1(c){2(c). Figure 2(d) reveals that even with h = hopt , the estimated curves are somewhat under smoothed in the visual sense. Thus, it is doubtful that bandwidths much smaller than hopt will actually be used in practice. For the practical range of bandwidths, the one-step local M-regression estimator is robust enough and the two-step local M-regression can even be useful for bandwidths somewhat smaller than the practical range. The conclusions here seem to support the default of 3 iterations in LOWESS, which corresponds to our two-step estimator. We have also conducted the simulations with other levels of contaminated errors and other choices of bandwidths. The conclusions are basically the same as above. The optimal bandwidths in Examples 1 and 2 are small, using only about 9% and 9.5% of data 12
points in each local neighborhood. For many real-data applications, larger local neighborhood are used and the local one-step and two-step M-regression estimator should be more ecient than what is presented here.
6 Proofs In this section, we will give the proofs of Theorems 3.1, 3.2 and 4.1. The following notation will be used throughout this section. Let j = (Xj ) and Kj = K ( x0h?nXj (Xj )). Denote by R(Xj ) = m(Xj ) ? m(x0) ? m0(x0)(Xj ? x0 ). The following lemmas are needed for our technical proofs.
Lemma 6.1 Assume that conditions (A1)-(A8) hold. For any random sequence fj gnj , if max jn jj j = =1
op (1) we have
n X j =1
1
0("j + j )j Kj (Xj ? x0)`
= (?1)` " (x0 )nh`n+1 fX (x0 )s` =` (x0 )(1 + op(1)) and
n X j =1
0("j + j )R(Xj )j Kj (Xj ? x0 )`
` = (?21)`+2 ("x(x)0) nh`n+3 m00(x0 )fX (x0 )s`+2 (1 + op (1)): 0 Proof. We just give the proof of the rst conclusion, because the second one can be shown by the same arguments. It is obvious that n X
=
j =1 n X
0("j + j )j Kj (Xj ? x0 )`
n 0("j )j Kj (Xj ? x0 )` + X[ 0("j + j ) ? 0("j )]j Kj (Xj ? x0)` j =1 j =1 Tn;1 + Tn;2 :
In the same lines of arguments as in Lemma 4 of Fan and Gijbels (1992), we have
ETn;1 = E [ = and
n X
j =1
j Kj (Xj ? x0 )` E ( 0("j )jXj )]
" (x0)(?1)
` nh`+1 f (x )s =` (x )(1 + o(1)) 0 n X 0 `
Tn;1 = "(x0)(?1)` nh`n+1 fX (x0)s` =`(x0)(1 + op (1)): 13
It suces to show that
Tn;2 = op(nh`n+1 ):
For any given > 0, let n = (1 ; ; n )T ,
D = fn : jj j ; 8j ng; and Then
n X V (n) = 1`+1 [ 0("j + j ) ? 0("j )]j Kj (Xj ? x0 )`: nhn j=1
sup jV (n )j D
n 1 X sup j 0("j ) ? 0("j + j )jj Kj jXj ? x0 j` : ` +1 nh n
j =1 D
By condition (A8) and noticing that jXj ? x0 j hn = in the above expressions, we have n X E sup jV (n)j a 1`+1 E j Kj jXj ? x0 j` b ; nhn j=1 D
where a and b are two sequences of positive numbers, tending to zero as ! 0. Since sup1j n jj j = op (1), it follows that V (^ n) = op(1) with ^ n = (1; ; n)T . The conclusion follows from the fact Tn;2 = nh`n+1 V (^ n) = op (nh`n+1 ):
Lemma 6.2 Assume that conditions (A1)-(A8) hold. Then n X (Yj ? m(x ) ? m0(x )(Xj ? x ))j Kj (Xj ? x )` 0
j =1
0
0
0
` = (?21) " (x0)nh`n+3 s`+2 m00(x0)fX (x0)=`+2 (x0)(1 + op (1)) n X ("j )j Kj (Xj ? x0 )` : + j =1
Proof. Recall that Yj = m(Xj ) + "j , and R(Xj ) = m(Xj ) ? m(x ) ? m0(x )(Xj ? x ). Then, n X Jn (Yj ? m(x ) ? m0(x )(Xj ? x ))j Kj (Xj ? x )` j n X = ("j + R(Xj ))j Kj (Xj ? x )` j n X f ("j ) + 0("j )R(Xj ) = 0
0
0
0
0
0
0
=1
0
=1
j =1
+[ ("j + R(Xj )) ? ("j ) ? 0("j )R(Xj )]gj Kj (Xj ? x0 )`
Jn + Jn + Jn : 1
2
(6.1)
3
14
By (A3) and Taylor's expansion, for jXj ? x0 j hn = (j = 1; ; n), we obtain 1 sup jm00(x)j max (X ? x )2 = O (h2 ): max j R ( X ) j j j 0 p n 1j n 1j n 2 x2x0 hn = Then using condition (A8) and the same argument as that in Lemma 6.1, we get
Jn3 = op (nh`n+3 ): Applying the second conclusion of Lemma 6.1 to the second term in (6.1), we conclude that ` (x0)(1 + op (1)): Jn2 = "(x0) (?21) nh`n+3 fX (x0)m00(x0) s``+2 +2 Hence, the result follows.
Lemma 6.3 Assume that conditions (A1)-(A7) hold. Let ! ! Pn (" ) K J j j j j Pn ("j )j Kj (Xj ? x )=hn : Jn J j (1)
=1
(2)
0
=1
Then Jn is asymptotically normal with mean zero and covariance matrix
0R R uK2 u du 1 K (u)du ? Dn = nhn "2 (x )fX (x )(x ) B @ R uK2 u du R u2 K2 xu0 du CA (1 + o(1)): ? x0 2 x0 ( )
2
0
0
0
( ) ( )
( )
(
)
(
)
Proof. We rst demonstrate that Jn is asymptotically normal. In fact, for any real numbers,
k1 and k2, which are not all simultaneously zero, the linear combination n n X X k1J (1) + k2J (2) = ("j )j Kj [k1 + k2 Xjh? x0 ] j n j =1 j =1
is a sum of independent and identically distributed random variables with mean zero and variance Bn2 with
Bn2 = nE f 2("1)21 K12[k1 + k2(X1 ? x0 )=hn]2g = nE f "2 (X1)21 K12[k12 + k22 (X1 ? x0)2 =h2n + 2k1k2(X1 ? x0)=hn ]g Wn;1 + Wn;2 + Wn;3: In the same lines of arguments as in Lemma 4 of Fan and Gijbels (1992), we can easily obtain asymptotic expressions for Wn;j (j = 1; 2; 3), and easily verify Lyapounov's condition n 1 X E jj j2+ ! 0; 2+ B n
j =1
15
via using condition (A7). That is, Jn is asymptotically normal. Using the asymptotic expressions for Wn;1, Wn;2 and Wn;3, one can easily obtain that the asymptotic covariance matrix Dn .
Proof of Theorem 3.1. Let ~a = a, ~b = hn b, r = (~a; ~b)T and r = (m(x ); hnm0(x ))T . 0
Denote by
0
0
X (Yj ? a ? b(Xj ? x ))j Kj X = (Yj ? ~a ? ~b Xjh? x )j Kj :
`n(r) =
0
0
n
In order to simplify the notation, denote m^ n (x0 ), m(x0), m^ 0n (x0) and m0(x0 ) by m ^ n , m, m^ 0n and m0 respectively. Let S denote the circle centered at r0 with radius . We will show that for any suciently small , ` (r) > `n(r0)g = 1: (6.2) nlim !1 P frinf 2S n
In fact, by Taylor's expansion, we have
`n (r) ? `n (r0) = `0n (r0)T (r ? r0 ) + 21 (r ? r0)T `00n(r )(r ? r0);
(6.3)
where r 2 S , and r = (~a; ~b) lies between r0 and r. Note that
n X 0 `n(r0) = ? (Yj ? m ? m0(Xj ? x0))j Kj (1; (Xj ? x0)=hn )T : j =1
It follows from Lemma 6.2 that `0n(r0) = ? 1 h2 (x )m00(x )f (x )( s2 ; ? s3 )T (1 + o (1)) 0 X 0 p nhn 2 n " 0 2 (x0) 3(x0 ) n X 1 ? (" ) K (1; Xj ? x0 )T :
nhn j=1
j j j
hn
By directly calculating the mean and variance, the second term on the right-hand side above is op (1). Thus `0n (r0) = op (nhn): (6.4) Note that
n `00n (r) = 1 X 0(Yj ? ~a ? ~b Xj ? x0 ) ? f [ nhn nhn j=1 hn 0 1 ? Xjh?nx0 @ j Kj ? Xj ?x0 (Xj ?x0 )2 hn h2n Mn1 + Mn2 :
16
0("j )] + 0("j )g
1 A
It is obvious from Lemma 6.1 that
Mn2 =
? sx10
s0
" (x0 )fX (x0 )
(
? sx10 2s2x0 " (x )fX (x )A(1 + op(1)):
0
(
)
(
! )
)
(1 + op(1))
0
Note that for jXj ? x0 j hn =, we have
? ~b Xj ? x0 ? "j j max R(Xj ) + (1 + 1= ): max j Y ? a ~ j j j h n
Then, it follows from Lemma 6.1 that lim sup lim sup kMn1 k = 0; n!1
!0
in probability:
Let a be the smallest eigenvalue of the positive de nite matrix A. Then, for any 8r 2 S , we have for suciently small lim P f inf 1 (r ? r )T `00 (r )(r ? r ) > 0:5af (x ) (x ) 2g = 1: n!1
r2s" nhn
0
n
0
X
0
0
"
0
This together with (6.3) and (6.4) establish (6.2). By (6.2), `n (r) has a local minimum in the interior of S . Since at a local minimum, (2.3) and (2.4) must be satis ed. Let (m ^ n ; hn m ^ 0n ) be the closest root to r0. Then ^ n ? m) nlim !1 P f(m
2
+ h2n (m ^ 0n ? m0)2 2 g = 1:
This implies the conclusion of Theorem 3.1.
Proof of Theorem 3.2. Let ^j = R(Xj ) ? (m^ n ? m) ? (m^ 0n ? m0)(Xj ? x0 ): Then,
Yj ? m^ n ? m^ 0n(Xj ? x0) = "j + ^j :
It follows from (2.3) and (2.4) that n X
j =1
f
("j ) + 0("j )^j + [ ("j + ^j ) ? ("j ) ? 0("j )^j ]gj Kj
Xj ?x0 hn
Note that the second term in the left hand side of (6.5) is n X
0("j )R(Xj )j Kj
j =1 Ln1 + Ln2 :
1
Xj ?x0 hn
0 ! X n 0("j )j Kj @ 1 ? X ?x0 j
hn
j =1
17
1
Xj ?x0 hn (Xj ?x0 )2 h2n
!
= 0:
(6.5)
1 ! A m^ n ?0m 0 hn (m^ n ? m )
Applying Lemma 6.1, we obtain " (x0)
Ln1 = nhn 22 (x ) 3
0
and
fX (x0)m00(x0)
s2
?s3 (x0 )
!
(1 + op (1))
!
^n ?m Ln2 = ? "(x0)fX (x0 )nhnA(1 + op (1)) h m ( ^ 0n ? m0) ; n m
where A is given in the proof of Theorem 3.1. Note that by the consistency of (m ^ n ; hn m^ 0n ) sup
j^j j
j :jXj ?x0 jhn =
sup
0 0 j R (Xj )j + jm ^ n ? mj + hn jm ^n ?m j
j :jXj ?x0 jhn =
= Op (h2n + (m ^ n ? m) + hn (m ^ 0n ? m0)) = op (1):
Then by condition (A8) and the same argument as that in Lemma 6.1, the third term in the left hand side of (6.5) is given by
op (nh)[h2n + (m^ n ? m) + hn (m^ 0n ? m0 )]; which converges to zero in probability faster than the second term in the left hand side of (6.5). Let
!
00h2 s2 n ?1 Bn = 2m 2 ?s3=(x0) (1 + op(1)) (x0 ) A s22 ?s1 s3 ! 00h2n m (x0) (1 + op(1)): = 2(s s ? s2 )(x ) 0 2 0 s1 s2 ? s0 s3 1
Then, it follows from (6.5) that
!
m^ n ? m ?1 ?1 ?1 ?1 hn (m^ 0n ? m0 ) = Bn + " (x0)fX (x0)(nhn ) An (1 + op (1))Jn;
(6.6)
where Jn is given in Lemma 6.3. The conclusion follows from (6.6), Lemma 6.3 and Slutsky's theorem.
!
Proof of Theorem 4.1. Let Hn = diag(1; hn) be a 2 2 matrix. Write Wn as aa1121 aa1222 .
Denote by
^j = R(Xj ) ? (m^ 0 ? m) ? (m^ 00 ? m0)(Xj ? x0 ):
18
Then, max j^ j j :jXj ?x0 jhn = j
max
j :jXj ?x0 jhn =
jR(Xj )j + jm^ ? mj + hn jm^ 0 ? m0j 0
0
= Op (h2n + (m ^ 0 ? m) + hn (m^ 00 ? m0 )) = Op(h2n + p 1 ):
nhn
By the de nitions of 1n (a; b) and Lemma 6.1, we have
a11 = ? = ? =
n X
0(Yj ? m ^0?m ^ 0 (Xj ? x0))j Kj 0
j =1 n X
0("j + ^j )j Kj
j =1 ?nhn "(x0)fX (x0)s0(1 + op(1)):
Applying Lemma 6.1 again, we obtain
a12 = a21 = nh2n "(x0)fX (x0)s1 =(x0)(1 + op(1)) and
a22 = ?nh3n "(x0)fX (x0)s2=2 (x0)(1 + op(1)):
Combination of the last three asymptotic formulae leads to
Wn = Hn
?s
s1 (x0)
and
Wn?1 = ?Hn
0
s1 s2 2 (x0 ) (x0 ) s1 (x0 ) s0
!
s1 (x0) ? 2s(2x0)
!
Hnnhn "(x0)fX (x0)(1 + op (1))
Hn?1 ( "(x0 )fX (x0)nhn )?1 (s0s2 ? s21 )?1 2 (x0 )(1 + op (1)):
In addition, by the de nitions of ^j and 1n (a; b), we have 1n (m ^ 0; m ^ 00) = =
n X j =1 n X j =1
+
(Yj ? m^ 0 ? m^ 00 (Xj ? x0 ))j Kj ("j )j Kj +
n X
j =1
j =1
0("j )^j j Kj
[ ("j + ^j ) ? ("j ) ? 0("j )^j ]j Kj
I n + I n + In : 1
n X
2
3
19
(6.7)
Further, Lemma 6.1 with j = 0 yields
In2 = ?nhn "(x0)fX (x0)s0(1 + op(1))(m^ 0 ? m) +nh2n " (x0)fX (x0) (sx1 ) (1 + op (1))(m ^ 00 ? m0) 0 + " (x0) nh3 s m00f (x )=2 (x )(1 + o (1)) 2
n
X
2
0
0
p
^0 ?m = nhn " (x0)fX (x0)(1 + op(1))(?s0; (sx1 ) ) h m ^ 00 ? m0 ) n (m 0 + " (x0) nh3 s m00f (x )=2 (x )(1 + o (1)):
!
X 0 0 p n 2 2 By (6.7), condition (A8) and the same argument as that in Lemma 6.1, we obtain
In3 = op(nh3n ) + op (nhn)[(m^ 0 ? m) + hn (m^ 00 ? m0 )]: Substituting the expressions of In1 , In2 and In3 , we get n X 1n (m ^ 0; m ^ 00) = " (2x0 ) nh3n s2 m00fX (x0 )=2(x0 )(1 + op (1)) + ("j )j Kj j =1 ! s m ^ 1 0 ?m +nhn " (x0)fX (x0)(1 + op (1))(?s0; (x0) ) hn(m^ 00 ? m0 ) : (6.8) Using the similar arguments, we can obtain n X 2n (m ^ 0; m ^ 00) = ? " (x0 ) nh4n s3 m00fX (x0 )=3(x0)(1 + op (1)) + ("j )j Kj (Xj ? x0 ) 2 j =1 ! s s m ^ 2 1 0 ?m 2 : (6.9) +nhn " (x0 )fX (x0 )(1 + op (1))( (x ) ; ? 2(x ) ) h (m n ^ 00 ? m0) 0 0 It follows from (6.8) and (6.9) that
!
1n (m ^ 0; m ^ 00) 2n (m ^ 0; m ^ 00) ! s 2 3 00 2 ? 1 " (x0 ) nhn m fX (x0 )= (x0)(1 + op(1))Hn ? s3 = Hn Wn 2 (x0 ) ! ! s 1 ?s0 (x0) m ^ 0 ?m ? 1 +Hn Wn nhn " (x0 )fX (x0 )(1 + op (1))Hn s1 ? s2 m^ 00 ? m0 (x0) 2 (x0) ! n X 1 ? 1 ("j )j Kj (X ? x )=h +Hn Wn Hn j 0 n j =1 Jn1 + Jn2 + Jn3 : (6.10)
Hn Wn?1
Simple algebra yields that
00 2 2 s3 )=2(x0) Jn1 = ? 2(s ms h?n s2) (s s ?(ss2 ?s s)1= (x0)(1 + op (1)) 1 2 0 3 0 2 1
20
!
and that
Jn2 = ?Hn
Hence, by (4.1) and (6.10), we get
!
m^ 0 ? m m^ 00 ? m0 (1 + op (1)):
!
m~ n(x0 ) ? m(x0) hn (m~ 0n (x0 ) ? m0(x0 )) s1 ! s2 2 ( x ) 2 0 = (nhn " (x0 )fX (x0 ))?1 s s ? s2 s(1x0 ) (sx0 ) Jn 0 0 2 1 (x0) ! 00 2 2 2 (x0 ) + o h2 + p 1 : + 2(sms h?n s2 ) ((ss2s??s1ss3s)= P n nhn 1 2 0 3 )=(x0 ) 0 2 1
(6.11)
where Jn is given by Lemma 6.3. The conclusion follows from (6.11), Lemma 6.3 and Slutsky's theorem.
References [1] Abramson, I. S. (1982). On bandwidth variation in kernel estimation{a square root law. Ann. Statist. 10 1217{1223. [2] Bickel, P. J. (1975). One-step Huber estimates in linear models. J. Amer. Stat. Assoc. 70 428-433. [3] Breiman, L., Meisel, W. and Purcell, E. (1977). Variable kernel estimates of multivariate densities. Technometrics 19 135{144. [4] Brockmann, M., Gasser, T. and Herrmann, E. (1993). Locally adaptive bandwidth choice for kernel regression estimators. Jour. Amer. Statist. Assoc., 88, 1302{1309. [5] Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc., 74, 829{836. [6] Cox, D. D. (1983). Asymptotics for M-type smoothing splines. Ann. Statist. 11 530{551. [7] Cunningham, J. K., Eubank, R. L. and Hsing, T. (1991). M-type smoothing splines with auxiliary scale estimation. Comput. Statist. & Data Anal. 11 43{51. [8] Fan, J. (1993). Local linear regression smoothers and their minimax eciencies. Ann. Statist. 21 196{216. [9] Fan, J. and Gijbels, I. (1992). Variable bandwidth and local linear regression smoothers. Ann. Statist. 20 2008{2036. [10] Fan, J., Hu, T. and Truong, Y. (1994). Robust Non-parametric function estimation. Scand. J. statist. 21 433-446. [11] Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial tting: variable bandwidth and spatial adaption. J. R. Statist. Soc. B 57 371-394. 21
[12] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications, Chapman and Hall, London. [13] Gasser, T. and Muller, H.-G. (1979). Kernel estimation of regression functions. In Smoothing Techniques for Curve Estimation, Lecture Notes in Mathematics, 757, 23{68. Springer-Verlag, New York. [14] Green, P.J. and Silverman, B.W. (1994). Nonparametric Regression and Generalized Linear Models: a Roughness Penalty Approach. Chapman and Hall, London. [15] Hardle, W. (1990). Applied Nonparametric Regression. New York: Cambridge University Press. [16] Hardle, W. and Tsybakov, A. B. (1988). Robust nonparametric regression with simultaneous scale curve estimation. Ann. Statist. 16 120{135. [17] Hall, P. and Marron, J. S. (1988). Variable window width kernel estimates of probability densities. Probab. Theory Related Fields 80 37{49. [18] Hall, P., Hu, T.C. and Marron, J.S. (1995). Improved variable window estimators of probability densities. Annals of Statistics, 23, 1-10. [19] Huber, P.J. (1981). Robust Estimation. Wiley, New York. [20] Hastie, T. and Loader, C. (1993). Local regression: automatic kernel carpentry (with discussion). Statist. Sci. 8 120{143. [21] Muller, H.-G. and Stadtmuller, U. (1987). Variable bandwidth kernel estimators of regression curves. Ann. Statist., 15, 182{201. [22] Ruppert, D. and Wand, M.P. (1994). Multivariate weighted least squares regression. Ann. Statist., 22, 1346{1370. [23] Tsybakov, A.B. (1986). Robust reconstruction of functions by the local-approximation method. Problems of Information Transmission, 22, 133{146. [24] Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London. [25] Welsh, A. H. (1994). Robust estimation of smooth regression and spread functions and their derivatives. Statist. Sinica, 6, 347-366.
22