Abstract. When estimating a regression function r or its th derivative, local polynomials are an attractive choice due to their exibility and asymptotic performance.
Ridging Methods in Local Polynomial Regression B. Seifert
Th. Gasser
University of Zurich Department of Biostatistics Sumatrastrasse 30 CH{8006 Zurich, Switzerland
1 Introduction Local polynomials are an attractive method for estimating a regression function or its th derivatives. This is due to good asymptotic properties (Fan 1993), to their exibility and to the availability of fast algorithms (Seifert, Brockmann, Engel & Gasser 1994). However, problems arise whenever the design becomes sparse or clustered: Then the conditional variance is unbounded. As one possible solution for these problems ridging of local polynomials was proposed by Seifert & Gasser (1996). 1 This
work was supported by project no. 21{52 567.97 of the
Swiss NSF.
0.8 0.6
•
y 0.2
0.4
• •• • •
• •
•
•• • •
•
• • • • • •
0.0
0.2
•
•
• •
•• •
• •• •
• • •• •
•
0
When estimating a regression function r or its th derivative, local polynomials are an attractive choice due to their exibility and asymptotic performance. Seifert & Gasser (1996) proposed local polynomial ridging to overcome problems of local polynomials with variance for random design while keeping their advantages. In this paper we present a data{adaptive spatial choice of the ridge parameter which outperforms our previously used rule of thumb. The main message is, however, that ridging is a powerful tool for the improvement of local polynomials, whereas the choice of the ridge parameter is not so decisive, but can improve the t to some extent. Local polynomial ridging is related to methods of shrinking the estimator towards the origin, to adaptive order choice, and to polynomial mixing. Relations between these methods and ridging will also be discussed. The concept of ridging is not restricted to local modi cations of estimators. Many penalized estimators like spline smoothers are global ridge estimators. A penalized local polynomial estimator (Seifert & Turlach 1996) shows an attractive performance and shares all asymptotic properties with local polynomials.
In contrast to classic ridging, we proposed a singular ridge matrix, leading to unbiased estimators in the case of a polynomial regression function of degree . This estimator shares all asymptotic advantages with local polynomials. A further advantage is its simple analytical form and the fact that it oers great exibility in choosing the degree of polynomial, of kernel, and of bandwidth scheme. For = 0 it has bounded conditional variance. A \rule of thumb" for choosing the ridge parameter was successfully used in our software (available at www.unizh.ch/biostat). In this paper we derive and study a locally optimized ridge parameter. An example of the performance is given in gure 1.
-0.2
Abstract
0.4
•
ll
• ridge
• • • • • true • • •• •
•
0.6 Design
0.8
1.0
Figure 1: Demonstration of ridging. Dashed line is local linear estimator with Epanechnikov weights (ll), thick line is the ridge estimator with spatially data{adaptive ridge parameter (ridge). The thin line is the true regression function (true). n = 50 random uniform design points (tick marks below) and = 0:1 were used. Local polynomial ridging bears a relation to shrinking the local polynomial estimator towards the origin as done
by Fan (1993) to achieve nite variance. Hall & Marron (1997) have generalized his argument. More practically oriented approaches are an adaptive order choice (Fan & Gijbels 1995), and polynomial mixing (Cleveland & Loader 1996). In section 2 we will introduce local linear ridge regression and discuss an optimal choice of the ridge parameter. Asymptotics give some idea when ridging is especially useful. In section 3 we will discuss relations between ridging and other modi cations of local polynomials, including penalized smoothing methods. Finally (section 4), we will do some numerical work to show that these concepts work.
2 Local Linear Ridge Regression Data are observed as a set of independent pairs of random variables (X1 ; Y1 ); : : : ; (Xn ; Yn ) where the Xi are scalar predictors on [0,1] and the Yi are scalar responses. A functional relationship between predictor and response is assumed to be
Yi = r(Xi ) + "i ;
(1)
where the "i are independent and satisfy E("i ) = 0 and V("i ) = 2 (Xi ). The predictors Xi are either \regularly spaced" Xi = F ?1 (i=n) for some distribution function F ( xed design) or distributed with density f (x) (random design). The local linear estimator of r(x0 ) is the value of the local regression line a + b(x ? x1 ) at x0 , where a and b are chosen to minimize a weighted sum of squares: n X i=1
(Yi ? a ? b(Xi ? x1 ))2 K Xi ? x0 ! min : h
(2)
Here K denotes a nonnegative kernel weight function, h is a bandwidth, and x1 is some arbitrary point for centering the local design points. Denote
sj =
n X K Xi ? x0 (Xi ? x1 )j
(3)
h
i=1
for j = 0; : : : ; 2 , and
Tj =
n X K Xi ? x0 (Xi ? x1 )j Yi
(4)
h
i=1
for j = 0; 1 . If we choose
n X ? x ,X n X ? x X i 0 x1 = xe = K K i 0 ; Xi i=1
h
i=1
h
arctan Ts21 rbll (x0 ) rbNW (x0 ) T0 s0
xe x0 |{z}
Figure 2: Tutorial illustration of local linear estimation with 6 points (circles) in the smoothing interval. Radii re ect kernel weights.
we obtain s1 = 0 , and the local linear estimator has a particularly simple form: where
rbll (x0 ) = Ts 0 + sT1 ; 0
The rst summand
2
(5)
= x0 ? xe :
rbNW (x0 ) = Ts 0 0
is the locally constant Nadaraya{Watson estimator. The local constant and the local regression line cross at x = xe, i. e., both estimators coincide for x0 = xe . The local linear estimator at x = xe is the Nadaraya{Watson estimator T0 = s0 and has therefore nite variance. If the point of interest x0 is far from the center of the data xe, and/or the design points in the smoothing interval are lumped in one cluster (s2 small), the local linear estimator rbll (x0 ) is wiggly, leading to a high variance. If the design is random, we need at least 4 observations in the smoothing window to obtain a nite unconditional variance of the local linear estimator, but even with more points there is no upper bound for the conditional variance (Seifert & Gasser 1996). To overcome the problem with the arbitrarily small denominator s2 , we proposed a ridge estimator with ridge parameter R
rb(x0 ) = Ts 0 + s +T1R : 0 2
(6)
Obviously, the variance of the ridge estimator rb(x0 ) is nite.
R
2.1 MSE{optimal Ridge Parameter
R
Let us for simplicity of notation assume that 2 is independent of x. Noting that the observations Yi are independently distributed, it can be easily veri ed that the conditional covariance between Tj and Tk given X1 ; : : : ; Xn is 2 sj+k , where
rb(x0 ) xe x0
sj =
n X
x0 (X ? xe)j K 2 Xi ? i h i=1
(9)
for j = 0; : : : ; 2 . Thus, the conditional variance of rb is V(rb(x0 )) = 2
Figure 3: Tutorial illustration of local linear ridge esti-
mation. The ridge parameters R and R rotate the local regression line around (xe; rbNW (x0 )) and hence are equivalent.
The ridge estimator rb(x0 ) is a penalized local linear estimator with a penalty on large slopes of the local regression line: n X i=1
(Yi ? a ? b(Xi
? xe))2 K
X ? x i 0 + R b2 h
! min :(7)
Denoting
R = s s+2 R 2
(8)
it can be seen that the ridge estimator rb is a convex combination of the local constant and local linear estimators with weights (1 ? R) and R :
rb(x0 ) = Ts 0 + R s T1 0 2T0 T0 T1 = (1 ? R) s + R s + s : 0
0
2
For R = 0 we get R = 1 , and the ridge estimator is the local linear one. For R ! 1 we get R ! 0 , and ridging yields the Nadaraya{Watson estimator. Thus, ridging provides a compromise between the Nadaraya{Watson estimator with superior variance but problems with bias and the local linear estimator with higher variance but simple bias.
!
s0 + 2 R s1 + R2 2 s2 ; s20 s0 s2 s22
(10)
which is a parabola in R. For the computation of the bias, let us approximate the regression function locally by a straight line r(x) = r(xe) + r0 (xe) (x ? xe) + op (h) : Then, the conditional expectation of Tj becomes E(Tj ) = r(xe) sj + r0 (xe) sj+1 + op (h) : Remembering that s1 = 0 by construction, the conditional bias of the ridge estimator is approximately B(rb(x0 )) = r(xe) + R r0 (xe) ? r(x0 ) = r(xe) + R r0 (xe) ? (r(xe) + r0 (xe) ) (11) = ?r0 (xe) (1 ? R) ; which is linear in R. The mean squared error thus is a parabola in R which has a unique minimum in r0 (xe)2 2 ? 2 sss1 0 2 Rmin = (12) 2 s : 0 2 2 2 r (xe) + 2 2 s2 The optimal ridge parameter thus is Ropt = Rmin , if Rmin 2 [0; 1], otherwise, Ropt = 0 or 1, respectively.
The optimal ridge parameter has a nice interpretation as a variance{bias compromise: The expression 2 2 s2 =s22 in the denominator is the variance of the difference of local polynomials of degrees 1 and 0. r0 (xe)2 2 is the squared dierence of expectations of these estimators for linear regression functions. ?2 s1 =(s0 s2 ) is the covariance between the dierence of local polynomials of degrees 1 and 0 and the local polynomial of degree 0. The optimal ridge parameter Ropt is well de ned unless the local polynomial estimators of degrees 1 and 0 coincide, i. e. unless x0 = xe. But in this case the estimator does not depend on Ropt .
Algorithm for spatially adaptive ridging
In order to keep the algorithm as simple as possible, we decided to propose the following algorithm. 1. Estimate r0 (xe) using local quadratic ridging. Use the bandwidth
h1 = 1:2 h n?2=35 and the non{adaptive ridge parameter given by the rule of thumb (see below). 2. Estimate 2 with the estimator by Gasser, Sroka & Jennen{Steinmetz (1986). 3. Estimate r(x0 ) using local linear ridging with the data{adaptive ridge parameter R = Ropt . Since the algorithm has to work also for small samples and sparse designs, we included some precautions:
h1 5 = n Replace rb0 (xe) at the boundary by
Z 0:9 0:1
jrb0 (x)j dx=0:8 :
R Rthumb Remarks: The order n?2=35 is the asymptotically opti-
mal order for optimal h1 =h, the constant 1.2 is some tuning parameter which should work well for a wide range of applications. Bounding the ridge parameter R from above by Rthumb is an additional precaution.
Algorithm for simpli ed data{adaptive ridging
For small n, the estimation of the derivative might be a problem. Hence, a simpli ed ridge parameter is also of interest. Then, the algorithm for spatially adaptive ridging is modi ed in the following way: 1: Replace rb0 (xe) everywhere by
Z 0:9 0:1
jrb0 (x)j dx=0:8 :
Rule of thumb for the ridge parameter
The rule of thumb has been used in our software since 1994, but the constants were never published. For p = + 1 the rule is as follows:
s2 ; Rthumb = s + R 2 thumb
where
p
(x0 ? x)2 h2 nh : Rthumb = 5 (2 + 3)(2 + 5) A ridge parameter R / (x0 ? x)2 was proposed by Seifert & Gasser (1996). When x0 is close to x, the design is locally well behaving and consequently little ridging is necessary. The other terms have been chosen by asymptotic considerations: R is added to s2 +2 and should be of lower order than this term. For small (n h) and large jx0 ? xj, the ridge parameter R should, however, make a sizable contribution.
2.2 Some Asymptotics
Ridging was introduced to overcome problems with variance of local linear estimators in regions where design points are sparse or clumped. There, classical asymptotic theory does not give an adequate description of the behavior. Asymptotic theory can, however, show whether we have to pay a price in good{natured situations versus a gain in poor situations. Because of the minimax{optimality of local linear estimators (see Fan 1993) we cannot expect that ridging is asymptotically superior. In this section, we will consider the asymptotic variance, bias and MSE of the local linear ridge estimator with a data{adaptive parameter. Consider weights K with support [?1; 1] and suppose h ! 0 , nh3 ! 1 and assume standard regularity conditions. We will present all formulae with true 2 > 0 and r0 (x0 ) 6= 0 , but note that they remain unchanged if we replace them by consistent estimates. Denote
Z
Z
j = uj K (u) du and j = uj K 2 (u) du : The integrals go from ?1 to 1 in the interior, and from c to 1 and from ?1 to c, respectively, at the boundaries. Standard asymptotics (see, e. g., Fan & Gijbels 1996) yield formulae for local polynomials centered at x0 . Using the superscript \(0)" for local polynomials centered at x0 , 1
(0)
nh sj
Z
= hj uj K (u) f (x + uh) du + Op = hj f (x0 ) j + hj+1 f 0 (x0 ) j+1
hj (nh)?1=2
+ Op (hj+2 ) + Op hj (nh)?1=2 ? = hj f (x0 ) j + hj+1 f 0 (x0 ) j+1 (1 + op (1)) : Replace by in asymptotic expansions for s(0) j in (9). By construction we obtain s0 = s(0) 0
1.0
s1 = s(0) 1 + s0 (0) 2 s2 = s(0) 2 + 2s1 + s0 ; and similarly for sj . From s1 = 0 it follows that
•
••
• • • • • • • • ••
•
can be approximated as
0.5
•
y
• •
NW 0.0
(0) = ? ss1 : 0 In the interior we have 1 = 3 = 1 = 3 = 0 . Let us omit the dependence of r, r0 , f and f 0 on x0 . Plugging the above relations into (12) the ridge parameter Rmin
true •
•• •• •• •• •• •• •• •• •• •• •• •• •• •• •
ll
2 Rmin = 1 ? nh1 3 02 0 (1 + op (1)) : r f0 2
-2.5
Thus, a data{adaptive ridge parameter makes asymptotically sense, since Ropt = Rmin < 1 . The corresponding ridge parameter R in (6) becomes 2 Ropt = s2 (1 ? Ropt ) = 02 0 (1 + op (1)) ; r 0 Ropt i. e. it is independent of n and h. The asymptotic vari-
ance of the ridge estimator with data{adaptive parameter is 1 4 f 02 2 2 0 : +o Vopt = Vll ? (nh)2 r02 f 4 40 p (nh)2
?
This gain is of second order (Vll = Op (nh)?1 ), but can become important for small h. At the boundary, the optimal ridge parameter is Ropt = 1 ? Op ((nh3 )?1 ) as in the interior. However, the eect of ridging is larger: 1 : V =V ?O opt
•
•
ll
p
n2 h4
The bias for linear regression functions becomes in the interior 1 2 f 0 0 + o 1 : Bopt = nh p nh 0 2 2
r f 0
The optimal mean squared error is then
4 02 2 MSEopt = MSEll ? 1 2 02f 4 04 + op 1 2 : (nh) r f 0 (nh) The gain in MSE is of the same order as for variance but the squared bias has halved that gain. For general regression functions, the mean squared error of the data{adaptive ridge estimator is
h2 2 r00 f 0 2 0 + o h2 : MSEopt = MSEll + nh p nh r0 f 2 30
-2.0 x
-1.5
Figure 4: Illustration of the bias of local constant and
local linear estimators in the interior. Points are observations without noise following a xed design with a standard normal design density f and observations in (?1; 1). Dots are Nadaraya{Watson estimator (NW), and dashes are local linear estimator (ll), both with h = 0:5 .
The dierence between MSEll and MSEopt is of the same order as (Vll ? VNW ). Unfortunately, asymptotically data{adaptive ridging is not always preferable to local linear estimation. For r00 f 0 =r0 > 0 , the local linear estimator is preferable.
2.3 Superecient estimation ?
The reason for the possible asymptotic loss of eciency is the local linear approximation of the regression function in (11). When approximating r locally by a polynomial of degree 2 or higher, we can use the fact that Nadaraya{Watson and local linear estimators have the same asymptotic variance. Thus, we can improve the bias without increasing the variance. The tutorial gure 4 illustrates this situation. At the right hand side of the gure (f 0 > 0 , r0 < 0 , r00 < 0), ridging increases the bias and hence the MSE. At the left hand side, ridging reduces the bias. When approximating r locally by a parabola, calculations similar to those in section 2.1 lead to an optimal ridge parameter 00 Rmin = 1 + 2rr0 ff 0 (1 + op (1)) :
If we omit the restriction R 2 [0; 1], the resulting estimator is nearly everywhere superecient with a gain in
MSE of
MSEll ? MSEopt / h4 r002 : This approach is, however, more a trick rather than a real improvement. If the regularity conditions necessary for such an improvement are ful lled, the improved estimator has to be compared with locally cubic estimators. This artifact is the main reason why we use a linear approximation of the regression function for the computation of the optimal ridge parameter.
3 Modi cations of Local Polynomials 3.1 Choice of Polynomial Degree
Fan & Gijbels (1995) proposed a local choice of the polynomial degree. When choosing between degrees 0 and 1, this corresponds to a choice of the ridge parameter R = 0 or 1 (see (8)). Fan & Gijbels propose to estimate the bias of the estimators using an approximation of the regression function locally by a polynomial of degree 3. As discussed in section 2.3 above, this approximation does not always yield a variance{bias compromise. Asymptotically, the estimator with smallest bias will be chosen. Cleveland & Loader (1996) introduced mixing local polynomials. The mixing degree is just the ridge parameter R in (8). Cleveland & Loader proposed to estimate the optimal polynomial degree at each x0 using cross validation. Cross{validation too does not always yield a variance{bias compromise. Asymptotically, the estimator with smallest bias will be chosen. Both modi cations do not converge to the local linear estimator (R 6! 1):
3.2 Shrinking Towards an Estimator
Fan (1993) modi ed the local linear estimator to make its variance nite:
re = s s s0+s2n?2 rbll : 0 2
Hall & Marron (1997) generalized this concept to shrinking the local linear estimator towards another estimator with nite variance:
re = s ss0 s+2 rbll + s s + rb0 : 0 2 0 2 Hence, when using for rb0 the Nadaraya{Watson estima-
tor, shrinking is equivalent to ridging, and the shrinkage parameter becomes
= s0 R ;
where R is the ridge parameter in (6). Hall & Marron show, that for small the choice of rb0 is irrelevant. They choose rb0 0 . For large shrinkage parameters, however, this leads to substantial bias. Hence, Hall & Marron assume that = op (n2 h6 ) ; which is equivalent to
R = op (nh5 ) or
R = 1 ? op (h2 ) :
The optimal ridge parameter ful lls this assumption only for nh5 ! 1, i. e. for oversmoothing. For our rule of thumb, p thumb = Op nh4 nh : Thus, our rule of thumb ful lls Hall & Marron's condition.
3.3 Imputation of Data
Hall & Turlach (1997) proposed to add pseudo{data to overcome problems of local linear smoothing with sparse regions in the design. They proposed interpolation of data. Ridging can also be interpreted as if we add temporarily pseudo{observations at x0 for the estimation of r(x0 ). Let us see what happens if we add n0 observations Y0 at x0 . The local polynomials centered at x0 change in the following way:
s0(add) = s0 + n0 K (0) ; T0(add) = T0 + n0 K (0)Y0 ; add) = s(0) for j > 0 ; s(0)( j j (0)( add ) = Tj(0) for j > 0 : Tj The local linear estimator with imputed data then is (0) (0) Y0 ) ? s(0) 1 T1 : rbll(add) (x0 ) = s2 (T0 + n0 K (0)(0) 2 (s0 + n0 K (0))s2 ? (s(0) 1 )
We can equate this to the ridge estimator (0) + R)T ? s(0) T (0) 0 1 1 : rb(x0 ) = (s2 (0) 2 s0 (s2 + R) ? (s(0) 1 )
As can easily be seen, the estimators are equal, if
Y0 = Ts 0 ; 0
i. e. the Nadaraya{Watson estimator of r, and
S = X0 W X =
n0 = s0 R(0) : K (0)s2
Thus, ridging can be interpreted as if we impute n0 pseudo{observations rbNW (x0 ) at x0 . For the optimal ridge parameter, 2 n / 1 ; 0;opt
and
h2 r02
p
n0;thumb = Op h2 nh
for our rule of thumb. For bandwidths of optimal order h = Op (n?1=5 ), optimal ridging pthus is equivalent to temporarily adding O(h?2 ) = O( nh) observations at x0 , our rule of thumb is equivalent to adding O(1) observations. Shrinking towards rb0 , proposed by Hall & Marron, is equivalent to adding
n0 = observations rb0 at x0 .
K (0)s(0) 2
Local polynomial estimators have a lot of advantages, perhaps the most important one being their exibility. Unfortunately, the estimators look rather rough, even at the optimal bandwidth. Avoiding this roughness by oversmoothing results in a loss of eciency. Ridge estimators as discussed before reduce that problem, but they do not penalize for roughness. Smoothing splines are the only nonparametric regression estimators which are really smooth by construction. They pay with a reduced exibility and need \smoother" regression functions. Seifert & Turlach (1996) proposed a local polynomial ridge estimator which keeps all advantages of local polynomials, but which, by choosing a global ridging matrix similar in spirit to that of smoothing splines, avoids the rough look, leading to an improved MISE behavior for nite samples. In this way, it provides an elegant way for solving problems with sparse regions in the realization of the design. For = 0 and p = 1, the method can be described as follows: Denote 0 1 (X ? x ) 1 1 1 B CA ; . . .. X = @ .. 1 (Xn ? x1 )
W = diag(K
h
0Y 1 1 Y =B @ ... CA ; T = X 0 W Y = T01 Yn
with elements sj from (3) and Tj from (4), and let
1
x0 ? x1 : The goal is to estimate r(x) at an output grid x01 ; : : : ; x0m . For every x0j , we have a local polynomial estimator rbll (x0j ), leading to an additional index j in all formulae. The estimator of r = (r(x01 ); : : : ; r(x0m ))0 = colj=1;:::;m (r(x0j )) can be obtained in one formula comu=
bining all components into large block matrices and vectors: Denote Sj = Xj0 Wj Xj , Tj = Xj0 Wj Y , and j = (aj ; bj )0 , and combine them into large blocks S = diagj=1;:::;m (Sj ), T = colj=1;:::;m (Tj ), and = colj=1;:::;m ( j ) of length 2 m. Then, the local linear estimator of is minimizer of a weighted sum of squares:
m X n X
3.4 Penalized Local Polynomials
Xi ? x 0
s s 0 1 s1 s2 ; T
);
(Yi ? aj ? bj (Xi ? x1j ))2 K Xi h? x0j j j =1 i=1
! min ;
which has the solution b ll = S ?1 T . The local linear estimator of r = colj (r(x0j )) is brll = (Im u0 ) b ll . A global ridge estimator is a minimizer of a penalized weighted sum of squares
m X n X
x0j (Yi ? aj ? bj (Xi ? x1j ))2 K Xi ? h j j =1 i=1 + 0 R ! min ;
which has the solution
br = (Im u0) (S + R)?1 T : The ridge estimator considered so far corresponds to global ridge matrix R with blocks a block{diagonal 0 0 . We want to nd a ridging matrix R, such 0 Rj that 0 R is a smoothness penalty for the regression function. Thus, in some sense, 0 R should be an es? P timator of j r(q) (x0j ) 2 for some smoothness order q, e. g. for q = 2. This can be done via divided dierences (compare e. g. de Boor 1978, pp. 4 ): De ne
D(k)
1
= diagj=1;:::;(m?k) x 0;j +k ? x0j
0.8
4 A Simulation Study
0.6
•
•• • •
• •
•
•• • •
-0.2
•
• • • • • •
0.0
0.2
•
•
• •
•• •
• •• •
• • •• •
•
0
y 0.2
0.4
•
0.4
global
• opt
•
• • • • • true • • •• •
•
0.6 Design
0.8
1.0
Figure 5: The global ridge estimator for the example
in gure 1. Dash{dots are the ridge estimator with spatially data{adaptive ridge parameter (opt), thick line is the global ridge estimator (global).
a weighting matrix of order (m ? k) (m ? k) and 1 0 ?1 1 ... ... C B (k) = B A @ ?1 1 a bidiagonal matrix of order (m ? k) (m ? k +1). Then, divided dierences of order q of r are obtained as (q) r = D(q) B (q) D(1) B (1) r: We propose the ridging matrix R = ((q) u0 )0 ((q)
u0 ). Then, the penalized local linear ridge estimator becomes
rb = (Im u0 ) S + c ((q) u0 )0 ((q) u0 )
?1
T;
where c is some ridge parameter. Note that for an equally spaced output grid D(k) is proportional to the identity matrix irrespective of the design. For q = 2 the estimator is close in spirit to cubic smoothing splines. However, the minimized sum of squares and the smoothness penalty areR dierent2 from that used for smoothing splines, where fr00 (x)g dt is n o 2 P used instead of j rb00 (x0j ) . Using the fast algorithm for computation of local polynomials by Seifert, Brockmann, Engel & Gasser (1994) and the band matrix structure of the ridging matrix, Seifert & Turlach developed a fast algorithm for the computation of the smooth estimator. It is similar to the Reinsch algorithm for splines (compare Green & Silverman 1994, pp. 19{27).
In a small simulation study the performance of the adaptive choice of Ropt was evaluated. We analyzed the following situations. Regression functions: 1. bimodal: r(x) = 0:3 exp(?4 (4 x ? 1)2 ) + 0:7 exp(?16 (4 x ? 3)2 ) , = 0:1 (Fan & Gijbels 1995) 2. sine: r(x) = sin(5 x) , = 0:5 (Ruppert, Sheather & Wand 1995) 3. peak: p r(x) = 2 ? 5 x + 5 exp(?400 (x ? 0:5)2) , = :5 (Seifert & Gasser 1996) Design densities on [0,1]: 1. uniform U (0; 1) 2. truncated normal N (0:5; 0:52) \ [0; 1] 3. truncated normal N (0; 1) \ [0; 1] These 9 examples were analyzed for xed and random design, and for 3 residual variances: as described above (), small ( = 2), and large (2 ). The following estimators were evaluated: local constant (Nadaraya{Watson), local linear, ridging with spatially adaptive parameter, ridging with simpli ed data{adaptive parameter, and ridging with rule of thumb. The k{nearest neighbor rule is a bandwidth scheme intended to achieve approximately constant variance for an estimator (be it a density or regression estimator). As a byproduct it alleviates the variance problems of local polynomials at the price of giving up exibility in bandwidth choice. Hence, the k-nn rule was included in the evaluations. The Epanechnikov kernel was used throughout the numerical computations. All realizations of the design were stretched to min(X1 ) = 0 , max(Xi ) = 1 before generating the Yi . We used a \steady" bandwidth near the boundary, i. e., the window width in [0,1] is always 2 h . For each situation, 400 data sets were generated. Performance for prespeci ed bandwidth. First, we are interested whether data adaptive ridging works for any given bandwidth. Hence, we generated data with random uniform design, n = 50, the bimodal regression function above, and bandwidths ranging from 1 = n to 0.5. Figure 6 shows the mean integrated squared error of these simulations on a log{log scale. As to be expected, the local linear estimator breaks down below a relatively large bandwidth of 0.1 . When the expected number of observations is less than 10,
1.0
ll
Relative Efficiency 0.2 0.4 0.6 0.8
MISE 0.010
0.020
knn
thumb opt
0.0
0.005
NW
0.02
0.05 Bandwidth
0.10
0.20
Figure 6: MISE versus bandwidth for n = 50 random uniformly distributed design points, r(x) = \bimodal", and = 0:1 . Dotted line is Nadaraya{Watson estimator (NW), dashes are local linear estimator (ll), long dashes are local linear estimator with k{nearest neighbor bandwidth scheme (knn), solid lines are ridge estimators with spatially adaptive parameter (opt) and rule of thumb (thumb), and dash{dots are ridge estimator with simpli ed data{adaptive parameter. the probability of realizations with sparse regions increases, leading to spikes in the estimator and hence to an arbitrarily large MISE. The k{nearest neighbor rule avoids smoothing windows containing too few observations. The local linear estimator with k{nn bandwidth scheme hence is stable down to k = 5 . With k 4 , the probability increases that these design points are lumped together around one point, and the estimator breaks down as well. The Nadaraya{Watson estimator has a small variance but a large bias, leading to a good behavior for small bandwidths. The ridge estimator with spatially adaptive parameter behaves excellent over the whole range of bandwidths. Ridging with rule of thumb works for any given bandwidth. For undersmoothing, the ridge parameter is, however, too conservative. Ridging with a simpli ed data{adaptive parameter performed surprisingly well.
Performance with sample size. Next, we were interested whether data adaptive ridging works for every sample size n . Hence, we analyzed data with xed and random design, 3 design densities, 3 regression functions, and 3 residual variances (see above). These 54 examples were analyzed for n ranging from 25 to 1000. MISE for each generation of a data set was analyzed at an estimated MISE{optimal bandwidth.
opt NW
thumb opt NW knn
ll
25
50
100 200 Sample Size n
500
1000
Figure 7: Eciency depending on sample size of esti-
mators of the bimodal regression function above for random uniform design and = 0:1 . Dots are Nadaraya{ Watson estimator (NW), Dashes are local linear estimator (ll) and local linear estimator with k{nearest neighbor bandwidth scheme (knn), solid lines are ridge estimators with spatially adaptive parameter (thick line) (opt) and rule of thumb (thumb).
For xed designs, dierences between local linear and ridge estimators were small. In the interior, dierences between MISE's of the local linear and ridge estimators were less than 1%. At the boundary, however, data{ adaptive ridge estimators gained about 10% relatively to the local linear estimator. In the following, we will concentrate on the random design, where dierences are more important. Figure 7 shows the example of the bimodal regression function and random uniform design. Relative eciency here is the ratio of asymptotic MISE of the local linear estimator expected from asymptotic theory to the true nite sample MISE of an estimator. In this example, at least 200 observations are necessary so that local linear estimation behaves well. Local linear estimation with k{nn rule worked well for n 50. Ridging with spatially adaptive parameter as well as with rule of thumb worked well for any n. Over all 162 examples (3 design densities 3 regression functions 3 residual variances 6 sample sizes), the mean MISE of the local linear estimator was 230 times higher than that of the spatially adaptive ridge estimator. This ratio was highest for n = 25 (MISEll / MISEopt = 1200) and reduced for increasing sample size to 1.01 for n = 1000. The largest loss of MISE for the ridge estimator compared with the local linear one was 2%. The ridge estimator always was bet-
ter than the Nadaraya{Watson estimator. In 93% of all examples ridging outperformed the k{nn rule, and always for n 100. In the mean, the MISEs of the spatially adaptive ridge estimator were 2% smaller than those of ridging using the rule of thumb and the simpli ed data{adaptive parameter. In the interior, dierences were small, but at the boundary the data{adaptive rule outperformed the conservative rule of thumb by 9%.
References de Boor, C. (1978). A Practical Guide to Splines. New York: Springer. Cleveland, W. S. & Loader, C. (1996). Smoothing by local regression: Principles and methods. in: Statistical Theory and Computational Aspects of Smoothing, W. Hardle, M. G. Schimek (eds), Physica (1996), 10{49. Fan, J. (1993). Local linear regression smoothers and their minimax eciencies. Ann. Statist. 21, 196{216. Fan, J. & Gijbels, I. (1995). Adaptive order polynomial tting: Bandwidth robusti cation and bias reduction. J. Comp. Graph. Statist. 4, 213{227. Fan, J. & Gijbels, I. (1996). Local Polynomial Modelling and its Applications. London: Chapman & Hall. Gasser, T., Sroka, L. & Jennen{Steinmetz, C. (1986). Residual variance and residual pattern in nonlinear regression. Biometrika 73, 625{633. Green, P. J. & Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models. A Roughness Penalty Approach. London: Chapman & Hall. Hall, P. & Marron, S. (1997). On the role of the shrinkage parameter in local linear smoothing. Probab. Theory Relat. Fields 108, 495{516. Hall, P. & Turlach, B. (1997). Interpolation methods for adapting to sparse design in nonparametric regression. J. Amer. Statist. Assoc. 92, 466{476. Ruppert D., Sheather, S. J. & Wand M. P. (1995). An eective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 90, 1257{1270. Seifert, B. & Gasser, T. (1996). Finite sample variance of local polynomials: Analysis and solutions. J. Amer. Statist. Assoc. 91, 267{275. Seifert, B. & Turlach, B. (1996). Introducing smoothness into local polynomials. (draft). Seifert, B., Brockmann, M., Engel, J. & Gasser, T. (1994). Fast algorithms for nonparametric curve estimation. J. Comp. Graph. Statist. 3, 192{213.