Data Adaptive Ridging in Local Polynomial Regression Burkhardt Seifert and Theo Gasser December 9, 1998 Revision August 30, 1999 When estimating a regression function or its derivatives, local polynomials are an attractive choice due to their exibility and asymptotic performance. Seifert and Gasser (1996) proposed ridging of local polynomials to overcome problems with variance for random design while retaining their advantages. In this paper we present a data{independent rule of thumb and a data{adaptive spatial choice of the ridge parameter in local linear regression. In a framework of penalized local least squares regression, the methods are generalized to higher order polynomials, to estimation of derivatives, and to multivariate designs. The main message is, that ridging is a powerful tool for improving the performance of local polynomials. A rule of thumb oers drastic improvements; data{adaptive ridging brings further but modest gains in mean square error.
Key Words: Nonparametric estimation; Nonparametric regression; Smoothing.
1 INTRODUCTION Local polynomials are an attractive method for estimating a regression function or its derivatives. This is due to good asymptotic properties (Fan 1993), to their exibility and to the availability of fast algorithms (Seifert, Brockmann, Engel, and Gasser 1994). However, problems arise whenever the design becomes sparse or clustered: Then the conditional variance is unbounded. As one possible solution for these problems ridging of local polynomials was proposed by Seifert and Gasser (1996). In contrast to classic ridging, we proposed a singular ridge matrix, leading to unbiased estimators Burkhardt Seifert is Statistician and Theo Gasser is Professor, Department of Biostatistics, ISPM, University of Zurich, Sumatrastrasse 30, CH{8006 Zurich, Switzerland (E{mail:
[email protected]). Journal of Computational and Graphical Statistics (to appear) 1
0.8 0.6
•
•• • •
• •
•
•• • •
-0.2
•
• • • • • •
0.0
0.2
•
•
• •
•• •
• •• •
• • •• •
•
0
y 0.2
0.4
•
0.4
• • • • • • • • •• •
ll
• adapt
true
•
0.6 Design
0.8
1.0
Figure 1. Demonstration of ridging. Dashed line is local linear estimator with Epanechniko weights (ll), thick line is the ridge estimator with spatially data{adaptive ridge parameter (adapt). The thin line is the true bimodal regression function de ned in section 4 below (true). n = 50 random uniform design points (tick marks below), = 0:1 and a bandwidth of 0.063 (1.2 times the asymptotically optimal bandwidth) were used.
in the case of constant regression functions. This estimator shares all asymptotic advantages with local polynomials. A further advantage is its simple analytical form and the fact that it oers great exibility in choosing the degree of polynomial, the kernel, and the bandwidth scheme. The estimator of the regression function itself has bounded conditional variance. A \rule of thumb" for choosing the ridge parameter was successfully used in our software (available at www.unizh.ch/biostat). In this paper we derive and study a new rule of thumb and a data{adaptive ridge parameter, locally optimized. An example of the performance is given in gure 1. When oversmoothing (i. e. choosing a relatively large bandwidth), the problems of local polynomials with variance become less dramatic. However, it would be desirable to be able to use any bandwidth with the same faith, quite in the spirit of nonparametric statistics. Ridging local polynomials ful lls this at a low computational cost. Algorithms for local polynomial estimation need only minor modi cations. Local polynomial ridging bears a relation to shrinking the local polynomial estimator towards the origin as done by Fan (1993) to achieve nite variance. Hall and Marron (1997) have generalized his argument. More practically oriented approaches are an adaptive order choice (Fan and Gijbels 1995) and polynomial mixing (Cleveland and Loader 1996). For relations between ridging and these methods see Seifert and Gasser (1998). 2
In section 2 we will introduce local linear ridge regression and discuss an optimal choice of the ridge parameter. Asymptotics give some idea when ridging is especially useful. Then (section 3) we will discuss the generalization of the optimal choice of the ridge parameter to higher order polynomials, to estimation of derivatives, and to multivariate designs. The general concept will be demonstrated by way of examples for derivative estimation and two{dimensional local linear smoothing. Finally (section 4), some numerical work shows that these concepts work.
2 LOCAL LINEAR RIDGE REGRESSION Data are observed as a set of independent pairs of random variables (X1 ; Y1 ); : : : ; (X ; Y ) where the X are scalar predictors on [0,1] and the Y are scalar responses. A functional relationship between predictor and response is assumed to be n
i
n
i
Yi = r(Xi ) + "i ;
(1)
where the " are independent and satisfy E(" ) = 0 and V(" ) = 2 (X ). The predictors X are either \regularly spaced" X = F ?1 (i=n) for some distribution function F ( xed design) or distributed with density f (x) (random design). We want to estimate the regression function r at x0 . The local linear estimator of r(x0 ) is the value of the local regression line a + b(x ? x1 ) at x0 , where a and b are chosen to minimize the weighted sum of squares i
i
i
i
i
i
X n
=1
i
x0 (Yi ? a ? b(Xi ? x1 ))2 K Xi ? h
:
(2)
Here K denotes a nonnegative kernel weight function, h is a bandwidth, and x1 is some arbitrary point for centering the local design points. Denote sj
X X ? x0 K (X ? x1 ) h n
=
i
i
=1
(3)
j
i
for j = 0; 1; 2 , and =
Tj
X X ? x0 K (X ? x1 ) h n
i
i
=1
j
Yi
i
for j = 0; 1 . By choosing
X X ? x0 ,X X ? x0 K x1 = xe = K X ; n
n
i
=1
i
i
h
i
=1
i
h
(4)
we obtain s1 = 0 , and the local linear estimator has a particularly simple form: T rbll (x0 ) = 0 s0
3
+ sT1 ; 2
(5)
y=
y= rbll (x0 ) rb(x0 ) rbNW (x0 )
e x
y=
T0 s0 T0 s0
+
+
(x
? xe)T1 s2
(x
? xe)T1
s2 + R
T0 s0
x0
|{z}
Figure 2. Tutorial illustration of local linear ridge estimation with 6 points (circles) in the smoothing interval. Radii re ect kernel weights.
where
= x0 ? xe :
Figure 2 illustrates the construction of the local linear estimator. The rst summand T rbNW (x0 ) = 0 s0
in (5) is the locally constant Nadaraya{Watson estimator. The local constant and the local regression line cross at x = xe, i. e., both estimators coincide for x0 = xe . The local linear estimator at x = xe is the Nadaraya{Watson estimator T0 =s0 and has therefore nite variance. If the point of interest x0 is far from the center of the data xe ( large), and/or the design points in the smoothing interval are lumped in one cluster (s2 small), then the local linear estimator rbll (x0 ) is wiggly, leading to a high variance. If the design is random, we need at least 4 observations in the smoothing window to obtain a nite unconditional variance of the local linear estimator, but even with more points there is no upper bound for the conditional variance (Seifert and Gasser 1996). To overcome the problem with the arbitrarily small denominator s2 , we proposed a ridge estimator T rb(x0 ) = 0 s0
+ s +T1R 2
(6)
with ridge parameter R. Obviously, the variance of the ridge estimator is nite, if the kernel K is bounded from above. 4
The ridge estimator rb(x0 ) is a penalized local linear estimator with a penalty on large slopes of the local regression line minimizing
X
x0 + R b2 : (Y ? a ? b(X ? xe))2 K X ? h =1 n
i
i
i
(7)
i
Denoting R=
s2 s2 + R
(8)
it can be seen that the ridge estimator is a convex combination of the local constant and local linear estimators with weights (1 ? R) and R : rb(x0 )
= Ts 0 + R s T1 0 2T T T = (1 ? R) s 0 + R s 0 + s 1 : 0
0
2
For R = 0 we get R = 1 , and the ridge estimator is the local linear one. For R ! 1 we get R ! 0 , and ridging yields the Nadaraya{Watson estimator. Thus, ridging provides a compromise between the Nadaraya{Watson estimator with nite variance but problems with bias and the local linear estimator with unbounded variance but simple bias.
2.1 MSE{optimal Ridge Parameter Let us for simplicity of notation assume that 2 is independent of x. Noting that the observations Y are independently distributed, it can be easily veri ed that the conditional covariance between T and T given X1 ; : : : ; X is 2 s+ , where i
j
k
n
j
k
sj =
X n
=1
i
K2
X ? x 0 (X ? xe) h i
i
for j = 0; 1; 2 . Thus, the conditional variance of rb is V(rb(x0 )) = 2
s0 s20
+ 2 Rs ss1 0 2
j
(9)
! 2 R 2 s2 ; + s2 2
which is a parabola in R. For the computation of the bias, let us approximate the regression function locally by a straight line r(x) = r(xe) + r0 (xe) (x ? xe) + o (h) : p
Then, the conditional expectation of T becomes j
E(T ) = r(xe) s + r0 (xe) s +1 + o (h) : j
j
j
5
p
Remembering that s1 = 0 by construction, the conditional bias of the ridge estimator is approximately B(rb(x0 )) r(xe) + R r0 (xe) ? r(x0 ) r(xe) + R r0 (xe) ? (r(xe) + r0 (xe) ) = ?r0 (xe) (1 ? R) ;
(10)
which is linear in R. The approximated mean squared error thus is a parabola in R which has a unique minimum in s1 s0 s2 : Ropt = 2 s r0 (xe)2 2 + 2 2 2 s2 is Ropt , if Ropt 2 [0; 1], otherwise, Ropt r0 (xe)2 2 ? 2
(11)
= 0 or 1, and Ropt = The optimal ridge parameter thus s2 (1 ? Ropt )= Ropt . The optimal ridge parameter has a nice interpretation as a variance{bias compromise: The expression 2 2 s2 =s22 in the denominator is the variance of the dierence of local polynomials of degrees 1 and 0. The term r0 (xe)2 2 is the squared dierence of expectations of these estimators for linear regression functions. Further, ?2 s1 =(s0 s2 ) is the covariance between the dierence of local polynomials of degrees 1 and 0 and the local polynomial of degree 0. The optimal ridge parameter Ropt is well de ned unless the local polynomial estimators of degrees 1 and 0 coincide, i. e. unless x0 = xe. But in this case the estimator does not depend on Ropt . Now, we will develop and study rules for the choice of the ridge parameter.
Rule of thumb for the ridge parameter One might be interested in a linear estimator which does not rely on information about the regression function and the residual variance. Hence, a choice of the ridge parameter is of interest which is independent of Y and takes into account only the realization of the design. The idea of the rule of thumb is to select the ridge parameter R such that the maximal variance of the slope b = T1=(s2 + R) of the local regression line in (6) is related to the variance of the intercept, i. e. the Nadaraya{Watson estimator a = T0=s0 . We decided to use the bound
(12) V s +T1R h0 jj ; 2 R where 0 = K 2(u) du. This bound was motivated by the fact that the variance of T0 =s0 asymptotically tends to 2 V Ts 0 ! n h f (x0 ) : 0
2
0
6
In the worst case, n h f (x0) may be small, going down to 1 corresponding to 2 observations in the smoothing window, and motivating an upper bound of 2 0 in (12). When x0 is close to xe ( small), the local linear estimator is well behaved and consequently a smaller bound than 2 0 might be appropriate. We chose a bound proportional to jj= h, leading to a shift{ and scale{invariant ridge estimator in (6) and to the bound in (12). The variance of b has the following bound:
2 2
= (s + Rs2)2 2 2 2 max (K (u)) s2 (using s2 max(K (u)) s2 ) (s2 + R)2 2 2 (K (u)) (using x 1 for x 0) : max 4R (x + 1)2 4 Consequently, the following rule of thumb for the choice of the ridge parameter of local linear regression max (K (u)) jj h Rthumb = 4 V s +T1R 2
u
u
u
u
0
ful lls (12). For Epanechniko weights, we have max (K (u)) = 3=4 , 0 = 3=5 , and hence Rthumb = 5 jj =(16 h) . u
Algorithm for spatially adaptive ridging Now, we will estimate Ropt in (11) to derive a data{adaptive ridge parameter. In order to keep the algorithm as simple as possible, we propose the following algorithm. 1. Estimate r0 (xe) using local quadratic ridging. Use the bandwidth h1 = 1:2 h n?2 35 and the non{adaptive ridge parameter given by the rule of thumb (see section 3.1 below). =
2. Estimate 2 with the estimator by Gasser, Sroka, and Jennen{Steinmetz (1986). 3. Estimate r(x0 ) using local linear ridging with the data{adaptive ridge parameter Radapt = bopt . R Since the algorithm has to work also for small samples and sparse designs, we included some precautions:
h1 5=n ,
Radapt Rthumb .
Remark 1: The order n?2 35 is the asymptotically optimal order for optimal h1=h, the constant =
1.2 is some tuning parameter which should work well for a wide range of applications. Bounding the 7
0.015
ll
MISE
0.010
knn
0.005
thumb NW adapt
0.02
0.05 Bandwidth
0.10
0.15
Figure 3. MISE of estimators of r(x) versus bandwidth for n = 50 random uniformly distributed design points, a bimodal regression function, and = 0:1 . Dotted line is Nadaraya{Watson estimator (NW), dashes are local linear estimator (ll), dash{dots are local linear estimator with k{nearest neighbor bandwidth scheme (knn), and solid lines are ridge estimators with spatially adaptive parameter (thick line) (adapt) and rule of thumb (thumb).
ridge parameter Radapt from below by Rthumb is an additional precaution, which ensures a bounded conditional variance. Remark 2: We also analyzed a global rule of thumb. However, in simulations it performed slightly worse than Rthumb . Remark 3: Two choices of the ridge parameter for a given (global or local) bandwidth are studied. As will be seen in section 2.2 below, the resulting ridge estimators are asymptotically equivalent to the local linear estimator. Hence, plug{in methods for bandwidth choice are also valid for ridge estimators. For small sample size, the ridge estimator might be considered as a regularization of the local linear estimator. Hence, bandwidths chosen by plug{in methods for the local linear estimator should work even better for the ridge estimator.
MSE{performance of ridge estimators We are interested whether data adaptive ridging works for any given bandwidth. Hence, we generated data with random uniform design, n = 50, a bimodal regression function shown in gure 1, and bandwidths ranging from 1=n to 0.5. Figure 3 shows the mean integrated squared error (MISE) of these simulations on a log{log scale. 8
As to be expected, for this example the local linear estimator breaks down below a relatively large bandwidth of 0.1 . When the expected number of observations locally is less than 10, the probability of realizations with sparse regions increases, leading to spikes in the estimator and hence to an arbitrarily large MISE. The k{nearest neighbor rule avoids smoothing windows containing too few observations. The local linear estimator with k{nn bandwidth scheme hence is stable down to k = 5 (h = 0:05). For k 4 , the probability increases that these design points are lumped together around one point, and the estimator breaks down as well. The Nadaraya{Watson estimator has a small variance but a relatively large bias, even here for uniform design. For general designs, the Nadaraya{Watson estimator can become arbitrarily bad. The ridge estimator with spatially adaptive parameter behaves excellent over the whole range of bandwidths. Ridging with rule of thumb works also for any given bandwidth. A detailed analysis of the MISE{performance of these estimators will be presented in section 4 below.
2.2 Some Asymptotics Ridging was introduced to overcome problems with variance when design points are sparse or clustered. There, classical asymptotic theory does not give an adequate description of the statistical behavior. Asymptotic theory can, however, show whether we have to pay a price in well{behaved situations. In this section, we will consider the asymptotic variance, bias and MSE of the local linear ridge estimator with a data{adaptive parameter. Consider symmetric weights K with support [?1; 1] and suppose h ! 0 , nh3 ! 1 and assume standard regularity conditions. We will present all formulae with true 2 > 0 and true r0 (x0 ) 6= 0 , but note that they remain unchanged if we replace them by consistent estimates. Thus, results for the optimal ridge parameter are also true for the data{adaptive choice Radapt . Denote j
=
Z
uj K (u) du
and =
Z
uj K 2 (u) du :
j
The integrals go from ?1 to 1 in the interior, and from c to 1 and from ?1 to c, respectively, at the boundaries. Standard asymptotics (see, e. g., Fan and Gijbels 1996) yield formulae for local polynomials centered at x0 . Using the superscript \(0) " for local polynomials centered at x0 , 1 s(0) = h Z u K (u) f (x + uh) du + O h (nh)?1 2 nh
j
0
j
j
j
p
=
= h f (x0 ) + h +1 f 0 (x0 ) +1 + O (h +2 ) + O h (nh)?1 2 ? = h f (x0 ) + h +1 f 0 (x0 ) +1 (1 + o (1)) : j
j
j
j
j
j
j
p
j
9
j
p
p
j
=
Replace by in asymptotic expansions for s(0) in (9). By construction we obtain j
s0 s1 s2
= s(0) 0 = s(0) 1 + s0 (0) 2 = s(0) 2 + 2 s 1 + s0 ;
and similarly for s . From s1 = 0 it follows that = ? s(0) 1 =s0 . In the interior we have 1 = 3 = 0 1 = 3 = 0 . Let us omit the dependence of r, r , f and f 0 on x0 . Plugging the above relations into (11) the ridge parameter Ropt can be approximated as 1 2 0 (1 + o (1)) : R =1? j
opt
n h3 r02 f 0 2
p
The corresponding optimal ridge parameter R in (6) becomes s (1 ? Ropt ) 2 0 Ropt = 2 = 02 (1 + o (1)) ; r 0
Ropt
p
i. e. it is independent of n and h. The asymptotic variance of the ridge estimator with optimal parameter is 1 4 02 2 : Vopt = Vll ? 2 2 02 f 4 04 + o (n h) r f 0 (n h)2 ? This gain is of second order (Vll = O (n h)?1 ), but can, nevertheless, become important for small h. At the boundary, the optimal ridge parameter is Ropt = 1 ? O ((n h3 )?1 ) as in the interior. However, the eect of ridging is larger: 1 V =V ?O : p
p
p
opt
ll
p
n2 h4
The bias for linear regression functions becomes in the interior 2 0 B = 1 f 0 + o 1 : opt
The optimal mean squared error is then
n h r0 f 2 20
p
nh
4 02 2 1 : MSEopt = MSEll ? 1 2 02 f 4 04 + o (n h) r f 0 (n h)2 The gain in MSE is of the same order as for variance but the squared bias has halved that gain. For general regression functions, the mean squared error of the optimal ridge estimator is p
2 2 2 00 0 MSEopt = MSEll + nhh r0 f2 32 0 + o nhh : r f 0 The dierence between MSEll and MSEopt is of the same order as (Vll ? VNW ). Asymptotically, ridging is not necessarily preferable to local linear estimation. For r00 f 0=r0 > 0 , the local linear estimator is asymptotically preferable. p
10
1.0
•
•
•
••
• • • • • • • • ••
• 0.5
•
y
•
0.0
NW
•
true •
•• •• •• •• •• •• •• •• •• •• •• •• •• •• •
ll -2.5
-2.0 x
-1.5
Figure 4. Illustration of the bias of local constant and local linear estimators in the interior. Points are observations without noise following a xed design with a standard normal design density f and observations in (?1; 1). Dots are Nadaraya{Watson estimator (NW), and dashes are local linear estimator (ll), both with h = 0:5 . Arrows show the direction of ridging the local linear estimator.
2.3 Superecient estimation ? The reason for the possible asymptotic loss of eciency is the local linear approximation of the regression function in (10). When approximating r locally by a polynomial of degree 2 or higher, we can use the fact that Nadaraya{Watson and local linear estimators have the same asymptotic variance. Thus, we can improve the bias without increasing the variance. The tutorial gure 4 illustrates this situation. At the right hand side of the gure (f 0 > 0 , r0 < 0 , r00 < 0), ridging increases the bias and hence the MSE. At the left hand side, ridging reduces the bias. When approximating r locally by a parabola, calculations similar to those in section 2.1 lead to an optimal ridge parameter r00 f Ropt = 1 + 0 0 (1 + o (1)) : 2r f If we omit the restriction R 2 [0; 1], the resulting estimator is nearly everywhere superecient with a gain in MSE of MSEll ? MSEopt / h4 r002 : p
This approach is, however, more a trick rather than a real improvement. If the regularity conditions necessary for such an improvement are ful lled, the improved estimator has to be compared with locally cubic estimators. 11
This artifact is the main reason why we use a linear approximation of the regression function for the computation of the optimal ridge parameter.
3 LOCAL LEAST SQUARES RIDGE REGRESSION The results in the previous sections can be generalized to local polynomials of higher degree, to derivative estimation and to multivariate designs. In this section we will use matrix notation: A , tr(A), rk(A), min(A) and max (A) denote transposed, trace, rank, minimal and maximal eigenvalue of A, respectively. Now, the predictors of the scalar responses Y may be d{dimensional vectors X = (X 1 ; : : : ; X ) . In the framework of local least squares regression we assume that the regression function is locally approximated P as a linear combination r(x) =0 p (x) of p + 1 basis functions p0 (x); : : : ; p (x), usually polynomials, trigonometric functions or splines. We want to estimate some functional of r(x0 ), usually r(x0 ) itself or its th derivative. For estimation purposes, we replace this functional by its approximation in the approximating linear space, i. e. we estimate a linear combination u = P u of regression coecients. The local least squares estimator rb (x ) of u minimizes u ls 0 =0 the weighted sum of squares T
i
i
i
id
T
p
j
j
j
p
T
p j
j
j
;
0 X@
X
i
j
n
=1
Yi ?
p
=0
j
T
12 p (X )A KH (X ) : j
i
(13)
i
Here KH denotes a nonnegative kernel weight function, rescaled by a d d {dimensional bandwidth matrix H . Note that the local basis functions p (X ) and the kernel weights KH (X ) depend on x0 and usually on all design points in the smoothing window, but not on Y . The local least squares estimator has an explicit form j
i
i
rbu;ls (x0 ) = uT S ?1 T ;
where
0 BB s00. S=B @ ..
:::
sp0 : : :
have elements of the form sjk
=
X n
=1
s0p
.. .
spp
1 CC CA
and
0 BB T.0 T =B @ ..
KH (X i ) pj (X i ) pk (X i )
i
12
(14)
Tp
1 CC CA
(15)
(16)
and =
Tj
X n
=1
KH (X i ) pj (X i ) Yi :
i
Throughout this section we assume that S is nonsingular. A ridge estimator rbu (x0 ) of u penalized local least squares estimator with a penalty on large components of minimizing
T
0 X@
X
i
j
n
=1
Yi ?
p
=0
j
12 p (X )A KH (X ) + R ; j
i
T
i
is a (17)
where R is some (p + 1) (p + 1) {dimensional nonnegative ridge matrix. The estimator has the explicit form rbu (x0 ) = uT (S + R)?1 T :
(18)
Since the observations Y are independently distributed, it can be easily veri ed that the conditional variance of rbu given X 1 ; : : : ; X is i
n
V(rbu (x0 )) = 2 u (S + R)?1 S (S + R)?1 u ; T
0 BB s00. S =B @ ..
where
has elements of the form sjk
=
X n
=1
i
:::
s0p
sp0 : : :
spp
.. .
1 CC CA
2 (X ) p (X ) p (X ) : KH i j i k i
For the computation of the expectation, we use approximations of the regression function in the approximating linear space. Then, the conditional expectation of rbu is approximately
0 X BB s..0 ? 1 B . E(rbu (x0 )) u (S + R) =0 @ j
p
T
j
j
sjp
1 CC CA :
The MSE{optimal ridge matrix within some set of nonnegative matrices for parameters and 2 may be found numerically. The whole space of nonnegative ridge matrices is usually too large to nd a stable minimum of (17). An interesting special case is, therefore, the search for optimal eigenvalues for given eigenvectors of R. Let us suppose that
R=
X q
=1
R` R(`)
`
13
with scalar ridge parameters R 0 and nonnegative basis matrices R( ) with rk(R( ) ) = 1 for ` = 1; : : : ; q . In this case we can nd an explicit solution for the MSE{optimal ridge parameter R 0 given all other ridge parameters. Denote `
`
`
`
R(0) =
X
6= 0
`
and R`0
`
h
= 1= 1 + R 0 tr (S + R(0) )?1 R( 0 )
Then, standard algebra yields (S + R)?1 = (1 ? R 0 )(S + R(0) )?1 + R 0 (S + R(0) )?1 : `
R` R(`)
`
`
h
S + R(0) ? R(
`0
) =tr
i
:
i
(S + R(0) )?1 R( 0 ) (S + R(0) )?1 `
`
Thus, the ridge estimator rbu (x0 ) is a convex combination of two linear estimators with weights (1 ? R 0 ) and R 0 . The MSE then is approximately a parabola in R 0 and can explicitly be minimized within R 0 2 [0; 1]. A solution for all parameters R1 ; : : : ; R may be found iteratively. The ridge estimator was designed to x problems of local polynomials with variance. Hence, the question arises whether its conditional variance is bounded. Let us then assume that K () and p () are bounded from above in the smoothing window. Then, S is bounded, and for nonsingular ridge matrices R the variance of rbu (x0 ) for any u is bounded from above by V(rbu (x0 )) 2 maxX (max (S2 )) : (min (R)) `
`
`
q
`
j
Usually, there is an overall mean (without loss of generality p0 ()) included in the model. Then, shrinking towards zero, as done using a nonsingular ridge matrix, is not desirable. The following theorem shows that ridging without shrinking towards zero has a bounded variance.
Theorem: Let p0() be bounded from below away from zero and K () and p (), j = 0; : : : ; p, be bounded from above. If the ridge matrix R has the structure
0 R=@ 0 0
0
R11
j
1 A;
where R11 > 0 is some p p {dimensional positive de nite matrix, then the variance of rbu (x0 ) is bounded for any u. A sketch of the proof is given in the appendix. As a consequence of this theorem, we can construct local polynomial ridge estimators of any polynomial degree with bounded variance which are unbiased for constant regression functions. 14
Another important case is the estimation of derivatives: Suppose that the local basis functions are orthogonal with respect to the weighted discrete design, so that S is diagonal. R can be diagonalized simultaneously. Then, those diagonal elements, which correspond to zero elements of u, do not in uence the estimator and can be set to zero.
3.1 Example: Estimation of the rst derivative Let us now discuss the special case of local quadratic estimation of the rst derivative in more detail. In this case, p0 (x); p1 (x); p2 (x) form a basis of parabolas, and KH (X ) = K ((X ? x0 )=h) is as in section 2. If we choose an orthogonal polynomial basis, the estimator has an especially simple form. Using the notation (3){(4), i
i
p0 (x)
= 1; p1 (x) = x ? xe ; s se s p2 (x) = (x ? xe)2 ? 3 (x ? xe) ? 2 = (x ? xe)2 ? 2 ; s s s 2
0
0
are mutually orthogonal with respect to the weighted discrete design. Here, xe = xe + s3 =(2 s2) is a weighted mean of design points, and se2 is the quadratic form in (3) centered at x1 = ex . Consequently, S in (15) is a diagonal matrix with diagonal elements = s0 ; s11 = s2 ; s2 s2 se2 s22 = s4 ? 3 ? 2 = se4 ? 2 : s2 s0 s0 The local polynomials T in (15) become s00
j
T0(1) T1(1) T2(1)
= T0 ; = T1 ; = T2 ? ss3 T1 ? ss2 T0 = Te2 ? sse2 T0 ; 2
0
0
where the superscript \(1)" is used to avoid confusion. We want to estimate the derivative of the local polynomial 0 p0 (x) + 1 p1 (x) + 2 p2 (x) at x = x0 , i. e., 1 + 2 2(x0 ? xe) . Using e = x0 ? xe , the term u in (14) becomes u = (0; 1; 2 e) , and the local quadratic estimator of the derivative of r at x0 becomes T 2 e T2(1) : rb1 lq (x0 ) = 1 + s s T
;
2
22
This estimator has an interpretation similar to the tutorial gure 2. The rst term T1 =s2 is the local linear Nadaraya{Watson estimator of r0 , i. e., it is locally constant. The local constant and the local 15
regression line
T1 s2
cross at x = xe .
(1)
+ (x ? xe) 2 sT2
22
For x0 = ex , Nadaraya{Watson{ and local quadratic estimators of r0 (x0 ) coincide. Hence, a ridge estimator should rotate the regression line around x = xe . Unfortunately, in contrast to section 2, the Nadaraya{Watson estimator does not have a bounded variance, and shrinking towards zero is useful: 2 e T2(1) : T1 + (19) rb1 (x0 ) = s2 + R1 s22 + R2
Rule of thumb for the ridge parameters The variance of the rst term T1 =(s2 + R1 ) may be estimated from above as follows: T1 2 2 K ()) s2 2 max(K ()) V s + R = (s + sR2 )2 (max( : s2 + R1 )2 4 R1 2 1 2 1 Asymptotics (see section 2.2) may help to nd a reasonable upper bound:
2 V Ts 1 ! n hf h22 2 : 2 2 If we use this limit as an upper bound for the rst term of the ridge estimator, and replace n h f by 1 as in section 2.1 (ridging this term results in a bias for linear regression functions, especially for large values of R1 ) we get max(K ()) 22 h2 : R1 thumb = 4 ;
2 Epanechniko kernel, max(K ()) = 3=4, 2 = 1=5, 2 = 3=35 ,
For the and R1 thumb = 7 h2 =80 . The variance of the second term can be estimated from above similarly: ;
!
2 e2 2 e2 e (1) K ()) s22 2 e2 max(K ()) V s2 +T2R = (4s + Rs22)2 4 (s max( : R2 22 2 22 2 22 + R2 )2
Taking the limit of V(T1 =s2 ) jej = h as an upper bound, we get a rule of thumb R2;thumb =
max(K ()) 22 jej h3 ; 2
which is R2 thumb = 7 jej h3 =20 for Epanechniko weights. We generated data with random uniform design, n = 100, a bimodal regression function shown in gure 1, and bandwidths ranging from 3=n to 0.5 . Figure 5 shows the mean integrated squared error of these simulations on a log{log scale. Local quadratic estimation with k{nn rule did not work well for any bandwidth. The reason is that a given number of points in the smoothing interval is not sucient to guarantee that there are 3 well separated design points, necessary for a stable local ;
16
20
knn
lq
thumb
2
MISE 5
10
NW
0.03
0.05
0.10
0.20
Bandwidth
Figure 5. MISE of estimators of r0 (x) versus bandwidth for n = 100 random uniformly distributed design points, a bimodal regression function, and = 0:1 . Dotted line is the local linear Nadaraya{ Watson estimator (NW), dashes are local quadratic estimator (lq), dash{dots are local quadratic estimator with k{nearest neighbor bandwidth scheme (knn), and solid line is ridge estimator with rule of thumb (thumb).
quadratic t. The Nadaraya{Watson estimator worked well for bandwidths greater than 0.05 . Note that this is a situation which especially ts the Nadaraya{Watson estimator: The design is uniform and the true derivative is close to zero at the boundary. Ridging with the rule of thumb worked well for any not too small bandwidth. Using the explicit form of the ridge estimator in (19), which is similar to (6), a data{adaptive choice of the ridge parameters can be constructed along the lines of section 2. There, a rule of thumb for the estimation of the second derivative is needed, which can be constructed similarly to that for the rst derivative.
3.2 Example: Two{dimensional local linear ridge regression In the case of local linear regression for two{dimensional design, a natural basis is p0 (x) = 1 ; p1 (x) = x1 ? xe1 ; p2 (x) = x2 ? xe2 ;
where
0 1 ,X X xe1 A @ KH (X ) ; = KH (X ) X xe = xe2 =1 =1 n
n
i
i
i
i
i
17
and
X ? x X ? x 2 02 1 01 K KH (X ) = K i
i
i
h1
h2
i
are common weighting schemes. The elements of
0 s00 B B S = @ s10
s10 s01 s20 s11
s01 s11 s02
1 CC A
and
sjk
=
n
=1
KH (X i )(Xi1 ? xe1 ) (Xi2 ? xe2 ) j
k
0 T00 B B T = @ T10 T01
in (15) are
X
?
or KH (X ) = K (X ? x0 )
and T = jk
X n
=1
T
i
H ?1(X ? x0) i
1 CC A
KH (X i )(Xi1 ? xe1 )j (Xi2 ? xe2 )k Yi :
i
i
By de nition, s01 = s10 = 0 . The local linear estimator of the regression function r at x0 is u S?1 T , where u = (1; 1; 2) , and = (1; 2) = x0 ? xe . For an appropriate choice of the ridge matrix note that the Nadaraya{Watson estimator and the local linear estimator coincide for x = xe independently of the observations Y . Moreover, we are not interested in a t of the whole regression plane, but only in its value at x = x0 . Hence, it does not matter whether the regression plane is well tted outside the connecting line of xe and x0 . An orthogonal basis of the plane will show how we can use this fact for the construction of a ridge estimator. Denote q e = 12 s02 ? 2 1 2 s11 + 22 s20 ; T
T
T
i
and let p(2) 0 (x) p(2) 1 (x) p(2) 2 (x)
= 1; = [(1 s02 ? 2 s11 ) (x1 ? xe1 ) + (2 s20 ? 1 s11 ) (x2 ? xe2 )]/ e ; = [(2 (x1 ? xe1 ) ? 1 (x2 ? xe2 )]/ e :
(2) e ), Then, e = p(2) 1 (x0 ) ? p1 (x
0 (2) T0 B (2) B T = @ T1(2) T2(2)
1 0 CC = BB A @
0 s00 B (2) S =B @ 0 0
1 C; [(1 s02 ? 2 s11 ) T10 + (2 s20 ? 1 s11 ) T01 ]/ e C A [(2 T10 ? 1 T01 ]/ e 1 0 1 1 0 0 C C B (2) s20 s02 ? s211 0 C A ; and u = B@ e CA : T00
0
1
18
0
true
thumb, = 0 21 ISE = 0.31
ll, = 0 15 ISE = 1.70
thumb, = 0 15 ISE = 0.23
h
NW, = 0 15 ISE = 0.26 h
ll, = 0 21 ISE = 0.41
:
:
h
h
:
h
:
:
Figure 6. Realizations of estimators for n = 200 random uniformly distributed design points, the regression function left above (true), and = 1 . Above are local linear estimator (ll) at its ISE{ optimal bandwidth and the ridge estimator with rule of thumb (thumb). Below are Nadaraya{Watson estimator (NW), local linear estimator, and ridge estimator at the ISE{optimal bandwidth of the ridge estimator.
The local linear estimator becomes T rbll (x0 ) = 00 s00
e
(2)
1 + T(2)
s22
as in (5). The only dierence is that now the weights depend on an auxiliary variable p2 (X ). Consequently, the problem of ridging reduces to the one{dimensional problem, including the rule of thumb and the data{adaptive choice of the ridge parameter. We implemented the formulae above in Splus for kernel weights KH (X ) = 2 (1?jj(X ?x0 )=hjj2 ) and modi ed the bandwidth locally to ensure that at least 3 design points are in the smoothing R circle. The constant 0 = K 2 (u) du used in the rule of thumb is now 0 = 4=3, leading to Rthumb = 6 j j =(16 h) . Figure 6 shows realizations of several estimators. For the ISE{optimal bandwidth h = 0:15 of the ridge estimator, the local linear estimator has problems at the boundary, which occupies 51% of the design. The ISE{optimal bandwidth h = 0:21 of the local linear estimator clearly leads to oversmoothing. The ridge estimator has, however, still a smaller ISE than the local linear one. It should be obvious by now, how to derive a data{adaptive choice of ridge parameter. i
i
19
i
4 A SIMULATION STUDY Local linear ridge regression In a small simulation study the performance of various estimators was evaluated. We analyzed the following situations. Regression functions: 1. bimodal: r(x) = 0:3 exp(?4 (4 x ? 1)2 ) + 0:7 exp(?16 (4 x ? 3)2 ) , = 0:1 (Fan and Gijbels 1995) 2. sine: r(x) = sin(5 x) , = 0:5 (Ruppert, Sheather, and Wand 1995)
p
3. peak: r(x) = 2 ? 5 x + 5 exp(?400 (x ? 0:5)2) , = :5 (Seifert and Gasser 1996) Design densities on [0,1]: 1. uniform U (0; 1) 2. truncated normal N (0:5; 0:52) \ [0; 1] 3. truncated normal N (0; 1) \ [0; 1] These situations were analyzed for xed and random design, and for 3 residual variances: as described above (), small (=2), and large (2 ). The resulting 54 situations were analyzed for n ranging from 25 to 1000. The following estimators were evaluated: local constant (Nadaraya{Watson), local linear, ridging with spatially adaptive parameter, and ridging with rule of thumb. The k{nearest neighbor rule is a bandwidth scheme intended to achieve approximately constant variance for an estimator (be it a density or regression estimator). As a byproduct it alleviates the variance problems of local polynomials at the price of giving up exibility in bandwidth choice. Hence, the k{nn rule with k 2 n h was included in the evaluations. The k {nn estimator then is the local linear estimator with the mean of the smallest and largest possible local bandwidth with k observations in the smoothing interval. All estimators were computed with a fast algorithm as described in Seifert et al. (1994). The Epanechniko kernel was used throughout. All realizations of the design were stretched to min(X ) = 0 , max(X ) = 1 before generating the Y . We used a \steady" bandwidth near the boundary, i. e., the total window width in [0,1] is always 2 h as in the interior (k of the k{nn rule is not reduced) . For each situation, 400 replicates were generated. The MISE for each replicate was analyzed at an estimated MISE{optimal bandwidth. For xed designs, dierences between local linear and ridge estimators were small. In the interior, dierences between MISEs of the local linear and ridge estimators were less than 1%. At the i
i
i
20
1.0 Relative Efficiency 0.2 0.4 0.6 0.8 0.0
adapt NW
thumb adapt NW knn
ll
25
50
100 200 Sample Size n
500
1000
Figure 7. Eciency depending on sample size of estimators of the bimodal regression function for random uniform design and = 0:1 . Dotted line is Nadaraya{Watson estimator (NW), dashes are local linear estimator (ll), dash{dots are local linear estimator with k{nearest neighbor bandwidth scheme (knn), and solid lines are ridge estimators with spatially adaptive parameter (thick line) (adapt) and rule of thumb (thumb).
boundary, however, data{adaptive ridging gained about 8% relative to the local linear estimator. In the following, we will concentrate on random design, where dierences are more important. Figure 7 shows the example of the bimodal regression function and random uniform design. Relative eciency here is the ratio of asymptotic MISE of the local linear estimator expected from asymptotic theory to the true nite sample MISE of an estimator. In this example, at least 200 observations are necessary so that local linear estimation behaves well. Local linear estimation with k{nn rule worked well for n 50. Ridging with spatially adaptive parameter as well as with rule of thumb worked well for any n. Over all 162 situations (3 design densities 3 regression functions 3 residual variances 6 sample sizes), the mean MISE of the local linear estimator was 834 times higher than that of the spatially adaptive ridge estimator. This ratio was highest for n = 25 (maximal MISEll / MISEadapt = 39 029) and reduced for increasing sample size to a mean of 1.02 for n = 1000. The largest loss of MISE for the ridge estimator compared with the local linear one was 3%. The ridge estimator always was better than the Nadaraya{Watson estimator. In 94% of all situations ridging outperformed the k {nn rule, and always for n 100. In the mean, the MISEs of the spatially adaptive ridge estimator were 2% smaller than those of ridging using the rule of thumb. In the interior, dierences were small, but at the boundary the data{adaptive rule outperformed the rule of thumb by 7%. 21
1.0 Relative Efficiency 0.2 0.4 0.6 0.8 0.0
thumb NW
knn lq 25
50
100 200 Sample Size n
500
1000
Figure 8. Eciency depending on sample size of estimators of the derivative of the bimodal regression function for random uniform design and = 0:1 . Dotted line is Nadaraya{Watson estimator (NW), dashes are local quadratic estimator (lq), dash{dots are local quadratic estimator with k{nearest neighbor bandwidth scheme (knn), and solid line is ridge estimator with rule of thumb (thumb).
Local quadratic ridge estimation of the derivative The same scheme was applied to estimating the derivative. Here, slight dierences between estimators could be observed for xed designs. The ridge estimator with rule of thumb gained 7% relative to the local quadratic estimator with k{nn rule, 9% relative to the local quadratic estimator, and 18% relative to the local linear Nadaraya{Watson estimator. For random design, however, dierences are more important. Figure 8 shows the example of the bimodal regression function and random uniform design. Local quadratic estimation with k{nn rule did not work well for small sample size. The Nadaraya{ Watson estimator was superior for large n. As explained in section 3.1, this situation is ideal for the Nadaraya{Watson estimator. Ridging with the rule of thumb worked well for any n. Over all 162 situations, the mean MISE of the local quadratic estimator was 370 times as high as that of the ridge estimator. This ratio was highest for n = 25 (maximal MISElq / MISEthumb = 21 330). The mean MISE of the k{nearest neighbor estimator was 137%, and that of the Nadaraya{ Watson estimator 32% higher than the MISE of the ridge estimator. Dierences were still noticeable for n = 1000. There, the mean MISE of the local quadratic estimator was 12%, that of the k{nearest neighbor estimator 11%, and that of the Nadaraya{Watson 40% higher than the MISE of the ridge estimator. 22
APPENDIX: PROOF OF THE THEOREM It has to be shown that all components of (S + R)?1 T in (18) have bounded variances. Divide
0 S=@
s00
S10
1 S01 A S11
and
0 T =@
T0
T1
1 A
into blocks analogously to R. Then, standard algebra yields
1 0? ?1 S 10 =s00 T0 ? 1 S 01 E ?1 T 1 1 + S E 01 s00 s00 CA ; (S + R)?1 T = B @ T ?E ?1 S10 s000 + E ?1T 1 where E = S 11 + R11 ? S 10 S 01 =s00 . The term T0 =s00 essentially is a weighted mean and hence has
bounded variance:
0 12 X BB KH (X ) p0(X ) CCC 2 V sT0 = 2 s200 = 2 B : B C X s00 00 (min p0 (x))2 A x 2 =1 @ KH (X ) p0 (X ) n
i
i
n
i
i
i
i
=1
The components T of T 1 are linear statistics with bounded coecients and hence have bounded variances: j
V(T ) = 2 s = 2 j
X n
jj
=1
i
2 (X ) p2 (X ) 2 n (max(K ()) max(p ()))2 : KH i i j j
The matrix S 11 ? S 10 S 01 =s00 is nonnegative, hence E R11 , and
E?1 1=min(R11) I : The components s0 of S 01 and S 10 are bounded, and so are the ratios j
0 1 s X BB K (X ) p (X ) CC 0 CC jp (X )j max(jp ()j) : 0 BB X H s00 min(p0 ()) =1 @ KH (X ) p20 (X ) A n
i
j
i
j
j
n
i
i
=1
i
i
i
ACKNOWLEDGMENTS This work was supported by project no. 21{52567.97 of the Swiss NSF.
23
REFERENCES Cleveland, W. S., and Loader, C. (1996), \Smoothing by Local Regression: Principles and Methods," in Statistical Theory and Computational Aspects of Smoothing, W. Hardle, M. G. Schimek (eds), Physica, 10{49. Fan, J. (1993), \Local Linear Regression Smoothers and their Minimax Eciencies," Annals of Statistics, 21, 196{216. Fan, J., and Gijbels, I. (1995), \Adaptive Order Polynomial Fitting: Bandwidth Robusti cation and Bias Reduction," Journal of Computational and Graphical Statistics, 4, 213{227. | (1996), Local Polynomial Modelling and its Applications. London: Chapman & Hall. Gasser, T., Sroka, L., and Jennen{Steinmetz, C. (1986), \Residual Variance and Residual Pattern in Nonlinear Regression," Biometrika, 73, 625{633. Hall, P., and Marron, S. (1997), \On the Role of the Shrinkage Parameter in Local Linear Smoothing," Probability Theory Related Fields, 108, 495{516. Ruppert D., Sheather, S. J., and Wand M. P. (1995), \An Eective Bandwidth Selector for Local Least Squares Regression," Journal of the American Statistical Association, 90, 1257{1270. Seifert, B., and Gasser, T. (1996), \Finite Sample Variance of Local Polynomials: Analysis and Solutions," Journal of the American Statistical Association, 91, 267{275. Seifert B., and Gasser, T. (1998), \Ridging Methods in Local Polynomial Regression," in Dimension Reduction, Computational Complexity, and Information, S. Weisberg (ed), Vol. 30 of Computing Science and Statistics, Interface Foundation of North America, Inc., Fairfax Station, VA, 467{476. Seifert, B., Brockmann, M., Engel, J., and Gasser, T. (1994), \Fast Algorithms for Nonparametric Curve Estimation," Journal of Computational and Graphical Statistics, 3, 192{213.
24