We shall consider two specific rules, respectively, local averaging and local. Ž x linear interpolation. Assuming x g X , X. , in the first rule we define l lq1. Ž . Ž .y1.
The Annals of Statistics 1997, Vol. 25, No. 5, 1912 ] 1925
INTERPOLATION METHODS FOR NONLINEAR WAVELET REGRESSION WITH IRREGULARLY SPACED DESIGN BY PETER HALL AND BERWIN A. TURLACH Australian National University We introduce interpolation methods that enable nonlinear wavelet estimators to be employed with stochastic design, or nondyadic regular design, in problems of nonparametric regression. This approach allows relatively rapid computation, involving dyadic approximations to waveletafter-interpolation techniques. New types of interpolation are described, enabling first-order variance reduction at the expense of second-order increases in bias. The effect of interpolation on threshold choice is addressed, and appropriate thresholds are suggested for error distributions with as few as four finite moments.
1. Introduction. Nonlinear wavelet methods in statistics provide a uniquely adaptive tool, offering unsurpassed levels of utility for estimating a wide range of both regular and irregular functions. See, for example, Donoho, Johnstone, Kerkyacharian and Picard Ž1995, 1996., Kerkyacharian and Picard Ž1993. and Donoho and Johnstone Ž1994, 1998.. Despite their adaptivity, however, wavelet methods are typically restricted by assumptions about the type of design that may be employed in problems of nonparametric regression. Usually the design must be not only regularly spaced, but dyadic. This contrasts with other approaches to curve estimation, where a great deal of attention has been paid to the problem of irregularly spaced design points. One of the earliest nonparametric curve estimating techniques, that of Ž1990., page 25, and Wand and Jones Nadaraya and Watson w cf. Hardle ¨ Ž1995., page 119x , is constructed as a ratio that explicitly cancels out most of the effect of stochastic design. Alternative approaches, for example the convoŽ1979. and local polynomial smoothing lution method of Gasser and Muller ¨ w e.g. Hastie and Loader Ž1993.x , conduct the cancellation implicitly. In the present paper we describe an interpolation-based approach to nonlinear wavelet methods in the case of stochastic, or regular but nondyadic, design. We identify threshold parameters that are sufficient for good performance when using interpolation, and describe mean squared error properties for appropriate thresholds. Particular attention is paid to developing results that hold under minimal conditions on the error distribution, so as to demonstrate wide applicability of the interpolation approach. Two different interpolation rules are considered, and their properties elucidated using both theoretical and numerical arguments. It is shown that, in addition to allow-
Received November 1995; revised September 1996. AMS 1991 subject classifications. Primary 62G07; secondary 62G30. Key words and phrases. Bias, mean squared error, nonparametric regression, piecewise smooth, stochastic design, threshold, variance.
1912
INTERPOLATION METHODS FOR WAVELETS
1913
ing wavelets to be employed with stochastic, or regular but nondyadic, design sequences, interpolation methods admit fast computation. This is achieved by approximating the interpolation-based estimator, defined in the continuum, using a dyadic grid. In the context of kernel methods an early interpolation approach was considered by Clark Ž1977, 1980.. A technique for accommodating stochastic design, modelled on the Nadaraya]Watson kernel method, was considered by Hall and Patil Ž1996a.. We approach the analysis of performance rather differently from Donoho, Johnstone, Kerkyacharian and Picard in their work cited earlier, since we wish to stress the impact which different interpolation rules have on performance. Critically, choice of interpolation rule affects the size of the variance term through a constant factor, rather than by changing the rate of convergence. Interpolation has no first-order impact on bias. Such properties cannot be described accurately by deriving only upper bounds. The approach taken in the present paper is to produce concise asymptotic formulas, in effect upper and lower bounds that are asymptotically identical. We do this for a single, piecewise-smooth target function, although our results may be shown to hold simultaneously over a large class of such functions. Our focus on functions that are only piecewise smooth, rather than smooth everywhere, still allows us to demonstrate that wavelet methods achieve a degree of adaptivity not enjoyed by traditional techniques such as those based on kernels or splines, which Žfor example. perform poorly with functions that have jump discontinuities. We should stress that the most distinctive features of different interpolation rules, which are present only in constant-factor changes to variance, do not emerge at all in a traditional minimax description of ‘‘pure thresholded’’ wavelet estimators. This is because pure thresholded estimators are oversmoothed by an order of magnitude w see, e.g., Hall and Patil Ž1996b.x , with the result that the effect of variance is swamped by that of bias. Hence, the main conclusions reached in the present paper do not emerge from more familiar analyses. 2. Methodology. 2.1. Model for data. Let data Y ' Ž X m , Ym ., 1 F m F n4 be generated by the model Ym s g Ž X m . q j m for 1 F m F n, where the design sequence X ' X m , 1 F m F n4 represents the ordered values of a random sample from a distribution with density f having support I s w 0, 1x , and the j m ’s are independent but not necessarily identically distributed random variables. In our asymptotic model we should, strictly speaking, denote X m and j m by X n m and j n m , respectively, since the value of the mth among them will vary with increasing sample size. 2.2. Interpolation rules. Let wm , 1 F m F n, denote weight functions depending on the X m ’s but not on the Ym ’s, and such that for integers n 1 , n 2
1914
P. HALL AND B. A. TURLACH
satisfying n 1 - 0 F n 2 we have wj ' 0 unless n 1 F j F n 2 . Define the interpolant
Ž 2.1.
YŽ x. s
Ý wm Ž x . Ym m
for x g Ž Xy n 1 , X ny n 2 .
At the ends of the design interval, that is, outside the region Ž Xyn 1, X ny n 2 x to which the above definition applies, Y may be defined in any of several ways. For definiteness we shall use horizontal extrapolation, meaning that Y Ž t . ' Y Ž Xyn 1 . on w 0, Xy n 1 x , and Y Ž t . ' Y Ž X ny n 2 . on Ž X ny n 2 , 1x . We shall consider two specific rules, respectively, local averaging and local linear interpolation. Assuming x g Ž X l , X lq1 x , in the first rule we define wm Ž x . s Ž2 n .y1 if yn q 1 F m y l F n , and wm Ž x . s 0 otherwise, and in the second,
¡n ~ Ž x. s n ¢0,
Ž X 2 lymq1 y x . r Ž X 2 lymq1 y X m . , y1 Ž x y X 2 lymq1 . r Ž X m y X 2 lymq1 . , y1
wm
if yn q 1 F m y l F 0, if 1 F m y l F n , otherwise.
Substituting into Ž2.1., we obtain, respectively, for x g Ž X l , X lq1 x , y1 Ž 2.2. Y Ž x . s Ž 2 n .
Ž 2.3. Y Ž x . s ny1
n
n
Ý
Ý
msy n q1
ms1
ž
Ylqm ,
x y X lymq1 X lqm y X lymq1
Ylqm q
X lqm y x X lqm y X lymq1
/
Ylymq1 .
2.3. Empirical wavelet transform for interpolated data. Write f and c for the ‘‘father’’ and ‘‘mother’’ wavelet functions, let p s pŽ n. be the primary resolution level, define pi s 2 i p for i G 0 and let f j Ž x . s p 1r2f Ž px q j . and c i j Ž x . s pi1r2c Ž pi x q j . be the functions that form the orthonormal basis of a wavelet expansion. Put bj s HI g f j and bi j s HI g c i j . We assume that c is of order r, meaning that r G 1 is the smallest integer such that Hx ic Ž x . dx is nonzero. Our estimators of bj and bi j are, respectively, ˆ bj s HI Yf j and ˆbi j s HI Yci j , and lead to the empirical wavelet transform, qy1
gˆ s Ý ˆ bj f j q Ý Ý ˆ bi j I ž < ˆ bi j < G d / c i j .
Ž 2.4.
j
is0
j
The empirical coefficients ˆ bj and ˆ bi j may be approximated to arbitrary accuracy on a dyadic grid. Indeed, taking N s 2 k for an integer k G 1, we may define N
N
ms1
ms1
˜bj s Ny1 Ý Y Ž mrN . f j Ž mrN . and ˜bi j s Ny1 Ý Y Ž mrN . ci j Ž mrN . , which represent series approximations to ˆ bj and ˆ bi j , respectively. Both are calculable using Mallat’s pyramid algorithm. The resulting analogue of gˆ is g, ˜ obtained by replacing each empirical coefficient ˆb in Ž2.4. by ˜b. In taking
1915
INTERPOLATION METHODS FOR WAVELETS
this approach to estimation, we are effectively replacing the data set Y by Y X ' Ž mrN, Y Ž mrN .., 1 F m F N 4 , and applying a relatively standard wavelet estimator to Y X . Provided that Nrn ª `, the first-order asymptotics of g˜ are identical to those of g. ˆ 2.4. Main theoretical result. Assume of g that it enjoys r piecewise-continuous derivatives, in the sense that there exist constants 0 s a1 - a2 - ??? - a k s 1 such that g has r continuous derivatives on each interval Ž a l , a lq1 . for 1 F l F k y 1, with left- and right-hand limits at a l and a lq1 , respectively; of f that it is piecewise-continuous in this sense, possibly with a different k and different a i ’s, and is bounded away from zero on I s w 0, 1x ; of c and f that they are compactly supported and Holder continuous. Then for ¨ some r G 1 and k / 0, and all integers i g w 0, r x and j g Žy`, `.,
Hc
2
Hx c Ž x . dx s k Ž r !.
s 1,
i
Hf s 1,
y1
di r ,
Hf Ž x . f Ž x q j . dx s d
0j,
where d jk is the Kronecker delta; assume of the tuning parameters p and q in the definition of g, ˆ that for some u ) 0 and « ) 0,
Ž 2.5.
py1 s o
½Žn
y1
log n .
1r Ž2 rq1 .
5,
y2 r rŽ2 rq1. py1 ., q s oŽ n
pq s O Ž nminŽ uq1rŽ2 rq1., 1.y « . ;
and of errors j m s j n m in the model of section 2.1 that they may be written as j n m s s Ž X n m . j nX m , where s is a piecewise-continuous function on I , j 1X m , . . . , j nX m are stochastically independent of one another and of X1 , . . . , X n , and each j nX m has, for 1 F m F n - `, the distribution of j X , with EŽ j X . s 0, EŽ j X 2 . s 1, E < j X < 2Ž1 qu. - `, and u as in Ž2.5.. We refer to the conditions in the previous paragraphs collectively as ŽC.. Condition Ž2.5. and the assumption E < j X < 2Ž1 qu. - ` are satisfied if we take p equal to a constant multiple of n1rŽ2 rq1., q equal to the integer part of Ž1 y « . log 2 n for some 0 - « - 2 rrŽ2 r q 1., and u s 2 rrŽ2 r q 1.; and if E < j X < 4 - `. It will follow from Theorem 2.1 that this size of p is optimal. We shall assume that the interpolation rule is given by either Ž2.2. or Ž2.3., although any of several other approaches could be employed. In the case of the rule at Ž2.2. put dn ' 1 q Ž2 n .y1 , and for the rule at Ž2.3., let dn ' Ž 2 n .
y2
E
½
y1
Ý
ž
ly1
Zl 2
lsy n
q
ny1
rs2 lq1
Ý Zl
ls0
Ý
Zr q Zl
ž
/ž
Ý
rslq1
Ý
Zr
rs2 lq1
2l
2
/ /ž Ý / 5 y1
y1
y1
2l
Zr q Zl
Zr
2
,
rs0
where Zr , y` - r - `4 are independent exponentially distributed random variables. For this definition, d1 s 3r2 and dn s 1 q O Ž ny1 . as n ª `.
1916
P. HALL AND B. A. TURLACH
Construct gˆ using tuning parameters p, q satisfying Ž2.5., and employing the threshold d s Ž Dny1 log n.1r2 , where the constant D satisfies D ) 2 u dn sup Ž s 2rf .
Ž 2.6.
and u is as in Ž2.5.. Define D 1 ' dn Hs 2 f y1 and D 2 ' k 2 Ž1 y 2y2 n .y1HŽ g Ž r . . 2 . THEOREM 2.1. Under conditions ŽC.,
Ž 2.7.
HE Ž gˆ y g .
2
s D 1 ny1 p q D 2 py2 r q o Ž ny1 p q py2 r . .
2.5. Discussion. REMARK 2.1 ŽVariance and bias.. The first and second terms on the right-hand side of Ž2.7. represent, respectively, the variance and squared bias contributions to mean integrated squared error ŽMISE.. If we replace p in Ž2.7. by hy1 , where h is a bandwidth, then Ž2.7. is transparently an analogue of the classical formula for mean squared error of a kernel or local polynomial estimator, albeit with different values of the constants D1 and D 2 . However, such formulas for kernel methods fail in the context of g ’s that are only piecewise smooth. Of course, the fact that we can decompose MISE so neatly into variance and squared-bias components is a consequence of our decision to adapt the order of the wavelet to that of the target function in places where the latter is smooth. This has enabled us to give a detailed analysis of the effect that different interpolation rules have on performance. More traditional minimax analysis would not have permitted us to reach such concise conclusions. REMARK 2.2 ŽEffects of n .. Larger values of n produce slightly better mean square performance, since the value of dn , and hence D1 , decreases as n increases. However, first-order properties of mean squared error do not capture all the qualitative features of the estimator, and in particular do not indicate the detrimental second-order effect that using too large a value of n can have on bias. If interpolation is over a wide range, then a wavelet estimator applied to interpolated data will recover a slope rather than a jump. Our numerical work in Section 3 will address this point. The variance inflation represented by the fact that dn ) 1 for finite n is related to that discussed by Chu and Marron Ž1991. in the case of convolution-based kernel methods. If n s n Ž n. ª ` so slowly that n s O Ž n « . for all « ) 0, then Theorem 2.1 remains true without any changes to the regularity conditions. We may then replace dn by d` s 1 in Ž2.6. for the threshold constant. REMARK 2.3 ŽChoice of p .. We see from Ž2.7. that the optimal value of p, in the sense of minimizing the right-hand side, is asymptotic to Ž2 rnD 2rD1 .1rŽ2 rq1.. The first part of condition Ž2.5. asks that p be of larger
INTERPOLATION METHODS FOR WAVELETS
1917
order than un s Ž nrlog n.1rŽ2 rq1., which is only marginally less than the order of the optimal p. However, it may be proved that Ž2.7. fails if p is of size un or smaller; and that, under the conditions of Theorem 2.1 excluding the first part of Ž2.5., and assuming that p ª `,
Ž ny1 log n .
2 rr Ž2 rq1 .
sO
½H
I
E Ž gˆ y g .
2
5
.
REMARK 2.4 ŽLocally varying thresholds.. Theorem 2.1 remains valid, under identical conditions, if we use a varying threshold in which the indicator I Ž< ˆ bi j < G d . in Ž2.4. is replaced by I Ž< ˆ bi j < G d i j ., where d i j s Ž Di j ny1 log n.1r2 and, in analogy to Ž2.6., Di j ) 2 u Ž 1 q « . dn s 2 Ž yjrpi . rf Ž yjrpi . for some « ) 0. ŽThis definition is appropriate if s 2 and f are continuous, but should be modified at jump discontinuities if those functions are discontinuous.. REMARK 2.5 ŽEmpirical choice of threshold.. Theorem 2.1 is readily extended to the case of an empirical chosen threshold, as follows. For simplicity, assume that s is constant on I . Let g ˆ , sˆ denote estimators of g ' inf f, s , ˆ s A sˆ 2rgˆ , and let g˜1 denote respectively, let A ) 2 udn be a constant, put D the version of gˆ in which the threshold is the random variable dˆ s ˆ y1 log n.1r2 . Assume we know a constant B such that < g < F B, and define Ž Dn g˜ s g˜1 if Hg˜12 F B 2 , and g˜ s 0 otherwise. Put n s 2 rrŽ2 r q 1.. THEOREM 2.2. If conditions ŽC. hold, with s constant, and if for all « ) 0 and some C ) 0,
Ž 2.8.
P Ž Cy1 F g ˆ F g q « . s 1 y o Ž nyn . , P Ž s y « F sˆ F C . s 1 y o Ž nyn . ,
then Ž2.7. holds with g˜ replacing g. ˆ For example, let U equal the maximal spacing between adjacent X i ’s, and let V equal the variance of the Yi ’s Žan overestimate of s 2 .. Put g ˆs Žlog n.rŽ nU . and sˆ 2 s V. Then Ž2.8. holds if in addition to ŽC. we assume that f is continuous on I ; see Section 5. The resulting threshold has a particularly simple formula: dˆ s Ž AUV .1r2 . Likewise, one may develop empirical, locally varying thresholds, and prove that they have properties similar to those of their nonempirical counterparts discussed in Remark 2.4. REMARK 2.6 ŽInterpolating nonrandom designs.. If the design variables X i may be written as X i s F y1 Ž irn., where F has a piecewise-continuous derivative f on I , and f is bounded above zero on I and integrates to 1 on I , then Theorem 2.1 holds if we replace dn by 1 throughout. In the case of
1918
P. HALL AND B. A. TURLACH
regularly-spaced design, f ' 1. There is now no advantage in using higher values of n , with the accompanying second-order exacerbation of bias that they entail. Methods for this case have also been studied by Cai Ž1996.. This point bears restating, for emphasis: the cases of stochastic design and deterministic-but-irregular design are distinctly different. They have identical first-order bias properties, but the former produces estimators with greater variance. These points are analysed in more detail in a longer manuscript w Hall and Turlach Ž1995.x , obtainable from the authors, on which this paper is based. 3. Numerical properties. We performed an extensive simulation study using the software package wavethresh w Nason and Silverman Ž1994.x . Three different targets were employed: the polynomial-with-jump function from Nason and Silverman Ž1994., and the functions ‘‘HeaviSine’’ and ‘‘Blocks’’ of Donoho and Johnstone Ž1994.. Following Donoho and Johnstone’s lead we chose the noise level so that the signal-to-noise ratio on a standard deviation scale was seven. The design points X i were drawn from a Uniform or a truncated Normal distribution. We used sample sizes n s 25, 50, 100, 250, 500 and 1000, and varied n between 1 and 9 in steps of 2. We employed the threshold dˆ s Ž AUV .1r2 suggested in Remark 2.5, with A s 3. Computation was performed on a dyadic grid, as suggested in Section 2.3, with N s 2 k and k s maxŽ8, u1.2 log 2 n v., where u x v denotes the smallest integer greater than or equal to x, and ? x @ the largest integer less than or equal to x. We used p s 2 i , q s 1, . . . , k y i Ž i s 1, . . . , ? kr2@ y 1. and Daubechies’ ‘‘extremal phase’’ wavelet of order six w Nason and Silverman Ž1994.x . For each setup we simulated 1000 realizations. In each case we calculated the wavelet decomposition, performed the thresholding and computed the estimator. We calculated integrated squared error, using the trapezoidal rule, and estimated mean integrated squared error ŽMISE. by averaging over these 1000 values. We implemented both interpolation rules, Ž2.2. and Ž2.3.. For small sample sizes, we obtained slightly better mean integrated squared performance by using Ž2.3., but usually, from n s 250 onward, performance of Ž2.2. and Ž2.3. was equivalent. In the following summary of our results we shall concentrate on Ž2.2. and X i uniformly distributed. Results in other cases are available from the authors upon request. Figure 1 shows, for different choices of n and n s 500, five typical realizations of the wavelet estimator for the polynomial-with-jump target. The values used for p and q are those which minimized observed MISE. The positive Ždecreasing. influence on variance for larger values of n is clear. At the same time the negative Žincreasing. influence on Žfinite sample. bias is apparent; smaller jumps are ‘‘smoothed away,’’ while larger jumps turn into slopes. Similar features were observed in other cases. For n G 250 the tendency of bias to increase with n had an effect on MISE only in the case of the ‘‘Blocks’’ target. For the other two targets, the decrease in variance was sufficient to compensate for the increase in bias.
INTERPOLATION METHODS FOR WAVELETS
1919
FIG. 1. Five typical realizations of the wavelet estimator, defined at Ž2.4., for the polynomialwith-jump target. Each realization is based on n s 500 observations, interpolation rule Ž2.2. and n s 1, 3 or 9.
1920
P. HALL AND B. A. TURLACH
4. Outline proof of Theorem 2.1. A fuller account of the argument is available in Hall and Turlach Ž1995., obtainable from the authors. 4.1. Preliminaries. For the sake of brevity we shall give the proof only in the case where f is uniformly continuous on I and the function s 2 is a constant. An additional argument would allow us to overcome the inconvenience of jump discontinuities in f and a varying s 2 . Let C, C1 , C2 , . . . denote generic positive constants. In view of the orthogonality properties of f and c ,
H Ž gˆ y g .
2
s A1 q A 2 q A 3 q A 4 ,
where A1 '
qy1
Ý ž ˆbj y bj / , 2
A2 '
j
Ý Ý ž ˆbi j y bi j /
is0
j
qy1
A3 '
ž
/
I