plug-in bandwidth selector for local polynomial regression estimator ...

4 downloads 0 Views 366KB Size Report
The local polynomial regression estimator has gained wide acceptance as an ... based approach for bandwidth selection for local polynomial regression in the ...
Nonparametric Statistics Vol. 16(1–2), February–April 2004, pp. 127–151

PLUG-IN BANDWIDTH SELECTOR FOR LOCAL POLYNOMIAL REGRESSION ESTIMATOR WITH CORRELATED ERRORS a,∗ ´ MARIO FRANCISCO-FERNANDEZ, JEAN OPSOMERb and ´ JUAN M. VILAR-FERNANDEZa a

Departamento de Matem´aticas, Facultad de Inform´atica, La Coru˜na, 15071, Spain; b Department of Statistics, Iowa State University, Ames, IA 50011, USA (Received 5 August 2002; Revised 11 April 2003; In final form 14 May 2003)

Consider the fixed regression model where the error random variables are coming from a strictly stationary, nonwhite noise stochastic process. In a situation like this, automated bandwidth selection methods for non-parametric regression break down. We present a plug-in method for choosing the smoothing parameter for local least squares estimators of the regression function. The method takes the presence of correlated errors explicitly into account through a parametric correlation function specification. The theoretical performance for the local linear estimator of the regression function is obtained in the case of an AR(1) correlation function. These results can readily be extended to other settings, such as different parametric specifications of the correlation function, derivative estimation and multiple non-parametric regression. Estimators of regression functionals and the error correlation based on local least squares ideas are developed in this article. A simulation study and an analysis with real economic data illustrate the selection method proposed. Keywords: Non-parametric estimation; Kernel regression; Dependent data; Smoothing parameter selection

1

INTRODUCTION

The local polynomial regression estimator has gained wide acceptance as an attractive method for estimating the regression function and its derivatives. Some of the advantages of this non-parametric estimation method over kernel-based methods are better boundary behavior, adaptation to estimate regression derivatives, easy computation, and good minimax properties. See Wand and Jones (1995) for a discussion of the properties of local polynomial regression when the errors are assumed to be independent. The local polynomial estimator is obtained by locally fitting a pth degree polynomial to the data by weighted least squares, and as with any non-parametric regression procedure, one has to choose an important parameter, called the bandwidth or smoothing parameter, which controls the amount of local averaging performed to obtain the regression estimate. Adequate selection of this parameter is vital for correct estimation of the regression function and its derivatives. Extremely wiggly curves are produced if this parameter is very small, and very ∗

Corresponding author. E-mail: [email protected]

c 2004 Taylor & Francis Ltd ISSN 1048-5252 print; ISSN 1029-0311 online  DOI: 10.1080/10485250310001622848

128

´ M. FRANCISCO-FERNANDEZ et al.

smooth curves result if the bandwidth is too big. Therefore, in practice, it would be useful to have available an automated method based on sample observations for the selection of the smoothing parameter. Bandwidth selection for local polynomial regression has been studied extensively in the independent error setting. While a number of different approaches have been proposed, we only provide some relevant references on the plug-in type of bandwidth estimators, which are extended to the dependent error setting later in this article. The concept of plug-in bandwidth selection is simple and appealing: find the bandwidth that minimizes the theoretical mean squared error (MSE) or mean integrated squared error (MISE) of the estimator, and replace any unknown terms in that optimal bandwidth by estimators. Fan and Gijbels (1995) proposed a plug-in technique for the selection of the bandwidth of the local polynomial estimator of the regression function and its derivatives that can give rise to local or global bandwidths. This technique is based on using a criterion of squared residuals to first obtain a pilot bandwidth and then use this pilot bandwidth to obtain good approximations of the bias and variance of the estimator. Subsequently, Fan et al. (1996) obtained the convergence rates of a local plug-in bandwidth constructed with the ideas exposed in the 1995 work. They also obtained the asymptotic distribution of the relative error of this bandwidth with respect to the optimal variable bandwidth. Ruppert et al. (1995) proposed three plug-in type selectors with varying degrees of complexity to obtain the smoothing parameter minimizing an asymptotic approximation to the MISE of the local polynomial estimator of the regression functions and its derivatives. Ruppert (1997) proposed a plug-in bandwidth selection method that does not require asymptotic approximations for the bias of the estimator. For the local polynomial estimator of the regression function, research on smoothing parameter selection when dependence exists among the observations is rather limited, especially compared with the amount of research on other kernel-type estimators of the regression function (see Opsomer et al. (2001), for a list of references). We briefly review some relevant results here. In Hart (1996), a review is presented on different methods for smoothing parameter selection under dependence and in different contexts (density, regression), and the time series cross-validation method is applied to the local linear estimator. In Hall et al. (1995), the bandwidths obtained using the methods of modified cross-validation and the moving bootstrap blocks in the context of regression with short-range and long-range dependence are discussed and compared. These selectors are applied to kernel estimators as well as to local linear regression. In Opsomer (1997) one of the bandwidth selection methods proposed in Ruppert et al. (1995) is generalized for the case of dependence and additive bivariate models. Recently, Hall and Van Keilegom (2003) have proposed a differencingbased approach for bandwidth selection for local polynomial regression in the case of time series errors. The objective of this article is to propose a plug-in bandwidth estimator for the dependent error case when a parametric model for the error structure can be specified, and study its theoretical and practical properties. We consider a fixed design regression model with dependent observations, this dependence coming through the observation errors. In this situation, the data-driven bandwidth selection methods that do not take the dependence of the observations into account can indeed produce bad results, as shown in Francisco-Fern´andez and Vilar-Fern´andez (2001). The estimator proposed here requires estimation of parameters from an assumed correlation model and uses the resulting estimator of the correlation function as a component in a plug-in bandwidth. For this purpose, we propose estimators of a class of regression functionals and of autocovariances of the errors. The asymptotic bias and variance of these estimators are obtained. We will work within the same asymptotic framework as Francisco-Fern´andez and Vilar-Fern´andez (2001), and the method will be based on the work of Altman (1993), Ruppert et al. (1995) and Opsomer (1997).

PLUG-IN BANDWIDTH SELECTOR

129

The organization of the work is as follows: In Section 2, we present the regression model, the relevant theory of local least squares kernel estimators, and the proposed plug-in bandwidth. In Section 3, we describe the theoretical performance of our plug-in bandwidth selector. For this, first we have to provide theoretical background for the estimation of the autocovariances of the errors and of a class of regression function functionals that arise in the formula for the plug-in bandwidth. In Section 4, we show the empirical performance of the bandwidth studied via a simulation study and illustrate its usefulness in the analysis of a time series of Spanish stock prices.

2 THE ESTIMATOR AND SOME ASYMPTOTIC PROPERTIES We begin by reviewing the framework of Francisco-Fern´andez and Vilar-Fern´andez (2001). Assume that univariate data Y1,n , Y2,n , . . . , Yn,n are observed, and that Yt,n = m(x t,n ) + εt,n ,

1 ≤ t ≤ n,

(1)

where x t,n , 1 ≤ t ≤ n, are the design points, m(x) is a ‘smooth’ regression function defined on [0, 1], without any loss of generality, and εt,n , 1 ≤ t ≤ n, is a sequence of unobserved random variables with zero mean and finite variance σε2. We assume, for each n, that {ε1,n , ε2,n , . . . , εn,n } have the same joint distribution as 1 , 2 , . . . , n , where {t , t ∈ Z } is a strictly stationary stochastic process. Also, it is assumed that the design {x t,n , 1 ≤ t ≤ n} is a regular design generated by a design density f ; that is, for each n, the design points are defined by 

xt,n

f (x) d(x) =

0

t −1 , n−1

1 ≤ t ≤ n,

(2)

f being a positive function, defined on [0, 1] and its first derivative is continuous. In order to simplify notation, we will not use n in the subindexes, that is, we will write x t , εt and Yt . Our goal is to estimate the unknown regression function and its derivatives based on an observed sample {(x t , Yt )}nt=1 . For this purpose we use the non-parametric local polynomial estimator. If we assume that the ( p + 1)th derivatives of the regression function at point x exist  and are continuous, local polynomial fitting permits estimating the parameter vector β(x) = t ( j) (β0 (x), β1 (x), . . . , β p (x)) , where β j (x) = m (x)/( j !), with j = 0, 1, . . . , p, by minimizing the function  2 p n   Yt −  β j (x)(x t − x) j  ωn,t , (3) ψ[β(x)] = t=1

j =0

−1 where ωn,t = n −1 K n (x t − x) are the weights, K n (u) = h −1 n K (h n u), K being a kernel function and h n the bandwidth or smoothing parameter that controls the size of the local neigh borhood and so the degree of smoothing. The estimator of β(x), obtained as a solution to the weighted least squares problem given in Eq. (3), is called the local polynomial kernel estimator and it is interesting to observe that this class of estimators includes the classical Nadaraya–Watson estimator, which is the minimizer of Eq. (3) when p = 0. Of special interest is also the local linear kernel estimator corresponding to p = 1.

´ M. FRANCISCO-FERNANDEZ et al.

130

In matrix notation, we have: Y1 . = .. ,

Y(n)

X p,(n)

1 . = ..

(x 1 − x) .. .

··· .. .

1

(x n − x)

· · · (x n − x) p

Yn

(x 1 − x) p .. , .

and let W(n) = diag(ωn,1 , . . . , ωn,n ) be the diagonal array of weights. Then, by assuming the invertibility of [X tp,(n) W(n) X p,(n) ], standard weighted least squares theory leads to the solution βˆ(n) (x) = (X tp,(n) W(n) X p,(n))−1 X tp,(n) W(n) Y(n) .

(4)

Then, the local polynomial estimator of j th derivative of the regression function, mˆ ( j )(x), is given by βˆ j (x)( j !), βˆ j (x) being the ( j + 1)th component of the vector given in Eq. (4), j = 0, 1, . . . , p. To derive an asymptotically optimal global bandwidth we can use a global measure of the estimation error, for example, the MISE is given by:  MISE(h) =

MSE [mˆ ( j ) (x)] f (x) dx.

(5)

Under suitable conditions, it follows from Corollary 1 of Francisco-Fern´andez and VilarFern´andez (2001) that for odd p, MSE [mˆ ( j ) (x)] = bias2 [mˆ ( j )(x)] + var[mˆ ( j )(x)],

(6)

where bias [mˆ

( j)

(x)] =

h np+1− j

m ( p+1) (x) j! ( p + 1)!



and var [mˆ

( j)

(x)] =

1 2 j +1

nh n

c(ε) ( j !)2 f (x)

 u

p+1

K ( j, p)(u) du [1 + o(1)]

(7)



 K (2j, p) (u) du

[1 + o(1)],

(8)

2 where c(ε) = σε2 [c(0) + 2 ∞ k=1 c(k)] with c(k) = E(εi , εi+k )/σε , k = 0, 1, 2, . . ., and K ( j, p) a high order kernel, defined in Section 3. Hence, it is easy to obtain the asymptotically optimal global bandwidth by minimizing the asymptotic expression of the MISE. For the case of the regression function ( j = 0) this asymptotically optimal bandwidth is given by: 

opt

c(ε) ( p+1) n (m (x))2 f (x) dx 1/(2 p+3) c(ε) = C0, p (K ) , nθ p+1, p+1

h 0,g,as = C0, p (K )

where θr,s = kernel K .





1/(2 p+3)

(9)

m (r) (x)m (s) (x) f (x) dx and C0, p (K ) is a real number that depends only on

PLUG-IN BANDWIDTH SELECTOR

131

In Eq. (9) there are two unknown quantities: c(ε) and θ p+1, p+1 . These must be estimated to produce a practical smoothing parameter from Eq. (9). In Francisco-Fern´andez and VilarFern´andez (2001), some suggestions on how to do this are presented. Once these unknown quantities are estimated and these estimators plugged into Eq. (9), the asymptotically optimal opt plug-in global bandwidth to estimate m(x) with a p-degree polynomial is denoted by hˆ 0,g,as . opt Section 3 is devoted to describing the theoretical performance of hˆ 0,g,as in the particular case of using a local linear fitting. For this, first, some background theory must be given for the estimation of c(ε) and θ2,2 .

3 THEORETICAL PERFORMANCE opt The aim of this section is to obtain the rate of convergence of hˆ 0,g,as to the global optimal bandwidth, h MISE , that minimizes expression (5), when we want to estimate the regression function in model (1) using the local linear estimator, that is, j = 0 and p = 1. As in the plug-in approach of Ruppert et al. (1995) and Opsomer (1997), we will first obtain the rates of convergence of the estimators, c(ε) ˆ and θˆ2,2 , to the two unknown quantities c(ε) and θ2,2 in opt Eq. (9), and then, using these rates, the rate sought for hˆ 0,g,as can be computed. The following assumptions will be needed for our analysis:

A.1 A.2 A.3

Kernel function K (·) is symmetric, with a bounded support, and Lipschitz continuous. The sequence of bandwidths, {h n }, satisfies h n > 0, h n ↓ 0, nh n ↑ ∞. Denote cov(εi , εi+k ) = σε2 c(k), k = 0, ±1, . . . , the errors of the model {εi }ni=1 satisfy ∞ k=1 k|c(k)| < ∞, E(εi ε j εk ) = 0 and cov(εi ε j , εk εl ) = cov(εi , εk )cov(ε j , εl ) + cov(εi , εl )cov(ε j , εk ),

for all i, j, k, l. A.4 The density function, f , satisfies 0 < C1 ≤ f (x) ≤ C2 < ∞ for all x ∈ [0, 1], where C1 and C2 are real numbers. Assumption A.3 is satisfied, for example, when the errors have an AR(1) dependence structure and follow a N(0, σε2 ) distribution. For simplicity of presentation, only this special case will be formally studied in the theoretical portion of this article. In this particular situation, the assumption of equally spaced design is usually considered. Hence, in the rest of the paper, we consider that x t = (t − 1)/(n − 1), t = 1, 2, . . . , n, that is, f (x) = 1 in expression (2). In the case of local linear fitting ( p = 1), expression (9) is:

opt h 0,g,as

c(ε) = C0,1 (K ) nθ2,2

1/5

= O(n −1/5 ),

(10)

where, since the errors are assumed to be AR(1),

c(ε) =

σε2

c(0) + 2

∞  k=1

 c(k) =

σε2

1+ρ . 1−ρ

(11)

Next, we propose and study estimators for c(ε) and θ2,2 using the weighted local polynomial estimator of the regression function and its derivatives.

´ M. FRANCISCO-FERNANDEZ et al.

132

3.1 Local Polynomial Estimation of θ2,2 In this subsection we investigate the local polynomial estimation of  θ2,2 = [m (x)]2 f (x) dx.

(12)

A natural estimator for this functional is given by: θˆ2,2 (g) = where

n 1  [mˆ (x i )]2 , n i=1 g

mˆ g (x) = 2!e3t (X tpθ ,(n) W(n) X pθ ,(n) )−1 X tpθ ,(n) W(n) Y(n)

(13)

(14)

is the pθ -degree local polynomial estimator of the 2nd derivative of the regression function obtained from Eq. (4), and g is a pilot bandwidth. For simplicity, we assume that pθ is an integer greater than 2 and that pθ − 2 is odd. Let e3t represent a ( pθ + 1) × 1 vector having 1 in the third entry and all other entries being 0. Let   |M j, p (u)| K ( j, p)(u) = j ! K (u), (15) |S|

where S is the ( p + 1) × ( p + 1) array whose (i + 1, j + 1)th element is u i+ j K (u) du, 0 ≤ i, j ≤ p and M j, p (u) is the same as S, except that the ( j + 1)th row is replaced by (1, u, . . . , u p )t . It is easily established that K ( j, p)(u) satisfies 

 0, 0 ≤ r ≤ s − 1, r = j,   r r = j, u K ( j, p) (u) du = j !,   c j, p , r = s,

(16)

where s = p + 1 when p − j is odd and s = p + 2 when p − r is even and c j, p is some non-zero constant. Therefore, (−1) j K ( j, p) is an order ( j, s) kernel as defined by Gasser et al. (1985).

of two real-valued funcLet (L 1 ∗ L 2 )(x) = L 1 (u)L 2 (x − u) du denote the convolution



tions L 1 and L 2 . Some other useful notations are: µl (L) = u l L(u) du, R(L) = L(u)2 du. The next result, proven in the Appendix, establishes the asymptotic bias and variance of θˆ2,2 (g) as an estimator of θ2,2 . PROPOSITION 1 Under Assumptions A.1, A.3 and A.4, assuming that m and f have 4 continuous derivatives, and that the pilot bandwidth g satisfies that g → 0 and ng 5 → ∞, we have: E[θˆ2,2 (g)] ≈ θ2,2 + 2

µ pθ +1 (K (2, pθ ) ) pθ −1 g θ2, pθ +1 + R(K (2, pθ ) )c(ε)n −1 g −5 ( pθ + 1)!

and var[θˆ2,2 (g)] = O where A ≈ B denotes A = B[1 + o(1)].



1 1 + , n n2 g9

(17)

(18)

PLUG-IN BANDWIDTH SELECTOR

133

This result generalizes that of Ruppert et al. (1995) to the dependent error case. To estimate the second derivative of the regression function using the local polynomial estimator, it is common to use a cubic fit. In this particular case pθ = 3, the expressions (17) and (18) are: E[θˆ2,2 (g)] ≈ θ2,2 + 2

µ4 (K (2,3) ) 2 g θ2,4 + R(K (2,3) )c(ε)n −1 g −5 4!

and var[θˆ2,2 (g)] = O



1 1 + 2 9 . n n g

(19)

(20)

The expressions in Eqs. (17) and (18) [or in Eqs. (19) and (20)] give the rates of convergence of the bias and the variance of θˆ2,2 (g), which makes it possible to obtain the order of the optimal pilot bandwidth, that is, the bandwidth that minimizes the MSE. Once this optimal bandwidth is substituted in the bias and variance expressions, the minimum order for the MSE can be also obtained. 3.2 Nonparametric Estimation of c(ε) The other unknown parameter appearing in Eq. (10) is the sum of the covariances of the errors, denoted by c(ε). Assuming AR(1) errors, this parameter is given by Eq. (11), and then to estimate it, only estimators of σε2 and ρ are needed. This is in contrast to the result of Opsomer (1997), where c(ε) was not assumed to be parametrically specified, and non-parametric estimation in the frequency domain was used instead. When the parametric assumption for the correlation structure holds (at least approximately), the current approach should provide a more reliable estimator for c(ε). The following estimators obtained by the method of moments will be used in our study: σˆ ε2 and

n 1 2 = εˆ n i=1 i

n−1 ˆ i εˆ i+1 i=1 ε ρˆ = , n ˆ i2 i=1 ε

(21)

(22)

where εˆ i = Yi − mˆ λ (x i ),

i = 1, 2, . . . , n,

and mˆ λ (x i ) is the local polynomial estimator of m(x i ) using a pσ degree polynomial and a pilot bandwidth λ. Now, let us consider the following estimators of γ (s) = E(εk εk+s ), γˆλ,n (s) =

n−s 1 εˆ i εˆ i+s , n i=1

s = 0, 1, 2, . . . , k.

(23)

The asymptotic behavior of γˆλ,n (s) is studied in the following Proposition (proven in the Appendix).

134

´ M. FRANCISCO-FERNANDEZ et al.

PROPOSITION 2 If Assumptions A.1, A.3 and A.4 hold and the pilot bandwidth λ satisfies λ → 0 and nλ → ∞ when n → ∞, we have: 

 c(ε) µ pσ +1 (K pσ ) [R(K pσ ) − 2K pσ (0)] θ pσ +1, pσ +1 + E[γˆλ,n (s)] = γ (s) + λ ( pσ + 1)! nλ s  1 (24) + o(λ2 pσ +2 ) + o +o nλ n 2 pσ +2

and var[γˆλ,n (s)] =

1 V +O , n n2 λ

(25)

where K p denotes the kernel K (0, p) defined in Eq. (15) and V is the covariance matrix given by Barlett’s formula (Barlett, 1946). These results generalize those stated in Altman (1993) for the kernel regression case. In the particular case of using a local linear fitting to obtain the residuals, that is, pσ = 1, expressions (24) and (25) become:  µ2 (K ) c(ε) θ2,2 + [R(K ) − 2K (0)] 2 nλ s  1 2 +o + o(λ ) + o nλ n 

E[γˆλ,n (s)] = γ (s) + λ4

and

1 V . +O var[γˆλ,n (s)] = n n2λ

(26)

(27)

From Proposition 2, the asymptotic bias and variance of estimators (21) and (22) are obtained. Moreover, the optimal pilot bandwidth λ can be readily deduced. These results generalize immediately to situations in which the parameters of the correlation function can be estimated by a function of terms of the form (23). Another approach, which we do not further explore here, is to note that since c(ε) = ∞ k=−∞ γ (k), we could directly estimate c(ε) by c(ε) ˆ =

s 

γˆλ,n (k),

(28)

k=−s

with s being a sufficiently large integer. From Proposition 2, the asymptotic behavior of Eq. (28) can be deduced for fixed s, and could be extended to the case in which s grows slowly as n → ∞. 3.3 Theoretical Performance of the Asymptotically Optimal Bandwidth In this subsection we present a theorem that shows the behavior of the plug-in bandwidth hˆ 0,g,as in the case of using a local linear polynomial. This result gives the rate of convergence of that bandwidth to the optimal global bandwidth that minimizes the MISE, denoted by h MISE . opt

PLUG-IN BANDWIDTH SELECTOR

135

THEOREM 1 If Assumptions A.1–A.4 hold and we choose λ = Ln −1/2 pσ +3 and g = Gn −1/ pθ +4 opt with L and G constants, it follows that the relative error of hˆ 0,g,as satisfies: n where

( pθ −1)/( pθ +4)

opt hˆ 0,g,as − h MISE

h MISE

−→ D,

  1 −1 µ4 [K (2, pθ ) ] 2 D = − θ2,2 G θ2 , Pθ +1 + R(K (2, pθ ) )c(ε)G −5 . 2 5 ( pθ + 1)!

(29)

(30)

In the particular case of pσ = 1 and pθ = 3, the pilot bandwidths would be: λ = Ln −1/5 and g = Gn −1/7 , and opt hˆ 0,g,as − h MISE −→ D, (31) n 2/7 h MISE where

  1 −1 µ4 (K (2,3) ) 2 −5 G θ2,4 + R(K (2,3) )c(ε)G . D = − θ2,2 2 5 4!

(32)

The proof of this result is obtained by using the same arguments as those presented in Ruppert et al. (1995) for the hˆ DPI bandwidth proposed in that paper, using the local linear estimator and under independence. In fact, the rate of convergence in both cases is the same, but the constant D in Eq. (32) would change when the data are independent: the variance of the errors, σε2 , appearing in the independent case is replaced by the sum of the covariances, c(ε). That same rate was also obtained for the bandwidth estimator under dependence discussed in Opsomer (1997). The rate-limiting component of all these plug-in bandwidths is the estimation of θˆ2,2 . The rate achieved for estimating θˆ2,2 remains unchanged across the asymptotic frameworks studied by these authors as well as in the current article (independence, non-parametric and parametric specification of the error structure, respectively). Hence, the overall convergence rate of the bandwidth estimators also remains the same. The results obtained can be extended to other settings, such as derivative estimation, local bandwidth, multiple non-parametric regression and the generalization to ARMA or m-dependent errors. We refer to M¨uller and Stadm¨uller (1988) for a discussion on how to choose estimators for the covariances in this case with the same order as the one obtained in subsection 3.2.

4

SIMULATION STUDY AND EXAMPLE

In this section, we present a simulation study to evaluate the practical performance of the bandwidth studied in Sections 1–3. We also apply the methods presented to a time series of stock prices for a Spanish bank. 4.1 Simulation Study We compare the bandwidth given in Eq. (10) with two other bandwidth estimators: the same bandwidth as Eq. (10) but considering that the errors are independent, that is, assuming that c(ε) = σε2 , and the bandwidth proposed in Opsomer (1997) where the estimate of c(ε) is

136

´ M. FRANCISCO-FERNANDEZ et al.

constructed by estimating the spectral density at 0 non-parametrically. The simulation experiment was set up as follows: in a first step, we simulate 300 samples of size n = 100, following a regression model like Eq. (1), where we consider a design of equally spaced points on the unit interval, x t = t/n, t = 1, . . . , n, with regression function m(x) = (2x − 1) + 2 exp[−16(2x − 1)2 ] (shown in Fig. 1) and errors following a dependence structure of AR(1) type, εt = ρεt−1 + et , where εt have distribution function N(0, σε2 ). For this simulation study, we set σε2 = 0.25. To study the influence of dependence of the observations, we consider different values for the correlation coefficient. In order to find the optimal bandwidth, we calculated  MISE(h) = E [mˆ h (t) − m(t)]2 dt, (33) for every bandwidth on a grid of equally spaced values of h, with the integral approximated by the mean of Riemann sums. By minimizing function (33) numerically in h, a numerical approximation to the value h MISE is found. The second step consists in drawing another 300 random samples of sample size n = 100 and computing the plug-in bandwidth estimators for every sample. To simplify notation, we write hˆ dep for the bandwidth estimator (10), hˆ ind for the estimator that ignores the correlation and hˆ ops for the bandwidth proposed in Opsomer (1997). We also evaluate the MISE achieved by these estimators relative to the best possible value MISE(h MISE ). This was done by computing MISE =

ˆ − MISE(h MISE ) E[MISE(h)] , MISE(h MISE )

with the expectation taken to be over realizations of the errors and approximated by Monte Carlo replication. In the computation of bandwidths hˆ ind , hˆ dep and hˆ ops , we need to estimate the parameters c(ε) and θ2,2 . In all three cases, the pilot bandwidth λ is computed by the time series crossvalidation method, proposed in Hart (1994), and a pilot fit is computed. For hˆ ind , c(ε) = σε2 is estimated by Eq. (21), while for hˆ dep , c(ε) is estimated by plugging Eqs (21) and (22) into Eq. (11). For hˆ ops , c(ε) is estimated non-parametrically on the Fourier transformed residuals

FIGURE 1 m(x) = (2x − 1) + 2 exp[−16(2x − 1)2 ].

PLUG-IN BANDWIDTH SELECTOR

137

TABLE I Optimal Bandwidth According to MISE and the Means of Plug-in Bandwidths hˆ dep , hˆ ind and hˆ ops . Bandwidth × 10−1

−0.9

−0.6

−0.3

ρ=0

0.3

0.6

0.9

h MISE hˆ dep hˆ ind hˆ ops

0.773 0.690 1.241 1.102

0.955 0.909 1.234 1.163

1.045 1.045 1.225 1.316

1.182 1.165 1.213 1.447

1.318 1.298 1.200 1.594

1.500 1.526 1.198 1.803

1.636 1.512 1.025 1.657

(see Opsomer, [1997]). For all three bandwidths, an empirically chosen pilot bandwidth, g, was used to carry out a local polynomial fitting of third order from which θ2,2 is estimated. Hence, only the estimation of c(ε) varies among the three methods simulated here. Table I shows the approximated optimal bandwidths with respect to the MISE criterion and the Monte Carlo approximation of the mean of the plug-in selectors. As these results show, hˆ dep tracks its target h MISE closely over the whole range of correlation values. Conversely, hˆ ind appears to slightly decrease as h MISE increases, a clearly undesirable effect. The bandwidth hˆ ops increases as the correlation increases, but tends to overestimate the optimal bandwidth for all correlation levels. Table II includes the efficiency measure, MISE, as a function of ρ. This table shows that it is important to take the dependence of the observations into account when a bandwidth is chosen and to use this information in the computation of the bandwidth. So, the bandwidth hˆ dep is better than hˆ ind in all cases when the observations are dependent and its behavior is almost identical in the independent case. For all correlation levels, hˆ ops has a worse performance than hˆ dep , but it is better than hˆ ind when the dependence is strong (positive or negative). Hence, it appears that parametric estimation of the correlation function provides a more reliable approach for estimating the optimal bandwidth, at least in the context considered here. Two figures are included for a better interpretation of the results. In Figure 2, we present a multiple box-plot of the bandwidth studied in this paper, hˆ dep , as a function of ρ. We can see in this figure that the mean and variability of hˆ dep increase when ρ grows. This same variability behavior is seen for the other two bandwidths considered here. The reason is clear: if we have a positive dependence, the estimator, for example, of c(ε) is more variable than in the case of negative dependence because we have to use a pilot non-parametric estimator of the regression function to obtain the residuals and if the dependence is positive, this estimator has difficulties to know which part of the observed trend is due to the true mean function and which part to the correlated errors. This is much easier when the dependence of the errors is negative. In Figure 3 we show the estimated density functions of the plug-in bandwidths obtained from the 300 replications with the previous model for ρ = 0.6. This plot shows that hˆ ind is severely biased. While hˆ dep is indeed slightly more variable than hˆ ind (indicated by the ‘fatter’ density), only a small fraction of its distribution function will result in fits that are significantly

TABLE II MISE hˆ dep hˆ ind hˆ ops

MISE as a Function of ρ.

−0.9

−0.6

−0.3

ρ=0

0.3

0.6

0.9

0.0476 0.7365 0.3710

0.0020 0.1834 0.1122

0.0015 0.0410 0.1007

0.0030 0.0035 0.0779

0.0046 0.0163 0.0592

0.0063 0.0560 0.0516

0.0065 0.0671 0.0115

´ M. FRANCISCO-FERNANDEZ et al.

138

FIGURE 2

Multiple Box-Plot of hˆ dep as a function of ρ.

undersmoothed. On the other hand, the bandwidth hˆ ops is more variable than the other two bandwidths and, as commented previously, tends to overestimate the optimal bandwidth. One possible concern about this method is what happens when the parametric model is misspecified. Clearly, if the AR(1) model does not fit the observed correlation structure, this might negatively affect the behavior of the plug-in bandwidth. To study this situation, several otherARMA-type dependence structures for the errors are considered but theAR(1) assumption is maintained in the bandwidth estimation. Specifically, we generate errors using the following models: • An MA(2) model with normal distribution: εi = ei + 1.4ei−1 + 0.9ei−2 . • An AR(2) model with normal distribution: εi = 0.4εi−1 + 0.5εi−2 + ei . • An ARMA(1,1) model with normal distribution: εi − 0.8εi−1 = ei + 0.3ei−1 . The rest of the parameters used in this part of the simulation study (regression function, sample size, number of replicas, etc.) are the same as those used in the AR(1) case. In these

FIGURE 3

Non-parametric density estimators of hˆ dep , hˆ ind and hˆ ops , and h MISE = 0.15.

PLUG-IN BANDWIDTH SELECTOR

139

TABLE III Optimal Bandwidth According to MISE and the Means of Plug-in Bandwidths hˆ dep , hˆ ind and hˆ ops Under Different True Correlation Function Specifications. Bandwidth × 10−1

MA(2)

AR(2)

ARMA(1,1)

h MISE hˆ dep hˆ ind hˆ ops

1.454 1.565 1.170 1.687

1.500 1.110 0.965 1.436

1.636 1.924 1.196 1.954

three cases, the values chosen for the parameters provide high autocorrelation coefficients. Table III includes the approximated optimal bandwidths with respect to the MISE criterion and the Monte Carlo approximation of the mean of the plug-in selectors. As shown in the table, the hˆ dep continues to correct at least partly for the correlation in those three cases, as does hˆ ops . Except in the AR(2) case, the mean of hˆ dep is closer to the true optimal bandwidth than either hˆ ind or hˆ ops . This indicates a reasonable robustness against correlation model misspecification. Table IV shows the efficiency measure, MISE, for the three correlation models. These results indicate that hˆ dep is successful in reducing the MISE relative to the bandwidth that assumes independence, and does so better than hˆ ops in two of the three considered cases. Overall, this indicates a reasonable robustness against correlation model misspecification. 4.2 Example We now illustrate the behavior of the local linear estimator and the bandwidths used in the previous simulation study with a time series of the weekly price (per share in euros) of the Spanish bank Santander in the Spanish stock market. The studied series is of 103 weeks (years 2001 and 2002) and each observation indicates the price (per share in euros) taken on Thursdays. When a Thursday fell on a holiday, the observation was taken on Wednesday (in studies where the price on Fridays was considered, wrong conclusions will result due to the so-called weekend effect). A fixed regression model can be fitted to these data, considering an equally spaced design on [0, 1], that is, t Yt = m + εt , t = 1, 2, . . . , 103. 103 The aim is to estimate the trend, m(·), using the local linear estimator with the global plug-in smoothing parameters proposed before (hˆ dep , hˆ ind and hˆ ops ). Assuming that the errors are correlated, the bandwidth hˆ dep is computed. The study of the estimated autocorrelation function (ACF) and the estimated partial autocorrelation function (PACF) of the residuals of TABLE IV MISE Under Different True Correlation Specifications. MISE

MA(2)

AR(2)

ARMA(1, 1)

hˆ dep hˆ ind hˆ ops

0.0137 0.0492 0.0432

0.0233 0.0412 0.0069

0.0311 0.0577 0.0425

´ M. FRANCISCO-FERNANDEZ et al.

140

FIGURE 4 Sample data and mˆ h (x) with hˆ dep = 0.1538 (smooth solid line), hˆ ind = 0.0587 (wiggly solid line) and hˆ ops = 0.1861 (dashed line).

this fit show that the assumption that the errors follow an AR(1) model appears reasonable. More specifically, they follow the model εt = 0.61058εt−1 + et . Figure 4 shows the data and three estimators of the trend, the first obtained using the bandwidth hˆ dep = 0.1538 (smooth solid line), the second with bandwidth hˆ ind = 0.0587 (wiggly solid line) and the third with bandwidth hˆ ops = 0.1861 (dashed line). These result illustrates the behavior previously observed in the simulation experiments.

5

CONCLUSION

In this article, we have proposed a plug-in bandwidth estimator for local polynomial regression that extends that of Ruppert et al. (1995) by allowing for correlation in the errors. Unlike in Opsomer (1997), this is done by assuming a specific parametric shape for the error correlation function. When that parametric shape is properly specified, the method is successful in correcting for the correlation, in the sense that the bandwidth estimator achieves the same asymptotic distribution as in the independent error case. Simulation experiments also demonstrate that in this case, the proposed method outperforms the estimator in Opsomer (1997), which used a fully non-parametric correlation function estimator. When the assumed correlation function is wrongly specified, the simulations indicate that our method is relatively robust against model mis-specification, in the sense that it continues to correct for the correlation at least partly for the range of correlation functions considered.

Acknowledgements This work was partially supported by the grants PB98-0182-C02-01, PGIDT00PXI10501PN MCyT Grant BFM2002-00265 (European FEDER support included), PGIDT01PX10505PK, and PGIDIT03PXIC10505PN.

PLUG-IN BANDWIDTH SELECTOR

141

References [1] Altman, N. S. (1993). Estimating error correlation in nonparametric regression. Statistics & Probability Letters, 18, 213–218. [2] Barlett, M. S. (1946). On the theoretical specification and sampling properties of auto-correlated time series. Journal of the Royal Statistical Society, 8(Suppl.), 27–41. [3] Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation. Journal of the Royal Statistical Society, Series B, 57(2), 371–394. [4] Fan, J., Gijbels, I., Hu, T. C. and Huang, L. S. (1996). A study of variable bandwidth selection for local polynomial regression. Statistica Sinica, 6, 113–127. [5] Francisco-Fern´andez, M. and Vilar-Fern´andez, J. M. (2001). Local polynomial regression estimation with correlated errors. Communications in Statistics: Theory and Methods, 30(7), 1271–1293. [6] Gasser, T., M¨uller, H. G. and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, Series B, 47, 238–252. [7] Hall, P. and Van Keilegom, I. (2003). Using difference-based methods for inference in nonparametric regression with time-series errors. Journal of the Royal Statistical Society, Series B, 65, 443–456. [8] Hall, P., Lahiri, S. N. and Polzehl, J. (1995). On bandwidth choice in nonparametric regression with both shortand long-range dependent errors. The Annals of Statistics, 23(6), 1921–1936. [9] Hart, J. (1994). Automated kernel smoothing of dependent data by using time series cross-validation. Journal of the Royal Statistical Society, Series B, 56(3), 529–542. [10] Hart, J. (1996). Some automated methods of smoothing time-dependent data. Journal of Nonparametric Statistics, 6, 115–142. [11] M¨uller, H. G. and Stadtm¨uller, U. (1988). Detecting dependencies in smooth regression models. Biometrika, 75(4), 639–650. [12] Opsomer, J. D. (1997). Nonparametric regression in the presence of correlated errors. In: Gregoire, T. G., Brillinger, D. R., Diggle, P. J., Russek-Cohen, E., Warren, W. G. and Wolfinger, R. D. (Eds.), Modelling Longitudinal and Spatially Correlated Data: Methods, Applications and Future Directions. Springer-Verlag, pp. 339–348. [13] Opsomer, J. D., Wang, Y. and Yang, Y. (2001). Nonparametric regression with correlated errors. Statistical Science, 16, 134–153. [14] Ruppert, D. (1997). Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. Journal of the American Statistical Association, 92, 1049–1062. [15] Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90, 1257–1270. [16] Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.

A1 APPENDIX: PROOFS In this section, we sketch proofs of the results presented in Section 3. Before proving Proposition 1, it is useful to establish results about the covariances of the response variable. Lemma 1 shows some equations that are immediately obtained considering that the errors satisfy Assumption A.3. LEMMA 1

Under Assumption A.3, it satisfies that: var(Yi2 ) = 2σε4 + 4σε2 m 2 (x i ),

(34)

var(Yi , Y j ) = σε2 [m 2 (x i ) + m 2 (x j ) + σε2 ] + cov2 (εi , ε j ), cov(Yi2 , Y j2 )

(35)

= 2cov (εi , ε j ) + 4cov(εi , ε j )m(x i )m(x j ),

(36)

cov(Yi2 , Yi Y j ) = 2σε2 cov(εi , ε j ) + 2cov(εi , ε j )m 2 (x i ) + 2σε2 m(x i )m(x j ),

(37)

cov(Yi2 , Y j Yk )

2

= 2cov(εi , ε j )cov(εi , εk ) + 2cov(εi , ε j )m(x i )m(x k ) + 2cov(εi , εk )m(x i )m(x j ),

cov(Yi Y j , Yi Yk ) = cov(ε j , εk )m (x i ) + 2

(38)

σε2 m(x j )m(x k )

+ cov(εi , ε j )cov(εi , εk ) + σε2 cov(ε j , εk ) + cov(εi , ε j )m(x i )m(x k ) + cov(εi , εk )m(x i )m(x j ),

(39)

´ M. FRANCISCO-FERNANDEZ et al.

142

cov(Yi Y j , Yk Yl ) = cov(εi , εk )cov(ε j , εl ) + cov(εi , εl )cov(ε j , εk ) + cov(εi , εk )m(x j )m(xl ) + cov(εi , εl )m(x j )m(x k ) + cov(ε j , εk )m(x i )m(xl ) + cov(ε j , εl )m(x i )m(x k ).

(40)

Proof of Lemma 1 Taking into account Assumption A.3, we have that, for every i, j, k, l,

and

E(εi ε j εk ) = 0

(41)

cov(εi ε j , εk εl ) = cov(εi , εk )cov(ε j , εl ) + cov(εi , εl )cov(ε j , εk ).

(42) 

Using Eqs. (41) and (42), Eqs. (34)–(40) are obtained immediately. Proof of Proposition 1 For clarity, the notation of

mˆ g (x)

given in Eq. (14) is changed to:

mˆ g (x) = 2!e3t (X tx Wx X x )−1 X tx Wx Y , where, although it exists, the relationship with the sample size n and the polynomial degree fitted pθ is not indicated. However, point x where the estimation is carried out is pointed out in the notation of Wx and X x . Let us remember that the letter g which appears in mˆ g (x) denotes the pilot bandwidth used.  (n) = (m(x 1 ), m(x 2 ), . . . , m(x n ))t . Taking into account that Let M t  (n)  (n) M E(Y Y t ) = M + ε ,

where ε is the variance–covariance matrix of the errors, we have that: E[θˆ2,2 (g)] = n −1

n   (n) ][2!e3T (X xT Wxi X xi )−1 X xT Wxi M  (n) ] [2!e3T (X xTi Wxi X xi )−1 X xTi Wxi M i i i=1

+ n −1

n 

4e3T (X xTi Wxi X xi )−1 X xTi Wxi ε Wxi X xi (X xTi Wxi X xi )−1 e3 .

i=1

The expressions obtained in Eqs. (7) and (8) for the bias and the variance of the jth derivative, when j = 2, let us conclude that:  n   µ p +1 (K (2, pθ ) ) pθ −1 ( pθ +1) g E[θˆ2,2 (g)] ≈ n −1 m (x i ) m (x i ) + θ ( pθ + 1)! i=1   µ pθ +1 (K (2, pθ ) ) pθ −1 ( pθ +1) × m (x i ) + g m (x i ) ( pθ + 1)!  n  −1 −1 −5 −1 2 +n n g c(ε) f (x i ) K (2, pθ ) (u) du. i=1

Now, Riemann approximations lead to E[θˆ2,2 (g)] ≈ θ2,2 + 2

µ pθ +1 (K (2, pθ ) ) pθ −1 θ2, pθ +1 + n −1 g −5 c(ε) g ( pθ + 1)!

 2 K (2, pθ ) (u) du.

So that Eq. (17) is proven. As far as the variance of θˆ2,2 (g) is concerned, note that θˆ2,2 (g) = n −1

n n   i=1 j =1

L ∗22 (x i , x j )Yi Y j ,

(43)

PLUG-IN BANDWIDTH SELECTOR

143

where L ∗22 (x i , x j ) =

n 

[2!e3t (X txk Wxk X xk )−1 X txk Wxk ei ] × [2!e3t (X txk Wxk X xk )−1 X txk Wxk e j ].

k=1

Using a typical approximation, we have: −1 t t r !er+1 (n −1 X txk Wxk X xk )−1 ≈ g −r f (x k )−1 er+1 S −1 H(n) ,

where S was defined just after Eq. (15) and H(n) = diag(1, g, g 2 , . . . , g p ). It is easy to obtain that xi − xk t , r !er+1 (n −1 X txk Wxk X xk )−1 X txk Wxk ei ≈ n −1 g −r−1 f (x k )−1 K (r, p) g and hence, L ∗22 (x i , x j )

≈n

−2 −6

g

n 

f (x k )

−2

K (2, pθ )

k=1

xi − xk g



K (2, pθ )

x j − xk g

.

Substituting this approximation in Eq. (43), the variance of θˆ2,2 (g) can be approximated in the following way: var[θˆ2,2 (g)] ≈

1 1  6 n g 12 i=1 j =1 f (x i )2 f (x j )2

n  × K (2, pθ ) (k, i )K (2, pθ ) (k, i )K (2, pθ ) (k, j )K (2, pθ ) (k, j )var(Yk2 ) n

n

k=1

+2

n  

K (2, pθ ) (k, i )K (2, pθ ) (l, i )K (2, pθ ) (k, j )K (2, pθ ) (l, j )var(Yk , Yl )

k=1 l =k

+

n   k=1 l =k

+4

2 2 2 2 K (2, pθ ) (k, i )K (2, pθ ) (l, j )cov(Yk , Yl )

n   k=1 u =k

+2

2 2 K (2, pθ ) (k, i )K (2, pθ ) (u, j )K (2, pθ ) (k, j )cov(Yk , Yk Yu )

n   k=1 u =k v =k

+4

n  

2 2 K (2, pθ ) (k, i )K (2, pθ ) (u, j )K (2, pθ ) (v, j )cov(Yk , Yu Yv )

K (2, pθ ) (k, i )K (2, pθ ) (l, i )K (2, pθ ) (u, j )

k=1 l =k u =k

× K (2, pθ ) (k, j )cov(Yk Yl , Yk Yu ) +

n  

K (2, pθ ) (k, i )K (2, pθ ) (l, i )K (2, pθ ) (u, j )

k=1 l =k u =k v =k



×K (2, pθ ) (v, j )cov(Yk Yl , Yu Yv ) = A1 + A2 + A3 + A4 + A5 + A6 + A7,

(44)

´ M. FRANCISCO-FERNANDEZ et al.

144

where we have denoted

K (2, pθ ) (k, i ) = K (2, pθ )

xk − xi . g

The next step is to study the rates of convergence of each of the seven terms of Eq. (44). Using Lemma 1, Riemann approximations of sums by integrals, convolutions between kernels, changes of variables and considering that the kernel has bounded support and the design is fixed, terms of order O(1/n) can be found by approximating terms in A6 and in A7 . Significant terms of order O(1/n 2 g 9 ) are in A2 , A6 and A7 . All other terms are of order o(1/n 2 g 9 ). As an example, we show how to handle the first two terms of Eq. (44). The rest of the terms are handled in the same way. Let C be a generic constant.

A1 =

n n n  1 1  2 2 K2 (k, i )K (2, pθ ) (k, j )var(Yk ). n 6 g 12 i=1 j =1 f (x i )2 f (x j )2 k=1 (2, pθ )

Taking into account that var(Yk2 ) = σε4 + 4σε2 m 2 (x k ), it follows that n   1  xk − x1 −1 −1 2 f (x 1 ) f (x 2 ) K (2, pθ ) A1 ≈ 4 12 n g k=1 x1 x2 g xk − x2 2 × K (2, [σε4 + 4σε2 m 2 (x k )] dx 1 dx 2 pθ ) g  2 n 1  4 xk − x1 2 = 4 12 [σε + 4σε2 m 2 (x k )] f (x 1 )−1 K (2, dx 1 pθ ) n g k=1 g x1 =

 2 n 1  4 2 2 −1 2 [σ + 4σ m (x )] f (x − gt) K (t) dt k k ε (2, pθ ) n 4 g 10 k=1 ε t

2  n σε2  2 2 −2 2 ≈ 4 10 [σ + 4m (x k )] f (x k ) K (2, pθ ) (t) dt n g k=1 ε t 1 =O n 3 g 10 1 =o . n2 g9 For the term A2 , taking into account that var(Yk , Yl ) = σε2 (m 2 (xl ) + m 2 (x k ) + σε2 ) + cov2 (εk , εl ),    τ (xl ,xk )

(45)

PLUG-IN BANDWIDTH SELECTOR

145

the term A2 can be split into two terms: n n 2  1 6 12 2 n g i=1 j =1 f (x i ) f (x j )2

A2 =

×

n  

K (2, pθ ) (k, i )K (2, pθ ) (l, i )K (2, pθ ) (k, j )K (2, pθ ) (l, j )

k=1 l =k

× [τ (xl , x k ) + cov2 (εk , εl )] = A21 + A22 . For A21 we have that: A21 =

n n 2  1 n 6 g 12 i=1 j =1 f (x i )2 f (x j )2

×

n  

K (2, pθ ) (k, i )K (2, pθ ) (l, i )K (2, pθ ) (k, j )K (2, pθ ) (l, j )τ (xl , x k )

k=1 l =k

x − x1 y − x1 1 K (2, pθ ) K (2, pθ ) g g x1 x2 x y f (x 1 ) f (x 2 ) x − x2 y − x2 × K (2, pθ ) K (2, pθ ) τ (x, y) f (x) f (y) dx 1 dx 2 dx dy g g     2σ 2 f (x − gt)−1 f (x − gu)−1 K (2, pθ ) (t) = 2 ε10 n g t u x y x−y x−y × K (2, pθ ) t − K (2, pθ ) (u)K (2, pθ ) u − g g 2σ 2 ≈ 2 ε12 n g

   

× τ (x, y) f (x) f (y) dt du dx dy   2σ 2 x−y 2 ≈ 2 ε10 f (x)−1 f (y)K (2, pθ ) ∗ K (2, pθ ) τ (x, y) dx dy n g g x y   2σ 2 = 2 ε9 f (x)−1 f (x − gp)K (2, pθ ) ∗ K (2, pθ ) ( p)2 τ (x, x − gp) dx d p n g x p   2σ 2 ≈ 2 ε9 K (2, pθ ) ∗ K (2, pθ ) ( p)2 d p σε2 + 2m 2 (x) dx n g p x 1 . =O 2 n g9 On the other hand, A22

n n 2σε4   1 = 6 12 n g i=1 j =1 f (x i )2 f (x j )2

×

n   k=1 l =k

K (2, pθ ) (k, i )K (2, pθ ) (l, i )K (2, pθ ) (k, j )K (2, pθ ) (l, j )c2 (|k − l|)

(46)

´ M. FRANCISCO-FERNANDEZ et al.

146

n 2 n  1 2σε4   2 = 6 12 c (|k − l|) K (2, pθ ) (k, i )K (2, pθ ) (l, i ) n g k=1 l =k f (x i )2 i=1 ≈ =

 2 n 1 2σε4   2 xk − x xl − x K c (|k − l|) K dx (2, pθ ) (2, pθ ) n 4 g 12 k=1 l =k g g x f (x)

2  n x k − xl 1 2σε4   2 K dt t − c (|k − l|) (t)K (2, pθ ) (2, pθ ) n 4 g 10 k=1 l =k g t f (x k − tg)

  n 2σε4   2 1 x k − xl 2 ≈ 4 10 c (|k − l|) K (2, pθ ) ∗ K (2, pθ ) n g k=1 l =k f (x k )2 g ≤

n−1 2σε4 C  (n − t) 2 c (|t|) n 3 g 10 t=1 n

∞ 2σε4 C  2 c (|t|) n 3 g 10 t=1 1 =O n 3 g 10 1 =o 2 9 . n g



(47) 

Proof of Proposition 2 this case

For simplicity, we will only prove this Proposition for s = 0. In n 1 2 εˆ , n i=1 i

γˆλ,n (0) = σˆ ε2 = where

t εˆ i = Yi − Sλ,x Y , i

i = 1, 2, . . . , n,

(48)

t being the smoother in x i , using a pilot bandwidth λ. That is, Sλ,x i t Sλ,x = (wi,1 , wi,2 , . . . , wi,n ), i

where wi, j = e1T (X xTi Wxi X xi )−1 X xi Wxi e j ≈

1 1 K pσ nλ f (x i )



x j − xi . λ

(49)

Then, the residual vector, εˆ (n) , can be written as:      T    Sλ,x1 εˆ 1 Y1 Y1      T     εˆ 2   Y2   Sλ,x2   Y2          εˆ (n) =  .  =  .  −  .  ·  .  = (In − Sλ )Y ,  ..   ..   .   ..       .    T εˆ n Yn Yn Sλ,x n

t where In is the identity matrix of order n and Sλ is the (n × n) matrix whose i th row is Sλ,x . i

PLUG-IN BANDWIDTH SELECTOR

147

Then, we have that 

t

(In − Sλ )Y Y t (In − Sλt ) = εˆ (n) εˆ (n)

εˆ 1 εˆ 1

  εˆ εˆ  2 1  =  ..  .  εˆ n εˆ 1

εˆ 1 εˆ 2

···

εˆ 2 εˆ 2

···

.. .

..

εˆ n εˆ 2

.

εˆ 1 εˆ n



 εˆ 2 εˆ n    . ..    . 

(50)

· · · εˆ n εˆ n

Taking the definition of σˆ ε2 into account, given in Eq. (21), it is clear that σˆ ε2 =

1 tr(In − Sλ )Y Y t (In − Sλt ), n

(51)

where tr A denotes the trace of matrix A.  t + ε , the expectation of σˆ ε2 is given by:  (n) M Using that E(Y Y t ) = M (n) 1 tr(In − Sλ )E(Y Y t )(In − Sλt ) n 1 1 t  (n)  (n) M (In − Sλt ) + tr(In − Sλ ) ε (In − Sλt ). = tr(In − Sλ ) M n n

E(σˆ ε2 ) =

(52)

Using Eq. (7) for j = 0, we obtain that   ( pσ +1) (x 1 ) pσ +1 m µ pσ +1 (K pσ )(1 + o(1))  λ ( pσ + 1)!        p +1 m ( pσ +1) (x 2 )  λ σ µ pσ +1 (K pσ )(1 + o(1))    ( p + 1)! σ .  (n) =  (In − Sλ ) M     ..     .     ( p +1) σ   (x n ) pσ +1 m λ µ pσ +1 (K pσ )(1 + o(1)) ( pσ + 1)! Then, t  (n)  (n) M (In − Sλ ) M (In − Sλt ) = (ai j )

(53)

is an (n × n) matrix whose (i, j )th element is  2 2 pσ +2 µ pσ +1 (K pσ ) ai j ≈ λ m ( pσ +1) (x i )m ( pσ +1) (x j ), ( pσ + 1)! and hence, 1 t  (n) M  (n) tr(In − Sλ ) M (In − Sλt ) n   n µ pσ +1 (K pσ ) 2 1  ( pσ +1) [m (x i )]2 ≈ λ2 pσ +2 ( pσ + 1)! n i=1    µ pσ +1 (K pσ ) 2 ≈ λ2 pσ +2 [m ( pσ +1) (x)]2 f (x) dx. ( pσ + 1)!

(54)

´ M. FRANCISCO-FERNANDEZ et al.

148

With respect to the second summand of Eq. (52), we have that: 1 2 1 1 tr(In − Sλ ) ε (In − Sλt ) = tr ε − tr Sλ ε + tr Sλ ε Sλt . n n n n

(55)

It is obvious that the first term of Eq. (55) is: 1 tr ε = σε2 . n

(56)

To obtain approximations for the second and the third summands, first we have to study the matrices Sλ ε Sλt and Sλ ε .  t  Sλ,x1    St   λ,x2    Sλ ε Sλt =   ε (Sλ,x1 , Sλ,x2 , . . . , Sλ,xn )  ..   .    t Sλ,x n  t  t t Sλ,x1 ε Sλ,x1 Sλ,x S · · · Sλ,x S 1 ε λ,x2 1 ε λ,xn    St S  t t  λ,x2 ε λ,x1 Sλ,x2 ε Sλ,x2 · · · Sλ,x2 ε Sλ,xn    = . .. .. ..   ..   . . . .   t t t Sλ,x S S S · · · S S ε λ,x1 λ,xn ε λ,x2 λ,xn ε λ,xn n Using Eq. (8), it is easy to obtain that the i th diagonal element of the previous matrix, t Sλ,x ε Sλ,xi , is given by: i t Sλ,x ε Sλ,xi i

1 c(ε) R(K pσ ) + o = nλ f (x i )



1 . nλ

Therefore, n 1 R(K pσ )c(ε) 1  1 1 t tr Sλ ε Sλ = +o n nλ n i=1 f (x i ) nλ 1 R(K pσ )c(ε) +o . ≈ nλ nλ

(57)

As far as the matrix Sλ ε is concerned, similar arguments lead to n n x j − xi 2 σε2  1  2 tr Sλ ε = Kp c(| j − i |) n nλ n i=1 f (x i ) j =1 σ λ and applying a Taylor expansion of K pr ((x j − x i )/λ) around 0, we have n n 1 2 σε2  1  K pσ (0)c(| j − i |) + o nλ n i=1 f (x i ) j =1 nλ 1 2K pσ (0)c(ε) +o ≈ . nλ nλ

(58)

PLUG-IN BANDWIDTH SELECTOR

149

The same kind of proof allows us to prove Eq. (24) for each s = 1, 2, . . . , k. To obtain the variance of σˆ ε2 we can use the same result as in Altman (1993). Taking Assumption A.3 into account, it is clear that assumptions (G) and (H) in the paper of Altman are satisfied. So, following the lines of the proof in that work, we have that 

γˆλ,n (0)

 



γˆλ,n (0)

       γˆλ,n (1)   γˆλ,n (1)             .  −→ N  E  . ,   ..   ..       γˆλ,n (k)





  V 1  , +O n n2 λ   

γˆλ,n (k)

where V is the covariance matrix given by Barlett’s formula (Barlett, 1946) for the process εt .  Proof of Theorem 1 From the proof of Proposition 2, the order of the bias and the variance of σˆ ε2 and ρ, ˆ defined in Eqs. (21) and (22), are O[λ2 pσ +2 + (1/nλ)] and O[(1/n) + (1/n 2 λ)], respectively. So the optimal selection for the pilot bandwidth, λ, will be of order O(n −1/(2 pσ +3) ), and for each pσ ≥ 1, the bias will be of order o(n −1/2 ), while the variance will be O(n −1 ), the terms of type (n 2 λ)−1 being of order less than o(n −1 ). The same is verified for c(ˆε), estimator of c(ε), obtained by substituting σε2 and ρ by the estimators given in Eqs. (21) and (22), respectively. As far as θˆ2,2 (g) is concerned, its bias has an order of O[g pθ −1 + 1/(ng 5 )] and its variance O[(1/n) + (1/n 2 g 9 )]. Therefore, the optimal pilot bandwidth, g, is g = O(n −1/( pθ +4) ). For this selection and for pθ ≤ 5, the bias is O(n −( pθ −1)/( pθ +4) ) and the standard deviation is O(n −(2 pθ −1)/(2 pθ +8) ) = o(n −( pθ −1)/( pθ +4) ). Hence, choosing g = Gn −1/( pθ +4) and taking Proposition 1 into account, we have that: n ( pθ −1)/( pθ +4) (θˆ2,2 (g) − θ2,2 ) −→ 2

µ pθ +1 (K (2, pθ ) ) pθ −1 G θ2, pθ +1 ( pθ + 1)!

+ R(K (2, pθ ) )c(ε)G −5 .

(59)

opt The next step to study the asymptotic performance of relative error, (hˆ 0,g,as − h MISE )/ h MISE , is to split it into two terms as follows: opt hˆ 0,g,as − h MISE

h MISE

=

opt opt hˆ 0,g,as − h 0,g,as

h MISE

opt

+

h 0,g,as − h MISE h MISE

= ϒ1 + ϒ2 .

(60)

First, we study ϒ2 . The technique used is to approximate the MISE of the estimator of the regression function, m(x), ˆ including more terms in the bias and then, to show that the opt bandwidth that minimizes this more accurate approximation differs from h 0,g,as in a small quantity. Taking the kernel properties into account, the expression of the MISE of the local linear estimator with an additional term in the bias is given by:  c  [1 + o(1)], MISE(h) = ah 4 + bh 6 + nh a, b and c being constants that can be obtained by using similar calculations as in Eqs. (7) and (8).

´ M. FRANCISCO-FERNANDEZ et al.

150

Deriving the previous expression, we obtain that: g(h) =

∂MISE(h) ≈ dh 3 + eh 5 − f n −1 h −2 . ∂h

The bandwidth minimizing the asymptotic MISE with only a term in the bias and in the variance is given by: 1/5 f opt h 0,g,as = . (61) nd opt

If we construct a Taylor approximation of g(h) around h 0,g,as , we obtain: g(h) ≈ g(h 0,g,as ) + (h − h 0,g,as )g (h 0,g,as ) opt

opt

opt

opt

opt

= e(h 0,g,as )5 + (h − h 0,g,as ) × [3d(h 0,g,as )2 + 5e(h 0,g,as )4 + 2 f n −1 (h 0,g,as )−3 ]. opt

opt

opt

Now, choosing the bandwidth: h ∗ = h 0,g,as − opt

e opt 3 (h ) 5d 0,g,as opt

and substituting it in the previous expression [using the value of h 0,g,as given in Eq. (61)], it follows that: g(h ∗ ) = e(h 0,g,as )5 − opt

e opt 3 opt (h ) [3d(h 0,g,as )2 5d 0,g,as

+ 5e(h 0,g,as )4 + 2 f n −1 (h 0,g,as )−3 ] opt

opt

opt

= e(h 0,g,as )5 −

3e opt 5 e2 opt 7 2 f e (h ) − (h 0,g,as ) − 5 0,g,as d 5dn

= O(n −7/5 ). This bandwidth h ∗ is better than h 0,g,as , since opt

g(h 0,g,as ) = O(n −1 ). opt

Note that h ∗ = h 0,g,as + O(n −3/5 ). opt

Obviously, we could repeat this argument by adding higher order derivatives to the bias approximation and improve the approximation for the minimum of the MISE. These improvements will be of order less than O(n −3/5 ), so that, we can conclude that: h MISE = h 0,g,as + O(n −3/5 ). opt

Then, opt

h 0,g,as − h MISE h MISE

=

O(n −3/5 ) opt h 0,g,as

+ O(n −3/5 )

=

O(n −3/5 ) = O(n −2/5 ). O(n −1/5 )

(62)

PLUG-IN BANDWIDTH SELECTOR

151

With respect to ϒ1 , a Taylor approximation produces: opt

opt opt hˆ 0,g,as = h 0,g,as +

∂h 0,g,as ∂c(ε)

opt

[c(ˆε) − c(ε)] +

∂h 0,g,as ∂θ2,2

[θˆ2,2 (g) − θ2,2 ] + · · · .

Then, opt opt hˆ 0,g,as − h 0,g,as

h MISE

≈ ≈

opt opt hˆ 0,g,as − h 0,g,as opt

h 0,g,as 1

opt

∂h 0,g,as

opt h 0,g,as ∂c(ε)

[c(ˆε) − c(ε)] +

1

opt

∂h 0,g,as

opt h 0,g,as ∂θ2,2

[θˆ2,2 (g) − θ2,2 ].

Taking the convergence orders of c(ˆε) and θˆ2,2 (g) expressed in the beginning of the proof into account, we have that opt opt hˆ 0,g,as − h 0,g,as

h MISE

1 −1 ≈ − θ2,2 [θˆ2,2 (g) − θ2,2 ]. 5

(63)

Now, using the decomposition (60), the expressions obtained in Eqs. (62) and (63), and applying directly Eq. (59), we obtain Eq. (29). 

Suggest Documents