Local Linear Forecasting Xiaochun Li Centre de recherches mathematiques Montreal, Quebec Nancy E. Heckman University of British Columbia Vancouver, British Columbia July 14, 1997 The authors wish to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada through grant A7969. Requests for this paper should be directed to the second author at Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z2, Canada, electronic mail address:
[email protected], or via anonymous ftp: newton.stat.ubc.ca, in the le forecast.ps, in the directory pub/nancy.
Abstract A local linear estimator is proposed for forecasting from independent data. The asymptotic bias and variance of the local linear forecast are presented and used to develop procedures for the estimation of the optimal bandwidth for forecasting. Simulation study shows that cross-validation procedures outperform plug-in procedures both in terms of choosing a bandwidth and in terms of accurate forecasts. The forecasting technique is applied to two data sets.
Keywords: local linear estimation, forecasting, bandwidth selection.
1
1 Introduction Suppose bivariate data (ti; Yi); i = 1; : : : ; n, are observed at times 0 t1 t2 : : : tn T , and any time beyond T is the future. Often it is of interest to estimate the \expected" value of Y at some t > T . We propose an estimator based on an extrapolation by local linear regression. We make no assumptions about the parametric form of the dependence of E (Y ) on t. However, our estimate relies on the choice of a bandwidth, which basically tells us how much of the past data we should use in making our forecast. Here, we develop automatic, i.e. data-driven, procedures for choosing the bandwidth. Figures 1 and 2 contain the annual incidence of malignant melanoma in males from 1936 until 1972. These data are described in Andrews and Herzberg (1985) and can be found in Statlib (web address http://www.stat.cmu.edu). The dots are the actual data points. The other symbols represent our forecasts using three dierent methods of bandwidth selection: complete plug-in (CAMSE ), forecasting cross-validation (FCV ), and a mixture of plug-in and cross-validation (PICV ). These methods are described in Section 4. In Figure 1, we've forecast melanoma incidence in 1971 and 1972, using data up to and including T = 1970. In Figure 2, we've taken T = 1969 as the present and have forecast melanoma incidence in 1970, 1971, and 1972. From the gures, we see that the best forecasts are those based on bandwidths chosen by FCV . The superiority of this method is con rmed in our simulation studies in Section 5. FIGURES 1 AND 2 HERE Figure 3 shows the results of our FCV -based method applied to forecasting the incidence of acquired immunode ciency syndrome (AIDS). The data (the dots), provided by the Laboratory Centre for Disease Control, are from the last quarter of 1979 until only the rst quarter of 1990. We chose this period to avoid the complicated problem of correcting for reporting delays and under-reporting. As reported by Schecter, Marion, Elmslie and Ricketts (1992), most problems in reporting are resolved within six years, so the data we've used can be considered to be accurate. Predicting the number of 2
future AIDS cases is important to health organizations and governments for estimating the future demand in health care and allocating funds accordingly. This forecasting problem has been studied extensively in the statistical literature, particularly via a technique called backcalculation (see, e.g. Brookmeyer and Gail, 1988, Schecter et al, 1992, or Bacchetti, Segal and Jewell, 1993). Unfortunately, this technique produces forecasts that can vary widely, depending on certain assumptions used. (See Li, 1996, for a discussion of the inadequacies of backcalculation.) The forecasting technique proposed here does not suer from the same problems, as relatively few assumptions are made. A similar local technique was used by Healy and Tillet (1988). However, they chose the bandwidth subjectively while we allow the data themselves to indicate the appropriate bandwidth. We have applied our forecasting technique to the Canadian data, treating T = the rst quarter of 1988 as the present and forecasting the quarterly number of new AIDS cases after that time. The forecasts are the O's in the gure. Bandwidths were chosen by FCV . FIGURE 3 HERE In all of the gures (1-3), the local linear regression weights were determined by a normal kernel truncated to [?1; 0], and a dierent bandwidth was chosen for each forecast made. The bandwidth tells us, roughly, how much of the data prior to the present we've used in our forecast. In Section 2, we de ne our local linear forecast. Our estimate is similar in spirit to the forecasting method of exponential smoothing for forecasting one time point into the future (see e.g. Harvey, 1989). However, we consider weights that aren't necessarily exponential. Section 3 contains asymptotic results, speci cally the asymptotic bias and variance of our estimate, based on the assumption that the data are independent. (The eect of correlation is discussed in the conclusions in Section 6.) Our asymptotic bias and variance are more complicated than those in local linear regression (see, e.g., Hardle, 1989, Ruppert and Wand, 1994, Wand and Jones, 1995, Fan and Gijbels, 1996), since we are forecasting as far ahead as possible, well outside the range of the data. In Section 4, we present four data-driven methods of bandwidth choice, all of which rely in some 3
sense on the asymptotic results of Section 3. Section 5 contains simulation results which compare our four methods of bandwidth choice.
2 A local linear forecasting estimator In local linear regression, to estimate a function m at a certain point t 2 [0; 1], we assume that m is approximately linear in a neighbourhood of t: m(x) 0 + 1(x ? t). We estimate 0 = 0;t = m(t) and 1 = 1;t = m0(t) by weighted regression. That is, the estimators ^0;t and ^1;t satisfy 1 0 ti ? t n ^ X 0;t C 2 ^ t = B (1) @ ^ A = argmin (Yi ? 0 ? 1(ti ? t)) K h ; n 1 1;t with K () a kernel function that gives more weight to the observations close to t than to the rest of the observations, and hn the so-called bandwidth that determines which observations are considered close to t. For example, if the kernel function is the indicator function over [?1; 1], then K ((ti ? t)=hn) = 1 for those ti's with jti ? tj hn and 0 otherwise. In the non-forecasting setting (e.g. estimating m(t) for t 2 [0; 1]), properties of m^ h (t) are well studied. Note that, in (1), the kernel is \centered" at t and that this \centering point" is given in the subscript of ^ t. For forecasting, for simplicity we suppose that T = 1, that is that 0 t1 t2 : : : tn 1 and consider predicting m(1 + n), n 0. Our t values can always be rescaled to the interval [0; 1]. One could let t = 1 + n in (1) and use m^ h (1 + n) = ^0;1+ as the forecast. However, since we have no data with ti 2 (1; 1 + n], we \center" our kernel at t = 1 and de ne m^ h ;1(1 + n), the local linear estimate of m(1 + n), as n
n
n
n
m^ h ;1(1 + n) = ^0;1 + ^1;1n; 1 0 ^ where ^ 1 = B @ ^0;1 CA is as in (1) with t = 1: 1;1 n
(2)
This centering at t = 1 helps us to avoid cumbersome calculations of the asymptotic bias and variance of m^ h (1 + n). n
4
Since m^ h ;1(1+n) is calculated from the Yi's, it is a random variable and its statistical properties are aected by the bandwidth hn . Asymptotic properties of m^ h ;1(1+n) are given in the following section. n
n
3 Asymptotic Results To better understand our estimate and how the bandwidth aects it, we study the estimate's asymptotic bias and variance. We rst present a theorem concerning the asymptotic properties of ^ t, as de ned in (1). Then, in the corollary below, the theorem is applied to calculating the asymptotic bias and variance of m(t + n) using data with ti t, t 2 [0; 1]. The proofs of the theorem and corollary can be found in Li (1996). These proofs are similar to those in the literature (see, e.g., Ruppert and Wand, 1994), except that the assumptions are slightly dierent. In particular, in assumption 2 of the theorem, the tni's are xed rather than random, and in assumption 3 we suppose that the kernel K is \one-sided", that is, with support [?1; 0].
Theorem 3.1 Suppose that 1. Yi = m(ti) + i , where the i 's are independent with mean 0 and variance 2;
R
2. ti = tni satis es 0t f (t) dt = i=n for some continuous density f which is bounded away from zero; i
3. K and K 0 are continuous on K 's compact support [?1; 0] with
u0 > 0; u0u2 ? u21 > 0;
R where ui = ?01 uiK (u)du; 4. m00 is continuous over [0; 1 + ], some > 0; 5. hn ! 0 and nhn ! 1.
5
Let
0 1?1 0 10 1?1 B(K ) = B@ u0 u1 CA B@ u0 u1 CA B@ u0 u1 CA u1 u2 u1 u2 u1 u2 R where ui = ?01 ui K (u)2du. Let 1?1 0 1 0 1 0 B@ b0(K ) CA = B@ u0 u1 CA B@ u2 CA : u3 u1 u2 b1(K ) Suppose that ^0;t and ^1;t are as in (1). Then, as n ! 1, 2 E ( ^0;t) = m(t) + h2n m00(t) b0(K ) + o h2n ; E ( ^1;t) = m0(t) + h2n m00(t) b1(K ) + o (hn ) ; and
1 0 1 ^ 2 1 0 ;t C B Var @ A = nh f (t) B(K ) + o nh 1; n n hn ^1;t uniformly in t 2 [an; 1] with lim inf an=hn 1, where 1 is a two by two matrix of ones. R Note that B(K ) is positive de nite, since u0 > 0 and (u0u2 ? u12)=( K 2)2 is the R variance of a random variable with density K 2= K 2, and thus is always positive. The theorem can be used to determine how far one can forecast, that is how large one can take n and still accurately estimate m(1 + n). Below, we argue that n can be as large as O(n?1=5), that is, that one can predict nn = n4=5 time points ahead. The usual measure of accuracy of an estimate is its mean squared error, which can be written in terms of the estimate's bias and variance: MSE (m^ h ;1(1 + n)) = Bias2(m^ h ;1(1 + n)) + Var(m^ h ;1(1 + n)): n
n
n
To accurately estimate m(1 + n) we require that the MSE converges to 0. Assuming that hn ! 0 and nhn ! 1, by Theorem 3.1, Bias(m^ h ;1(1 + n)) = E ( ^0;1) + nE ( ^1;1) ? m(1 + n) = m(1) + m0(1)n ? m(1 + n ) 00 + m 2(1) h2n b0(K ) + hn nb1(K ) + o(h2n + hn n): n
6
Obviously, unless m is a line, the bias converges to zero if and only if n ! 0. Therefore, for accurate estimation, we require that n ! 0. In this case, by a Taylor series expansion of m(1 + n) about 1, 00 Bias(m^ h ;1(1 + n)) = m 2(1) h2n b0(K ) + hn nb1(K ) ? n 2 + o(h2n + hn n + n2): n
Now consider the variance of our estimate. By Theorem 3.1, Var(m^ h ;1(1 + n)) = Var( ^0;1) + 2n Cov( ^0;1; ^1;1) + n2Var( ^1;1) 2 2! n n = 1 + o(1) nh f (1) B11 + 2B12 h + B22 h2 n
n
n
n
where Bij is the ij th entry of B(K ). We now combine these results to study the asymptotic behavior of MSE (m^ h ;1(1 + n)). We treat the sequence n, n 1, as being given but with n ! 0 and consider the optimal rate of convergence to zero of our MSE in three cases: n
(i) n=hn ! 0; (ii) n=hn ! 2 (0; 1); (iii) n=hn ! 1. In all three cases, we also require that hn ! 0 and nhn ! 1. We want our rate of convergence to be of order n?4=5, the usual optimal rate for smoothing estimates of regression functions. First consider case (i). This includes \one step ahead" prediction (n = 1=n) and one-sided regression (n = 0). If n=hn ! 0 then 00(1)b0(K ) !2 2 B11 m 1 4 4 MSE (m^ h ;1(1 + n)) = hn + 2 nhn f (1) + o hn + nhn : n
Thus, if m00(1) 6= 0, the asymptotic MSE will be minimized if we choose hn Hoptn?1=5 where 2B11 : 1 5 = Hopt (m00(1)b0(K ))2 f (1) 7
In this case (and indeed, for any hn Hn?1=5 with H 6= 0), the asymptotic MSE is of order n?4=5. This asymptotic MSE and the optimal bandwidth hn Hoptn?1=5 have appeared in the non-forecasting literature (see, for instance, Fan, 1992). That is, forecasting m(1 + n) with n = o(n?1=5) is, in terms of asymptotic MSE , the same as estimating m(1). Next consider case (ii): if n =hn ! 2 (0; 1), then !2 00(1) !2 b0 (K ) b1(K ) m 4 MSE (m^ h ;1(1 + n)) = n 2 2 + ? 1 2 1 4 2 + n f (1) (B11 + 2B12 + B22 ) + o n + n : n n In order to have the MSE converge to zero at rate n?4=5 when m00(1) 6= 0 and b0(K )=2 + b1(K )= 6= 1, we must take n Dn?1=5, D 6= 0. Any sequence of n's not satisfying this will result in an asymptotic MSE larger than n?4=5. For the nal case (iii), if n=hn ! 1, then 2! 00(1) !2 n 2 2B22 m n 4 4 + nh3 f (1) + o n + nh3 : MSE (m^ h ;t(1 + n)) = n 2 n
n
n
n
To ensure that the MSE is O(n?4=5) we require that the variance term be O(n?4=5), that is, that limsup n1=5n < 1. However, by assumption, n = n1=5n ! 1 hn n1=5hn and so n1=5hn ! 0. This implies that the bias term in the MSE is larger than O(n?4=5): 2 n 2 1 n n4=5 nh3n = hn n1=5hn ! 1: In light of the analysis of these three cases, we restrict ourselves to sequences n and hn with n=hn ! < 1. Our results are summarized in the following corollary, stated for the more general case of forecasting m(t + n) using data with ti t.
Corollary 3.1 Assume that the conditions of Theorem 3.1 hold and let Bij be the ij th entry of B(K ). Assume also that n = Dn?1=5 and hn = Hn?1=5 for some D 0, H 8
> 0. Let = D=H . Then, if 6= 0, as n ! 1, ! b ( K ) b ( K ) 2 0 1 00 sup 2 Bias(m^ h ;t(t + n)) ? m (t) 2 + ? 1 = o(1); t2[a ;1] n n
n
and
sup nn f (t)Var(m^ h ;t(t + n)) ? 2(B11 + 2B12 + B222) = o(1)
t2[an;1]
n
whenever lim inf an =hn 1. If = 0, then, as n ! 1
2 00 sup h2 Bias(m^ h ;t(t + n )) ? m (t)b0(K ) = o(1); t2[a ;1] n n
n
and
sup nhn f (t)Var(m^ h ;t(t + n)) ? 2B11 = o(1)
t2[an;1]
n
whenever lim inf an=hn 1. Furthermore, the above statements hold when m00(t) is replaced by m00(1), f (t) is replaced by f (1), and an ! 1.
Remarks. 1. Consider the eect of the bandwidth on the asymptotic bias and variance of our estimate. Since we would like to forecast as far ahead as possible and still achieve an asymptotic MSE of order n?4=5, we focus discussion on the case that n = Dn?1=5, hn = Hn?1=5, D and H greater than zero, and set = D=H . In this case, we say that the asymptotic bias of m^ h ;t(t + n) is ! 2 b ( K ) b ( K ) 1 0 n 00 Asymptotic Bias = 2 m (t) 2 + ? 1 and the asymptotic variance of m^ h ;t(t + n) is 2 Asymptotic Variance = n f(t) (B11 + 2B12 + B222): n Here, we're considering n as xed, with the value of determining the bandwidth (hn = n=). In forecasting, the dependence of the asymptotic bias and variance on the bandwidth is more complicated than in the non-forecasting setting (i.e., with = 0). In the non-forecasting setting, the bias is of the form C1h2n and the n
n
9
variance is of the form C2=(nhn ). Obviously, in the non-forecasting case, the square of the asymptotic bias is increasing in hn and the variance is decreasing in hn. In the forecasting setting, the asymptotic bias is a quadratic form in 1= = hn=n , and so its monotonicity properties depend on b0(K ) and b1(K ). Similarly, the asymptotic variance's monotonicity properties depend on B11; B12 and B22. The asymptotic variance is a decreasing function of hn and the square of the asymptotic bias is a decreasing function of (equivalently an increasing function of hn ) for the following kernels K : the standard normal density function truncated to [?1; 0]; the density of the uniform distribution on [?1; 0]; and the Epanechnokov kernel truncated to [?1; 0]. 2. The theorem and the corollary also hold in the case that the ti's are random, independent and identically distributed and independent of the i's. Here, f is the density of the ti's and the statements of the theorem and corollary hold in probability, with expected values and variances conditional on t1; ; tn. The proofs can be found in Li (1996). 3. Note that, although n ! 0, we are forecasting roughly nn steps ahead, that is of order n4=5 steps ahead. Thus we're forecasting much further ahead than \one step" (n = 1=n). This is, in a way, a matter of scaling. In some forecasting applications, we use data (ti = i; Yi), i = 1; ; n, to forecast m(n + n). To apply the theorem and corollary, we must rst rescale the ti 's so that they lie between 0 and 1. That is, we let ti = ti =n and m(t) mn(t) = m(nt), and we forecast m(1+n =n). Our asymptotic results will hold provided n = Dn4=5 and m00n(1) = n2m00(n) converges to a constant, that is m00(n) is of order n?2. This severe restriction on m00 is not an artifact of our method of proof. If one wants to forecast far into the future by tting a line through the data, then it's clear that m(t) must be close to a line.
10
4 The choice of a bandwidth Often, one prefers to choose the bandwidth by a completely automatic, or data-driven, method rather than a subjective one. Typically, one seeks to minimize some measure of the discrepancy between the estimated and the true function and tries to estimate the bandwidth which minimizes this measure. There are two commonly used measures for the choice of a bandwidth for an estimator, the mean squared error and the prediction error. When observations are independent, these two measures are basically the same, but lead to two dierent approaches of bandwidth selection, namely the plug-in approach and the cross-validation approach. In the non-forecasting setting, both approaches are commonly used data-driven methods for choosing a bandwidth for a local regression estimator. There is abundant literature on the merits and limitations of both methods in this setting. For discussions of these methods, see Hardle, Hall and Marron (1988), Gasser, Kneip and Kohler (1991), Hall, Sheather, Jones and Marron (1991), Hastie and Loader (1993), Ruppert, Sheather and Wand (1995) and Fan and Gijbels (1995). We will discuss the two measures and the two approaches in the forecasting setting in the next two subsections. The third subsection contains a method that combines the two approaches. In all three subsections, we assume that n; n 1, are given and satisfy n = Dn?1=5, D 6= 0.
4.1 A complete plug-in approach, CAMSE Recall that our measure of accuracy of our forecast is the mean squared error: MSE (m^ h ;1(1+ n)) MSE (hn ): The bias and the variance of m^ h ;1(1 + n) are both functions of hn with, typically, the square of the bias small for hn small and the variance small for hn big. It is desirable to choose an \optimal" bandwidth such that the forecasting estimator has a small bias and a small variance. Thus, we would like to nd the hn which minimizes MSE . The quantity MSE (hn ) is fairly complicated so, instead of minimizing MSE , we try to minimize AMSE , the asymptotic value of MSE . Using Corollary 3.1 and writing n
n
11
AMSE as a function of = n=hn , we see that AMSE () = m00(1)2b2() + 2v(); where
(3)
! 2 b0(K ) b1 (K ) n b() = 2 2 + ? 1 ; (B + 2B + 2B ): v() = nf (1) 11 12 22 n
The asymptotically optimal bandwidth is then equal to hopt = n =opt, where opt minimizes AMSE . In the non-forecasting setting we have an explicit formula for the hn that minimizes AMSE : hopt = Cn?1=5, C depending on m00(t) and 2. There is no such formula in forecasting. Fortunately, it's easy to show that a minimizer of AMSE exists. Observe that, since B(K ) is positive de nite, as ! 0 or +1, AMSE () ! +1. So a minimum of AMSE () in (0; +1) is guaranteed. The minimum can be found by setting the rst derivative of AMSE to zero and choosing the positive root which gives the smallest AMSE . However, setting the rst derivative of AMSE to zero and solving for calls for solving a seventh degree polynomial equation. An alternative method for minimizing AMSE is by a grid search. AMSE depends on the unknown quantities 2 and m00(1). These quantities can be d . We then estimated from the data and then \plugged into" AMSE to yield AMSE d , and let our optimal bandwidth be h^ opt = n=^opt. nd ^opt, the minimizer of AMSE Application of the plug-in approach to the forecasting setting is therefore fairly straightforward. Estimates of the unknown quantities m00(1) and 2 can be gotten by standard non-forecasting techniques since m00(1) and 2 don't depend on the \future". Our estimate of 2 is one proposed by Rice (1984): nX ?1 (Y ? Y )2 i+1 i 2 ^Rice = : i=1 2(n ? 1) Using standard results in local polynomial regression, m00(1) can be estimated by 12
tting a local cubic polynomial around t = 1. That is, let 12 0 3 n X X t i?1 j A @ Yi ? j (ti ? 1) K h ; (^0; ^1; ^ 2; ^3) = argmin n
j =0
i=1
(4)
and let m^ h ;100(1) = 2^2. Note that the kernel function K and the bandwidth hn used to estimate m00(1) may be dierent from those used for forecasting. The bandwidth that minimizes the asymptotic MSE of this estimate of m00(1) is hm (1) = C0n?1=9, where C0 depends on 2 and one higher unknown derivative m(4)(1). Ordinarily the fourth derivative of a function is dicult to estimate because, roughly, fourth order divided dierences of the data are very noisy. However, one can try to estimate m(4)(1) by tting a local polynomial of degree 5 as in (4) but then one needs to choose yet another bandwidth. The optimal value of this bandwidth depends on m(6)(1), which is even more dicult to estimate than m(4)(1). Fan and Gijbels (1995) present an alternative method for choosing a good bandwidth for the local cubic estimate of m00(1). With this method, one avoids estimating higher order derivatives of m. The authors introduce a statistic RSC , n
00
RSC (t; hn) = ^ 2(t; hn)f1 + 4V (t; hn)g; where we are tting a kernel centered at t, V (t; hn) depends on the kernel K and ^ 2(t; hn) is calculated from residuals based on a local cubic t using bandwidth hn . The bandwidth hRSC that minimizes the asymptotic value of E (RSC (1; hn )) is related to hm by hm = ChRSC where C depends on the kernel and the ti's and hence can be computed. Thus hRSC can be estimated from the sample by ^hRSC , the minimizer of RSC (1; ). The estimate of hm is then ^hm = C ^hRSC , and ^hm is used as the bandwidth in estimating m00(1). So the optimal bandwidth for estimating m00(1) can be estimated without having to estimate higher order derivatives of m. The following algorithm summarizes the steps to choose the bandwidth ^h for forecasting m(1 + n). 00
00
00
00
Algorithm: 13
00
1. Fit a local cubic polynomial centered at t = 1 as in (4) to the data for each value of hn on a grid; 2. nd h^ RSC that minimizes RSC (1; hn ) on that grid of hn ; 3. let h^ m = C ^hRSC ; 00
4. estimate m00(1) by tting a local cubic polynomial to the data using (4) with bandwidth ^hm ; 00
2 ; 5. estimate 2 by ^Rice
d and nd ^opt, the mini6. use these estimates of m00(1) and 2 to calculate AMSE d ; mizer of AMSE 7. estimate m(1 + n) by m^ h ;1(1 + n), as de ned in (2), using bandwidth ^hopt = n=^opt. n
Remarks: 1. As suggested by Fan and Gijbels (1995), one can modify steps 2 and 3 to produce a less variable estimate of the bandwidth for m^ 00 by using more data points than used in RSC (1; ). Speci cally, in step 2, one might instead choose ^hARSC to minimize P :t 2[1? ;1] RSC (ti; hn ) : ARSC (h) = i# i : t 2 [1 ? ; 1] n
i
i
n
(In the simulations of Section 5, we take n = n.) The formula in step 3 gives the relationship between ^hARSC and h^ m , where ^hm is the bandwidth for estimating m00(t), t near 1. We call this modi ed method ARSC . 00
00
2. It isn't clear what happens with this method when m(4)(1) = 0. The derivation of the multiplier C in ^hm = C ^hRSC relies on the fact that m(4)(1) 6= 0. However, if m(4)(1) = 0, then h^ RSC will most likely be very large, yielding a large value of ^hm , which is desired. This problem is investigated in our simulation studies in Section 5. 00
00
14
4.2 A forecasting cross-validation approach, FCV A commonly used criterion for bandwidth selection is the prediction error, ?m PE (hn ) = (Y1+ ^ h ;1(1 + n ))2 n
n
where Y1+ = m(1+n)+ is a new observation made at the time 1+n . The optimal bandwidth by this criterion is the one that minimizes PE . Since Y1+ is independent of the Yi 's n
n
E (PE (hn )) = E (m(1 + n) ? m^ h ;1(1 + n))2 + Var() = MSE (hn ) + Var(): n
So, when regression errors are uncorrelated, minimizing the expected prediction error E (PE (hn )) is equivalent to minimizing MSE (hn ). In the non-forecasting setting, the prediction error is de ned to be PE (hn ) = Pn(m^ (t ) ? Y )2 with Y = m(t ) + , a new observation made at t . The datai i 1 h i i i i based cross-validation choice of hn in this setting minimizes the following modi cation of PE . Replace Yi by Yi and m^ h (ti) by m^ h (?i)(ti), where m^ h (?i)(ti) is the estimate of m(ti) based on the data set with (ti; Yi ) excluded. That is, we judge the performance of m^ h (?i)(ti), our estimate of m(ti) using bandwidth hn , by comparing it to Yi. This idea does not lend directly to the forecasting setting. In forecasting we can certainly calculate an estimate of m(1 + n ). However, we haven't observed a Y value at t = 1 + n, so we cannot compare our estimate of m(1 + n) to a Yi. One natural idea for choosing an optimal bandwidth for forecasting is to treat a portion of the most recent data as the future and try to forecast it with the remaining data using bandwidth hn. One then chooses the bandwidth that minimizes some measure of the discrepancy between the \future" Yi's and their estimated values. This idea has the avor of conventional cross-validation in that a statistical model is built on part of the data and then \validated" by the rest of the data. Speci cally, suppose that we try to forecast Yi 's with ti 2 [1 ? n; 1]. To forecast m(ti) based on \past" data with tj 2 [0; ti ?n], we center our kernel K at the \present", ti ? n, letting m^ h ;t ? (ti) = ^0;t ? + ^1;t ? n; n
n
n
n
n
n
i
n
i
15
n
i
n
with ^0;t ? and ^1;t ? as given in (1), with K having support [?1; 0]. Then our CV function for forecasting m(ti)'s with ti 2 [1 ? n; 1] is given by 2 X FCV (hn ) = #t 2 [11? ; 1] m^ h ;t ? (ti) ? Yi : i n t 2[1? ;1] i
n
i
n
n
i
n
n
i
Remarks: 1. We could try to forecast m(ti)'s for ti in an interval larger than [1 ? n ; 1], but we've found in practice that the interval [1 ? n; 1] works well. A smaller interval doesn't contain enough data, resulting in a highly variable FCV . A larger interval will contain ti's far from t = 1. These ti's will not give accurate information about an optimal bandwidth for forecasting m(1 + n ). 2. Cross-validation has been proposed for bandwidth choice when errors are correlated. Gijbels, Pope and Wand (1997) use cross-validation to choose the bandwidth for forecasting a Y one time point ahead of the observed data. Hart (1994) uses a cross-validation idea similar to FCV , but with the goal of estimating m(t) t 2 [0; 1]. This technique can also be used to improve bandwidth selection for uncorrelated data (Hart and Yi, 1996). 3. Recall that the prediction error, PE , and the mean squared error, MSE , of m^ h ;1(1+n) are related by E (PE (hn)) = 2 + MSE (hn ), and that the complete plug-in approach CAMSE chooses the bandwidth by trying to minimize AMSE , the asymptotic mean squared error. Using Corollary 3.1 one can show that n
E (FCV (hn )) ? 2 ? AMSE (hn) = o(n?4=5): (See Li, 1996.) This gives a heuristic justi cation of the statement that the bandwidth that minimizes FCV is approximately the same as the bandwidth that minimizes AMSE . Of course, a much stronger result is necessary to guarantee that the two bandwidths are similar. However, our simulation studies in Section 5 support our heuristic claim. 16
4.3 A mixed approach, PICV : the combination of the plug-in and the CV approaches Recall that in CAMSE , the plug-in approach for selecting a bandwidth, we want to minimize AMSE in (3). The optimal bandwidth hopt is de ned to be n=opt with opt the minimizer of AMSE . To estimate opt, we minimize an estimate of AMSE . Obviously opt also minimizes AMSE , where () = m00(1)2 b2() + v(): (5) AMSE () = AMSE 2 2 To estimate opt, we propose minimizing an estimate of AMSE . Estimating AMSE requires estimating one thing, (m00(1)=)2, while estimating AMSE entails estimating both m00(1) and 2. Below, we describe a local linear regression estimator whose asymptotically optimal bandwidth hopt satis es (m00(1)=)2 = C=hopt5 , with C known. We can estimate hopt by cross-validation, yielding ^hopt, and then use this estimate to calculate an estimate of (m00(1)=)2, namely C=^hopt5 . We de ne hopt as the minimizer of the asymptotic mean squared error of m^ h ;1(1) = ^0;1, as de ned in (1). By Theorem 3.1, this asymptotic mean squared error is equal to !2 2 2 11 h n 00 AMSE1(hn ) = 2 m (t)b0(K ) + nh Bf (1) : n Thus hopt satis es ! m00(1) 2 = C where C = B11 : (6) hopt5 nf (1)b20 (K ) We estimate hopt by one-sided cross-validation of m(ti) for ti near 1. That is, we let ^hopt be the minimizer of 2 X (?i) 1 m^ h (ti) ? Yi CV1(hn) = #t 2 [1 ? ; 1] i n t 2[1? ;1] n
n
i
n
(?i) where m^ h (?i)(ti) is the usual leave-one-out estimate of m(ti), that is, m^ h (?i)(ti) = ^0;t where 0 (?i) 1 tj ? ti X ^0;t C 2 ^ (t?i) B @ ^(?i) A = argmin (Yj ? 0 ? 1(tj ? ti)) K h n 1;t j 6=i n
n
i
i
i
17
i
with K as in Theorem 3.1. To understand why h^ opt might be a reasonable estimate of hopt, one can show that, under the conditions of Corollary 3.1, E (CV1(hn)) = 2 + AMSE1(hn) + o(n?4=5). Although this doesn't guarantee that the minimizer of CV1 is close to the minimizer of AMSE1, it does provide a heuristic justi cation of the use of h^ opt as an estimate of hopt. The simulation studies in Section 5 further support the use of h^ opt. Thus, we rst nd h^ opt by minimizing CV1, and then estimate (m00(1)=)2 using ^hopt d (), an estimate of in (6). We plug this estimate of (m00(1)=)2 into (5) to get AMSE d (). Our bandwidth for forecasting AMSE (), and nd ^opt, the minimizer of AMSE m(1+n) is then h^ PICV = n=^opt. This h^ PICV is the estimate of the optimal bandwidth for forecasting by a combination of cross-validation and plug-in approaches. Formula (6) relies on the fact that m00(1) 6= 0. However, if m00(1) is equal to zero, then one expects that ^hopt will be very large, resulting in an estimate of (m00(1)=)2 which is close to zero.
5 Simulations 5.1 Motivation and data In this section we will study the local linear forecasting estimator of m(1+n) for n = 0:1 and 0:2 on simulated data sets, Yi = m(ti)+ i with i; i = 1; : : : ; n, independent and identically distributed normal random variables with mean zero and standard deviation 0:1. Our ti's are equally spaced, that is ti = i=n. Data sets with sample sizes n = 50 and 100 are generated from each of three m's, m1(t) = t, m2(t) = t2 and 8 > < 2?7=2(cos(4t) + 1); t 1=2; m3(t) = > 5=2 : t ; t > 1=2: Methods of estimating an optimal bandwidth by using the complete plug-in procedure CAMSE , its modi cation ARSC , the forecasting cross-validation procedure FCV and the mixed approach PICV are applied to each data set. The kernel function K 18
used throughout is the density function of the standard normal with support truncated to [?1; 0]. (Simulation results using the truncated Epanechnikov kernel are similar.) For the truncated normal kernel function, the asymptotic mean squared error by formula (3) is 00 2 AMSE () = m 4(1) 4n(0:1532668=2 + 0:9663585= + 1)2 2 + n (4:034252 + 12:1699 + 12:2112162 ): n
(7)
The optimal bandwidth for forecasting is hopt = n=opt, where opt minimizes AMSE (). The goal of this simulation study is to compare the estimates of m(1 + n) and the estimates of optimal bandwidths by the three bandwidth estimation procedures. The function m1(t) = t was chosen since it is, in some sense, ideal for local linear estimation. One can show that, in this case, the bias of m^ h ;1(1 + n) is zero and thus, to minimize the AMSE , one minimizes the variance by taking hn as large as possible. The function m2(t) = t2 was chosen since it's not linear, has m00 2 (so perhaps m00(1) is easy to estimate), but has m(4) 2 0. As noted in Remark 2 in Section 4.1, this might cause a problem for the CAMSE and ARSC methods. The function m3 was chosen because it has a non-zero fourth derivative at t = 1, a non-constant second derivative over [0; 1=2) [ (1=2; 1] and a discontinuity point at t = 1=2. Our hope is that our databased bandwidths will be small enough so that our estimates will be based on data with t's in (1=2; 1]. It is also interesting to see how the magnitude of m00(1) will aect the estimated optimal bandwidth and the resulting forecasts. It can be proved from (7) that a larger absolute value of m00(1) will result in a smaller asymptotically optimal bandwidth. The three m's under study have m001 (1) = 0, m002 (1) = 2 and m003 (1) = 3:75. As noted above, the optimal bandwidth for forecasting m1(1 + n) is as large as possible. The asymptotically optimal bandwidths for m1, m2 and m3 for n = 0:1 and 0:2 and for n = 50 and 100 are given in Tables 1-3. n
19
5.2 Results and comments TABLES 1-3 AROUND HERE Tables 1-3 display the summary statistics of the forecasts of m(1 + n), n = 0:1 and 0:2, and the values of h^ , the estimate of the asymptotically optimal bandwidth for the forecasts, by the four procedures. Since the estimates of the optimal bandwidth are very variable, the median of the estimates of the optimal bandwidth for 100 simulations is a better indicator of the center of these estimates than the mean, and so the medians are presented in the tables. In the tables, the standard deviation (s.d.) and the mean square error (m.s.e.) of the one hundred values of m^ h ;1(1 + n) and ^h are also given, along with the average m^ h ;1(1 + n). Recall that the m.s.e. is equal to the sum of the bias squared and the standard deviation squared. The last line of each table displays the true value of m(1 + n), the minimum value of AMSE () and hopt, the asymptotically optimal bandwidth. Note that we present standard deviations, not standard errors, p which are equal to the standard deviation divided by 100. In terms of the mean squared error of our forecasts, we see that the complete plugin methods (CAMSE and ARSC ) forecasts are the worst. FCV produces the best forecasts, and the m.s.e.'s of these FCV forecasts are closest to the minimum AMSE 's. There are a few cases where the m.s.e.'s of forecasts by PICV are the same as the m.s.e.'s of forecasts by FCV . Comparing the means of forecasts, we see that no single procedure has the lowest bias. However, forecasts by CAMSE and ARSC tend to have much larger standard deviations. From Tables 1-3, we see that the estimates of the optimal bandwidths calculated by CAMSE and ARSC are the most variable. We do not see clear evidence that the ARSC bandwidths are less variable than those chosen by CAMSE . FCV and PICV have very similar performance in estimating h^ . Tables 1-3 show that, as expected, the m.s.e.'s of forecasts by all procedures increase as jm00(1)j increases, for xed n and n. Also, as expected, larger values of jm00(1)j tend to result in smaller median h^ . n
n
20
Recall that all four bandwidth selection procedures choose a bandwidth that minimizes an estimate of either AMSE or a function easily related to AMSE . The complete d , the cross-validation plug-in methods CAMSE and ARSC estimate AMSE by AMSE approach FCV minimizes FCV , whose expected value is close to AMSE + 2, and the mixed procedure PICV minimizes an estimate of AMSE = AMSE=2. To gain insight into the performance of the three procedures, in Figures 4 and 5 we study the curves d d (to study CAMSE and ARSC ), FCV ? 2 (to study FCV ), and 2AMSE AMSE (to study PICV ). These three curves are then compared to the \truth", AMSE . Figure 4 shows the pointwise medians of the three curves from 100 simulations for each of the three functions, for n = 50; n = 0:1. The solid line is the \truth", the d are AMSE curve. We see that, in terms of median curves, FCV ? 2 and 2AMSE d close to the \truth", suggesting that the bandwidths that minimize FCV and AMSE d may be close to hopt, the bandwidth that minimizes AMSE . On the other hand AMSE agrees with the AMSE curve only for m = m3. This function is the only one considered that has m(4)(1) 6= 0. The poor performance of CAMSE and ARSC for m = m1 and m = m2 can perhaps be explained by the fact that these methods rely heavily on estimates of m00 near 1, and these estimates rely on the fourth derivative of m at 1 being non-zero. Plots (not shown) for the other values of n and n are similar. FIGURE 4 HERE Figure 5 contains plots of the median and the twenty- fth and seventy- fth percentiles of the three curves for the simulations using m = m3, n = 100, and n = 0:1. The \truth", AMSE , is also plotted. The FCV ? 2 curve is the least variable. Although it badly underestimates AMSE for large values of hn, its estimation of AMSE d is excellent in the crucial region near hopt, the h that minimizes AMSE . Both AMSE d are fairly variable, with AMSE d the more variable of the two. Plots (not and 2AMSE shown) for the other m's and other values of n and n are similar. FIGURE 5 HERE It is easy to see why, with the function m = m3, FCV (hn ) ? 2 under-estimates AMSE (hn) for large values of hn. Recall that the FCV score is an average of the 21
estimates of the prediction errors for m(ti)'s with ti 2 [1 ? n; 1] and that by Corollary 3.1, the asymptotic mean squared error of m^ h ;t ? (ti) depends on m00(ti ? n), while AMSE (hn) depends on m00(1). The second derivatives of m1 and m2 are constants, with m001 0 and m002 2. So for ti 2 [1 ? n; 1], m00(ti ? n) is equal to m00(1). Thus, for m1 and m2, one sees an agreement in shape between the shifted FCV scores and the AMSE curve. But for m3, m003 (t) = 15t1=2=4 for t > 1=2, which is an increasing function of t. Therefore for ti 2 [1 ? n; 1], m003 (ti ? n) is less than m003 (1), and so FCV will tend to under-estimate the asymptotic mean squared error of a forecast for m(1 + n). Asymptotically, this problem disappears, since m00(ti ? n) converges to m00(1) as n ! 1. However, in a nite sample one would expect FCV ? 2 to underestimate the AMSE and that the discrepancy would increase as hn increases, with sample sizes remaining xed. Although FCV and PICV have similar performance in terms of mean squared error, the FCV procedure is much faster to compute than the PICV procedure. Based on our analysis and the above observations from our modest simulation study, we conclude that FCV has the best performance and thus is recommended. n
i
n
6 Conclusion In this paper, the local linear forecasting estimator is investigated. The asymptotic theory of this estimator, assuming uncorrelated errors, is presented and then applied to its automatic implementation, i.e., the data-driven estimation of an optimal bandwidth for forecasting. For the estimation of an optimal bandwidth, the complete plug-in approach (CAMSE ), its modi cation (ARSC ), the forecasting cross-validation approach (FCV ), and a mixture of these approaches (PICV ) are investigated. Simulations clearly show the advantage of FCV and PICV over the plug-in procedures and also indicate that FCV is superior to PICV . Here, we have only considered regression data with uncorrelated errors. Since, for xed bandwidth, m^ (1+n) is linear in the Yi 's, correlation will not aect its asymptotic 22
bias but may aect its asymptotic variance, in a way that depends on the correlation structure. Thus, for estimating m(1 + n), a formula analogous to (3) is needed to calculate appropriate plug-in bandwidth estimates, and this formula will undoubtedly require estimation of the unknown correlation structure. Alternatively, we may want to forecast a Y value at the future time 1 + n. Presumably, our forward cross-validation would work well here. Modi cations of plug-in and CV methods exist for estimation of the trend m(t), t 2 [0; 1], when regression errors are correlated. See, for instance, Hall and Hart (1990), Altman (1990), Chu and Marron (1991), Hart (1991, 1994), Hermann, Gasser and Kneip (1992), and Opsomer (1996). Bandwidth choice for one step ahead forecasting has been studied by Hart (1994) and Gijbels, Pope and Wand (1997).
Acknowledgments. We would like to thank Steve Marion of the Department of Health
Care and Epidemiology, University of British Columbia, for providing us with initial data sets and advice concerning the analysis of AIDS data. In addition, we thank Ping Yan from the Laboratory Centre for Disease Control, who provided us with the Canadian AIDS data analyzed here.
References Altman, N.S. (1990), \Kernel Smoothing of Data with Correlated Errors," Journal of the American Statistical Association 85, 749-759. Andrews, D.F. and Herzberg, A.M. (1985), Data: A Collection of Problems from Many Fields for the Student and Research Worker, Springer-Verlag, New York. Bacchetti, P., Segal, M.R., and Jewell, N.P. (1993), \Backcalculation of HIV Infection Rates," Statistical Science 8, 82-119. Brookmeyer, R. and Gail, M.H. (1988), \A Method for Obtaining Short-term Predictions and Lower Bounds on the Size of the AIDS Epidemic," Journal of the American Statistical Association 83, 301-308. 23
Chu, C.K. and Marron, J.S. (1991), \Comparison of Two Bandwidth Selectors with Dependent Errors," Annals of Statistics 19, 1906-1918. Fan, Jianqing (1992), \Design-adaptive Nonparametric Regression," Journal of the American Statistical Association 87, 998-1004. Fan, J., and Gijbels, I. (1995), \Data-driven Bandwidth Selection in Local Polynomial Fitting: Variable Bandwidth and Spatial Adaptation," Journal of the Royal Statistical Society B 57, 371-394. Fan, J. and Gijbels, I. (1996), Local Polynomial Modeling and its Applications, Chapman and Hall, London. Gasser, T., Kneip, A. and Kohler, W. (1991), \A Flexible and Fast Method for Automatic Smoothing," Journal of the Americal Statistical Association 86, 643-652. Gijbels, I., Pope, A., and Wand, M.P. (1997), \Automatic Forecasting via Exponential Smoothing: Asymptotic Properties," manuscript. Hall, P. and Hart, J.D. (1990), \Nonparametric Regression with Long-range Dependence," Stochastic Processes and their Applications 36, 339-351. Hall, P., Sheather, S.J., Jones, M.C. and Marron, J.S. (1991), \On Optimal Data-based Bandwidth Selection in Kernel Density Estimation," Biometrika 78, 263-271. Hardle, W. (1989), Applied Nonparametric Regression, Cambridge University Press, Cambridge. Hardle, W., Hall, P. and Marron, J.S. (1988), \How Far are Automatically Chosen Regression Smoothing Parameters from Their Optimum?" (with discussion), Journal of the American Statistical Association 83, 86-99. Hart, J.D. (1991), \Kernel Regression Estimation with Time Series Errors," Journal of the Royal Statistical Society: B 53, 173-187. 24
Hart, J. D. (1994), \Automated Kernel Smoothing of Dependent Data by Using Time Series Cross-validation," Journal of the Royal Statistical Society: B 56, 529-542. Hart, J. D. and Yi, S. (1996), \One-sided Cross-validation," manuscript. Harvey, A.C. (1989), Forecasting, Structural Time Series Models and the Kalman Filter, Cambridge University Press, New York. Hastie, T. and Loader, C. (1993), \Local Regression: Automatic Carpentry," Statistical Science 8, 120-143. Healy, M. J. R., and Tillett, H. E. (1988), \Short-term Extrapolation of the AIDS Epidemic," Journal of the Royal Statistical Society A 151, 50-61. Hermann, E., Gasser, T. and Kneip, A. (1992), \Choice of Bandwidth for Kernel Regression when Residuals are Correlated," Biometrika 79, 783-795. Li, Xiaochun (1996), \Local Linear Regression Versus Backcalculation in Forecasting," Ph.D. thesis, Statistics Department, University of British Columbia. Opsomer, Jean (1996), \Estimating an Unknown Function by Local Linear Regression when the Errors are Correlated," Iowa State University Preprint Number 95-44. Rice, J. (1984), \Bandwidth Choice for Nonparametric Regression," Annals of Statistics 12, 1215-1230. Ruppert, D. and Wand, M.P. (1994), \Multivariate Weighted Least Squares Regression," Annals of Statistics, 22, 1346-1370. Ruppert, D., Sheather, S.J., and Wand, M.P. (1995), \An Eective Bandwidth Selector for Local Least Squares Regression," Journal of the American Statistical Association 90, 1257-1270. Schechter M. T., Marion, S. A., Elmslie, K.D. and and Ricketts, M. (1992), \How Many Persons in Canada Have Been Infected with Human Immunode ciency Virus? An 25
Exploration Using Backcalculation Methods," Clinical Investigative Medicine 15, 331-345. Wand, M.P. and Jones, M.C. (1995), Kernel Smoothing, Chapman and Hall, London.
26
C
6
C
• • O• O
4
•
• • •
2
•
•
•
••
• • • ••••
•
••
••
•
•
0
•
•••
• •••• • •
40
50
60
70
year
Figure 1: The annual incidence of melanoma in males from 1936 until 1972 (the dots). Local linear forecasts are based on the data up to and including 1970. The forecasts using bandwidths chosen by CAMSE are denoted by C. The forecasts using bandwidths chosen by FCV and PICV (the O's) are the same. The two bandwidths chosen by CAMSE for the two forecasts are the same, 1.4 years. FCV and PICV choose the same bandwidths: 19.0 years (for 1971) and 17.3 years (for 1972). 27
5
••• F
•
F P • • • • • OC P CC • • • •
3
4
•
• 2
••• • •
•
••
• •
••
•
•
0
1
•
• ••••
••
40
50
60
70
year
Figure 2: The annual incidence of melanoma in males from 1936 until 1972 (the dots). Local linear forecasts are based on the data up to and including 1969. Forecasts using CAMSE are denoted by C, FCV by F, and PICV by P. FCV and PICV gave the same forecast for 1970 (denoted O). The three bandwidths chosen by CAMSE for the three forecasts are all equal to 2 years. The three bandwidths chosen by FCV are 9.6, 7.9, and 6.9 years. The three bandwidths chosen by PICV are all equal to 9.6 years. 28
300
o o• o o• o• • oo • • o• ••••
200
• •• • • 100
•
•
• • •
0
••• ••••••••••••• 80
82
•
••••
84
86
88
90
year
Figure 3: The quarterly number of new AIDS cases in Canada reported within six years of diagnosis from the fourth quarter of 1979 to the rst quarter of 1990 (the dots). Local linear forecasts (the O's) are based on the data up to and including the rst quarter of 1988. The eight bandwidths (chosen by FCV ) for the eight forecasts are, in years, 3.2, 2.4, 0.9, 1.0, 1.2, 0.9, 0.9, 0.9. 29
0.10 0.08 0.0
0.02
0.04
0.06
m=m1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.08
0.08
0.10
0.10
h
0.06
m=m3
0.02
0.02
0.04
0.04
0.06
m=m2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
h
0.2
0.3
0.4
0.5
0.6
0.7
h
Figure 4: Each graph shows four curves: the median of 100 FCV () ? 2 curves (dotted), d () curves (short dashed - used for PICV ), the median the median of 100 2AMSE d () curves (long dashed - used for CAMSE ) and the true AMSE curve of 100 AMSE (solid). Here n = 50 and n = 0:1.
30
0.0
0.02
0.04
0.06
0.08
0.10
FCV: FCV-
0.1
0.2
0.3
0.4
0.5
0.6
0.7
h
AMSE
0.02
0.04
0.06
0.08
0.10
PICV:
0.0
0.0
0.02
0.04
0.06
0.08
0.10
CAMSE: AMSE
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
h
0.2
0.3
0.4
0.5
0.6
0.7
h
Figure 5: The AMSE () curve (solid), the 25th, 50th and 75th quantiles of FCV () ? 2, d () and 2AMSE d () curves respectively, for m = m3, n = 100 and n = 0:1. AMSE
31
Table 1: Summary of 100 forecasts (m^ s) and estimates of optimal bandwidths by local linear regression for m = m1 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value.
method
CAMSE ARSC FCV PICV
m^ (s.d.) 1.08 (0.19) 1.11 (0.18) 1.11 (0.07) 1.10 (0.08)
true value 1.10 method
CAMSE ARSC FCV PICV
m^ (s.d.) 1.10 (0.08) 1.10 (0.18) 1.10 (0.05) 1.10 (0.06)
true value 1.10
m = m1, n = 50 n = 0:1 (m.s.e.) (0.04) (0.03) (0.01) (0.01)
0.00
h^ [M ]
(s.d.) 0.30 (0.27) 0.24 (0.25) 0.51 (0.21) 0.56 (0.18) opt
1
m^ (s.d.) 1.17 (0.34) 1.21 (0.27) 1.20 (0.08) 1.21 (0.10)
1.20
m = m1, n = 100 n = 0:1 (m.s.e.) (0.01) (0.03) (0.00) (0.00)
0.00
M ] (s.d.) h^ [opt 0.33 (0.28) 0.24 (0.24) 0.51 (0.21) 0.53 (0.17)
1
m^ (s.d.) 1.21 (0.14) 1.21 (0.30) 1.20 (0.06) 1.20 (0.09)
1.20
32
n = 0:2 (m.s.e.) (0.12) (0.07) (0.01) (0.01)
0.00
n = 0:2 (m.s.e.) (0.02) (0.09) (0.00) (0.01)
0.00
M ] (s.d.) h^ [opt 0.27 (0.26) 0.27 (0.27) 0.51 (0.12) 0.42 (0.09)
1
M ] (s.d.) h^ [opt 0.30 (0.25) 0.26 (0.24) 0.51 (0.10) 0.40 (0.11)
1
Table 2: Summary of 100 forecasts (m^ s) and estimates of optimal bandwidths by local linear regression for m = m2 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value.
method
CAMSE ARSC FCV PICV
m^ (s.d.) 1.16 (0.24) 1.17 (0.23) 1.16 (0.09) 1.17 (0.13)
true value 1.21 method
CAMSE ARSC FCV PICV
m^ (s.d.) 1.16 (0.12) 1.19 (0.18) 1.18 (0.08) 1.20 (0.10)
true value 1.21
m = m2, n = 50 n = 0:1 (m.s.e.) (0.06) (0.05) (0.01) (0.02)
h^ [M ]
(s.d.) 0.24 (0.21) 0.22 (0.22) 0.29 (0.12) 0.35 (0.15)
m^ (s.d.) 1.33 (0.43) 1.37 (0.40) 1.33 (0.11) 1.32 (0.16)
0.01
0.34
1.44
opt
n = 0:2 (m.s.e.) (0.20) (0.16) (0.02) (0.04)
M ] (s.d.) h^ [opt 0.22 (0.20) 0.22 (0.21) 0.29 (0.09) 0.31 (0.11)
0.02
0.32
m = m2, n = 100 n = 0:1 (m.s.e.) (0.02) (0.03) (0.01) (0.01)
M ] (s.d.) h^ [opt 0.24 (0.26) 0.22 (0.19) 0.26 (0.09) 0.30 (0.13)
m^ (s.d.) 1.34 (0.20) 1.38 (0.33) 1.36 (0.11) 1.35 (0.14)
n = 0:2 (m.s.e.) (0.05) (0.11) (0.02) (0.03)
M ] (s.d.) h^ [opt 0.23 (0.22) 0.20 (0.20) 0.24 (0.08) 0.26 (0.09)
0.01
0.30
1.44
0.02
0.27
33
Table 3: Summary of 100 forecasts (m^ s) and estimates of optimal bandwidths by local linear regression for m = m3 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value.
method
CAMSE ARSC FCV PICV
m^ (s.d.) 1.21 (0.28) 1.23 (0.28) 1.19 (0.10) 1.20 (0.11)
true value 1.27 method
CAMSE ARSC FCV PICV
m^ (s.d.) 1.20 (0.20) 1.20 (0.24) 1.19 (0.09) 1.21 (0.11)
true value 1.27
m = m3, n = 50 n = 0:1 (m.s.e.) (0.08) (0.08) (0.02) (0.02)
h^ [M ]
(s.d.) 0.24 (0.21) 0.18 (0.19) 0.26 (0.09) 0.29 (0.11)
m^ (s.d.) 1.45 (0.51) 1.44 (0.41) 1.37 (0.12) 1.39 (0.16)
0.02
0.26
1.58
opt
n = 0:2 (m.s.e.) (0.28) (0.19) (0.06) (0.06)
M ] (s.d.) h^ [opt 0.23 (0.19) 0.20 (0.20) 0.25 (0.10) 0.26 (0.09)
0.05
0.24
m = m3, n = 100 n = 0:1 (m.s.e.) (0.04) (0.06) (0.01) (0.02)
M ] (s.d.) h^ [opt 0.24 (0.26) 0.21 (0.22) 0.21 (0.08) 0.21 (0.08)
m^ (s.d.) 1.43 (0.33) 1.41 (0.33) 1.42 (0.15) 1.41 (0.17)
n = 0:2 (m.s.e.) (0.13) (0.14) (0.05) (0.06)
M ] (s.d.) h^ [opt 0.22 (0.22) 0.20 (0.24) 0.21 (0.09) 0.19 (0.07)
0.01
0.22
1.58
0.04
0.21
34