International Journal of Forecasting 22 (2006) 239 – 247 www.elsevier.com/locate/ijforecast
Exponential smoothing model selection for forecasting Baki Billah a, Maxwell L. King b, Ralph D. Snyder b,*, Anne B. Koehler c,1 a
Department of Epidemiology and Preventive Medicine, Monash University, VIC 3800, Australia Department of Econometrics and Business Statistics, Monash University, VIC 3800, Australia Department of Decision Sciences and Management Information Systems, Miami University, Oxford, OH 45056, USA b
c
Abstract Applications of exponential smoothing to forecasting time series usually rely on three basic methods: simple exponential smoothing, trend corrected exponential smoothing and a seasonal variation thereof. A common approach to selecting the method appropriate to a particular time series is based on prediction validation on a withheld part of the sample using criteria such as the mean absolute percentage error. A second approach is to rely on the most appropriate general case of the three methods. For annual series this is trend corrected exponential smoothing: for sub-annual series it is the seasonal adaptation of trend corrected exponential smoothing. The rationale for this approach is that a general method automatically collapses to its nested counterparts when the pertinent conditions pertain in the data. A third approach may be based on an information criterion when maximum likelihood methods are used in conjunction with exponential smoothing to estimate the smoothing parameters. In this paper, such approaches for selecting the appropriate forecasting method are compared in a simulation study. They are also compared on real time series from the M3 forecasting competition. The results indicate that the information criterion approaches provide the best basis for automated method selection, the Akaike information criteria having a slight edge over its information criteria counterparts. D 2005 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. Keywords: Model selection; Exponential smoothing; Information criteria; Prediction; Forecast validation
1. Introduction The exponential smoothing methods are relatively simple but robust approaches to forecasting. They are widely used in business for forecasting demand for inventories (Gardner, 1985). They have also per* Corresponding author. Fax: +61 613 9905 5474. E-mail addresses:
[email protected] (B. Billah),
[email protected] (M.L. King),
[email protected] (R.D. Snyder),
[email protected] (A.B. Koehler). 1 Fax: +1 513 529 9689.
formed surprisingly well in forecasting competitions against more sophisticated approaches (Makridakis et al., 1982; Makridakis & Hibon, 2000). Three basic variations of exponential smoothing are commonly used: simple exponential smoothing (Brown, 1959); trend-corrected exponential smoothing (Holt, 1957); and Holt–Winters’ method (Winters, 1960). A distinctive feature of these approaches is that a) time series are assumed to be built from unobserved components such as the level, growth and seasonal effects; and b) these components need to be adapted over time when demand series display the effects of
0169-2070/$ - see front matter D 2005 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.ijforecast.2005.08.002
240
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247
structural changes in product markets. As these components may be combined by addition or multiplication operators, 24 variations of the exponential smoothing methods may be identified (Hyndman, Koehler, Snyder, & Grose, 2002). Given this proliferation of options, an automated approach to method selection becomes most desirable (Gardner, 1985; McKenzie, 1985). Hyndman et al. (2002) provided a statistical framework for exponential smoothing based on the earlier work of Ord, Koehler, and Snyder (1997). The framework incorporated stochastic models underlying the various forms of exponential smoothing and enabled the calculation of maximum likelihood estimates of smoothing parameters. It also enabled the use of Akaike’s information criterion (Akaike, 1973) for method selection. One issue not addressed was the preference for Akaike’s information criterion over possible alternatives such as Schwarz (1978), Hannan and Quinn (1979), Mallows (1964), Golub, Heath, and Wahba (1979), and Akaike (1970). One aim, therefore, is to determine whether Akaike’s information criterion (AIC) has a superior performance compared to its alternatives. Given that it was developed to minimise the forecast mean squared error, it might be hypothesised that the AIC has a natural advantage over the alternatives in forecasting applications, except possibly for Akaike’s FPE which is asymptotically equivalent to the AIC. The exponential smoothing methods were traditionally implemented without reference to a statistical framework so that other approaches were devised to resolve the method selection problem. Prediction validation (Makridakis, Wheelwright, & Hyndman, 1998) is one such approach. The sample is divided into two parts: the fitting sample and the validation sample. The fitting sample is used to find sensible values for the smoothing parameters, often with a sum of squared one-step ahead prediction error criterion. The validation sample is used to evaluate the forecasting capacity of a method with a criterion such as the mean absolute percentage error (MAPE). Another approach applies a general version of exponential smoothing on the assumption that it effectively reduces to an appropriate nested method when this is warranted by the data. Trend corrected exponential smoothing is applied to annual time series; Winter’s method is applied to sub-annual time series. A second
aim is to gauge the effectiveness of these traditional approaches relative to the information criterion approach to method selection. The plan of this paper is as follows. State space models for exponential smoothing and an approach to their estimation are introduced in Section 2. Criteria to be used in model selection and a measure for comparing resulting forecast errors are explained in Section 3. A simulation study is discussed in Section 4. An application of the model selection criteria to the M3 competition data (Makridakis & Hibon, 2000) is given in Section 5. The paper ends with some concluding remarks in Section 6.
2. State space models The state space framework in Snyder (1985), and its extension in Ord et al. (1997), provides the basis of an efficient method of likelihood evaluation, a sound mechanism for generating prediction distributions, and the possibility of model selection with information criteria. Important special cases, known as structural models, that capture common features of time series such as trend and seasonal effects, provide the foundations for simple exponential smoothing, trend corrected exponential smoothing and Holt– Winters’ seasonal exponential smoothing. Of the 24 versions of exponential smoothing found in Hyndman et al. (2002), the scope of this study is limited to three linear cases. The focus is on a time series that is governed by the innovations model (Snyder, 1985): yt ¼ hVxt1 þ et
ð2:1Þ
xt ¼ Fxt1 þ Aet :
ð2:2Þ
Eq. (2.1), called the measurement equation, relates an observable time series value y t in typical period t to a random k-vector x t1 of unobservable components from the previous period. h is a fixed k-vector, while the e t , the so-called innovations, are independent and normally distributed random variables with mean zero and a common variance r 2. The inter-temporal dependencies in the time series are defined in terms of the unobservable components with the so-called transition equation (2.2). F is a fixed k k dtransitionT matrix and a is a k-vector of smoothing parameters.
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247
The following special cases, termed structural models, provide the statistical underpinning for common forms of exponential smoothing. The models most naturally relate to the error correction versions of exponential smoothing (Gardner, 1985), but also underpin the more traditional and equivalent dweighted averageT versions of the method. ! Local Level Model (LLM) y t = S t1 + e t where S t is a local level governed by the recurrence relationship S t = S t1 + ae t where 0 V a V 1. It underpins the simple exponential smoothing method (Brown, 1959). ! Local Trend Model (LTM) y t = S t1 + b t1 + e t where b t is a local growth rate. The local level and local growth rates are governed by the equations S t = S t1 + b t1 + ae t and b t = b t1 + be t , respectively, where 0 V a V 1 and 0 V b V a. Note that aV = [a b]. This model underpins trend corrected exponential smoothing (Holt, 1957). ! Additive Seasonal Model (ASM) y t = S t1 + b t1 + s tm + e t , where s t is the local seasonal component. The local level, growth and seasonal components are governed by S t = S t1 +b t1 + ae t , b t = b t1 + be t and s t = s tm + ce t , respectively, where m is the number of seasons in a year, 0 V a V 1, 0 V b V a, and 0 V c V 1 a. In this case xVt = [S t b t c t . . . c tm + 1] and aV = [a b c 0. . .0]. This model is the basis of Holt–Winters’ additive method (Winters, 1960). Traditionally, the smoothing parameters a, b and c were set to fixed values determined subjectively by users on the basis of personal experience. The studies of Chatfield (1978) and Bartolomei and Sweet (1989) show that this can be problematic and that parameters are best estimated from data. Ord et al. (1997) recommend that estimates of the parameters be obtained by maximizing the conditional log-likelihood. For the class of linear state space models (2.1) and (2.2), the conditional likelihood function based on a sample of size n is given by n n n 1 X logL ¼ logð2pÞ log r2 2 e2 2 2 2r t¼1 t ð2:3Þ where the errors are calculated recursively with a general linear form of exponential smoothing defined by the relationships e t = y t hVx t1 and x t = Fx t 1 +a e t for t = 1,2,. . ., n.
241
This conditional likelihood is not only a function of a but also of the unobserved random vector x 0. Following a tradition from exponential smoothing, the seed state vector x 0 is approximated with plausible heuristics such as the following: Seed level estimate equals average of first three observations in sample. Fit a linear trend line to the first five observations. The seed level is set equal to its intercept. The seed rate is set equal to its slope. Local seasonal Fit a linear trend line with seasonal dummy model variables to the first two years of data. The seed level and rate are chosen as before. The seasonal factors are set equal to coefficients of the seasonal dummies after normalisation to sum to zero. Local level model Local trend model
The resulting heuristic estimates xˆ 0 of x 0 are entered into the conditional likelihood function, so that the latter is then only maximized with respect to the parameter vector a. The exact likelihood function could potentially be obtained by integrating x 0 out of (2.3). The conditional likelihood is used instead because the state variables in the models are generated by non-stationary processes so that the seed state vector has an improper unconditional distribution. The situation is similar to one considered in Bartlett’s paradox (Bartlett, 1957) from Bayesian statistics. Exact likelihood values for models with different state dimensions are non-comparable and information criteria based on them are therefore also non-comparable. This is not an issue for the conditional likelihood because the use of an improper unconditional distribution of the seed state vector is avoided. It is for this reason that the conditional likelihood is used instead of the exact likelihood for estimating the parameter vector a. ˆ , the exponenOn obtaining the estimates xˆ 0 and a tial smoothing algorithm is used to obtain the corresponding estimate xˆ n of the state vector at the end of the sample. Point forecasts are then generated recursively with the equations yˆn ( j) = hVx n+j1 and xˆ n+j = Fxˆ n+j1 for j = 1,2,. . ., r where r is the prediction horizon.
3. Model selection approaches and a measure for comparing them An information criterion has the general form ˆ ) p(n,q), where p(n,q) is the so-called log L(a
242
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247
penalty function, q being the number of free parameters. Various forms of the penalty have been suggested, as may be seen from Table 1. Note that q* is the number of free parameters in the smallest model that nests all models under consideration and c = n q*. No clear theory exists for deciding which of these information criteria is best suited for choosing the appropriate method of exponential smoothing. Thus, a simulation study was undertaken to compare them. The simulation also included a comparison with two other approaches for model selection. The prediction validation approach (Val) selects the model with the smallest mean absolute percentage error (MAPE) for forecasting withheld data, and the encompassing model approach always selects LTM for annual data and ASM for quarterly and monthly data. The performance of each approach was gauged in terms of the median absolute prediction error as a percentage of the standard deviation, given by MdAPES
1
0
C B C B jynþj yˆ n ð jÞj C: ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi s ¼ medianB 100 C B X n A @ 2 ðyt y¯ Þ =ðn 1Þ t¼1
This measure was chosen for several reasons. It is a unit free measure that permits comparisons between real time series measured in different units. It avoids the problems encountered with the MAPE (and its variations) for series values close to zero. Most impor-
Table 1 Alternative penalty functions Criterion Penalty function
Source
AIC BIC HQ MCp GCV FPE
Akaike (1973) Schwarz (1978) Hannan and Quinn (1979) Mallows (1964) Golub et al. (1979) Akaike (1970)
q qlog (n) / 2 qlog(log(n)) nlog(1 + 2q / c) / 2 nlog(1 q / n) (nlog(n + q) nlog (n q)) / 2
q is the number of free parameters. q* is the number of free parameters in the smallest model nesting all other models. c = n q*.
tantly, it gives a fair comparison when applied to time series with different standard deviations. The forecast error may display a tendency to be larger for time series with a large standard deviation. APES is a measure that does not produce larger values just because there is more variability in the time series. Thus, such time series will not necessarily cause an increase in the MdAPES just because of the larger variability. More variable time series can still have APES values near the median value and play a central role in the evaluation process. In the comparisons, both simulated data and real data with different amounts of variability are included. The median was used instead of the average to eliminate the distorting effect of outlier APES.
4. Simulation study A simulation study was conducted to determine whether any of the approaches to model selection displayed a superior performance as measured in terms of forecast accuracy instead of the more usual criteria of the proportion of correctly selected models. It consisted of many experiments carried out under a wide variety of conditions. Depending on the type of data, the time series were generated by the three models: the local level model (LLM), local trend model (LTM) and additive seasonal model (ASM). For the ASM, the seed seasonal components were generated from the equation s jm =Asin(2jp / m), j = 1,2, . . ., m, where A is the seasonal amplitude and m is the number of seasons in a year. Candidate values for the various factors in these models were: r = 10, 20 (for all models) n = 24, 40 (for LLM and LTM) n = 24, 60 (for quarterly/m = 4 ASM) n = 48, 96 (for monthly/m = 12 ASM) a = 0.1, 0.5, 0.9; S 0 = 100 (for LLM) a
0:1 0:4 0:7 ¼ ; ; ; b 0:05 0:1 0:1 100 100 S0 ¼ ; ðfor LTMÞ b0 1 5
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247
0 1 0 1 0 1 0 1 a 0:1 0:4 0:7 B C B C B C B C @ b A ¼ @ 0:05 A; @ 0:1 A; @ 0:1 A;
c 0:1 0:2 S0 100 100 ¼ ; ; 1 5 b0
0:3
A ¼ 0; 25; 50 ðfor ASMÞ: The various combinations of these factors lead to 180 scenarios for the simulation study. The 180 scenarios were repeated 10 times (i.e., 10 trials) so that the study consisted of 1800 simulation experiments. Each experiment consisted of the following four steps: 1. Generate a time series, from a specified model, consisting of a) a tuning sample of a specified size n, and b) an evaluation sample for r succeeding periods: for annual data r = 6, for quarterly data r = 8, and for monthly data r = 18. 2. Fit a collection of models to the tuning sample using the conditional likelihood function: the LLM and LTM for annual data, and additionally the ASM model for quarterly and monthly data. 3. Select the best model by one of the model selection approaches: a) For the six information criteria approaches, choose the model that is best according to each information criterion. b) For the prediction validation approach (Val): i) Withhold the last r periods of the tuning sample and fit the local level model, local trend model, and additive seasonal model to the first n r values by maximizing the conditional likelihood function. ii) Choose the model with the smallest MAPE for the forecasts of the r periods of withheld data. c) For the encompassing approach (Enc): i) Choose the local trend model for annual data. ii) Choose the additive seasonal model for quarterly and monthly data. 4. Using the estimates from Step 2 for the model chosen in Step 3: i) Generate predictions for each of the r periods in the evaluation sample.
243
ii) Calculate the absolute prediction error as a percentage of the standard deviation of the tuning sample (APES) for each of the time periods in the evaluation sample. Overall, 20,880 APES were calculated for each model selection method. As their relative frequency distribution displayed a strong positive skew, their central tendency was summarised by their median. The effect of sampling error on their median was gauged with a bootstrap study. The 20,880 APES were treated as a surrogate population. Five hundred samples of size 20,880 were randomly selected with replacement. The distribution of the medians of these 500 samples is summarised by its median and 90% confidence interval shown for each model selection method in Fig. 1a. It may be observed that: a) The confidence intervals are quite tight so that the effect of sampling error has largely been eliminated by the use of the large surrogate population, thereby reducing room for ambiguity in the comparison of the various model selection methods. b) There are no marked differences between the various information criteria. c) Information criteria are better for model selection than the prediction validation approach. d) The encompassing approach, to the surprise of the authors, is just as effective as the information criteria. Observation c) is important because the prediction validation approach is often recommended and used in practice (Makridakis et al., 1998). The reason for its inferior performance is most likely to be the fact that it does not rely on the entire sample for the fitting phase and so the estimates of parameters lack the level of statistical efficiency that the other approaches have. The effect of sample size, collation period and prediction horizon are shown in Figs. 1b, c and d. To interpret these results, it should be understood that: a. Small sample is n = 24 for annual and quarterly series, and n = 48 for monthly series; b. Large sample is n = 40 for annual series, n = 60 for quarterly series and n = 96 for monthly data; c. The short run consists of future periods 1–3 for annual series, 1–4 for quarterly series and 1–9 for monthly series;
244
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247
(b) Sample Size
28
45
26
40
24
35
22 lower
AIC
BIC
HQ
MCp
23.41
23.48
23.48
23.44
23.83
23.88
23.84
24.37
24.37
24.38
24.37
median 23.8 upper
24.34
GCV
FPE
Enc
Val
23.46
23.4
23.42
24.93
23.8
23.79
23.83
25.43
24.35
24.35
25.88
MdAPES
MdAPES
(a) All
30 25
large
20 15 10 small lower
(c) Collation Period
small
AIC
BIC
HQ
MCp
GCV
FPE
Enc
small median 38.56 38.56 38.46 38.59 38.56 38.56 38.7
50
annual
small upper
39.46 39.46 39.26 39.31 39.22 39.4
large lower
14.5
14.57 14.5
large upper
41.08
39.48 42.06
14.48 14.42 14.48 14.43 15.9
large median 14.81 14.92 14.86 14.82 14.76 14.81 14.8
45
Val
37.71 37.62 37.64 37.86 37.72 37.74 37.76 39.94
16.41
15.24 15.38 15.35 15.28 15.18 15.31 15.29 16.89
MdAPES
40
(d) Prediction Horizon
35 30
40
quarterly
long run 35
20 15
MdAPES
25
monthly AIC
BIC
HQ
MCp GCV
FPE
Enc
39.53 40.16 39.82 39.75 39.23 39.58 39.43 38.44
annual median
41.87 42.2
annual upper
43.85 44.85 44.11 44.02 43.38 44.01 44.38 43.53
quarterly lower
24.15 24.18 24.21 24.21 23.94 24.13 23.88 25.24
41.97 41.87 41.36 41.73 41.94 40.61
24.9
25
Val
annual lower
quarterly median 24.86 24.9
30
24.91 24.76 24.87 24.68 26.06
short run
20 15 short lower
AIC
BIC
HQ
MCp
GCV
FPE
Enc
Val
16.58 16.62 16.65 16.58 16.58 16.55 16.53 18.32
short median 17.08 17.17 17.2
17.17 17.01 17.06 17.01 18.8
quarterly upper
25.59 25.71 25.68 25.75 25.64 25.66 25.43 26.98
short upper
17.47 17.52 17.58 17.51 17.47 17.5
monthly lower
20.48 20.51 20.57 20.51 20.46 20.44 20.29 22.45
long lower
32.51 32.5
32.43 32.51 32.52 32.45 32.52 33.13
long median
33.31 33.4
33.32 33.39 33.46
long upper
34.34 34.49
34.4
monthly median 21.12 21.2 monthly upper
21.19 21.17 21.09 21.1
20.97 23.08
21.61 21.65 21.63 21.58 21.55 21.53 21.49 23.6
33.4
17.45 19.28 33.46 34.18
34.42 34.46 34.36 34.57 35.17
Fig. 1. Simulation results (median APEs and 90% confidence intervals for the criteria given in Table 1).
d. The long run comprises future periods 4–6 for annual data, 5–8 for quarterly data and 10–18 for monthly data.
Table 2 Surrogate population sizes for bootstrap
Collation period
Sample size Prediction horizon
Annual Quarterly Monthly Small Large Short run Long run
Simulation
M3 data
2160 5760 12,960 10,440 10,440 10,440 10,440
3870 6048 25,704 19,472 16,150 17,811 17,811
Partitioning the original sample of 20,880 APES according to these factors results in many smaller surrogate populations, their sizes being those shown in the second column of Table 2. The bootstrap was applied as described above to these smaller samples to again obtain the median and 90% confidence intervals of the median APES. The results obtained for sample size and prediction horizon were as expected. Larger sample sizes lead to smaller APES. There is more volatility in long run predictions. A key point, however, is that the basic conclusions above remain unchanged for these cases. The poorer performance of the prediction evaluation method does not change with sample size or prediction horizon. The results for the collation period tell a slightly different story. They
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247
should be highly correlated with the sample size results because the monthly time series are longer than the quarterly time series, which in turn are longer than the annual series. Interestingly, in the case of the annual series, the median for prediction validation is lower than the medians of the other approaches, but the wider confidence intervals suggest that this difference is not statistically significant.
5. Application to the M3 competition data The use of simulated data does raise the criticism that in real life the true model is unknown. Furthermore, real series are not as well behaved as the simulated series. This happens even when random
errors and outliers are included in the simulated series. In this section, we investigate how forecasting performance is affected by the eight approaches to model selection on real data. The eight model selection approaches were applied to the M3 competition data (Makridakis & Hibon, 2000) to see whether the results of simulated data carry through for real data. It was necessary to remove time series that were too short. Each time series had a tuning sample of a specified size n and an additional evaluation sample of size r where again r = 6 for annual data, r = 8 for quarterly data, and r = 18 for monthly data. For the prediction validation approach, it was necessary to fit models to n r observations where n must exceed r. Moreover, to obtain plausible results with a satisfactory level of statistical reliability,
(a) All
(b) Sample Size
44
50
small
42 45
40 38
lower
AIC
BIC
HQ
MCp
GCV
FPE
Enc
Val
40.02
40.24
40.32
40.05
40.36
39.95
41.3
41.48
median 40.53
40.76
40.76
40.54
40.99
40.56
41.85
42.04
41.03
41.35
41.29
41.02
41.48
41.02
42.56
42.68
upper
MdAPES
MdAPES
245
40
30 small lower
(c) Collation Period
AIC
BIC
HQ
MCp
GCV
FPE
44.99 44.83 44.89 44.93 45.97 45
Enc
Val
47.72 47.33
small median 45.66 45.56 45.59 45.66 46.92 45.68 48.48 48.23
60
small upper
46.56 46.39 46.46 46.59 47.68 46.51 49.57 49.09
large lower
34.53 35.24 35.05 34.51 34.34 34.53 34.44 35.25
large median 35.2
annual
55
large
35
large upper
35.91 35.81 35.24 34.96 35.15 35.1
35.94
35.87 36.75 36.64 35.87 35.62 35.94 35.85 36.86
MdAPES
50
(d) Prediction Horizon 45
monthly 60
40
long run
55
quarterly 30
AIC
BIC
HQ
MCp
GCV
FPE
Enc
Val
MdAPES
50 35
45 40
30
annual median
50.93 50.98 50.16 50.93 52.93 50.94 54.45 52.39
25
annual upper
53.1
quarterly lower
34.49 34.36 34.85 34.49 34.54 34.5
53.18 52.52 53
54.84 53.01 56.95 54.25 34.89 33.32
quarterly median 35.57 35.55 35.91 35.59 35.63 35.59 36.03 34.5
shortrun
35
47.81 47.95 47.65 47.97 50.86 48.08 52.27 49.88
annual lower
short lower
AIC
BIC
HQ
MCp
GCV
FPE
Enc
Val
30.76 30.73 30.76 30.75 30.92 30.71 31.99 31.76
short median 31.21 31.23 31.26 31.19 31.41 31.19 32.54 32.36
quarterly upper
36.98 36.81 37.21 36.96 37.07 37.07 37.13 35.9
short upper
31.84 31.85 31.88 31.84 32
monthly lower
40.09 40.47 40.42 40.08 40.29 40.06 41.44 42.06
long lower
51.66 52
long median
52.68 52.82 52.92 52.64 53.65 52.72 54.15 53.99
long upper
53.64 54.01 53.98 53.73 54.54 53.72 54.98 55.03
monthly median 40.63 41.06 40.96 40.63 40.96 40.64 42.15 42.76 monthly upper
41.28 41.68 41.62 41.28 41.55 41.24 42.92 43.39
31.75 33.05 32.96
51.98 51.67 52.61 51.53 53.33
Fig. 2. M3 data results (median APEs and 90% confidence intervals for the criteria given in Table 1).
53.1
246
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247
the focus of the study was restricted to those series where n z 20 for annual data, n z 28 for quarterly data, and n z 72 for monthly data. After culling series that are too small according to this definition, 1452 of the 2829 time series in the M3 data remained for the comparative study. The procedures that were described in Steps 2 and 3 of Section 3 for the simulation study were applied to the 1452 time series from the M3 competition data. The results involved 35,622 APES for each method. The bootstrapped sampling distribution of medians is summarised in Fig. 2a. The results are similar to those from the simulation study except that the encompassing approach is no longer competitive with the information criteria. When examined by sample size, collation period and prediction horizon, as shown in Figs. 2b, c and d, the results are again similar to those from the simulation study. Now, however, the quarterly time series has the lowest MdAPES. Moreover, although the validation method appears to be better than its competitors for this particular case, the difference is barely significant at the 90% level. The distinction between the small and large samples used in this study depends on the collation period. A small sample was one with a length that did not exceed the median length of all the samples with a corresponding collation period.
6. Conclusions In this paper various approaches to model selection were compared using time series simulated from statistical models underlying exponential smoothing. They were also evaluated using a subset of the time series from the M3 competition database. Their relative performances were judged in terms of the prediction capacities of their selected models. Results from the simulated and real time series data proved to be remarkably consistent. They indicated that there is little to distinguish the various information criteria. The most important finding was that the information criteria approaches appear to be superior to the commonly used prediction validation approach. Moreover, there was some evidence, using the real data, that suggested that the information criteria approaches are better than the encompassing approach.
Overall, these studies indicate that the best practice would be to adopt an information criteria approach for choosing between the various common exponential smoothing methods. They indicate that the AIC probably has a slight edge over its counterparts. As expected, however, the FPE had a similar performance to the AIC, reflecting the asymptotic equivalence of these criteria.
References Akaike, H. (1970). Statistical predictor identification. Annals of the Institute of Statistical Mathematics, 22, 203 – 217. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov, & F. Csaki (Eds.), Second international symposium on information theory (pp. 267 – 281). Budapest7 Akademiai Kiado. Bartlett, M. S. (1957). A comment on D.V. Lindley’s statistical paradox. Biometrika, 44, 533 – 534. Bartolomei, S. M., & Sweet, A. L. (1989). A note on a comparison of exponential smoothing methods for forecasting seasonal series. International Journal of Forecasting, 5, 111 – 116. Brown, R. G. (1959). Statistical forecasting for inventory control. New York7 McGraw Hill. Chatfield, C. (1978). The Holt–Winters forecasting procedure. Applied Statistics, 27, 264 – 279. Gardner, E. S. (1985). Exponential smoothing: The state of the art. Journal of Forecasting, 4, 1 – 28. Golub, G. H., Heath, M., & Wahba, G. (1979). Generalized crossvalidation as a method for choosing a good ridge parameter. Technometrics, 21, 215 – 223. Hannan, E. J., & Quinn, B. G. (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society. Series B, 41, 190 – 195. Holt, C. C. (1957). Forecasting Trends and Seasonal by Exponentially Weighted Averages, ONR Memorandum No. 52, Carnegie Institute of Technology, Pittsburgh, USA (published in International Journal of Forecasting 2004, 20, 5–13). Hyndman, R. J., Koehler, A. B., Snyder, R. D., & Grose, S. (2002). A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting, 18, 439 – 454. Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., et al. (1982). The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of Forecasting, 1, 111 – 153. Makridakis, S., & Hibon, M. (2000). The M3-Competition: Results, conclusions and implications. International Journal of Forecasting, 16, 451 – 476. Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (1998). Forecasting: Methods and applications. New York7 John Wiley & Sons. Mallows, C. L. (1964). Choosing variables in a linear regression: A graphical aid. Paper presented at the Central Regional
B. Billah et al. / International Journal of Forecasting 22 (2006) 239–247 Meeting of the Institute of Mathematical Statistics, Manhattan, Kansas. McKenzie, E. (1985). Comments on dExponential smoothing: The state of the artT by E.S. Gardner. Journal of Forecasting, 4, 32 – 36. Ord, J. K., Koehler, A. B., & Snyder, R. D. (1997). Estimation and prediction for a class of dynamic nonlinear statistical models. Journal of the American Statistical Association, 92, 1621 – 1629.
247
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461 – 464. Snyder, R. D. (1985). Recursive estimation of dynamic linear statistical models. Journal of the Royal Statistical Society. Series B, 47, 272 – 276. Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management Science, 6, 324 – 342.