Vol. 18 No. 1
Journal of Systems Science and Complexity
Jan., 2005
TIME SERIES FORECASTING WITH MULTIPLE CANDIDATE MODELS: SELECTING OR COMBINING? ∗ YU Lean (Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China; School of Management, Graduate School of Chinese Academy of Sciences, Beijing 100049, China)
WANG Shouyang (Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China; Institute of Policy and Planning Sciences, University of Tsukuba, Tsukuba, Ibaraki 305–8573, Japan. Email:
[email protected])
K. K. Lai (Department of Management Sciences, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong)
Y. Nakamori (School of Knowledge Science, Japan Advanced Institute of Science and Technology 1-1, Asahidai, Ishikawa 923–1292, Japan) Abstract. Various mathematical models have been commonly used in time series analysis and forecasting. In these processes, academic researchers and business practitioners often come up against two important problems. One is whether to select an appropriate modeling approach for prediction purposes or to combine these different individual approaches into a single forecast for the different/dissimilar modeling approaches. Another is whether to select the best candidate model for forecasting or to mix the various candidate models with different parameters into a new forecast for the same/similar modeling approaches. In this study, we propose a set of computational procedures to solve the above two issues via two judgmental criteria. Meanwhile, in view of the problems presented in the literature, a novel modeling technique is also proposed to overcome the drawbacks of existing combined forecasting methods. To verify the efficiency and reliability of the proposed procedure and modeling technique, the simulations and real data examples are conducted in this study. The results obtained reveal that the proposed procedure and modeling technique can be used as a feasible solution for time series forecasting with multiple candidate models. Key words. Time series forecasting, model selection, stability, robustness, combining forecasts.
Received July 20, 2004. *This paper was partially supported by NSFC, CAS, RGC of Hong Kong and Ministry of Education and Technology of Japan.
2
YU LEAN et al.
Vol. 18
1 Introduction With the increasing advancement of computational technology, a variety of models (both linear and nonlinear) have been widely used for time series modeling and forecasting. For example, the autoregressive integrated moving average (ARIMA) model proposed by Box and Jenkins[1] , which is a typical linear model, has been proven to be effective in many time series forecasts. Likewise, the artificial neural network (ANN) model, a typical nonlinear model and newly intelligent computational technique, has also been shown to be a very promising approach for time series modeling and forecasting. Due to the fact that different time series have different properties and features, the selected model based on a selection criterion (e.g., Akaike information criterion (AIC)[2] ), hypothesis testing and graphical inspections, is often mis-specified and thus may cause an unexpectedly high variability in the final prediction and affect forecasting accuracy and reliability. In practical application, however, academic researchers and business practitioners often face two important dilemmas. One is whether to select an appropriate modeling approach for prediction purposes or to combine these different individual approaches into a single forecast for dissimilar modeling approaches. The other is whether to select the best candidate model to forecast or to mix the various candidates with different parameters into a new forecast for the same/similar modeling approaches. In other words, for time series forecasting the focus is on whether to select a model from multiple candidates or to combine these different models into a single forecast. Generally speaking, the two issues are highly non-trivial and have received considerable attention with different approaches being studied. Here we briefly discuss some of these approaches that are closely related to our work. In time series forecasting applications, the general practice is to select one appropriate model from the multiple candidates in terms of some selection methods; thus final estimation, interpretation, and prediction are then based on the selected model. Generally speaking, there are three methods for multiple candidate model selection. The first is graphical inspection together with examination of simple summary statistics (such as autocorrelations (AC) and partial autocorrelations (PAC)), which is very useful for preliminary modeling analysis. The second is hypothesis testing, a formal technique for model selection in statistics. The third method is to use a well-defined and formal model selection criterion, such as AIC [2] or Bayesian information criterion (BIC)[3] . However, these methods exhibit some defects. For example, the first method is too subjective and too rough in general for model selection. Likewise, there are difficulties with the second approach due to the challenging issue of multiple testing (e.g., there is no objective guideline for the choice of the size of each individual test, and it is completely unclear how such a choice affects the forecasting accuracy). Furthermore, Breiman [4] pointed out that the estimators based on model selection are instable. Yang[5] also considered that a major drawback with model selection is its instability. For example, with a small or moderate number of observations, as expected, it is usually hard to distinguish models that are close to each other (the model selection criterion values are generally quite close). In this case, the choice of the model with the smallest criterion value is unstable. That is, a slight change in the data may lead to a different model being chosen. As a consequence, the forecasts based on the selected model are highly variable[5−6] . Furthermore, Yang[5] pointed out identifying the true model is not necessarily optimal for forecasting. To reduce variability in model selection, researchers have turned to combining or mixing the different candidate models for prediction purposes. A variety of literature on combined forecasting is reported. In the combination of the same/similar method, Draper[7] and George and McCulloch[8] proposed an “averaging” method to combine a stabilized estimator. Raftery [9] suggested using a BIC approximation for Bayesian model averaging. Madigan and York [10] used a Markov chain Monte Carlo approximation to obtain a stable forecast. Breiman[11] proposed a “bagging” method to generate multiple versions of the estimator and then average them into a
No. 1
TIME SERIES FORECASTING: SELECTING OR COMBINING?
3
stable estimator. In a similar manner, Buckland et al.[12] proposed a plausible model weighting method according to the value of a model selection criterion (e.g., AIC). Cross-validation and bootstrapping have also been used to linearly combine different models with the intention of improving accuracy by finding the best linear combination (see [4,13,14]). Juditsky and Nemirovski[15] proposed a stochastic approximation method to combine k forecasts in the best linear combination. Yang[6] applied the adaptive regression method (ARM) by mixing different candidates from the same model for regression. Likewise, Yang[5] used the aggregated forecast through exponential re-weighting method (AFTER) to combine the forecasts from the different individual autoregressive moving average (ARMA) models. However, their works are only limited to linear combinations of the same methods with different parameters. Readers can refer to a review of this topic by Hoeting et al.[16] for more details. In addition, more references are presented in References [17–37]. In the combination of different/dissimilar models, literature documenting the research shows it to be quite diverse. Combining forecasts with dissimilar methods has been studied for over three decades, starting with the pioneering work of Reid[38,39] and Bates and Granger[40]. Various methods have been involved. For work pre-dating 1989, readers can refer to Clemen [41] for a comprehensive review of this topic. After that time, with the increasing development and application of new computational technology, many artificial intelligent (AI) techniques such as artificial neural networks (ANNs) have been presented. There are a few examples in the existing literature combining neural network forecasting models with conventional time series forecasting techniques. For example, Wedding II and Cios[42] constructed a combination model integrated radial basis function neural networks (RBFNN) and the un-invariant Box-Jenkins (UBJ) model to predict three time series. Luxhoj et al.[43] presented a hybrid econometric and ANN approach for sales forecasting. Likewise, Voort et al.[44] introduced a hybrid method called KARIMA using a Kohonen self-organizing map and ARIMA method to predict short-term traffic flow. Recently, Tseng et al.[45] proposed a hybrid model (SARIMABP) that combines the seasonal ARIMA (SARIMA) model and the back-propagation (BP) neural network model to predict seasonal time series data. Zhang[46] applied a hybrid methodology that combined both ARIMA and ANN models for time series forecasting. For more details, readers can refer to References [17–37]. The idea of combining forecasts implicitly assumed that one could not identify the underlying process (i.e., one could not select an appropriate model for a specified time series), but that different forecasting models were able to capture different aspects of the information available for prediction. This is also one of starting points for this study. As Clemen[41] concludes: “Using a combination of forecasts amounts to an admission that the forecaster is unable to build a properly specified model. To try ever more elaborate combining models seems to add insult to injury as the more complicated combinations does not generally perform all that well.” Hence combining the different candidate models for prediction has been common practice in practical applications to reduce the variability and uncertainty of selecting an individual model. It should be noted that, however, two main problems are found in the existing combined forecast models mentioned previously. One is the form of the combined forecasting model. In the existing literature, the combined forecasting models are limited to linear combination form. But a linear combination approach is not necessarily appropriate for all the circumstances. Another issue is the number of individual forecasting models used. The question here is how to determine the number of individual forecasting models. It is well known that not all circumstances are satisfied by following “the more, the better”. Thus, it is necessary to determine the number of individual models used in the combined forecasts. With respect to the first problem (i.e. linear combination drawback), a nonlinear combined forecasting model utilizing the ANN technique is introduced in this paper. With regard to
4
YU LEAN et al.
Vol. 18
the second problem (i.e. determining the number of individual models used in combined forecasting), the principal component analysis (PCA) technique is introduced. We use the PCA technique to choose the appropriate number of individual models. Therefore this study proposes a novel nonlinear combined forecasting model utilizing the ANN technique and PCA technique to achieve a reduction in model uncertainty, or achieving an increase in forecast accuracy. The PCA & ANN-based nonlinearly combined (PANC) forecasting approach is used in the proposed novel nonlinear combined model. However, the fact that we use the combined method for time series prediction does not necessarily mean that we are against the practice of model selection. Generally, identifying the true model (when it makes good sense) is an important task in understanding relationships of time series variables. In deterministic linear time series, it is often observed that selection may outperform combining methods when one model is very strongly preferred, in which case there is little instability in selection. As Yang[5] claimed, in the time series context, combining does not necessarily lead to prediction improvement when model selection is stable. That is to say, the use of combining forecasts does not mean that the individual forecasts should be suppressed. In fact, these individual forecasts can provide evidence about the relative merits of their respective models, methods, or forecasters as well as about the relationships among the multiple forecasts. Furthermore, for purpose of reporting and comparison, it is helpful to give the individual forecast as well as the combined forecast in order to provide an indication of the variability among the forecasts. Decision makers using the forecasts may want to see if and how their decisions change when different forecasts are used[47] . In addition, it should be pointed out the combining approach we take is related to (although different from) formal Bayesian consideration. In particular, no prior distribution will be specified for parameters in the models. The aims of this study, then, are three-fold: (1) to reveal how to deal with the two dilemmas of selecting and combining; (2) to show how to construct a novel nonlinear combination model for time series forecasting; and (3) to compare the accuracy of various methods and models with real data examples. In view of the three aims, this study firstly proposes a double-phaseprocessing procedure to treat the two dilemmas of selecting or combining, moves on to describe the building process for the proposed nonlinear combination model, and finally presents the application of the proposed procedure and nonlinear combined approach with the real data examples. The rest of the study is organized as follows. Section 2 describes the procedure for dealing with the two dilemmas of selecting and combining in detail. In Section 3, a novel nonlinear combined forecasting model is proposed to overcome the two problems of existing combined forecasts. In order to verify the effectiveness and efficiency of the proposed procedure and model, simulation experiments and empirical analysis of real time series are reported in Section 4. Conclusions and future research directions are given in Section 5.
2 Dilemmas of Selecting or Combining and Solutions to Them Assume {y1 , y2 , · · · , yt } to be a time series. At time t for t > 1, we are interested in forecasting the next value yt+1 based on the past observations of y1 , y2 , · · · , yt . Here we focus on one-step-ahead point forecasting. As mentioned previously, two dilemmas are taken into account in this study. With respect to the same time series, the first is whether to select a “true” model or to combine multiple individual models for the different/dissimilar methods; the other is whether to select a “true” model or to combine multiple models with different parameters for the same/similar method. That is, one faces two types of choice for a specified time series. If we do not decide on a
No. 1
TIME SERIES FORECASTING: SELECTING OR COMBINING?
5
determined method for the time series, then multiple different/dissimilar methods are tested by trial and error: thus one faces the first choice. If a method is confirmed that is appropriate for the time series, multiple versions with different parameters of the determined method are generated: i.e., one faces the second choice. This is graphically represented in Fig.1. From Fig. 1, we can clearly see the two dilemmas for time series forecasting. In short, this is a problem of selecting or combining. Subsequently, we further explore the solution of the two dilemmas. Many selection methods are often utilized in solving the selecting or combining problem for multiple candidate models. However, as mentioned earlier, the potentially large variability in the resulting forecast is common to all model selection methods. When there is substantial uncertainty in finding the best model, alternative methods such as combination model should be taken into account. An interesting issue of selecting or combining is: how should we measure the model selection criteria or the resulting prediction? This is the key to solve the selecting or combining dilemma. It should be noted that two judgmental criteria are required to define dealing with two types of dilemma. In order to solve the dilemmas, a double-phase-processing procedure is proposed. Time series
Different methods
Model (A)
Model (B)
···
Same method
Model (K)
Model (A1 )
Model (A2 )
···
Model (An )
Selection or combining (Dilemma I)
Selection or combining (Dilemma II)
Judgmental criterion I
Judgmental criterion II
Decision I
Fig. 1
Decision II
Two dilemmas of time series forecasting
In the first phase, a solution to the first dilemma is provided. In the case of whether to select or combine the different/dissimilar methods, the relative prediction performance stability (S) is introduced. Here the conception of “prediction performance stability” is that time series prediction performance with different/dissimilar model is stable. In other words, for a specified time series, if the difference between the performance indicator for the best candidate model and the performance indicator for the remaining models is very large, then the prediction performance for the time series is unstable and another strategy, such as combining method, should be considered. Otherwise, the time series prediction performance is stable and selecting the “best” model from the different/dissimilar candidates may be a wise decision. To be concrete, assume that a different model {A, B, · · · , K} is determined based on all the observations {yi } (i = 1, 2, · · · , k) and use them as predictors. Accordingly, in the same prediction period, the prediction performance indicator (e.g., root mean squared error (RMSE)), defined by {P Ii } (i = 1, 2, · · · , k) and its volatilities (σi ) (i = 1, 2, · · · , k) with the different evaluation
6
Vol. 18
YU LEAN et al.
sets, are calculated. Hence, the relative prediction performance stability (S) is calculated within the framework of the different evaluation sets. Pm 1 (m j=1 P Iij )/σi Si = , i = 1, 2, · · · , k, (1) Min {P Ii }/σP Ii i
where P Ii =
m X P Iij j=1
m
.
By comparing Si and Sθ , we can judge whether to select a “true” model or combine multiple different/dissimilar models, i.e., ( Selecting, if Si ≥ Sθ , Decision = i = 1, 2, · · · , k. (2) Combing, if Si < Sθ . From Equation (2), we can see that if the all the Si are larger than the threshold value Sθ , a model with the smallest value of P I is selected. If the opposite is true, the combining strategy is recommended. In other words, once the value of Sθ is determined, we can make corresponding decisions for the first dilemma. It should be noted that it is hard to determine a rational value of Sθ . Furthermore, once Sθ is determined, we face a probable situation: some values of Si are larger than Sθ and others smaller than Sθ . If this is the case, how to deal with the situation is still a problem. When the selecting decision is made in the first phase, the second dilemma is generated. In the second phase, we use the “robustness” judgmental criterion. When the robustness of the selected model based on some model selection criteria is strong (i.e., when one model is strongly preferred in terms of the corresponding selection criterion), the model should be selected rather than combining from the multiple candidate models with different parameters for the same method. On the other hand, when the robustness of the selected model is weak, model combining should be considered in order to obtain good prediction performance. Here the definition of model robustness has been a focus issue. In fact, model robustness can be defined in such a way as follows. The selected model should be robust in the sense that it is indifferent to a radical change to a small portion of the data or a slight change in all of the data[48] . Simply speaking, if a model is robust, a minor change in the data should not change the outcome dramatically. The idea of measuring model robustness cn is selected is to examine its consistency with different data sizes. Suppose that the model M by the model selection criterion based on the observations {yi } (i = 1, 2, · · · , n). Let k be an integer between 1 and n−1. For each j in {n−k, n−k +1, · · · , n−1}, apply the model selection cj denote the selected model. Then let R method to the data {yi } (i = 1, 2, · · · , j) and let M cn ) is selected, (robustness metric) be the percentage of time series for which the same model (M i.e., n−1 P cj = M cn } I{M R=
j=n−k
k
,
(3)
cj = M cn , then I{M cj = M cn } = 1, else where I{ } denotes the indicator function: that is, if M c c I{Mj = Mn } = 0. The rationale behind the consideration of R is quite clear: removing a few observations should not cause much change for a stable model. In addition, the integer k should be chosen appropriately. On one hand, we want to choose small k so that the selection problems
No. 1
TIME SERIES FORECASTING: SELECTING OR COMBINING?
7
for j in {n−k, n−k +1, · · · , n−1} are similar to the real problem with full observations. On the other hand, we want k not to be too small so that R is reasonably stable[6] . Once the rational threshold value Rθ is determined, we can make the decision of whether to select or combine by judging the robustness of the selected model in the second dilemma, i.e., ( Selecting, if R ≥ Rθ , Decision = (4) Combing, if R < Rθ . That is, if the robustness value of the selected model is larger than the predefined threshold value, then the selecting strategy is preferred; else the combining strategy is preferred. Although Yang[6] also pointed out the same question, his work was only limited to the selection of the same method. Thus, the dilemmas we present are wider and the alternative solution is more general. Therefore, the proposed procedure is different from previous work. To summarize, the proposed double-phase-processing procedure is executed as follows. In the first sub-procedure, we use the performance stability indictor to deal with the first dilemma (Equations (1)–(2)). In the second sub-procedure, we use the robustness metric to treat the second dilemma, as Equations (3)–(4) shows. With regard to the specified time series, two dilemmas are often faced. With respect to the specified time series, common practice is to use multiple dissimilar methods to model the problem by trial and error and then introduce the judgmental criteria to make the corresponding decisions. That is, for many different/dissimilar models, if the time series model has stable performance, then a selecting decision is made, otherwise the combining strategy is recommended. If the model is robust, then we select a “best” model from the multiple candidates with different parameters, otherwise we combine the candidates into a single time series forecast. A flow chart of proposed procedure is depicted in Fig. 2. Time series data
? Multiple different candidate models with dissimilar method
? HH
No
H H Stability H HH H Yes ? Selecting one method
? Multiple candidate models with different parameters for name method
? H
H Yes Robustness H H H
?
HH H
No
?
Selecting Selection Criterion
-
Fig. 2
?
Combining Combined Strategy Forecasting
The double-phase-processing procedure of time series forecasting
8
YU LEAN et al.
Vol. 18
Relying on the proposed double-phase-processing procedure, the above two types of dilemma can be solved. This gives a clear decision on whether to select an appropriate model from multiple candidates or to combine these models for a specified time series. In practical application, however, the combining strategy is often selected due to the uncertainty of time series formation. Thus one has to confront combination forecast problems. As mentioned in Section 1, several drawbacks of combination models are found in the existing literature. Subsequently, a novel nonlinear combination model is proposed in the following section to overcome the shortcomings of existing combination methods.
3
Nonlinear Combination Forecasting Model
The idea of a combined forecasting model is not new. Since the pioneering work of Reid [38−39] and Bates and Granger[40], a variety of studies have been conducted and many combined methods have been proposed. Basically, the main problem of combined forecasts can be described as follows. Suppose there are n forecasts such as yˆ1 (t), yˆ2 (t), · · · , yˆn (t) (including similar and dissimilar forecasts). The question is how to combine these different forecasts into a single forecast yˆ(t), which is assumed to be a more accurate forecast. The general form of the model for such a combined forecast can be defined as: n X yˆ(t) = wi yˆi (t), (5) i=1
where wi P denotes the assigned weight of yˆi (t), and in general the sum of the weights is equal to 1, i.e., wi = 1. In some situations the weights may have negative values and the sum of i
them may be greater than 1[49] . There are a variety of methods available to determine the weights used in the combined forecasts. For comparison, three main methods are presented here. The first is the equal weights (EW) method, which uses a simple arithmetic average of the individual forecasts, and which is a relatively easy and robust method. That is, wi =
1 , n
(6)
where n is the number of forecasts. The second is minimum-variance (MV) method proposed by Dickinson[50−51] . Its main ideas are Min(wi V wiT ), n X wi = 1, (P1 ) i = 1, 2, · · · , n, (7) i=1 wi ≥ 0,
where V is the matrix of error-variance. By solving the quadratic programming (P 1 ), an optimal weight set can be obtained for the combining forecasts. The third is the time-varying minimum-square-error (TVMSE) method proposed by Yu et al.[52] and its main ideas are as 2 Min(wi (ei (t)) ), n X wi = 1, (P1 ) i = 1, 2, · · · , n; t = 1, 2, · · · , N, (8) i=1 wi ≥ 0,
No. 1
TIME SERIES FORECASTING: SELECTING OR COMBINING?
9
where t represents the evaluation/prediction period. With the change of time t, the weights vary with time t. This method is thus a new and robust method for combining forecasts. (Refer to Clemen[41] and References for more methods.) However, there are two main problems in existing combined forecasting approaches. One is the number of individual forecasts. Theoretical proof[53] shows that the total forecasting errors of combined forecasts does not necessarily decrease with an increase of the number of individual forecasting models. If this is the case, there is redundant or repeated (even noisy) information among the individual forecasts. Furthermore, the redundant information often directly impacts on the effectiveness of the combined forecasting method. It is therefore necessary to eliminate individual forecasts that are not required. In other words, one of the key problems of combined forecasts is to determine a reasonable number of individual forecasts. The other problem is that the relationships among individual forecasts are determined in a linear manner. The three methods mentioned above, and many other weighted methods, are all developed under linear consideration. A combined forecast should merge the individual forecasts according to the natural relationships existing between the individual forecasts–including, but not limited to, linear relationships. Because a linear combination of existing information cannot represent the relationship between individual forecasting models in some situations, it is necessary to introduce nonlinear combination methodology in the combined forecasts. We set out to solve the two problems in the following. In the first problem, the question we face is how to extract effective information that reflects substantial characteristics of series from all selected individual forecasts and how to eliminate redundant information. The principal component analysis (PCA) technique (see [54,55]) is used as an alternative tool to solve the problems. The PCA technique, an effective feature extraction method, is widely used in signal processing, statistics and neural computing. The basic idea in PCA is to find the components (s1 , s2 , · · · , sp ) that can explain the maximum amount of variance possible by p linearly transformed components from a data vector with q dimensions. The mathematical technique used in PCA is called eigen analysis. In addition, the basic goal in PCA is to reduce the number of dimensions of the data. Thus, one usually chooses p ≤ q. Indeed, it can be proved that the representation given by PCA is an optimal linear dimension reduction technique in the mean-square sense[54] . There are important benefits to such a reduction in dimensions. First, the computation of the subsequent processing is reduced. Second, noise can be reduced and the meaningful underlying information can be identified. The following presents the PCA process for combined forecasting. Assuming that there are n individual forecasting models in the combined forecasts and that every forecasting model contains m forecasting results, then forecasting matrix (Y ) can represented as
Y =
y11 y21 .. . yn1
y12 y22 .. . yn2
· · · y1m · · · y2m . .. . .. · · · ynm
,
(9)
where yij is the jth forecasting value with the ith forecasting model. Next, we deal with the forecasting matrix using the PCA technique. First, eigen-values (λ1 , λ2 , · · · , λn ) and corresponding eigenvectors A = (a1 , a2 , · · · , an ) can be solved from the forecasting matrix. Then the new principal components are calculated as Zi = a T i Y, i = 1, 2, · · · , n.
(10)
10
YU LEAN et al.
Vol. 18
Subsequently, we choose m (m ≤ n) principal components from existing n components. If this is the case, the saved information content is judged by θ=
(λ1 + λ2 + · · · + λm ) . (λ1 + λ2 + · · · + λn )
(11)
If θ is sufficiently large (e.g., θ > 0.8), enough information has been saved after the feature extraction process of PCA. Thus, re-combining the new information can further improve the prediction performance of combined forecasts. In the case of the second problem, we propose a nonlinear combined forecasting model as a remedy. The detailed model is presented as follows. A nonlinear combined forecasting model can be viewed as a nonlinear information processing system that can be represented as y = f (I1 , I2 , · · · , In ),
(12)
where f (·) is a nonlinear function and Ii denotes the information provided by individual forecasting models. If an individual forecasting model can provide an individual forecast yˆi , then Equation (12) can be represented as yˆ = f (ˆ y1 , yˆ2 , · · · , yˆn ).
(13)
Determining the function f (·) is quite challenging. In this study, ANN is employed to realize nonlinear mapping. The ability of back-propagation neural networks to represent nonlinear models has been tested by previous work. That is, all individual forecasts are used as inputs of the ANN model, and the output of the ANN model is seen as the results of nonlinear combination forecasts. In fact, the ANN training is a process of searching for optimal weights. That is, this training process makes the sum of the square errors minimal, i.e., Min[(y − yˆ)(y − yˆ)T ],
(14)
Min{[y − f (ˆ y1 , yˆ2 , · · · , yˆn )][y − f (ˆ y1 , yˆ2 , · · · , yˆn )]T }.
(15)
or By solving the unconstrained nonlinear programming problems of linear combination models with an ANN model, a PCA & ANN-based nonlinear combination (PANC) model is generated for nonlinear combining forecasting. Although Shi et al.[49] proposed a nonlinear combined forecasting method they only solved the second problem of existing combined methods. Therefore, the PANC model is a fully novel method. To summarize, the proposed nonlinear combination model consists of four stages. Generally speaking, in the first stage we construct an original forecasting matrix according to the selected individual forecasts. In the second stage, the PCA technique is used to deal with the original forecasting matrix and a new forecasting matrix is obtained. In the third stage, based upon the judgments of PCA results, the number of individual forecasting models is determined. And in the final stage, an ANN model is developed to combine different individual forecasts; meanwhile the corresponding forecasting results are obtained. To verify the efficiency of the proposed procedures and models, several real data examples are conducted in the following section.
No. 1
4
TIME SERIES FORECASTING: SELECTING OR COMBINING?
11
A Simulation Study
4.1 Experimental Plan, Data Description and Forecasting Evaluation Criterion In this study, many experiments are conducted to verify the efficiency of the proposed procedure and nonlinear combined forecasting model, and two real time series are used. One is housing sales time series, which is used to check the efficiency of the proposed procedure. The other is an exchange rate series–U.S dollars (USD) against British pound (GBP)–which is used to test the reliability and effectiveness of the proposed procedure and of the nonlinear combined model at the same time. In this study, four forecasting methods–the ARIMA model, exponential smoothing (ES) model, simple moving average (SMA) model and artificial neural network (ANN) model with three-layer network structures–are used in the experiments. The housing sales data used are monthly and are obtained from Pankratz’ book [56] , covering the period from January 1965 to December 1975 with a total of n = 132 observations. We take the monthly data from January 1965 to June 1973 as the in-sample (training periods) data sets (102 observations) and the remainder as the out-of-sample (testing periods) data sets (30 observations). The foreign exchange (USD/GBP) data used in this paper are monthly and are obtained from Pacific Exchange Rate Service (http://fx.sauder.ubc.ca), provided by Professor Werner Antweiler, University of British Columbia, Vancouver, Canada. We take monthly data from January 1971 to December 1998 as in-sample data sets (336 observations) and take the data from January 1999 to December 2003 as out-of-sample data sets (60 observations), which are used to evaluate the good or bad performance of predictions based on evaluation measurements. In order to save space, the original data are not listed here, and detailed data can be obtained from the website or from the authors. In order to compare the prediction performance, it is necessary to introduce the forecasting evaluation criterion. It is known that one of the most important forecasting evaluation criteria is the root mean square error (RMSE). Its computational form can be defined as v u N u1 X RM SE = t (yt − yˆt )2 , (16) N t=1
where yt (t = 1, 2, · · · , N ) represent the actual values, yˆt (t = 1, 2, · · · , N ) are predicted values, and N is the number of evaluation periods. In this study we use the RMSE to measure the forecasting performance of different models. 4.2 Empirical Results In this study, two classes of experiment are conducted to verify the effectiveness of efficiency of the proposed procedure and nonlinear model. 4.2.1 Experiments for the proposed procedure In accordance with the procedure proposed in Section 2, we use the housing sales series and exchange rate series to verify and test the proposed procedure. Some different forecasting methods are used to fit the specified time series, thus facing the first dilemma (whether to select one method from the alternatives or whether to combine these candidates into a single forecast). If the selection strategy is preferred, then the second dilemma is generated (whether to select one model from multiple candidate models with different parameters or whether to combine these candidate models with different parameters for prediction purposes). As mentioned in experimental plans, the ARIMA, ES, SMA and ANN models are used to fit the time series and to select an appropriate model for every method. Then we use the
12
Vol. 18
YU LEAN et al.
selected model to predict the future value of time series. For convenience, the testing sets (or evaluation sets) are presented for verification. Accordingly, the performance indicators are calculated. In this experiment, two examples are presented. In view of the judgment rule of performance stability, the corresponding decision can be made. Here we assume the threshold value Sθ = 4.00 and Rθ = 0.80. Example I: Sales forecasting In this example, the results of sales forecasting performance indicators with different evaluation sets are shown in Table 1. Table 1
The RMSE of different methods with different evaluation sets
Evaluation sets 10 observations 20 observations 30 observations
P Iij RMSE RMSE RMSE P Ii· σi
ARIMA 24.3445 24.0810 24.2595 24.2283 0.1345
ES 6.6162 6.0732 6.1381 6.2758 0.2965
SMA 12.4916 11.4301 11.1456 11.6891 0.7094
ANN 1.86813 3.34299 2.86363 2.6916 0.7523
P I: performance indicator; ARIMA: autoregressive integrated moving average; ES: exponential smoothing; SMA: simple moving average; ANN: artificial neural network; RMSE: root mean square error.
From Table 1, we can calculate the performance stability indicator in accordance with Equation (1). First of all, we find that the smallest value of P Ii (i = 1, 2, 3, 4) is 2.6916, i.e., Min {P Ii } = 2.6916. Then, the values of Si (i = 1, 2, 3) are calculated as follows. i
24.2283/0.1345 = 50.3479; 2.6916/0.7523 6.2758/0.2965 S2 = = 5.9160; 2.6916/0.7523 11.6891/0.7094 S3 = = 4.6054. 2.6916/0.7523 S1 =
Because all Si > Sθ = 4.00 (i = 1, 2, 3) for different evaluation sets, the selection strategy is recommended. The ANN model is selected as a “true” model for prediction purposes. However, if we assume Sθ to be 60.00, then the combination strategy is preferred. In addition, if Sθ is equal to 10.00, it is hard to judge between the two strategies and further exploration is needed. This implies that determining an appropriate Sθ is extremely important. When the selection strategy is preferred, as mentioned earlier, we face the second dilemma. Here we take the ANN model as an example for further explanations. First of all, we select an appropriate ANN model based on the full observations by training and learning. Then we change the sample sizes and fit the series again and again. Meanwhile, the robustness metric (R) is calculated. Finally, the corresponding decision can be made by comparing the value of R with the threshold value (Rθ ). The experimental results are presented in Tables 2–3. Table 2 Methods RMSE
(3,2,1) 0.0522
The process of method selection
ANN(input node, hidden node, output node) (3,3,1) (3,4,1) (3,5,1) (4,2,1) (4,3,1) (4,4,1)∗ 0.0568 0.0661 0.0686 0.0513 0.0539 0.0503
ANN: artificial neural network; RMSE: root mean square error.
(4,5,1) 0.0748
No. 1
TIME SERIES FORECASTING: SELECTING OR COMBINING?
13
Table 3 The robustness testing process (k = 30) Sample size 102 103 104 105 106 107 108 109 110 111 ANN(I, H, O) (3,3,1) (4,3,1) (4,4,1) (4,4,1) (4,3,1) (4,4,1) (4,4,1) (3,5,1) (4,4,1) (4,4,1) bj = M bn) Ij ( M 0 0 1 1 0 1 1 0 1 1 Sample size 112 113 114 115 116 117 118 119 120 121 ANN(I, H, O) (4,4,1) (4,5,1) (4,4,1) (3,4,1) (4,4,1) (4,4,1) (4,5,1) (4,4,1) (4,4,1) (4,4,1) bj = M bn} Ij { M 1 0 1 0 1 1 0 1 1 1 Sample size 122 123 124 125 126 127 128 129 130 131 ANN(I, H, O) (4,4,1) (4,4,1) (4,4,1) (4,4,1) (4,4,1) (4,4,1) (4,4,1) (4,4,1) (4,4,1) (4,4,1) bj = M bn} Ij { M 1 1 1 1 1 1 1 1 1 1
bj = M bn }, then Ij {M bj = M bn } = 1, else Ij {M bj = M bn } = 0; ANN(I, H, O): ANN (input node, If {M
hidden node, output node).
From Table 2, we select the ANN(4,4,1) as an appropriate model using the corresponding model selection criteria. By changing the sample size, we can make the corresponding decisions. From Table 3, we can find that different k values often lead to different robustness metric values. When k is equal to 20, the value of R is 0.85(R > Rθ ), then the decision of selecting ANN(4,4,1) as a predictor is rational. However, when k is equal to 30, the value of R is 0.7667(R < R θ ), then ANN(4,4,1) should be rejected for the time series. Combining multiple models with different parameters into a new forecast is a good choice for prediction purposes. Thus, it is very important to choose k with an appropriate value, as this affects our decisions to a great extent. Unfortunately, the determination of k still depends on the trial and error. Example II: Exchange rate forecasting In this example, we use the foreign exchange series to check the efficiency of the proposed procedure. The results of exchange rate forecasting performance indicators with different evaluation sets are shown in Table 4. Table 4 The RMSE of different methods with different evaluation sets Evaluation sets 30 observations 45 observations 60 observations
PI RMSE RMSE RMSE P Ii σi
ARIMA 0.0196 0.0193 0.0195 0.01947 0.00015
ES 0.0111 0.0113 0.0117 0.01137 0.00031
SMA 0.0265 0.0250 0.0269 0.02613 0.00100
ANN 0.0033 0.0032 0.0031 0.0032 0.0001
P I: performance indicator; ARIMA: autoregressive integrated moving average; ES: exponential smoothing; SMA: simple moving average; ANN: artificial neural network; RMSE: root mean square error; σ: volatility.
From Table 4, we find that the performance indicators of an ANN model with the different evaluation sets are the best of all, indicating that the ANN model is a very promising modeling technique. According to the first judgmental rule and the results in Table 4, we can calculate the performance stability indictors. For three different evaluation sets, S1 =
0.01497/0.00015 = 3.1188; 0.0032/0.0001
S2 =
0.01137/0.00031 = 1.1462; 0.0032/0.0001
S3 =
0.02613/0.00100 = 0.8166. 0.0032/0.0001
14
Vol. 18
YU LEAN et al.
Because Si < Sθ = 4.00(i = 1, 2, 3) for different evaluation sets, the combination strategy is recommended. However, verification of the combination forecasts is the task of the following section. 4.2.2 Experiments for the proposed nonlinear combined model As mentioned before, the combined strategy problem has to face either the first dilemma or the second dilemma. In allusion to the shortcomings of existing combined forecast model, a novel nonlinear combined model is proposed. In this section, four individual models, ARIMA, ES, SMA and ANN models, and four methods of combined forecast, EW, MV, TVMSE and ANC methods, are used for combination forecasts to verify the effectiveness of the proposed nonlinear combination method. Here, the comparison of three sides is presented. For space reasons, we only use the exchange rate series for our tests. A. Individual models vs. combination models First of all, we provide the comparison between four individual models and four combination methods. The comparison results are shown in Table 5. Table 5
The comparison between individual models and combination methods
Models/Methods RMSE Rank
ARIMA 0.0195 7
ES 0.0117 6
SMA 0.0269 8
ANN 0.0031 3
EW 0.0112 5
MV 0.0017 2
TVMSE 0.0033 4
ANC 0.0016 1
ARIMA: autoregressive integrated moving average; ES: exponential smoothing; SMA: simple moving average; ANN: artificial neural network; EW: equal weights; MV: minimum variance; TVMSE: timevarying minimum-squared-error; ANC: ANN-based nonlinear combination; RMSE: root mean square error.
From Table 5, we can find that (1) the performance of combination methods is generally better than that of individual models; (2) among individual models, the ANN model performs the best; (3) in combination methods, the ANN-based nonlinear combination (ANC) method outperforms the three linear combination methods (EW, MV and TVMSE); and (4) the ANN-based combination method performs the best of all the individual forecast models and combination methods. B. Combination of the same/similar methods vs. combination with different/dissimilar methods Subsequently, we compare the performance between combination with same methods and combination with different methods. Here we take the ARIMA model as a proxy (the same model) for comparison. The corresponding method is presented in Table 6. Table 6
The comparison between combination with same methods and combination with different methods
Methods EWARIMA EW MVARIMA MV TVMSEARIMA TVMSE ANCARIMA ANC RMSE 0.0086 0.0112 0.0091 0.0017 0.01305 0.0033 0.0023 0.0016 Rank 5 7 6 2 8 4 3 1 ARIMA: autoregressive integrated moving average; EW: equal weights; MV: minimum variance; TVMSE: time-varying minimum-squared-error; ANC: ANN-based nonlinear combination; RMSE: root mean square error.
As can be seen in Table 6, we can conclude that (1) the performance when combining different models is better than that when combining the same method, with the exception of
No. 1
TIME SERIES FORECASTING: SELECTING OR COMBINING?
15
EW; and (2) according to the rank, ANC with different models performs the best, followed by MV with different models, ANC with the same method and TVMSE with different models. C. PCA-based linear (nonlinear) combination models vs. linear (nonlinear) combination models In the following, we compare the forecasting performance between combination models and PCA-based combination models, and the results are shown in Table 7. Table 7 The comparison between combination models and PCA-based combination models Models/Methods PEW EW PMV MV PTVMSE TVMSE PANC ANC RMSE 0.0092 0.0112 0.0060 0.0017 0.0029 0.0033 0.0015 0.0016 Rank 7 8 6 3 4 5 1 2 PEW: PCA-based equal weights; EW: equal weights; PMV: PCA-based minimum variance; MV: minimum variance; PTVMSE: PCA-based time-varying minimum-squared-error; TVMSE: time-varying minimum-squared-error; PANC: PCA & ANN-based nonlinear combination; ANC: ANN-based nonlinear combination; RMSE: root mean square error.
As Table 7 reveals, we can conclude that PCA-based combination models outperform the other combination models in general in terms of RMSE and rank. Of all the combination models, the PANC performs the best, revealing that it is a robust and promising model and is worth generalization and application for all kinds of forecasting practice. To summarize, we can conclude the following: (1) in the individual forecasting model, the ANN model is the best in terms of RMSE (Table 5); (2) the combined forecasting models generally perform better than the individual forecasting models (Table 5); (3) the forecasting accuracy of nonlinear combination is better than that of linear combination (Table 5); (4) a combination of dissimilar methods outperforms a combination of similar methods (Table 6); (5) the PCA-based combination models perform better the other combination models and other individual models (Tables 5–7); and (6) the PANC model performs the best of all the linear (nonlinear) combination models and individual models (Tables 5–7). This leads to the final conclusion: (7) the novel nonlinear combination model can be used as an alternative tool for time series forecasting to obtain greater forecasting accuracy and improve the prediction quality further in view of empirical results.
5
Conclusions
In this study, we propose a double-phase-processing procedure to treat the problem of whether to select or combine. Based on the proposed procedure, a nonlinear combination model is also proposed to overcome the drawbacks of existing combined methods. Simulation results obtained reveal that (1) the proposed procedure is useful, reliable and efficient for dealing with the problems of selecting or combining; (2) this proposed nonlinear combined model is also useful for improving the prediction performance of time series; and (3) thus, the proposed procedure and modeling technique can be a feasible solution for time series forecasting with multiple candidate models. However, several issues are generated and need to be addressed in the future. One important issue is the determination of the threshold values, such as the threshold value of performance stability and robustness metric, since this affects the decision-making judgments of researchers and practitioners. Another is the determination of the value of k in the computation of the robustness metric, the value of θ in the PCA process, and the size of the evaluation period in
16
YU LEAN et al.
Vol. 18
the calculation of performance stability. In addition, the other reasonable judgmental rule of selecting one method or combining multiple methods, and more rational nonlinear combination technique, are worth exploring in further study of developments of the combination method. References [1] G. E. P. Box and G. Jenkins, Time Series Analysis: Forecasting and Control, Holden-Day, San Francisco, CA, 1970. [2] H. Akaike, Information theory and an extension of the maximum likelihood principle, In: The Proceedings of 2nd International Symposium of Information Theory (ed. by Petrov, B. N. and Csaki, F.), Akademia Kiado, Budapest, 1973, 267–281. [3] G. Schwartz, Estimating the dimension of a model, The Annals of Statistics, 1978, 6: 461–464. [4] L. Breiman, Stacked regressions, Machine Learning, 1996, 24: 49–64. [5] Y. Yang, Combining time series models for forecasting, Working Paper, Department of Statistics, Iowa State University, 2001. [6] Y. Yang, Regression with multiple candidate models: Selecting or mixing? Working Paper, Department of Statistics, Iowa State University, 2000. [7] D. Draper, Assessment and propagation of model uncertainty, Journal of the Royal Statistical Society: Series B, 1995, 57: 45–97. [8] E. I. George and R. E. McCulloch, Approaches for Bayesian variable selection, Statistica Sinica, 1997, 7: 339–373. [9] A. E. Raftery, Bayesian model selection in social research (with discussion), In: Sociological Methodology (ed. by Marsden, P. V.), Blackwells, Cambridge, Massachusetts, 1995, 111–196. [10] D. Madigan and J. York, Bayesian graphical models for discrete data, International Statistical Review, 1995, 63: 215–232. [11] L. Breiman, Bagging predictors, Machine Learning, 1996, 24: 123–140. [12] S. T. Buckland, K. P. Burnham and N. H. Augustin, Model selection: An integral part of inference, Biometrics, 1995, 53: 603–618. [13] D. H. Wolpert, Stacked generalization, Neural Networks, 1992, 5: 241–259. [14] M. LeBlanc and R. Tibshirani, Combining estimates in regression and classification, Journal of the American Statistical Association, 1996, 91: 1641–1650. [15] A. Juditsky and A. Nemirovski, Functional aggregation for nonparametric estimation, The Annals of Statistics, 2000, 28 (3): 681–712. [16] J. A. Hoeting, D. Madigan, A. E. Raftery and C. T. Volinsky, Bayesian model averaging: a tutorial, Statistical Science, 1999, 14(4): 382–417. [17] N. Meade, Evidence for the selection of forecasting methods, Journal of Forecasting, 2000, 9: 515–535. [18] A. Jessop, Sensitivity and robustness in selection problems, Computers and Operations Research, 2004, 31: 607–622. [19] F. Diebold, Forecast combination and encompassing: reconciling two divergent literatures, International Journal of Forecasting, 1989, 5: 589–592. [20] C. K. Chan, B. G. Kingsman and H. Wong, Determining when to update the weights in combined forecasts for product demand–an application of the CUSUM technique, European Journal of Operational Research, 2004, 153: 757–768. [21] M. Beccali, M. Cellura, V. Lo Brano and A. Marvuglia, Forecasting daily urban electric load profiles using artificial neural networks, Energy Conversion & Management, article in press, 2004. [22] N. Harvey and C. Harries, Effects of judges forecasting on their later combination of forecasts for the same outcomes, International Journal of Forecasting, article in press, 2003. [23] R. Lahdelma and H. Hakonen, An efficient linear programming algorithm for combined heat and power production, European Journal of Operational Research, 2003, 148: 141–151. [24] A. J. Haklander and A. V. Delden, Thunderstorm predictors and their forecast skill for the Netherlands, Atmospheric Research, 2003, 67–68: 273–299.
No. 1
TIME SERIES FORECASTING: SELECTING OR COMBINING?
17
[25] Y. Fang and D. Xu, The predictability of asset returns: an approach combining technical analysis and time series forecasts, International Journal of Forecasting, 2003, 19: 369–385. [26] C. Guermat and R. D. F. Harris, Forecasting value at risk allowing for time variation in the variance and kurtosis of portfolio returns, International Journal of Forecasting, 2002, 18: 409–419. [27] V. Kumar, A. Nagpal and R. Venkatesan, Forecasting category sales and market share for wireless telephone subscribers: a combined approach, International Journal of Forecasting, 2002, 18: 583– 603. [28] N. Terui and H. K. Van Dijk, Combined forecasts from linear and nonlinear time series models, International Journal of Forecasting, 2002, 18: 421–438. [29] G. J. Chen, K. K. Li, T. S. Chung, H. B. Sun and G. Q. Tang, Application of an innovative combined forecasting method in power system load forecasting, Electric Power Systems Research, 2001, 59: 131–137. [30] M. Tamimi and R. Egbert, Short-term electric load forecasting via fuzzy neural collaboration, Electronic Power Systems Research, 2000, 56: 243–248. [31] M. Y. Hu and C. Tsoukalas, Combining conditional volatility forecasts using neural networks: an application to the EMS exchange rates, Journal of International Financial Markets, Institutions and Money, 1999, 9: 407–422. [32] C. K. Chan, B. G. Kingsman and H. Wong, The value of combining forecasts in inventory manage -ment–a case study in banking, Journal of Operational Research, 1999, 117: 199–210. [33] I. Fischer and N. Harvey, Combining forecasts: What information do judges need to outperform the simple average? International Journal of Forecasting, 1999, 15: 227–246. [34] F. L. Chu, Forecasting tourism: a combined approach, Tourism Management, 1998, 19(6): 515– 520. [35] L. M. De Menezes and D. W. Bunn, The persistence of specification problems in the distribution of combined forecast errors, International Journal of Forecasting, 1998, 14: 415–426. [36] L. M. De Menezes, D. W. Bunn and J. W. Taylor, Review of guidelines for the use of combined forecasts, European Journal of Operational Research, 2000, 120: 190–204. [37] G. Q. Zhang, B. E. Patuwo and M. Y. Hu, Forecasting with artificial neural networks: the state of the art, International Journal of Forecasting, 1998, 14: 35–62. [38] D. J. Reid, Combining three estimates of gross domestic product, Economica, 1968, 35: 431–444. [39] D. J. Reid, A comparative study of time series prediction techniques on economic data, Ph. D. thesis, University of Nottingham, Nottingham, 1969. [40] J. M. Bates and C. W. J. Granger, The combination of forecasts. Operational Research Quarterly, 1969, 20: 451–468. [41] R. T. Clemen, Combining forecasts: A review and annotated bibliography, International Journal of Forecasting, 1989, 5: 559–583. [42] D. K. Wedding II and K. J. Cios, Time series forecasting by combining RBF networks, certainty factors, and the Box-Jenkins model, Neurocomputing, 1996, 10: 149–168. [43] J. T. Luxhoj, J. O. Riis and B. Stensballe, A hybrid econometric-neural network modeling approach for sales forecasting, International Journal of Production Economics, 1996, 43: 175–192. [44] M. V. D. Voort, M. Dougherty and S. Watson, Combining Kohonen maps with ARIMA time series models to forecast traffic flow, Transportation Research Part C: Emerging Technologies, 1996, 4: 307–318. [45] F. M. Tseng, H. C. Yu and G. H. Tzeng, Combining neural network model with seasonal time series ARIMA model, Technological Forecasting and Social Change, 2002, 69: 71–87. [46] G. P. Zhang, Time series forecasting using a hybrid ARIMA and neural network model, Neurocomputing, 2003, 50: 159–175. [47] R. L. Winkler, Combining forecasts: a philosophical basis and some current issues, International Journal of Forecasting, 1989, 5: 605–609. [48] L. A. Yu, S. Y. Wang and K. K. Lai, Double robustness analysis for determining optimal feedforward neural model architecture, Submitted to the IEEE Transactions on Neural Networks, 2003. [49] S. M. Shi, L. D. Xu and B. Liu, Improving the accuracy of nonlinear combined forecasting using neural networks, Expert Systems with Applications, 1999, 16: 49–54.
18
YU LEAN et al.
Vol. 18
[50] J. P. Dickinson, Some statistical results on the combination of forecasts, Operational Research Quarterly, 1973, 24: 253–260. [51] J. P. Dickinson, Some comments on the combination of forecasts, Operational Research Quarterly, 1975, 26: 205–210. [52] L. A. Yu, S. Y. Wang and K. K. Lai, A novel nonlinear ensemble forecasting model incorporating GLAR and ANN for foreign exchange rate, To appear in Computers and Operations Research, 2003. [53] G. Y. Yang, X. W. Tang and Y. K. Ma, The discussions of many problems about combined forecasting with non-negative weights, Journal of XiDian University, 1996, 25(2): 210–215. [54] I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, Berlin, 1986. [55] J. Karhunen and J. Joutsensalo, Generalizations of principal component analysis, optimization problems and neural networks, Neural Networks, 1995, 8(4): 549–562. [56] A. Pankratz, Forecasting with Dynamics Regression Models, John Wiley & Sons Inc., New York, 1991.