A Machine Learning Approach to Define Weights ... - Semantic Scholar

A Machine Learning Approach to Define Weights for Linear Combination of Forecasts Ricardo Prudêncio1 and Teresa Ludermir2 1

Departament of Information Science, Federal University of Pernambuco, Av. dos Reitores, s/n - CEP 50670-901 - Recife (PE) - Brazil 2 Center of Informatics, Federal University of Pernambuco, Pobox 7851 - CEP 50732-970 - Recife (PE) - Brazil

Abstract. The linear combination of forecasts is a procedure that has improved the forecasting accuracy for different time series. In this procedure, each method being combined is associated to a numerical weight that indicates the contribution of the method in the combined forecast. We present the use of machine learning techniques to define the weights for the linear combination of forecasts. In this paper, a machine learning technique uses features of the series at hand to define the adequate weights for a pre-defined number of forecasting methods. In order to evaluate this solution, we implemented a prototype that uses a MLP network to combine two widespread methods. The experiments performed revealed significantly accurate forecasts.

1

Introduction

Combining time series forecasts from different methods is a procedure commonly used to improve forecasting accuracy [1]. In the linear combination of forecasts, a weight is associated to each available method, and the combined forecast is the weighted sum of the forecasts individually provided by the methods. An approach that uses knowledge for defining the weights in the linear combination of forecasts is based on the development of expert systems, such as the Rule-Based Forecasting system [2]. The expert rules deployed by the system use descriptive features of the time series (such as length, basic trend,...) in order to define the adequate combining weight associated to each available forecasting method. Despite its good results, developing rules in this context may be unfeasible, since good forecasting experts are not always available [3]. In this paper, we proposed the use of machine learning for predicting the linear weights for combining forecasts. In the proposed solution, each training example stores the description of a series (i.e. the series features) and the combining weights that empirically obtained the best forecasting performance for the series. A machine learning technique uses a set of such examples to relate time series features and adequate combining weights. In order to evaluate the proposed solution, we implemented a prototype that uses MLP neural networks [4] to define the weights for two widespread methods: S. Kollias et al. (Eds.): ICANN 2006, Part I, LNCS 4131, pp. 274–283, 2006. c Springer-Verlag Berlin Heidelberg 2006

A Machine Learning Approach to Define Weights

275

the Random Walk and the Autoregressive model [5]. The prototype was evaluated in 645 yearly series and the combined forecasts were compared to the forecasts provided by benchmarking methods. The experiments revealed that the forecasts generated by the predicted weights were significantly more accurate than the benchmarking forecasts. Section 2 presents methods for linear combination of forecasts, followed by section 3 that describes the proposed solution. Section 4 brings the implemented prototype, experiments and results. Section 5 presents some related work. Finally, section 6 presents the conclusions and future work.

2

Combining Forecasts

The combination of forecasts from different methods is a well-established procedure for improving forecasting accuracy [1]. Empirical evidence has shown that procedures that combine forecasts often outperform the individual methods that are used in the combination [1][6]. The linear combination of K methods can be described as follows. Let Zt (t = 1, . . . , T ) be the available data of a series Z and let Zt (t = T + 1, . . . , T + H) be the H future values to be forecasted. Each method k uses the available data tk (t = to calibrate its parameters and then generates its individual forecasts Z tC are defined as: T + 1, . . . , T + H). The combined forecasts Z tC = Z

K

tk (t = T + 1, . . . , T + H) wk ∗ Z

(1)

k=1

The combining weights wk (k = 1, . . . , K) are numerical values that indicate the contribution of each individual method in the combined forecasts. Eventually constraints are imposed on the weights in such a way that: K

wk = 1 and wk ≥ 0

(2)

k=1

Different approaches for defining adequate combining weights can be identified in literature [6]. A very simple approach is to define equal weights (i.e. wk = 1/K) for the combined methods, which is usually referred as the Simple Average (SA) combination method. Despite its simplicity, the SA method has shown to be robust in the forecasting of different series [7]. A more sophisticated approach for defining the combining weights is to treat the problem within the regression framework [8]. In this context, the individual forecasts are viewed as explanatory variables and the actual values of the series as the response variable. In order to estimate the best weights, each combined method generates in-sample forecasts (i.e. forecasts for the available data Zt ). Then, a least squares method computes the weights that minimize the squared error for the combined in-sample forecasts. Empirical results revealed that the regression-based methods are almost always less accurate than the SA [9].

276

R. Prudêncio and T. Ludermir

An alternative approach that has shown promising results in the combination of forecasts is based on the development of expert systems, such as the RuleBased Forecasting [2]. In that work, an expert system with 99 rules is used to weight four forecasting methods. The rule base was developed based on the guidelines provided by human experts. The authors used time series features in the rules (such as basic trend, presence of outliers) to modify the weight associated to each method. Figure 1 shows an example of rule used in the system. The rule defines how to modify the weights in case of a series that presents an insignificant trend. In the experiments performed using the expert system, the improvement in accuracy over the SA method has shown to besignificant. Rule: Insignificant Basic Trend - IF NOT a significant basic trend, THEN add 0.05 to the weight on random walk and subtract it from that on the regression. Fig. 1. Example of rule implemented in the Rule Based Forecasting system

3

The Proposed Solution

As seen, expert systems have been successfully used to define weights for the linear combination of forecasts. Unfortunately, the knowledge acquisition in these systems depends on the availability of good human forecasting experts, who are often scarce and expensive [3]. In order to minimize this difficulty, the use of machine learning is proposed to define the weights for combining forecasts. Figure 2 presents the architecture of the proposed solution. The system has two phases: training and use. In the training phase, the Intelligent Combiner (IC) uses a supervised algorithm to acquire knowledge from a set of examples in the Database (DB). Each training example stores the descriptive features for a particular series and the best configuration of weights used to combine K methods for that series. The learner implemented in the IC module generalizes the experience stored in the examples by associating time series features to the most adequate combining weights. In the use phase, given a time series to be forecasted, the Feature Extractor (FE) module extracts the values of the time series features. According to these values, the IC module predicts an adequate configuration of weights for combining the K methods. The solution proposed here contributes to two different fields: (1) in time series forecasting, since we provided a new method for combining forecasts; (2) in machine learning, since we used its concepts and techniques in a problem which was not tackled yet. In the following subsections, we provide a more formal description of each module of the proposed solution. Section 4 presents an implemented prototype and the experiments that evaluated the proposed solution.

A Machine Learning Approach to Define Weights Time Series Data + Contextual Information

-

Time Series Features (x) FE

-

IC

6

277

- Combining Weights: w1 , . . . , wK

Training Examples (E) DB Fig. 2. System’s architecture

3.1

The Feature Extractor

As said, the combining weights are predicted for a time series Z based on its description. Formally, a time series Z is described as a vector x = (x1 , . . . , xp ) where each xj (j = 1, . . . , p) corresponds to the value of a descriptive feature Xj . A time series feature can be either: (1) a contextual information directly provided by the user, such as the domain of the series, the time interval, the forecasting horizon, among others; or (2) a descriptive statistics automatically calculated from the available time series data Zt (t = 1, . . . , T ). In the proposed solution, the FE module extracts those features that are computed from the available data. These features may be for example the length, the presence of trend or seazonality, the autocorrelations, the amount of turning points, among others. Obviously the choice of appropriate features is highly dependent on the type of series at hand. 3.2

The Database

An important aspect to be considered is the generation of a set of training examples used by the IC module. Let E = {e1 , . . . , en } be a set of n examples, where each example stores: (1) the values of the p features for a particular series; and (2) the adequate weights for combining the K forecasting methods on that series. An example ei ∈ E is then defined as a vector ei = (xi , wi ) where xi = (x1i , . . . , xpi ) is the description of the series and wi = (wi1 , . . . , wiK ) is the best configuration of weights associated to the K methods. In order to generate each example ei ∈ E, we simulate the forecasting of a series Zi by using the K methods and then we compute the configuration of weights that would obtain the best forecasting result if it was used to combine the forecasts provided by the K methods. For this, the following tasks have to be performed. First, given a sample series Zi (i = 1, . . . , n), its data is divided into two parts: a fit period Zi,t (t = 1, . . . , Ti ) and a forecasting period Zi,t (t = Ti +1, . . . , Ti +H). The fit period corresponds to the available data at time Ti used to calibrate the K methods. The forecasting period in turn corresponds to H observations to be forecasted by the K calibrated methods. Hence, for each series Zi , this task

278


k results on the forecasts Zi,t (k = 1, . . . , K) (t = Ti + 1, . . . , Ti + H) individually provided by the K calibrated methods for the forecasting period of the series. In the second task, we defined the combining weights (wi1 , ..., wiK ) that minimize a chosen forecasting error measure E, computed for the combined forecasts C (t = Ti + 1, ..., Ti + H). The measure E may be for example the Mean AbZ i,t solute Error, or the Mean Squared Error. The task of defining the best combining weights can be formulated as an optimization problem as follows:

Minimize:

K k C ) = E( E(Z wik ∗ Zi,t ) i,t

(3)

k=1

Subject to: K

wik = 1 and wik ≥ 0

(4)

k=1

Different optimization methods may be used to solve this problem, considering the characteristics of the measure E to be minimized. For each series Zi , this optimization process results on the weights (wi1 , ..., wiK ) that would minimize the forecasting error E if they were used to combine the K methods on that series. Finally, in the third task, the features (x1i , . . . , xpi ) are extracted for describing the fit period of the series (as defined in the FE module). These features and the optimized weights (wi1 , ..., wiK ) are stored in the DB as a new example. 3.3

The Intelligent Combiner

The Intelligent Combiner (IC) module implements a supervised learning algorithm that acquires knowledge from the set of training examples E. Given the set E, the algorithm is used to build a learner which is a regression model L : X → [0; 1]K that receives as input a time series description and predicts the best configuration of combining weights, considering the error criteria E. The final output of the system for a time series described as x = (x1 , . . . , xp ) is then: w = (w 1 , . . . , w K ) = L(x)

(5)

The algorithms that can be used to generate the learner L may be for instance neural network models, algorithms for induction of regression trees, the k-nearest neighbour algorithm, among others.

4

The Implemented Prototype

In order to verify the viability of the proposal, a prototype was implemented for defining the combining weights of K = 2 methods: the Random Walk (RW) and the Auto-Regressive model (AR) [5]. The prototype was applied to forecast the yearly series of the M3-Competition [10], which provides a large set of time series related to economic and demographic domains. In the next subsections,


279

we provide the implementation details as well as the experiments that evaluated the prototype. A short presentation of the implemented prototype can also be found in [11]. 4.1

The Feature Extractor

In this module, the following features were used to describe the yearly series: 1. Length of the time series (L): number of observations of the series; 2. Basic Trend (BT): slope of the linear regression model; 3. Percentage of Turning Points (TP): Zt is a turning point if Zt−1 < Zt > Zt+1 or Zt−1 > Zt < Zt+1 . This feature measures the oscillation in a series; 4. First Coefficient of Autocorrelation (AC): large values of this feature suggest that the value of the series at a point influences the value at the next point; 5. Type of the time series (TYPE): it is represented by 5 categories, micro, macro, industry, finances and demographic. The first four features are directly computed using the series data and TYPE in turn is a contextual information provided by the authors of M3-Competition. 4.2

The Database

In this case study, we used the 645 yearly series of the M3-Competition to generate the set of examples. Each time series was used to generate a different example as defined in the section 3.2. First, given a series Zi , the last H = 6 years of the series were defined as the forecasting period and the remaining data of the series was defined as the fit period (as suggested in the M3-Competition). After calibration, the RW and AR k (k = 1, 2) (t = Ti + 1, . . . , Ti + 6) models provided its individual forecasts Z i,t Second, we defined the combining weights wik (k = 1, 2) that minimized the C (t = Ti +1, ..., Ti +6). Mean Absolute Error (MAE) of the combined forecasts Z i,t This task was formulated as the optimization problem: Minimize: T T 2 i +6 i +6 C | = 1 k )| C ) = 1 |Zi,t − Z |Zi,t − (wik ∗ Z M AE(Z i,t i,t i,t 6 6 t=Ti +1

Subject to:

2

t=Ti +1

wik = 1 and wik ≥ 0

(6)

k=1

(7)

k=1

This optimization problem was treated using a line search algorithm implemented in the Optimization toolbox for Matlab [14]. Finally, the example associated to the series Zi is composed by the values of the five features (defined in the FE module, section 4.1) computed on the fit data and the optimum weights wi = (wi1 , wi2 ).

280


4.3

The Intelligent Combiner

In the implemented prototype, the IC module uses the Multi-Layer Perceptron (MLP) network [4] (one hidden layer) as the learner. The MLP input layer has 9 units that represent the 5 time series features described in the FE module. The first four input units received the values of the numeric features (i.e. L, BT, TP, AC). The feature TYPE was represented by 5 binary attributes (either 1 or 0 value), each one associated to a different category of series (see fig. 3). The output layer has two nodes that represented the weights associated to the RW and AR models. The output nodes used sigmoid functions which ensures that the predicted weights are non-negative. In order to ensure that the predicted weights sum to one (see eq. 2), the outputs of the MLP were normalized. Formally, let O1 and O2 be the values provided as output for a given time series description x. The predicted weights w 1 and w 2 are defined as: Ok w k = 2 l=1

Ol

(k = 1, 2)

(8)

The MLP training was performed by the BackPropagation (BP) algorithm [4] and followed the benchmark training rules provided in [12]. The BP algorithm was implemented by using the Neural Network Toolbox for Matlab [13]. O1

O2

6 6 wk = 2Ok O (k = 1, 2) l=1 l aa H bH aa Z ZbH a Le e ZbHH aa aa L e ZbbHH L e Z bbHH aa L e ZZ b HH aaa b aa L Z e b HH

6 6 6 6 6 6 6 6 6 L

BT

TP

AC

MI

MA

IN

FI

DE

TYPE Fig. 3. MLP used to define the combining weights for the RW and AR models

4.4

Experiments and Results

In the performed experiments, the set of 645 examples in the DB was equally divided into training, validation and test sets. We trained the MLP using 2, 4, 6, 8 and 10 nodes in the hidden layer (30 runs for each value). The optimum number


281

of hidden nodes was chosen as the value that obtained the lowest average SSE error on the validation set over the 30 runs. Table 1 summarizes the MLP results. As it can be seen, the optimum number of nodes according to the validation error was 8 nodes. The gain obtained by this value was also observed in the test set. Table 1. Training results Number of Hidden Nodes 2 4 6 8 10

SSE Training Average Deviation 65.15 1.87 64.16 1.97 64.22 1.81 64.13 2.28 64.37 2.62

SSE Validation Average Deviation 69.46 0.80 69.49 0.87 69.34 0.86 69.29 0.67 69.56 1.03

SSE Test Average Deviation 68.34 1.03 67.78 1.20 67.14 1.28 66.74 1.35 67.54 1.32

We further investigated the quality of the combined forecasts generated by using the weights predicted by the selected MLP (with 8 nodes). Let TEST be the set of 215 series that were used to generate the test set of the above i2 be the weights predicted for the series Zi ∈ TEST. experiments. Let w i1 and w The forecasting error produced by the combination of methods at time t is: eC i,t

C = Zi,t − = Zi,t − Z i,t

2

k (w ik ∗ Zi,t )

(9)

k=1

In order to evaluate the amount of these errors across all series Zi ∈ TEST for all forecasting points t ∈ {Ti +1, . . . , Ti +6}, we considered the Percentage Better (PB) measure [15]. Given a reference method R that serves for comparison, the PB measure is computed as follows: P BR = 100 ∗

where δi,t =

1 m

T i +6

δi,t

(10)

Zi ∈T EST t=Ti +1

C 1, if |eR i,t | < |ei,t | 0, otherwise

(11)

In the above definition, eR i,t is the forecasting error obtained by the method R in the i-th series at forecasting time t, and m is the number of times in which C |eR i,t | = |ei,t |. Hence, P BR indicates in percentage terms, the number of times that the error obtained by the method R was lower than the error obtained using the combined forecasts. Hence, values lower than 50 indicate that the combined forecasts are more accurate than the forecasts obtained by the reference method. The P B measure was computed for three reference methods. The first one is merely to use RW for forecasting all series and the second is to use AR for all series. The third reference method is the Simple Average (SA). Table 2 summarizes the results over the 30 runs of the best MLP. The average PB measure was lower than 50% for all reference methods which indicates that the combined

282

R. Prudêncio and T. Ludermir Table 2. Comparative forecasting performance measured by PB Reference Method RW AR SA

PB Measure Average Deviation Conf. Interv. (95%) 42.20 0.26 [42.11; 42.29] 40.20 0.44 [40.04; 40.36] 43.24 1.45 [42.72; 43.76]

forecasts were more accurate. The confidence intervals suggest that the obtained gain is statistically significant.

5

Related Work

The proposed solution is closely related to previous work that used machine learning to select forecasting methods [3][16][17][18][19]. In the selection approach, time series features are used by a learning technique to predict the best method for forecasting among a set of candidates. In the solution presented here, the learner uses the features to define the best combination of methods. Our approach is more general since the selection problem can be seen as a special case of combination where wk = 1 if k is the best method and 0 otherwise. In a previous work [18], we adopted the selection approach and applied a MLP network to select among the RW and AR models. Experiments were performed in the same 645 yearly series and the same forecasting period adopted in the present work. Table 3 shows the P B value computed for the combination approach, by using the selection approach developed in [18] as reference. As it may be seen, the combination procedure obtained a performance gain when compared to the simple selection approach adopted in the previous work. Table 3. Forecasting performance (PB) of the combined forecasts by using the selection approach as a basis for comparision Reference PB Measure Method Average Deviation Conf. Interv. (95%) Selection Approach (MLP) 48.28 0.74 [48.01; 48.54]

6

Conclusion

In this work, we proposed the use of machine learning to define linear weights for combining forecasts. In order to evaluate the proposal, we applied a MLP network to support the combination of two forecasting methodss. The performed experiments revealed a significant gain in accuracy compared to benchmarking procedures for forecasting. Modifications in the current implementation may be performed, such as augmenting the set of features, optimizing the MLP design and considering new forecasting error measures to be minimized.


283

Although the focus of our work is forecasting, the proposed method can also be adapted to other situations, for example, to combine classifiers. In this context, instead of time series features, characteristics of classification tasks should be considered (such as the number of examples). In this context, our solution becomes related to the Meta-Learning field [20], which studies how the performance of learning algorithms can be improved through experience. The use of the proposed solution to combine classifiers will be investigated in future work.

References 1. Hibon, M., Evgeniou, T.: To combine or not to combine: selecting among forecasts and their combinations. International Journal of Forecasting, 21(1) (2004) 15–24. 2. Adya, M., Armstrong, J. S., Collopy, F., Kennedy, M.: Automatic identification of time series features for rule-based forecasting. International Journal of Forecasting, 17(2) (2001) 143–157. 3. Arinze, B.: Selecting appropriate forecasting models using rule induction. OmegaInternational Journal of Management Science, 22(6) (1994) 647–658. 4. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagation errors. Nature, 323 (1986) 533–536. 5. Harvey, A.: Time Series Models. MIT Press, Cambridge, MA (1993) 6. DeMenezes, L., Bunn, D., Taylor, J.: Review of guidelines for the use of combined forecasts. European Jour. of Operational Research, 120 (2000) 190–204. 7. Armstrong, J.: Findings from evidence-based forecasting: methods for reducing forecast error. (2005) Available at: http://www.jscottarmstrong.com/. Accessed on March 20, 2006. 8. Granger, C.W.J., Ramanathan, R.: Improved methods of combining forecasts. Journal of Forecasting, 3 (1984) 197-204. 9. Asku, C., Gunter, S.I.: An empirical analysis of the accuracy of SA, OLS, ERLS and NRLS combination of forecasts. Intern. Journal of Forecasting, 8 (1992) 27-43. 10. Makridakis, S., Hibon, M.: The M3-competition: results, conclusions and implications. International Journal of Forecasting, 16(4) (2000) 451–476. 11. Prudêncio, R., Ludermir, T.B.: Using Machine Learning Techniques to Combine Forecasting Methods. Lect. Notes in Artificial Intelligence, 3339 (2004) 1122–1127. 12. Prechelt, L.: Proben 1: a set of neural network benchmark problems and benchmarking rules, Tech. Rep. 21/94, Fakultat fur Informatik, Karlsruhe (1994). 13. Demuth, H., Beale, M.:. Neural Network Toolbox for Use with Matlab, The Mathworks Inc, (2003). 14. The Mathworks, Optimization Toolbox User’s Guide, The Mathworks Inc. (2003). 15. Flores, B.E.: Use of the sign test to supplement the percentage better statistic. International Journal of Forecasting, 2 (1986) 477–489. 16. Chu, C-H., Widjaja, D.: Neural network system for forecasting method selection. Decision Support Systems, 12(1) (1994) 13–24. 17. Venkatachalan, A.R., Sohl, J.E.: An intelligent model selection and forecasting system. Journal of Forecasting, 18 (1999) 167–180. 18. Prudêncio, R., Ludermir, T.B.: Meta-learning approaches for selecting time series models. Neurocomputing Journal, 61(C) (2004) 121–137. 19. Prudêncio, R., Ludermir, T.B., DeCarvalho, F.: A modal symbolic classifier for selecting time series models. Pattern Recogn. Letters, 25(8) (2004) 911–921. 20. Giraud-Carrier, C., Vilalta, R., Brazdil, P.: Introduction to the special issue on meta-Learning. Machine Learning, 54 (2004) 187-193.