MCMC, Stan, Exponential Smoothing, Local and Global Trend, LGT, CNTK. Abstract: This ... We could apply data augmentation to the time series, so as to increase the size of the data set. ..... [Journal] // Journal of Statistical Software. - 2008.
Data Preprocessing and Augmentation for Multiple Short Time Series Forecasting with Recurrent Neural Networks
Slawek Smyl Karthik Kuber
36th International Symposium on Forecasting, Santander, Spain, June 2016
Keywords: Long Short-Term Memory, LSTM, M3 competition, Bayesian, Markov Chain Monte Carlo,
MCMC, Stan, Exponential Smoothing, Local and Global Trend, LGT, CNTK.
Abstract: This paper describes attempts to use statistical time series algorithms for data
preprocessing and augmenting for time series forecasting with recurrent neural networks (RNNs), in particular, utilizing Long Short-Term Memory (LSTM) networks. When applied to the yearly time series of M3 competitions, the results are encouraging: the neural network system improves on almost all published results, with exception to a new statistical algorithm called Local and Global Trend (LGT).
1. Introduction and Motivation Artificial Neural Networks (NNs) are a family of diverse, usually nonlinear and complex, models that have been successfully applied to many Machine Learning tasks. However, NNs have rather disappointing track record in time series forecasting, perhaps with exception of forecasting some chaotic time series and long series, e.g. of electrical power load (Hong, et al., 2016). For example, in M3 time series competition (Makridakis, et al., 2000) there was only one NN submission, which was still far from the best solutions. The main issue appears to be the complexity of NNs – they have many (from dozens to thousands and millions) parameters, and a single time series, even if preprocessed with a moving window, typically is just too short to provide enough examples for NN to learn. This data shortage problem can be dealt in two ways: 1. If using a group of related time series, we could train an NN on data from all of them, not building one model per series. 2. We could apply data augmentation to the time series, so as to increase the size of the data set. This paper uses both approaches. The first one is facilitated by using recurrent NNs, and in particular a network called Long Short-Term Memory (LSTM), which proved very capable in natural language processing and speech recognition tasks (Graves, et al., 2013). The second approach uses output from a Markov Chain Monte Carlo-fitted statistical algorithm. Additionally, output from a standard time series algorithm (Exponential Smoothing) was used for data preprocessing.
2. Data Preprocessing for Forecasting with NN Neural networks, like the vast majority of Machine Learning algorithms, expect constant size inputs. A time series needs to be preprocessed to create a number of pairs, where both input and output can be vectors. For non-recurrent NNs the inputs need to provide all information needed to make a forecast, but creating features that have the same size and encode equally well a short and a long time series is difficult. Instead, a moving window approach is typically used - the features are extracted from the window of a constant size covering the recent part of the time series. A time series of the length equal tsLength, is converted into tsLength- outputSize- inputSize records, each outputSize+inputSize long. Therefore, in case of non-recurrent NNs, the data outside of the current window does not influence the forecast, which may be problematic, as we clearly lose some information. Typically, normalization is also needed, as neural networks that utilize squashing functions, like tanh and sigmoid, cannot operate well with inputs substantially outside range. Even if a network does not have this limitation, the normalization is needed if learning from several series of different amplitude. Extracting good features from the input window is an art and important skill. Providing a NN just with a number of recent time series values is usually not a good practice. In this work, we also investigated if some artifacts of Exponential Smoothing algorithm (ETS) could act as useful features. Auto ETS (from
`
Forecast package of R) was applied to all time series data that preceded and included the current moving window and the following outputs were then used: 1. Error type (Additive or Multiplicative). 2. Trend type (Additive, Additive damped, Multiplicative). (These categorical values were encoded in one-hot encoding) 3. Forecast vector (normalized by last value of the input window).
4. RNNs, LSTMs, and the Network Architecture Recurrent neural networks (RNNs) have directed cycles, they can “remember” and use some past information. This is important, it can potentially fix the issue of losing information when using a constant size moving window preprocessing. Unfortunately, most RNNs have rather short memory (a few steps), so they do not improve on non-recurrent NNs. Fortunately, there is a type of RNNs that does not have this problem: Long Short-Term Memory (LSTM) – these nets can remember over 100 steps (Längkvist, et al., 2014), and the forgetting of old information is of exponentially decaying type. A lot of recent progress in natural language processing and speech recognition happened with using LSTMs. RNNs are also suitable to be trained on many time series at the same time: They learn from all of them, but during forecasting (replay) stage, when fed data step-by-step from one series, they can “zero-in” on this particular series (Prokhorov, et al., 2002). It is perhaps worth pointing here that the learning across many time series will be beneficial if at least some subset of them is somehow related. Fortunately, this happens often in real life, e.g. demand for computing power across datacenters in the same geographical region is likely to be related, and the shape of a time series from new datacenters, even across different regions and starting dates, may be similar too. An LSTM (Hochreiter, et al., 1997) can be viewed as a complicated, but single layer network. It contains a state vector, a mechanism to add new information to it, “forget” part of it, and output a modified version of it. There are many versions of LSTMs; this work used one described by following formulas: it = σ (W(x i )xt + W(h i )ht − 1 + W(c i )ct − 1 + b(i )) ft = σ (W(x f )xt + W(h f )ht − 1 + W(c f )ct − 1 + b(f )) ct = ft •ct − 1 + it •tanh(W(x c )xt + W(h c )ht − 1 + b(c )) ot = σ (W(x o )xt + W(h o )ht − 1 + W(c o )ct + b(o )) ht = ot •tanh(ct ),
where σ is the logistic sigmoid function, it, ft, and ot are vectors of same size that represent input, forget, and output gates, while ct and ht are also the same size vectors that represent “normal” and hidden state of the LSTM layer. W’s are the weight matrices, but in this particular version, W(ci), W(cf), and W(co) are diagonal (so actually also vectors). This is done so that element m in each gate vector receives only input from the element m of the state vector.
`
(Graves, et al., 2013). Existence of these three matrices denotes an LSTM network with “peephole connections”, introduced in (Gers, et al., 2002). As mentioned above, the LSTM can be seen as a complicated, nonetheless a single hidden layer, and it can be stacked and mixed with other, e.g. standard layers. The network architecture used in this work additionally borrows from ResNet (He, et al., 2015), and is shown in Fig. 1.
Figure 1. The RNN architecture
5. Data Augmentation In the image recognition area, where data sets are large, e.g. in recent ImageNet competitions they are in order of 100k, researchers still created methods of “multiplying” the data set size automatically, by rotating a bit the image, changing contrast, brightness, adding noise, etc. This procedure is called Data Augmentation and does improve the NN learning (Krizhevsky, et al., 2012). Similar idea could be applied to time series - after all in statistical view, a time series is considered a random realization of some underlying data generation process. (Chatfield, 2000), p25. One way to do this would be to use residuals from a statistical time series algorithm to generate new time series, e.g. as in (Bergmeir, et al., 2016). For Bayesian models fitted with Markov Chain Monte Carlo another approach is obvious: we can sub-sample (e.g. choosing several times a mean of a few hundred from a few thousand samples of) parameters, residuals, forecasts, etc. Here we use samples of parameters and forecast paths calculated by a statistical algorithm called LGT (Local and Global Trend), a new statistical algorithm created by S.Smyl, that extends the AAZ model of the ETS classification (Additive error, Additive trend) into a flexible nonlinear model and uses Student’s t error distribution. It is fitted by a Probabilistic Programming tool called Stan (Stan Development Team, 2015) that employs a fast Markov Chain Monte Carlo engine (Hoffman & Gelman, 2014). LGT improves on all published results of M3 data set. For more information, please refer to Appendix 1.
`
Figure 2. An example LGT forecast. The parameter values shown are averages from thousands of samples. Also shown are 200 of sample paths and some quantiles.
2000 samples of parameters and forecast paths were saved and then subsampled, and the aggregated results were used as NN features. The parameters were normalized with “quantile normalization”, i.e. the values were transformed to their empirical quantiles. Medians of sample paths (calculated per prediction horizon) were first transformed with log() and then normalized by the last value of the input window (also “logged”). The subsample size varied from 2000 to 100, as seen in Fig. 2.
6. Results M3 yearly time series were used for testing. There are 645 of them, relatively short, of length 14 to 40 values. The maximum prediction horizon is 6 (so the shortest time series have altogether 20 values). To measure forecasting performance, we used sMAPE, which was the main metric of the M3 competition, and MASE (Hyndman, 2006). Their formulas are as follows: 𝑠𝑀𝐴𝑃𝐸 = 𝑀𝐴𝑆𝐸 =
200 ℎ |𝑦𝑡 − 𝑦̂𝑡 | ∑ ℎ ̂𝑡 | 𝑡=1 |𝑦𝑡 | + |𝑦
ℎ−1 ∑ℎ𝑡=1|𝑦𝑡 − 𝑦̂𝑡 | (𝑛 − 𝑠)−1 ∑𝑛𝜏=𝑠+1|𝑦𝜏 − 𝑦̂ 𝜏−𝑠 |
where h is the maximum prediction horizon, s is the seasonality (1 for non-seasonal series and e.g., 4 for quarterly series), and n is the size of in-sample data (length of past time series). Summation generally happens over prediction horizons, except for the divisor of the MASE equation, where it happens over past (in-sample) data.
`
Table 1 lists the average values of the metrics for several algorithms: -
-
-
-
-
-
ETS(ZZZ) is the Exponential Smoothing algorithm from Forecast package of R. ‘ZZZ’ is the model specification that allows trying all ETS versions for the best fit. No preprocessing was done, as ETS is a time series algorithm and does not require any data preprocessing. ARIMA is auto.arima() function from Forecast package of R. As in the previous case, no restrictions were placed on choice of ARIMA models and no preprocessing was done. Auto NN was the single neural network entry in M3 competition. The metrics are recalculated from the historical record. RBF and RBUST-Trend were two entries in the competition that achieved best results (recalculated here) on yearly series as measured by sMAPE and MASE respectively. However, across all time-series, including seasonal ones, they did not perform that well. ‘LSTM with minimal preprocessing’ was the LSTM-based neural network, as depicted in Fig.1. It used moving window of the size 7, with values normalized (after log() transformation) by the last values of the input window. It is not obvious how to create some better features if the window size is so short, and the moving window size cannot be really made larger, because the shortest time series have just 14 known values (apart from the 6 that are the answer, and cannot be used for the back testing). ‘LSTM using ETS for preprocessing’ uses the same NN architecture, moving window and normalized values as above, but adds some features from output of ETS algorithm, as described in Section 2. ‘LSTM using LGT for preprocessing’ uses the same NN architecture, moving window and normalized values as used by ‘LSTM with minimal preprocessing’, but adds some features from output of the LGT algorithm, as described in Section 3. LGT uses time series data without any preprocessing.
It is disappointing that ‘LSTM using LGT for preprocessing’ did not beat pure LGT, although it did achieve second best result in this comparison. However, and this is perhaps the most important finding of this paper, ‘LSTM using ETS for preprocessing’ did improve on ‘LSTM with minimal preprocessing’ and ‘ETS (ZZZ)’, so here using statistical algorithm (ETS) for preprocessing was beneficial for the LSTM-based neural network performance and elevated the final result (as measured by sMAPE) above best entry of the M3 competition and ETS and ARIMA (for both metrics).
Table 1. Average performance metrics on M3 yearly time series
`
Algorithm/Metric
sMAPE
MASE
ETS (ZZZ)
17.27
2.86
ARIMA
17.12
2.96
Auto NN (M3 participant)
18.57
3.06
Best algorithm in M3
16.42
2.63
(RBF)
(ROBUST-Trend)
LSTM with minimal preprocessing
16.4
2.88
LSTM using ETS for preprocessing
15.94
2.73
LSTM using LGT for preprocessing
15.46
2.65
LGT
15.25
2.52
sd= 0.015
sd= 0.004
The row for ‘LSTM using LGT for preprocessing’ above, is actually the best case from Table 2, where subsample size was varied (In every case the subsampling was repeated several times, yielding several close, but different values) Table 2. Average performance metrics of ‘LSTM using LGT for preprocessing’ on M3 yearly time series
Subsample size (out of 2000) 2000 500 250 100
sMAPE 15.65 15.67 15.46 15.68
MASE 2.77 2.77 2.65 2.76
The smaller the subsample size, the more “fuzzified” the features. As expected, there appear to exist a “sweet-spot” of the degree of “fuzzification”. The varying input causes the NN to produce varying output, as illustrated in Figs. 3 and 4.
`
Figure 3. An example NN forecasts (red) and LGT inputs (green). The time series is drawn in black and blue.
Figure 4. An example NN forecasts (red) and LGT inputs (green). The time series is drawn in black and blue.
7. Tool used: Computational Network Toolkit (CNTK) It is a powerful and easy to use neural networks program by Microsoft, open-source licensed since April 2015, available for Linux and Windows at https://github.com/Microsoft/CNTK. The computation engine uses either Intel’s MKL library (when running on CPU) or NVidia libraries (when running on GPU). CNTK enables easy creation of (almost) arbitrary models, including recurrent ones. Like in Probabilistic Programming, it provides separation from model creation/description code and the learning code which is automatically supplied. It was used by a winning system (152 layer deep) of ImageNet 2015 competition (Huang, 2015).
`
8. Further work While the results on the yearly time series from M3 competition are good and the idea of using statistical algorithms for data preprocessing and data augmentation appears promising, the situation was quite different for monthly M3 series: Firstly, because of their larger number and longer average length, data augmentation with an MCMC-fitted algorithm similar to LGT turned out to be impractical because of its large computational requirements. Secondly, the data preprocessing with ETS did not improve the neural network performance. It is not unusual - standard NN architectures tend to have problems dealing with seasonal data (Zhang, et al., 2007) (Zhang, et al., 2005). It is possible to deseasonalize first, but then the results will depend heavily of the deseasonalization algorithm. Instead, we would like to research new neural networks architectures that would be designed for relatively short seasonal time series. CNTK allows easy experimentation with novel architectures. The search for suitable architectures could be perhaps broaden and automated by using computational intelligence approaches. Instead of using MCMC-fitted algorithms for data augmentation, one can try bootstrapping from residuals of statistical algorithms a la (Bergmeir, et al., 2016) to create new, derived time series.
9. Summary We used some outputs of ETS for preprocessing time series. We also used averages of subsamples from parameters and forecasts of a MCMC-fitted Bayesian time series algorithms for data augmentation. Both of these methods appear to improve forecasting performance of a LSTM-based neural network, as tested on 645 yearly series of M3 competition. There are many interesting possibilities in improving RNN-based time series forecasting systems with alternative data preprocessing, data augmentation, and neural network architectures.
Appendix 1 Local and Global Trend (LGT) is a time series forecasting model derived from Exponential Smoothing’s AAN (Additive error, Additive trend) version in the ETS classification. It can be viewed as a generalization of additive and multiplicative trend and error Exponential Smoothing models. This model has the following features which differentiates it from the AAN model:
a. Student’s t-distribution of error Student’s t-distribution is flexible, it can have fat tails when parameter 𝜈 (degrees of freedom) gets sufficiently small, e.g., close to 1, when it converges to Cauchy distribution, but for 𝜈 over, say 20, it converges to normal distribution. The 𝜈 parameter is just one of the parameters to fit, and therefore the model automatically adjusts to relatively calm time series, like those of M3 competition, but also to more volatile ones such as found in online services.
`
b. Nonlinear global trend and reduced local linear trend The following function is used to encode a global trend: µt=lt-1 + γ * lt-1ρ where lt-1 is previous level, ρ varies within , and ρ and γ are scalars and constant for the whole time series. This flexible trend function is useful in cloud services, where the time series tend to grow faster than linear but slower than exponential. Actually, if the p parameter is allowed to have negative values (e.g., vary from -0.5 to 1), then for a growing trend we would get a damped behavior. Additionally, the model uses local linear trend in form 𝜆 ∗ 𝑏𝑡−1 , where λ is a value between -1 and 1, and in practice almost always is between 0 and 1. This has the effect of a reduced strength local trend bt-1. So, the model is sensitive to the latest changes, but to a lesser extent than the original AAN model. (The negative λ coefficient would indicate a counter-trending behavior, but it is rare.) To sum up, the expected value µt = 𝑙𝑡−1 + γ ∗ 𝑙𝑡−1 𝑝 + λ ∗ 𝑏𝑡−1 It is assumed, and enforced during the simulation (forecasting) stage, that the levels and forecast values are positive. All the parameters, such as γ, p, and λ, are fitted by MCMC; only prior distributions are assumed.
c. Nonlinear heteroscedasticity The scale parameter of Student’s t-distribution is also expressed by a global, non-linear function of a similar form: σ ∗ 𝑙𝑡−1 𝜏 + 𝜍, where σ and 𝜍 are positive and τ varies within . This is quite a general function: when τ gets small, close to 0, the scale becomes constant, and therefore the model becomes homoscedastic; while for large values of 𝜏, close to 1, for each step the scale increases proportionally to the level, and that approaches behavior of the multiplicative error models. The full likelihood specification is as follows: 𝑦𝑡 ~ 𝑆𝑡𝑢𝑑𝑒𝑛𝑡(𝜈, 𝑙𝑡−1 + 𝛾 ∗ 𝑙𝑡−1 𝜌 + 𝜆 ∗ 𝑏𝑡−1 , σ ∗ 𝑙𝑡−1 𝜏 + 𝜍)
(4)
It is perhaps worth repeating that all parameters, including 𝜏 and 𝜈 are fitted, not assumed.
d. Performance of the model on M3 dataset The M3 time series and forecasts are from Mcomp R package (Hyndman, et al., 2013). The following metrics are reported in table below: symmetric MAPE (sMAPE), which was the main metric of the M3 competition (Makridakis, et al., 2000); Mean Absolute Scaled Error (MASE) (Hyndman, 2006); weighted MAPE (wMAPE); and MAPE. 𝑀𝐴𝑃𝐸 = 𝑠𝑀𝐴𝑃𝐸 =
`
100 ℎ |𝑦𝑡 − 𝑦̂𝑡 | ∑ |𝑦𝑡 | ℎ 𝑡=1 200 ℎ |𝑦𝑡 − 𝑦̂𝑡 | ∑ ℎ ̂𝑡 | 𝑡=1 |𝑦𝑡 | + |𝑦
𝑤𝑀𝐴𝑃𝐸 = 100 ∗
𝑀𝐴𝑆𝐸 =
∑ℎ𝑡=1|𝑦𝑡 − 𝑦̂𝑡 | ∑ℎ𝑡=1|𝑦𝑡 |
ℎ−1 ∑ℎ𝑡=1|𝑦𝑡 − 𝑦̂𝑡 | (𝑛 − 𝑠)−1 ∑𝑛𝜏=𝑠+1|𝑦𝜏 − 𝑦̂ 𝜏−𝑠 |
where h is the maximum prediction horizon, s is the seasonality (1 for non-seasonal series and e.g., 4 for quarterly series), and n is the size of in-sample data (length of past time series). Summation generally happens over prediction horizons, except for the divisor of the MASE equation, where it happens over past (in-sample) data. In Table 3, the metrics are further averaged over all series belonging to a particular category, and the LGT model is compared to the best per-category/per-metric algorithms of the M3 competition; auto ETS; auto ARIMA from the R Forecast package (Hyndman, et al., 2008); and the mean of auto ETS and auto ARIMA, called Hybrid. Additionally, for yearly, quarterly, and monthly series, we reproduce sMAPE and MASE for the best variants of bagged algorithms in (Bergmeir, et al., 2016).
Table 3. Overall performance on M3 non-seasonal time series
sMAPE LGT Best algorithm in M3 Hybrid ETS (ZZZ) ARIMA Best of Bagged ETS
LGT Best algorithm in M3 Hybrid ETS (ZZZ) ARIMA
MASE Yearly series 15.23, sd= 0.015 2.50, sd= 0.004 16.42 2.63 (RBF) (ROBUST-Trend) 16.73 2.85 17.37 2.86 17.12 2.96 17.80 3.15 (BLD.Sieve) (BLD.MBB) Other series 4.26, sd=0.017 1.72, sd=0.002 4.38 1.86 (ARARMA) (AutoBox2) 4.33 1.79 4.33 1.79 4.46 1.83
wMAPE
MAPE
15.72, sd= 0.015 16.60 (ForcX) 17.40 17.65 18.10
19.73, sd=0.027 19.95 (AutoBox2) 21.50 21.92 22.07
4.35, sd=0.018 4.47 (ARARMA) 4.39 4.43 4.54
4.66, sd=0.020 4.68 (ARARMA) 4.70 4.76 4.82
Fig. 5 displays the two main metrics, sMAPE and MASE, calculated per prediction horizon (wMAPE cannot be aggregated over non-related series, and for the sake of space, MAPE is skipped). LGT; the best algorithms per category/per metric of M3 competition; auto ETS; auto ARIMA; Hybrid; and a few others that were the best performing algorithms of the M3 competition are compared.
`
4 3 1
2
MASE
LGT ROBUST-Trend THETA hybrid ForecastX ETS ARIMA ForecastPro
2
3
4
5
6
4
5
6
15
LGT RBF THETA hybrid ForecastX ETS ARIMA ForecastPro
10
sMAPE
20
25
1
1
2
3 forecast horizon
Figure 5. Average error per horizon, M3 yearly series
Bibliography Bergmeir Christoph, Hyndman Rob J. and Benítez Juan R. Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation [Journal] // International Journal of Forecasting. 2016. - Vol. 32. - pp. 303-312. Chatfield Chris Time-Series Forecasting [Book]. - [s.l.] : CRC Press, 2000. Gers Felix, Schraudolph Niclo and Jurgen Schmidhuber Learning Precise Timing with LSTM Recurrent Networks [Journal] // Journal of Machine Learning Research. - 2002. - pp. 115-143. Graves A., Mohamed A. and Hinton G. Speech recognition with deep recurrent neural networks [Conference] // The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). - 2013. He Kaiming [et al.] Deep Residual Learning for Image Recognition [Online]. - Dec 10, 2015. https://arxiv.org/abs/1512.03385. Hochreiter Sepp and Schmidhuber Jürgen Long short-term memory [Journal] // Neural Computation. 1997. - 8 : Vol. 9. - pp. 1735–1780. Hong Tao and Fan Shu Probabilistic electric load forecasting: A tutorial review [Journal] // International Journal of Forecasting. - 2016. - 3 : Vol. 32.
`
Huang Xuedong Microsoft Computational Network Toolkit offers most efficient distributed deep learning computational performance [Online] // Microsoft Research Blog. - Dec 7, 2015. - 7 11, 2016. https://www.microsoft.com/en-us/research/microsoft-computational-network-toolkit-offers-mostefficient-distributed-deep-learning-computational-performance/. Hyndman Rob J. and Khandakar Yeasmin Automatic time series forecasting: the forecast package for R [Journal] // Journal of Statistical Software. - 2008. - 3 : Vol. 26. - pp. 1-22. Hyndman Rob J. Another Look at Forecast-Accuracy Metrics for Intermittent Demand [Journal] // Foresight. - 2006. - pp. 43-46. Hyndman Rob J., Akram Muhammad and Bergmeir Christoph Mcomp package for R [Online]. - 2013. - 2 14, 2016. - http://robjhyndman.com/software/mcomp/. ImageNet [Online] // Large Scale Visual Recognition Challenge 2014 (ILSVRC2014). - 2014. - 2 17, 2016. http://image-net.org/challenges/LSVRC/2014/index#data. International Institute of Forecasters M3-Competition [Online]. - 2015. - 02 17, 2016. http://forecasters.org/resources/time-series-data/m3-competition/. Krizhevsky Alex, Sutskever Ilya and Hinton Geoffrey E. ImageNet Classification with Deep Convolutional Neural Networks [Conference] // Advances in Neural Information Processing Systems. - 2012. - Vol. 25. pp. 1097-1105. Makridakis S. and Hibon M. The M3-Competition: results, conclusions and implication [Journal] // International Journal of Forecasting. - 2000. - Vol. 16. - pp. 451–476. Prokhorov Danil V., Feldkamp Lee A. and Tyukin Ivan Yu Adaptive behavior with fixed weights in RNN: an overview. [Conference] // Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN). - 2002. Stan Development Team Stan Modeling Language Users Guide and Reference Manual, Version 2.9.0 [Online]. - 2015. - 02 17, 2016. - http://mc-stan.org/. Zhang Peter G. and Kline Douglas M. Quarterly Time-Series Forecasting With Neural Networks [Journal] // IEEE Transactions on Neural Networks. - 2007. - pp. 1800-1814. Zhang Peter G. and Min Qi Neural network forecasting for seasonal and trend time series [Journal] // European Journal of Operational Research. - 2005. - pp. 501–514.
`