Proceedings of the Fourth IASTED International Conference Advances in Computer Science and Technology (ACST 2008) April 2-4, 2008 Langkawi, Malaysia ISBN CD: 978-0-88986-730-7
A COMPARISON OF NEURAL NETWORK METHODS AND BOX-JENKINS MODEL IN TIME SERIES ANALYSIS Ong Hong Choon1 and Javin Lee Tze Chuin2 1, 2 School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Pulau Pinang, Malaysia. 1
[email protected] 2
[email protected] neurons, and send it as output to other neurons, just like the way biological neurons do.
ABSTRACT Many neural network methods had been proposed and applied in time series forecasting since the past decade. To gain a better understanding a comparison is made with the more conventional method in time series by describing the advantages and disadvantages of using neural networks. The time series prediction capabilities of the multi-layered perceptron neural network model (MLP) and the time series Box-Jenkins model (SARIMA) are also compared in this paper. The data used for analyzing and forecasting is a 10 years of time series (January 1996 – December 2005) monthly record for temperature data, in Bayan Lepas, Penang, Malaysia. To show the effectiveness of prior data processing, four sets of MLP simulation programs are used: the original data (O), linear detrending model (DTL), deseasonalized model (DS) and both detrending and deseasonalized model (DSTL). Results show that the neural network methods with prior data processing in time series forecasting perform better compared to the Box-Jenkins method.
2. A Comparison Study 2.1 Neural Network Methods in Time Series Analysis Time series forecasting models have been widely used to predict future values. However, the use of neural network models is also increasing due to its flexibility and accurate forecasts. Therefore a comparison is done in this section between neural network models, the MLP in particular and time series models. The performance of the standard MLP with error backpropagation algorithm is determined by the learning rates and momentum factors. Usually these two values are determined heuristically in order to obtain an optimal result. Du et al. (see [2]) had applied an optimum blockadaptive learning rate back-propagation algorithm for training feedforward neural network to overcome the difficulty in determining the optimum values of these two parameters. The results in their study showed that this method can determine the parameters effectively thus providing significant improvement learning speed over the standard back-propagation algorithm.
KEY WORDS Time series, neural network, multi-layered perceptron, forecasting
1. Introduction
2.2 Advantages of Neural Networks A few benefits from using neural networks in time series are as follows: (a) Nonlinearity: Linearity of a process allows researchers to make certain assumptions and approximations so that simple computation of results can be made. However, these assumptions cannot be made easily in a nonlinear process. Therefore nonlinear process is often difficult or impossible to model or predict their behaviour. However, the neurons in the neural network are able to approximate a nonlinear process easily and effectively, thus producing promising results. (b) Adaptivity: ANN is an adaptive system that changes its structure based on the information that flows through the network, or in short, ANN learns by examples (see [3]). (c) Graceful degradation: Also known as fault tolerance, which is defined as a system which can continue to operate at some reduced level of performance after one of its components fail to
Recently, neural networks have been widely used in many fields such as finance, medical sciences, engineering, survival analysis and more. Various types of neural network models have been proposed and modified by researchers and most of them provided promising results. In time series, besides the Box and Jenkins methods which have been commonly use for forecasting, neural networks are also widely applied and they have been proven to produce a more accurate forecast than the Box and Jenkins methods in most cases. An Artificial Neural Network (ANN) is an information processing paradigm that is akin to the parallel architecture of the brain. It is composed of a large number of interconnected entities called neurons working in unison to solve specific tasks. A biological neuron may receive up to 10,000 different inputs, and may send its output to many other neurons (see [1]). The neurons in an ANN also receive information as input from other
605-024
344
work. In neural network, if one of the neurons is damaged, the network will not degrade dramatically. More like a real nervous system: The structure of a neural network draws its inspiration from the real nervous system. Therefore a better understanding on a nervous system allows researchers to make improvements on neural network so that it can perform more effectively. Parallel organization: The computation in a neural network is parallel This permits solutions to problems. Some special hardware devices are being designed which take advantage of this capability (see [4]). Input-output mapping: Neural network is able to train the input and the training is repeated until it reaches a stable state, where the error between the desired response and actual response is minimized. This will ensure the prediction to be more accurate. Different modifications produce different results: In neural network, many types of algorithms and different activation functions can be used. By utilizing those algorithms effectively, accurate results can be obtained (see [5]).
Conditional Heteroskedasticity (ARCH) model, which is more time consuming and troublesome compared to normal ARMA model (see [10]).
2.3 Disadvantages of Neural Networks Researchers had found some disadvantages using neural networks: (a) Neural Networks are “slow learners”. They may take a long time to train and converge to a solution (see [6]). (b) Neural networks may not produce promising results at all times compared to some other methods. (c) The parameters in neural network need to be done by hand or based on hillclimbing methods in order to obtain an optimized result, which is more time consuming (see [7]). (d) Neural network can produce accurate forecasting, but is not able to provide explanations on the data.
Most of the unknowns in neural network such as learning rates and number of neurons are done based on hillclimbing method, which is time consuming (see [7]). In time series, the parameters of the model are estimated by using the likelihood estimated function, which is less time consuming as compared to the neural network.
(d)
(e)
(f)
(g)
In neural networks, the minimization of error function is used to determine the best parameters or weights in order to produce an optimal result. In time series, the maximum likelihood function is used to estimate the parameters while fitting the model by using the Box-Jenkins model (see [11]). In neural network, different types of networks can be used while doing forecasting on time series data to produce various outcomes. Besides, different training algorithms in a particular neural network can also produce different results. Most of the mentioned outcomes are proven to be reliable. Thus by using different neural network architectures, researchers can obtain a very accurate forecast on time series data (see [5]). In time series, limited models can be acquired by using the Box-Jenkins model. However, in most cases, only one model can give accurate forecast values. Therefore unlike neural networks, time series is unable to provide various reliable outcomes for researchers to choose from.
3. Methodology 3.1 Neural Network Model In this paper, the Multi-layered Perceptron (MLP) model will be used as it is the most popular network (see [12]). MLP is a feedforward neural network which is trained with the standard back-propagation training method. With an input layer, one or more hidden layers and an output layer, the network is able to approximate virtually any input-output map (see [8]).
2.4 Comparison between Neural Network Models and Time Series Models A comparison is made between neural network models and time series models, and is illustrated in Table 1.
MLP is a class of network that consists of multiple layers of computational units, usually interconnected in a feedforward way which only allows the data flow forward to the output without any feedback (see [13]). MLP is normally trained with the standard error back-propagation training method.
Standard time series methods are limited by the underlying assumptions of the model, such as stationarity, seasonality or length of the time series. However, neural network models are not limited by model assumptions, as well as the noise, irregular sampling or shortness of the time series (see [8]).
The error back-propagation training method is a gradient descent technique to minimize the error. This method uses error correction learning where the desired response for the system must be known to train the network. Once trained, the weights are considered stagnant and they are used for the new input samples to generate output values (see [14]).
Artificial neural networks have been widely accepted as a useful potential method in modelling complex nonlinear and dynamic systems (see [9]). For time series models, non-linear processes are difficult to be handled as compared to neural networks. Non-linear processes, which have the possibilities of producing chaotic time series, are normally modelled by using the Autoregressive
In this study, four sets of simulation programs for the MLP will be used: the original data (O), the linear 345
detrended model (DTL), the deseasonalized with the smoothing moving average technique (DS), and both detrended with linear time model and deseasonalized (DSTL).
indicates the better the model is. The maximum likelihood function will be discussed in the next section. 3.2.2 Estimation of the Parameters Once the tentative model is identified, the second step of Box-Jenkins method is to estimate the parameters in the model. A general ARMA (p,q) model with unknown parameters φ = {φ1 , φ 2 ,..., φ p } , θ = {θ 1 , θ 2 ,...,θ q } and
The O model is the original model where the past data are fed as inputs to generate the forecast values as the outputs. Prior data processing is then applied into O model to develop into DTL, DS and DSTL models. The DTL model is concerned with the trend components whereas the DS model is concerned with the deseasonalized component. Lastly, both detrended and deseasonalized techniques are applied to develop the DSTL model. In time series, trend and seasonal components of the data are the large-scale deterministic components which should be eliminated to fit the model. This is also applied into neural network model so that the best forecast model can be obtained.
σ ε2 = E (ε t2 ) is defined as: Yt = φ1Yt −1 + φ 2Yt −2 + ... + φ pYt − p + ε t − θ1ε t −1 − θ 2ε t −2 − ... − θ q ε t −q Here Yt is a stationary time series and { ε t } is normally distributed, e.g. {ε t } ~ N (0, σ ε2 ) . There are a few criteria to estimate the parameters and fit the best model. Some popular ones are the Maximum Likelihood Estimation and the Least Squares Estimation.
3.2 Box-Jenkins Time Series Model In time series, the most popular forecasting method is the Box-Jenkins forecasting models, or the ARIMA models. Box-Jenkins models predict a variable’s present or future values from its past values, and have proven to provide accurate forecasts in many applications. Original data is used to build seasonal ARIMA models, since the monthly temperature data involve seasonal components.
3.2.3 Diagnosis Checking After the model has been identified and the parameters are estimated, diagnostic checking has to be done to ensure if the model is adequate. 3.2.3.1 Box-Pierce Test and Ljung-Box Test The Box-Pierce test and Ljung-Box test are used to test the correlation of the residuals. The first k (k>10) autocorrelations of the residuals can be tested by BoxPierce test whereby:
There are three stages for ARIMA modelling: (1)Identification of the initial p, d, q, P, D and Q parameters by using the autocorrelation function (ACF) and the partial autocorrelation function (PACF). (2) Estimation of the parameters that is identified. (3) Diagnosis checking on the model that is tentatively identified to ensure if it is adequate. If it is not adequate, the model has to be modified and improved (see [15]).
k
Q(k ) = N ∑ ri 2 i =1
or by using the Ljung-Box statistics: k
Q * (k ) = N ( N + 2)∑
3.2.1 Identification of the Parameters The identification procedure is applied to time series data to obtain some idea of the values of p, d, q, P, D and Q parameters in SARIMA (p, d, q) x (P, D, Q) model. The idea is obtained by observing on the ACF and the PACF.
i =1
ri 2 ( N − i)
Here, N represents the number of observations, k is the length of coefficients to test autocorrelation, ri is the autocorrelation coefficient (for lag i). The null hypothesis is that none of the autocorrelation coefficients up to lag k is different from zero, which also means that the residuals are independent and the model is adequate.
In most of the time series software, two types of graphs can be computed, the graph for Autocorrelation Function (ACF) versus lag k, and Partial Autocorrelation Function (PACF) versus lag k. The behaviour of the ACF and PACF is shown in Table 2.
3.3 Criteria for Comparison There are a few common types of error measurements being used to test the model fitting and forecasting performance between Box-Jenkins model and neural network model. Three of these measurements will be used in this study; they are mean absolute error (MAE), mean absolute percentage error (MAPE) and mean square error (MSE) whereby:
The Akaike Information Criterion (AIC) is a measure of the goodness-of-fit of an estimated statistic model. It is an operational way of trading off the complexity of an estimated model against how well the model fits the data. AIC = -2 ln(L) + 2r where L is the maximum likelihood function and r denotes the number of parameters. Smaller value of AIC
346
n
MAE =
∑| y t =1
n y t − yˆ t yt
n
∑ t =1
MAPE =
n n
MSE =
after seasonal differencing of 12 lags are shown in Figure 4 and Figure 5. Both of them show that they tail off rapidly to zero indicating the series is now stationary after a differencing of lag 1 and a seasonal differencing of lag 12. This also suggests a Seasonal ARIMA model, SARIMA (p, d, q) x (P, D, Q) with seasonal period 12, where d and D is equal to 1. From both of the ACF and PACF, they show large spikes at lag 1 and also lag 12. It is not sensible to assume both ACF and PACF cut off in time series. Here we can either assume the ACF tails off and PACF cuts off, or ACF cuts off and PACF tails off. If we assume ACF tails off and PACF cuts off, this suggests an AR(1) and SAR(12) process and the model is SARIMA (1, 1, 0) x (1, 1, 0) 12 . This is because if we assume so, the PACF cuts off at lag 1 and the ACF tails off suggesting an AR(1) process; the PACF cuts off at lag 12 and ACF tails off suggests SAR(1) process. On the other hand, if we assume the ACF cuts off and PACF tails off, this will suggest MA(1) and SMA(12), which is SARIMA (0, 1, 1) x (0, 1, 1) 12 . Since there are two models to be chosen, we choose one of them by comparing their AIC values. By comparing the AIC values, SARIMA (0, 1, 1) x (0, 1, 1) 12 shows a smaller AIC value, which is 0.7729 compared to SARIMA (1, 1, 0) x (1, 1, 0) 12 , 1.1423. Therefore the tentative model is
− yˆ t |
t
∑(y
t
× 100
− yˆ t ) 2
t =1
n
y t is defined as the true value from the data and yˆ t is the estimated or the forecast value from neural network or Box-Jenkins model. The model with the smaller value of the error measurements indicates the better model.
4. Results To show a comparison of the two methods, 120 observations in the monthly temperature data is obtained from Department of Meteorology Malaysia. The 120 observations are obtained from January 1996 until December 2005 in Bayan Lepas, Penang. The monthly temperature data is presented in Figure 1 in degree Celsius. The data is divided into two subsets: one for training set, and the other for testing. The training set consists of the first 108 observations whereas the testing set consists of the last 12 observations.
chosen to be SARIMA (0, 1, 1) x (0, 1, 1) 12 . Since the best model is chosen to be SARIMA (0, 1, 1) x (0, 1, 1) 12 , we can also write it as: (1 − B)(1 − B 12 )Yt = (1 − θ 1 B )(1 − θ 12 B 12 )ε t
To obtain the Box-Jenkins model, we first perform ACF for data. It can be clearly observed from Figure 1 that there is a seasonal trend in the series. This suggests that a transformation is needed to be done. Therefore the ACF is generated and is shown in Figure 2. Figure 2 shows that the ACF tails off in a sine wave pattern, suggesting that the series is not a stationary series. The sine wave pattern suggests that there exist negative coefficients in the model. Since it is not a stationary series, first differencing should be done. Moreover, the ACF shows that there is a large value on lag 12, suggesting the series may consist of a seasonal component. After first differencing is done, the ACF is shown in Figure 3. The ACF after first differencing shows that there is a large spike at lag 12, suggesting that a seasonal pattern with period 12 occurring in the series. Thus, a seasonal differencing of 12 should be applied to the series. The ACF and PACF
After the model has been identified and the parameters are estimated, diagnostic checking has to be done to ensure if the model is adequate. This can be done by using the Box-Pierce or Ljung-Box test. In this study, we only consider the Ljung-Box test. For diagnostic checking, the null hypothesis is that the residuals are independent and the model is adequate. From Figure 6, Ljung-Box test shows the p-value for lag 12, 24, 36, 48 are greater than 0.05. Therefore we accept the null hypothesis that the residuals are independent and the model is adequate. The error measurements are made by using mean absolute error (MAE), mean absolute percentage error (MAPE) and mean square error (MSE).
Table 1: Comparison between Neural Network Models and Time Series Models Neural Network Models Time Series Models Not limited by certain assumptions. Limited by certain assumptions. Calculate non-linear process easily. Difficult to calculate non-linear process. Unable to provide explanation on the data. Able to provide explanation on the data. Use the minimization of error. Use the maximum likelihood function. More reliable outcomes can be obtained. Less reliable outcomes can be obtained. More time consuming in determining parameters. Less time consuming in determining parameters. Widely used in various areas Usage is limited 347
Table 2: Behaviour of ACF and PACF ACF PACF Tails off towards zero (exponential Cuts off after lag p decay or damped sine wave) Cuts off after lag q Tails off towards zero (exponential decay or damped sine wave) Tails off after lag (q-p) Tails off after lag (p-q)
Process AR(p) MA(q) ARMA(p,q)
Table 3: Error Measurements for each Model MAE MAPE MSE 0.3159 1.1453 0.1806 0.2990 1.0854 0.1807 0.7951 2.8252 0.7350 0.5680 2.0544 0.5086 0.7612 2.7152 0.6638
Model Type SARIMA ANN(DSTL) ANN(DS) ANN(DTL) ANN(O)
Figure 1: Time Series Plot for Monthly Temperature in Bayan Lepas, 1996 – 2005
A u to c o r r e la tio n F u n c tio n f o r Y ( t) ( w ith 5 % s ig n ific a n c e lim its fo r th e a u to c o r r e la tio n s ) 1 .0 0 .8
Autocorrelation
0 .6 0 .4 0 .2 0 .0 - 0 .2 - 0 .4 - 0 .6 - 0 .8 - 1 .0 1
5
10
15
20
25 La g
30
35
Figure 2: ACF for Monthly Temperature 348
40
45
50
A u to c o r r e la tio n F u n c tio n fo r D 1 ( w ith 5 % s ig n ific a n c e lim its fo r th e a u to c o r r e la tio n s ) 1 .0 0 .8 0 .6 Autocorrelation
0 .4 0 .2 0 .0 - 0 .2 - 0 .4 - 0 .6 - 0 .8 - 1 .0 1
5
10
15
20
25 La g
30
35
40
45
50
45
50
Figure 3: ACF after First Differencing A u to c o r r e la tio n F u n c tio n f o r S e a s o n a l D iff e r e n c in g ( w ith 5 % s ig n ific a n c e lim its fo r th e a u to c o r r e la tio n s ) 1 .0 0 .8 0 .6 Autocorrelation
0 .4 0 .2 0 .0 - 0 .2 - 0 .4 - 0 .6 - 0 .8 - 1 .0 1
5
10
15
20
25 La g
30
35
40
Figure 4: ACF after Seasonal Differencing P a r tia l A u to c o r r e la tio n F u n c tio n fo r S e a s o n a l D iffe r e n c in g ( w ith 5 % s ig n if ic a n c e lim its f o r th e p a r tia l a u to c o r r e la tio n s ) 1 .0
Partial Autocorrelation
0 .8 0 .6 0 .4 0 .2 0 .0 - 0 .2 - 0 .4 - 0 .6 - 0 .8 - 1 .0 2
4
6
8
10
12
14 La g
16
18
20
Figure 5: PACF after Seasonal Differencing Modified Box-Pierce (Ljung-Box) Chi-Square statistic Lag Chi-Square DF P-Value
12 10.7 9 0.294
24 15.8 21 0.783
36 48 25.6 31.3 33 45 0.816 0.940 Figure 6: Ljung-Box Result from Minitab 349
22
24
26
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs1 1/report.html
5. Conclusion Table 3 shows the error measurements for various models. The MAE, MAPE and MSE values for testing set ranged from 0.2990 to 0.7951, 1.0854 to 2.8252 and 0.1806 to 0.7350 respectively. The best two models are the SARIMA model and the DSTL model. DSTL has the lower MAE and MAPE compared to SARIMA, and both models have almost the same MSE. Therefore we can conclude that DSTL is the best forecasting model. By comparing DTL and DS models, DTL has the lower error in MAE, MAPE and MSE when compared to DS model. This suggests that detrending may be a greater factor in forecasting compared to deseasonalizing. It can be seen that O, DTL and DS models do not perform better than SARIMA model. We can also say that not all neural network will perform better than the Box-Jenkins Model. The help of prior data processing allows better and effective neural network performance. It is shown that DSTL and the SARIMA models are the best two models in forecasting. DSTL performs better than SARIMA suggesting that neural network can give a more accurate forecasting compared to the SARIMA model.
[5] C. Harpham & C. W. Dawson, The effect of different basis functions on a radial basis function network for time series prediction: A comparative study, Neurocomputing, 69, 2006, 2161-2170. [6] D. D. Olmsted, History and principles of neural networks from 1960 to 1990, 2006. [Online]. [Accessed 15th April 2007]. Available from World Wide Web: http://www.neurocomputing.org/NNHistoryTo1990.aspx [7] V. M. Rivas, J. J. Merelo, P. A. Castillo, M. G. Arenas & J. G. Castellano, Evolving RBF neural networks for time series forecasting with EvRBF, Information Sciences, 165, 2004, 207-220. [8] M. Ture & I. Kurt, Comparison of four different time series methods to forecast Hepatitis A virus infection, Expert System with Application, 31, 2006, 41-46. [9] M. Traeger, A. Eberhart, G. Geldner, A. M. Morin, C. Putzke, H. Wulf, & L. H. Eberhart, Prediction of postoperative nausea and vomiting using an artificial neural network, Anaesthesist, 52, 2003, 1132-1138.
Time series models are often limited by the underlying assumptions such as the stationarity, trend and seasonal pattern and the noise process. In contrast, neural networks are not limited by these assumptions. However, neural network is time consuming in determining the parameters in order to obtain an optimized result.
[10] Wikipedia, Time series, 2007. [Online]. [Accessed 31st May 2007]. Available from World Wide Web: http://en.wikipedia.org/wiki/Time_series
Acknowledgements We would like to thank the Chief Director of the Department of Meteorology Malaysia, for providing us the monthly temperature data. This work was supported in part by the U.S.M. Fundamental Research Grant Scheme (FRGS) grant no. 203/PMATH/671041.
[11] G. Janacek, Practical time series (New York: Oxford University Press Inc, 2001). [12] M. Ture, I. Kurt, A. T. Kurum, & K. Ozdamar, Comparing classification techniques for predicting essential hypertension, Expert System with Application, 29(3), 2005, 583-588.
References [1] S. Leslie, An introduction to neural network, 1996. [Online]. [Accessed 20th March 2007]. Available from World Wide Web: http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html
[13] A. Azadeh, S. F Ghaderi. & S. Sohrabkhani, Forecasting electrical consumption by integration of neural network, time series and ANOVA, Applied Mathematics and Computation, 186(2), 2007, 1753-1761.
[2] L.M. Du, Z.Q. Hou & Q.H. Li, Optimum blockadaptive learning algorithm for error back-propagation networks, IEEE Transactions on Signal Processing, 40(12), 1992, 3032-3042.
[14] K. Mehrotra, C. K. Mohan & S. Ranka, Elements of artificial neural networks (Cambridge: The MIT Press, 1996).
[3] Wikipedia, Artificial neural network, 2007. [Online]. [Accessed 29th April 2007]. Available from World Wide Web: http://en.wikipedia.org/wiki/Artificial_neural_network
[15] G. E. P. Box & G. Jenkins, Time series analysis, forecasting and control (San Francisco: Holden-Day, 1976).
[4] C. Stergiou & D. Siganos, Neural network, 1996. [Online]. [Accessed 2nd April 2007]. Available from World Wide Web:
350