Forecasting Method Selection Using ANOVA and Duncan Multiple Range Tests on Time Series Dataset Adhistya Erna Permanasari, Dayang Rohaya Awang Rambli, P. Dhanapal Durai Dominic Computer and Information Science Dept., Universiti Teknonologi PETRONAS Bandar Seri Iskandar, Tronoh Perak, Malaysia
[email protected], {roharam, dhanapal_d}@petronas.com.my
Abstract—Selection of a suitable forecasting technique is of prime importance in order to obtain a better prediction result. This paper demonstrated the use of two statistical approaches namely, Analysis of Variance (ANOVA) and Duncan multiple range tests for determining the performance of different forecasting methods. Three forecasting methods were chosen and compared: regression, decomposition, and ARIMA. Data from monthly incidence of Salmonellosis in US from 1993 to 2006 was collected and used for technical analysis. ANOVA was initially used to identify significant difference between the actual data and three forecasting methods. Based on the results from ANOVA, selection of appropriate method was conducted using Duncan multiple range tests. The results showed that both regression and ARIMA could be used in the Salmonellosis data. On the contrary, decomposition method yielded the least performance and is not suitable for being applied on the available dataset. Keywords - forecasting; regression; decomposition; ARIMA; ANOVA; Duncan Multiple Range Test
I.
INTRODUCTION
Different forecasting methods have been employed to predict future number of diseases occurrences. The results can be useful in assisting the making of policy to reduce disease incidence. Various techniques that have been applied to predict disease incidence in human, include Multivariate Markov chain model to project the number of tuberculosis (TB) incidence in the United States from 1980 to 2010 [1], exponential smoothing to forecast the number of human incidence of Schistosoma haematobium at Mali [2], ARIMA model to forecast the SARS epidemic in China [3], a Bayesian dynamic model to monitor the influenza surveillance as one factor of SARS epidemic [4], seasonal autoregressive models to analyze Cutaneous leishmaniasis (CL) incidence in Costa Rica from 1991 to 2001 [5], the application of decomposition method [6] and seasonal ARIMA to predict number of Salmonellosis human incidence [7].
978-1-4244-6716-7/10/$26.00 (c) 2010 IEEE
Whilst some have employed several forecasting models, earlier approaches to forecasting is often based on a single forecasting technique. Due to numerous numbers of available forecasting techniques with difference performance, selection of the most appropriate prediction model to yield better prediction results becomes critical. Besides, a single technique may not yield the same prediction accuracy for different type of datasets. Thus, this further highlights the importance of selecting an appropriate forecasting model for each specific dataset. Analysis of Variance (ANOVA) is one of the commonly used statistical methods to compare among several groups. In this paper, ANOVA was used for selecting forecasting method based on comparison of means between forecasting results; that is for testing a significant different between group means. ANOVA is often used when a user needs to compare performance which involves more than two parameters. The advantage of ANOVA over other tests such as simple t-tests is that ANOVA can detect effect of interaction between variables. It could also be used to test more complex hypotheses in existing problem [8]. When differences between groups exist, a post hoc test can then be conducted to identify which group that differs from the others. In this paper, Duncan multiple range test was used. This paper aims to provide the empirical results for evaluating and finding the appropriate method in estimating future number of Salmonellosis disease incidence. Different forecasting methods were used to address the goal, including regression, decomposition, and ARIMA. The results were evaluated using ANOVA and Duncan multiple test. To accommodate the purpose, monthly number of Salmonellosis incidence was selected. Salmonellosis dataset in United States was collected for the 168 month period from January 1993 to December 2006. The data was obtained from the summary of notifiable diseases in United States from the Morbidity and Mortality Weekly Report (MMWR) that published by Centers for Disease Control and Prevention (CDC). The seasonal variation of the original data is presented as a chart
presented in Fig. 1. The plot shows a peak season of incidence in August while the minimum number of incidence occurrences in January. Since time series plot of the historical data exhibited the seasonal variations which present similar trend every year, SARIMA was chosen as the appropriate approach to develop a model prediction.
results of each method are compared with actual data using ANOVA blocked design. Finally, Duncan multiple test was conducted to identify which method produced the forecast closer to actual data. A Duncan test result was also used to find the appropriate method among them. III.
A. Regression Regression model time series is used when the independent variable is time and the model focus on predicting the future values.
7000
6000
5000 No of Incidence
FORECASTING RESULTS
4000
Twelve seasonal components were determined from the monthly data. Unfortunately, the twelfth month could not provide any information like the first 11. Hence, the twelfth month is used as a baseline for comparison. In this situation, a monthly trend variable is applied. The final regression formula was:
3000
2000
1000
0 12
24
36
48
60
72
84
96
108
120
132
144
156
168
t (month)
Figure 1. Monthly number of Tuberculosis incidence in US (1993-2006).
The remainder of the paper is structured as follows. Section 2 presents model framework. Section 3 reports forecasting results. Section 4 presents ANOVA results. Section 5 evaluates Duncan multiple range test. Finally, Section 6 present conclusion. II.
MODEL FRAMEWORK
yt = TRt + SN t + ε t = β 0 + β1t + β s1 xs1,t + β s 2 xs 2,t + β s 3 xs 3,t + β s 4 xs 4,t + β s 5 xs 5,t + β s 6 xs 6,t + β s 7 xs 7 ,t + β s 8 xs 8,t + β s 9 xs 9,t + β s10 xs10,t + β s11 xs11,t
(1)
= 4815.034 + 0.470t − 3053.470 xs1,t − 2967.297 xs 2,t − 2603.768 xs 3,t − 2435.738 xs 4,t − 1902.637 xs 5,t − 1245.321xs 6,t − 51.220 xs 7 ,t + 416.095 xs 8,t + 151.482 xs 9,t − 165.417 xs10,t − 1264.030 xs11,t
where yt is the observed value in time period t, TRt is the seasonal factor in time period t, SNt is the seasonal factor in time period t, and εt is the error term in time period t. B. Decomposition Decomposition method is one of the seasonal smoothing methods. By using this method, a series will be broken into some components part: trend, seasonality, cyclical, and irregular (error). Future number of Salmonellosis incidence was forecasted using three forecasting method, namely regression, decomposition, and ARIMA. The flow of model development is presented in Fig. 2.
Decomposition method can be divided into 2 categories of model: multiplicative decomposition and additive decomposition. When a time series exhibits an increasing or a decreasing seasonal variation then the multiplicative decomposition model should be selected. Additive decomposition model is used only when the time series display a constant seasonal variation.
As illustrated in Fig.2, the first step was collection Salmonellosis data. This is followed by data processing using each method (regression, decomposition, ARIMA). Next, the
Based on the current time series (Figure 1), the time series exhibits constant seasonal variation so additive decomposition is used [9]. The additive decomposition model is:
Figure 2. Flow of model development.
yt = TRt + SN t + CLt + IRt
(2) Where y t is the observed value of the time series in time period t, TRt is the trend component (or factor) in time period t, SNt is seasonal component (or factor) in time period t, CLt is cyclical component (or factor) in time period t, and IRt is irregular component (or factor) in time period t. In the model, component of CLt was removed in eq. (2) because no cyclic component was identified in the time series. In order to calculate SNt , estimation of it was used (snt). Calculating average, snt , for each month. With L = 12 (number of period a year), then the seasonal factor is:
⎞ ⎛ L (3) snt = snt − ⎜ ∑ snt / L ⎟ = snt − 4.719 ⎠ ⎝ t =1 Finally, the estimation of trt for the trend TRt could be obtained by fitting a regression equation to the deseasonalized data. The resultant function is
trt = b0 + b1t = 4069.887 − 3.563t
(4)
C.
ARIMA Seasonal ARIMA (SARIMA) is used when the time series exhibits a seasonal variation. A seasonal autoregressive notation (P) and a seasonal moving average notation (Q) forms the multiplicative process of SARIMA as (p,d,q)(P,D,Q)s. The subscripted letter ‘s’ shows the length of seasonal period. For example, in an hourly data time series then s = 7, in a quarterly data s = 4, and in a monthly data s = 12. Thus, the general SARIMA (p,P,q,Q) model is (5) φ p ( B )φP ( B s ) zt = δ + θ q ( B )θQ ( B s )at s Where B is non-seasonal backshift operator, B is seasonal backshift operator, φ is the autoregressive operator of order p, and θ is the moving average operator. In order to develop ARIMA model, Box-Jenkins (BJ) methodology was used. BJ consists of four iterative steps: identification, estimation, diagnostic checking, and forecasting. Different ARIMA models were applied to find the best fitting model. The most appropriate model was then selected by using the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) values. The model AR(9), SAR(12), MA(14), SMA(24) can be written as SARIMA(9,0,14)(12,1,24)12. From this model the parameter result were AR(9) = 0.154, SAR(12) =-0.513, MA(14) = 0.255, SMA(24) = -0.8604. The final model is expressed as eq. (6): (1 − 0.154 B 9 + 0.513B12 + 0.078 B 21 ) zt = (1 + 0.255 B14 − 0.860 B 24 − 0.219 B 38 )at
(6)
IV.
ANALYSIS OF VARIANCE (ANOVA)
ANOVA is a technique that firstly developed by R.A. Fisher in 1920s. It is used to compare group mean [10]. ANOVA uses two hypothesis to determine the result, namely null hypothesis and alternative hypothesis. ANOVA is called as analysis of an analysis of variance because ANOVA test compares two variance estimations: variance within group (the unsystematic variation/error in the data) and variance between groups (effects due to the experiment). In this paper, actual data and forecast result between linear regression, decomposition, and ARIMA were compared by ANOVA. The experiment has a variability values from a systematically controlled sources. Time (months) is the common source of variability that can be controlled by blocking. Thus, a blocked design of ANOVA was applied with months as the blocking parameter. Three different techniques were compared, namely regression, decomposition, and ARIMA. The estimated values of from these techniques were compared with the actual data using ANOVA. The hypothesis was:
H 0 : μ1 = μ 2 = μ 3 = μ 4
H1 : μi ≠ μ j
i, j = 1, 2, 3, 4, i ≠ j
Where μ1, μ2, μ3, μ4 were the average estimation obtained from actual, regression, decomposition, and ARIMA, respectively. The hypotheses was tested by using level of significance (α) assumed as 0.05. The ANOVA results for this problem are shown in Table I. If F value of the method measurement is greater than the corresponding value of Fcrit, then it can be concluded that the result between methods is having a significant different (null hypothesis is rejected). Otherwise, it can be concluded that there are no significant different between methods (null hypothesis is accepted). From the ANOVA result shown in Table I, Sum Square (SS) at α = 0.05 was 37282552, while Mean Square (MS) was 92742.67. It can be concluded that in α = 0.05, the null hypothesis was rejected because: Fcrit = 2.627103 and F = 3.868407, where F > Fcrit Beside the P value of between group also gave an evidence to reject the null hypothesis, where P = 0.009503 < 0.05. The blocked ANOVA tested whether any of the population means differ from each other. While, a multiple comparison test is conducted to check which population means differ from the others. The following section is reported the application of Duncan's Multiple Range Test.
TABLE I. Groups
Count
Sum
Average
Variance
Actual
135
488458
3618.207
1829864
Regression
135
488677.6
3619.834
1564833
Decomposition
135
503236.4
3727.677
1541855
ARIMA
135
491894.9
3643.666
1792091
Source of Variation
V.
ANOVA RESULT
SS
df
MS
F
P
Blocks (month)
8.64E+08
134
6450416
69.55176
2E-216
1.25215
Between Group (Method)
1076299
3
358766.4
3.868407
0.009503
2.627103
Within Groups (Error)
37282552
402
92742.67
Total
9.03E+08
539
DUNCAN MULTIPLE RANGE TEST
When null hypothesis is rejected, then a post hoc test can be conducted to identify which groups having different mean. In this study, Duncan multiple range test was chosen. Duncan multiple range test can maintain a low overall type I error [11] and also can to be applied in groups application that exhibit not significantly different [12]. Duncan test uses a studentized range statistic within a multiple stage test, referred to as a multiple range test. A least significance error is calculated and associated with an increasing number of sample subset means. The population mean is significantly different if the range of subset greater than the least significant range (LSR). For the confidence level α = 0.05, LSR was found from the Duncan’s Table. LSR were computed as shown below: Mean sum of square error, MS = 92742.67, n = 135 Then, standard error of each average,
MS (7) S= = 26.21 n Since the total number of selected groups was 4, the total number of ranges was equal to 3. From the table of significant ranges Montgomery for 402 degrees of freedom (804 is the number of degrees of freedom for within groups from ANOVA table) and α = 0.05, the three ranges were calculated as given below: r2 = rα ( p , df ) = r0.05( 2, 402) = 2.772
r3 = rα ( p , df ) = r0.05(3, 402 ) = 2.918 r4 = rα ( p , df ) = r0.05( 3, 402 ) = 3.017
LSR can be calculated from equation:
R p = rp × S
(8)
F crit (α=0.05)
By applying (8), LSR results were R2 = 72.655, R3 = 76.486, and R4 = 79.077. The mean of forecasting results is presented in Table II. TABLE II. Methods
MEAN OF EACH GROUPS Mean Symbol
Mean Value
Actual
M1
3618.207
Regression
M2
3619.834
Decomposition
M3
3727.677
ARIMA
M4
3643.666
The means of methods in Table II need to be sorted first as M1, M2, M4, and M3. Then, each group was compared as the following: Actual difference M1 – M2 = 1.626
< 72.655 (R2)
Actual difference M1 – M3 = 109.470
> 76.486 (R3)
Actual difference M1 – M4 = 25.459
< 79.077 (R4)
Actual difference M2 – M3 = 107.843
> 72.655 (R2)
Actual difference M2 – M4 = 23.832
< 76.486 (R3)
Actual difference M3 – M4 = 84.011
> 72.655 (R2)
The decomposition method performs poorly compared to other techniques. The decomposition method (corresponding to the mean value: M3 = 3727.677) yield more differences when compared with other methods; the actual different was greater than its least significant range (109.470>76.486). It is shown that the mean difference of decomposition with other groups was greater from the least significant range (M2 – M3 and M3 – M4). This indicates that decomposition method is not an appropriate technique to be applied to the current dataset. Contrarily, the mean comparison between the actual data and other method
(regression and ARIMA) reveal small differences. The difference of regression and ARIMA were less than the least significant range (1.626