368
Int. J. Industrial and Systems Engineering, Vol. 4, No. 4, 2009
An integrated GA-time series algorithm for forecasting oil production estimation: USA, Russia, India, and Brazil A. Azadeh* Department of Industrial Engineering and Center of Excellence for Intelligent Based Experimental Mechanics, College of Engineering, University of Tehran, P.O. Box 11365-4563, Tehran, Iran E-mail:
[email protected] E-mail:
[email protected] *Corresponding author
M. Aramoon Department of Industrial Engineering, College of Engineering, University of Tehran, Iran E-mail:
[email protected]
M. Saberi Department of Industrial Engineering, University of Tafresh, Iran E-mail:
[email protected] Abstract: This study presents an integrated algorithm for forecasting oil production based on a Genetic Algorithm (GA) with variable parameters using stochastic procedures, time series and Analysis of Variance (ANOVA). The significance of the proposed algorithm is two fold. First, it is flexible and identifies the best model based on the results of ANOVA and MAPE, whereas previous studies consider the best fitted GA model based on Minimum Absolute Percentage Error (MAPE) or relative error results. Second, the proposed algorithm may identify conventional time series as the best model for future oil production forecasting because of its dynamic structure, whereas previous studies assume that GA always provides the best solutions and estimation. To show the applicability and superiority of the proposed algorithm, the data for oil production in USA, Russia, India and Brazil from 2001 to 2006 are used and applied to the proposed algorithm. Keywords: integrated genetic algorithm; time series; oil production; ANOVA; analysis of variance; Duncan’s multiple range test; MAPE; minimum absolute percentage error.
Copyright © 2009 Inderscience Enterprises Ltd.
An integrated GA-time series algorithm
369
Reference to this paper should be made as follows: Azadeh, A., Aramoon, M. and Saberi, M. (2009) ‘An integrated GA-time series algorithm for forecasting oil production estimation: USA, Russia, India, and Brazil’, Int. J. Industrial and Systems Engineering, Vol. 4, No. 4, pp.368–387. Biographical notes: Ali Azadeh is an Associate Professor and the founder Department of Industrial Engineering and co-founder of RIEMP at the University of Tehran. He graduated with first class honor Degree (BS) in Applied Mathematics from the University of San Francisco and obtained his MS and PhD in Industrial and Systems Engineering from the San Jose State University and the University of Southern California. He received the 1992 Phi Beta Kappa Alumni Award for excellence in research and innovation of doctoral dissertation in USA. He is the recipient of the 1999–2000 Applied Research Award and has published more than 290 academic papers. Malihe Aramoon is currently a Graduate student in Industrial Engineering at the University of Tehran. She has been selected for the Dean’s Honor Roll for six semesters. Her current research interests include evolutionary computations and their applications in industrial engineering. Mortezza Saberi is an instructor of Industrial Engineering at the University of Tafresh, Iran. He earned his BS in Applied Mathematics from Amir Kabir University of Technology and his MS in Industrial Engineering from Bu Ali Sina University, Iran. His current research interests include systems modelling, data mining and econometric modelling and forecasting of supply and demand via artificial intelligence tools.
1
Introduction
In the era of vigorous globalisation and quick changes, countries have been striving to increase their ability to estimate and predict oil production. In this situation, countries can plan their procedures for several years in the future and control some fluctuations of economics. The evolutionary algorithms are suitable for modelling this kind of problem with unknown factors. The target is to find the essential structure of data to forecast future production with less error. In this paper we report a Genetic Algorithm (GA) to estimate and predict oil production. GA is a class of iterative procedures that simulate the evolution process of population of structures subject to the competitive forces prescribed in Darwin’s survival of the fitness principle. Nowadays, the ability to forecast the future, based only on past data, leads to strategic advantages, which may be the key to success in organisations. In response to this issue, recent literature documents different methods and techniques. Time Series Forecasting allows the modelling of complex systems as black-boxes, being a focus of attention in several research arenas such as Operational Research, Statistics or Computer Science. On the other hand, Genetic and Evolutionary Algorithms (GEAs) are novel techniques increasingly used in optimisation and machine learning tasks (Cortez et al., 2001). Mirmirani and Li (2004) applied VRA and ANN techniques to make ex-post forecast of US oil price movement. Lagged oil price, lagged oil supply, and lagged energy consumption were used as three endogenous variables for VAR-based forecast. The GA-based ANN used oil supply, energy consumption and money supply. They have
370
A. Azadeh et al.
considered root mean squared error and Mean Absolute Error (MAE) as the evaluation criteria. An application of Artificial Neural Networks (ANNs) for short-term forecasting of GDP, using oil prices and utilising cascaded learning, is proposed by Malik and Nasereddin (2006). They presented the newly developed cascaded ANN in which the hidden nodes are not enforced by the decision-maker but are determined endogenously. Their computational experiments proposed utilising oil prices in a non-linear fashion in real output forecasting models. Ye et al. (2006) provided a model to forecast crude oil spot prices in the short-run using high- and low-inventory variables. They showed that the non-linear-term model better captures price responses at very high- or very low-inventory levels and improves forecasting capability. Kermanshahi and Iwamiya (2002) used ANNs to predict the peak electronic loads in Japan up to the year 2020. They focused on economical factors rather than weather condition for long-term load forecasting. An energy consumption forecast system using the fuzzy logic approach is introduced for supporting the manufacturing plant’s operations by Lau et al. (2008). They demonstrated how to apply the fuzzy logic system by using the fuzzy rule reasoning mechanism in a clothing manufacturing plant. Tsekouras et al. (2007) described a method that was developed for midterm energy forecasting by using a non-linear multivariable regression model. The proposed method performed an extensive search in order to select the most appropriate functions, weighting factors and training periods to be used in the model. Results obtained by the proposed method for the Greek power system were compared with the results of standard regression methods. Abdel-Aal (2007) demonstrated the use of abductive and neural networks for modelling and forecasting the univariate time series of monthly energy demand. Sözen et al. (2007) developed equations for the estimation of GHG emissions in Turkey using the ANN approach in order to plan the use of energy by sectors. The equations obtained were used to determine the future level of the GHG and to take measures to control the share of sectors in total emission. The estimation of Turkey’s energy demand based on economic indicators using the GA was reported by Ceylan and Ozturk (2004). Ozturk et al. (2003) estimated industrial electricity demand using the GA. Osman et al. (2005) have combined GA with Fuzzy Logic Controller (FLC) so that the search region is able to adapt to the promising area and the boundary intervals are monitored by FLC and it is modified each time. Tang et al. (2005) have used a GA based Takagi-Sugeno-Kang fuzzy neural network to tune the parameters in Takagi-Sugeno-Kang fuzzy neural network. Canyurt et al. (2004) have carried out some researches recently to estimate the energy consumption using GA. Hasheminia and Akhavan Niaki (2006) have introduced a new type of GA to find the best regression model among several alternatives and have assessed its performance by an economical case study. Azadeh and Tarverdian (2007) have presented an integrated algorithm for forecasting monthly electrical energy consumption based on GA, computer simulation and design of experiments using stochastic procedures. Azadeh et al. (2007) have shown the integration of GA and neural networks to estimate and predict electrical energy consumption in the short term. According to the literature review, there is no research work with regard to oil production forecasting. The proposed algorithm which is based on GA, conventional time series, ANOVA and Minimum Absolute Percentage Error (MAPE) is discussed in the next section. The input variable used to estimate the best conventional model and, consequently, the GA model is the function of past data. The GA applied in this study has been tuned for all its parameters and the best coefficients with minimum error are identified, while all parameter values are tested concurrently. The proposed algorithm
An integrated GA-time series algorithm
371
uses ANOVA to select either GA or conventional time series for future production estimation. Furthermore, if the null hypothesis in the ANOVA F-test is rejected, the Duncan’s Multiple Range Test is used to identify which model is closer to actual data at α level of significance. It also uses MAPE when the null hypothesis in ANOVA is accepted to select from GA or time series model.
2
Conventional time series models
Time series models are quite well known to predict a variable behaviour in the future by knowing its behaviour in the past. One of the most famous time series models is the Autoregressive Integrated Moving Average (ARIMA) model. The ARIMA model belongs to a family of flexible linear time series models that can be used to model many different types of seasonal as well as non-seasonal time series. In the most popular multiplicative form, the ARIMA model can be expressed as: Φ p ( L) yt = θ q ( L)ε t .
(1)
With Φ p ( L) = 1 − Φ1 L − " − Φ p Lp
θ q ( L) = 1 − θ1 L − " − θ q Lq where s is the seasonal length L is the back shift operator defined by Lk yt = yt–k and εr is a sequence of white noises with zero mean and constant variance. Equation (1) is often referred to as the ARIMA (p, q) model. Box et al. (1994) proposed a set of effective model building strategies for identification, estimation, diagnostic checking and forecasting of ARIMA models. In the identification stage, the sample Auto Correlation Function (ACF) (the correlation between two variables) is plotted. A slowly decaying ACF suggests non-stationary behaviour. In such circumstances Box et al. (1994) recommend differentiating the data. A common practice is to use a logarithmic transformation if the variance does not appear to be constant. After preprocessing, if needed, ACF and PACF1 of preprocessed data are examined to determine all plausible ARIMA models. Some non-linear time series patterns were also developed mainly by Granger and Pristly. One of these non-linear models is referred to as bilinear of which the first rank model of the bi-linear model is as shown in equation (2). Xt = aXt–1 + bZt + c Zt–1Xt–1
(2)
in which Zt is the stochastic procedure and a, b and c are the model parameters. It should be noted that only the last part of the above equation is non-linear. Another type of non-linear model is the Threshold Auto Regressive (TAR) model in which the parameters are dependent on the past values of the procedure. One example of such models is described by equation (3). α1 X t −1 + Z t −1(1) Xt = (2) α 2 X t −1 + Z t −1
if X t −1 < d if X t −1 ≥ d
.
(3)
372
A. Azadeh et al.
Furthermore, the proposed algorithm fits the best linear or non-linear model to the data set. This is quite important because most studies assume that linear time series such as ARIMA provide the best fit.
2.1 Data preprocessing As in time-series methods making the process, covariance stationary2 is one of the basic assumptions and also using preprocessed data is more useful in most heuristic methods (Zhang and Oi, 2005) and so the stationary assumption should be studied for the models. If the models are not covariance stationary, the most suitable preprocessed method should be defined and applied. In forecasting models, a preprocessing method should have the capability of transforming the preprocessed data into their original scale (called post processing). So, in time series forecasting, the appropriate preprocessing method should have two main properties. It should make the process stationary and have post processing capability. The most useful preprocessed methods are presented in the sections.
2.1.1 The first difference method The difference method was proposed by Box et al. (1994). Tseng et al. (2002) also used this method in their paper in which the estimation of time series functions using the heuristic approach is done. In this method, the following transformation should be applied: yt = xt − xt −1.
(4)
However, for the first difference of the logarithm method the transformation is adjusted as follows: yt = log( xt ) − log( xt −1 ).
(5)
2.1.2 Normalisation There are different normalisation algorithms which are Min-Max Normalisation, Z-Score Normalisation and Sigmoid Normalisation. The Min-Max normalisation scales the numbers in a data set to improve the accuracy of the subsequent numeric computations. Nayak et al. (2004), Karunasinghe and Liong (2006), Tseng et al. (2002), and Gareta et al. (2006) used this method in their papers to estimate time series functions using heuristic approach. If xold, xmax, xmin are the original, maximum and minimum values of the raw data, respectively, and xmax ′ , xmin ′ are the maximum and minimum of the normalised data, respectively, then the normalisation of xold called x′new can be obtained by the following transformation function: x − xmin xnew ′ = old ( xmax ′ − xmin ′ ) + xmin ′ . xmax − xmin
(6)
In Z-score Normalisation the data are changed so that their mean and variance are 0 and 1, respectively. The transformation function used for this method is as follows where std is the standard deviation of the raw data:
An integrated GA-time series algorithm xnew =
xold − mean . std
373 (7)
The Sigmoid Normalisation uses a sigmoid function to scale the data in the range of [–1, 1]. The transformation function is as follows: xnew =
1 − eα
1 + eα x − mean . α = old std
(8)
2.2 Open and close simulation Suppose that the inputs for GA are y(t); t = 1 … m, using a set of lagged samples y(t – i), the ith lag of y(t); i = 1, …, k; 1 < k = 1 is generated through the network by the previous generated output of y(t + j – i). In fact, to test the Simulation Power of a GA, we use samples m – k + 1 to m and then let the model generate all the succeeding samples, y(t); t = m + 1, …, m + n. On the other hand, the previous generated output is used in close simulation, whereas open simulation works with input data.
3
The integrated algorithm
The proposed algorithm may be used to estimate oil production in the future by optimising parameter values. The proposed algorithm uses ANOVA to select either GA or conventional time series for future production estimation. Furthermore, if the null hypothesis in the ANOVA F-test is rejected, Duncan’s Multiple Range Test is used to identify the model which is closer to actual data at α level of significance. It also uses MAPE when the null hypothesis in ANOVA is accepted to select from the GA or time series model. The significance of the proposed algorithm is two-fold. First, it is flexible and identifies the best model based on the results of ANOVA and MAPE, whereas previous studies consider the best fitted GA model based on MAPE or relative error results. Second, the proposed algorithm may identify conventional time series as the best model for future oil production forecasting because of its dynamic structure, whereas previous studies assume that GA always provide the best solutions and estimation. Figure 1 depicts the proposed algorithm of this study. The reader should note that all steps of the integrated algorithm are based on standard and scientific methodologies which are time series, GA, conventional time series, ANOVA, the Duncan’s Multiple Range Test and MAPE. Furthermore, the GA modelling is based on the time series model which is selected for the data set. The best model is distinguished by modelling, running and testing various time series models and selecting the model with the lowest error.
3.1 Genetic Algorithm The GA is a part of evolutionary computing, which is a rapidly growing area of artificial intelligence. These algorithms were directly described by Goldberg (1989) and have
374
A. Azadeh et al.
taken attention to solve optimising problems. The most important advantage of the GAs is their ability to use accumulated information about the initial unknown search space in order to move the next searches in to useful spaces (Ceylan and Ozturk, 2004). The fundamental principle of GAs was first introduced by Holland (1975). In GAs, the better chromosome is the one that is closer to the optimal solution. In the application of GAs the population of chromosomes is created randomly. The number of the populations is different from one problem to another. GAs differ from conventional non-linear optimising techniques by preserving a population of the solutions. The key feature of such algorithms is the manipulation of a population whose individuals are characterised by possessing a chromosome. A chromosome is composed of strings of symbols called bits. Each bit is attached to the position within the string representing the chromosome to which it belongs. If, for example, the strings are binary, then each bit can take the value of 0 or 1. The link between the GA and the problem at hand is provided by the fitness function (F). The F establishes a mapping from the chromosomes to some set of real numbers. The GA procedure is generative. Each production of GA makes a new population of the existing type. Suppose that the population size is P initially. P individuals are assigned values to their chromosomes, where the assignment can be either random or deterministic. A permutation of such strings can be introduced to construct a population of designs where each design has its own fitness value. Figure 1
The integrated GA-time series-DOE algorithm for oil production forecasting
An integrated GA-time series algorithm
375
A group of chromosomes is called a population. One of the genetic features is that, instead of focusing on one point of the search space of a chromosome it works on a population of chromosomes. This way, at each stage the algorithm has a population of chromosomes which possess the desired properties more than the last one. Each population or generation of the chromosomes has the same size which is referred to as population size. If the number of the chromosomes is too low the possibility of the movement operation by the GA will also be low and it only searches a small part of the search space. According to the researches a suitable population size is about 20–30 chromosomes. Of course, sometimes a population with 50–100 has led to best answers. The GA works with a ‘population’ of possible answers (e.g., sets of parameter values). Because of this, it does not require initial estimates of the fitting parameters, but requires only the allowable range of each parameter. The goodness or badness of the answer is determined by the value returned by the goal function. The more suitable answers have bigger fitness. To make a chance of survival more, its survival probability is considered according to its fitness value. Therefore the fittest chromosome takes part to produce offspring with more probability. GA encompasses three main operators: selection, crossover and mutation and which are described below briefly.
3.1.1 Solution coding (chromosome structure) In this research, the initial population of individuals is generated randomly. To do so, a set of N chromosomes are generated at random. The chromosome chi is represented by a sequence of genes (gj = 1, …, m). Each individual gene contains one bit of information: the jth coefficient is represented by the value of gene gj in the sequence of the chromosome. Since the output results are strongly sensitive to the initial set, the initial coefficients are generated randomly in a predetermined interval i.e., [a b]. The coefficients a, and b are the coefficients which are gained from the conventional time series model.
3.1.2 The fitness function To introduce the fitness function, the variables should be put in the model and then the difference of the estimated values from the actual data for each chromosome should be calculated and in each generation the individual with minimum difference must be returned. Individual parameters are selected randomly and after putting in the model the fitness is calculated. To obtain this purpose, the fitness function is to cover this goal and it is shown below which is called MAPE3 error: where Dactual, Destimated show actual and estimated oil production, respectively, and n is the number of observations. As the fitness function is minimisation, individuals with less amount of fitness are returned at each generation. In the next section we present the most important error estimation methods. However, due to MAPE’s wide spread use in this field and it comprehensiveness in this field we chose it for our proposed algorithm. n
min f = 1/ n
∑ (D
actual
j =1
− Destimated ) / Dactual .
(9)
376
A. Azadeh et al.
3.1.2.1 Error estimation methods There are four basic error estimation methods which are listed below: •
Mean Absolute Error (MAE)
•
Mean Square Error (MSE)
•
Root Mean Square Error (RMSE)
•
Mean Absolute Percentage Error (MAPE).
They can be calculated by the following equations, respectively:
∑ MAE = MSE =
∑
RMSE =
MAPE =
n
x t =1 t
− xt′
n
n t =1
( xt − x′) 2 n
∑ ∑
n t =1
( xt − x′) 2
(10)
n xt − x ′ t =1 x t . n n
3.1.3 Local search In every iteration, chromosomes in the current population improved by using the pair-wise exchange procedure (XP or 2-Opt method). In XP, the position of every pair of coefficients is exchanged. Finally a chromosome with the less MAPE is chosen.
3.1.4 Crossover In this paper, we propose a new crossover, like direction-base, to determine the search space by using the fitness value. The procedure of applying this crossover is as follows: IF MAPE (P1) < MAPE (P2) f1 = λ1 × P1 + λ2 × P2; f2 = λ1 × P1 – λ2 × P2; ELSE MAPE (P1) > MAPE (P2) f1 = λ1 × P2 + λ2 × P1; f2 = λ1 × P2 – λ2 × P1; END λ2 + λ1 = 1.
An integrated GA-time series algorithm
377
3.1.5 Mutation This operator is designed in order to move in the direction of finding the best, or near the best, solutions. Chromosomes improved by using the 3-Opt method. In this operator, the position of every three of coefficients is exchanged. Finally, a chromosome with the less MAPE is chosen.
3.1.6 Selection The selection of individuals to produce successive generations plays an extremely important role in a GA. A probabilistic selection is performed based upon the individual’s fitness such that the better individuals have an increased chance of being selected. We use a roulette wheel for the mating pool. In iteration t, Pt is used for selection, crossover and mutation to create Qt. Then a combined population Rt = Pt + Qt is formed. The population Rt is of size 2N. Then the population Rt is sorted according to the normalised strategy, chromosomes are first normalised according to (12), then chromosomes, whose normalised value are less, or equal to zero, are selected as a new population, Pt+1. Where Zi is the normalised value of chromosome i and fi is the fitness of chromosome i, µ is the average value and σ is the standard deviation of chromosomes value in Rt. zi =
fi − µ
σ
.
(11)
3.1.7 Stopping criteria In this case, the algorithm terminates if the number of generations reaches the specific number. The GA parameters in all problems are as follows: •
The number of populations in each generation = 100
•
The maximum number of generations = 50
•
The crossover rate is 0.8
•
The mutation rate as a local search is 0.2.
Pseudo code The Proposed Genetic Algorithm: {For (i = 1: pop size) { i=Generate solution (); i=Local search (i); Add individual i to P }
378
A. Azadeh et al.
While terminate ~ = true { For (j=1:#crossover) { Select two parent ia, ib ∈ P randomly; ic = crossover (ia, ib); Add ic to P; } For (j = 1:#Mutation) { Select one individual im ∈ P randomly; im = mutate (im); Add im to P; } P = select (P) } Result }
4
The case studies
The proposed algorithm was applied to 69 sets of actual data which are the monthly oil production values from January 2001 to September 2006 for India, Russia, Brazil and USA. Each step of the algorithm is discussed in the following sections. The reader may refer to Figure 1 which shows the infrastructure of the proposed algorithm. It is further used to identify the preferred model to forecast and estimate oil production by the integrated mechanism of the proposed algorithm which is based on GA, conventional time series, ANOVA and MAPE. Step 1: 69 sets of raw data are divided into 63 input data and six test data. Also, preprocessed data are divided into 62 input data and six test data. Step 2: It can be seen in Figure 2(a)–(d) that raw data for Russia, Brazil and USA have a trend but that raw data for India do not have a trend. While omitting the trend is better for more precise estimation in time series methods, we applied the first difference method as the best preprocessing method to generate a covariance stationary process. The results of applying preprocessing methods for given data (Russia, Brazil and USA) are shown in Figure 2(e)–(g), respectively.
An integrated GA-time series algorithm Figure 2
379
Raw and preprocessed data. (a) raw data for Russia; (b) raw data for Brazil; (c) raw data for USA; (d) raw data for India; (e) preprocessed data by first difference method for Russia; (f) preprocessed data by first difference method for Brazil and (g) preprocessed data by first difference method for USA (see online version for colours)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
380
A. Azadeh et al.
Step 3: Determination of input variables for the model should be done. In our model input variables were selected using ACF. The results of applying ACF for given data are shown in Figures 3–6 for Brazil, Russia, India and USA, respectively. It is shown in Figure 3 that a set of lagged y (t – i), as the input (independent) variables are i = 8 and 12 for Brazil. In Figure 4 a set of lagged y (t – i) are i = 5 and i = 12 for Russia. In Figure 5 a set of lagged y(t – i) are i = 1, i = 2, i = 3 and i = 4 for India and in Figure 6 a set of lagged y(t – i) are i = 4, i = 12 for the USA. Step 4: After running with GA, Tables 1–4 show the results of close and open simulation for four countries. Figure 3
ACF charts for Brazil (see online version for colours)
Figure 4
ACF charts for Russia (see online version for colours)
Figure 5
ACF charts for India (see online version for colours)
An integrated GA-time series algorithm Figure 6
Table 1
381
ACF charts for USA (see online version for colours)
The MAPE between estimated and actual values for India GA (open simulation)
Month
GA (close simulation)
Actual data
Estimated
MAPE
Estimated
MAPE
64
685
685.84
0.0165
685.84
0.0146
65
689
685.71
656.70
66
704
688.51
679.60
67
691
691.47
703.18
68
650
692.38
690.50
69
701
700.19
669.08
Table 2
The MAPE between estimated and actual values for Russia GA (open simulation)
Month
Actual data
Estimated
MAPE
64
9170
9170
65
9160
9164.5
9136.1
66
9260
9275.3
9278.0
67
9260
9259.7
9260.0
68
9330
9276.1
9266.6
69
9280
9280.0
9280.0
Table 3
0.0013
GA (close simulation) Estimated
0.0023
The MAPE between estimated and actual values for USA GA (open simulation)
Month
9145.2
MAPE
GA (close simulation)
Actual data
Estimated
MAPE
Estimated
MAPE
64
5067
5115.6
0.0074
5101.9
0.0074
65
5100
5160.1
5140.7
66
5219
5154.5
5134.6
67
5171
5120.9
5103.3
68
5155
5155
5153.5
69
5120
5124.9
5120.7
382
A. Azadeh et al.
Table 4
The MAPE between estimated and actual values for Brazil GA
Month
Actual data
Estimated
MAPE
1737 1748 1630 1725 1703 1733
1737.30 1730.30 1732.60 1725.20 1703.00 1716.40
0.0137
64 65 66 67 68 69
We must consider that for Brazil, open simulation and close simulation are the same, because the first lagged y (t – i) is i = 8 but the test data are equal to 6.
4.1 Analysis of Variance The estimated results of the proposed GA for four countries, time series methods and actual data are compared by Analysis of Variance (ANOVA). The experiment was designed such that variability arising from extraneous sources can be systematically controlled. Time is the common source of variability in the experiment that can be systematically controlled through blocking (Montgomery, 1999). Therefore, a one way blocked design of ANOVA was applied. The results are shown in Tables 5–8. The test of hypothesis is defined as: H0: µ1 = µ2 = µ3 H1: µI ≠ µj i, j = 1, 2, 3, i ≠ j where µ1, µ2 and µ3 are the average estimations obtained from actual data, GA and time series, respectively. If the null hypothesis is accepted then the preferred model is the one which has lower MAPE error. Otherwise, if the null hypothesis is rejected, Duncan’s Multiple Range Test is used to compare treatment means and to select the preferred model. It can be seen from Table 5 that up to α = 0.25 null hypothesis is rejected because f0.25,2,10 = 1.49 and f0 = 1.51. In Table 6 the null hypothesis is accepted; also, in Table 7 it is shown that up to α = 0.25 null hypothesis is rejected because f0.25,2,10 = 1.49 and f0 = 2.08, and in Table 8 the null hypothesis is α accepted. Therefore, we can conclude that treatments means differ for Russia and the USA. Now, in order to find which of the treatments means (GA or time series) is closer to the actual data, Duncan’s Multiple Range Test, which is discussed in the next section, is used. Table 5
ANOVA table for comparison of time series, actual data and GA for Russia
Source of variation Between groups (treatment) Blocks (month) Within groups Total
Sum square 1302.00 73554.485 4306.52 79163.00
Degree of freedom 2.00 5.00 10.00 17.00
Mean square 651.00 14710.90 430.65
F
F(0.25)
1.51 34.16
1.49
An integrated GA-time series algorithm Table 6
ANOVA table for comparison of time series, actual data and GA for Brazil
Source of variation Between groups (treatment) Blocks (month) Within Groups Total Table 7
383
Sum square
Degree of freedom
Mean square
F
F(0.25)
13118.00
2.00
826.00
1.21
1.49
3493.77
5.00
698.75
1.02
6830.23
10.00
683.02
23442.00
17.00
ANOVA table for comparison of time series, actual data and GA for USA
Source of variation
Sum square
Degree of freedom
Mean square
F
F(0.25)
Between groups (treatment)
232163.00
2.00
11527.00
2.08
1.49
33233.89
5.00
6646.78
1.20
55551.11
10.00
5555.11
320948.00
17.00
Blocks (month) Within Groups Total Table 8
ANOVA table for comparison of time series, actual data and GA for India
Source of variation Between groups (treatment) Blocks (month)
Sum square
Degree of freedom
Mean square
F
F(0.25) 1.49
48.00
2.00
24.00
0.12
613.88
5.00
122.78
0.61
200.61
Within groups
2006.12
10.00
Total
2668.00
17.00
4.1.1 Duncan’s Multiple Range Test We perform Duncan’s Multiple Range Test for Russia and the USA. In order to perform Duncan’s Multiple Range Test we should find the standard deviation for each treatment mean calculated as S yi =
MS(error) . b
Then we should find Rp values that are calculated as below: R p = rα ( p, f ) S yi,
where rα (p, f) is driven from the Duncan’s Multiple Range Test. After sorting the mean treatment we can compare each treatment for each country. As illustrated in Table 9, averages of the first (actual data) and the second treatment (the GA) for the USA are equal at α = 0.05. This shows that the average estimated values of oil production of the GA and actual data are equal at 95% confidence level. Hence, the GA outputs outperform the conventional time series significantly for the USA. The averages of the first (actual data) and the second treatment (the GA) are equal at α = 0.05 and also the averages of the first (actual data) and the third treatment
384
A. Azadeh et al.
(time series) are equal at α = 0.05 for Russia. But the interval between the average of first and second treatment is less than the average of first and third treatment and we can conclude that the GA outputs outperform the conventional time series for Russia as well. For India and Brazil, we use MAPE because the null hypothesis in ANOVA is accepted. Table 10 indicates that the GA outputs outperform the conventional time series for India and Brazil because of lower MAPE error. However, both GA and conventional time series provide good results for Russia. Also, GA provides much better results for USA than the conventional time series approach. Table 9
Duncan’s Multiple Range Test for USA and Russia USA
Russia
5138.67
9243.3
The average of the second treatment ( y 2) 5138.50
9228.3
4897.67
9223.3
The average of the first treatment ( y1)
The average of the third treatment ( y 3)
5138.67 – 5138.50 = 0.17 9243.3 – 9228.3 = 15
y1 − y 2 R2 = r0.05 (2.10) × S yi
3.15 × 30.43 = 95.85
3.15 × 8.47 = 26.69
Comparing treatment 1 and 2
0.17 < 95.85 → µ1 = µ2
15 < 26.69 → µ1 = µ2
y1 − y 3
5138.67 – 4897.67 = 241 9243.3 – 9223.3 = 20
R3 = r0.05 (3.10) × S yi
3.30 × 30.43 = 100.41
3.30 × 8.47 = 27.96
Comparing treatment 1 and 3
241 > 100.41 → µ1 ≠ µ3
20 < 27.96 → µ1 = µ3
Table 10
Comparing treatments for Brazil and India Brazil
India
MAPEGA
0.0138
0.01584
MAPEconventional time series
0.205
0.0244
5
Conclusion
This paper presented an integrated algorithm to estimate and predict oil production. The proposed algorithm uses ANOVA to select either GA or conventional time series for future production estimation. Furthermore, if the null hypothesis in the ANOVA F-test is rejected, Duncan’s Multiple Range Test is used to identify which model is closer to actual data at α level of significance. It also uses MAPE when the null hypothesis in ANOVA is accepted to select from either the GA or the time series model. The significance of the proposed algorithm is two-fold. First, it is flexible and identifies the best model based on the results of ANOVA and MAPE, whereas previous studies consider the best fitted GA model based on MAPE or relative error results. Second, the proposed algorithm may identify conventional time series as the best model for future oil production forecasting because of its dynamic structure, whereas previous studies assume that GAs always provide the best solutions and estimation. Figure 1 depicts the
An integrated GA-time series algorithm
385
proposed algorithm of this study. The reader should note that all steps of the integrated algorithm are based on standard and scientific methodologies which are time series, GA, conventional time series, ANOVA, the Duncan’s Multiple Range Test and MAPE. Furthermore, the GA modelling is based on which time series model is selected for the data set. The best model is distinguished by modelling, running and testing various time series models and selecting the model with lowest error. To show the applicability of the GA and time series approach, data regarding monthly oil production in Russia, Brazil, USA and India, from January 2001 to September 2006 were used. The date was preprocessed with the first difference method in order to improve the output. We used two strategies, open and close simulation, to estimate and predict oil production. Then ANOVA was applied for the open simulation to compare the proposed GA, time series and actual data. It was found that the null hypothesis is true for India and Brazil and MAPE was used to identify which model is closer to the actual data. It was shown that the proposed GA has better estimation for India and Brazil. On the other hand, the null hypothesis was rejected for Russia and the USA. Therefore, Duncan’s Multiple Range Test was used to identify which model is closer to the actual data. It was shown that the proposed GA has better estimated values for oil production in the USA. Furthermore, either GA or time series may be used for oil production in Russia. It was further shown that the proposed algorithm provides better estimation than the conventional approach for the USA, India and Brazil and is indifferent for Russia. It is therefore concluded to use the proposed approach for oil production prediction in the world. Future research should deal with utilisation of other intelligent tools such Adaptive Network-Based Fuzzy Systems (ANFIS) and ANNs to see if they can provide a better solution than GA in particular and the integrated approach in general.
Acknowledgements The authors are grateful for the valuable comments and suggestion from the reviewers. Their valuable comments and suggestions have enhanced the strength and significance of our paper.
References Abdel-Aal, R.E. (2008) ‘Univariate modeling and forecasting of monthly energy demand time series using abductive and neural networks’, Computers and Industrial Engineering, Vol. 54, No. 4, pp.903–917. Azadeh, A. and Tarverdian, S. (2007) ‘Integration of genetic algorithm, computer simulation and design of experiment for forecasting electrical energy consumption’, Energy Policy, Vol. 35, pp.5229–5241. Azadeh, A., Ghaderi, S.F., Tarverdian, S. and Saberi, M. (2007) ‘Integration of artificial neural networks and genetic algorithm to predict electrical energy consumption’, Applied Mathematics and Computation, Vol. 186, No. 2, pp.1731–1741. Box, G.E.P., Jenkins, G.M. and Reinsel, G.C. (1994) Time Series Analysis: Forecasting and Control, Prentice-Hall, Englewood Cliffs, NJ. Canyurt, O.E., Ozturk, H. and Hepbasli, A. (2004) ‘Energy demand estimation based on two-different genetic algorithm approaches’, Energy Source, Vol. 26, No. 14, pp.1313–1320.
386
A. Azadeh et al.
Ceylan, H. and Ozturk, H. (2004) ‘Estimating energy demand of Turkey based on economic indicators using genetic algorithm approach’, Energy Conversion and Management, Vol. 45, pp.2525–2537. Cortez, P., Rocha, M. and Neves, J. (2001) Engineering of Intelligent Systems: 14th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, IEA/AIE 2001, Proceedings, Budapest, Hungary, 4–7 June. Gareta, R., Romeo, L.M. and Gil, A. (2006) ‘Forecasting of electricity prices with neural networks’, Energy Conversion and Management, Vol. 47, pp.1770–1778. Goldberg, D.E. (1989) Genetic Algorithm in Search, Optimization and Machine Learning, Addison-Wesley, Harlow, England. Hasheminia, H. and Akhavan Niaki, S.T. (2006) ‘A genetic algorithm approach to find the best regression/econometric model among the candidates’, Applied Mathematics and Computation, Vol. 183, No. 1, pp.337–349. Holland, J.H. (1975) Adoption in Neural and Artificial Systems, The University of Michigan Press, Ann Arbor, MI, USA. Karunasinghea, D.S.K. and Liong, S.Y. (2006) ‘Chaotic time series prediction with a global model: artificial neural network’, Journal of Hydrology, Vol. 323, Nos. 1–4, pp.92–105. Kermanshahi, B. and Iwamiya, H. (2002) ‘Up to year 2020 load forecasting using neural nets’, International Journal of Electrical Power and Energy Systems, Vol. 24, pp.789–797. Lau, H.C.W., Cheng, E.N.M., Lee, C.K.M. and Ho, G.T.S. (2008) ‘A fuzzy logic approach to forecast energy consumption change in a manufacturing system’, Expert Systems with Applications, Vol. 34, pp.1813–1824. Malik, F. and Nasereddin, M. (2006) ‘Forecasting output using oil prices: a cascaded artificial neural network approach’, Journal of Economics and Business, Vol. 58, pp.168–180. Mirmirani, S. and Li, H.C. (2004) ‘A comparison of VAR and neural networks with genetic algorithm in forecasting price of oil’, Advances in Econometrics, Vol. 19, pp.203–223. Montgomery, D.C. (1999) Design and Analysis of Experiments, John Wiley & Sons, New York. Nayak, P.C., Sudheer, K.P., Rangan, D.M. and Ramasastri, K.S. (2004) ‘A neuro-fuzzy computing technique for modeling hydrological time series’, Journal of Hydrology, Vol. 291, pp.52–66. Osman, M.S., Abo-Sinna, M.A. and Mousa, A.A. (2005) ‘A combined genetic algorithm-fuzzy logic controller (GA–FLC) in nonlinear programming’, Applied Mathematics and Computation, Vol. 170, No. 2, pp.821–840. Ozturk, H., Ceylan, H., Canyurt, O.E. and Hepbasli, A. (2003) ‘Electricity estimation using genetic algorithm approach: a case study of Turkey’, Energy, Vol. 30, pp.1003–1012. Sözen, A., Gülseven, Z. and Arcaklioğlu, E. (2007) ‘Forecasting based on sectoral energy consumption of GHGs in Turkey and mitigation policies’, Energy Policy, Vol. 35, No. 12, pp.6491–6505. Tang, A., Quek, C. and Ng, G. (2005) ‘GA-TSKfnn: parameters tuning of fuzzy neural network using genetic algorithms’, Expert Systems with Applications, Vol. 29, pp.769–781. Tsekouras, G.J., Dialynas, E.N., Hatziargyriou, N.D. and Kavatza, S. (2007) ‘A non-linear multivariable regression model for midterm energy forecasting of power systems’, Electric Power Systems Research, Vol. 77, pp.1560–1568. Tseng, F.M., Yu, H.C. and Tzeng, G.H. (2002) ‘Combining neural network model with seasonal time series ARIMA model’, Technological Forecasting and Social Change, Vol. 69, pp.71–87. Ye, M., Zyren, J. and Shore, J. (2006) ‘Forecasting short-run crude oil price using high- and low-inventory variables’, Energy Policy, Vol. 34, pp.2736–2743. Zhang, G.P. and Oi, M. (2005) ‘Neural network forecasting for seasonal and trend time series’, European Journal of Operational Research, Vol. 160, pp.501–514.
An integrated GA-time series algorithm
387
Notes 1
Partial Auto-Correlation Function. By definition, a preprocessing model is covariance stationary if it has a finite and time-invariant mean and covariance. 3 Mean Absolute Percentage Error. 2