4420
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 62, NO. 7, JULY 2015
Improving the Quality of Prediction Intervals Through Optimal Aggregation Mohammad Anwar Hosen, Student Member, IEEE, Abbas Khosravi, Member, IEEE, Saeid Nahavandi, Senior Member, IEEE, and Douglas Creighton, Member, IEEE
Abstract—Neural networks (NNs) are an effective tool to model nonlinear systems. However, their forecasting performance significantly drops in the presence of process uncertainties and disturbances. NN-based prediction intervals (PIs) offer an alternative solution to appropriately quantify uncertainties and disturbances associated with point forecasts. In this paper, an NN ensemble procedure is proposed to construct quality PIs. A recently developed lower–upper bound estimation method is applied to develop NN-based PIs. Then, constructed PIs from the NN ensemble members are combined using a weighted averaging mechanism. Simulated annealing and a genetic algorithm are used to optimally adjust the weights for the aggregation mechanism. The proposed method is examined for three different case studies. Simulation results reveal that the proposed method improves the average PI quality of individual NNs by 22%, 18%, and 78% for the first, second, and third case studies, respectively. The simulation study also demonstrates that a 3%–4% improvement in the quality of PIs can be achieved using the proposed method compared to the simple averaging aggregation method. Index Terms—Aggregation, neural network (NN), prediction interval (PI), simulated annealing (SA), uncertainty and disturbance, weighted average.
I. I NTRODUCTION
I
N RECENT years, neural networks (NNs) have appeared as an efficient, effective, and reliable tool to model nonlinear systems because of their ability to model unspecified nonlinear relationships between system outputs (targets) and input variables [1]–[3]. However, the performance of NNs significantly drops in the presence of large process disturbances as well as uncertainties [4], [5]. In addition, the traditional NN model does not indicate the accuracy of its own performance. A new NN modeling strategy, named lower–upper bound estimation (LUBE) method, was proposed in the literature to quantify the effects of uncertainties and disturbances on point forecasts generated by NNs [6]. In this technique, an NN predicts an interval rather than a single value (all traditional NNs predict Manuscript received January 13, 2014; revised May 2, 2014, July 21, 2014, and September 18, 2014; accepted October 12, 2014. Date of publication December 18, 2014; date of current version May 15, 2015. This work was supported by the Centre for Intelligent Systems Research at Deakin University, Australia. The authors are with the Centre for Intelligent Systems Research, Deakin University, Burwood Vic. 3125, Australia (e-mail: ahosen@ deakin.edu.au;
[email protected]; saeid.nahavandi@ deakin.edu.au;
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIE.2014.2383994
Fig. 1.
PI-based NN model in the LUBE method.
single values as forecasts) for every input sample. Fig. 1 shows the basic idea of NN-based prediction intervals (PIs) in the LUBE method. The PI contains upper and lower bounds with a prescribed confidence level (CL). This CL indicates the accuracy of the NN performance. As described in [7], PIs are developed for measuring the variance associated with the difference between the measured value and the model predicted value. The variance for PIs is defined as follows: ti − yi = (yi − yi ) + i
(1)
where ti , yi , and yi are the ith measured target (output), the model output, and the true regression mean of the target, respectively. i is the noise (random variable) with a zero expectation. The two terms in the right-hand side of (1) are statistically independent, and the total variance associated to the model outcome can be defined as [7] 2 σi2 = σ + σ2i . yi
(2)
The first term σ 2 of (2) originated from epistemic uncer yi tainty (model mismatch and parameter estimation errors), and the second term σ2i originated from aleatory uncertainty (noise variance or unmeasured disturbances). Therefore, PIs cover both uncertainties associated with model mismatch and noise in data. Studies on construction, optimization, and applications of PIs in different fields of science and engineering can be found in the literature [8]–[13]. Through comprehensive case studies and simulations, Khosravi et al. [6] show that the LUBE method produces higher quality PIs in terms of coverage probability (calibration) and width (sharpness) than other available techniques, such as bootstrap [14], Bayesian [15], mean-variance estimation [16],
0278-0046 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
HOSEN et al.: IMPROVING THE QUALITY OF PREDICTION INTERVALS THROUGH OPTIMAL AGGREGATION
and delta method [17]. Moreover, its computational burden is significantly less than that of other alternative methods for PI construction. They use a PI-based cost function instead of a traditional error-based cost function for training the NN models. This PI-based cost function directly focuses on improving the PI quality instead of minimizing the forecasting errors, a procedure followed by traditional PI construction methods. The PI-based cost function not only increases the PI coverage probability but also decreases the PI widths to ensure informative PIs. It is already known that an ensemble of NNs is an efficient method to improve the accuracy, generalization power, and robustness of NNs [18]–[20]. Usually, the outputs from a set of individually trained NNs are combined to form a single prediction [21], [22]. This is because NN performance intimately depends on initial parameters as well as the perturbation of NN parameters, and hence, the NN prediction performance fluctuates from one replicate of training to another one [23], [24]. Moreover, a trained NN is not always optimal for a whole range of data or for different data sets. Even retraining an NN will result in a different set of parameters, as the search space is often nonconvex with several local minima. As per these, it is reasonable to further study how the quality of PIs generated by NNs can be improved through developing optimal aggregation methods. Several NN ensemble methods (using forecast combination) can be found in the literature [25]. However, the use of ensemble methods for PI-based forecasts is still limited. Simple averaging and weighted averaging are mostly used in the forecast ensemble procedure [26]. The former is the most popular aggregation method due to its simplicity and ease of implementation. In a recent study, Khosravi and Nahavandi [18] applied this method to combine NN-based PIs for wind power generation. To the best of our knowledge, this is the only existing published work on PI aggregation. They showed that the proposed method improves the quality of PIs by approximately 50% for the majority of cases compared to the quality of PIs generated by individual NNs. However, the main limitation of this method is that it considers an equal contribution by all distinctive NNs to the ensemble output. Weighted averaging of NN forecasts is the most effective method for forecast combination [26]–[29]. As the key contribution of this paper, we extend the concept of weighted average forecast aggregation to the field of PI construction. To the best of our knowledge, this is one of the first studies investigating the possibility of improving the quality of PIs through their aggregation. In the literature, an error-based cost function is often used to optimize the weight of each NN ensemble member [25], [30], [31]. In contrast to the traditional weighted average ensemble procedure, we use a PI-based cost function in this study to construct high-quality combined PIs. Two optimization algorithms, namely, simulated annealing (SA) and genetic algorithm (GA), are applied to minimize the cost function [32]. Finally, the performance of the proposed method is examined using three different case studies. Furthermore, comparisons are made between the quality of individual and combined PIs to quantitatively measure the superiority of the proposed method in generating quality intervals. The rest of this paper is structured as follows. Section II briefly describes the LUBE method. A PI-based cost function is intro-
4421
duced in this section as well. Section III is about the proposed method for the aggregation of PIs. The case studies used in this paper are presented in Section IV. Section V demonstrates the simulations results for the proposed method. Section VI concludes the finding in this paper with some remarks. II. D EVELOPMENT OF PI-B ASED NN PI-based forecasting is a promising tool to quantify the effects of uncertainties and disturbances present in process systems. In contrast to traditional point forecasting, the main advantage of this method is that it bears extra information such as the prediction accuracy. PI can be generated with a prescribed probability called the CL (1 − α). This CL denotes the accuracy or trustworthiness of constructed PI values. There are several methods in the literature to construct PIs using NN as mentioned in Section I [4]. However, all of these methods have their own limitations. In contrast to the traditional PI construction method, the LUBE method offers an effective, reliable, and fast method to construct high-quality PIs. Existing PI construction techniques attempt to estimate the mean and variance of the targets for the construction of PIs (essentially, these are parametric methods). The LUBE method, in contrast, directly approximates the upper and lower bounds of PIs based on the set of inputs. The LUBE method uses a PI-based cost function, rather than error-based cost functions, to train NN models. The PI-based cost function consists of two quality indexes of PIs, the coverage probability and the widths. This cost function known as the coverage width criterion (CW C) is defined as (3) CW C = P IN AW 1 + γ(P ICP )e−η(P ICP −ϕ) where γ=
0, 1,
P ICP ≥ ϕ P ICP ≤ ϕ.
Here, η is a hyperparameter that scales up the PI coverage error (P ICP -CL), assigning more penalties for undercoverage. ϕ corresponds to the nominal CL (1 − α). The two objective parameters for CW C, PI coverage probability (P ICP ) and PI normalized average widths (P IN AW ), are defined as 1 cj n j=1 n
P ICP =
(4)
where
1, tj ∈ [Lj , Uj ] [Lj , Uj ] 0, tj ∈ ⎛ ⎞ n 1 1 (Uj − Lj )⎠ . P IN AW = ⎝ R n j=1 cj =
(5)
Here, tj , Lj , and Uj are the target value, lower bound, and upper bound for the j th sample, respectively. R is the range of the underlying target values.
4422
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 62, NO. 7, JULY 2015
From (4), it can be seen that P ICP measures the total number of samples covered by PIs. It is noted that one can easily get a 100% P ICP by simply increasing the widths of PIs (measured by P IN AW ). To eliminate this problem, the cost function, CW C in (3) covers both key characteristics of PIs (P ICP and P IN AW ). In general, the smaller the CW C value, the better the PI quality. PIs with a low coverage probability will always have a large CW C value, regardless of their widths. As seen in (3), γ = 0 when P ICP ≥ ϕ (P ICP ≥ ϕ means that P ICP is satisfied with a predefined CL), and the exponential term is eliminated (thus, CW C = P IN AW ). In this case, the optimization algorithm deals with only P IN AW and tries to reduce the width of PIs. If P ICP ≤ ϕ (P ICP ≤ ϕ means that P ICP is not satisfactory), then the CW C value exponentially increases. Gradient descent-based methods are not appropriate to minimize CW C as it is nonlinear, complex, and nondifferentiable. In contrast, global optimization techniques such as GA and SA are ideal tools for minimizing this cost function and adjusting NN parameters. The detailed procedure of the LUBE method can be found in [6]. The main intention of this paper is to aggregate the PI-based forecasts from NNs to obtain better quality and stable PIs. To serve this purpose, Ntotal PI-based NNs (PI-NN) are developed using the LUBE method for a 90% CL. The first step of NN training is to prepare the sample data. It has already been shown that randomized sampling improves the generalization power of NN models [33]. The prepared sample data are randomly split into three subsets as follows. 1) The training data set (Strain ) contains 60% of total data. 2) The validation data set (Svald ) contains 20% of total data. 3) The testing data set (Stest ) contains 20% of total data. Only training data are used to train the PI-NN models. The rest of the data sets are kept for the NN forecast (PIs) combination procedure. The random initialization of training parameters and different NN structures are chosen in the training process to diversify the PI-NN models. Five distinct NN structures are obtained by changing the number of neurons (Nu ) in the hidden layer. Nu is set to 6, 8, 10, 12, and 14, where each class contributes an equal number of NN models (Ntotal /5). The cost function CW C is used to optimize the NN parameters. The SA optimization algorithm is employed to minimize the cost function CW C through changing the NN parameters. The initial cooling temperature (TSA0 ) for SA is set to 5 to allow uphill movements in the early iterations. A geometric cooling schedule with a cooling factor of 0.9 is used for changing the cooling temperature (TSA ) for SA. The parameter η in (3) is also set to 50 in order to penalize PIs if the P ICP is less than the desired CL. The detailed description of the selection of the η value and the sensitivity of CWC against η can be found in [6]. The parameters used in the LUBE method are shown in Table I. III. C OMBINATION OF PI-B ASED F ORECAST The combination of point forecast from several models to improve the overall prediction accuracy has received huge
TABLE I PARAMETERS U SED IN R EPLY TO LUBE M ETHOD TO T RAIN THE PI-NN M ODEL
attention in the NN community. It has been reported in [34] that the NN ensemble technique is an effective method to improve the prediction performance, even if it is a simple averaging method (combined forecast through simple averaging of forecasts from individual members). The application of the NN ensemble method can be found in many different applications, such as electricity load forecasting, machine learning, finance and economics, and medical science [34]–[36]. However, all of these applications deal with point-based forecasting. In recent years, Khosravi and Nahavandi [18] proposed a PIbased NN ensemble method through the simple averaging of PIs that generated from the distinctive PI-NN model. In contrast to the traditional averaging method, the weighted averaging method is more appropriate to integrate the individual NN performance in the ensemble process [26]. In this paper, we extend the PI aggregation method that is proposed in [18] utilizing the weighted averaging mechanism. Fig. 2 shows the ensemble development procedure of the PI-based NNs. The detailed procedure of the proposed methodology of the PI combination procedure is as follows. 1) Determination of ensemble members: The ensemble members for the PI combination process are selected from the PI-NN models developed in Section II using the LUBE method. As described before, random training initialization and different NN structures are used to diversify the PI-NN models. First, the diversified PI-NN models are ranked according to their individual prediction performance. The performance of PI-NN models is determined using the Svald data set by utilizing the cost function CW C. This time, Ntotal sets of PIs are constructed for developed LUBE PI-NN models using the Svald data set. P ICPvald , P IN AWvald , and CW Cvald are calculated for these sets of PIs. The PINN models are then ranked based on CW Cvald in an ascending order. After sorting the PI-NN models based on CW Cvald , the first Nbest NN models (PI-NNranked ) are selected as the ensemble members out of Ntotal NN models. The P ICP , P IN AW , and CW C values for these sorted PI-NNranked models are denoted as P ICPranked , P IN AWranked , and CW Cranked , respectively. The other NN models are discarded. 2) Weighted averaging of PIs: Nbest sets of PIs (PIN N best,r , where r = 1, 2, . . . , Nbest ) are constructed from Nbest NN models using the Stest data set. This Stest data set is not used in the training process nor in the NN ensemble member determination process. The P ICPN N best,r
HOSEN et al.: IMPROVING THE QUALITY OF PREDICTION INTERVALS THROUGH OPTIMAL AGGREGATION
4423
square error or mean absolute percentage error is used to optimize the weights in the point forecast problem [30]. However, there are no prior values of the desired PIs in our case, and hence, error-based cost functions cannot be utilized to optimize the combiner parameters. The cost function CW C used in the LUBE method is employed as the objective function to optimize the combiner parameters for the PI combination process. High-quality combined PIs can be constructed through minimizing the CW C value as this cost function covers both the key indexes (P ICP and P IN AW ) of PI quality. Two optimization techniques, SA and GA, are used to optimize the weights of NN ensemble members through minimizing the CW C value under the following constraints: 0 ≤ wr ≤ 1; and
N best
wr = 1.
r=1
Fig. 2.
Training and ensemble procedure of PI-based NNs.
and P IN AWN N best,r are determined using PIN N best,r and used to calculate CW CN N best,r . The constructed PIN N best,r are then combined by using the weighted averaging mechanism. The general equation of weighted averaging combined PIs (PIcomb ) can be defined as follows: PIcomb =
N best
wr PIN N best,r
(6)
r=1
where r(r = 1, 2, . . . , Nbest ) is the ranking position of Nbest PI-NN models and PIN N best,r and wr are their corresponding PI set and weight (named as combiner parameter), respectively. The most challenging task in the weighted average forecast combination method is how to determine the weights (wr ) for individual NN ensemble members. Traditionally, error-based cost functions such as mean
These two constraints allow all ensemble members to contribute to the PI combination process. The SA optimization algorithm is well known for nonlinear optimization problems [37]. In recent years, Khosravi et al. vigorously described this method for optimizing a certain number of variables by using the PI-based cost function [6]. The SA optimization process is terminated if one of the following criteria is met. 1) The maximum number of iterations is reached. 2) No improvement is achieved for a specific number of consecutive iterations. 3) The cooling temperature TSA becomes smaller than a predefined threshold. GA is an evolutionary algorithm that is inspired by Darwin’s theory of evolution, such as inheritance, mutation, selection, and crossover. GA is very popular for nonlinear and complex optimization problems [38]. The basic steps of GA are as follows. 1) The GA starts with generating a random population of n chromosomes (corresponding to n sets of weights). The chromosome type used in this paper is a vector of real numbers. 2) Evaluate the cost function CW C for each set of weights in the population and rank them based on the CW C value. 3) Generate a new population by the following three steps: a) Selection: Select the best chromosomes as children for the next generation from the parents of the previous population based on their fitness value, CW C. The corresponding children are called elite children. b) Crossover: Crossover children are created by combining the vector of two parents based on a crossover probability. Here, the scattered crossover function is used. It creates a random binary vector (1, 0) where its length is the same as the chromosome size. As an example, if p1 and p2 are the two parents where p1 = [a b c d e f g h], p2 = [1 2 3 4 5 6 7 8], and the random crossover vector = [1 0 1 1 0 0 0 1], then the child is [a 2 c d 5 6 7 h].
4424
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 62, NO. 7, JULY 2015
c) Mutation: This type of children is created by random changes of genes (position in the chromosome) of individual parents. 4) Evaluate the fitness function for accepting or rejecting new population. 5) Check the stopping conditions of the optimization algorithm. The optimization algorithm terminates if any of the following conditions are satisfied: a) The maximum number of iterations/generations. b) The change of the fitness value over stall generation is less than the function tolerance. Otherwise, the optimization algorithm returns to step 2. 6) After the termination of the optimization algorithm, the final set of weights is considered as optimized weights for the ensemble members, and calculate the PIcomb by using (6). The initial combining parameters for the optimization process are determined using the values of CW C. Both CW C data sets (CW Cranked and CW CN N best ) that are determined using Svald (ranking stage) and Stest (combining stage) data sets are used to get the initial combiner parameters as follows: ⎤ ⎡ 1⎢ ⎢ CW Cranked,r ⎢ best 2 ⎣ N 1 1
wr,initial =
r=1
CW Cranked,r
+
1 CW CN N best,r N best em=1
1 CW CN N best,r
Fig. 3. Prepared data for NN training for first case study (first 100 samples of total data set).
⎥ ⎥ ⎥. ⎦ (7)
The other initial optimization parameters are used as per Table I for SA. The optimization parameters for GA such as population size, crossover probability, and mutation probability are set to 30, 65%, and 4%, respectively. A trial and error method is used to determine the GA parameters [39]. The maximum number of generations or iterations for GA is also set to 100. After completing the optimization process, the latest combiner parameters (wr,opt ) are used to generate PIcomb using (6). Finally, P ICPcomb , P IN AWcomb , and CW Ccomb are determined using P Icomb and (3)–(5) and stored to examine and compare the performance of combined PIs with the PIs generated from individual NNs.
IV. C ASE S TUDIES The performance of the proposed NN ensemble method is examined for three different case studies. The first case study is a nonlinear exothermic polystyrene (PS) batch reactor [40]. In a recent study, Hosen et al. [41] developed a traditional NN model for the control study purpose of the PS reactor. They generated the training data by using a published hybrid model [42] for this reactor. The target data set (reactor temperature data) is generated by changing the effective operating variable, heater power, for this reactor [43]. We follow the same procedure to generate the training data. However, disturbances are included in our training data by changing the effective operating variable [44]. White noise (variance=1) is also added in the reactor temperature to account for unknown or unmeasured disturbances.
Fig. 4. NN structure with four inputs [Q(t), T (t − 3), T (t − 2), T (t − 1)] and one output [T (t)] for PS reactor. This NN model is used in model predictive control system to optimize control signal, heater duty [41].
A total of 12 000 sample data (time series) are generated for the whole batch run of the PS reactor. The time-series data set is randomly split into three subsets: training (60%), validation (20%), and testing (20%). Fig. 3 shows the first 100 samples of temperature and heater power in the training set. The input–output vector for the PI-NN PS model is [Q(t), T (t − 3), T (t − 2), T (t − 1); T (t)] as shown in Fig. 4 [41]. [T (t − 1), T (t − 2), and T (t − 3)] are the three lagged values of reactor temperature added into the set of inputs. The nonlinear time delay Mackey–Glass differential equation is used to generate the data for the second case study. The classical Mackey–Glass equation can be defined as βx(t − τ ) dx(t) = − γx(t) dt 1 + xn (t − τ )
(8)
where β, n, τ , and γ are positive constants. The time series is chaotic (τ > 17) and nonconvergent and has been widely used as a benchmark in prediction studies [45]. The parameters β, n, τ , and γ are set to 0.2, 10, 17, and 0.1, respectively. x(0) is also set to 1.2. A total of 1000 data samples
HOSEN et al.: IMPROVING THE QUALITY OF PREDICTION INTERVALS THROUGH OPTIMAL AGGREGATION
Fig. 5.
Training data set for the second case study.
Fig. 6.
Training data set for the third case study.
4425
are generated. Four lagged values are used as inputs to train the PI-NN model. The input–output vector for NN training data is [x(t − 4), x(t − 3), x(t − 2), x(t − 1); x(t)]. The third case study is a nonlinear plant, where its output nonlinearly depends on both its past output values and the input values [46]. The plant difference model is as follows: y(t + 1) =
y(t) + u3 (t) 1 + y 2 (t)
(9)
where u(t) = sin
2πt 100
Fig. 7. CW Cranked and CW CN N best for the 30 ranked PI-NN models. (a), (b), and (c) are for the first, second, and third case studies.
.
(10)
y(1) is set to 0.05. 500 samples are generated. Both y(t) and u(t) are used to predict y(t + 1). The input–output vector for NN training data is [u(t), y(t); y(t + 1)]. The training data sets for these two case studies are shown in Figs. 5 and 6, respectively. V. R ESULTS AND D ISCUSSION The prediction performances of individual ranked PI-NN models and ensemble PI-NN models are presented in this section. The cost function, the CW C, and the PI quality (PI width and PI coverage probability) are used to examine the PI-NN model performance. Finally, the prediction performance of the ensemble NNs is compared with that of the individual ensemble members.
First of all, 150 (Ntotal = 150) NNs are developed using the Strain data set with a CL of 90% for all case studies as described in Section II. Nbest is set to 30. It means that only the best 30 out of 150 NN models are chosen and kept as the NN ensemble members. The validation data (Svald ) set is used to determine the NN models in the ensemble. As described in Section II, the cost function CW C is used as a performance index of the PI-NN model to rank them. After forming the ensemble, 30 sets of PIs (PIN N best ) are constructed using the Stest data set for all those ensemble members, and then, CW CN N best is calculated. Fig. 7 shows the CW Cranked (using Svald data) and CW CN N best (using Stest data) values for the 30 ranked models for all case studies. As ensemble members are selected from the total 150 PI-NN models, it is expected that the prediction performance of the ranked models is consistent in terms of
4426
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 62, NO. 7, JULY 2015
CW Cranked values. It is true for the second and third case studies as shown in Fig. 7(b) and (c); however, CW Cranked values vary from 20 to 40 for the first case study as seen in Fig. 7(a). This is because NN performance fluctuates from one replicate to another replicate with respect to training initiation as well as the nonlinear behavior of training data (PS reactor is highly nonlinear in nature). Fig. 7 also shows the individual prediction capability of ranked NN models for the Stest data set. The prediction performances of the ranked NN models are not consistent for the Stest data set compared to the Svald data set. It should be CW Cranked,r ∼ = CW CN N best,r for the corresponding NN models. However, only 46% (14 out of 30) and 76% (23 out of 30) PI-NN models produced similar or near quality PIs for the first and second case studies, respectively. The worst results can be observed in the third case study where only 10% (3 out of 30) PI-NNranked produced similar or near values of CW C for both data sets. This indicates that the best performing NN model for one data set does not always generate the best result for another data set. As per these, it is not practically wise to make a decision by relying on individual NN prediction. As per the proposed ensemble method, PIN N best from ranked PI-NN models are combined using the weighted averaging method. Two optimization algorithms, SA and GA, are employed to optimize the weight in (6) through minimizing the PI-based cost function CW C. For ease of explanation, the weighted averaging method using SA and GA are denoted as WASA and WAGA, respectively. Equation (7) is used to calculate the initial weights for NN ensemble members. Combined PIs using SA and GA are denoted as PIcomb,W ASA and PIcomb,W AGA . Combined PIs are also generated through the simple averaging method (PIcomb,mean ) and used for comparison. The mathematical expression for calculating PIPIcomb,mean is as follows: PIcomb,mean (k) =
1
N best
Nbest
j=1
PIN N best (k)
(11)
where k = 1, 2, . . . , denotes the sample size of the PIN N best or Stest data set. The performance of the proposed aggregation method is now compared with that of individual ensemble members in terms of the cost function CW C and coverage probability P ICP . Fig. 8 shows the CW C values obtained using PI-NNranked models and the proposed method for three case studies. As seen in Fig. 8, the proposed method produces better quality PIs than ranked best ensemble members in the majority of cases. CW Ccomb < CW CN N best,r can be observed in 21, 27, and 30 out of 30 PI-NNranked models for the first, second, and third case studies, respectively. Lower CW C means that the PI quality is good in terms of P ICP and width. The proposed ensemble method WAGA produces better results than other PI combination methods (WASA and simple averaging method). The CW Ccomb,W AGA values for the first, second, and third cases are 19.64, 15.26, and 13.28, respectively, which are significantly lower than their corresponding mean values of CW CN N best,r (34, 22.52, and 226.68, respectively). In case three, the proposed method produces higher quality PIs than all
Fig. 8. CW CN N best,r and CW Ccomb for the three case studies. (a), (b), and (c) are for the first, second, and third case studies.
of the ensemble members. This indicates that the proposed ensemble method produces better quality PIs, and its performance remains consistent from one experiment to another. The quality of ensemble PIs is also better than the quality of PIs of individual NNs in terms of coverage probability P ICP as demonstrated in Fig. 9. For the first two cases, PIs from the simple averaging method have the highest coverage (96.33% and 93.5% for the first and second case studies, respectively). The aggregation method with SA produces a higher P ICP value for the third case study than any other method. Combined PIs with GA cover at least 90% of target values for all three case studies, and the results are more stable and consistent for this method. Although P ICPcomb,W AGA is less than P ICPcomb,W ASA and P ICPcomb,mean , WAGA constructs high-quality PIs in terms of CW C as CW Ccomb,W AGA is the smallest (see Fig. 8). A high P ICP with a large CW C value indicates that PIs are too wide and accordingly less
HOSEN et al.: IMPROVING THE QUALITY OF PREDICTION INTERVALS THROUGH OPTIMAL AGGREGATION
4427
Fig. 10. Improvement of individual best 30 NN models in terms of PI quality by using the proposed method.
second, and third case studies. The highest improvement is achieved for PI-NNranked,10 (80%), PI-NNranked,25 (75%), and PI-NNranked,8 (100%) for three case studies. The proposed WAGA method also improves the PI quality from 3% to 4% compared to the simple averaging method as seen in Table II. Furthermore, Table II shows the improvement of the ranked one PI-NN ensemble members (PI-NNranked,1 ) for all case studies. Improvements of 59%, 54%, and 72% are achieved through the proposed ensemble method WAGA compared to PI-NNranked,1 . Moreover, the average improvement (ImN N best ) of the individual NN models (PI-NNranked ) is 22%, 18%, and 78% for the first, second, and third case studies, respectively. The aforementioned results indicate that the application of the proposed ensemble method significantly improves the quality of PIs. It is worth to note the computational complexity or load of the proposed method. There are two types of computations involved in the proposed method.
Fig. 9. Coverage probability (P ICP ) of constructed PIs that are generated from the PI-NNranked,r models and ensemble methods. (a), (b), and (c) are for the first, second, and third case studies.
informative. Note that the predefined CL is 90%. It is not practically desirable to have too wide intervals as they do not carry much information about the target values and their fluctuations. It is also important to check how much improvement is achieved by aggregation methods in terms of CW C. The improvement (Im ) of the PI-NNranked,r model is defined as follows: Im,r =
CW CN Nbest,r − CW Ccomb . CW CN Nbest,r
(12)
Fig. 10 demonstrates the amount of improvements for WAGA PIs over PIs using best individual NNs. The figure shows that the quality of PIs is improved at least by 13%, 5%, and 90% in 50% of ensemble members for the first,
1) Offline: This is related to developing NN models and determining the optimal aggregation parameters. 2) Online: This is related to employing NN models to generate intervals and then aggregating those using predetermined parameters. In practice, offline computation is least important. PI-NN models can be developed offline using existing data. Once models are developed, they can generate intervals (in a few milliseconds), and then an aggregation process can be executed (again in a few milliseconds). The elapsed time for online computations is less than 3 s for the proposed method. Combining forecasts from individual PI-NN models is an effective method to improve the PI quality in terms of their coverage probability as well as PI width, even if it is a simple averaging method. However, the weighted averaging aggregation method produced better quality PIs than individual ensemble members or the simple averaging ensemble method. Significant improvement of the PI quality using the proposed ensemble method is observed for the third case study. As seen in Fig. 9, 90% (27 out of 30 cases) of P ICPN N best,r values are lower than the nominal CL (90%), and the mean
4428
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 62, NO. 7, JULY 2015
TABLE II I MPROVEMENT OF PI Q UALITY BY WAGA M ETHOD
of CW CN N best,r values is greater than 200. The proposed WAGA method not only increases the coverage probability to 90% but also decreases the CW C value to 13.28 for this case, which is lower than that of any individual ensemble member. VI. C ONCLUSION The quality of PIs constructed using NN models significantly varies from one experiment to another due to the unstable performance of trained models. To remedy this problem, an optimal procedure to develop PI-based NN ensembles is proposed in this paper. The main aim is to improve the quality of LUBE PIs in terms of their coverage probability and width. The LUBE method is used to develop the PI-based NNs and rank them based on their performance to form the PI-NN ensemble. The weighted average aggregation method is applied to combine PIs where two optimization methods, namely, SA and GA, are employed to optimally tune weights through minimizing a PI-based cost function, CW C. The performance of the proposed method is then examined for three different case studies: a nonlinear chemical reactor, a nonlinear time delay Mackey–Glass equation, and a nonlinear plant. Simulation results demonstrated that aggregating PIs significantly improves their quality in terms of coverage probability (reliability) and width (informativeness). The average improvements based on CW C are 22%, 18%, and 78% for the first, second, and third case studies, respectively. The WAGA method also improves the PI quality by 3%–4% more than the simple averaging aggregation method in all cases. R EFERENCES [1] J. Mazumdar and R. Harley, “Recurrent neural networks trained with backpropagation through time algorithm to estimate nonlinear load harmonic currents,” IEEE Trans. Ind. Electron., vol. 55, no. 9, pp. 3484– 3491, Sep. 2008. [2] M. Radac, R. Precup, E. Petriu, and S. Preitl, “Iterative data-driven tuning of controllers for nonlinear systems with constraints,” IEEE Trans. Ind. Electron., vol. 61, no. 11, pp. 6360–6368, Nov. 2014. [3] Y. Lu, D. Li, Z. Xu, and Y. Xi, “Convergence analysis and digital implementation of a discrete-time neural network for model predictive control,” IEEE Trans. Ind. Electron., vol. 61, no. 12, pp. 7035–7045, Dec. 2014. [4] A. Khosravi, S. Nahavandi, D. Creighton, and A. F. Atiya, “Comprehensive review of neural network-based prediction intervals and new advances,” IEEE Trans. Neural Netw., vol. 22, no. 9, pp. 1341–1356, Sep. 2011. [5] A. Khosravi, S. Nahavandi, D. Srinivasan, and R. Khosravi, “Constructing optimal prediction intervals by using neural networks and bootstrap method,” IEEE Trans. Neural Netw. Learn. Syst., to be published. [6] A. Khosravi, S. Nahavandi, D. Creighton, and A. Atiya, “Lower upper bound estimation method for construction of neural network-based prediction intervals,” IEEE Trans. Neural Netw., vol. 22, no. 3, pp. 337–346, Mar. 2011. [7] A. Khosravi, S. Nahavandi, D. Creighton, and D. Srinivasan, “Optimizing the quality of bootstrap-based prediction intervals,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2011, pp. 3072–3078. [8] J. H. Zhao, Z. Y. Dong, Z. Xu, and K. P. Wong, “A statistical approach for interval forecasting of the electricity price,” IEEE Trans. Power Syst., vol. 23, no. 2, pp. 267–276, May 2008.
[9] H. Quan, D. Srinivasan, and A. Khosravi, “Uncertainty handling using neural network-based prediction intervals for electrical load forecasting,” Energy, vol. 73, pp. 916–925, Aug. 2014. [10] A. Khosravi, S. Nahavandi, and D. Creighton, “Construction of optimal prediction intervals for load forecasting problems,” IEEE Trans. Power Syst., vol. 25, no. 3, pp. 1496–1503, Aug. 2010. [11] M. A. Hosen, A. Khosravi, S. Nahavandi, and D. Creighton, “Prediction interval-based neural network modelling of polystyrene polymerization reactor: A new perspective of databased modelling,” Chem. Eng. Res. Des., vol. 92, no. 11, pp. 2041–2051, 2014. [12] C. Teljeur et al., “Using prediction intervals from random-effects metaanalyses in an economic model,” Int. J. Tech. Assess. Health Care, vol. 30, no. 1, pp. 44–49, Jan. 2014. [13] P. L. Espinheira, S. L. Ferrari, and F. Cribari-Neto, “Bootstrap prediction intervals in beta regressions,” Comput. Stat., vol. 29, pp. 1–15, 2014. [14] T. Heskes, “Practical confidence and prediction intervals,” Adv. Neural Inf. Process. Syst., pp. 176–182, 1997. [15] C. M. Bishop, Neural Networks for Pattern Recognition. New York, NY, USA: Oxford Univ. Press, 1995. [16] D. A. Nix and A. S. Weigend, “Estimating the mean and variance of the target probability distribution,” in Proc. IEEE Int. Conf. Neural Netw., 1994, vol. 1, pp. 55–60. [17] J. G. Hwang and A. A. Ding, “Prediction intervals for artificial neural networks,” J. Amer. Stat. Assoc., vol. 92, no. 438, pp. 748–757, Jun. 1997. [18] A. Khosravi and S. Nahavandi, “Combined nonparametric prediction intervals for wind power generation,” IEEE Trans. Sustain. Energy, vol. 4, no. 4, pp. 849–856, Oct. 2013. [19] S. Han, Y. Liu, and J. Yan, “Neural network ensemble method study for wind power prediction,” in Proc. APPEEC, 2011, pp. 1–4. [20] A. Chaouachi, R. M. Kamel, R. Andoulsi, and K. Nagasaka, “Multiobjective intelligent energy management for a microgrid,” IEEE Trans. Ind. Electron., vol. 60, no. 4, pp. 1688–1699, Apr. 2013. [21] L. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 10, pp. 993–1001, Oct. 1990. [22] Z. Zhihua and C. Shifu, “Neural network ensemble,” Chin. J. Comput., vol. 25, no. 1, pp. 1–8, 2002. [23] M. A. Hosen, A. Khosravi, D. Creighton, and S. Nahavandi, “Prediction interval-based modelling of polymerization reactor: A new modelling strategy for chemical reactors,” J. Taiwan Inst. Chem. Eng., vol. 45, no. 5, pp. 2246–2257, Sep. 2014. [24] Y. Shatnawi and M. Al-khassaweneh, “Fault diagnosis in internal combustion engines using extension neural network,” IEEE Trans. Ind. Electron., vol. 61, no. 3, pp. 1434–1443, Sep. 2014. [25] B. E. Hansen, “Least-squares forecast averaging,” J. Econometrics, vol. 146, no. 2, pp. 342–350, 2008. [26] S. Yang and A. Browne, “Neural network ensembles: Combining multiple models for enhanced performance using a multistage approach,” Expert Syst., vol. 21, no. 5, pp. 279–288, Nov. 2004. [27] J. W. Taylor and S. Majithia, “Using combined forecasts with changing weights for electricity demand profiling,” J. Oper. Res. Soc., vol. 51, no. 1, pp. 72–82, Jan. 2000. [28] S. Chen, W. Wang, G. Qu, and J. Lu, “Application of neural network ensembles to incident detection,” in Proc. IEEE Int. Conf. Integr. Technol., 2007, pp. 388–393. [29] J. Yang, X. Zeng, S. Zhong, and S. Wu, “Effective neural network ensemble approach for improving generalization performance,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 878–887, Jun. 2013. [30] D. W. Opitz and J. W. Shavlik, “Actively searching for an effective neural network ensemble,” Connection Sci., vol. 8, no. 3/4, pp. 337–354, 1996. [31] J. Smith and K. F. Wallis, “A simple explanation of the forecast combination puzzle,” Oxford Bull. Econ. Stat., vol. 71, no. 3, pp. 331–355, 2009. [32] J. Hey, T. J. Teo, V. P. Bui, G. Yang, and R. Martinez-Botas, “Electromagnetic actuator design analysis using a two-stage optimization method with coarse-fine model output space mapping,” IEEE Trans. Ind. Electron., vol. 61, no. 10, pp. 5453–5464, Oct. 2014.
HOSEN et al.: IMPROVING THE QUALITY OF PREDICTION INTERVALS THROUGH OPTIMAL AGGREGATION
[33] F. Fernandes and L. Lona, “Neural network applications in polymerization processes,” Braz. J. Chem. Eng., vol. 22, no. 3, pp. 401–418, Jul./Sep. 2005. [34] X. Yao and M. Islam, “Evolving artificial neural network ensembles,” IEEE Comput. Intell. Mag., vol. 3, no. 1, pp. 31–42, Jan. 2008. [35] H. Song, S. F. Witt, K. F. Wong, and D. C. Wu, “An empirical study of forecast combination in tourism,” J. Hosp. Tour. Res., vol. 33, no. 1, pp. 3–29, Feb. 2009. [36] M. H. Pesaran and A. Pick, “Forecast combination across estimation windows,” J. Bus. Eco. Stat., vol. 29, no. 2, pp. 307–318, 2011. [37] R. Precup, R.-C. David, E. M. Petriu, S. Preitl, and M. Radac, “Fuzzy control systems with reduced parametric sensitivity based on simulated annealing,” IEEE Trans. Ind. Electron., vol. 59, no. 8, pp. 3049–3061, Aug. 2012. [38] H. Dawid, Adaptive Learning by Genetic Algorithms: Analytical Results and Applications to Economic Models. New York, NY, USA: SpringerVerlag, 1996. [39] C. Silva and E. Biscaia Jr, “Genetic algorithm development for multiobjective optimization of batch free-radical polymerization reactors,” Comput. Chem. Eng., vol. 27, no. 8, pp. 1329–1344, Sep. 2003. [40] M. A. Hosen and M. A. Hussain, “Optimization and control of polystyrene batch reactor using hybrid based model,” Comput. Aided Chem. Eng., vol. 31, pp. 760–764, 2012. [41] M. A. Hosen et al., “Performance analysis of three advanced controllers for polymerization batch reactor: An experimental investigation,” Chem. Eng. Res. Des., vol. 92, no. 5, pp. 903–916, May 2014. [42] M. A. Hosen, M. A. Hussain, and F. S. Mjalli, “Hybrid modelling and kinetic estimation for polystyrene batch reactor using artificial neutral network (ANN) approach,” Asia-Pacific J. Chem. Eng., vol. 6, no. 2, pp. 274–287, Mar./Apr. 2011. [43] Z. Zeybek, S. Çetinkaya, H. Hapoglu, and M. Alpbaz, “Generalized delta rule (GDR) algorithm with generalized predictive control (GPC) for optimum temperature tracking of batch polymerization,” Chem. Eng. Sci., vol. 61, no. 20, pp. 6691–6700, 2006. [44] M. A. Hosen, M. A. Hussain, and F. S. Mjalli, “Control of polystyrene batch reactors using neural network based model predictive control (NNMPC): An experimental investigation,” Control Eng. Practice, vol. 19, no. 5, pp. 454–467, May 2011. [45] A. Khosravi, S. Nahavandi, and R. Khosravi, “Evaluation and comparison of type reduction algorithms from a forecast accuracy perspective,” in Proc. IEEE Int. Conf. Fuzzy Syst., 2013, pp. 1–7. [46] C.-F. Juang and Y.-W. Tsao, “A self-evolving interval type-2 fuzzy neural network with online structure and parameter learning,” IEEE Trans. Fuzzy Syst., vol. 16, no. 6, pp. 1411–1424, Dec. 2008.
Mohammad Anwar Hosen (S’13) received the B.Sc.(Hons.) degree in chemical engineering from the Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2004 and the M.Eng.Sc. degree in chemical engineering from the University of Malaya, Kuala Lumpur, Malaysia, in 2010. His specialization was artificial intelligence (AI)-based modeling and control. He is currently working toward the Ph.D. degree in the Centre for Intelligent Systems Research, Deakin University, Burwood, Australia. His current research interests include the development and application of AI techniques for the modeling and control of industrial nonlinear plants.
4429
Abbas Khosravi (M’07) received the B.Sc. degree in electrical engineering from Sharif University of Technology, Tehran, Iran, in 2002, the M.Sc. degree in electrical engineering from Amirkabir University of Technology, Tehran, in 2005, and the Ph.D. degree from Deakin University, Burwood, Australia, in 2010. He was with the eXiT Group, University of Girona, Girona, Spain, from 2006 to 2007, conducting research in artificial intelligence. He is currently a Research Fellow with the Centre for Intelligent Systems Research, Deakin University. His current research interests include the theory and application of neural networks and fuzzy logic systems for the modeling, analysis, control, and optimization of operations within complex systems. Mr. Khosravi was a recipient of the Alfred Deakin Postdoctoral Research Fellowship in 2011.
Saeid Nahavandi (SM’07) received the Ph.D. degree from Durham University, Durham, U.K. He is an Alfred Deakin Professor, the Chair of Engineering, and the Director of the Center for Intelligent Systems Research at Deakin University, Burwood, Australia. He has published over 450 papers in various international journals and conference proceedings. His research interests include the modeling of complex systems, robotics, and haptics. Prof. Nahavandi is a Co-Editor-in-Chief of the IEEE S YSTEMS J OURNAL and an Editor (South Pacific Region) of the International Journal of Intelligent Automation and Soft Computing. He is a Fellow of Engineers Australia (FIEAust) and the Institution of Engineering and Technology (FIET).
Douglas Creighton (M’10) received the B.Eng. (with honors) degree in systems engineering and the B.Sc. degree in physics from the Australian National University, Canberra, Australia, in 1997. He spent several years as a software consultant prior to receiving the Ph.D. degree in simulation-based optimization from Deakin University, Burwood, Australia, in 2004. He is currently a Research Academic and Stream Leader with the Centre for Intelligent Systems Research, Deakin University. He has been actively involved in modeling, discrete event simulation, intelligent agent technologies, human–machine interface and visualization, andsimulation-based optimization research.