Partial Mutual Information Criterion For Modelling Time Series Via Neural Networks I. Luna and S. Soares DENSIS-FEEC-UNICAMP 13083-970 Campinas, Brazil {iluna,dino}@cose.fee.unicamp.br
1
R. Ballini DTE-IE-UNICAMP 13083-852 Campinas, Brazil
[email protected]
Abstract
between inputs and outputs, resulting in possible omission of important model inputs [14].
In this paper, a strategy for modelling temporal time series is proposed. This approach is based on the Partial Mutual Information Criterion, which is evaluated for selecting relevant inputs for a time series model. This criterion does not only consider input-output relations but also stored information each input provides. The methodology is applied to identify a linear time series model and a non-linear model based on artificial neural networks, which is used for modelling Brazilian monthly streamflow series. Simulation results show the usefulness of the presented method. Keywords: Partial mutual information, input selection, time series, streamflow forecasting, FIR neural network.
Furthermore, generalization performance on a fixed-size training set is closely related to the number of free parameters in the model [2]. One source of unnecessary weights comes from input variables that provide poor or little information about the output to be learned, increasing models complexity and the number of local minimum in the error surface.
Introduction
Real world modelling problems or system identification involve a large number of potential inputs [16]. In the case of neural networks, inputs are usually determined by using priori knowledge or selected by trial and error [21]. Traditional linear methods, like the Bayesian Information Criterium (BIC) [12], or other criteria based on correlation analysis are also applied in non-linear applications. The drawback of applying these ones for non-linear problems is that they are almost unable to capture non-linear dependence that may exist
In the present study, a methodology for selecting inputs for modelling time series is proposed. It is mainly based on partial mutual information criterion, which was originally proposed in [14]. To calculate the partial mutual information, it is necessary to estimate marginal and joint probability density functions, as well as expected operators. In [14] was assumed a normal distribution for data to approximate these values. To avoid this assumption, this work approximates all probability density functions and conditional expected values, with product multivariate kernel estimators [13], using as kernel functions, the city-block distance [6]. The advantage to use this kernel function when compared to the others, is that there is not necessary to make assumptions about the samples distribution. Instead of that, the city block kernel function let data talk by itself. Indeed, partial mutual information is able to capture all linear and non-linear dependence between variables. The idea of approximating probability functions with the kernel estimators and the city-
block function was recently proposed in [4], where, different from [14], expected values were estimated with a general regression neural network - GRNN [17]. This work, apart from applying product kernel estimators to approximate marginal and joint probability density functions, it is also applied for estimating conditional expected values, reducing complexity for the approximations made in [14], as well as avoiding the process of parameters adjustment for the GRNN at each iteration, as it was done in [4]. The proposed method is applied on a synthetic data set, where the dependence attributes were known a priori. Indeed, the proposed methodology is applied to determine suitable model inputs for a one-step ahead forecasting model of monthly streamflow series of a Brazilian river. Results are compared with the ones obtained by a periodic autoregressive moving average model (PARMA), with inputs selected considering the BIC criterium. The performance of the input selection method is assessed by numerical results. The paper is organized as follows. Section 2 introduces the partial mutual information criterion and describes some modifications of the approach, when compared with the ones proposed in [14] and [4]. Section 3 discusses some results and applications. Finally, Section 4 presents the conclusions of the paper.
2
Input Selection for Time Series Models
Mutual Information is one of the most fundamental information measure in the area of theory information [7]. Even though, it is currently being applied in many other different areas such as pattern recognition, image processing, identification of non-parametric models, [2], [20], [21]. This criterion is generally considered as a measure of dependence between two variables. It can also be considered as a measure of the stored information in one variable about another, or the measure of the degree of predictability of the output variable knowing the input variable [21].
Given two discrete random variables X and Y , the mutual information between these two variables is defined as follows: N 1 fX,Y (xi , yi ) MI = log N i=1 e fX (xi )fY (yi )
(1)
where (xi , yi ) is the i − th bivariate sample data pair with i = 1, . . . , N ; fX,Y (xi , yi ) is the joint probability density function between xi and yi ; and fX (xi ) and fY (yi ) are the univariate probability densities estimated for each sample data point. If X and Y are independent, the joint probability density will be equal to the product of the marginal densities, therefore, the mutual information between them will be zero. If the two variables are strongly dependent, then, the joint density will be greater than the product of the marginal probabilities and the mutual information between them will be large. As it can be seen, mutual information is a good measure for selecting significant attributes, as an attempt to model the system under study. However, this criterion is not able to deal with redundant inputs. If there is a third variable Z highly correlated to X (for example Z = 3X), it will also have a high MI value, and both X and Z would be selected as significant inputs. In this case, Z would be a redundant variable since it can be completely described by X. In order to overcome this problem, in [14] was proposed the partial mutual information criterion, which is an extension of mutual information. Moreover, it is able to capture all dependence between two variables and as it is a model-free strategy [16], it is not necessary to define a model structure a priori. The approach proposed in this paper, is based on partial mutual information (PMI), which is presented below. 2.1
Partial Mutual Information Criterion
Partial mutual information [14], is a measure of the partial or additional dependence that a new input can add to the existing prediction model [3]. Given a dependent discrete variable
Y (the output of the model), and an input X (the independent discrete variable), for a set of pre-existing inputs Z, the discrete version of the PMI criterion is defined as:
N fX ,Y (xi , yi ) 1 PMI = loge N i=1 fX (xi )fY (yi )
where: and:
(2)
xi = xi − E(xi |Z)
(3)
yi = yi − E(yi |Z)
(4)
where E(·) denotes the expectation function, xi and yi represent the residual components, corresponding to the i−th data pair sample, i = 1, . . . , N , and fX (xi ) fY (yi ) and fX ,Y (xi , yi ) are the respective marginal and joint probability densities. Using Eqs. (3) and (4) the resulting variables X and Y represent only the residual information in variables X and Y , once the effect of the existing predictors in Z have been taken into account [14]. A good estimate of the expected function is necessary for calculating the PMI criterion. In [14] was used an approximation based on Gaussian kernel functions, assuming a normal distribution of the samples and in [4] was proposed a general regression neural network [17]. In this work, the expected values are approximated via the Nadaraya-Watson Estimator [13] and the city block kernel function, instead of Gaussian functions. Using the city block kernel function and the kernel estimators, it is not necessary to assume any particular form for the regression function, as considered in [1] or in [14]. Indeed, computational complexity diminishes, once it is not necessary to calculate covariance matrix and its respective inverse, reducing processing time required. Let r(x) = E(Y |X = x). An approximation of r(x) is defined by [13]: rˆ(x) =
N i=1
wλx (x, xi )yi
(5)
where, Kλ (x − xi ) wλx (x, xi ) = N x j=1 Kλx (x − xj )
(6)
and Kλx (x − xi ) is the kernel function for x, that in this work is represented by the cityblock function. Many other kinds of kernel functions and its characteristics can be found in [15]. There are different ways for approximating both, marginal and joint probabilities. For example, [8] used frequency histograms as an estimate of probabilities. This work approximates marginal and joint probability functions via kernel functions, once it is an efficient and robust tool, as shown in [14] and in [11]. Let be N pairs of data sample [x(k), y(k)], with x(k) being an input variable and y(k) being, without loss of generality, its corresponding scalar output; k = 1, . . . , N . The classic multivariate probability density estimator, introduced in [6], is given by: N
1 K fˆ(x) = N λ i=1
x − xi λ
=
N 1 Kλ (x − xi ) N i=1 (7)
where Kλ (t) is the kernel function, and λ is the bandwidth parameter. The kernel function is required to be a valid probability density function [11]. As discussed in [4], using kernel functions based on the city block distance, Eq. (7) may be rewritten as follows: fˆx (x) =
p N 1 exp−|xj −xij |/λ N (2λ)p i=1 j=1
(8)
that is,
p N 1 1 exp − |xj − xij | fˆx (x) = N (2λ)p i=1 λ j=1 (9)
where p is the dimension of each xi , i = 1, . . . , N . The bandwidth parameter λ is calculated by the equation: λ=
4 p+2
1/(p+4)
N −1/(p+4)
(10)
Even though Eq. (10) was estimated assuming Gaussian distributions [13]. The estimation, given by Eq.(10), is widely applied in the literature because of its efficiency and simplicity [4], and for that reason, it will be applied in this work.
4. Calculate the confidence measure for the selected input. If its PMI value is higher than the confidence limit, then include it into z. If it is not, then go to step 6.
Finally, joint probability density estimation of x given y, is defined by [1]: N 1 x − xj y − yj )Kλy ( Kλ ( ) (11) fˆxy (x, y) = N i=1 λ λy
5. Repeat steps 2-4, as many times as necessary.
where λy is the bandwidth parameter associated to the kernel function for y. Once probabilities functions and expected operations are defined, it is necessary to define a stop condition. The partial mutual information criterion will be applied to data samples, via Eqs. (2)-(11), until the PMI value of a selected input be inferior to a confidence measure. Confidence limit is calculated assuming independence between inputs and the output variable. This independence is forced generating p different arrangements of the independent variable, and these are built by bootstrapping. The PMI value is calculated for each arrangement and the null test of hypothesis is built. The PMI value associated to the γth percentile will be equal to the confidence limit. That is, if the PMI value of the selected input is greater than the PMI value of the γth percentile sample, therefore, it means that there is some significant dependence between the inputs and the output variable and hence, the null test of hypothesis will be rejected. In this work, we used p = 100 different arrangements of the independent variable and percentile γ = 95%. 2.2
The Algorithm
The algorithm for selecting model inputs can be summarized as follows: 1. Build an initial set os possible inputs for the model, called z∗ . Indeed, define a set of selected inputs, denoted by z. This is a null vector at the beginning of the algorithm; 2. Evaluate the partial mutual information between each plausible input in z∗ and the dependent variable y, taking into account the selected inputs in z; 3. Identify the variable with the highest PMI in the previous step;
6. End of the algorithm, that is, the most relevant inputs have been selected.
3
Simulations and Applications
3.1
Synthetic Data
The described methodology was applied for identifying a linear model with known attributes, also presented in [4] and [14]. This result is shown to ascertain the utility of the algorithm. The example presented is an auto-regressive model defined by the equation below: xt = 0.3xt−1 − 0.6xt−4 − 0.5xt−9 + et (12) where et is a random variable with normal distribution, with mean zero and standard deviation equal to the unity. A total of 420 data were generated for the auto-regressive model and the first 20 ones were discarded, so as to reduce the effects of an arbitrary initialization. The algorithm summarized in section 2.2 was applied for the first 15 lags, considered as potential inputs for the model. Table 1 shows the results obtained for this example. As it can be seen, the algorithm is capable of selecting the correct attributes for the model. Lags 4, 9 and 1 were the attributes selected. Lag 4 obtained the highest PMI value, indicating that this is the input with a highest significance for the model, when compared with the other ones. This can be verified in Eq. (12), were the coefficient for lag 4 (-0,6) is more significant than the other ones (-0,5 for lag 9 and 0,3 for lag 1). Lag 5 was not considered because its PMI value was not greater that its corresponding PMI value of the 95th percentile randomised sample. It means that, all the other lags will have a lower PMI value. So,
the first three lags with the highest PMI value will be enough to build the model. Note that, apart from selecting the right inputs for the model, the algorithm gave important information about the system, selecting the inputs according to its relevance. Table 1: Results applying PMI criterion for the auto-regressive model defined by Eq. (12). Lag 4 9 1 5
3.2
PMI 0.3146 0.2174 0.1310 0.0413
Percentile (95%) 0.2166 0.1477 0.1006 0.0429
Monthly Streamflow Forecasting
Planning and operation activities of water resources and energy systems is a complex and difficult task once it involves non-linear production characteristics and depends on numerous variables, being one of them the streamflow. Most hydroelectric systems involves geographically distinct regions and has hydrometric data collected through a very sparse and distinct data acquisition networks. This results in considerable uncertainty in the hydrological information available. Furthermore, the inherently non-linear relationship between input and output flow challenges streamflow forecasters considerably. Another difficulty in streamflow forecasting concerns its nonstationary nature due to wet and dry periods over the year [10]. Given samples of a streamflow time series, vt−h ∈ , h = 1, . . ., the aim is to estimate the value of vt using information from a set of past values of vt from historical data, that is, we consider one-step ahead prediction. In this section, we build two prediction models, the first one is based on FIR (Finite Input Response) neural networks [18]. This neural architecture was chosen because of its ability to deal with non-linear problems and its characteristics of temporal, dynamic and time-delay response. The parameters are adjusted using Temporal Backpropagation Algo-
rithm. For more details, see [19]. The proposed methodology for selecting inputs was applied for the FIR model. The second prediction model is a periodic autoregressive moving average model (PARMA). Both, the AR model of order (p) and the MA model of order (q) were selected using the Bayes Information Criterium. Both models are applied to forecast monthly inflows of a hydroelectric plant, namely Furnas, situated at Southeast region in Brazil. Hydrologic data covering the period from 1931 to 1990 were used to adjust the models. Data corresponding from 1991 to 1998 was used to test the performance of the adjusted models. The inflows oscillate between minimum and maximum values following the seasonal variation during the twelve months period. Thus, the seasonality of the monthly flows suggests the use of twelve different models. First, the respective input determination procedure was applied to find an appropriate set of inputs for each model. In this stage, only the training set was used. The initial set of possible model inputs was composed by the first 15 lags of the dependent variable. Table 2 shows the inputs obtained for each FIR model, and its corresponding PMI values, as well as the order of the PARMA models. The second column corresponds to the selected lags for the FIR models. For example, the FIR model to predict April inflow of this year, will have as inputs lags 1 and 5, which correspond to March inflow of the same year (lag 1) and December inflow of a year before (lag 5). The respective PMI scores are shown in column 3. Notice that, for each FIR model, the PMI value diminishes rapidly after an input is selected (Table 2). That is, the PMI value for the first selected input will be higher than the PMI value for the second selected input, and so on. Bold lines correspond to the end of the process for each model. It means that, the respective evaluated lag got a PMI value no greater than its confidence limit (column 4). So, every input, which has not been considered yet, will have a lower PMI value and therefore,
Table 2: Input selection for FIR and PARMA models. The Percentile corresponds to the 95th percentile randomised sample PMI for the FIR model. Model Month Jan Feb Mar Apr May Jun Jul Aug Set Oct Nov Dec
Inputs (lags) 1 13 1 13 1, 4, 14, 13 15 1, 5 7 1, 3 12 1, 2 14 1, 13 4 1, 8, 14 9 1 13 2, 10, 14, 9, 1, 8, 13 4 1 7 1, 11 15
FIR PMI score 0,321 0,169 0,321 0,166 0,314; 0,187; 0,011 0,429; 0,186 0,099 0,469; 0,195 0,068 0,612; 0,095 0,054 0,591; 0,136 0,053 0,589; 0,111; 0,020 0,465 0,129 0,295; 0,174; 0,014; 0,006; 0,004 0,359 0,153 0,302; 0,179 0,108
0.130; 0.052
0,059
0,120; 0,061 0,002
they will not be significant for modelling the streamflow series. Applying the PMI criterium, October streamflow was the one with more selected inputs. Paying attention on the PMI score for all the selected attributes, the ones with a very high PMI value, when compared to the other ones are only lags 2, 10, 14 and 9. It can be observed easily in Figure 1. It is clear that, the last selected inputs have a poor information contribution when it is compared to lag 2 or 10, and because of that, only the first four significant lags will be considered for modelling October streamflow. The last column of Table 2 shows the order of the PARMA models, which were adjusted in [9]. Note that p varied from 1 to 4, whereas q was equal to 0 or 1. Using the characteristics described in Table 2, forecasts produced by FIR model for one step ahead prediction are depicted in Figure 2. As it can be observed, the prediction in wet months, represented by top regions of the
Percentile (95%) 0,223 0,186 0,230 0,168 0,228; 0,179; 0,111; 0,034 0,011 0,241; 0,161 0,100 0,232; 0,120 0,087 0,224; 0,066 0,058 0,231; 0,118 0,063 0,242; 0,099; 0,054 0,022 0,230 0,144 0,228; 0,159; 0,115; 0,045 0,010; 0,005; 0,001 0,004 0,205 0,157 0,228; 0,167 0,113
PARMA Order (p,q) (1,0) (1,0) (2,0) (1,1) (2,1) (1,0) (1,1) (1,0) (4,0) (1,1) (1,0) (1,1)
streamflow curve, is more difficult than in dry ones. Figure 2 shows such difficulty, once top regions may not present similar frequencies, even though streamflow series is, in general, periodic. The prediction models are evaluated using testing data and the RMSE (m3 /s), MPE (%) and MAE (m3 /s) criteria. Table 3 shows the global errors for one-step ahead forecasts of the two prediction models for the hydroelectric plant of Furnas. We note that in general, the FIR model prediction produced better results than PARMA model. Table 3: Global one-step ahead prediction errors. Model FIR PARMA
MPE (%) 16,58 28,12
RMSE (m3 /s) 364,46 430,12
MAE (m3 /s) 184,64 279,70
Autocorrelation and cross correlation checks were applied for each FIR model, as another way for evaluating its performance [5]. The first 20 lags of the residual autocorrelations
0.35
0.4 PMI 95th Percentile
0.3 0.2
0.3 rk
0.1 0 −0.1
0.25 −0.2 −0.3 −0.4
0.2
2
4
6
8
10
12
14
16
18
20
12
14
16
18
20
PMI
k (a) 0.15
0.4 0.3 0.2
0.1 sk
0.1 0 −0.1
0.05
−0.2 −0.3 0 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
−0.4
15
2
4
6
8
Lags
k (b)
Figure 1: PMI values obtained for the first 15 lags, considered as possible inputs for October streamflow FIR model. Note that only lags 2, 10, 14 and 9 have a considerable PMI when compared with the other ones.
Figure 3: FIR prediction model. Residual analysis for December inflows of Furnas reservoir: (a) Estimated autocorrelation function, (b) Estimated cross correlation function.
4 4000 Real FIR 3500
3000
Inflow (m3/s)
2500
2000
1500
1000
500
0 1991
1992
1993
1994
1995 year
10
1996
1997
1998
1999
Figure 2: Inflow one-step ahead prediction for Furnas reservoir from 1991 to 1998.
were calculated. Fig. 3 illustrates autocorrelation and cross correlation estimates for the trained FIR network, corresponding to December inflows. Fig. 3-(a) shows autocorrelation estimates, together with an upper bound √ 1/ m, where m is the size of the training set. Fig. 3-(b) shows the estimate cross correlation function for the first 20 lags between the desired output and the estimate residuals, to√ gether with an upper bound 1/ m, for their standard errors, which is represented by the dotted line in both Fig. 3-(a) and (b). Based on this analysis for all the 12 adjusted models, there were no evidence of an inadequate modelling.
Conclusions
In this work, there has been established an input selection method for time series models. This method is based on partial mutual information criterion, which is capable of detecting all linear and non-linear relations between inputs and the output of a model. The algorithm was tested with a linear auto-regressive model, with known attributes, to verify de efficiency of the algorithm. The approach was also applied to a monthly streamflow series, where inputs to build the model were selected using the described method. The model obtained low prediction error when compared to a PARMA model with inputs selected with the BIC criterium. In general terms, numerical results showed good performance and the usefulness of the algorithm. Acknowledgements This work was partially supported by the Research Foundation of the State of São Paulo (FAPESP), the Research and Projects Financing (FINEP) and the National Council for Scientific and Technological Development (CNPq).
References [1] S. Akaho. Conditionally independent component analysis for supervised fea-
ture extraction. Neurocomputing, 49:139– 150, 2002. [2] B. V. Bonnlander and A. S. Weigend. Selecting input variables using mutual information and nonparametric density estimation. In Proc. of the 1994 International Symposium on Artificial Neural Networks, pages 42–50, Tainan, Taiwan, 1994. [3] G. J. Bowden. Forecasting Water Resources Variables Using Artificial Neural Networks. Phd Thesis, University of Adelaide, Australia, February 2003. [4] G. J. Bowden, H. R. Maier, and G. C. Dandy. Input determination for neural network models in water resources applications. Part 1–background and methodology. Journal of Hydrology, (301):75–92, 2005. [5] G. E. P. Box and G. M. Jenkins. Time Series Analysis: forecasting and control. Holden-Day Inc., Revised edition, 1976. [6] T. Cacoullous. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18:179–189, 1966. [7] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. [8] A. M. Fraser and H. L. Swinney. Indepedent coordinates for strange attractors from mutual information. Physical Review A, 33(2):1134–1140, 1986. [9] M. Magalhães, R. Ballini, R. Gonçalves, and F. Gomide. Predictive fuzzy clustering model for natural streamflow forecasting. In Proceedings of the IEEE International Conference on Fuzzy Systems, pages 390–394, Budapest, Hungary, July 2004. [10] H. Maier and G. Dandy. Neural networks for the prediction and forecasting of water resources variables: a review of modelling issues and applications. Environmental Modelling & Software, 15:101–124, 2000.
[11] Y.-I. Moon, B. Rajagopalan, and U. Lall. Estimation of Mutual Information using Kernel Density Estimators. Physical Review E, 52(3):2318–2321, September 1995. [12] G. Schwarz. Estimating the Dimension of a Model. The Annual of Statistics, 6(2):461–464, 1978. [13] D. W. Scott. Multivariate Density Estimation: Theory, Practice an Visualization. John Wiley & Sons Inc., 1992. [14] A. Sharma. Seasonal to internannual rainfall probabilistic forecasts for improved water supply management: Part 1 – A strategy for system predictor identification. Journal of Hydrology, (239):232– 239, 2000. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [16] R. Sindelar and R. Babuska. Input Selection for Nonlinear Regression Models. IEEE Transactions on Fuzzy Sets, 12(5):688–696, October 2004. [17] D. F. Specht. A General Regression Neural Network. IEEE Transactions on Neural Networks, 2(6):568–576, November 1991. [18] E. Wan. Temporal backpropagation for FIR neural networks. International Joint Conference on Neural Networks, 1:575– 580, June 1990. [19] A. S. Weigend and N. A. Gershenfeld. Time Series Prediction: Forecasting the Future and Understanding the Past. 1992. [20] M. Zaffalon and M. Hutter. Robust features selection by mutual information distributions. In 18th International Conference on Uncertainty in Artificial Intelligence, pages 577–584, 2002. [21] G. L. Zheng and S. A. Billings. Radial Basis Function Network Configuration Using Mutual Information and the Orthogonal Least Squares Algorithm. Neural Networks, 9(9):1619–1637, 1995.