Robust Estimator for the Learning Process in Neural Networks Applied in Time Series H´ector Allende1,4 , Claudio Moraga2,3 , and Rodrigo Salas1 Universidad T´ecnica Federico Santa Mar´ia; Dept. de Inform´ atica; Casilla 110-V; Valpara´iso-Chile; {hallende,rsalas}@inf.utfsm.cl 2 Technical University of Madrid; Dept. Artificial Intelligence. E-28660 Boadilla del Monte Madrid; Spain. 3 University of Dortmund; Department of Computer Science; D-44221 Dortmund; Germany;
[email protected] 4 Universidad Adolfo Iba˜ nez; Facultad de Ciencia y Tecnolog´ıa.
1
Abstract. Artificial Neural Networks (ANN) have been used to model non-linear time series as an alternative of the ARIMA models. In this paper Feedforward Neural Networks (FANN) are used as non-linear autoregressive (NAR) models. NAR models are shown to lack robustness to innovative and additive outliers. A single outlier can ruin an entire neural network fit. Neural networks are shown to model well in regions far from outliers, this is in contrast to linear models where the entire fit is ruined. We propose a robust algorithm for NAR models that is robust to innovative and additive outliers. This algorithm is based on the generalized maximum likelihood (GM) type estimators, which shows advantages over conventional least squares methods. This sensitivity to outliers is demostrated based on a synthetic data set. Keywords: Feedforward ANN; Nonlinear Time Series; Robust Learning.
1
Introduction
FANN are very good universal approximators of functions first used in the field of engineering. Typical applications involve the analysis of spatio-temporal data called time series analysis in the field of statistics. Recently, applications of ANN have received a huge impact on the research of nonlinear time series. Researchers in neurocomputing use feedforward networks, e.g. [6], [11], to predict future values of time series only by knowledge from the past. Since ANN are universal approximators for the unknown functions (See [12]), no assumptions need to be hypothesized for the data set. That is, no prior models are built for the unknown functions. This characteristic is quite
This work was supported in part by Research Grant Fondecyt 1010101, in part by Research Grant BMBF CHL-99/023 (Germany) and in part by Research Grant DGIP-UTFSM. Work of C. Moraga was supported by Grant SAB2000-0048 of the Ministry of Education, Culture and Sport (Spain) and the Social Fund of the European Union.
J.R. Dorronsoro (Ed.): ICANN 2002, LNCS 2415, pp. 1080–1086, 2002. c Springer-Verlag Berlin Heidelberg 2002
Robust Estimator for the Learning Process in Neural Networks
1081
different from the time series which is assumed to be represented by a linear and stationary fractional ARIMA model. The goal of this paper is to propose a neurocomputing technique to perform robust forecasting for some non-linear time series.
2
Problem Formulation
Time series analysis is often perturbed by occasional unpredictable events that generate aberrant observations. The analyst has to rely on the data to detect which points in time are outliers to estimate the appropriate corrective actions. The case of unknown location and type of the outlying observations has been considered extensively in the literature for outliers isolation. Fox in [8] introduced the additive and innovation types of outliers. The impact of the outliers on the parameters estimates has been studied by [2] and that on forecasting by [4]. Theory and Practice are mostly concerned with linear methods, such as ARIMA models (See [3]). However, many series exhibit features which cannot be explained in a linear framework. More recently there has been increasing interest in non-linear models. Many types of non-linear models have been proposed in the literature, see for example bilinear models [7] and non-linear ARMA models (NARMA) [5]. There is a large statistical Literature on the topic of robustness toward outliers. A robust statistical method consists of a model, that is not much affected by outliers. [2], [7] and [6] show that least square (LS) methods are quite non-robust. Furthermore, it is important to know that the LS method lacks of robustness not only for classical statistical linear models, but also for non-linear times series neural network predictor model. For LS procedure the ANN modelling shows that the outliers have a local and a semiglobal impact (See [5]). In this work we focus on robust neural network modelling of non-linear time series which contain outliers. We present some results on the lack of robustness of LS fitting of ANN models for time series. Section (6) introduces a new robust learning process for fitting feedforward predictive models for time series. The proposed method uses a robust algorithm that limits the influence of the gross outliers upon the learning process (parameter estimation). Synthetic data studies are considered.
3
Time Series Processing
In formal terms, a time series is a sequence of vectors, depending on time t: xt , t = 0, 1..., theoretically, xt can be seen as a function of the time variable t. For practical purpose, however, time is usually viewed in terms of discrete time steps, leading to an instance of x at every end point of a usually fixed size - time interval. The problem of forecasting is stated as follows: Find a function h : d×n+l → d , where d is the dimension of the sampling space, n is the size of the sampling and l gives the number of exogenous variables. To obtain an estimate x(t +
1082
H. Allende, C. Moraga, and R. Salas
k) of the vector x at time t + k given the values of x up to time t, plus a number of additional time-independent variables (exogenous features) πi : xt+k = h(xt , xt−1 , ..., π1 , ..., πl ), where k is called the lag for prediction. Typically, k = 1, meaning that the subsequent vector should be estimated, but can take any value longer than 1, as well. For the sake of simplicity, we will neglect the additional variables πi throughout this paper. Viewed in this way, forecasting becomes a problem of function approximation, where the chosen method is to approximate the continuous-valued function h as closely as possible. In this sense, it can be compared to function approximation or regression problems involving static data vectors, and many methods from this domain can be applied here, as well (See [1] and [5]). This observation will turn out to be important when discussing the use of ANN for forecasting. Usually the evaluation of forecasting performance is done by computing an error measure E, over a number of time series elements, such as a validation on N x(t − i); x(t − i)) where e is a function measuring a single a test set: E = i=0 e(ˆ error between the estimated forecast and actual sequence element. Typically, a distance measure is used here, but depending on the problem, other functions can be used.
4
Neural Networks
ANN are computational models that consist of elementary processing elements called neurons that are connected to others by weighted links. The elementary processor is a rough mathematical representation of the neuron which receives an excitation signal as an input and then it processes it by applying a transfer function to obtain an output signal. Depending how these neurons are connected to others, we obtain different types of ANN architecture or topology. The neurons are organized in three kinds of layers, input or sensory, hidden and output layers. ANN have received a great attention because they are able to learn from data, and then generalize when they are facing unknown new data. Suppose we have the sample Dn = {x1 , ..., xt , ..., xn } belonging to some sample space χ ⊂ d and generated by an unknown function h(x) with the addition of a stochastic component ε. The task of the “neural learning” is to construct an estimator ˆ ) of h(x ), where w = (w1 , ..., wp )T is a set of free pagAN N (xt , w, Dn ) ≡ h(x t t rameters (known as “connection weights”) obtained from the parameter space Θ ⊂ p , xt = (xt−1 , ..., xt−q ), and Dn is a finite set of observations. Since no a priori assumptions are made regarding the functional form of h(xt ), the neural model gAN N (xt−1 , ..., xt−q , w) is a non-parametric estimator of the conditional density E[h(xt−1 , ..., xt−q )/xt−1 , ..., xt−q ].
5
ANN for Time Series
In this paper we will deal with a non-linear Autoregressive (NAR) time series model. The central problem is to construct a function, h : q → in a dynamical
Robust Estimator for the Learning Process in Neural Networks
1083
system with the form: xt = h(xt−1 , xt−2 , ..., xt−q ) + εt , where h is an unknown smooth function and εt denotes noise. A FANN provides a nonlinear approximation to h given by x ˆt = h(xt−1 , xt−2 , ..., xt−q ) =
λ j=1
p [2] [1] [1] wj γ1 ( wij xt−i + wp+1,j )
(1)
i=1
where the function γ1 (·) is a smooth bounded monotic function. The estimated parameter w ˆ is obtained by minimizing iteratively a cost functional Ln (w) i.e., w ˆ = arg min{Ln (w) : w ∈ Θ}, Θ ⊂ p , where Ln (w) is for example the ordinary least squares function i.e. Ln (w) =
n 1 (xi − g(xi−1 , w))2 2n i=q+1
(2)
where xi = (xi−1 , . . ., xi−q )
6
Robust Neural Networks Training
In this section we develop a robust method of fitting feedforward neural networks for NAR type models. The robust procedure we use is an adaptation to the neural network setting of a procedure known to be highly robust for fitting linear AR and ARMA models in the presence of outliers. The loss function disclosed in (2) is very sensitive in the presence of outliers, causing that the error produced by the outlier give a large value compared to the other estimated errors caused by the training algorithm. We propose a Robust method of fitting NAR models for Time Series with innovative or additive outliers as proposed by [2] for the AR and ARMA models. This method consists in a measure to estimate the fitting of the ANN to the data other than the traditional measure. The M estimator of the parameter w ˆ is obtained by minimizing iteratively n a cost functional RLn (w) = n1 t=q+1 ρ( srtt ), i.e., w ˆ = arg min{RLn (w) : w ∈ W ⊆ p }, where ρ is a robustifying function that introduces a bound influence of the outlier on the loss function, and st (r) is a data-dependent robust scale estimate whose objective is to make the parameters invariant to scale transformation. Alternatively, the estimated parameter can be obtained by solving the first order equation n 1 ψ(x, rt /st )Dt (w) = 0 n t=q+1
(3)
∂ ∂ gann (xi , w1 ), . . ., ∂w gann (xi , wp )), r = xt − where Di (w) = ( ∂w 1 1 gAN N (xt−1 , ..., xt−q , w) and ψ (x, r) := ∂ρ(r)/∂r,
1084
H. Allende, C. Moraga, and R. Salas
A general class of robust estimators, known as GM estimator, which is an extension of the M estimate, is obtained by assigning a weight function to the observations. The GM estimate of w ˆ is obtained by solving the equation n 1 η(x, rt /st )Dt (w) = 0 n t=q+1
(4)
The conditions that η(·, ·) : p × → must satisfy to have nice asymptotic properties can be found in [9].There are different classes of GM estimators. For example Mallows’ GM estimators given by η(x, r, c) = νx (x)ψ(r, c)
(5)
where νx (x) is a weight function, νx : p → [0, 1]. A popular choice of the scale parameter is done by st = 1.483med[|rt − med[rt ]|]. When c is chosen fix the learning process of the ANN, i.e., estimation process of the model parameters, has some problems, an initial model of the parameters is needed to find the final parameters, and the efficiency of the algorithm is reduced. We propose a dynamic GM estimator by letting the learning process change the parameter c. The objective of changing c is to start from a GM estimator close to the LS estimator, so a much better initialized ANN is no longer needed , and the efficiency is improved in the early stage of the algorithm accelerating the convergence process. There are several ways of changing the c parameter through time, some proposal can be found in [10], as a stochastic technique for global optimization.
7
Simulation Results Using Synthetic Data
To illustrate the idea presented in section 6, a NAR(1) process with outliers is considered: xt = h(xt−1 ) + at
(6)
where h(xt−1 ) = 5xt−1 exp[−x2t−1 /4]. For the innovative outlier at is formed with a white Gaussian noise process and an outlier generating process, so at has distribution Fa = (1 − α)N (0; σa2 ) + αN (0; σi2 ). And for the additive outlier at is a Gaussian process with zero mean and variance σa2 , and the observed data is obtained by zt = xt + ut vt , where vt is a zero-one process with P [ut = 0] = α and ut has distribution Fu = N (0; σu2 ). With 0 < α c
(8)
labelled GMH and GMB respectively, was used. The results obtained are shown in table 1. Note that the apparently weaker performance of the GM estimators against the MSE estimator in the training process for the outliers case is due to the non-overfitting to the training data, thus obtaining a better prediction performance. Table 1. MSE results for the Robust Learning process of the ANN in one step prediction xt+1 for different type of outliers (1) No α=0 Estimates Train LSE 0.1826 GMB 0.1253 GMH 0.1609
8
outlier (2) Innovative α = 0.05 Pred Train Pred 0.1970 0.1858 0.2094 0.1303 0.1103 0.1253 0.1706 0.1463 0.1707
(3) Additive α = 0.05 and σu2 = 1 Train Pred 0.4327 0.4107 0.3713 0.3391 0.4037 0.3737
(4) Additive α = 0.1 and σu2 = 10 Train Pred 0.3830 0.2674 0.3169 0.1986 0.3517 0.2299
Concluding Remarks
ANN fit by the classical least squares estimate was shown to be sensitive to innovative and additive outliers and the estimate is badly affected by outliers, thus providing very poor predictions for some values. Our study shows that for a single outlier, ANN tends to give a good predictor as long as the predictor is not near the outlier. However, very bad predictors can occur in localized regions of the parameter space. After observing the results of the synthetic time series, we conclude that the robust learning procedure based on the GM estimate has better performance than the LS estimate. On the other hand the robust learning algorithm based on redescending function (Bisquare) has better performance than the estimate with Huber functions.
References 1. Allende H., Moraga C.: Time Series Forecasting with Neural Networks. Forschungsbericht N◦ 727/2000. Universitaet Dortmund Fachbereich Informatik. (2000)
1086
H. Allende, C. Moraga, and R. Salas
2. Allende H., Heiler S.: Recursive Generalized M-Estimates for Autoregressive Moving Average Models. Journal of Time Series Analysis. 13 (1992) 1-18 3. Box G.E., Jenkins G.M. and Reinsel G.C: Time Series Analysis, Forecasting and Control. Ed. Prentice Hall. (1994) 4. Chen C., Liu L.M.: Forecasting time series with outliers. J. Forecast. 12 (1993) 13–35 5. Connor J. T., Martin R. D.: Recurrent Neural Networks and Robust Time Series Prediction. IEEE Transactions of Neural Networks 5 2 (1994) 240–253 6. Connor J. T.: A robust Neural Networks Filter for Electricity Demand Prediction. J. of Forecasting 15 (1996) 437–458 7. Gabr M.M.: Robust estimation of bilinear time series models. Comm. Statist. Theory and Meth 27 1 (1998) 41–53 8. Fox A. J.: Outliers in time series. J. Royal Statist. Soc. Ser B 43 (1972) 350-363 9. Hampel F.R., Ronchetti E.M., Rousseeuw P.J., Stahel W.A. Robust Statistics. Wiley Series in Probability and Mathematical Statistics. (1986) 10. Schoen F.: Stochastic techniques for global optimization: a survey of recent advances, J. Global Optim. 1, 3 (1991) 207–228 11. Weigend A., Gershenfeld N.: Time Series Prediction: Forecasting the Future and Understanding the Past: Proceedings of the NATO Advanced Research Workshop on Comparative Time Series. (1993) 12. White H.: Artificial Neural Networks: Approximation and Learning Theory, Basil Blackwell, Oxford (1992)