a unifying view of stochastic approximation, kalman filter and ...

A UNIFYING VIEW OF STOCHASTIC APPROXIMATION, KALMAN FILTER AND BACKPROPAGATION Enrico Capobianco University of Padua, Statistics Department Abstract

In this paper the relationships between the Stochastic Approximation, the Kalman Filter and the Backpropagation algorithms are investigated. We show that when the Neural Network architecture at hand can be formalized such that the approximation of the optimum for a nonlinear objective function is the problem for which we seek a solution, then both Stochastic Approximation techniques and appropriate Kalman Filters can be employed in order to reach the goal but the latter can also handle various structural characteristics of the stochastic processes involved and suggest a more ecient two-step estimator.

1 Introduction Recent developments of neural networks have required comparisons with statistical and control-systems approaches in order to verify the potential gains from using neural nets for statistical parameter estimation, nonlinear dynamic system modelling and identi cation, time series prediction, optimal and suboptimal ltering etc. Various algorithms have been proposed and tested on a real or simulated basis and they showed interesting results. Other studies have considered the opportunity of unifying the theoretical concepts behind the construction of the practical algorithms; this paper belongs to this second category and is devoted to synthesizing the most important learning procedure, i.e. the backpropagation algorithm, with stochastic approximation techniques and Kalman Filter algorithms. The paper proceeds as follows. Section 2 describes how the backpropagation algorithm is easily embedded in the stochastic approximation framework. Section 3 introduces the extended This paper was prepared at Stanford University while the author was a visiting research scholar with the PDP research group.

1

and iterated versions of the Kalman Filter and shows the relationship existing with the Gauss-Newton method. Section 4 points out the basic advantages that can derive from working with a state space representation of the neural network and introduces a two-step parameter estimation procedure. Section 5 is for the conclusions.

2 Backpropagation and Stochastic Approximation algorithms Following [8], suppose we have a nonlinear objective function f(Xt ; ) where f : Rk ! R, Xt is a k 1 random input vector and 2 Rp represents the vector of unknown parameters. We want to use this function for forecasting the random variable yt and to do this we allow the following single hidden layer feedforward network structure for f(Xt ; ): f(Xt ; ) = x0 + +

X r

j

F(x0 j )

(1)

j =1

where x = (1; x0)0 , = (0 ; 0 ; 0)0 , = ( 0; : : :; r )0 , = ( 1 ; : : :; r )0 , r 2 N and F : R ! R (a bounded and continuosly dierentiable function). Consider that f(Xt ; ) is an approximation of the objective function g(Xt ) = E(yt =Xt). In this nonlinear least square set-up we seek a solution to min [E([yt ? f(Xt ; )]2)], or equivalently to E(r f(Xt ; )[yt ? f(Xt ; )]) = 0, with r representing the gradient k 1 vector calculated w.r.t. . A very straightforward way to solve this problem is to employ the Robbins-Monro (RM) Stochastic Approximation algorithm, whose structure is given by: ^t+1 = ^t + t r f(Xt ; ^)[yt ? f(Xt ; ^)] (2) This recursion is that of a Stochastic Gradient method and therefore generalizes the Backpropagation (BP) algorithm [12] for neural network learning in allowing for a time varying learning rate. In [8] some modi cations to the RM algorithm are presented in order to speed up the convergence rate and thus include a Gauss-Newton step at each updating stage obtaining a Modi ed Robbins-Monro algorithm: (3) ^t+1 = ^t + t L^ ?t+11 r f(Xt ; ^t )[yt ? f(Xt ; ^t )] L^ t+1 = L^ t + t [rf(Xt ; ^t )r f(Xt ; ^t ) ? L^ t ] (4) where now = ((vecL)0 ; 0 )0 is the new augmented parameter vector. As a result of the above modi cation and other technical devices employed in order to avoid numerical problems and deal with the computational burden involved, the new algorithm and therefore the correspondent generalized

backpropagation scheme are able to perform with a better convergence rate than the simple RM scheme also when moderate dependence is present in the data. Of course the approximation to E(yt =Xt ) is only locally optimal, but it's nevertheless important to relax the usually retained \ i.i.d." assumption about the stochastic process generating the data. Then, the nal important contribution given in [8] is that of showing the consistency and asymptotic normality of the designed estimator, under appropriate conditions on the learning rate (i.e. t = (t + 1)?1 ).

3 Extended and Iterated Kalman Filter algorithms Arti cial Neural Networks can be conveniently cast in a state space representation. A general nonlinear state space model is given by a system equation xt+1 = gt(xt ) + wt and a measurement equation yt = ht(xt ) + vt where gt(:) and ht(:) are nonlinear functions. Consider now a k-layered feedforward network structure described as in [3]: ikj =

Xw?

Nk?1

o ? + jk

k 1;k k 1 lj l

(5)

l=1

where the input of the j th node in the kth layer is given by the sum of the product of the connection weight w with the output and a bias parameter, and okj = F(ikj ) (6) where the output as a function of the input through F : R ! R. Given the standard BP procedure: wljk?1;k(t + 1) = wljk?1;k(t) ? kj (t)okl ?1 (t) (7) where is the learning rate and kj = (okj ? yjk )F 1(ikj ), we can rewrite the network in state space (and usually compact notation) form as: Wt+1 = Wt + Gt

(8)

yt = O t + t (9) where t is the output error, Gt is equal to the correction term1 of the BP recursion (7), but with only t characterized by pure erratic behavior, and where the signal in the measurement equation is actually Ot(Wt ). As a matter of fact in this framework the Extended Kalman Filter (EKF) and/or the Iterated Kalman Filter (IKF) represent the master estimation scheme, 1 The matrix G separates in a convenient way the deterministic components from the purely random ones; see [3] for details.

since standard approximations are introduced in order to derive a suboptimal lter for the signal or to forecast the observed random variable. The general nonlinear functions gt (:) and ht (:), when suciently smooth, are expanded in Taylor series about the conditional means x^t=t and x^t=t?1, thus obtaining gt(xt) = gt(^xt=t) + Gt(xt ? x^t=t) + : : : and ht (xt) = ht(^xt=t?1) + Ht(xt ? x^t=t?1) + : : :. A formal way to operate with a more accurate algorithm is the following one2 ; starting from the EKF algorithm: x^+ = x^? + K(y ? h(^x? )) (10) P + = (I ? KH)P (11) 1 ? H = h (^x ) (12) 0 0 ? 1 K = PH (HPH + R) (13) and given x^ = x^?, we can obtain the Iterated Kalman Filter as follows: xit+1 = x^ + Kit(y ? h(xit) ? Hit(^x ? xit)) (14) Pit+1 = (I ? Kit Hit)Pit (15) Hit = h1 (xit) (16) 0 0 ? 1 Kit = PitHit(HitPitHit + R) (17) Recently [2] have shown that the IKF algorithm is an application of the Gauss-Newton (GN) method. It's common in statistics and econometrics to work with estimators that aim at minimizing some sum-of-squares functions P like S() = t 2t , where the vector represents a residual term from an estimated linear/nonlinear regression or time series model. The P vector of rst derivatives, or Gradient, in this case is v() = @S@() = 2 @@ t and the 2 S ( ) P @ @ @2 Hessian is given by V () = @@@ 0 = 2 [ @ @0 ? @@0 t]. Several schemes are able of iteratively nding a solution to the initial minimization problem. The most general one is the Newton-Raphson (NR) method, which is given by: X t @t ? @2t )]?1 X @t (18) = ^ + [ ( @ @ @0 @@0 t @ t t

t

t

t

Since the term involving second derivatives is usually small when compared to the rst derivatives product term, the GN scheme approximates the above iterative solution and presents a formula that is identical to NR, apart from the term with second derivatives. In [2] this last approximate scheme is 2 A dierent lter can be derived by including more terms in the Taylor series expansions, thus obtaining second order Extended Kalman Filters or, when ht (xt) can be linearized about the updated conditional mean estimate x^t=t , it should be possible to improve the linearization and thus the nal estimate of the state variable (see [1]).

applied to the log-likelihood criterion function l() = 21 (0Q?1 ) derived from a state space model whose components are z = [y; x^]0, m(x) = [h(x); x]0, z N(m(x); Q), with the 2 2 matrix Q having zero o-diagonal elements and the variances R and P in the main diagonal. Given that in this case S() =j j2= (0 Q?1) and given the factorization B 0 B = Q?1 , we have (:) = B(z ? m(x)) and thus @@(:) = ?Bm1 (x); replacing this last term in the GN formula, it is shown in [2] that, after some algebra, the same identical updating equation employed by the IKF to estimate the state variable can be obtained. Thus, by induction the iterates from the GN method correspond to those from the IKF algorithm.

4 Some extensions and generalizations Since we showed how to cast a neural network architecture in a state space representation, we should try to exploit the properties of it. The most important fact is that from the Kalman Filter algorithm and its variants we obtain, in a very elegant and immediate way, the likelihood function through the well-known prediction error decomposition device [13]. The likelihood function for the whole set of observations is obtained by the joint PDF, i.e. L(y; ) = Nt=1 p(yt =Yt?1), considering Yt?1 the set of observations up to and including yt?1; since the innovation or prediction error computed by the lter is t = yt ? E(yt =Yt?1) and var(t ) = var(yt =Yt?1) = Dt , when the observations are normally distributed the likelihood function can be expressed in terms of the innovations. Therefore the likelihood function in prediction error decomposition form is: N N X X 1 1 KN log L = ? 2 log2 ? 2 log j Dt j ? 2 t0 Dt?1 t t=1 t=1

(19)

considering a k 1 t vector and N observations3. For the case we study, where a nonlinear model is the object of investigation and some approximations were used, the solution is of course suboptimal and close to the optimal one according to the accuracy of the approximation involved. But there are other aspects which deserve to be mentioned, like: the state space set-up can deal with data from stochastic processes which are stationary or not and with parameters (those which build up the functional form of the neural network, for instance) that are xed or time-varying 3 Under Gaussianity the lter delivers an optimal Minimum Mean Squared solution for the estimation problem; under hypotheses dierent from the Gaussian, the lter gives only a Minimum Mean Square Linear solution and the values which are computed are Quasi Maximum Likelihood estimates, less ecient but consistent (and therefore useful to start a recursive procedure).

apart from the typical markovian structure of the system equation which

characterizes an autoregressive dynamic evolution for the state variable to be estimated, various other aspects can be considered by this model representation: xed and random exogenous eects, switching regimes, missing data, stochastic variance models, qualitative observations etc. the disturbance terms allow for the analysis of real input data, usually noisy; when their distribution is not Gaussian various nonlinear ltering techniques are available (see [7], for instance) Returning to the issue concerning the implementation of recursive schemes, consider what we have from KF running over the state space model: we obtain the likelihood function as given by (19) and we can calculate its derivatives either analytically or numerically. In this latter case through the same lter. For instance the ith element of the Score vector is given by (see [5]): @ log L = ? 1 X[tr[(D?1 @Dt )(I ? D?1 0 )] ? @t0 D?1 ] (20) t t t t @i 2 t @i @i t t therefore requiring the evaluation of the k k matrices of derivatives @D @ and the k 1 vector of derivatives @ , for i = 1; : : :; p and t = 1; : : :; N. @ These derivatives may be computed through p additional passes of the KF; if we consider a new run of the lter with = [1 : : :: : :i + i : : :: : :p ] we obtain a new set of innovations t(i) and variances Dt(i) and the numerical approximations of the derivatives4 are i?1 [t(i) ? t] and i?1 [Dt(i) ? Dt ]. t i

t

i

Then, the Berndt, Hall, Hall and Hausman (BHHH) algorithm which suggests the P approximation of the Hessian by the well-known outer-product L @ log L , is the most indicated choice we have in order formula, i.e. Nt=1 @ log @ @ 0 to improve, at least asymptotically, the eciency of the initial GN or equivalently IKF estimator ^ according to the recursion: = ^ + [

X @ log L @ log L ]? X @ logL N

t=1

N

@

@0

1

t=1

@

(21)

where is a variable step length chosen so that the likelihood function is maximized in a given direction.

5 Conclusions In this paper we showed that the Backpropagation formula, the Stochastic Approximation method and the Extended and Iterated Kalman Filter algorithms have many common properties and we underlined the fact that most 4

The same derivatives can be used to compute the Information Matrix (see [5]).

of the powerful state space machinery for the neural network representation has not been exploited. Then, from the inferential side it should be important the possibility of employing two-step estimators that use likelihood information (through its functionals involved) in order to reach better asymptotically ecient nal estimates.

References [1] B.D.O. Anderson, J.B. Moore: Optimal Filtering. Prentice Hall, Englewood Clis, NJ (1979). [2] B.M Bell, F.W. Cathery: \ The Iterated Kalman Filter update is a Gaussian Newton method ". IEEE Transactions on Automatic Control, 38, 294-297 (1993). [3] G. Chen, H. Ogmen: \ Modi ed extended Kalman ltering for supervised learning". International Journal of Systems Science, 24, 1207-1214 (1993). [4] S. Chen, S.A. Billings: \ Neural Networks for nonlinear dynamic system modelling and identi cation ". International Journal of Control, 56, 319346 (1992). [5] A.C. Harvey: Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge (1989). [6] K.J. Hunt, D. Sbarbaro, R. Zbikowski, P.J. Gawthrop: \ Neural Networks for Control Systems-A Survey ". Automatica, 28, 1083-1112 (1992). [7] G. Kitagawa: \ Nongaussian state space modelling of nonstationary time series ". Journal of the American Statistical Association, 82, 1032-1063 (1987). [8] C.M. Kuan, H. White: \ Arti cial Neural Networks: an econometric perspective ". Econometric Reviews, 13, 1-92 (1994). [9] O. Nerrand, P. Roussel-Ragot, L. Personnaz, G. Dreyfus: \ Neural Networks and Nonlinear Adaptive Filtering: unifying concepts and new algorithms ". Neural Computation, 5, 165-199 (1993). [10] I. Poli, R.D. Jones: \ A neural net model for prediction ". Journal of the American Statistical Association, 89, 117-121 (1994). [11] H. Robbins, S. Monro: " A Stochastic Approximation Method ". Annals of Mathematical Statistics, 22, 400-407 (1951).

[12] D.E. Rumelhart, G.E. Hinton, R.J. Williams: \ Learning internal representations by error propagation ". In D.E. Rumelhart and J.L. McClelland eds: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge: MIT Press,1, 318-362 (1986). [13] F.C. Schweppe:\Evaluation of likelihood functions for gaussian signals". IEEE Transactions on Information Theory, IT-11, 61-70 (1965).