Fast and Accurate Training of Multilayer

Internal paper (available by http://www.robotic.dlr.de/Friedrich.Lange)

Fast and Accurate Training of Multilayer Perceptrons Using an Extended Kalman Filter (EKFNet) Friedrich Lange Institute for Robotics and Systems Dynamics DLR (German Aerospace Research Establishment) Oberpfaenhofen, D-82234 Wessling, Germany e-mail: [email protected]

September 1995 Abstract The training algorithm EKFNet uses an Extended Kalman Filter for supervised learning of feedforward neural nets. The big dierence with respect to ordinary backpropagation methods is the calculation of a N N covariance matrix which considers the interdependence of the N weights that have to be optimised. Therefore computing time increases quadratically with the number of weights, thus restricting fast learning to problems with up to about 200 weights. In contrast to other optimization methods training is performed as learning-by-pattern to allow fast convergence in case of long sets of training data. EKFNet is demonstrated for the task of decoupling an industrial robot as well as for a problem of the PROBEN1 benchmark set. In most cases EKFNet proves to learn faster and more accurate than any other method tested.

1 Introduction In contrast to linear mappings, where the parameters can be calculated easily, it is still an unsolved problem to tune the weights of nonlinear neural nets in such a way, that a given training data set is represented with minimal error. Two problems are associated with the well known standard backpropagation algorithm [RMtP86]: First, appropriate learning parameters need to be chosen. Their tuning is not trivial. Second, the performance is deteriorated by couplings between the weights. Especially for learning-by-pattern methods which are treated in this paper, former training information is forgotten in subsequent training steps, yielding slow convergence even for trivial problems. Other training methods for neural networks (e. g. Quickprop [Fah88], RPROP [RB93], or adaptive control of learning parameters [Sal91]) weaken the demand of carefully tuned learning parameters without changing the problem of couplings. Methods which originally come from optimization theory [NN91] [Sun93] [BH94] promise better convergence. In the following the Extended Kalman Filter [May82] is applied to the training of neural nets. A similar approach was already proposed and applied to classi cation problems by [SW89] but has not been generally accepted [BH94]. 1

2 Developement of the training method 2.1 Extended Kalman Filters The extended Kalman algorithm assumes a dynamical system (k + 1) = ((k); k) + v (; k) (1) which is observed by (2) y(k) = ((k); k) + w(; k): v (; k) and w(; k) represent stochastical noise that disturbes the measurement of the vector of unknown variables (k). The variable y(k) as well as the functions ((k); k) and ((k); k) are known at time instant k. For dynamical systems with unknown or stochastical changes in the unknown parameters (; k) can be replaced by (k). This yields a simpli ed version of the Kalman lter equations for the estimated values ^(k + 1) and the assumed covariance matrix of the estimation error P (k + 1): P (k + 1) = P (k) + Q (3) ?

^(k + 1) = ^(k) + P (k + 1) (^(k); k + 1) ?1 (^(k); k + 1) P (k + 1) (^(k); k + 1) + 2 (k + 1) y(k + 1) ? (^(k); k + 1) ?

t

?

P (k + 1) = P (k + 1) ? P (k + 1) (^(k); k + 1) ?

(4)

?

?1 (^(k); k + 1) P (k + 1) (^(k); k + 1) + 2 (k + 1) (5) (^(k); k + 1) P Please note the dierence between the function (^(k); k + 1) and the (transposed) vector ^ (6) (^(k); (k + 1)) = @ ((k^); k + 1) : @ (k) For linear functions ((k); k) the estimates ^(k) are known to be optimal for the N unknown variables (k), if Q is the covariance matrix of the stochastical changes of the parameters Q = E fv v g = I q (7) and 2 is the variance of the measurement noise 2 = E fw wg: (8) Please notice that y(k) and ((k); k) are scalars, since multi output measurements always can be replaced by a set of scalar measurements. Otherwise the divisions in equation (4) and (5) would be matrix inversions.

t

?

t

?

t

T

Because the covariance matrix P of the estimation error is symmetrical the algorithm of equations (3) to (5) can be implemented by about 32 N 2 + 4N multiplications and summations. The computational eort of equation (6) is dependent on the function ((k); k). In general this part increases only linearly with N and is thus negligible for N 1. 2

2.2 Application of the Extended Kalman Filter for training of neural nets Training of weights in a neural network can be treated as an estimation problem. The optimal weights and biases represent the unknown vector , whereas the partial derivatives with respect to the unknowns (^(k); i) have to be derived by a recursive algorithm similar to the backpropagation algorithm. This can be found in the appendix. It should be noted that for the Kalman algorithm the output error in uences only the update of ^ in equation (4) whereas the adaption of P uses the partial derivative thus being independent of the error. The learning parameters of the Kalman lter q and 2 have to be adapted in each epoch, i. e. when all M patterns are trained once. This is proposed here in a dierent way than in [SW89]. The variance of the "noise" is estimated by the mean representational error to 2 = 0:1

1 X(y(j ) ? (^(k); j ))2 : M M

(9)

j =1

For numerical reasons a minimum value of Xp p 10?6

!2

N

2

max (^(k); j )

(10)

2

? ii

i;j

i=1

i

has to be kept in relation to the diagonal elements of the covariance Matrix P and to the elements of the vectors (^). ?

The "change" of the optimal values due to changing linearization of equation (6) is calculated roughly by 0:01 (^ ? ^ ) (^ ? ^ ): (11) q= N This change is assumed to occur from one epoch to another and thus weights the dierence of the estimated values between two epochs. So equation (11) is executed only at the end of every epoch, otherwise replaced by q = 0. T

old

old

Initialization of the weights and biases is proposed randomly, except for the output layer, where the parameters are set to zero. This yields a minimal a priori error. As mentioned earlier training converges faster if the update of the weights is performed after each training step. It can p be accelerated further if the sequence of training patterns is changed. It turned out to be favourable if M ? 1 patterns are skipped in each step when searching the next training pattern. Thus for M = 10 patterns the training sequence would be 1, 4, 7, 10, 2, 5, 8, 3, 6, 9, 1, 4, .. . This is advantegeous especially for measured training data, where consecutive values normally are silimar. To prevent over tting the adaption of 2 can be performed using a set of validation data which are dierent from the training set. Training should be stopped when a minimal error for the validation set is reached.

3 Learning to decouple the dynamical behaviour of robots Learning can be demonstrated best by replacing EKFNet by other training methods. This is done for the learning of a feedforward controller of an industrial robot in [LH95]. In this case a linear mapping is learned rst. Then a neural net is trained to reduce remaining errors. Two nets are trained with net structures of 14-7-3-1 and 24-8-3-1 nodes, meaning 133 and 231 weights, respectively. 3

N M 0 bp tn EKFNet

131 196 0.461 0.460 0.336 0.212

random weights 131 231 1176 196 0.489 0.249 0.489 0.246 0.417 0.196 0.231 0.245

231 1176 0.275 0.266 0.221 0.202

131 196 0.062 0.054 0.036 0.032

pretrained weights 131 231 231 1176 196 1176 0.037 0.035 0.077 0.034 0.038 0.082 0.033 0.031 0.074 0.027 0.028 0.070

Table 1: RMS error in mrad for training set of robot joint angle commands. Training is performed for the same CPU-time using standard backpropagation (bp), EKFNet, and the truncated Newton method (tn). 0 stands for the initial error. For the examples with pretrained net the covariance matrix P is provided. EKFNet is compared with standard backpropagation and with the Truncated Newton method [NN91] which was best among the optimization methods in [Sun93]. Table 1 proves that EKFNet is best for all examples of the smaller net, whereas for the bigger net it is sometimes inferior to the optimization method. The dierences between the two methods are small with respect to the backpropagation algorithm. The latter cannot improve as much within the given CPU-time. In addition it even cannot preserve the result in case of very good initial weights.

4 Application to the PROBEN1 set of benchmark problems Similar results can be reported when using benchmarks problems of PROBEN1 [Pre94]. For sake of brevity only the prediction of energy consumption in a building is reported here. Table 2 compares EKFNet with standard backpropagation ( = 0:05, = 0:1) and the RPROP method which is stated in [Pre94]. The net structure is chosen to 14-8-4-3, resulting in 171 weights including biases. The training set comprises 2104 patterns. method epochs 0 RPROP 307 - 1380 bp 650 EKFNet 1 EKFNet 10

training data 0.362 0.306 0.303 0.179 0.123 0.119 0.088 0.104 0.105 0.066 0.103 0.104 0.058 0.096 0.097

0.206 0.311 0.135 0.167 0.163

test data 0.298 0.307 0.131 0.126 0.105 0.106 0.104 0.105 0.101 0.100

Table 2: RMS error for dierent training data of prediction of energy consumption in buildings. Note that the error measure is not quadratical as in [Pre94]. All numbers are averages of some runs with dierent initialization. 1 epoch of EKFNet needs as much CPU-time as about 65 (30) epochs of bp for this problem (the number in braces is valid for PCs under MS-DOS). It turns out that the learning-by-epoch method RPROP is inferior for the problems. The result using EKFNet is best because the covariance matrix ensures that previously learned information is not destroyed in subsequent training steps. So even training of 1 epoch is superior to the other methods. The quali cation of backpropagation for the rst test set seems to be pure chance. The same can be reached with EKFNet for a modi ed adaptation of 2 , causing a slower convergence on the training data.

4

5 Conclusion Problems of standard training methods as troublesome adaption of learning parameters or destruction of previously learned information by subsequent training does not occur when training with EKFNet. In relation to other optimization methods learning-by-pattern of EKFNet proves to be advantageous for learning of continuous functions. EKFNet can be recommended for training of feedforward nets with up to about 200 weights. For those problems it learns faster than other methods and in most cases the nal results are more accurate. For larger problems it is inferior to other optimization methods because of the costs increasing quadratically with the number of weights. For sake of brevity the algorithm and the experiments cannot be described here in more detail. So the interested reader is suggested to contact the author by email to get a software implementation of EKFNet for own experiments.

References [BH94]

E. Barnard and J. E. W. Holm. A comparative study of optimization techniques for backpropagation. Neurocomputing, 6:19{30, 1994. [Fah88] S. E. Fahlmann. An empirical study of learning speed in back-propagation networks. Technical Report CMU-CS-88-162, Carnegie Mellon University, September 1988. [LH95] F. Lange and G. Hirzinger. Application of multilayer perceptrons to decouple the dynamical beaviour of robot links. In Int. Conf. on Arti cial Neural Networks ICANN'95, Paris, France, Oct. 9-13 1995. [May82] P. S. Maybeck. Stochastic Models, Estimation and Control, volume 2, volume 141-2 of Mathematics in Science and Engineering. Academic Press, 1982. [NN91] S. G. Nash and J. Nocedal. A numerical study of the limited memory BFGS method and the truncated-newton method for large scale optimization. SIAM Journal on Optimization, 1(3):358{372, August 1991. [Pre94] L. Prechelt. PROBEN1 | A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, D-76128 Karlsruhe, Germany, September 1994. Anonymous FTP: /pub/papers/techreports/1994/1994-21.ps.Z on ftp.ira.uka.de. [RB93] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In H. Ruspini, editor, IEEE Int. Conf. on Neural Networks (ICNN), pages 586{591, San Francisco, 1993. [RMtP86] D. Rumelhart, J. McClelland, and the PDP Research Group. Parallel Distributed Processing, volume 1. The MIT Press, Cambridge, Massachusetts, London, England, 1986. [Sal91] R. Salomon. Verbesserung konnektionistischer Lernverfahren, die nach der Gradientenmethode arbeiten. PhD thesis, Fachbereich Informatik der Technischen Universitat Berlin, 1991. [Sun93] J. Sundermann. Anwendung von Optimierungsverfahren auf das Training von neuronalen Netzen. Master's thesis, Fakultat fur Informatik, Technische Universitat Munchen, 1993. [SW89] S. Singhal and L. Wu. Training feed-forward networks with the extended Kalman algorithm. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 1187{1190, Glasgow, Scotland, 1989. 5

Appendix: Equations of forward and backward propagation for a strictly layered net For this appendix a net architecture is assumed in which each neuron i1 of layer i has inputs from all neurons i2 in layer i ? 1 with layer 0 meaning the input layer. For such nets the forward propagation is expressed by out(i; i1) = net(i; i1) = weight(i; i1; 0) +

1

(12)

1 + e?

net(i;i1 )

?1) X

l(i

weight(i; i1; i2) out(i ? 1; i2)

(13)

i2 =1

where l(i ? 1) is the number of neurons in layer i ? 1 and weight(i; i1; 0) is the bias term of neuron i; i1 . In this notation out(n ; i ) is the output of the last layer i.e. the net output (). For the elements of the vector of partial derivatives (^(k); ) (equation 6), here denoted by J (i; i1 ; i2), a recursive algorithm can be derived using a variable . This is similar but not identical with the standard backpropagation algorithm. l

out

t

Using the well known derivative of the sigmoid function @out(i; i ) = out(i; i1) ? out(i; i1) out(i; i1) @net(i; i )

(14)

out

out

and the chain rule for derivation of recursive functions yield the recursion (n ; i1 ) = l

(i; i1) =

X

l(i+1)

0 i1 6= i out(n ; i1 ) ? out(n ; i1 ) out(n ; i1) i1 = i l

l

l

(15)

out out

(i + 1; i2) weight(i + 1; i2; i1 ) [out(i; i1 ) ? out(i; i1 ) out(i; i1)] i 6= n : l

(16)

i2 =1

The partial derivatives are J (i; i1; 0) = (i; i1)

(17)

J (i; i1; i2 ) = (i; i1 ) out(i ? 1; i2):

(18)

and

This recursion is dierent to the standard backpropagation algorithm as the output error is not included. Besides, for nets with multiple output neurons the derivatives are related to only one output. (J (i; i1 ; i2 ) is row i of the Jacobian of the total multidimensional mapping.) out

6