REGULARIZATION TOOLS FOR TRAINING FEED ... - CiteSeerX

1 downloads 0 Views 269KB Size Report
E-mail: [email protected], Marten[email protected],. Per. .... by Abazoglou 1] and Andersson 2]. By regularizing the nonlinear least squares.
Copyright information to be inserted by the Publishers

REGULARIZATION TOOLS FOR TRAINING FEED-FORWARD NEURAL NETWORKS PART I: Theory and basic algorithms  and JERRY ERIKSSON, M ARTEN GULLIKSSON, PER LINDSTROM PER- AKE WEDIN  Department of Computing Science, Umea University S-901 87 Umea, Sweden E-mail: [email protected], [email protected], [email protected], [email protected] We present regularization tools for training small, medium and large feed-forward arti cial neural networks. The determination of the weights leads to very ill-conditioned nonlinear least squares problems and regularization is often suggested to get control over the network complexity, small variance error, and to get a nice optimization problem. The algorithms proposed explicitly use a sequence of Tikhonov regularized nonlinear least squares problems. The Gauss-Newton method is applied to the regularized problem that is much less ill-conditioned than the original problem, and exhibits far better convergence properties than a Levenberg-Marquardt method. Numerical results presented con rm that the proposed implementations are more reliable and ecient than the Levenberg-Marquardt method implemented in the Matlab Neural Network toolbox. The proposed algorithms are tested using benchmark problems and guidelines by Lutz Prechelt in the Proben1 package. All software is written in Matlab and gathered in a toolbox. KEY WORDS: Neural network training, Tikhonov regularization, Non-linear least squares.

1 INTRODUCTION The training phase of supervised feed-forward neural networks leads to very dicult unconstrained nonlinear least squares problems. It is known from Cybenko [21] that the Jacobian computed in each step may become ill-conditioned. Applying a standard optimization method to these optimization problems generally gives poor results since an ill-conditioned Jacobian gives a very poor search direction. A possible technique in optimization is to incorporate regularization to prevent parameters to move freely. From the perspective of neural networks it is generally believed that regularization is the most useful approach in the context of training in neural networks.  Financial support has been received by the Swedish National Board of Industrial and Technical Development under grant NUTEK 8421-94-4603

1

2

J. ERIKSSON, M. GULLIKSSON et al.

Bishop [3, 4, 5] argues that regularization is preferable to architecture selection, early stopping, or training with noise in order to handle complexity control (number of nodes, etc) In, Ljung et. al. [16], it is stated that regularization is a very useful tool for selecting "the ecient number of parameters", but also for getting small variance. This kind of smoothing is known as ridge regression. In this paper, we propose optimization methods explicitly applied to small and medium size nonlinear regularized problems. To be speci c, we directly formulate and solve nonlinear Tikhonov regularized problems. It will be shown theoretically and practically that this approach is superior to the standard optimization regularization techniques, such as in the Levenberg-Marquardt (LM), the trust region methods [18], or truncated QR-methods as in subspace minimization approaches [7, 8, 15]. In a companion paper [9] (this issue) we propose new algorithms for large-scale problems of feed-forward neural network type. For the present neural network computations, the di erence between the output vector, A and the desired target vector T , is named the error vector E = T ? A, see Section 1.1. However, throughout we will use some of the conventions from nonlinear optimization and write f instead of E . Then the nonlinear least squares problem is written as m 1X 2 (1) min F ( x ) = x2 in %Safeguard = in =2; %Occurs very seldom

end end

FIGURE 4: The inner steplength algorithm in Matlab code.

except one, are based on the UCI machine learning databases repository. The strength of Proben1 is the unifying approach that has been used to make it possible for other researchers to reproduce numerical tests. The benchmarking problems cover two kind of neural network applications: classi cation problems and their continuous counterparts which represent approximation problems. Our tests involve the approximation problems. The test problems are thoroughly described in Proben1. Some background facts are given and the number of input- and output signals (nodes) is listed as is the total number of input examples. The de nition of a speci c problem can be found in Proben1 given the name of the problem. The leftmost column in Table 3 contains some problem/network speci c data for the problems used in this study. building1 For example, (2+2) l tells us the name of the problem (building1) and (2+2) l 6312*45 de nes the network to have two hidden layers with 2 and 2 tansig nodes, respectively. The letter l tells us that output nodes use the purelin activation function (an s instead of an l mean that the output nodes use the tansig activation function). 6312*45 is the size of the corresponding Jacobian matrix (6312 is the number of components in f and 45 is the number of weights in the neural net to be determined). (Note that the total data for problem xxx1, xxx2 and xxx3 are the same. The di erence is the partitioning of the total data into training set, validation set, and test set, respectively.)

TRAINING NEURAL NETWORKS, PART I

17

4.2 Convergence criteria

Early stopping convergence criteria. We use the objective function Faug (x) de ned by (3) in the training procedure. In Faug (x) the centre c is the zero vector and f (x) is the vector with elements oi (x) ? ti for actual output value oi and target value ti , i = 1; 2; :::; q  outs, where q is the number of input examples and outs is the number of output nodes. However, the reported results in Tables 3 { 5 for the training set, validation set and test set are all based on the corresponding mean squared error percentage de ned as E = q 100 outs

qX outs i=1

(oi ? ti )2 :

Before giving the termination criteria we de ne some quantities. Let Eva (t) and Etr (t) be the squared error function on the validation set and the training set outs E using the validation data respectively after epoch t. For example Eva = q100 points. The value Eopt (t) is de ned to be the lowest validation set error obtained in the epochs up to t, i.e. Eopt (t) = min E (t0 ): t t va Now we de ne the generalization loss at epoch t to be the relative increase of the 0

validation error over the lowest validation set error (in percent)  E (t)  GL(t) = 100 E va (t) ? 1 : opt A high generalization loss is an indication to stop the training in order to avoid over tting of the network. This leads to one class of stopping criteria; the generalization loss exceeds a certain threshold (GL is ful lled if GL(t) > for a certain epoch t). To follow Proben1, we formalize the notation of the training progress by de ning a training strip of length k to be a sequence of k epochs numbered n + 1; :::n + k where n is a multiple of k. The training progress (measured in parts per thousand) after such a training strip is   P 0 t 2 t ? k +1;:::;t Etr (t ) ?1 : Pk (t) = 1000  k  min t 2t?k+1;:::;t Etr (t0 ) Just like the progress, GL is normally evaluated only at the end of each training strip. As proposed in Proben1 we have used a rather complicated criterion to stop the training at an early point. Training is stopped when either P5 (t) < 1 or GL10 is ful lled or more than 5000 epochs were used or the following condition was satis ed. The GL5 stopping criterion was ful lled at least once and the validation error had increased for at least 8 successive strips at least once and the quotient 0:1GL5(t)=P5 (t) had been larger 0

0

18

J. ERIKSSON, M. GULLIKSSON et al.

than 3 at least once.

Optimization convergence criterion.

The training progress de ned by Pk (t) is dangerous, since it might be small only because the actual optimization method makes very little progress in the training process. A natural termination criterion for nonlinear least squares problems solved by the Gauss-Newton method is the quotient Q(; x) = kJaug pGN k=kfaug k . If Q(; x) is small then the current iterate x is close to a local minimizer for the current . It is clear from Section 3 that  is reduced as long as the Gauss-Newton method makes sucient progress (i.e., > 0:5) and, thus,  will converge to a s > 0. We de ne the optimization termination by Q(s ; x) < 0:005. For Levenberg-Marquardt and backpropagation methods the quotient Q(; x) is not available and there is no such termination criterion in the corresponding Matlab Neural Network toolbox. The "absolute termination" training set error function at the point where the optimization termination criterion is satis ed, see Table 5. 4.3 Numerical results The results reported are all obtained by the traintr given in Figure 3 and trainlm and trainbpx from the Matlab Neural Network toolbox. All codes have been extented to handle the early stopping convergence criteria. The initial value x0 is generated at random but is the same for di erent optimization methods used on the same problem. The centre c is the zero vector and the initializing value (init ) of the regularization parameter is set to 0.1 in all runs. The abbreviations in Tables 3 { 5 are the following: Training set. Mean and standard deviation (sdev) of minimum squared error percentage on training set reached at the the termination point. Validation set. Ditto, on validation set. Test set. Mean and sdev of squared test error percentage at the point where minimum of Eva (t) occurs. Total epochs. The total number of epochs until termination. Relevant epochs. The number of epochs until minimum of Eva (t) occurs. 2{norm of solution. kxsol ? ck2 where xsol is the termination point and c is the zero vector (in this study).

Results with benchmark rule convergence criteria.

Table 3 and 4 show the results for traintr, trainlm, and trainbpx corresponding to the rst, second, and third row in each block. The results indicate that backpropagation generally is a bad choice of training method, so we only concentrate on the results for traintr and trainlm. For the are problem traintr and trainlm converge with almost identical sum-of-square results, but traintr require much less epochs. traintr also converges to smaller 2{norm solutions. The most striking di erences of the methods are obtained for the hearta problems. traintr converges with much smaller

TRAINING NEURAL NETWORKS, PART I

Problem

Total Relevant epochs epochs mean sdev mean sdev

are1 18 6 9 5 (4+0) s 86 55 60 47 1599*115

are2 19 6 10 5 (2+0) s 58 46 48 46 1599*59

are3 20 8 6 3 (2+0) l 23 14 7 10 1599*59 hearta1 # 9 0 7 1 (32+0) s # 29 21 12 4 460*1185 hearta2 19 9 8 3 (2+0) l 28 22 7 3 460*75 hearta3 14 3 8 2 (4+0) s 105 73 69 66 460*149 heartac1 12 4 7 2 (2+0) l 33 34 11 12 152*75 heartac2 9 3 6 2 (8+4) l 15 21 8 15 152*329 72 41 62 38 heartac3 10 2 6 2 (4+4) l 17 18 6 5 152*169 81 44 65 41 building1 24 12 19 13 (2+2) l 43 44 38 42 6312*45 53 43 43 41 building2 (16+8) s 6312*403 20 37 13 37 building3 # 64 18 64 18 (8+8) s # 191 15 191 15 6312*219 27 33 21 34

19

2-norm of solution mean sdev 5.37 2.03 6.28 1.91 5.59 1.82 5.99 3.32 6.09 1.54 9.61 3.08 10.23 0.27 115.9 136.1 16.19 8.70 21.42 10.85 9.96 4.35 25.97 28.17 10.38 5.34 21.29 11.88 5.78 0.40 29.56 16.23 10.67 0.59 5.85 0.75 12.97 4.98 7.81 1.12 6.61 1.57 8.61 2.94 11.58 21.29 92.98 78.69 20.31 4.06 14.12 5.92 70.75 73.48

TABLE 3: Results obtained for traintr, trainlm and trainbpx which correspond to the three rows for each problem. "Early stopping" is used to terminate. The mean and standard deviation are computed from 30 runs (except from # marked rows where only three runs are used) using di erent starting points.

20

J. ERIKSSON, M. GULLIKSSON et al.

Problem

Training set mean sdev

are1 0.29 0.04 (4+0) s 0.37 0.04 1599*115

are2 0.37 0.02 (2+0) s 0.45 0.05 1599*59

are3 0.33 0.01 (2+0) l 0.36 0.03 1599*59 hearta1 # 2.58 1.65 (32+0) s # 22.5 29.1 460*1185 hearta2 3.35 0.31 (2+0) l 3.54 0.16 460*75 hearta3 2.65 0.26 (4+0) s 5.85 8.46 460*149 heartac1 2.76 0.53 (2+0) l 2.83 0.49 152*75 heartac2 1.77 2.11 (8+4) l 2.20 1.89 152*329 5.66 4.09 heartac3 1.38 1.71 (4+4) l 1.32 1.04 152*169 17.38 54.81 building1 0.22 0.21 (2+2) l 0.29 0.20 6312*45 5.72 17.51 building2 (16+8) s 6312*403 31.02 9.37 building3 # 0.20 0.01 (8+8) s # 0.52 0.23 6312*219 27.58 14.14

Validation set mean sdev 0.35 0.02 0.35 0.02 0.45 0.01 0.47 0.04 0.48 0.01 0.48 0.02 8.29 0.13 26.3 29.7 4.12 0.24 4.21 0.27 4.27 0.25 7.28 8.07 4.84 0.69 5.04 0.62 6.95 1.29 6.35 1.16 7.13 2.42 6.50 1.20 6.50 1.13 20.85 56.50 1.06 0.60 1.69 1.57 7.93 14.93 31.22 9.45 0.24 0.01 0.53 0.24 27.52 14.12

Test set mean sdev 0.57 0.02 0.55 0.03 0.29 0.02 0.29 0.02 0.34 0.01 0.35 0.01 8.00 0.25 23.1 25.4 4.30 0.20 4.34 0.23 4.88 0.23 7.67 7.92 2.90 0.69 3.21 1.11 7.26 2.10 5.91 1.48 6.71 3.28 6.72 1.34 6.42 1.35 22.07 59.08 0.82 0.44 1.02 0.46 6.96 16.12 30.61 9.23 0.24 0.01 0.53 0.24 27.38 14.06

TABLE 4: Results obtained for traintr, trainlm and trainbpx continued from Table 3.

TRAINING NEURAL NETWORKS, PART I

21

sum-of-square with smaller 2{norm using much less epochs. The methods have equal sum-of-square on the heartac problems, but traintr uses less epochs. The

methods also di er in their 2{norm solution. Similar results are also obtained on the building1 problem. To summarize, in general traintr converges using less epochs to smaller 2{norm solutions. The sum-of-square is often smaller for traintr and mostly the corresponding standard deviation is much smaller too. There are a lack of results in the tables where the problems are too large for our computer. For these problems we refer to [9] (this issue) where we propose methods that handle large-scale problems.

Results with optimization convergence criteria. Building1 Training Total Relevant (2+2) l set epochs epochs 6312*45 mean sdev mean sdev mean sdev TR(rel) 0.13 0.001 24 10 19 8 LM(abs) 0.16 0.07 139 67 107 73 BPP(abs) 0.23 0.16 5000 0 3681 1371 TR(early) 0.19 0.14 25 12 19 13 LM(early) 0.28 0.23 47 50 42 48 BPP(early) 0.89 0.39 69 50 58 47 heartac1 Training Total Relevant (2+0) l set epochs epochs 152*75 mean sdev mean sdev mean sdev TR(rel) 1.76 0.45 29 7 6 2 LM(abs) 2.40 0.48 173 65 7 5 TR(early) 2.77 0.57 12 4 6 2 LM(early) 2.80 0.39 24 31 7 5

are3 Training Total Relevant (2+0) l set epochs epochs 1599*59 mean sdev mean sdev mean sdev TR(rel) 0.31 0.01 27 6 5 2 LM(abs) 0.34 0.03 168 67 21 53 TR(early) 0.32 0.01 20 10 5 2 LM(early) 0.36 0.02 26 14 8 12

2-norm of solution mean sdev 6.11 0.73 8.56 2.86 5.36 4.69 6.54 1.66 8.10 2.52 4.63 4.85 2-norm of solution mean sdev 47.99 33.96 32.79 19.01 10.49 5.64 19.37 12.15 2-norm of solution mean sdev 8.88 2.65 14.17 5.40 6.35 1.69 10.12 3.12

TABLE 5: Results obtained by di erent toolboxes TR (Tikhonov Regularization), LM (Levenberg-Marquardt) and BPP (fast back-prop.) on three problems. rel/abs/early is short for "optimization termination" criterion, the corresponding "absolute termination" criterion and "early stopping criteria" respectively. The mean and standard deviation are computed from 20 runs using di erent starting points.

22

J. ERIKSSON, M. GULLIKSSON et al.

As can be seen in Table 5 a tighter termination criterion such as the optimization termination criterion for some applications results in better values for trainingvalidation and test sets for all the methods. The total epochs using the optimization convergence criteria for traintr and trainlm on the three selected problems are (24, 29, 27) and (139, 173 168), respectively. It is clear that if more training is required traintr outperforms the Levenberg-Marquardt method. The results of the backpropagation based code is mentioned just out of curiosity. Since all methods should stop at the training set value, the mean value of training set should be equal for all methods. However, the trainlm and trainbpx does not converge within 200/5000 epochs for some trainings and terminates with higher training set value. Therefore, the mean values are a bit higher for trainlm and trainbpx. 5 CONCLUSIONS The traintr and trainlm implementations are intended for problems where the full Jacobian matrix easily ts in the main memory. The limit of the problem size is dependent on the hardware capacity and speed. In our case the output vector should contain at most a few thousands elements and the number of weights at most a few hundred. For larger problems the main memory is too small. In the Proben1 package, this is the case for the building2 problem as indicated by lack of results in Tables 3 and 4. The trainbpx implementation produced an answer, but useless. In part 2 [9] we describe algorithms for problems that are large, mainly due to many observation points. By using a truncated conjugate-gradient method and new automatic di erentiation tools for solving the regularized problem we can avoid storing the augmented Jacobian Jaug and can compute a search direction much faster than using a QR-factorization. REFERENCES 1. T.J. Abazoglou. The minimum norm projection on c2 -manifolds. Tr. A. M. S., 243:115{122, 1978. 2. L. Andersson. Best approximations from Hilbert submanifolds. Journal of Approximation Theory, 44(3), 1985. 3. C. M. Bishop. Curvature-driven smoothing: A learning algorithm for feed-forward networks. IEEE Transactions on Neural Networks, 4(5):882{884, 1993. 4. C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7:108{116, 1994. 5. C. M. Bishop. Regularization and complexity control in feed-forward networks. Technical Report NCRG/95/022, Neural Computing Research Group, Dept. of Comp. Science and applied Math., Aston University, Birmingham, UK, 1995. 6. H. Demuth and M. Beale. Matlab Neural Network Toolbox, Ver. 2.0. The Math Works Inc., 1994. 7. P. Deu hard and V. Apostolescu. An underrelaxed Gauss-Newton method for equality constrained nonlinear least squares. In Balakrishnan and Thoma, editors, Proceedings IFIP Conf. Opt. Tech.: part 2 (1977), Lecture notes in control and information science 7, pages 22{32, Berlin, Heidelberg, 1978. Springer Verlag.

TRAINING NEURAL NETWORKS, PART I

23

8. L. C. W. Dixon and D. J. Mills. Neural Network and Nonlinear Optimization 1. The representation of continuous functions. Optimization Methods and Software, 1:141{151, 1992. 9. J. Eriksson, M. Gulliksson, P. Lindstrom, and P.- A. Wedin. Regularization Tools for Training Feed-forward Neural Networks. Part II: Large-scale problems. Technical Report UMINF96.05, Department of Computing Science, University of Umea, Umea, Sweden, 1996. 10. J. Eriksson and P.- A. Wedin. Regularization Methods for Nonlinear Least Squares. Part I: Exactly Rank-de cient Problems. Technical Report UMINF-96.03, Department of Computing Science, Umea University, Umea, Sweden, 1996. 11. J. Eriksson and P.- A. Wedin. Regularization Methods for Nonlinear Least Squares. Part II: Almost Rank-de cient Problems. Technical Report UMINF-96.04, Department of Computing Science, Umea University, Umea, Sweden, 1996. 12. S. Haykin. Neural networks; A comprehensive foundation. Macmillan, 1994. 13. P. Lindstrom. Two user guides. one (enlsip), for constrained - one (elsunc) for unconstrained nonlinear least squares problems. Technical Report UMINF-109 and 110.84, Inst. of Information Processing, University of Umea, Umea, Sweden, 1984. 14. P. Lindstrom and P.- A. Wedin. A new linesearch algorithm for unconstrained least squares problems. Math. Prog., 29:268{296, 1984. 15. P. Lindstrom and P.- A. Wedin. Methods and software for nonlinear least squares problems. Technical Report UMINF{133.87, Inst. of Information Processing, University of Umea, Umea, Sweden, 1988. 16. L. Ljung. Tutorial 11: Nonlinear black-box modeling in system identi cation. Neuronimes, ICANN, 1995. Tutorial given at the 8th international conference on neural networks and their applications, Paris, France. 17. D. G. Luenberger. Linear and Nonlinear Programming (Second edition). Addison Wesley, 1984. 18. J. J. More. The Levenberg-Marquardt algorithm: Implementation and theory. In G. A. Watson, editor, Proceedings of the 1977 Dundee conference on numerical analysis, Lecture notes in mathematics 630, pages 105{116, Berlin, Heidelberg, New York, Tokyo, 1978. Springer Verlag. 19. J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several variables. Academic Press, New York, 1970. 20. L. Prechelt. PROBEN 1 - A set of neural network benchmarking problems and benchmarking rules. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, 76128 Karlsruhe, Germany, 1994. 21. S. Saarinen, R. Bramley, and G. Cybenko. Ill-conditioning in neural network training problems. SIAM J. Sci. Stat. Comput., 14:693{714, 1993.

Suggest Documents