1
An Adaptive Training Method of Back-Propagation Algorithm Jang-Hee Yoo and Jae-Woo Kim
Arti cial Intelligence Division Systems Engineering Research Institute P.O. Box 1, Yoosung, Taejeon, 305-600, Korea email:
[email protected]
Jong-Uk Choi
Dept. of Information Engineering Sangmyung Women's University 7 Hongji-Dong, Jongro-Gu, Seoul, 110-743, Korea
Abstract
Currently, the back-propagation is the most widely applied neural network algorithm at present. However, its slow learning speed and local minima problem are often cited as the major weakness of the algorithm. In this paper, described are an adaptive training algorithm based on selective retraining of patterns through error analysis, and dynamic adaptation of learning rate and momentum through oscillation detection for improving the performance of back-propagation algorithm. The usefulness of proposed algorithms was demonstrated in experiments with the XOR and Encode problems.
1 Introduction Back-propagation algorithm has been known to be useful in training multi-layered neural networks, and thus has been eectively applied to various elds such as pattern recognition, signal and image processing, forecasting, robot control, etc. However, as the number of iterated updates of weight vectors is n2 when the number of nodes in each layer is n, the required computation time becomes enormous with the increase of the number of nodes. In addition, it is quite possible for the algorithm to converge into local minimum. To overcome these problems, many researchs on improving convergence speed have been done [2, 10, 12, 14]. The research eorts can be classi ed into the following three categories. First, heuristic knowledge obtained through repeated experiments can be embedded into algorithm improvement. For example, many iterations are required when the learning rate is small and the error curve is declining very slowly, because dierential value becomes very small. In contrast, when the learning rate is a large, overshooting problem arises which is caused by the large curvature in the dierential values. Therefore, dynamic adaptation of learning rate or reuse of gradient depending on the size of variations are frequently employed [5, 6]. Second, as the learning algorithms of neural networks are nothing but solving non-linear
2 optimization problems [13], attempts have been made to improve algorithm performance based on numerical methods. As dierential values of high-order equations includes more information of the search space, when calculating weights of next step, frequently employed are well-formed mathematical theories: Newton method, Quasi-Newton method, and dierential values with the values of gradient [1, 4, 9]. Third, in contrast to modifying learning algorithms, external factors such as training set and training order are modi ed: in the review technique the patterns identi ed dicult to train in the trace of learning degrees of each category will have more chances of training than others [7], while in the preparation technique the number of training samples gradually increases to prevent over tting [8]. In this paper an adaptive back-propagation algorithm is proposed, based on a selective retraining of patterns through analyzing error curve, and a dynamic adaptation of learning rate and momentum through detections of oscillation. Usefulness of the proposed algorithm was tested in XOR and Encoder problems.
2 Error Analysis and Selective Retraining Back-propagation algorithm is a gradient descent algorithm in which MSE(mean square error) is employed [3, 12, 15] for minimizing error in weight-error space. An error measure, RMS(root-mean square) [12, 15] is derived which normalizes MSE as equation-(1). 1 P K (d ? o )2 Erms = PK (1) pk pk p=1 k=1
vuuX X t
In the equation, P is the number of training patterns, and K denotes the number of nodes in output layer. The dpk is value of k ? th output node for p ? th input pattern, and opk is the actual output. The value of Erms is more descriptive than MSE in comparing training results of algorithms and thus is more eective in measuring the accuracy of mapping and association[6]. Erms can be used as an error measure in back-propagation algorithm which continue training processes until the value of Erms becomes less than the predetermined tolerance value. The algorithm which uses Erms and a xed value of predetermined tolerance has two serious problems. First, some of input patterns, even though they are not responsible for error, should undergo training processes because of the error caused by other patterns, especially when the size of input pattern is large. Second, as the value of Erms is used as an error measurement, degree of learning obtained for each pattern is not accurately re ected. One of the solutions to the problem might be to calculate average RMS for all training patterns and individual RMS for each pattern, and then to train speci c patterns which have large values of RMS than average RMS. In many cases, the weights incorrectly t actual output of speci c patterns in back-propagation algorithm. The incorrect tting can be detected by identifying the output node of k which has the maximum value of error for pattern p, de ned
3 as following:
(2) Epkmax = maxKk=1 (jdpk ? opk j) As a conclusion, retraining which re ect characteristic of each pattern can be done by detecting incorrect ttings and by utilizing error measurements of Epkmax (max out) and Erms (average rms). Figure-1 describes the selective retraining algorithm proposed in this research. current tss = max output = 0.0; Loop number of pattern for a epoch compute actual output(); compute error(); current tss = current tss + current error; If number of output > 1 Then current rms = current error/ number of output; else current rms = 0.0; If (max output - current rms) > average rms Then adjust weights(); end of loop average rms = current tss / (no of pattern * no of output);
Figure 1: Selective Retraining Applications of the algorithm may not only reduce training time, but also increase recognition rate by selective retraining.
3 Dynamic Adaptation of Learning Rate and Momentum In the back-propagation algorithm, weights are recursively adjusted with a set of pairs (input values and corresponding output values) until the value of dierence between desired output and actual output is less the predetermined tolerance value. Weight adjustment is done based on the generalized equation-(3) [3, 11, 12]: Wji(t) = (j oi ) + Wji(t ? 1)
(3)
In the equation, t is time sequence and denotes the learning rate. As the learning rate becomes larger, the change of weight is becoming larger. Therefore, training with larger learning rate might be nished earlier. However, in that case convergence is not guaranteed, because oscillation can arise. It is desirable that the learning rate be maximized for speedy convergence within a range to prevent oscillations. The variable is a momentum term introduced to provide speedy training while preventing oscillations, indicating the size of weight adjustment
4 based on previous changes of weights. Current research on improving training speed mainly focuses on modi cations of terms included in the equation-(3) [10, 12, 15]. In the back-propagation algorithm of equation-(3), oscillation can be detected by analyzing error curves. Oscillation is indicated by irregular uctuations of decrease and increase of the error measurement term Erms , and should be detected within a predetermined interval(number of epochs) to be applicable to dynamic adaptation of learning rate and momentum. In the analysis of error curve, the error curve is likely to converge into a global minimum or escape from local minimum, when the error monotonously decreases or increases for the predetermined interval. Therefore, whenever the frequency of error decreases falls below minimum value but jumps above maximum value predetermined in the error analysis, learning rate and momentum are modi ed. Otherwise, initial learning rate and initial momentum are assigned. Figure-2 describes the algorithm of dynamic adaptation of learning rate and momentum. delta error = average rmst?1 - average rmst ; If delta error < 0.0 Then oscillation = oscillation + 1; If reference interval = TRUE Then If (oscillation > max freq) k (oscillation < min freq) Then oscillation = 0; learning rate = initial learning rate * (interval size - oscillation) / interval size; momentum = initial momentum * ((1.0 - initial learning rate) + learning rate); oscillation = 0; end if
Figure 2: Dynamic Adaptation of Learning Rate and Momentum To apply the algorithm of Figure-2, the initial learning rate and initial momentum, in addition to the interval(reference interval) for error analysis and error decreasing frequency(min freq, max freq) for detecting oscillations, should be determined. The proposed algorithm may eective in training relatively complex patterns by detecting oscillations and quickly adapting to them.
4 Experiments and Analysis Results In this research, the proposed algorithm was tested in XOR problem and 8-3-8 Encoder problem. The algorithms of Figure-1 and Figure-2 were programmed in C-language running on SUN SPARCstation. Patterns tested were of uniform distribution, obtained by the random number generator. Initial learning rate was
5 set 0.5 and initial momentum was set 0.9. The interval for oscillation detection was set the number of training patterns. Table-1 shows comparison results of standard algorithm and proposed algorithm. In the test, after training 100 patterns, investigated are average RMS values when the training iterations reach 500, 1000, and 2000 epochs, average recognition rate measured after 2000 epochs, and recognition rate for 1000 new patterns. Table 1: Experimental Result of Algorithm Performance NeuralNets Task Neural AverageRMS Average Gen: Algorithm Topology 500 1000 2000 Correct Test Standard XOR 2x3x1 0.0174 0.0145 0.0133 99% 92.0% BP Encode 8x3x8 0.0133 0.0100 0.0093 98% 72.4% Proposed XOR 2x3x1 0.0083 0.0040 0.0024 100% 95.3% BP Encode 8x3x8 0.0150 0.0138 0.0120 97% 74.3%
Performance results such as the number of iterations and convergence speed are sensitive to the initial weights. Therefore, the same set of initial weights was employed for comparing two algorithms and the performance tests were repeated in multiple times for preventing statistical biases. In the test, as is shown in Table-1, the proposed back-propagation algorithm demonstrated better performance than standard algorithm in solving XOR problem which has only a single output node. In solving Encode problem, convergence speed was a little slow, but the degree of generalization was increased. Modi cation of weights through selective retraining reduced computation complexity and eventually decreased training time. Intersting result obtained was that the standard algorithm showed much better performance where the proposed algorithm was poor. The opposite was true. The result shows that the initial weights might be a critical factor in determining convergence speed.
5 Conclusions In this paper, a back-propagation algorithm is proposed, combining a dynamic adaptation of learning rate and momentum, and selective retraining method. The proposed algorithm may contribute to improving convergence speed and enhancing degree of generalization, by decreasing computation complexity through selectively training patterns and eectively preventing oscillation through dynamical adaptation of learning rate and momentum. Future research should be done on determining the interval and the frequency in error analysis for detecting oscillations, in addition to determining initial learning rate and initial momentum. Also, more research should be done on assignment of appropriate initial weights and on improving degree of generalization, which is one of the most important goals of neural network learning.
6
References
[1] Becker, S., and Le Cun, Y., \Improving the Convergence of Back-Propagation Learning with Second Order Methods," in Proceeding of the 1988 Connectionist Models Summer School, Carnegie-Mellon-University, pp.29-37, 1989. [2] Fahlman, Scott E., \Faster-Learning Variations on Back-Propagation : An Empirical Study," in Proceeding of the 1988 Connectionist Models Summer School, CarnegieMellon-University, pp.38-51, 1988. [3] Hecht-Nielsen, R., \Theory of the Backpropagation Neural Network," in Proceedings of 1989 International Conference on Neural Networks, Washington D.C., Vol.I, pp.593-601, June 1989. [4] Himmelblau, D. M., \Introducing Ecient Second Order Eects into Back Propagation Learning," in Proceedings of the International Joint Conference on Neural Networks, Washington D.C., Vol.I, pp.631-634, Jan. 1990. [5] Hush, D. R., and Salas, J. M., \Improving the Rate of Back-Propagation with the Gradient Algorithm," in Proceedings of the IEEE International Conference on Neural Networks, San Diego, Vol.I, pp.441-447, July 1988. [6] Jacobs, R. A., \Increased Rates of Convergence Through Learning Rate Adaption," Journal of Neural Networks, World Scienti c, Vol.1, pp.295-307, 1988. [7] Mori, Y., and Yokosawa, K., \Neural Networks that Learn to Discriminate Similar Kanji Characters," Advances in Neural Information Processing Systems I, pp.332347, 1989. [8] Ohnishi, N., Okamoto, A., and Sugie, N., \Selective Presentation of Learning Samples for Ecient Learning in Multi-Layer Perceptron," in Proceedings of the International Joint Conferenec on Neural Networks, Washington D.C., Vol.I, pp.688-691, Jan. 1990. [9] Parker, D. B., \Optimal Algorithms for Adaptive Networks: Second Order BackPropagation, Second Order Direct Propagation, and Second Order Hebbian Learning," in Proceedings of the IEEE International Conference on Neural Networks, San Diego, Vol.II, pp.593-600, June 1987. [10] P ster, M. and Rojas, R., \Speeding-up Backpropagation - A Comparison of Orthogonal Techniques," in Proceedings of 1993 International Joint Conference on Neural Networks, Nagoya, Vol.I, pp.517-523, Oct. 1993. [11] Rumelhart D. E., Hinton G. E., and Willians R. I., \Learning Internal Representations by Error Propagation," in Rumelhart, D. E, and McClelland, J. L., and PDP Research Group(Eds.) in Parallel Distributed Processing, Vol.1 (Cambridge, Mass., MIT Press, 1986), pp.318-362. [12] Smith, M., Neural Networks for Statistical Modeling, Van Nostrand Reinhold, 1993. [13] Watrous, R. L., \Learning Algorithm for Connectionist Networks: Applied Gradient Methods of Nonlinear Optimization," in Proceedings of the IEEE International Conference on Neural Networks, San Diego, Vol.II, pp.619-627, June 1987. [14] Werbos, Pau J., \Backpropagation Through Time : What It Does and How to Do It," in Proceedings of the IEEE, Vol.78, No.40, pp.1550-1560, Oct. 1990. [15] Zurada, Jacek M., Introduction to Arti cial Neural Systems, West Publishing Company, 1992.