Comparative Study between Different versions of the Backpropagation ...

3 downloads 0 Views 345KB Size Report
[6] Maureen Caudill, and Charles Butler, Understanding. Neural Networks: Computer Explorations, Volume. 1,1993,.155-218. [7] M.A. Otair and W.A. Salameh, ...
Comparative Study between Different versions of the Backpropagation and Optical Backpropagation Mohammed A. Otair Department of Management Information Systems Al-ahllya Amman University Amman, Jordan Email Address: [email protected]

Abstract This paper focuses on gradient-based backpropagation algorithms that use either a common adaptive learning rate for all weights or a separate adaptive learning rate for each weight. The learning-rate adaptation is based on descent techniques and estimates of the local constant that are obtained without additional error function and gradient evaluations. The proposed algorithms improve the backpropagation training in terms of both convergence rate and convergence characteristics, such as stable learning and robustness to oscillations. Experiments are conducted to compare and evaluate the convergence behavior of these gradient-based training algorithms with xor problem, which is a popular training problem.

Key Words Neural Networks, Backpropagation, Momentum, DeltaBar-Delta, QuickProp, Optical Backpropagation .

1. Introduction The goal of supervised training is to update the network weights iteratively to globally minimize the difference between the actual outputs of the network and the desired outputs. The backpropagation (BP) algorithm [15] is widely recognized as a powerful tool for training feedforward neural networks. But since it applies the steepest descent method in updating the weights, it suffers from a slow convergence. A variety of approaches adapted from numerical analysis have been applied in an attempt to use not only the gradient of the error function but also the second derivative in constructing efficient supervised training algorithms to accelerate the learning process. The research usually focuses on heuristic methods for dynamically adapting the learning rate during training to accelerate the convergence. A common approach to avoid slow convergence in the

Walid A. Salameh Department of Computer Science-RSS Princess Summaya University for Science and Technology 11941 Al-Jubaiha, Amman, Jordan Email Address: [email protected]

flat directions and oscillations in the steep directions, as well as to exploit the parallelism inherent by the BP algorithm, consists of using a different learning rate for each direction in weight space [2, 4, 13,16]. However, attempts to find a proper learning rate for each weight usually result in a trade-off between the convergence speed and the stability of the training algorithm. For example, the delta-bar-delta method[4] or the QuickProp method [2] introduce additional highly problemdependent heuristic coefficients to alleviate the stability problem. This paper presents three proposed algorithms that provide stable learning, robustness to oscillations, and improved convergence rate. The reported results are based on XOR training set. It’s very doubtful if this results hold for a much more complicated practical applications. In this paper an overview of many different speedup techniques is given. All of them are tested on XOR training set The paper is organized as follows. In section 2 the BP algorithm and two-speedup gradient-based BP are presented .In section 3 the proposed algorithms are presented. Experimental results are presented in section 4 to evaluate and compare the performance of these algorithms with several other BP methods. Section 5 presents the conclusions.

2. Backpropagation Basically, Backpropagation [Rumelhart, 1986] is a gradient descent technique to minimize some error criteria E. In the batched mode variant the descent is based on the gradient descent for the total training set: ∂E ∆Wij(n ) = (−η • ) + α ∗ ∆Wij (n − 1) (1) ∂W η and α are two non-negative constant parameters called learning rate and momentum. The momentum can speed up training in very flat regions of the error surface and suppresses weight oscillation in step valleys. A good choice of learning rate and

momentum is very essential for training success and speed. Adjusting these parameters by hand can be very difficult and might take a very long time for more complicated tasks. One way to optimize the backpropagation algorithm is to find proper values for the learning rate automatically. The following techniques try to adjust these parameters during the training. Most of them also adjust the momentum parameter, so that step size and search direction are altered. In addition, local adaptations use independent learning rates for every adjustable parameter (every weight). Therefore they are able to find optimal learning rates for every weight.

2.1 Delta–Bar–Delta Robert A. Jacobs uses a local learning rate adaptation [4]. In contrary to the former approaches his delta–bar–delta algorithm controls the learning rates by observing the sign changes of an exponential averaged gradient. He increases the learning rates by adding a constant value instead of multiplying it: 1. Choose some small initial value for every learning rate. 2. Adapt the learning rates.

η ij (n ) + k

, If ∆ ij (n − 1) * ∆ ij (n ) > 0

(2.a)

∆ηij(n+1) = (1 − γ ) *η ij (n ) , If ∆ ij (n − 1) * ∆ ij (n ) < 0 (2.b) η ij (n)

, Otherwise

The delta value for each neuron is calculated as: ∆ ij = −δ j * x i

(2.c)

(3)

Where δj is the error at a single output unit is defined as:

δ j = (Y − O) • f ' (∑ wijxi )

(4)

Where Y is the desired output, O is the actual output, Wij are the weights from hidden to the output units and Xi are the input for each output unit. The bar-delta value for each neuron is calculated as: ∆ ij (n ) = (1 − β ) * ∆ ij (n ) + β * ∆ ij (n − 1)

(5)

∆ ij ( n ) is the derivative of the error surface, β is the

smoothing constant. 3. Update the weights. ∆Wij (n ) = −η ij (n ) ∗ δ j

(6)

2.2 QuickProp One method to speed up the learning process is to use information about the curvature of the error surface. This requires the computation of the second order derivatives of the error function. QuickProp [2] assumes the error surface to be locally quadratic and attempts to jump in

one step from the current position directly into the minimum of the parabola. QuickProp computes the derivatives in the direction of each weight. After computing the first gradient with regular backpropagation, a direct step to the error minimum is attempted by

∆W

ij

( n +1) =

∂E ∂ w ij

∂E (n ) ∂ w ij ∂E ( n −1) − (n ) ∂ w ij

(7)

Where: Wij weight between units i and j ∆Wij (n + 1) actual weight change. ∂E ( n + 1) ∂ W ij

partial derivative of the error function by Wij

∂E (n ) the last partial derivative ∂Wij

3. Proposed Algorithms 3.1 Optical Backpropagation (OBP) In this section, the adjustment of the new OBP [7] will be described which would improve the performance of the BP algorithm. The convergence speed of the learning process can be improved significantly by OBP through adjusting the error, which will be transmitted backward from the output layer to each unit in the intermediate layer. In BP, the error at a single output unit is defined as: δ j = (Ypk − O pk ) ∗ f ′(Σw ij x i ) (8) Where the subscript “P “ refers to the pth training vector, and “K “ refers to the kth output unit. In this case, Ypk is the desired output value, and Opk is the actual output from kth unit, then δj will propagate backward to update the outputlayer weights and the hidden-layer weights. Optical Backpropagation (OBP) applies a non-linear function on the error from each output unit before applying the backpropagation phase, using this formula: 2 (9) Newδ j = (1 + e(Y−O) ) • f ' (∑ wijxi ) , if (Y – O) >= zero. 2 Newδ j = −(1 + e(Y−O) ) • f ' (∑ wijxi )

(10)

, if (Y – O) < zero. An OBP uses two forms of Newδj because the exp function always returns zero or positive values (and the adapts operation for many output units need to decrease the actual outputs rather than increasing it). The Newδj will propagate backward to update the output-layer weights. (i.e. The deltas of the output layer change, but all other techniques and equations of BP remain unchanged).

This Newδj will minimize the errors of each output unit, and the weights on certain units change very large from their starting values.

algorithm uses an Optical Backpropagation OBP rather than BP network.

4. Comparative Study The steps of an OBP: 1. Apply the input example to the input units. 2. Calculate the net-input values to the hidden layer

units. 3. Calculate the outputs from the hidden layer. 4. Calculate the net-input values to the output layer units. 5. Calculate the outputs from the output units. 6. Calculate the error term for the output units, but replace Newδj with δj in standard BP. 7. Calculate the error term for the output units, using Newδj, also. 8. Update weights on the output layer using the following equation: ∆Wij ( n ) = −η ij ( n ) ∗ New δj (11)

9. Update weights on the hidden layer. 10. Repeat steps from step 1 to step 9 until the error

(Ypk – Opk) or Mean Square Error (MSE) is acceptably small for each training vector pairs.

3.2 Optical Backpropagation with Momentum Factor (OBP-M) The proposed algorithm OBP-M applied OBP algorithm with momentum factor to speed up the training process with standard backpropagation with momentum (BPM). OBP-M uses the same techniques and equations, which used in OBP except the equation (11) in step 8 which is changed to be as follows: ∆Wij ( n ) = −η ij ( n ) ∗ New δj + (α ∗ ∆Wij ( n − 1)) (12) The OBP-M algorithm applies the momentum term, which is added to the weight change and is proportional to the previous weight change. Because the OBP algorithm [8,9,10] improves the performance of the BP algorithm. The convergence speed of the training process can be improved significantly by OBP through adjusting the error especially when a momentum factor is used.

3.3 Enhanced Version of Delta-Bar-Delta (EVDBD) An Enhanced version of Delta-Bar-Delta (EVDBD) algorithm is an extension of the Delta-Bar-Delta algorithm as a natural outgrowth from Jacob’s work [4], EVDBD is the same as Delta-Bar-Delta which introduced by Jacobs as outlined above except that the proposed

4.1 Xor problem The main popular form of this problem in the literature of back-propagation and related learning architectures is a "2-2-1 network" which has a single layer of two hidden units, each connected to both inputs and to the output. There are 9 trainable weights in all, including the 3 bias connections for the hidden and output units.

4.2 Experiment results In all Experiments, the network trained to solve the XOR problem using back-propagation (BP) and backpropagation with momentum (BPM)[15]. In addition, the performance results for the proposed constrained optimization method are compared with the following learning algorithms: QuickProp (QP)[2], Delta-Bar-Delta (DBD) [4], Optical Backpropagation OBP, Optical backpropagation with momentum (OBPM) and Enhanced version of Delta-Bar-Delta (EVDBD). Thus, in order to have a better evaluation of the performance of the algorithms, the experiments were conducted using the same initial weight vectors that have been randomly chosen from a uniform distribution in (−1 to 1) This network is trained with standard BP, BP with momentum factor BPM, OBP, DBD and EVDBD respectively. The training process is continued until reached to mean square error (MSE) less than or equal to 0.001. Different learning rates were used, ranged from (0.1 to 1). In BPM and OBPM the value 0.8 is used for momentum factor, while the DBD and EVDBD use the following parameters (β=0.6, γ=0.001, k=0.001). In the DBD and EVDBD algorithms there is no constant value for the learning rate, instead the learning rate is initialized with values for each training process as shown in the first column in table 1, then these learning rates values are adapted through the training epochs. Table 1 shows the results for each training processes using all algorithms in term of number of epochs. Table 1 and figure 1 show a comparison of the final results in term of number of epochs for every training algorithm. Table 1: Experimental results for the XOR problem

η

BP

BPM 0.8

QP

DBD

0.1 21304

8894

8022

981

1640

457

379

0.2 10339

4450

2995

956

911

450

191

0.3 6772

2969

2008

930

713

456

129

0.4 5018

2229

1492

905

656

472

97

0.5 3980

1784

1210

881

659

484

79

0.6 3295

1488

985

857

671

479

66

0.7 2810

1277

870

834

651

454

58

0.8 2448

1118

868

812

593

416

51

0.9 2169

995

1025

791

521

373

46

1947

896

990

770

443

326

43

1

OBP EVDBD OBPM 0.8

4. Conclusion This paper introduced a three algorithms: OBP, OBPM and EVDBD respectively which has been proposed for the training of multilayer neural networks. The proposed algorithms provide stable learning, robustness to oscillations, and improved convergence rate.

The algorithms have been tested by training with xor problem. These results cannot be transferred to more complicated training sets. Especially for this kind of training sets optimization is necessary, whereas it’s of little importance to speed up XOR learning. Nevertheless most of the algorithms are superior to standard backpropagation running in batched mode. On the other hand backpropagation updating the weights after every pattern presentation outperforms all global adaptive learning algorithms. Algorithms OBPM and EVDBD are seem to be superior to all other training algorithms using fixed topologies. As seen in table1, the Xor problem has been solved using different values for learning rate, with BP, BPM, OBP, DBD, and EVDBD. Figure 1 shows the effectiveness of each algorithm. 10000

BP BPM 0.8

9000

QP

8000

DBD OBP

7000

EVDBD

MSE

6000

OBPM 0.8

5000 4000 3000 2000 1000 0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Comparing the results obtained from figure 1, the OBPM and EVDBD required the least number of epochs from all other algorithms.

1

Le arning Rate

Figure 1: Solving xor Problem Using Different Algorithms

The use of a momentum term with OBP accelerate the BP training. The algorithm was superior to other methods in terms of requiring less number of epochs to converge as well as less parameters are used (i.e. it does not uses many parameters which problem dependent like many algorithms mentioned above).

Future Work Future work will study the effect of using OBP algorithm on the RPROP algorithm. References [1] Carling, A., Alison, Back Propagation. Introducing Neural Networks, 1992, 133-154. [2] Fahlman, S. E. (1989). Faster-learning variations on backpropagation: An empirical study. [3] Freeman, J. A., Skapura, D. M., Backpropagation. Neural Networks Algorithm Applications and Programming Techniques, 1992, 89-125. [4] Jacobs, R. A., Increased rates of convergence through learning rate adaptation, Neural Networks, 1988,1,169-180 [5] Martin T. Hagan, Howard B. Demuth, Neural Networks Design, 1996, 11.1-12.52 [6] Maureen Caudill, and Charles Butler, Understanding Neural Networks: Computer Explorations, Volume 1,1993,.155-218 [7] M.A. Otair and W.A. Salameh, An Improved BackPropagation Neural Networks using a Modified Nonlinear Function, Proceedings of the IASTED International Conference, 2004,442-447 [8] M.A. Otair and W.A. Salameh,” Speeding Up Backpropagation Neural Networks”, under preparation, 2004. [9] M.A. Otair and W.A. Salameh,” Online Handwritten Character Recognition Using An Optical Backpropagation Neural Networks”, under preparation, 2004. [10] M.A. Otair and W.A. Salameh,” Solving The Exclusive-OR Problem Using An Optical Backpropagation Algorithm ”, under preparation, 2004. [11] M. Hagiwara, “Theoretical derivation of momentum term in back-propagation” , Int. Joint Conf. On Neural Networks, 682-686 ,1992

[12] Minai, A.A., Williams, R.D., Acceleration of backpropagation through learning rate momentum adaptation, Proceedings of the International Joint Conference on Neural Networks, 1990, 1676-1679. [13] Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The Rprop algorithm. In Proceedings of the IEEE International Conference on Neural Networks. (pp. 586–591). [14] Robert Callan, The Essence of Neural Networks, Southmpton Institute, 1999, 33-52. [15] Rumelhart, D. E., Hinton, G. E., and Williams, R. J., Learning internal representations by error propagation, In D. E. Rumelhart and J. L. McClelland (eds) Parallel Distributed Processing, 1986, 318362. [16] Silva, F., & Almeida, L. (1990). Acceleration techniques for the backpropagation algorithm. Lecture Notes in Computer Science, 412, 110–119. [17] Simon Hakin , Neural Networks A Comprehensive Foundation ,2nd Edition , 1999 ,161-184

Suggest Documents