Discovery of Neural Network Learning Rules Using Genetic Programming Amr Mohamed Radi and Riccardo Poli School of Computer Science The University of Birmingham Birmingham B15 2TT, UK E-mail: fA.M.Radi,
[email protected]
Technical Report CSRP-97-21 September 3, 1997 Abstract
The development of the backpropagation learning rule has been a landmark in neural networks. It provides a computational method for training multilayer networks. Unfortunately, backpropagation suers from several problems. In this paper, a new technique based upon Genetic Programming (GP) is proposed to overcome some of these problems. We have used GP to discover new supervised learning algorithms. A set of such learning algorithms has been compared with the Standard BackPropagation (SBP) learning algorithm on dierent problems and has been shown to provide better performances. This study indicates that there exist many supervised learning algorithms better than, but similar to, SBP and that GP can be used to discover them.
1
1 Introduction Supervised learning algorithms are by far the most frequently used methods to train arti cial neural networks. The Standard BackPropagation (SBP) algorithm represents a computationally eective method for the training of multilayer networks which has been applied to a number of learning tasks in science, engineering, nance and other disciplines. The SBP learning algorithm has indeed emerged as the standard algorithm for the training of multilayer networks, against which other learning algorithms are often benchmarked [11, 33]. The standard backpropagation algorithm suers from a number of problems. The main problems are: it is extremely slow, training performance is sensitive to the initial conditions, and it may be trapped in local minima before converging to a solution. For these reasons in the past few years a number of improvements to SBP have been proposed in the literature. They can be divided into ve main groups [30]: 1. Methods based on dynamic learning rate adjustments. There are two kinds of adjustments: direct and indirect. Examples of methods for direct learning rate adjustment are Rprop, SAB, super SAB, Quickprop, etc [9, 29, 31, 15, 34]. Examples of indirect adjustment methods are the momentum method [5] and the conjugate gradient method [2]. 2. Methods based on the rescaling of variables [14]. 3. Methods based on using the expected output of neurons instead of the current output [10]. 4. Methods based on the minimisation of dierent error functions [26, 1, 13, 22]. 5. Methods for determining the learning rate on the basis of the size of the training set [8]. We will review the SBP rule and some of the recent improvements in Section 2. All these algorithms are generally faster than the SBP rule (up to one order of magnitude) but tend to suer from some of the problems of SBP. Eorts continue in the direction of solving these problems to produce faster supervised learning algorithms and more importantly to improve their reliability. However, the progress is extremely slow because any new rule has to be imagined/designed rstly (using engineering and/or mathematical principles) by a human expert and then has to be tested extensively to verify its functionality and eciency. Also, most algorithms newly proposed are not very dierent from or much better than the previously known types, as scientists tend to search the space of possible learning algorithms for neural nets by using a kind of \gradient descent". This way of searching may take a long time to 2
lead to signi cant breakthroughs in the eld. Indeed, by looking critically at the huge literature on this topic, it can be inferred that only a few really novel algorithms which demonstrated much better performance than SBP have been produced in the last 10 years [18, 7, 32, 25, 23, 3]. This has led some researchers to use optimisation algorithms to explore, at least locally, the space of the possible learning rules. Given the limited knowledge of such a space, the tools of choice have been evolutionary algorithms [6] which, although not optimal for some domains, oer the broadest possible applicability. Very often the strategy adopted has been to use Genetic Algorithms (GAs) [12] to nd the optimal parameters for pre xed classes of learning rules. The few results obtained to date are very promising. We recall them in Section 3. In preliminary work (not reported) we have replicated part of these results and applied GAs to nd the optimal parameters for the SBP algorithm, the SBP algorithm with momentum, and the Rprop algorithm. However, in that work we have realised that by xing the class of rules that can be explored we biased the search and prevented the evolutionary algorithm from exploring the much larger space of rules which we, humans, have not thought about. So, in line with some recent work by Benjio [3], which is also summarised in Section 3, we decided to use a technique known as Genetic Programming (GP) developed by Koza [20]. GP allows the direct evolution of symbolic learning rules with their coecients (if any) rather than the simpler evolution of parameters for a xed learning rule. We brie y summarise how GP works and describe our approach in Section 4. Section 5 reports the experimental results obtained on two classes of standard benchmark problems, the parity problems and the encoder-decoder problem. We discuss these results and draw some conclusions in Section 6.
2 Standard Backpropagation Algorithm and Recent Improvements A multilayer perceptron is a fully connected feed-forward neural network in which an arbitrary input vector is propagated forward through the network, causing an activation vector to be produced in the output layer. The network behaves like a function which maps the input vector onto the output vector. This function is determined by the connection weights of the net. The backpropagation algorithm follows the principle of gradient descent to perform a kind of error-correction learning. In SBP, the objective is to tune the weights of the network so that the network performs a desired input/output mapping. Let uli be the ith neuron in the lth layer (the input layer is the 0th layer and the output layer is the kth layer). Let nl be the number of neurons in the lth layer. The weight of the link between neuron ulj and neuron uli is denoted +1
3
by wijl . Let fx ; x ; :::; xm g be the set of input patterns that the network is supposed to learn and let ft ; t ; :::; tmg be the corresponding target output patterns. The pairs (xp,tp) p=1, ..,m are called training patterns. The output oi of a neuron ui when pattern xp is presented coincides with its net input netlip i.e. it is the ith element, xip, of xp. For the other layers, the net input netlip of a neuron uli for the input pattern xp is usually computed as follows: 1
2
1
2
0
+1
0
+1
netlip+1
=
nl X j =1
wijl oljp ? il ; +1
where oljp is the output of the neuron ulj , and il is the bias of neuron uli . For the sake of a homogeneous representation, i is often substituted by a weight to a 'bias unit' with a constant output 1. This means that biases can be treated like weights. We assume this in the rest of the paper. The output oljp is computed by passing the net input through a non-linear activation-function. Usually, the sigmoid logistic function is used so that 1 oljp = ; ? 1 + e netljp where determines the steepness of the activation function. A lower value of gives a smoother transition from a lower to a higher activation level as netljm increases; a higher value for will cause a step-like transition in the activation level. In the rest of the report we assume that =1. The error "ip for the j th neuron uki of the output layer for the training pair (xp,tp) is computed as +1
+1
"jp = tjp ? okjp: The SBP rule uses this error to adjust the weights in such a way that the error gradually reduces. The training process stops when the error of every neuron for every training pair is reduced to an acceptable level, or when no further improvement is obtained. The network performance can be assessed using the total sum of squared (TSS) errors given by the following function: nk m X X "ip: E= 2
p=1 i=1
The weights in the network are changed according to the following equation:
wijl (s + 1) = wijl (s) + wijl (s) where wijl (s) is the weight update of wijl in the sth learning step. In the batched variant of the SBP which is based on the gradient of E the weight update is given by: 4
@E wijl (s) = @w l
ij
where is a parameter called learning rate. The choice of is essential for training success and speed. A large value for will produce big weight changes and quicker learning. The danger with too large values of is that the learning process may diverge or oscillate. The SBP algorithm suers from several problems [30, 28, 29, 31, 9, 11]: 1. If it does converge (this usually happens when the learning rate is small), it will be very slow. 2. Training performance is sensitive to the initial conditions. 3. Oscillations may occur during learning (this usually happens when the learning rate is high). 4. If the error function is shallow, the gradient is very small leading to small weight changes. This is known as the \ at spot" problem. 5. The convergence to a local minimum of the error function can occur under certain circumstances and there is no proof that the SBP algorithm nds a global minimum. Several methods of speeding up the SBP algorithm have been proposed [17, 30, 28, 31, 16, 27, 36]. For example the momentum method modi es the SBP algorithm as follows: l wijl (s) = @w@E l (s) + wij (s ? 1) ij where is a parameter called momentum. The momentum determines the eect of the previous wijl value on the current one. This helps the gradient descent escape from local minima of the error function and has the eect of smoothing the error surface in weight space by ltering out high frequency variations. Rprop is one of the fastest variation of the SBP algorithm [17, 30, 28]. Rprop stands for 'Resilient backpropagation'. It is a local adaptive learning scheme, performing supervised batch learning in multilayer perceptrons. For a detailed discussion see also [28, 29]. The basic principle of Rprop is to eliminate @E on the weight the harmful in uence of the size of the partial derivative @w l ij changes. The sign of the derivative is only used to indicate the direction of the weight update while the size of the weight change is exclusively determined by a weight-speci c update-value lij (s) as follows
8 l >< ?ij (s) if @w@Eijl s > 0, wijl (s) = > +lij (s) if @w@E < 0, l ij s :0 otherwise. ( ) ( )
5
The update-values lij (s) are modi ed according to the following equation:
8> l @E @E >< ij (s ? 1) if @wijl s? @wijl s > 0, lij (s) = > ?lij (s ? 1) if @wijl@Es? @w@E l s < 0, :> l (s ? 1) otherwise. ij +
(
1)
( )
(
1)
( )
ij
where ? and are constants such that 0 < ? < 1 < . The Rprop algorithm has three parameters: the initial update-value o which directly determines the size of the rst weight step (default setting o =0.1) and the limits for the step update values max and min which prevent the weights from becoming too large or too small. Typically max =50.0 and min =0.0000001. Convergence with Rprop is rather insensitive to the choice of these parameters. Nevertheless, for some problems it can be advantageous to allow only very cautious (namely small) steps, in order to prevent the algorithm from getting stuck too quickly in local minima. Rprop suers from the same problems as any adaptive learning algorithm [28]. +
+
3 Previous Work on the Evolution of NN Learning Rules A relatively small amount of previous work has been reperted on the evolution of learning rules for NNs. Given the topology of the network, GAs have been used to nd the optima learning rules. For example, Kitano [19] combined GAs with neural networks and proposed the NeuroGenetic Learning algorithm. This consists of ve stages: 1) evolution, 2) developmental, 3) weight distribution, 4) network pruning and reduction, and 5) gradient-descend training. He demonstrated that the method provides an order of magnitude speed up with respect to other methods. Montana [24] used GAs for training feedforward networks and created a new method of training which is similar to SBP. Whitley [35] showed that GAs can make a positive and competitive contribution in the neural networks area. As, although they have trouble getting an exact solution, all the solution are nearly correct. Chalmers [4] applied GAs to discover supervised learning rules for single-layer neural networks. He discussed the role of dierent kinds of connections systems and veri ed the optimality of the Delta rule. This author noticed that discovering more complex learning rules like the SBP using GAs is not easy because either a) one uses a highly complex genetic coding, or b) one uses a simpler coding which allows SBP as a possibility. In the rst case the search space would be huge in the second case we bias the search using our own prejudices. All these methods choose a xed number of parameters and a rigid form for the learning rule. Koza [4] has demonstrated that GP may be a good way of getting around the limitations inherent to xed genetic coding which GAs suer from. GP is a very natural way of encoding dynamic algorithmic 6
max +
∗ x
∗
x
x
y
3
Figure 1: Parse-tree representation of the expression y).
max(x * x, x + 3 *
processes of the kind we are investigating here. We will brie y summarise how GP works and describe our approach to discovering learning rules in next Section.
4 Evolution of NN Learning Rules with GP GP is the extension of GAs in which the structures that make up the population under optimisation are not xed-length character strings that encode possible solutions to a problem, but programs that, when executed, are the candidate solutions to the problem [20, 21]. Programs are expressed in GP as parse trees, rather than as lines of code. For example, the simple expression max(x * x, x + 3 * y) would be represented as shown in Figure 1. The set of possible internal (non-leaf) nodes used in GP parse trees are called function set, F = ff ; ; fNF g. All functions of F have arity (the number of arguments) greater than one. The set of terminal (leaf) nodes in the parse trees representing programs in GP is called terminal set T = ft ; ; tNT g. Table 1 shows some typical functions and terminals used in GP. F and T can be merged into a uniform group C = F [ T if terminals are considered as 0-arity functions. The search space of GP is the set of all the possible (recursive) compositions of the functions in C . The basic search algorithm used in GP is a classical GA with mutation and crossover speci cally designed to handle parse trees. GP has been applied successfully to a large number of dicult problems like automatic design, pattern recognition, robotics control, synthesis of neural networks, symbolic regression, music and picture generation, etc, but only one attempt to use GP to induce new learning rules for NNs has been reperted to date. Bengio [3] used GP to nd learning rules which had optimal number of parameters. Bengio used the output of the input neuron, the error of the 1
1
7
Functions Kind Examples Arithmetic +, *, / Mathematical sin, cos, exp Boolean AND, OR, NOT Conditional IF-THEN-ELSE, IFLTE Looping FOR, REPEAT
Terminals Kind Examples Variables x, y Constant values 3, 0.45 0-arity functions rand, go left Random constants random
Table 1: Typical functions and terminals used in genetic programming. output neuron and the rst derivative of the activation function of the output neuron as the terminals in T and algebraic operators as the functions in F . GP found a better learning rule compared to the rules discovered by simulated annealing and GAs. The new learning rule suers from the same problems as SBP and was only tested on special problems. Our work is an extension of Bengio's work our objective is to explore a larger space of rules using dierent parameters and dierent rules for the hidden and output layers. We hope to obtain a rule which is general like SBP but faster, more stable, and which can work in dierent conditions. We want to discover learning rules of the following form:
wijl (s) = f (wijl ; xljp; tip; olip) Using GP the tness of each learning rule is f = Emax ? E the TSS errors function and Emax is a constant such that f 0 . The value of E is measured at the maximum number of learning epochs. The task that the networks are supposed to learn through to each learning rule in the population are linearly separable mapping from input patterns to output patterns. These are discribed in the next section together with the function and the terminal nodes used by GP.
5 Experimental Results
5.1 Test Problems
It is not easy to establish a fair comparison between the many variants of supervised learning techniques. There are as many benchmark problems reported in the literature as there are new learning algorithms. Here, we consider two problems which have been widely used: 1. The 'exclusive or' (XOR) problem and its more general form, the N-input parity problem. The network consists of N units in the input layer and 8
one in the outputlayers, and M neurons in the hidden layers. The target output is 1 if an odd number of inputs are 1, 0 otherwise. 2. The family of the N-M-N encoder problems force network to generalise and to map input patterns into similar output activation [9]. The network consists of N units in both the input and output layers, and M neurons in the hidden layer. The output (target) vector is identical to the input vector. The objective is to learn a mapping of N input units to M hidden units (encoding) and a mapping of M hidden units to N output units (decoding), where in general M