FAST LEARNING IN MULTILAYERED NEURAL NETWORKS BY MEANS OF HYBRID EVOLUTIONARY AND GRADIENT ALGORITHMS † Alexander P. Topchy, Oleg A. Lebedko, Victor V. Miagkikh Research Institute for Multiprocessor Computer Systems Chekhov str. 2, GSP-284, Taganrog, 347928, RUSSIA Phone: (+7)(86344)-64488, e-mail:
[email protected] †Taganrog State University of Radio-Engineering Nekrasovsky L. 44, GSP-17A, Taganrog, 347928, RUSSIA Phone: (+7)(86344)-61768, e-mail:
[email protected] Abstract: This paper describes two algorithms based on cooperative evolution of internal hidden network representations and a combination of global evolutionary and local search procedures. The obtained experimental results are better in comparison with prototype methods. It is demonstrated, that the applications of pure gradient or pure genetic algorithms to the network training problem is much worse than hybrid procedures, which reasonably combine the advantages of global as well as local search.
1. INTRODUCTION Artificial Neural Networks (ANN) allows to approach effectively a large class of applications including pattern recognition, visual perception, signal processing and control systems. The most progress in this field is related to invention of the error backpropagation algorithm by Rumelhart et al. [1]. Backpropagation is now a conventional procedure for ANN training. However, the backpropagation as well as its numerous modifications, often leads to typical problems for gradient descent algorithms: slow convergence rate and susceptibility to presence of local minima. In practice, many real-world training problems have no reasonable fast guaranteed algorithms for their solution. Moreover, the NP-hardness of the training problem defined on a fixed network architecture was shown by Judd [2]. A promising approach to solution of hard optimization problems, including machine learning is evolutionary algorithms (EA) based on simulated evolution. Evolutionary algorithms have been employed by many researchers for neural networks training with the different degree of success. Genetic Algorithms (GA) and Evolutionary Programming (EP) were used for weights and thresholds training [3,4], for search of efficient or minimal ANN structure and topology [5,6], and for optimization of learning algorithms themselves [7]. The better results with GA were obtained in construction of optimal network structure, because ANN connectivity learning itself is considerably more difficult task for current backpropagation paradigm. However, the application of conventional evolutionary optimization technique to pure parametric learning, i.e. finding the weights and the thresholds for a given fixed network architecture, is limited. Some reasons of these restrictions are: • Relatively large computational complexity of EA procedures processing the population of ANN, because the number of the network parameters reaches N∼103-105 in the real world applications. • Insufficient performance of the recombination operators used by genetic algorithms, if network is directly encoded into binary strings. It often leads to preliminary convergence in
the population and is related to ambiguity of the interpreting function, which maps the strings from recombination space into the networks space. • Strong dependence of the objective function (network error) on a problem being solved. This does not allow to propose general recipes for construction of genetic representation and for choice of crossover operators, which could be adequate to an arbitrary neural network structure and training data set in supervised learning. This paper describes two parametric learning algorithms, which are free of mentioned drawbacks. They are based on cooperative evolution of internal hidden network representations and a combination of global evolutionary and local search procedures. In the first proposed algorithm, learning process can be considered as evolutionary adaptation of network parameters to optimal internal representation of the information being processed. This type of evolution proceeds in a single network in contrast to conventional EA, which operate with a population of ANN. This proposed procedure is related to the recent work by Prados [8,9], who trained network parameters by means of gradient descent to desired hidden neurons activities. The ensemble of these objective intermediate patterns can be defined in the course of natural selection on the set of randomly generated activities. The tuning of all weights was performed exclusively by gradient method. In contrast to this work, different search space encoding and search methodology were employed by authors, namely phenotypic weights variations based on EP. The second algorithm mainly focuses on direct combination of genetic algorithm and deltarule for training two subsets of multilayered perceptrons’ parameters, i.e. hidden and output layer weights respectively. The experimental results of applying both proposed algorithms to test problems are presented and discussed. 2. THE STATEMENT OF THE PARAMETRIC LEARNING PROBLEM The authors proceed from the standard definition of the feedforward ANN training problem: training input - desired output pairs (xk , tk ), xk ∈ℜI, tk ∈ℜN , k=1,...,K are given and it is required to obtain nonlinear mapping y: ℜI →ℜN , such one that y(xk) ≈ tk, ∀k. In case at hand the mapping y is a two-layer network, which depends on synaptic weights and thresholds:
\ [ k = ) :) 9[ k , (1) where matrices V and W contain weights and thresholds of the first and the second layers respectively, F(xk) - non-linear sigmoidal activation function, which is applied to each component of its vector argument. The goal is to minimize the objective function E(V,W) corresponding to the network error E on the training set: K
E = ∑ W k − y [ k
(2)
k =
Parametric learning consists in the search of optimal weights V and W in a fixed architecture, i.e. given number of neurons and network connectivity. 3. PREMISES FOR COMBINED LEARNING HEURISTICS
A functional role performed by network parameters may serve as a basis for hybridization of different search methods for ANN training. It is known that each hidden neuron in feedforward ANN draws a simple hyper surface in the input space. Each output neuron then combines these distinguished regions to construct a final region corresponding to the class [10]. Hence, hidden layer provides global separation of the input space, which can be performed in many ways because of fuzzy boundaries between correlated classes, hidden neurons symmetry and so on. The hidden layer units extract essential features and form prototypes for processing patterns, but did not make final decision on its belonging to chosen classes. On the other hand, the output layer must provide the best combination of prototypes formed by the hidden layer. Besides, this solution for every output neuron does not depend on the other ones, which is not true about hidden neurons. In the same time, the desired classes of membership functions for patterns training in the output layer learning are known. Therefore, being given output activity of hidden units, the problem can be reduced to one-layer network (simple perceptron) learning. This can be easily achieved by the fast delta-rule learning. It is clear that the main difficulty in this case consists in formation of required prototypes, or equivalently, in computation of weights values V for the first layer, which defines hidden activity. This difficulty can be overcome with the help of special fitness function, which estimates outputs of hidden neurons. Though definition of such a function is ambiguous, it can be introduced for guiding a coarse of evolutionary search. Concrete examples of the fitness function depend on the chosen EA variants as well as particular statement of the problem being solved (e.g. classification, function approximation, etc.). Two types of simulated evolution in different environments and with different instances in the population are investigated here: cooperative evolution of groups of weights in a single network and competitive genetic evolution of networks. In both cases, the evolutionary search is able to find approximate, but good enough starting points for further gradient fine tuning in each iteration. Taking this into account, we consider global search methods to be the most appropriate for training of hidden neurons parameters and fast gradient descent procedure for output neurons learning. 4. HYBRID ALGORITHM BASED ON EVOLUTIONARY COOPERATION(!!) 4.1. Background of the method The main perspective idea of this method is the transition from a population of independent networks to a population of cooperative individuals (i.e. hidden units) that form network fitness altogether. This approach demands specific fitness function that determines, how well an individual complements the other individuals in the population [11]. Such an evaluation function and corresponding algorithm (GenLearn) were proposed by Prados [9]. If output vectors tk belong to {-1,1} then we can define this function for every i-th hidden neuron as the following:
ei = K⋅N
K
∑ W k : ) 9[ k ,
(3)
k =
where K is the number of network outputs, N is the number of patterns in the training set. It may be shown that the goal of any learning procedure is to maximize ej, if tik ∈ {F(-∞), F(+∞)} and F(x) is sigmoidal function. GenLearn algorithm is based on the evolution of hidden neurons activity patterns. Therefore, the global search proceeds in the space of activities corresponding to all input learning images. This search space is superfluous for many real applications in comparison to the space of hidden units’ weights and thresholds, which fully define internal representation of the hidden
layer. Also, the technique suggested by Prados trains the parameters of the network by means of deterministic gradient rule on the basis of randomly generated hidden activities. It is clear that it is possible to exclude this step and perform direct search in the space of weights and thresholds. In proposed algorithm, the evolution is based on phenotypic changes in different groups of weights in the hidden layer and upheld by a variant of evolutionary programming technique [12]. Under phenotypic changes we imply such modifications of parental solutions, which are strictly mutation-based and do not involve underlying variations in genotype. Thus, encoding into a string and definition of genetic recombination operators are not required. Offsprings are created by altering the parental parameters by uniformly distributed random values, which variances depend on the fitness function. The output layer weights and thresholds are trained by means of fast gradient method that allows to estimate the quality of hidden layer representation and to direct the evolution. 4.2. Description of the algorithm The evolution proceeds in accordance with following algorithm in pseudo-code: 1. Generation of the initial network: 1.1. Generation of the hidden layer parameters uniformly distributed on the interval [-1.0,1.0]. 1.2. Calculation of the hidden units activities. 1.3. Training output layer parameters by means of 25-30 delta-rule iterations. 2. Acquisition of the total network error. 3. Calculation of the hidden neurons fitness e i. 4. Repeat 5. Generation of the neurons-offsprings by mutation of hidden units with the lowest fitness. 6. Acquisition of the hidden layer activities. 7. Output layer training by delta-rule. 8. Calculation of the total network error Eoffsp and fitness of each hidden unit. 9. if Eoffsp < Eparent then accept neurons offsprings, return to parental population otherwise. 10. Until Eoffsp is satisfactory. 11. Stop. Mutation is implemented as an addition of uniformly distributed value with zero expectation and variance inversely proportional to the fitness of the particular hidden neuron: Vij = Vij + U( α/ ei),
(4)
where α/ ei - the width of the mutation scale interval; the coefficient α may vary from 0.1 to 1 (in experimental simulations it was fixed at 0.5). Usual EP techniques use normally distributed additions but this is not necessary in the proposed algorithm. This can be motivated by the fact that an argument of neuron activation function is a sum of weighted inputs that results in normally distributed changes of the neuron activities caused by uniform distribution of the weights. Experiments show that both modifications of this algorithm with normal and uniform distributions have the same rate of convergence. Also,
the width of the interval, where additions are generated from, does not depend on the total network error explicitly, however, while total network error decreases, the average fitness of the hidden neurons increases. It is reasonable to mutate less hidden neurons while the network error decreases in order to speed up the learning rate. During the first iterations a half of all hidden units is changed. 4.3. Experimental Results The simulations were carried on for two test problems. Results were averaged and compared to the standard backpropagation algorithm and GenLearn procedure. Total network mean square error was normalized with the respect to the number of output neurons and the size of the training data set. In the first test, a network with 10 input, 20 hidden and 10 output neurons was trained by 20 binary randomly generated input-output pairs. The process of learning by means of considered algorithm (abbreviation EP-plus-BP was used) is depicted in the Fig. 1. GenLearn procedure surpasses the proposed algorithm during beginning iterations, because the uniform distribution of the hidden layer activities provided by GenLearn leads to lower initial error level. However, proposed technique shows better results after about four hundred iterations in solving of this "random mapping" problem. Besides, this algorithm is 1.25 times faster than GenLearn procedure.
MSE 0.1
BP GenLearn 0.08
GA+BP EP+BP
0.06
0.04
0.02
100
200
300
400 iterations
Fig. 1. Comparison of algorithms on the random mapping problem. As the second test problem, a network with 2 input, 10 hidden and 1 output neurons was trained to classify 80 points on 2D plane corresponding to the two classes (see Fig. 2). The points of the first class form a non-convex region surrounded by the points of another class. GenLearn procedure fails to train network for this rather real training data set (see Fig. 3), while proposed technique successively solves this problem and zero error level was always reached in contrast to both backpropagation and the algorithm proposed by Prados. GenLearn has very low convergence rate because it works in superfluous space formed by hidden neurons activities.
1.0
-1.0
1.0
-1.0
Fig. 2. Training data set (the second test problem). 5. GENETIC PLUS GRADIENT LEARNING ALGORITHM 5.1. The Main Principles of Hybridization This hybrid algorithm combines global search of genetic algorithm (GA) technique with gradient delta-rule. It works with the population of networks, but in contrast to usually applied network representations, where all parameters form binary string, we operate with the string encoding of the hidden layer parameters only. It allows to reduce the global search space, that leads to boosting of the learning rate. In order to evaluate this string, it is necessary to calculate hidden unit patterns, to train output neurons by several delta rule iterations and to compute the global network error, which is the fitness of the instance. The population of such individuals along with gradient descent algorithm are the parts of the GA. Therefore, it is possible to apply efficient genetic operators like selection, crossover and mutation. Particular genetic string (hidden layer parameters) defines the subspace for the local search where fast gradient method always finds the best possible combination. It allows to consider the last layer weights to be encoded implicitly in hidden layer representation. The dimension of each local search subspace is fixed by the number of parameters in the output layer. Therefore, GA works with the population of subspaces which can be interpreted in terms of conventional GA as schemata with dummy bits corresponding to the output layer parameters. The evaluation of each schema is performed during one iteration of the algorithm because gradient descent for one layer always finds the same solution starting from arbitrary point within the subspace. In contrast to conventional GA, "evaluation" of schemata proceeds for numerous iterations along with the evolution of these subspaces. 5.2. Description of the Algorithm The following pseudo-code describes the general structure of the proposed algorithm. 1. Generation of the double(!!) initial population of networks (generation of hidden layer parameters and encoding them into genetic string) 2. Repeat 3. Gradient tuning 4. Fitness evaluation 5. Selection 6. Crossover 7. Mutation
8. Until error of the best network in the population is satisfactory 9. Stop. Hidden layer was encoded into genetic string by placing the parameters of each hidden neuron in neighboring positions. Two-point crossover operator proved to be efficient for such an encoding. During gradient tuning, the output layer of each offspring is trained by means of several (about 30) delta-rule iterations that gives an estimation Ei of the string. The probability of the instance inclusion into the next generation pi to avoid premature convergence was calculated as follows: pi = −
Index i , P
(5)
where Index(i) is the index of the instance in the sorted population (index=0 corresponds to the fittest individual); 2P is the size of the population before selection. This formula automatically supports an elitist strategy which is required in GA for guaranteed finding the global optimum. A simple heuristic was used to reduce computational complexity. If after training of onelayer networks by 10 delta-rule iterations the errors are E1(10) < E2(10) then, as a rule, E1(30) ≤ E2(30). It allows to reject some not perspective strings after preliminary examination. 5.3. Experimental Simulations Experiments for this algorithm were carried out for the same test problems, described above. Obviously, the convergence rate is stable and does not strongly depend on the problem being solved. The main disadvantage in comparison with other considered algorithms (GA-plusBP) is that the network error is higher than for EP-plus-BP and GenLearn algorithms after the first iteration. It may be explained by random generation of initial weights on the wide interval, that leads to saturated outputs of the hidden neurons. These occasional outputs, which are far from the optimal ones determine relatively low fitness of the instances in the initial population. The narrowing of the interval decreases the convergence rate of the algorithm because in this case necessary bits of high order schemata may appear as a result of mutation operator only. For the population of 16 instances (after selection), as it was in the simulations, one iteration of GA-plusBP algorithm is approximately equal to 100 iterations of the standard backpropagation procedure and about 12 iterations of EP-plus-BP algorithm. However, proposed technique converges to the global optimum in complex tasks and it may be effectively implemented in the multiprocessor systems, where local optimization of each instance can take place on the particular processor.
MSE 0.18
0.14
GenLearn
0.1
BP 0.06
EP+BP 0.02
GA+BP 100
200
300
400
iteration
* Fig. 3. The comparison of algorithms in the 2-D hors-shoe problem. (!!) 5. CONCLUSION Results obtained with the help of described algorithms are better in comparison with the prototype methods. Proposed approaches obviously suggest, that the application of pure gradient or pure genetic algorithms for ANN training is much worse than hybrid procedures, which reasonably combine both global and local search. In the first algorithm, such a hybridization allows to implement search in a single network, that differs from traditionally employed evolutionary simulations. This algorithm is less time consuming than ones working with ANN population. Moreover, it can be noted that the idea behind proposed algorithm can be used wider, i.e. it is possible to form a population from other groups of ANN parameters. In the second algorithm a similar combination of genetic and gradient fine tuning algorithms speeds up the convergence to the global optimum, because such a genetic string representation forms a population of search subspaces, where each subspace is estimated by the fast gradient procedure. This estimation defines the lower bound of the fitness for each solution in this subspace. In general, these subspaces may be interpreted as schemata in conventional GA. Hence, an evaluation of such schemata in proposed algorithm is performed during one iteration, because gradient descent for the second layer always finds the same solution starting from arbitrary point within the subspace. In contrast, evaluation of schemata in conventional GA proceeds for numerous iterations along with the evolution of these subspaces. Also, this algorithm considerably reduces the length of the genetic string, that is very important for large ANN.
REFERENCES [1]
D.E.Rumelhart, G.Hinton, R.Willliams, Learning internal representations by error propagation, In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Chapter 8, Cambridge, MA: MIT Press, 1986. [2] J.S.Judd, Neural Network, Design and the Complexity of Learning, Cambridge, MA: MIT Press, 1990. [3] D.Whitley, T.Hanson, Optimizing neural networks using faster, more accurate genetic search, In Proceedings of the Third International Conference on Genetic Algorithms, pp. 391-395, San Mateo, CA: 1989. [4] D.B.Fogel, L.J.Fogel, W.W.Porto, Evolving Neural Networks, Biological Cybernetics, v.63, pp.487-493, 1990. [5] P.J.Angelint, G.P.Saunders, J.B.Pollack, An Evolutionary Algorithm that Constructs Recurrent Neural Networks, IEEE Trans. on Neural Networks, v.5, no.1, pp.54-65, 1994. [6] V.Maniezzo, "Genetic evolution of the topology and weight distribution of neural networks", IEEE Transactions on Neural Networks, v.5, no.1. pp.39-53, [7] D.Chalmers, The evolution of learning: an experiment in genetic connectivism, In Proceedings of the 1990 Connectionist Models Summer School, San Mateo, CA:1990. [8] D.Prados, New learning algorithm for training multilayered neural networks that uses Genetic Algorithm techniques, Electronics Letters, v.28, no.16, 1991. [9] D.Prados, A fast supervised learning algorithm for large multilayered neural networks, In Proceedings of 1993 IEEE International Conference on Neural Networks, San Francisco, vol.2, pp.778-782. [10] N.Nilsson, Learning Machines, New York, NY: McGraw-Hill, 1965. [11] K.De Jong, W.Spears, On the State of Evolutionary Computation, In Proceedings of the Fifth International Conference on Genetic Algorithms, pp.618-623, San Mateo, CA: 1993. [12] L.J.Fogel, A.J.Owens and M.J.Walsh, Artificial Intelligence through Simulated Evolution, John Wiley & Sons, 1966.