Optimal Design of Neural Nets Using Hybrid Algorithms - CiteSeerX

10 downloads 232 Views 183KB Size Report
Genetic algorithm (GA) is an adaptive search technique based on the principles and ... from natural evolution. Simulated annealing (SA) is a global optimization ..... and Machine. Learning, Addison-Wesley Publishing Company, Inc., (1989).
Optimal Design of Neural Nets Using Hybrid Algorithms Ajith Abraham & Baikunth Nath Gippsland School of Computing & Information Technology Monash University, Churchill 3842, Australia {Email: Ajith.Abraham, [email protected]} Abstract: Selection of the topology of a network and correct parameters for the learning algorithm is a tedious task for designing an optimal Artificial Neural Network (ANN), which is smaller, faster and with a better generalization performance. Genetic algorithm (GA) is an adaptive search technique based on the principles and mechanisms of natural selection and survival of the fittest from natural evolution. Simulated annealing (SA) is a global optimization algorithm that can process cost functions possessing quite arbitrary degrees of nonlinearities, discontinuities and stochasticity but statistically assuring a optimal solution. In this paper we explain how a hybrid algorithm integrating the desirable aspects of GA and SA can be applied for the optimal design of an ANN. This paper is more concerned with the understanding of current theoretical developments of Evolutionary Artificial Neural Networks (EANNs) using GAs and other heuristic procedures and how the proposed hybrid and other heuristic procedures can be combined to produce an optimal ANN.

1.

Introduction

Many of the conventional ANNs now being designed are statistically quite accurate but they still leave a bad taste with users who expect computers to solve their problems accurately. The important drawback is that the designer has to specify the number of neurons, their distribution over several layers and interconnection between them. The interest in evolutionary search procedures for designing ANN architecture has been growing in recent years as they can evolve towards the optimal architecture without outside interference, thus eliminating the tedious trial and error work of manually finding an optimal network [1]. Genetic Algorithms and Simulated Annealing which are the most general purpose optimization procedures are increasingly being applied independently to a diverse spectrum of problem areas. Relative performance of GA and SA have been primarily confined to empirical evaluations on test problems and previous works have clearly shown that both these techniques have got their own merits on different classes of problems. For a long time theoretical investigators of SA and GA have focused on developing a hybrid algorithm that employs the desirable properties and performance of both GA and SA [2]. In certain situations GA outperformed SA and vice versa. GA is not designed to be ergodic and cover the space in a maximally efficient way. But the prime benefit of GAs is the parallalization capability. In contrast SA is largely sequential in moving from one optimal value to the next. States must be sampled sequentially, for acceptability and to permit identification of current local minima about which new test parameters are chosen.

2.

Genetic Algorithm (GA)

GAs are adaptive methods, which may be used to solve optimization problems, based on the genetic processes of biological organisms. Over many generations, natural populations evolve according to the principles of natural selection and "Survival of the Fittest", first clearly stated by Charles Darwin in "The Origin of Species". By mimicking this process, GAs are able to "evolve" solutions to real world problems, if they have been suitably encoded. The procedure may be written as the difference equation:

x[t + 1] = s(v( x[t ]))

(1)

x[t] is the population at time t under a representation x, v is a random variation operator, and s is the selection operator.

Figure.1. Flowchart of GA iteration GAs deal with parameters of finite length, which are coded using a finite alphabet, rather than directly manipulating the parameters themselves [20]. This means that the search is unconstrained neither by the continuity of the function under investigation, nor the existence of a derivative function. Figure 1 illustrates the functional block diagram of a GA. It is assumed that a potential solution to a problem may be represented as a set of parameters. These parameters (known as genes) are joined together to form a string of values (known as a chromosome). The particular values the genes can take are called its alleles. The position of the gene in the chromosome is its locus. Encoding issues deal with representing a solution in a chromosome and unfortunately, no one technique works best for all problems. A fitness function must be devised for each problem to be solved. Given a particular chromosome, the fitness function returns a single numerical fitness or figure of merit, which determines the ability of the individual that the chromosome represents. Reproduction is another critical attribute of GAs where two individuals selected from the population are allowed to produce offspring, which comprise the next generation. Having selected two parents, their chromosomes are recombined using the mechanisms of crossover and mutation. Traditional view is that crossover is the more important of the two mechanisms for rapidly exploring a search space. Mutation provides a small amount of random search, and helps ensure that no point in the search space has a zero probability of being examined. If the GA has been correctly implemented, the population should evolve

over successive generations so that the fitness of the best and the average individual in each generation moves towards the global optimum. Selection is the conservation of the fittest individuals for the next generation and is based on 3 parts. The first part involves determination of the individual’s fitness by the fitness function. The second part involves converting the fitness function into an expected value followed by the last part where the expected value is then converted to a discrete number of offspring. To avoid premature convergence of GAs due to interference from mutation and genetic drift, sharing and crowding may be used to decrease the amount of duplicate schemata in the population. Elitism may be incorporated to keep the most superior individuals (and superior schemata) within the population. Parallel genetic algorithms use the convergence of its sub-populations to superior schemata(s) of low order and, then, propagation of individuals between sub-populations to achieve global optimization.

3.

Simulated Annealing (SA)

SA exploits an analogy between the way in which a metal cools and freezes into a minimum energy crystalline structure (the annealing process) and the search for a minimum in a more general system. The algorithm can converge to global optimum theoretically, but usually finds a near optimum in practice. SA's major advantage over other methods is its ability to avoid becoming trapped at local minima. Figure 2 shows a flowchart of SA iteration. The annealing schedule, i.e., the temperaturedecreasing rate used in SA is an important factor, which affects SA's rate of convergence.

Figure 2. Flow chart of SA iteration.

The algorithm employs a random search, which not only accepts changes that decrease objective function f, but also some changes that increase it. The latter are  δf  accepted with a probability p = exp −  , where δf is the increase in objective  T  function, and f and T are control parameters. Several SAs have been developed with annealing schedule inversely linear in time (Fast SA), exponential function of time (Very Fast SA), etc. We explain an SA algorithm [5], which is exponentially faster than the Very Fast SA whose annealing schedule is given by T (k ) =

T0 exp(e k )

, where

T0 is the initial temperature, T (k ) is the temperature we wish to approach to zero for k=1,2,….. Representing the generation function of the simulated annealing algorithm as: D

D

i =1

i =1

g k ( Z ) = ∏ g k ( zi ) = ∏

1 1 2( zi + ) ln (1 + ln(1 /T i(k ))) ln(1 / Ti ( k ))

(2)

where Ti (k ) is the temperature in dimension i at time k and D is the dimension of the state space. The generation probability will be given by D

Gk ( Z ) = ∫−z11 ∫−z 12 .....∫−z 1D g k ( Z )dz1dz2 ....dz D = ∏ Gki ( zi ) i =1

where G ki ( z i ) =

sgn( z i ) ln(1 + z i ln(1 / Ti (k ))) 1 + 2 2 ln(1 + ln(1 / Ti (k )))

(3)

(4)

It is straightforward to prove that an annealing schedule for

Ti (k ) = T0i exp(− exp(bi k 1 / D ))

(5)

a global minimum (statistically) can be obtained. That is, ∞

∑ gk = ∞

k = ko

(6)

where bi>0 is a constant parameter and k0 is a sufficiently large constant to satisfy (6).

4.

Genetic Annealing Algorithm (GAA)

Genetic annealing is a hybrid random searching technique fusing SA and GA methodologies into a more efficient algorithm. Such Hybrid algorithms can inherit the convergence property of simulated annealing and parallalization capability of GA. Each genotype is assigned an energy threshold, initially equal to the energy of the randomized bit string to which it is assigned. It is the genotype’s threshold, not its energy that determines which trial mutations constitute acceptable improvements. If the energy of the mutant exceeds the threshold of the parent that spawned it, the mutant is rejected and a new genotype is considered. However if the energy of the new genotype is less than or equal to the energy of the parent, the mutant is accepted as a replacement for its progenitor. GAA uses an Energy Bank (EB) to keep track of the energy liberated by the successful mutants. Whenever a mutant passes the threshold test, the difference between the threshold and the mutant’s energy is added

to the EB for temporary storage. Once the quantum of energy is accounted, the threshold is reset so that it equals the energy of the accepted mutant and move on to next member of the population. After each member has been subjected to a random mutation, the entire population is reheated by changing the threshold. The rate of reheating is directly proportional to the amount of energy accumulated in the EB (from each member of the population) as well as designer’s choice of coolant rate (Section 3). Annealing results from repeated cycles of collecting energy from successful mutants and then redistributing nearly all of it by raising the threshold energy of each population member equally [14].

5.

Back-Propagation (BP) Algorithm

BP is one of the most famous training algorithms for multilayer perceptrons [8]. Basically, BP is a gradient descent technique to minimize the error E for a particular training pattern. For adjusting the weight ( w ij ) from the i-th input unit to the j-th output, in the batched mode variant the descent is based on the gradient ∇E (

( ) Z ij

for the total training set: Z ij (n) = −

( Z ij

+

Z ij (n − 1)

(7)

The gradient gives the direction of error E. The parameters ε and α are the learning rate and momentum respectively. A good choice of both the parameters is required for training success and speed of the ANN. Empirical research has shown that the backpropagation algorithm used for training the ANN has the following problems: •

BP often gets trapped in a local minimum mainly because of the random initialization of weights.



BP usually generalizes quite well to detect the global features of the input but after prolonged training the network will start to recognize individual input/output pair rather than settling for weights that generally describe the mapping for the whole training set.

Design of ANN by Global Optimization Algorithms (GOAs) is attractive because it can handle the above-mentioned problems more effectively. Moreover design using global search approach could be extended to a wide range of ANNs, not just feedforward ANNs.

6.

A general framework for optimal design of Neural Networks

An optimal design of an ANN can only be achieved by the adaptive evolution of connection weights, architecture and learning rules which progress on different time scales [1]. Figure 1 illustrates the general interaction mechanism with the architecture of the ANN evolving at the highest level on the slowest time scale.

Figure 3. Interaction of various search mechanisms in the design of optimal ANN For every architecture, there is the evolution of learning rules that proceeds on a faster time scale in an environment decided by the architecture. For each learning rule, evolution of connection weights proceeds at a faster time scale in an environment decided by the problem, the learning rule and the architecture. Hierarchy of the architecture and learning rules rely on the prior knowledge. If there is more prior knowledge about the learning rules than the architecture then it is better to implement the learning rule at a higher level.

6.1. Hybrid algorithm for global search of connection weights (Algorithm 1) The shortcomings of the BP algorithm mentioned in Section 6 could be overcome if the training process is considered as a global search of connection weights towards an optimal set defined by the GA. Optimal connection weights can be formulated as a global search problem wherein the architecture and learning rules of the ANN are predefined and fixed during the evolution.

Figure 4. Genotype of binary representation of weights Connection weights may be represented as binary strings of a certain length. The whole network is encoded by concatenation of all the connection weights of the network in the chromosome. A heuristic concerning the order of the concatenation is to put connection weights to the same node together. Fig 4 illustrates the binary representation of connection weights wherein each weight is represented by 4 bits. Real numbers have been proposed to represent connection weights directly [10]. A representation of the ANN could be (4.0, 8.0, 7.0, 3.0, 1.0, 5.0). However proper genetic operators are to be chosen depending upon the representation used.

Global search of connection weights using the hybrid heuristic can be formulated as follows: 1)

Generate an initial population of N weight vectors and for i =1 to N, initialize the ith threshold, Th(i), with the energy of the ith configuration.

2)

Begin the cooling loop •

Energy bank (EB) is set to zero and for i = 1 to N randomly mutate the ith weight vector.



Compute the Energy (E) of the resulting mutant weight vector. ♦

If E > Th(i) , then the old configuration is restored.



If E ≤ Th(i) , then the energy difference (Th(i) –E) is incremented to the Energy Bank (EB) = EB+ Th(i) –E. Replace old configuration with the successful mutant

End cooling loop. 3)

Begin reheating loop. •

Compute reheating increment eb =

EB * Ti (k ) , for i= 1 to N. N

(Ti(k)=cooling constant). •

Add the computed increment to each threshold of the weight vector.

End reheating loop. 4)

Go to step 2 and continue the annealing and reheating process until an optimum weight vector is found (required min error is achieved).

5)

Check whether the network has achieved the required error rate. If the required error rate is not achieved, skip steps 1 to 4, restore the weights and switch on to backpropagation algorithm for fine-tuning of the weights.

6)

End

While gradient based techniques are very much dependant on the initial setting of weights, the proposed algorithm can be considered generally much less sensitive to initial conditions. It is worthwhile to incorporate a gradient-based search (BP) during the final moments of the global search to enhance fine tune of local search and avoid the global search being trapped in some local minima [11].

6.2. Hybrid algorithm for global search of optimal architecture (Algorithm 2) Evolutionary architecture adaptation can be achieved by constructive and destructive algorithms. Constructive algorithms starting from a very simple architecture add

complexity to the network until the entire network is able to learn the task [3-4, 9]. Destructive algorithms start with large architectures and remove nodes and interconnections until the ANN is no longer able to perform its task [15]. Then the last removal is undone. Figure 5 demonstrates how a typical neural network architecture could be directly encoded and how the genotype is represented. We assume that the node transfer functions are fixed before the architecture is decided. For an optimal network, the required node transfer function (gaussian, sigmoidal, tangent et al) can be formulated as a global search problem, which runs at a faster time scale than the search for architectures. From To 1 2 3 4 5

1

2 3

4 5

Bias

0 0 1 1 0

0 0 1 1 0

0 0 0 0 1

0 0 1 1 1

0 0 0 0 1

0 0 0 0 0

Gene 000000 000000 110001 110001 001101

Complete genotype

000000000000110001110001001101

Figure 5. Direct coding and genotype representation of neural network architecture Scalability is often an issue when the direct coding (low level) scheme is used. The genotype string will be very large as the ANN size increases and thus increase the computation time of the evolution. To minimize the size of the genotype string and improve scalability, when priori knowledge of the architecture is known it will be efficient to use some indirect coding (high level) schemes. For example, if two neighboring layers are fully connected then the architecture can be coded by simply using the number of layers and nodes. The blueprint representation is a popular indirect coding scheme where it assumes that architecture consists of various segments or areas. Each segment or area defines a set of neurons, their spatial arrangement and their efferent connectivity. Several high level coding schemes like graph generation system [13], Symbiotic Adaptive NeuroEvolution (SANE) [12], Marker Based Genetic Coding [16], L-Systems [6], Cellular Encoding [17], Fractal Representation [18] and Evolutionary Network Optimizing System (ENZO) [7] are some of the rugged techniques. Global search of transfer function and the connectivity of the ANN using the hybrid algorithm can be formulated as follows:

1)

Generate an initial population of N architecture vectors and for i =1 to N, initialize the ith threshold, Th(i), with the energy of the ith configuration. Depending on the coding schemata used, each vector should represent the architecture and the node transfer function.

2)

and 3) are same as in Algorithm 1. Replace weight vector by architecture vector

4)

Go to step 2 and continue the annealing and reheating process until an optimum architecture vector is found.

5)

End

6.3. Hybrid algorithm for global search of learning rules (Algorithm 3) For the neural network to be fully optimal the learning rules are to be adapted dynamically according to its architecture and the given problem. Deciding the learning rate and momentum can be considered as the first attempt of learning rules [19]. The basic learning rule can be generalized by the function [1]: n

∆ w (t ) = ∑

n



k

(θ i1 ,i2 ,..,ik ∏ x ij (t − 1))

k =1 i ,i ,..., i =1 k 1 2

j =1

Where t is the time, ∆w is the weight change, x1, x2,….. xn are local variables and the θ’s are the real values coefficients which will be determined by the global search algorithm. In the above equation different values of θ’s determine different learning rules. In deriving the above equation it is assumed that the same rule is applicable at every node of the network and the weight updating is only dependent on the input/output activations and the connection weights on a particular node. Genotypes (θ’s) can be encoded as real-valued coefficients and the global search for learning rules using the hybrid algorithm can be formulated as follows: 1)

Generate an initial population of N θ vectors and for i =1 to N, initialize the ith threshold, Th(i), with the energy of the ith configuration.

2)

and 3) are same as in Algorithm 1. Replace weight vector by θ vector.

4)

Go to step 2 and continue the annealing and reheating process until an optimal θ vector is obtained.

5)

End

It may be noted that a BP algorithm with an adaptive learning rate and momentum can be compared to a similar situation.

7.

Conclusion

ANNs are no more a concept within the academic environment. It has become a part of the harsher world of users who simply want to get the tasks completed. Unfortunately, few applications tolerate the level of error produced mainly due to trial and error design of ANNs. In this paper we have presented how the optimal design of an ANN could be achieved using a 3-tier global search process that is based on a meta-heuristic hybrid algorithm. When compared to the pure genetic evolutionary search, the proposed hybrid algorithm has a better convergence and moreover by incorporating other algorithms (BP for fine tuning of weights), optimality could be further ensured. However, the real success in modeling such systems will directly depend on the genotype representation of the connection weights, architecture and the learning rules. Depending on the priori information available about architecture and learning rules, the global search procedures are to be formulated accordingly. GOAs attract considerable computational effort. Fortunately GAs work with a population of independent solutions, which makes it easy to distribute the computational load among several processors. As computers continue to deliver accelerated performance, global search of large ANNs become more easily feasible. The authors are currently working on the implementation part of the proposed algorithm.

8.

Acknowledgements

Authors wish to thank the three anonymous referees for their constructive comments that improved clarity of the paper.

References 1.

Yao X.: Evolving Artificial Neural Networks, Proceedings of the IEEE, 87(9):1, 423-1447, (1999).

2.

Hart W.E.: A Theoretical Comparison of Evolutionary Algorithms and Simulated Annealing, Proceedings of the Fifth Annual Conference on Evolutionary Programming. MIT press, (1996).

3.

Frean M.: The Upstart Algorithm: A Method for Constructing and Training Feed Forward Neural Networks, Neural computations Volume 2, pp.198-209, (1990).

4.

Mezard M., Nadal J.P.: Learning in Feed Forward Layered Networks: The Tiling Algorithm, Journal of Physics A, Vol 22, pp. 2191-2204, (1989).

5.

Yao X.: A New Simulated Annealing Algorithm, International Journal of Computer Mathematics, 56:161-168, (1995).

6.

Boers E.J.W., Kuiper H., Happel B.L.M., Sprinkhuizen-Kuyper I.G.: Designing Modular Artificial Neural Networks, In: H.A. Wijshoff (ed.); Proceedings of Computing Science in The Netherlands, pp. 87-96, (1993).

7.

Gutjahr S., Ragg T.: Automatic Determination of Optimal Network Topologies Based on Information Theory and Evolution, IEEE Proceedings of the 23rd EUROMICRO Conference, (1997).

8.

Schiffmann W., Joost M., Werner R.: Comparison of Optimized Backpropagation Algorithms, Proceedings. Of the European Symposium on Artificial Neural Networks, Brussels, pp. 97-104, (1993).

9.

Mascioli F., Martinelli G.: A Constructive Algorithm for Binary Neural Networks: The Oil Spot Algorithm, IEEE Transaction on Neural Networks, 6(3), pp 794-797, (1995).

10. Porto V.W., Fogel D.B., Fogel L.J.: Alternative Neural Network Training Methods, IEEE Expert, volume 10, no.4, pp. 16-22, (1995). 11. Topchy A.P., Lebedko O.A.: Neural Network Training by Means of Cooperative Evolutionary Search, Nuclear Instruments & Methods In Physics Research, Section A: accelerators, Spectrometers, Detectors and Associated equipment, Volume 389, no. 1-2, pp. 240-241, (1997). 12. Polani D., Miikkulainen R.: Fast Reinforcement Learning Through Eugenic Neuro-Evolution. Technical Report AI99-277, Department of Computer Sciences, University of Texas at Austin, (1999). 13. Kitano H.: Designing Neural Networks Using Genetic Algorithms with Graph Generation System, Complex Systems, Volume 4, No.4, pp. 461-476, (1990). 14. Price K.V.: Genetic Annealing, Dr. Dobbs Journal, Vol.220, pp. 127-132, (1994). 15. Stepniewski S.W., Keane A.J.: Pruning Back-propagation Neural Networks Using Modern Stochastic Optimization Techniques, Neural Computing & Applications, Vol. 5, pp. 76-98, (1997). 16. Fullmer B., Miikkulainen R.: Using Marker-Based Genetic Encoding of Neural Networks To Evolve Finite-State Behavior, Proceedings of the First European Conference on Artificial Life, France), pp.255-262, (1992). 17. Gruau F.: Genetic Synthesis of Modular Neural Networks, In S Forrest (Ed.) Genetic Algorithms: Proceedings of the 5th International Conference, Morgan Kaufman, (1993). 18. Merril J.W.L., Port R.F.: Fractally Configured Neural Networks, Neural Networks, Vol 4, No.1, pp 53-60, (1991). 19. Kim H.B., Jung S.H., Kim T.G., Park K.H: Fast Learning Method for BackPropagation Neural Network by Evolutionary Adaptation of Learning Rates, Neurocomputing, vol. 11, no.1, pp. 101-106, (1996). 20. Goldberg D.E.: Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Publishing Company, Inc., (1989).

Suggest Documents