In this paper the ability of BGAs in the eld of Arti cial Neural Networks (ANNs) 7] has ... Our approach is to use BGAs to yield the network architecture optimization ...
Arti cial Neural Networks Optimization by means of Evolutionary Algorithms I. De Falco1 , A. Della Cioppa1, P. Natale2 and E. Tarantino1 Research Institute on Parallel Information Systems National Research Council of Italy (CNR) Via P. Castellino 111, Naples, ITALY. Department of Mathemathics and Applications, University of Naples \Federico II" Monte S. Angelo via Cintia, Naples, ITALY. 1
2
Keywords: Evolutionary Algorithms, Breeder Genetic Algorithms, Arti cial Neural Networks.
Abstract In this paper Evolutionary Algorithms are investigated in the eld of Arti cial Neural Networks. In particular, the Breeder Genetic Algorithms are compared against Genetic Algorithms in facing contemporaneously the optimization of (i) the design of a neural network architecture and (ii) the choice of the best learning method for nonlinear system identi cation. The performance of the Breeder Genetic Algorithms is further improved by a fuzzy recombination operator. The experimental results for the two mentioned evolutionary optimization methods are presented and discussed.
1. Introduction
E
VOLUTIONARY Algorithms have been applied successfully to a wide variety of optimization problems. Recently a novel technique, the Breeder Genetic Algorithms (BGAs) [1, 2, 3] which can be seen as a combination of Evolution Strategies (ESs) [4] and Genetic Algorithms (GAs) [5, 6], has been introduced. BGAs use truncation selection which is very similar to the (; ){strategy in ESs and the search process is mainly driven by recombination making BGAs similar to GAs. In this paper the ability of BGAs in the eld of Arti cial Neural Networks (ANNs) [7] has been investigated. Dierently from GAs, which have been widely applied to design these networks [8], BGAs have not been examined in this task. The most popular neural network, the Multi{Layer Perceptron (MLP) [9], for which the best known and successful training method is the Back Propagation (BP) [10], has been considered. A response must be given on how to construct the network and reduce the learning time. The BP method is a gradient descent method making use of derivative information to descend along the steepest slope and reach a minimum. This technique has two drawbacks: it may get stuck in local minima and requires the speci cation of a number of parameters so that the training phase can take a very long time. In fact, learning neural network weights can be considered a hard optimization problem for which the learning time scales exponentially becoming prohibitive as the problem size grows [10]. Besides in any new problem a lot of time can be wasted to nd an appropriate network architecture. It seems natural to devote attention to heuristic methods capable of facing satisfactorily both problems. Several heuristic optimization techniques have been proposed. In most cases the heuristic techniques have been used for the training process [11, 12, 13]. Whitley has faced the above tasks separately [14]. A dierent approach in which GAs have been utilized both for the architecture optimization and to choose the best variant of the BP method among four proposed in literature is followed in [15]. In [16] the GAs and the Simulated Annealing have been employed with the aim to optimize the neural network architecture and train the network with a particular model of BP. Our approach is to use BGAs to yield the network architecture optimization and contemporaneously choose the best technique to update the weights in the BP optimizing the related parameters. In this way, an automatic procedure to train the neural network and to handle the network topology optimization at the same time is provided. Nonlinear system identi cation has been chosen as test problem to verify the eciency of the approach proposed. This test is representative of a very intriguing class of problems for control systems in engineering and allows to eect signi cant experiments in combining evolutionary methods and neural networks.
The paper is organized as follows. In section 2 BGAs are brie y described. Section 3 is dedicated to the MLP and the BP methods, while in section 4 an explanation of the aforementioned identi cation problem is reported. In section 5 some implementation details are outlined. In section 6 the experimental results are presented and discussed. Section 7 contains nal remarks and prospects of future work.
2. Breeder Genetic Algorithms BGAs are a class of probabilistic search strategies particularly suitable to deal with continuous parameter optimization. They, dierently from GAs which model natural evolution, are based on a rational scheme `driven' by the breeding selection mechanism. This consists in the selection, at each generation, of the best elements within the current population of elements ( is called truncation rate and its typical values are within the range 10% to 50% of ). The selected elements are let free to mate (self{mating is prohibited) so that they generate a new population of ? 1 individuals. The former best element is then inserted in this new population (elitism ) and the cycle of life continues. In such a way the best elements are mated together hoping that this can lead to a tter population. These concepts are taken from other sciences and mimic animal breeding. A wide set of appropriate genetic recombination and mutation operators has been de ned to take into account all these topics. Typical recombination operators are the Discrete Recombination (DR), the Extended Intermediate Recombination (EIR) and the Extended Line Recombination (ELR) [2]. As concerns mutation operators, the continuous and discrete mutation schemes (CM) are considered. However, a comprehensive explanation of these operators can be found in [2]. Moreover, in our approach the fuzzy recombination operator described below has also been considered [17]. Let us consider the following genotypes x = fx1 ; : : : ; xn g and y = fy1; : : : ; yn g, where the generic xi and yi are real variables. The probability to obtain the i{th value of the ospring z is given by a bimodal distribution p(zi ) = f (xi ); (yi )g where (r) is a triangular probability distribution with modal values xi and yi with
xi ? d j yi ? xi j r xi + d j yi ? xi j and yi ? d j yi ? xi j r yi + d j yi ? xi j for xi yi and d is usually chosen in the range [0:5; 1:0]. For our aims the following triangular probability distribution is introduced: 2js?r j b s (r) = 1 ? b where s is the centre and b is the basis of the distribution. A BGA can be formally described by:
BGA = (P 0 ; ; ; R; M; F ; T ) where P 0 is the initial random population, the population size, the truncation threshold, R the
recombination operator, M the mutation operator, F the tness function and T the termination criterion. A general scheme of a BGA is outlined in the following: Procedure Breeder Genetic Algorithm
begin randomly initialize a population of individuals; while (termination criterion not ful lled) do evaluate goodness of each individual; save the best individual in the new population; select the best individuals; for i = 1 to ? 1 do randomly select two elements among the ; recombine them so as to obtain one ospring; perform mutation on the ospring; od update variables for termination; od end
3. Arti cial Neural Networks ANNs represent an important area of research which opens a variety of new possibilities in dierent elds including control systems in engineering. The MLP are ANNs consisting of a number of elementary units arranged in a hierarchical layered structure, with each internal unit receiving inputs from all the units in the previous layer and sending outputs to all the units in the following layer. In the reception phase of the unit the sum of the inputs of the unit is evaluated. Except than in the input layer, each incoming signal is the synaptic weighted output of another unit in the network. Let us consider that the MLP is composed of L layers with each layer l containing N l neurons. Moreover, let us suppose that the neurons in the rst layer contain N 0 = m inputs each, every such neuron receiving the same number of inputs and the output layer contains N L = n neurons. The total activation of the i{th neuron in the hidden layer l, denoted by hli , is calculated as follows:
hli (wil ; yl?1 ) =
l?1 NX
j =1
wijl yjl?1 + i
i = 1; : : : ; N l l = 1; : : : ; L
(1)
where wijl represents the connection weight between the j {th neuron in layer l ? 1 and the i{th neuron in layer l, yjl?1 is the input arriving from the layer l ? 1 to the i{th neuron, and i is the threshold of the neuron. In the transmission phase, the output of the neuron is an activation value yil = f (hli ) computed as a function of its total activation. In the experiments performed three forms have been used for this function; in particular the `sigmoid':
yil = f (hli ) = the related form symmetric about the h{axis:
1
1 + e?kh
l i
k>0
? yil = f (hli ) = tanh hli
(2)
(3)
and a further semi{linear function here introduced: 8 1
? 1 hli 1 hli < ?1
(4)
The learning phase consists in nding an appropriate set of weights from the presentation of the inputs and the corresponding outputs. The most popular method for this phase is the BP. The system is trained by using a set containing p example patterns. Therefore, when the single pattern is presented to the network, the output of the rst layer is evaluated. Then the neurons in the hidden layers will compute their outputs propagating them until the last layer is reached. The output of the network will be given by the neurons at this level, i.e. yiL for i = 1; : : : ; N L . In the supervised learning, the goal is to minimize, with respect to the weight vector, the sum E of the squares of the errors observed at the output units after the presentation of the chosen pattern:
E=
p X s=1
E s = 21
p X n X s=1 i=1
(yis;L ? dsi )2
(5)
where dsi represents the desired output at the i{th output unit in response to the s{th training case. The weight changes are chosen so as to reduce the output error by an approximation to gradient descent until an acceptable value is attained. The last step is to \change" the weights by a small amount in the direction which causes the error, i.e. the direction opposed to the partial derivative of the (5) with respect to the weights in the hidden layers. Hence the following update procedure is introduced:
@E (t) wijl (t + 1) = wijl (t) ? @w l (t) ij
(6)
where is the training balance which is conventionally chosen in the interval [0; 1]. There are a number of other heuristic modi cations to the basic approach which may speed up the training time or enable completion of the learning process. These approaches propose dierent ways to proceed in the change of the weights. The rationale behind the modi cations suggested can be found in the reported references. In [9] a momentum term 2 [0; 1] is suggested to be included in (6) so the formula becomes:
@E (t) + wl (t) wijl (t + 1) = ? @w ij l (t) ij
(7)
where wijl (t) is the previous weight change. In [18] an alternative strategy, known as exponential smoothing, is proposed which modi es the (6) in the following way:
@E (t) + wl (t) wijl (t + 1) = (1 ? ) @w ij l (t) ij
(8)
Besides, other modi cations to the basic backpropagation method have been introduced. A statistical technique presented in [18] leads to the following equation:
@E (t) + wl (t)) + (1 ? )! wijl (t + 1) = ((1 ? ) @w ij l (t) ij
(9)
where ! is determined by a Cauchy distribution and its usage is established by the Simulated Annealing technique. Thus, an initial temperature t0 must be xed together with the value which simply extends the dimension of the training parameter space. Since the values of the parameters involved depend on the problem under examination, it is not possible to take an a priori decision by a qualitative analysis about which of the proposed variants is the best.
4. Neural Networks for System Identi cation Neural networks, thanks to their ability to learn, approximate and classify examples, represent an eective response to the modern demand for complex control systems with a high precision degree. It is desirable that a trained network should produce correct response not only for the training patterns but also for hitherto unseen data. Therefore, in the simplest case, to assess the generalization ability of a trained network, the training data set is calculated in the same way but for dierent data points and the unseen patterns are assumed to be located in the region of the data evaluated. In system identi cation for modeling the input{output behavior of a dynamic system, the network is trained using the input{output data and the weights adjusted by the BP algorithm. The network is provided with information related to the system history so that it can represent adequately the dynamic behavior of the system in xed ranges of a particular application. The training can be performed by observing the input{output behavior of the system with the neural network which receives the same input than the system and a xed number of delayed inputs and outputs. The system output is the desired output of the network. The system and the network output are compared to allow the weight update so as to reduce the error until the required precision is reached. Fig. 1 illustrates the principle of modeling a nonlinear SISO (Single Input Single Output) system by using a neural network assuming that the output depends only on the input and the output at the previous time step. Thus, one assumes that the unknown system is discrete in time and continuous with respect to the input. It is evident that the identi cation is performed o{line because the neural network operation is relatively slow.
u(k)
nonlinear system
y(k) +
z
-
-1
^ y(k) Neural Network e(k)
z
-1
Fig. 1. Modeling a nonlinear system with the MLP (z?1 denotes the time delay of a unit). SISO 0.5 -0.5
y(k) 10 5 0 -5 -8 5 2 y(k-1)
0
-2 -1.9
-1
0u(k-1)
1
2
Fig. 2. Response surface of the nonlinear system (10) to be modeled by the neural network. This problem has been chosen as a benchmark to compare GAs and BGAs both for its applicative interest and because the large size of the search space allows a statistically signi cant test of the relative performance of dierent algorithms. To verify the performance of a neural network optimized by GAs and BGAs for a nonlinear system identi cation, the following test problem has been chosen:
y(k) = 2:5y(k ? 1) sin(e(?u2 (k?1)?y2 (k?1)) ) + u(k ? 1)[1 + u2 (k ? 1)]
(10) where u and y represent the input and the output respectively [19]. Eq. (10) evidences that the output y(k) depends only on the previous input u(k ? 1) and the previous output y(k ? 1). This problem can be represented in a three{dimensional space as illustrated in Fig. 2.
5. Multi{Layer Perceptron Optimization by Evolutionary Algorithms The basic idea is to provide an automatic procedure to nd the most appropriate neural network structure. The optimization of an MLP which must be trained to solve a problem P is characterized by the need to determine : 1. The architecture (a) The number of hidden layers NHL, (b) The number of nodes NHl for each layer l. 2. The activation function fil to be used for each layer l. 3. Presence or absence of the bias b. 4. The most appropriate technique for weight change TRM and the training balance . On the basis of this technique the following parameters are to be xed: (a) the momentum term ,
(b) the initial temperature t0 , (c) the statistical training term . We wish to point out that in our approach the training phase is eected with the BP algorithm. Analytical procedures to determine the exact con guration of these factors for a given application do not exist. Each time it is necessary to x a particular con guration on the basis of empirical considerations, then to train the network with the patterns and to evaluate its quality. The algorithmical analysis demonstrates that this search presents an exponential complexity if performed with an exaustive technique. Furthermore, in addition to the time complexity, there is the possibility that the training fails because it runs into a con guration which is a local minimum. We have utilized heuristic methods to reduce the search time and the probability to get stuck in local minima. Namely, GAs and BGAs have been used to determine the appropriate set of parameters listed above. The aim is to provide the most general possible technique to determine the network structure. At the end of the evolutionary process not only the best network architecture for a particular application but also the trained network will be provided.
5.1. Encoding The neural network is de ned by a \genetic encoding" in which the genotype is the encoding of the dierent characteristics of the MLP and the phenotype is the MLP itself. Therefore, the genotype contains the parameters related to the network architecture, i.e. NHL and NHl, and other genes representing the activation function type fil , the dierent BP methods and the related parameters. For both the Evolutionary Algorithms considered, the chromosome structure x = (x1 ; : : : ; xn ), constituted by 17 loci, is reported in Table 1. Each allele is de ned in the subset Ai ; i 2 f1; : : : ; 7g
Locus x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 Gene NHL b TRM t0 fi0 NH1 fi1 NH2 fi2 NH3 fi3 NH4 fi4 fi5 Set A1 A2 A2 A3 A4 A2 A5 A7 A6 A7 A6 A7 A6 A7 A6 A7 A7 Table 1. reported in the third row of the Table 1. Since it is sucient in most applications that a network have few hidden layers, the value of NHL has been allowed to vary from 1 to 4 while the maximum value for NHl has been xed equal to 20. Namely, the loci are de ned within the following subsets: A1 = f1; : : : ; 4g; A2 = [0; 1] IR; A3 = f0; 1g; A4 = f1; 2; 3g with f1 momentum; 2 exponential smoothing; 3 statistical techniqueg; A5 = [0; 100] IR; A6 = f1; : : : ; 20g; A7 = f1; 2; 3g with f1 f1 ; 2 f2 ; 3 f3g where the locus A4 individuates the way to change the weights, while the genes which individuate the activation function used can assume the value f1 for the sigmoid, f2 for tanh and f3 for the semi{linear; it is worth noting that the activation function can be dierent for each layer.
5.2. The tness function To evaluate the goodness of an individual, the network is trained with a xed number of patterns and then evaluated according to determined parameters. The parameters which seem to describe better the goodness of a network con guration are the mean square error E at the end of the training and the number of epochs ep needed for the learning. Clearly it is desirable to attain for both the parameters E and ep values as low as possible: in fact, a neural network must learn as fast as possible (small values for ep), and with a good approximation of the output desired (small values for E ). It is necessary to establish an upper limit epmax to the number of epochs ep utilized by the network for the training, thus 0 < ep epmax. Moreover, it is desirable that 0 emin E where emin represents the minimum error required. Since the heuristic techniques are implemented for the minimization, the tness function has been chosen as a function increasing when ep and E increase. Speci cally the tness function is:
F (x) = epep + E max
(11)
The choice of this tness function is justi ed as it takes into account both E and the learning speed ep, weighting these contributions in a dynamic manner. Note that emin is generally chosen equal to 10?a with a > 2.
5.3. The optimization algorithm for the neural network structure Let MLP(xi ) be the algorithm for training the MLP related to the individual xi representing a neural network con guration. The general schema for the BGA is the following1: Given a pattern set P for the network training ; Procedure Breeder Genetic Algorithm
begin randomly initialize a population of neural network structures; while (termination criterion not ful lled) do train xi by means of MLP(xi ) on P ; evaluate the trained network xi ; save the best trained network in the new population; select the best neural network con gurations; for i = 1 to ? 1 do randomly select two structures among the ; recombine them so as to obtain one ospring; perform mutation on the ospring od update variables for termination; od end The procedure is constituted by an evolutionary algorithm, a BGA, and by a procedure for training the MLP encapsulated in the main program. This procedure returns the values for the tness evaluation. Its action is completely transparent to the main program which can only evaluate the goodness of the result.
6. Experimental results Due to the fact that the choice of the best selection method and genetic operators would require very high execution times if preliminary tests were eected directly for the non linear system identi cation, we have decided to use as benchmark a subset of the classical test functions known as F6, F7 and F8 [20]. For the GA these experiments have been performed to establish the best selection method. In particular, the truncation selection has resulted to have better performance than that achieved by the proportional, the tournament and the exponential selections. For the BGA the objective of the tests has been to determine the best combination among mutation and recombination operators. The experimental results have proved that the discrete mutation in combination with the fuzzy recombination operator has led to the best performance. It has to be pointed out that the fuzzy recombination allows to obtain a reduction of about 40% in terms of the convergence time with respect to the extended intermediate and the extended line recombination operators. Other preliminary tests for the MLP have been conducted by using as application the well{known \exclusive OR" (XOR) function. This task is commonly considered an initial test to evaluate the network ability to classify data. The GA has a population of 30 individuals and it utilizes the truncation selection with a 1{elitist strategy. The percentage of the truncation is set equal to 30%. The algorithm has been executed 10 times xing as termination criteria epmax = 500 and emin = 10?8. A one-point crossover operator with probability pc = 0:6 and a mutation operator with probability pm = 0:06 have been employed. The 1
For the GA there are two ospring so the cycle for is to be repeated ?2 1 times.
minimum value for the error has been obtained at the second generation demonstrating the eectiveness of the approach proposed. This result is due to the presence of good individuals in the initial population. This can be explained by the fact that the number of individuals xed is probably large for the problem under consideration. Moreover, the result achieved implies the presence of neural network structures in the population able to learn in a number of epochs lower than epmax . The best individual in the population has learned in 80 epochs. The architecture of this individual is 2{16{5{1, without bias and the momentum technique with = 4:72656 10?1 and = 4:375 10?1 has resulted to be the best. In the ninety per cent of the cases the technique with momentum and the absence of the bias has resulted to be the best result. The statistical and the exponential techniques produce individuals with bad tness values which disappear during the evolution. For the XOR we have only established that the network has learned the patterns as the network has been trained with the complete space of four possible solutions. The application of BGAs for this problem has provided results not much dierent because the test is very simple. For this reason we consider a more complex problem to compare GAs and BGAs. The problem derives from a real{world application in engineering and it is the nonlinear system identi cation described by (10). The rst problem is to determine the dimension of the pattern set to train the network. Being the (10) a continuous function with respect to u, the number of samples sucient to approximate it with the desired precision must be found. In fact, a small set of patterns cannot be sucient while a large set involves a high time to eect the training phase. The dimension of the pattern set A has been xed equal to 100 which has turned out to be a number of samples sucient to train the network. The system has been excited by setting u to be a random signal uniformly distributed between ?2:0 and 2:0. Considering that the output at the time k = 0 is equal to zero, Eq. (10) has been evaluated on the set of patterns. In this way, we have obtained the triples (u(k); y(k); y(k + 1)) in which the rst two elements represent the input and the previous output of the system and the third the desired output. Moreover, a pattern set B constituted by 500 samples dierent from the previous ones has been determined in order to verify how much the network has learned about the SISO system. To evaluate the goodness attained by the trained networks, the network has been excited by the rst two elements of the triples of the set B and the output y^(k + 1) computed by the network has been compared with the desired output evaluating the value of the error. For the GA the number of individuals in the population has been xed equal to 50 and as termination criterium we have established a value of 100 for the maximum number of generations. It utilizes the truncation selection with a 1{elitist strategy and the percentage of the truncation equal to 30%. The operators are the one-point crossover with pc = 0:8 and the mutation with pm = 0:06. The value of the error emin has been set equal to 10?3. We have executed ten trials of the algorithm. It has been necessary to train the neural networks for 1000 epochs to allow the GA to converge in a not too high number of generations, so as to reduce the network error at the end of the training phase. As concerns the techniques for the weight variation during the training phase, the best method found is that based on the momentum. The best architecture achieved is 2{19{19{18{1, the technique for changing the weights is that of the momentum with = 1:95312 10?2, and = 4:98047 10?2 and the bias present. The value of the error obtained is 7:89309 10?3. The activation functions determined by the evolution from the rst to the last level are respectively: sigmoid, semi{linear, tanh, tanh and semi{linear. In Fig. 3 it is possible to observe the response surface of the network MLP achieved by the evolutionary process, after the veri cation of the network eected on the pattern set B . Furthermore, the error surface computed as the dierence between the network output y^(k + 1) and the system output y(k + 1) is shown. The results obtained with the BGA are better than those achieved by the GA. The number of individuals in the population and the termination criterion for the BGA have been the same used for the GA. The truncation rate is 30% and emin = 10?4. The discrete mutation with pm = 0:06 has been employed. It has to be pointed out that the performance of the BGA has allowed to choose a value of emin lower than that of the GA. The evolution terminates after 50 generations, half of those needed by the GA. In particular, the results for the BGA have been improved in terms of convergence speed by using the fuzzy recombination instead of the extended intermediate and the extended line recombination operators. The best architecture achieved is 2{20{16{17{1, the technique for changing the weights is that of the momentum with = 1:57313 10?2 and = 7:03549 10?1. The value of the error obtained is 5:21883 10?4. The activation functions determined by the evolution from the rst to the last level are
Genetic Algorithm
Genetic Algorithm MLP 0.5 -0.5
y(k) 10
error 0.5
y(k) 10
5
5
0
0
-5
-5
-8
-8
5
5 2 y(k-1)
0
-2 -1.9
-1
0 u(k-1)
2 y(k-1)
2
1
0
-2 -1.9
Breeder Genetic Algorithm
-1
0 u(k-1)
2
1
Breeder Genetic Algorithm MLP 0.5 -0.5
y(k) 10
error 0.5
y(k) 10
5
5
0
0
-5
-5
-8
-8
5
5 2 y(k-1)
0
-2 -1.9
-1
0 u(k-1)
1
2
2 y(k-1)
0
-2 -1.9
-1
0 u(k-1)
1
2
Fig. 3. The response surface of the MLP on the pattern set B and the error surface obtained by the GA and the BGA.
respectively: sigmoid, tanh, tanh, sigmoid and semi{linear. It is possible to note that the BGA converges faster than the GA so as to reduce the time to complete the evolutionary process. Suce it to consider that the GA has needed on average six days to terminate its execution on a RISC 6000 station, while the BGA has taken on average ve days when the discrete and extended line recombination have been employed and three days with the fuzzy operator. In gure 3 it is also possible to observe the approximation of function (10) performed by the neural network at the end of the evolutionary process of the BGA. As it can be noted, in such case, the error is much lower than that achieved by the GA. Since the set B utilized to verify the eectiveness of the neural network optimization performed by the BGA is larger than the set A used for the training, it is evident that not only has the trained network learned the patterns, also it has shown ability to generalize being able to appropriately simulate the characteristics of the system to be identi ed.
7. Conclusions In this paper the eectiveness of the Evolutionary Algorithms in the eld of ANNs has been demonstrated. The BGAs are compared against the GAs in facing the simultaneous optimization of the design of the architecture and the choice of the best learning method. In particular, the fuzzy recombination operator, which improves the performance of the BGA, has been employed. The problem test used is a nonlinear system identi cation. The experimental results have proved the superiority of the BGA with respect to the GA in terms of both solution quality and speed of convergence. The BGA method can also be easily distributed among several processors as it operates on populations of solutions that may be evaluated concurrently. Further work will be concerned with the implementation of a parallel version of the evolutionary approach proposed with the aim at improving the performance.
References [1] Muhlenbein, H., Schlierkamp{Voosen, D., 1993, Analysis of Selection, Mutation and Recombination in Genetic Algorithms, Neural Network World , 3, pp. 907{933.
[2] Muhlenbein, H., Schlierkamp{Voosen, D., 1993, Predictive Models for the Breeder Genetic Algorithm I. Continuous parameter optimization, Evolutionary Computation , 1(1), pp. 25{49. [3] Muhlenbein, H., Schlierkamp{Voosen, D., 1994, The Science of Breeding and its Application to the Breeder Genetic Algorithm, Evolutionary Computation , 1, pp. 335{360. [4] Back, T., Homeister, F., Schwefel H.P., 1991, A survey of evolution strategies. Proceedings of the Fourth International Conference on Genetic Algorithms , San Mateo CA, USA, Morgan Kaumann, pp. 2{9. [5] Holland, J.H., 1975, Adaptation in Natural and Arti cial Systems , University of Michigan Press, Ann Arbor. [6] Goldberg, D. E., 1989, Genetic Algorithms in Search, Optimization and Machine Learning , AddisonWesley, Reading, Massachussets. [7] Hertz, J., Krogh, A., Palmer, R.G., 1991, Introduction to the Theory of Neural Computation , Addison{ Wesley Publishing. [8] Kuscu, I., Thornton, C., 1994, Designing Neural Networks using Genetic Algorithms: Review and Prospect, Cognitive and Computing Sciences , University of Sussex. [9] Rumelhart, D. E., Hinton, G. E., Williams, R. J, 1986, Learning Internal Representations by Error Propagation, Parallel Distributed Processing: Explorations in the Microstructure of Cognition , VIII, MIT Press. [10] Rumelhart, D. E., McLelland, J. L., 1986, Parallel Distributed Processing, I-II, MIT Press. [11] Montana, D. J., Davis, L., 1989, Training Feedforward Neural Networks using Genetic Algorithms, Proceedings of the Eleventh International Joint Conference on Arti cial Intelligence , pp. 762{767. [12] Hiestermann, J., 1990, Learning in Neural Nets by Genetic Algorithms, Parallel Processing in Neural Systems and Computers , North{Holland, pp. 165{168. [13] Battiti, R., Tecchiolli, G., 1995, Training Neural Nets with Reactive Tabu Search, IEEE Trans. on Neural Networks , 6(5), pp. 1185{1200. [14] Whitley, D., Starkweather, T., Bogart, C., 1990, Genetic Algorithms and Neural Networks: Optimizing Connections and Connectivity, Parallel Computing , 14, pp. 347{361. [15] Reeves, C. R., Steele, N. C., 1992, Problem{solving by Simulated Genetic Processes: a Review and Application to Neural Networks, Proceedings of the Tenth IASTED Symposium on Applied Informatics , pp. 269{272. [16] Stepniewski, S., Keane, A. J., Pruning back propagation Neural Networks using Modern Stochastic Optimization Techniques, to appear in Neural Computing & Applications . [17] Voigt, H.M., Muhlenbein, H., Cvetkovic, D., 1995, Fuzzy Recombination for the Continuous Breeder Genetic Algorithm, Proceedings of the Sixth International Conference on Genetic Algorithms , Morgan Kaumann. [18] Wassermann, P. D., 1989, Neural Computer Theory and Practice , Van Nostrand Reihnold, New York. [19] Yaw{Terng Su, Yuh-Tay Sheen, 1992, Neural Networks for System Identi cation, Int. J. of Systems Sci., 23(12), pp. 2171{2186. [20] Muhlenbein, H., Schomish, M., Born, J., 1991, The Parallel Genetic Algorithm as Function Optimizer, Parallel Computing , 17, pp. 619{632.