ENZO-M { a Hybrid Approach for Optimizing Neural Networks by Evolution and Learning Heinrich Braun and Peter Zagorski Institute for Logic, Complexity and deductive Systems, University of Karlsruhe, D-76128 Karlsruhe, Germany, Internet:
[email protected], Tel.: +49-721-608-4211
Abstract. ENZO-M combines two successful search techniques using two dierent timescales: learning (gradient descent) for netuning of each ospring and evolution for coarse optimization steps of the network topology. Therefore, our evolutionary algorithm is a metaheuristic based on the best available local heuristic. Through training each ospring by fast gradient methods the search space of our evolutionary algorithm is considerably reduced to the set of local optima. Using the parental weights for initializing the weights of each ospring both the gradient descent (learning) is speeded up by 1-2 orders of magnitude and the expected value of the local minimum ( tness of the trained ospring) is far above the mean value for randomly initialized osprings. Thus, ENZO-M takes full advantage of both the knowledge transfer from the parental genstring using the evolutionary search and the eciently computable gradient information using netuning. By the cooperation of the discrete mutation operator and the continuous weight decay method ENZO-M impressively thins out the topology of feedforward neural networks. Especially, ENZO-M also tries to cut o the connections to possibly redundant input units. Therefore ENZO-M not only supports the user in the network design but also recognizes redundant input units.
1 Introduction Our evolutionary algorithm is a hybrid approach where the selection and reproduction correspond to a genetic algorithm but each ospring is tuned by an ecient local optimization heuristic, before its tness is evaluated, i.e. in the outer evolutionary loop there is nested an inner optimization loop of the local heuristic. Therefore our approach can be characterized as a metaheuristic (the outer loop) using an ecient local optimization heuristic as a basic operator (the inner loop). By the way, we use the term "evolutionary algorithm", because our algorithm is neither a genetic algorithm [9] nor an evolutionary strategy [11], [16] or program [8], [7], but uses the same common paradigm. This approach has been also successfully applied for other dicult combinatorial optimization problems [1], [2]. For an ecient recombination respectively crossover operator it is crucial that the exchanged substrings (genes) have a similar semantic. Unfortunately, this is
not achieved by the standard operators for the strong representation scheme due to the problem of permutable internal representations [4]. That is, two successfully trained networks may extract the same features and use the same internal neural representations in order to solve the given task, but distribute these internal representations in a totally dierent way among their hidden neurons, i.e. the contributions of the hidden neurons to the overall solution may be internally permuted. Therefore this problem is also referred to as the phenomenon of dierent structural mappings coding the same functional mapping. In our former network evolution system ENZO [4], we presented a rst successful solution for this problem by additionally optimizing the total connection length in a two dimensional layout. Although this approach was very promising, the global optimum obeying the additional optimization criteria will in general not be the global optimum for the original task, because long connections are avoided although they are sometimes useful. In order to clarify the eects and since the mutation operator is by itself sucient for an ecient evolutionary search [7], [16] we omitted in ENZO-M the crossover operator and by that circumvented the above mentioned problem. Additionally to ENZO we use short cut connections for the topology, a faster gradient heuristic RPROP [6] specially adapted to the evolutionary algorithm, weight decay methods for improving the generalization and a more sophisticated mutation operator. The reintroduction of the crossover operator is a current research issue and will be tackled in a forthcoming paper. For evaluating the performance of our evolutionary approach we tested ENZO-M with three dierent kinds of applications: (1) three smaller benchmark problems for statistically signi cant evaluation of the impact our design decisions (a digit recognition problem, the T-C problem and the chess board problem), (2) a real world problem (the classi cation of medical data for a thyroid gland diagnosis [14], [13]) and (3) a hard touch stone where the best handcrafted network uses a (120-60-20-2-1){ topology with over 5600 connections (a scoring function for the two-player game Nine Men's Morris [3]). So at least the last one of them is much more complex than problems considered in publications on comparable topics so far.
2 Optimization of feedforward neural networks There are two levels of optimization problems. The optimization of the connectivity (design problem) and the optimization of the weights (learning problem). In most applications the learning problem is solved by a gradient descent method, as e.g. backpropagation. But there are some demanding applications, which cannot be solved by gradient descent as e.g. recurrent neural networks. Since evolutionary algorithms use no gradient information, but only a measure for the behaviour ( tness) of each neural network, they can even handle such problems. On the other hand, they are inferior i.e. in particular computationally more expensive, when gradient heuristics already can solve the learning problem suciently well.
Evolutionary algorithms which are designed to solve the design problem, has to use a learning algorithm for evaluating the tness of each ospring. As learning algorithm gradient descent can be used as well as evolutionary algorithms (see above). Since ecient gradient descent algorithms as RPROP [6] solve more eciently the learning problem for feedforward networks we prefer to use gradient descent for this task. But instead of training the ospring from scratch we propose to use the weight matrix of the trained parent network for constructing the weight initialization. By that both learning is speeded up by 1-2 orders of magnitude and the expected tness of the ospring is far above the average for its topology. For the representation of the network there are two types of schemes used: strong and weak. The representation scheme is called strong, if the chromosome is a blueprint of the network i.e. every connection corresponds directly to a substring of the chromosome. Whereas the scheme is called weak, if the chromosome embeds just rules for constructing the network. Obviously, the latter seems to be biologically more plausible, since the natural chromosomes also do not describe the exact connectivity but only the coarse structure. But for most application oriented problems the networks are not so large, that the evolution of the coarse structure suces. Rather a detailed speci ed solution is required. Therefore, we use in our approach a strong representation scheme, a linear representation of the matrix of the weights with additional control information for each connection (weight). As optimization criterion any computable tness function may be used, e.g.: Learning error, test set error (generalization capability), network size or a weighted sum of several single criteria. Additionally we want to remark that there may be a tradeo between network size and generalization capability. Although it is widely believed that the smallest network solving the learning task have the best generalization capability, the contrary may be true (e.g. confer our experimental investigations or see [3]). Moreover, there may be a tradeo between the accuracy of the tness function and its ecient computability. Especially, for optimizing the generalization capability as a measure for the expected error on untrained inputs it may be computationally expensive to average about all possible inputs. Therefore, a small, but representative test set has to be used instead. Using a test set can cause over tting to the test set ( over-evolving eect) similar to the well-known over tting to the training set ( over-learning eect), i.e. the generalization capability can deteriorate due to better tting the test set caused by the selection pressure for the tness criterion. In our experimental investigations (cf. section 4.2 ) we could indeed ascertain that tting the test set is not the very same as improving the generalization capability, since the test set error decreased significantly faster than the overall error (i.e. the generalization capability). But we could not ascertain an increase of the overall error as an over tting eect. Since every ospring is tuned only on the training set (totally disjunct from the test set) this possible over tting on the evolutionary level is much more subtle (cf. g. 4 ). Nevertheless, whenever this eect makes trouble, we can use a dynamically
changing test set. Finally, we want to add some remarks on the optimization of feedforward networks. The goal of our system is to provide a powerful design tool for a systematic search for highly optimized neural networks according to user de ned speci cations. Therefore, the user should be able to constrain the search by specifying both the maximal number of hidden layers and in each layer the maximal number of hidden units. We call the limiting case of this speci cation the 'maximal network'. Especially, that means the optimization task is to nd an optimal subnetwork of this maximal network. This approach has two advantages. Firstly, although the represented neural networks may dynamically grow or shrink, a static representation scheme is possible by using a representation for the maximal network and enhancing the representation of each connection by an 'existing' bit. Secondly, the depth of the network is restricted (by the depth of the maximal network), which may guarantee a fast parallel evaluation. Furthermore, our system ENZO-M handles networks with short cut connections since this does not slow down the parallel evaluation time but may signi cantly reduce the network complexity, e.g. the well-known XOR problem can be solved with 1 hidden neuron and 4 connections using short cuts instead of 2 hidden neurons and 6 connections without short cuts.
3 Our Evolutionary Approach Every heuristic for searching the global optimum of dicult optimization problems has to handle the dilemma between exploration and exploitation. Priorizing the exploitation (as hill-climbing strategies do) bears the danger of getting stuck in a poor local optimum. On the other hand, full explorative search which guarantees to nd the global optimum, uses vast computing power. Evolutionary algorithms avoid getting stuck in a local optimum by parallelizing the search using a population of search points (individuals) and by stochastic search steps, i.e. stochastic selection of the parents and stochastic generation of the osprings (mutation, crossover). On the other hand, this explorative search is biased towards exploitation by biasing the selection of the parents preferring the tter ones. This approach has proven to be a very ecient tool for solving many dif cult combinatorial optimization problems. A big advantage of this approach is its general applicability. There are only two problem dependent issues: The representation of the candidate solutions as a string (genstring = chromosome) and the computation of the tness. Even though the choice of an adequate representation seems to be crucial for the eciency of the evolutionary algorithm, it is obvious that in principle both conditions are ful lled for every computable optimization problem. On the other hand, this problem independence neglects problem dependent knowledge as e.g. gradient information. Therefore the pure use of evolutionary algorithms may have only modest results in comparison to other heuristics, which can exploit the additional information. For the problem of optimizing feedfor-
ward neural networks we can easily compute the gradient by backpropagation. Using a gradient descent algorithm we can tremendously diminish the search space by restricting the search to the set of local optima. This hybrid approach uses two time scales. Each coarse step of the evolutionary algorithm is intertwined with a period of ne steps for the local optimization of the ospring. For this approach there seems to be biological evidence, since at least for higher animals nature uses the very same strategy: Before evaluating the tness for mating, the osprings undergo a longer period of ne tuning called learning. Since the evolutionary algorithm uses the ne tuning heuristic as a subtask, we can call it a metaheuristic. Obviously, this metaheuristic is at least as successful as the underlying ne tuning heuristic, because the osprings are optimized by that. Our experimental investigations will show, that the results of this metaheuristic are not only as good but impressively superior to the underlying heuristic. Furthermore, any improvement of the underlying ne tuning heuristic will improve the whole system. However, there is a tradeo for the computing time of the local heuristic: more computing time for the local heuristic means less osprings, as far as the overall computing time is xed. Therefore we used the fastest gradient heuristic [6] and augmented it with the weight decay and weight elimination techniques in such a way that this augmentation did not augment the computation time unfavourably (e.g. less than double). In the natural paradigm the genotype is an algorithmic description for developing the phenotype, which seems not to be an invertible process, i.e. it is not possible to use the improvements stemming from learning ( ne tuning) for improvingthe genotype as Lamarck erroneously believed. In our application however, there is no dierence between genotype and phenotype, because the matrix of weights, which determines the neural network, can be linearly noted and interpreted as a chromosome (genstring). In this case Lamarcks idea is fruitful, because the whole knowledge gained by learning in the ne tuning period can be transferred to the osprings (Lamarckism). The strengths of our approach stem mainly from this eect in two ways: Firstly, since the topology of the osprings is very similar to the topology of the parents, transferring the weights from the parents to the osprings diminishes impressively the learning time by 1-2 orders of magnitude (in comparison to learning from the scratch with random starting weights). This also implies, that we can generate 1-2 orders of magnitude more osprings in the same computation time. Secondly, the average tness of these osprings is much higher: the tness distribution for the training of a network topology with random initial weights will be more or less Gaussian distributed with a modest mean tness value whereas starting near the parental weights (remind the topology of the ospring is similar but not the same as that of its parents) will result in a network with a tness near the parental tness (may be worse or better, cf. section ??). That means, whenever the parental tness is well above the average tness (respectively its topology) then the same may be expected for its ospring (in case using the parental weights). Moreover, our experiments have shown for the highly
evolved "sparse" topologies that with random starting weights the gradient descent heuristic did not nd an acceptable local optimum (solving the learning task), but only by inheriting the parental knowledge and initializing the weights near the parental weights. Summarizing, our algorithm brie y works as follows (cf. g. 1 ). Taking into account the user's speci cations ENZO-M generates a population of dierent networks with a given connection density. Then the evolution cycle starts by selecting a parent, prefering the ones with a high tness ranking in the current population and by generating an ospring as a mutated duplication of this parent. Each ospring is trained by the best available ecient gradient descent (RPROP [6]) heuristic using weight decay methods for better generalization. By removing negligible weights, trained osprings may be pruned and then re-trained. Being evaluated an ospring is inserted into the sorted population according to its determined tness value, thereby removing the last population element. Fitness values may incorporate any design criterion considered important for the given problem domain. strategies to maintain diversity
selection
unit mutation / bypass heuristic weight elimination / addition
mutation
Lamarckian algorithm RPROP/ weight decay
training
Ranking Error on test set
evaluation a)
Fig. 1: Evolution cycle of ENZO-M
3.1 Mutation
b)
c)
Fig. 2: Bypass algorithm: a) original network b) after deletion of the middle unit c) with added bypass connections
Mutation is the only genetic operator used in ENZO-M. Therefore, we tried to enlarge its sphere of activity. Besides of the widely used link mutation we also realized unit mutation which is well suited to signi cantly change the network topology, allowing the evolutionary algorithm to explore the search space faster and more exhaustive. Our experiments show that unit mutation eciently and reliably directs the search towards small architectures. Within link mutation every connection can be changed by chance. There are two probabilities: p+ to adding an absent connection and p? for deleting an existing one. The evolutionary search can be in uenced by varying the ratio of p+ and p? . Choosing p+ < p? speeds-up network dilution leading to sparsely connected networks. On the other hand, the contrary setting p+ > p? may be more favourable, since connections are also removed by the local heuristic through the weight decay and pruning operator (see section 3.2 ).
In contrast to link mutation, unit mutation has a strong impact on the network's performance. Thus we do not mutate each unit itself but change the number of active units by a small amount a. If a is negative, we randomly delete a units prefering units with few connections. Whereas, if a is positive, we insert a units fully connected by small random weights to the succeeding and preceding layers. Changing the number of active units avoids making destructive mutations by deleting a trained and inserting an untrained unit at the same time. To improve our evolutionary algorithm we developed two heuristics which support unit mutation: the prefer-weak-units (PWU) strategy and the bypass algorithm. The idea behind the PWU-heuristic is to rank all hidden units of a act:connections ) and to delete network according to their relative connectivity ( max:connections sparsely connected units with a higher probability than strongly connected ones. This strategy successfully complements other optimization techniques, like weight elimination, implemented in ENZO-M. The bypass algorithm is the second heuristic we realized. As mentioned before, unit mutation is a large disruption of the network topology. So it seems sensible to alleviate the resulting eects. Adding a unit to a well trained network is not very crucial: if this new unit is not necessary, all weights from and to it will be adjusted in such a way, that the network performance is not aected by that unit. Other than adding an unit, deletion of an unit can result in a network which is not able to learn the given training patterns. This can happen because there are too few hidden units to store the information available in the training set. But even if there are enough hidden units, deleting an unit can destroy important data paths within the network ( g. 2a and b). In opposite to addition of units and connections the eects of deletion can not be repaired by the training procedure. For that reason we restore deleted data paths before training by inserting bypass connections ( g. 2c). The application of the bypass algorithm signi cantly increases the proportion of successful networks generated by unit mutation. Both the number of networks with devastated topologies decreases and the generated networks need less epochs to learn the training set, because they are not forced to simulate missing bypass connections by readjusting the weights of the remaining hidden units.
3.2 Training
To train our networks we use a modi ed gradient descent technique called weight elimination [17]. We add a penalty term accounting for the network complexity to the usual cost function. The new objective function (E = TSS + C ) is then optimized by RPROP, a very robust and fast gradient descent method [6]. After training, a part of network weights is close to zero and can be deleted. We delete all weights whose amounts are below a speci c threshold (pruning). Thereafter the error produced by pruning is minimized by retraining. The parameter is essential for the success of weight elimination. But there is not an universal value. Thus we have to adjust automatically. Unlike
Rumelhart et. al., we use a very fast algorithms to train the networks, and have only few epochs to nd out a suitable value. Thus the heuristic to adjust recommended in [17] was further developed by adding a dynamic step size adaptation [18]. The heuristic is controlled by monitoring TSS. As long as TSS is suciently decreasing, as well as the step size () are increased. If exceeds its optimal value, as can be seen at TSS, the last growing step is withdrawn and we begin a new growing phase with a smaller value. Weight elimination and mutation cooperate in two ways. Firstly, by setting p+ > p? link mutation can add more connections than delete. Thus important connections deleted by weight elimination get a second chance. Secondly, weight elimination is not well appropriate to delete units. This part can be done by unit mutation.
4 Results We evaluated ENZO-M in two ways. On the one hand, we investigated the impact of our design decisions on small benchmark problems in order to allow a signi cant statistical evaluation: the eect of Lamarckism (by inheriting parental weights), the over tting eect (evaluated by meta-cross-validation) and the cooperation of discret unit mutation and the continous weight dacay method (especially, removing redundant input units). On the other hand, we tested ENZO-M on two dicult benchmark problems: Nine Men's Morris - a hard touchstone requiring a high generalization ability given a relatively small training set and the Thyroid gland - a real world problem requiring high classi cation accuracy given a large training set.
4.1 Digit recognition - The Eect of Lamarckism The rst benchmark we used to test ENZO-M was a digit recognition task. We used a training set of 10 patterns (one per digit) and a second tness set of 70 noisy patterns to measure the generalization ability of the created networks. Following the rst results obtained with a previous version of ENZO [4] we started our experiments with a 23-12-10-10 topology. ENZO-M was able to minimize the network down to a 20-10 topology without any hidden units showing that this problem is linearly separable. Moreover, ENZO-M disconnected three input neurons from the architecture. We also used digit recognition to investigate the eects of Lamarckism on the tness distribution of 100 osprings generated from one parent network. From one parent network we generated 100 osprings by mutation and Lamarckism and then computed the tness distribution of the networks. We obtained a sharp peak which indicates that the osprings are not far away from parent network in search space. We compared our results with another 100 networks created from the same parent topology by instantiating all weights randomly ( g. 3). Now the networks are disadvantageous scattered in the search
space with a much greater variance and lower mean value of the obtained network tnesses. Moreover, other experiments shown that for hard problems the evolved sparse topologies even did not learn the training set after random initialization because the learning procedure get trapped in poor local minima.Thus using Lamarckism is crucial to obtain very small networks. Further, Lamarckism shortens training times by 1-2 orders of magnitude and makes evolutionary search more ecient. random init Lamarckismus
60 fitness mcv
Verteilung
50
error
40 30 20 10 0
2
4
6
8
10
Fitness
Fig. 3: Comparison of tness distribution of networks generated by Lamarckism and random initialization. Fitness of parent network was 0.6
0 0
200
400 600 epochs
800
1000
Fig. 4: Error due to the tness- and meta-cross-validation-sets
4.2 Chess board problem - Over tting the test set
The second benchmark we used, was a 3*3-chess-board classi cation problem1. It was driven to investigate the eects of over tting the test set used by the tness evaluation ( tness set). In order to cross-validate the generalization ability we compared the error on this tness set with the error on a meta-cross-validation set (MCV-set) which was disjunct to both the training set and the tness set. We generated three distinct pattern sets: a training set containing 8 patterns per square, a tness and a MCV set both containing 32 patterns per square. Fig. 4 shows the error value due to the tness and MCV sets. As we can see, optimization on the tness set does not deteriorate the generalization ability.
4.3 TC problem - removing redundant inputs
A third set of experiments was driven to show the eectiveness of the cooperation of the discrete mutation operator and the continuous weight decay method [12]: a 'T' and a 'C' which must be recognized on a 4*4 grid independent of their position or rotation (in 90 steps). Starting with a 16-16-1 architecture (as in [10] described) ENZO-M generated very small networks. Two dierent 'minimal' solutions were found: a 11-1-1 and a 10-2-1 topology. Both architectures are smaller than solutions obtained by other approaches. Whereas the evolutionary program in [10] removed 79% of the weights and one input unit, ENZO-M surpasses this result signi cantly by removing 92% of the weights and ve input units (i.e. 31% of the input vector). 1
i.e. pixels situated in a black square should be classi ed with 1 the others with 0
topology #weights classi cation start network 16-16-1 288 100% Mc Donnell 15-7-1 60 100% ENZO-M 11-1-1 22 100%
4.4 Nine Men's Morris - a hard touchstone Nine Men's Morris is a benchmark with the largest initial topology (over 5600 connections). In our previous research we investigated networks learning a scoring function for the endgame of Nine Men's Morris in the light of best performance [4]. Now we try to optimize both the networks topology as well as its performance. The initial topology we used was a 120-60-20-2-1 network. ENZO-M was able to minimize the network to a 14-2-2-1 architecture deleting not only hidden units but also the majority of the input neurons. So ENZO-M can be used to recognize redundant input features and to eliminate them. The table below shows the performance of three networks: the rst network (SOKRATES) was the best handcrafted network developed just by using backpropagation (described in [3]), a second network was generated by ENZO (described in [4]), and a third network we got by ENZO-M. System topology #weights performance SOKRATES 120-60-20-2-1 4222 0.826 ENZO 120-60-20-2-1 2533 1.218 ENZO-M 14-2-2-1 24 1.040 Networks optimized by ENZO and ENZO-M show a signi cantly better performance than the original SOKRATES network. Further, that superior performance is achieved with smaller networks. ENZO network contains the same number of units but fewer connections than the original network whereas ENZOM network is a radical new and very small topology. It may be worth emphasizing, that our former system ENZO found a network with a performance slightly surpassing the ENZO-M network but using the full input vector. This dierence in performance between ENZO and ENZO-M results from dierent optimization goals used in each program (performance vs. topology). From that the reader may conclude, that the evolutionarily selected part of the input units by ENZO-M contains the crucial components whereas the remaining bear only little (but not zero) information.
4.5 Thyroid gland - a real world benchmark The classi cation of medical data for a thyroid gland diagnosis [14], [13] is a realworld benchmark which requires a very good classi cation2 . A further challenge 2
92% of the patterns belong to one class. Thus a useful network must classify much better than 92%.
is the vast number of training patterns (nearly 3800) which exceeds the size of toy problem's training sets by far. A set of experiments is described by Schimann et al. in [14] who compare several algorithms to train feed forward neural networks. They used a constant 21-10-3 topology, which we took as a maximal starting architecture to ENZO-M. We used a population size of 25 individuals and generated 300 osprings. The resulting network had a 17-1-3 topology, 66 weights and correctly classi ed 98.4% of the test set. topology #weights performance Schimann 21-10-3 303 94.8% ENZO-M 17-1-3 66 98.4%
5 Conclusion ENZO-M combines two successful search techniques: gradient descent for an ecient local weight optimization and evolution for a global topology optimization. By that it takes full advantage of the eciently computable gradient information without being trapped by local minima. Through the knowledge transfer by inheriting the parental weights both learning is speeded up by 1-2 orders of magnitude and the expected tness of the ospring is far above the average for its topology. Moreover, ENZO-M impressively thins out the topology by the cooperation of the discrete mutation operator and the continuous weight decay method. For this the knowledge transfer is again crucial, because the evolved topologies are mostly too sparse to be trained with random initial weights. Additionally, ENZO-M tries also to cut o the connections to eventually redundant input units. Therefore ENZO-M not only supports the user in the network design but also determines the in uence of each input component: for the thyroid gland problem a network was evolved with better performance but 20% less input units and for the Nine Men's Morris problem ENZO-M did even nd a network with better performance but only 12% of the input unit originally used. In spite of our restriction to mutation as the only reproduction operator, ENZO-M has proven to be a powerful design tool to evolve multilayer feedforward networks. Nevertheless, it remains a challenging research issue, to incorporate a crossover operator in order to recombine eciently complex parental submodules (schema, gene).
References 1. H. Braun Massiv parallele Algorithmen fur kombinatorische Optimierungsprobleme und ihre Implementierung auf einem Parallelrechner, Diss., TH Karlsruhe, 1990 2. H. Braun On solving traveling salesman problems by genetic algorithms, in: Parallel Problem Solving from Nature, LNCS 496 (Berlin, 1991)
3. H. Braun, J. Feulner, V. Ullrich Learning strategies for solving the problem of planning using backpropagation, in: Proc. 4th ICNN, Nimes, 1991 4. H. Braun and J. Weisbrod Evolving neural feedforward networks, in: Proc. Int. Conf. Arti cial Neural Nets and Genetic Algorithms, R. F. Albrecht, C. R. Reeves and N.C. Steele, eds. Springer, Wien, 1993 5. H. Braun and J. Weisbrod Evolving neural networks for application oriented problems, in: Proc. of the second annual conference on evolutionary programming, Evolutionary programming Society, San Diego, 1993, S.62-71 6. M. Riedmiller and H. Braun A direct adaptive method for faster backpropagation learning: The RPROP algorithm, in: Proc. of the ICNN 93, San Francisco, 1993 7. D.B. Fogel System identi kation through simulated evolution; a machine learning approach to modelling, Needham Heights, Ginn Press, 1991 8. L. J. Fogel, A.J. Owens and M.J. Walsh Arti cial intelligence through simulated evolution, John Wiley, NY., 1962 9. J. H. Holland Adaptation in natural and arti cial systems, The University of Michigan Press, Ann Arbor, 1975 10. John R. Mc Donnell and Don Waagen Neural Network Structure Design by Evolutionary Programming, Proc. of the Second Annual Conf. on Evolutionary Programming, San Diego, 1993 11. I. Rechenberg Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution,Frommann-Holzboog Verlag, Stuttgart, 1973 12. D. E. Rumelhart and J. McClelland Parallel Distributed Processing, 1986 13. W. Schimann, M. Joost, R. Werner Application of genetic algorithms to the construction of topologies for multilayer perceptrons, in: Proc. of the int. conf. Arti cial Neural Nets and Genetic Algorithms, R. F. Albrecht, C. R. Reeves and N.C. Steele, eds. Springer, Wien, 1993 14. W. Schimann, M. Joost, R. Werner Optimization of the backpropagation algorithm for training multilayer perceptrons, Techn. rep., University of Koblenz, 1993 15. H. P. Schwefel Numerische Optimierung von Computermodellen mittels der Evolutionsstrategie, in: Interdisciplinary research (vol. 26), Birkhauser, Basel, 1977 16. H. P. Schwefel Evolutionsstrategie und Numerische Optimierung, Diss., TU Berlin, 1975 17. A. S. Weigend and D. E. Rumelhart and B. A. Huberman Generalisation by Weight{Elimination with Application to Forecasting 18. P. Zagorski Entwicklung evolutionarer Algorithmen zur Optimierung der Topologie und des Generalisierungsverhaltens von Multilayer Perceptrons, Diplomarbeit, Universitat Karlsruhe, 1993
This article was processed using the LaTEX macro package with LLNCS style