Hierarchical Learning with Procedural Abstraction ... - CiteSeerX

Hierarchical Learning with Procedural Abstraction Mechanisms by Justinian Rosca Submitted in Partial Ful llment of the Requirements for the Degree Doctor of Philosophy Supervised by Professor Dana Ballard Department of Computer Science The College Arts and Sciences University of Rochester Rochester, New York 1997

To my parents and to Daniela

iii

Acknowledgments I am very fortunate to have met and been in uenced by great people both before and during my doctoral years. I could have not imagined the exciting research environment in the Computer Science Department of the University of Rochester, were it not for Professor Chris Brown, who skillfully convinced me to do my graduate studies here. I found myself in a bright, eervescent, and collegial environment, including students, sta, and faculty. First and foremost I would like to thank my cheerful advisor and research partner, Professor Dana Ballard, for his unsurpassed con dence in me, unbounded enthusiasm in my ideas, permanent great advice and the freedom he gave me to pursue my own thoughts. I am also greatly indebted to my committee. Chris Brown promptly replied my uncountable last minute requests and always put the greatest eort to give me excellent advice. Randal Nelson was a keen critical voice, a great challenger of any idea, and always generator of spectacular insights. Ken De Jong was available for technical discussions, analysis, and suggestions to my inquiries, even when he hardly knew me. Robbie Jacobs encouraged me towards a balanced theoretical and experimental work perspective, and was always ready to listen. In addition to my committee, John Koza helped me with numerous discussions and great personal support and advice. I thank him for the genetic programming code that I used to jump start in this area. Also, Professor James Allen was always generous in discussions, ready to help with books, and never minded my interruptions during all these years in Rochester. Thanks to Pattie Maes and her group at the MIT Media Lab for making my summer there absolutely terri c. Special thanks to Michael Johnson for the long GP discussions

iv we carried out. Chapter 7 of this dissertation is based on ideas that I, Pattie, and Mike have been developing together. Many thanks should go to some of my buddies in the Computer Science Department: to Jonas Karlsson for his companionship and pessimistic, though joyful, attitude on my exuberant ideas that challenged me to make them work; to Andrew McCallum for carefully proofreading many of my papers and oering great comments in the last minute; to Virginia de Sa who was amazing in detecting those little mistakes in my papers in the shortest possible time; to Ramesh Sarukkai for his youthful, non-conformist, and innocent life stories that were cheering me up after hours of work in the long Rochester winters; to Je Schneider for his frankness; to Leonidas Kontothanassis for always accepting challenges on systems topics. I owe many thanks to special people working in the elds of genetic programming, genetic algorithms, and machine learning whom I had the privilege to know: Pete Angeline for his critical view of many of my ideas and for the controversial long distance conversations we've had time and again, even when the re alarm in the Computer Studies Building was on; Bill Langdon for his British humor, kindness, and willingness to analyze together thorny GP issues; Walter Tackett for sacri cing the silence of his cats while having me for long GP conversations; Rich Sutton for long discussions and patience to listen; David Fogel for good advice on many occasions; Terry Jones for giving me the craziest ride of my life in a city unknown to him, to catch a plane that I did not catch; Astro Teller for opening his house to me; Peter Whigham for still owing me a bear; Una-May O'Reily, David Andre, Kim Kinnear, Simon Handley, Nic Mcphee, Shumeet Baluja, Maja Mataric, and many others for inspiring discussions. The list could probably continue on many more pages. I thank everyone whom I've collaborated with along these years. I'll end my list with my class in the Computer Science Department without whom I could not have imagined life here: Choh Man Teng, Eric Ringger, Jim Muller, Maged Michael, Marius Zimand, Martin Jagersand, Olac Fuentes, and Raj Rao. Many other colleagues of mine contributed to enriching my life, and among those Garbis, Jim, Charlene, and Gabi were special. You are all wonderful people!

v I could not end without special thanks I owe to several truly outstanding teachers who have decisively in uenced my journey before graduate studies: Ilie Ion, Octavian Stanasila, Vasile Branzanescu and Cristian Giumale. They will be always part of the recollection of beautiful moments of my life. It is almost impossible to acknowledge the level of my gratitude to my family. Mom and Dad, your absolute dedication and love has given me strength and made me worryfree; My parents-in-law, your supreme con dence has given me a boost in all my actions. Above all, I would have not made it without the love and support, equally in the good and demanding times, of my wife Daniela. This material is based upon work supported by the National Science Foundation under grant numbers IRI-9406481 and CDA-9491142, and by the National Institutes of Health under grant number 1 P41 RR09283. Any opinions, ndings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily re ect the views of these agencies.

vi

Abstract Evolutionary computation (EC) consists of the design and analysis of probabilistic algorithms inspired by the principles of natural selection and variation. Genetic Programming (GP) is one sub eld of EC that emphasizes desirable features such as the use of procedural representations, the capability to discover and exploit intrinsic characteristics of the application domain, and the exibility to adapt the shape and complexity of learned models. Approaches that learn monolithic representations are considerably less likely to be eective for complex problems, and standard GP is no exception. The main goal of this dissertation is to extend GP capabilities with automatic mechanisms to cope with problems of increasing complexity. Humans succeed here by skillfully using hierarchical decomposition and abstraction mechanisms. The translation of such mechanisms into a general computer implementation is a tremendous challenge, which requires a rm understanding of the interplay between representations and algorithms, and insights about how machine learning can enhance GP search. This dissertation describes theoretical and experimental work to respond to the above challenge. It analyzes the characteristics of stochastic search for procedural representations, by focusing on the properties and biases of variable length expressions. Speci cally, it describes experiments about \uniform" random sampling on the space of expressions, statistical properties of search, the dynamics of tness distributions, and the variation of solution complexity during GP search. These experiments are used to develop a novel structural property called rooted tree-schema that formalizes the role of variable complexity of learned expressions.

vii This research also extends the capabilities of GP search with two new complementary approaches to evolve problem decompositions, called Adaptive Representation (AR) and Evolutionary-Divide-and-Conquer (EDC). AR extends GP with heuristic components that: (1) learn good subexpressions from problem solving traces; (2) abstract subexpressions into subroutines; (3) use subroutines to bias future search. Evolved solutions assume a modular and hierarchical organization. EDC takes the approach of extracting a \team" solution, instead of the ubiquitous GP approach that looks at the best-of-population individual. Its \symbiotic" representation and \coevolutionary" tness evaluation drive the algorithm towards both specialized and cooperating solutions. The dissertation brings to life these approaches in several increasingly complex algorithms.

viii

Table of Contents Acknowledgments

iii

Abstract

vi

List of Figures

xi

List of Tables

xvii

1 Introduction

1

1.1 We need more powerful computational principles : : : : : : : : : : : : :

2

1.2 Simulated evolution is simple but powerful : : : : : : : : : : : : : : : : :

3

1.3 Beyond simple learning: hierarchical, modular representations, and scalable learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5

1.4 Perspective of this dissertation : : : : : : : : : : : : : : : : : : : : : : :

7

1.5 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

8

1.6 The rest of the story : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

13

2 Motivations

15

2.1 Why procedural representations? : : : : : : : : : : : : : : : : : : : : : :

16

2.2 The combinatorial optimization view : : : : : : : : : : : : : : : : : : : :

18

2.3 Reinforcement learning oers insights to formalize the learning problem

20

2.4 The power of hierarchical abstractions : : : : : : : : : : : : : : : : : : :

21

ix 2.5 Resonance with fundamental principles in inductive learning : : : : : : :

25

2.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

28

3 Genetic Programming: Foundations and Related Approaches

30

3.1 The search perspective : : : : : : : : : : : : : : : : : : : : : : : : : : : :

31

3.2 Random walks and Markov chains : : : : : : : : : : : : : : : : : : : : :

35

3.3 Simulated annealing : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

37

3.4 Genetic programming : : : : : : : : : : : : : : : : : : : : : : : : : : : :

39

3.5 Other evolutionary computation approaches : : : : : : : : : : : : : : : :

44

3.6 Discussion of stochastic search methods : : : : : : : : : : : : : : : : : :

55

4 Inherent Characteristics and Limitations of Genetic Programming

60

4.1 Biases in the random generation of simple tree structures : : : : : : : :

61

4.2 Biases in transforming simple tree structures : : : : : : : : : : : : : : :

65

4.3 Complexity drift in the evolution of tree representations : : : : : : : : :

72

4.4 A complex test case: the Pac-Man game : : : : : : : : : : : : : : : : : :

92

4.5 Experimental results : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

96

4.6 Statistical dynamics of GP : : : : : : : : : : : : : : : : : : : : : : : : : : 105 4.7 Related work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114

5 Modularity in GP: The Adaptive Representation Approach

118

5.1 Review of modular approaches to genetic programming : : : : : : : : : : 118 5.2 Characteristics and biases of modular GP : : : : : : : : : : : : : : : : : 128 5.3 Expanding the function set: a formal view : : : : : : : : : : : : : : : : : 134 5.4 Complexity measures for modular evolved expressions : : : : : : : : : : 137 5.5 Adaptive representation : : : : : : : : : : : : : : : : : : : : : : : : : : : 140 5.6 Summary and other related work : : : : : : : : : : : : : : : : : : : : : : 151

x

6 Adaptive Representation through Learning

153

6.1 Learning good subroutines: the ARL algorithm : : : : : : : : : : : : : : 153 6.2 Representation approaches : : : : : : : : : : : : : : : : : : : : : : : : : : 162 6.3 Experimental results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 166 6.4 Summary and other related work : : : : : : : : : : : : : : : : : : : : : : 178

7 Evolutionary Divide-and-Conquer

181

7.1 The EDC framework : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 182 7.2 Test case: symbolic regression : : : : : : : : : : : : : : : : : : : : : : : : 184 7.3 The EDC representation : : : : : : : : : : : : : : : : : : : : : : : : : : : 186 7.4 EDC algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193 7.5 Experimental results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 195 7.6 Related work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 199 7.7 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 201

8 Conclusions

202

8.1 This work in perspective : : : : : : : : : : : : : : : : : : : : : : : : : : : 202 8.2 Directions of future research : : : : : : : : : : : : : : : : : : : : : : : : : 210

Bibliography

215

A Reinforcement Learning

233

B Minimum Description Length Principle for Hierarchical Organizations of the Function Set 236 C Computing the Expanded Structural Complexity

238

D Price's Selection and Covariance Theorem

240

xi

List of Figures 2.1 (a) A representation of a perceptron with preprocessing association units. (b)

One nonlinear TLU with inputs x1 , x2, and x1 x2 solves the parity problem : :

24

3.1 The \problem solving as search" perspective: basic steps in one iteration of a best rst search procedure that can be applied to a representation with a small branching factor. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

34

3.2 The SA algorithm. The objective is to minimize the cost function f. : : : : : :

38

3.3 Skeleton of a standard GP algorithm. : : : : : : : : : : : : : : : : : : : : :

41

3.4 Example of solution to the EVEN-3-PARITY problem : : : : : : : : : : : : : : :

44

3.5 Skeleton of the genetic algorithm : : : : : : : : : : : : : : : : : : : : : : : :

47

3.6 Example of parameter optimization problem for n = 1 and q = 2: nd the global minimum of f(x) = x sin(4x) given that ?2:0 x 3:0 : : : : : : : : : :

49

3.7 Equal mutation probability lines in the space of variations of the objective variables (two-dimensional case). In the absence of correlations, a uniform probability distribution is used for controlling mutations; when the direction of search outlines correlated variables, mutations are modi ed to incorporate two more pieces of information: the magnitude () and the direction (). : : : : : : : :

51

3.8 The basic steps performed in one generation of an EA : : : : : : : : : : : : :

55

xii 4.1 Probability mass function of the random variable X representing the number

of hits for the even-5-parity function in four cases: (a) Random generation of Boolean tables; (b), (c) and (d) Random generation of GP expressions, when the function set is composed respectively of four (b), eight (c), and sixteen (d) distinct Boolean functions. In cases (b), (c), and (d) F necessarily contains the four primitives of two variables and, or, nand, nor. : : : : : : : : : : : : :

64

4.2 Distributions of the dierence in tness between ospring and parents over a

run of GP on the even-5-parity problem. At each generation, only a small fraction of the population has DiFitness = F(ospring) - maxfF(parent)g > 0 and is indicated by arrows here. : : : : : : : : : : : : : : : : : : : : : : : :

67

4.3 Structure tree for a GP solution to even-5-parity. Nodes are labeled with the most recent generation number when the node played a pivot role in a crossover operation. Zero labeled nodes remained unchanged from the initial generation. The encircled subtrees were not eective in the evaluation of their parents but are important in the nal solution. : : : : : : : : : : : : : : : : : : : : : : :

68

4.4 A small change in a Boolean tree will not necessarily determine a sharp change

in the program behavior. : : : : : : : : : : : : : : : : : : : : : : : : : : : :

69

4.5 Variation of eective code for the best of generation individual in a GP run. :

70

4.6 A rooted tree fragment is the property of interest in analyzing the dynamics of GP. This example represents a tree-schema of order eight in the language used for inducing parity Boolean functions. : : : : : : : : : : : : : : : : : : : : :

80

4.7 Three dierent tree structures de ned by the same rooted tree-schema. : : : :

80

4.8 Example of a probability mass function for choosing the depth of crossover points. The distribution plotted is the negative binomial (Pascal) with r = 2 and p = 0:23, whose cumulative distribution for the maximum depth of trees of 17 is 0:999. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

91

4.9 Rule for the automatic adaptation of the probability of disruption. pm and pc

are updated proportionately so that their sum follows this rule. : : : : : : : :

91

xiii 4.10 An example of the Pac-Man trajectory for an evolved program. The trace of Pac-Man is marked with vertical dotted lines. The monster traces are marked with horizontal dotted lines. Pac-Man started between the two upside-down T-shaped walls (bottom) while the four monsters were in the central den. PacMan headed North-East, captured a fruit and the pill there, and then attracted the monsters in the South-West corner. There it ate the pill and just captured three of the monsters (to be reborn in the den). Next it will closely chase the fourth monster. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

93

4.11 Pac-Man game rewards. : : : : : : : : : : : : : : : : : : : : : : : : : : : :

93

4.12 Pac-Man perceptions are deictic, smell-like, primitives. ifb a true boolean value only if monsters are blue. Every other primitive return a distance to the closest object of a given type in the world. Distance is an integer in the range [0; 43]. This and the function primitives determine the total number of possible perception states that can be experienced by the agent. : : : : : : : : : : : :

94

4.13 Pac-Man actions are based on deictic routines: advance or retreat to the closest (and second closest for monsters) object of a given type. Each action primitive returns the distance to the corresponding object. : : : : : : : : : : : : : : :

94

4.14 Variation of size (a) (topmost), tness/size (b) (middle), and raw- tness learning

curve (c) (bottom) in a run of GP on even-5-parity without parsimony pressure. 99

4.15 Variation of size (a) (topmost), tness/size (b) (middle), and raw- tness learning curve (c) (bottom) in a run of GP on the Pac-Man problem without parsimony pressure. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 100

4.16 The curves from Figure 4.15 repeated for GP with parsimony pressure = 0:1. 102 4.17 The curves from Figure 4.15 repeated for GP with parsimony pressure = 1:0. 103 4.18 Pac-Man learning curve and variation of size in a run of GP with autoadaptation

in crossover, mutation and reproduction rates. : : : : : : : : : : : : : : : : : 104

4.19 Adaptation in the probability of crossover. : : : : : : : : : : : : : : : : : : : 104 4.20 Fitness distributions over a run of GP on the even-5-parity problem. : : : : 111

xiv 4.21 Entropy re ects population diversity. In a run for even-5-parity, entropy clearly increases in the next 15 generations when signi cant improvements from the random initial population are achieved. Then entropy remains at a relatively constant level. Will it decrease and signal a freezing of any further evolution? See Figures 4.22 and 4.23 for the answer. : : : : : : : : : : : : : : : : : : : 112

4.22 Best-of-generation number of hits and average tness in a run of the parity example for 200 generations. The rst part of this run is detailed in Figure 4.21. 112

4.23 Entropy variation in the run of the even-5-parity example from Figure 4.21. 113 4.24 Fitness distributions over a run of GP on the Pac-Man problem. : : : : : : : 114 5.1 (a) An individual program with two automatically de ned functions. It consists of three branches: ADF0, ADF1 and a result producing branch with one body. Each branch has a set of arguments A (only for ADFs), a function set F and a terminal set T which are established in the problem de nition. (b) Hierarchy of components. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 121

5.2 Additional GP operators in the module acquisition (GLIB) approach : : : : : 123 5.3 Probability mass function of the random variable X representing the number

of hits for ADF-GP even-5-parity functions. The ADF-GP architecture uses two ADFs of two arguments each. The function sets of the main program and of the ADFs contain respectively four (a), eight (b), and sixteen (c) distinct Boolean functions of two variables. The functions sets necessarily contains the four primitives of two variables and, or, nand, nor. : : : : : : : : : : : : : 129

5.4 The non-causality of ADF-GP: De nitions for ADFs are local. Thus, a fragment of code copied from donor parent 1 into receiving parent 2 will be evaluated in the new lexical environment of parent 2. : : : : : : : : : : : : : : : : : : : : 131

5.5 Distribution trend of the percentage of birth certi cate types over generations,

while looking for a solution to even-5-parity that was found in generation 15. Random indicates the propagation of random individuals from the initial population due to reproduction. : : : : : : : : : : : : : : : : : : : : : : : : 133

xv 5.6 Fitness distributions over a run of ADF-GP on the even-5-parity problem. : 135 5.7 Entropy variation over a run of ADF-GP on the even-5-parity problem. : : : 135 5.8 Two-tier architecture of the adaptive representation algorithm. : : : : : : : : 141 5.9 Subroutine discovery in the adaptive representation algorithm : : : : : : : : : 142 5.10 Call graph for the extended function set in the even-8-parity example : : : : 147 5.11 Complexity of best of generation individual and average values over the entire

population in the even-8-parity example. : : : : : : : : : : : : : : : : : : : 148

5.12 With AR inhibited, an even-5-parity run shows permanent increase in struc-

tural complexity but a plateau in tness. : : : : : : : : : : : : : : : : : : : : 148

5.13 Final distribution of block frequencies in a run of even-3-parity. : : : : : : : 149 5.14 Evolution of most frequent blocks in the even-3-parity example. : : : : : : : 150 6.1 ARL extension to GP: the subroutine discovery algorithm for adapting the prob-

lem representation. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 156

6.2 Dierential tness distributions over a run of ARL with representation A on the Pac-Man problem. At each generation, only a small fraction of the population has DiFitness > 0. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 157

6.3 The ARL algorithm extends the standard GP algorithm with Step 3b which adapts the problem representation system (vocabulary) by creating new subroutines, eventually deleting old ones and creating new individuals to be entered in the selection competition. : : : : : : : : : : : : : : : : : : : : : : : : : : : 161

6.4 Distribution of tness values over 50000 random programs generated using the ramped-half-and-half method from [Koza, 92]. Each tness class covers an interval of 100 game points. The tness of each program is the average number of game points on three independent runs. : : : : : : : : : : : : : : : : : : : 167

6.5 ARL evolved hierarchy of subroutines for one solution in representation B. : : 173

xvi 6.6 SGP learning curves: The tness of the best-of-generation individual, average standardized tness and entropy of the population for a SGP run. The decrease in entropy indicates the gradual loss in population diversity. Its correlation with a plateau in the best-of-generation tness indicates a probable local optimum for genetic search. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 175

6.7 ARL learning curves: The tness of the best-of-generation individual, average standardized tness and entropy of the population for an ARL run. Discovery of subroutines preserves a high population diversity. The average population standardized tness rapidly decreases. : : : : : : : : : : : : : : : : : : : : : 176

6.8 Hit distribution for the SGP solution from Figure 6.6. After generation 20, the population becomes dominated by one class of individuals for many generations. 177

6.9 Hit distribution for the ARL solution from Figure 6.7. The number of individuals in certain classes grows exponentially. However discovery of subroutines regulates this growth and avoids a loss in diversity by facilitating the rapid discovery of even better individuals. : : : : : : : : : : : : : : : : : : : : : : 177

7.1 The shape of the discontinuous target function f and two approximations ob-

tained with GP (solution 1 in Table 3) and GP(IF) (solution 3). : : : : : : : 197

A.1 The reinforcement learning perspective. The agent perceives the environment and acts in the environment. In a Markov environment it can learn the maximum utility action at every possible state, theoretically, after executing each action in each state in nitely often. Practically, the agent will converge toward a good policy or close estimate of utility after sucient time. : : : : : : : : : : : : : 234

xvii

List of Tables 2.1 In a multi-layer perceptron learning, hidden decision regions are created and adjusted to de ne a low-error partitioning on the multidimensional space spanned by the input variables. : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

25

3.1 GP setup for solving parity problems of order n, and additional AR parameters. 43 5.1 Statistics of birth certi cates in successful runs of even-5-parity using ADFGP with a zero mutation rate and a population size of 4000. Each certi cate of a given type counts one unit and is temporally discounted with a discount factor = 0:8 based on its age. Only certi cates at most 8 generations old have been considered. The last line shows the averages of bcf values of the three types. 132

5.2 GP setup for solving parity problems of order n, and additional AR parameters. 144 5.3 Comparison of results (rounded gures): AR-GP vs. results reported in Koza94,

marked (# ). sc is the structural complexity, g is the number of generations : : 145

5.4 Important steps in the evolutionary trace for a run of even-8-parity : : : : : 146

xviii 6.1 Representative signatures (the return type of each function and the types of its arguments) in the typed Pac-Man vocabulary (B). The primitive vocabulary contains two control functions (ifb is also a perception function), two generic functions for creating random integers, three logic functions (and, or, and not), eight comparison functions (two instances from each of 0) there exists a probability vector pT such that further evolution does not change the distribution of reachable states: pT P = pT and when time goes to in nity, the probability vectors of reaching any given state from any initial state are all identical to pT , i.e. mlim !1 P

m

= pT

Markov analysis distinguishes between accessible states, transient states, absorbing states and states that communicate (transitions are possible mutually) and possibly form closed Markov chains (no transitions outside the set of communicating states is possible). Based on this distinction any Markov chain can be written in a canonical form which enables the de nition of the fundamental matrix of a Markov chain. This matrix determines relevant statistical properties of the Markov chain, such as the mean number of times the process is in a given state, or the mean number of steps before entering a closed set of states. 2 The theory of Markov chains can be applied for predicting the behavior of stochastic search algorithms, such as simulated annealing and genetic algorithms. The probabilities of being in each state after n transitions for a Markov chain that is described by an initial state probability vector are given by: 2

p = Pn

37

3.3 Simulated annealing Simulated annealing (SA) is a simple randomized technique for iterative improvement [Kirkpatrick et al., 1983]. SA repeatedly traverses a Markov chain by sampling the search space U from the neighborhood of the current point x with a probability profx portional to e? T , where T is a temperature parameter and f is the objective function. The idea is that when T is large (initially), all points in the space have approximately equal probability of being selected. However T follows an \annealing" process. The Markov chain follows a path with increased objective values in general, although worse transitions can also be accepted. By probabilistically accepting worse transitions, the algorithm can avoid local minima that limit the performance of other iterative improvement search algorithms, such as HC, that rely on local information. ( )

In this dissertation, we experiment with an SA algorithm that follows the description in [Laarhoven, 1988] (see Figure 3.2). The parameters of the algorithm are: the initial temperature T0, the nal temperature Tf , the length L of the Markov chain at a xed temperature, and the number of iterations G. Temperature and tness determine the probability of a state to be replaced by a less promising neighboring state. The length of the Markov chain, as well as the number of iterations, determine the length of the Markov walk. This is how the solutions are tried during the annealing process. An annealing schedule is de ned by the choice of these parameters. More precisely, these parameters determine an exponential cooling parameter used in step (3.2) of the algorithm: L Tf

= e G ln T

0

It ensures that at the end of the chain the temperature will be Tf , after starting at temperature T0, and applying a geometric decrease in temperature with parameter , after each L steps. The simulated annealing algorithm has the nice property that convergence holds asymptotically, if certain conditions about the resulting homogeneous Markov chain are met [Laarhoven, 1988].

38

Algorithm SA 1. Initialize initial state i; Parameters: T0 ; Tf ; L; G; 2. M = 0 3. repeat 3.1 repeat (a) (b) (c) (d)

Select j 2 ?(i) if f (j ) ? f (i) 0 then accept f j ?f i T else if e > random[0; 1) then accept if accept then i j ( )

( )

until equilibrium is approached or chain ends (L) 3.2 [Cooling] T 3.3 M

T

M +1

until stop criterion is satis ed (M; G) Figure 3.2: The SA algorithm. The objective is to minimize the cost function f.

39

3.4 Genetic programming Research in the area of applying nature-like evolutionary processes in order to build arti cial systems has considerably grown and has outlined alternative solutions to hard search and optimization problems. Several distinct approaches currently coexist under the name of evolutionary algorithms (EA). They are: genetic algorithms (GA) [Holland, 1975; Holland, 1992], evolutionary programming (EP) [Fogel et al., 1966; Fogel, 1995], evolution strategies (ES) [Schwefel, 1981; Back and Schwefel, 1993], genetic programming (GP) [Koza, 1989; Koza, 1992], and evolutionary reinforcement learning (or classi er systems) (ERL) [Goldberg, 1989; Wilson, 1994]. These main approaches share common features, but also have distinguishing characteristics that make them appropriate to some speci c class of possible applications. In this section we overview the main ideas of GP, while the next section presents other EA approaches. GP is the most recent evolutionary approach. It descended rather directly from the GA paradigm. Ideas of applying a GA to program evolution can be traced back. We only stop at [Cramer, 1985], which proposed a tree-based genotype representation for an arithmetic expression. The paradigm, as used today, is due to John Koza [Koza, 1989]. GP evolves a population of program expressions driven by a tness function that measures how well each program solves the problem. The GP paradigm, its main parameters, and some advanced topics are presented in [Koza, 1992; Koza, 1994b]. Here we just overview some basic concepts and notations used throughout this dissertation. In GP, problem solving is formulated as a search in the space of computer programs, which are structures of dynamically varying size and shape. Populations of computer programs (individuals) are perpetuated using three probabilistic operations: survival (reproduction in [Koza, 1992]) and genetic variation through crossover and mutation. Survival is equivalent to the unmodi ed copying of an individual into the next generation. Crossover is a process analogous to sexual reproduction in which two new programs result from an exchange of genetic material between two parent programs. It consists of three steps. First, two parent individuals are selected from the population.

40 Then a node is randomly chosen in the tree representing each parent. Finally, subtrees rooted at the two points are swapped. Mutation is a process analogous to asexual reproduction in which a newly generated subtree replaces the subtree rooted at a random pivot position in a selected individual. Other types of mutation appear in the literature, such as ipping the label of a node in a program [Sims, 1991]. The initial steps in applying GP to evolve a population of programs for solving a given problem are: (1) de ne the set of program primitives (initial genetic material), namely the set of terminals T and the function set F0; (2) de ne the evaluation cost measure C or tness function f , which assigns a cost ( tness) to each individual program3; (3) de ne the control parameters (population size M , maximum number of generations G, selection method, fractions of individuals on which each genetic operator is applied pr ; pc ; and pm , etc.); (4) de ne a search termination criterion in terms of maximum computational eort to be spent or quality of solution to be obtained. The structure of such a standard GP algorithm is given in Figure 3.3. Choosing good parameter values does not appear to be a diculty in applying GP. The population size should be as big as possible, dependent on the memory available. The maximum number of generations depends on the time taken on average by a tness computation (simulation). Most importantly, the problem representation given by the terminal set and the function set, as well as tness computation are crucial. The problem usually provides a raw tness function (f ). The GP objective is to maximize f . Alternatively, the objective can be to minimize some cost (such as classi cation error in a regression problem) C . Cost is called standardized tness in [Koza, 1992]. The relation between cost and tness for an individual i is given by:

C (i) = fstandardized (i) = [MAX-FITNESS ? f (i)] c1 + s(i) c2

(3.1)

where s(i) is the size (complexity) of individual i, c1 > 0 and c2 0 are weighting parameters. The actual measure of the \quality" of an individual used by a tness For inductive problems the cost is given by the error on a set of tness cases E . E can represent an evolving population too, as in the parasites metaphor (see [Hillis, 1990]). 3

41

GP skeleton

De ne problem: T ; F0; C (f); E . Denote the population at a given generation i by Pi . 1. Initial Generation: i = 0 2. Randomly initialize population P0 (T ; F0) 3. Repeat (a) Evaluate population(Pi; E ; C ) (b) Generate a new population Pi+1 by survival (reproduction), crossover, mutation of individuals(Pi) i. Select genetic operation O (pr ; pc; pm ) ii. Select parent individual(s) W from current population (O; Pi) iii. Generate ospring(O,W ,Pi+1) (c) Next generation: i = i + 1 Until termination criterion is met

Figure 3.3: Skeleton of a standard GP algorithm.

42 proportionate reproduction method is normalized tness, which is also a type of cost: fnormalized (i) = PMC (i) 2 [0; 1] (3.2) i=1 C (i) If c2 > 0, then for equal raw tness, individuals with a smaller size will have a lower normalized tness value and will be preferred to individuals with a bigger size. In this case the tness function incorporates parsimony penalty component. Usually the constants c1 and c2 are chosen in an empirical way (c2 = 0 most often). We would like to adapt these parameters automatically in order to score the dierent components \correctly." Sections 4.3 and 5.4.3 and Appendix B will discuss this aspect. For now, consider c1 = 1; c2 = 0. The size of the search space in GP is equal to the number of program trees of depth at most D (an initial parameter of GP) that can be generated. A lower bound on this value is given by the number of complete binary trees having leaf labels from T and inner labels from F : j F j2D ?1 j T j2D . Regardless of the symmetry or other properties of the problem domain, the number of solution trees is much smaller than this value. The likelihood of discovering a solution by chance are extremely slim, practically zero, and decreases if the problem is scaled up. Nonetheless, GP manages to assemble solutions much faster than random search [Koza, 1992]. Next we present an example of applying standard GP to the problem of inducing a Boolean function from examples.

3.4.1 GP test case: induction of parity functions The chosen test problem is the induction of a parity Boolean formula (circuit) from examples. Parity is an attractive problem for several reasons. First, it operates on a manageably nite sample space, the space of Boolean functions with a given number of inputs. This enables theoretical characterizations of the properties of the search space, as will be done in Section 4.1. Second, parity is dicult to learn because every time an input bit is ipped, the output also changes. The odd-n-parity problem is to nd a logical composition of primitive Boolean functions that computes the sum of input bits over the eld of integers modulo 2.

43 even-n-parity can be de ned by ipping the result of odd-n-parity. The odd-nparity and even-n-parity functions appear to be dicult to learn in GP, especially

for values of n greater than ve [Koza, 1992].

The initial function set for the parity problem in GP is de ned by the set of primitive Boolean functions of two variables:

F0 = fAND; OR; NAND; NORg

(3.3)

The terminal set is de ned by a set of Boolean variables:

T0 = fD0; D1; D2; :::; Dn?1g

(3.4)

Any Boolean function of n variables is de ned on the set of 2n combinations of input values. Given a program implementing a Boolean function, its performance is computed on all possible combinations of Boolean values for the input variables, and is compared with a table de ning the even-n-parity function. Each time the program and the even-n-parity table give the same result, the program records a hit. The task is to discover a program that achieves the maximum number (2n ) of hits. Parity problems can be solved by GP with the setup given in Table 3.1. Table 3.1: GP setup for solving parity problems of order n, and additional AR parameters. T F

D0 ; D1 ; D2 ; :::;Dn?1 fAND;OR; NAND;NORg

Fitness cases E all combinations of input variables Cost function number of mistakes in computing parity Population size 4000 No. of gen. 50 Selection tness-proportionate Crossover rate 70% function points, 20% on leaf nodes Reprod. rate 10% Mutation rate 0% Term. criterion Max. number of hits: 2n

GP has been successfully applied to complex control, design, or knowledge discovery applications. Several examples of successful applications of GP are: the analysis of

44

Figure 3.4: Example of solution to the EVEN-3-PARITY problem protein secondary structure [Koza, 1994b], the evolution of electrical circuits [Koza et al., 1996b; Koza et al., 1996a], etc.

3.5 Other evolutionary computation approaches 3.5.1 The Genetic Algorithm The GA paradigm, best known due to work by Holland, De Jong and Goldberg [Holland, 1975; DeJong, 1975; Goldberg, 1989] has gained huge popularity due to its simplicity, wide applicability and power. The GA approach is considered a weak method that much resembles traditional search methods, such as hill-climbing (or its correspondent for minimization problems, gradient-descent), general best rst (GBF) or generate and test [Pearl, 1984] (see [Ackley, 1987] for a comparison between dierent search methods and hybrid algorithms). For example, the main steps of a GA (apply genetic operators, update tness of search patterns and prune ineective patterns) are similar to the steps of a GBF (expand the search frontier with new candidates, evaluate new

45 search candidates, prune the search space), although they are performed in a dierent order and in a totally dierent way. More elaborated descendants of the basic GA have emerged lately (mGA - messy genetic algorithms ([Goldberg et al., 1989], [Goldberg et al., 1990], [Goldberg et al., 1991]), delta coding [Whitley et al., 1991]) in order to cope with some of the already recognized diculties of GAs. In GAs individuals are traditionally represented as xed length binary strings. Individuals are genetically bred using three main genetic operations: reproduction, mutation, and crossover (characterized by probabilities of occurrence pr , pm , and pc respectively). Reproduction is the process of copying individuals according to their tness value. Mutation ips the value of a random bit in the string. Crossover is the process in which two new chromosomes result from an exchange of genetic material between two parent chromosomes. First, two winner individuals are selected from the population to represent the parents. Then, in one-point crossover, a random crossover point is selected in the binary representations of the two parents, and the rst part of the rst parent is combined with the second part of the second parent, generating one new chromosome. Similarly, the remaining parts generate the second ospring. Each operation is based on a selection of t individuals. Other operators have been de ned for GAs. The choice for other genetic operators is dictated by the relation between representation and the tness function (see [Vose and Liepins, 1991], [Radclie, 1992], [Altenberg, 1994], and also Section 2.1). The tness f (i) of an individual i corresponds intuitively to the probability that the chromosome will survive (f (i) 2 [0; 1]) in the next generation, and is determined by evaluating the chromosome using domain dependent performance/cost function (C ) in one of the following ways:

normalized tness (a standard measure for tness computation). If performance of individual i is c(i) 0 (also called standardized tness) then a normalized tness measure of individual i is f (i) = Pc(ci()i) i

46

rank tness [Winston, 1992] is given by a geometric probability distribution with probability of success p on the population ranks obtained after sorting the standardized tness values. Thus rank k will receive a tness

fk = (1 ? p)k?1 p It has the advantage that it eliminates the implicit bias introduced by the measurement scale of c.

other ad-hoc methods, depending on the application. Two selection methods are currently used. The rst is tness proportionate selection: an individual is selected with a probability proportional to its normalized tness value. The use of normalized tness can lead to premature convergence though. The second selection method is tournament selection. The selected individual is the winner of a \tournament" played by a small group of individuals randomly chosen. The tournament can consist only of tness comparison: the individual with the best tness value among the ones in the competing group is declared the winner. Alternatively, the tournament can altogether bypass the expensive tness evaluation [Sims, 1991; Angeline and Pollack, 1994; Hillis, 1990]. It can be proved that tournament selection has the same eect as proportionate selection based on rank tness, but it is more ecient as it does not involve sorting the tness values [Blickle, 1996]. The population tness evaluation, reproduction, and genetic recombination steps are repeated until a termination criterion is met (usually when a good enough individual is discovered or a big enough number of generations is consumed). Genetic algorithms perform an eective search for t chromosomes based on stochastic factors. It is still a disputed theoretical matter why GAs perform better than random search ([Radclie, 1992], [Altenberg, 1994]) and whether crossover is essential in the operation of a GA [Jones, 1995b; Fogel, 1993].

47

GA skeleton

De ne problem: C Denote the population at a given generation t by Pt. 1. Initial Generation: t = 0 2. Randomly initialize population P0 3. Evaluate population(Pt; C ) 4. Repeat until termination criterion is met (generation t) (a) Generate a new population Pt+1 by reproduction, crossover, mutation of individuals(Pt) i. Select genetic operation O (pr ; pc; pm ) ii. Select winning individuals (parents) W from current population (O; Pt) iii. Generate ospring(O,W ,Pt+1) (b) Evaluate population(Pt+1; C ) (c) Next generation: t = t + 1

Figure 3.5: Skeleton of the genetic algorithm

48

3.5.2 Evolution Strategies The Evolution Strategies approach was developed as a numerical optimization paradigm in Germany in the late 60's. One main dierence from the GA consists in the reliance mainly on mutation, as opposed to crossover [Back and Schwefel, 1993; Back et al., 1991]. ES represent individuals as oating-point vectors, as opposed to binary strings in GA. One particularly interesting aspect of ES is the idea of self-adaptation of parameters. We will also address it here. The parameter optimization problem can be formulated in the following way: given an objective function f :M !R (where R is the set of real numbers and M Rn ) nd a global minimizer, i.e. a point x such that: 8x 2 M; f (x) f (x) = f M is de ned to be a feasible region:

M = fx 2 Rnjgj (x) 0; 8j 2 f1; 2; :::; qgg where gj are given constraint functions. Rechenberg made the analogy with biological evolution which he viewed as an optimization process, and applied the principle \smaller changes occur more often than larger ones" [Schwefel, 1981]. The simplest ES algorithm proposed by Rechenberg used a representation of a partial solution (individual) as a tuple:

a = (x; ) where x (the objective variables) and (the standard deviations for mutation control parameters on each objective variable) are vectors of size n. The algorithm implemented a two-step update strategy. First, the ES algorithm generates an ospring ox by mutating x at time t using a normal distribution N , with mean zero and standard deviation . Second, the algorithm selects the next generation candidate between the parent and ospring, using the objective function value as a tness value:

oxt+1 = xt + N (0; )

49

3

2

1

0

-1

-2

-3 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

Figure 3.6: Example of parameter optimization problem for n = 1 and q = 2: nd the global minimum of f(x) = x sin(4x) given that ?2:0 x 3:0

xt+1 = Selectionf (oxt+1 ; xt) Performance is measured by a convergence rate ' that depends on the number of steps n taken and :

x '(n; ) = metric-distance-covered-towardsnumber-of-trials-taken

(3.5)

The importance of this simple algorithm is determined by the existence of a convergence theorem [Back et al., 1991]. For positive and a regular optimization problem (f is continuous, both M and its interior are closed sets in Rn and a vicinity of the optimum is also a closed set) having a nite optimum, f (xt ) converges to f x:

Probftlim f (xt ) = f g = 1 !1 For simple problems (the corridor and the sphere models described in [Schwefel, 1981]) one can compute:

50 1. the expected value of the probability of a successful mutation:

Probff (ox) f (x)g where ox is the ospring generated by mutating x, and 2. the expected rate of convergence. One can determine the optimum value of the standard deviation in order to maximize the rate of convergence. The most appropriate probability of mutation results by using the optimum value for . The optimum values are surprisingly close for the two totally dierent models mentioned: the corridor model that is linear and the sphere model that is non-linear. Consequently, Rechenberg formulated a heuristic rule for updating mutations called the 1/5 Success Rule [Back and Schwefel, 1993]. It states that the ratio of successful mutations to all mutations should be approximately 1/5. For more than 51 successful mutations we have to decrease t+1 and for less successful mutations we have to increase it. This lies at the basis of the automatic updating of the control parameters t, which now depend on time. In our discussion so far, x has been a vector of real values. Similarly can be considered as a vector having components that control how each component of x is mutated. The state-of-the-art ES algorithm extends the standard ideas described above along the following lines:

introduces more control parameters that represent coecients of correlations between the standard deviations.

performs (deterministically, as before) selection in a population of individuals. Two alternatives are common: ( + )-ES, maintains a population of size , generates ospring and selects the best individuals out of all (+) individuals; (; )-ES selects the best ospring out of the ospring generated. introduces recombination operators typical to a GA [Back et al., 1991].

51 Now, individuals have the following representation:

a = (x; ; c) where c (mutation correlations) is a vector of size n(n2+1) . Each component is a real value. All the components of an individual are subject to modi cation and evolution similar to the way genes are modi ed in a GA. ∆x

2

σ1 σ2

α

12

∆x1

Figure 3.7: Equal mutation probability lines in the space of variations of the objective variables

(two-dimensional case). In the absence of correlations, a uniform probability distribution is used for controlling mutations; when the direction of search outlines correlated variables, mutations are modi ed to incorporate two more pieces of information: the magnitude () and the direction ().

If f is a continuous function, then one can take advantage of the correlations in the tness landscape when moving to an optimum. The interdependence between variables can be discovered using an analysis similar to principal component analysis [Chat eld and Collins, 1989] on the set of variations in the objective values (x) over previous (or recent) search history. The largest variations de ne the magnitudes and rotation angles that transform the coordinate system in order to eliminate the inter-dependencies

52 between variables (see gure 3.7). In this modi ed space, a uniform Gaussian (ndimensional normal distribution) can be used for the mutation of the transformed objective variables. In practice, correlation angles are used instead of correlations [Back et al., 1991]. They represent the rotation angles for the the mutation rates. Individuals are given by:

a = (x; ; ) The relation between the correlations cij and the angles ij is: tan(2 ) = 2cij ij

i2 ? j2

If C = [cij ], then the probability density function used for mutations is a multivariate normal distribution with mean 0 [Gelb, 1974]: 1 exp(? 12 zT C ?1z) (2 ) (detC ) The mutation process has three components. Correlation angles are mutated additively using a normal distribution N (0; 1). Standard deviations are mutated multiplicatively using a normal distribution too. The objective variables are mutated according to the multivariate distribution mentioned before, that takes into account the interdependencies in the population. The structure of an ES algorithm is similar to that of a GA, with a random initialization of the population, and a main loop consisting of recombination (operators borrowed from GA practice), mutation, evaluation of individuals and selection of the next generation population.

Prob(z) =

n 2

1 2

3.5.3 Evolutionary Programming Evolutionary Programming [Fogel, 1993], is a method of simulating natural evolution by emphasizing behavioral changes rather than genetic changes. It is very similar to Evolution Strategies. A typical example is the evolution of a population of nite state machines (FSM) 4 in order to provide predictions of increasing accuracy with respect to Finite state machines are represented as Mealy machines, i.e. nite automata in which an output symbol is associated with a transition from a state on an input 4

53 a given goal. Given a model of the environment and a control goal, the behavior of the individual FSMs that act upon the model is adapted in light of the control goal. EP takes an approach similar to ES for adapting search parameters [Saravanan and Fogel, 1994].

3.5.4 Evolutionary reinforcement learning An evolutionary reinforcement learning (ERL) system (also known as a classi er system (CS)) is a system that learns the appropriate condition-action rules, called classi ers, in order to control the system-environment interaction using a reinforcement learning strategy enhanced with a GA. The system is able to perceive environmental changes by means of a set of (binary) detectors. It aects the environment as a result of action executions. Based on the eects on the environment the system receives a reward. ERL systems are able to learn by searching the space of condition-action rules. They use a GA-based control algorithm and employ a method of credit assignment similar to the Q-learning method used in reinforcement learning [Watkins, 1989] in order to adjust the strengths of classi ers. The main steps in the control cycle of a ERL system are: 1. Performance or sense-act. This step is analogous to the steps of the control cycle in a rule-based system. At time t: (a) Determine a match set containing classi ers whose conditions are satis ed, M (t). (b) Perform con ict resolution to select a subset of actions to be executed, which is called the action set A(t). (c) Execute an action from the action set using roulette-wheel selection (see [Goldberg, 1989]).

54 2. Reinforcement or credit assignment. Let S[A(t)] be the strength of action set A(t), i.e. the total strength of members of A(t). We adjust S[A(t)] in order to modify the strength of classi ers in A(t): (a) Subtract an amount used to initialize a bucket B (t) = S[A(t)], where is a learning rate and (b) Add reward obtained from the environment r(t), where r(t) is the reward after choosing and executing the action from A(t) at time t. (c) Propagate backwards the eects of A(t + 1), by adding the discounted (by factor ) bucket from time t + 1, B (t + 1). (d) Weaken matching classi ers that do not get active, i.e from the set M (t) ? A(t), because they advocate dierent actions. 3. Discovery or generation of new classi ers, using a GA: (a) Crossover or mutate selected classi ers (b) Remove weak classi ers As noted in [Wilson, 1994], the overall eect of the reinforcement cycle can be summarized by a Widrow-Ho type of rule [Rumelhart et al., 1986], very similar to the Q-learning rule [Watkins, 1989]: S[A(t)] = (r(t) + S[A(t+1)] ? S[A(t)]) (in Q-learning S[A(t+1)] would be replaced by the maximum over all actions of the estimated Q values). [Wilson, 1994] suggests that roulette wheel action selection implements an exploration policy. In the reinforcement phase of the control loop of a classi er system, matching classi ers that do not get activated are weakened. This lowers the chances of choosing unpromising actions in the near future and implements an exploitation policy. The weakening magnitude, and thus the exploration-exploitation tradeo is controlled by an explicit parameter. More elaborate methods are possible. For example [Wilson, 1994] proposes that an energy function replace the temperature variable in an annealing function that de nes the probability of exploration. Exploration would usually decrease energy while exploitation might increase it.

55

3.6 Discussion of stochastic search methods In this section we compare and discuss the stochastic search techniques discussed in this chapter. We can highlight the characteristics of evolutionary algorithms by interpreting an EA purely as a search procedure, and comparing it with a standard informed procedure such as best rst search (see Section 3.1. The literature has several discussions of this

avor for GAs [Jones, 1995b], and GP [Angeline, 1994a; Tackett, 1994]. An EA uses a collection (population) of alternatives which are explored in parallel, applies a control strategy based on stochastic selection, and uses stochastic operators. The main steps of an EA are summarized in Figure 3.8. We notice a striking analogy to the main steps of a best rst search algorithm from Figure 3.1. Nonetheless, a more careful comparison of the skeleton algorithm structures from the two gures points out several important dierences:

Evolutionary algorithm basic loop steps 1. Generate new individuals (ospring) by applying genetic operators in parallel over the entire population 2. Evaluate new individuals (through interaction with the problem environment). 3. Select next generation individuals (or, equivalently, prune ineective population individuals) as a mix of surviving individuals and ospring. Figure 3.8: The basic steps performed in one generation of an EA

The implicit parallelism issue. An EA proceeds virtually in parallel by using a

population of alternatives. Theoretically, an EA explores in parallel successors of

56 many alternatives. In contrast, BFS always continues search with the best alternative and expands it. In the process, the current alternative may be suspended if it looks less promising, or resumed later if the better alternative now proves to be fruitless in the future. Although BFS maintains several parallel search alternatives, only one alternative is explored at a time (unless speculative computation is performed to speed-up the process).

The branching factor issue. Two characteristics of EA search spaces in general,

and especially of GP search spaces, are a huge branching factor and a huge size of the search space. The GP branching factor, for instance, is determined by the number of dierent ways of changing a tree structure in order to generate ospring. There are extremely many ways to do it. In contrast, traditional search methods have a much smaller branching factor. This observation has two main consequences. First, completeness can no longer be guaranteed. Total enumeration of all possible successors would result in computationally intractable EAs. EAs rely on random generation of a few ospring from probabilistically selected parents. Step 1 of Figure 3.8 is probabilistic. Second, it is infeasible to keep track of alternatives already encountered during the search process. In GP in particular it is too computationally expensive to check the isomorphism of two alternatives (program trees for GP). Therefore EAs are randomized algorithms that give up guarantees of optimality and completeness. [Ackley, 1987] interpreted search problems that are based on an evaluation function and that have huge branching factors as optimization problems. Not all successor alternatives can be generated, therefore a search control component of the algorithm has the task of generating plausible alternatives. Search is equivalent to looking for maxima or minima in the tness landscape. In accordance with this view, GAs and ESs are mostly used as optimization algorithms.

The selection/pruning issue. EA selection is an active step in the sense that the composition of the next generation depends on what alternatives are selected and how those alternatives are combined. Selection makes search adaptive. BFS

57 pruning is a passive step, in the sense that it simply discards unpromising alternatives from bookkeeping. BFS partitions the space of alternatives in closed, open and unexplored alternatives [Nilsson, 1980]. The list of open alternatives can be interpreted as a "population" of solutions, which preserves the potential search continuation points. Unpromising solutions are discarded as a result of the pruning step. In traditional search, the next population is simply based on the local properties of the component states. In contrast, in EAs, global population characteristics in uence the composition of the successor population and thus determine the dynamic properties of the algorithm. Alternatives can't be \closed," due to the size of the search space. The expansion step is probabilistic, and selection is also probabilistic. Some parents survive and will generate more ospring. The surviving fraction of the population is determined by selection of the ttest. Population diversity plays an important role in search. Genetic operations have to insure high variability in the population. Many EAs explicitly maintain an increased population diversity in order to avoid local optima during search . The issue of diversity in EAs has recently suggested improvements of traditional search approaches [Shell et al., 1994].

The evaluation issue. The evaluation of an alternative in traditional search

approaches is based on xed criteria. Evaluation in EAs can be more general. It is based on the performance of the problem solution de ned by the alternative in the interaction with the environment, relatively to other members of the population. By using a changing environment or by imagining active forms of interaction with the environment, more complex and realistic problems can be tackled by EAs.

The dierences between state space search and genetic search stated above point out the essential role played by the population, as well as questions about the appropriate representation, operators and control strategies in genetic search. Understanding the relationships between these three elements is a challenging problem, as discussed in Section 2.2.

58 The dynamics of a GA with xed-length representations can be theoretically analyzed with the theory of Markov chains. The con guration of a xed size population de nes a state. A state value is formed by concatenating the representations of the individuals in the population at a given time. Then the trace of a GA run represents a chain in the space of all possible state con gurations. For a population of M bit strings and a string size N , the number of possible states of the Markov chain is (2N )M = 2NM . An analysis with such a huge number of states is infeasible. Simpli ed models have been designed [Vose and Liepins, 1991]. Such models enable a theoretical analysis of population trajectories. For instance in [Nix and Vose, 1992], selection, mutation and crossover are incorporated into the transition matrix of the model in order to study the asymptotic behavior of the steady state distributions as a function of the number of absorbing states ( xed points). It is shown that for a nite, suciently large population, and one attracting point, the convergence behavior of a GA can be predicted. Nonetheless the analysis is purely theoretical. It has no practical implications. Simulated annealing can also operate on symbolic structures. In contrast to GP, which uses a population, SA is an iterative improvement algorithm. SA performs well in continuous spaces, but it has also been applied to combinatorial optimization problems. Program search satis es the basic conditions for applying SA: (1) the possible program structures represent possible con gurations; (2) the tness function gives a measure of cost; (3) the mutation of a parent program is a mechanism for generating program successors in a \neighborhood" of the parent program. An implementation of this algorithm will be described in Section 3.3. SA is also of interest to us for comparison to GP. A GP-SA comparison may also shed light on the role of the population in GP. Other successful iterative improvement algorithms for discrete optimization problems have been reported in the literature. One technique that shares similarity with SA is GSAT [Selman and Kautz, 1993]. GSAT is a greedy search technique applied to the search for satisfying assignments of rst order formulae to hard satis ability problems. For a given formula, the procedure starts with a random truth assignment and greedily chooses to ip the variable with the largest increase in the number of satis ed clauses of the formula. Note that search for satis able assignments to a given rst order

59 formula takes place in a non-continuous space. Another technique inspired by the genetic algorithm and hill-climbing is PBIL [Baluja and Caruana, 1995] which dispenses of the population altogether and preserves the apparently equivalent information in a probability vector. Both SA and EAs are dependent upon many design parameters. The most important parameters for an EA are population size, rate of reproduction, crossover, mutation or other particular genetic operators, the number of generations to be tested and various parameters controlling the creation of representations. The choice of proper parameter values is a dicult problem in itself and one wishes they could be automatically adapted or learned. Some of the algorithms (GA, ES) are based on static representations (although ES has self-adapting parameters) and while others (GP, EP, ERL) rely on dynamic representations. The problem representation may de ne the structure of a solution besides particular values for parameters. We can view for example GP as a paradigm that can synthesize structure and adapt parameters. Automatic adaptation of structure or parameters represents a very powerful idea that shows up in recent applications of EA. We will also discuss the idea of automatic adaptation of the representation in the next chapters. Among the EA techniques described in this chapter, GP is of particular interest to this dissertation. In the example in Section 3.4.1 we suggested that GP achieve a much better performance than random search. How does this happen? Unfortunately, the GP literature contains no consistent analysis of how GP works, but rather interpretations of GA principles. In general the EA theory is extremely weak (we presented an exception for ES in Section 3.5.2). We will nd an answer to the important question of how GP works in Chapter 4. The parity problem presented here can be easily scaled up by increasing its size n. According to intuition, and also experimental results [Rosca and Ballard, 1994b], GP spends more time and is less successful in solving higher order parity problems. Then the next question is: how can we improve the performance of GP for complex problems? An original approach is presented in Chapters 5 and 6 and is based on the idea of adapting the problem representation.

60

4 Inherent Characteristics and Limitations of Genetic Programming The GP literature contains no consistent analysis of how GP works. This chapter assembles theoretical and experimental evidence about the biases and limitations of standard GP, which learns monolithic procedural representations, taken from four dierent angles. First, I reveal how a static analysis of the distributions of program behaviors sheds light on the in uence of the initial set of primitives (terminals and functions). Second, I discuss dynamic biases induced by the use of tree shaped expressions. Third, I analyze of the role played by variable complexity in the selection and survival of expressions. The key idea is a particular property of GP representations called rooted tree-schema. Fourth, I take a statistical perspective on the dynamics of GP. The experiments reported herein refer to two challenging application domains: (1) induction of a Boolean formula for computing the parity function; (2) learning from reward how to control an agent in a dynamic and nondeterministic environment, the Pac-Man game.

61

4.1 Biases in the random generation of simple tree structures Any GP algorithm starts by generating a random population of program structures, unless there is reason to bias search from the very beginning by including in the initial population prede ned genetic material, such as subexpressions or complete individuals based on domain knowledge or previous runs of GP. The general case is the absence of any knowledge about the program space. The goal of this section is to outline the crucial in uence of dierent representational choices on the distributions of behaviors in the initial random population. The analysis compares theoretical and experimental tness distributions for dierent choices of primitives for an inductive problem. Next we use as test problem the induction of a parity Boolean formula from examples, introduced in Section 3.4.1. Parity is an attractive test because GP operates on a nite sample space, the space of Boolean functions with a given number of inputs. This enables the computation of distributions of interest for random choices of an initial population.

4.1.1 Theoretical analysis of uniform random sampling The eciency of a GP algorithm depends on the computational eort needed to generate a solution with a given probability. An important reference case is the random case. The probability of randomly generating a problem solution depends both on the initial function set, and on the method of generating random individuals. Our goal is to understand the in uence of the function set composition on this probability. But rst, we will see what is the theoretical distribution of a sample from the space of Boolean functions (not programs). Let us consider the sample space of all functions

U = ff : Bn ?! Bg

62 where B = f0; 1g. Note that kUk = 22n , thus we can obtain random elements of U by

ipping 2n distinct fair coins. Consider the random variable X mapping the nite sample space U onto the set of positive integer numbers N de ned as follows: X is the number of hits of a randomly generated Boolean function s 2 U . We are interested in analyzing the probability mass function of X .

ProbfX = xg =

X s2U :X (s)=x

Probfsg

Consider a random (here and in turn the term random refers to structures generated randomly according to a uniform probability distribution) Boolean function with k hits. The k hits are due to i 1-hits and to (k ? i) 0-hits. even-n-parity takes an equal number (i.e. n2 ) of 0 and 1 values over the set of input binary strings. Thus, the number of Boolean functions that coincide with even-n-parity for a xed set of k input strings is : 0 10 1 0 1 k X B @ i=0

n 2

i

CA B@

n 2

k?i

CA = B@ n CA k

(4.1)

which implies that X has a binomial distribution, with p = q = 21 . In order to prove n equality 4.1 use Newton's binomial on both sides of the identity (1+ x)n = (1+ x) (1+ x) n and identify coecients. It follows that 2

2

0 1 n ProbfX = kg = 21n B @ CA k

(4.2)

The expected value of X is n2 and its variance is n4 [Cormen et al., 1990]. The binomial distribution of X is closely approximated by the distribution obtained experimentally (see Figure 4.1 (a)).

4.1.2 Experimental analysis of program diversity GP searches the space of programs in parallel, using a population of programs. Programs are randomly created in the initial population, but what does random mean? We

63 would like to start with search points \widely" scattered across the search space, and maintain diversity during search. In order to understand the role of the problem representation in the initial generation of random programs, we designed a set of experiments for estimating qualitative measures of diversity such as tness distributions. A straightforward de nition of diversity in GP is the percentage of structurally distinct individuals at a given generation. Two individuals are structurally distinct if they are not isomorphic trees. However, such a de nition is not practically useful. It is computationally expensive to test for tree isomorphisms for every pair of programs. Moreover, associativity of functions is extremely dicult to take into account. The type of variation we are interested in is tness diversity. Two individuals are dierent if they score dierently. Two experiments about the role of the problem representation in the random initial population are reported next. First, a uniform random generation of parity tables is compared to a random generation of program trees as used by GP with function set F0. Second, we vary the composition of the primitive function set and analyze again the tness distribution of randomly generated GP programs. In these experiments, the method of generating random GP individuals is the ramped-half-and-half method [Koza, 1992]. In order to create an initial population of increased diversity, this method generates trees of depth varying modulo the initial maximum size (taken to be six), and of both balanced and random shape. When functions s 2 U are generated uniformly then the probability of generating even-n-parity is 21n . For the even-3-parity problem [Koza, 1994b] reports that no solution is discovered after the random generation of 10 million parity functions. However, the above analysis implies that for n = 3 it should be considerably easier (one in 256 trees) to nd a solution if the random generation of trees in GP results in a uniform distribution of functions. About four billion GP trees would need to be generated in order to nd one that computes even-5-parity (n = 5). Sampling the space of programs is not equivalent to, and actually much worse than uniformly sampling the space of functions. How much more? 2

Figure 4.1 compares the distribution of hits obtained for a population of uniform

64 random tables ((a) - the ideal case) and random GP expressions ((b)-(d)) for the even5-parity problem. The random GP expressions are generated using three function sets, containing respectively 4, 8, and 16 distinct Boolean primitives of two variables (out of the total of 16). In all three cases F0 F . The mean and standard deviation of the distribution of randomly generated tables compares closely to the theoretical results outlined above for n = 5, although only 16,000 random tables were generated. The distribution of random GP expressions in the even-5-parity problem, with the function set de ned in ( 3.3) (GP(4) in Figure 4.1), shows that for h < 12 or h > 20 the probability of having h hits is practically zero. The GP random distribution is much narrower than might be anticipated. 100,000 (a) Random Tables

100000 x Prob{X=x}

(b) GP(4) (c) GP(8)

10,000

(d) GP(16) 1,000

100

10

1 0

2

4

6

8 10 12 14 16 18 20 22 24 26 28 30 32

Number of Hits

Figure 4.1: Probability mass function of the random variable X representing the number of hits for the even-5-parity function in four cases: (a) Random generation of Boolean tables; (b), (c) and (d) Random generation of GP expressions, when the function set is composed respectively of four (b), eight (c), and sixteen (d) distinct Boolean functions. In cases (b), (c), and (d) F necessarily contains the four primitives of two variables and, or, nand, nor.

The narrow GP(4) hit distribution suggests a low diversity in population behaviors. A solution by means of GP will be dicult to obtain, because it would require

65 more generations, and thus an increased computational eort, to create more diverse individuals. When the function set is varied a wider distribution will result. This is the case of GP(8) and GP(16) in Figure 4.1. More diversity on the primitive operations used (GP(8)) is good, but too much diversity (GP(16)) (all possible Boolean functions of two variables) does not give equally good results. Note that F0 is complete in the sense that any Boolean function can be written just using functions from F0 . The new Boolean functions of two variables that are added to the function set enable a dierent organization of the search space. The eect of apparently non-useful functions, initially included in the function set, is bene cial. All new functions are based on the initial primitive functions and terminals. Theoretically, the search space remains the same, the space of all programs that can be built based on F0 and T0 . The curse of dimensionality gives us no hope to use directly random sampling for complex problems. The GP initial random tness distributions are worse than the theoretical distribution based on uniform random sampling of the space of functions, and would probably converge towards the theoretical distribution it if the population size M ! 1. In spite of this, GP works much better than random sampling due to the creation and composition of diverse functions implemented by subexpressions throughout the populations.

4.2 Biases in transforming simple tree structures GP departs from the genetic algorithm paradigm by considering tree structures for representing genotypes as computer programs [Cramer, 1985; Koza, 1992]. Trees are structures which oer a exible representation for creating and manipulating programs. Genetic transformations on trees were described in Section 3.4. There are dierences between the analogous operations on trees (GP) and linear structures (GA) which we discuss next. To simplify the discussion, we will only consider tree crossover, nonetheless conclusions can be drawn also for tree mutation.

66 Crossover in GP is the process in which two new programs result from an exchange of genetic material between two parent programs. It consists of three steps. First, two parent individuals are selected from the population. Then a pivot node is randomly chosen in the tree representing each parent. Finally, subtrees rooted at the two pivots are swapped. A useful perspective in analyzing program transformations is given by the notion of causality [Rosca and Ballard, 1995], or locality of possible transformations. Causality relates changes in the structure of an object with the eects of changes, that is changes in the properties or behavior of the object. In optimization problems, small changes are considered more useful than larger ones. The principle of strong causality states that small alterations in the underlying structure of an object, or small departures from the cause should determine small changes of the object's behavior, or small changes of the eects, respectively [Rechenberg, 1994]. The main question in the analysis of GP causality is what are the eects of changes that result by applying crossover.

Most genetic operations have a harmful eect or are neutral The intuition is that most crossover operations have a harmful eect or are neutral. In particular, ospring of individuals that are already partially adapted to the \environment" and already have a complex structure are more likely to have a worse tness. This is close to the conclusions on the role of mutation in natural evolution [Wills, 1993]. It is also in agreement with our intuition that a small change in a program may drastically change the program behavior. Experimental evidence is shown in Figure 4.2, which shows the histogram of the dierence in tness between an ospring and the best of its parents for a run of GP on the even-5-parity problem. The gure shows that most individuals do not improve the tness of their parents. Distributions very similar to Figure 4.2 are characteristic to very dierent problems as well (for another example see [Rosca and Ballard, 1996a]). In addition there is the following simple argument. Consider a partial solution to a hypothesis formation problem obtained using standard GP and represented by a tree T . Consider that T is selected as a parent and it is possible to obtain a solution by

67

3500

Number of Individuals

3000 2500 2000 50 1500 40 1000

30 Gen. 20

500

10

0 -30

-20

-10

0

10

20

0 30

DiffFitness

Figure 4.2: Distributions of the dierence in tness between ospring and parents over a run of

GP on the even-5-parity problem. At each generation, only a small fraction of the population has DiFitness = F(ospring) - maxfF(parent)g > 0 and is indicated by arrows here.

modifying T in such a way that a certain subtree Ti is not changed. Consider also that crossover points are chosen with uniform probability over the set of m nodes of T . The probability of choosing a crossover point v that does not lie within Ti is: (Ti) Prob(Select(v)jv 62 Ti ) = 1 ? Size Size(T ) The bigger Ti is (and this is true in the case of a hypothetical convergence to a solution) the smaller is the probability of keeping it unchanged. The dynamics of trees shows the phenomenon of instability or weak causality of GP structures.

Low expected height of crossover points From a topological point of view, where are most of the crossover changes performed? Equivalently we can ask about the expected height of crossover pivot points. In current GP practice crossover points are chosen from internal nodes with a probability of 90% and from leaf nodes with a probability of 10%. A fundamental remark is that internal crossover nodes are chosen according to a uniform probability distribution. If we additionally assume, for a rough approximation, that trees operated upon are complete

68 binary tress we can compute the expected height of crossover pivot points E [h]: 9 E [h] = 10

X i

9 (2 ? H ) Prob(i) h(i) = 10 2H ? 1

where Prob(i) is the probability of choosing node i of height h(i) as a crossover point, and H is the tree height. The leaf nodes of the complete binary tree all have height 0. This result, although based on an assumption, shows that most of the changes are closer to the tree bottom. The eect can be noticed on the structure tree1 in Figure 4.3, which represents a typical case. 0

0

0

15

0

0

0

0

0

0

0

0

19 0

0

0

0

0

0

16

20

0

0

22

0

0

0

0

13

6

5

0

18

0

14

0

0

0

0

11

0

0

4

0

0

0

2

1

0

0

0

0

0

Figure 4.3: Structure tree for a GP solution to even-5-parity. Nodes are labeled with the

most recent generation number when the node played a pivot role in a crossover operation. Zero labeled nodes remained unchanged from the initial generation. The encircled subtrees were not eective in the evaluation of their parents but are important in the nal solution.

High functional specialization of deep subtrees What is the in uence of selecting smaller or bigger subtrees to participate in the crossover operation? We hypothesized that the crossover operator generates non-causal changes in GP. A complete answer to this question would involve an analysis of the properties of the function set. The eect of a small change can be severe in problems of symbolic regression, or less severe in problems of regression of Boolean functions. For The notion of structure tree was introduced in [Rosca and Ballard, 1994a] with the goal of qualitatively analyzing program transformations during evolution. 1

69 example, the result of the Boolean function represented in Figure 4.4 will be given by the result of evaluating on a given tness case only if the evaluation of the following Boolean expression is true:

: :

The longer the path to , the higher will be the probability that plays a less important role in the overall evaluation. AND

α OR

β OR

δ

γ

Figure 4.4: A small change in a Boolean tree will not necessarily determine a sharp change in the program behavior.

Correlation of tree size and probability of change How much code of a parse tree representing an individual is eective? When evaluating a program, not all code is used (evaluated). This is due to the fact that the function set may (and usually does) include lazy functions. A lazy function has its parameters evaluated only if they are needed. Sometimes, code that is used has no eect at all on the output, for any of the tness cases but may be useful on new cases. Fortunately, for hierarchical genome representations as used in GP, useless code can be cheaply detected by marking code into eective and ineective code during evaluation. Eective code is code that is executed at least once, regardless of the overall result of the computation. The size of eective code from an individual is the individual's eective size. Figure 4.5 presents an example.

70

160

7,200

140

6,300

120

5,400

100

4,500

80

3,600

60

2,700

40

1,800

20

900

Size EffSize

0

Fitness

# Points

Fitness

0 0

10

20

30

40 50 60 Generation

70

80

90

Figure 4.5: Variation of eective code for the best of generation individual in a GP run. It is well known that GP evolves non-parsimonious trees (see the increase in size in Figure 4.5) if no size pressure is included in the tness evaluation [Rosca and Ballard, 1994b], a phenomenon suggestively called \defense against crossover" [Altenberg, 1994]. The increase in the size of unuseful code within population individuals would decrease the probability of disrupting useful code by means of non-causal crossover changes. This idea will be theoretically analyzed in Section 4.3. However, the useless regions of code, called introns may represent reservoirs of genetic material [Angeline, 1994a]. They either preserve or evolve good fragments of code to be activated later during evolution as a result of crossover. This makes it dicult to estimate the size of eectively useful code. One such example is presented in Figure 4.3. [Nordin et al., 1995] argued that introns can be useful during GP search, and therefore even introduces explicitly de ned introns in linear representation of programs. This conclusion has been disputed. The eective size of a program is greater or equal than the dierence between the size of the program and the total size of introns. Instead of analyzing introns we could track eective size.

71

Functional inheritance by superposition How does GP exploit structures? In contrast to GP crossover, the GA crossover operator is homologous, that is any swapped bit preserves its position on the string. GP crossover is non-homologous in the sense that it does not preserve the position of the subtree on which it operates, but instead can paste a subtree at any tree level. The probability of choosing homologous crossover points in two structurally similar parents in order to transmit the parent functionality to ospring is inversely proportional to the product of the parent sizes, i.e. is very low. Moreover, if trees grow in size, this probability decreases even more and becomes negligible. This implies that even when the two parents are identical, ospring will most often have a totally dierent functionality, and most probably they will score less than parents. Selection favors crossover changes that recombine parts of the structure of the parents so as to improve performance, but how? In several problem domains one can observe the super-position of the parent behaviors in the ospring. In an example for the problem of nding an impulse response function [Koza, 1994b], Koza showed that crossover determines a better ospring performance by improving one parent's performance for one portion of the time domain, and inheriting the behavior of the other parent for the rest of the domain . Such a behavior has been interpreted as "case splitting": GP re nes a partial solution by changing a subtree so that the program treats separately, in a more detailed way, a particular input case. In this case, structures are exploited through the function they have when computing their tness.

Summary of GP transformational biases The issues presented above are summarized by the following conclusions: (1) Low expected height of crossover points; (2) High functional specialization of deep subtrees; (3) Correlation of tree size and probability of non-causal changes; and (4) Functional inheritance by superposition. These arguments alleviate the non-causality problem of the GP crossover operation discussed at the beginning of this section. The issues discussed are actually under user control through the probability of selection of crossover

72 nodes and function set properties. Later we will suggest how ne controls, such as the probability of selection of crossover nodes, can be advantageously used in GP.

4.3 Complexity drift in the evolution of tree representations Complexity of evolved structures is a non-issue in most of the recent evolutionary computation (EC) literature. EC techniques such as genetic algorithms (GAs) [Holland, 1975], evolutionary programming (EP) [Fogel et al., 1966; Fogel, 1995], and evolution strategies (ES) [Back et al., 1991] use mostly xed length encodings of the structures to be evolved. This design decision seriously limits their applicability to the domain of parametric problems. Many applications could bene t enormously from simulated evolution that is open-ended with respect to the complexity of evolved solutions, i.e. when no particular structure is assumed a priori. This would be particularly the case with complex design or control problems where the structure of a satisfactory solution is unknown. The behavior of a GA and furthermore GP is extremely hard to characterize formally. In GP, evolved programs are very general non-parametric encodings, in the sense that the number of parameters can be very large and their role is very exible. The complexity of evolved expressions can drastically vary over the span of the search. One serious problem of standard GP is that evolved expressions appear to drift towards large and slow forms on average. This threatens to thrash the performance of the GP engine by reaching prespeci ed threshold parameters such as the total size or depth of manipulated expressions [Sims, 1991; Tackett, 1994; Angeline, 1994b; Nordin et al., 1995; Rosca, 1996]. Performance near the edge of these parameters is not desirable. The simple-minded solution is to monitor when such non-desirable operating conditions are reached and to increase the values of the responsible parameters to avoid such conditions. The GP run could be continued if resources still allow. Another solution to control size growth is to include a parsimony penalty component into the tness function in order to limit the growing size of expressions [Iba et al., 1994;

73 Rosca and Ballard, 1994b; Zhang and Muhlenbein, 1995]. However, the penalty component changes the tness landscape. How heavy should parsimony penalty be weighted or how should it be adapted in order to not aect the underlying optimization process? This section presents a novel view on the role played by size in evolutionary computation. It discusses a particular property of GP representations that sheds light on the role of variable complexity of evolved structures. The property of interest is the rooted tree schema relation on the space of tree-shaped structures which provides a quanti able partitioning of the search space. The present analysis answers questions such as: What role does variable complexity play in the selection and survival of structures? What is the in uence of parsimony pressure on selection and survival of structures? What is the role played by the weighting factor of the complexity term? Are there alternative approaches to simulating a parsimony component to tness that do not result in a change of the tness landscape? This and the next three sections provide theoretical answers to these questions, interpretation of these results and an experimental perspective.

4.3.1 Schemata Theory Schemata in genetic algorithms Of all theoretical studies on the dynamics of a simple genetic algorithm, schemata theory and the building block hypothesis have been the most controversial. In this section we review schema theory and its extension to GP, in order to contrast the new approach to be described later. The concept of schema was introduced by Holland in order to characterize the informal notion of \building block." It was assumed that GAs work by recombining good building blocks. For linear binary representation of xed length L, a schema is de ned by a string of length L over the binary alphabet extended with a don't care symbol. A schema is interpreted as a template string whose 0 and 1 bits represent a fragment, or block of a chromosome. The Schema Theorem, gives an estimate of the change in the frequency of a schema in the population as a result of tness-proportionate reproduction, crossover, and mutation

74 ([Holland, 1992], see also [Goldberg, 1989]). We will restate it below. Let H be a xed schema of de ning length (H) and m(H; t) be the number of copies of H in the population at time (generation) t. Let fH be the average tness of all strings in the population matching H while and f be the average tness of the population. Consider a simple GA using tness proportionate selection for reproduction. Ospring are created through copying or single-point crossover, and additional variation through mutation. The probabilities for crossover and mutation are respectively pc and pm. Then, a lower bound on the number of copies of H in the next generation is given by [Goldberg, 1989]:

m(H; t + 1) m(H; t) ffH 1 ? pc L(?H)1 (1 ? pm)o(H)

(4.3)

The coecient of m(H; t) on the right hand side of relation (4.3) represents an approximation of the growth factor of schema H. The constructive eect of variation operations was not considered. A super-unitary growth factor indicates that the next generation will contain more samples of the schema. The theorem has been interpreted as showing that schemata with tness values above population average, of low order and short length { all intuitive conditions for a super-unitary growth factor { will receive an exponentially increasing number of samples in the next generations. Such schemata are building blocks. Good individuals tend to be made up of good schemata, i.e. building blocks. The GA discovers and recombines such building blocks in parallel to create solutions. This is the essence of the Building Block Hypothesis as presented in [Goldberg, 1989]. Moreover, Holland argued that the search for an optimal string combines exploitation (preservation of schemata) and exploration (creation of new schemata) in close to an optimal proportion. The argument relied on the analogy between the allocation of samples to schemata in the GA with the allocation of eort in the Two-Armed-Bandit problem [Holland, 1975; Holland, 1992]. Schemata theory has been criticized for not re ecting the processing done by a GA and not being really informative. One such critique is that GA allocates trials

75 to schemata very dierently from the optimal allocation given by the Two-ArmedBandit solution. This was shown on contrived examples [Grefenstette and Baker, 1989; Muhlenbein, 1991]. Some of the discussions of schemata theory caveats, such as that schema frequency variation in disharmony with the Schema Theorem, are summarized in [Mitchell, 1996]. The problem with current interpretations is the consideration of possibly non-independent schemata. Independence is de ned relative to the contribution to tness. Schema interdependence is due to inclusion relationships and epistasis. A formal notion of schema independence should take into account such eects. This would focus the analysis and interpretations on the relevant entities, i.e. the independent competing schemata. Another critique is that schemata do not necessarily capture relationships among meaningful properties that determine tness [Altenberg, 1995]. Generalized schemata can be de ned by partitioning the space of structures with many other relations. Such attempts have been presented in the GA literature [Vose and Liepins, 1991; Radclie, 1991]. Relations analogous to the schema theorem will hold for other representations as well [Radclie, 1992]. Indeed, schema theory explains the proliferation of substructures through selection but this fact is indicative more of when a schema hypothesis can be refuted. An argument for this remark is the intended use of Price's covariance and selection theorem [Price, 1970]. Price's theorem states that the variation in the frequency of a gene between the ospring and the parent population is proportional to the covariance between the frequency of the gene in an individual and the number of ospring of that individual over the parent population. Price outlined that his theorem helps in constructing hypothesis about selection, such as whether a certain behavioral feature is advantageous: Recognition of covariance is of no advantage for numerical calculation, but of much advantage for evolutionary reasoning and mathematical model building. [[Price, 1970], page 521] Altenberg [Altenberg, 1995] brought the attention on Price's theory and its general implications in GA theory. Although he dismissed the merit of current interpretations

76 of the schema theory in explaining GA performance, he pointed out that the Schema Theorem is a particular case of Price's covariance theorem.2 Altenberg further generalized the view of Price's theorem by considering measurements other than the frequency of a gene (schema). Both schemata theory and Price's theorem explain the variation in the frequency of a schema (Holland) or a group of genes (Price) over successive generations, but they rely on dierent arguments. Holland concentrates on the eects of selection, reproduction, crossover and mutation, so that his derivation explicitly incorporates tness due to its role in selection. Price's analysis relies, more generally, on the correlation between the number of ospring produced from parents and the frequency of a certain gene in the parent population. Also, both theorems can be used to construct hypothesis about the selection process, rather than fully explain the dynamics. In Section 4.3.2 we will introduce a new relation of the space of programs that allows us to focus on the interaction between tness and complexity in variable complexity representations. Our measurement will be the frequency of population individuals satisfying such relations. Before that, we critically look at approaches to expand the notion of schema to GP.

Schemata in genetic programming GA schemata have been interpreted from two interchangeable perspectives. The rst perspective focuses on what parts can remain unchanged after repeated crossover operations and how these parts are assembled through recombination. The building block hypothesis shares this view. The second perspective interprets a schema as the set of individuals sharing common constraints on their structure. By successfully creating schemata of higher order, the GA focuses search on smaller regions of the space. To see how an approximation of the Schema Theorem can be derived from Price's theorem take into account that in a GA the number of ospring containing a gene (and its correspondent, a schema) is correlated with the deviation of tness from the average population tness for tness proportional selection. 2

77 It is not at all obvious how the GA schemata interpretations can be transferred to the variable length structures manipulated by GP. What property of variable length representations would be useful to observe or analyze in order to explain changes in structure or tness and predict the dynamics of GP? Again one can take two dierent perspectives of a GP schemata. The rst views schemata from the perspective of structural variation through crossover and mutation while the second focuses on subsets of the search space. In GA schema theory the two views were just the two sides of the same coin. In GP each of the two interpretations leads to dierent de nitions and suggests dierent insights. Next we will review two de nitions that take the rst perspective and then present a new approach, the rooted tree-schema theory, that takes the second perspective. GP searches the space of trees constrained by a maximum size or depth of trees. The tree representation can be also viewed as a string made of function and terminal symbols. The GP string has a default parenthesized structure imposed by the original hierarchical shape. Although the GA schema concept cannot be directly applied to the parenthesized string, it is helpful to think of GA schema as of sets of xed-length bit strings that have a number of features (bits) in common and then extend the de nition to GP by considering sets of programs that have features in common. The question that remains is what kind of features would be most informative. Koza's schema de nition focuses on subtrees as features. Programs in a GP schema set have in common one or more speci ed subtrees: A schema in genetic programming is the set of all individual trees from the population that contain, as subtrees, one or more speci ed trees. A schema is a set of LISP S-expressions sharing common features. [[Koza, 1992], page 117] While a GA schema implicitly speci es through bit positions how two low order schemata are to be combined, this GP schema de nition does not. Subtrees de ning a schema can appear anywhere within the structure of an individual that belongs to the schema, provided that the entire structure obeys a maximum size or depth constraint.

78 The de nition above suggests the intuitive idea that subtrees may play the role of functional features and that good features may be functionally combined to create good representations [Rosca and Ballard, 1996a]. Another, more general, GP schema de nition is provided in [O'Reilly and Oppacher, 1995]. The previous de nition allows for trees to be combined only as subtrees in larger structures. This second de nition makes explicit how schema components can be combined in larger structures in analogy to a GA schema by using wildcards. A GP-schema H is a set of pairs. Each pair is a unique S-expression tree or fragment (i.e. incomplete S-expression tree with some leaves as wildcards) and a corresponding integer that speci es how many instances of the Sexpression tree or fragment comprise H. [[O'Reilly and Oppacher, 1995], page 77] While the Koza de nition considers only the type of component manipulated by one crossover operation, this de nition allows for the incomplete speci cation of fragments of a tree \corresponding to what is left intact by repeated crossovers." The extension allows the de nition of one or more contiguous fragments of a tree, where each tree fragment may have wildcards at its root or on its leaves.3 Additionally, a duplication factor is considered for each fragment. In contrast to a GA schema, there is no predetermined way to combine the fragments. Both de nitions support the intuition that subtrees may be building blocks. However, the problem with both de nitions is that they only implicitly specify a subset of the space of all tree structures with trees that match the schema. This makes it extremely dicult to characterize how GP allocates trials to regions of the space of program trees. The diculty is re ected in the inconclusive attempt for deriving a GP Schema Theorem based on the second schema de nition presented above [O'Reilly and Oppacher, 1995]. The root and leaf tree fragment wildcards have slightly dierent meanings. A root wildcard indicates that the fragment can be embedded in some higher level subtree, while a leaf wildcard indicates possible re nement with any subtree (fragment), possibly a terminal. 3

79

4.3.2 Portraying Variable Complexity Representations In contrast to xed length GA representations, each member x of the GP population has a variable size sx . The goal of this section is to de ne a property that makes it possible to estimate the growth in the number of individuals satisfying the property as a function of the complexity of evolved individuals, in analogy to Equation (4.3). This analysis will represent the foundation for theoretical developments, interpretations, and experiments along the following lines:

Role of variable complexity during evolution In uence of parsimony penalty Balance between tness/error and complexity penalty Alternative approaches for imposing parsimony pressure

The rooted tree-schema property Generalized schemata can be de ned by partitioning the space of structures with other relations. Such attempts have been presented in the GA literature [Vose and Liepins, 1991; Radclie, 1991]. Next we propose a simple structural property that de nes a dierent type of partitioning of the space of programs. The space of programs will be partitioned based on the topmost structure of trees. We will call the relation induced by this property a rooted tree-schema or tree-schema. The intention is to outline new properties and insights about GP search.

De nition. A rooted tree-schema or tree-schema of order k is a rooted and contiguous tree fragment speci ed by k function and terminal labels. (see Figure 4.6.)

The de nition takes advantage of the hierarchical nature of program representations. It constructively speci es the correspondence between the tree-schema representation and the subset of programs de ned by the tree-schema. One can easily check if a tree belongs to a given tree-schema. The de nition will allow us to capture a dierent view of how the space of tree-like structures is searched. It is important to outline that all

80

AND OR D0

NOR

*

AND OR

D2

* *

D1

Figure 4.6: A rooted tree fragment is the property of interest in analyzing the dynamics of GP.

This example represents a tree-schema of order eight in the language used for inducing parity Boolean functions.

tree-schema of a xed shape and order are independent and are theoretically competing for a share of the search eort.

Figure 4.7: Three dierent tree structures de ned by the same rooted tree-schema. Note that we have dropped the implicit wild card on top of a fragment used in the previous GP schema de nitions. The rooted tree fragment can be instantiated only by re ning the wild cards downwards to terminals. A rooted fragment precisely speci es, in a structural manner, the region in the space of programs to which all its instantiations will belong. It may also specify, through its composition, a collection of subtrees that are of particular importance (such as (OR D2 D1 ) in Figure 4.6) and how the trees are functionally combined. There are more ways to build a tree top, in analogy to the 2k instantiations of k xed bits of a GA-schema of order k. All the possible labelings of a xed shape of the

81 root fragment correspond to one tree-schema class. For any given class, the class treeschemata compete for existence in the GP population. The larger the root fragment is, the higher is the number of competing tree-schemata. However, only a small fraction of them would be present in the population at any given time. Now we can easily translate GA considerations similar to the schemata theorem or Price's theorem to GP. The dierent element to pay attention to is the variable length of chromosomes.

Growth of tree-schemata Consider a tree-schema H and the subpopulation that matches H (see an example in Figure 4.7.) Let kHk be the schema order4 and m(H; t) be the number of copies of H in the population at time (generation) t. Let fH be the average tness of all trees matching H in the population and f be the average tness of the population.5 Consider the standard GP procedure, as de ned in [Koza, 1992], that uses tness proportionate selection for reproduction. Ospring are created through copying, tree crossover, and mutation. The probabilities for crossover and mutation are respectively pc and pm . Let also pd = pc + pm . In a population of size M , the expected number of ospring of individual x 2 H is proportional with its tness:

PMfx f M = Pfxfy M = ffx y=1 y

(4.4)

For simplicity, we drop the bounds of a summation over the entire population. The probability of destruction for instances of individual x 2 H is The order and the de ning complexity of a tree-schema schema H are identically de ned as the number of nodes in H, o(H) = (H) = kHk. 5 From now on we will denote the average of quantity q over set A by qA or (q)A . We also drop the index altogether when the set in the entire population. For example sH is the average complexity ? of tree structures over tree-schema H and fs H is the average of ratios fs over H. Similarly s is the average complexity over the entire population. 4

82

pd kHk s

(4.5)

x

By combining relations (4.4) and (4.5) we obtain a lower bound on the number of instances of H that results from ospring generated by x

fx ? p kHk fx f d sx f

(4.6)

To obtain a lower bound on the number of copies of H in the next generation, we sum relations (4.6) for all trees x 2 H

X fx !

X fx ! kHk ? pd f m(H; t + 1) x2H sx x2H f

(4.7)

This is equivalent to

m(H; t + 1)

2 X fx ! 4

x2H

f

or, taking into account that m(H; t) = f1H

P

P fx 3 1 ? pd P x2H fsxx 5 x2H kHk

x2H fx , we get

3 2 P fx 77 66 f H 6 m(H; t + 1) m(H; t) f 61 ? pd P x2H sfxx 77 5 4 | x2H{z kHk}

(4.8)

a

Equation 4.8 accounts for the variation in the number of individuals matching treeschema H due to selection and the destructive eects of crossover and mutation.6 Let us denote as a the coecient of pd on the right-hand side above. It is obvious that the variation of tree-schemata is also a function of the evolved size sx of individuals x 2 H. If all trees had the same size S (sx = S; 8x 2 H) then a would reduce to kHk S which is analogous to the term combining the destructive eect of crossover and mutation in the Schema Theorem (see relation 4.3). Here, the order of the tree plays the role of de ning length. When a has a very small value it does not in uence the growth of above average 6

Fitness and complexity are time dependent too. Notations hide this time dependence for simplicity.

83 tness tree-schemata. A bigger value of a dampens this growth. Next we analyze in detail this dependence of the right-hand side of 4.8 on size.

The role of variable size during evolution Relation 4.8 can be also written as

m(H; t + 1) R1 m(H; t) 1 ? m(RH2; t) where R1 and R2 are de ned as

R1 = ffH

(4.9)

f

R2 = pd kHk fs H H For above average tness tree-schemata

fH f

(4.10)

i.e. R1 1. We are interested in conditions under which the right-hand side of 4.9 is greater () or much greater () than m(H; t). Under the respective conditions, by the transitivity of or , it would follow that

m(H; t + 1) m(H; t) or m(H; t + 1) m(H; t)

(4.11)

To determine such conditions under which 4.11 is true, we substitute the right-hand side of inequation 4.9 for m(H; t + 1) in the rst part of relation 4.11 and take into account inequality 4.10. We get

R2 m(H; t + 1) m(H; t) if m(H; t) RR1? 1 1

(4.12)

After substituting R1 and R2, this implication can be written as

m(H; t + 1) m(H; t) if m(H; t) (H; f; t)

(4.13)

84 where (H; f; t) is de ned as

f

0

1

(H; f; t) = s f1 @1 + fH 1 A pd kHk H H f ?1

(4.14)

Note that fs H is the average of fs values over the set of individuals H. We have proved the following property:

Theorem 1. For a given schema H (kHk= xed and pd = constant), the number of instances of tree-schema H increases if m(H; t) exceeds a threshold value = (H; f; t) (see relations 4.13 and 4.14).

The interpretation is that an individual x 2 H can increase the survival rate of H by decreasing the threshold factor in one of the following two ways (see relation 4.14):

By increasing its tness fx. This would determine an increase of fH bigger than an increase in f . By increasing its complexity sx. This would determine a decrease of fs H. Relation 4.13 can be further analyzed as follows. The expression for = (H; f; t) can be rewritten by grouping terms that are interpretable using Chebyshev's monotonic inequality7

1 f sH 0 1 m(H; t + 1) m(H; t) if m(H; t) s f @1 + fH A pd kHk sH f ?1 | {zH H} | {z } b

(4.15)

c

Relation 4.15 is an equivalent form of relation 4.13 and can be interpreted as follows. Assume that sH = constant over time steps t and t +1. Then the right-hand side of 4.15 depends on factors b and c. An increased value of the product b c results in an increased 7

Chebyshev's monotonic inequality states that if a1 a2 ::: a and b1 b2 ::: b then

n n P P P ( n ni ai )( n ni bi ) n ni ai bi . Also, the sense of the inequality is reversed if b b ::: bn , 1

=1

1

=1

[Graham et al., 1994].

1

=1

1

2

85 threshold value in equation 4.15. The b factor captures the interaction between tness and size within tree-schema H. Let the members of H be indexed by 1; 2; :::; k. If they can be ordered such that

s1 s2 ::: sk

and

f1 f2 fk s1 s2 sk

i.e. they can be ordered such that increases in size are over-compensated by increases in tness within H, then by Chebyshev's inequality it would follow that b 1. The smaller b is, the smaller the threshold becomes. Under the above conditions, the growth of schema H is favored as compared to another schema that has the same average tness but does not satisfy the conditions. On the contrary, if tness increases under-compensate the increases in complexity, the b component has a value bigger than 1. This determines a higher threshold value and implies that the growth of schema H is not equally favored. In conclusion, this section has proved a necessary condition for the increase in the number of instances of a tree-schema (relation 4.13) which outlines the role of variable complexity of evolved structures. By increasing in complexity but not in tness, an individual determines a decrease in the threshold used in relation 4.13. If coupled with no other innovations in the population, this facilitates the increase of the individual's survival rate. For pure tness to be dominant, tness increases should over-compensate increases in complexity. The increase in the survival rate of an individual through a pure increase in complexity is not a desirable tendency. Can such an eect generalize to the entire population? Section 4.5 will present simulations to answer this question. A \yes" answer is plausible, and it would indicate that the performance of GP search can be seriously degraded. The next section extends the present analysis when adding parsimony pressure during selection in order to con ne the expected survival of individuals of ever increasing complexity.

86

4.3.3 Adding Parsimony Parsimony pressure is given by an additive component p(sx ) in the tness function. Its coecient, or weighting factor, is negative in order to penalize a size increase. With parsimony, the raw tness function fx is replaced by fxp

fxp = fx + p(sx )

(4.16)

A commonly used parsimony measure is linearly dependent on size

p(sx ) = ?sx

(4.17)

In GP practice, the choice of the weighting factor is very much an art. Intuitively, small positive values should be used, so that the p(sx ) component is negligible compared to fx . This can be achieved by choosing as follows:

fx

infx s x

(4.18)

where in mum is taken over a set of best solutions evolved without using any parsimony pressure. For inductive problems, we can de ne a tness function that trades error and complexity. Both error and complexity are measured in information bits that have to be transmitted over a line in order to be able to recreate the original data at the other end of the line. By applying the minimum description length principle similarly to [Quinlan and Rivest, 1989], we obtain the following de nition of the parsimony component:

p(sx ) = ?k1sx logsx ? k2sx

(4.19)

where k1 depends on the inverse of the log of the number of tness cases and k2 is proportional to the number of bits needed to encode all symbols and inversely proportional to the log of the number of tness cases [Rosca and Ballard, 1994b]. In the rest of the article we will use a linear parsimony component.

87

Selection with a parsimonious tness function We compare the selection pressure in the following two cases: (1) when using a linear parsimony component (as de ned in relations 4.16 and 4.17), and (2) when using plain tness. The result is synthesized by the following theorem:

Theorem 2. With linear parsimony penalty, individuals x with a tness-complexity

ratio greater than the ratio of population averages f and s:

fx > f sx s

(4.20)

have greater selection likelihood. This property is true regardless of the particular value of .

Proof: The role of parsimony can be formally analyzed by comparing the expected number of ospring of an individual x in the two cases above:

P f(xf??ssx ) M > Pfxf M y y

y

y y

(4.21)

Relation 4.20 is obtained after truth preserving simpli cations in 4.21. 2 Theorem 2 allows one to interpret whether selection is stronger with or without parsimony. With parsimony, individuals x with a tness-complexity ratio greater than the ratio of population averages f and s have greater selection likelihood. Although is factored out relation 4.20, it in uences how much bigger the selection pressure is and what gets selected. A stronger selection pressure towards more eective individuals is useful in early stages of evolution and may be undesirable in later stages due to the decreased population diversity and thus stronger convergence tendency towards a local optimum.

Growth of tree-schemata with parsimony When a parsimonious tness function is used, f is replaced by f p which is de ned here according to relations 4.16 and 4.17.8 Let p be the value of obtained by substiSo far we have bypassed details about the nature of f , such as whether f is raw tness or normalized tness (see [Koza, 1992]), and rather considered that the tness function f supplies values used during 8

88 tuting fp for f in equation 4.14. The survival rates of a schema in the two cases can be compared by comparing the value of p with . The direct comparison _ p can be written as

f

1 _ f ? s 1 (4.22) s H fH ? f s H (f ? s)H ? (f ? s) The average operator over H, implicit in the average notation ()H , has the following properties that are used to simplify equation 4.22:

f ? s

f

s H = s H? (f ? s)H = fH ? sH

After simpli cations we get

f

f

1

1 _ (4.23) ? s H fH ? f s H (fH ? f ) ? (sH ? s) When complexity sH increases the left-hand side of 4.23 decreases. However, this is

_ p ,

not necessarily true with the right-hand side, where the term 1

(fH ? f ) ? (sH ? s) increases with sH . As expected, this shows that size plays a dierent role in the dynamics of GP with parsimony pressure. In this case it is obvious that selection will favor the smaller of two evolved structures of equal raw tness

sx sx , fxp fxp 1

2

1

2

(4.24)

This implies a decrease in size over periods of time when no improvement of tness is observed. When H identi es a worthy individual x just discovered, relation 4.22 can be simpli ed. The worthy structure is propagated throughout the population at a higher rate if the threshold is much lower than the threshold with parsimony p . This is true if and only if the right hand side below holds the selection phase of the GP algorithm.

89

p , fsx fs x

(4.25)

The following theorem covers the general case:

Theorem 3. For an above average schema H, if is chosen as in relation 4.18 and

the relative increase in schema tness is bigger than the relative increase in schema size times the average of ratios fs H , than a better than average schema has better chances to survive and propagate in the population. More precisely

p , (f

H ? f)

f

s H (sH ? s)

(4.26)

The above conclusions are consistent with the analysis of the role of size in selection and can be particularized for the case H = fxg to reach the same conclusion as in 4.25. The size increase tendency is deampli ed, as expected, when a parsimonious tness function is considered versus the case of the tree-schema growth inequation 4.13. However in demarcation situations, such as the discovery of a t structure and its proliferation, size in uences the rate of survival of the structure as shown in relation 4.25. In section 4.5 we will explore the role of size in GP simulations where a parsimonious tness function is considered.

4.3.4 Controlling Schema Growth The probability of destruction of instances of individual x 2 H depends on the complexity sx (see 4.5). In one probable scenario the complexity of the current best programs tends to increase and also the proportion of the more complex of these programs in the population increases although they do not improve in raw performance. This would make GP prone to local minima and result in a waste of computational eort. The result is comprised in the growth of tree-schemata in relation 4.13 and the interpretation of Theorem 1. GP search is biased towards exploring structures of higher and higher complexity independently of raw tness.

90 On the contrary, when using parsimony, the above problem is apparently accounted for. This is done at the price of modifying the tness function to penalize for increases in complexity. The rede nition of tness establishes a change in the tness landscape whose eects are hard to predict other than experimentally. It would be desirable to inhibit the propagation of structures that increase in complexity without improving in tness. The remedy to this problem is to control tree schema growth so that the complexity of structures does not in uence the probability of tree-schema destruction. This can be achieved in two ways. First, such an eect could be approximately obtained by prede ning a probability mass function for choosing crossover and mutation points within tree structures, which would assign most of the probability mass to the higher tree levels, i.e. to nodes closer to the root. Figure 4.8 gives an indication of the desired shape of the probability mass function. The depth of the point of variation is chosen based on this probability mass function, and the node itself is chosen uniformly among all nodes of the same depth. If a depth is not represented in some tree structure, than that structure is copied unchanged. A tree structure grows downwards, so that the probability of disrupting a xed number of the high layers in the tree becomes constant. This method achieves the goal of disrupting a tree-schema with constant probability and making it complexity independent. Unfortunately, the method introduces the parameters of the probability mass function that would have to be eventually adjusted (see Figure 4.8). Another practical disadvantage is the intricateness of any method for determining the actual crossover/mutation point. Second, we could directly aect the probabilities of crossover and mutation, i.e. make them adapt, so that disruption of tree-schemata occurs with probability independent of complexity. The basic idea is to rede ne the probability of disruption of structure x to be a function of sx

8 >
d : minf1; pd ssx g if sx s0

(4.27)

0

where s0 is a prede ned parameter playing the role of a threshold complexity value (see

91

0.1 0.09 0.08

Probability

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

2

4

6

8

10

12

14

16

18

Depth

Figure 4.8: Example of a probability mass function for choosing the depth of crossover points. The distribution plotted is the negative binomial (Pascal) with r = 2 and p = 0:23, whose cumulative distribution for the maximum depth of trees of 17 is 0:999.

Figure 4.9). With this change, disruption of schema H occurs with constant probability

kHk = constant = p p0d kHk d s s x

0

(4.28)

pd’ 1

pd 0

s0

Complexity sx

Figure 4.9: Rule for the automatic adaptation of the probability of disruption. pm and pc are

updated proportionately so that their sum follows this rule.

If sx increases then the probability of destruction of x remains constant. The number of instances of x in the next generation will be exclusively in uenced by its tness and the appearance in the population of better individuals.

92

4.4 A complex test case: the Pac-Man game We consider the problem of controlling an agent in a dynamic environment, similar to the well known Pac-Man game [Koza, 1992]. The Pac-Man game is a typical RL task. An agent, called Pac-Man, can be controlled to act in a maze of corridors. Up to four monsters chase Pac-Man most of the time. Food pellets, energizers and fruit objects result in rewards of 10, 50 and 2000 points respectively, when reached by Pac-Man (see Figure 4.11). After each capture of an energizer (also called \pill"), Pac-Man can chase monsters in its turn, for a limited time of 25 steps. During this period monsters are \blue." The rewards are 500 points for capturing the rst monster, 1000 points for the next etc. up to four monsters. However, a monster re-emerges from a central den shortly after it is captured by the agent. A snap-shot of the Pac-Man world is presented in Figure 4.10. A solution (also called policy or agent function) to the problem is a program that controls Pac-Man movements based on current sensor readings, and possibly past sensor readings and internal state (memory). It maps states into actions or sequences of actions. Such a program is an implicit representation of the agent policy and can be evolved by means of GP. But how good can evolved solutions get? The problem is to learn a controller to drive the Pac-Man agent in order to acquire as many points as possible. The agent has ve integer-valued perception primitives, one Boolean perception, and eight overt action primitives (see Figures 4.12 and 4.13). PacMan can sense when monsters are blue. The other perception primitives are smell-like senses. Pac-Man can sense the Manhattan distance to the closest food pellet, pill, fruit, and closest or second closest monster. The overt action primitives move the agent along the maze corridors towards or backwards from the nearest object of a given type. The function set contains two conditional operators, ve perception primitives, and eight action primitives. ifb (if-blue) senses if monsters are blue and when true executes its rst argument otherwise executes its second argument. iflte (if-less-than-or-equal) compares its rst argument to its second argument. For a \less-than" result the third argument is executed. For a \greater-or-equal" result the fourth argument is executed.

93

Figure 4.10: An example of the Pac-Man trajectory for an evolved program. The trace of

Pac-Man is marked with vertical dotted lines. The monster traces are marked with horizontal dotted lines. Pac-Man started between the two upside-down T-shaped walls (bottom) while the four monsters were in the central den. Pac-Man headed North-East, captured a fruit and the pill there, and then attracted the monsters in the South-West corner. There it ate the pill and just captured three of the monsters (to be reborn in the den). Next it will closely chase the fourth monster.

Object Captured Monster

Game Points 500/ 1000/ 1500/ 2000

Fruit

2000

Food

10

Energizer

50

Figure 4.11: Pac-Man game rewards.

94

Perception

Result

SENSE-DIS-FOOD

distance

SENSE-DIS-PILL

distance

SENSE-DIS-FRUIT

distance

SENSE-DIS-MON1

distance

SENSE-DIS-MON2

distance

IFB

boolean

Figure 4.12: Pac-Man perceptions are deictic,

smell-like, primitives. ifb a true boolean value only if monsters are blue. Every other primitive return a distance to the closest object of a given type in the world. Distance is an integer in the range [0; 43]. This and the function primitives determine the total number of possible perception states that can be experienced by the agent.

Action

Result

ACT-A-MON1

distance

ACT-A-MON2

distance

ACT-A-PILL

distance

ACT-A-FRUIT

distance

ACT-A-FOOD

distance

ACT-R-MON1

distance

ACT-R-MON2

distance

ACT-R-PILL

distance

ACT-R-FRUIT

distance

Figure 4.13: Pac-Man actions

are based on deictic routines: advance or retreat to the closest (and second closest for monsters) object of a given type. Each action primitive returns the distance to the corresponding object.

The perception primitives return the Manhattan distance to the closest food pellet, pill, fruit and closest or second closest monster. They are, respectively: sense-dis-food, sense-dis-pill, sense-dis-fruit, sense-dis-mon1, sense-dis-mon2. The terminal set has no elements. The action primitives move the agent along maze corridors. All return a number encoding the direction faced by the agent. For instance, act-a-pill advances the agent on the shortest path to the nearest uneaten energizer while act-r-pill retreats the agent from the nearest uneaten energizer. The other actions have analogous functions with respect to the closest monster, the second closest monster, fruit, and food: acta-mon-1, act-r-mon-1, act-a-mon-2, act-r-mon-2, act-a-fruit, act-a-food. If the shortest path or closest monster or food are not uniquely de ned, then a random choice from the valid ones is returned by the corresponding function. A program is evaluated based on the performance of the agent on an initial world con guration.

95

4.4.1 Problem Diculty, Fitness Cases and the Fitness Measure The Pac-Man problem has a number of major sources of diculty. First, Pac-Man is an active agent in a dynamic environment, thus the number of world states that it can encounter during one simulation is huge. Second, the degree of perceptual aliasing (or hidden state) is high. Third, the problem has several sources of nondeterminism:

monster moves are random 20% of the time. fruit moves are occasionally random. Finding an optimal control policy would be intractable even for a deterministic environment. Diculty raises questions such as what are good solutions and how much training is needed to evolve good solutions. Training eort is measured by the number of simulations executed or the number of primitives executed. In the game simulator, any control decision (ifb, iflte, ifte) or agent perceptual action (sense) takes zero time and any agent movement action (act) takes one time unit. Monsters and fruit move synchronously with the agent. A solution is interpreted repeatedly until Pac-Man is captured by a monster or eats all food pellets. Each simulation of an evolved program controller starts in the same initial world state. A number of training, or tness cases is considered. Each training case corresponds to one simulation. Multiple simulations of the same program have dierent outcomes due to random events in the external environment. The actual sequence of random events for a simulation is controlled by a random number generator. The tness of a program is the average number of points or \hits" accumulated by the agent under the control of the program on the set of tness cases. The standardized tness of program i, StdFitness(i), is the dierence maxpoints ? Fitness(i), where maxpoints is the theoretical maximum number of points that can be obtained if the agent gathers all food pellets, all fruit, all energizers and the maximum number of monsters (four) in each of the four blue periods generated by eating all four energizers. Random events in the simulation of a program result in huge variations in problem diculty and thus determine huge variations in tness. In order to evolve general

96 good solutions, a large number of tness cases should be considered. This requires an increased computational eort for each individual tness evaluation. Reynolds [1992] discussed this typical speed-accuracy trade-o for the problem of evolving a \corridor following" control program. He de ned Fitness(i) as the minimum number of simulation steps, over several tness (i.e. training) cases, taken before the rst collision of the agent.

4.5 Experimental results This section presents experiments aimed at reinforcing conclusions that have been suggested by the previous theoretical analysis. The interpretations given take into account the interplay between the two factors: tness and complexity. One cannot know in advance what particular tree-schema is to be preferred by the evolutionary process in a run of GP. Moreover, it is extremely hard to trace a treeschema property for all competing tree-schemata of any given shape and large size. However, one can examine relevant properties globally, for the entire population, and narrowly, for the best individual in the population. The experiments below provide a unitary view of GP dynamics by looking at a common set of measures over three types of runs of the standard GP engine. The measures are quantities that have appeared throughout our derivations and have been used to qualitatively explain results such as the averages over the population (f , s, fs ) and the best individual in a generation fbest , sbest , or fs best . The types of runs performed are 1. Standard GP with raw tness being only a measure of performance. 2. Standard GP with parsimony pressure. The tness function combines raw tness and a linear parsimony component to penalize a size increase. 3. Adaptive GP where the probabilities of mutation, crossover and reproduction are updated dynamically in order to impose constant parsimony pressure on competing tree-schema regardless on the complexity of evolved structures.

97 Two test problems are used. The rst problem is the induction of a Boolean formula (circuit) that computes parity on a set of bits [Koza, 1994b]. The raw tness function has access to tness cases that show the parity value for all inputs, and counts the number of correct parity computations. These experiments use the following parameters: population size M = 4000, number of generations N = 50, crossover rate pc = 89% (20% on leaf nodes), mutation rate pm = 1%, reproduction rate pr = 10%, number of tness cases = 2n where n is the order of the parity problem. The second problem is the induction of a controller for a robotic agent in a dynamic and nondeterministic environment, as in the Pac-Man game [Koza, 1992; Rosca and Ballard, 1996a]. Raw tness here is computed from the performance of evolved controllers over a number of simulations. The parameters for these runs are M = 500, N = 150, pc = 89%, pm = 1%, pr = 10%, with three simulations determining the individual tness. Each of Figures 4.14-4.18 contains three plots: (a) Variation of average complexity s = AvgS and the complexity of the best-of-generation individual sbest = S(Best) (top); (b) Variation of the ratio of averages fs = AvgF/S and of fs best = F/S(Best) (middle); (c) The tness learning curve fbest = S(Best) and variation of average tness f = AvgF (bottom).

Fitness based on pure performance Over the time span of evolution in a GP run, there often are long periods of time when no tness improvements are noticed. Section 4.3.2 has proved that an increase of the individual's survival rate can be accomplished by the increase in its complexity, but not in its tness. Can this eect generalize to the entire population? We suggested that a \yes" answer is plausible, which indicates that the performance of the GP engine can be seriously degraded. Here we present experimental evidence. The variation in the complexity of evolved structures can be seen in plots correlating the learning curves and complexity curves for the two test problems. When tness remains constant, both the best of generation complexity S(Best) and the average complexity s = AvgS indeed increase over time. Plateaus of F(Best) can be observed in Figure 4.14(c) between generations 33 and 59, or Figure 4.15(c) between generations

98 15 and 47, or 53 and 101. During the corresponding time intervals, size almost doubles in Figure 4.14(a) and signi cantly increases in Figure 4.15(a) while average tness also increases. The increase in average size is explained by the predominant increase in survival rate of above average individuals of increased size in the absence of any tness improvements.

Parsimonious tness Next we present experiments where parsimony pressure is applied during selection, in order to con ne the survival of individuals of ever increasing complexity and thus to guard against the apparent loss of eciency of GP search. An important question is whether parsimony would deter GP search from nding t solutions at the expense of nding parsimonious solutions. This could happen because of the arti cial distortion in tness created by the parsimony component. We expect to see a decrease in size over spans of time with no improvement in tness. Three such intervals can be noticed in Figure 4.16(a) and (c): generations 18 to 57, 58 to 72, and 73 to 99. The GP algorithm discovers a new best solution, having a complexity higher than every other previous individual at the beginning of each of these intervals. Following the complexity curves towards the end of the intervals, we note a gradual decrease in S(Best). The same tendency is conspicuous in the average complexity plot AvgS, which has a shape similar to S(Best) and is delayed with about four to ve generations. The delay period is the time needed by selection to pick up on the opportunities created at the beginning of the above intervals. One remarkable feature of the AvgF plot in Figure 4.16(c) is that average tness decreases in correlation with AvgS. The explanation is that parsimony pressure determines a decrease in complexity, which makes mutation and crossover operations more disruptive. This generates a decrease in average tness over the population. The eect is even clearer when the value of the weighting factor increases (see Figure 4.17). Note also the rapid increases in tness in early generations in Figures 4.16(b) and 4.17(b). They show that the following relation holds in very early generations in contrast

99

600 S(Best)

# Nodes

500

AvgS

400 300 200 100 0 0

10

20

30

40

50

60

70

Generation

2.50

F/S(Best) AvgF/S

Hits/Nodes

2.00 1.50 1.00 0.50 0.00 0

10

20

30

40

50

60

70

Hits

Generation 45 40 35 30 25 20 15 10 5 0

F(Best) AvgF

0

10

20

30

40

50

60

70

Generation

Figure 4.14: Variation of size (a) (topmost), tness/size (b) (middle), and raw- tness learning curve (c) (bottom) in a run of GP on even-5-parity without parsimony pressure.

100

300 S(Best)

# Nodes

250

AvgS

200 150 100 50 0 0

10 20 30 40 50 60 70 80 90 100 110 120 130 140

Points/Nodes

Generation 200.00 180.00 160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00

F/S(Best) AvgF/S

0

10 20 30 40 50 60 70 80 90 100 110 120 130 140

Generation 7,000 F(Best)

Game Points

6,000

AvgF

5,000 4,000 3,000 2,000 1,000 0 0

10 20 30 40 50 60 70 80 90 100 110 120 130 140

Generation

Figure 4.15: Variation of size (a) (topmost), tness/size (b) (middle), and raw- tness learning curve (c) (bottom) in a run of GP on the Pac-Man problem without parsimony pressure.

101 to Figures 4.14(b) and 4.15(b)

fbest f sbest s

i.e. that the stronger selection pressure towards more eective individuals due to parsimony is useful to rapidly focus search towards good structures (as in the discussion of equations 4.20 and 4.25). The ability of the GP engine to nd t solutions is improved considerably when using a parsimonious tness function.

Adaptive probability of destruction In this third experiment we modify the standard GP engine in order to adapt the probability of disruption of structures (mutation and crossover) as in equation 4.27. The standard GP procedure varies selected structures in proportion to pc and pm and keeps around (in next generation) surviving selected structures in proportion to

pr = 1 ? pc ? pm . This is done globally in the sense that a pc fraction of the next generation is obtained through crossover on selected structures, etc. In contrast, the size-adaptive procedure decides what genetic operation to apply for each selected individual. In this way, the complexity of the individual can be used in taking the decision of mutation, crossover or survival. The procedure globally records the proportions of the next generation obtained with each genetic operation. An example is given in Figure 4.19 where one can see the variations in the probability of crossover. The variations can be correlated with the variations in the average complexity AvgS. Although size increases over time, the higher destruction appears to limit the size increase without disrupting the search process.

Summary of experiments The experiments above have attained three main goals in relation with the theoretical analysis of tree-schema growth:

102

120 S(Best)

# Nodes

100

AvgS

80 60 40 20 0

Points/Nodes

0

10

20

30

40

50

60

70

80

90

Generation

1,000.00 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00

F/S(Best) AvgF/S

0

10

20

30

40

50

60

70

80

90

Game Points

Generation 10,000 9,000 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0

F(Best) AvgF

0

10

20

30

40

50

60

70

80

90

Generation

Figure 4.16: The curves from Figure 4.15 repeated for GP with parsimony pressure = 0:1.

103

80 S(Best)

70

AvgS

# Nodes

60 50 40 30 20 10 0

Points/Nodes

0

10

20

30

40

50

60

70

80

90

Generation

1,000.00 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00

F/S(Best) AvgF/S

0

10

20

30

40

50

60

70

80

90

Generation 12,000

Game Points

F(Best) 10,000

AvgF

8,000 6,000 4,000 2,000 0 0

10

20

30

40

50

60

70

80

90

Generation

Figure 4.17: The curves from Figure 4.15 repeated for GP with parsimony pressure = 1:0.

104

500 450 400 350 300 250 200 150 100 50 0

S(Best) AvgS

0

10 20 30 40 50 60 70 80 90 100 110 120 130 140

1,000.00 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00

F/S(Best) AvgF/S

0

10 20 30 40 50 60 70 80 90 100 110 120 130 140

12,000 F(Best)

10,000

AvgF

8,000 6,000 4,000 2,000 0 0

10 20 30 40 50 60 70 80 90 100 110 120 130 140

0.16

Figure 4.18: Pac-Man learning curve and variation of size in a run of GP with autoadaptation in crossover, mutation and reproduction rates.

1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80

Pc

0

10 20 30 40 50 60 70 80 90 100 110 120 130 140

Figure 4.19: Adaptation in the probability of crossover.

105

Traced the GP-speci c variable complexity during evolution and interpreted its variations from the perspective of the size-dependent growth formula 4.13. Complexity increases can derail the search eort of the GP engine. This is showed by the long stable periods with no improvements in the most t structures.

Traced the in uence of an additive parsimony component to tness. Experiments show a considerable increase in eciency and oer insight into the choice for the value of , the weighting factor of the parsimony component.

Observed at work the second proposed alternative to imposing parsimony pressure. More experiments are needed to clearly assess the advantage of this adaptive method.

4.6 Statistical dynamics of GP 4.6.1 Analogy with a physical system Ludwig Boltzmann introduced the distinction between micro state and macro state which enabled him to give a statistical interpretation to thermodynamics [Thompson, 1988]. The micro state description of a physical system would include a speci cation of state variables (such as position and velocity) for each particle. Theoretically, this could completely de ne the state of the system. In contrast, a macro state is a macroscopic description, i.e. one that is de ned in terms of observable properties of the system (such as mass, volume, or velocity). By analogy to a physical system, consider that the macro state of a stochastic system represented by a GA/GP system is de ned by its entire population at a given time. We can observe properties that de ne global measures such as average tness or best-ofgeneration tness. In GP in particular, many genotypes may correspond to the same phenotype. We may not be interested exactly in particular genotypes, but rather in the course of evolution. In this analogy, a particular genotype would correspond to a micro state.

106 We extend the analogy by interpreting tness as energy. The energy of an individual i is in this case: H (i) = Std- tness(i) The principle of natural selection is strongly tied to the idea of energy, as individuals in a population compete for the eective utilization of energy resources [Wicken, 1988]. Ideally, there would be no uncertainty regarding the state if the entire population were made up of copies of a single individual (one having the minimum energy for a global optimum state). However, genetic search starts with a randomly generated state. During genetic search micro states uctuate determining a variation of the state in time. In thermodynamics, the energy of a system depends on the absolute temperature T , another macroscopic state variable. We could also use temperature in our interpretations. However, here we will only consider that the temperature has a constant xed value T . The above analogy enables us to apply some of the results from statistical mechanics, in order to qualitatively interpret state changes and convergence. One extensive property of a system's state is entropy and is de ned below. The probability of a state i in thermal equilibrium is given by the Boltzmann-Gibbs distribution:

Prob(i) = pi = Z ?1 e?

H (i) T

where Z is a normalizing constant needed in order to make p a probability distribution. Z actually plays a very important role in statistical physics and is called the partition function: X Hi Z = e? T ( )

i

If we de ne the free energy of the system as

F = ?T logZ it can be easily showed that

F = hH i ? T S

(4.29)

107 where hH i represents the average value of a random variable H and S is the entropy of the system:

S=?

X i

pi logpi

(4.30)

The free energy can be interpreted as the sum of the probabilities of individual states, according to the following identity:

e? FT = X p = 1 i Z i In the free energy formula (4.29), estimations of H and S would result in an estimation of F which can be interpreted as the probability of nding the system in a subset of states [Hertz et al., 1991]. The classical interpretation of entropy comes from thermodynamics. The entropy function was introduced by Clausius to represent the change of state when an increment of energy is added to a body as heat during a reversible process. It was later interpreted statistically by Boltzmann. The entropy of a system whose micro states are uncertain and have probabilities of occurrence pi dependent on energy is de ned by the relation 4.30, up to a constant. The entropy has a maximum value when all micro states are equiprobable. Entropy represents the disorder in the system of particles and tends to increase for irreversible processes (as the ones in nature), according to the second law of thermodynamics [Thompson, 1988]. Shannon used the same formula to de ne an information measure representing one's ignorance of which of a number of possibilities actually holds, given the a priori probability distribution represented by P [Shannon, 1949]. Yet another interpretation of entropy is complexity [Chaitin, 1987] or information content of an individual structure. In this context order means compressibility. Redundancies subtract from an individual's complexity. All these interpretations use the same formula (4.30) but assign dierent meanings to the probabilities fed into the formula. This generalization tendency in interpreting entropy led researchers to search for an unifying view between the statistical interpretations of the second law of thermodynam-

108 ics in physics and evolutionary principles in biology [Bruce H. Weber and Smith, 1988]. Schrodinger [Schrodinger, 1945] and others have noticed the following paradox: the increase in entropy in physical systems brings about a disorganization of the systems. Equivalently, systems evolve from less probable to more probable states. In contrast, natural evolution is described as progress, transformation from simple to complex or from more to less probable states. Schrodinger explained the paradox by looking at the

ux of energy in a living system and suggesting that it does not conform to the basic assumptions of classical thermodynamics. Among the various claims about the role of the second law of thermodynamics in biological evolution [Bruce H. Weber and Smith, 1988], Wicken proposed that genetic variation is due to the probabilistic nature of the second law [Wicken, 1988]. One measure that quanti es variation is diversity. Johnson de ned diversity in terms of the distribution of the energy within the system based on Shannon's information entropy measure, but outlined that diversity is not perfectly synonymous with either information, or with statistical entropy [Johnson, 1988].

4.6.2 Population entropy as a diversity measure A rule of thumb in the GA literature postulates that population diversity is important for avoiding premature convergence. The problem is how to capture heterogeneity. A straightforward de nition of diversity, or non-similarity for GA string-based representations is based on the Hamming distance between encodings of individuals. [Eshelman and Schaer, 1993] discuss strategies for maintaining GA population diversity by controlling how mates are selected, how children are created by recombination and how parents are replaced. Eshelman and Schaeer propose a method called \incest prevention," in which individuals are randomly paired for mating provided that their Hamming distance is above a certain threshold. Their method is showed superior in examples based on elitist selection. In GP, diversity may be de ned as the percentage of structurally distinct individuals at a given generation. Two individuals are structurally distinct if they are not

109 isomorphic trees. However, such a de nition is not practically useful. It is computationally expensive to test for tree isomorphisms. Moreover, associativity of functions is extremely dicult to take into account. In contrast, similarity between structures can be easily tested in GAs. [Ryan, 1994] uses an intuitive measure of diversity, based on performance, and shows that maintaining increased diversity in GP leads to better performance. His algorithm is called \disassortative mating." It selects parents for crossover from two dierent lists of individuals. One list of individuals is ranked based on tness, while the other is ranked based on the sum of size and weighted tness. The individuals from the second list are presumably dierent in structure and tness from the ones in the rst list. The goal is to evolve solutions of minimal size that solve the problem. By directly using the size constraint the GP algorithm would be prevented from nding solutions. In contrast, the disassortative mating algorithm improves convergence to a better optimum while maintaining speed. Two other diversity measures discussed in [Rosca, 1995b] are the distribution of complexity of individuals (expanded structural complexity) and the distribution of tness values. The latter is a more direct and easily observable type of variation in the population. Two individuals are dierent if they score dierently. Such information can be succinctly described using Shannon's information entropy formula and represents a global measure for describing the state of the dynamical system represented by the population, in analogy to the state of a physical or informational system: X E (P ) = ? pk logpk k

where pk is the proportion of the population P occupied by population partition k at a given time. Entropy has been used as a measure of diversity of an evolving ecological community in [Ray, 1993]. Partitions were de ned as individuals having the same genotype. In a functional approach such as GP, an appropriate measure of diversity is obtained by grouping individuals in classes according to their behavior or phenotype and com-

110 puting the population entropy based on the number of individuals belonging to each of these classes.

4.6.3 Entropy experiments This section examines the relation between diversity, as measured by population entropy, and tness variation. Four examples are presented, from two problem domains: Boolean regression and controlling an agent in a dynamic environment (similar to the Pac-Man problem described in [Koza, 1992]). Each example discusses the relationships between the best-of-generation tness, the average population tness (called energy in our earlier discussion) and diversity, as measured by the entropy formula. The GP setup for the parity problem was described in table 3.1. The setup for Pac-Man is described in Section 4.4. Other GP parameters were chosen as in [Koza, 1994b]. The GP termination criterion did not take into account whether a solution was found. The plots showed in this section represent three measures of interest: the best-ofgeneration individual (hits), the population average tness population and the population entropy. The tness and hits graphs (best-of-generation number of hits and average population tness) have the value axis on the left while the entropy plots have the value axis on the right. Figure 4.20 presents a three dimensional plot with tness distributions for a typical run of GP on the even-5-parity problem. These plot oer a compact representation of the tness histograms used in [Koza, 1992]. Koza pointed out that tness histograms \give a highly informative view of the progress of the learning process for the populations a whole."The 3-D plot clearly shows the global improvement in tness over the population. New features that allow improvements are probably synthesized and transmitted from parents to ospring, as suggested by the wave-like advance of the distributions. Once some of the best individual is discovered the number of individuals with similar behaviors increases exponentially until undermined by a similar increase of individuals with an even better behavior (hits).

111

No. of individuals

4000

3000

2000

1000 50 40 0 0

30 5

10

15

20 20

25

10 30

35

0

Generations

Hits

Figure 4.20: Fitness distributions over a run of GP on the even-5-parity problem.

Even-5-parity in standard GP The variation in entropy is jointly represented with the learning curves (hits and standardized tness) in Figure 4.21. Figures 4.23 and 4.22 show the long term evolution in this run, up to generation 200. Notice that GP continues to improve overtime. This is also proved by the preservation of an entropy around the value of 1.5 (Figure 4.23), as opposed to a drop in entropy which would signal a decrease in diversity and in the likelihood of discovering better solutions. GP improves slowly, and if run for a long enough number of generations it could reach the optimum solution of 32 hits.

GP on the Pac-Man problem A similar analysis of entropy can be performed for the Pac-Man problem, and is presented in gure 4.24. Although the entropy has the general features from the previous examples, it is much more noisy. This can be due to the increased instability of solutions

112

35

Slide

3.5

AvgFitness

3.0

Hits

30

Entropy 2.5

• Bullet

20

2.0

15

1.5

10

1.0

5

0.5

0

Entropy

Fitness/Hits

25

0.0 0

5

10

15

20

25

30

35

40

45

Generations

Figure 4.21: Entropy re ects population diversity. In a run for even-5-parity, entropy clearly

increases in the next 15 generations when signi cant improvements from the random initial population are achieved. Then entropy remains at a relatively constant level. Will it decrease and signal a freezing of any further evolution? See Figures 4.22 and 4.23 for the answer.

35.00 Hits 30.00

AvgFitness

25.00 20.00 15.00 10.00 5.00 0.00 0

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190

Figure 4.22: Best-of-generation number of hits and average tness in a run of the parity example

for 200 generations. The rst part of this run is detailed in Figure 4.21.

113

1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00

Entropy

Slide • Bullet 0

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190

Generation

Figure 4.23: Entropy variation in the run of the even-5-parity example from Figure 4.21. in the Pac-Man domain, very large distributions of tness values, and to the smaller population size (eight times smaller than in the parity example). This time entropy decreases from a high value initially to lower and lower values during the run. Entropy decreases in spite of improvements in tness (both average and best-of-generation). This indicates a too high selection pressure towards good individuals.

Discussion The examples above present common patterns and suggest the following conclusions: 1. Monotonic decreases in population entropy over an increased number of generations indicate possible local search optima. These are associated to plateaus on the best-of-generation individual hit plots. 2. Entropy decreases correspond to decreases in population diversity but not necessarily to decreases in tness. This situation indicates a selection pressure higher than optimal. 3. An improvement in average tness may be caused by the selection of above average individuals in larger proportions and does not necessarily show that bene cial changes are made in the population composition.

114

6,000

0.35

Slide

5,000

0.30

0.25

• Bullet

Hits AvgFitness

0.20

Entropy

3,000

0.15

Entropy

Fitness/Hits

4,000

2,000 0.10 1,000

0.05

0.00

0 0

5

10

15

20 25 30 Generations

35

40

45

Figure 4.24: Fitness distributions over a run of GP on the Pac-Man problem. 4. The correlation between entropy (i.e. population diversity) and maximum energy (i.e. best-of-generation tness) suggests when computational eort is wasted due to local minima. This information can be used to control perturbations in the population or stop the simulation. In GP, the computational eort should be spent so that diversity is increased only when there is clear evidence that search has reached a local minimum. A description of phenotypic diversity based on the entropy formula appears to be useful when correlated with other statistical measures extracted from the population.

4.7 Related work The problem of learning a non-parametric model without a priori biasing for particular structures has been tackled in the area of non-parametric statistical inference. In statistical terms this is the problem of learning with low bias or tabula-rasa learning. However, low bias in the choice of models is paid for by a high variance (see [Geman et al., 1992] for an excellent introduction to the bias/variance dilemma). Methods for

115 balancing bias and variance include techniques that rely on a complexity penalty function which is added to the error term in order to promote parsimonious solutions. The basic idea is to trade the complexity of the model for its accuracy. This idea resonates with one of the fundamental principles in inductive learning represented by Ockham's razor principle, which is interpreted as: \Among the several theories that are consistent with the observed phenomena, one should pick the simplest theory" [Li and Vitanyi, 1992]. What is simple often turns out to be more general. One common approach to dealing with a variable complexity model within the Bayesian estimation framework is Rissanen's minimum description length (MDL) principle [Li and Vitanyi, 1993]. The MDL principle trades o the model code length, i.e. the complexity term, against the error code length, i.e. the data not explained by the model or error term. Complexity is naturally expressed as the size of code or data in bits of information. Informed approaches to include a parsimony component, such as the MDL principle, implicitly expect that the capability of the learned model is a smooth function of its complexity. This is not true for Genetic Programming, which furthermore cannot aord to exploit a large number of training examples and use in nite populations in order to overcome the problem. 9 A small change in a program can entirely destroy its performance. The capability of a model speci ed with a program is not a smooth function of its complexity. Nonetheless GP manages to sample the space of programs and to discover automatically satis able models of variable complexity. The MDL principle has been also applied in GP to extend the tness function of hybrid classi cation models [Iba et al., 1993; Iba et al., 1994]. For example [Iba et al., 1994] applied the MDL principle in the learning rule of a GP-regression tree hybrid. [Zhang and Muhlenbein, 1995] used an adaptive parsimony strategy in a GP-neural net hybrid. In both cases GP manipulates tree structures corresponding to a hierarchical multiple regression model of variable complexity, decision trees, or sigma-pi neural netAn alternative approach to program induction based on an iterative deepening exhaustive search is taken in ADATE [Olsson, 1995]. However, ADATE can not hope to solve problems for which the complexity of a solution is large. 9

116 works, rather than programs. MDL-based tness functions have been unsuccessful in the case of GP evolving pure program structures. Iba outlined that the MDL-based tness measure can be applied problems satisfying the \size-based performance" criterion [Iba et al., 1994], where the more the tree structure grows the better its performance becomes.[Rosca and Ballard, 1994b] has used the MDL principle to assess the suitability of an extension of GP with subroutines called adaptive representation (AR). The most common approach to circumvent complexity-induced limitations in GP has been the use of a parsimonious tness function. Parsimony imposes constraints on the complexity of learned solutions. However the eects of such constraints in GP have not been elucidated. Parsimony pressure clearly improves eciency of search and understandability of solutions if well designed. The quality and in particular the generality of solutions may also be improved in inductive problems. However, adding the right parsimony pressure has been more of an art. One example of avoiding this decision in an ad-hoc algorithm is \disassortative mating" [Ryan, 1994]. This GP algorithm selects parents for crossover from two dierent lists of individuals. One list of individuals is ranked based on tness while the other is ranked based on the sum of size and weighted tness. The goal is to evolve solutions of minimal size that solve the problem. However, it was recognized that by directly using the size constraint the GP algorithm is prevented from nding solutions. The disassortative mating algorithm is reported to improve convergence to a better optimum while maintaining speed. Related to the size problem (also called the bloating phenomenon), GP research has focused on the analysis of introns. Introns are pieces of code with no eect on the output. An analysis of introns goes hand in hand with an analysis of bloating. [Nordin et al., 1995] tracked introns in an assembly language GP system based on a linear (sequential) but variable length program representation. The analysis suggested that the increase in size is a \defense against crossover." A similar conclusion is reached here in Theorem 1 (see Section 4.13). In the linear representation, the noticed increase in the size of programs was attributed to introns. Based on experiments with controlled crossover or mutation rate within intron fragments, [Nordin et al., 1995] suggested that a representation which generates introns leads to better search eectiveness. Thus, introns

117 may have a positive role in GP search protecting against destructive genetic operations. For hierarchical GP representations [Rosca, 1996] showed that much of the size increase is due to ineective code too. However, the role of introns has been disputed in the case of GP using tree representations [Andre and Teller, 1996]. For one thing, the overhead introduced by exponentially increasing tree sizes may oset any protective eects of introns. Tackett pointed out that bloating cannot be selection-neutral [Tackett, 1994]. He presented experiments suggesting that the average growth in size is proportional to the selection pressure. In our analysis, selection pressure itself is complexity dependent. Tackett also suggested that the larger programs selected by GP contain expressions which are inert overall (introns), but contain useful subexpressions, thus correlating bloating with hitchhiking. Another suggestion for con ning the increase in complexity is to employ modular GP extensions such as algorithms based on the evolution of the architecture [Koza, 1994b], heuristic extensions for the discovery of subroutines [Rosca and Ballard, 1994b; Rosca and Ballard, 1996a], or GP with architecture modifying operation using code duplication [Koza, 1995]. Evolved modular programs theoretically have a lower descriptional complexity [Rosca and Ballard, 1994b], and also appear to present better generality [Rosca, 1996; Rosca and Ballard, 1996b]. The problem that evolved expressions tend to drift towards large and slow forms without necessarily improving the results was recognized in some excellent early work in GP, applied to the simulation of textures for use in computer graphics [Sims, 1991]. The solution devised was heuristic. Mutation frequencies were adjusted so that a decrease in complexity was slightly more probable than an increase. This did not prevent increases towards larger complexity, but more complex solutions were due to the selection of improvements. It is not apparent how this was done. Interestingly, the solution to controlling complexity presented in Section 4.3.4 achieves exactly this eect and is theoretically founded.

118

5 Modularity in GP: The Adaptive Representation Approach The previous two chapters have presented key ideas in interpreting GP processing and in understanding its characteristics and limitations. Starting with this chapter we address the second goal of the dissertation: extending the capabilities of GP. The rst step in this direction is to present a GP extension called Adaptive Representation (AR). AR oers a heuristic solution to the problem of architecture discovery. It extends the search component of GP with a heuristic component that: (1) can learn good subexpressions from problem solving traces; (2) can abstract subexpressions into subroutines; (3) can use subroutines to bias future search. Evolved solutions assume a modular and hierarchical organization. First, we review the main modular approaches to program synthesis from the GP literature along the criteria used for analyzing GP in the previous chapter. Then we propose the AR approach to automatic problem decomposition. Further improvements of this GP extension will be presented in the next chapter.

5.1 Review of modular approaches to genetic programming The idea of using subroutines in genetic programming (GP) is drawn from the genetic algorithm (GA) building block hypothesis. Building blocks are relevant pieces of a partial solution that can be assembled together in order to generate better partial so-

119 lutions to the problem at hand. Holland [1992] (see also [Goldberg, 1989]) hypothesized that GAs achieve their search capabilities by means of \block" processing. This lead to several attempts to explicitly identify and use blocks in GA algorithms. For example, the messy genetic algorithm (mGA) [Goldberg et al., 1989] explicitly attempted to discover useful blocks of code guided by the string structure of individuals. The structure is apparent in the mGA representation which takes the form of a string having each gene tagged with an index representing the gene's original position. After ltering useful blocks, mGA employs typical GA operations to combine those blocks. Perhaps owing to the purely structural nature of block de nitions, the improvements of these experiments were somewhat modest, and in fact the building block hypothesis has not gained conclusive support in GA literature so far. Nor is it clear how blocks can be best combined: recent GA experimental work disputes the usefulness of crossover as a means of communication of building blocks between the individuals of a population [Jones, 1995a]. In GP, [O'Reilly and Oppacher, 1995] made an analogy to the GA schemata theory. A major goal of that work was to understand whether GP problems have building block structure, but the results were also inconclusive. A structural approach is also at the basis of \constructional problems" [Tackett, 1995], i.e. problems in which the evolved trees are not semantically evaluated. Instead, program tness is determined by matching a set of patterns against the program and adding up the prede ned tness of each pattern. By ignoring the semantic evaluation step, the analysis of constructional problems is not generalizable to typical GP problems. GP presents a challenging picture due to the functional representation it generally uses. An analysis of block processing in GP has to rely on the function of blocks of code. GP modularization approaches consider the eect of encapsulating and possibly generalizing blocks of code in order to create modules. Modules correspond to (parts of) evolved subexpressions, and will be de ned more precisely below, according to the approaches various researchers have taken. The rst approach to modularization in GP was the encapsulation operation introduced in [Koza, 1992]. Re nements or extensions of the encapsulation concept have

120 focused on dierent aspects of function de nition. The main approaches to modularization discussed in the GP literature are essentially extensions of the standard GP engine. Three early approaches are automatically de ned functions (ADF) [Koza, 1994b], module acquisition (MA) [Angeline, 1994b], and adaptive representation (AR) [Rosca and Ballard, 1994a]. Another approach contrasted to ADF is automatically de ned macros (ADMs) [Spector, 1996]. ADFs have been used in another extension to GP called cellular encodings [Gruau, 1994]. These approaches will be shortly described next, with the exception of AR and ARL which are the main subject of the rest of this chapter and the next chapter, respectively.

Encapsulation The encapsulation operation, originally called \de ne building block" was viewed as a genetic operation that identi es a potential useful subtree and gives it a name so that it can be referenced (as a function with zero arguments) and used later [Koza, 1992].

Automatically De ned Functions Automatic de nition of functions is an extension of the GP paradigm to cope with the automatic decomposition of a solution function [Koza, 1992]. In this approach individuals are represented by a set of subroutines, called automatically de ned functions (ADFs), and a main function, called result producing branch. Each subroutine component has a xed number of parameters. Each of the subroutine and main function components is de ned based on its speci c alphabet (function and terminal sets). The architecture of a program is de ned by the number of subroutines, the number of arguments of each subroutine, and the nature of hierarchical references among the components. GP using the ADF-based representation of individuals (called ADF-GP henceforth) co-evolves representations for all these components implementing a program. Let us take a program pattern with two automatically de ned functions (ADF0 and ADF1) and a result producing branch with one body. Then one distinguishes between terminal sets and function sets for ADF0, ADF1 and the program body. In the example

121 presented in Figure 5.1 terminals from the initial terminal set are not included in the terminal sets for the function branches. The primitive function and terminal sets are de ned such that the components form a xed hierarchy. Genetic operations are constrained depending on the components on which they operate, constraint called branch typing. For example crossover can only be performed between components of the same type. Note that the hierarchy of components is xed at the outset of running GP. ADF0

ADF1

PROGRAM-BODY

ADF1 ADF0

ADF0

PROGRAM-BODY

ADF0 ADF0

ADF1 ADF1

A = {Arg0, Arg1}

A = {Arg0, Arg1, Arg2}

F = {OR, AND, NOR, NAND}

F = {OR, AND, NOR, NAND, ADF0}

T = {Arg0, Arg1}

T = {Arg0, Arg1,Arg2}

F = {OR, AND, NOR, NAND, ADF0, ADF1}

T = {D0, D1, D2}

(a)

ADF0

(b)

Figure 5.1: (a) An individual program with two automatically de ned functions. It consists

of three branches: ADF0, ADF1 and a result producing branch with one body. Each branch has a set of arguments A (only for ADFs), a function set F and a terminal set T which are established in the problem de nition. (b) Hierarchy of components.

Also, note that subroutines are not shared between individual programs. Subroutines may have no clear meaning from the point of view of the problem solved, they may not correspond to speci c subgoals related to the problem at hand. We do not a priori know what a subproblem is. Subroutines are not explicitly associated to problem subgoals even in the case when we know what a problem subgoal is. Ultimately, the eort to tune up the architecture may not be negligible. Two main dierences between ADF-GP and standard GP are: rst, ADF-GP can develop much more complex programs. The virtual size of the program body, after an inline substitution of ADF down to the basic primitives in the program body, can be very large (see the de nition of the expanded structural complexity notion in Section 5.4.2). Second, ADF-GP is able to make larger jumps in the search space. For example

122 a mutation in the lowest ADF level, ADF0, called in higher level ADFs radically changes the behavior of the body of the program. This may be a big disadvantage in late stages of evolution when the algorithm tunes solutions. ADF-GP is theoretically more powerful than standard GP because it can evolve more complex solutions than would be allowed by the resource constraints of GP, such as expression size or depth. ADF-GP may or may not be more ecient depending on the application. The intuition is that ADF-GP may be more ecient, especially for problems with regularity patterns in their solutions. The advantage of the ADF approach is its generality. The greatest disadvantage is that it requires the speci cation of the architecture for decomposition in advance.

GLIB and Module Acquisition The module acquisition (MA) approach is applied more generally to GP and EP [Angeline and Pollack, 1993; Angeline, 1994a]. In its GP implementation called GLIB, pieces of code called modules can be randomly frozen from manipulation as a result of compress operations and are kept in a global genetic library. More precisely, a module is a piece of code obtained by randomly choosing a subtree and possibly randomly chopping o its branches to introduce arguments (see the left side of gure 5.2). A module can be decompressed by an additional genetic operation, called expand, which has a complementary eect to compress. The genetic library passively preserves de nitions of modules. The compress and expand operators aect the size of evolved expressions. Thus, they may also positively aect the course of evolution. For example, consider the case when an evolved solution may call many modules, many times. The virtual size (i.e. equivalent GP size) can grow huge. It's harder to evolve equivalent big expressions with standard GP which is con ned to size or depth limits. If the application is such that only large solutions are acceptable, and moreover those solutions can be modular, then the chances that GP would be able to evolve a monolithic structure with these features are extremely small. The approach may help, although experimental evidence is scarce.

123 Compression protects what may be useful genetic material from the destructive eect of other genetic operators. It also helps in decreasing the average size of individuals in the population, while maintaining the same power. As a side eect, it may also generate a loss of diversity in the population, problem presumably repaired by the expand operator (see gure 5.2.) Compress

Mod

Compressed program tree

Program tree Expand F1 F6

T2

F7

F3

F2

T2

T1

F4

T3

T2

T3

F5 F1

F6

F7

Mod(arg1, arg2, arg3)

T2 F3

F2

T2

T3

T3

T2 T1

F4

arg1

F5

arg2

arg3

Figure 5.2: Additional GP operators in the module acquisition (GLIB) approach It is interesting that in [Angeline and Pollack, 1994], the authors talk about the worth of a module, but they attribute to it a rather passive role. The module's worth is the number of times the module has been used since its birth, in subsequent generations. If a module is not frequently used, it means it is not viable in the competition with other individuals.

Automatically De ned Macros The essential dierence between a subroutine and a macro is in the way code is evaluated. For a subroutine, arguments are rst evaluated and then the subroutine code is invoked on the actual argument values. For a macro, the macro de nition is expanded with argument de nitions before any evaluation and then the resulting code is evaluated. Thus most often a subroutine and an identically de ned macro will

124 produce entirely dierent results. The order of code execution is changed for a macro. Moreover, the code corresponding to some macro arguments may not be executed at all, if the macro de nition contains lazy-evaluation functions, such as if. Whenever the primitives involved in the macro de nition have side eects, the changed order of execution and the dierent code activation pattern will generate dierent results for a macro invocation than for a subroutine invocation. Automatically De ned Macros [Spector, 1996] implement the idea of ADFs by working with macros instead of subroutines. The goal is to attempt an improvement of GP eciency for problems with side-eecting primitives that control, for instance, an agent in a simulated world (Obstacle Avoiding Robot, Lawn Mower, etc.). One of the conclusions in [Spector, 1996] is that ADMs are likely to be useful in environments within which "context sensitive or side-eect-producing operators play important roles."

Architecture evolution and code duplication The ADF approach presents the disadvantage of working with a xed program architecture. This problem was ingeniously addressed in [Koza, 1994a] using the biologically inspired idea of code duplication (see also [Koza, 1995]). The architecture of evolved programs can be modi ed by means of new operations for duplicating parts of the genome. Six new genetic operations were introduced for altering the architecture of an individual program: branch duplication, argument duplication, branch deletion, argument deletion, branch creation and argument creation. Duplication operations are performed such that they preserve the semantics of the resulting programs. They increase the potential for the re nement of the programs. The duplication of elements of program architecture (branches or arguments) is done in conjunction with a random replacement of the invocations of the corresponding element to the duplicated copy. Such an operation decreases the probability that a future random change will drastically change the behavior of the program. A similar conclusion can be drawn for creation operations. The deletion operations do not possess the nice properties mentioned above. They have the antagonist role of con ning the increase in size of the evolved programs.

125 Duplication increases the chances of survival of pieces of code, and thus virtually protects evolved code against the destructive eects of genetic operations. It also increases the likelihood of maintaining higher diversity in the population. Therefore code duplication transformations seem to be useful in general. The point, though, is that they also modify the number of arguments of a function, or the number of subroutines, therefore altering the architecture of solutions. Decomposition is a result of the evolutionary process.

Morphogenesis In standard GP, the genotype is a program which is directly interpreted in order to generate some behavior or a tness value. Another approach is to rst perform a transcription or development of the genotype into another structure that implements a model (viewed as the phenotype), and then perform tness evaluation on the resulting model. Transcription can be controlled by the primitives of the application. Gruau considered a language for transcription called cellular encoding that speci es the transformation of \neural cells." Primitives in the language aect the cell and its interconnections with neighboring cells. For example a cell with inputs and outputs can divide in series into two cells, the rst inheriting all mother-cell inputs and the second inheriting all the mother-cell outputs. Other primitives include parallel division, increasing or decreasing weights of connections, removing connections, modifying threshold parameters in the cell, stopping development, etc. [Gruau, 1994]. Development starts with a mother cell. However, after each operation resulting in a cell division (such as serial or parallel division above), each daughter cell continues its own development path. Full development of an embryonic cell into a neural network can be speci ed using a tree structure labeled with the cellular encoding primitives. This process resembles biological development where after cell division each resulting cell has its own copy of the mother cell chromosome. Once created, the neural network can be used for data interpretation or prediction.

126 Its tness is actually determined by how good a model is for the application task. Now, instead of optimizing over neural network architectures, the algorithm optimizes over genotypic encodings, i.e. cellular development programs. This is done using GP search on the search space of tree structures representing development programs. Normally, during program interpretation, functions require that all arguments be evaluated rst to nd the actual argument values. Then the function can be applied. With lazy functions only the needed parameters are evaluated. In both cases execution proceeds bottom-up. In contrast, the evaluation of cellular encodings proceeds top-down and in parallel. One desirable property of the encoding is locality. A change of a subtree in a development tree will only aect some local part of the grown neural network. A module of the subtree corresponds to a module of the neural network. With this interpretation the eect of genetic operations can be visualized as local changes in the neural network. This makes it easy to implement the development of modular neural networks. The cellular encoding only has to include primitives that reiterate interpretations of some module from the beginning by means of jump or recurse primitives. In particular, the development program can contain ADF-like components. Development of a cell continues with the program de ned by the ADF when the name of the ADF is executed during development. The execution of the same ADF for two dierent cells in the growing neural network will determine similar structures to appear in two dierent parts of the network. The idea has been proven to be feasible from applications such as generating families of neural networks for computing parity on a generic number of inputs, to the development of the control system of the various motor subsystems (such as legs) of an animat. Furthermore, phenotypes can be exposed to learning in an environment for tness determination thus allowing for the study of the interaction between learning and evolution [Hinton and Nowlan, 1987; Gruau and Whitley, 1993].

127

Summary A performance comparison of ADF and module acquisition (MA), as well as other variations of the two methods, is presented in [Kinnear Jr., 1994]. ADF consistently shows better performance. These is attributed to the repeated use of calls to automatically de ned functions and to the multiple use of formal parameters in ADFs. In the above methods, selection of programs to participate in reproduction operations is tness-proportional. However, in ADF and MA selection of blocks of code within programs is purely random or \uninformed." Uniform random changes at all levels may determine a loss of bene cial evolutionary changes. For instance, ADF samples the space of subroutines by modifying automatically de ned functions at randomly chosen crossover points. Similarly, MA randomly selects a subtree from an individual and randomly chops its branches to de ne a module and thus decide which genetic material is frozen. All points of a tree, either active (i.e. eective during evaluation) or inactive (i.e. introns; for a discussion of introns in GP see [Nordin et al., 1995]) are equally likely to be the source of a compress operation. Random changes will not be an ecient strategy if the bottom-up evolution hypothesis [Rosca and Ballard, 1995] holds. This theory conjectures that ADF subroutine representations become stable in a bottom-up fashion. Early in the process changes are focused towards the evolution of low level subroutines. Later in the process the changes are focused towards the evolution of program control structures, that is structures at higher levels in the hierarchy of subroutines [Rosca and Ballard, 1995]. For this reason we have focused on heuristic or adaptive measures to guide the focus of attention during search for the creation and modi cation of subroutines.1 For problems with symmetries and patterns of regularity, modularity should bring a number of advantages such as increased search eciency and easier scale-up behavior [Koza, 1994b; Koza, 1995]. Later, we will examine how modularity helps decouple independent parts of the problems thus facilitating problem decomposition. Also, modSee also Chapters 3, 4, and 5 authored by Teller, Angeline, Iba and de Garis in [Angeline and Kinnear, Jr., 1996]. 1

128 ularity facilitates reuse of code as an alternative to repeatedly evolving the same code fragments multiple times. Had such repeated use of code been necessary in the design of a solution, then a GP algorithm extended with modularity mechanisms could be considerably more ecient than standard GP. An attempt to explain the course of evolution in GP based on an understanding of what building blocks are, appears in [Tackett, 1993]. The idea that frequent subtrees in one individual correspond to synthesized features suggests the conclusion that those subtrees comprise \building blocks".

5.2 Characteristics and biases of modular GP This section contrasts characteristics of the ADF-GP modular representation with the standard GP representation. Discussions follow the main ideas presented in Chapter 4: distribution of tness values of random representations, transformation of representations, complexity of evolved representations, and statistical dynamics. Some of the arguments support the decision of using a modular representation while others do not. New concepts needed to characterize the complexity of evolved structures for modular representations will be introduced in Section 5.4.

5.2.1 Biases in the random generation of expressions In the ADF representation, randomly created ADFs are equivalent to random functions over the corresponding input variables. The main program can invoke the random ADF functions, besides the primitives in F . Thus we expect to see a distribution of tness values close to the case of an enlarged function set, as presented in section 4.1, Figure 4.1 (c) and (d). Experimental results prove this hypothesis. Figure 5.3 shows an extended analysis for ADF-GP solutions to the parity problem. The distribution of the number of hits is observed similarly with section 4.1 for several alternatives in the choice of the function set. Thus, for ADF-2 (Figure 5.3 (a)), the function set used in all modules is F0 (de ned

129 in relation 3.3). For ADF-2-8 the function set includes four additional Boolean functions of two variables, besides the primitives in F0, while for ADF-2-16 it includes all sixteen Boolean functions of two variables. The analysis outlines two conclusions. First, the distribution for ADF-2 is wider than the distribution for GP (see Figure 4.1). This indicates the better potential for ADF-GP to create and maintain increased diversity which can be exploited by genetic search (see also [Koza, 1994b]). Second, the distributions for ADF-2-8 and ADF-2-16 are even wider. More subroutines in the representation positively aect the diversity of behaviors in a random population of programs. However, if there are too many useless subroutines the positive eect is limited. Fortunately, selection will also in uence what subroutines and subexpressions, in general can be found in the population at any given time. 100,000 (a) ADF-2

100000 x Prob{X=x}

(b) ADF-2-8 (c) ADF-2-16

10,000

1,000

100

10

1 0

2

4

6

8 10 12 14 16 18 20 22 24 26 28 30 32

Number of Hits

Figure 5.3: Probability mass function of the random variable X representing the number of hits for ADF-GP even-5-parity functions. The ADF-GP architecture uses two ADFs of two arguments each. The function sets of the main program and of the ADFs contain respectively four (a), eight (b), and sixteen (c) distinct Boolean functions of two variables. The functions sets necessarily contains the four primitives of two variables and, or, nand, nor.

130 These results support the idea that subroutines can lead to a wider diversity on which selection can pick up leading to more eective search.

5.2.2 Transformational biases ADF-GP has to discover the de nitions for ADFs, the main program body, and also has to nd a good composition of ADFs and primitives (that solves the problem). This corresponds roughly to discovering a way to decompose the problem and solving the subproblems given only the maximum number of subproblems and the general structure of the subproblems (i.e. number of parameters and the subproblem \alphabets"). Due to the imposed ordering of ADFs we can consider each ADF as a dierent structural level. The ADF approach simultaneously attacks the search problem at dierent structural levels. During GP search, modi cations are alternatively made at each of the structural levels. A code fragment brought from another individual changes its function entirely if it contains calls to ADFs. For example, consider a piece of code with calls to lower order ADFs that is pasted in a higher-order function or the main body as a result of a crossover operation; and also suppose the de nitions of the ADFs in the two parents are entirely dierent. Lexical scope dictates the de nition to be used when invoking a sub-function, so that the calls to ADFs from the transplanted piece of code will refer to the de nition of a totally dierent function from the new lexical scope. This quite frequent situation is depicted in Figure 5.4 and demonstrates the noncausality of ADF-GP. The non-causality property of ADF-GP is in total opposition with the principle of strong causality previously stated in Chapter 4. It is useful to visualize how the search for a solution may generally proceed in ADFGP. Each of the ADF functions represents a dierent sub-function. Consider the last modi cation imposed on a program tree before it becomes an acceptable solution. It is very unlikely, but not impossible that this last change has been a change with a large in uence, for example a change in one of the functions at the basis of the hierarchy. This situation represents a lucky change. Most probably, though, it was a change at the

131

A PROGRAM-BODY ADF1 ADF0

ADF0

ADF0

ADF0 ADF0

ADF1 ADF1

B PROGRAM-BODY ADF1 ADF1 ADF0 ADF0 ADF0 ADF0 ADF0

ADF0 ADF1 ADF1

Figure 5.4: The non-causality of ADF-GP: De nitions for ADFs are local. Thus, a fragment

of code copied from donor parent 1 into receiving parent 2 will be evaluated in the new lexical environment of parent 2.

highest level, in the program body. We conjecture the following general principle: as evolution progresses, changes according to the principle of strong causality become more important and should be supported by the representation and processing primitives. In other words, as better and better individuals are found, selection most often favors small, causal changes that have the biggest chance of turning successful. The eect of this principle is a stabilization on useful lower level ADFs. Evolution will freeze good subroutines and will eventually nd bene cial changes at higher levels.

5.2.3 Bottom-up evolution In order to test the hypothesis that causality plays an increasing role as evolved individuals become more complex and t we have studied the most recent part of the genealogy tree for even-n-parity parity problems. This was done by giving each individual a birth certi cate that speci es its parents and the method of birth that corresponds to the branch type where crossover is performed (one of ADF0, ADF1, or main program body), or where the reproduction was performed. In this experiment the mutation rate was zero. We hoped that an analysis of the birth certi cates, starting

132 with the nal solution and tracing backwards its origin, would shed light on the GP dynamics. In order to determine the eect of the dierent types of birth operations, we have computed a temporally discounted frequency factor bcf for a given solution tree T and a type of birth:

X kT bcf (T; type; d) = 11??

d ftypeg (i) depth(Ti) i=0

where kT is the number of programs in the genealogy tree of T down to a depth d, and ftypeg (Ti) is the characteristic function of ancestor Ti of T , returning 1 if Ti has a birth certi cate of birth-type type and 0 otherwise. Table 5.1 presents the results for several successful runs of ADF-GP for even-5parity, with two ADFs and three arguments each. These results show that ADF-GP search relies in most cases on changes at higher and higher structural levels which make it possible to exploit good code fragments that appear in the population. Table 5.1: Statistics of birth certi cates in successful runs of even-5-parity using ADF-GP

with a zero mutation rate and a population size of 4000. Each certi cate of a given type counts one unit and is temporally discounted with a discount factor = 0:8 based on its age. Only certi cates at most 8 generations old have been considered. The last line shows the averages of bcf values of the three types. GP Programs Birth Certi cate Freq. Final Run Explored ADF0 ADF1 Body Gen. 1 123,009 0.295 0.0 0.704 32 2 110,892 0.221 0.472 0.416 32 3 62,699 0.077 0.526 0.397 17 4 35,162 0.447 0.102 0.451 9 5 55,748 0.1 0.214 0.685 15 6 55,438 0.093 0.202 0.704 15 Avg. 0.205 0.252 0.559 -

The above numerical results have taken into account only a small time window compared to the entire number of generations. A more detailed picture of the importance

133

100

100

90

90

80

80

Percentage per generation

Percentage per generation

of various types of crossover during the entire GP evolution is based on a complete analysis of birth certi cates, going back to the initial generation. Such an analysis is depicted in Figure 5.5 for a typical case. The overlapping distribution trends of birth certi cates suggest both the overall importance of a birth certi cate type as well as its trend over the entire evolution period, from generation 0 till the solution is found.

70 60 50 40 30 20 10 0

Random ADF0 ADF1

70

Body

60 50 40 30 20 10 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Generation number

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Generation number

Figure 5.5: Distribution trend of the percentage of birth certi cate types over generations, while

looking for a solution to even-5-parity that was found in generation 15. Random indicates the propagation of random individuals from the initial population due to reproduction.

The stabilization of changes in the hierarchy occurs bottom-up. Crossover changes in automatically discovered functions are highly non-causal and are performed in early generations. This is in sharp contrast with changes in the main program body. Those changes are mostly changes at low tree heights (as discussed in Section 3.4.1), that are performed in the late generations. An interesting point is that in very early generations the most frequent genetic operations in the genealogy tree were reproduction operations (see Figure 5.5). The results presented con rm that, as the population evolves, increasingly causal changes (i.e. changes in the main body) become more important and are selected. In conclusion, the preimposed hierarchical ordering among ADFs biases search in the space of programs. The resulting bias is expressed by the bottom-up evolution

134 hypothesis which conjectures that ADF representations become stable in a bottom-up fashion. Early in the process changes are focused towards the evolution of low level functions. Later, changes are focused towards higher levels in the hierarchy of functions.

5.2.4 Statistical dynamics In Chapter 4, Section 4.6 we described interpretations of the entropy of a system. One such interpretation, when the system is represented by a population of individuals, is that entropy oers a measure of diversity of behaviors in the population. We traced the variation of entropy and the tness histograms over the time of evolution, and this allowed us to interpret how evolution progressed. Here we do the analysis for ADF-GP. Figure 5.6 presents the tness distributions for a run of ADF-GP on the even-5parity problem. Figure 5.7 shows the variation in entropy over a run of ADF-GP on the same problem. The most obvious dierence between Figure 5.6 and gure 5.7 is the increased exploration of the search space [Rosca, 1995b] in ADF-GP. In the rst case, the use of subroutines positively aects the eciency of GP. Entropy has the tendency to decrease as the system becomes more \organized", i.e. converges and does not discover better solutions due to loss of diversity. This can be seen in Figure 5.7. Entropy steeply increases for about twelve generations, correlated with the initial increase in the number of hits for the best-of-generation individual. After that period, entropy starts to decrease until a new best-of-generation individual is discovered. After that, a new stable regime is reached, and entropy further decreases.

5.3 Expanding the function set: a formal view The results relative to the bias in the random generation of expressions from Sections 3.4.1 and 5.2 suggest the following formal view on an enlarged function set. Consider the standard GP procedure operating on the language of expressions created from problem primitives P , where P = T [ F . The primitives, terminals T and functions F , comprise methods for accessing and modifying domain dependent information (possibly state information), and various processing primitives (both domain

135

4000

3000

2000

40

1000

30 20

0 0

5

10

10 15

20

25

30

35

0

Figure 5.6: Fitness distributions over a run of ADF-GP on the even-5-parity problem.

3.5

Slide

30

3.0 Hits

Fitness/Hits

25

• Bullet

20

2.5

AvgFitness Entropy

2.0

15

1.5

10

1.0

5

0.5

Entropy

35

0.0

0 0

5

10

15

20 25 30 Generations

35

40

45

Figure 5.7: Entropy variation over a run of ADF-GP on the even-5-parity problem.

136 dependent and independent). The closure requirement for full generality of genetic operators is that any function call be well de ned for any combination of arguments that it may encounter (primitives or other subexpressions). Suppose that all subexpressions in the language return a value from a domain D (for example, D = R). De ne F total to be the set of function compositions over D in terms of the elements of F , actually mapping the input space onto D. Every sub-expression of an expression evolved by GP implements some function which belongs to F total . Over generations, the change of subexpressions corresponding to any xed point in a surviving tree is equivalent to a dynamic sampling of functions from F total . A GP run usually converges after some time, i.e. the population does not appear to improve any longer. This means that the expressions used in the population represent some attractor small subset of F total . The problem of induction of a solution can be alternatively formulated as the problem of determining a subset of F total that supplies all the information needed to easily assemble a solution. This formulation is not practical or constructive, but oers a fertile conceptual ground. One approach to simplify the induction problem is to provide GP with a large set of primitives P developed by hand from a much smaller set of bare primitives. The idea is that the primitives have some relevance to the application. For example, Tackett uses this approach in an application of GP to feature discovery in images. The set of GP primitives includes bare primitives such as area and range, but also compositions such as area range2 . If the set of primitives is much larger than needed, GP has to select the right primitives instead of evolving them. Naturally, F F total. It may be dicult to manually determine an appropriate extension E of the function set (F E F total) necessary to solve a given problem. Also, it may be unrealistic to consider huge primitive sets. The question is whether a set of useful subexpressions could be automatically determined while solving the problem. However, note that some sub-expressions may depend on the context of evaluation, and in their turn may have side-eects on the state of the simulation for tness evaluation. Side-eecting depends on the nature of the primitives used. Nonetheless, certain subexpressions may still turn out to be helpful.

137 The GA building block hypothesis (BBH) [Goldberg, 1989] is one additional motivation for automatically trying to detect useful subexpressions. The GP schema de nition by Koza ([Koza, 1992], page 117) suggests the intuitive idea that subtrees may play the role of functional features. We conjecture that good features may be functionally combined to create good representations [Rosca and Ballard, 1996a].

5.4 Complexity measures for modular evolved expressions This section proposes a theoretical basis for analyzing the size of modular structures. It applies to a system that can use subroutines. These may be part of the representation (ADF-GP), or may be created dynamically, as will be the case with the AR algorithm. I de ne the notions of structural complexity, evaluation complexity, expanded structural complexity, and stochastic complexity. The rst measure gives an idea about the amount of memory used, the second is an upper bound on the number of primitives evaluated during execution, the third is a measure of the virtual size of an individual that can be directly compared with structural complexity of standard GP solutions. Finally, stochastic complexity oers both justi cation for biasing search towards decompositions of small subroutines and a possible way to extend the tness function for inductive problems.

5.4.1 Structural and evaluation complexity Suppose an individual is represented by a program which calls discovered functions, which may call other discovered functions. Nonetheless the call graph based on the caller-callee relation has no cycles. Let Size(F ) be the number of nodes in the tree representing a program F . Let F0 be the program tree representing an individual T0 which contains direct or indirect calls to F1 ; F2; :::; Fm. We de ne structural complexity SC (F0) and evaluational complexity EC (Fi) as follows:

138

SC (F0) = EC (Fi) = Size(Fi ) +

X 0j m

X

j 2Ji

Size(Fj )

EC (Fj ) jCalls(Fi; Fj )j

(5.1) (5.2)

where Ji = fj ji < j m and Fi calls Fj g and jCalls(Fi; Fj )j is the number of times Fi calls Fj . In standard GP, where no functions other than the primitive ones are used, the structural and evaluational complexities are equal to the program size Size(F ). Assuming that functions from the initial function set are executed in unit time, the evaluational complexity shows how many time units it takes to execute an individual program.

5.4.2 Expanded structural complexity A true measure of the virtual size of a modular individual, if we had to build it from primitive functions, is obtained by counting all the nodes in the tree resulting after an \inline" expansion of all the called functions down to the primitive functions. This complexity measure is called expanded structural complexity. It is computed from the structural complexity (i.e. the number of tree nodes) of all the functions in the hierarchy which are called directly or indirectly in the main program body of the individual. The expanded structural complexity of a program F , denoted IC (F ), can be computed in a bottom-up manner starting with the lowest functions in the call graph of F . For each subfunction G, called directly or indirectly by F , IC (G) can be de ned using a recursive formula (see Appendix C). Note that EC (F ) diers from the expanded structural complexity. Expanded structural complexity corresponds to the notion of circuit size complexity from complexity theory. The following inequalities hold between the introduced complexity measures:

Size(F ) SC (F ) EC (F ) Expanded - SC(F)

139

5.4.3 Minimum descriptional complexity We can view the problem of determining a program that explains a set of examples or optimizes a tness function as one of hypotheses formation: we look for the best program that explains the data. Rissanen's minimum description length (MDL) principle oers an answer to approaching the problem. It states that the best theory to explain a set of data is one which minimizes the length of the data description together with the hypothesis description. In general, problems such as the inference of a decision tree that best explains a set of examples [Quinlan and Rivest, 1989], the construction of a nite automaton or the inference of a Boolean function that satis es a set of constraints are all problems that match the described pattern and can be solved using the MDL principle [Li and Vitanyi, 1993]. MDL is also called stochastic complexity. The MDL principle advocates a hierarchical representation of evolved programs (see Appendix B). Moreover, by biasing discovery of subroutines towards small sizes we also bias search towards solutions with smaller descriptional complexity. If a hierarchical organization is discovered, the size of individuals and discovered functions is kept within reasonable bounds while the structural complexity of individuals can be much bigger. Moreover, the descriptional complexity could be used as a measure for the tness of an individual T that would drive GP towards discovering solutions with a smaller descriptional complexity [Iba et al., 1994]. It balances the requirement of getting a simple description of a solution tree T and the requirement of minimizing the number of misses. The mechanism for adapting the representation while searching for solutions represents a natural way to generate a hierarchy of more and more complex functions (a hierarchical representation) and to make possible the discovery of a solution of small (or even minimal) description length. Unfortunately, MDL may not work too well with GP in general. Informed approaches to include a parsimony component, such as the MDL principle, implicitly expect that the capability of the learned model is varies smoothly with its complexity. This is not true for Genetic Programming. A small change in a program can entirely destroy its performance. However, with parsimony pressure individuals would tend to take a hierarchical

140 organization because this is a way of working towards achieving a minimum descriptional complexity. We can achieve good results with a parsimony component that is designed as discussed in section 4.3.3.

5.5 Adaptive representation The ADF-GP modular representation presents several advantages: modularization, reuse, increased diversity, which are likely to improve the performance of the GP engine for applications where a modular decomposition simpli es the problem. Unfortunately, it suers from the problem of non-causality. Much eort is spent trying to evolve ADF expressions while totally changing the behavior of programs, and not exploiting accumulated changes in the other program components. Another problem is the requirement to prede ne the architecture of solutions. In this section we present a new approach relies on the idea of using subroutines, and changes the strategy for subroutine discovery for problem decomposition. The approach is called adaptive representation (AR).

5.5.1 The AR algorithm The central idea of an adaptive representation system is to nd and use subroutines based on measures of their function. Reusing good building blocks has obvious advantages in terms of the economizing the search process. A larger set of functions positively aects the tness distribution of programs created through initialization or genetic operators. Thus the use of subroutines focuses search in the space of programs. This idea is implemented in a simple form in the adaptive representation approach. AR uses GP to search for good individuals (representations), while adapting the architecture (representation system) through subroutine invention, to facilitate the creation of better representations. These two activities are performed on two distinct tiers (see Figure 5.8). GP search acts at the bottom tier. Due to the tness proportionate selection mechanism of GP, more t program structures pass their substructures to ospring. At

141 the top tier, the subroutine discovery algorithm selects, generalizes, and preserves good substructures. Discovered subroutines are re ected back in programs from the memory (current population) and thus adapt the architecture of the population of programs. SUBR. DISCOVERY

Memory

Problem Representation FUNCTIONS SUBROUTINES

POPULATION

TERMINALS

GP

Figure 5.8: Two-tier architecture of the adaptive representation algorithm. The subroutine discovery algorithm creates new subroutines that extend the problem representation, as a result of three steps (see Figure 5.9 for a more formal description): 1. Identify useful blocks of code that appear as a result of genetic operations. Either an informed, or an heuristic technique, can be employed in specifying of what could be useful blocks of code. 2. Generalize the blocks that withstand the selection criterion above using inductive generalization [Michalski, 1983]. The result is a set of new subroutines which extend the current function set. 3. Create a number of random individuals from the extended function set and replace low- tness individuals (thus exploit newly created functions). The critical problem in AR is the evaluation of the usefulness of a block of code (the rst step above). Evaluation should be based on additional domain knowledge whenever such knowledge is available. However, domain-independent methods are more desirable for this goal. The evaluation of subexpressions will be explored in this chapter by means

142

Adapt representation(Pi; Fi; Fi+1) 1. Discover candidate building blocks BB i by evaluating each block's merit(Pi); 2. Prune the set of candidates(BB i ); 3. For each block in the candidate set, b 2 BB i, repeat: (a) Determine the terminal subset Tb used in the block(b); (b) Create a new function f having as parameters the subset of terminals and as body the block(b; Tb); (c) Extend the function set Fi+1 with the new function(Fi; f).

Figure 5.9: Subroutine discovery in the adaptive representation algorithm of user supplied criteria called block tness functions, and using domain independent methods in the next chapter.

5.5.2 Frequent and t candidate building blocks There exist two obvious choices for determining the usefulness of blocks: frequent blocks and t blocks. The block with the highest usefulness becomes a candidate. Frequent blocks can be determined by keeping track of how often a block appears in the entire population. Surprisingly, frequent blocks are not necessarily useful building blocks. By analogy to the GA schemata theorem, a good building block spreads rapidly in the population, and this determines a high frequency count for the block. For example, in the even parity problem if we augment the function set with the exclusive or function (xor), then xor (a building block for computing parity for 2 input bits) will soon become dominant in t program trees. The problem is that the converse of the GA schemata theorem is not true. Poor blocks, i.e. blocks that are identities or add no functionality, may be frequent and should not be considered as candidates, although they may have a role of preserving recessive features (they are introns in [Angeline, 1994b]).

143 Blocks that appear in a nal solution may be discovered very late in the search process. They are not necessarily responsible for the evolution process, usually having a low frequency count. Frequent blocks in early generations may become rare in late generations. Similarly simply considering the frequency of a block in an individual [Tackett, 1993], the block's constructional or schema tness [Altenberg, 1994], or conditional expected tness [Tackett, 1995] is not sucient. The above arguments are supported by experimental evidence obtained by monitoring frequent blocks in the population. This indicated the unsuitability of the criterion in estimating block usefulness. A much better choice for discovering building blocks is to consider t blocks. We can incrementally check for new t blocks instead of relying on expensive statistics in the population. Block evaluation is done with one or more block tness functions based on supplementary domain knowledge. Block tness functions are supplied in the de nition of the GP problem. Each of the block tness functions exerts \environmental pressure" for the selection of viable blocks. In a co-evolutionary framework such pressure could come from co-evolving species [Hillis, 1990]. Several other methods can be used to evaluate the tness of a block. First, one can use the program tness function to evaluate blocks or can compute the correlation between the program output value and the subexpression value. This has the advantage of requiring no more domain knowledge than the knowledge built into the tness function, but is not a general method [Iba and de Garis, 1996]. Second, one can use a slightly modi ed version of the tness function, corresponding to a lower dimensionality problem of the same type. For example the block tness may be measured only on a reduced set of tness cases, dependent on the variables used in the block. This method actually scales the tness function down to cope with a smaller size problem (see the example from section 5.5.3). As expected, t blocks are very useful in dynamically extending the problem representation by means of de nition of new global subroutines.

5.5.3 Experimental results The even-n-parity problem is solvable by problem decomposition into simpler subproblems and thus represents a good test bench for the discovery of more and more

144 complex building blocks. The problem was described in Section 4.1. Table 3.1 (Section 3.4.1) summarizes the parameters used in the GP experiments here, while Table 5.2 summarizes the additional parameters for AR. Table 5.2: GP setup for solving parity problems of order n, and additional AR parameters. Block tness function tness measure applied on a subset of inputs Block selection best two blocks (if any) 1 or 1 Epoch-replacement-fraction 2 4

The cost or standardized tness of a program i having Hits(i) hits, and a size Size(i) is:

Cstandardized (i) = [2n ? Hits(i)] C 1 + Size(i) C 2

where C 1; C 2 are constants. We used both C 1 = 1 and C 2 = 0, and CC 21 = n. We have also tested a formula derived directly from the MDL principle (see Appendix B) with poorer results. One possible explanation is the greater pressure on eliminating dead regions of code which may prove to be a reservoir for diversity later on. Additional control parameters for GP were as follows. The maximum depth of individuals was D = 17, while new individuals may have a depth of maximum 6. We used both tness-proportionate and tournament selection (with similar results). We have not experimented with other values of these parameters, but rather have used the values reported in [Koza, 1992] for result comparability reasons. Other speci c AR parameters were as follows. The block tness function was the same as the tness function (C 1 = 1 and C 2 = 0), with the change that hits are evaluated on a subset of the set of tness cases, determined after xing the values of variables not used in the block to arbitrary values (zero here). This is weaker than computing parity on a subset of inputs. Only blocks with a maximum number of hits are considered as candidates. No pruning function has been initially considered (step 2 in gure 5.9). An epoch-replacement-fraction of 21 gave good results when solving even-n-parity with n up to 8. For bigger orders, we chose a smaller value to keep lower the computational overhead due to adapting the representation. In general, the

145

Table 5.3: Comparison of results (rounded gures): AR-GP vs. results reported in Koza94, marked (# ). sc is the structural complexity, g is the number of generations

Method

even-3 even-4 even-5 even-8 g sc g sc g sc g sc

GP# 5 45 23 113 50 300 # ADF-GP 3 48 10 60 28 157 24 186 AR-GP 2 17 3 15 5 32 10 41 bigger its value, the larger are the computational eort and memory requirements for a run so one has to trade o the power obtained with the costs employed. We solved all parity problems up to order 11 on a sun sparcstation 10 by adapting the representation based on t blocks. A comparison of results among AR, GP, and ADF-GP is presented in Table 5.3. Row 1 shows the number of generations to nd a solution with 99% probability in one run and the average structural complexity of solution trees obtained for ten runs of AR-GP on even-parity problems with population size M = 4000. Rows 2 and 3 present some comparative results taken from [Koza, 1994b] for sample runs of GP with similar parameter values, but M = 16000.

Emergence of hierarchical representations It is important to point out the hierarchical structure of the functions dynamically created and their semantics. Table 5.4 presents the main steps of the trace in a run of even-8-parity and gure 5.10 presents the nal call-graph induced in that run. Dierent levels in the call graph correspond to higher epochs of the evolutionary process. Functions on the same level do not call one another. Only functions in dierent epochs may take advantage of the genetic material discovered previously. The foundation of the hierarchy is made of the primitive functions included in F0 . We have argued from a theoretical point of view that hierarchical structures are more powerful than structures based on the initial function set. Table 5.3 brings some experimental evidence. With ADF or AR the scalability of the even parity problem

146

Table 5.4: Important steps in the evolutionary trace for a run of even-8-parity

Generation

0.

New

functions [F681]: (LAMBDA (D3) (NOR D3 (AND D3 D3))); [F682]: (LAMBDA (D4 D3)

(NAND (OR D3 D4) (NAND D3 D4)))

Generation 1. New function [F683]: (LAMBDA (D4 D5 D7) (F682 (F681 D4) (F682 D7 D5)))

Generation 3. New functions [F684]: (LAMBDA (D4 D5

D0 D1 D6) (F683 (F683 D0 D6 D1) (F681 D4) (OR D5 D5))); [F685]: (LAMBDA (D1 D7 D6 D5) (F683 (F681 D1) (AND D7 D7) (F682 D5 D6)))

Generation 7. The solution found is: (OR

(F682 (F682 (F683 D4 D2 D6) (NAND (NAND (AND D6 D1) (F681 D5)) D1)) (F682 (F683 D5 D0 D3) (NOR D7 D2))) D5) improves signi cantly. Figure 5.11 presents the evaluational complexity and the size of both the best of generation individual and on average over the entire population. The structural complexity values are bounded from below and above by Size and EC respectively. The best of generation individual becomes simpler and simpler due to size pressure and the variety of useful and more powerful blocks appeared in the population. Starting with generation 3 discovered functions begin to dominate the structure of the best of generation individual as they gradually replace the primitive functions. A statistical analysis of the frequent blocks after the function set is extended on the basis of t blocks, outlines the

147

Time EVEN-8-PARITY

7

3

F685 (Even-5-Parity)



1


0

OR

NAND

F681 (NOT)

AND

NOR

Figure 5.10: Call graph for the extended function set in the even-8-parity example importance of the new functions created, and thus of the hypothesized building blocks. The new functions rapidly become dominant in the population, if they are useful. We can thus evaluate extensions of the function set. Figure 5.12 outlines that the complexity of individuals increases considerably for a simpler problem (even-5-parity) when standard GP is applied, without bringing noticeable improvements in the standardized tness. A similar problem would appear if the new functions created by AR did not correspond to good building blocks.

Fit and frequent blocks This section presents experiments that support the discussion about t and frequent blocks in Section 5.5.2. All experiments track the presence of small blocks of height h, 2 h 4. An example of the evolution of most frequent blocks (MFB) when disabling the adaptation of the function set in a run of the even-3-parity problem, is presented in gures 5.13 and 5.14. None of the MFBs appeared in a nal solution. The most frequent blocks at the end of generation 22 when a solution is found are: 1. (nand (or (nand d2 d1) (or d2 d2)) (nand (and d0 d1) (nor d2 d2))) appeared 132 times and has 6 hits;

148

140

EC(Best-of-gen) Size(Best-of-gen) Primitives(Best-of-gen) NewFun(Best-of-gen) AvgEC AvgSize

Complexity (points)

120 100 80 60 40 20 0 0

2

4 6 Generation

8

10

Figure 5.11: Complexity of best of generation individual and average values over the entire population in the even-8-parity example.

Standardized Fitness / Structural Complexity

35 AvgFitness Fitness(Best-of-gen) SC(Best-of-gen)

30

25

20

15

10 0

5

10

15

20 25 30 Generation

35

40

45

50

Figure 5.12: With AR inhibited, an even-5-parity run shows permanent increase in structural

complexity but a plateau in tness.

149 2. (nand (nand (nor d0 d1) d2) (nand (and d0 d1) (nor d2 d2))) appeared 42 times and has 4 hits; 3. (nor (nor d2 d1) (and d0 d2)) appeared 41 times and has 4 hits; 4. (and d0 (nor d0 d2)) appeared 35 times and has 4 hits; 5. (or (nand (or d2 d0) d2) (or (and d1 d0) (nor d1 d0))) appeared 31 times and has 2 hits (MFB5). Hits were measured for the total number of 8 tness cases. Note that MFB5 contains a useful sub-block which is nothing but XOR applied to D0 and D1 (the right subtree of its root). This explains why MFB5 has become frequent. 140 120

Count

100 80 60 40 20 0 0

200

400

600

800 1000 1200 1400 1600 1800 Block No.

Figure 5.13: Final distribution of block frequencies in a run of even-3-parity. There are 13 blocks with a tness of 6 at generation 22. One example is (nand (or (nand d2 d1) d0) (nand (and d0 d1) (nor d2 d2))). Runs with AR using discovery of subroutines based on frequent blocks have failed. In contrast, by adapting the representation based on t blocks, solutions are obtained in at most two generations. Performance is improved dramatically in this case as well as in much more complex ones. There is no such improvement if the function is extended based on frequent blocks. It is interesting to note that a statistical analysis of the frequent blocks and functions used after the function set is extended on the basis of t blocks, outlines the importance

150

140 120

MFB1 MFB2 MFB3 MFB4 MFB5

No. Modules

100 80 60 40 20 0 0

5

10 15 Generation

20

25

Figure 5.14: Evolution of most frequent blocks in the even-3-parity example. of the new functions created, and thus of the hypothesized building blocks. The new functions rapidly become dominant in the population, if they are useful. We can thus evaluate extensions of the function set. This is of high importance especially when the rules used in establishing the merit of building blocks are heuristics rather than precise ones. With AR, the analysis of the evolution of the most frequent nal blocks shows a dierent picture. The three most frequent blocks in a run of AR on even-5-parity are: 1. (f893 d3 (f894 d4 d0 d1)) appears 131 times and has 16 hits, out of 16. 2. (f895 d3 (f894 d2 d1 d3) d3 (or d1 d1)) appears 102 times and attains 16 hits too. 3. (f894 (f895 d3 (f894 d2 d1 d3) d3 (or d1 d1)) (f892 (f895 d1 d4 d1 d1)) d0) appears 22 times and has 32 hits (it would be chosen for generating a 5 input subfunction). All MFBs compute either parity or its inverse on a subset of input bits (the last one for all inputs). If the evolutionary path is good then the population has a high potential to contain and invent new useful building blocks. The GP algorithm will disseminate them in the population.

151

5.5.4 Discussion AR does not delete subroutines. Provided that the block selection heuristics are good, the procedure creates stable, useful subroutines. This focuses search towards desirable regions in the search space leading to improved overall eciency and scalability. In general some subroutines may be bad guesses or may only present a temporary advantage. GP has the potential to select useful primitives. Ideally, the algorithm should learn which subroutines to delete and which to keep around, but this is not attempted in AR. There are two important dierences between AR and the module acquisition approach [Angeline, 1994a]. First the created subroutines in AR expand the set of primitives instead of just being recorded into a genetic library. Second in GLIB encapsulation is done randomly, while in AR blocks selected for subroutine creation are evaluated using either user supplied heuristic information (block tness functions) or statistical properties from the population.

5.6 Summary and other related work The adaptive representation idea departs from the the principles of natural evolution and attempts to heuristically speed up the evolution of procedural representations. In contrast to nature, a simulated evolution algorithm has access to a wealth of history information, which it can use to re ect at. Then the idea is to use the experience gained in simulations in order to improve GP search. The AR algorithm implements this simple idea in a two-tier architecture. The bottom tier uses GP as a search engine. The top tier implements a meta-level learning or optimization algorithm speci c to the representation used. The key idea of the learning algorithm is to extract features from the evolution trace of GP in order to be able to focus search towards new regions of the state space that may be more promising. With procedural representations, a natural choice for what to learn is subroutines, i.e. procedural abstractions. An analysis of randomly enlarged function sets shows that

152 enlarged function sets are advantageous. AR dynamically creates what can be useful abstractions. AR attempts to implement this idea in a domain-independent way by relying on frequent expressions. This approach has not been successful. The rescue is to use domain-dependent block tness functions. The next chapter will present domainindependent solutions to selecting genetic material for future abstractions. At the discovery level, AR can follow a reinforcement learning strategy. The discussion in Section 2.3 is relevant from this perspective. An example of a very similar strategy to focus search to AR's strategy appears in the STAGE algorithm [Boyan and Moore, 1997]. STAGE solves optimization problems in three stages. First, an optimization algorithm such as simulated annealing is used. Pairs formed by the suboptimal solutions and the corresponding values of the objective function can be used to train a function approximator to the objective function. The approximator is used to predict regions of the search space where the optimization algorithm can be run again.

153

6 Adaptive Representation through Learning The Adaptive Representation through Learning (ARL) algorithm copes with problems not addressed in the approaches presented in the previous chapter. The most important of these problems is the domain-independent characterization of the value of subexpressions. This chapter discusses in detail the ARL extension to GP. ARL further improves on AR. The distinctive ideas are to evaluate the utility of subexpressions in domainindependent ways, and to dynamically manage the global library of subroutines created by AR by learning what is good from evolution traces. The chapter discusses several domain-independent heuristics for selecting and determining the value of subexpressions. It presents experimental results and discusses the role of discovered subroutines.

6.1 Learning good subroutines: the ARL algorithm The adaptive representation technique could be further improved by solving two issues. The rst is the domain-independent characterization of the value of subexpressions. Previous GP extensions do not attempt to decide what is relevant, i.e. which blocks of code or subroutines may be worth giving special attention, but employ genetic operations at random points. The second issue is the time-course of the generation of new subroutines. When should new subroutines be created, and when could subroutines be deleted? Other techniques, including the AR approach in the previous chapter, do

154 not make informed choices to automatically decide when creation, deletion, or modi cation of subroutines is advantageous or necessary. Actually, AR has no mechanisms to delete subroutines, but is rather thrifty in creating too many subroutines at one generation. The \what" issue is addressed by relying on local measures such as parent-ospring dierential tness and block activation in order to discover useful subroutines and by learning which subroutines are useful. The \when" issue is addressed by learning evaluations for subroutines and by relying on global population measures, such as population entropy, in order to predict when search reaches local optima and escapes them. This section describes the ARL algorithm. It answers the above questions using both local and global information implicitly stored in the population. Local information is brought to bear based on the notions of dierential tness and block activation. Global information is used to de ne subroutine utility.

6.1.1 The ARL Strategy The central idea of the ARL algorithm, as well as of AR, is the dynamic adaptation in the problem vocabulary. The vocabulary at generation t is given by the union of the terminal set T , the function set F , and a set of evolved subroutines St.

T [ F represents the set of primitives which is xed throughout the evolutionary process. In contrast, St is a set of subroutines whose composition may vary from one generation to another. St may be viewed as a population of subroutines that extends

the representation vocabulary in an adaptive manner. Subroutines compete against one another, but may also cooperate for survival, as will be described below. New subroutines are discovered and the \least useful" ones die out. St is used as a pool of additional problem primitives, besides T and F , for randomly generating some individuals in the next generation, t + 1. The ARL algorithm attempts to automatically discover useful subroutines and adapt the set St by applying the heuristic \pieces of useful code may be generalized and successfully applied in more general contexts."

155

6.1.2 Discovery of Useful Subroutines New subroutines are created using blocks of genetic material from the pool given by the current population. The major issue here is the detection of what are \useful" blocks of code. The notion of usefulness in the subroutine discovery heuristic is de ned by two concepts, dierential tness, and block activation. The subroutine discovery algorithm is presented in Figure 6.1. The major steps in the discovery of useful blocks are described next.

Dierential Fitness The nature of GP is that programs that contain useful code will tend to have a higher tness and consequently their ospring will tend to dominate the population. The concept of dierential tness is a heuristic which anticipates this trend and focuses on blocks from such individuals. Thus blocks of code are selected from programs that have the biggest tness improvement over their least t parent, i.e. the highest dierential tness. Let i be a program in the population having raw tness Fitness(i). Its dierential tness is de ned as: DiFitness(i) = Fitness(i) ? minp2Parents(i) fFitness(p)g

(6.1)

We focus on program i having the following property:

maxifDiFitness(i)g > 0

(6.2)

Large dierences in tness are presumably created by useful combinations of pieces of code appearing in the structure of an individual. This is exactly what the algorithm should discover. Figure 6.2 shows the histogram of the dierential tness de ned above for a run of ARL on the Pac-Man problem. Each slice of the plot for a xed generation represents the number of individuals (in a population of size 500) vs. dierential tness values. The gure shows that only a small number of individuals improve on the tness on their parents. ARL will focus on such individuals in order to discover salient blocks of code.

156

Subroutine-Discovery(Pt; S new ; Ptdup) 1. Initialize the set of new subroutines S new = .

Initialize the set of duplicate individuals Ptdup =

2. Select a subset of promising individuals: B = maxifDiFitness(i)g > 0 3. Label each node of program i 2 B with the number of activations in the evaluation of i on all tness cases.

4. Create a set of candidate building blocks BBt by selecting all blocks of small height and high activation(B);

5. Prune the set of candidates(BBt) by eliminating all blocks having inactive nodes;

6. For each block in the candidate set, b 2 BBt, repeat: (a) Let b belong to program i. Generalize the code of block b: i. Determine the terminal subset Tb used in the block(b); ii. Create a new subroutine s having as parameters a random subiii.

set of Tb and as body the block(b; Tb); Create a new program pdup making a copy of i having block b replaced with an invocation of the new subroutine s. The actual parameters of the call to s are given by the replaced terminals.

(b) Update S new and Ptdup: i. S new = S new [ fsg ii. Ptdup = Ptdup [ fpdupg 7. Results S new; Ptdup Figure 6.1: ARL extension to GP: the subroutine discovery algorithm for adapting the problem representation.

157

500

# Programs

400

300 50 200

40 30 Gen.

100 20 10

0 −60

−40

−20

0 Fitness Class

20

40

0 60

Figure 6.2: Dierential tness distributions over a run of ARL with representation A on the PacMan problem. At each generation, only a small fraction of the population has DiFitness > 0.

Block Activation

Once candidate parents have been selected, the next step is to identify useful blocks of code within those parents. During repeated program evaluation, some blocks of code are executed more often than others. The more active blocks become candidate blocks. Block activation is de ned as the number of times the root node of the block is executed. Salient blocks are active blocks of code from individuals with the highest dierential tness. In contrast to [Tackett, 1995], salient blocks have to be detected eciently, online. This is possible because candidate blocks are only searched for among the blocks of small height (between 3 and 5 in the current implementation) present in individuals with the highest dierential tness. Nodes with the highest activation value are considered as candidates. In addition, we require that all nodes of the subtree be activated at least once, or a minimum percentage of the total number of activations of the root node. This condition is imposed in order to eliminate from consideration blocks containing introns and hitch-hiking phenomena [Tackett, 1995]. It is represented by the pruning step (5) in Figure 6.1.

158

Generalization of Blocks The nal step is to formalize the selected blocks as new subroutines and add them to the GP vocabulary. Blocks are generalized by replacing some random subset of terminals in the block with variables (see Step 6a in Figure 6.1). Variables become formal arguments of the subroutine created. The generalization operation makes sense in the case when the primitive symbols satisfy the closure condition [Koza, 1992], i.e. they can be functionally combined in every possible way. In strongly-typed GP [Montana, 1994] each variable or constant has a type, and each function has a signature. The function signature is de ned by the type of the function result and by the formal argument types. Block generalization in typed GP additionally assigns a signature to each subroutine created. The subroutine signature is de ned by the type of the function that labels the root of the block and the types of the terminals selected to be substituted by variables. Signatures of all primitives and new subroutines represent the new genetic composition constraints.

Subroutine Utility ARL expands the set of subroutines St whenever it discovers new subroutine candidates. All subroutines in St are assigned utility values which are updated every generation. A subroutine's utility is estimated by observing the outcome of using it. This is done by accumulating, as reward, the average tness values of all programs that have invoked s over the past generations, directly or indirectly (by calling other subroutines that call s directly or indirectly). The subroutine utility is analogous to schema tness. However, reward accumulation1 is done over a xed time window of W generations. are currently used to estimate the tness of subroutines. Reinforcement learning (RL) algorithms such as Q-learning [Watkins, 1989] uses temporal discounting of future expected rewards. Temporal discounting could be also used here. Undiscounted rewards have been used because the method is simpler, is analogous to schema tness, and does not favor current mediocre programs that may invoke a subroutine. R-learning is an undiscounted RL algorithm [Schwartz, 1993] that outlines some of the advantages of undiscounted dynamic programming methods. 1

Undiscounted past rewards

159 Thus for a subroutine s, its utility U (s) is:

U (s) = K ?1

Xt X

t?W j

Fitness(j )

(6.3)

where j is a program that invokes s and K is a normalizing constant. In a hierarchy of subroutines, good subroutines higher in the hierarchy may reinforce other subroutines lower in the hierarchy, so programs may \cooperate" for survival. If we de ne the raw utility of s, U^ (s), as the average tness of all programs directly invoking it, the utility of s, U (s), is equivalent to the following algebraic form:

U (s) = 0U^ (s) +

X j

j U (sj )

(6.4)

where sj is a subroutine that invokes s and j is a subunitary weighting factor representing the fraction of all programs calling s indirectly through sj (j 1) or directly (j = 0). This formula shows that if s is a good subroutine and a particular subroutine sj invokes s often, then its utility will also be higher. The set of subroutines co-evolves with the main population of solutions through creation and deletion operations. New subroutines are automatically created based on active blocks as described before. Low utility subroutines are deleted in order to keep the total number of subroutines below a given number. In order to preserve the functionality of those programs invoking a deleted subroutine, calls to the deleted subroutine are substituted with the actual body of the subroutine, as in an in-line substitution operation.

6.1.3 The ARL Algorithm The general structure of the ARL algorithm is given in Figure 6.3. It extends a standard GP algorithm with one new step (3b), implementing the adaptation of the problem vocabulary.

6.1.4 When to Create Subroutines: Using Entropy The use of functional subroutines in the GP population has the eect of preserving a higher population diversity over generations in comparison to standard GP [Rosca,

160 1995b]. An increased diversity may result in a more eective search, by escaping local optima. The notion of diversity in GP is involved in an answer to the \when" question. In general, there are several possible answers to the \when" question. New subroutines can be created: 1. Whenever the subroutine discovery algorithm suggests new subroutines. 2. At the end of epochs, i.e. periods of consecutive generations throughout which the GP system works like standard GP using a xed representation system. 3. Adaptively, in response to long term decreases in population diversity. There are several justi cations for not attempting to adapt the problem representation all the time: (1) The computational overload introduced for tracking blocks of code; (2) The goal to exploit the current set of subroutines St (the vocabulary would be changed only if no progress is made using it); (3) The poor performance of candidate solutions in early generations (progress is probable in early generations, anyway). An appropriate measure of diversity is population entropy [Rosca, 1995a]. The entropy measure represents a global measure for describing the state of the dynamical system represented by the population, in analogy to the state of a physical or informational system. Population entropy can be computed by grouping individuals into classes according to their behavior or phenotype and determining the number of individuals that belong to each class, according to Shannon's information entropy formula [Shannon, 1949]: X E (P ) = ? pk log(pk ) (6.5) k

where pk is the proportion of the population P occupied by population partition k at a given time. In GP a useful class of the partition is a xed interval of tness values. Individuals are regarded as equivalent if their tness values lie in the same interval regardless of dierences in their code. Entropy provides a way to track diversity during a GP run. Decreases in population diversity can be correlated with a plateau in the best-of-generation tness or a plateau in the average tness over a fraction of the best individuals in the population. Such

161

ARL Algorithm

De ne problem: terminal set T , function set F , tness function F, set of training cases E . Denote the population at a given generation t by Pt .

1. Initial Generation: evolution time (generation) t = 0, discovered subroutines S0 =

2. Randomly initialize population P0(T S F ; P0) 3. Repeat until termination criterion is met: (a) Evaluate population(Pt; E ; F) (b) Adapt representation(Pt ; St; St+1; Ptdup; Ptnew) i. Discover new subroutines S new and create duplicate individuals Ptdup by calling

Subroutine-Discovery(Pt; S new; Ptdup)

ii. Update subroutine utilities (St; S new) iii. Create the next generation subroutine set St+1: A. Select subroutines of low utility to be deleted S old (St ; S new ) B. St+1 = (St ? S old ) S S new

iv. Randomly generate and evaluate newborns Ptnew S S (T F St+1 ; Ptnew ; E ; F) v. Evaluate population of newborns(Ptnew ; E ; F)

(c) Generate a new population Pt+1 by tness proportionate reproduction, crossover, mutation of individuals(Pt; Ptdup ; Ptnew; Pt+1)

i. Create intermediate population Pt = Pt S Ptdup S Ptnew ii. Select genetic operation O (pr ; pc; pm; O) iii. Select winning individuals W from the intermediate population 0

(O; Pt ; W ) Generate ospring Pt+1 (O,W ,Pt+1) 0

iv.

(d) Next generation: t = t + 1

Figure 6.3: The ARL algorithm extends the standard GP algorithm with Step 3b which adapts

the problem representation system (vocabulary) by creating new subroutines, eventually deleting old ones and creating new individuals to be entered in the selection competition.

162 correlations suggest convergence towards a local optimum [Rosca, 1995a]. They can be used to decide when to create new subroutines.

6.2 Representation approaches We have tested the ARL algorithm on the problem of controlling an agent in a dynamic environment described in detail in Section 4.4. Here we present an alternative problem representation, and discuss the results obtained with several search approaches among which ARL is central. We also highlight dierences with human-designed solutions.

6.2.1 Representation approaches in the Pac-Man Problem The Pac-Man problem primitives chosen for the GP implementation in [Koza, 1992] are suciently high level to focus on a single game aspect, that of task prioritization. In this chapter we describe experiments with three related representations of the Pac-Man problem. The rst is the Pac-Man problem representation from [Koza, 1992] and is called A1 here. The second is the representation described in Section 4.4, and called A2 here. It uses the same vocabulary as representation A1, but changes the result returned by some primitives. More precisely, A2 modi es A1 by making all the action primitives return the distance from the corresponding element. This addresses the problem that distances and directions in A1 are mixed without making much resulting sense, at least for the human designer. GP using A2 cannot mix distances and directions. In contrast, GP using A1 is free to mix them if this provides an evolutionary advantage. We refer to either A1 or A2 by A. The third will be called representation B. It uses a typed vocabulary by taking into account the signature of each primitive, i.e. the return type of each function (subroutine) as well as the types of its arguments. It introduces primitives for evolving explicit logical conditions under which actions can be executed. This representation will be described in more detail next.

163

Representation B - Typed GP In problem representations A1 and A2, actions may appear both in the condition and in the action part of an iflte expression. The evaluation of the condition changes the context where the action is executed. This makes it extremely dicult to understand what an evolved program really does without executing it. For analyzing GP, it is desirable that evolved programs express explicit conditions under which certain actions are prescribed. To do this, we used a typed GP system [Montana, 1994; Johnson et al., 1994] based on an extended set of primitives obtained from A2. The problem representation A2 is extended with relational operators (