d : minf1; pd ssx g if sx s0
(4.27)
0
where s0 is a prede ned parameter playing the role of a threshold complexity value (see
91
0.1 0.09 0.08
Probability
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
2
4
6
8
10
12
14
16
18
Depth
Figure 4.8: Example of a probability mass function for choosing the depth of crossover points. The distribution plotted is the negative binomial (Pascal) with r = 2 and p = 0:23, whose cumulative distribution for the maximum depth of trees of 17 is 0:999.
Figure 4.9). With this change, disruption of schema H occurs with constant probability
kHk = constant = p p0d kHk d s s x
0
(4.28)
pd’ 1
pd 0
s0
Complexity sx
Figure 4.9: Rule for the automatic adaptation of the probability of disruption. pm and pc are
updated proportionately so that their sum follows this rule.
If sx increases then the probability of destruction of x remains constant. The number of instances of x in the next generation will be exclusively in uenced by its tness and the appearance in the population of better individuals.
92
4.4 A complex test case: the Pac-Man game We consider the problem of controlling an agent in a dynamic environment, similar to the well known Pac-Man game [Koza, 1992]. The Pac-Man game is a typical RL task. An agent, called Pac-Man, can be controlled to act in a maze of corridors. Up to four monsters chase Pac-Man most of the time. Food pellets, energizers and fruit objects result in rewards of 10, 50 and 2000 points respectively, when reached by Pac-Man (see Figure 4.11). After each capture of an energizer (also called \pill"), Pac-Man can chase monsters in its turn, for a limited time of 25 steps. During this period monsters are \blue." The rewards are 500 points for capturing the rst monster, 1000 points for the next etc. up to four monsters. However, a monster re-emerges from a central den shortly after it is captured by the agent. A snap-shot of the Pac-Man world is presented in Figure 4.10. A solution (also called policy or agent function) to the problem is a program that controls Pac-Man movements based on current sensor readings, and possibly past sensor readings and internal state (memory). It maps states into actions or sequences of actions. Such a program is an implicit representation of the agent policy and can be evolved by means of GP. But how good can evolved solutions get? The problem is to learn a controller to drive the Pac-Man agent in order to acquire as many points as possible. The agent has ve integer-valued perception primitives, one Boolean perception, and eight overt action primitives (see Figures 4.12 and 4.13). PacMan can sense when monsters are blue. The other perception primitives are smell-like senses. Pac-Man can sense the Manhattan distance to the closest food pellet, pill, fruit, and closest or second closest monster. The overt action primitives move the agent along the maze corridors towards or backwards from the nearest object of a given type. The function set contains two conditional operators, ve perception primitives, and eight action primitives. ifb (if-blue) senses if monsters are blue and when true executes its rst argument otherwise executes its second argument. iflte (if-less-than-or-equal) compares its rst argument to its second argument. For a \less-than" result the third argument is executed. For a \greater-or-equal" result the fourth argument is executed.
93
Figure 4.10: An example of the Pac-Man trajectory for an evolved program. The trace of
Pac-Man is marked with vertical dotted lines. The monster traces are marked with horizontal dotted lines. Pac-Man started between the two upside-down T-shaped walls (bottom) while the four monsters were in the central den. Pac-Man headed North-East, captured a fruit and the pill there, and then attracted the monsters in the South-West corner. There it ate the pill and just captured three of the monsters (to be reborn in the den). Next it will closely chase the fourth monster.
Object Captured Monster
Game Points 500/ 1000/ 1500/ 2000
Fruit
2000
Food
10
Energizer
50
Figure 4.11: Pac-Man game rewards.
94
Perception
Result
SENSE-DIS-FOOD
distance
SENSE-DIS-PILL
distance
SENSE-DIS-FRUIT
distance
SENSE-DIS-MON1
distance
SENSE-DIS-MON2
distance
IFB
boolean
Figure 4.12: Pac-Man perceptions are deictic,
smell-like, primitives. ifb a true boolean value only if monsters are blue. Every other primitive return a distance to the closest object of a given type in the world. Distance is an integer in the range [0; 43]. This and the function primitives determine the total number of possible perception states that can be experienced by the agent.
Action
Result
ACT-A-MON1
distance
ACT-A-MON2
distance
ACT-A-PILL
distance
ACT-A-FRUIT
distance
ACT-A-FOOD
distance
ACT-R-MON1
distance
ACT-R-MON2
distance
ACT-R-PILL
distance
ACT-R-FRUIT
distance
Figure 4.13: Pac-Man actions
are based on deictic routines: advance or retreat to the closest (and second closest for monsters) object of a given type. Each action primitive returns the distance to the corresponding object.
The perception primitives return the Manhattan distance to the closest food pellet, pill, fruit and closest or second closest monster. They are, respectively: sense-dis-food, sense-dis-pill, sense-dis-fruit, sense-dis-mon1, sense-dis-mon2. The terminal set has no elements. The action primitives move the agent along maze corridors. All return a number encoding the direction faced by the agent. For instance, act-a-pill advances the agent on the shortest path to the nearest uneaten energizer while act-r-pill retreats the agent from the nearest uneaten energizer. The other actions have analogous functions with respect to the closest monster, the second closest monster, fruit, and food: acta-mon-1, act-r-mon-1, act-a-mon-2, act-r-mon-2, act-a-fruit, act-a-food. If the shortest path or closest monster or food are not uniquely de ned, then a random choice from the valid ones is returned by the corresponding function. A program is evaluated based on the performance of the agent on an initial world con guration.
95
4.4.1 Problem Diculty, Fitness Cases and the Fitness Measure The Pac-Man problem has a number of major sources of diculty. First, Pac-Man is an active agent in a dynamic environment, thus the number of world states that it can encounter during one simulation is huge. Second, the degree of perceptual aliasing (or hidden state) is high. Third, the problem has several sources of nondeterminism:
monster moves are random 20% of the time. fruit moves are occasionally random. Finding an optimal control policy would be intractable even for a deterministic environment. Diculty raises questions such as what are good solutions and how much training is needed to evolve good solutions. Training eort is measured by the number of simulations executed or the number of primitives executed. In the game simulator, any control decision (ifb, iflte, ifte) or agent perceptual action (sense) takes zero time and any agent movement action (act) takes one time unit. Monsters and fruit move synchronously with the agent. A solution is interpreted repeatedly until Pac-Man is captured by a monster or eats all food pellets. Each simulation of an evolved program controller starts in the same initial world state. A number of training, or tness cases is considered. Each training case corresponds to one simulation. Multiple simulations of the same program have dierent outcomes due to random events in the external environment. The actual sequence of random events for a simulation is controlled by a random number generator. The tness of a program is the average number of points or \hits" accumulated by the agent under the control of the program on the set of tness cases. The standardized tness of program i, StdFitness(i), is the dierence maxpoints ? Fitness(i), where maxpoints is the theoretical maximum number of points that can be obtained if the agent gathers all food pellets, all fruit, all energizers and the maximum number of monsters (four) in each of the four blue periods generated by eating all four energizers. Random events in the simulation of a program result in huge variations in problem diculty and thus determine huge variations in tness. In order to evolve general
96 good solutions, a large number of tness cases should be considered. This requires an increased computational eort for each individual tness evaluation. Reynolds [1992] discussed this typical speed-accuracy trade-o for the problem of evolving a \corridor following" control program. He de ned Fitness(i) as the minimum number of simulation steps, over several tness (i.e. training) cases, taken before the rst collision of the agent.
4.5 Experimental results This section presents experiments aimed at reinforcing conclusions that have been suggested by the previous theoretical analysis. The interpretations given take into account the interplay between the two factors: tness and complexity. One cannot know in advance what particular tree-schema is to be preferred by the evolutionary process in a run of GP. Moreover, it is extremely hard to trace a treeschema property for all competing tree-schemata of any given shape and large size. However, one can examine relevant properties globally, for the entire population, and narrowly, for the best individual in the population. The experiments below provide a unitary view of GP dynamics by looking at a common set of measures over three types of runs of the standard GP engine. The measures are quantities that have appeared throughout our derivations and have been used to qualitatively explain results such as the averages over the population (f , s, fs ) and the best individual in a generation fbest , sbest , or fs best . The types of runs performed are 1. Standard GP with raw tness being only a measure of performance. 2. Standard GP with parsimony pressure. The tness function combines raw tness and a linear parsimony component to penalize a size increase. 3. Adaptive GP where the probabilities of mutation, crossover and reproduction are updated dynamically in order to impose constant parsimony pressure on competing tree-schema regardless on the complexity of evolved structures.
97 Two test problems are used. The rst problem is the induction of a Boolean formula (circuit) that computes parity on a set of bits [Koza, 1994b]. The raw tness function has access to tness cases that show the parity value for all inputs, and counts the number of correct parity computations. These experiments use the following parameters: population size M = 4000, number of generations N = 50, crossover rate pc = 89% (20% on leaf nodes), mutation rate pm = 1%, reproduction rate pr = 10%, number of tness cases = 2n where n is the order of the parity problem. The second problem is the induction of a controller for a robotic agent in a dynamic and nondeterministic environment, as in the Pac-Man game [Koza, 1992; Rosca and Ballard, 1996a]. Raw tness here is computed from the performance of evolved controllers over a number of simulations. The parameters for these runs are M = 500, N = 150, pc = 89%, pm = 1%, pr = 10%, with three simulations determining the individual tness. Each of Figures 4.14-4.18 contains three plots: (a) Variation of average complexity s = AvgS and the complexity of the best-of-generation individual sbest = S(Best) (top); (b) Variation of the ratio of averages fs = AvgF/S and of fs best = F/S(Best) (middle); (c) The tness learning curve fbest = S(Best) and variation of average tness f = AvgF (bottom).
Fitness based on pure performance Over the time span of evolution in a GP run, there often are long periods of time when no tness improvements are noticed. Section 4.3.2 has proved that an increase of the individual's survival rate can be accomplished by the increase in its complexity, but not in its tness. Can this eect generalize to the entire population? We suggested that a \yes" answer is plausible, which indicates that the performance of the GP engine can be seriously degraded. Here we present experimental evidence. The variation in the complexity of evolved structures can be seen in plots correlating the learning curves and complexity curves for the two test problems. When tness remains constant, both the best of generation complexity S(Best) and the average complexity s = AvgS indeed increase over time. Plateaus of F(Best) can be observed in Figure 4.14(c) between generations 33 and 59, or Figure 4.15(c) between generations
98 15 and 47, or 53 and 101. During the corresponding time intervals, size almost doubles in Figure 4.14(a) and signi cantly increases in Figure 4.15(a) while average tness also increases. The increase in average size is explained by the predominant increase in survival rate of above average individuals of increased size in the absence of any tness improvements.
Parsimonious tness Next we present experiments where parsimony pressure is applied during selection, in order to con ne the survival of individuals of ever increasing complexity and thus to guard against the apparent loss of eciency of GP search. An important question is whether parsimony would deter GP search from nding t solutions at the expense of nding parsimonious solutions. This could happen because of the arti cial distortion in tness created by the parsimony component. We expect to see a decrease in size over spans of time with no improvement in tness. Three such intervals can be noticed in Figure 4.16(a) and (c): generations 18 to 57, 58 to 72, and 73 to 99. The GP algorithm discovers a new best solution, having a complexity higher than every other previous individual at the beginning of each of these intervals. Following the complexity curves towards the end of the intervals, we note a gradual decrease in S(Best). The same tendency is conspicuous in the average complexity plot AvgS, which has a shape similar to S(Best) and is delayed with about four to ve generations. The delay period is the time needed by selection to pick up on the opportunities created at the beginning of the above intervals. One remarkable feature of the AvgF plot in Figure 4.16(c) is that average tness decreases in correlation with AvgS. The explanation is that parsimony pressure determines a decrease in complexity, which makes mutation and crossover operations more disruptive. This generates a decrease in average tness over the population. The eect is even clearer when the value of the weighting factor increases (see Figure 4.17). Note also the rapid increases in tness in early generations in Figures 4.16(b) and 4.17(b). They show that the following relation holds in very early generations in contrast
99
600 S(Best)
# Nodes
500
AvgS
400 300 200 100 0 0
10
20
30
40
50
60
70
Generation
2.50
F/S(Best) AvgF/S
Hits/Nodes
2.00 1.50 1.00 0.50 0.00 0
10
20
30
40
50
60
70
Hits
Generation 45 40 35 30 25 20 15 10 5 0
F(Best) AvgF
0
10
20
30
40
50
60
70
Generation
Figure 4.14: Variation of size (a) (topmost), tness/size (b) (middle), and raw- tness learning curve (c) (bottom) in a run of GP on even-5-parity without parsimony pressure.
100
300 S(Best)
# Nodes
250
AvgS
200 150 100 50 0 0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Points/Nodes
Generation 200.00 180.00 160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00
F/S(Best) AvgF/S
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Generation 7,000 F(Best)
Game Points
6,000
AvgF
5,000 4,000 3,000 2,000 1,000 0 0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Generation
Figure 4.15: Variation of size (a) (topmost), tness/size (b) (middle), and raw- tness learning curve (c) (bottom) in a run of GP on the Pac-Man problem without parsimony pressure.
101 to Figures 4.14(b) and 4.15(b)
fbest f sbest s
i.e. that the stronger selection pressure towards more eective individuals due to parsimony is useful to rapidly focus search towards good structures (as in the discussion of equations 4.20 and 4.25). The ability of the GP engine to nd t solutions is improved considerably when using a parsimonious tness function.
Adaptive probability of destruction In this third experiment we modify the standard GP engine in order to adapt the probability of disruption of structures (mutation and crossover) as in equation 4.27. The standard GP procedure varies selected structures in proportion to pc and pm and keeps around (in next generation) surviving selected structures in proportion to
pr = 1 ? pc ? pm . This is done globally in the sense that a pc fraction of the next generation is obtained through crossover on selected structures, etc. In contrast, the size-adaptive procedure decides what genetic operation to apply for each selected individual. In this way, the complexity of the individual can be used in taking the decision of mutation, crossover or survival. The procedure globally records the proportions of the next generation obtained with each genetic operation. An example is given in Figure 4.19 where one can see the variations in the probability of crossover. The variations can be correlated with the variations in the average complexity AvgS. Although size increases over time, the higher destruction appears to limit the size increase without disrupting the search process.
Summary of experiments The experiments above have attained three main goals in relation with the theoretical analysis of tree-schema growth:
102
120 S(Best)
# Nodes
100
AvgS
80 60 40 20 0
Points/Nodes
0
10
20
30
40
50
60
70
80
90
Generation
1,000.00 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00
F/S(Best) AvgF/S
0
10
20
30
40
50
60
70
80
90
Game Points
Generation 10,000 9,000 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0
F(Best) AvgF
0
10
20
30
40
50
60
70
80
90
Generation
Figure 4.16: The curves from Figure 4.15 repeated for GP with parsimony pressure = 0:1.
103
80 S(Best)
70
AvgS
# Nodes
60 50 40 30 20 10 0
Points/Nodes
0
10
20
30
40
50
60
70
80
90
Generation
1,000.00 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00
F/S(Best) AvgF/S
0
10
20
30
40
50
60
70
80
90
Generation 12,000
Game Points
F(Best) 10,000
AvgF
8,000 6,000 4,000 2,000 0 0
10
20
30
40
50
60
70
80
90
Generation
Figure 4.17: The curves from Figure 4.15 repeated for GP with parsimony pressure = 1:0.
104
500 450 400 350 300 250 200 150 100 50 0
S(Best) AvgS
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
1,000.00 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00
F/S(Best) AvgF/S
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
12,000 F(Best)
10,000
AvgF
8,000 6,000 4,000 2,000 0 0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
0.16
Figure 4.18: Pac-Man learning curve and variation of size in a run of GP with autoadaptation in crossover, mutation and reproduction rates.
1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80
Pc
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Figure 4.19: Adaptation in the probability of crossover.
105
Traced the GP-speci c variable complexity during evolution and interpreted its variations from the perspective of the size-dependent growth formula 4.13. Complexity increases can derail the search eort of the GP engine. This is showed by the long stable periods with no improvements in the most t structures.
Traced the in uence of an additive parsimony component to tness. Experiments show a considerable increase in eciency and oer insight into the choice for the value of , the weighting factor of the parsimony component.
Observed at work the second proposed alternative to imposing parsimony pressure. More experiments are needed to clearly assess the advantage of this adaptive method.
4.6 Statistical dynamics of GP 4.6.1 Analogy with a physical system Ludwig Boltzmann introduced the distinction between micro state and macro state which enabled him to give a statistical interpretation to thermodynamics [Thompson, 1988]. The micro state description of a physical system would include a speci cation of state variables (such as position and velocity) for each particle. Theoretically, this could completely de ne the state of the system. In contrast, a macro state is a macroscopic description, i.e. one that is de ned in terms of observable properties of the system (such as mass, volume, or velocity). By analogy to a physical system, consider that the macro state of a stochastic system represented by a GA/GP system is de ned by its entire population at a given time. We can observe properties that de ne global measures such as average tness or best-ofgeneration tness. In GP in particular, many genotypes may correspond to the same phenotype. We may not be interested exactly in particular genotypes, but rather in the course of evolution. In this analogy, a particular genotype would correspond to a micro state.
106 We extend the analogy by interpreting tness as energy. The energy of an individual i is in this case: H (i) = Std- tness(i) The principle of natural selection is strongly tied to the idea of energy, as individuals in a population compete for the eective utilization of energy resources [Wicken, 1988]. Ideally, there would be no uncertainty regarding the state if the entire population were made up of copies of a single individual (one having the minimum energy for a global optimum state). However, genetic search starts with a randomly generated state. During genetic search micro states uctuate determining a variation of the state in time. In thermodynamics, the energy of a system depends on the absolute temperature T , another macroscopic state variable. We could also use temperature in our interpretations. However, here we will only consider that the temperature has a constant xed value T . The above analogy enables us to apply some of the results from statistical mechanics, in order to qualitatively interpret state changes and convergence. One extensive property of a system's state is entropy and is de ned below. The probability of a state i in thermal equilibrium is given by the Boltzmann-Gibbs distribution:
Prob(i) = pi = Z ?1 e?
H (i) T
where Z is a normalizing constant needed in order to make p a probability distribution. Z actually plays a very important role in statistical physics and is called the partition function: X Hi Z = e? T ( )
i
If we de ne the free energy of the system as
F = ?T logZ it can be easily showed that
F = hH i ? T S
(4.29)
107 where hH i represents the average value of a random variable H and S is the entropy of the system:
S=?
X i
pi logpi
(4.30)
The free energy can be interpreted as the sum of the probabilities of individual states, according to the following identity:
e? FT = X p = 1 i Z i In the free energy formula (4.29), estimations of H and S would result in an estimation of F which can be interpreted as the probability of nding the system in a subset of states [Hertz et al., 1991]. The classical interpretation of entropy comes from thermodynamics. The entropy function was introduced by Clausius to represent the change of state when an increment of energy is added to a body as heat during a reversible process. It was later interpreted statistically by Boltzmann. The entropy of a system whose micro states are uncertain and have probabilities of occurrence pi dependent on energy is de ned by the relation 4.30, up to a constant. The entropy has a maximum value when all micro states are equiprobable. Entropy represents the disorder in the system of particles and tends to increase for irreversible processes (as the ones in nature), according to the second law of thermodynamics [Thompson, 1988]. Shannon used the same formula to de ne an information measure representing one's ignorance of which of a number of possibilities actually holds, given the a priori probability distribution represented by P [Shannon, 1949]. Yet another interpretation of entropy is complexity [Chaitin, 1987] or information content of an individual structure. In this context order means compressibility. Redundancies subtract from an individual's complexity. All these interpretations use the same formula (4.30) but assign dierent meanings to the probabilities fed into the formula. This generalization tendency in interpreting entropy led researchers to search for an unifying view between the statistical interpretations of the second law of thermodynam-
108 ics in physics and evolutionary principles in biology [Bruce H. Weber and Smith, 1988]. Schrodinger [Schrodinger, 1945] and others have noticed the following paradox: the increase in entropy in physical systems brings about a disorganization of the systems. Equivalently, systems evolve from less probable to more probable states. In contrast, natural evolution is described as progress, transformation from simple to complex or from more to less probable states. Schrodinger explained the paradox by looking at the
ux of energy in a living system and suggesting that it does not conform to the basic assumptions of classical thermodynamics. Among the various claims about the role of the second law of thermodynamics in biological evolution [Bruce H. Weber and Smith, 1988], Wicken proposed that genetic variation is due to the probabilistic nature of the second law [Wicken, 1988]. One measure that quanti es variation is diversity. Johnson de ned diversity in terms of the distribution of the energy within the system based on Shannon's information entropy measure, but outlined that diversity is not perfectly synonymous with either information, or with statistical entropy [Johnson, 1988].
4.6.2 Population entropy as a diversity measure A rule of thumb in the GA literature postulates that population diversity is important for avoiding premature convergence. The problem is how to capture heterogeneity. A straightforward de nition of diversity, or non-similarity for GA string-based representations is based on the Hamming distance between encodings of individuals. [Eshelman and Schaer, 1993] discuss strategies for maintaining GA population diversity by controlling how mates are selected, how children are created by recombination and how parents are replaced. Eshelman and Schaeer propose a method called \incest prevention," in which individuals are randomly paired for mating provided that their Hamming distance is above a certain threshold. Their method is showed superior in examples based on elitist selection. In GP, diversity may be de ned as the percentage of structurally distinct individuals at a given generation. Two individuals are structurally distinct if they are not
109 isomorphic trees. However, such a de nition is not practically useful. It is computationally expensive to test for tree isomorphisms. Moreover, associativity of functions is extremely dicult to take into account. In contrast, similarity between structures can be easily tested in GAs. [Ryan, 1994] uses an intuitive measure of diversity, based on performance, and shows that maintaining increased diversity in GP leads to better performance. His algorithm is called \disassortative mating." It selects parents for crossover from two dierent lists of individuals. One list of individuals is ranked based on tness, while the other is ranked based on the sum of size and weighted tness. The individuals from the second list are presumably dierent in structure and tness from the ones in the rst list. The goal is to evolve solutions of minimal size that solve the problem. By directly using the size constraint the GP algorithm would be prevented from nding solutions. In contrast, the disassortative mating algorithm improves convergence to a better optimum while maintaining speed. Two other diversity measures discussed in [Rosca, 1995b] are the distribution of complexity of individuals (expanded structural complexity) and the distribution of tness values. The latter is a more direct and easily observable type of variation in the population. Two individuals are dierent if they score dierently. Such information can be succinctly described using Shannon's information entropy formula and represents a global measure for describing the state of the dynamical system represented by the population, in analogy to the state of a physical or informational system: X E (P ) = ? pk logpk k
where pk is the proportion of the population P occupied by population partition k at a given time. Entropy has been used as a measure of diversity of an evolving ecological community in [Ray, 1993]. Partitions were de ned as individuals having the same genotype. In a functional approach such as GP, an appropriate measure of diversity is obtained by grouping individuals in classes according to their behavior or phenotype and com-
110 puting the population entropy based on the number of individuals belonging to each of these classes.
4.6.3 Entropy experiments This section examines the relation between diversity, as measured by population entropy, and tness variation. Four examples are presented, from two problem domains: Boolean regression and controlling an agent in a dynamic environment (similar to the Pac-Man problem described in [Koza, 1992]). Each example discusses the relationships between the best-of-generation tness, the average population tness (called energy in our earlier discussion) and diversity, as measured by the entropy formula. The GP setup for the parity problem was described in table 3.1. The setup for Pac-Man is described in Section 4.4. Other GP parameters were chosen as in [Koza, 1994b]. The GP termination criterion did not take into account whether a solution was found. The plots showed in this section represent three measures of interest: the best-ofgeneration individual (hits), the population average tness population and the population entropy. The tness and hits graphs (best-of-generation number of hits and average population tness) have the value axis on the left while the entropy plots have the value axis on the right. Figure 4.20 presents a three dimensional plot with tness distributions for a typical run of GP on the even-5-parity problem. These plot oer a compact representation of the tness histograms used in [Koza, 1992]. Koza pointed out that tness histograms \give a highly informative view of the progress of the learning process for the populations a whole."The 3-D plot clearly shows the global improvement in tness over the population. New features that allow improvements are probably synthesized and transmitted from parents to ospring, as suggested by the wave-like advance of the distributions. Once some of the best individual is discovered the number of individuals with similar behaviors increases exponentially until undermined by a similar increase of individuals with an even better behavior (hits).
111
No. of individuals
4000
3000
2000
1000 50 40 0 0
30 5
10
15
20 20
25
10 30
35
0
Generations
Hits
Figure 4.20: Fitness distributions over a run of GP on the even-5-parity problem.
Even-5-parity in standard GP The variation in entropy is jointly represented with the learning curves (hits and standardized tness) in Figure 4.21. Figures 4.23 and 4.22 show the long term evolution in this run, up to generation 200. Notice that GP continues to improve overtime. This is also proved by the preservation of an entropy around the value of 1.5 (Figure 4.23), as opposed to a drop in entropy which would signal a decrease in diversity and in the likelihood of discovering better solutions. GP improves slowly, and if run for a long enough number of generations it could reach the optimum solution of 32 hits.
GP on the Pac-Man problem A similar analysis of entropy can be performed for the Pac-Man problem, and is presented in gure 4.24. Although the entropy has the general features from the previous examples, it is much more noisy. This can be due to the increased instability of solutions
112
35
Slide
3.5
AvgFitness
3.0
Hits
30
Entropy 2.5
• Bullet
20
2.0
15
1.5
10
1.0
5
0.5
0
Entropy
Fitness/Hits
25
0.0 0
5
10
15
20
25
30
35
40
45
Generations
Figure 4.21: Entropy re ects population diversity. In a run for even-5-parity, entropy clearly
increases in the next 15 generations when signi cant improvements from the random initial population are achieved. Then entropy remains at a relatively constant level. Will it decrease and signal a freezing of any further evolution? See Figures 4.22 and 4.23 for the answer.
35.00 Hits 30.00
AvgFitness
25.00 20.00 15.00 10.00 5.00 0.00 0
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Figure 4.22: Best-of-generation number of hits and average tness in a run of the parity example
for 200 generations. The rst part of this run is detailed in Figure 4.21.
113
1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00
Entropy
Slide • Bullet 0
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Generation
Figure 4.23: Entropy variation in the run of the even-5-parity example from Figure 4.21. in the Pac-Man domain, very large distributions of tness values, and to the smaller population size (eight times smaller than in the parity example). This time entropy decreases from a high value initially to lower and lower values during the run. Entropy decreases in spite of improvements in tness (both average and best-of-generation). This indicates a too high selection pressure towards good individuals.
Discussion The examples above present common patterns and suggest the following conclusions: 1. Monotonic decreases in population entropy over an increased number of generations indicate possible local search optima. These are associated to plateaus on the best-of-generation individual hit plots. 2. Entropy decreases correspond to decreases in population diversity but not necessarily to decreases in tness. This situation indicates a selection pressure higher than optimal. 3. An improvement in average tness may be caused by the selection of above average individuals in larger proportions and does not necessarily show that bene cial changes are made in the population composition.
114
6,000
0.35
Slide
5,000
0.30
0.25
• Bullet
Hits AvgFitness
0.20
Entropy
3,000
0.15
Entropy
Fitness/Hits
4,000
2,000 0.10 1,000
0.05
0.00
0 0
5
10
15
20 25 30 Generations
35
40
45
Figure 4.24: Fitness distributions over a run of GP on the Pac-Man problem. 4. The correlation between entropy (i.e. population diversity) and maximum energy (i.e. best-of-generation tness) suggests when computational eort is wasted due to local minima. This information can be used to control perturbations in the population or stop the simulation. In GP, the computational eort should be spent so that diversity is increased only when there is clear evidence that search has reached a local minimum. A description of phenotypic diversity based on the entropy formula appears to be useful when correlated with other statistical measures extracted from the population.
4.7 Related work The problem of learning a non-parametric model without a priori biasing for particular structures has been tackled in the area of non-parametric statistical inference. In statistical terms this is the problem of learning with low bias or tabula-rasa learning. However, low bias in the choice of models is paid for by a high variance (see [Geman et al., 1992] for an excellent introduction to the bias/variance dilemma). Methods for
115 balancing bias and variance include techniques that rely on a complexity penalty function which is added to the error term in order to promote parsimonious solutions. The basic idea is to trade the complexity of the model for its accuracy. This idea resonates with one of the fundamental principles in inductive learning represented by Ockham's razor principle, which is interpreted as: \Among the several theories that are consistent with the observed phenomena, one should pick the simplest theory" [Li and Vitanyi, 1992]. What is simple often turns out to be more general. One common approach to dealing with a variable complexity model within the Bayesian estimation framework is Rissanen's minimum description length (MDL) principle [Li and Vitanyi, 1993]. The MDL principle trades o the model code length, i.e. the complexity term, against the error code length, i.e. the data not explained by the model or error term. Complexity is naturally expressed as the size of code or data in bits of information. Informed approaches to include a parsimony component, such as the MDL principle, implicitly expect that the capability of the learned model is a smooth function of its complexity. This is not true for Genetic Programming, which furthermore cannot aord to exploit a large number of training examples and use in nite populations in order to overcome the problem. 9 A small change in a program can entirely destroy its performance. The capability of a model speci ed with a program is not a smooth function of its complexity. Nonetheless GP manages to sample the space of programs and to discover automatically satis able models of variable complexity. The MDL principle has been also applied in GP to extend the tness function of hybrid classi cation models [Iba et al., 1993; Iba et al., 1994]. For example [Iba et al., 1994] applied the MDL principle in the learning rule of a GP-regression tree hybrid. [Zhang and Muhlenbein, 1995] used an adaptive parsimony strategy in a GP-neural net hybrid. In both cases GP manipulates tree structures corresponding to a hierarchical multiple regression model of variable complexity, decision trees, or sigma-pi neural netAn alternative approach to program induction based on an iterative deepening exhaustive search is taken in ADATE [Olsson, 1995]. However, ADATE can not hope to solve problems for which the complexity of a solution is large. 9
116 works, rather than programs. MDL-based tness functions have been unsuccessful in the case of GP evolving pure program structures. Iba outlined that the MDL-based tness measure can be applied problems satisfying the \size-based performance" criterion [Iba et al., 1994], where the more the tree structure grows the better its performance becomes.[Rosca and Ballard, 1994b] has used the MDL principle to assess the suitability of an extension of GP with subroutines called adaptive representation (AR). The most common approach to circumvent complexity-induced limitations in GP has been the use of a parsimonious tness function. Parsimony imposes constraints on the complexity of learned solutions. However the eects of such constraints in GP have not been elucidated. Parsimony pressure clearly improves eciency of search and understandability of solutions if well designed. The quality and in particular the generality of solutions may also be improved in inductive problems. However, adding the right parsimony pressure has been more of an art. One example of avoiding this decision in an ad-hoc algorithm is \disassortative mating" [Ryan, 1994]. This GP algorithm selects parents for crossover from two dierent lists of individuals. One list of individuals is ranked based on tness while the other is ranked based on the sum of size and weighted tness. The goal is to evolve solutions of minimal size that solve the problem. However, it was recognized that by directly using the size constraint the GP algorithm is prevented from nding solutions. The disassortative mating algorithm is reported to improve convergence to a better optimum while maintaining speed. Related to the size problem (also called the bloating phenomenon), GP research has focused on the analysis of introns. Introns are pieces of code with no eect on the output. An analysis of introns goes hand in hand with an analysis of bloating. [Nordin et al., 1995] tracked introns in an assembly language GP system based on a linear (sequential) but variable length program representation. The analysis suggested that the increase in size is a \defense against crossover." A similar conclusion is reached here in Theorem 1 (see Section 4.13). In the linear representation, the noticed increase in the size of programs was attributed to introns. Based on experiments with controlled crossover or mutation rate within intron fragments, [Nordin et al., 1995] suggested that a representation which generates introns leads to better search eectiveness. Thus, introns
117 may have a positive role in GP search protecting against destructive genetic operations. For hierarchical GP representations [Rosca, 1996] showed that much of the size increase is due to ineective code too. However, the role of introns has been disputed in the case of GP using tree representations [Andre and Teller, 1996]. For one thing, the overhead introduced by exponentially increasing tree sizes may oset any protective eects of introns. Tackett pointed out that bloating cannot be selection-neutral [Tackett, 1994]. He presented experiments suggesting that the average growth in size is proportional to the selection pressure. In our analysis, selection pressure itself is complexity dependent. Tackett also suggested that the larger programs selected by GP contain expressions which are inert overall (introns), but contain useful subexpressions, thus correlating bloating with hitchhiking. Another suggestion for con ning the increase in complexity is to employ modular GP extensions such as algorithms based on the evolution of the architecture [Koza, 1994b], heuristic extensions for the discovery of subroutines [Rosca and Ballard, 1994b; Rosca and Ballard, 1996a], or GP with architecture modifying operation using code duplication [Koza, 1995]. Evolved modular programs theoretically have a lower descriptional complexity [Rosca and Ballard, 1994b], and also appear to present better generality [Rosca, 1996; Rosca and Ballard, 1996b]. The problem that evolved expressions tend to drift towards large and slow forms without necessarily improving the results was recognized in some excellent early work in GP, applied to the simulation of textures for use in computer graphics [Sims, 1991]. The solution devised was heuristic. Mutation frequencies were adjusted so that a decrease in complexity was slightly more probable than an increase. This did not prevent increases towards larger complexity, but more complex solutions were due to the selection of improvements. It is not apparent how this was done. Interestingly, the solution to controlling complexity presented in Section 4.3.4 achieves exactly this eect and is theoretically founded.
118
5 Modularity in GP: The Adaptive Representation Approach The previous two chapters have presented key ideas in interpreting GP processing and in understanding its characteristics and limitations. Starting with this chapter we address the second goal of the dissertation: extending the capabilities of GP. The rst step in this direction is to present a GP extension called Adaptive Representation (AR). AR oers a heuristic solution to the problem of architecture discovery. It extends the search component of GP with a heuristic component that: (1) can learn good subexpressions from problem solving traces; (2) can abstract subexpressions into subroutines; (3) can use subroutines to bias future search. Evolved solutions assume a modular and hierarchical organization. First, we review the main modular approaches to program synthesis from the GP literature along the criteria used for analyzing GP in the previous chapter. Then we propose the AR approach to automatic problem decomposition. Further improvements of this GP extension will be presented in the next chapter.
5.1 Review of modular approaches to genetic programming The idea of using subroutines in genetic programming (GP) is drawn from the genetic algorithm (GA) building block hypothesis. Building blocks are relevant pieces of a partial solution that can be assembled together in order to generate better partial so-
119 lutions to the problem at hand. Holland [1992] (see also [Goldberg, 1989]) hypothesized that GAs achieve their search capabilities by means of \block" processing. This lead to several attempts to explicitly identify and use blocks in GA algorithms. For example, the messy genetic algorithm (mGA) [Goldberg et al., 1989] explicitly attempted to discover useful blocks of code guided by the string structure of individuals. The structure is apparent in the mGA representation which takes the form of a string having each gene tagged with an index representing the gene's original position. After ltering useful blocks, mGA employs typical GA operations to combine those blocks. Perhaps owing to the purely structural nature of block de nitions, the improvements of these experiments were somewhat modest, and in fact the building block hypothesis has not gained conclusive support in GA literature so far. Nor is it clear how blocks can be best combined: recent GA experimental work disputes the usefulness of crossover as a means of communication of building blocks between the individuals of a population [Jones, 1995a]. In GP, [O'Reilly and Oppacher, 1995] made an analogy to the GA schemata theory. A major goal of that work was to understand whether GP problems have building block structure, but the results were also inconclusive. A structural approach is also at the basis of \constructional problems" [Tackett, 1995], i.e. problems in which the evolved trees are not semantically evaluated. Instead, program tness is determined by matching a set of patterns against the program and adding up the prede ned tness of each pattern. By ignoring the semantic evaluation step, the analysis of constructional problems is not generalizable to typical GP problems. GP presents a challenging picture due to the functional representation it generally uses. An analysis of block processing in GP has to rely on the function of blocks of code. GP modularization approaches consider the eect of encapsulating and possibly generalizing blocks of code in order to create modules. Modules correspond to (parts of) evolved subexpressions, and will be de ned more precisely below, according to the approaches various researchers have taken. The rst approach to modularization in GP was the encapsulation operation introduced in [Koza, 1992]. Re nements or extensions of the encapsulation concept have
120 focused on dierent aspects of function de nition. The main approaches to modularization discussed in the GP literature are essentially extensions of the standard GP engine. Three early approaches are automatically de ned functions (ADF) [Koza, 1994b], module acquisition (MA) [Angeline, 1994b], and adaptive representation (AR) [Rosca and Ballard, 1994a]. Another approach contrasted to ADF is automatically de ned macros (ADMs) [Spector, 1996]. ADFs have been used in another extension to GP called cellular encodings [Gruau, 1994]. These approaches will be shortly described next, with the exception of AR and ARL which are the main subject of the rest of this chapter and the next chapter, respectively.
Encapsulation The encapsulation operation, originally called \de ne building block" was viewed as a genetic operation that identi es a potential useful subtree and gives it a name so that it can be referenced (as a function with zero arguments) and used later [Koza, 1992].
Automatically De ned Functions Automatic de nition of functions is an extension of the GP paradigm to cope with the automatic decomposition of a solution function [Koza, 1992]. In this approach individuals are represented by a set of subroutines, called automatically de ned functions (ADFs), and a main function, called result producing branch. Each subroutine component has a xed number of parameters. Each of the subroutine and main function components is de ned based on its speci c alphabet (function and terminal sets). The architecture of a program is de ned by the number of subroutines, the number of arguments of each subroutine, and the nature of hierarchical references among the components. GP using the ADF-based representation of individuals (called ADF-GP henceforth) co-evolves representations for all these components implementing a program. Let us take a program pattern with two automatically de ned functions (ADF0 and ADF1) and a result producing branch with one body. Then one distinguishes between terminal sets and function sets for ADF0, ADF1 and the program body. In the example
121 presented in Figure 5.1 terminals from the initial terminal set are not included in the terminal sets for the function branches. The primitive function and terminal sets are de ned such that the components form a xed hierarchy. Genetic operations are constrained depending on the components on which they operate, constraint called branch typing. For example crossover can only be performed between components of the same type. Note that the hierarchy of components is xed at the outset of running GP. ADF0
ADF1
PROGRAM-BODY
ADF1 ADF0
ADF0
PROGRAM-BODY
ADF0 ADF0
ADF1 ADF1
A = {Arg0, Arg1}
A = {Arg0, Arg1, Arg2}
F = {OR, AND, NOR, NAND}
F = {OR, AND, NOR, NAND, ADF0}
T = {Arg0, Arg1}
T = {Arg0, Arg1,Arg2}
F = {OR, AND, NOR, NAND, ADF0, ADF1}
T = {D0, D1, D2}
(a)
ADF0
(b)
Figure 5.1: (a) An individual program with two automatically de ned functions. It consists
of three branches: ADF0, ADF1 and a result producing branch with one body. Each branch has a set of arguments A (only for ADFs), a function set F and a terminal set T which are established in the problem de nition. (b) Hierarchy of components.
Also, note that subroutines are not shared between individual programs. Subroutines may have no clear meaning from the point of view of the problem solved, they may not correspond to speci c subgoals related to the problem at hand. We do not a priori know what a subproblem is. Subroutines are not explicitly associated to problem subgoals even in the case when we know what a problem subgoal is. Ultimately, the eort to tune up the architecture may not be negligible. Two main dierences between ADF-GP and standard GP are: rst, ADF-GP can develop much more complex programs. The virtual size of the program body, after an inline substitution of ADF down to the basic primitives in the program body, can be very large (see the de nition of the expanded structural complexity notion in Section 5.4.2). Second, ADF-GP is able to make larger jumps in the search space. For example
122 a mutation in the lowest ADF level, ADF0, called in higher level ADFs radically changes the behavior of the body of the program. This may be a big disadvantage in late stages of evolution when the algorithm tunes solutions. ADF-GP is theoretically more powerful than standard GP because it can evolve more complex solutions than would be allowed by the resource constraints of GP, such as expression size or depth. ADF-GP may or may not be more ecient depending on the application. The intuition is that ADF-GP may be more ecient, especially for problems with regularity patterns in their solutions. The advantage of the ADF approach is its generality. The greatest disadvantage is that it requires the speci cation of the architecture for decomposition in advance.
GLIB and Module Acquisition The module acquisition (MA) approach is applied more generally to GP and EP [Angeline and Pollack, 1993; Angeline, 1994a]. In its GP implementation called GLIB, pieces of code called modules can be randomly frozen from manipulation as a result of compress operations and are kept in a global genetic library. More precisely, a module is a piece of code obtained by randomly choosing a subtree and possibly randomly chopping o its branches to introduce arguments (see the left side of gure 5.2). A module can be decompressed by an additional genetic operation, called expand, which has a complementary eect to compress. The genetic library passively preserves de nitions of modules. The compress and expand operators aect the size of evolved expressions. Thus, they may also positively aect the course of evolution. For example, consider the case when an evolved solution may call many modules, many times. The virtual size (i.e. equivalent GP size) can grow huge. It's harder to evolve equivalent big expressions with standard GP which is con ned to size or depth limits. If the application is such that only large solutions are acceptable, and moreover those solutions can be modular, then the chances that GP would be able to evolve a monolithic structure with these features are extremely small. The approach may help, although experimental evidence is scarce.
123 Compression protects what may be useful genetic material from the destructive eect of other genetic operators. It also helps in decreasing the average size of individuals in the population, while maintaining the same power. As a side eect, it may also generate a loss of diversity in the population, problem presumably repaired by the expand operator (see gure 5.2.) Compress
Mod
Compressed program tree
Program tree Expand F1 F6
T2
F7
F3
F2
T2
T1
F4
T3
T2
T3
F5 F1
F6
F7
Mod(arg1, arg2, arg3)
T2 F3
F2
T2
T3
T3
T2 T1
F4
arg1
F5
arg2
arg3
Figure 5.2: Additional GP operators in the module acquisition (GLIB) approach It is interesting that in [Angeline and Pollack, 1994], the authors talk about the worth of a module, but they attribute to it a rather passive role. The module's worth is the number of times the module has been used since its birth, in subsequent generations. If a module is not frequently used, it means it is not viable in the competition with other individuals.
Automatically De ned Macros The essential dierence between a subroutine and a macro is in the way code is evaluated. For a subroutine, arguments are rst evaluated and then the subroutine code is invoked on the actual argument values. For a macro, the macro de nition is expanded with argument de nitions before any evaluation and then the resulting code is evaluated. Thus most often a subroutine and an identically de ned macro will
124 produce entirely dierent results. The order of code execution is changed for a macro. Moreover, the code corresponding to some macro arguments may not be executed at all, if the macro de nition contains lazy-evaluation functions, such as if. Whenever the primitives involved in the macro de nition have side eects, the changed order of execution and the dierent code activation pattern will generate dierent results for a macro invocation than for a subroutine invocation. Automatically De ned Macros [Spector, 1996] implement the idea of ADFs by working with macros instead of subroutines. The goal is to attempt an improvement of GP eciency for problems with side-eecting primitives that control, for instance, an agent in a simulated world (Obstacle Avoiding Robot, Lawn Mower, etc.). One of the conclusions in [Spector, 1996] is that ADMs are likely to be useful in environments within which "context sensitive or side-eect-producing operators play important roles."
Architecture evolution and code duplication The ADF approach presents the disadvantage of working with a xed program architecture. This problem was ingeniously addressed in [Koza, 1994a] using the biologically inspired idea of code duplication (see also [Koza, 1995]). The architecture of evolved programs can be modi ed by means of new operations for duplicating parts of the genome. Six new genetic operations were introduced for altering the architecture of an individual program: branch duplication, argument duplication, branch deletion, argument deletion, branch creation and argument creation. Duplication operations are performed such that they preserve the semantics of the resulting programs. They increase the potential for the re nement of the programs. The duplication of elements of program architecture (branches or arguments) is done in conjunction with a random replacement of the invocations of the corresponding element to the duplicated copy. Such an operation decreases the probability that a future random change will drastically change the behavior of the program. A similar conclusion can be drawn for creation operations. The deletion operations do not possess the nice properties mentioned above. They have the antagonist role of con ning the increase in size of the evolved programs.
125 Duplication increases the chances of survival of pieces of code, and thus virtually protects evolved code against the destructive eects of genetic operations. It also increases the likelihood of maintaining higher diversity in the population. Therefore code duplication transformations seem to be useful in general. The point, though, is that they also modify the number of arguments of a function, or the number of subroutines, therefore altering the architecture of solutions. Decomposition is a result of the evolutionary process.
Morphogenesis In standard GP, the genotype is a program which is directly interpreted in order to generate some behavior or a tness value. Another approach is to rst perform a transcription or development of the genotype into another structure that implements a model (viewed as the phenotype), and then perform tness evaluation on the resulting model. Transcription can be controlled by the primitives of the application. Gruau considered a language for transcription called cellular encoding that speci es the transformation of \neural cells." Primitives in the language aect the cell and its interconnections with neighboring cells. For example a cell with inputs and outputs can divide in series into two cells, the rst inheriting all mother-cell inputs and the second inheriting all the mother-cell outputs. Other primitives include parallel division, increasing or decreasing weights of connections, removing connections, modifying threshold parameters in the cell, stopping development, etc. [Gruau, 1994]. Development starts with a mother cell. However, after each operation resulting in a cell division (such as serial or parallel division above), each daughter cell continues its own development path. Full development of an embryonic cell into a neural network can be speci ed using a tree structure labeled with the cellular encoding primitives. This process resembles biological development where after cell division each resulting cell has its own copy of the mother cell chromosome. Once created, the neural network can be used for data interpretation or prediction.
126 Its tness is actually determined by how good a model is for the application task. Now, instead of optimizing over neural network architectures, the algorithm optimizes over genotypic encodings, i.e. cellular development programs. This is done using GP search on the search space of tree structures representing development programs. Normally, during program interpretation, functions require that all arguments be evaluated rst to nd the actual argument values. Then the function can be applied. With lazy functions only the needed parameters are evaluated. In both cases execution proceeds bottom-up. In contrast, the evaluation of cellular encodings proceeds top-down and in parallel. One desirable property of the encoding is locality. A change of a subtree in a development tree will only aect some local part of the grown neural network. A module of the subtree corresponds to a module of the neural network. With this interpretation the eect of genetic operations can be visualized as local changes in the neural network. This makes it easy to implement the development of modular neural networks. The cellular encoding only has to include primitives that reiterate interpretations of some module from the beginning by means of jump or recurse primitives. In particular, the development program can contain ADF-like components. Development of a cell continues with the program de ned by the ADF when the name of the ADF is executed during development. The execution of the same ADF for two dierent cells in the growing neural network will determine similar structures to appear in two dierent parts of the network. The idea has been proven to be feasible from applications such as generating families of neural networks for computing parity on a generic number of inputs, to the development of the control system of the various motor subsystems (such as legs) of an animat. Furthermore, phenotypes can be exposed to learning in an environment for tness determination thus allowing for the study of the interaction between learning and evolution [Hinton and Nowlan, 1987; Gruau and Whitley, 1993].
127
Summary A performance comparison of ADF and module acquisition (MA), as well as other variations of the two methods, is presented in [Kinnear Jr., 1994]. ADF consistently shows better performance. These is attributed to the repeated use of calls to automatically de ned functions and to the multiple use of formal parameters in ADFs. In the above methods, selection of programs to participate in reproduction operations is tness-proportional. However, in ADF and MA selection of blocks of code within programs is purely random or \uninformed." Uniform random changes at all levels may determine a loss of bene cial evolutionary changes. For instance, ADF samples the space of subroutines by modifying automatically de ned functions at randomly chosen crossover points. Similarly, MA randomly selects a subtree from an individual and randomly chops its branches to de ne a module and thus decide which genetic material is frozen. All points of a tree, either active (i.e. eective during evaluation) or inactive (i.e. introns; for a discussion of introns in GP see [Nordin et al., 1995]) are equally likely to be the source of a compress operation. Random changes will not be an ecient strategy if the bottom-up evolution hypothesis [Rosca and Ballard, 1995] holds. This theory conjectures that ADF subroutine representations become stable in a bottom-up fashion. Early in the process changes are focused towards the evolution of low level subroutines. Later in the process the changes are focused towards the evolution of program control structures, that is structures at higher levels in the hierarchy of subroutines [Rosca and Ballard, 1995]. For this reason we have focused on heuristic or adaptive measures to guide the focus of attention during search for the creation and modi cation of subroutines.1 For problems with symmetries and patterns of regularity, modularity should bring a number of advantages such as increased search eciency and easier scale-up behavior [Koza, 1994b; Koza, 1995]. Later, we will examine how modularity helps decouple independent parts of the problems thus facilitating problem decomposition. Also, modSee also Chapters 3, 4, and 5 authored by Teller, Angeline, Iba and de Garis in [Angeline and Kinnear, Jr., 1996]. 1
128 ularity facilitates reuse of code as an alternative to repeatedly evolving the same code fragments multiple times. Had such repeated use of code been necessary in the design of a solution, then a GP algorithm extended with modularity mechanisms could be considerably more ecient than standard GP. An attempt to explain the course of evolution in GP based on an understanding of what building blocks are, appears in [Tackett, 1993]. The idea that frequent subtrees in one individual correspond to synthesized features suggests the conclusion that those subtrees comprise \building blocks".
5.2 Characteristics and biases of modular GP This section contrasts characteristics of the ADF-GP modular representation with the standard GP representation. Discussions follow the main ideas presented in Chapter 4: distribution of tness values of random representations, transformation of representations, complexity of evolved representations, and statistical dynamics. Some of the arguments support the decision of using a modular representation while others do not. New concepts needed to characterize the complexity of evolved structures for modular representations will be introduced in Section 5.4.
5.2.1 Biases in the random generation of expressions In the ADF representation, randomly created ADFs are equivalent to random functions over the corresponding input variables. The main program can invoke the random ADF functions, besides the primitives in F . Thus we expect to see a distribution of tness values close to the case of an enlarged function set, as presented in section 4.1, Figure 4.1 (c) and (d). Experimental results prove this hypothesis. Figure 5.3 shows an extended analysis for ADF-GP solutions to the parity problem. The distribution of the number of hits is observed similarly with section 4.1 for several alternatives in the choice of the function set. Thus, for ADF-2 (Figure 5.3 (a)), the function set used in all modules is F0 (de ned
129 in relation 3.3). For ADF-2-8 the function set includes four additional Boolean functions of two variables, besides the primitives in F0, while for ADF-2-16 it includes all sixteen Boolean functions of two variables. The analysis outlines two conclusions. First, the distribution for ADF-2 is wider than the distribution for GP (see Figure 4.1). This indicates the better potential for ADF-GP to create and maintain increased diversity which can be exploited by genetic search (see also [Koza, 1994b]). Second, the distributions for ADF-2-8 and ADF-2-16 are even wider. More subroutines in the representation positively aect the diversity of behaviors in a random population of programs. However, if there are too many useless subroutines the positive eect is limited. Fortunately, selection will also in uence what subroutines and subexpressions, in general can be found in the population at any given time. 100,000 (a) ADF-2
100000 x Prob{X=x}
(b) ADF-2-8 (c) ADF-2-16
10,000
1,000
100
10
1 0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32
Number of Hits
Figure 5.3: Probability mass function of the random variable X representing the number of hits for ADF-GP even-5-parity functions. The ADF-GP architecture uses two ADFs of two arguments each. The function sets of the main program and of the ADFs contain respectively four (a), eight (b), and sixteen (c) distinct Boolean functions of two variables. The functions sets necessarily contains the four primitives of two variables and, or, nand, nor.
130 These results support the idea that subroutines can lead to a wider diversity on which selection can pick up leading to more eective search.
5.2.2 Transformational biases ADF-GP has to discover the de nitions for ADFs, the main program body, and also has to nd a good composition of ADFs and primitives (that solves the problem). This corresponds roughly to discovering a way to decompose the problem and solving the subproblems given only the maximum number of subproblems and the general structure of the subproblems (i.e. number of parameters and the subproblem \alphabets"). Due to the imposed ordering of ADFs we can consider each ADF as a dierent structural level. The ADF approach simultaneously attacks the search problem at dierent structural levels. During GP search, modi cations are alternatively made at each of the structural levels. A code fragment brought from another individual changes its function entirely if it contains calls to ADFs. For example, consider a piece of code with calls to lower order ADFs that is pasted in a higher-order function or the main body as a result of a crossover operation; and also suppose the de nitions of the ADFs in the two parents are entirely dierent. Lexical scope dictates the de nition to be used when invoking a sub-function, so that the calls to ADFs from the transplanted piece of code will refer to the de nition of a totally dierent function from the new lexical scope. This quite frequent situation is depicted in Figure 5.4 and demonstrates the noncausality of ADF-GP. The non-causality property of ADF-GP is in total opposition with the principle of strong causality previously stated in Chapter 4. It is useful to visualize how the search for a solution may generally proceed in ADFGP. Each of the ADF functions represents a dierent sub-function. Consider the last modi cation imposed on a program tree before it becomes an acceptable solution. It is very unlikely, but not impossible that this last change has been a change with a large in uence, for example a change in one of the functions at the basis of the hierarchy. This situation represents a lucky change. Most probably, though, it was a change at the
131
A PROGRAM-BODY ADF1 ADF0
ADF0
ADF0
ADF0 ADF0
ADF1 ADF1
B PROGRAM-BODY ADF1 ADF1 ADF0 ADF0 ADF0 ADF0 ADF0
ADF0 ADF1 ADF1
Figure 5.4: The non-causality of ADF-GP: De nitions for ADFs are local. Thus, a fragment
of code copied from donor parent 1 into receiving parent 2 will be evaluated in the new lexical environment of parent 2.
highest level, in the program body. We conjecture the following general principle: as evolution progresses, changes according to the principle of strong causality become more important and should be supported by the representation and processing primitives. In other words, as better and better individuals are found, selection most often favors small, causal changes that have the biggest chance of turning successful. The eect of this principle is a stabilization on useful lower level ADFs. Evolution will freeze good subroutines and will eventually nd bene cial changes at higher levels.
5.2.3 Bottom-up evolution In order to test the hypothesis that causality plays an increasing role as evolved individuals become more complex and t we have studied the most recent part of the genealogy tree for even-n-parity parity problems. This was done by giving each individual a birth certi cate that speci es its parents and the method of birth that corresponds to the branch type where crossover is performed (one of ADF0, ADF1, or main program body), or where the reproduction was performed. In this experiment the mutation rate was zero. We hoped that an analysis of the birth certi cates, starting
132 with the nal solution and tracing backwards its origin, would shed light on the GP dynamics. In order to determine the eect of the dierent types of birth operations, we have computed a temporally discounted frequency factor bcf for a given solution tree T and a type of birth:
X kT bcf (T; type; d) = 11??
d ftypeg (i) depth(Ti) i=0
where kT is the number of programs in the genealogy tree of T down to a depth d, and ftypeg (Ti) is the characteristic function of ancestor Ti of T , returning 1 if Ti has a birth certi cate of birth-type type and 0 otherwise. Table 5.1 presents the results for several successful runs of ADF-GP for even-5parity, with two ADFs and three arguments each. These results show that ADF-GP search relies in most cases on changes at higher and higher structural levels which make it possible to exploit good code fragments that appear in the population. Table 5.1: Statistics of birth certi cates in successful runs of even-5-parity using ADF-GP
with a zero mutation rate and a population size of 4000. Each certi cate of a given type counts one unit and is temporally discounted with a discount factor = 0:8 based on its age. Only certi cates at most 8 generations old have been considered. The last line shows the averages of bcf values of the three types. GP Programs Birth Certi cate Freq. Final Run Explored ADF0 ADF1 Body Gen. 1 123,009 0.295 0.0 0.704 32 2 110,892 0.221 0.472 0.416 32 3 62,699 0.077 0.526 0.397 17 4 35,162 0.447 0.102 0.451 9 5 55,748 0.1 0.214 0.685 15 6 55,438 0.093 0.202 0.704 15 Avg. 0.205 0.252 0.559 -
The above numerical results have taken into account only a small time window compared to the entire number of generations. A more detailed picture of the importance
133
100
100
90
90
80
80
Percentage per generation
Percentage per generation
of various types of crossover during the entire GP evolution is based on a complete analysis of birth certi cates, going back to the initial generation. Such an analysis is depicted in Figure 5.5 for a typical case. The overlapping distribution trends of birth certi cates suggest both the overall importance of a birth certi cate type as well as its trend over the entire evolution period, from generation 0 till the solution is found.
70 60 50 40 30 20 10 0
Random ADF0 ADF1
70
Body
60 50 40 30 20 10 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Generation number
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Generation number
Figure 5.5: Distribution trend of the percentage of birth certi cate types over generations, while
looking for a solution to even-5-parity that was found in generation 15. Random indicates the propagation of random individuals from the initial population due to reproduction.
The stabilization of changes in the hierarchy occurs bottom-up. Crossover changes in automatically discovered functions are highly non-causal and are performed in early generations. This is in sharp contrast with changes in the main program body. Those changes are mostly changes at low tree heights (as discussed in Section 3.4.1), that are performed in the late generations. An interesting point is that in very early generations the most frequent genetic operations in the genealogy tree were reproduction operations (see Figure 5.5). The results presented con rm that, as the population evolves, increasingly causal changes (i.e. changes in the main body) become more important and are selected. In conclusion, the preimposed hierarchical ordering among ADFs biases search in the space of programs. The resulting bias is expressed by the bottom-up evolution
134 hypothesis which conjectures that ADF representations become stable in a bottom-up fashion. Early in the process changes are focused towards the evolution of low level functions. Later, changes are focused towards higher levels in the hierarchy of functions.
5.2.4 Statistical dynamics In Chapter 4, Section 4.6 we described interpretations of the entropy of a system. One such interpretation, when the system is represented by a population of individuals, is that entropy oers a measure of diversity of behaviors in the population. We traced the variation of entropy and the tness histograms over the time of evolution, and this allowed us to interpret how evolution progressed. Here we do the analysis for ADF-GP. Figure 5.6 presents the tness distributions for a run of ADF-GP on the even-5parity problem. Figure 5.7 shows the variation in entropy over a run of ADF-GP on the same problem. The most obvious dierence between Figure 5.6 and gure 5.7 is the increased exploration of the search space [Rosca, 1995b] in ADF-GP. In the rst case, the use of subroutines positively aects the eciency of GP. Entropy has the tendency to decrease as the system becomes more \organized", i.e. converges and does not discover better solutions due to loss of diversity. This can be seen in Figure 5.7. Entropy steeply increases for about twelve generations, correlated with the initial increase in the number of hits for the best-of-generation individual. After that period, entropy starts to decrease until a new best-of-generation individual is discovered. After that, a new stable regime is reached, and entropy further decreases.
5.3 Expanding the function set: a formal view The results relative to the bias in the random generation of expressions from Sections 3.4.1 and 5.2 suggest the following formal view on an enlarged function set. Consider the standard GP procedure operating on the language of expressions created from problem primitives P , where P = T [ F . The primitives, terminals T and functions F , comprise methods for accessing and modifying domain dependent information (possibly state information), and various processing primitives (both domain
135
4000
3000
2000
40
1000
30 20
0 0
5
10
10 15
20
25
30
35
0
Figure 5.6: Fitness distributions over a run of ADF-GP on the even-5-parity problem.
3.5
Slide
30
3.0 Hits
Fitness/Hits
25
• Bullet
20
2.5
AvgFitness Entropy
2.0
15
1.5
10
1.0
5
0.5
Entropy
35
0.0
0 0
5
10
15
20 25 30 Generations
35
40
45
Figure 5.7: Entropy variation over a run of ADF-GP on the even-5-parity problem.
136 dependent and independent). The closure requirement for full generality of genetic operators is that any function call be well de ned for any combination of arguments that it may encounter (primitives or other subexpressions). Suppose that all subexpressions in the language return a value from a domain D (for example, D = R). De ne F total to be the set of function compositions over D in terms of the elements of F , actually mapping the input space onto D. Every sub-expression of an expression evolved by GP implements some function which belongs to F total . Over generations, the change of subexpressions corresponding to any xed point in a surviving tree is equivalent to a dynamic sampling of functions from F total . A GP run usually converges after some time, i.e. the population does not appear to improve any longer. This means that the expressions used in the population represent some attractor small subset of F total . The problem of induction of a solution can be alternatively formulated as the problem of determining a subset of F total that supplies all the information needed to easily assemble a solution. This formulation is not practical or constructive, but oers a fertile conceptual ground. One approach to simplify the induction problem is to provide GP with a large set of primitives P developed by hand from a much smaller set of bare primitives. The idea is that the primitives have some relevance to the application. For example, Tackett uses this approach in an application of GP to feature discovery in images. The set of GP primitives includes bare primitives such as area and range, but also compositions such as area range2 . If the set of primitives is much larger than needed, GP has to select the right primitives instead of evolving them. Naturally, F F total. It may be dicult to manually determine an appropriate extension E of the function set (F E F total) necessary to solve a given problem. Also, it may be unrealistic to consider huge primitive sets. The question is whether a set of useful subexpressions could be automatically determined while solving the problem. However, note that some sub-expressions may depend on the context of evaluation, and in their turn may have side-eects on the state of the simulation for tness evaluation. Side-eecting depends on the nature of the primitives used. Nonetheless, certain subexpressions may still turn out to be helpful.
137 The GA building block hypothesis (BBH) [Goldberg, 1989] is one additional motivation for automatically trying to detect useful subexpressions. The GP schema de nition by Koza ([Koza, 1992], page 117) suggests the intuitive idea that subtrees may play the role of functional features. We conjecture that good features may be functionally combined to create good representations [Rosca and Ballard, 1996a].
5.4 Complexity measures for modular evolved expressions This section proposes a theoretical basis for analyzing the size of modular structures. It applies to a system that can use subroutines. These may be part of the representation (ADF-GP), or may be created dynamically, as will be the case with the AR algorithm. I de ne the notions of structural complexity, evaluation complexity, expanded structural complexity, and stochastic complexity. The rst measure gives an idea about the amount of memory used, the second is an upper bound on the number of primitives evaluated during execution, the third is a measure of the virtual size of an individual that can be directly compared with structural complexity of standard GP solutions. Finally, stochastic complexity oers both justi cation for biasing search towards decompositions of small subroutines and a possible way to extend the tness function for inductive problems.
5.4.1 Structural and evaluation complexity Suppose an individual is represented by a program which calls discovered functions, which may call other discovered functions. Nonetheless the call graph based on the caller-callee relation has no cycles. Let Size(F ) be the number of nodes in the tree representing a program F . Let F0 be the program tree representing an individual T0 which contains direct or indirect calls to F1 ; F2; :::; Fm. We de ne structural complexity SC (F0) and evaluational complexity EC (Fi) as follows:
138
SC (F0) = EC (Fi) = Size(Fi ) +
X 0j m
X
j 2Ji
Size(Fj )
EC (Fj ) jCalls(Fi; Fj )j
(5.1) (5.2)
where Ji = fj ji < j m and Fi calls Fj g and jCalls(Fi; Fj )j is the number of times Fi calls Fj . In standard GP, where no functions other than the primitive ones are used, the structural and evaluational complexities are equal to the program size Size(F ). Assuming that functions from the initial function set are executed in unit time, the evaluational complexity shows how many time units it takes to execute an individual program.
5.4.2 Expanded structural complexity A true measure of the virtual size of a modular individual, if we had to build it from primitive functions, is obtained by counting all the nodes in the tree resulting after an \inline" expansion of all the called functions down to the primitive functions. This complexity measure is called expanded structural complexity. It is computed from the structural complexity (i.e. the number of tree nodes) of all the functions in the hierarchy which are called directly or indirectly in the main program body of the individual. The expanded structural complexity of a program F , denoted IC (F ), can be computed in a bottom-up manner starting with the lowest functions in the call graph of F . For each subfunction G, called directly or indirectly by F , IC (G) can be de ned using a recursive formula (see Appendix C). Note that EC (F ) diers from the expanded structural complexity. Expanded structural complexity corresponds to the notion of circuit size complexity from complexity theory. The following inequalities hold between the introduced complexity measures:
Size(F ) SC (F ) EC (F ) Expanded - SC(F)
139
5.4.3 Minimum descriptional complexity We can view the problem of determining a program that explains a set of examples or optimizes a tness function as one of hypotheses formation: we look for the best program that explains the data. Rissanen's minimum description length (MDL) principle oers an answer to approaching the problem. It states that the best theory to explain a set of data is one which minimizes the length of the data description together with the hypothesis description. In general, problems such as the inference of a decision tree that best explains a set of examples [Quinlan and Rivest, 1989], the construction of a nite automaton or the inference of a Boolean function that satis es a set of constraints are all problems that match the described pattern and can be solved using the MDL principle [Li and Vitanyi, 1993]. MDL is also called stochastic complexity. The MDL principle advocates a hierarchical representation of evolved programs (see Appendix B). Moreover, by biasing discovery of subroutines towards small sizes we also bias search towards solutions with smaller descriptional complexity. If a hierarchical organization is discovered, the size of individuals and discovered functions is kept within reasonable bounds while the structural complexity of individuals can be much bigger. Moreover, the descriptional complexity could be used as a measure for the tness of an individual T that would drive GP towards discovering solutions with a smaller descriptional complexity [Iba et al., 1994]. It balances the requirement of getting a simple description of a solution tree T and the requirement of minimizing the number of misses. The mechanism for adapting the representation while searching for solutions represents a natural way to generate a hierarchy of more and more complex functions (a hierarchical representation) and to make possible the discovery of a solution of small (or even minimal) description length. Unfortunately, MDL may not work too well with GP in general. Informed approaches to include a parsimony component, such as the MDL principle, implicitly expect that the capability of the learned model is varies smoothly with its complexity. This is not true for Genetic Programming. A small change in a program can entirely destroy its performance. However, with parsimony pressure individuals would tend to take a hierarchical
140 organization because this is a way of working towards achieving a minimum descriptional complexity. We can achieve good results with a parsimony component that is designed as discussed in section 4.3.3.
5.5 Adaptive representation The ADF-GP modular representation presents several advantages: modularization, reuse, increased diversity, which are likely to improve the performance of the GP engine for applications where a modular decomposition simpli es the problem. Unfortunately, it suers from the problem of non-causality. Much eort is spent trying to evolve ADF expressions while totally changing the behavior of programs, and not exploiting accumulated changes in the other program components. Another problem is the requirement to prede ne the architecture of solutions. In this section we present a new approach relies on the idea of using subroutines, and changes the strategy for subroutine discovery for problem decomposition. The approach is called adaptive representation (AR).
5.5.1 The AR algorithm The central idea of an adaptive representation system is to nd and use subroutines based on measures of their function. Reusing good building blocks has obvious advantages in terms of the economizing the search process. A larger set of functions positively aects the tness distribution of programs created through initialization or genetic operators. Thus the use of subroutines focuses search in the space of programs. This idea is implemented in a simple form in the adaptive representation approach. AR uses GP to search for good individuals (representations), while adapting the architecture (representation system) through subroutine invention, to facilitate the creation of better representations. These two activities are performed on two distinct tiers (see Figure 5.8). GP search acts at the bottom tier. Due to the tness proportionate selection mechanism of GP, more t program structures pass their substructures to ospring. At
141 the top tier, the subroutine discovery algorithm selects, generalizes, and preserves good substructures. Discovered subroutines are re ected back in programs from the memory (current population) and thus adapt the architecture of the population of programs. SUBR. DISCOVERY
Memory
Problem Representation FUNCTIONS SUBROUTINES
POPULATION
TERMINALS
GP
Figure 5.8: Two-tier architecture of the adaptive representation algorithm. The subroutine discovery algorithm creates new subroutines that extend the problem representation, as a result of three steps (see Figure 5.9 for a more formal description): 1. Identify useful blocks of code that appear as a result of genetic operations. Either an informed, or an heuristic technique, can be employed in specifying of what could be useful blocks of code. 2. Generalize the blocks that withstand the selection criterion above using inductive generalization [Michalski, 1983]. The result is a set of new subroutines which extend the current function set. 3. Create a number of random individuals from the extended function set and replace low- tness individuals (thus exploit newly created functions). The critical problem in AR is the evaluation of the usefulness of a block of code (the rst step above). Evaluation should be based on additional domain knowledge whenever such knowledge is available. However, domain-independent methods are more desirable for this goal. The evaluation of subexpressions will be explored in this chapter by means
142
Adapt representation(Pi; Fi; Fi+1) 1. Discover candidate building blocks BB i by evaluating each block's merit(Pi); 2. Prune the set of candidates(BB i ); 3. For each block in the candidate set, b 2 BB i, repeat: (a) Determine the terminal subset Tb used in the block(b); (b) Create a new function f having as parameters the subset of terminals and as body the block(b; Tb); (c) Extend the function set Fi+1 with the new function(Fi; f).
Figure 5.9: Subroutine discovery in the adaptive representation algorithm of user supplied criteria called block tness functions, and using domain independent methods in the next chapter.
5.5.2 Frequent and t candidate building blocks There exist two obvious choices for determining the usefulness of blocks: frequent blocks and t blocks. The block with the highest usefulness becomes a candidate. Frequent blocks can be determined by keeping track of how often a block appears in the entire population. Surprisingly, frequent blocks are not necessarily useful building blocks. By analogy to the GA schemata theorem, a good building block spreads rapidly in the population, and this determines a high frequency count for the block. For example, in the even parity problem if we augment the function set with the exclusive or function (xor), then xor (a building block for computing parity for 2 input bits) will soon become dominant in t program trees. The problem is that the converse of the GA schemata theorem is not true. Poor blocks, i.e. blocks that are identities or add no functionality, may be frequent and should not be considered as candidates, although they may have a role of preserving recessive features (they are introns in [Angeline, 1994b]).
143 Blocks that appear in a nal solution may be discovered very late in the search process. They are not necessarily responsible for the evolution process, usually having a low frequency count. Frequent blocks in early generations may become rare in late generations. Similarly simply considering the frequency of a block in an individual [Tackett, 1993], the block's constructional or schema tness [Altenberg, 1994], or conditional expected tness [Tackett, 1995] is not sucient. The above arguments are supported by experimental evidence obtained by monitoring frequent blocks in the population. This indicated the unsuitability of the criterion in estimating block usefulness. A much better choice for discovering building blocks is to consider t blocks. We can incrementally check for new t blocks instead of relying on expensive statistics in the population. Block evaluation is done with one or more block tness functions based on supplementary domain knowledge. Block tness functions are supplied in the de nition of the GP problem. Each of the block tness functions exerts \environmental pressure" for the selection of viable blocks. In a co-evolutionary framework such pressure could come from co-evolving species [Hillis, 1990]. Several other methods can be used to evaluate the tness of a block. First, one can use the program tness function to evaluate blocks or can compute the correlation between the program output value and the subexpression value. This has the advantage of requiring no more domain knowledge than the knowledge built into the tness function, but is not a general method [Iba and de Garis, 1996]. Second, one can use a slightly modi ed version of the tness function, corresponding to a lower dimensionality problem of the same type. For example the block tness may be measured only on a reduced set of tness cases, dependent on the variables used in the block. This method actually scales the tness function down to cope with a smaller size problem (see the example from section 5.5.3). As expected, t blocks are very useful in dynamically extending the problem representation by means of de nition of new global subroutines.
5.5.3 Experimental results The even-n-parity problem is solvable by problem decomposition into simpler subproblems and thus represents a good test bench for the discovery of more and more
144 complex building blocks. The problem was described in Section 4.1. Table 3.1 (Section 3.4.1) summarizes the parameters used in the GP experiments here, while Table 5.2 summarizes the additional parameters for AR. Table 5.2: GP setup for solving parity problems of order n, and additional AR parameters. Block tness function tness measure applied on a subset of inputs Block selection best two blocks (if any) 1 or 1 Epoch-replacement-fraction 2 4
The cost or standardized tness of a program i having Hits(i) hits, and a size Size(i) is:
Cstandardized (i) = [2n ? Hits(i)] C 1 + Size(i) C 2
where C 1; C 2 are constants. We used both C 1 = 1 and C 2 = 0, and CC 21 = n. We have also tested a formula derived directly from the MDL principle (see Appendix B) with poorer results. One possible explanation is the greater pressure on eliminating dead regions of code which may prove to be a reservoir for diversity later on. Additional control parameters for GP were as follows. The maximum depth of individuals was D = 17, while new individuals may have a depth of maximum 6. We used both tness-proportionate and tournament selection (with similar results). We have not experimented with other values of these parameters, but rather have used the values reported in [Koza, 1992] for result comparability reasons. Other speci c AR parameters were as follows. The block tness function was the same as the tness function (C 1 = 1 and C 2 = 0), with the change that hits are evaluated on a subset of the set of tness cases, determined after xing the values of variables not used in the block to arbitrary values (zero here). This is weaker than computing parity on a subset of inputs. Only blocks with a maximum number of hits are considered as candidates. No pruning function has been initially considered (step 2 in gure 5.9). An epoch-replacement-fraction of 21 gave good results when solving even-n-parity with n up to 8. For bigger orders, we chose a smaller value to keep lower the computational overhead due to adapting the representation. In general, the
145
Table 5.3: Comparison of results (rounded gures): AR-GP vs. results reported in Koza94, marked (# ). sc is the structural complexity, g is the number of generations
Method
even-3 even-4 even-5 even-8 g sc g sc g sc g sc
GP# 5 45 23 113 50 300 # ADF-GP 3 48 10 60 28 157 24 186 AR-GP 2 17 3 15 5 32 10 41 bigger its value, the larger are the computational eort and memory requirements for a run so one has to trade o the power obtained with the costs employed. We solved all parity problems up to order 11 on a sun sparcstation 10 by adapting the representation based on t blocks. A comparison of results among AR, GP, and ADF-GP is presented in Table 5.3. Row 1 shows the number of generations to nd a solution with 99% probability in one run and the average structural complexity of solution trees obtained for ten runs of AR-GP on even-parity problems with population size M = 4000. Rows 2 and 3 present some comparative results taken from [Koza, 1994b] for sample runs of GP with similar parameter values, but M = 16000.
Emergence of hierarchical representations It is important to point out the hierarchical structure of the functions dynamically created and their semantics. Table 5.4 presents the main steps of the trace in a run of even-8-parity and gure 5.10 presents the nal call-graph induced in that run. Dierent levels in the call graph correspond to higher epochs of the evolutionary process. Functions on the same level do not call one another. Only functions in dierent epochs may take advantage of the genetic material discovered previously. The foundation of the hierarchy is made of the primitive functions included in F0 . We have argued from a theoretical point of view that hierarchical structures are more powerful than structures based on the initial function set. Table 5.3 brings some experimental evidence. With ADF or AR the scalability of the even parity problem
146
Table 5.4: Important steps in the evolutionary trace for a run of even-8-parity
Generation
0.
New
functions [F681]: (LAMBDA (D3) (NOR D3 (AND D3 D3))); [F682]: (LAMBDA (D4 D3)
(NAND (OR D3 D4) (NAND D3 D4)))
Generation 1. New function [F683]: (LAMBDA (D4 D5 D7) (F682 (F681 D4) (F682 D7 D5)))
Generation 3. New functions [F684]: (LAMBDA (D4 D5
D0 D1 D6) (F683 (F683 D0 D6 D1) (F681 D4) (OR D5 D5))); [F685]: (LAMBDA (D1 D7 D6 D5) (F683 (F681 D1) (AND D7 D7) (F682 D5 D6)))
Generation 7. The solution found is: (OR
(F682 (F682 (F683 D4 D2 D6) (NAND (NAND (AND D6 D1) (F681 D5)) D1)) (F682 (F683 D5 D0 D3) (NOR D7 D2))) D5) improves signi cantly. Figure 5.11 presents the evaluational complexity and the size of both the best of generation individual and on average over the entire population. The structural complexity values are bounded from below and above by Size and EC respectively. The best of generation individual becomes simpler and simpler due to size pressure and the variety of useful and more powerful blocks appeared in the population. Starting with generation 3 discovered functions begin to dominate the structure of the best of generation individual as they gradually replace the primitive functions. A statistical analysis of the frequent blocks after the function set is extended on the basis of t blocks, outlines the
147
Time EVEN-8-PARITY
7
3
F685 (Even-5-Parity)
F684 (Even-4-Parity)
F683 (Even-3-Parity)
1
F682 (Even-2-Parity)
0
OR
NAND
F681 (NOT)
AND
NOR
Figure 5.10: Call graph for the extended function set in the even-8-parity example importance of the new functions created, and thus of the hypothesized building blocks. The new functions rapidly become dominant in the population, if they are useful. We can thus evaluate extensions of the function set. Figure 5.12 outlines that the complexity of individuals increases considerably for a simpler problem (even-5-parity) when standard GP is applied, without bringing noticeable improvements in the standardized tness. A similar problem would appear if the new functions created by AR did not correspond to good building blocks.
Fit and frequent blocks This section presents experiments that support the discussion about t and frequent blocks in Section 5.5.2. All experiments track the presence of small blocks of height h, 2 h 4. An example of the evolution of most frequent blocks (MFB) when disabling the adaptation of the function set in a run of the even-3-parity problem, is presented in gures 5.13 and 5.14. None of the MFBs appeared in a nal solution. The most frequent blocks at the end of generation 22 when a solution is found are: 1. (nand (or (nand d2 d1) (or d2 d2)) (nand (and d0 d1) (nor d2 d2))) appeared 132 times and has 6 hits;
148
140
EC(Best-of-gen) Size(Best-of-gen) Primitives(Best-of-gen) NewFun(Best-of-gen) AvgEC AvgSize
Complexity (points)
120 100 80 60 40 20 0 0
2
4 6 Generation
8
10
Figure 5.11: Complexity of best of generation individual and average values over the entire population in the even-8-parity example.
Standardized Fitness / Structural Complexity
35 AvgFitness Fitness(Best-of-gen) SC(Best-of-gen)
30
25
20
15
10 0
5
10
15
20 25 30 Generation
35
40
45
50
Figure 5.12: With AR inhibited, an even-5-parity run shows permanent increase in structural
complexity but a plateau in tness.
149 2. (nand (nand (nor d0 d1) d2) (nand (and d0 d1) (nor d2 d2))) appeared 42 times and has 4 hits; 3. (nor (nor d2 d1) (and d0 d2)) appeared 41 times and has 4 hits; 4. (and d0 (nor d0 d2)) appeared 35 times and has 4 hits; 5. (or (nand (or d2 d0) d2) (or (and d1 d0) (nor d1 d0))) appeared 31 times and has 2 hits (MFB5). Hits were measured for the total number of 8 tness cases. Note that MFB5 contains a useful sub-block which is nothing but XOR applied to D0 and D1 (the right subtree of its root). This explains why MFB5 has become frequent. 140 120
Count
100 80 60 40 20 0 0
200
400
600
800 1000 1200 1400 1600 1800 Block No.
Figure 5.13: Final distribution of block frequencies in a run of even-3-parity. There are 13 blocks with a tness of 6 at generation 22. One example is (nand (or (nand d2 d1) d0) (nand (and d0 d1) (nor d2 d2))). Runs with AR using discovery of subroutines based on frequent blocks have failed. In contrast, by adapting the representation based on t blocks, solutions are obtained in at most two generations. Performance is improved dramatically in this case as well as in much more complex ones. There is no such improvement if the function is extended based on frequent blocks. It is interesting to note that a statistical analysis of the frequent blocks and functions used after the function set is extended on the basis of t blocks, outlines the importance
150
140 120
MFB1 MFB2 MFB3 MFB4 MFB5
No. Modules
100 80 60 40 20 0 0
5
10 15 Generation
20
25
Figure 5.14: Evolution of most frequent blocks in the even-3-parity example. of the new functions created, and thus of the hypothesized building blocks. The new functions rapidly become dominant in the population, if they are useful. We can thus evaluate extensions of the function set. This is of high importance especially when the rules used in establishing the merit of building blocks are heuristics rather than precise ones. With AR, the analysis of the evolution of the most frequent nal blocks shows a dierent picture. The three most frequent blocks in a run of AR on even-5-parity are: 1. (f893 d3 (f894 d4 d0 d1)) appears 131 times and has 16 hits, out of 16. 2. (f895 d3 (f894 d2 d1 d3) d3 (or d1 d1)) appears 102 times and attains 16 hits too. 3. (f894 (f895 d3 (f894 d2 d1 d3) d3 (or d1 d1)) (f892 (f895 d1 d4 d1 d1)) d0) appears 22 times and has 32 hits (it would be chosen for generating a 5 input subfunction). All MFBs compute either parity or its inverse on a subset of input bits (the last one for all inputs). If the evolutionary path is good then the population has a high potential to contain and invent new useful building blocks. The GP algorithm will disseminate them in the population.
151
5.5.4 Discussion AR does not delete subroutines. Provided that the block selection heuristics are good, the procedure creates stable, useful subroutines. This focuses search towards desirable regions in the search space leading to improved overall eciency and scalability. In general some subroutines may be bad guesses or may only present a temporary advantage. GP has the potential to select useful primitives. Ideally, the algorithm should learn which subroutines to delete and which to keep around, but this is not attempted in AR. There are two important dierences between AR and the module acquisition approach [Angeline, 1994a]. First the created subroutines in AR expand the set of primitives instead of just being recorded into a genetic library. Second in GLIB encapsulation is done randomly, while in AR blocks selected for subroutine creation are evaluated using either user supplied heuristic information (block tness functions) or statistical properties from the population.
5.6 Summary and other related work The adaptive representation idea departs from the the principles of natural evolution and attempts to heuristically speed up the evolution of procedural representations. In contrast to nature, a simulated evolution algorithm has access to a wealth of history information, which it can use to re ect at. Then the idea is to use the experience gained in simulations in order to improve GP search. The AR algorithm implements this simple idea in a two-tier architecture. The bottom tier uses GP as a search engine. The top tier implements a meta-level learning or optimization algorithm speci c to the representation used. The key idea of the learning algorithm is to extract features from the evolution trace of GP in order to be able to focus search towards new regions of the state space that may be more promising. With procedural representations, a natural choice for what to learn is subroutines, i.e. procedural abstractions. An analysis of randomly enlarged function sets shows that
152 enlarged function sets are advantageous. AR dynamically creates what can be useful abstractions. AR attempts to implement this idea in a domain-independent way by relying on frequent expressions. This approach has not been successful. The rescue is to use domain-dependent block tness functions. The next chapter will present domainindependent solutions to selecting genetic material for future abstractions. At the discovery level, AR can follow a reinforcement learning strategy. The discussion in Section 2.3 is relevant from this perspective. An example of a very similar strategy to focus search to AR's strategy appears in the STAGE algorithm [Boyan and Moore, 1997]. STAGE solves optimization problems in three stages. First, an optimization algorithm such as simulated annealing is used. Pairs formed by the suboptimal solutions and the corresponding values of the objective function can be used to train a function approximator to the objective function. The approximator is used to predict regions of the search space where the optimization algorithm can be run again.
153
6 Adaptive Representation through Learning The Adaptive Representation through Learning (ARL) algorithm copes with problems not addressed in the approaches presented in the previous chapter. The most important of these problems is the domain-independent characterization of the value of subexpressions. This chapter discusses in detail the ARL extension to GP. ARL further improves on AR. The distinctive ideas are to evaluate the utility of subexpressions in domainindependent ways, and to dynamically manage the global library of subroutines created by AR by learning what is good from evolution traces. The chapter discusses several domain-independent heuristics for selecting and determining the value of subexpressions. It presents experimental results and discusses the role of discovered subroutines.
6.1 Learning good subroutines: the ARL algorithm The adaptive representation technique could be further improved by solving two issues. The rst is the domain-independent characterization of the value of subexpressions. Previous GP extensions do not attempt to decide what is relevant, i.e. which blocks of code or subroutines may be worth giving special attention, but employ genetic operations at random points. The second issue is the time-course of the generation of new subroutines. When should new subroutines be created, and when could subroutines be deleted? Other techniques, including the AR approach in the previous chapter, do
154 not make informed choices to automatically decide when creation, deletion, or modi cation of subroutines is advantageous or necessary. Actually, AR has no mechanisms to delete subroutines, but is rather thrifty in creating too many subroutines at one generation. The \what" issue is addressed by relying on local measures such as parent-ospring dierential tness and block activation in order to discover useful subroutines and by learning which subroutines are useful. The \when" issue is addressed by learning evaluations for subroutines and by relying on global population measures, such as population entropy, in order to predict when search reaches local optima and escapes them. This section describes the ARL algorithm. It answers the above questions using both local and global information implicitly stored in the population. Local information is brought to bear based on the notions of dierential tness and block activation. Global information is used to de ne subroutine utility.
6.1.1 The ARL Strategy The central idea of the ARL algorithm, as well as of AR, is the dynamic adaptation in the problem vocabulary. The vocabulary at generation t is given by the union of the terminal set T , the function set F , and a set of evolved subroutines St.
T [ F represents the set of primitives which is xed throughout the evolutionary process. In contrast, St is a set of subroutines whose composition may vary from one generation to another. St may be viewed as a population of subroutines that extends
the representation vocabulary in an adaptive manner. Subroutines compete against one another, but may also cooperate for survival, as will be described below. New subroutines are discovered and the \least useful" ones die out. St is used as a pool of additional problem primitives, besides T and F , for randomly generating some individuals in the next generation, t + 1. The ARL algorithm attempts to automatically discover useful subroutines and adapt the set St by applying the heuristic \pieces of useful code may be generalized and successfully applied in more general contexts."
155
6.1.2 Discovery of Useful Subroutines New subroutines are created using blocks of genetic material from the pool given by the current population. The major issue here is the detection of what are \useful" blocks of code. The notion of usefulness in the subroutine discovery heuristic is de ned by two concepts, dierential tness, and block activation. The subroutine discovery algorithm is presented in Figure 6.1. The major steps in the discovery of useful blocks are described next.
Dierential Fitness The nature of GP is that programs that contain useful code will tend to have a higher tness and consequently their ospring will tend to dominate the population. The concept of dierential tness is a heuristic which anticipates this trend and focuses on blocks from such individuals. Thus blocks of code are selected from programs that have the biggest tness improvement over their least t parent, i.e. the highest dierential tness. Let i be a program in the population having raw tness Fitness(i). Its dierential tness is de ned as: DiFitness(i) = Fitness(i) ? minp2Parents(i) fFitness(p)g
(6.1)
We focus on program i having the following property:
maxifDiFitness(i)g > 0
(6.2)
Large dierences in tness are presumably created by useful combinations of pieces of code appearing in the structure of an individual. This is exactly what the algorithm should discover. Figure 6.2 shows the histogram of the dierential tness de ned above for a run of ARL on the Pac-Man problem. Each slice of the plot for a xed generation represents the number of individuals (in a population of size 500) vs. dierential tness values. The gure shows that only a small number of individuals improve on the tness on their parents. ARL will focus on such individuals in order to discover salient blocks of code.
156
Subroutine-Discovery(Pt; S new ; Ptdup) 1. Initialize the set of new subroutines S new = .
Initialize the set of duplicate individuals Ptdup =
2. Select a subset of promising individuals: B = maxifDiFitness(i)g > 0 3. Label each node of program i 2 B with the number of activations in the evaluation of i on all tness cases.
4. Create a set of candidate building blocks BBt by selecting all blocks of small height and high activation(B);
5. Prune the set of candidates(BBt) by eliminating all blocks having inactive nodes;
6. For each block in the candidate set, b 2 BBt, repeat: (a) Let b belong to program i. Generalize the code of block b: i. Determine the terminal subset Tb used in the block(b); ii. Create a new subroutine s having as parameters a random subiii.
set of Tb and as body the block(b; Tb); Create a new program pdup making a copy of i having block b replaced with an invocation of the new subroutine s. The actual parameters of the call to s are given by the replaced terminals.
(b) Update S new and Ptdup: i. S new = S new [ fsg ii. Ptdup = Ptdup [ fpdupg 7. Results S new; Ptdup Figure 6.1: ARL extension to GP: the subroutine discovery algorithm for adapting the problem representation.
157
500
# Programs
400
300 50 200
40 30 Gen.
100 20 10
0 −60
−40
−20
0 Fitness Class
20
40
0 60
Figure 6.2: Dierential tness distributions over a run of ARL with representation A on the PacMan problem. At each generation, only a small fraction of the population has DiFitness > 0.
Block Activation
Once candidate parents have been selected, the next step is to identify useful blocks of code within those parents. During repeated program evaluation, some blocks of code are executed more often than others. The more active blocks become candidate blocks. Block activation is de ned as the number of times the root node of the block is executed. Salient blocks are active blocks of code from individuals with the highest dierential tness. In contrast to [Tackett, 1995], salient blocks have to be detected eciently, online. This is possible because candidate blocks are only searched for among the blocks of small height (between 3 and 5 in the current implementation) present in individuals with the highest dierential tness. Nodes with the highest activation value are considered as candidates. In addition, we require that all nodes of the subtree be activated at least once, or a minimum percentage of the total number of activations of the root node. This condition is imposed in order to eliminate from consideration blocks containing introns and hitch-hiking phenomena [Tackett, 1995]. It is represented by the pruning step (5) in Figure 6.1.
158
Generalization of Blocks The nal step is to formalize the selected blocks as new subroutines and add them to the GP vocabulary. Blocks are generalized by replacing some random subset of terminals in the block with variables (see Step 6a in Figure 6.1). Variables become formal arguments of the subroutine created. The generalization operation makes sense in the case when the primitive symbols satisfy the closure condition [Koza, 1992], i.e. they can be functionally combined in every possible way. In strongly-typed GP [Montana, 1994] each variable or constant has a type, and each function has a signature. The function signature is de ned by the type of the function result and by the formal argument types. Block generalization in typed GP additionally assigns a signature to each subroutine created. The subroutine signature is de ned by the type of the function that labels the root of the block and the types of the terminals selected to be substituted by variables. Signatures of all primitives and new subroutines represent the new genetic composition constraints.
Subroutine Utility ARL expands the set of subroutines St whenever it discovers new subroutine candidates. All subroutines in St are assigned utility values which are updated every generation. A subroutine's utility is estimated by observing the outcome of using it. This is done by accumulating, as reward, the average tness values of all programs that have invoked s over the past generations, directly or indirectly (by calling other subroutines that call s directly or indirectly). The subroutine utility is analogous to schema tness. However, reward accumulation1 is done over a xed time window of W generations. are currently used to estimate the tness of subroutines. Reinforcement learning (RL) algorithms such as Q-learning [Watkins, 1989] uses temporal discounting of future expected rewards. Temporal discounting could be also used here. Undiscounted rewards have been used because the method is simpler, is analogous to schema tness, and does not favor current mediocre programs that may invoke a subroutine. R-learning is an undiscounted RL algorithm [Schwartz, 1993] that outlines some of the advantages of undiscounted dynamic programming methods. 1
Undiscounted past rewards
159 Thus for a subroutine s, its utility U (s) is:
U (s) = K ?1
Xt X
t?W j
Fitness(j )
(6.3)
where j is a program that invokes s and K is a normalizing constant. In a hierarchy of subroutines, good subroutines higher in the hierarchy may reinforce other subroutines lower in the hierarchy, so programs may \cooperate" for survival. If we de ne the raw utility of s, U^ (s), as the average tness of all programs directly invoking it, the utility of s, U (s), is equivalent to the following algebraic form:
U (s) = 0U^ (s) +
X j
j U (sj )
(6.4)
where sj is a subroutine that invokes s and j is a subunitary weighting factor representing the fraction of all programs calling s indirectly through sj (j 1) or directly (j = 0). This formula shows that if s is a good subroutine and a particular subroutine sj invokes s often, then its utility will also be higher. The set of subroutines co-evolves with the main population of solutions through creation and deletion operations. New subroutines are automatically created based on active blocks as described before. Low utility subroutines are deleted in order to keep the total number of subroutines below a given number. In order to preserve the functionality of those programs invoking a deleted subroutine, calls to the deleted subroutine are substituted with the actual body of the subroutine, as in an in-line substitution operation.
6.1.3 The ARL Algorithm The general structure of the ARL algorithm is given in Figure 6.3. It extends a standard GP algorithm with one new step (3b), implementing the adaptation of the problem vocabulary.
6.1.4 When to Create Subroutines: Using Entropy The use of functional subroutines in the GP population has the eect of preserving a higher population diversity over generations in comparison to standard GP [Rosca,
160 1995b]. An increased diversity may result in a more eective search, by escaping local optima. The notion of diversity in GP is involved in an answer to the \when" question. In general, there are several possible answers to the \when" question. New subroutines can be created: 1. Whenever the subroutine discovery algorithm suggests new subroutines. 2. At the end of epochs, i.e. periods of consecutive generations throughout which the GP system works like standard GP using a xed representation system. 3. Adaptively, in response to long term decreases in population diversity. There are several justi cations for not attempting to adapt the problem representation all the time: (1) The computational overload introduced for tracking blocks of code; (2) The goal to exploit the current set of subroutines St (the vocabulary would be changed only if no progress is made using it); (3) The poor performance of candidate solutions in early generations (progress is probable in early generations, anyway). An appropriate measure of diversity is population entropy [Rosca, 1995a]. The entropy measure represents a global measure for describing the state of the dynamical system represented by the population, in analogy to the state of a physical or informational system. Population entropy can be computed by grouping individuals into classes according to their behavior or phenotype and determining the number of individuals that belong to each class, according to Shannon's information entropy formula [Shannon, 1949]: X E (P ) = ? pk log(pk ) (6.5) k
where pk is the proportion of the population P occupied by population partition k at a given time. In GP a useful class of the partition is a xed interval of tness values. Individuals are regarded as equivalent if their tness values lie in the same interval regardless of dierences in their code. Entropy provides a way to track diversity during a GP run. Decreases in population diversity can be correlated with a plateau in the best-of-generation tness or a plateau in the average tness over a fraction of the best individuals in the population. Such
161
ARL Algorithm
De ne problem: terminal set T , function set F , tness function F, set of training cases E . Denote the population at a given generation t by Pt .
1. Initial Generation: evolution time (generation) t = 0, discovered subroutines S0 =
2. Randomly initialize population P0(T S F ; P0) 3. Repeat until termination criterion is met: (a) Evaluate population(Pt; E ; F) (b) Adapt representation(Pt ; St; St+1; Ptdup; Ptnew) i. Discover new subroutines S new and create duplicate individuals Ptdup by calling
Subroutine-Discovery(Pt; S new; Ptdup)
ii. Update subroutine utilities (St; S new) iii. Create the next generation subroutine set St+1: A. Select subroutines of low utility to be deleted S old (St ; S new ) B. St+1 = (St ? S old ) S S new
iv. Randomly generate and evaluate newborns Ptnew S S (T F St+1 ; Ptnew ; E ; F) v. Evaluate population of newborns(Ptnew ; E ; F)
(c) Generate a new population Pt+1 by tness proportionate reproduction, crossover, mutation of individuals(Pt; Ptdup ; Ptnew; Pt+1)
i. Create intermediate population Pt = Pt S Ptdup S Ptnew ii. Select genetic operation O (pr ; pc; pm; O) iii. Select winning individuals W from the intermediate population 0
(O; Pt ; W ) Generate ospring Pt+1 (O,W ,Pt+1) 0
iv.
(d) Next generation: t = t + 1
Figure 6.3: The ARL algorithm extends the standard GP algorithm with Step 3b which adapts
the problem representation system (vocabulary) by creating new subroutines, eventually deleting old ones and creating new individuals to be entered in the selection competition.
162 correlations suggest convergence towards a local optimum [Rosca, 1995a]. They can be used to decide when to create new subroutines.
6.2 Representation approaches We have tested the ARL algorithm on the problem of controlling an agent in a dynamic environment described in detail in Section 4.4. Here we present an alternative problem representation, and discuss the results obtained with several search approaches among which ARL is central. We also highlight dierences with human-designed solutions.
6.2.1 Representation approaches in the Pac-Man Problem The Pac-Man problem primitives chosen for the GP implementation in [Koza, 1992] are suciently high level to focus on a single game aspect, that of task prioritization. In this chapter we describe experiments with three related representations of the Pac-Man problem. The rst is the Pac-Man problem representation from [Koza, 1992] and is called A1 here. The second is the representation described in Section 4.4, and called A2 here. It uses the same vocabulary as representation A1, but changes the result returned by some primitives. More precisely, A2 modi es A1 by making all the action primitives return the distance from the corresponding element. This addresses the problem that distances and directions in A1 are mixed without making much resulting sense, at least for the human designer. GP using A2 cannot mix distances and directions. In contrast, GP using A1 is free to mix them if this provides an evolutionary advantage. We refer to either A1 or A2 by A. The third will be called representation B. It uses a typed vocabulary by taking into account the signature of each primitive, i.e. the return type of each function (subroutine) as well as the types of its arguments. It introduces primitives for evolving explicit logical conditions under which actions can be executed. This representation will be described in more detail next.
163
Representation B - Typed GP In problem representations A1 and A2, actions may appear both in the condition and in the action part of an iflte expression. The evaluation of the condition changes the context where the action is executed. This makes it extremely dicult to understand what an evolved program really does without executing it. For analyzing GP, it is desirable that evolved programs express explicit conditions under which certain actions are prescribed. To do this, we used a typed GP system [Montana, 1994; Johnson et al., 1994] based on an extended set of primitives obtained from A2. The problem representation A2 is extended with relational operators (