General Schema Theory for Genetic Programming with Subtree-Swapping Crossover Riccardo Poli School of Computer Science The University of Birmingham Birmingham, B15 2TT, UK Email:
[email protected] Telephone: +44-121-414-3739 FAX: +44-121-414-4281 Technical Report: CSRP-00-16 November 2000 (Revised December 2000) Abstract In this paper a new general schema theory for genetic programming is presented. Like other recent GP schema theory results (Poli 2000a, Poli 2000b), the theory gives an exact formulation (rather than a lower bound) for the expected number of instances of a schema at the next generation. The theory is based on a Cartesian node reference system which makes it possible to describe programs as functions over the space N2 and allows one to model the process of selection of the crossover points of subtree-swapping crossovers as a probability distribution over N4 . The theory is also based on the notion of variable-arity hyperschema, which generalises previous definitions of schema or hyperschema introduced in GP. The theory includes two main theorems describing the propagation of GP schemata: a microscopic schema theorem and a macroscopic one. The microscopic version is applicable to crossover operators which replace a subtree in one parent with a subtree from the other parent to produce the offspring. Therefore, this theorem is equally applicable to standard GP crossover (Koza 1992) with and without uniform selection of the crossover points, as it is to one-point crossover (Poli and Langdon 1997b, Poli and Langdon 1998), size-fair crossover (Langdon 1999, Langdon 2000), strongly-typed GP crossover (Montana 1995), context-preserving crossover (D’haeseleer 1994) and many others. The macroscopic version is applicable to crossover operators in which the probability of selecting any two crossover points in the parents depends only on their size and shape. In the paper we provide examples which show how the theory can be specialised to specific crossover operators and how it can be used to derive other general results such as an exact definition of effective fitness and a size-evolution equation for GP with subtree-swapping crossover.
1 Introduction Schema theories are often macroscopic models of genetic algorithms, in the sense that they provide information about the properties of a population at the next generation in terms of macroscopic quantities, like schema fitnesses, population fitness, or number of individuals in a schema, measured at the current generation. Some schema theories are not fully macroscopic in that they express macroscopic properties of schemata using only microscopic quantities, such as the fitnesses of all the individuals in the population, or a mixture of microscopic and macroscopic quantities. We will refer to these as microscopic schema theories to differentiate them from the purely macroscopic ones. Usually macroscopic schema equations are simpler and easier to analyse that microscopic ones, but in the past they tended to be only approximate or worst-case-scenario models, although as shown by recent work (Stephens and Waelbroeck 1997, Stephens and Waelbroeck 1999) then don’t have to be so. 1
The theory of schemata in genetic programming has had a difficult childhood. After some excellent early efforts leading to different worst-case-scenario schema theorems (Koza 1992, Altenberg 1994, O’Reilly and Oppacher 1995, Whigham 1995, Poli and Langdon 1997b, Rosca 1997), only very recently exact schema theories have become available (Poli 2000a, Poli 2000b) which give exact formulations (rather than lower bounds) for the expected number of instances of a schema at the next generation.1 These exact theories are applicable to GP with one-point crossover (Poli and Langdon 1997a, Poli and Langdon 1997b, Poli and Langdon 1998). Since, one-point crossover is not widely used,2 so far this work has had a limited impact. However, the work in (Poli 2000b) remains the only proposal of an exact macroscopic schema theory for GP, since no exact macroscopic schema theory for standard crossover (or any other GP crossover) has ever been proposed. This paper fills this theoretical gap and presents a new exact general schema theory for genetic programming which is applicable to standard crossover as well as many other crossover operators. The theory includes two main results describing the propagation of GP schemata: a microscopic schema theorem and a macroscopic one. The microscopic version is applicable to crossover operators which replace a subtree in one parent with a subtree from the other parent to produce the offspring. So, the theorem covers standard GP crossover (Koza 1992) with and without uniform selection of the crossover points, one-point crossover (Poli and Langdon 1997b, Poli and Langdon 1998), size-fair crossover (Langdon 1999, Langdon 2000), strongly-typed GP crossover (Montana 1995), context-preserving crossover (D’haeseleer 1994) and many others. The macroscopic version is valid for a large class of crossover operators in which the probability of selecting any two crossover points in the parents depends only on their size and shape. So, for example, it holds for all the above-mentioned crossover operators except strongly typed GP crossover. The paper is organised as follows. Firstly, we provide a review of earlier relevant work on schemata in Section 2. Then, in Section 3, we introduce the notion of node reference systems. We use it to define the concepts of functions and probability distributions over such reference systems in Sections 4 and 5, respectively. Then, we show how these can be used to build probabilistic models of different crossover operators (Section 6) which we use to derive a general microscopic schema theorem for GP with subtreeswapping crossover in Section 7. We transform this into a macroscopic schema theorem valid for a large set of subtree-swapping crossover operators used in practice in Section 8. In Section 9 we give examples that show how the theory can be specialised to obtain schema theorems for specific types of crossover operators, and we illustrate how it can be used to obtain other general results, such as an exact definition of effective fitness and a size-evolution equation for GP with subtree-swapping crossover. Some conclusions are drawn in Section 10.
2 Background Schemata are sets of points of the search space sharing some syntactic feature. For example, in the context of GAs operating on binary strings, syntactically a schema (or similarity template) is a string of symbols from the alphabet f0,1,*g. The character * is interpreted as a “don’t care” symbol, so that, semantically, a schema represents a set of bit strings. For example the schema *10*1 represents four strings: 01001, 01011, 11001 and 11011. Typically schema theorems are descriptions of how the number (or the fraction) of members of the population belonging to a schema vary over time. As we noted in (Poli et al. 1998), for a given schema H the selection/crossover/mutation process can be seen as a Bernoulli trial (a newly created individual either samples or does not sample H ) and, therefore, the number of individuals sampling H at the next generation, m(H; t +1), is a binomial stochastic variable. So, if we denote with (H; t) the success probability of each trial (i.e. the probability that a newly created individual samples H ), which we term the total transmission probability of H , an exact schema theorem is simply
E [m(H; t + 1)℄ = M(H; t); 1 To
(1)
be precise, one of the models proposed in (Altenberg 1994) is an exact model. However, this is a microscopic model of the propagation of program components (subtrees) in GP with standard crossover, rather than a theorem about schemata as sets. This model allows one to compute the exact proportion of subtrees of a given type in an infinite population in the next generation given information about the programs in the current generation. 2 In fact, probably there are no public domain GP implementations offering one-point crossover as an option, yet.
2
where M is the population size and E [℄ is the expectation operator. Holland’s (Holland 1975) and other worst-case-scenario schema theories normally provide a lower bound for (H; t) or, equivalently, for E [m(H; t + 1)℄. One of the difficulties in obtaining theoretical results on GP using the idea of schema is that its definition is much less straightforward than for GAs. A few alternative definitions have been proposed in the literature. Syntactically all of them define schemata as composed of one or multiple trees or fragments of trees. In some definitions (Koza 1992, Altenberg 1994, O’Reilly and Oppacher 1995, Whigham 1995, Whigham 1996) schema components are non-rooted and a schema in seen as a set of subtrees that can be present multiple times within the same program. The focus in these theories is to predict how the number or the frequency of such subtrees vary over time. In more recent definitions (Poli and Langdon 1997b, Rosca 1997) syntactically schemata are represented by rooted trees or tree fragments. These definitions make schema theorem calculations easier. For brevity in the next subsection we will describe only the definition introduced in (Poli and Langdon 1997b, Poli and Langdon 1998), because this is used in the rest of this paper and in a number of other recent theoretical developments (Poli 2000b, Poli 2000a, Poli and McPhee 2000). In the next section, we will also describe two schema theorems obtained using this definition. Then, in Section 2.2 we will introduce the concept of effective fitness and we will show how this is related to schema theories.
2.1 Exact GP Schema Theory for One-point Crossover In (Poli and Langdon 1997b) a definition of schema for GP was proposed in which a schema is a tree composed of functions from the set F[f=g and terminals from the set T [f=g, where F and T are the function and terminal sets used in a GP run.3 The primitive = is a “don’t care” symbol which stands for a single terminal or function. A schema H represents programs having the same shape as H and the same labels for the non-= nodes. For example, if F =f+, *g and T =fx, yg the schema (+ x (= y =)) represents the four programs (+ x (+ y x)), (+ x (+ y y)), (+ x (* y x)) and (+ x (* y y)). This kind of schemata partition the program space into subspaces of programs of fixed size and shape. For this reason in the following we will refer to them as fixed-size-and-shape schemata. In order to derive a GP schema theorem for fixed-size-and-shape schemata in (Poli and Langdon 1997b, Poli and Langdon 1998) we used non-standard forms of mutation and crossover, namely point mutation and one-point crossover. Point mutation is the substitution of a node in the tree with another node with the same arity. One-point crossover works by selecting a common crossover point in the parent programs and then swapping the corresponding subtrees, like standard crossover. To account for the possible structural diversity of the two parents, one-point crossover analyses the two trees starting from the root nodes and considers for the selection of the crossover point only the parts of the two trees, called the common region, which have the same topology (i.e. the same arity in the nodes encountered traversing the trees from the root node).4 The resulting schema theorem (see (Poli and Langdon 1997b, Poli and Langdon 1998)) is a generalisation of Holland’s schema theorem (as discussed in (Poli 2000c)) for variable size structures. Like Holland’s theorem it is a worst-case-scenario model (i.e. it provides only a lower bound for (H; t)). In (Poli 2000a, Poli 2000b) we were able to improve this producing an exact schema theory for GP with one-point crossover, thanks to the introduction of a generalisation of the definition of GP schema: the hyperschema. A GP hyperschema is a rooted tree composed of internal nodes from the set F [ f=g and leaves from T [f=; #g. The operator = is a “don’t care” symbols which stands for exactly one node, while the operator # stands for any valid subtree. For example, the hyperschema (* # (= x =)) represents all the programs with the following characteristics: a) the root node is a product, b) the first argument of the root node is any valid subtree, c) the second argument of the root node is any function of arity two, d) the first argument of this function is the variable x, e) the second argument of the function is any valid node in the terminal set. 3 For symmetry, in this paper we use the convention that also the root node of a program or a schema has an output link. In this way there is always a one-to-one correspondence between nodes and links, and, therefore, we can interchangeably consider crossover points as nodes or as links. It the paper we will assume that the root node (or its output link) can be chosen for crossover. 4 The common region can be defined more formally using the notion of arity-match set membership function introduced in Section 4.
3
Thanks to hyperschemata, it is possible to obtain the following exact microscopic expression for the total transmission probability for a fixed-size-and-shape GP schema H under one-point crossover and no mutation:
(H; t)
= (1
p
xo
)p(H; t) + pxo
X
X X p(h1 ; t)p(h2 ; t) h1
h2
NC(h1 ; h2)
2
i
(
C h1 ;h2
)
Æ(h1 2 U (H; i))Æ(h2 2 L(H; i)) (2)
where: pxo is the crossover probability; p(H; t) is the selection probability of the schema H ;5 the first two summations are over all the individuals in the population; p(h1 ; t) and p(h2 ; t) are the selection probabilities of parents h1 and h2 , respectively; (h1 ; h2 ) is the number of nodes in the tree fragment representing the common region between program h1 and program h2 ; C (h1 ; h2 ) is the set of indices of the crossover points in the common region; Æ (x) is a function which returns 1 if x is true, 0 otherwise; L(H; i) is the hyperschema obtained by replacing all the nodes on the path between crossover point i and the root node with = nodes, and all the subtrees connected to those nodes with # nodes; U (H; i) is the hyperschema obtained by replacing the subtree below crossover point i with a # node.6 If a crossover point i is in the common region between two programs but it is outside the schema H , then L(H; i) and U (H; i) are empty sets. The hyperschemata L(H; i) and U (H; i) are important because, if one crosses over at point i any individual in L(H; i) with any individual in U (H; i), the resulting offspring is always an instance of H . Let consider an example which shows how L(H; i) and U (H; i) are built. If H =(* = (+ x =)) then, as indicated in the second column of Figure 1, L(H; 1) is obtained by first replacing the root node with a = symbol and then replacing the subtree connected to the right of root node with a # symbol obtaining (= = #). The schema U (H; 1) is instead obtained by replacing the subtree below the crossover point with a # symbol obtaining (* # (+ x =)), as illustrated in the third column of Figure 1. The fourth and fifth columns of Figure 1 show how L(H; 3) = (= # (= x #)) and U (H; 3) = (* = (+ # =)) are obtained. In (Poli 2000b) we transformed Equation 2 into a macroscopic model obtaining
NC
(H; t)
= (1
p
xo
)p(H; t) + pxo
XX k
l
X
1
NC(G ; G ) k
l
2
i
p(U (H; i) \ G ; t)p(L(H; i) \ G ; t) k
(
C Gk ;Gl
)
l
(3) where the first two summations are over all the possible program shapes G1 , G2 , , i.e. all the possible fixed-size-and-shape schemata containing = signs only.7 The sets U (H; i) \ Gk and L(H; i) \ Gl either are (or can be represented by) fixed-size-and-shape schemata or are the empty set ;. So, the result expresses the total transmission probability of H only using the selection probabilities of a set of lower- (or same-) order schemata. As discussed in (Poli 2000c), it is possible to show that, in the absence of mutation, Equation 3 generalises and refines not only the GP schema theorem in (Poli and Langdon 1997b, Poli and Langdon 1998) but also Holland’s (Holland 1975) and more modern GA schema theory (Stephens and Waelbroeck 1997, Stephens and Waelbroeck 1999).
2.2 Effective Fitness The concept of effective fitness was introduced in GP in (Nordin and Banzhaf 1995, Nordin et al. 1995) to explain the reasons for bloat and active-code compression. In that work the effective fitness of program j 5 In
fitness proportionate selection this is given by p(H; t) =
m(H;t)f (H;t) M f(t)
, where
(
) is the number of strings matching
m H; t
the schema H at generation t, f (H; t) is the mean fitness of such strings, and f(t) is the mean fitness of the strings in the population. 6 Equation 2 is in a slightly different form than the result in (Poli 2000a). However, the two results are equivalent since C (h1 ; h2 ) = C (h2 ; h1 ) and (h1 ; h2 ) = (h2 ; h1 ). 7 Equation 3 is in a slightly different form than the result in (Poli 2000b). However, the two results are equivalent since C (Gk ; Gl ) = C (Gl ; Gk ) and (Gk ; Gl ) = (Gl ; Gk ).
NC NC
NC NC
4
H 0
L(H,1)
*
U(H,1)
*
1
L(H,3)
*
U(H,3)
*
*
2
=
+ 3
x
=
+
=
+
=
+
=
+
4
=
x
=
x
=
=
x
+
x
#
x
x
=
=
=
=
=
*
=
+
=
x
=
*
=
=
=
=
+
#
=
=
#
#
=
x
#
Figure 1: Example of schema and some of its potential hyperschema building blocks. The crossover points in H are numbered as shown in the top left. was defined as follows:
C f =f 1 p p C assuming fitness proportionate selection. In this equation C e j
e j
j
a j
d j
!
;
(4)
is the number of nodes in program j , Cje is the number of nodes in the active part (in contrast to the intron part) of program j , p is the crossover probability, pdj is the probability that crossover in an active block of program j leads to worse fitness for the offspring of j and fj is the fitness of individual j . Then, letting Pjt be the proportion of programs j at generation t, Pjt+1 the average proportion of offspring of j which behave like j at generation t + 1, and ft the average population fitness at generation t, one can write: a j
P
+1
t j
P ff t j
e j t
(5)
which describes “the proliferation of individuals from one generation to the next” (Nordin et al. 1995).8 Equation 5 clearly indicates that an alternative way of interpreting the effects of crossover is to imagine a GP system in which selection only is used, but in which each individual is given a fitness fje rather than the original fitness fj . The concept of effective fitness is very similar to the concept of operator-adjusted fitness (not to be confused with Koza’s adjusted fitness (Koza 1992)) introduced for GAs a few years earlier by Goldberg in (Goldberg 1989, page 155). Goldberg defined the adjusted fitness of a schema fadj (H; t) in such a way as to allow to reformulate Holland’s schema theorem as
E [m(H; t + 1)℄
m(H; t) f (H; t): f(t) adj
(6)
8 The “ ” sign in the equation should really be “ ” but it was used with the justification that the reconstruction of individuals with the same behaviour as j (due to crossover applied to individuals different from j ) was a rare event.
5
This shows that Nordin and Banzhaf’s notion of effective fitness and Goldberg’s notion of operator-adjusted fitness are essentially the same idea, although specialised for different representations and operators. An interesting way of interpreting the operator-adjusted fitness is that a GA could be thought to be attracted towards the optima of fadj rather than those of the fitness function f . If, for a particular problem, these optima do not coincide, then the problem can be said to be deceptive (Goldberg 1989). Again, the definitions of adjusted fitness and deception are only approximations since they are based on Holland’s schema theorem. So, fadj is only a lower bound for the true effective/adjusted fitness of a schema in a binary GA. Stephens and Waelbroeck (Stephens and Waelbroeck 1997, Stephens and Waelbroeck 1999) independently rediscovered the notion of effective fitness. The effective fitness of a schema is implicitly defined through the equation m(H; t + 1) m(H; t) fe (H; t) E = f(t) ; M M assuming that fitness proportionate selection is used. This has basically the same form as Equation 5. Indeed the two equations represent nearly h the samei idea, although in different domains. The equation is +1) p(H;t) ) also very similar to Equation 6. Since E m(H;t = (H; t) and m(H;t M = f (H;t) , one obtains ()
Mf t
fe (H; t)
(H; t) f (H; t); p(H; t)
=
(7)
where (H; t) should be substituted with the expression obtained in (Stephens and Waelbroeck 1997, Stephens and Waelbroeck 1999). There are clear similarities between fje , fadj (H; t) and fe (H; t), but there are also important differences: fje and fadj (H; t) are approximations (of unknown accuracy, being in fact lower bounds) of the true effective fitness of an individual in a standard GP system and of a schema in a GA, respectively, while fe (H; t) is the true effective fitness for a schema in a binary GA. In addition, fe (H; t) of a schema can be bigger than f (H; t) if the building blocks for H are abundant and relatively fit. On the contrary the estimates/bounds given by fje and fadj (H; t) ignore the constructive effects of crossover and therefore are always smaller than fj and f (H; t), respectively. Thanks to the exact schema result reported in the previous section, in (Poli 2000b) we extended to GP with one-point crossover the exact notion of effective fitness provided in (Stephens and Waelbroeck 1997, Stephens and Waelbroeck 1999) obtaining
fe (H; t)
=
f (H; t)
h
1
p
xo
1
XX k
l
X
2
i
(
C Gk ;Gl
p(U (H; i) \ G ; t)p(L(H; i) \ G ; t) i NC(G ; G )p(H; t) ) k
k
l
l
(see (Poli 2000b, Poli 2000c) for more details). This result gives the true effective fitness for a GP schema under one-point crossover: it is not an approximation or a lower bound.
3 Node Reference Systems Given a syntax tree like, for example, the one in Figure 2 which represents the S-expression (A (B C D) (E F (G H))), there can be different methods to indicate unambiguously the position of one particular node in the tree. One method is to use the path from the root node (D’haeseleer 1994). The path can be specified indicating which branch to select to find the target node for every node encountered starting from the root of the tree. This corresponds to using a variable-dimensionality relative coordinate system. For example, as illustrated in Figure 3, it would be possible to indicate node H in the syntax tree in Figure 2 using the list of coordinates [2 2 1] which mean “select the second argument of node A, then the second of node E, then the first of node G”. This method, by definition, allows one to determine exactly the path followed to reach a particular node in the reference system. However, it has the disadvantage of not corresponding to our typical notion of a Cartesian reference system, because the number of coordinates necessary to locate a node grows with the depth of the node in the tree. 6
A
B
C
E
D
F
G
H Figure 2: A sample syntax tree.
A []
[1] B
[1 1]
C
[1 2]
E [2]
D
F
[2 1]
G
[2 2]
H [2 2 1] Figure 3: Variable-dimensionality relative node reference system.
0
1
0
A
1
B
E
2
C
D
3
H
2
F
3
Column i
G
Layer d
Figure 4: A tree-dependent Cartesian node reference system.
7
0
1
0
A
1
B
E
2
C
D
2
3
4
F
G
5
...
... ...
3
Column 12 i
H
Layer d
Figure 5: Tree-independent Cartesian node reference system. Nodes and links of the maximal tree are drawn with dashed lines. Only four layers are shown. A better alternative from this point of view is to use a coordinate system in which the nodes in the tree are organised into layers of increasing depth (this is done very frequently when trees are drawn). The nodes are then aligned to the left and an index is assigned to each node in a layer. The layer number d and the index i can then be used to define a Cartesian coordinate system. This is illustrated in Figure 4. So, for example, the position of node G in Figure 2 could be represented with the coordinates (2,3) to mean that it belongs to layer 2 (node A being in layer 0) and that it is the fourth node in the layer (assuming that the index starts from 0). This reference system presents the problem that it is not possible to infer the structure of a tree from the coordinates of its nodes. For example, if one knows that node H is at coordinates (3,0), it is impossible to determine at which coordinates its parent node is (node G at coordinates (2,3)). A coordinate system similar to this one but without this problem can be defined by assuming that the trees to represent have nodes with arities not bigger than a predefined maximum value amax . Then one could define a node-reference system like the previous one for the largest possible tree that can be created with nodes of arity amax . This maximal tree would include 1 node of arity amax at depth 0, amax nodes of arity amax at depth 1, a2max nodes of arity amax at depth 2, etc.. Finally one could use the maximal system also to locate the nodes of non-maximal trees. This is possible because a non-maximal tree can always be described using a subset of the nodes and links in the maximal tree. This is illustrated in Figure 5 assuming amax = 3. So, for example, node G in Figure 2 would have coordinates (2,4) while node H would have coordinates (3,12). In this reference system it is always possible to find the route to the root node from any given valid pair of coordinates. This reference system is entirely equivalent to the one shown in Figure 3 but corresponds more closely to the standard notion of a Cartesian reference system. If one chooses amax to be the maximum arity of the functions in the function set, it is possible to use this reference system to represent the structure of any program in any population. Because of these properties, we will use this reference system in the rest of the paper. As clarified in the following sections, it is sometimes necessary to be able to express the location of nodes in different trees at the same time. In this case we can extend the node-reference systems just introduced by concatenating the coordinates of the nodes in each reference system into a single multidimensional coordinate vector. For example, we can indicate two nodes, (d1 ; i1 ) in one tree and (d2 ; i2 ) in another tree, at the same time using the point (d1 ; i1 ; d2 ; i2 ) in a four-dimensional coordinate system. Finally, it should be noted that in the Cartesian reference system in Figure 5 it is possible to transform pairs of coordinates into integers by counting the nodes in the reference system in breadth-first order. So, Pd 1 node (d; i) would correspond to the integer x=0 (amax )x +i. For example, if amax = 3, (2,1) corresponds to 5. Clearly, it is also possible to map integers into node coordinates unambiguously. We will use this property to simplify the notation in some of the following sections.
8
0
1
0
IF
1
AND
OR
2
x1
x2
2
3
4
5
Column i
x1
x1
...
x3
...
3 Layer d
Figure 6: Syntax tree for the program (IF (AND x1 x2) (OR x1 x3) x1) represented in a node reference system for nodes with maximum arity 3.
4 Functions over Node Reference Systems Given a node reference system it is possible to define functions over it. An example of such functions is a function which represents a particular computer program. Given a pair of coordinates, this function returns the primitive stored at such coordinates in a given syntax tree. If one considers the program h one could define it as the function h(d; i) which returns the node in h stored at position (d; i) if (d; i) is a valid pair of coordinates. If (d; i) is not in h then a conventional default value, ;, is returned to represent the absence of a node. For example, for the program h=(IF (AND x1 x2) (OR x1 x3) x1) represented in Figure 6, this function would return the following values: h(0; 0) = IF, h(1; 0) = AND, h(1; 1) = OR, h(1; 2) = x1, h(2; 0) = x1, h(2; 1) = x2, h(2; 2) = ;, h(2; 3) = x1, h(2; 4) = x3, h(2; 5) = ;, h(2; 6) = ;, etc.. Obviously, this function could be represented as the following table:
d 0 1 2 3 .. .
h(d; i) i
0 IF AND x1
;
;
1
;
OR x2
x1
;
2
; ;
.. .
; ;
3
; ;
4
x1
x3
;
;
So, programs can be seen as functions over the space N 2 . Below we will refer to h(d; j ) as the name function for h and we will denote it with N (d; i; h). Another useful function is the size function S (d; i; h) which represents the number of nodes present in the subtree rooted at coordinates (d; i) in tree h, with the convention that S (d; i; h) = 0 if (d; i) indicates an inexistent node. For example the tree in Figure 2 has the following size function:
9
d 0 1 2 3 4 .. .
S (d; i; h) i 0 8 3 1 0 0
1 0 4 1 0 0
2 0 0 0 0 0
3 0 0 1 0 0
4 0 0 2 0 0
5 0 0 0 0 0
.. .
12 0 0 0 1 0 .. .
The size function can be defined using the name function:
S (d; i; h) =
X
x
d
x d 1 (i+1)(aX max ) y
=i(amax )x d
Æ(N (x; y; h) 6= ;);
where Æ (a) is a function which returns 1 if a is true, 0 otherwise. Other examples of node functions include:
The arity function A(d; i; h) which returns the arity of the node at coordinates (d; i) in program h. For example, for the tree in Figure 5, A(0; 0; h) = 2, A(1; 0; h) = 2, A(2; 1; h) = 0 and A(2; 4; h) = 1. Clearly this function could be defined as the composition of the node function N (d; i; h) mentioned above with a function a(x) which returns the arity of primitive x: A(d; i; h) = a(N (d; i; h)), where a(;) could be defined to be ; or 1 by convention. The type function T (d; i; h) which returns the data type of the node at coordinates (d; i) in h. This could be defined as T (d; i; h) = t(N (d; i; h)) where the function t(x) returns the type of primitive x (with t(;) = ;). The function-node function F (d; i; h) which returns 1 if the node at coordinates (P d; i)P is in h and it is a function, 0 otherwise. The number of internal nodes in h is therefore given by d i F (d; i; h).
Similarly, it is possible to define functions on multi-dimensional node-coordinate systems. An interesting function of this kind is the crossover-offspring size function So (d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) which returns the size of the offspring produced by replacing the subtree rooted at coordinates (d1 ; i1 ) in parent h1 with the subtree rooted at coordinates (d2 ; i2 ) in parent h2 . So, using the size function defined above we can write:
S (d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) = S (0; 0; h1) S (d1 ; i1 ; h1 ) + S (d2 ; i2 ; h2 ): o
Other useful functions of this kind include:
The position match function
PM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) =
n
1 if (d1 ; i1 ) = (d2 ; i2 ), N (d1 ; i1 ; h1 ) 6= ; and N (d2 ; i2 ; h2 ) 6= ;, 0 otherwise.
The arity match function
AM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) =
The type match function
TM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) =
The name match function
NM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) =
n
1 if A(d1 ; i1 ; h1 ) = A(d2 ; i2 ; h2 ), 0 otherwise.
n
n
1 if T (d1 ; i1 ; h1 ) = T (d2 ; i2 ; h2 ), 0 otherwise.
1 if N (d1 ; i1 ; h1 ) = N (d2 ; i2 ; h2 ), 0 otherwise.
10
The arity-match set membership function or common region membership function:
8 1 if (d ; i ) = (d ; i ), A(d ; i ; h ) = A(d ; i ; h ) and 1 1 2 2 1 1 1 2 2 2 > < either (d1 ; i1 ) = (0; 0) or AMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) = > : AMSM (d1 1; bi1 =amax ; d2 1; bi2 =amax ; h1 ; h2 ) = 1, 0 otherwise,
where b is the integer-part function. The name-match set membership function:
8 > < 1 if (d1 ; i1 ) = (d2 ; i2), N (d1 ; i1; h1 ) = N (d2 ; i2 ; h2 ) and either (d1 ; i1 ) = (0; 0) or NMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) = > NMSM (d1 1; bi1 =amax ; d2 1; bi2 =amax ; h1 ; h2 ) = 1, : 0 otherwise.
5 Probability Distributions over Node Reference Systems Most genetic operators used in GP require the selection of a node where to perform a transformation (e.g. the insertion of a random subtree, or of a subtree taken from another parent) which leads to the creation of an offspring. In most cases the selection of the node is performed with a stochastic process of some sort. It is possible to model this process by assuming that a probability distribution is defined over the nodes of each individual. If we use the node-reference system introduced in the previous section, this can be expressed as the function:
p(d; ijh) = PrfA node at depth d and column i is selected in program hg;
(8)
where we assume that p(d; ijh) is zero for all the coordinates (d; i) which represent inexistent nodes in h.9 For example, if we consider the tree in Figure 5 and we select nodes with uniform probability, then p(d; ijh) = 18 if (d; i) is a node, p(d; ijh) = 0 otherwise. If instead we select functions with a probability 0.9 and any node with a probability 0.1, like in standard GP crossover (Koza 1992), then p(0; 0jh) = p(1; 0jh) = p(1; 1jh) = p(2; 4jh) = 0:2375, p(2; 0jh) = p(2; 1jh) = p(2; 3jh) = p(3; 12jh) = 0:0125, and p(d; ijh) = 0 for all other coordinate pairs, i.e. p(d; ijh) is given in the following table: p(d; ijh) i d 0 1 2 3 4 12 0 1 2 3 .. .
0.2375 0.2375 0.0125 0.0000
0.0000 0.2375 0.0125 0.0000
0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0125 0.0000 .. .
0.0000 0.0000 0.2375 0.0000
0.0000 0.0000 0.0000 0.0125
There are many possible uses for p(d; ijh). For example it is possible to compute the probability p(djh) that a node at depth d is selected in h:
X
p(djh) =
p(d; ijh):
i
For example, for the tree in Figure 2 and for the function p(d; ijh) given in the previous table, the function p(djh) would be:
j
j
9 For this probability distribution we use the notation p(d; i h) rather than p(d; i; h) to emphasise the fact that p(d; i h) can be seen as the conditional probability of selecting node (d; i) if (or given that) the program being considered is h. In the rest of the paper, we will do the same for other probabilities distributions.
11
d p(djh) 0 1 2 3 4 .. .
0.2375 0.4750 0.2750 0.0125 0.0000 .. .
Once p(djh) is defined it is possible to compute the expected depth of the nodes being selected:
E [Djh℄ =
X
p(djh) d;
d
where D is a stochastic variable representing the depth of the node selected. For example, if we consider the function p(djh) in the previous table we obtain:
E [Djh℄ = 0:2375 0 + 0:4750 1 + 0:2750 2 + 0:0125 3 + 0:0000 4 + = 1:0625 Similarly it is possible to compute the probability p(ijd) that a node is in column i and the expected value of I , a stochastic variable representing the column of the node being selected, even if this entity is perhaps less interesting. The probability distribution p(d; ijd) can also be used to compute the expected value of real-valued node functions, like for example the expected value of the size function for a particular tree h
E [S (d; i; h)℄ =
XX d
i
S (d; i; h)p(d; ijh); h
2
i
or other descriptors like the variance of the size function, i.e. E (S (d; i; h) E [S (d; i; h)℄) . However, for the purpose of developing a general schema theory for GP, an even more important use of probability distributions over node reference systems is their use in modelling crossover operators, as discussed in the following section.
6 Modelling Subtree-swapping Crossovers It is quite easy to extend the previous ideas and model subtree-swapping crossover operators as conditional probability distribution functions over the space N 4 . We can do that by considering the probability distribution
p(d1 ; i1; d2 ; i2jh1 ; h2 ) = Pr
A node at depth d1 and column i1 is selected in parent h1 and a node at depth d2 and column i2 is selected in parent h2
;
with the convention p(d1 ; i1 ; d2 ; i2 jh1 ; h2 ) = 0 if (d1 ; i1 ) is an inexistent node in h1 or (d2 ; i2 ) is an inexistent node in h2 . Like for the single-node case, it is possible to compute the marginal probabilities:
p(d1 ; i1 jh1 ; h2 ) =
XX d2
and
p(d2 ; i2jh1 ; h2 ) =
XX d1
p(d1 ; i1 ; d2 ; i2 jh1 ; h2 )
i2
p(d1 ; i1; d2 ; i2jh1 ; h2 ):
i1
The first one of these gives the probability that the crossover point in the first parent will be (d1 ; i1 ) when the parents are h1 and h2 . The second one does the same for the crossover points in the second parent.
12
These probability distributions can be used to compute important statistical descriptors like the expected value (or the variance) of the crossover-offspring size function for parents h1 and h2 :
E [S (d1 ; i1; d2 ; i2 ; h1 ; h2 )℄
=
o
XXXX
=
i2
d2
S (0; 0; h1)
XXXX
+ =
S (d1 ; i1 ; d2 ; i2 ; h1 ; h2 )p(d1 ; i1; d2 ; i2 jh1 ; h2 ) o
i1
d1
d1
i1
d2
i2
d1
i1
d2
i2
XXXX
S (0; 0; h1) +
XX
S (d1 ; i1 ; h1 )p(d1 ; i1 ; d2 ; i2 jh1 ; h2 ) S (d2 ; i2 ; h2 )p(d1 ; i1 ; d2 ; i2 jh1 ; h2 )
XX d1
(9)
S (d1 ; i1 ; h1 )p(d1 ; i1jh1 ; h2)
(10)
i1
S (d2 ; i2 ; h2 )p(d2 ; i2 jh1 ; h2 ):
i2
d2
This result indicates that E [So (d1 ; i1 ; d2 ; i2 ; h1 ; h2 )℄ is given by the size of the first parent minus the mean size of the subtree removed from h1 plus the mean size of the subtree removed from h2 , where the means are calculated using the probability distributions p(d1 ; i1 jh1 ; h2 ) and p(d2 ; i2 jh1 ; h2 ) (not p(d1 ; i1 jh1 ) and p(d2 ; i2jh2 )). If the selection of the crossover points is performed independently in the two parents, then
p(d1 ; i1; d2 ; i2jh1 ; h2 ) = p(d1 ; i1 jh1 ) p(d2 ; i2jh2 ); where p(d; ijh) is defined in Equation 8. We will call separable crossover operators for which this relation is true. Standard crossover is a separable operator. Indeed, assuming uniform selection of the crossover points,
pStdUnif (d1 ; i1; d2 ; i2jh1 ; h2 ) = pStdUnif (d1 ; i1 jh1 ) pStdUnif (d2 ; i2 jh2 ) with
Æ(N (d; i; h) 6= ;) N (h) S (0; 0; h) is the number of nodes in h and N (d; i; h) is the name function defined in pStdUnif (d; ijh) =
where N (h) = Section 4. So,
pStdUnif (d1 ; i1; d2 ; i2jh1 ; h2) =
Æ(N (d1 ; i1 ; h1 ) 6= ;)Æ(N (d2 ; i2; h2 ) 6= ;) : N (h1 )N (h2 )
(11)
For standard crossover with a 90%-function/10%-any-node selection policy, it is easy to show that
pStd90
j
10 (d; i h) = 0:9 P
=
d
Æ(N (d; i; h) 6= ;) FP(d; i; h) ; + 0:1 N (h) F (d; i; h) i
where F (d; i; h) is the function-node function defined in Section 4. Thus,
pStd90
j
10 (d1 ; i1 ; d2 ; i2 h1 ; h2 ) =
=
F (d ; i1 ; h1 ) Æ(N (d1 ; i1 ; h1 ) 6= ;) + 0:1 N (h1 ) F (d; i; h1 )
1 0:9 P P d
(12)
i
d
i
Note that for separable operators, the marginal distributions are p(d1 ; i1 jh1 ; h2 ) = p(d1 ; i1 jh1 ) and
p(d2 ; i2jh1 ; h2 ) = p(d2 ; i2 jh2 ), whereby Equation 9 becomes E [S (d1 ; i1 ; d2 ; i2 ; h1 ; h2 )℄ o
=
S (0; 0; h1) E [S (d1 ; i1 ; h1 )℄ + E [S (d2 ; i2 ; h2 )℄: 13
F (d ; i2; h2 ) Æ(N (d2 ; i2 ; h2 ) 6= ;) : + 0:1 N (h2 ) F (d; i; h2 )
2 0:9 P P
In some crossover operators the selection of the crossover points in the two parents is not performed independently. For example in one-point crossover, the first and second crossover points must have the same coordinates. In this case, if we assume to use a uniform probability of node selection, then
p1pt (d1 ; i1 ; d2 ; i2 jh1 ; h2 ) =
NC
NC(h1; h2)
if (d1 ; i1 ) = (d2 ; i2 ) and (d1 ; i1 ) 2 C (h1 ; h2 ), otherwise,
1= 0
(13)
where (h1 ; h2 ) is the number of nodes in the common region C (h1 ; h2 ) between program h1 and program h2 (see Section 2.1). This can also be expressed using the arity-match set membership function AMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) defined in Section 4:
p1pt (d1 ; i1 ; d2 ; i2jh1 ; h2 )
AMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) AMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) i2 d2 i1
P P P P
=
d1
AMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) AMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) i1 d1
=
P P
=
AMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) (h1 ; h2 )
NC
where we used the fact that AMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) = 6 0 only if (d1 ; i1) = (d2 ; i2 ). Incidentally, given this definition it is easy to show that
AMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) p1 (d1 ; i1 jh1 ; h2 ) = P P AMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) 1 1 d
and
i
AMSM(d2 ; i2 ; d2 ; i2 ; h1 ; h2 ) ; p2 (d2 ; i2jh1 ; h2 ) = P P AMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) 1 1 i
d
whereby for one-point crossover
E [S (d1 ; i1 ; d2 ; i2 ; h1 ; h2 )℄ o
=
S (0; 0; h1) P P S (d ; i ; h1 )AMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) 1 P1 P1 1 + AMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) P P 1 1 S (d ; i ; h2 )AMSM(d2 ; i2 ; d2 ; i2 ; h1 ; h2 ) 2 P2 P2 2 : AMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) 1 1 d
d
i
d
i
d
i
i
Similarly, it is possible to model strict-one point crossover (Poli and Langdon 1997a), strongly typed GP crossover (Montana 1995),10 and context-preserving crossover (D’haeseleer 1994) using the functions defined in Section 4 obtaining:
pstri t 1pt (d1 ; i1 ; d2 ; i2 jh1 ; h2 )
=
NMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) NMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) i2 d2 i1
P P P P d1
NMSM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) ; NMSM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) i1 d1
=
P P
pstgp (d1 ; i1 ; d2 ; i2 jh1 ; h2 )
=
1 P P P P
p ontext(d1 ; i1 ; d2 ; i2 jh1 ; h2 )
=
TM(d
i1
d1
=
d2
; i1 ; d2 ; i2 ; h1 ; h2 ) ; TM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) 2 i
PM(d
; i1 ; d2 ; i2 ; h1 ; h2 ) PM(d1 ; i1 ; d2 ; i2 ; h1 ; h2 ) 1 1 2 2 PM(d ; i ; d ; i ; h ; h ) P P 1 1 2 2 1 2 : PM(d1 ; i1 ; d1 ; i1 ; h1 ; h2 ) 1 1 1 P P P P d
i
d
i
10 In
d
i
STGP crossover, the crossover point in the first parent is selected randomly. The crossover point in the second parent is selected only among the nodes of the same type as the first crossover point. If no nodes of that type exist in the second parent then two alternatives are available: one can return one of the parents, or nothing is returned. In this second case crossover has to be attempted again. In this paper we model a version of this second type of crossover in which a new attempt is made using the same two parents. Since the root node is of the same type in all individuals, a pair of valid crossover points always exists.
14
The model for size-fair crossover (Langdon 1999, Langdon 2000) is more complicated:
psize
j
fair(d1 ; i1 ; d2 ; i2 h1 ; h2 )
where
psize
fair (d; i;
pStd90
=
j
10 (d1 ; i1 h1 )psize fair (d2 ; i2 ;
=
n jh; s) = 0f (h; s)
jh2 ; S (d1 ; i1; h1 ));
if 1 S (d; i; h2 ) 2s + 1, otherwise.
(14)
The function f (h; s) is a relatively complicated function of the following quantities: the number of nodes (n+ ) for which S (d; i; h) > s in the tree h being considered, the number of nodes (n ) for which S (d; i; h) < s, those (n0 ) for which S (d; i; h) = s, and the mean differences
mean+ mean
= =
XX
(S (d; i; h)
XX d
i
d
i
(s
s)Æ(S (d; i; h) > s)
S (d; i; h))Æ(S (d; i; h) < s)
(see (Langdon 1999, Langdon 2000) for more details).11 Other subtree-swapping crossover operators could be modelled using probability density functions over node reference systems. Thanks to these probabilistic models of crossover, it is possible to develop a general schema theory for GP as described in the following sections.
7 Microscopic Exact GP Schema Theorem for Subtree-swapping Crossovers For simplicity in this and the following sections we will use a single index to identify nodes unless otherwise stated. We can do this because, as indicated in Section 3, there is a one-to-one mapping between pairs of coordinates and natural numbers. In order to obtain a schema theory valid for subtree-swapping crossovers, we need to extend the notion of hyperschema summarised in Section 2.1. We will call this new form of hyperschema a Variable Arity Hyperschema or VA hyperschema for brevity. Definition 1. A Variable Arity hyperschema is a rooted tree composed of internal nodes from the set F [ f=; #g and leaves from T [ f=; #g, where F and T are the function and terminal sets. The operator = is a “don’t care” symbols which stands for exactly one node, the terminal # stands for any valid subtree, while the function # stands for exactly one function of arity not smaller than the number of subtrees connected to it. For example, the VA hyperschema (# x (+ = #)) represents all the programs with the following characteristics: a) the root node is any function in the function set with arity 2 or higher, b) the first argument of the root node is the variable x, c) the second argument of the root node is +, d) the first argument of the + is any terminal, e) the second argument of the + is any valid subtree. If the root node is matched by a function of arity greater than 2, the third, fourth, etc. arguments of such a function are left unspecified, i.e. they can be any valid subtree. VA hyperschemata generalise most previous definitions of schemata in GP. For example, they generalise hyperschemata (Poli 2000a, Poli 2000b) (which are VA hyperschemata without internal # symbols). These in turn generalise the notion of fixed-size-and-shape schemata (Poli and Langdon 1997b, Poli and
j
11 To be precise, the expression of p size fair (d1 ; i1 ; d2 ; i2 h1 ; h2 ) is only an approximate description of the behaviour of size-fair crossover, because it does not model the atypical events in which n+ = n0 = 0 or n = n0 = 0. In these cases the selection of the crossover point in the first parent is repeated, which means that that is chosen according to a probability distribution slightly different from pStd90=10 (d1 ; i1 h1 ). To be precise this makes the choice of the crossover points in the two parents dependent, like for one-point crossover. It is possible to derive an exact model of size-fair crossover in terms of conditional probability distributions, but this is beyond the scope of this paper.
j
15
Langdon 1998) (which are hyperschemata without # symbols) and Rosca’s schemata (Rosca 1997) (which are hyperschemata without = symbols).12 Thanks to VA hyperschemata and to the notion of probability distributions over node reference systems, it is possible to obtain the following general result: Theorem 2 Microscopic Exact GP Schema Theorem. The total transmission probability for a fixed-sizeand-shape GP schema H under a subtree-swapping crossover operator and no mutation is
(H; t)
= (1
p
p
)p(H; t) +
XX xo
xo h1
h2
p(h1 ; t)p(h2 ; t)
XX
2
i
H
(15)
p(i; j jh1 ; h2 )Æ(h1 2 U (H; i))Æ(h2 2 L(H; i; j ))
j
where: pxo is the crossover probability; p(H; t) is the selection probability of the schema H ; the first two summations are over all the individuals in the population; p(h1 ; t) and p(h2 ; t) are the selection probabilities of parents h1 and h2 , respectively; the third summation is over all the crossover points (nodes) in the schema H ; the fourth summation is over all the crossover points in the node reference system; p(i; j jh1 ; h2 ) is the probability of selecting crossover point i in parent h1 and crossover point j in parent h2 ; Æ(x) is a function which returns 1 if x is true, 0 otherwise; L(H; i; j ) is the hyperschema obtained by rooting at coordinate j in an empty reference system the subschema of H below crossover point i, then by labelling all the nodes on the path between node j and the root node with # function nodes, and labelling the arguments of those nodes which are to the left of such a path with # terminal nodes; U (H; i) is the hyperschema obtained by replacing the subtree below crossover point i with a # node. The hyperschema L(H; i; j ) represents the set of all programs whose subtree rooted at crossover point j matches the subtree of H rooted in node i. The idea behind its definition is that, if one crosses over at point j any individual matching L(H; i; j ) and at point i any individual matching U (H; i), the resulting offspring is always an instance of H . Before we proceed with the proof of the theorem, let us try to understand with examples how L(H; i; j ) is built (refer to Section 2 and Figure 1 to see how U (H; i) is built). In the examples, we will use 2–D coordinates to designate the nodes i and j . Let us consider the schema H =(* = (+ x =)). As indicated in Figure 7, L(H; 1; 2) = L(H; (1; 0); (1; 1)) is obtained through the following steps: a) we root the subschema below crossover point (1; 0), i.e. the symbol =, at coordinates (1; 1) in an empty reference system, b) we label the node at coordinates (0; 0) with a # function node (in this case this is the only node on the path between node (1; 1) and the root), and c) we label node (1,0) with a # terminal (this is because the node is to the left of the path between (1; 1) and the root, and it is an argument of one of the nodes replaced with #). Another example is provided in Figure 8, where L(H; 2; 5) = L(H; (1; 1); (2; 2)) is obtained through the following steps. Firstly, we root the subschema below crossover point (1; 1), i.e. the tree (+ x =), at coordinates (2; 2) in an empty reference system. Note that this is not just a rigid translation: while the + is translated to position (2; 2), its arguments need to be translated more, i.e. to positions (3; 4) and (3; 5), because of the nature of the reference system used. Then, we label the nodes at coordinates (0; 0) and (1; 1) with # functions (these two nodes are on the path between node (2; 2) and the root). Finally, we label node (1,0) with a # terminal (this node is the only argument of one of the nodes replaced with # to be to the left of the path between (2; 2) and the root node).13 Once the concept of L(H; i; j ) is available, the theorem can easily be proven. 12 Sets
of VA hyperschemata without = symbols can also be used to represent the programs in O’Reilly’s GP schemata (O’Reilly and Oppacher 1995). Syntactically O’Reilly’s schemata are multisets of subtrees and tree fragments, which represent all programs including such subtrees and tree fragments. All the programs including a given subtree can be represented using a collection of VA hyperschemata without = symbols and without # terminals, while all the programs including a given tree fragment can be represented using a collection of VA hyperschemata without = symbols. So, the programs in one of O’Reilly’s schemata can be represented using the intersection of appropriate sets of VA hyperschemata. 13 One might think that L(H; i; j ) is a generalisation of the hyperschema L(H; i) defined in Section 2. In fact, L(H; i) is very similar to L(H; i; i). However, there are differences between these two hyperschemata. In L(H; i) the nodes on the path to the root node in H are replaced with = symbols. Each of these stands for a function of a pre-fixed arity, which one can determine from the structure of H . In L(H; i; i) the nodes in the path to (0; 0) are filled with # symbols. Each of these stands for a function of arity not-smaller than a prefixed value which can be determined from the particular column occupied by the node (but obviously not bigger than amax ).
16
L(H,(1,0),(1,1)) 0 0 Crossover Point
1
1
2
3
Column i
* Empty Node
+
=
2
x
=
2
3
Column i
2
3
Column i
Layer d 0
1
0 Path to Root
1
=
2 Layer d 0 0
#
1
#
1
=
2 Layer d
Figure 7: Phases in the constructions of the VA hyperschema building block L(H; (d1 ; i1 ); (d2 ; i2 )) of the schema H =(* = (+ x =)) for (d1 ; i1 ) = (1; 0) and (d2 ; i2 ) = (1; 1) within a node coordinate system with amax = 2.
17
L(H,(1,1),(2,2)) 0 0
*
1
=
1
0
3
x
=
2
3
Column i
+
2 Layer d
2
1
4
5
x
=
4
5
x
=
Column i
0
1
+
2 Layer d 3 0 0
#
1
#
2 Layer d
1
2
3
Column i
#
+
3
Figure 8: Phases in the constructions of the VA hyperschema building block L(H; (d1 ; i1 ); (d2 ; i2 )) the schema H =(* = (+ x =)) for (d1 ; i1 ) = (1; 1) and (d2 ; i2 ) = (2; 2) within a node coordinate system with amax = 2.
18
Proof. Let us consider the function
g(h1 ; h2 ; i; j; H ) = Æ(h1 2 U (H; i))Æ(h2 2 L(H; i; j )): Given two parent programs, h1 and h2 , and a schema of interest H , this function returns the value 1 if by crossing over h1 at position i and h2 at position j one obtains an offspring in H . It returns 0 otherwise. This function can be considered as a measurement function (see (Altenberg 1995)) that we want to apply to the probability distribution of parents and crossover points at time t, p(h1 ; h2 ; i; j; t). The quantity p(h1 ; h2 ; i; j; t) represents the probability that, at generation t, the selection/crossover process will choose parents h1 and h2 and crossover points i and j . If a population and a crossover operator are characterised by the distribution p(h1 ; h2 ; i; j; t), the expected value of g (h1 ; h2 ; i; j; H ) is simply:
E [g(h1 ; h2 ; i; j; H )℄ =
XXXX h1
h2
i
g(h1 ; h2 ; i; j; H )p(h1 ; h2 ; i; j; t):
(16)
j
We can write
p(h1 ; h2; i; j; t) = p(i; j jh1 ; h2 )p(h1 ; t)p(h2 ; t); (17) where p(i; j jh1 ; h2 ) is the conditional probability that crossover points i and j will be selected when the parents are h1 and h2 , while p(h1 ; t) and p(h2 ; t) are the selection probabilities for the parents. Substituting Equation 17 into Equation 16 and noting that if crossover point i is outside the schema H , then L(H; i; j ) and U (H; i) are empty sets, lead to E [g(h1 ; h2 ; i; j; H )℄ =
XX h1
p(h1 ; t)p(h2 ; t)
h2
XX
2
i
H
g(h1 ; h2 ; i; j; H )p(i; j jh1; h2 ):
(18)
j
Since g (h1 ; h2 ; i; j; H ) is a binary stochastic variable, its expected value also represents the proportion of times the variable takes the value 1. This corresponds to the proportion of times the offspring produced by crossover is in H . So, the contribution to (H; t) due to selection followed by crossover is E [g(h1 ; h2 ; i; j; H )℄. By multiplying this by pxo (the probability of doing selection followed by crossover) and adding the term (1 pxo )p(H; t) due to selection followed by cloning one obtains the r.h.s. of Equa2 tion 15.
8 Macroscopic Exact GP Schema Theorem In order to transform Equation 15 into an exact macroscopic description of schema propagation we will need to make one additional assumption on the nature of the subtree-swapping crossovers used: that their behaviour depends only on the shape of the parents, not on the actual labels of the nodes in the parents. More formally, we will assume that the choice of the crossover points in any two parents, h1 and h2 , depends only on their shapes, G(h1 ) and G(h2 ), i.e. that p(i; j jh1 ; h2 ) = p(i; j jG(h1 ); G(h2 )). We will term such operators node-invariant crossovers. Theorem 3 Macroscopic Exact GP Schema Theorem. The total transmission probability for a fixed-sizeand-shape GP schema H under a node-invariant subtree-swapping crossover operator and no mutation is
(H; t)
= (1
p
xo
)p(H; t) + pxo
XXX
2
k;l i
H
p(i; j jG ; G )p(U (H; i) \ G ; t)p(L(H; i; j ) \ G ; t); k
l
k
l
j
(19) where the schemata G1 , G2 , are all the possible fixed-size-and-shape schemata of order 0 (program shapes) and the other symbols have the same meaning as in Theorem 2. sets of programs. Their union represents the whole Proof. The schemata P G1 , G2 , represent disjointP search space. So, k Æ (h1 2 Gk ) = 1. Likewise, l Æ (h2 2 Gl ) = 1. If we append the l.h.s. of these 19
equations at the end of the triple summation in Equation 15 and reorder the terms, we obtain:
XX
XX
p(i; j jh1; h2 )Æ(h1 2 U (H; i))Æ(h1 2 G )Æ(h2 2 L(H; i; j ))Æ(h2 2 G ) 2 X X XX p(i; j jh1 ; h2 )Æ(h1 2 U (H; i))Æ(h2 2 L(H; i; j )): = p(h1 ; t)p(h2 ; t) 2 12 k 22 l p(h1 ; t)p(h2 ; t)
k;l h1 ;h2
k
i
k;l h
G ;h
H
l
j
G
i
H
j
For node-invariant crossover operators p(i; j jh1 ; h2 ) = p(i; j jG(h1 ); G(h2 )), which substituted into the previous equation gives:
X
X
XX
p(i; j jG(h1 ); G(h2 ))Æ(h1 2 U (H; i))Æ(h2 2 L(H; i; j )) 2 XX p(i; j jG ; G )Æ(h1 2 U (H; i))Æ(h2 2 L(H; i; j )) = p(h1 ; t)p(h2 ; t) 2 12 k 22 l X XXX X p(i; j jG ; G ) p(h1 ; t)Æ(h1 2 U (H; i)) = p(h2 ; t)Æ(h2 2 L(H; i; j )) : 2 12 k 22 l | {z }| {z } ( ( )\ k ) ( ( )\ l )
k;l h1
p(h1 ; t)p(h2 ; t)
2 k 22 l X X G ;h
G
k;l h
G ;h
i
H
j
k
i
G
k
k;l i
H
j
H
l
j
l
h
G
h
p U H;i
G ;t
G
p L H;i;j
G ;t
2
The sets U (H; i) \ Gk and L(H; i; j ) \ Gl either are (or can be represented by) fixed-size-and-shape schemata or are the empty set ;. So, the theorem expresses the total transmission probability of H only using the selection probabilities of a set of lower- (or same-) order schemata.
9 Applications and Specialisations In this section we give examples that show how the theory can be specialised to obtain schema theorems for specific crossover operators, and we illustrate how it can be used to obtain other general theoretical results, such as an exact definition of effective fitness and a size-evolution equation for GP with subtree-swapping crossover.
9.1 Macroscopic Exact Schema Theorem for GP with Standard Crossover Let us specialise Theorem 3 to standard crossover. For simplicity we will do this for the case in which the crossover points are selected uniformly at random. In order to use Theorem 3 we need to check whether standard crossover is node invariant, i.e. whether it is true that p(i; j jh1 ; h2 ) = p(i; j jG(h1 ); G(h2 )). The notion of name function introduced in Section 4 is applicable to any tree, including schemata. If G is an order-0 schema (i.e. a program shape) N (d; i; G) returns = if (d; i) is in the schema, ; otherwise. So, it is easy to convince oneself that if h is a program and G(h) is its shape, N (d; i; h) 6= ; if and only if N (d; i; G(h)) 6= ;. Also, clearly N (h) = N (G(h)). So, using the integers i and j to represent the coordinates (d1 ; i1 ) and (d2 ; i2 ), respectively, the expression of pStdUnif (d1 ; i1; d2 ; i2jh1 ; h2 ) in Equation 11 can be transformed as follows
pStdUnif (i; j jh1 ; h2 )
= = =
Æ(N (i; h1 ) 6= ;)Æ(N (j; h2 ) 6= ;) N (h1 )N (h2 ) Æ(N (i; G(h1 )) 6= ;)Æ(N (j; G(h2 )) 6= ;) N (G(h1 ))N (G(h2 )) pStdUnif (i; j jG(h1 ); G(h2 )):
So, we can substitute the expression of p(i; j jGk ; Gl ) in Equation 19 with the expression
Æ(N (i; G
) 6= ;)Æ (N (j; Gl ) 6= ;) ; N (Gk )N (Gl )
k
20
obtaining
(H; t)
p
= (1
p
xo
)p(H; t) +
X X X Æ(N (i; G ) 6= ;)Æ(N (j; G ) 6= ;) k
xo
2
k;l i
H
l
N (G )N (G ) k
j
l
p(U (H; i) \ G ; t)p(L(H; i; j ) \ G ; t): k
l
From this we obtain Theorem 4 Macroscopic Exact GP Schema Theorem for Standard Crossover. The total transmission probability for a fixed-size-and-shape GP schema H under standard crossover with uniform selection of crossover points is
(H; t)
= (1
p
p
)p(H; t) + X X 1 p(U (H; i) \ Gk ; t)p(L(H; i; j ) \ Gl ; t): N (Gk )N (Gl ) i2H \G j2G
xo
X
xo k;l
k
(20)
l
As an example let us further specialise this result to the case of linear structures. 9.1.1 Exact Schema Theorem for GP with Standard Crossover Acting on Linear Structures In the case of function sets including only unary functions, programs and schemata can be represented as linear sequences of symbols like h1 h2 :::hN . In this case:
The hyperschema U (h1 h2 :::hN ; i) = h1 h2 :::hi #. The hyperschema L(h1 h2 :::hN ; i; j ) = (#)j hi+1 :::hN , where the notation (x)n means x repeated n times. This is equivalent to (=)j hi+1 :::hN since only linear structures are allowed.
The set U (h1 h2 :::hN ; i) \ Gk =
The set L(h1 h2 :::hN ; i; j ) \ Gl =
h1 h2 :::h (=)
k
i
;
for i < k , otherwise.
i
(=)j hi+1 :::hN
if l = j i + N , otherwise.
;
By substituting these quantities into Equation 20 and performing some simplifications one obtains the following result
(h1 :::h ; t) N
p
= (1
p
xo
)p(h1 :::hN ; t) +
X1
min(X N;k) 1
k
xo k
=0
(21)
p(h1 :::h (=) i
k
i
; t)
X p((=) h +1 :::h ; t) j
j
i
i
j i+N
N
;
which can be shown to be entirely equivalent to the schema theorem for linear structures reported in (Poli and McPhee 2000), where we provided a direct proof for it.
9.2 A Different Form of Macroscopic Schema Theorem for One-point Crossover As an additional example, in this section we will derive a GP schema theorem for one-point crossover equivalent to the one described in Section 2.1. Again, in order to use Theorem 3 we need to first check whether one-point crossover is node invariant, i.e. whether it is true that p(i; j jh1 ; h2 ) = p(i; j jG(h1 ); G(h2 )). It is easy to see that (h1 ; h2 ) = (G(h1 ); G(h2 )), and that i 2 C (h1 ; h2 ) if and only if i 2 C (G(h1 ); G(h2 )). Therefore, the expression of p1pt (d1 ; i1 ; d2 ; i2 jh1 ; h2 ) in Equation 13 can be transformed as follows
NC
NC
p1pt (i; j jh1 ; h2 )
=
NC(h1 ; h2)
1= 0
NC
if i = j and i 2 C (h1 ; h2 ), otherwise,
1= (G(h1 ); G(h2 )) if i = j and i 2 C (G(h1 ); G(h2 )), 0 otherwise, = p1pt (i; j jG(h1 ); G(h2 )): =
21
So, we can substitute the expression of p(i; j jGk ; Gl ) in Equation 19 with the expression
Æ(i 2 C (G ; G ))Æ(i = j ) ; NC(G ; G ) k
l
k
l
obtaining
(H; t)
= (1
p
p
xo
)p(H; t) +
X X X Æ(i 2 C (G ; G ))Æ(i = j ) k
xo
2
k;l i
= (1
p
xo
H
j
p(U (H; i) \ G ; t)p(L(H; i; j ) \ G ; t)
l
NC(G ; G ) k
)p(H; t) + pxo
k
l
X X Æ(i 2 C (G ; G ))
2
k;l i
H
NC(G
k
k
l
;G ) l
l
p(U (H; i) \ G ; t)p(L(H; i; i) \ G ; t); k
l
which leads to Theorem 5 Macroscopic Exact GP Schema Theorem for One-point Crossover. The total transmission probability for a fixed-size-and-shape GP schema H under one-point crossover is
(H; t)
= (1
p
xo
)p(H; t) + pxo
X k;l
X
1
NC(G ; G ) k
l
2 \
i
H
p(U (H; i) \ G ; t)p(L(H; i; i) \ G ; t): k
(
C Gk ;Gl
)
l
(22) This result is entirely equivalent to Equation 3. For brevity, here we provide only a sketch of the proof of this fact. Firstly, one should note that the sum over i 2 C (Gk ; Gl ) in Equation 2 can be replaced with a sum over i 2 H \ C (Gk ; Gl ) because U (H; i) \ Gk = L(H; i) \ Gl = ; if i 62 H . Then, all we need to show is that p(U (H; i) \ Gk ; t)p(L(H; i; i) \ Gl ; t) = p(U (H; i) \ Gk ; t)p(L(H; i) \ Gl ; t) for i 2 H \ C (Gk ; Gl ). If U (H; i) \ Gk = ; this is true, since the l.h.s. and the r.h.s. are both zero. So, let us concentrate on the case in which U (H; i) \ Gk 6= ;. In this case, the arities of the nodes on the path between node i and the root node must be the same for H and Gk . Since, i 2 C (Gk ; Gl ), then it must also be true that the arities of the nodes on the path between node i and the root node must be the same for H and Gl . For this reason, when U (H; i) \ Gk 6= ;, L(H; i; i) \ Gl = L(H; i) \ Gl , and so, whether or not U (H; i) \ Gk 6= ;, p(U (H; i) \ Gk ; t)p(L(H; i; i) \ Gl ; t) = p(U (H; i) \ Gk ; t)p(L(H; i) \ Gl ; t) for i 2 H \ C (Gk ; Gl ).
9.3 Size-evolution Equation for GP with Subtree-swapping Crossover Let us call symmetric any crossover operator for which p(i; j jh1 ; h2 ) = p(j; ijh2 ; h1 ). Theorem 6. The expected mean size of the programs at generation t + 1, E [(t + 1)℄, in a GP system with a symmetric subtree-swapping crossover operator in the absence of mutation can equivalently be expressed in microscopic form as X E [(t + 1)℄ = (23) N (h)p(h; t);
2P
h
where P denotes the population, or in macroscopic form as
E [(t + 1)℄ =
X
N (G )p(G ; t): l
l
(24)
l
Proof. By definition the expected mean size of the individuals in a population at the next generation is given by X E [(t + 1)℄ = N (h)(h; t);
2S
h
22
where the summation is over all possible computer programs in the search space S . By substituting Equation 15 (specialised for H = h) into this equation, one obtains
E [(t + 1)℄
p
X
N (h)p(h; t) + 2S X X X XX p(i; j jh1 ; h2 )Æ(h1 2 U (h; i))Æ(h2 2 L(h; i; j )); p N (h) p(h1 ; t)p(h2 ; t) 2S 1 2P 2 2P
= (1
xo
)
h
xo
h
h
i
h
j
where we did not limit i to range only over the crossover points in h because if i is outside h, then L(h; i; j ) and U (h; i) are empty P sets and so, for such P i’s Æ(h1 2 U (h; i))Æ(h2 2 L(h; i; j )) = 0. Because p(h; t) 6= 0 only for h 2 P , h2S N (h)p(h; t) = h2P N (h)p(h; t). Therefore,
E [(t + 1)℄
p
= (1
xo
)
X
N (h)p(h; t) +
2P X X h
p
xo h1
p(h1 ; t)p(h2 ; t)
2P 2 2P
XX i
h
p(i; j jh1 ; h2 )
j
X
2S
N (h)Æ(h1 2 U (h; i))Æ(h2 2 L(h; i; j )):
h
The quantity N (h)Æ (h1 2 U (h; i))Æ (h2 2 L(h; i; j )) 6= 0 only for the program h made up from the upper part of h1 with respect to crossover point i and the lower part of h2 with respect to crossover point j . These include N (U (h1 ; i)) and N (h2 ) N (U (h2 ; j )) nodes, respectively. Therefore,
E [(t + 1)℄
p
= (1
xo
)
X
N (h)p(h; t) +
2P X X h
p
xo h1
2P 2 2P
p(h1 ; t)p(h2 ; t)
XX i
h
p(i; j jh1 ; h2 ) N (U (h1 ; i)) + N (h2 ) N (U (h2 ; j )) :
j
However, it easy to show that for symmetric crossovers
X X h1
p(h1 ; t)p(h2 ; t)
2P 2 2P X X
i
h
=
h1
2P 2 2P
XX j
p(h1 ; t)p(h2 ; t)
p(i; j jh1 ; h2 )N (U (h1 ; i))
XX i
h
p(i; j jh1; h2 )N (U (h2 ; j ));
j
by simply reordering the terms in the summations and renaming their indices. So,
E [(t + 1)℄
p
= (1
xo
)
X
2P
N (h)p(h; t) +
h
=1
z X }| X
X X
{
p(i; j jh1 ; h2 ) p(h1 ; t)p(h2 ; t)N (h2 ) 2P 2 2P X X X (1 p ) N (h)p(h; t) + p p(h1 ; t)p(h2 ; t)N (h2 ) 2P 1 2P 2 2P =1 zX }| { X X (1 p ) N (h)p(h; t) + p p(h2 ; t)N (h2 ) p(h1 ; t) 2P 2 2P 1 2P X X (1 p ) N (h)p(h; t) + p p(h2 ; t)N (h2 ) 2P 2 2P X N (h)p(h; t) 2P X X N (h)p(h; t) 2 l\P p
xo
h1
=
i
h
xo
xo
h
=
h
xo
h
xo
xo
h
=
h
h
=
l
h
h
xo
h
=
j
G
23
h
X
X
N (G )
p(h; t) 2 l\P X = N (G )p(G ; t):
=
l
l
h
G
l
l
l
2 The theorem is important because it indicates that for symmetric subtree-swapping crossover operators the mean program size evolves as if selection only was acting on the population. This means that if there is a variation in mean size, like for example in the presence of bloat, that can only be attributed to some form of positive or negative selective pressure on some or all the shapes Gl . From this theorem it follows that on a flat landscape
E [(t + 1)℄ = (or E [(t + 1)℄ =
X
X
N (G )p(G ; t) = l
l
l
l
P l
N (G )(G ; t l
N (G )
l
m(G ; t) M l
1) for infinite populations), whereby
l
Corollary 7. On a flat landscape,
E [(t + 1)℄ = (t):
(25)
For infinite populations, the expectation operator can be removed from the previous theorem and corollary. In (Poli and McPhee 2000) we obtained more specific versions of these results for one-point crossover and standard crossover acting on linear structures.
9.4 Effective Fitness for GP with Subtree-Swapping Crossovers Once an exact schema theorem is available, it is easy to extend to GP with subtree-swapping crossovers the notion of effective fitness provided in (Stephens and Waelbroeck 1997, Stephens and Waelbroeck 1999) (see Section 2.2). Indeed, by using the definition in Equation 7 and the value of (H; t) in Equation 19, we obtain the following Theorem 8. The effective fitness of a fixed-size-and-shape GP schema H under a node-invariant subtreeswapping crossover operator and no mutation is
fe (H; t)
=
f (H; t)
h
p
1
xo
X X
1
k
2
k;l i
p(i; j jG ; G )
H;j
l
p(U (H; i) \ G ; t)p(L(H; i; j ) \ G ; t) i : p(H; t) k
l
(26) This result gives the true effective fitness for a GP schema under subtree swapping crossover: it is not an approximation or a lower bound. It is possible to specialise this definition to standard crossover. Again for simplicity we will consider the case in which crossover points are selected uniformly at random. In this case, from Equation 20 we obtain: Corollary 9. The effective fitness of a fixed-size-and-shape GP schema H under standard crossover with uniform selection of crossover points is
fe (H; t)
=
f (H; t)
h
1
p
xo
1
X p(U (H; i) \ G ; t)p(L(H; i; j ) \ G ; t) i
X X
k
2 \
k;l i
H
Gk j
2
Gl
N (G )N (G )p(H; t) k
l
l
:
(27) In future work we intend to compare this definition with the approximate notions of effective fitness and operator-adjusted fitness introduced in (Nordin and Banzhaf 1995, Nordin et al. 1995) and (Goldberg 1989), respectively.
24
10 Conclusions In this paper a new general schema theory for genetic programming is presented. The theory includes two main results describing the propagation of GP schemata: a microscopic schema theorem and a macroscopic one. The microscopic version is applicable to crossover operators which replace a subtree in one parent with a subtree from the other parent to produce the offspring. The macroscopic version is valid for subtreeswapping crossover operators in which the probability of selecting any two crossover points in the parents depends only on their size and shape. Therefore, these theorems are very general and can be applied to model most GP systems used in practice. Like other recent schema theory results (Stephens and Waelbroeck 1997, Stephens and Waelbroeck 1999, Poli 2000a, Poli 2000b), this theory gives an exact formulation (rather than a lower bound) for the expected number of instances of a schema at the next generation. One special case of this theory is the exact schema theorem for standard crossover: a result that has been awaited for many years. As shown by some recent explorations reported in (Poli and McPhee 2000), exact schema theories can be used, for example, to study the exact schema evolution in infinite populations over multiple generations, to make comparisons between different operators and identify their biases, to study the evolution of size, and investigate bloat. Also, as discussed in (Poli 2000c) for one-point crossover, exact macroscopic theories open the way to future work on GP convergence, population sizing, and deception, only to mention some possibilities. In the future we hope to use the general schema theory reported in this paper to obtain new general results in at least some of these exciting directions.
Acknowledgements The author would like to thank the members of the EEBIC (Evolutionary and Emergent Behaviour Intelligence and Computation) group at Birmingham, Nic McPhee, Jon Rowe, Julian Miller, Xin Yao, and W. B. Langdon for useful discussions and comments on various parts of the work reported in this paper.
References Altenberg, Lee (1994). Emergent phenomena in genetic programming. In: Evolutionary Programming — Proceedings of the Third Annual Conference (A. V. Sebald and L. J. Fogel, Eds.). World Scientific Publishing. pp. 233–241. Altenberg, Lee (1995). The Schema Theorem and Price’s Theorem. In: Foundations of Genetic Algorithms 3 (L. Darrell Whitley and Michael D. Vose, Eds.). Morgan Kaufmann. Estes Park, Colorado, USA. pp. 23–49. D’haeseleer, Patrik (1994). Context preserving crossover in genetic programming. In: Proceedings of the 1994 IEEE World Congress on Computational Intelligence. Vol. 1. IEEE Press. Orlando, Florida, USA. pp. 256–261. Goldberg, D. E. (1989). Genetic algorithms and walsh fuctions: II. Deception and its analysis. Complex Systems 3(2), 153–171. Holland, J. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. Ann Arbor, USA. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press. Cambridge, MA, USA. Langdon, W. B. (1999). Size fair and homologous tree genetic programming crossovers. In: Proceedings of the Genetic and Evolutionary Computation Conference (Wolfgang Banzhaf, Jason Daida, Agoston E. Eiben, Max H. Garzon, Vasant Honavar, Mark Jakiela and Robert E. Smith, Eds.). Vol. 2. Morgan Kaufmann. Orlando, Florida, USA. pp. 1092–1097.
25
Langdon, William B. (2000). Size fair and homologous tree genetic programming crossovers. Genetic Programming And Evolvable Machines 1(1/2), 95–119. Montana, David J. (1995). Strongly typed genetic programming. Evolutionary Computation 3(2), 199–230. Nordin, Peter and Wolfgang Banzhaf (1995). Complexity compression and evolution. In: Genetic Algorithms: Proceedings of the Sixth International Conference (ICGA95) (L. Eshelman, Ed.). Morgan Kaufmann. Pittsburgh, PA, USA. pp. 310–317. Nordin, Peter, Frank Francone and Wolfgang Banzhaf (1995). Explicitly defined introns and destructive crossover in genetic programming. In: Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Applications (Justinian P. Rosca, Ed.). Tahoe City, California, USA. pp. 6–22. O’Reilly, Una-May and Franz Oppacher (1995). The troubling aspects of a building block hypothesis for genetic programming. In: Foundations of Genetic Algorithms 3 (L. Darrell Whitley and Michael D. Vose, Eds.). Morgan Kaufmann. Estes Park, Colorado, USA. pp. 73–88. Poli, R. (2000a). Hyperschema theory for GP with one-point crossover, building blocks, and some new results in GA theory. In: Genetic Programming, Proceedings of EuroGP 2000 (Riccardo Poli, Wolfgang Banzhaf and et al., Eds.). Springer-Verlag. Poli, Riccardo (2000b). Exact schema theorem and effective fitness for GP with one-point crossover. In: Proceedings of the Genetic and Evolutionary Computation Conference (D. Whitley, D. Goldberg, E. Cantu-Paz, L. Spector, I. Parmee and H.-G. Beyer, Eds.). Morgan Kaufmann. Las Vegas. pp. 469– 476. Poli, Riccardo (2000c). Microscopic and macroscopic schema theories for genetic programming and variable-length genetic algorithms with one-point crossover, their use and their relations with earlier GP and GA schema theories. Technical Report CSRP-00-15. School of Computer Science, The University of Birmingham. Poli, Riccardo and Nicholas F. McPhee (2000). Exact schema theorems for GP with one-point and standard crossover operating on linear structures and their application to the study of the evolution of size. Technical Report CSRP-00-14. School of Computer Science, The University of Birmingham. Poli, Riccardo and W. B. Langdon (1997a). Genetic programming with one-point crossover. In: Soft Computing in Engineering Design and Manufacturing (P. K. Chawdhry, R. Roy and R. K. Pant, Eds.). Springer-Verlag London. pp. 180–189. Poli, Riccardo and W. B. Langdon (1997b). A new schema theory for genetic programming with one-point crossover and point mutation. In: Genetic Programming 1997: Proceedings of the Second Annual Conference (John R. Koza, Kalyanmoy Deb, Marco Dorigo, David B. Fogel, Max Garzon, Hitoshi Iba and Rick L. Riolo, Eds.). Morgan Kaufmann. Stanford University, CA, USA. pp. 278–285. Poli, Riccardo and William B. Langdon (1998). Schema theory for genetic programming with one-point crossover and point mutation. Evolutionary Computation 6(3), 231–252. Poli, Riccardo, William B. Langdon and Una-May O’Reilly (1998). Analysis of schema variance and short term extinction likelihoods. In: Genetic Programming 1998: Proceedings of the Third Annual Conference (John R. Koza, Wolfgang Banzhaf, Kumar Chellapilla, Kalyanmoy Deb, Marco Dorigo, David B. Fogel, Max H. Garzon, David E. Goldberg, Hitoshi Iba and Rick Riolo, Eds.). Morgan Kaufmann. University of Wisconsin, Madison, Wisconsin, USA. pp. 284–292. Rosca, Justinian P. (1997). Analysis of complexity drift in genetic programming. In: Genetic Programming 1997: Proceedings of the Second Annual Conference (John R. Koza, Kalyanmoy Deb, Marco Dorigo, David B. Fogel, Max Garzon, Hitoshi Iba and Rick L. Riolo, Eds.). Morgan Kaufmann. Stanford University, CA, USA. pp. 286–294.
26
Stephens, C. R. and H. Waelbroeck (1997). Effective degrees of freedom in genetic algorithms and the block hypothesis. In: Proceedings of the Seventh International Conference on Genetic Algorithms (ICGA97) (Thomas B¨ack, Ed.). Morgan Kaufmann. East Lansing. pp. 34–40. Stephens, C. R. and H. Waelbroeck (1999). Schemata evolution and building blocks. Evolutionary Computation 7(2), 109–124. Whigham, P. A. (1995). A schema theorem for context-free grammars. In: 1995 IEEE Conference on Evolutionary Computation. Vol. 1. IEEE Press. Perth, Australia. pp. 178–181. Whigham, Peter Alexander (1996). Grammatical Bias for Evolutionary Learning. PhD thesis. School of Computer Science, University College, University of New South Wales, Australian Defence Force Academy.
27