Crossover Bias in Genetic Programming - Semantic Scholar

Crossover Bias in Genetic Programming Maarten Keijzer1 and James Foster2 1

2

[email protected] University of Idaho [email protected]

Abstract. Path length, or search complexity, is an understudied property of trees in genetic programming. Unlike size and depth measures, path length directly measures the balancedness or skewedness of a tree. Here a close relative to path length, called visitation length, is studied. It is shown that a population undergoing standard crossover will introduce a crossover bias in the visitation length. This bias is due to inserting variable length subtrees at various levels of the tree. The crossover bias takes the form of a covariance between the sizes and levels in the trees that form a population. It is conjectured that the crossover bias directly determines the size distribution of trees in genetic programming. Theorems are presented for the one-generation evolution of visitation length both with and without selection. The connection between path length and visitation length is made explicit.

1

Introduction

Nodes in trees are usually distinguished as being either internal nodes, or external nodes. In genetic programming these are named functions and terminals respectively. Important characteristics of such trees are the internal path length, defined as the sum of path lengths (number of edges) to reach every internal node from the root, and the external path length, similarly defined as the sum of path lengths to reach every external node from the root. Many recurrences and generative functions are known for these path lengths for trees of a particular shape, usually binary trees.

X X X X X

π(T ) = 4 internal path length ζ(T ) = 12 external path length f (T ) = 4 number of internal nodes e(T ) = 5 number of external nodes s(T ) = 9 size φ(T ) = 25 visitation length

Fig. 1. Example tree and example values for the various functions defined on the tree that are used throughout this paper. Circles are internal nodes and crossed boxes external nodes.

In genetic programming, internal nodes and external nodes are simply considered nodes, and operators are defined in terms of such general nodes. The size of a tree is defined as the number of nodes, regardless of them being internal or external. With the notable exception of Koza’s 90/10% rule of selecting internal nodes over external nodes, many studies assume that node selection for subtree crossovers and mutations is done by selecting nodes uniformly. This is followed here: the subtree crossover studied here selects nodes uniformly. This is called standard crossover here. In the genetic programming literature, the shape of a tree is usually characterized by its size and depth. Here an alternative measure is examined: a close relative to the total path length of a tree, called the visitation length. The visitation length gives a natural distinction between balanced and skewed trees of the same size and is directly related to the average size of the subtrees in a tree. A tree’s visitation length has a number of important analytical properties that are directly related to the evolution of sizes and shapes in genetic programming. In particular, it will be shown that a crossover bias is present in the evolution of this visitation length. This crossover bias is fully determined by a covariance between sizes and levels in a tree, and it is conjectured that is determines the distribution of sizes in a population of such trees. The paper presents theorems for the evolution of visitation length both with and without selection, alongside some empirical investigations into the effect of crossover bias in genetic programming. This paper presents a mathematical foundation for inquiries into the evolution of tree topology both with and without selection.

2

Sizes and Levels in Trees

Definition 1 (Notation and Size). Given a tree T , the function s(T ) defines the number of nodes (both external and internal) in the tree. To subtree relation is defined through the ∈s symbol: t ∈s T is true if and only if tree t is a subtree of tree T : this defines all subtrees, not only the direct descendants. Uppercase (T ) will be used to denote a rooted tree (used for selection and variation), while lowercase (t) is used to designate subtrees. A tree is considered P to be a subtree of itself. To sum over all subtrees in a tree, the notation t∈s T will be used. To determine the direct descendents (immediate subtrees) of a tree T , indexing will be used, using the c(T ) to define the number of children of T . Internal and external path length are denoted with π and ζ respectively, while f and e give the number of internal and external nodes. Using this notation, the size of a tree can be defined in various ways: s(T ) =

X

t∈s T

c(T )

1=1+

X

s(Ti ) = f (T ) + e(T )

i=1

Similarly, using the δ function defined to return 1 when the argument is true, and 0 otherwise, the size of a subtree t from T can be defined as the number of subtrees that are a subtree of t:

s(t) =

X

u∈s T

δ(u ∈s t)

Definition 2 (Level). The level of a subtree t in a particular tree T is defined as the the number of nodes that need to be traversed to reach the root node, including itself. Thus the level l(T ) of a root node equals 1, while the nodes accessible directly from the root will be found at level 2, etc. The level of a subtree t from a tree T can thus be defined as the number of nodes for which t is a subtree: X l(t) = δ(t ∈s u) u∈s T

Lemma 1 (Symmetry between sizes and levels). The P P sum of sizes of all subtrees from T s(t), is equal to the sum of levels t∈s T t∈s T l(t). Proof. Using Definitions 1 and 2: X X X X X X s(t) = δ(u ∈s t) = δ(t ∈s u) = l(t) t∈s T

t∈s T u∈s T

t∈s T u∈s T

t∈s T

⊓ ⊔ Definition 3 (Total Visitation Length and Mean Subtree Size). The sum of the sizes (and equivalently by Lemma 1 the sum of levels) has symmetric properties and is directly related to the average subtree size. This sum is denoted here with the function φ, and is called the visitation length: φ(T ) =

X

t∈s T

s(T ) =

X

c(T )

l(T ) = s(T ) +

t∈s T

X

φ(Ti )

i

Using φ, both the mean subtree size s¯ and mean subtree level ¯l of a tree T can be defined as: P s(t) φ(T ) ¯ ) = t∈s T = s¯(T ) = l(T s(T ) s(T ) The visitation length φ measures the total number of nodes that need to be visited starting at the root. It has a number of important properties, and much of the remainder of this paper is devoted to the study of φ. The visitation length, φ, is directly related to the total path length: Theorem 1 (Total Path Length). The visitation length φ is, for trees, directly related to the total (internal and external) path length through: φ(T ) = π(T ) + ζ(T ) + s(T )

Proof. The level of a subtree is the path length to the root plus 1 (Definition 2). For all internal nodes the sum of levels equals π(T ) + f (T ) while for all external nodes to ζ(T ) + e(T ). By definition, s(T ) = e(T ) + f (T ). ⊓ ⊔ The definitions of visitation length and path length differ solely in the manner of counting: nodes (vertices) and edges respectively. The functional φ as defined on trees has some important properties. For a tree with a given size s, φ will take on smaller values, the more balanced the tree is. Figure 2 gives an example of this relation. The visitation length is used by for instance [7] to define an alternative to size based parsimony pressure in order to steer the population to small, balanced, trees. φ = 17

φ = 19

φ = 20

φ = 22

+

+

+

+

+ +

+ x

+ x

exp x

xxxx

+ x xx

exp x

-

-

*

x

xx

Fig. 2. Total visitation length for a number of differently shaped trees of size 7. The less balanced the tree, the larger the visitation length.

3

Empirical behaviour of φ in genetic programming

Although the change in the size of a tree of a population undergoing standard crossover is zero in the expectation, this does not necessarily hold for the visitation length. φ is related to the both the shape and the size of the tree. A simple experiment is set up to investigate the behaviour of a population undergoing only crossover, without selection. A population of 5000 binary trees is created through one of three methods: grow, ramped-half-and-half and skew 3 , each governed by a single parameter, the maximum depth of the tree. For all methods, when the depth limit of 7 is reached, only terminals are selected. Figure 3 shows the relationship between the average size of the population and the average visitation length φ for a number of runs of genetic programming using different initialization strategies. A clear convergence to a particular relationship between the two variables is observed. The transient can however be long. 3

Skew creates a binary tree with at each node at least one terminal

20000

Mean Visitation Length

15000

10000

5000

0 0

100

200

300

400

500

Mean Size

Fig. 3. Phase space plot of φ¯ and s¯. Circles, crosses and plusses depict the beginning of runs initialized using the grow, skew, and ramped-half-and-half respectively.

4

Analysis

To study the evolution of the visitation length φ¯ without selection, and to shed some light on this apparent convergence to a fixed relationship with the average size, the microscopic mechanics of standard crossover are examined. It is wellknown that this operation on a population does not alter the expected size of the population in the next generation. However, this is not necessarily the case for the average visitation length. Notation. Below, P will be used to denote a population of n trees. T ∈ P denotes that the rooted tree T is part of the population, summations such as P denote a summation over all trees (not subtrees) in the population. For T ∈P summing over all subtrees and levels in a population, the double summation P P will be used. T ∈P t∈s T Definition 4 (Size-Level Covariance). The size-level covariance on a population P , is defined as: 1 X 1 X Covsl (P ) = (s(t) − s¯(P ))(l(t) − s¯(P )) n s(T ) T ∈P t∈s T P 1 X t∈s T s(t)l(t) = − s¯(P )2 n s(T ) T ∈P

With s¯(P ) =

1 n

P

φ(T ) T ∈P s(T )

Note that due to Lemma 1, this average subtree size is equal to the average level. The definition thus defines a true covariance. Lemma 2 (Microscopic interaction). The net effect of inserting a single (sub)tree t at the place of u in tree U (denoted by u ← t) in the visitation length φ(U ) is given by: φ(U |u ← t) = φ(U ) − φ(u) + φ(t) + (l(u) − 1)(s(t) − s(u)) P Proof. Consider the recursion φ(U ) = s(U ) + i φ(Ui ) from Definition 3. For the subtree u from U that is replaced by t, ∆φ = φ(t) − φ(u). This term is transmitted unaltered in the recursion. The change in size ∆s = s(t) − s(u) will affect the size of all parents of the node (i.e., all subtrees v from U , for which u ∈s v). By definition of the level as one plus the path length to the root node, exactly l(u) − 1 ancestors are effected, leading to an additional change in φ of (l(u) − 1)∆s. ⊓ ⊔ Theorem 2 (Crossover Bias). The expected value of the visitation length in a population undergoing standard crossover and without selection is determined by the average visitation length in the current population minus the covariance between sizes and levels of the subtrees in the population. ¯ ′ ) = φ(P ¯ ) − Cov (P ) φ(P sl Proof. Averaging over all pairs of trees in a population P consisting of n trees, and all possible crossover points, using Lemma 2, where the individual P terms deP ¯ ) and 1/2 P pendent on φ, 1/2 T,U∈P (φ(T )+φ(U )) = φ(P T,U∈P u∈s U,T ∈s T (φ(t)− φ(u)) = 0, leads to a total change in φ(P ) of: P P 1 X X t∈s T u∈s U (l(u) − 1)(s(t) − s(u)) ∆φ(P ) = 2 n s(T )s(U ) T ∈P U∈P

After some algebra (in Appendix), this reduces to: 1 ∆φ(P ) = 2 n

"

X

T ∈P

P

s(t) s(T )

t∈s T

#2

1 X − n

T ∈P

P

l(t)s(t) = −Covsl (P ) s(T )

t∈s T

⊓ ⊔ Although the expected size in the next generation for a population undergoing standard crossover and no selection is expected to be zero, this is not the case for the expected visitation length of the trees composing the population. The size-level covariance Covsl introduces a bias in the expected visitation length, and evolves this macroscopic quantity toward a region where this bias is no longer present. When Covsl = 0, this bias disappears. Theorem 2 relates three quantities: the expected value of the visitation length, the average subtree size and the product of sizes and levels.

The main effect that is shown here is the effect on the visitation length of inserting variable sized subtrees at various levels in a tree. This effect is captured in the covariance between sizes and levels in the tree. Lemma 2 shows the microscopic basis of this covariance, and holds for any replacement of a subtree in a tree. Other crossovers and mutations that insert variable length subtrees in various locations in the tree will have a similar derivation associated with them, where the uniform probabilities are replaced by non-uniform probabilities. At this point it is not clear which, if any, operator will have no crossover bias. Analyzing other operators and designing an unbiased operator is left for future work. The crossover bias is the product of sizes and levels, both compared with the mean subtree size in the population. For populations consisting of identical trees, this quantity appears to be nonzero in general4 . If this is the case, to make the covariance disappear, the mean subtree size in the population needs to be significantly smaller than the subtree size of the larger trees. To make this happen, small trees need to be sampled more frequently than large trees. 4.1

Binary Trees and the Catalan Distribution

It is conjectured here that the crossover bias is mainly related to the distribution of tree size in the population, not to a particular preference for balanced/skewed trees. As is conjectured elsewhere [3], a population not undergoing selection will evolve to the most common shapes. Here it is empirically investigated if shape, as measured by total path length or visitation length, indeed evolves towards their expected (common) values. If so, crossover bias does not affect the shape at all, and its full effect must lie in influencing the distribution of sizes in the population. As short trees have a smaller visitation length than larger trees, the relationship being non-linear, a particular size distribution can have an effect on expected visitation length, without changing expected size. For binary trees, the expected path lengths under the Catalan distribution can be found in [6], chapter 5: Theorem 3 (Sedgewick and Flajolet). For a binary tree T with n internal nodes, selected at random with the Catalan distribution √ √ – E(π(T )) = n√ πn − 3n + O( √ n) – E(ζ(T )) = n πn − n + O( n). And thus, Corollary 1. For a binary tree T with n = (s(T ) − 1)/2 internal nodes, selected at random with the Catalan distribution: √ √ E(θ(T )) = 2n πn − 2n + 1 + O( n) 4

No proof is available at this point, it is however experimentally verified for all trees consisting of unary, binary and ternary nodes up to depth 7.

Proof. Direct by using Theorem 1 in Theorem 3

⊓ ⊔

The experiment in Figure 4 indicates that a population undergoing standard crossover apparently samples trees from this Catalan distribution. Given a particular size of a tree, the expected value of√φ of such trees is determined by the the expected tree shapes, up to a factor of n. The figure shows a close fit given some indication that the conjecture is true, and that no particular shapes are preferred by standard crossover. The main effect of the crossover bias would then be manifested as a preference for a particular limit distribution of tree sizes. Figure 4 also depicts this distribution of sizes for binary trees without selection as a histogram. The experiment exhibits a strong preference for trees of small size. The relationship between crossover bias and size distribution has been examined before [4], albeit for unary trees, and led to the discovery of a limit distribution for such unary trees in the form of a gamma distribution. For binary trees, a different distribution is induced, one where small trees, in particular terminals, are over-abundant. 120000 Number of trees Measured visitation length Expected visitation length Balanced trees Skewed trees

100000

number of trees sampled

visitation length

80000

60000

40000

20000

0 0

200

400

600

800

1000

internal nodes

Fig. 4. Visitation length given size, both measured and expected through Corollary 1 for binary trees, given a particular size. The graph is the result of 100 runs of 5000 trees, operating without selection. Also shown is the distribution of the sizes of the trees that are sampled, and the visitation length for maximally balanced and skewed trees.

5

Size-level Covariance and Fitness

Because the effect of crossover on φ is non-zero in general, this can be expected to have an effect on the composition of a population undergoing selection. Following

Altenberg [1, 2] the effect of fitness on an evolving population can be studied using a canonical genetic algorithm. The expected value of any measurement function F in such a canonical algorithm can be formulated as: F¯ ′ =

X

p(x)′ =

x

X X

x y,z∈P

F (x)T (x ← y, z)

w(y)w(z) p(y)p(z) w ¯2

(1)

where p(x) measures the frequency of genotype (tree) x in the next generation, based on current frequencies, fitness w, and the transmission function T that gives the probability of creating genotype x based on y and z. This equation is directly related to Price’s Covariance and Selection Theorem [5] (see Altenberg [1] for a complete derivation). Theorem 4 (Visitation Length and Fitness). The expected size of the visitation length φ in a population undergoing standard crossover and selection over one generation is given by: X w(T ) P (s(t) − s¯(P )) (l(t) − s¯(P )) ¯ ′ ) = φ(P ¯ ) + Cov(φ(T ), w(T ) ) − φ(P ¯ } ws(T ¯ )|P | | {z w | {z } Fitness/φ covariance Crossover bias Proof. in Appendix

⊓ ⊔

This result on visitation length translates directly to total path length: Corollary 2 (Path length and Fitness). The evolution of the total path length p(x) = π(x) + ζ(x) under standard crossover and selection is given by: ¯ ′ ) − s¯(P ) − Cov(s(T ), w(T )/w) p¯(P ′ ) = φ(P ¯ Proof. This is direct by using Theorems 1 and 4 and Price’s Covariance and Selection theorem applied to size. ⊓ ⊔ As can be expected, the bias induced by standard crossover without selection gets transmitted when using selection. The evolution of visitation length is determined by the crossover bias without selection from Theorem 2, and covariances between both fitness and visitation length, and fitness and the product of size and level of individual trees. To investigate the effect of crossover bias on a population undergoing selection, an experiment is performed. The problem is defined by a terminal set consisting of the constant 1, a function set consisting of the unary minus operator −, and the binary addition operator +. The goal of the problem is to find an expression that evaluates to the value 0. The smallest solution for this problem are two trees consisting of 4 nodes: 1+−1 and −1+1. To keep bloat under control and to make the smallest solutions global optima, a small amount of parsimony pressure is used. The population performs proportional selection with crossover

only (thus no reproduction), using a population of size 50, 000. The fitness value 2 used in proportional selection is w(T ) = e−eval(T ) × 2−1/1000×|T | . Initialization is ramped-half-and-half, and invariably leads to finding the global optimum after initialization. With the global optimum already found, failed runs are effectively eliminated and the transient to the equilibrium state of maximal average fitness can be studied. 60 Crossover Bias Cov(Fitness, Visitation length) Cov(Fitness, Visitation length) + Crossover Bias Mean Size 40

nodes

20

0

-20

-40

-60 0

200

400

600

800

1000

1200

1400

1600

1800

2000

generations

Fig. 5. Transient to the equilibrium for a population that already found a global optimum for 100 runs. Depicted are the mean size of the trees in the population, the covariance between fitness and φ alone, the additional terms induced by the crossover bias, and finally the resulting change in φ, ∆φ. 95% confidence intervals are included.

Figure 5 depicts the evolution of the population towards equilibrium. Although the optimum is a four-node solution, the mean size of the trees in the population grows. First slowly, but when a critical value of approximately 20 nodes is reached, the population undergoes a rapid growth until at 50 nodes, equilibrium is reached due to parsimony pressure. Elongated runs do not show any growth after this point is reached. Interestingly, bloat does occur, even when there is a fitness disadvantage for larger trees. Depicted is also the balance that is achieved between the covariance of fitness w and visitation length φ on the one hand, and the crossover bias on the other. The experiment consistently shows that the fitness function w is correlated with individuals of lower visitation length. In particular, this holds when the population is truly converged and the expected size differential becomes zero. It can however not be concluded that the fitness function indicates that balanced trees are preferred by the fitness function and that crossover bias hinders this. Although lower visitation length indicates more balanced trees, this only

holds for trees of the same size. Smaller trees usually have a smaller visitation length than larger trees. This relationship is non-linear: the experimental observation can then simply be an indication that the fitness function prefers a different size distribution, crossover bias hindering this. The experiment does show however that the crossover bias is not necessarily aligned with the information obtained by measuring fitness. A dynamic balance is found for the expected visitation length φ¯ during the run and the covariance between φ and w is counteracted exactly by the size-level covariance. It can be expected that such a conflict of forces has an effect on the optimization capabilities of an evolving population. At this point, a clear view on the exact effect of the crossover bias in an evolving population is unavailable.

6

Conclusion

This work examines the one generation evolution for path length and visitation length in genetic programming for standard crossover, both with and without selection. A crossover bias is derived that works against the covariance between fitness and visitation length. The crossover bias takes the form of a covariance between the sizes and levels in the tree. Theorems are presented for the exact effect this crossover bias has on the evolution of shape, both with and without selection. It is hypothesized that this crossover bias mainly introduces a preference for a particular size distribution as empirical evidence indicates that no preference for a particular visitation length is induced by crossover. The paper gives a theoretical foundation for further inquiry into these matters.

References 1. L. Altenberg. The evolution of evolvability in genetic programming. In K. E. Kinnear, Jr., editor, Advances in Genetic Programming, chapter 3, pages 47–74. MIT Press, 1994. 2. L. Altenberg. The Schema Theorem and Price’s Theorem. In L. D. Whitley and M. D. Vose, editors, Foundations of Genetic Algorithms 3, pages 23–49, Estes Park, Colorado, USA, 31 July–2 Aug. 1994 1995. Morgan Kaufmann. 3. W. B. Langdon, T. Soule, R. Poli, and J. A. Foster. The evolution of size and shape. In L. Spector, W. B. Langdon, U.-M. O’Reilly, and P. J. Angeline, editors, Advances in Genetic Programming 3, chapter 8, pages 163–190. MIT Press, Cambridge, MA, USA, June 1999. 4. N. F. McPhee and R. Poli. A schema theory analysis of the evolution of size in genetic programming with linear representations. In J. F. Miller, M. Tomassini, P. L. Lanzi, C. Ryan, A. G. B. Tettamanzi, and W. B. Langdon, editors, Genetic Programming, Proceedings of EuroGP’2001, volume 2038 of LNCS, pages 108–125, Lake Como, Italy, 18-20 Apr. 2001. Springer-Verlag. 5. G. Price. Selection and covariance. Nature, 227:520–521, 1970. 6. R. Sedgewick and P. Flajolet. An Introduction to the Analysis of Algorithms. AddisonWesley, 1996.

7. G. Smits and M. Kotanchek. Pareto-front exploitation in symbolic regression. In U.-M. O’Reilly, T. Yu, R. L. Riolo, and B. Worzel, editors, Genetic Programming Theory and Practice II, chapter 17. Kluwer, Ann Arbor, 13-15 May 2004.

Appendix Proof of Theorem 2 # " X X 1 1 X X (l(u) − 1)(s(t) − s(u)) ∆φ(P ) = 2 n s(T )s(U ) t∈s T u∈s U T ∈P U∈P X X 1 X X 1 = 2 [s(t)l(u) − l(u)s(u)] n s(T )s(U ) t∈s T u∈s U T ∈P U∈P " # X X X 1 X X 1 [s(t)l(u)] − s(T ) l(u)s(u) = 2 n s(T )s(U ) t∈s T u∈s U u∈s U T ∈P U∈P P P P l(u)s(u) 1 X X u∈s U l(u) t∈s T s(t) − u∈s U = 2 n s(T )s(U ) s(U ) T ∈P U∈P P P P 1 X 1 X X u∈s U s(u) u∈s U l(u)s(u) t∈s T s(t) − = 2 n s(T )s(U ) n s(U ) U∈P T ∈P U∈P #2 " P P 1 X t∈s T l(t)s(t) 1 X t∈s T s(t) − = 2 n s(T ) n s(T ) T ∈P

T ∈P

Proof of Theorem 4 Insert Theorem 2, into Equation 1: ¯ ′ ) = 1/2 φ(P

X

T,U∈P

= 1/|P | 1/2

X

w(T )w(U )/w¯2 φ(T ∪ U ) + Covsl (T ∪ U ) w(T )/wφ(T ¯ )+

T ∈P

X

T,U∈P

w(T ) X (s(t) − s¯(T ∪ U ))(l(t) − s¯(T ∪ U )) + s(T )w ¯ t∈s T

w(U ) X (s(u) − s¯(T ∪ U ))(l(u) − s¯(T ∪ U )) s(U )w ¯ u∈s U X w(T ) P (s(t) − s¯(P )) (l(t) − s¯(P )) ¯ = φ(P ) + Cov(φ(T ), w(T )/w) ¯ + ws(T ¯ )|P |