Duplication of Coding Segments in Genetic Programming Thomas Haynes Department of Mathematical & Computer Sciences The University of Tulsa 600 South College Avenue Tulsa, OK 74104{3189 e{mail:
[email protected] Abstract
Research into the utility of non{coding segments, or introns, in genetic{based encodings has shown that they expedite the evolution of solutions in domains by protecting building blocks against destructive crossover. We consider a genetic programming system where non{coding segments can be removed, and the resultant chromosomes returned into the population. This parsimonious repair leads to premature convergence, since as we remove the naturally occurring non{coding segments, we strip away their protective backup feature. We then duplicate the coding segments in the repaired chromosomes, and place the modi ed chromosomes into the population. The duplication method signi cantly improves the learning rate in the domain we have considered. We also show that this method can be applied to other domains.
1 Introduction Researchers in both genetic algorithms (GA) and genetic programming (GP) have recently begun to examine the utility of non{coding segments1 in chromosomes [2, 5, 10, 19, 20, 24, 25, 26, 35, 36, 37]. The research into the utility of non{coding segments in genetic{based encodings has shown that they facilitate the evolution of solutions in domains. A bit is the atomic element of a chromosome and a non{coding segment is a bit or a continuous collection of bits that do not contribute to the overall tness of a chromosome. Non{coding segments guard against destructive crossover by providing bits where the exchange of genetic material will not eect the tness of the chromosome. We investigate the eect of duplication of coding segments in the chromosome on the evolutionary process. This duplication is found in GP chromosomes, and is conjectured to be the building blocks for GP [32]. Multiple appearances of these building blocks increase the probability that the building block will survive reproduction. A diculty in GP research is in identifying the building blocks for a domain [27, 29]. We have implemented a domain in which all building blocks can easily 1
Non{coding segments are computational models of intragenic regions or introns.
1
be enumerated. This capability allows us to investigate the eects of duplication of building blocks within chromosomes. We present an approach in which all non{coding segments are rst removed from a GP chromosome. A non{coding segment is then created which is a duplicate of the coding segment and concatenated onto the chromosome. At least one of the resultant children is guaranteed to be at least as t as the repaired parent. We then examine the utility of increasing the number of copies of the coding segment within the chromosome. The rest of this paper is organized as follows: Section 2 presents natural and genetic{ based non{coding segments. Section 3 introduces the clique detection problem. Section 4 reveals how the clique detector facilitates our research into non{coding segments. Section 5 presents our experimental results. Section 6 ties everything together. Section 7 shows how this work can be extended.
2 Non{coding segments Non{coding segments model the intragenic regions reported in the biological literature [1, 8, 11] and are the intron segments seen in genetic{based encoding literature [24, 35]. The non{coding sequences account for a large fraction of the DNA [11, 35]. It is hypothesized that these segments are backup material for the coding segments. For example, the frog Xenopus laevis has 450 copies of the gene codings for 18S and 28S rRNA and 24,000 copies of the gene for 5S rRNA [11]. Another conjecture is that the non{coding sequences can act as a library for adaptation. During RNA splicing the non{coding sequences are stripped, producing a smaller RNA molecule. As the gene can be spliced in a variety of ways, the non{coding sequence for one mRNA could be a coding sequence for another [1]. Besides this code reuse, as a protein evolves to meet changes in the environment, it can resort to the non{coding segments instead of evolving entirely new genetic material.
2.1 Genetic Algorithms
In the genetic{based encoding literature, most of the emphasis on non{coding segments has focused on how these extra bits provide a buer against destructive crossover. The canonical GA chromosome, or string, representation utilizes a binary alphabet. If a don't care or wildcard symbol is utilized, we have the alphabet f0; 1; g. A schemata is a template describing subsets of strings within the string. The de ning length of a schema is the distance between the outermost bits de ned on the binary alphabet. Building blocks have a small de ning length and are highly t. They are integral to the schema theorem, which de nes how the implicit parallel search of a genetic algorithm \builds" better solutions over time. The interested reader is referred to Goldberg [14] for an introduction to the basics of GA theory. It is shown in [19] that adding non{coding segments to chromosomes, to separate building blocks, protects those building blocks from being sliced by crossover. Wu et. al. [36, 37] have applied Levenick's work on the Royal Road functions [10], which are designed as testbeds for studying building blocks. In GA research, chromosomes are typically represented as xed length bit strings. With a string of length l, and a building block of de ning length , any 2
crossover operation has a probability,
Pl = l ? 1 of destroying a building block [14]. If non{coding segments, of a total length of i are added to the chromosome, then the length of the chromosome increases to l + i, and the probability of destructive crossover breaking up a building block of de ning length decreases to Pl+i = l + i ? 1 : A simple example will suce to show the power of inserting non{coding segments into a chromosome. In Figure 1(a) we are given a string of length l = 15 and a building block, b1, of de ning length = 6. The probability of crossover tearing apart the building block is Pl = 0:43. In Figure 1(b) we add an intron segment of length i = 5 and we nd that the probability of destructive crossover decreases to Pl+i = 0:32. It should be noted that while adding the non{coding segment at the tail of the chromosome decreases the probability of destructive crossover, it does not aid the recombination of building blocks as much as placing the non{coding segments between the building blocks [34]. building block b1
b2
***1*0***0*1*1* 0 14 (a) non-coding building b2 segment block b1 ***1*0***0*1*1****** 0 19 (b)
Figure 1: Example binary string chromosomes showing the power of non{coding segments to prevent destructive crossover. (a) Without the non{coding segment. (b) With the non{ coding segment. Note that multiple segments could be added anywhere but between the start and end of any building block. The key point to inserting non{coding segments into the GA chromosome is that non{ coding segments reduce the chance of destructive crossover. An \arti cial" constraint which is enforced is that only the coding segments are examined to determine the tness of a chromosome.2 Therefore material within a non{coding segment cannot be mixed with that within the coding segment. Thus the non{coding segment material is meaningless, and in particular, selection pressure does not drive it to be backup material. From our understanding of the biological interactions of transcription of DNA to mRNA [11], natural selection has determined that certain regions are triggered to be transcription points based on the appearance of certain sequences of nucleotides. 2
3
Wu and Lindsay [36] also point out that there is a drawback to inserting non{coding segments: they retard the growth of building blocks. A little thought will show that it will be hard for evolution to recombine the building blocks if non{coding segments are there to prevent destructive crossover. However, once those building blocks are formed, they are quite dicult to break up.
2.2 Genetic Programming
The \basic" theory of genetic programming is borrowed from that of genetic algorithms. Due to the diculties in detecting building blocks in GP chromosomes, research is ongoing into formally connecting the theory as to why GP works with that of why GAs work [27, 29, 30, 33]. The canonical GP chromosome representation is a parse tree, and is quite often called a S{ expression. The interested reader is referred to Koza [18] for an introduction to the basics of GP theory. The dierences between GA and GP are more than the xed versus variable genotype representation. In GA, due to the xed length of chromosomes, there is a close relationship between the genotype and phenotype structure of a chromosome3. Thus the building blocks of GAs are usually represented at the genotype level. It is relatively easy to detect building blocks in a GA chromosome. With the GP, building blocks are at the phenotype or semantical level. As such, they are dicult to represent, detect, and capture. Also, there can be a duplication of building blocks in a GP chromosome, whereas there may not be any such duplication in a GA chromosome. Tackett [32] compares the diculty in researching building blocks between GP and GA: dierent notations of schemata and a non{binary alphabet. His conjecture is that \small" subtrees which appear frequently in S{expressions are GP's building blocks. These subtrees are prevalent due to their contribution to the tness of the chromosomes in which they appear. He also suggests that parsimony may be of interest not only because of aesthetic considerations, but because of a natural bound on the \appropriate size" of a solution tree for a given problem. Altenberg [2] has a dierent stance on why duplications appear inside GP chromosomes: there are two selection forces which add blocks of code to the population. The genetic operators spread a block to dierent chromosomes, and an emergent selection pressure, which he calls \constructional selection" causes the formation of duplication within a chromosome. The duplication is a result of the tness of the block being replicated. Angeline [4] reports that while there is redundancy in chromosomes, the bene t of these semantically extraneous components is in the prevention of destructive crossover. He also highlights a dierence between GAs and GPs with regards to non{coding segments: in GAs, they are added by design, whereas in GPs they evolve naturally. Nordin et. al. [24, 25, 26] investigate the dynamics of non{coding segments in GP evolution. Unlike the typical tree representation, Nordin et. al. utilize chromosomes comprised of linear genomes which are 32 bit strings and represent binary code for a SUN{4 [23]. Non{ coding bits are de ned to be those that when replaced by a NOP instruction, i.e. perform no operation, do not change the semantics or phenotype of the chromosome. Using this capability, Nordin et. al. investigated the eects of non{coding segments on destructive crossover. 3
There is always an exception, in this case it is messy GAs.
4
They reached the same conclusions regarding the utility of non{coding segments as did the GA researchers. While their chromosome representation is nonstandard, they report some promising preliminary results with the more traditional representation.
3 Clique Detection
Given a graph G = (V; E ), a subgraph G0 of G is a clique if [22] G0 = (V 0; E 0) where V 0 V; E 0 E; and V 0 V 0 ? f(vi; vi) j vi 2 V 0g = E 0: Less formally, a clique of G is a complete subgraph of G. A clique is denoted by the set of vertices in the complete subgraph. The goal is to nd all cliques of G. Since the subgraph of G induced by any subset of the vertices of a complete subgraph of G is also complete, it is sucient to nd all maximal complete subgraphs of G [17]. A maximal complete subgraph of G is referred to as a maximal clique. Figure 2 is a 10 node graph, with cliques: C = f f0; 3; 4g; f0; 1; 4g; f1; 4; 5g; f1; 2; 5g; f2; 5; 6g; f3; 4; 7g; f4; 7; 8g; f4; 5; 8g; f5; 8; 9g; f5; 6; 9gg: 0
3
1
4
7
2
5
8
6
9
Figure 2: Example 10 node graph. Finding the maximum clique of a graph has been addressed within the genetic algorithm community [6, 7, 22]. The problem of nding the set of all cliques has not been addressed within the GA community. This is due in part to the fact that a variable length structure is needed to nd all the cliques, while a xed length chromosome can be used to nd the maximum clique. Our preliminary research on the clique detection problem is found in [15]. A variable length chromosome is necessary because in general there will be an unknown number of cliques per graph. Potential cliques are denoted as candidate cliques, which need to be evaluated to determine if they actually form a clique. The chromosome will then be a collection of candidate cliques. Candidate cliques need to be tested to determine if they contain duplicate nodes. 5
they are subsumed by another candidate clique. they are completely connected. they are maximal.
3.1 Fitness Measure
Each chromosome in the population will represent sets of candidate cliques. The function and terminal sets are F = fExtCon, IntCong and T = f1,: : : ,#nodesg. ExtCon connects two sets of candidate cliques, while IntCon connects nodes inside a candidate clique. The tness evaluation will be composed of two parts: A reward for clique size. A reward for number of cliques in the chromosome. Since we want to gather the maximal connected subgraphs, the reward for size must be greater than that for the number of cliques. We also ensure that we do not reward for the same clique either being expressed in the chromosome twice or being subsumed by another clique expressed in the chromosome. The rst falsely in ates the tness of the individual, while the second invalidates the goals of the problem. The algorithm for the tness evaluation is: 1. Parse the S{expression into a series of ordered cliques. 2. Throw away any invalid candidate cliques, i.e., nodes are repeated. 3. Throw away any candidate cliques that are subsumed by other candidate cliques. 4. Throw away any candidate cliques that are not completely connected. The formula for measuring the tness is:
X F = c + c
=1
i
ni
;
where c = # of valid candidate maximal cliques and ni = # nodes in cliquei: Both and are constants which are con gurable by the user. Note that has a minimum bound before smaller connected components overwhelm the maximal clique. For example, consider the graph in Figure 3. It is clear that there is only one clique:
C4 = ff1; 2; 3; 4gg: But there are four candidate cliques of cardinality three:
C3 = ff1; 2; 3g; f1; 2; 4g; f1; 3; 4g; f2; 3; 4gg: The genetic search should favor C4 over C3, and the proper choice of will allow this. 6
1
2
3
4
Figure 3: Example four node graph.
3.2 Role of Strong Typing
We utilize a strongly typed genetic programming (STGP) [21] system to investigate research into clique detection [16]. We use a STGP system instead of a canonical GP system to force type inheritance. One serious constraint on the user{de ned terminals and functions in GP systems is closure, i.e. all of the functions must accept arguments of a single data type and return values of the same data type. This constraint forces all functions to return values that can be used as arguments for any other function. STGP allows for an additional level of typing to be added. We have extended STGP by adding type inheritance to allow for more than two levels of typing [16]. In the context of the clique detection domain, they are in essence forcing the chromosome to evolve \lists" of nodes.
4 Approach An interesting observation is that the tness function for the clique detector pares the chromosome down to the coding segments. The list of candidate cliques for a given chromosome succinctly encapsulates the content of that chromosome. Indeed each candidate clique can be thought of as a building block from which \better" chromosomes can be constructed. This method of paring down the chromosome is similar to the RNA splicing, mentioned earlier, in that non{coding segments are stripped out of the RNA transcript from DNA [1]. In GA research, if there are invalid bits in a chromosome and some algorithm exists to translate those bits into valid bits, then they can be repaired and the resultant chromosome evaluated to determine the tness of the original chromosome. Some issues are whether or not to return the repaired chromosome into the population and at what rate of return [9, 28]. The repair operation is done at chromosome evaluation, not during the reproduction stage; there is no assurance that the repaired chromosome will even be selected for reproduction. The evaluation function transforms chromosomes from GP space to clique set space, i.e. genotype to phenotype. The repair process is to take the phenotype and map it back into a genotype. In Section 3.1, we showed the evaluation function removes nodes that do not contribute to the tness of the chromosome. Thus the resultant chromosome is likely to be smaller than the original. If the set of candidate cliques had been the empty set, and the chromosome had been selected for repair, then the repair would not take place. 7
4.1 Example Repair Process
An example chromosome for the 10 node graph (Figure 2) is presented in S{expression form in Figure 4. It has ve candidate cliques, three of which are invalid. With the algorithm presented in Section 3.1, the only cliques from this list of candidate cliques are #2 and #5:
C = ff4; 8; 7g; f5; 6gg: The other candidate cliques are eliminated because they violate one or more of the rules presented in Section 3: candidate clique #4 contains duplicate nodes, i.e. the node 7 is repeated; candidate clique #3 is subsumed by candidate clique #2; and, candidate clique #1 is not completely connected. We have mapped the chromosome from GP space to clique set space. If this chromosome had been selected for repair, then the resultant mapping back into GP space would produce the S{expression shown in Figure 5. The repair process prunes dead branches of the S{expression. ExtCon
ExtCon
ExtCon
Candidate Clique #1
Candidate Clique #2
Candidate Clique #3
IntCon
IntCon
ExtCon
IntCon Candidate Clique #4
3
5
4
IntCon
7
7
4
8
Candidate Clique #5
IntCon
IntCon
7
7
5
6
Figure 4: S{expression for example 10 node graph. At this point we have mapped the chromosome from GP space to clique set space. If this chromosome had been selected for repair, then the resultant mapping back into GP space would produce the S{expression shown in Figure 5. The repair process prunes dead branches of the S{expression. ExtCon
IntCon
IntCon
5
IntCon
6
7
4
8
Figure 5: Repaired S{expression for example 10 node graph. 8
4.2 Simple Repair
As the extraction of candidate cliques from the chromosome can be viewed as a repair process, we investigated various rates of return of the repaired chromosomes into the population. It was our conjecture that the genotypes of chromosomes which succinctly capture the phenotype of the chromosome were more elegant and natural. It should be noted that non{coding segments can be inserted and deleted by evolution in DNA [8]. For all our experiments, we utilize a population size of 2000, and each run consists of 600 generations. The graph, for which cliques are detected, is the 10 node graph shown in Figure 2. All chromosomes undergo repair, and we investigate repair rates, i.e. the percentage of repaired chromosomes returned into the population, of 0%, 0.5%, 1.5%, 3%, 5%, and 10%. We average our tness curves over 10 runs of the clique detector by providing dierent initial seeds to the random number generator. We found that repair rates greater than 0.5%4 either degraded the performance or caused premature convergence, see Figure 6. (These results are statistically signi cant with a two-tail t-test and a con dence level of 0.001.) A natural question at this point is why does repairing work in GA applications, but not in GP applications? Perhaps it is just this particular domain for which repair fails in GP applications. Or perhaps the repair process is actually damaging the chromosome instead of xing it up. 8000
7000 R0
R.5Q0
6000
Fitness
5000
4000 R1.5Q0 3000 R3 R5 R10
2000
1000
0 0
100
200
300 Generation
400
500
600
Figure 6: Best tness for base case and repair rates of 0.5%, 1.5%, 3%, 5%, and 10%. Exactly what is it that the repair process is doing? It is removing \dead" or non{coding 4
We know from Orvosh and Davis [28] that small repair rates are desirable.
9
bits from the chromosome, i.e. those bits which do not contribute, either positively or negatively, to the calculation of the tness of the chromosome. Whereas in GA research, the repair process does not remove any bits. Also, the repair process is removing genetic diversity. Finally, the repair process is removing any naturally occurring duplicate non{ coding segments. Thus the protective backup feature of these segments is being negated. As discussed in Section 2, genetic{based encoding research has shown that this non{ coding material protects building blocks from the eects of destructive crossover. In this paper, we discuss experiments in which we insert non{coding segments into the chromosome to investigate whether this results in an increase in the tness of chromosomes.
4.3 Repair with Duplication
Further research was performed in which the repaired chromosome was duplicated before it was thrown back into the population. For example, the chromosome represented in Figure 7 has been duplicated into the chromosome in Figure 8. Note that while the genotypes of these two chromosome are dierent, the phenotypes are exactly the same, i.e. both chromosomes evaluate to the same tness. In eect an non{coding segment has been added to the chromosome. ExtCon
ExtCon
6
IntCon
2
IntCon
7
4
8
Figure 7: S{expression for generation 0's best chromosome. ExtCon
ExtCon
ExtCon
6
ExtCon
IntCon
2
IntCon
7
ExtCon
4
6
8
IntCon
2
IntCon
7
4
8
Figure 8: S{expression for generation 0's best chromosome doubled. 10
If we examine Figure 7, it is evident that crossover will be destructive for this chromosome. Any point selected for crossover will break up a building block. However, it is equally evident that for Figure 8 crossover cannot be completely destructive. If the crossover mechanism selects a point anywhere to the left of the root, i.e., in the left subtree of the root, then the right subtree of the root will remain intact. The child which \inherits" the right subtree will have a tness greater than or equal to that of the parent. A similar argument holds if the crossover point is in the right subtree. If the root is selected as the crossover point, then the child inheriting the whole tree will still have a lower bound of the tness of this parent. Note that while the non{coding segment is redundant in the parent, it will probably not be in the child. Indeed it will only be redundant if the other parent already contains the coding segment.
5 Experimental Results Our conjecture from Section 4.3 is that the chromosome in Figure 8 should aid in the genetic search for all of the cliques in the graph. At least one of the children will be as t as the repaired parent. The curve R0 in Figure 9 presents the learning curve for the clique detector with no repairs taking place. On the average, the solution rst appears in generation 354. The rst experiment we conducted was to inject repairs with a 0.5% probability into the population. The curve R.5Q1 in Figure 9 is the result after adding one duplicate of the coding segment during the repair process. The solution is found at about generation 335. The hypothesis of the utility of duplication appears to not have been signi cant. If we examine what the process is doing, we see that if the repaired chromosome is selected for crossover, the building block should last for at least one generation. Can we force the building block to last longer? i.e. can we cause the building block to propagate through more than one generation? Yes, if we add more than one copy of the building block to the repaired chromosome. If we assume that we create only non{coding segments such that the total number of instances of the coding block is a power of two, then we can perform some worst and best case analysis as to the survivability of the coding segment. In both cases, we assume that only one parent has copies of the coding segment. If we consider the tree formed having the \roots" of the coding subtrees as terminals, then we have a complete binary tree of depth log2cs, with cs being the number of instances of coding segments. The worst case is that the block will survive for log2 cs ? 1 generations5. In the best case, the block will survive for a number of generations equal to the sum of the number of edges and the number of terminal nodes6 . This is simply 3cs. We conducted further experimentation in which we added three and seven copies of the coding segment. The curve R.5Q3 in Figure 9 utilizes three backups of the coding segment. Notice that the solution appears around generation 246. This is a savings of about 108 generations. In curve R.5Q7 in Figure 9 we present the results of having seven duplicates. The solution appears around generation 171. Again, the computational savings would be We also do not account for mutation in these cases. Each block lasts until all of the crossover points above it have been chosen, and none have been chosen inside it. 5 6
11
8000
7000 R.5Q7
R.5Q3
R.5Q1
R.5Q0
6000
Fitness
5000
4000
3000
R0 R.5Q1
2000
1000
0 0
100
200
300 Generation
400
500
600
Figure 9: Best tness for base case and a repair rate of 0.5% with 0, 1, 3, and 5 duplications. signi cant, i.e. 183 generations. (All results signi cant as indicated by two-tail t-tests with a con dence level of 0.001) Finally, in Figure 10, we present the results for a repair rate of 10%. We also show the average appearance of the optimal solution in Table 1. At a repair rate of 10% and with 7 duplicates of the coding segment, which is a signi cant savings of 298 generations over no repair (paired{sample test with a con dence level of 0.001), and 115 generations over 0.5% repair with 7 duplications (two-tail t-test with a con dence level of 0.001). Duplicates 0 1 3 7
Appearance Not found 223 63 56
Table 1: Average appearance of optimal solution for a repair rate of 10%. We have found that in general: Complete removal of non{coding segments cause premature convergence. As duplicates of the coding segment are added to the chromosome, the learning improves. 12
8000
7000 R10Q7
R10Q3
R10Q1
R0
6000
Fitness
5000
4000
3000
2000 R10Q0
1000
0 0
100
200
300 Generation
400
500
600
Figure 10: Best tness for base case and a repair rate of 10% with 0, 1, 3, and 5 duplications.
As the repair rate increases, and more than one duplicate of the coding segment is added
to the chromosome, the learning increases. This contradicts the ndings reported by Orvosh and Davis [28].
6 Conclusions We are in eect utilizing the tree structure of GP chromosomes to conduct experimentation into variable length GA chromosomes. Our function and terminal sets, presented in Section 3.1, are not programming structures, but rather connectors for data structures. The \programming" aspect is handled by the tness function, which is similar to how the GA tness function translates the binary strings into the problem domains. The variable length nature of the GP chromosome combined with the position independence imparted by the tness function provides us with an approximation of a messy GA [12, 13]. We have seen that the duplication of three or more copies of the coding segments signi cantly speeds up the learning process for the clique detection problem. In particular, we have shown that with seven copies of the coding segment, we can at least halve the computational eort of nding the optimal solution and at best we have shown an 84% increase in nding the optimal solution over no repair and duplication at all. While the clique detection domain readily lends itself to the study of building blocks in the genetic programming chromosome, our results are not domain dependent. The duplication of subtrees has been mentioned in passing within the GP literature (see Section 2.2) and discussed informally within the GP 13
mailing list. Some analysis will show that this method can work for any GP domain. Simple editing rules for GP chromosomes have been identi ed [18]. The methods used by compiler writers to optimize code are also applicable to \optimizing" the GP chromosome, which is after all a parse tree in canonical form. An example of the repair and duplication process for other domains is shown in Figure 11. The parse tree to be evaluated is shown in Figure 11(a). The left subtree of the root node is True, which cause the middle subtree to be a coding segment and the right subtree to be a non{coding segment. The tree could be pruned, leaving only the middle subtree. The IFTE7 function can be used to add a duplicate of the coding segment. This is shown in Figure 11(b). The utility of the IFTE function in creating duplication is discussed in Angeline [4]. IFTE
True
Y
*
.007
IFTE
sin
False
X
Y
(a)
*
.007
*
Y
.007
(b)
Figure 11: GP parse trees revealing how duplication can be accomplished. (a) The right subtree of the IFTE node is non{coding. (b) A duplicate of the coding segment from (a) has been created. Note that whether the rst argument evaluates to either True or False is immaterial.
7 Future Work This research can be extended in several fronts. As mentioned in Section 4, the repair process takes place during chromosome evaluation, and thus there is no guarantee that the repaired chromosome will survive into the next generation. A new crossover function could be introduced which takes a single parent and produces a repaired child. Another such function could generate two children: one by straight selection, and the other by the repair process. We are planning to investigate this extension in a propositional inference domain. Our preliminary research shows that linear non{coding segments, similar to those reported by McPhee and Miller [20] naturally appear in this domain. Furthermore, building blocks are not as easy to identify in this domain as in the clique detection, and the optimal solution should have no duplicates of the coding segment. Another research thread is to investigate the utility of duplicated chromosomes as a library for evolution as discussed in Section 2. Some preliminary work in utilizing already evolved code can be found in Seront [31]. Rosca [29, 30] also utilizes evolved code in the form of IFTE is short for IfThenElse, i.e. if the rst argument is True, then evaluate the second argument, else evaluate the third argument. 7
14
evolved subroutines in his adaptive representation through learning (ARL) mechanism, which allows him to automatically detect building blocks in GP. A nal thread is how to incorporate explicit duplication of building blocks into GA chromosomes. Wu [35] has investigated oating building blocks inside GA chromosomes, and has reported the duplication of building blocks inside the chromosome.
Acknowledgements I want to thank Cory Hoelting for some discussions on this research. I also thank the anonymous reviewers, Justinian Rosca, Sandip Sen and Annie Wu for reviewing draft copies of the paper. Finally, I thank Mark Lindsay for allowing me access to workstations in his computer lab.
References [1] Bruce Alberts, Dennis Bray, Julian Lewis, Martin Ra, Keith Roberts, and James D. Watson. Molecular Biology of the Cell. Garland Publishing, Inc., 1989. [2] Lee Altenberg. The evolution of evolvability in genetic programming. In Kenneth E. Kinnear, Jr., editor, Advances in Genetic Programming. MIT Press, 1994. [3] P. J. Angeline and J. B. Pollack. Coevolving high-level representations. In C. G. Langton, editor, Arti cial Life III, SFI Studies in the Sciences of Complexity, volume XVII. Addison-Wesley, 1991. [4] Peter John Angeline. Genetic programming and emergent intelligence. In Kenneth E. Kinnear, Jr., editor, Advances in Genetic Programming. MIT Press, 1994. [5] Tobias Blickle and Lothar Thiele. Genetic programming and redundancy. In J. Hopf, editor, Genetic Algorithms within the Framework of Evolutionary Computation (Workshop at KI-94, Saarbrucken), pages 33{38. Max-Planck-Institut fur Informatik (MPI-I94-241), 1994. [6] Thang Nguyen Bui and Paul H. Eppley. A hybrid genetic algorithm for the maximum clique problem. In Larry Eshelman, editor, Proceedings of the Sixth International Conference on Genetic Algorithms, pages 478{484, San Francisco, CA, 1995. Morgan Kaufmann Publishers, Inc. [7] R. Chandraasekharam, S. Subhramanian, and S. Chaudhury. Genetic algorithm for node partioning problem and applications in VLSI design. IEE Proceedings, Part E: Computers and Digital Techniques, 140(5):255{260, Sep 1993. [8] James Darnell, Harvey Lodish, and David Baltimore. Molecular Cell Biology. Scienti c American Books, 1990. 15
[9] Lawrence Davis, David Orvosh, Anthony Cox, and Yuping Qiu. A genetic algorithm for survivable network design. In Proceedings of the Fifth International Conference on Genetic Algorithms, pages 408{415, Champaign, IL, 1993. Morgan Kaufman. [10] Stephanie Forrest and Melanie Mitchell. Relative building block tness and the building block hypothesis. In D. Whitley, editor, Foundations of Genetic Algorithms 2, pages 109{126. Morgan Kaufmann, 1992. [11] Douglas J. Futuyma. Evolutionary Biology. Sinauer Associate, Sunderland, MA, 1986. [12] David Goldberg, Kalyanmoy Deb, and Bradley Korb. Messy genetic algorithms: Motiation, analysis, and rst results. Complex Systems, 3:493{530, 1989. [13] David Goldberg, Kalyanmoy Deb, and Bradley Korb. Messy genetic algorithms revisited: Studies in mixed size and scale. Complex Systems, 4:415{444, 1990. [14] David E. Goldberg. Genetic Algorithms in Search, Optimization & Machine Learning. Addison-Wesley, Reading, MA, 1989. [15] Thomas Haynes. Clique detection via genetic programming. Technical Report UTULSAMCS-95-02, The University of Tulsa, April 24, 1995. [16] Thomas Haynes, Dale Schoenefeld, and Roger Wainwright. Type inheritance in strongly typed genetic programming. In Kenneth E. Kinnear, Jr. and Peter J. Angeline, editors, Advances in Genetic Programming 2, chapter 18. MIT Press, 1996. [17] Kenneth Kalmanson. An Introduction to Discrete Mathematics and its Applications. Addison{Wesley, 1986. [18] John R. Koza. Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press, Cambridge, MA, USA, 1992. [19] James R. Levenick. Inserting introns improves genetic algorithm success rate: Taking a clue from biology. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 123{127. Morgan Kaufmann, 1991. [20] Nicholas Freitag McPhee and Justin Darwin Miller. Accurate replication in genetic programming. In L. Eshelman, editor, Proceedings of the Sixth International Conference on Genetic Algorithms, pages 303{309. Morgan Kaufmann, 1995. [21] David J. Montana. Strongly typed genetic programming. Evolutionary Computation, 3(2):199{230, 1995. [22] Ammanamanchi Srinivasa Murthy, Guturu Parthasarthy, and V. U. K. Sastry. Clique nding { a genetic approach. In Proceedings of the First IEEE Conference on Evolutionary Computation, pages 18{21, Piscataway, NJ, 1995. IEEE. [23] Peter Nordin. A compiling genetic programming system that directly manipulates the machine code. In Kenneth E. Kinnear, Jr., editor, Advances in Genetic Programming. MIT Press, 1994. 16
[24] Peter Nordin. Explictly de ned introns and destructive crossover in genetic programming. In P. Angeline and K. E. Kinnear, Jr., editors, Advances in Genetic Programming 2. MIT Press, 1996. [25] Peter Nordin and Wolfgang Banzhaf. Complexity compression and evolution. In L. Eshelman, editor, Genetic Algorithms: Proceedings of the Sixth International Conference (ICGA95), pages 310{317, San Francisco, CA., USA, July 1995. Morgan Kaufmann. [26] Peter Nordin, Frank Francone, and Wolfgang Banzhaf. Explicitly de ned introns and destructive crossover in genetic programming. In Justinian P. Rosca, editor, Proceeedings of the Workshop on Genetic Programming: From Theory to Real-World Applications, pages 6{22, July 1995. [27] Una-May O'Reilly. An Analysis of Genetic Programming. PhD thesis, Carelton University, Ottawa-Carleton Institute for Computer Science, Ottawa, Ontario, Canada, 22 September 1995. [28] David Orvosh and Lawrence Davis. Shall we repair? Genetic algorithms, combinatorial optimization, and feasibilty constraints. In Proceedings of the Fifth International Conference on Genetic Algorithms, page 650. Morgan Kaufman, 1993. [29] Justinian Rosca. Towards automatic discovery of building blocks in genetic programming. In E. S. Siegel and J. R. Koza, editors, Working Notes for the AAAI Symposium on Genetic Programming, pages 78{85, Menlo Park, CA, 10{12 November 1995. AAAI. [30] Justinian Rosca and Dana H. Ballard. Discovery of subroutines in genetic programming. In P. Angeline and K. E. Kinnear, Jr., editors, Advances in Genetic Programming 2. MIT Press, 1996. [31] Gregory Seront. External concepts reuse in genetic programming. In E. S. Siegel and J. R. Koza, editors, Working Notes for the AAAI Symposium on Genetic Programming, pages 94{98, Menlo Park, CA, 10{12 November 1995. AAAI. [32] Walter Alden Tackett. Genetic programming for feature discovery and image discrimination. In Proceedings of the 5th International Conference on Genetic Algorithms, ICGA-93. Morgan Kaufmann, 1993. [33] Walter Alden Tackett. Mining the genetic program. IEEE Expert, 10(3):28{38, June 1995. [34] Annie S. Wu. http://netq.rowland.org/wu/p.html. (NetQ is a question/answer forum for authors of Evolutionary Computation papers), 1995. [35] Annie S. Wu. Non-coding segments and oating building blocks for the genetic algorithm. PhD thesis, University of Michigan, December 1995. [36] Annie S. Wu and Robert K. Lindsay. Empirical studies of the genetic algorithm with non-coding segments. Evolutionary Computation, 3(2), 1995. 17
[37] Annie S. Wu, Robert K. Lindsay, and Michael D. Smith. Studies on the eect of non{ coding segments on the genetic algorithm. In Proceedings of the 6th IEEE International Conference on Tools with Arti cial Intelligence, New Orleans, LA, November 1994. [38] Byoung-Tak Zhang and Heinz Muehlenbein. Balancing accuracy and parsimony in genetic programming. Evolutionary Computation, 3(1):17{38, 1995.
18