Evolving Self-Reproducing Programs

Evolving Self-Reproducing Programs R. I. (Bob) McKay and D. Essam School of Computer Science, University College, University of New South Wales, ADFA, Northcott Drive, Campbell, ACT, Canberra, Australia Ph (61-2-6268 8169, 61-2-6268 8167) Fx: 61-2-6268 8581 {rim, daryl}@cs.adfa.edu.au

Abstract- This paper investigates the problem of evolving a self-reproducing program. It i s hoped that the analysis of this process m i g h t aid the understanding of the process by which the first self-reproducing molecules gave rise to life. We see this work as the meta-level o f the well-known investigations into the biochemical precursors to life. This paper presents a sample self-reproducing program, defines the grammar for that program, and then presents the results when a grammar-guided genetic programming system attempts to find for itself a self-reproducing program. Keywords: Genetic Programming, Self Reproduction, Grammar Guided Genetic Programming

1 Introduction This paper investigates self-reproducing programs. By selfreproduction we mean an program which outputs, or returns the source code of the program itself. The goal of this work is to determine whether a program can evolve to be self-reproducing from some initial random population of programs, and ultimately to study the conditions that are neccessary for that to be successful. From this process we hope to gain insight into the algorithmic aspects of the chemical process whereby self-reproducing molecules were first generated. Von Neumann first presented a self-reproducing architecture in 1949. This architecture was a description of a process whereby a Turing Machine could produce a replicate of itself. In contrast, this work attempts to genetically find a self-reproducing program written in a simple functional language. The following sections of the paper will cover aspects of the chemical process of self-reproduction. It will then present a grammar describing the functional programming language, and hence the search space of the evolutionary process. We then present a sample self-reproducing program, followed by the design, and the results, of our attempts to develop a self-reproducing program. We conclude with a discussion of those results and possible future work.

2 The Origins of Life The origins of life are still a topic of considerable debate. However there appears to be a consensus amongst biochemists that the first identifiable forms of life were self-reproducing molecules of either RNA or protein, or perhaps a hybrid, with the former apparently more favoured (Voet & Voet, 1995). From the work of Miller and Urey (1953) on, there has been strong interest in the chemical processes by which life might have emerged. While this work is highly speculative, one thing which conclusively emerges is that life one Earth, in anything like its present form, could not have arisen by random search alone. This can be seen from a simple combinatorial argument. The molecules involved occupy a volume of at least some cubic angstrom units. The Earth's volume is of the order of 1036 cubic angstrom units. The available time from the Earth's beginnings to the first evidence of cellular life is at most 2*109 years, or 1023 microseconds. The reactions creating new RNA or protein molecules take at least some microseconds. Hence the number of alternatives tested is at most 1059 combinations. Given the 4 base pairs of RNA, breadth first search in this time could only test chains up to 100 base pairs long. Given the 22 amino acids occurring in proteins, the situation there is even more restricting, with the search extending only to chains of length 54. In neither case is it conceivable that a self-replicating machinery could be encoded in chains of this length. The situation is more complex if Hoyle and Wickramasinghe's (1978) controversial argument of inergalactic seeding is accepted. In that case, the available search mechanism is many orders of magnitude larger, but the bounds in the arguments above can be tightened considerably, particularly by restricting attention to the available carbon atoms in carbon-rich environments, so that even so, it seems unlikely that random search could be involved. If random search is unlikely to explain the emergence of life, then we need to look more closely at non-random search mechanisms to see if they can offer an explanation. We see this work as a first step in that direction. Biologists have suggested that the non-random nature of the search involved a semi-evolutionary process, in which

molecules which tended to create molecules of similar structure would first become more common. Then increased fidelity of copying would be selected for, until eventually full self-reproduction, one of the primary characteristics of life, would be achieved. Our long-term intention is to evaluate the sufficiency of this search mechanism experimentally. Of course, we have nowhere near the search resources available to us that were available in the emergence of life. However we have chosen to search a much smaller search space, using a carefully crafted functional language in which selfreproduction is much more readily expressed. As a first step, to impose a bound on the scale of the work, we have attempted to evolve self-reproduction making use of a perfectly functioning evolutionary system (ie genetic programming). Presumably this search will be more efficient than the biochemical semi-evolution described above. If we could not evolve self-reproduction in this system, then it would be highly unlikely that we could successfully simulate the semi-evolutionary search previously described. This paper reports on the results of the first step.

JOIN M M2: Use a depth-first search to find the first UNDEF in the tree M, then return the tree that is the result of replacing that UNDEF with M2. QUOTE M: return the tree M. FUN M: return the result of executing the program again recursively, with X set to the value returned by M. FUN UNDEF returns UNDEF. A recursive call to deeper than a predefined level of recursion (5) returns a special value LOOP, which is passed through unchanged by all other functions. X: return the value of X. The program is assumed to start with the initial value of X being null UNDEF: return UNDEF Therefore the sample tree above will be evaluated as:

UNDEF This is because X is initially NULL, thus the first I F branch is executed, therefore quote undef is executed, which then returns undef. A slightly more complex example: (JOIN (QUOTE (IF X UNDEF UNDEF)) X)

3 Details of Approach This work first identified a sample self-reproducing program, and a small programming language in which that program could be encoded. The grammar for the language in which the program is written is as follows S -> M M -> IF M M M M -> JOIN M M M -> QUOTE M M -> FUN M M -> X M -> UNDEF Table 1: Grammar for functional language The language defined by the grammar is a programming language whose terms may be interpreted as trees:

IF

X

QUOTE

X

UNDEF

Its semantics are as follows: IF A B C : If A is null, execute B else execute C

where (QUOTE (IF X UNDEF UNDEF)) returns (IF X UNDEF UNDEF) and the JOIN then searches for the first UNDEF and then returns (IF X X UNDEF)

4 Sample Self-Reproducing Program To verify that the language could indeed be used to encode a self-reproducing program, we developed one such. Omitting many UNDEF, QUOTE and JOIN components for clarity of explanation, it is: (IF X (1) (FUN (LINE 3)) (2) (JOIN (JOIN (JOIN (QUOTE (3.1) (IF X UNDEF UNDEF) (FUN UNDEF)) X) X) (3.2) ) where LINE3 is replaced by the text of line 3. That is (IF X (FUN ((JOIN (JOIN (JOIN (QUOTE (IF X UNDEF UNDEF) (FUN UNDEF)) X) X)) (JOIN (JOIN (JOIN (QUOTE (IF X UNDEF UNDEF) (FUN UNDEF)) X) X) ) This self-replicates by the following logic - the first time the program is executed, X is null, the first IF then causes FUN to be executed with X having the value of LINE 3, During that run the code returned is then (IF X (FUN LINE3) LINE3) i.e. the tree matching that of the original program.

The full program is given below: (IF X (FUN (QUOTE (JOIN (JOIN (JOIN (QUOTE (IF UNDEF UNDEF UNDEF ) ) (QUOTE X ) ) (JOIN (QUOTE (FUN (QUOTE UNDEF ) ) ) X)) X))) (JOIN (JOIN (JOIN (QUOTE (IF UNDEF UNDEF UNDEF ) ) (QUOTE X ) ) (JOIN (QUOTE (FUN (QUOTE UNDEF))) X ) ) X) ) ) This program suggests another aspect of the evolution of self-reproduction. Typically , programs developed by genetic programs, suffer from `bloat'. That is, even a successful program, if the evolutionary process is continued, will usually become larger. This often occurs by the program introducing `introns' into itself. Introns are sections of code that do not alter the result. E.g. x might change into x+x-x. In the grammar used for this paper, (join undef undef), is equivalent to undef. The study of the causes of introns and their prevention is an important continuing research question. However, note that in the above program, line 3 is replicated. Thus, if an intron was introduced into the code, it too would have to be replicated. We expect that this would be relatively unlikely to occur by chance. Thus we expect that, for this problem, there will be strong evolutionary pressure against bloat.

5 Experimental Design A number of runs were performed using various fitness functions. All functions were designed to measure the similarity between a program tree, and the program tree it produces when run. Each successive fitness function was carefully designed to avoid problems occurring in the use of the preceding fitness function. Each of the fitness functions was executed for 10 runs with the following parameters PARAMETER

VALUE

Max Generations Population Size (Generation 1) Population Size (Generation > 1) Max depth (initial pop) Max depth (subsequent) Tournament size Crossover Probability Mutation Probability

100 1000 500 12 12 3 0.9 0.1

The first two fitness measures simply examine the differences in size between the two trees, with the second

adding a large penalty term for very small trees. The remainder measure the difference between the trees more directly, giving a fixed penalty for every node in each tree that does not occur in the corresponding position in the other.The third uses the difference alone. This turned out to be a bad decision, since it exerts excessive parsimony pressure and leads to early stagnation. The fourth then divides the fitness by the sum of the sizes of the trees (ie the fitness measure measures the proportionate difference rather than the absolute difference between the trees). The fifth fitness function adds a penalty for not incorporating the 'FUN' function symbol, while the sixth adds in addition a penalty term for omitting 'IF' - both of these were added to avoid causes of early stagnation. The specific fitness functions used, were: 1. the difference in size between two trees 2. the difference in size between two trees, with a penalty term for very small trees 3. the difference beween two trees 4. the difference between two trees divided by the sum of the sizes of the trees 5. the proportionate difference between two trees, unless the trees did not execute a 'fun' term, in which case the fitness value was set to an arbitrarily large number 6. the proportionate difference between two trees, unless the trees did not execute both a 'fun' term and an 'if' term, in which case the fitness value was set to an arbitrarily large number

6 Experimental Results The first fitness function reliably found correct, but uninteresting, solutions to the problem - the programs 'x', 'undef' and 'null', which are all technically selfreproducing, but not in any interesting way. The second fitness function attempted to overcome this by requiring that every program in the population have a minimum size. The second fitness function tended to produce quite large trees, which were similar in size, but not in content, to their own output. The third fitness function returned results that were all of a similar form. Each of the programs started with a quote, and then listed a number of terms. i.e. a typical result for this fitness function was (quote (if x x (quote (join x x)))) Which would return (if x x (quote (join x x))) Thus all these programs had a fitness of 1. The fourth finess function found an even more limited local minimum; all runs returned functions (quote (quote (quote (quote ... (quote x)))... )))

That is, a sequence of quotes to the maximum depth possible, where the last node quoted x. This fitness function favoured individuals having an absolute difference of 1, divided by the size of the individual. Thus to minimise their relative difference, these programs expanded to be the maximum possible size. The fifth fitness function found programs in which the 'fun' term was almost always instantiated as 'fun undef', ie a synonym for 'undef'. The sixth fitness function found a self-reproducing program in one of its 10 runs. That program was

will combine into larger ones. It seems reasonable to assume that natural complex molecules, including those capable of self-reproduction would be produced from reactive sum-molecules. So in this way, we could equate the fun and if nodes to those reactive molecules. We had also hypothesised that a self-reproducing program would be less prone to bloat then standard programs. The results have shown a very small number of introns in the initial solution, and more significantly, that the program has not bloated during subsequent generations.

(join (quote (join (quote undef) (fun (if (if x x x) (quote x) (if undef x undef))) (fun (if (if x x x) (quote x) (if undef x undef) ) ) )

8 Conclusions

(1) (2) (3) (4)

This program, like that developed by the authors, contains a repetition of information (lines 3 & 4). The program returns line (2) and line (3) (note how they are quoted by line 1), where the first undef is replaced (via the join) with what is returned by line 4. There are two introns on line 4, namely (if x x x), which could be replace by x and (if undef x undef) which could be replaced by undef . Bearing that in mind, line 4 returns the result of (fun x) if x is null (i.e. the first time), or the result of (fun undef ) if x is not null. Consider the result of (fun x). This will return lines 2 and 3 quoted, with the first undef replaced by what is returned by line 4. Line 4 evaluates as (fun undef), which returns undef. Thus the first undef in lines 2 and 3 is replaced with undef. Thus (fun x) returns lines 2 and 3. Thus the inital call to the program returns lines 2 and 3 (as detailed previously) joined to the results of (fun x) which we have determined is another copy of lines 2 and 3. Thus, overall the tree returned by the program is lines 2 and 3, where the first undef is replaced by lines 2 and 3. That is, a copy of itself.. It is also interesting to note that the self-reproducing program was first found during generation 18 of its Run, and did not change for the remainder of that run. This is significant, in that genetically-evolved programs typically become larger and more complex over subsequent generations.

7 Discussion The first 3 results illustrate the ability of genetic programs to find significant local minima of the search space and to then become `stuck' in that location. The only way we have been able to develop effective results was to force the programs to contain the two key operators with which it could develop self-reproduction. This approach is on one hand unsatisfying, in that it guides the program in a more rigid manner then we would like. However, perhaps it does illustrate a factor related to the chemical process of selfreproduction, in that many molecules are reactive, such that when in contact with certain other molecules they

These results have demonstrated that it is possible to evolve a self-reproducing algorihm. It also indicates that for this to occur, some of its components must have a strong tendency to be used as part of that program, perhaps mimicking the neccessity of reactive molecules as the building blocks of chemical self-reproduction. Furthermore these results show that self-reproduction does seem to exert a parsimony pressure. Conceptually, our work differs from the process by which natural self-reproduction is believed to have arisen in three main ways: • through the careful design of the functional language used, the density of self-reproduction in the hypothesis space is much higher than in the corresponding natural system; we believe this is an appropriate trade-off for the much greater search capacity employed by the natural system • the fitness function has been specifically tailored to derive self-reproduction, and is not readily interpreted in terms of natural evolutionary pressures • the search is being carried out by an alreadyperfected evolutionary system, whereas the widely accepted hypothesis on the origin of life posits the perfection of the evolutionary system as part of the emergence of selfreproduction We plan to address the latter two points in our future work, aiming to find more natural fitness functions, and to more accurately model the process of emergence of selfreproduction. Future work will attempt to discover if selfreproduction can be developed without the strong force for certain types of nodes to be included. This work also introduces further possibilities. For example,. how easily can self-reproductive programs be evolved to perform some function. Furthermore, can we develop genetic programs that include in later generations, those programs that were returned by the previous generations. Hopefully those fields of study might help to shed more light on the proccesses whereby biological selfreproduction arose.

Acknowledgements The ideas in this paper have benefited from discussions with Hussein Abbass. The work could not have been

carried out in a reasonable time without the flexibility provided by Brian Ross' DCTG-GP package. Thank you both.

Bibliography Hoyle, F. and Wickramasinghe, C. (1978) Lifecloud. Harper and Row, Koza, J R Genetic Programming: on the Programming of Computers by means of Natural Selection, Bradford / MIT Press, 1992 available. Miller, S., 1953, 'A Production of Amino Acids Under Possible Primitive Earth Conditions': Science, v. 117, p. 528-529. von Neumann, J: Theory and Organization of Complicated Automata, in A. W. Burks, ed., `Theory of Self-

Reproducing Automata [by] John von Neumann', University of Illinois Press, Urbana, pp. 29-87 (Part One), 1949. Based on transcripts of lectures delivered at the University of Illinois, in December 1949. Edited for publication by A.W. Burks. Ross, B J: 'Logic-based Genetic Programming with Definite Clause Translation Grammars', Technical Report #CS-99-02, Dept of Computer Science, Brock University, St Catharines Ontario, 1999; summary in Banzhaf et al (eds) Proceedings of the Genetic and Evolutionary Computation Conference, P1236, Morgan Kaufmann, 1999 Voet, D & Voet, J Biochemistry 2nd Edn, New York, Wiley & sones, 1995