Partitioned Selection: A Selection Operator for in ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
('sphere'), step, Rosenbrock and Shekel's foxholes. The experimental setup is described in more detail in the next section. 3. EXPERIMENTS. Each experiment ...
Partitioned Selection: A Selection Operator for in-vitro Evolution R I McKay1, N Mori1,2 and D L Essam1 1: School of IT & EE, Australian Defence Force Academy, University of New South Wales, ACT 2600, Australia 2: School of Computer and Systems Sciences Osaka Prefecture University, Sakai, Osaka 599-8531, Japan E-Mails: {b.mckay, d.essam}@adfa.edu.au, [email protected] ABSTRACT One potential application for biomolecular computation methods is the evolution of novel proteins[1]. However there remain a number of key problems, one being how to tie the evaluation of the phenotype (protein) to the fitness of the genotype (DNA). Artificial liposomes may provide a solution, by isolating DNA and derived proteins in cell-like partitions, so that the DNA may be associated with the fitness of its derived protein. This paper examines the behaviour of the corresponding weak selection mechanism – partitioned selection – within an evolutionary algorithm. 1. INTRODUCTION In recent years, there has been tremendous interest in Biomolecular Computation[2]. A number of authors have proposed DNA-based evolutionary algorithms [3,4], leading to the potential use of DNA-based evolutionary algorithms for generating novel molecules. However DNA molecules are of themselves of limited usefulness: in nature, they are used for information storage, coding for more directly useful RNA, and even more importantly for proteins. Hence we can expect that an important payoff from using DNA-based evolutionary techniques will arise from their coding of useful RNA or protein molecules[1]. We recall that DNA molecules generate RNA via the transcription process, in which a transcriptase enzyme translates the DNA coding into a corresponding RNA coding; and the RNA may then be further translated into a corresponding protein sequence. In an artificial evolutionary system, evaluation would be carried out on the phenotype (ie RNA or protein), but it would then be necessary to transfer this fitness evaluation back to the corresponding DNA molecules. For RNA, this may be relatively straightforward, because RNA viruses have provided us with reverse transcriptase enzymes, which can transcribe RNA back into the

corresponding DNA. Even with RNA, it may not be all plain sailing – reverse transcription is highly unreliable compared with forward transcription. While some level of stochastic mutation is acceptable, even desirable, in an evolutionary algorithm, it is unclear whether it would be possible to devise a DNA coding scheme sufficiently robust to the variation induced by reverse transcription, or whether the rate of mutation would be too high to permit an evolutionary algorithm to progress. When we come to the more useful protein molecules, we face an even greater obstacle: there is no protein translation analogue to reverse transcriptase. It is unlikely that we will be able to generate enzymes able to reverse the translation from RNA to protein. Hence we need to find other ways of linking the fitness evaluations of protein and DNA. Artificial liposome research [5] provides one possible solution. If we can first segregate the DNA into small compartments (cells), in which we can carry out transcription, translation and evaluation separately for small numbers of DNA molecules, then we can select the cells based on their fitness, and thus link the phenotype stochastically with the genotype. We are thus led to examine the properties of the Partitioned Selection (PS) operator in this paper. We emphasise that our aim is not to present a better selection operator for evolutionary algorithms. Rather, we will investigate whether PS can be at all effective (noting that the massive parallelism provided by biomolecular computation may alleviate any computational inefficiency), and if so, to investigate its properties, both theoretically and experimentally. In fact, the stochastic properties of the partitioning operator described here may be of interest in themselves, giving rise to a stochastic form of Relaxation Selection [6], and we intend to investigate this further in future. For that purpose, we would particularly concentrate on larger partitions, with evaluation conducted on stochastic samples from the partitions. However this stochastic sampling is not likely to be a feasible method of

evaluation for the molecular systems under consideration, so we do not discuss it further here. The rest of this paper is organised as follows. Section 2 defines the problem in more algorithmic terms, and presents some simple formal analysis. Section 3 describes the experiments undertaken, and Section 4 presents the results. Section 5 discusses the results and draws conclusions about the viability of partitioned selection for use in evolutionary biomolecular algorithms. 2. PARTITIONED SELECTION Partitioned selection is perhaps more accurately described as a family of selection operators rather than a single operator. Many reasonable selection operators can be applied at the partition level, giving rise to a variety of partitioned selection operators. However many would be difficult to implement in a DNA computation system. For example tournament selection would be difficult to implement because of its inherently sequential nature. Moreover partitioned selection clearly weakens the selection pressure of the original, underlying selection operator. If the underlying selector exerts only weak selective pressure, we are unlikely to generate an effective selection mechanism. Hence in this work, we have taken a highly selective mechanism, truncation selection, as the underlying selection operator. Truncation selection selects from the population in two stages. The first stage relies on a selection ratio k; the fittest k proportion of the population are selected, entirely deterministically, to form a truncated population. The JGAP system then builds a new population by selecting from this truncated population for crossover, reproduction or mutation. When an individual is required, it is uniformly selected (with replacement) from the truncated population. How can we simulate the process of forming cells? How many DNA molecules should each partition contain? The distribution of actual physical sizes of the partitions will be determined by the physical process of forming the partitions; since we can’t determine the process at this stage, we have assumed the best case, namely that the partitions are of uniform physical size. Controlling the mean number of genotypic individuals in each partition is relatively easy, since it is determined by the physical partition volume and the dilution of the aqueous solution. If we assume that the partitions are uniformly physically sized, then the number of individuals in each partition is likely, under reasonable assumptions about the physical process, which allocates individuals to partitions, to be binomially distributed. We assume a binomial distribution in this paper. It is clear that larger partition sizes will lead to weaker selection pressure, and for very large partition sizes the selection pressure will be so weak that

evolution will effectively halt. On the other hand, evaluation is likely to be one of the most expensive stages in DNA-computation systems for molecular evolution, with the evaluation cost being proportionate to the total volume of liquid processed, rather than the number of molecules evaluated. Hence there is a significant cost to evaluation of empty cells – under reasonable assumptions, the cost is proportional to the total number of cells. The choice of partition size reflects a trade-off between weaker selection pressure and decreasing cost, as the average number of genotypes per partition increases. It is our aim, in this paper, to investigate this trade-off. We do so using optimisation problems as surrogates. De Jong’s minimisation test functions [7] form a useful set of examples, and of these we choose four: quadratic (‘sphere’), step, Rosenbrock and Shekel’s foxholes. The experimental setup is described in more detail in the next section. 3. EXPERIMENTS Each experiment is replicated through 100 runs, each with a different random number generator seed. Each run consists of 100 generations, with a population size of 500. All experiments are based on the JGAP Genetic Programming system [8], and hence reflect some of its restrictions. In most experiments, the crossover and reproduction operators are applied at 45.5%, with 9% mutation. The mutation operator used is bitwise mutation with 20% of bits mutated. The crossover operator is onepoint crossover. For the no-mutation experiments, the operator ratios are 50% reproduction and 50% crossover. The gene representation is 20 bits per chromosome, using direct representation; the 20-bit chromosome is mapped to the appropriate range for the problem. Since the focus of the experiments is the selection operator, we describe this in more detail. We use tournament selection as the control, primarily because (as with partitioned selection) the selection pressure is tunable. Partitioned selection works as follows. The algorithm is driven by two parameters, the selection ratio k, and the expected partition size S. Given a population size n, N = n/S partitions are created. Individuals are then allocated to the partitions with uniform probability, giving a binomial distribution of partition size. The fitness of a partition is defined to be the fitness of the fittest individual in the partition. The partitions are then ranked in fitness order. n/k individuals are deterministically selected in order from the partitions, ie fittest partition first until it is exhausted, then second, etc until the quota of n/k has been completed (individuals from the final partition are randomly selected, so there is no guarantee of selecting the fittest in that partition). Finally, as new individuals are needed for construction of

the next generation, they are randomly selected (with replacement) from the n/k individuals chosen as above. As previously mentioned, the experiments use four of De Jong’s [8] test functions, namely the quadratic function, step function, Rosenbrock’s function, and Shekel’s foxholes. We choose these, following De Jong, as representing a selection of the classes of optimisation difficulties, which may be encountered by our system. The control experiments use tournament selection, with tournaments of sizes 2, 3, 4, 5 and 6. The primary set of experiments aims to investigate the effect of partition size. They use partitioned selection with partitions of sizes of 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0 and 100.0 (giving the number of partitions as 667, 500, 400, 333, 286, 250, 200, 167 and 5 respectively). The second set of experiments investigates the effect of selection pressure, and a hypothesis that increasing partition size will have similar effects to decreasing fitness pressure. Using a partition size of 1.5, they alter the selection ratio from the previous value of 0.2 (ie 20%) of the population to 0.1 and to 0.5. The final set of experiments investigates a hypothesis that the stochasticity induced by partitioned selection has similar effects to that generated by mutation. We repeat the experiments for partition sizes of 0.75, 1.5 and 3.0, but with the mutation operator turned off. Table 1: Evolutionary Parameters Runs per Treatment Generations per Run Population Size Reproduction Rate Crossover Rate Mutation Rate Bit Mutation Rate Partition Size Tournament Size Bits per Chromosome Function Range (Quadratic) Function Range (Rosenbrock) Function Range (Shekel) Function Range (Step)

100 100 500 45.5% (50%) 45.5% (50%) 9% (0%) 20% 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 100.0 2,3,4,5,6 20 -5.12…5.12 -5.12…5.12 -65.356…65.356 -5.12…5.12

4. RESULTS Figures 1 and 2 show the mean best fitness by generation, for a series of runs, over a range of tournament and partition sizes. With the exception of the largest partitions (size=100), there is considerable similarity between the tournament and partition results (inverted, because selection pressure increases as tournament size increases, but decreases as partition size increases). As expected, the convergence of the partitioned selection runs slows as partition size increases, but the rate of increase is remarkably low –

even size 3.0 partitions perform highly effectively. As anticipated, very large partition sizes (100.0) give such weak selection pressure that evolution effectively ceases. Figure 3 shows the same runs in more detail, omitting the partition size = 100.0, and the early generations, so as to give a zoomed fitness dimension. These results confirm the evenness, but also the surprising slowness, of the deterioration of performance as partition size increases. Table 1 gives numeric confirmation of these results – with a partition size of 1.5, partitioned selection performs worse than tournaments selection of size 3, but the decrease is relatively small. Figures 4 and 5, comparing the mean average fitness by generation, are a remarkable contrast to the above. While the best fitness of tournament and of partitioned selection are relatively similar, their mean average fitness behaves in completely different ways. Tournament selection average fitness behaviour closely parallels that of best fitness, whereas the partitioned selection behaviour is totally different. Despite attaining almost equivalent best fitness to tournament selection, the partitioned selection runs have much worse mean fitness, suggesting that they have been able to maintain a far better range of diversity than the tournament selection runs. Interestingly, with the exception of the step function, all functions show a mean fitness inflection at about the time the improvement in best fitness stops. This inflection is particularly marked for Shekel’s function; Figure 6 shows the mean, over all runs, of the standard deviation of fitness within a generational population (for partition size 1.5). It shows a similar inflection, with a minimum close to the point at which the mean best fitness attains its minimum. We do not yet have a good explanation for this behaviour. Table 2: Final Generation Fitness Tournament size 3 Mean Best Quadratic 0.123665708 2.55E-09 Rosenbrock 228.3258382 1.57329717 Shekel 0.001996001 0.002000001 Step -24.64246 -25

Partition size 1.5 Mean Best 14.53985371 3.36E-05 32413.37352 4.79896719 0.004438732 0.002000001 -14.17052 -24.98

Figures 7 and 8 show the behaviour of partitioned selection as selection pressure changes; in general, the results are as expected, with lower selection pressure leading to delayed convergence. Figures 9 and 10 repeat the experiments of varying partition size, but with mutation turned off. Perhaps the most important point is that the algorithm still behaves quite well, attaining results which, though not quite as good as for the other experiments, are still quite respectable. Without mutation, eventual convergence of the algorithm is inevitable (since there is no mechanism

to introduce new diversity), but partitioned selection is able to delay convergence substantially, sufficient to obtain good results. 5. DISCUSSION AND CONCLUSIONS The results demonstrate that partitioned selection with small partition sizes is an effective selection mechanism for evolutionary algorithms, giving rise to reasonable behaviours for quite difficult problems. The decay in behaviour with increasing size is sufficiently slow to give system designers a reasonably forecastable trade-off between partition size and performance. In fact, the results suggest that partitioned selection is, if anything, a better selection operator than we anticipated. In comparison with tournament selection, in these runs, it is able to give good exploitation (ie finding the minimum fairly accurately) while demonstrating a remarkable ability to maintain diversity (exploration). These characteristics are particularly suitable for applications in DNA-based evolutionary systems, with populations many orders of magnitude larger than those used in standard computer-based systems. The performance of partitioned selection in these respects is sufficiently good that it warrants investigation in its own right as a selection operator. In the nomutation sets of experiments, it demonstrated an impressive ability to maintain diversity when the source of renewed diversity was removed. The interaction between partition size and selection pressure is smooth and readily understandable, hence we anticipate that tuning the algorithm for real applications will be relatively simple. Overall, the results confirm the feasibility of partitioned selection as a mechanism for DNA evolution; in fact, they suggest that it may have interesting properties for standard evolutionary computation systems as well. Future work will further explore some unexpected behaviours of the algorithm, particularly the unexpected apparent increase in population diversity after the best fitness has converged. Our next approach to this will measure the genotypic diversity in an attempt to understand why this phenomenon occurs. We also plan to further investigate the performance of the algorithm in general evolutionary computation

problems, and to characterise its overall effect on diversity. ACKNOWLEDGEMENTS The ideas in this work were generated in the course of discussion with Prof. Byoung Tak Zhang of Seoul National University; they also benefited from discussions with Mr Xuan Nguyen and Mr Yin Shan of UNSW @ ADFA. We would like to thank them for the insights they have contributed. The experiments were conducted using a modified version of the JGAP open-source GA software (http://jgap.sourceforge.net/), and we would like to express our thanks to the developers of that software. REFERENCES [1] K.Sakamoto, M. Yamamura, H.Someya: Toward Wet Implementation of Genetic Algorithm for Protein Engineering, Proceedings of the Tenth International Meeting on DNA Computing, Milan, Italy, 2004, to appear [2] G Paun, G Rozenberg, A Salomaa: DNA Computing: New Computing Paradigms (Texts in Theoretical Computer Science), Springer Verlag, 1998[2] [3] D. H. Wood, E. Antipov, B. Lemieux, W. Cedeno, and J. Chen. “A DNA Implementation of the Max 1s Problem ” in W. Banzhaf, A. Eiben, M. Garzon, V. Honavar, M. Jakiela, and R. E. Smith (eds.) GECCO-99: Proceedings of the Genetic and Evolutionary Computation Conference pp1835-1842 (Morgan Kaufman, San Francisco) (1999) [4] Zhang, B.-T. and Jang, H.-Y., A Bayesian Algorithm for In Vitro Molecular Evolution of Pattern Classifiers, Preliminary Proceedings of the Tenth International Meeting on DNA Computing, pp. 294-303, 2004 [5] Moscho, A, Orwar, O, Chiu, D T, Modi, B P, and Zare, R N, Rapid Preparation of Giant Unilamellar Vesicles, Proc. Nat. Acad. Sci. USA, 93, Pp 11443 – 11447, 1996 [6] M. Ohkura, N. Mori, K. Matsumoto: Speed up of the Thermodynamical Selection Rule by Means of the Relaxation Selection Method, 43rd Annual Conference of the Institute of Systems, Control and Information Engineers (SCI'99), pp. 107-108 (1999) [7] De Jong, K. An analysis of the behaviour of a class of genetic adaptive systems. PhD thesis, University of Michigan, 1975 [8] JGAP open-source GA software package (http://jgap.sourceforge.net/)

Figure 1: Control Experiment (Tournament Selection) Best Fitness for Varying Tournament Sizes

Figure 2: Partitioned Selection Best Fitness for Varying Partition Sizes

Figure 3: Partitioned Selection Best Fitness (omitting first 3 generations and 100 partition for greater detail)

Figure 4: Control Experiment (Tournament Selection) Mean Fitness for Varying Tournament Sizes

Figure 5: Partitioned Selection Mean Fitness for Varying Partition Sizes

Figure 6: Partitioned Selection Fitness Standard Deviation for Partition Size 1.5

Figure 7: Partitioned Selection Best Fitness for Varying Selection Pressures

Figure 8: Partitioned Selection Mean Fitness for Varying Selection Pressures

Figure 9: Partitioned Selection Best Fitness for Varying Partition Sizes, No Mutation

Figure 10: Partitioned Selection Mean Fitness for Varying Partition Sizes, No Mutation

Suggest Documents