Application of the Parallel Fast Messy Genetic Algorithm to ... - CiteSeerX

3 downloads 11481 Views 170KB Size Report
Application of the Parallel Fast Messy Genetic Algorithm to the Protein .... For an application using a .... dial or the juxtapositional phase, along with its fitness.
Application of the Parallel Fast Messy Genetic Algorithm to the Protein Folding Problem Laurence D. Merkle, George H. Gates, Jr., Gary B. Lamont Department of Electrical and Computer Engineering Air Force Institute of Technology Graduate School of Engineering Wright-Patterson AFB, OH 45433 flmerkle, ggates, lamontg@at.af.mil

Ruth Pachter

Wright Laboratory Wright-Patterson AFB, OH 45433 [email protected]

1 Introduction Polypeptides, including naturally occurring proteins, are polymers of amino acids. The capability to predict a polypeptide's tertiary structure (the 3dimensional arrangement of atoms in the molecule, which is also known as the conformation) given its amino acid sequence is important to numerous scienti c, medical, and engineering applications. For example, in order to understand the mechanism by which an enzyme functions, researchers require knowledge of its tertiary (and perhaps quaternary) structure 3]. Also, without this capability, ab initio design of new polypeptides with speci c biochemical, mechanical, and/or optical properties (the inverse protein folding problem) is infeasible. Applications of synthetic polypeptides include biological and chemical catalysts biosensors pharmaceuticals hormones and biological regulatory agents optical, chemical, and mechanical energy interconversion mechanisms energy and/or information storage devices on the angstrom size scale 3] and non-linear optical materials 21]. The eort to develop a general technique for such structure prediction is commonly referred to as the protein folding problem, because the protein may be envisioned as a \chain" of amino acids which \folds" on itself in a particular way. A common theme in many approaches to the protein folding problem is the minimization of an energy function in conformational space 12]. We discuss a parallel genetic algorithm development eort which focuses on the global minimiza-

tion of a semi-empirical energy model. In particular, a parallel \fast messy" genetic algorithm is designed and implemented on the Intel Paragon massively parallel processor. We present execution times and conformational energies obtained for the pentapeptide Met]enkephalin.

2 Energy Minimization One common method of protein structure prediction, which is referred to as energy minmization, searches a protein's conformational space for an energy minimum. This is a challenging optimization problem for several reasons. One reason is that the conformational space is of high dimensionality. Another factor is that determination of an individual conformation's energy is itself computationally intensive. Finally, the energy function contains a very large number of local minima 1]. In particular, a polypeptide molecule posesses 3NA ; 6 degrees of freedom, where NA is the number of atoms contained in the molecule. Even relatively small polypeptides contain thousands of atoms, and polypeptides containing hundreds of thousands of atoms are not uncommon. Some simpli cation of the problem is possible by performing the minimization in a space of reduced dimensionality. For example, by observing that in polypeptides many dihedral angles, and virtually all bond lengths and bond angles, are relatively inexible, we can substantially constrain

the conformational space. The dimensionality of this reduced search space is ND , the number of variable dihedral angles considered. Note that ND is a fraction of NA . The predictive capability of any energy minimization method depends upon the accuracy of the underlying energy model. An exact model, accounting for all quantum-mechanical eects is prohibitively computationally intensive, with algorithmic complexity as high as O(NA5 ) 15]. Again, some simpli cation is made possible by approximating the molecular energy with two-body semi-empirical models, which typically result in algorithms of only O(NA2 ) complexity. The energy model used here considers contributions due to non-bonded van der Waals interactions (represented by the Lennard-Jones potential), Coulomb's law, and bonded interactions It does not consider solvent interactions. The energy of a conformation is given by E =

X (ij )2B

+ + +

Kr (rij ; req )2 ij

X

(ijk)2A

K (ijk ; eq )2

X

(ijkl)2D

X

(ij )2N

ijk

K 1 + cos(ijkl ; ijkl )]

"

ijkl

Aij rij

12



; Br ij ij

6

#

+ q"riqj (1) ij

where

 B is the set of bonded atom pairs,  A is the set of atom triples de ning bond angles,  D is the set of atom 4-tuples de ning dihedral angles,

   

N is the set of non-bonded atom pairs, rij is the distance between atoms i and j, ijk is the angle formed by atoms i j and k, ijkl is the dihedral angle formed by atoms i j k and l,

 qi is the partial atomic charges of atom i,  the Kr 's, req 's, K 's, eq 's, K 's, ijkl 's, ij

ijk

ijkl

Aij 's, Bij 's, and " are empiric constants.

The empirical parameters are derived from the parameter les of the molecular modeling software CHARMm version 22.0, last updated 92/10/05 19], as described by Brinkman, et. al. 2]. Both ab initio and semi-emperical models result in energy surfaces which contain many local minima, and are thus fundamentally challenging problems for existing functional optimization techniques. Numerous optimization algorithms have been applied to the problem 20, 12], including Monte Carlo simulation, simulated annealing, or other global optimization techniques, often in conjunction with ecient calculusbased local minimization techniques such as steepest descent, conjugate-gradient, or Newton-Raphson.

3 Messy Genetic Algorithms One optimization algorithm which has been applied to the protein folding problem is genetic algorithms 14, 2]. Genetic algorithms (GAs) are a class of stochastic, population-based, optimization algorithms, and are described in detail elsewhere (c.f. Holland 11], Goldberg 5], or Michalewicz 18]). One general class of functions for which the canonical (or \simple") GA tends to converge toward suboptimal solutions is that of deceptive functions 22]. The messy GA 6, 7, 8], which is designed speci cally to overcome deception, consists of three phases: 1. Initialization. Partially enumerative initialization (PEI) generates a population consisting of all possible partial solutions de ned over k loci1. 2. Primordial. Tournament selection is applied repeatedly to reduce the population size and eliminate less t partial solutions, thus forming a population of highly t partial solutions 3. Juxtapositional. This phase is similar to the simple GA in its use of recombinative operators. The speci c operator used is cut-and-splice, which is a one-point crossover operating on variable length strings. Splice and bitwise cut probabilities are chosen to promote rapid increase of the string length from k to ` 6]. In both the primordial and juxtapositional phases, a locally optimal solution, called the competitive template, is used to \ ll in the gaps" in partially speci1 In a GA, each solution is encoded as a sequence of genes, where a gene's position in the sequence is referred to as its locus and its value is referred to as its allele. A partial solution dened over k loci is sometimes referred to as a building block.

ed solutions to allow their evaluation. Also, in order to prevent the cross-competition between building blocks caused by non-uniform scaling, competition is restricted to those individuals which are de ned at some threshold number of common loci. In particular, the threshold for a pair of partial solutions is the expected number of common de ning loci for partial solutions of their lengths. For an application in which each string contains ` genes, and each gene has C possible alleles, the initial population contains   C N = k k` (2) solutions. This is signi cantly larger than typical simple GA population sizes. For an application using a binary representation with ` = 240 and k = 5, the messy GA initial population contains 2:04  1011 individuals, whereas typical simple GA population sizes are in the tens to thousands. This is the cost of improved GA performance on deceptive functions. The large size of the messy GA initial population and the substantial execution time required to reduce the population are serious drawbacks for the algorithm. In order to remedy these problems, Goldberg, et. al. propose three modi cations to the original algorithm: use of Probabilistically Complete Initialization (PCI) in place of PEI, use of building block ltering, and more conservative thresholding in tournament selection 9]. PCI generates a population of random individuals in which each building block has an expected number of copies sucient to overcome sampling noise. Each individual in the population is de ned at `0 = ` ; k loci, which are selected randomly without replacement (it is assumed that k  `). The population size is ; 2k)! 2z 2  2 (m ; 1)2k N = `!(` (3) (` ; k)!2  where `, `0 , and k, have been de ned previously, m is the number of building blocks in a fully speci ed solution,  is a parameter specifying the probability of selection error between two competing building blocks, P Z  z ] = 1 ;  where Z is a standard normal random variable, and  2 is a parameter specifying the maximuminverse signal-to-noise ratio per subfunction to be detected 9]. The fast messy GA enriches the initial population via alternating tournament selection and building block ltering (BBF). Tournament selection increases the proportion of individuals containing highly t building blocks. BBF then randomly deletes some

number of genes from every individual, the number being chosen so that BBF is expected to disrupt many but not all of the highly t building blocks. Those individuals still containing highly t building blocks receive additional copies in subsequent iterations of tournament selection. The net eect is to produce a population of partial strings of length k with a high expected proportion of highly t building blocks. As in the original messy GA, competition is restricted to those individuals which contain some number of common de ning loci. Unlike the original messy GA, the threshold for each step in the primordial phase is speci ed as an input parameter. Current practice is to use an empirically determined ltering and threshold schedule, although work is in progress to develop a more complete theory to allow a priori schedule designs 13]. One attractive feature of genetic algorithms is that they are highly parallelizable, in the sense that it is possible to obtain near-linear or better speedups on a wide variety of parallel architectures, typically with relatively little development eort. Many parallel genetic algorithm implementations use the \island" model, wherein the population is partitioned amongst the processors, and each processor executes an independent genetic algorithm 10]. The interprocessor communication for this model is easily tailored to a speci c architecture, since no communication is required for tness evaluation. Numerous variations exist in which either the selection operation executing on a particular processor is aected by other processors' subpopulations, or processors communicate some portion of their subpopulations to other processors. These approaches extend with some modi cations to the messy GA 17].

4 Parallel Algorithm Design A parallel fast messy GA (PFMGA) is designed using the island model and implemented on the Intel Paragon. One \controller" processor inputs the GA parameters and sends them to the remaining processors. Each processor, including the controller, randomly and independently generates a competitive template. Each then independently performs PCI to generate an initial population of N=P individuals, where N is given by Equation 3 and P is the number of processors. Each processor then applies the tournament selection and building block ltering operators to its subpopulation. The shu"e size is speci ed to be equal to the subpopulation size.

Prior to the juxtapositional phase, each noncontroller processor communicates to the controller its entire subpopulation of partial solutions, as well as the best solution observed in its primordial phase, overlaid with that processor's competitive template. The noncontroller processors then terminate. The controller processor combines the subpopulations in a single population of N partial solutions, and executes the juxtapositional phase as in the serial version. It then receives the overall best solution reported by any processor for the primordial phase, and reports the overall best solution obtained in either the primordial or the juxtapositional phase, along with its tness. Given a xed population size and string length, we are interested in the total execution time as a function of the number of processors P . Negligible time is spent in inputting GA parameters and outputting results. During PCI each processor performs O(P ;1) function evaluations. Given a xed ltering schedule with O(log`) ltering events, the number of function evaluations performed by each processor in the primordial phase is xed and is O(P ;1). The primordial phase execution time also depends signi cantly on the number of compatibility tests performed, which is a function of the subpopulation size N=P , the shu"e size N=P , and the probability of compatibility for individuals randomly selected from the same subpopulation 16]. The probability of compatibility may depend on P , but not in any obvious way. Thus, the number of compatibility tests is expected to be O(P ;2). During the juxtapositional phase, which has O(log`) generations, the controller processor performs O(1) function evaluations. It also performs a number of compatibility tests which is a function of total population size N, shu"e size N, and probability of compatibility. Again, the latter may depend on P , but not in any obvious way. Thus, the number of compatibility tests is expected to be independent of O(1). Every individual in a non-controller processor's subpopulation is communicated. There are N ; N=P = N(1 ; 1=P) such individuals. Thus, communication time is expected to increase asymtotically with increasing P. In summary, the time spent in function evaluations, compatibility tests, and communications are expected to be O(1+P ;1), O(1+P ;2) and O(1 ; P ;1) respectively. The PFMGA is implemented on the Intel Paragon parallel supercomputer in C Release 4.1.1, and runs under the Paragon OSF/1 Release 1.0.4 Server 1.1 R1.1 and Paragon OSF/1 Release 1.0.4

Server 1.2 WW20_Beta operating

systems. It is part of AFIT's GA Toolkit, which includes several sequential and parallel GAs 4, 16].

5 Application of the PFMGA to Energy Minimization Each individual in the PFMGA population, once overlaid on the competitive template, is a binary string representation of a subset of the internal coordinates required to fully specify a conformation. The NC represented coordinates are referred to as the \independently variable" coordinates. Which of the coordinates are to be independently variable is speci ed in a Z-matrix 2], which is input during initialization. Each independently variable coordinate V is represented by bits (bV 0+1  bV 0+2  : : : bV0+k^ ) in the string such that V = Vmin + (Vmax ; Vmin )

k^ X i=1

bV 0+i 2;i

(4)

where k^ = `=NC and V 2 Vmin Vmax ]. The values of the coordinates which are not independently variable are also speci ed in the Z-matrix, while the empirical constants of Equation 1, are speci ed in another input le which is read during initialization 2]. Finally, the interatomic distances for a conformation are determined by the internal coordinates 2]. Thus, the tness of an individual is determined by overlaying it on the competitive template, \decoding" each of the independently variable coordinates, calculating the interatomic distances, and then calculating the energy associated with the conformation via Equation 1.

6 Experimental Design and Results Here we describe experiments designed to assess the scalability of the PFMGA design, in particular for the application to the energy minimization of Met]Enkephalin. The independently variable internal coordinates are chosen to be the 24   !, and angles, which are also the variable coordinates used in several other studies 20, 2]. The string length is 240, so that each internal coordinate is represented by 10 bits. Experiments are performed using 1, 2, 4, 8, 16, 32, 64, and 128 processors. The same set of 10 randomly generated seeds is used for 10 independent runs on each of the 8 processor counts. The number of runs is sucient to determine at the 95% con dence level

2000 1900 1800 1600 1500 1400

Table 1: BBF and Threshold Schedule Generation String length Threshold 0 216 194 7 185 143 11 157 107 15 135 84 19 115 64 23 98 47 29 84 38 35 72 31 41 61 25 47 53 21 53 45 17 59 39 15 65 33 12 71 29 10 77 24 8 82 21 7 87 18 6 92 15 5 97 13 4 102 11 4 107 9 3 112 8 3 117 7 2 122 6 2 127 5 3

sample data sample mean

1700

Elapsed Time (sec)

statistically signi cant dierences between execution times obtained for most processor counts. The ltering and thresholding schedule is a scaled version of that used by Goldberg, et. al. for a 50 bit application 9, 13]. A total of 24 ltering episodes are performed, as shown in Table 1. The shu"e number is

1300 1200 1100 1000 900 800

700

600 1

2

4

8 16 Processors

32

64

128

Figure 1: Execution times for PFMGA minimization of the Met]-Enkephalin energy function vs. processor count. The sample and mean conformational energy obtained for each processor count is shown in Figure 2. The best conformational energy of any individual in sample data sample mean 1300

1200

equal to the subpopulation size, the cut probability is 0.02, and the splice probability is 1.0. The primordial phase has 132 generations and the juxtapositional has 12. The overow factor is 1.6, and the total population size is 256 . No outer loop is performed. The competitive template is a randomly generated conformation. The sample and mean execution times obtained for each processor count, measured as elapsed time on the controller processor, are shown in Figure 1. The PFMGA obtains sub-linear speedup through 32 processors, at which point execution time begins to increase. The sublinear nature of the speedup is due primarily to the sequential nature of the juxtapositional phase, and secondarily to the increased communication required for recombination of the population between the primordial and juxtapositional phases. Also of interest is the increased variance of the execution time for higher processor counts. We attribute this primarily to the increased variance in the number of I/O nodes available to the PFMGA.

Energy (kcal)

1100

1000

900

800

700

600 1

2

4

8 16 Processors

32

64

128

Figure 2: Energy of Met]-Enkephalin obtained via PFMGA minimization vs. processor count. the randomly generated initial population invariably has energy on the order of 1010 kcal. The energies obtained by the PFMGA are of the same order of magnitude as the lowest known for the model. Also, there is no statistically signi cant dierence observed between the energies obtained for any processor counts. That is, the solution quality obtained is independent of the number of processors.

7 Conclusions and Future Directions The parallel fast messy genetic algorithm design presented here exhibits speedup up to 32 processors for the Met]-Enkephalin energy minimzation application. The conformational energies are not as low as those obtained in studies using more re ned energy models 20], but are near the lowest known for our model. One planned re nement of the model relates to dependently variable internal coordinates. In the current model, the values of such coordinates are speci ed during initialization and xed throughout the execution. A substantially more accurate model will result from calcluating these coordinates from the coordinates which are independently variable. The scalability for the design is limited by the fact that the juxtapositional phase is performed serially. Parallelization of the juxtapositional phase is clearly necessary in order to extend the scalability of the application to more processors. Also, migration has been shown to have signi cant eects on both solution quality and execution time in the parallel simple GA. Subsequent PFMGA experiments will investigate migration. Finally, the design produces homogenous initial subpopulations. It has been shown that the execution times of tournament selection based parallel GAs on distributed memory architectures are significantly aected by the distribution of individuals in the initial subpopulations, and in particular by the probability of compatibility between solutions in those populations 17]. Thus, subsequent PFMGA experiments will explore various strategies for producing non-homogenous initial subpopulations.

References

1] Bilbro, G. and W. Snyder. \Optimization of Functions with Many Minima," IEEE Transactions on Systems, Man, and Cybernetics , 21 (4):840{849 (Jul/Aug 1991).

2] Brinkman, Donald J., et al. \Parallel Genetic Algorithms and Their Application to the Protein Folding Problem." Intel Supercomputer Users Group Conference Proceedings . 1993.

3] Chan, H. and K. Dill. \The Protein Folding Problem," Physics Today , 24{32 (February 1993).

4] Dymek, Capt Andrew. An Examination of Hypercube Implementations of Genetic Algorithms . MS thesis, AFIT/GCE/ENG/92-M, Air Force

Institute of Technology School of Engineering, Wright-Patterson AFB OH, March 1992 (ADA248092).

5] Goldberg, David E.

Genetic Algorithms in Search, Optimization, and Machine Learning .

Reading MA: Addison{Wesley Publishing Company, 1989.

6] Goldberg, David E. and others. \Messy Genetic Algorithms: Motivation, Analysis, and First Results," Complex Systems , 3 :493{530 (1989).

7] Goldberg, David E. and others. \Messy Genetic Algorithms Revisited," Complex Systems , 4 :415{ 444 (1990).

8] Goldberg, David E. and others. \Don't Worry, Be Messy." Proceedings of the Fourth International Conference on Genetic Algorithms . 24{30. San Mateo CA: Morgan Kaufmann Publishers, Inc., 1991.

9] Goldberg, David E. and others. \Rapid, Accurate Optimization of Dicult Problems Using Fast Messy Genetic Algorithms." Proceedings of the

Fifth International Conference on Genetic Algorithms , edited by Stephanie Forrest. 56{64. San

Mateo CA: Morgan Kaufmann Publishers, Inc., 1993.

10] Gordon, V. Scott and Darell Whitley. \Serial and Parallel Genetic Algorithms as Function Optimizers." Proceedings of the Fifth International Conference on Genetic Algorithms , edited by Stephanie Forrest. 177{183. San Mateo CA: Morgan Kaufmann Publishers, Inc., 1993.

11] Holland, John H. Adaptation in Natural and Articial Systems . Cambridge, MA: MIT Press, 1992.

12] Hunter, Lawrence, et al., editors. Proceedings,

First International Conference on Intelligent Systems for Molecular Biology , Menlo Park, Califor-

nia: AAAI Press, 1993.

13] Kargupta, Hillol Personal communication regarding tournament selection parameters in the fast messy GA, December 1993.

14] LeGrand, S. M. and K. M. Merz. \The Application of the Genetic Algorithm to the Minimization of Potential Energy Functions," Journal of Global Optimization , 3 :49{66 (1991).

15] Lengauer, T. Algorithmic Research Problems in Molecular Bioinformatics . Technical Report 748, Arbeitspapiere der GMD, May 1993.

16] Merkle, Laurence D. Generalization and Parallelization of Messy Genetic Algorithms and Communication in Parallel Genetic Algorithms . MS

17]

18]

19]

20]

21]

22]

thesis, Air Force Institute of Technology, WPAFB OH 45433, December 1992. Merkle, Laurence D. and Gary B. Lamont. \Comparison of Parallel Messy Genetic Algorithm Data Distribution Strategies." Proceedings of the Fifth International Conference on Genetic Algorithms , edited by Stephanie Forrest. 191{198. San Mateo CA: Morgan Kaufmann Publishers, Inc., 1993. Michalewicz, Zbigniew. Genetic Algorithms + Data Structures = Evolution Programs . Berlin: Springer-Verlag, 1992. Molecular Simulations, Incorporated. CHARMm version 22.0 Parameter File , 1992. Nayeem, Akbar, et al. \A Comparative Study of the Simulated-Annealing and Monte Carlowith-Minimization Approaches to the MinimumEnergy Structures of Polypeptides: Met]Enkephalin," Journal of Computational Chemistry , 12 (5):594{605 (1991). R. Pachter, T. M. Cooper, R.L. Crane and W. W. Adams. \Smart Structures and Materials." 1993 SPIE Proceedings1 . 1993. Whitley, Darrell. \Fundamental Principles of Deception in Genetic Search." Foundations of Genetic Algorithms , edited by G. Rawlins. San Mateo, California: Morgan Kaufmann, 1991.

Suggest Documents