Computer search algorithms in protein modification and design John ...

9 downloads 309 Views 222KB Size Report
The computer-aided design of protein sequences requires efficient search ... Park, Pennsylvania 16802, USA; e-mail: [email protected]. †Department ..... finger template that was used for the sequence optimization and it folds with a stability ...
471

Computer search algorithms in protein modification and design John R Desjarlais* and Neil D Clarke† The computer-aided design of protein sequences requires efficient search algorithms to handle the enormous combinatorial complexity involved. A variety of different algorithms have now been applied with some success. The choice of algorithm can influence the representation of the problem in several important ways — the discreteness of the configuration, the types of energy terms that can be used and the ability to find the global minimum energy configuration. The use of dead end elimination to design the complete sequence for a small protein motif and the use of genetic and mean-field algorithms to design hydrophobic cores for proteins represent the major themes of the past year. Addresses *Department of Chemistry, Pennsylvania State University, University Park, Pennsylvania 16802, USA; e-mail: [email protected] †Department of Biophysics and Biophysical Chemistry, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA; e-mail: [email protected] Current Opinion in Structural Biology 1998, 8:471–475 http://biomednet.com/elecref/0959440X00800471 © Current Biology Publications ISSN 0959-440X Abbreviations DEE dead end elimination GA genetic algorithm MC Monte Carlo

Introduction The design of protein sequences, whether intended to adopt a particular fold or to modify a function, involves evaluating an extraordinarily large number of sequences for their ability to ‘fit’ a given structure. Search algorithms describe how a computer program samples from this enormous set of allowed solutions. All search algorithms necessarily make compromises between computational speed and thoroughness. Furthermore, there are important dependencies between the choice of search algorithm, the way in which the search space is represented and the energy or scoring function used. The purpose of this brief review is both to outline the computer search algorithms that have been used to design protein sequences and to raise some of the issues involved in understanding how search algorithms, energy functions and structural representations are inter-related. Finally, we discuss some of the experimental criteria used to analyze designed proteins, since conclusions regarding the effectiveness of the search algorithms depend, in part, on the interpretation of such experiments.

Overview — sampling versus quasi-exhaustive searching Search algorithms can be divided into two categories. The first category encompasses algorithms that sample

solutions semi-randomly and then move from one possible solution to another in a manner that depends on both the nature of the energy landscape and the algorithm-specific rules for movement. Algorithms of this type that have been applied to proteins include Monte Carlo (MC) techniques [1,2] and genetic algorithms (GAs) [3–6]. An advantage of these algorithms is that they can be applied to search problems that have an infinite number of possible solutions. In particular, sidechain and backbone conformations can be allowed to vary continuously ([2]; JR Desjarlais, TM Handel, unpublished data). On the other hand, there is no guarantee that these algorithms will explore solutions near the global energy minimum. In contrast, algorithms that fall into the second catergory, pruning algorithms, are intended to be functionally equivalent to an exhaustive search. Since truly exhaustive searches are possible only for very small search spaces, pruning algorithms first simplify the search space by allowing only certain discrete conformations. They then apply rejection criteria in order to eliminate the vast majority of combinatorial possibilities without actually considering them formally. The robustness of these methods obviously depends both on how finely the conformational space is represented and on the criteria used for rejection. Application of the dead end elimination (DEE) theorem [7] is the most important pruning idea currently used in the design of protein sequences [8••,9]. Other pruning methods [10,11] have been successfully used to design ligand-binding sites [12–17].

Sampling algorithms In the case of protein-sequence design, sampling algorithms can be used to vary sidechain identity, sidechain orientation and backbone structure. The simplest type of sampling procedure is the MC method. The general strategy of MC algorithms is to iteratively propose a modification to a model and then decide whether or not the proposed modification should be accepted. The most common way of deciding whether to accept a proposed modification is to use the Metropolis criterion [18]. According to this method, a modification is always accepted if it lowers the energy of the model. If a modification increases the energy of the model, acceptance or rejection is based on the outcome of what is essentially a weighted coin toss. The relative probabilities of the old (unmodified) and new (modified) models, which are used to weight the coin, are calculated according to Boltzmann’s relationship between probability and energy differences at a given temperature for the system. The relationship between acceptance probability and temperature is useful for simulated annealing. In this variation of the standard sampling algorithm, the system is slowly cooled throughout the run in order to gradually decrease the probability that an uphill energy modification will be accepted.

472

Engineering and design

GAs are similar in some respects to MC methods. The major distinctions are that a population of models is propagated (evolved) throughout the course of the run and genetic operators, such as recombination, are used to create new models from existing parents. The efficacy of the GA method stems from the implicit parallelism contained within protein design problems; different segments of the structure are optimized in parallel and selective recombination between models will sometimes bring two of the optimized segments together into the same model. Both the MC and GA methods are relatively straightforward to encode into search algorithms, although neither is guaranteed to converge to a global minimum. Both methods require a thorough optimization of the parameters that control the convergence properties of the algorithm, with respect to the system being studied.

Dead end elimination DEE is arguably the most powerful method for discrete conformational searches because of its ability to make enormous reductions in combinatorial complexity using a robust process of elimination. In simple terms, the DEE theorem allows individual sidechain identities/rotamers to be strictly designated as being incompatible with the global energy minimum. In its original form [7], the DEE theorem uses the criterion that if the lowest energy structure that can be found using a given sidechain rotamer is higher in energy than the highest energy structure that can be found with a different rotamer, the first rotamer can be eliminated. A significantly larger reduction of possible rotamers is attainable with Goldstein’s variation [19] of the original method. This uses the criterion that if the energy of a possible structure containing one rotamer is always lowered by changing to a second rotamer, the first rotamer can be eliminated. With both methods, extending the concept to include rotamer pairs and higher order combinations results in further improvements in efficiency, although the application of the algorithm to higher order combinations obviously poses combinatorial problems of its own. For some problems, the application of DEE may result in a unique solution [8••,9,19,20]. Several modifications that improve the efficiency of the DEE process have been described [20–22]. Like all pruning-type algorithms, the implementation of DEE requires the use of discrete representations of the backbone and sidechains. In addition, it is restricted to energy terms that can be written as the sum of individual and pairwise energy terms. In some cases, these limitations might be overly restrictive for the problem at hand, necessitating use of other sampling methods, such as MC or GA.

Sidechain rotamers and the use of discrete backbone conformations Although the representation of conformational space as a set of discrete states is required only for pruning algorithms, this simplification is also very commonly used for

sampling methods such as MC and GA. The level at which sidechain and backbone conformations are made discrete can be expected to have a dramatic effect on the ability of the algorithm to predict the foldability of alternative sequences. The most obvious problem with the discretization of conformational space is that the number of acceptable packing solutions found will be much smaller than it should be because of the steric clashes that might be avoided in continuous space. Recent studies on the effect of backbone and/or sidechain flexibility have confirmed that rigidly defined rotamers can be very misleading when applied to the prediction of allowed sequences ([23]; JR Desjarlais, TM Handel, unpublished data). These studies also imply that parameterization of the weights applied to various energy function terms strongly depends on the level of discreteness used. As an example of how energy term parameterization is closely coupled to the choice of search strategy, an often used compromise for dealing with the steric problems inherent in the use of rotamers is to soften the repulsive van der Waal’s term or to reduce the size of the atomic radii. These adjustments can themselves be parameterized according to experimental results indicating whether certain substitutions are functional or nonfunctional [24,25]. Whether this solution is robust in the sense that it gives predictions as accurate as those that could be achieved with finer sampling and a more accurate potential is unclear.

Global versus additive energies (scoring terms) Algorithms that eliminate individual amino acids/rotamers or pairs of amino acids/rotamers are very powerful approximations, but they disregard the context of the global structure. Whether this kind of pruning is appropriate depends on how well the global energy can be represented as a simple sum of single and pairwise energy terms. Examples of terms that might reasonably be considered to be simple in this way include Lennard–Jones energies, torsional energies, secondary-structure propensities and simple coulombic representations of electrostatic effects. The most important property that is not additive in this simple way is the solvent-accessible surface area. This is important because accessible surface areas are commonly used to estimate the solvent contribution to the free energy of a model sequence or structure [26]. Recognition of this problem [8••] has led to the development of empirical terms that compensate for the nonadditivity of accessible surface area [27•]. Other global energy terms that might be used in design algorithms include compositional biases and geometric constraints towards preferred structures. One important advantage of sampling methods such as GA and MC is that they can use global terms of this type.

Pruning strategies in the design of ligandbinding sites In addition to the redesign of protein folds, considerable progress has been made in designing simple ligand-binding sites. Metal ions offer important experimental advantages for the development of these methods.

Computer search algorithms in protein modification and design Desjarlais and Clarke

These advantages include well-defined coordination geometries, tight intrinsic binding and, in some cases, spectroscopic properties that allow the design to be evaluated without requiring a complete structure determination. Two computer programs have been written that help design metal sites in proteins of known structure. Interestingly, the two programs use very different strategies for dealing with the combinatorial complexity of the problem. DEZYMER [10] uses ‘depth first’ pruning whereas METAL-SEARCH [11] uses ‘on-the-fly binning’. METAL-SEARCH is much less versatile than DEZYMER (it was written to look only for tetrahedral Cys–His sites) but this lack of versatility is not due to the choice of pruning algorithm. The different strategies are illustrated schematically in Figure 1.

473

Both DEZYMER and METAL-SEARCH assume fixed backbones, both use rotamers in the initial stages of the search and both use simple geometric criteria for evaluating potential sites. Despite these simplifications, both programs have been used successfully in the design of metal-binding sites [12–17,28]. DEZYMER has also been used to design sites of more complex ligands [29].

Criteria for evaluating design algorithms The past couple of years have seen remarkable successes in the manual design of simple helical proteins, including one design that was constrained to be 50% identical in sequence to a predominantly β-sheet protein [30]! Success in protein design, however, only raises the standards of what should be considered a success. If it is important that

Figure 1

(a)

(b)

5

5

1

1

4

4

3

2

3

2

Current Opinion in Structural Biology

Pruning strategies used in the design of metal-binding sites. For the purposes of this schematic, potential binding sites involve three substitutions. (a) ‘Depth first’ pruning. A particular amino acid substitution, sidechain rotamer and, if appropriate, sidechain to metal orientation are first picked as an ‘anchor’ around which the search for additional coordinating residues is conducted. The coordinates of the bound-metal ion are then calculated (small black circle at the end of the sidechain on residue 1). Residues that are deemed too far away from the metal ion are immediately rejected (residues 2 and 3, gray circles). For each remaining residue, all possible rotamers are ‘grown’ in turn, one atom at a time. As each atom is added, the growing sidechain is evaluated to see whether it is compatible with binding to the anchor ligand. If the growing sidechain is deemed incompatible with binding, then growth along that branch is stopped (gray branches); otherwise, growth and evaluation are continued (black branches). In DEZYMER, the criteria for assessing the compatibility with binding include having the anchor position lie within a precalculated ‘ligand expectation sphere’ for the growing sidechain [10]. In this example, sidechains from residues 1, 4, and 5 might meet the geometric criteria and could be further refined and evaluated. The search must then be repeated using a different initial position for the metal ion, as determined by different combinations of anchor residue, sidechain rotamer and sidechain metal geometry. (b) Pruning by ‘on-the-fly binning’. METAL-SEARCH precalculates idealized metal positions at every residue, for each kind of sidechain ligand and for every rotamer [11]. The efficiency of the algorithm comes in grouping, ‘on-the-fly’, those substitutions that have idealized metal positions near to one another. This is done by placing a grid over the protein structure prior to the calculation of the metal coordinates. As the metal positions are calculated (small black circles), the algorithm simply notes which box the metal ion falls into. Information relevant to how the metal got into that box (residue number, amino acid type and rotamer) is then added to the list that is associated with that particular box. Once all the metal positions have been calculated, the algorithm checks to see which boxes contain information about three or more residues. In this case, the heavy-lined boxes indicate possible sites involving residues 1, 4 and 5, and residues 2, 3 and 4. Geometric criteria that assess the quality of the site(s) are then applied, and sites that meet the criteria can be further refined and evaluated. Reproduced with permission from [11].

474

Engineering and design

designed proteins be like natural proteins, then it is important to decide what criteria are most indicative of a natural native structure. Thermodynamic stability is probably not a good criterion because increased stability can be obtained simply by increasing structural degeneracy (configurational entropy) in the native-like state. Indeed, α4, the first four-helix bundle to be designed, is extremely stable even though it lacks the well-ordered core associated with natural proteins [31,32]. A high enthalpy change for unfolding is arguably a more useful criterion. Perhaps the most useful criteria are amide proton exchange rates, as these provide insight into the dynamic structure of the protein. In the case of ligand-binding site design, the correct binding geometry and high binding affinity are probably the two most important criteria.

next few years will see many more designs. How much of the success of protein design is due to the search algorithms themselves? How much is due to the plasticity of protein structures? How much is due to the retention of wild-type sequence features? How do we decide what computational and experimental criteria are the most appropriate for evaluating protein designs? The answers to these, and other questions, will come only when a great many more calculations, and experiments, are carried out.

References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as:

• of special interest •• of outstanding interest 1.

Lee C, Levitt M: Accurate prediction of the stability and activity effects of site-directed mutagenesis on a protein core. Nature 1991, 352:448-451.

2.

Hellinga HW, Richards FM: Optimal sequence selection in proteins of known structure by simulated evolution. Proc Natl Acad Sci USA 1994, 91:5803-5807.

3.

Holland JH: Adaptation in Natural and Artificial Systems. Cambridge, MA: The MIT Press; 1992.

4.

Tuffery P, Etchebest C, Hazout S, Lavery R: A new approach to the rapid determination of protein sidechain conformations. J Biomol Struct Dyn 1991, 8:1267-1289.

5.

Desjarlais JR, Handel TM: De novo design of the hydrophobic cores of proteins. Protein Sci 1995, 4:2006-2018.

6.

Pedersen JT, Moult J: Genetic algorithms for protein structure prediction. Curr Opin Struct Biol 1996, 6:227-231.

7.

Desmet J, De Maeyer M, Hazes B, Lasters I: The dead-end elimination theorem and its use in protein side-chain positioning. Nature 1992, 356:539-542.

Other search algorithms The search methods discussed above have evolved from earlier application in the closely related field of structure prediction by comparative modeling. It is likely that emerging methods, such as mean-field approaches [33–35], will begin to find application in protein design as well [36,37•]. Parametrized minimization has also been used quite successfully for the prediction [38] and design (P Harbury, J Plecs, B Tidor, T Alber, P Kim, personal communication) of coiled-coil proteins. The extension of this method to more typical proteins, which do not have the high degree of symmetry inherent in coiled coils, will presumably require extensive modification in order to accommodate the increased combinatorics involved.

Conclusions As the search algorithm determines the types of limitations and assumptions involved in the search process, one must carefully consider the trade-offs involved in choosing among the options. Currently, DEE is the superior method for guaranteeing convergence to the global minimum energy conformation. It is, however, important to distinguish between the global energy minimum of the restricted search space (determined by the discreteness of allowed conformations) and that of the true conformational and sequence space of the protein. The impressive success of DEE in designing sequences for discrete backbone structures suggests that, given sufficiently fine sampling, the two minima are closely related but not identical [9,39]. We expect that this will change dramatically as design attempts are extended to include de novo designed backbone structures. The convergence properties of the GA and MC approaches have so far proven to be sufficient for hydrophobic core design and evaluation [2,5,40•], but they may eventually suffer as the size of the search space is increased. Some advantages of these methods, as discussed above, will remain. The number of designed sequences of the past few years that have ‘worked’ at some level is truly remarkable. The

8. Dahiyat BI, Mayo SL: Protein design automation. Protein Sci 1996, •• 5:895-903. This paper demonstrates the potential of dead end elimination in designing complete protein sequences that are consistent with a fixed backbone template. The structure of the designed protein is very similar to the zincfinger template that was used for the sequence optimization and it folds with a stability that is considered to be substantial given the small size of the domain. There are, however, interesting differences between the backbone and sidechain structures determined for the real protein and the designed structure. 9.

Dahiyat BI, Mayo SL: De novo protein design: fully automated sequence selection. Science 1997, 278:82-87.

10. Hellinga HW, Richards FM: Construction of new ligand binding sites in proteins of known structure. I. Computer-aided modeling of sites with pre-defined geometry. J Mol Biol 1991, 222:763-785. 11. Clarke ND, Yuan SM: Metal search: a computer program that helps design tetrahedral metal-binding sites. Proteins 1995, 23:256263. 12. Klemba M, Regan L: Characterization of metal binding by a designed protein: single ligand substitutions at a tetrahedral Cys2His2 site. Biochemistry 1995, 34:10094-10100. 13. Klemba M, Gardner KH, Marino S, Clarke ND, Regan L: Novel metalbinding proteins by design. Nat Struct Biol 1995, 2:368-373. 14. Hellinga HW, Caradonna JP, Richards FM: Construction of new ligand binding sites in proteins of known structure. II. Grafting of a buried transition metal binding site into Escherichia coli thioredoxin. J Mol Biol 1991, 222:787-803. 15. Regan L, Clarke ND: A tetrahedral zinc(II)-binding site introduced into a designed protein. Biochemistry 1990, 29:10878-10883. 16. Hellinga HW: Metalloprotein design. Curr Opin Biotechnol 1996, 7:437-441. 17.

Regan L: Protein design: novel metal-binding sites. Trends Biochem Sci 1995, 20:280-285.

Computer search algorithms in protein modification and design Desjarlais and Clarke

18. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH: Equation of state calculations by fast computing machines. J Chem Phys 1953, 21:1087-1092. 19. Goldstein RF: Efficient rotamer elimination applied to protein sidechains and related spin glasses. Biophys J 1994, 66:1335-1340. 20. Lasters I, De Maeyer M, Desmet J: Enhanced dead-end elimination in the search for the global minimum energy conformation of a collection of protein sidechains. Protein Eng 1995, 8:815-822. 21. Keller DA, Shibata M, Marcus E, Ornstein RL, Rein R: Finding the global minimum: a fuzzy end elimination implementation. Protein Eng 1995, 8:893-904. 22. De Maeyer M, Desmet J, Lasters I: All in one: a highly detailed rotamer library improves both accuracy and speed in the modelling of sidechains by dead-end elimination. Fold Des 1997, 2:53-66.

®

23. Lee C: Testing homology modeling on mutant proteins: predicting structural and thermodynamic effects in the Ala98 Val mutants of T4 lysozyme. Fold Des 1996, 1:1-12. 24. Hurley JH, Baase WA, Matthews BW: Design and structural analysis of alternative hydrophobic core packing arrangements in bacteriophage T4 lysozyme. J Mol Biol 1992, 224:1143-1159. 25. Dahiyat BI, Mayo SL: Probing the role of packing specificity in protein design. Proc Natl Acad Sci USA 1997, 94:10172-10177. 26. Eisenberg D, McLachlan AD: Solvation energy in protein folding and binding. Nature 1986, 319:199-203. 27. Street AG, Mayo SL: Pairwise calculation of protein solvent• accessible surface area. Fold Des 1998, 3:253-258. The authors present the parametrization of scaling factors for pairwise calculations of solvent-accessible surface areas, a requirement for using dead end elimination. Excellent correlations between the resulting approximations and the true surface areas are demonstrated. 28. Pinto AL, Hellinga HW, Caradonna JP: Construction of a catalytically active iron superoxide dismutase by rational protein design. Proc Natl Acad Sci USA 1997, 94:5562-5567. 29. Coldren CD, Hellinga HW, Caradonna JP: The rational design and construction of a cuboidal iron-sulfur protein. Proc Natl Acad Sci USA 1997, 94:6635-6640.

475

30. Dalal S, Balasubramanian S, Regan L: Protein alchemy: changing beta-sheet into alpha-helix. Nat Struct Biol 1997, 4:548-552. 31. Regan L, DeGrado WF: Characterization of a helical protein designed from first principles. Science 1988, 241:976-978. 32. Handel TM, Williams SA, DeGrado WF: Metal ion-dependent modulation of the dynamics of a designed protein. Science 1993, 261:879-885. 33. Koehl P, Delarue M: Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J Mol Biol 1994, 239:249-275. 34. Lee C: Predicting protein mutant energetics by self-consistent ensemble optimization. J Mol Biol 1994, 236:918-939. 35. Koehl P, Delarue M: Mean-field minimization methods for biological macromolecules. Curr Opin Struct Biol 1996, 6:222226. 36. Kono H, Doi J: Energy minimization method using automata network for sequence and sidechain conformation prediction from given backbone geometry. Proteins 1994, 19:244-255. 37. •

Kono H, Nishiyama M, Tanokura M, Doi J: Design of hydrophobic core of E. coli malate dehydrogenase based on the sidechain packing. Pac Symp Biocomput 1997: 210-221. This work demonstrates the use of a novel automata network method, similar to mean-field approaches [35], for hydrophobic core design 38. Harbury PB, Tidor B, Kim PS: Repacking protein cores with backbone freedom: structure prediction for coiled coils. Proc Natl Acad Sci USA 1995, 92:8408-8412. 39. Su A, Mayo SL: Coupling backbone flexibility and amino acid sequence selection in protein design. Protein Sci 1997, 6:17011707. 40. Lazar GA, Desjarlais JR, Handel TM: De novo design of the • hydrophobic core of ubiquitin. Protein Sci 1997, 6:1167-1178. The paper represents the most recent application of a genetic algorithm for hydrophobic core design. Several multiply sustituted variants of ubiquitin were designed and experimentally characterized. The use of different types of sidechain rotamer libraries, atomic-potential functions and levels of conformational discreteness are explored.

Suggest Documents