Biased Fragment Distribution in MC Simulation of Protein Folding ERIC MARTINEAU,* PIERRE-JEAN L’HEUREUX, JOHN R. GUNN
Universite´ de Montre´al, De´partement de Chimie, Montre´al, Que´bec, Canada Centre de Recherche en Calcul Applique´, Montre´al, Que´bec, Canada Protein Engineering Network of Centers of Excellence, Edmonton, Alberta, Canada Received 16 April 2004; Accepted 4 June 2004 DOI 10.1002/jcc.20109 Published online in Wiley InterScience (www.interscience.wiley.com).
Abstract: Monte Carlo (MC) methods play an important role in simulations of protein folding. These methods rely on a random sampling of moves on a potential energy surface. To improve the efficiency of the sampling, we propose a new selection of trial moves based on an empirical distribution of three-residue (triplet) conformations. This selection is compared to random combinations of the preferred conformations of the three amino acids, and it is shown that the new trial moves lead to finding structures closer to the native conformation. © 2004 Wiley Periodicals, Inc.
J Comput Chem 25: 1895–1903, 2004
Key words: fragment distribution; MC simulation; protein folding
Introduction Monte Carlo simulations have proven to be very useful in protein folding prediction over the past decade.1,2 The many degrees of freedom in a protein lead to extremely complex energy landscapes composed of multiple minima and maxima.3 An exhaustive sampling of this potential energy surface is intractable. Monte Carlo simulations are one of the tools used to sample this energy landscape.4 The straightforward Monte Carlo algorithm is not well suited for finding a unique native conformation. To address this problem, noncanonical sampling algorithms5,6 and methods of smoothing out the potential5 of the ensemble of structures7 have been proposed. By decreasing an artificial temperature function with the simulated annealing procedure, the Monte Carlo algorithm can overcome the local minima problem and possibly find the global minimum.8 In protein folding, a change in the Cartesian coordinates of any atom is a trial move on the corresponding energy landscape of the protein. In the combined Monte Carlosimulated annealing algorithm, this trial move is accepted or rejected, depending on the effect it has on the overall energy of the system. Even though protein folding can be studied theoretically, several methods rely on experimental data. Insight into the secondary structure can be provided by NMR9 and circular dichroism10 experiments. The 3D structure of proteins can be elucidated by NMR.11 electron microscopy,12 and X-ray crystallography.13 The known 3D structures of proteins are then compiled in databases such as the Protein Data Bank (PDB).14 However, sometimes
experimental methods are not always sufficient to elucidate particular protein 3D structures. With improvement of computers and mass storage, theoretical methods based on the information acquired over the years by experimental techniques, entered the race to predict folding patterns.15 Knowledge-based approaches extract statistical information and integrate it into theoretical frameworks. On the other hand, a good knowledge-based method represents a fair compromise between ab initio and empirical methods. To improve a protein folding simulation, one can take advantage of PDB structural features to refine the model. In this way, the knowledge-based approach can take advantage of the existing information without fully understanding the physics behind the phenomenon of protein folding.16 Stochastic simulation methods do not take into account the trajectory of the folding. Therefore, to find the global minimum of the potential energy surface, it is necessary to sample it in a quasi-exhaustive manner. Allowing the widest range of random structures in a Monte Carlo search gives a more diverse subset of molecules. This diverse subset of molecules will, in return, correspond to very different regions of the potential energy surface. This larger sampling may output better 3D structures, but practical use forbids the exhaustive sampling of the potential energy surCorrespondence to: E, Martineau, e-mail:
[email protected] *Cuurent address: Neurochem Inc., 275 Armand-Frappier Blvd., Laval, Que´bec, H7V 4A7, Canada Contract/grant sponsors: NSERC, FCAR, PENCE, Universite´ de Montre´al, and CERCA
© 2004 Wiley Periodicals, Inc.
1896
Martineau, L’Heureux, and Gunn
•
Vol. 25, No. 15
•
Journal of Computational Chemistry
face. However, premature termination of random moves on the potential energy surface may fail to generate satisfactory results. One way of biasing these random moves is to change the geometrical distribution of possible structures. In this article, we propose changing the distribution of three-residue (triplet) fragments. This new distribution comes from the crystallographic triplet geometries of a set of nonredundant proteins. This natural distribution will provide a new weighting of conformations that will more efficiently guide the Monte Carlo simulation towards more realistic structures. The new distribution of trial moves will be compared to the results of random combinations of amino acid conformations, and its performance in producing native-like structures will be evaluated.
Simulation Algorithm The simulation is based on a hierarchical model of protein folding previously introduced.17 The fundamental assumption in this approach is that the free energy of folding is dominated by the hydrophobic packing of the secondary-structure elements. As this is primarily an ab initio methodology, there are no constraints on the overall shape of the molecule, and the primary goal is to reproduce the thermodynamic stability of the compact topology. The details of the loop conformations are not considered important at this level of resolution, and are required only to provide sufficient flexibility to allow effective hydrophobic packing. The potential function is an empirical potential of mean force derived from a statistical analysis of known structures. Solvent effects are included only implicitly in the relative propensities of each amino acid to be buried or exposed in the structure. The potential is therefore predominantly long range, with excluded volume treated with a hard-sphere term. This potential can be evaluated roughly two to three orders of magnitude more rapidly than an all-atom molecular-mechanics force field. The model18,19 therefore neglects short-range interactions such as hydrogen bonds. Hydrogen bonding is, of course, essential to stabilize the secondary structure; however, the objective is to simulate the packing of secondary-structure elements relative to one another and not to model their internal structure, which is taken to be constant. Helices and strands are thus treated simply as rigid bodies, thereby eliminating any requirement that they be stable with respect to the simulation energy. This model has been shown to be effective at producing lowresolution structures of ␣-helical proteins. The packing hypothesis breaks down, however, in modeling a -sheet formation where the strand–strand interactions consist of short-range directional forces that are nonlocal in sequence. All -folds therefore tend to be poorly described without additional terms to model the sheet structure. This representation of protein folding is the basis of the hierarchical methodology that has been developed using simulated annealing and a genetic algorithm.20 Secondary structure elements (helices and strands) are treated as rigid bodies, assuming a fixed secondary structure assignment as a priori input, and the Monte Carlo trial moves are explicitly based on the end-to-end conformations of the loops that determine the relative positions of the secondary-structure elements.
Figure 1. Representative structure of the Myoglobin (1mbo) generated by the program using the default dihedral maps and triplet distributions.
Each amino acid is represented by the backbone atoms, while the side chain is represented by a sphere centered on the -carbon position. The primary and the secondary structures of the target protein are required as initial input. For the primary structure, there are finite lists of and angles for the 20 different natural amino acids. These are the only degrees of freedom taken into consideration; all the others are assumed to be standard values. Each amino acid possess its own choice of and angles represented by a Ramachandran map.21 For residues assigned secondary structures, only those in the ␣-helix and the -sheet (parallel and antiparallel) are considered. For each of these structures, only one pair of and angles22 is considered to keep them fixed during the simulation. Figure 1 illustrates a simulated protein where the internal structure of the helices is maintained rigid while the loops remain flexible. The hierarchical methodology consists of building larger and larger segments of the structure using a set of possible conformations of the smaller segments. In the present algorithm, the hierarchy follows this order: residues, triplets, loops, and finally, proteins. Figure 2 illustrates the hierarchical methodology. The first step in the hierarchical procedure is to construct triplets of amino acids. Each triplet is accepted or rejected, depending on its corresponding region in the geometrical space. More precisely, each triplet is assigned to a specific bin depending on its end-to-end internal coordinates. The objective is to cover as many bins as possible. At the level of the overall potential of the protein, the replacement of a triplet by another from the same bin is a small move in the conformational space.
Biased Fragment Distribution in MC Simulation of Protein Folding
1897
conformations. The temperature function in our case is exponential and can be expressed in the following equivalent ways: T k ⫽ 共T N⫺k 䡠 T kf 兲 1/N ⫽ T ie ⫺Ak ⫽ e ⫺AT k⫺1 i
Figure 2. Hierarchical methodology to generate 3D structure of proteins.
The second step in the procedure is to bias the population of loops. A loop is a series of triplets between two secondary structures. To change a loop, the replacement of a triplet is necessary. The trial loops formed must be different from their initial geometry without losing similarity. Here, again, the loops are assigned to specific bins depending on their internal coordinates. However, in contrast to the case of the triplets, the population of the bins represents a subset of more probable loop conformations. The last hierarchical step is to generate the complete protein from a selection of loops present in the list and inserting them between each pair of consecutive secondary structures. The acceptance or rejection of the new protein is purely based on an energy score derived from a statistical potential function.23 If the generated molecule is not rejected, it will evolve in an ensemble managed by a Monte Carlo algorithm. In our case, the Monte Carlo algorithm is based on the Metropolis criterion.24,25 It goes as follows: a final state af is chosen from an initial state ai by modifying specific degrees of freedom. This transition probability can be written as follows: P共a i 3 a f兲 ⫽ exp共⫺共Ef ⫺ Ei 兲兲 if Ef ⬎ Ei P共a i 3 a f兲 ⫽ 1
if Ef ⱕ Ei
(1) (2)
where Ei and Ef are the energy of the initial and the final states, respectively, and  ⫽ (kBT)⫺1. The probability of a protein to be in the state i corresponds to a Boltzmann distribution. This algorithm leads towards a stationary state where the population of a state ai is proportional to exp(⫺Ei). It is postulated that repeating this procedure over a sufficient period of time will ultimately represent the distribution in equilibrium both statistically and thermodynamically.25 The Monte Carlo algorithm is combined with a simulated annealing. This means that the system will be subjected to a varying virtual temperature. The gradual decrease of this temperature will reduce the access to a certain number of previous
(3)
where A ⫽ (1/N) 䡠 log(Ti /Tf); N is the total number of iterations; k is the current iteration; and T is the virtual temperature. This choice of annealing schedule allows the simulation to spend more time at lower temperatures to better maintain in equilibrium where barrier crossings are less frequent. Finally, the ensemble of molecules is managed by a genetic algorithm. From an ensemble of 64 proteins called “parents,” there are mutations (random changes in loops of the protein), hybridizations (coupling different members of the ensemble), and selections of the next generation of molecules. There are 512 “children” formed from this “parent” ensemble. From this ensemble, chosen to be larger than the “parent” ensemble, a canonical subset of 64 new “parents” is retained. The subset of new parents is chosen randomly from a Boltzmann distribution. In other words, each possible way of choosing 64 out of 512 structures has a probability exp(⫺E), where E is the total energy for the 64 structures in the set. The new “parents” retain some similarity to their predecessors, and it is possible to form “children” that will have lower energy. Optimization of the parameters for the genetic algorithm has been discussed in a previous article.20 In a single Monte Carlo cycle, the algorithm will construct molecules from loops and, later on, select new parents from the canonical subset of children in the genetic algorithm. The assessment of prediction quality is based on the coordinate superimposition RMS deviation26 with respect to the target protein structure available in the PDB.
Triplet Characterization As mentioned previously, the actual triplet construction is based on random combinations of coordinates selected from individual Ramachandran maps. New information can be added so that the triplet list will be more representative of the information found in the natural structures. More precisely, we extract geometries of triplets located in loops from an ensemble of known protein 3D structures in the PDB. This knowledge-based distribution of triplet geometries is then used inside the simulation to orient the prediction. The triplets are cataloged in a sequence-dependent fashion. During the simulation, two homology scores and cutoff criteria are used to create a dynamic list of triplets relevant for a particular move. Further along, we will show the impact of this knowledgebased distribution by comparing statistics of internal coordinates for both distributions. Here is the protocol to construct the knowledge-based distribution. First, a structurally nonredundant set of proteins is required. Because the selection of triplets inside the simulation is not probabilistic, a redundant set would give nondesired weights to the distribution. Our nonredundant set of proteins comes from the methodology of Dunbrack.27 This list contains 2376 chains with a primary structure acceptance threshold of 90% or less, a resolution of 3 Å, and an R factor (from BLAST28,29) of 1. From this list, all
1898
Martineau, L’Heureux, and Gunn
•
Vol. 25, No. 15
•
Journal of Computational Chemistry
Figure 4. Distribution of the different combinations of triplet neighbors. Figure 3. Distribution of triplets with respect to their sequence.
the triplets contained in the loops are extracted and characterized depending on their sequence, their immediate neighbors, and their internal coordinates. In total, 184,661 triplets are extracted from the list. Of the 203 possible combinations of amino acids, 546 are not present. This is reasonable, because in nature some combinations are less probable than others. Of these missing combinations, 22.4% contain a tryptophan, 17.2% contain a cysteine, and 16.1% contain a methionine. Figure 3 illustrates the population of triplet sequences. The x-axis is written as an index (m) corresponding to triplet combination. Table 4 shows each value (1 to 20) and its corresponding amino acid.
ometry triplets.” The mathematical form of the coordinates is shown in eqs. (5) to (9), where {xi, yi, zi} is the local coordinate frame of residue i with xi along the NOC␣ bond and the amide H in the xi–yi plane. Figure 5 illustrates the relationships between the internal coordinates. Interestingly, the natural weight of the new distribution is different from the random distribution. Figures 6 to 10 illustrate this difference.
兩q 3兩 ⫽ cos⫺1
冘 冘冘冘
8000
20
20
20
m⫽
m⫽1
共aa兲 i共aa兲 j共aa兲 k
q 1 ⫽ 共 xˆ 1 䡠 Rˆ 兲
(5)
q 2 ⫽ 共 xˆ 2 䡠 Rˆ 兲
(6)
冉
冑共1 ⫺ q12兲 䡠 共1 ⫺ q22兲
(4)
i⫽1 j⫽1 k⫽1
The triplet contexts (neighbors) are defined as the secondary structures (loop, ␣-helix, or -sheet) of the two residues preceding the triplet and the two residues following the triplet. There are five classes of secondary structure context for each end: helix-helix, helix-loop, strand-strand, strand-loop, and loop-loop on one end; and helix-helix, loop-helix, strand-strand, loop-strand, and looploop on the other end. Table 5 shows the 25 different classes of neighbors. Figure 4 shows the population of each type of environment. To ensure we properly cover the conformational space of the triplets with the natural distribution, each triplet is assigned to a bin depending on its internal coordinates. In our case, there are 1024 bins where each division is 90° (for the dihedral angles) and 0.5 (for the cosines of polar angles). By choosing these bin sizes, as a resolution, there is an average of 180 triplets (184,661/1024) per bin. Now that we have a new distribution of internal coordinates extracted from the natural data, it is possible to verify how much information there really is in the natural distribution. The random distribution will serve as a reference. In other words, it is a comparison between “natural geometry triplets” and “random ge-
冊
xˆ1 䡠 xˆ2 ⫺ 共xˆ1 䡠 Rˆ兲 䡠 共xˆ2 䡠 Rˆ兲
兩q 4兩 ⫽ cos⫺1
兩q 5兩 ⫽ cos⫺1
冉冑 冊 冉冑 冊 yˆ1 䡠 Rˆ
1 ⫺ q12
yˆ2 䡠 Rˆ
1 ⫺ q22
(7)
(8)
(9)
Figure 5. Illustration of the internal coordinates (q1–q5) for a triplet of residues. Each residue is noted by i, i ⫹ 1, i ⫹ 2, and i ⫹ 3. The Cartesian coordinates of the starting residue and the terminal residue are also shown.
Biased Fragment Distribution in MC Simulation of Protein Folding
1899
Figure 8. Randomly generated distribution of the internal coordinate q3 (dashed line) vs. the natural (smooth line) over the angle range. Figure 6. Randomly generated distribution of the internal coordinate q1 (dashed line) vs. the natural (smooth line) over the angle range.
Before starting the analysis, the conformational space of internal coordinates needs to be explained. Internal coordinates q1 and q2 are polar angles (like in ordinary spherical coordinates). These coordinates range from to 0. The definition of q1 and q2 are inner products. These two inner products are the cosine of an angle. For the graph, we use the arccosine of this value to obtain an angle value. Because the domain of the arccosine function ranges between 0 and , the angle values cannot be negative. For the q1 coordinates in Figure 6, the two graphs look similar in the same regions. The only two noticeable differences are located between ⫺180° and ⫺155° and between 5° and 0°. In these two regions, we found less population of q1 in the natural distribution compared to the randomly generated distribution. For the coordinate q2, we find at first sight a rather bizarre form (Fig. 7). This comes from the definition of the coordinate q2, which does not allow any value over approximately 90°. The distributions of q2 resemble one another with the exception that the peaks
Figure 7. Randomly generated distribution of the internal coordinate q2 (dashed line) vs. the natural (smooth line) over the angle range.
in the population are farther apart. Nevertheless, the region starting from 60° to 0° contains more information in the case of the natural distribution. At this point, we notice that the previous algorithm was not providing enough importance to this region. The major difference between the coordinates q1 and q2 is the inclusion of the nitrogen preceding the first ␣-carbon in the geometry of the triplet. This nitrogen is not part of the backbone between the two ␣-carbons. Therefore, this carbon could position itself approximately anywhere. The nitrogen of the last ␣-carbon is included in the definition. The distribution of the coordinate q2 shows that it is impossible to position this nitrogen on the far side of the ␣-carbon. In contrast to the coordinates q1 and q2, the coordinates q3, q4, and q5 are equatorial angles like in the usual spherical coordinates. For the coordinates q3 (Fig. 8), q4 (Fig. 9), and q5 (Fig. 10), it is clear that the way we were defining our coordinates previously was not correct. The knowledge-based distribution adds valuable geometries for future simulations.
Figure 9. Randomly generated distribution of the internal coordinate q4 (dashed line) vs. the natural (smooth line) over the angle range.
1900
Martineau, L’Heureux, and Gunn
•
Vol. 25, No. 15
•
Journal of Computational Chemistry
forms a symmetric 25 ⫻ 25 matrix. The size of this matrix justifies the number of residues considered in the neighbors. With three residues on each side, the matrix increased considerably, and it would treat too many subclasses. One residue on each side of the triplet did not give enough information on the neighbors. To construct the matrix, we calculated the RMS deviation on each triplet pair in the list and sorted them into the appropriate category. Because our list of triplets contained 184,661 entries with their neighbors, a little more than 30 billion pairs of structures were evaluated. This represents a good statistical sampling. The matrix can be seen in Figure 11. Here, a high score in our matrix means very little resemblance between two neighbor classes. The structural homology score is shown in eq. (11). Score2 ⫽ R11 ⫹ R22 Figure 10. Randomly generated distribution of the internal coordinate q5 (dashed line) vs. the natural (smooth line) over the angle range.
Bias and Selection To incorporate more natural triplets into specific loops, some pruning must occur. Data collected about both the sequence and neighboring secondary structure of natural triplets are used to create homology-based filters. Our scoring scheme allows a sufficient quantity of triplets to pass the filter while increasing the quality of the information passing through. The score has to be tight enough to improve the quality of the list of triplets, but loose enough for the Monte Carlo simulation to run smoothly. The sequence homology filter uses a score based on the Dayhoff matrices,30 also called PAM (Point Accepted Mutation). These matrices are a representative model of the mutation of amino acids during evolution. The scores in these matrices allow the comparison of the 20 amino acids among themselves. The higher the score between two amino acids, the more they resemble one another chemically, and most likely can be interchanged during evolution. In our case, the matrix PAM256 seemed to fulfill qualitatively our requirements in terms of sequence similarity. The reasons for choosing the PAM256 matrix are that the larger numbers are closer to the equilibrated transition matrix (which represents several generations) and for comparing structure, the history is not important. This sequence homology score for a triplet is given in eq. (10). Score1 ⫽ P11 ⫹ P22 ⫹ P33
(11)
R11 is the RMS value for the comparison between the preceding neighbor of the template triplet and the preceding neighbor of the listed triplet. R22 is the RMS value for the comparison between the following neighbor of the template triplet and the following neighbor of the listed triplet. By looking at the matrix, a RMS value of 1.77 Å corresponds to the least similar pairs of classes, which are 17–12 (triplets surrounded by 75% of -sheets) and 17–14 (triplets surrounded by ␣-helix on one side and -sheet on the other side). At the other extreme, the most similar classes are the ones compared among them. The RMS measures range from 1.24 to 1.19 Å. This matrix is a good starting point to differentiate between desirable and undesirable neighbor substitutions. In summary, the major change to the algorithm is the inclusion of a new natural triplet geometry distribution instead of the randomly generated distribution. Two scores were calculated for each triplet included in the loops. These scores will be used to create an
(10)
P11 is the probability of mutation (given by the Dayhoff matrix) between the first amino acid of the target triplet and the first amino acid of the listed triplet. P22 and P33 are probability of mutation for amino acid in positions 2 and 3, respectively, in the triplet. We built the second filter based on structural homology. RMS deviations served as a similarity measure. Considering all the atoms of the main chain (in our case ␣-carbon, -carbon, nitrogen, hydrogen, carbonyl, and oxygen), equal weight allows us to model the rotation of the ends better than just with the ␣-carbons. With this measure it was possible to compare five types of neighbors on each side of the triplet. The comparison of all the possibilities
Figure 11. Similarity matrix based on the RMS for the different classes of neighbors, represented in a grey scale format.
Biased Fragment Distribution in MC Simulation of Protein Folding
1901
Table 1. Energies and RMS for Three Protein Predictions Without the Natural Triplets.
Protein 3chy 1mbo 1aba
Best energy
Mean energy
Best RMS (Å)
Mean RMS (Å)
⫺118.82 ⫺25.66 ⫺92.99
⫺73.57 8.45 ⫺69.06
9.78 8.59 8.33
13.52 12.37 11.35
Simulation and Results
Figure 12. Routine in pseudo-code to accept or reject natural triplet geometry into the new optimized list of natural triplet geometries.
optimized list of natural triplet geometries for a given protein. Two filter cutoff values need to be entered as initial parameters before the simulation begins. The first filter cutoff is the acceptance criterion for the sequence homology score. If the score is higher than the cutoff, the triplet is close enough in sequence to the target triplet. The second filter cutoff is the acceptance criterion for the structural homology score. If the score is lower than this cutoff, the triplet is close enough in terms of structural neighbors to the target triplet. If both conditions are satisfied, the natural triplet geometry is accepted and included in the new optimized list of triplets. Figure 12 illustrates the subroutine (in pseudo-code) for acceptance of triplets in the new optimized list. Figure 13 shows the modified hierarchical method that replaced the one presented in Figure 2.
This method was tested on three different proteins: Myoglobin (1mbo),31 in which we consider 146 residues and 8 ␣-helices, CheY (3chy),32 in which we consider 121 residues, 5 ␣-helices, and 5 -sheets, and finally, Glutaredoxin (1aba),33 in which we consider 86 residues, 3 ␣-helices, and 4 -sheets. With these three proteins, we covered different arrangements of secondary structure and different sizes. There was no protein containing only -sheets because the statistical potential we use does not treat properly short-range interactions.18 To optimize the sequence homology filter, several simulations were performed with different cutoff combinations as initial parameters. When the best improvement in energy and/or RMS was found, then the optimal cutoff combination was found for a given protein. Trip (our simulation algorithm) is a program written in Fortran 77 for a single processor at a time. Simulations were run on SGI Origin 10000 and SGI Origin 12000 processors. Depending on the protein length and the initial parameters, 50 iteration simulations took between 35 and 75 h of CPU time. Table 1 shows information for reference simulations (with the original distribution of triplet geometries) of the three chosen proteins. Each simulation was run three times to ensure reproducibility in the results. The mean energy and the mean RMS comes from all three simulations for the same protein. The RMS of a given generated protein is calculated with respect to the target protein (coming from structure available in the Protein Data Bank). Note there is no unit value to the energy because it has no real physical sense. Table 2 shows the simulation results using the natural distribution of triplet geometries with the sequence homology filter optimized. There is a significant difference in the energy and the RMS values. Finally, Table 3 shows the simulation results using the natural distribution of triplet geometries with both cutoffs optimized.
Table 2. Energies and RMS for Three Protein Predictions Using the Natural Triplets Filtered with the Optimized Cutoff of Sequence Homology.
Protein
Figure 13. Modified hierarchical procedure to include the natural geometry of triplets. The sequence and structure selection algorithm is the routine presented in Figure 12.
3chy 1mbo 1aba
Best energy
Mean energy
Best RMS (Å)
Mean RMS (Å)
⫺140.54 ⫺34.98 ⫺114.76
⫺101.66 0.84 ⫺80.90
7.47 6.41 6.30
10.71 11.78 8.64
1902
Martineau, L’Heureux, and Gunn
•
Vol. 25, No. 15
Table 3. Energies and RMS for Three Protein Predictions Using the Natural Triplets Filtered with Optimized Cutoff for Both Sequence and Structural Homology.
Protein 3chy 1mbo 1aba
Best energy
Mean energy
Best RMS (Å)
Mean RMS (Å)
⫺140.54 ⫺34.98 ⫺114.76
⫺101.66 0.84 ⫺80.90
6.27 6.41 6.30
10.71 10.80 8.64
Clearly, the addition of an optimized sequence homology filter improved the results. In the energies, it decreased by ⬃20 units in certain cases. In the RMS, it decreased by ⬃24% in the case of the best mean RMS for the 1aba. However, the addition of a structural filter did not really help. Only the best RMS of the 3chy and the best mean RMS of the 1mbo decreased by approximately 1 Å. The energy was not even influenced. From these results it can be seen that the natural distribution helps the Monte Carlo algorithm to lead the simulation towards more realistic 3D protein structures. The selection of triplet geometries based on their sequence to form an optimized list of triplet geometries represents a major improvement for our algorithm. Nonetheless, structural homology filters may improve some protein prediction like 3chy and 1mbo.
•
Journal of Computational Chemistry
Table 5. Different Triplet Environments.
Neighbors preceding the triplet loop-loop helix-loop sheet-loop helix-helix sheet-sheet loop-loop helix-loop sheet-loop helix-helix sheet-sheet loop-loop helix-loop sheet-loop helix-helix sheet-sheet loop-loop helix-loop sheet-loop helix-helix sheet-sheet loop-loop helix-loop sheet-loop helix-helix sheet-sheet
Neighbors following the triplet
Code
loop-loop loop-loop loop-loop loop-loop loop-loop loop-helix loop-helix loop-helix loop-helix loop-helix loop-sheet loop-sheet loop-sheet loop-sheet loop-sheet helix-helix helix-helix helix-helix helix-helix helix-helix sheet-sheet sheet-sheet sheet-sheet sheet-sheet sheet-sheet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Conclusion The objective of this article was to verify the effect of a natural distribution of triplet geometries over a randomly generated distribution. To classify the new distribution, the sequence, the neigh-
Table 4. Amino Acids and Their Corresponding Codes.
Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Corresponding amino acid Alanine Arginine Asparagine Aspartic Acid Cysteine Glutamine Glutamic Acid Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine
bors, and the internal coordinates were considered. To enhance the quality of the distribution used in the simulation, a sequence homology filter (PAM256) and a structural homology filter were applied. The sequence homology filter improves both the energy and RMS of output structures. The addition of a structural filter had a less significant effect, improving the RMS in isolated cases only. Overall, the new natural distribution of triplet geometries contributed to the enhancement of the quality of the structures generated.
Acknowledgments The authors would like to thank the Universite´ de Montre´al and CERCA for material support. The authors would also like to thank Benoit Cromp for useful discussions and Dr. Ronny Priefer for some revisions in this manuscript.
References 1. Kolinski, A.; Skolnick, J. Proteins 1994, 18, 338. 2. Kolinski, A.; Skolnick, J. Proteins 1994, 18, 353. 3. Onuchic, J. N.; Luthey-Schulten, Z.; Wolynes, P. G. Annu Rev Phys Chem 1997, 48, 545. 4. Li, Z.; Scheraga, H. A. Proc Natl Acad Sci USA 1987, 84, 6611. 5. Scheraga, H. A. Biophys Chem 1996, 59, 329. 6. Hansmann, U. H. R.; Okamoto, Y. Phys A 1994, 212, 415.
Biased Fragment Distribution in MC Simulation of Protein Folding
7. 8. 9. 10. 11. 12. 13. 14. 15.
16. 17. 18. 19.
Amara, P.; Staub, J. F. J Phys Chem 1995, 99, 14840. Kirkpatrick, S.; Geddat, C. D.; Vecchi, M. P. Science 1983, 220, 671. Gomathi, L.; Subramanian, S. Curr Sci 1996, 71, 553. Johnson, W. C. Annu Rev Biophys Biophys Chem 1988, 17, 145. Wuthrich, K. Science 1989, 243, 45. Amos, L. A.; Henderson, R; Unwin, P. N. Prog Biophys Mol Biol 1982, 39, 183. McPherson, A. The Preparation and Analysis of Protein Crystals; Wiley: New York, 1982. Koehl, P.; Levitt, M. Curr Opin Struct Biol 1999, 19, 155. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic Acids Res 2000, 28, 235. L’Heureux, P.-J.; Cromp, B.; Martineau, E.; Gunn, J. R. Adv Chem Phys 2002, 120, 193. Gunn, J. R.; Monge, A.; Friesner, R. A.; Marshall, C. H. J Phys Chem 1994, 98, 702. Monge, A.; Lathrop, J. P.; Gunn, J. R.; Shenkin, P. S.; Friesner, R. A. J Mol Biol 1995, 247, 995. Friesner, R. A.; Gunn, J. R. Annu Rev Biophys Biomol Struct 1996, 25, 315.
1903
20. Gunn, J. R. J Chem Phys 1997, 106, 4270. 21. Ramachandran, G. N.; Sasisekheran, V. Adv Protein Chem 1968, 23, 284. 22. Cohen, F. E.; Richmond, T. J.; Richards, F. M. J Mol Biol 1980, 137, 9. 23. Casari, G.; Sippl, M. J. J Mol Biol 1992, 224, 725. 24. Metropolis, N. J Chem Phys 1953, 96, 768. 25. Yeomans, B. Statistical Mechanics of Phase Transitions; Oxford Press: New York, 1992; Chapt 7. 26. Cohen, F. E.; Sternberg, M. J. J Mol Biol 1980, 138, 321. 27. Sauder, J. M.; Arthur, J. W.; Dunbrack, R. L. Proteins Struct Funct Genet 2000, 40, 6. 28. Karlin, S.; Altschul, S. F. Proc Natl Acad Sci USA 1990, 87, 2264. 29. Karlin, S.; Altschul, S. F. Proc Natl Acad Sci USA 1993, 90, 5873. 30. Dayhoff, M. O. Atlas of Protein Sequence and Structure; National Biomedical Research Foundation: Maryland 1978. 31. Phillips, S. E. J Mol Biol 1980, 142, 531. 32. Volz, K.; Matsumura, P. J Biol Chem 1991, 266, 15511. 33. Eklund, H.; Ingelman, M.; Soderberg, B. O.; Uhlin, T.; Nordlund, P.; Nikkola, M.; Sonnerstam, U.; Joelson, T.; Petratos, K.. J Mol Biol 1992, 228, 596.