Integrating Multiple Scoring Functions to Improve Protein ... - CiteSeerX

2 downloads 0 Views 627KB Size Report
Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
Integrating Multiple Scoring Functions to Improve Protein Loop Structure Conformation Space Sampling Yaohang Li, Ionel Rata, and Eric Jakobsson

Abstract — In this article, we present a new protein structure modeling approach based on multi-scoring functions sampling. The rationale is to integrate multiple carefully-selected physicsor knowledge-based scoring functions to tolerate insensitivity and inaccuracy existing in an individual scoring function so as to improve protein structure modeling accuracy. We apply the multi-scoring function sampling approach to protein loop backbone structure modeling. Our computational results show that sampling the scoring function space of a physics-based soft-sphere potential function and a knowledge-based scoring function based on pairwise atoms distance has led to resolution improvement in the predicted decoy populations in a set of 12residue benchmark loop targets.

I. INTRODUCTION

A

CCURATELY modeling protein or protein complex structures from protein sequences (ab initio protein modeling) is considered as one of the most significant grand challenges that have broad economic and scientific impacts. The theoretical foundation of ab initio protein modeling is the Anfinsen’s thermodynamic hypothesis [1], which states that the native conformation is a unique, stable, and kinetically accessible minimum of the protein energy. Based on the Anfinsen’s thermodynamic hypothesis, currently the ab initio protein modeling efforts are focused on globally optimizing a scoring function describing the protein energy to obtain models close to the native structures. In practice, due to the difficulty of accurately calculating the energy of large protein or protein complex molecules, many scoring functions have been developed instead to approximate the protein energy. For example, the physicsbased scoring functions are designed to estimate various physical interactions; the knowledge-based scoring functions are derived from the statistics of the experimentallydetermined conformations within the protein data banks (PDB), and the regression-based scoring functions are developed to combine various physics- or knowledge-based

Yaohang Li is with the Department of Computer Science, North Carolina A&T State University, Greensboro, 27411 USA (phone: 336-3347245; fax: 336-334-7244; e-mail: [email protected]). Ionel Rata is with the Center for Biophysics and Computational Biology, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL 61801, USA (email: [email protected]). Eric Jakobsson is with the Department of Molecular and Integrative Physiology and National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign Urbana-Champaign, IL 61801, USA (e-mail: [email protected]).

scoring terms using the regression method. Moreover, simplification techniques by reducing protein representation are often used to “soften” or “simplify” the scoring function landscape to facilitate the global optimization process. These approximations or simplifications in scoring functions usually cause certain level of insensitivity or inaccuracy, which may lead to incorrect protein models. In this article, we put forward a new approach for protein modeling by sampling multiple scoring functions with potentially different resolution. The multi-scoring functions sampling approach can tolerate the insensitivity in individual scoring function and thus increases the chance of discovering native-like, good conformations. We apply the multi-scoring functions sampling approach to protein loop backbone structure modeling by developing a populationbased evolutionary algorithm to explore loop conformations using a scoring function based on pair-wise atoms distance and a soft-sphere potential function. Our computational results show RMSD (Root Mean Square Deviation) improvement in the decoys produced by the multi-scoring functions sampling algorithm, which agree with our theoretical analysis. The rest of the article is organized as follows. The protein modeling scoring functions and their potential inaccuracy are analyzed in Section 2. In Section 3, we illustrate the insensitivity problem in current scoring functions. Section 4 provides the theoretical analysis of the multi-scoring functions sampling approach in protein modeling. We describe the implementation of the population-based evolutionary multi-scoring functions sampling algorithm for protein loop structure modeling and the computational results in Sections 5 and 6, respectively. Finally, Section 7 summarizes our conclusions and future research directions. II. PROTEIN STRUCTURE MODELING SCORING FUNCTIONS AND THEIR POTENTIAL INACCURACY AND INSENSITIVITY A. Physics-based Scoring Functions Ideally, the protein energy would be evaluated with quantum mechanics, in which case the energy function could report the true energy. In computational practice, quantum mechanics is wildly intractable because of the size of protein molecules. As a compromise, artificial scoring functions (force fields) are developed based on classical physics to

978-1-4244-6766-2/10/ $26.00 © IEEE Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

III. SENSITIVITY OF SCORING FUNCTIONS

Erroneous Conformations Conformation with the lowest score Native-like Conformation

Score

approximate the true energy of molecular systems, such as CHARMM [2], AMBER [3], OPLS [4], and GROMOS [5]. Moreover, the physics-based scoring functions usually yield a rugged scoring function (energy) landscape, which is difficult to search. Simplification techniques such as reducing protein representation are often used to design a less rugged scoring function to facilitate the optimization process. These simplifications may also decrease the resolution of the physics-based scoring functions. B. Knowledge-based Scoring Functions Knowledge-based approaches evaluate the increasing number of experimentally determined conformations by statistical means to extract rules on preferred configurations and combinations. These rules are converted into “pseudopotential” scoring functions for protein modeling. Compared with physics-based scoring functions, the knowledge-based potentials tend to be “softer” to tolerate structural imperfection – allowing better handling of the uncertainties and deficiencies of the computer-generated models [6]. The main problems of knowledge-based scoring functions are from their theoretical assumptions. Knowledge-based scoring functions derive their rules from experimental data typically by applying the inverse Boltzmann law, which is based on the following two important assumptions: 1) the set of known conformations are representative of proteins or protein complexes in general; and 2) the observed frequencies are independent of each other. The first assumption is questionable in most knowledge-based scoring functions – because compared to the unknown conformations, the known ones are an extremely small fraction [7, 8]. Thomas et al. [9] and Kocher et al. [10] also argued that inter-residue interactions are not independent. In consequence, all these may contribute to inaccuracy or insensitivity in knowledge-based scoring functions.

Native Conformation

Insensitive Region

Figure 1: A Conceptual Illustration of Scoring Function Sensitivity

Currently, many existing physics-, knowledge-, or regression-based scoring functions for protein modeling have certain level of accuracy, which can distinguish a far deviated, erroneous conformation from a native-like, good conformation. When coupling with an effective optimization algorithm, a scoring function can usually drive the sampling process to some conformations relatively close to the native conformation. However, because the existing scoring functions are an approximation of the true energy function, when the score is low enough, the scoring functions’ sensitivity decreases. That is, in low score regions, a conformation with a relatively higher score may in fact be a more reasonable structure than the one with a lower score. In practice, the native conformation usually does not exhibit the lowest score when it is put among the decoys generated by computer simulation [13]. Figure 1 shows the conceptual illustration of the scoring function insensitivity problem. -340

C. Regression-based (Weighted-Sum) Scoring Functions

-350 -360 Rosetta Score

To take advantage of the knowledge- and physics-based scores, a popular way is to build a linearly combined scoring function by adding up various empirical or physical terms with a particular set of weights. The regression-based scoring function is also called the weighted-sum scoring function. The weights are usually assigned by regression method [11, 12] – fitting predicted and experimentally determined models to a given set of training complexes. A major drawback of the regression method is its dependence on the size, composition and generality of the training set used to derive the weights. Moreover, it is also questionable that the weights are constants or the combinations are linear in general. An even more serious consequence of adding up scoring functions is the possible enlargement of the overall insensitivity regions and generation of more local minima.

decoy native

-370 -380 -390 -400 -410 0

0.5

1

1.5 2 RMSD(A)

2.5

3

(a)

Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

3.5

function space composed of the Rosetta, dDFIRE, and DOPE scoring functions. The color of the surface represents the RMSD of the corresponding decoy. Although an individual scoring function may exhibit certain insensitivity, decoys that are very close to the native conformation can be found in the common low score areas of the scoring functions, as indicated in Figure 3. This strongly indicates that sampling the common low score areas of multiple scoring functions may improve the chance of reaching conformations with good quality and thereby improve the resolution of the predicted models.

-276 decoy native

DFFIRE Score

-278

-280

-282

-284

-286

RMSD(A) 3

-288 0

0.5

1

1.5 2 RMSD(A)

2.5

3

3.5

(b)

4

x 10

-1.29

x 10

-1.3 DOPE Score

decoy native -1.3

-1.31 DOPE Score

2.5

-1.28

4

2 -1.32 -1.34 1.5

-1.36 -276 -278

-340

-1.32

-350

-280

1

-360 -282

-1.33

-370

Common Low Score Area

-284 -286 DDFIRE Score

-1.34

-1.35 0

0.5

1

1.5 2 RMSD(A)

2.5

3

3.5

(c) Figure 2: Scoring functions analysis for 1,081 loop decoys 90-101 in 1kuh. (a) Plot of Rosetta score vs. RMSD (b) Plot of DDFIRE score vs. RMSD (c) Plot of DOPE score vs. RMSD

Figure 2 illustrates an example of scoring function insensitivity in protein loop decoys in 1kuh(90:101) provided in Jacobson’s loop decoys library [14] generated by comparative modeling, where the score-RMSD plots of 1,081 decoys using the Rosetta [15] full-atom scoring function, the dDFIRE (dipolar Distance-scaled, Finite-Ideal gas Reference) [16] scoring function, and the DOPE (Discrete Optimized Protein Energy) [17] scoring function are shown. The scores are obtained by evaluating the loop decoys together with the rest of the protein. In any one of these scoring functions, there are a small portion of decoys showing lower scores than the native conformation. Moreover, neither the native conformation nor the nativelike decoys (within 0.5A RMSD from the native conformations) yield the lowest scores in these scoring functions. Such phenomenon can also be found in many other targets listed in Jacobson’s loop decoys library. Figure 3 shows the interpolated surface of the 1,081 decoys of 1kuh(90:101) represented by points within a

-380 0.5

-390

-400 -288

-410

Rosetta Score

Figure 3: Common low score area reveals decoys in good quality in 1kuh(90:101)

IV. SAMPLING MULTIPLE SCORING FUNCTIONS Due to the protein modeling scoring function insensitivity problem, an optimization process, which is looking for the absolute global minimum in the scoring function, may be misled to undesired conformations and ignore good ones with low but not the lowest scores. To handle this problem, we advocate a new protein structure modeling approach based on sampling multiple physics- or knowledge-based scoring functions instead of globally optimizing an individual scoring function. This is based on our assumption that when multiple good scoring functions are present, the correct, native-like conformations should satisfy most of them by yielding low scores while an incorrect conformation usually may yield low scores in some but not all scoring functions. This assumption is reasonable because most current “good” scoring functions for protein structure modeling do show certain level of accuracy. Efficiently sampling multiple scoring functions can increase the chance of reaching native-like, good conformation candidates, which will eventually lead to protein modeling accuracy improvement. In multi-scoring functions sampling, the scoring functions s1(x), s2(x), …, sn(x), involved form a scoring function space,

Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

S(x) = [s1(x), s2(x), …, sn(x)], where x ∈ C is a feasible protein conformation. A conformation x* ∈ C is Pareto optimal if and only if there is no x ∈ C such that si ( x) ≤ si ( x*) for all i ∈ {1, 2,..., n} , with at least one strict inequality. The set of x* forms an optimal surface called the Pareto optimal front [18], which include conformations at the global minimum in each individual scoring function as well as non-dominated conformations in different combinations of scores. Unlike global optimization, which looks for a conformation xmin si = min(si ( x)) in a single scoring function si(x), the goal of multi-scoring functions sampling in protein modeling is to explore the diverse conformations close to solution set x* at the Pareto optimal front, which yield solution optimality in various combinations of the involved scoring functions. As a result, compared to globally optimizing a single scoring function, efficiently sampling multiple scoring functions to explore the conformations at or near the Pareto optimal front will lead to broader exploration of the protein conformation space, including not only the ones best satisfying individual scoring function but also those satisfying most scoring functions by yielding low scores. This tolerates insensitivity in individual scoring function and thus increases the chance of discovering nativelike, good conformations. The protein energy function has a well-known rugged, “funnel-like” landscape with large number of deep local minima hierarchically disposed [19]. In protein modeling using global optimization, an optimization process may be trapped in a deep local minimum and cannot escape in practical time. This phenomenon is referred to as the “waiting-time dilemma.” [20] Compared to global optimization, multi-scoring functions sampling has a faster barrier-crossing capability. This is due to the fact that when multiple scoring functions are involved, a deep local minimum in one scoring function may not be a local minimum in another. When a population-based operation such as genetic-algorithm-style crossover or replica exchange [21] is applied, a conformation in a deep local minimum of a scoring function may be switched to another scoring function where its score is no longer a local minimum. This will increase the chance of jumping out of deep local minima in a scoring function. Figure 4 depicts a scenario that a replica exchange between two scoring functions helps escape deep local minima. Taking advantage of multiple available scoring functions to help the sampling process escape the deep local minima in one scoring function are also discussed in [27].

Figure 4: Replica exchange helps simulation process to escape deep local minima

Compared to optimizing a weighted-sum scoring function that combines multiple scoring terms, the multi-scoring functions sampling does not require estimation of weights. More importantly, multi-scoring functions sampling has potentially broader exploration of the protein loop conformation space with low scores than weighted-sum scoring function optimization. This is due to the fact that the weighted sum approach cannot find certain Pareto optimal conformations in the case of a concave Pareto optimal front [18]. The minor drawback of multi-scoring functions sampling is its computational cost, which is higher than optimizing a single scoring function due to the requirement of evaluating multiple scoring functions.

Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

V. MULTI-SCORING FUNCTIONS SAMPLING VS. SAMPLING A WEIGHTED-SUM SCORING FUNCTION Optimizing a weighted-sum scoring (energy) function is a popular approach in protein structure modeling to combine multiple scores. The weighted-sum scoring function is built by linearly combining various knowledge- or physics-based scoring terms by a particular set of weights, which are usually derived by the regression methods. Our analysis shows, theoretically, multi-scoring functions sampling has major advantages compared to weighted-sum scoring function optimization.

Figure 5: Scenario of a concave Pareto optimal front where a weighted-sum approach will fail to find some Pareto-optimal solutions

First of all, multi-scoring functions sampling has potentially broader exploration of the protein loop conformation space with low scores than weighted-sum scoring function optimization. This is due to the fact that the weighted sum approach cannot find certain Pareto optimal conformations in the case of a concave Pareto optimal front [20]. Figure 5 shows a conceptual scenario of a concave search space of two hypothetical scoring functions S1 and S2. When a set of weights are selected, a contour line is formed and the minimum solution of the weighted sum function corresponds to a solution on the Pareto optimal front, which is a tangent point of the contour line and the solution space. However, there exists no contour line that can produce a tangent point with the feasible solution space in the region BC in the Pareto optimal front. This is because before a tangent point is reached in BC, the contour line becomes a tangent at another point at AB or CD, which yields a lower weighted-sum function value. In other words, in weighted-sum function optimization, solutions in AB or CD will attract the optimization process to drive away from solutions in BC. The concave Pareto optimal front may still

exist even a nonlinear function is used to combined various terms. In contrast, an efficient multi-scoring functions sampling approach can produce solutions at a concave Pareto optimal front, which thus leads to broader exploration of potentially good conformations in the protein conformation space. Secondly, multi-scoring functions sampling can avoid problem due to scoring functions overlapping. Another problem of weighted-sum scoring function is its potentially over-counting of the same interaction scoring terms. This is due to terms, particularly the statistical ones, used in the weighted-sum scoring function that may have a common component. Without loss of generality, let us consider two scoring functions S1 = s1 + s and S2 = s2 + s, where s is a common component and s1 and s2 are complementary components for S1 and S2, respectively. A weighted-sum approach using S1 and S2 will produce a scoring function S = w1S1 + w2S2 = w1s1 + w2s2 + (w1 + w2)s, where the common component s is over-counted. In contrast, multi-scoring functions sampling can address the over-counting issue naturally. Sampling multiple scoring functions S1 and S2 will lead to minimization of the common component s and sampling the Pareto optimal front of s1 and s2, where s will not be over-counted. Finally, compared to weighted-sum scoring function optimization, multi-scoring functions sampling has the advantage of no weight determination required and no assumption of linear-combined weights. VI. POPULATION-BASED EVOLUTIONARY MULTI-SCORING FUNCTIONS SAMPLING ALGORITHM FOR PROTEIN LOOP STRUCTURE MODELING To demonstrate the effectiveness of multi-scoring functions sampling, we develop a population-based evolutionary algorithm to sample the loop backbone conformation space. A statistical pair-wise atoms distancebased scoring function [22] (knowledge-based) and a softsphere potential function [23] (physics-based) are employed to form the scoring function space. Initially, a population with N conformations, C1, …, CN, is randomly generated. Each loop structure conformation Ck with n residues is represented by a vector (θ1, …, θ2n), which corresponds the backbone dihedral angles of (φ1, ψ1, …, φn, ψn). For simplicity, the dihedral angles of ωk are kept constants at their average value of 180° while the bond angles and bond lengths are also remain constant. The greedy-heuristic Cyclic Coordinate Descent (CCD) algorithm [24] is applied to each conformation to adjust the dihedral angles so that the loop closure condition [26] is satisfied. Scores for multiple scoring functions si(.) are evaluated for each conformation. Then, the relationship of each conformation pair Cp and Cq is evaluated according to

Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

the dominance relation, which is generally defined as in [18] and below: “A conformation Cp is said to (strongly) dominate another conformation Cq ( C p C q ) if both conditions i) and ii) are satisfied: i). for each scoring function si(.), si(Cp) ≤ si(Cq) holds for all i; ii). there is at least one scoring function sj(.) where sj(Cp) < sj(Cq) is satisfied.” Let D(Ck) denote the number of conformations in the population that Ck dominates. It is easy to prove that D(Cp) > D(Cq) if C p C q . Then, the N conformations in the population can be ranked in the decreasing order of D(Ck) – a conformation dominating more other conformations will have a higher rank. The probability that a conformation Ck is selected for reproduction, P(Ck), is N P (C ) = D (C ) / D (C ) . k

k



i

i =1

We implement two methods to generate new conformations, including 1) mutation: permuting randomly selected dihedral angles of an old conformation; and 2) crossover: swapping part of the corresponding dihedral angle vectors between two old conformations. Again, the CCD algorithm is applied to every newly generated conformation by adjusting its dihedral angles to guarantee satisfaction of the loop closure condition. The top-ranked M conformations and the newly generated N – M conformations form the new population and replace the old one. The above procedure is repeated until convergence is observed in the population. Convergence is estimated by evaluating the likelihood of conformations within a population. The likelihood of two conformations Cp and Cq, L(Cp, Cq), is measured by the RMSD value of the corresponding Cα atoms. The convergence in a population is reached if the M top-ranked conformations yield similar structures, i.e., max L(C p , Cq ) < δ , 1≤ p≤ M , 1≤q≤ M , p≠ q

where δ is a threshold constant. The conformation with the highest rank is then outputted as a decoy. The descriptive pseudo code of the population-based diversified sampling algorithm is described as follows. The program is executed repeatedly with different initial conformations to produce a set of decoys.

Adjust dihedral angles of Ck to satisfy loop closure using CCD Repeat { For each conformation Ck { For each scoring function si(.) Evaluate si(Ck) D(Ck) = 0 } For each pair of conformations Cp and Cq { Evaluate the dominance relationship between Cp and Cq If C p C q , D(Cp)++ If C p ≺ Cq , D(Cq)++ } Sort C1, …, CN so that D(C1) ≥ D(C2) ≥ … ≥ D(CN) Generate N - M new conformations, CN-M+1’, …, CN’ For each conformation Ck’ in CN-M+1’, …, CN’ Adjust dihedral angles of Ck’ to satisfy loop closure Keep C1, …, CM in the population and replace CN-M+1, …, CN with CN-M+1’, …, CN’ } Until convergence Output the conformation with the highest rank as a decoy

VII. COMPUTATIONAL RESULTS IN PROTEIN LOOP STRUCTURE MODELING Table 1 shows the average RMSD of the 1,000 decoys in the 17 12-residue targets listed in Jacobson’s loop decoys library [14] generated based on multi-scoring functions sampling in comparison with those generated by optimizing the distance-based scoring function, the soft-sphere potential, and the weighted-sum scoring function. The weights of the weighted-sum scoring function are determined by a regression method based on the native loop structures in the benchmark library. Compared to optimizing single scoring function or weighted-sum scoring function, the multi-scoring functions sampling strategy yields averagely 0.16A~0.41A shift toward the native in the RMSD distribution of its decoys. In the targets of 1akz(181:192), 1ixh(160:171), 1dad(204:215), and 5nul(54:65), such RMSD shifts toward native in multi-scoring functions sampling are greater than or close to 0.5A. There is no significantly adverse RMSD shift in these targets in multi-scoring functions sampling. More importantly, sampling multiple scoring functions leads to enhanced chance of reaching decoys with good quality. Table 1 also shows improved average RMSD of the best decoys generated in multi-scoring functions sampling.

Initialize N conformations, C1, …, CN, randomly For each conformation Ck

Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

Table 1: The best and average backbone RMSD(A) of the 1,000 decoys generated by multi-scoring functions sampling, optimizing in each individual scoring function as well as the weighted-sum scoring function in 12-residue loops

Start Res.

PDB 1akz 1ixh 1cex 5pti 1rge 1arb 7rsa 1xyz 1cyo 153l 1bkf 1dad 1dim 1kuh 2ayh 351c 5nul

End Res.

181 160 40 36 57 74 13 813 32 98 9 204 213 90 21 15 54

Average Standard Deviation

192 171 51 47 68 85 24 824 43 109 20 215 224 101 32 26 65

Multi-Scoring Functions Sampling

Soft-Sphere Potential Function

Distancebased Scoring Function

WeightedSum Scoring Function

Best

Best

Best

Avg.

Best

Avg.

Avg.

Avg.

1.19 1.51 1.78 1.61 1.00 1.37 1.50 1.51 1.34 1.79 1.10 1.42 1.10 1.21 1.67 1.91 2.14

2.62 3.93 3.77 3.25 2.61 2.66 3.02 3.67 3.52 3.32 2.85 2.76 3.46 2.55 3.26 4.30 4.38

2.07 2.80 2.28 1.66 1.71 1.46 2.15 1.60 1.63 2.07 1.59 1.98 1.73 1.64 1.64 1.37 2.39

3.78 4.76 4.10 3.86 3.52 2.92 3.77 3.52 3.35 3.82 3.32 3.68 3.61 2.95 3.11 3.72 5.13

1.91 2.79 2.27 1.84 1.40 1.50 2.10 1.69 1.44 2.25 1.44 1.72 1.62 1.19 1.54 1.61 3.22

3.61 4.54 3.84 3.54 2.90 2.62 3.48 3.34 3.26 3.49 3.02 3.29 3.59 2.30 3.02 3.96 4.87

1.86 2.37 2.28 1.69 1.29 1.32 1.37 1.38 1.72 2.05 1.50 1.42 1.25 1.29 1.51 1.57 2.81

3.70 4.59 3.99 3.65 2.97 2.65 3.49 3.42 3.45 3.55 3.15 3.40 3.68 2.60 3.20 4.22 5.14

1.48 0.31

3.29 0.59

1.87 0.38

3.70 0.57

1.85 0.53

3.45 0.63

1.69 0.45

3.58 0.65

7rsa (13:24)

1akz (181:192)

500

400 Multi-Scoring Functions Sampling 350

400

Distance-based Scoring Function Weighted-Sum Scoring Function

350

250

# of D ecoys

# of Decoys

300

450

Soft-Sphere Potential

200 150

Multi-Scoring Functions Sampling Soft-Sphere Potential Distance-based Scoring Function Weighted-Sum Scoring Function

300 250 200 150

100

100 50

50

0

0 0A~0.5A 0.5~1.0A 1.0~1.5A 1.5~2.0A 2.0~2.5A 2.5~3.0A 3.0~3.5A 3.5~4.0A 4.0~4.5A 4.5~5.0A

>5.0A

RMSD

# o f D eco ys

300

1.0~1.5A

1.5~2.0A

2.0~2.5A

2.5~3.0A

3.0~3.5A

3.5~4.0A

4.0~4.5A

4.5~5.0A

>5.0A

Figure 6: RMSD distributions in 1,000 decoys generated by multi-scoring functions sampling, optimizing distance-based scoring function, soft-sphere potential, and weighted-sum scoring function in protein loop targets of 1akz(181:192), 1rge(57:68), and 7rsa(13:24)

Multi-Scoring Functions Sampling 350

0.5~1.0A

RMSD

1rge(57:68)

400

0A~0.5A

Soft-Sphere Potential Distance-based Scoring Function Weighted Sum Scoring Function

250 200 150 100 50 0 0A~0.5A 0.5~1.0A 1.0~1.5A 1.5~2.0A 2.0~2.5A 2.5~3.0A 3.0~3.5A 3.5~4.0A 4.0~4.5A 4.5~5.0A

RMSD

>5.0A

Figure 6 shows the RMSD distributions of 1,000 decoys generated by various sampling/optimization methods in loop targets 1akz(181:192), 1rge(57:68), and 7rsa(13:24) respectively. In these plots, one can find that in optimizing the soft-sphere potential function, distance-based scoring function, or the weighted-sum scoring function, although the “good”, native-like conformations (RMSD < 2.0A) can be occasionally reached, the optimization method also produces large population of “bad” decoys (RMSD > 4.0A). This is because the “bad” decoys yield low scores in a single

Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

scoring function due to the scoring function insensitivity problem. In contrast, the numbers of “bad” decoys are greatly reduced or even eliminated while the populations of native-like, “good” decoys are significantly larger in multiscoring function sampling strategy. Our computational results indicate that the multi-scoring functions sampling strategy can tolerate insensitivity in an individual scoring function, because the decoys produced by the multi-scoring function sampling method are supposed to “satisfy” multiple scoring functions by yielding low scores. The increased “good” decoy population in multi-scoring functions sampling will enhance the chance of eventually generating high-resolution models in future clustering and structure refinement operations [25].

[4] [5] [6] [7] [8] [9] [10] [11]

VIII. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS In this paper, we advocate a new protein conformation sampling approach based on multi-scoring functions sampling to address the scoring function insensitivity problem. By carefully selecting multiple physics- or knowledge-based scoring functions, the multi-scoring functions sampling approach intends to broadly explore the potentially “good” conformations in the protein conformation space and tolerate insensitivity in a single scoring function. We apply the multi-scoring functions sampling approach to modeling the backbone structure of long protein loops. Our population-based evolutionary algorithm using a distance-based scoring function and a softsphere potential function has shown increased “good” loop decoy production compared to optimizing an individual or weighted-sum scoring function. Our future research directions include improving the efficiency and diversity of the multi-scoring functions sampling algorithms to sample conformations near/at the Pareto optimal front and applying the multi-scoring function strategy to more challenging protein modeling problems, such as ab initio protein folding, protein-ligand docking, and protein-protein interactions. ACKNOWLEDGEMENT The work is partially supported by NSF under grants of CCF-0829382 and CCF-0845702 and NCSA Summer Faculty Fellowship to Y. Li.

[12] [13]

[14]

[15] [16]

[17] [18] [19] [20] [21] [22] [23] [24] [25]

REFERENCES [1] [2]

[3]

C. B. Anfinsen, “Principles that Govern the Folding of Protein Chains,” Science, 181:223-230, 1973. B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, M. Karplus, “CHARMM: a program for macromolecular energy, minimization and dynamics calculations,” J. Comput. Chem. 4:187-217, 1983. W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, “A second generation force-field for the simulation of proteins, nucleicacids, and organic-molecules,” J. Am. Chem. Soc. 117:5179-97, 1995.

[26] [27]

W. Damm, A. Frontera, J. Tirado-Rives, W. L. Jorgensen, “OPLS AllAtom Force Field for Carbohydrates,” Journal of Computational Chemistry, 18(16): 1955-1970, 1997. W. F. van Gunsteren, H. J. C. Berendsen, “Groningen Molecular Simulation (GROMOS) Library Manual,” Groningen, The Nether.: BIOMOS, 1987. H. Gohlke, G. Klebe, “Statistical Potentials and Scoring Functions Applied to Protein-Ligand Binding,” Current Opinion in Structural Biology, 11(2):231-235, 2001. A. Godzik, “Knowledge-based potentials for protein folding: what can we learn from known protein sequences?” Structure, 4:363-366, 1996. B. A. Naim, “Statistical potentials extracted from protein structures: are these meaningful potentials?” J. Chem. Phys. 107: 3698-3706, 1997. P. D. Thomas, K. A. Dill, “Statistical potentials extracted from protein structures: how accurate are they?” J. Mol. Biol. 257: 457-469, 1996. J. A. Kocher, M. J. Rooman, S. J. Wodak, “Factors influencing the ability of knowledge-based potentials to identify native sequencestructure matches,” J. Mol. Biol., 235: 1598-1613, 1994. H. J. Bohm, “The development of a simple empirical scoring function to estimate the binding constant for a protein-ligand complex of known three-dimensional structure,” J. Comput. Aided Mol. Des, 8(3): 243-256, 1994 A. N. Jain, “Scoring noncovalent protein-ligand interactions: continuous differentiable function tuned to compute binding affinities,” J. Comput. Aided. Mol. Des, 10(5): 427–440, 1996. Y. Li, A. J. Bordner, Y. Tian, X. Tao, A. Gorin, “Extensive Exploration of the Conformational Space Improves Rosetta Results for Short Protein Domains”, Proceedings of International Conference on Computational Systems Bioinformatics, Stanford, 2008. M. P. Jacobson, D. L. Pincus, C. S. Rapp, T. J. F. Day, B. Honig, D. E. Shaw, R. A. Friesner, “A hierarchical approach to all-atom protein loop prediction,” Proteins: Structure, Function, and Bioinformatics, 55(2): 351-367, 2004. C. A. Rohl, C. E. Strauss, K. M. Misura, D. Baker, “Protein Structure Prediction using Rosetta,” Methods in Enzymology, 383: 66-93, 2004. Y. Yang, Y. Zhou, “Ab initio folding of terminal segments with secondary structures reveals the fine difference between two closelyrelated all-atom statistical energy functions,” Protein Science, 17:1212-1219, 2008. M. Shen, A. Sali, “Statistical potential for assessment and prediction of protein structures,” Protein Science, 15: 2507-2524, 2006. K. Deb, “Multi-objective optimization using evolutionary algorithms,” John Wiley&Sons, 2001. J. D. Bryngelson, J. N. Onuchic, N. D. Socci, P. G. Wolynes, “Funnels, Pathways, and the Energy Landscape of Protein Folding: A Synthesis,” Proteins, 21:167-195, 1995. W. H. Wong, F. Liang, “Dynamic Weighting in Monte Carlo and Optimization,” Proceedings of the National Academy of Sciences of the USA, 94:14220-14224, 1997. A. Mitsutake, Y. Sugita, Y. Okamoto, “Generalized-ensemble algorithms for molecular simulation of biopolymers,” Biopolymers, 60(2): 96-123, 2001. A. Rojnuckarin, S. Subramaniam, “Knowledge-based interaction potentials for proteins”, Proteins: Structure, Function, and Genetics 36:54-67, 1999. H. Zhang, L. Lai, Y. Han, Y. Tang, ”A Fast and Efficient Program for Modeling Protein Loops,” Biopolymers, 41: 61-72, 1997. A. A. Canutescu, R. L. Dunbrack, “Cyclic Coordinate Descent: A robotics algorithm for protein loop closure,” Protein Science, 12:963972, 2003. K. Zhu. D. L. Pincus, S. Zhao, R. A. Friesner, “Long loop prediction using the protein local optimization program,” Proteins, 65: 438-452, 2006. E. A. Coutsias, C. Seok, M. P. Jacobson, K. A. Dill, “A Kinematic View of Loop Closure,” J. Comput. Chem. 25: 510-528, 2004. J. Handl, S. C. Lovell, J. Knowles, “Investigations into the Effect of Multiobjectivization in Protein Structure Prediction,” Lecture Notes in Computer Science, 5199: 702-711, 2008.

Authorized licensed use limited to: Old Dominion University. Downloaded on July 29,2010 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

Suggest Documents