Approximation Algorithms For Protein Folding Prediction

0 downloads 0 Views 205KB Size Report
Abstract. We present a new polynomial-time algorithm for the protein folding problem in the two-dimensional HP model introduced by Dill 1], which has been ...
Approximation Algorithms For Protein Folding Prediction Giancarlo MAURI Antonio PICCOLBONI ([email protected]) ([email protected]) Giulio PAVESI ([email protected]) Department of Computer Science University of Milan Via Comelico 39 Milan, Italy

Abstract

We present a new polynomial-time algorithm for the protein folding problem in the two-dimensional HP model introduced by Dill [1], which has been recently proved to be NP-hard [2]. The model abstracts one of the dominant forces of protein folding: the hydrophobic-hydrophilic interaction. Thus, proteins are modeled as binary strings on the alphabet fH, Pg, representing chains of hydrophobic and hydrophilic monomers. The problem is to nd, given a string s, the two-dimensional structure that minimizes a given energy function. Our idea is the following: describe a class of feasible structures by means of an ambiguous contextfree grammar (i.e. there is a bijection between the set of parse trees and a subset of possible structures); give a score to every production of the grammar, so that the total score of every parse tree (the sum of the scores of the productions of the tree) is an upper bound on the energy of the corresponding structure; apply a parsing algorithm to nd the parse tree with the highest score, corresponding to the con guration with minimal energy among those generated by the grammar. So far, we have designed a grammar that guarantees a performance ratio (i.e., the ratio between the energy of the solution found by the algorithm and the optimal one) of 1/4, equalling the two best polynomial-time performance guaranteed algorithms for this problem [3]. However, experimental results on a large set of random instances have shown an average performance ratio for our algorithm of 0.67, versus 0.55 and 0.48 for the other two.

1

1 Introduction Proteins are polymer chains of amino acid residues of twenty di erent kinds. Under speci c environmental conditions (i.e. inside living organisms), they fold to form a unique geometric pattern, known as native state, that determines their macroscopic properties, behavior and function. The current hypothesis is that the native state is determined uniquely by the position of the di erent residues in the chain. Usually, possible conformations are analyzed in terms of their free energy. According to the Thermodynamical Hypothesis, the native structure of a protein is the one corresponding to a global minimum of its free energy. Thus, the protein folding prediction problem can be recast as an energy minimization problem. The form of the energy function changes according to the model adopted. Since it would be impossible to reproduce all the chemical and physical interactions occurring in the folding process, it is necessary to make some abstractions, hiding some aspects and emphasizing the e ect of others. Computer simulations of such abstract models can be then compared to experimental observations of real proteins, in order to determine whether the criteria used to build the model play an important role in the formation of the native structure.

2 The HP Model One of the most successful and best-studied abstract models is the twodimensional hydrophobic-hydrophilic model, or HP model, proposed by Dill1 . Basically, the amino acid residues can be divided in two classes: the hydrophobic, i.e. non-polar, and hydrophilic, i.e. polar. Experiments have shown that during the folding process the hydrophobic residues tend to interact with each other, forming the core of the nal structure, shielded from the environment by the hydrophilic ones. Therefore, the protein instance can be reduced to a binary sequence of H's (meaning hydrophobic) and P's (meaning polar, or hydrophilic). Furthermore, the conformational space is discretized into a square lattice. Thus, since two residues cannot occupy the same space position, the possible conformations for the protein in this model are self-avoiding walks (SAWs) on a two-dimensional grid. Each amino acid occupies one node of the grid, with his neighbours in the chain on adjacent 1 \for chain lengths for which exaustive enumeration is possible (up to about 30 monomers), two-dimensional models more accurately represent the phisically important surface-interior ratios of proteins than do three-dimensional models" [4]

2

nodes. It is essential to distinguish between pairs of amino acids that are connected neighbours on the grid (i.e., neighbours also in the sequence), and pairs that are topological neighbours (adjacent on the lattice but not in the sequence). From now on, we will refer to a pair of residues as in contact (or generating a contact ) if they are topological neighbours. The free energy function for this model is based on the number of hydrophobic residues that are in contact on the lattice. Every H{H contact on the lattice brings a free energy of e ( 0). Every other contact has a free energy of 0. Following the Thermodynamical Hypothesis, the native conformation is the one that minimizes the free energy, that is, maximizes the number of contacts between H's. To increase the number of contacts between themselves, the H's have to group inside the nal structure. A recent result shows the computational complexity of the problem: The protein folding problem in the two-dimensional HP model, i.e., nding the global minimum of the energy function as de ned above, is NP-hard [2].

3 Context-free Grammars For Protein Folding Prediction

3.1 The Algorithm

According to the model, a protein instance can be represented by a string s = s0 : : :sn , where si fH, Pg. Our algorithm is based on the following steps: 1. De ne an ambiguous grammar that generates all the possible protein instances (i.e. strings of H's and P's of arbitrary length). 2. De ne a relation between the derivations of the grammar and a subset of all the possible SAWs, where to every production of a derivation recursively corresponds a spatial position of the terminal symbols generated by the production itself. 3. Assign to every production of the grammar an appropriate score, representing (a lower bound to) the number of contacts between H's generated by the spatial position of the symbols associated to the production in the SAW corresponding to the parse tree. 4. Apply a parsing algorithm to nd the tree with the highest score (computed as the sum of the scores of the productions of the tree), that is, 3

the tree corresponding to the SAW with minimal energy in the subset generated by the grammar.

3.2 The Grammar

Let us now introduce the rst grammar we de ned for our algorithm.  G = fT; N; S; P g, where:  T = fH; P; Ug is the set of terminal symbols.  N = fS; L; Rg is the set of the non-terminal symbols.  R is the source symbol, i.e. the root of every parse tree.  P is the set of the productions, composed by the following production schemes: (1) S ! T1 S T2 (2) S ! T1 L T2 S T3 L T4 (3) S ! T1 L T2 S T3 T4 (4) S ! T1T2 S T3 L T4 (5) S ! T1T2 (6) S ! T1 L T2T3 L T4 (7) S ! T1T2 T3 L T4 (8) S ! T1 L T2T3 T4 (9) L ! T1 L T2 (10) L ! T1T2 with Ti 2 fH, Pg. (11) S ! S UU (12) S ! UU (13) R ! SS The layout of the terminal symbols associated to each production can be seen in Figure 1. The proof that every parse tree corresponds to a selfavoiding walk is straightforward. The score of every production is increased by one every time two H's generated by the production are in contact. For example, the production S ! HLHS HLH has score 4, S ! HLHS HLP score 2, and so on. All the possible contacts between H's that are generated by di erent productions cannot be added to the score of the parse tree. Therefore, the score of a parse tree is a lower bound of the actual number of contacts in the corresponding structure. 4

(1)

S

! T1

j

T2

S

T1

j j

(2)

S

! T1

T2

L

S

| T1 (L) | T2

T 3 T4

j j

(3)

S

! T1

T2

L

T3

S

L

T4

| T1 (L) | T2

j

j

(4)

S

! T 1 T2

S

T3

L

T1

j

T4

T2

j

(5)

(6)

(7)

S

S

! T1

S

! T 1 T2 L

T2 T3

! T1 T2 T3

(9)

S

! T1 L

L

! T1

L

L

T4

j

T4

j

T3

j j

T4 | (L) T3 |

j

j

T4 | (L) T3 |

j

j

j

j

j

j

j

| T1 T4 | (L) (L) | T2 |T3 |

T1

T4 | (L) T2 |T3 |

j

T4

T2 T3 T4

L

j

+1 if T1

= T2 = H

+1 if T1 +1 if T1 +1 if T2

= T2 = H = T4 = H = T3 = H

+1 +1 +1 +1

T1 T1 T2 T3

= T4 = H = T2 = H = T3 = H = T4 = H

+1 if T1 +1 if T2 +1 if T3

= T4 = H = T3 = H = T4 = H

+1 if T1 +1 if T1 +1 if T3

= T2 = H = T4 = H = T4 = H

+1 if T3 +1 if T1

= T4 = H = T4 = H

+1 if T1 +1 if T1

= T2 = H = T4 = H

+1 if T1

= T2 = H

if if if if

T1 |T2

j

(8)

j

T2

j

| T1 T4 (L) | T2 |T3

j

|T1 |

T2

|T2 |

Figure 1: Production schemes and corresponding layout of the symbols and scores. 5

PP

* PP * k *

PP P|P

j

j  H|P j   j P|H  H|P j j

HLHS HLH

*

P|H

P

j

PS P

*

HS H

*

P

j

S

H j H  H j j H

*



P

j

R

+

S

+

P

j

HS H

 H|P  j P|H  H|P j j

P|H

j

+



PS P

+

HLHS HLH

+ k + + PP

P|P

PP

PP

Figure 2: Structure generated by the algorithm for the sequence HPHPPHPPHPPHPHHPHPPHPPHPPHPH and corresponding parse tree. Contacts between H's are shown by dots (). It could be objected that this grammar generates only sequences of even length. To solve this problem, and to avoid adding further productions to the grammar, in case of a sequence s of odd length the string actually parsed is s = sP. In fact, it can be proved that the structure found by the algorithm for s , once removed the nal P, is also the best structure among those that can be generated by the algorithm for the original sequence s. The algorithm builds structures in which the sequence is folded onto itself twice (see Figure 2). The parse tree is split into two subtrees, whose roots are the two S symbols generated by the source symbol R. The symbols generated by each subtree form a structure shaped like an \U", giving an overall disposition similar to a \C" (facing backwards). If the length of the string is even, the rst and last symbol are always in contact. The S nonterminals form the \backbone" of the structure, while the symbols generated by L nonterminals form lateral structures. The introduction of the dummy terminal symbol U allows the grammar to generate a larger set 6

of structures. The string actually parsed (after the possible addition of a P) is su = sUU. If the second subtree contains only the production S !UU, the rst subtree generates the whole sequence, which is again folded once to form a structure shaped like an \U". Without this extension, for example, on the sequence HPPH (whose energy minimum is ?1), the algorithm would generate a structure with energy zero.

3.3 The Parsing Algorithm

The parsing algorithm is based on the Earley parser for context-free grammars [5], and it is very similar to the version that computes the Viterbi Parse of a string generated by a stochastic grammar proposed by Stolcke [6]. It preserves its worst case time (O(n3)) and space (O(n2)) complexity.

4 Performance Analysis Given a string s = s0 : : : sn , where si 2fH, Pg, two residues si and sj can be in contact on the grid only if jj ? ij is odd. Furthermore, every residue can be the topological neighbour of at most two other residues, except when it is located at one of the endpoints of the sequence. In this case, it can be in contact with three other residues. Now, let he be the number of H's in even position in a given sequence s; ho the number of H's in odd position; h = min(he; ho). We also de ne OPT (s) as the free energy of the optimal conformation for a given sequence s. Finally, let us give to every H{H contact a free energy of ?1. The above considerations bring to the following:

Theorem 1

OPT (S )  ?2h ? 2 (1) It can be observed that the lower bound ?2h ? 2 can be reached only by

sequences of odd length with two H's at the endpoints. In order to have a performance guaranteed approximation algorithm, for every possible instance of the problem the ratio between the energy of the structure generated by the algorithm and the energy of the optimal structure must be bounded by a constant. That is, for every possible sequence s of arbitrary length we must have: A(s)  R (2) R(s) = OPT (s) 7

where A(s) is the free energy of the structure generated by the algorithm when given in input the sequence s. We will call R the absolute performance ratio of the algorithm. Given N 2 Z, let SN = fs j OPT (s)  N g, and let RN = inf fR(s) j s 2 SN g. The asymptotic performance ratio R1 is given by:

R1 = supfk j RN  k ; N 2 Zg = sup s2infSN R(s) N 2Z

(3)

If R1 = k, the algorithm approaches a factor of k of the optimum when applied to instances with lower energy for the native state. It is important to point out the fact that the energy of a structure depends on the number of H's in the sequence, and not on the length of the sequence itself. Clearly, the higher are R and R1 , the better are the solutions found by the algorithm. If we had R(s) = 1 for every instance, we would have an algorithm that always nds the optimal structure. Now, let RG and R1 G be the absolute and the asymptotic performance ratios of our algorithm. Some easily proved lower bounds for RG and R1 G are given below: Lemma 1 Given a sequence s, there always exists a structure for s, corresponding to a parse tree, that brings d h2+1 e contacts.

Theorem 2 Theorem 3

h +1 RG  ??d2h2? e2 = 41

(4)

R1G  14

(5)

The lower bounds for the performance ratios of our algorithm equal the performance ratios of the best two algorithms known [3].

5 Experimental Evaluation We have tested our algorithm on random sequences, and compared the results to the best performance guaranteed algorithms (see Figure 3). For each value of PH we have completed 1000 runs of our algorithm and of the other two with performance guaranteed ratios of 1/4 (called B and C as in the original paper), on instances of length 63 with two H's at the endpoints (in order to have a lower energy bound), where PH = Pr[si = H ]; 8i 2 [0; n]. 8

Algorithm

B

Time Complexity

O n

Guaranteed Absolute Performance Ratio Guaranteed Asymptotic Performance Ratio Average Performance Ratio H = 15 Average Performance Ratio H = 33 Average Performance Ratio H= 5 Average Performance Ratio H = 66 Average Performance Ratio H = 85 Average Performance Ratio (overall) Worst case Performance Ratio Found P

:

P

:

P

( )

C ( 2)

O n

CF G

( 3)

O n

1/4

1/4

1/4

1/4

1/4

1/4

0.52

0.60

0.79

0.48

0.57

0.72

0.48

0.55

0.68

0.48

0.53

0.63

0.48

0.50

0.55

0.48

0.55

0.67

0.25

0.33

0.375

:

P

:

P

:

Figure 3: Guaranteed and experimental performance ratios of algorithms B and C [3], and our algorithm (CFG). The performance ratio of our algorithm seems to decrease as the average number of H's in the sequence is increased. The same trend, even if with lower ratios, is shown by algorithm C , while algorithm B has a constant ratio. In the tests, the worst-case ratio of 1/4 has been reached only by algorithm B2. The worst case found for our algorithm is 3/8: this, as discussed in the 2 On sequences like PP(HPP)4k+1 ,

 3 (whose optimal energy is ?4 ), algorithm C produces structures with energy ? ? 3. So, the performance ratio approaches 1/4 as is increased [3]. It should be noted that our algorithm always nds the optimal solution on this set of instances. k

k

k

k

9

following section, leaves open the issue of its performance ratios.

6 Conclusions We have proved that our algorithm has the same performance guarantee as the best known algorithms, but experimental results suggest that it is even better in an average case sense. Moreover, whereas the 1/4 bound is tight for the best known algorithms, the tightness of the same performance bound for our algorithm is still an open problem. In fact, theorems 2 and 3, based on lemma 1, simply guarantee that for each instance s there always exists, among those that can be generated by the algorithm, a structure whose energy gives performance ratios of 1/4, but not that this structure is the one actually generated, that is, the one with lowest energy. This fact, together with the encouraging experimental results, leads us to the conjecture that a tight bound to the performance of our algorithm (or of an improvement of it, based on larger grammars) could be in fact the experimental one, that is 3/8.

References [1] Ken A. Dill, Dominant forces in protein folding. In Biochemistry, 24:1501, 1985. [2] P. Crescenzi, D. Goldman, C. Papadimitriou, A. Piccolboni, M. Yannakakis, On the Complexity of Protein Folding. In Proc. of the Second Annual Conference of Computational Biology (RECOMB '98). [3] William E. Hart, Sorin C. Istrail, Fast Protein Folding in the Hydrophobic-Hydrophilic Model Within Three-eights of Optimal. In Journal of computational biology, spring 1996. [4] H. S. Chan, Ken A. Dill, The Protein Folding Problem. In Physics Today (1993), pp. 24{32. [5] Jay Earley, An Ecient Context-Free Parsing Algorithm. In Communications of the ACM, 6.451-455, 1970. [6] Andreas Stolcke, An Ecient Probabilistic Context-Free Parsing Algorithm That Computes Pre x Probabilities. In Computational Linguistics, 21(2), 165{201, 1995. 10

Suggest Documents