Complexity of Pseudoknot Prediction in Simple

0 downloads 0 Views 172KB Size Report
pairs. If i.j denotes a base pair between the i 'th and the j 'th base of an RNA ... sequence in thermodynamic models taking base pair stacking and loop (i.e. regions of .... structure with a maximum number of base pair stackings in the canonical .... representing items and bins with something guaranteed not to form base pairs.
Complexity of Pseudoknot Prediction in Simple Models Rune B. Lyngsø Dept. of Statistics, Oxford University, Oxford, OX1 3TG, United Kingdom; [email protected]

Abstract. Efficient exact algorithms for finding optimal secondary structures of RNA sequences have been known for a quarter of a century. However, these algorithms are restricted to structures without overlapping base pairs, or pseudoknots. The ability to include pseudoknots has gained increased attention over the last five years, but three recent publications indicate that this might leave the problem intractable. In this paper we further investigate the complexity of the pseudoknot prediction problem in two simple models based on base pair stacking. We confirm the intractability of pseudoknot prediction by proving it hard for binary strings in one model, and for strings over an unbounded alphabet in the other model. Conversely, we are also able to present a polynomial time algorithm for pseudoknot prediction for strings over a fixed size alphabet in the second model and a polynomial time approximation scheme for pseudoknot prediction for strings over a fixed size alphabet in the first model.



1 Introduction Proteins usually get all the attention when talk is on molecular biological processes, with ribonucleic acids, or RNA, relegated a simple messenger role. It is, however, well known that functional, or non coding, RNA is a key component in several vital processes, perhaps most notably by making up most of the ribosome, the molecule translating messenger RNA to proteins. Moreover, new non coding RNA’s and functionally important parts of messenger RNA’s are constantly being discovered. The pervasiveness of functional RNA in core biological processes has even led to the theory of an RNA world [1], a time near the origin of life when biology was based on RNA or RNA-like molecules, and DNA and proteins had not yet been added to the apparatus of life. The major driving force of structure formation for RNA molecules is Watson–Crick and wobble G, U base pair formation, and in particular stacking of neighbouring base pairs. If denotes a base pair between the ’th and the ’th base of an RNA sequence, two base pairs and are stacking if and ; a maximal contiguous sequence of consecutively stacking base pairs, , ..., , is called a helix of length . The set of base pairs in the three dimensional structure of an RNA molecule is denoted the secondary structure of that RNA molecule. More generally, secondary structure is used to refer to any (legal) set of base pairs for an RNA sequence. Algorithms for finding optimum secondary structures for an RNA sequence in thermodynamic models taking base pair stacking and loop (i.e. regions of unpaired bases) destabilising effects into account have been known for almost twenty five years [2]. A major deficiency of these algorithms, however, is that they do not consider structures containing pseudoknots. Though it is not known whether pseudoknots



 

  

 %&



 



   !" #$!

A

U A

U

A

G A

C 20

U

A C A U G C A U

U

U 30 U C C A U U

50

60

C

C U G A

100

U G

90

80

A A

G C C C G G

U

120

A

U C G G G

A G G U A C G A C U U U U C U U

U

A

UC C U G A A A A 40

70

A U

110

G G A C G A

G

A

G U C U G U G C G U

C

UA

U A G U A G G A G U G C

A U

A

Fig. 1. Secondary structure of the Escherichia coli operon mRNA from position 16 to position 127, cf. [12, Figure 1]. The backbone of the RNA molecule is drawn as straight lines while          base pairings are shown with zigzagged lines. E.g. the base pairs ,

,

, and  , together with the parts of the backbone connecting the involved bases, form a non-planar substructure equivalent to   .

are essential per se, there are numerous examples where evolution has led to a non coding RNA gene with a pseudoknot substructure essential for its functioning [3, 4]. At its simplest a pseudoknot is just two overlapping base pairs. Two base pairs and are overlapping if    . More generally pseudoknots are used to refer to pairs of substructures, e.g. helices, that contain overlapping base pairs. If the stability of a secondary structure is modelled by independent contributions from the base pairs of the structure, we can find the most stable structure, including arbitrary pseudoknots, by maximum weighted matching [5]. However, evidence exists in abundance that considering base pairs in isolation is an oversimplification. Hence, some attempts have been made to expand the set of structures considered in [2] to allow structures containing some pseudoknots while still allowing similar thermodynamic energy rules and efficient exact algorithms for finding the optimum structure [6–9]. Conversely, several recent publications indicate that extending the set of structures considered to allow arbitrary pseudoknots leaves the problem of finding the optimum structure "! hard [7, 10, 11]. One can criticise the "! hardness results of these three papers for assuming unrealistic models of RNA secondary structure formation, though. In [10] the scoring function is not fixed but assumed part of the input, i.e. the scores of structural elements varies with the sequence. In [7, 11] the set of legal structures is restricted to be planar. A structure is planar if the graph consisting of the bases as nodes and the backbone and base pair connections as edges is planar. The requirement of planarity is not based on observed real world restrictions as non-planar structures are known, cf. Fig. 1. The contribution of this paper is to investigate the computational complexity of finding optimum general secondary structures, i.e. structures that may contain non-planar pseudoknots, with structures scored by two of the simplest possible functions taking stacking into account. One function, introduced in [11], scores a secondary structure by the number of base pair stackings it contains. The rationale for this is that base pair stackings by and large is the only structural element with a stabilising contribution to

  



  

 

secondary structures in the canonical parametrisation, cf. [13], of the energy model assumed by [2]. For this scoring function we provide a simple proof that it is "! hard to find the optimum structure of an RNA sequence, and strengthen this to also hold for binary strings. We further present a polynomial time approximation scheme (PTAS) for finding structures with a score close to optimum. The other scoring function considered counts the number of stacking base pairs. For this function we are only able to establish the "! hardness of finding the optimum structure when allowed an unbounded alphabet. We complement this result with an algorithm that for strings over any alphabet of fixed size finds the optimum structure in polynomial time. The practical relevance of this algorithm is diminished by polynomial time being for RNA sequences. In Sect. 2 we give a formal specification of the models and scoring functions investigated in this paper. In Sect. 3 we provide proofs that finding an optimum secondary structure with pseudoknots is hard. In Sect. 4 we present a polynomial time algorithm for finding the optimum structure according to one scoring function, and a PTAS for finding a structure approximating the optimum score according to the other scoring function. Finally, in Sect. 5 we briefly discuss some open questions.



2 Folding Model We will assume a model for secondary structures where only some types of base pairs are allowed, each base forms at most one base pair, and the two bases in a base pair are separated by at least three bases in the string. This last requirement is inconsequential to the proofs in the next section, as the reductions also work with the requirement removed. However, it is a consequence of steric constraints for real RNA molecules, and is thus included. This model is a straightforward generalisation of the model assumed in [2].

      –     !  #" $"&% and (' *) – if     then  + ,%.-/    %2 0 143 – 5 768+ 9 : ;

 





Definition 1 (General folding model). For a string over an alphabet with an associated set of legal base pairs, a legal secondary structure is a set of base pairs such that if then

      

One instance of the above model would be the canonical RNA folding model usually assumed for finding thermodynamically optimal RNA structures. In this model only canonical, i.e. Watson–Crick and G,U wobble base pairs, are allowed.

 –     !  #" $"&% and (' *) – if      then  + ,%.-/    =%2 0 143 – # 76> 9?%< / C  G %  A  U %@  G  U %%

<  A  C  G  U %  ,

   

Definition 2 (Canonical RNA folding model). For an RNA sequence a legal secondary structure is a set of base pairs such that if

then

      

Evidently not all secondary structures that are legal by our folding model will be physically realisable due to steric constraints. We will briefly return to this in Sect. 5.



 

base pair stackings



 

Fig. 2. A helix of





base pairs

stacking base pairs contains



base pair stackings.

Table 1. Illustration of the differences between the three scoring functions.

Number of base pairs Number of base pair stackings Number of stacking base pairs 

 

 





Helix length 

 

 

 















The number of base pairs in a secondary structure is just the size of . As previously mentioned, finding a structure with a maximum number of legal base pairs is just an instance of maximum matching, which can be solved efficiently [5]. In this paper we focus on two slight generalisations of looking at each base pair in isolation. We consider scoring functions where the score of a base pair depends on the presence of a neighbouring, or stacking, base pair in , either by scoring a structure by the number of base pair stackings it contains or by the number of stacking base pairs it contains.





Definition 3 (Number of base pair stackings). For a legal secondary structure , the number of base pair stackings is defined as



5 "     ; "        : ;:%

"



Definition 4 (Number of stacking base pairs). For a legal secondary structure , the number of stacking base pairs is defined as  

  "     ; "       :          : ;

%$"

The difference between these scoring functions for a helix of stacking base pairs is illustrated in Fig. 2 and in Table 1. The score of an entire structure is just the sum of scores of the helices it contains.

3 Complexity Results In this section we investigate some complexity issues for pseudoknot prediction by establishing the "! hardness of finding legal secondary structures that are optimum

   using the and scoring functions. We start with a simple proof that finding a structure with a maximum number of base pair stackings in the canonical RNA folding model, cf. Def. 2, is "! hard. We strengthen this result to also hold for strings over a binary alphabet. Finally we prove that finding a structure with a maximum number of stacking base pairs is "! hard if we are allowed to use an unbounded alphabet.

3.1 Number of Base Pair Stackings

 

 

Apart from illustrating the difference between the and scoring functions,

 Fig. 2 also illustrates that under the scoring function the contribution of a helix is always one less than the length of the helix, i.e. the number of base pairs in the helix. Hence, for a fixed number of base pairs, each helix these base pairs are distributed over  reduces the score of the structure by one. Assume that we have an RNA sequence for which all legal base pairs have to contain a particular type of base, say a C. Further   , and that the assume that the C’s in are grouped in substrings of lengths   bases at either end of these substrings cannot form a legal base pair with any base in . If a structure for has the C’s in each of the substrings form base pairs that   neatly stacks in one contiguous helix, then the score of the structure is exactly    . If for any of the substrings the C’s are split among two or more helices,     

 or some left unpaired, the score will be less than   . So to rephrase, the

 optimum score depends on whether we can ‘pack’ the base pairs of each substring to form contiguous helices, or whether we have to distribute the base pairs over two or  more helices, or leave part of a substring unpaired, for one or more of the substrings.







6  $6 

  !  



6  $6





Theorem 1. Given an RNA sequence and a target , it is "! hard to determine whether there is a structure that is legal under the canonical RNA folding model and with   .

< ?, -/ are 0 added to . The three parts are joined with three $ bases separating each part. The target is set to twice the sum of the number of unique variables occurring in  and the number of clauses in  , i.e. the number of non-$ bases in the variable and clause parts of . The alphabet is the set of bases used in . If  has a satisfying truth assignment we can form pairs of stacking base pairs between bases in the variable part and bases in the literal part corresponding to literals that become false by the truth assignment, while still being able to find two bases corresponding to a literal occurrence for each clause that has not been paired with bases in the variable part, i.e. we can find a structure for with stacking legal base pairs. Conversely, a structure with stacking legal base pairs for will have all non-$ bases in the variable and clause parts forming base pairs. A truth assignment obtained by requiring a literal to be false iff bases representing it in the literal part form base pairs with bases in the variable part will clearly satisfy  , as for each clause we can find a #$ literal whose negation is false. The construction is illustrated in Fig. 5.



 



6

6

6



6   %   %   %   % 





 ,6  9?%





6



66









6

 6  9?%









4 Algorithmic Results It is somewhat unsatisfying that Theorem 3 assumes an unbounded alphabet. For one thing, the result does not establish that it is "! hard to find the optimum RNA secondary structure with arbitrary pseudoknots when structures are scored by the number of stacking base pairs they contain. But as we shall see in this section, such a result would be quite surprising. For strings over any fixed alphabet, the problem of finding   the optimum secondary structure using the scoring function turns out to be in ! . To see this, consider the helix of five stacking base pairs in Fig. 2. This contributes   to the overall score under the scoring function. Breaking it into two helices of  lengths two and three, the total contribution is still  . Any helix of stacking base pairs, i.e. any helix of length at least two, can be broken into helices of lengths two or three. Hence, finding an optimum structure when only helices up to length three are   considered will result in an optimum structure under the scoring function. So for a string we could partition it into singletons, dinucleotides, and trinucleotides in all possible ways, and for each partition find a maximum weighted matching where matchings of complementary dinucleotides has weight  and matchings of com plementary trinucleotides has weight . However, there is an exponential number of   different partitions. But the important part of a partition, in terms of score, is not the partition itself, but the number of each of the dinucleotides and trinucleotides it contains. Hence, for any prefix of and count  of yet unpaired occurrences of each of the dinucleotides and trinucleotides in we can find the optimum number of stacking base pairs that can be formed in by the following recursion.







  !+ 76



   

   





7     /   /      /  / ('  /  7 /   ) 







/



  /



! 8 76  !   !6   6 6      6 7 6  6         %         %

7

 

!#"









  3

$ ,

,.3&%



  /   7  / (' /  7 *#"     







$ ,

,.3 ,+,%

(1)



The notation  -&%  ($ -  ) denotes a count identical to  , except that the count of the string - is increased (reduced) by one. The rationale of the recursion is that we can either leave the trailing singleton, dinucleotide, or trinucleotide of unpaired for now, and update the count  accordingly. Or we can pair the trailing dinucleotide (trinucleotide) with a complementary dinucleotide (trinucleotide). The recursion of (1) can be used as the basis of a dynamic programming algorithm, where the optimum score  . , where equals . Optimum structures can be determined by traceback. The count of any dinucleotide or trinucleotide and the number of different prefixes of is . The number of different dinucleotides and trinucleotides is   3 + , so the number of different entries of we need to maintain and compute is , so we can find 0/21 . Any one entry can be computed in time . in 3/21 . The space complexity can be reduced to 41 by applying the method time described in [15]. For a four letter alphabet like the RNA alphabet this means a time complexity of and a space complexity of 35 . The observant reader will have noticed that (1) does not guarantee that all base pairs formed are between bases that are separated by at least three other bases in the string.

 ! + 6

=  

  " ;"  " ;"      5

!" $"

 

=



=  



This can be amended by adding a careful, constant time bookkeeping of the status of the last few bases in the prefix. The recursion can readily be modified to allow individual scores for each type of base pair.

  For the scoring function, we can not apply the above technique to find the score of an optimum structure. Indeed, the fact that breaking one helix into two smaller helices reduces the score by one was the foundation of the reduction in Sect. 3.1. But   considering helices up to length would only break a helix of length into helices, i.e. the contribution to the overall score counted for that particular helix would    only be decreased by or a fraction of its actual  contribution.  So by amending (1) to consider substrings up to length and using the scoring  function, we can find a structure with a score that is at least of the optimum    / different score. There are   3   and . Hence, we can approximate substrings over alphabet of lengths between

 the optimum score within under the score function in time /  and space , i.e. in polynomial time for any fixed . This establishes the 

  existence of a PTAS for pseudoknot prediction under the scoring function. It is unlikely that a Fully PTAS, i.e. an approximation scheme where the time complexity , exists as an approximation would equal depends only polynomially on the optimum score due to the integral nature of scores.

   4'    

 

   

6 " ;" 6   



 +" ;"

 





       8" ;"   " ;"  



   

    

5 Discussion In this paper we have proven that it is "! hard to find an optimum RNA secondary structure when we allow any set of base pairings, as long as all base pairs are canonical, no base is included in more than one base pair, and the bases of any base pair obey the minimum separation requirement. A lot of structures that are legal under these assumptions will not be realisable by three dimensional structures due to steric constraints. Defining a model that allows those structures, and only those structures, that can be realised by a three dimensional structure without in essence turning the secondary structure prediction problem in the model into a tertiary structure prediction problem seems a daunting task. However, by increasing the number of A’s separating the item representations and the bin representations in the proof of Theorem 1 it should be possible to add enough freedom of movement of the substrings of C’s and G’s to meet constraints based on reasonable steric considerations. This trick can not be applied to the string constructed in the proof of Theorem 2, however, as we do not have a separator symbol that is guaranteed not to form base pairs. Though we did manage to develop a polynomial time algorithm for finding the   optimum structure of an RNA sequence under the scoring function, the time com35 ) does render it rather useless in plexity of (and space complexity of practice. From Theorem 3 we would expect an exponential dependence on the alphabet size. But this still allows for the possibility of a, say, algorithm for finding   optimum structures under the scoring function. One open problem that remains is whether we can strengthen Theorem 3 to hold for strictly complementary alphabets. In the proof, some of the bases added to the alphabet can form legal base pairs with more than one other type of base, similar to the presence

5

=

= 



of the G U wobble base pair in the set of legal base pairs for RNA sequences. It is still unanswered whether the pseudoknot prediction problem remains "! hard under   the scoring function if each base is included in only one legal base pair, similar to the set of Watson–Crick base pairs. One hint that this just might affect the complexity is that a strictly complementary alphabet allows us to decrease the complexity of an 3 0/21 algorithm based on the recursion in (1) to . This follows as we can group dinucleotides and trinucleotides into complementary pairs for which we only need to consider cases where at most one of them have a count larger than zero.

5 



Acknowledgements. This work was supported by EPSRC grant HAMJW, and MRC grant HAMKA. The author would like to thank the anonymous referees for useful comments, and in particular one referee for supplying errata.

References 1. Joyce, G.F.: The antiquity of RNA-based evolution. Nature 418 (2002) 214–221 2. Zuker, M., Stiegler, P.: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research 9 (1981) 133–148 3. Felden, B., Massire, C., Westhof, E., Atkins, J.F., Gesteland, R.F.: Phylogenetic analysis of tmRNA genes within a bacterial subgroup reveals a specific structural signature. Nucleic Acids Research 29 (2001) 1602–1607 4. Tanaka, Y., Hori, T., Tagaya, M., Sakamoto, T., Kurihara, Y., Katahira, M., Uesugi, S.: Imino proton NMR analysis of HDV ribozymes: nested double pseudoknot structure and Mg  ion-binding site close to the catalytic core in solution. Nucleic Acids Research 30 (2002) 766–774 5. Tabaska, J.E., Cary, R.B., Gabow, H.N., Stormo, G.D.: An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics 14 (1998) 691–699 6. Rivas, E., Eddy, S.: A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology 285 (1999) 2053–2068 7. Akutsu, T.: Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discrete Applied Mathematics 104 (2000) 45–62 8. Uemura, Y., Hasegawa, A., Kobayashi, S., Yokomori, T.: Tree adjoining grammars for RNA structure prediction. Theoretical Computer Science 210 (1999) 277–303 9. Reeder, J., Giegerich, R.: From RNA folding to thermodynamic matching, including pseudoknots. Technical Report 03, Technische Fakultät, Universität Bielefeld (2003) 10. Lyngsø, R.B., Pedersen, C.N.S.: RNA pseudoknot prediction in energy based models. Journal of Computational Biology 7 (2000) 409–428 11. Ieong, S., Kao, M.Y., Lam, T.W., Sung, W.K., Yiu, S.M.: Predicting RNA secondary structures with arbitrary pseudoknots by maximizing the number of stacking pairs. In: Proceedings of the 2nd Symposium on Bioinformatics and Bioengineering. (2001) 183–190 12. Gluick, T.C., Draper, D.E.: Thermodynamics of folding a pseudoknotted mRNA fragment. Journal of Molecular Biology 241 (1994) 246–262 13. Mathews, D.H., Sabina, J., Zuker, M., Turner, D.H.: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology 288 (1999) 911–940 14. Papadimitriou, C.M.: Computational Complexity. Addison-Wesley Publishing Company, Inc. (1994) 15. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequence. Communications of the ACM 18 (1975) 341–343

Suggest Documents