Predicting RNA Secondary Structures Including ... - CiteSeerX

3 downloads 1024 Views 2MB Size Report
converter which shall be able to switch this notation to the extended Vienna notation .... [10] Pseudobase - A Pseudoknot Database, http://www.ekevanbatenburg.nl/PKBASE/PKB.HTML. ...... We can save pictures generated in the .jpg format.
Predicting RNA Secondary Structures Including Pseudoknots by Andrey Kravchenko Computing Laboratory, Oxford University, UK, 2009

Abstract

RNA secondary structures play a vital role in modern genetics and a lot of time and eort has been put into their study. It is important to be able to predict them with high accuracy, since methods involving manual analysis are expensive, time-consuming and error-prone. Predictions can also be used to guide experiments to reduce time and money requirements. Several algorithms have been developed for implementing this task. Most of them assume that the desired secondary structure will not contain pseudoknots. However, pseudoknots, though not occurring that often, play an important role in a secondary structure as a whole. This report describes in detail the full thermodynamic model used to predict secondary structures without pseudoknots and the associated algorithm. It proceeds to extend the model to include a restricted class of pseudoknots and presents an ecient algorithm for the prediction of structures within this class. This algorithm has a running time complexity of O(n4 ) and a spatial complexity of O(n2 ), putting it on a high competitive edge with other known algorithms that take pseudoknots into account. A detailed discussion of implementation and an appendix containing the program code are also included.

1

Contents 1 Introduction

3

1.1

An informal overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Biological motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Some denitions

5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 The Nearest Neighbour Thermodynamic Model

6

2.1

Nussinov's algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

NNTM preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

simFold

recursions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Extending NNTM with pseudoknot prediction

6 7 9

11

3.1

Classication of pseudoknots

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.2

Modied Akutsu's algorithm

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.2.1

Original algorithm

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.2.2

Our extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.3

Incorporating pseudoknots into NNTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.4

The Vienna notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

4 Implementation details 4.1

The software architecture

18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

4.1.1

The model

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

4.1.2

Parsing the thermodynamic parameters and real RNA secondary structures . . . . . .

18

4.1.3

Model-View-Controller pattern

18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Choosing the programming language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4.3

The data input

20

4.4

The Graphical User Interface

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4.5

Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Final results

24

5.1

Accuracy of prediction on pseudoknots in PseudoBase

. . . . . . . . . . . . . . . . . . . . . .

5.2

Running-time CPU performance

24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

5.3

Further areas of research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

5.4

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

6 Acknowledgments

26

A Some Extracts From the Source Code

28

A.1

Model.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A.2

Parser.java

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 41

A.3

TestBio.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

A.4

TestRunningTime.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

A.5

MainScreen.java

A.6

Controller.java

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

2

1

Introduction

1.1 An informal overview In this project we are concerned with developing an ecient RNA secondary structure predicting algorithm that takes pseudoknots into account. As the input we have RNA's primary structure, which is a sequence of bases (

A, C, G

or

U ).

The secondary structure also includes bonds formed between bases (base pairs).

The most common ones are (A, U), (C, G) and (G, U). There are a lot of algorithms that give more or less accurate predictions of secondary structures. However, only a few take pseudoknots into account (we will describe what a pseudoknot is later in this report), even though they occur in many molecules and can be involved in important genetic functions. In the rst section we will set forth the biological motivation underlying the problem and the denitions of the most important problem-related terms. We will then go on to describing our new approach for predicting pseudoknots, the thermodynamic model for predicting RNA secondary structures without pseudoknots and our method of extending this model with pseudoknot prediction. We will then give the implementation details and the accuracy that we achieved in real biological systems, as well as the CPU running time of our implementation. The Java code for the most important classes of our program is attached in the appendices.

1.2 Biological motivation Ribonucleic acid (RNA) is a molecule that plays an important role in genetic processes.

By the Central

Dogma of Molecular Biology, RNA is transcribed from DNA and its sequence of triplets is used to specify the order in which specic amino acids are joined during protein synthesis. RNA is not only a messenger, but also performs various genetic functions. Two important examples of functional RNAs are Ribosomal RNA (rRNA) and Transfer RNA (tRNA). RNA consists of nucleotides, each of which consists of a base, among other components. There are four types of nucleotides in RNA: Adenine (A), Cytosine (C), Guanine (G) and Uracil (U).

Figure 1: The Central Dogma of Molecular Biology 1 http://en.wikipedia.org/wiki/File:Centraldogma_nodetails.GIF

3

1

One of the most important aspects of RNA molecules is the strong hydrogen bond interaction formed between some of its bases. Unlike DNA, which has two strands with bases from one strand forming pairs with bases from the other, RNA is self-folding. The most common RNA base pairs are the Watson-Crick base pairs (Adenine-Uracil and Cytosine-Guanine) and the Wobble base pair (Uracil-Guanine).

Although

other pairings are also possible, their frequency of occurrence is negligible.

Figure 2: Most observable RNA base pairs

2

We distinguish three levels of organisation in RNA: primary, secondary and tertiary structure. Primary structure is simply the sequence of nucleotides connected by covalent bonds. Secondary structure also shows us the base pairs occurring in the molecule. Tertiary structure gives us the 3-D shape of the RNA molecule by representing each nucleotide in its atomic 3-D structure. This project concerns the prediction of the RNA secondary structure from its primary structure. The secondary structure can be obtained from the primary one experimentally, but this process is error-prone and requires a lot of manual work. Hence it is important to be able to predict RNA secondary structures with signicant accuracy without resorting to labour-intensive laboratory methods.

Figure 3: Examples of RNA secondary and tertiary structures

3

In Figure 3 we can see an example of RNA secondary structure without pseudoknots, in which the RNA folds into a tree-like structure. An example of a pseudoknotted RNA secondary structure is presented in Figure 4.

Though not occurring that often, pseudoknots are not uncommon enough to be completely

neglected, and sometimes play an important genetic role. This project focuses on predicting RNA secondary structures that include pseudoknots.

2 RNA Secondary Structure Predictions (Lecture by Rune Lyngso, November 20, 2008) 3 RNA Secondary Structure Predictions (Lecture by Rune Lyngso, November 20, 2008)

4

Figure 4: An example of pseudoknotted RNA secondary structure 4

1.3 Some denitions Before we proceed to the actual description of our model, we need to provide some mathematically precise denitions of the terms we will be using. Note that we will dene the notion of pseudoknots in Section 3.

ˆ

RNA primary structure - a sequence of bases x1 , x2 , . . . xn , where ∀i xi ∈ {A, U, C, G}.

ˆ

Canonical base pair

ˆ

Minimal pairing distance R

- a base pair

(xi, xj ) ∈ {(A, U ) , (U, A), (C, G), (G, C), (G, U ), (U, G)}.

- the minimum number of bases between bases xi and xj , such that they (xi ,xj ) can form a base pair ⇐⇒(xi , xj ) is a canonical base pair ∧ the physical properties of nucleotides that R= 3. That is the value assigned

can form a base pair, i.e.,

i+R i2 − 1

2.3 simFold recursions As a rst step towards implementing pseudoknot prediction into the NNTM model, we had to implement the basic version of NNTM, which does not allow pseudoknots to occur. As a guide for our implementation, we used the recursions for Mirela Andronescu's

simF old,

which are given in [6]. We also slightly modied them.

In our implementation we keep four arrays dependent on the number of bases

ˆ W [j] | 1 ≤ j ≤ n

- keeps the MFE for segment

[1 . . . j].

Thus,

W (n)

n

in the molecule. They are:

keeps the MFE of the whole

molecule.

ˆ V [i][j] | 1 ≤ i < j ≤ n

- keeps the MFE for segment

ˆ W M [i][j] | 1 ≤ i < j ≤ n

[i . . . j]

assuming that

(i, j)

form a base pair.

- is used for evaluating multibranched loops

ˆ T ype[i][j] ∈ {N ON E, ST ACKIN G, HAIRP IN, IN T ERN AL, M U LT IBRAN CHED}|1 ≤ i < j ≤ n - is used to remember the optimal type of the loop enclosed by base pair (i, j). We will need this array for backtracking.

9

Before we give the actual recursions we need to dene a function called

N onGCP enalty .

Once again, we

follow [6] in our denition:

( N onGCP enalty(a, b) =

0 N onGCT erminal

, if (a, b) is (C, G) or (G, C) , otherwise

A slightly corrected version of the recursions taken from [6] is given below.

W (j)

[1, . . . , j]. We search for an i such that (xi, xj ) form a base pair and M F Es of the loop enclosed by (i, j) and W (i − 1) are minimal. We also have to add a penalty if base pair (xi, xj ) is not (C, G) or (G, C).     W (j − 1) ,         V (i, j) + N onGCP enalty (xi , xj ) + W (i − 1) ,        V (i + 1, j) + N onGCP enalty (xi+1 , xj ) + ∆G.Dangle50 (xj , xi+1 , xi ) + W (i − 1), W (j) = min1≤i i and j 0 < j , and a loop closed by base 0 0 pair (i , j ).

of an internal loop closed by base pairs

V BI(i, j) = min(i

Suggest Documents