Structural Alignment of Two RNA Sequences with Lagrangian Relaxation Markus Bauer
Gunnar W. Klau
Institute of Computer Graphics and Algorithms, Vienna University of Technology, Austria markus|
[email protected]
Abstract. RNA is generally a single-stranded molecule where the bases form hydrogen bonds within the same molecule leading to structure formation. In comparing different homologous RNA molecules it is usually not sufficient to consider only the primary sequence, but it is important to consider both the sequence and the structure of the molecules. Traditional alignment algorithms can only account for the sequence of bases, but not for the base pairings. Considering the structure leads to significant computational problems because of the dependencies introduced by the base pairings and the presence of pseudoknots. In this paper we address the problem of optimally aligning two given RNA sequences either with or without known structure (allowing for pseudoknots). We phrase the problem as an integer linear program and then solve it using Lagrangian relaxation. In our computational experiments we could align large problem instances—18S and 23S ribosomal RNA with up to 1500 bases within minutes while preserving pseudoknots.
1
Introduction
Unlike DNA, an RNA molecule is generally single-stranded and folds in space due to the formation of hydrogen bonds between its bases. While similarity between two nucleic acid chains is usually determined by sequence alignment algorithms, these can only account for the primary structure and thus ignore structural aspects. However, the problem of producing RNA alignments that are structurally correct has emerged as one of the central obstacles for the computational study of functional RNAs. To date, the available tools for computing structural alignments are either based on heuristical approaches and thus produce suboptimal alignments or cannot attack instances of reasonable input size. In this paper we deal with the comparison of two RNA sequences together with their structure. Indeed, we do not necessarily require the actual knowledge of either structure, but will infer a common structure based on the computation of preserved hydrogen bonds. In the presence
of pseudoknots, the problem becomes NP-hard even when only two sequences have to be aligned (Evans gives an NP-hardness proof for a special case of this problem in [7]). The computational problem of considering sequence and structure of an RNA molecule simultaneously was first addressed by Sankoff [15] who proposed a dynamic programming algorithm that aligns a set of RNA sequences while at the same time predicting their common fold. Algorithms similar in spirit were proposed later on for the problem of comparing one RNA sequence to one or more of known structure. Corpet and Michot [5] align simultaneously a sequence with a number of other, already aligned, sequences using both primary and secondary structure. Their dynamic programming algorithm requires O(n5 ) running time and O(n4 ) space (n is the length of the sequences) and thus can handle only short sequences. They propose an anchor-point heuristic to divide large alignment problems by fixed alignment regions into small subproblems that the dynamic programming algorithm can then be applied to. Bafna et al. [2] improved the dynamic programming algorithm to a running time of O(n4 ) which still does not make it applicable to reallife problems. Common motifs among several sequences are searched by Waterman [16]. Eddy and Durbin [6] describe probabilistic models for measuring the secondary structure and primary sequence consensus of RNA sequence families. They present algorithms for analyzing and comparing RNA sequences as well as database search techniques. Since the basic operation in their approach is an expensive dynamic programming algorithm, their algorithms cannot analyze sequences longer than 150-200 nucleotides. Gorodkin et al. [8] and Mathews and Turner [12] published simplified versions of Sankoff’s original algorithm. Hofacker et al. [10] give a different approach to structural alignments: instead of folding and aligning sequences simultaneously, they present a dynamic programming approach to align the corresponding base pair probability matrices, computed by McCaskill’s partition function algorithm [13], and therefore take the structural information into account. Reinert et al. [11] presented a Branch-and-Cut algorithm for aligning an RNA molecule of known sequence and known structure to an RNA molecule of known sequence but unknown structure. The algorithm computes an (optimal) alignment that maximizes sequence and structure consensus simultaneously and is based on the Branch-and-Cut technique. The method can handle pseudoknots and is able to solve already problems up to a size of 1400 bases. However, for problems of that size their implementation starts to require prohibitive time.
We start with a similar integer linear programming (ILP) formulation as given by Reinert et al., but instead of using a Branch-and-Cut approach, we resort to Lagrangian relaxation which was already successfully applied by Lancia et al. to the contact map problem [4] which is similar to the RNA alignment problem. The structural alignment of RNA molecules is central to the various RNA similarity or structure prediction problems defined in the above cited papers. If the two molecules are functionally related and have a similar structure, the RNA structural alignment allows, e. g., to draw conclusions about the structure of the unknown molecule. The techniques we will put forward can be modified such that incremental or simultaneous computations of multiple structural sequence alignments are possible. Since the traditional approaches cannot solve middle sized or large instances of the structural RNA alignment problem without using ad hoc heuristics, biologists still have to carry out large structural alignments by hand. Furthermore, most algorithms are not able to integrate tertiary structure interactions like pseudoknots, or need prohibitive ressources. We tested a first version of our proposed algorithm with 23S rRNA sequences from [17] and 18S rRNA sequences from Drosophila melanogaster (1995 bases) and human (1870 bases). We could, for example, optimally align the 23S rRNA sequences of Pyrodictium occultum (1497 bases) and Sulfolobus shibatae (1495 bases). The result was a structural alignment that is a very good approximation of the “hand-made” optimal structural alignment and is, in particular, much better than the purely sequencebased optimal alignment. In the 18S rRNA dataset it is well known that the first 1200 bases contain pseudoknots. We can show that our approach correctly preserves these pseudoknots. To our knowledge, no other algorithm is capable of doing this. The paper is structured as follows: Section 2 gives basic definitions and a mathematical formulation of the problem. In Sect. 3 we study how to relax the ILP in order to make it efficiently solvable, whereas Sect. 4 shows how we solve the original problem by means of Lagrangian relaxation. The results of our computational experiments are given in Sect. 6. Finally, we discuss our results in Sect. 7.
2
Basic Definitions and Mathematical Formulation
Before presenting a graph-theoretic model of aligning two RNA sequences and a corresponding integer linear programming (ILP) formulation we start with some basic definitions:
Definition 1. Let S be a sequence s1 , . . . , sn of length n over the alphabet Σ = (A, G, U, C, −). A paired base (i, j) is called an interaction, if si 6= − and sj 6= − and if (i, j) forms a Watson-Crick-pair. The set P of interactions is called the annotation of sequence S. Two interactions are said to be in conflict, if they share one base. A pair (S, P ) is called an annotated sequence. Note that a structure where no pair of interactions is in conflict with each other forms a valid secondary structure of an RNA sequence. We are given two annotated sequences (S1 , P1 ) and (S2 , P2 ). In graphtheoretic terms the input can be modeled as a graph G = (V, A∪I) where the set V denotes the vertices of the graph, in this case the letters of the two sequences, a set A of edges between vertices of the two input sequences (the alignment edges) and I the set of interaction edges between vertices of the same sequence. The left side of Fig. 1 shows such an input graph. Dashed lines are interaction edges, solid lines are alignment edges.
Fig. 1. Input graph for structural alignments and realized interaction matches Two alignment edges (a1 , b1 ) and (a2 , b2 ) are said to be in conflict, if a1 < a2 → b1 < b2 or a1 > a2 → b1 > b2 is not satisfied. Visually stated, alignment edges that are in conflict cross or touch each other. A subset A of A is called an alignment, if no alignment edges are in conflict. Graph-theoretically, an alignment is a non-crossing matching. Two interaction edges i = (i1 , i2 ) ∈ P1 and j = (j1 , j2 ) ∈ P2 are said to be realized by an alignment A if and only if the alignment edges (i1 , j1 ) and (i2 , j2 ) are realized by A. The pair (i, j) is called an interaction match. Note that (i, j) is an ordered tuple, that is, (i, j) is distinct from (j, i). The right side of Fig. 1 shows four interaction matches that are realized by the alignment (indeed it shows a preserved pseudoknot). Each alignment edge and interaction match is assigned a positive weight representing the benefit of realizing this edge or the match. In
the case of interaction edges we could for example choose the score for realizing the interaction match (i, j), e. g., as the number of hydrogen bonds between the bases or the base pair probability computed by means of McCaskill’s algorithm [13]. Traditional sequence alignments aim at maximizing the score of edges realized by an alignment. A structural alignment, however, takes the structural information (the information contained within the interaction edges) into account as well. Informally stated, the problem of structurally aligning two annotated P sequences P (S1 , P1 ) and (S2 , P2 ) calls for an optimal solution of max a∈A wa + i∈P1 ,j∈P2 wij , that is, the score achieved by the weight of the alignment and interaction edges. Three properties, however, have to be satisfied in order to form a valid RNA secondary structure: 1. Every vertex is incident to at most one interaction edge. 2. The endnodes of each interaction match have to be realized by alignment edges. 3. No alignment edges are in conflict. Then, an ILP formulation follows directly: max
XX
wlm ylm +
m∈A l∈A
X
X
wm xm
(1)
m∈A
xl ≤ 1
∀I ∈ I
(2)
∀l, m ∈ A, l < m
(3)
∀m ∈ A
(4)
integer
(5)
l∈I
ylm = yml X ylm ≤ xm l∈A
x, y ≥ 0
The variable xm equals one, if alignment edge m is part of the alignment, whereas ylm = 1 holds, if the alignment edges l and m realize the interaction match (l, m). The set I contains all subsets of alignment edges, such that all pairs of elements of a specific subset are crossing each other. One can easily verify that all properties for a structural alignment are satisfied: inequalities (3) and (4) guarantee that interaction matches are realized by alignment edges and that every node is incident to at most one interaction edge, whereas (2) guarantees the alignment edges to be non-crossing. The order l < m within the equality constraints (3) denotes an arbitrary order defined on the elements of A (otherwise identical constraints
show up twice, that is, yml = ylm and ylm = yml were part of the ILP). Due to the NP-hardness of the problem, we cannot hope to solve the ILP above directly. Therefore, we drop some constraints and show how the relaxed ILP can be solved efficiently.
3
Relaxation and Efficient Solution
We call the ILP (1)-(5) without the constraints (3) the relaxed problem and show how to solve it efficiently. Later we show how to incorporate the dropped constrained again in order to solve our original problem. Our algorithm proceeds in two stages. First, we compute for each m ∈ A the maximal profit that the realization of m can possibly yield. Then, we use the maximal profit of each edge to compute a conventional alignment. Lemma 1. The relaxed problem can be solved in time O(|A|2 ). Proof. Suppose the variable xm = 0, then due to (4) all ylm = 0 as well. For xm = 1, however, the optimal choice for all m ∈ A is given by X max wlm ylm + wm l∈A
X
xl ≤ 1
∀I ∈ I
l∈I
X
ylm ≤ 1
l∈A
x, y ≥ 0
integer
To put it differently: For each m ∈ A we compute the maximal profit that the alignment edge can possibly realize. The maximum profit consists of its own weight wm plus the best interaction match that can be realized, if m is part of the solution. Let pm be the maximum profit of alignment edge m and let yˆlm be the realized interaction match. In the second step, we compute the optimal overall profit by solving X max pm xm m∈A
X
xm ≤ 1
∀I ∈ I
l∈I
x≥0
integer
Let x ¯ be the solution to the alignment problem above. We claim that the solution of the relaxed problem is given by y¯lm = yˆlm x ¯m for all l, m ∈ A.
Assume that (¯ ylm , x ¯) is not the optimal solution to the relaxed problem, but another solution that does not realize the maximal profit. Then, according to the objective function, there exist yvw = 0 and/or xw = 0 that would realize a higher score. There are two possibilities: (a) yvw = 0 and xw = 1 (with yxw = 1 being the interaction match chosen to be realized by alignment edge w): This implies that wvw > wxw . In this case, however, yvw = 1 holds and not yxw = 1, since the interaction match with the highest score is chosen in the first step of the algorithm. Therefore, there cannot be an yvw = 0 whose realization would yield a higher score. (b) xw = 0: This implies that for an element c from the set I (remember that pairs of elements of the same set I are crossing each other) holds pc > pw , because otherwise w would have been realized. Again, there cannot be an xw = 0 yielding a higher score in case of xw = 1. For analyzing the running time of the entire algorithm, two things have to be taken into account: First, choosing the interaction match that realizes the maximal profit. Second, computing an alignment, given the single profits p. If the set of all interaction matches that could possibly be realized by alignment edge m is being computed in a preprocessing phase, selecting the best possible interaction match can be accomplished in constant time by means of priority queues (regard the weight of the interaction matches as the priority, then extracting the element with the highest priority can be done in constant time). Computing the alignment dominates the overall running time and can be done in O(|A|2 ) with the Needleman-Wunsch algorithm. t u
4
Lagrangian Relaxation
We solve the original problem by moving the dropped constraints (3) into the objective function and by penalizing its violation. We assign a Lagrangian multiplier λ to the constraint. The task is then to find Lagrangian multipliers that provide the best bound to the original problem. The Lagrangian problem is given by: XX X X X max wlm ylm + wm xm + λlm (ylm − yml ) m∈A l∈A
m∈A
X
xl ≤ 1
l∈A m∈A,l