A Polyhedral Approach to RNA Sequence Structure Alignment 1 ...

11 downloads 0 Views 294KB Size Report
RNA Sequence Structure Alignment. Hans-Peter Lenhof. Knut Reinert y. Martin Vingronz. Abstract. Ribonucleic acid (RNA) is a polymer composed of four basesĀ ...
A Polyhedral Approach to RNA Sequence Structure Alignment Hans-Peter Lenhof



Knut Reinert

y

z

Martin Vingron

Abstract

Ribonucleic acid (RNA) is a polymer composed of four bases denoted A, C, G, and U. It generally is a single-stranded molecule where the bases form hydrogen bonds within the same molecule leading to structure formation. In comparing di erent homologous RNA molecules it is important to consider both the base sequence and the structure of the molecules. Traditional alignment algorithms can only account for the sequence of bases, but not for the base pairings. Considering the structure leads to signi cant computational problems because of the dependencies introduced by the base pairings. In this paper we address the problem of optimally aligning a given RNA sequence of unknown structure to one of known sequence and structure. We phrase the problem as an integer linear program and then solve it using methods from polyhedral combinatorics. In our computational experiments we could solve large problem instances { 23S ribosomal RNA with more than 1400 bases { a size intractable for former algorithms.

1 Introduction An RNA molecule, unlike DNA, is a generally single-stranded nucleic acid molecule that folds in space due to the formation of hydrogen bonds between its bases. While similarity between two nucleic acid chains is usually determined by sequence alignment algorithms, these can only account for the primary structure and thus ignore structural aspects. In this paper we deal with the comparison of two RNA sequences when for one of them base pairings are known. The aim then is to align the sequence of unknown structure in such a way that the known structure carries over to the unknown sequence and as much sequence similarity is maintained as possible. With the prediction of tRNA structure from a set of similar sequences, Levitt (Levitt 1969) had strikingly demonstrated that sets of similar sequences can yield convincing evidence for how an RNA molecule folds. The computational problem of considering sequence and structure of an RNA molecule simultaneously was rst addressed by Sanko (Sanko 1985) who proposed a dynamic programming algorithm that aligns a set of RNA sequences while at the same time predicting their common fold. Algorithms similar in spirit were proposed later on for the problem of comparing one RNA sequence to one or more of known structure. Corpet and Michot (Corpet and Michot 1994) align simultaneously a sequence with a number of other sequences using both primary and secondary structure. Their dynamic programming algorithm requires O(n5 ) running time and O(n4 ) space (n is the length of the sequences) and thus can handle only short sequences. Corpet and Michot propose an anchor-point heuristic to divide large alignment problems by xed alignment regions into small subproblems that the dynamic programming algorithm can then be applied to. Bafna  y z

MPI fur Informatik, Im Stadtwald, 66123 Saarbrucken, Germany corresponding author Deutsches Krebsforschungszentrum, Abt. Theoretische Bioinformatik, INF 280, D-69120 Heidelberg, Germany

et al. (Bafna, Muthukrishnan, and Ravi 1995) improved the dynamic programming algorithm to a running time of O(n4 ) which still does not make it applicable to real-life problems. Common motifs among several sequences are searched by Waterman (Waterman 1989). Eddy and Durbin (Eddy and Durbin 1994) describe probabilistic models for measuring the secondary structure and primary sequence consensus of RNA sequence families. They present algorithms for analyzing and comparing RNA sequences as well as database search techniques. Since the basic operation in their approach is an expensive dynamic programming algorithm, their algorithms cannot analyze sequences longer than 150{200 nucleotides. Recently, Gorodkin et al. (Gorodkin, Heyer, and Stormo 1997) published a simpli ed version of Sanko 's original algorithm. In this paper we present a branch-and-cut algorithm for aligning an RNA molecule S1 of known sequence and known structure to an RNA molecule S2 of known sequence but unknown structure. The algorithm computes an (optimal) alignment that maximizes sequence and structure consensus simultaneously. To be more precise, the score that is optimized is a weighted sum of a sequence alignment score and a base pairing score which measures the quantity and quality of the base pairs of S1 and S2 that are \matched" by the sequence alignment. We call this problem the RNA Sequence Structure Alignment (RSA) problem. Note that we do not restrict base pairs to the ones that can be realized in a secondary structure but allow for the inclusion of pseudo knots. The RSA problem is the central problem of the di erent RNA similarity or structure prediction problems de ned in the above cited papers. If the two molecules are functionally related and have a similar structure the RSA alignment allows, for example, to draw conclusions about the structure of molecule S2 . The RSA techniques we will put forward can easily be modi ed in such a way that incremental or simultaneous computations of multiple structural sequence alignments are possible. Since the older approaches cannot solve middle sized or large instances of the RSA problem without using ad hoc heuristics, biologists still have to carry out structural alignments of sequences of such size by hand. Furthermore, the known algorithms are not able to integrate tertiary structure interactions like pseudo knots. Bafna et al. (Bafna, Muthukrishnan, and Ravi 1995) claim that \their algorithm can be extended to structures that allow crossing edges", but they do not mention how this can be done and whether the complexity of the algorithm changes. We tested whether methods from polyhedral combinatorics can be used to solve large instances of the RSA problem. Such methods have recently been successfully applied to problems like physical mapping (Christof, Junger, Kececioglu, Mutzel, and Reinelt 1997) or multiple sequence alignment (Reinert, Lenhof, Mutzel, Mehlhorn, and Kececioglu 1997). We rephrased the RSA problem as an integer linear program and studied the facets of the convex hull of all integral solutions. Based on these studies we implemented a rst version of a branch-and-cut algorithm coded in C++. As an example we tested 23S ribosomal RNA sequences. Each of these sequences is more than 1400 bases long. The \hand-made" structural alignments for these sequences were taken from the Antwerpen rRNA database (de Rijk, de Peer, Chapelle, and de Wachter 1994). Our algorithm was able to solve this large instances of the problem in reasonable time to optimality. The produced structural alignment is a very good approximation of the \hand-made" optimal structural alignment and is, in particular, much better than the purely sequence-based optimal alignment. Although not documented in this examples, the branch-and-cut algorithm can also handle pseudo knots without any change in the algorithm. A graph-theoretic characterization and the integer linear programming formulation of the RSA problem are given in Section (2). In Section (3) we study the structure of the \RSA polytope" and present some classes of facet-de ning inequalities. We sketch the branch-and-cut algorithm in Section (4). The results of our computational experiments are given in Section (5). Finally we discuss our results in Section (6). 2

2 A graph{theoretic characterization of the RSA problem Let two RNA sequences S1 and S2 be given. Assume further that for S1 we are given a list of base pairs, e.g. as output by a secondary structure prediction program. The only condition on those base pairs is that a base can be involved in at most one base pair. We thus allow tertiary interactions or pseudo knots. For long sequences we are not going to consider every possible alignment and instead assume that we are given a list of feasible residue pairings that could be realized within an alignment. The results section will give further details on how to generate this list. The input for the RSA problem can be modeled as a graph G = (V; E [ B ). The letters of the sequences S1 ; S2 are viewed as vertices V in G. The order of the letters in the sequence S1 (S2 ) induces an order on the subset of vertices of G that represent the sequence S1 (S2 ). Every edge e of E connects a vertex of S1 with a vertex of S2 . By start(e) we denote the index of the letter of S1 where the edge e starts and by end(e) we denote the index of the letter of S2 where the edge e ends. Every edge e 2 E has a nonnegative weight we representing the gain of aligning the endpoints of the edge. We call the edges in E alignment edges. Two alignment edges e and f (e 6= f ) are in con ict if (start(e)  start(f ) and end(e)  end(f )) or (start(f )  start(e) and end(f )  end(e)). Figure 1 Note that an edge is not in con ict with itself. We call a subset E 0 of E a trace of G if there are no two alignment edges in E 0 in con ict with each other. We say that a trace E 0 realizes an edge e if e 2 E 0. The standard representation of a trace places the letters connected by a realized edge into the same column of the \alignment array". Edges that are in con ict with each other cannot be \drawn" simultaneously without changing the order of bases in a sequence (see Figure 1). The edges in B represent possible base pairings between residues of the second sequence as derived from the given base pairs for sequence S1 . We call the edges in B base pair edges (see also Figure 2). There is a base pair edge bij 2 B , if the two alignment edges ei and ej (i < j ) have the following properties:  The two alignment edges ei and ej are not in con ict.  There is a base pair given between the bases start(ei) and start(ej ) in molecule S1.  The bases end(ei ) and end(ej ) of S2 are complementary (and thus able to form a base pair). Note that bij is unique and that the de nition of base pair edges can also incorporate other constraints like for example a minimum distance of start(ei) and start(ej ) (resp. end(ei ) and end(ej )). We call ei and ej the generating alignment edges of the base pair edge bij . Each base pair edge bij has a positive weight wij that could, for example, measure the \energy" or the number of the hydrogen bonds of the base pair. We say that a trace realizes a base pair edge bij if the trace realizes its two generating alignment edges ei and ej . Two base pair edges are in con ict if there is a con ict between their generating alignment edges. A base pair edge is in con ict with an alignment edge if the alignment edge is in con ict with one of the generating alignment edges of the base pair. A subset A = fE 0 [ B 0 j E 0  E; B 0  B g of E [ B is called a RSA alignment of G if  E 0 is a trace of G and 3

Figure 1 goes here

 the base pair edges in B 0 are realized by the alignment edges in E 0 .

Note that B 0 can be any subset of all the base pair edges realized by E 0 , e.g. B 0 = ;. The weight w(A) of the RSA alignment A is the sum of the weights of the alignment edges in E 0 and the weights of the base pair edges in B 0 . The RSA problem is then de ned as follows: Given an RSA Graph G = (V; E [ B ) compute the RSA alignment A of G with maximal weight. It is now easy to formulate the RSA problem as an integer linear program (ILP): X w x + X w x maximize ij ij i i ei 2E

subject to

bij 2B

x k + xl  1 8 ek ; el 2 E s.t. ek is in con ict with el xij  xi xij  xj 8 base pair variables xij xi ; xij 2 f0; 1g

(1) (2) (3)

Figure 2 The variable xi represents the alignment edge ei and the variable xij the base pair edge bij . The rst class of constraints ensures that there are no con icts between alignment edges. The second class of constraints guarantees that a base pair edge is only realized if its generating alignment edges are realized. An alternative ILP formulation uses the alignment edges merely to de ne the base pair variables. The realization of a base pair edge then implicitly realizes its left and right de ning edge. Although this formulation has fewer variables in the ILP, we will use the rst formulation because it is better suited for describing combined sequence and structure similarity and because the alignment edges allow a more ecient branch-and-cut algorithm which will be explained further in Section 4. Every RSA alignment A = E 0 [ B 0 can be represented by a jE [ B j-dimensional f0; 1g-vector A , the so-called incidence vector. The incidence vector describes which edges f are realized (Af = 1) by the alignment A. We de ne the RSA polytope of G as the convex hull of all incidence vectors of RSA alignments, i.e., P (G) := convfA 2 f0; 1gjE[Bj j A  E [ B is an RSA alignment of Gg:

3 The structure of the RSA polytope In this section we investigate the structure of the RSA polytope. It is important to identify classes of facet-de ning inequalities, because the facets of the RSA polytope are the best separating planes in the cutting phase of a branch-and-cut algorithm (see Section 4). Unfortunately the pair IA (G) = (E [ B; fA j A  E [ B is an RSA alignment of Gg) does not form an independence system on E [ B , because the base pair edges are dependent on the alignment edges. This deprives us of an elegant way of proving results about the RSA polytope (see Nemhauser and Wolsey 1988). The following well known theorem by Nemhauser and Trotter will be used in the forthcoming proofs. Theorem 1 (Nemhauser and Trotter 1973) Let P  Qd be a full dimensional polyhedron. If F is a (nonempty) face of P then the following assertions are equivalent. 4

Figure 2 goes here

1. F is a facet of P . 2. dim(F ) = dim(P ) ? 1, where dim(P ) is the maximum number of anely independent points in P minus one. 3. There exists a valid inequality cT x  c0 with respect to P with the following three properties:

(a) F = fx 2 P jcT x = c0 g (b) There exists a vector x^ 2 P such that cT x^ < c0 . (c) If aT x  a0 is a valid inequality for P such that F  F = fx 2 P jaT x = a0 g then there exists a number  2 Q such that aT =   cT and a0 =   c0 .

Assertions 2 and 3 provide the two basic methods to prove that a given inequality cT x  c0 is facetde ning for a polyhedron P . The rst method, called the direct method, consists of exhibiting a set of d = dim(P ) vectors x1 ; : : : ; xd satisfying cT xi = c0 and showing that these vectors are anely independent. The indirect method is the following: We assume that fx j cT x = c0 g  fx j aT x = a0 g for some facet-de ning inequality aT x  a0 and prove that there exists a  > 0 such that aT =   cT and a0 =   c0 . We now state some basic results about the RSA polytope and then de ne three non-trivial classes of valid inequalities and show when they are facet-de ning. Lemma 1 Let G = (V; E [ B ) be an RSA graph with n alignment and m base pair edges. Then P (G) is full-dimensional and for all ei 2 E the inequality xi  1 is facet-de ning i there is no ej 2 E which has a con ict with ei .

Proof: The rst part of the lemma is proven by exhibiting n + m +1 anely independent incidence

vectors of RSA alignments. We can easily do that by constructing n RSA alignments consisting of one alignment edge and m RSA alignments consisting of one base pair edge together with its generating alignment edges. Together with the zero vector this yields n + m +1 anely independent incident vectors. To prove the second part, we assume, that there is no ej 2 E which is in con ict with ei . We de ne n ? 1 sets fej ; ei g, 8ej 2 E n fei g and m sets fei ; b; el ; er g, 8b 2 B where el ; er are the generating alignment edges of b. Together with the set feg this yields n + m RSA alignments Ak , k = 1; : : : ; n + m, whose incidence vectors Ak are anely independent and satisfy Ai k = 1. On the other hand, if there is a ej 2 E which is in con ict with ei , then for every incidence vector  of an RSA alignment, i = 1 would imply j = 0. Therefore dimfx 2 P (G) j xi = 1g  n + m ? 1 and hence xi  1 is not a facet-de ning inequality.

Lemma 2 Let G = (V; E [ B ) be an RSA graph with n alignment and m base pair edges. 1. The inequality xi  0 is facet-de ning i ei is not a generating alignment edge of a base pair edge.

2. For each base pair edge bij the inequality xij  0 is facet-de ning.

Proof: (1) Let ei 2 E be an alignment edge which is not a generating alignment edge of a base pair edge. Then the n ? 1 RSA alignments consisting of one alignment edge other than ei and

the m RSA alignments consisting of one base pair edge and its generating alignment edges form a collection of n + m ? 1 RSA alignments Ak , k = 1; : : : ; n + m ? 1 whose incidence vectors Ak are anely independent. Together with the zero vector this yields n + m anely independent incidence 5

vectors with Ai k = 0. Therefore xi  0 is a facet-de ning inequality. On the other hand, if ei 2 E is a generating alignment edge of a base pair edge bij , then for every incidence vector , i = 0 would imply ij = 0. Therefore dimfx 2 P (G) j xi = 0g  n + m ? 1 and hence xi  0 is not a facet-de ning inequality. (2) Let bij 2 B be a base pair edge. The n RSA alignments consisting of one alignment edge and the m ? 1 RSA alignments consisting of one base pair edge other than bij and its generating edges, form n + m ? 1 RSA alignments Ak , k = 1; : : : ; n + m ? 1, whose incidence vectors Ak are anely independent. Together with the zero vector this yields n + m anely independent incidence vectors satisfying Aijk = 0. Therefore xij  0 is a facet-de ning inequality. We will now see that we can tighten both classes of inequalities that are in the ILP formulation. If one looks at an alignment edge ei it might be that it is the generating alignment edge of a set Bi  B of base pair edges. Since all pairs of base pair edges in Bi have a con ict { the generating alignment edges di erent from ei start all in the same base of sequence S1 { an RSA alignment can realize only one base pair edge in Bi . Therefore we can tighten these inequalities in the ILP formulation as follows: X bij 2Bi

xij  xi :

We call this class of valid inequalities base pair inequalities and show that they are facet-de ning for the RSA polytope. Theorem 2 Let G = (V; E [ B ) be an RSA graph. Let ei be an alignment edge and let Bi be the set pair edges that have ei as a generating alignment edge. Then the base pair inequality Pb of2Bbase x ? xi  0 is facet-de ning for P (G). ij ij i

Proof: Denote the base pair inequality by cT x  c0. Obviously condition 3 b) of Theorem 1 holds, because for the incidence vector i of the set fei g holds cT i < c0 . Therefore it is sucient to show that every valid inequality aT x  a0 with fx j cT x = c0 g  fx j aT x = a0 g is { up to a multiplicative factor { equal to cT x  c0 . Assume that fx j cT x = c0 g  fx j aT x = a0 g. Since c0 = 0, it follows that a0 = 0. The incidence vectors feg of the jE j ? 1 sets feg, 8e 2 E n fei g ful ll cT feg = aT feg = 0. Hence ae = 0, 8e 2 E n fei g. Similarly the jB j ? jBi j incidence vectors A of the sets Ab = fb; el ; er g, 8b 2 B n Bi ful ll cT A = aT A = 0, because ei is a not a generating alignment edge of b. Hence ab = 0, 8b 2 B n Bi . The sets Ab = fei ; bij ; ej g, 8bij 2 Bi form RSA alignments whose incidence vectors A b

b

b

bij

ij

satisfy cT Abij = 0 and therefore aT Abij = 0. If one subtracts aT Abij = 0 from aT Abij0 = 0, 8bij0 2 Bi n fbij g this yields abij =    = abij0 , because we just showed that all other coecients except aei are zero. Since abij = ?aei we can choose  = ab1ij which yields   aT = cT . In the ILP formulation the inequalities xl + xk  1, 8l; k with el in con ict with ek , ensure that only one of the con icting edges can be realized. We can tighten these inequalities by augmenting them from pairs of edges to maximal sets of edges, in which each pair of edges is in con ict. Figure 3

De nition 1 Let G = (V; E [ B ) be an RSA graph, B 0  B be a set of base pair edges and let E (B 0 ) be the set of generating alignment edges of B 0. Let E 0  E n E (B 0) be a set of alignment edges in G. If each pair of edges of the set C = E 0 [ B 0 is in con ict then C is called an extended clique.

6

Figure 3 goes here

It is clear that only one edge from an extended clique P can be realized by an alignment. Therefore P the extended clique inequality x(C ) = ei 2E 0 xi + bij 2B0 xij  1 is valid. We will prove that an extended clique inequality is facet-de ning unless it is redundant, i.e., if one can replace a base pair edge by one of its generating edges, such that the resulting set of edges is still an extended clique. If an extended clique is not redundant, we call it contributing. For example in Figure 3 (a) the set f1; 3; 4g builds an extended clique. However, when replacing the base pair edge 4 by the base pair edge 2 (one of its generating alignment edges) this also yields an extended clique f1; 2; 3g. In Figure 3 (b) and (c) there is no way of replacing one of the base pair edges by a generating alignment edge, such that the resulting set is still an extended clique. In both cases the set f1; 2; 3g is a contributing extended clique. Theorem 3 Let G = (V; E [ B ) be an RSA graph. Let C = E 0 [ B 0 be a maximal contributing extended clique in G. Then the inequality x(C )  1 is facet-de ning for P (G).

Proof: Denote the extended clique inequality by cT x  c0. Condition 3 b) of Theorem 1 holds,

because for the zero incidence vector cT i < c0 . Therefore it is sucient to show that every valid inequality aT x  a0 with fx j cT x = c0 g  fx j aT x = a0 g is { up to a multiplicative factor { equal to cT x  c0 . Assume that fx j cT x = c0 g  fx j aT x = a0 g. All coecients ae of alignment edges e 2 E 0 are equal to a0 , because the set feg is an RSA alignment. Let e be any alignment edge not in C . Then there must be an edge in C , such that this edge and e are not in con ict, because otherwise C would not be maximal or it would be redundant. There are two cases: 1. There is an alignment edge e0 2 E 0 , such that e0 and e are not in con ict. In that case the two sets A = fe0 g and A0 = fe; e0 g build RSA alignments which satisfy cT  = c0 and therefore 0 aT  = a0 for  = A and  = A . Subtracting ae0 = a0 from ae + ae0 = a0 yields ae = 0. 2. There is no alignment edge e0 with the above mentioned property. Then a base pair edge b0 2 B 0 must exist that is not in con ict with e and whose generating alignment edges e0l and e0r are di erent from e. Otherwise C would not be a maximal contributing extended clique. In this case the two sets A = fb0 ; e0l ; e0r g and A0 = fb0 ; e0l ; e0r ; eg build RSA alignments 0 T T A A which satisfy c  = c0 and therefore a  = a0 for  =  and  =  . Subtracting ab0 + ae0l + ae0r = a0 from ab0 + ae0l + ae0r + ae = a0 yields ae = 0. Since the coecients of all alignment edges in E n E 0 are zero, the coecients of base pair edges b 2 B 0 are equal to a0 . Let b be a base pair edge not in C . The base pair edge b together with its generating alignment edges forms an RSA alignment. If one of its generating alignment edges is in C then the other generating alignment edge is in E n E 0 , because generating edges cannot be in con ict. Since one of the coecients of the generating edges is a0 and the other is 0 the coecient ab has to be 0. If both generating edges of b are in E n E 0 then their coecients are 0 and there are again two cases: 1. There is an alignment edge e0 2 E 0 , such that e0 and b are not in con ict. In that case the two sets A = fe0 ; el ; er g and A0 = fb; el ; er ; e0 g build0 RSA alignments which satisfy cT  = c0 and therefore aT  = a0 for  = A and  = A . Subtracting ae0 + ael + aer = a0 from ae0 + ab + ael + aer = a0 yields ab = 0 2. There is no alignment edge e0 with the above mentioned property. Then a base pair edge b0 2 B 0 must exist that is not in con ict with b and whose generating alignment edges are di erent from those of b. In this case the two sets A = fb0 ; e0l ; e0r ; el ; er g and A0 = fb0 ; e0l ; e0r ; b; el ; er g 7

build RSA alignments which satisfy cT  = c0 and therefore aT  = a0 for  = A and  = A0 . Subtracting ab0 + ae0l + ae0r + ael + aer = a0 from ab0 + ae0l + ae0r + ael + aer + ab = a0 yields ab = 0. This completes the proof that the coecients of edges not in C are 0. The rest of the coecients is equal to a0 . Choosing  = ac00 we have a0 =   c0 and aT =   cT . Therefore cT x  c0 is a facet-de ning inequality for P (G). The above mentioned classes of facet-de ning inequalities do not form a complete description of the RSA polytope. It is indeed an open question to nd such a complete description. We were able to identify another class of inequalities, which is not always facet-de ning. In the following we characterize this class and prove that it is facet-de ning for certain RSA graphs. De nition 2 Let G = (V; E [ B ) be an RSA graph where the set B consists only of base pair edges that are pairwise in con ict. Let L  E and R  E be the set of left and right generating edges of B respectively. Further let E be a partition of 2i sets C1 ; : : : ; C2i , i > 1 with the following properties: 1. The edges in Cj , 1  j  2i, form an extended clique. 2. 8e 2 C1 holds: e is in con ict exactly with each l 2 L and each f 2 C1 [ C2 with e 6= f . 3. 8e 2 C2i holds: e is in con ict exactly with each r 2 R and each f 2 C2i [ C2i?1 with e 6= f . 4. 8e 2 Cj , j = 2; : : : ; 2i ? 1 holds: e is in con ict exactly with each f 2 Cj ?1 [ Cj [ Cj +1 , e 6= f . Then the set C := C1 [ C2 [    [ C2i [ B is called odd cycle of length i. If one of the base pair edges in B is realized, its left and right generating edge must be realized as well. Since the left generating edge is in con ict with every edge in C1 and the right generating edge is in con ict with every edge in C2i it follows, that in every RSA alignment at most one edge out of every other extended clique can be realized, i.e., either one of C2 ; C4 ; : : : ; C2i?2 or one of C3 ; C5 ; : : : ; C2i?1 . In both cases at most i ? 1 alignment edges are realized, which yields a total of i realized edges including the base pair edge. On the other hand, if no base pair edge is realized, then exactly one edge out of C1 ; C3 ; : : : ; C2i?1 or C2 ; C4 ; : : : ; C2i can be realized which again iPrealized edges. Therefore for any odd cycle C the odd cycle inequality x(C ) = Pb 2Byields x + ij ej 2C1 [[C2i xj  i is valid. In Figure 4 an odd cycle of length 3 is shown, where ij j B j= 2 and j Ci j= 1. More speci cally B = fb1;10 ; b2;9 g, C1 = fe3 g, C2 = fe4 g,. . . , C5 = fe7 g and C6 = fe8 g. Figure 4 The odd cycle inequality is indeed facet-de ning, if the graph G is an odd cycle. We prove this in the following theorem, using the notations from De nition 2. Theorem 4 Let G = (V; E [ B ) be anP RSA graphPthat forms an odd cycle C as in De nition 2. Then the odd cycle inequality x(C ) = bij 2B xij + ej 2C1 [[C2i xj  i is facet-de ning for P (G).

Proof: Denote the odd cycle inequality by cT x  c0 . Clearly condition 3 b) of Theorem 1 holds.

If we realize any edge contained in one of the extended cliques of an odd cycle, we have a valid RSA alignment, the incidence vector of which ful lls cT i < c0 . Hence it is sucient to show that every valid inequality aT x  a0 with fx j cT x = c0 g  fx j aT x = a0 g is { up to a multiplicative factor { equal to cT x  c0 . 8

Figure 4 goes here

Assume that fx j cT x = c0 g  fx j aT x = a0 g. Throughout the proof let ei be any edge from Ci and b be any edge from B . Furthermore denote with el and er the left resp. right generating edge of b. Then the sets M := fe1 ; e3 ; : : : ; e2i?1 g and N := fe2 ; e4 ; : : : ; e2i g form RSA alignments satisfying cT  = c0 and hence aT  = a0 . For any l 2 L the set N [flg also forms an RSA alignment satisfying cT  = c0 and therefore aT  = a0 . Hence subtraction of the two equalities yields al = 0 for all l 2 L. A similar argument holds for all right generating edges r 2 R. The sets M 0 := fb; el ; er ; e3 ; e5 ; : : : ; e2i?1 g and N 0 := fb; el ; er ; e2 ; e4 ; : : :0 ; e2i?2 g also form RSA alignments that satisfy cT  = c0 and hence aT  = a0 . Subtracting aT M = a0 from aT M = a0 yields ael + aer + ab ? ae1 = 0. Since ael = aer = 0 it follows that ab = ae1 . Analogously ab = ae2i . De ne for k = 1; : : : ; i the sets Mk := fb; el ; er ; e2 ; e4 ; : : : ; e2k?2 ; e2k+1 ; e2k+3 ; : : : ; e2i?1 g, i.e., Mk contains one base pair edge together with its left and right generating edge, the alignment edges with even indices from 2 to 2k ? 2 and the alignment edges with odd indices from 2k + 1 to 2i ? 1. Every Mk forms an RSA alignment and its incidence vector k satis es cT k = c0 and hence aT k = a0 . Subtracting aT k+1 = a0 from aT k = a0 yields ae2k+1 = ae2k for k = 1; : : : ; i ? 1. Next we de ne for k = 0; : : : ; i the sets Nk := fe1 ; e3 ; : : : ; e2k?1 ; e2k+2 ; e2k+2 ; : : : ; e2i g, i.e., Nk contains the alignment edges with odd indices from 1 to 2k ? 1 and the alignment edges with even indices from 2k + 2 to 2i. Every Nk forms an RSA alignment and its incidence vector k satis es cT k = c0 and hence aT k = a0 . Subtracting aT k+1 = a0 from aT k = a0 yields ae2k+2 = ae2k+1 for k = 0; : : : ; i ? 1. Putting the above arguments together we have ab = ae1 = ae2 =    = ae2i = a0 . Since the argument holds for any edges ei 2 Ci and b 2 B all coecients in the odd cycle C are equal. Choosing  = ac00 we have a0 =   c0 and aT =   cT . Therefore cT x  c0 is a facet-de ning inequality for P (G).

4 A branch-and-cut algorithm A review of branch-and-cut algorithms We relax the given integer linear program by dropping the integer condition and solve the resulting linear program. If the solution x of the linear program is integral we have the optimal solution. Otherwise we search for a valid inequality fx  f0 that \cuts o " the solution x, i.e, fy  f0 for all y 2 P (G) and f x > f0 ; the set fx j fx = f0 g is called a cutting plane. The search for a cutting plane is called the separation problem. Any cutting plane found is added to the linear program and the linear program is resolved. The generation of cutting planes is repeated until either an optimal solution is found or the search for a cutting plane fails. In the second case a branch step follows: We generate two subproblems by setting one fractional variable xe to 0 in the rst subproblem and to 1 in the second subproblem and solve these subproblems recursively. This gives rise to an enumeration tree of subproblems. A branch-and-cut algorithm for the RSA problem In order to specialize the generic branch-and-cut algorithm we need to describe separation algorithms for our classes of inequalities. Since each base pair inequality is only a stronger version of exactly one inequality in the ILP formulation, we replace each such inequality in the ILP formulation by the corresponding base pair inequality. All maximal extended cliques C that do not contain a base pair edge are contributing and therefore the extended clique inequality of C is facet-de ning for the RSA polytope. Reinert et al. showed in (Reinert, Lenhof, Mutzel, Mehlhorn, and Kececioglu 1997) that those inequalities can be 9

separated in polynomial time by computing a longest path in a directed acyclic graph. The other maximal extended cliques C contain at least one base pair edge and are by Theorem 3 facet-de ning if and only if they are contributing. Unfortunately we have not found an ecient way of separating this class of inequalities. By checking the ILP inequality xi + xj  1, 8i; j 2 E with i is in con ict with j , we can determine the feasibility of an (integer) solution of the LP. If we nd two con icting edges i; j 2 E with xi + xj > 1 we adopted two ways of handling with this situation: 1. If the RSA graph is not too dense, we enumerate all maximal extended cliques containing i (or j ). With the use of bit vectors and an adaption of an algorithm by Tsukiyama et al. (Tsukiyama, Ide, Ariyoshi, and Shirawaka 1977) this can be done in reasonable time for up to 20000-30000 cliques. This cuts o the current infeasible solution and considerably shrinks the enumeration tree in the branching phase. We used this method also for the second ILP formulation described Section 2 which limits this formulation at the moment to small problem instances. 2. If the RSA graph is too dense and therefore the number of cliques in the above mentioned enumeration explodes, we refrain from enumerating the extended cliques (which we do in most cases). If we cannot nd a violated extended clique inequality and our solution is still fractional we apply a heuristic for nding violated odd cycle inequalities. The heuristic chooses a fractional base pair variable xij and tries to nd an even number of \connecting" alignment edges between the left and right generating edge of maximal value. If it nds such a set C Pof 2i alignment edges with the P property xij + ej 2C xj > i it adds the odd cycle inequality xij + ej 2C xj  i to the LP. If this also fails we branch. In the branching phase we choose the fractional base pair variable which is closest to 0:5 and has the highest objective function coecient. After the branching we iterate the process on the two subproblems.

5 Computational results In this section we report on the results generated with our program. The implementation is coded in C++ using the library of ecient data types and algorithms LEDA (Mehlhorn and Naher 1995) and the branch-and-cut framework ABACUS (Junger and Thienel 1997). The generation of the RSA graph One input for our algorithm is a set of reasonable alignment edges. In principle, we use for this purpose alignment edges realized by some suboptimal alignment, i.e. an alignment with a score close to optimal. In contrast to (Lenhof, Reinert, and Vingron 1998) we do not take all the edges realized by any suboptimal alignments scoring better than a xed threshold s below the optimal. Rather we employ a windowing technique to make the alignment graph denser in certain regions and thinner in others. The reason for applying the windowing technique described below lies in the fact that conventional suboptimal alignments have frequently shown insucient deviation from the optimal alignment to cover the alignment edges necessary to build the structurally correct alignment. On the other hand, upon inclusion of a sucient number of suboptimal alignments the number of edges to consider became too large. As a remedy we designed a windowing technique that adjusts the suboptimality cuto according to the local quality of an alignment. Where the alignment appears to be very good no suboptimal alternatives are considered. In alignment regions showing little sequence conservation more suboptimal alternatives are taken into account. 10

We proceed as follows: For a given conventional optimal alignment we compute for each position

i in the rst sequence, say, an index q(i). Let a(i) be the position of character i in the alignment and l be the length of the rst sequence. Then, for a given window size w, we sum the substitution matrix values of the the aligned characters from alignment position maxf0; a(i) ? wg to alignment position minfa(l); a(i) + wg and divide it by the length of the window. The index q(i) is a measure for the local quality of the optimal alignment at sequence position i. See Figure 5 for an example with window width 4.

Figure 5 Now we compute a coecient c(i) as follows: First we normalize q(i) to a value p(i) between 0 and 1 and then de ne c(i) = (1 ? p(i))2 . The coecient c(i) is near 0 in regions where the alignment is reliable and near 1 in regions where it is not reliable. Figure 6 shows a plot of c(i) for the optimal conventional alignment between Desulfurococcus Mobilis (1495 nucleotides) and Halobacterium Halobium (1472 nucleotides). One can see three prominent peaks where indeed our experiments show that the conventional alignment is wrong. Finally, we collect all edges at position i that are realized by some alignment that is at most c(i)  s worse than the optimal conventional alignment, where s is the (maximal) suboptimality. The alignment edges induce base pair edges for each of the pairs A-U, C-G and G-U as described in Section 2 and together with the base pair edges they form the input RSA graph. Figure 6 Assessing the quality of the results The given secondary structure should direct the optimal alignment towards the detection of conserved structural patterns. We used two ways of assessing the quality of a structural alignment.  The rst is the number of realized base pairs. We compare it to the number of standard base pairs (A-U, C-G, G-U) in the correct structure of the second sequence. The more base pairs we realize, the better the alignment.  The second is the comparison of the sequence alignment from the database (which we assume to be the \correct" alignment) with the computed alignment. We use a dot plot representation to overlay the correct alignment with either the optimal conventional alignment or our structural alignment. A mark at position (i; j ) in the plot indicates, that the i-th character of the rst sequence is aligned with the j -th character of the second sequence. The more (mis)matches of an alignment coincidence with the (mis)matches from the correct alignment, the better the alignment. Results We initially tested the algorithm on small problem instances like tRNA alignments and alignments of 5S RNA sequences. For the cases studied the algorithm reproduced the correct alignments. Here we want to present more challenging examples of 23S ribosomal RNA sequences from the Antwerpen rRNA database (de Rijk, de Peer, Chapelle, and de Wachter 1994). The base pairs for the rst sequence were taken from the common secondary structure given in the database. We present alignments for three sequences taken from the database: 1. Desulfurococcus Mobilis with 1496 nucleotides. The common structure in the database realizes 469 base pairs of which 464 are standard base pairs. 11

Figure 5 goes here

Figure 6 goes here

2. Halobacterium Halobium with 1474 nucleotides. The common structure in the database realizes 459 base pairs of which 452 are standard base pairs. 3. Methanobacterium formicicum with 1477 nucleotides. The common structure in the database realizes 455 base pairs of which 447 are standard base pairs. From these three sequences we build three test sets, where the structure is always given for the rst sequence. The rst alignment we compute is between Desulfurococcus Mobilis and Halobacterium Halobium. The second between Halobacterium Halobium and Methanobacterium formicicum and nally the third between Methanobacterium formicicum and Desulfurococcus Mobilis. Figure 6 contains the results of di erent runs of our algorithm on these examples. We computed the conventional alignment with ane gap costs. The gap initiation penalty was 6 and the gap prolongation penalty 3. The substitution matrix we use assigns a score of 4 to a match and 1 to a mismatch In the computation of c(i) we additionally assign the score of ?1 to an indel. The rst column shows the number of the test set. The second column gives the degree of suboptimality. If this number is too high there are too many alignment and base pair edges in the RSA graph and the problem becomes hard to solve. If this number is too small we are in danger of loosing too much information. The third column contains the run-time in minutes and seconds on an UltraSparc 2/200. Note that the numbers include the time of computing the conventional alignment which in our case is around 30s. The fourth column gives the total number of edges (or variables) and the last column shows the number of base pairs that are realized by the optimal RSA alignment. Figure 7 As expected the number of realized base pairs increases with higher suboptimality parameter except for test set 3. Here the conventional alignment seems to be reasonable and is not improved by our method. In the appendix we present a region of an alignment between Desulfurococcus Mobilis and Halobacterium Halobium computed in two minutes with suboptimality 10. In Figure 8 you see two dot plots. The correct alignment taken from the database is denoted with crosses. Using an 'x' as mark, we plotted in Figure 8 a) the (mis)matches from the structural alignment obtained by our algorithm and in Figure 8 b) the (mis)matches from all optimal conventional alignments. Our alignment coincides to a much larger degree with the handmade alignment than the optimal conventional one. Figure 9 depicts the alignment itself in this region. We show the optimal handmade, the optimal conventional and the optimal structural alignment. The numbers in the rst and fourth row indicate structural elements (helices), where x and x0 are complementary strands of a helix numbered x. Capital letters show bases that form a base pair with a base in the complementary strand. Small letters are either not part of a helix or form a bulge within a helix. We inserted square brackets into the alignment to indicate the beginning and end of a helix. They are not part of the alignment. The rst two rows in all three alignments show part of Desulfurococcus Mobilis sequence with three helices 9; 90 , 10; 100 and 11; 110 . It is easy to check that the capital letters read from the beginning from x build a base pair with the capital letters read from the end of x0 . The last two rows in all three alignments always show a part of Halobacterium Halobium with the same three helices 9; 90 , 10; 100 and 11; 110 . The optimal handmade alignment shows that helix 9 has the same length in both sequences, helix 10 is a bit shorter and helix 11 is considerably shorter in Halobacterium Halobium. The optimal conventional alignment identi es helix 9, however it completely fails to recognize helices 10 and 11. A closer look at helix 10 in the optimal handmade alignment reveals that there is indeed a very poor sequence similarity at this position. Inspecting the optimal structural alignment we can observe that helix 10 is almost completely identi ed. 12

Figure 7 goes here

6 Conclusion In this paper we gave a formulation of a sequence structure alignment problem using polyhedral combinatorics. We described three classes of facet-de ning inequalities for the RSA polytope and gave a branch-and-cut algorithm. Our computational results show, that we are able to solve problem instances to optimality, the size of which has not been tractable for a dynamic programming approach. Moreover we demonstrated that the formulation of structural alignment models the biological truth better than conventional sequence alignment with ane gap costs. Another advantage of our formulation is that it can easily be modi ed to apply to other variants of RNA comparison. It is conceivable to use it for the comparison of RNA structures, to generalize it to several sequences, or to the insertion of a new sequence into a given alignment with a consensus structure. Furthermore, unlike for dynamic programming formulations, pseudo knots can be accounted for in our framework. This exibility appears to us to be one of the major advantages of the integer linear programming approach. Figure 8

Figure 8 goes here

Figure 9 Figure 9 goes here

References Bafna, V., Muthukrishnan, S., and Ravi, R. 1995. Computing similarity between RNA strings. In: Z. Galil, and E. Ukkonen, eds., Combinatorial Pattern Matching , 1{16, Springer, Espoo, Finland. Christof, T., Junger, M., Kececioglu, J., Mutzel, P., and Reinelt, G. 1997. A branch-andcut approach to physical mapping with end-probes. In: Proceedings of the First Annual International Conference on Computational Molecular Biology (RECOMB97), 84{93. Corpet, F., and Michot, B. 1994. RNAlign program: alignment of RNA sequences using both primary and secondary structures. CABIOS 10(4), 389{399. de Rijk, P., de Peer, Y. V., Chapelle, S., and de Wachter, R. 1994. Database on the structure of small ribosomal subunit RNA. Nucleic Acids Research 22, 3495{3501. Eddy, S. R., and Durbin, R. 1994. RNA sequence analysis using covariance models. Nucleic Acids Research 22(11), 2079{2088. Gorodkin, J., Heyer, L. J., and Stormo, G. D. 1997. Finding the most signi cant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Research 25, 3724{3732. Junger, M., and Thienel, S. 1997. The design of the branch and cut system ABACUS. Technical Report No. 97.260, Institut fur Informatik, Universitat zu Koln . Lenhof, H.-P., Reinert, K., and Vingron, M. 1998. A polyhedral approach to RNA sequence structure alignment. In: Proceedings of the 2nd Annual International Conference on Computational Molecular Biology (RECOMB-98), 153{162, ACM Press, New York, USA. Levitt, M. 1969. Detailed molecular model for transfer ribonucleic acid. Nature 224, 759{763. Mehlhorn, K., and Naher, S. 1995. LEDA, a platform for combinatorial and geometric computing. Communications of the ACM 38(1), 96{102. 13

Nemhauser, G., and Trotter, L. 1973. Properties of vertex packing and independence system polyhedra. Mathematical Programming 6, 48{61. Nemhauser, G. L., and Wolsey, L. A. 1988. Integer and combinatorial optimization . Wiley, Chichester. Reinert, K., Lenhof, H.-P., Mutzel, P., Mehlhorn, K., and Kececioglu, J. 1997. A branchand-cut algorithm for multiple sequence alignment. In: Proceedings of the First Annual International Conference on Computational Molecular Biology (RECOMB97), 241{249. Sanko , D. 1985. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics 45(5), 810{825. Tsukiyama, S., Ide, M., Ariyoshi, H., and Shirawaka, I. 1977. A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6, 505{517. Waterman, M. S. 1989. Consensus methods for folding single-stranded nucleic acids. In: Mathematical Methods for DNA Sequences., 185{224, CRC Press, Boca Raton, Fl.

14

List of Figures 1 2 3 4 5 6 7 8 9

(a) Edges f1; 2g and f3; 4g are in con ict (b) f2g and f4g are realized by the alignment An RSA graph with the base pairs of the rst sequence . . . . . . . . . . . . . . . . One redundant (a) and two contributing (b,c) extended cliques . . . . . . . . . . . . An RSA graph forming an odd cycle of length 3 . . . . . . . . . . . . . . . . . . . . . Calculation of quotient at position i with window of width 4 . . . . . . . . . . . . . . Plot of c(i) for an optimal conventional alignment . . . . . . . . . . . . . . . . . . . Results of the algorithm with di erent levels of suboptimality . . . . . . . . . . . . . Dot plots of optimal alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alignments corresponding to the dotplot in Figure 8 . . . . . . . . . . . . . . . . . .

15

16 17 18 19 20 21 22 23 24

C

1 G

C

G

4

2 3 C

G

-

C

C

G

-

G

C

-

G

U

U

(b)

(a)

Figure 1: (a) Edges f1; 2g and f3; 4g are in con ict (b) f2g and f4g are realized by the alignment

16

C A G G U G U C U A C G A C G C U G C A G C U C U G U A C G A C G A U G ::::::

left generating edge

base pair edge

right generating edge

Figure 2: An RSA graph with the base pairs of the rst sequence

17

1

1

3 2 4 a)

2

1 2

3 b)

3 c)

Figure 3: One redundant (a) and two contributing (b,c) extended cliques

18

e2 e1

e4

e3

e6 e5

e8 e7

e10

e9

b2;9 b1;10

x1;10

+

x2;9

+

x3

+

x4

+

x5

+

x6

+

x7

+

x8

3

Figure 4: An RSA graph forming an odd cycle of length 3

19

w

i

w

ACCGU---CGUCGUCG-GC---GUU AGCGUCGCCG---CCGUGCAAAGU-

Match=2, Mismatch=?1, Quotient = 3/9 Figure 5: Calculation of quotient at position i with window of width 4

20

1 0.8 0.6 0.4 0.2 00

200

400

600

800

1000

1200

1400

Figure 6: Plot of c(i) for an optimal conventional alignment

21

1600

Test set Suboptimality Time # Variables # Realized base pairs 1 0 00:32 2051 436 1 5 00:45 2399 440 1 10 02:02 3091 446 1 13 20:26 3659 446 2 0 00:28 1925 436 2 10 00:34 2078 438 2 20 01:01 2857 441 2 30 02:11 4039 445 2 40 16:23 5643 445 3 0 00:30 2002 445 3 10 00:59 2855 445 3 20 02:04 3664 445 3 30 04:24 5087 445 Figure 7: Results of the algorithm with di erent levels of suboptimality

22

225 220 215 210 205 200 195 190 185 180 175 170 165 160 155 150 145 140

160

180

200

220

240

a) Plot of optimal alignment against computed structural alignment 225 220 215 210 205 200 195 190 185 180 175 170 165 160 155 150 145 140

160 180 200 220 240 b) Plot of optimal alignment against computed conventional alignment

Figure 8: Dot plots of optimal alignments

23

Correct alignment -----9-------------------9'------------------------10---------[GGGGgauaaCACCGG]gaaa[CUGGUGcuaaucCCCC]aua[GG GGAGGAGgCC]uggaag [CGGGaauacUCUCGG]gaaa[CUGAGGcuaaucCCCG]aua ac[GCUUUGCUCC]uggaa-----9-------------------9'------------------------10----------------10'-----------11-----------------------------11'-[GGUUCCUCCC C]-gaaag[GGU GUGGCaggGGU]uaac[GCUGCUACA CC]g [GGGGCAAAGC]c ggaaa-[CGC]---------- uccg ---------[GC G] -------10'-----------11-----------------------------11'--

Optimal conventional alignment -----9-------------------9'------------------------10---------[GGGGgauaaCACCGG]gaaa[CUGGUGcuaaucCCCC]aua[GGGGAG GAgGCC]uggaag [CGGGaauacUCUCGG]gaaa[CUGAGGcuaaucCCCG]aua ----ac[g----c u---------9-------------------9'------------------------10----------------10'---------------11----------------------11'------[GGUUCCUCC CC]gaa ag[GGUGUGGC aggGGU]uaa c[GCUGC UACA CC]g --uugcucc]ug gaa[gg ggcaaagc]cgg--- aaa[c ---gc]uccg[gc g] ----------------------10'-----------------11-----------11'-

Optimal structural alignment ------9-------------------9'--------------------10-------------[GGGGgauaaCACCGG]gaaa[CUGGUGcuaaucCCCC]aua[GG GGAGGAgGCC]uggaa g [CGGGaauacUCUCGG]gaaa[CUGAGGcuaaucCCCG]aua ac[GCUUUGCUCC]uggaa[G ------9-------------------9'--------------------10--------------------10'------------11----------------------11'--------[GGUUCCUCCCC]gaaag[GG UGUGGCaggGGU]uaa c[GC UGCUACA CC]g GGGCAAA---- ----- gc]cg--g-a--a-- --a[c gc]U-C--CG[gc g] -------10'-------------------------------11----------11'--

Figure 9: Alignments corresponding to the dotplot in Figure 8

24