Fast Sorting by Reversal Piotr Berman1;3 and Sridhar Hannenhalli2;4 1Department of Computer Science and Engineering
The Pennsylvania State University University Park, PA 16802
[email protected] 2 Department of Mathematics
University of Southern California Los Angeles, CA 90089-1113
[email protected]
Abstract
Analysis of genomes evolving by inversions leads to a combinatorial problem of sorting by reversals studied in detail recently. Following a series of work recently, Hannenhalli and Pevzner developed the rst polynomial algorithm for the problem of sorting signed permutations by reversals and proposed an O(n4 ) implementation of the algorithm. In this paper we exploit a few combinatorial properties of the cycle graph of a permutation and propose an O(n2 (n)) implementation of the algorithm where is the inverse Ackerman function. Besides making this algorithm practical, our technique improves implementations of the other rearrangement distance problems.
3 This work is supported by NSF grant CCR9114545. 4 This work is supported by NSF Young Investigator Award, NIH grant 1R01 HG00987 and
DOE grant DE-FG02-94ER61919.
B. oleracea (cabbage)
B. campestris (turnip)
1
-5
4
-3
2
1
-5
4
-3
-2
1
-5
-4
-3
-2
1
2
3
4
5
Figure 1: \Transformation" of cabbage into turnip
1 Introduction In the late 1980's Jerey Palmer and colleagues discovered a remarkable and novel pattern of evolutionary change in plant organelles. They compared the mitochondrial genomes of Brassica oleracea (cabbage) and Brassica campestris (turnip), which are very closely related (many genes are 99% - 99.9% identical). To their surprise, while the genes themselves are almost identical, there are dramatic dierences in their order (Fig. 1). This discovery and many other studies in the last decade convincingly proved that genome rearrangements is a common mode of molecular evolution in mitochondrial, chloroplast, viral and bacterial DNA (see Bafna and Pevzner, 1995a). Every study of genome rearrangements involves solving a combinatorial \puzzle" to nd a shortest series of reversals to transform one genome into another. (Three such reversals \transforming" cabbage into turnip are shown in Fig. 1.) In cases of genomes consisting of small number of \conserved blocks", Palmer and coauthors were able to nd the most parsimonious scenarios for rearrangements. However, for genomes consisting of more than 10 blocks, exhaustive search over all potential solutions is far beyond the possibilities of \pen-and-pencil" methods. As a result, Palmer and Herbon, 1988 and Makaro and Palmer, 1988 overlooked the most parsimonious scenarios of rearrangements in more complicated cases like turnip vs. black mustard or turnip vs. radish (see Bafna and Pevzner, 1995a for correct solutions). Analysis of genome rearrangements provides a multitude of challenges for computer scientists; see Pevzner and Waterman, 1995 for a review of open combinatorial problems motivated by genome rearrangements. A computational approach based on comparison of gene orders versus traditional comparison of DNA sequences was pioneered by Sanko (see Sanko et al., 1990, 1992 and Sanko, 1992). Kececioglu and Sanko, 1993 suggested the rst performance guarantee algorithm to analyze genome rearrangements and conjectured that sorting by reversals in NP-hard. The problem was further studied by Bafna and Pevzner, 1993 who introduced the notion of cycle graph of a permutation and revealed important links between the maximum cycle decomposition of this graph and reversal distance. Recently Hannenhalli and Pevzner, 1995a (referred
to as HP95 in rest of paper) found a polynomial algorithm for sorting signed permutation by reversals, a problem which was also believed to be NP-hard. See also Kececioglu and Sanko, 1994, Kececioglu and Gus eld, 1994, Kececioglu and Ravi, 1995 and Bafna and Pevzner, 1995b, HP95, Hannenhalli, 1995, Hannenhalli and Pevzner, 1995b and Hannenhalli and Pevzner, 1995c for recent progress on the computational aspects of genome rearrangements, as well as Gates and Papadimitriou, 1979, Even and Goldreich, 1981, Jerrum, 1985, Aigner and West, 1987, Cohen and Blum, 1993, Heydari and Sudborough, 1993 for studies of related combinatorial problems. In the problem we consider, the order of genes in two organisms is represented by permutations = (12 : : :n) and = (12 : : :n). A reversal (i; j) of an interval [i; j], is the permutation 1 2 ::: i ?1 i i+1 ::: j? 1 j j +1 ::: n 1 2 ::: i ?1 j j? 1 ::: i+ 1 i j +1 ::: n Clearly (i; j) has the eect of reversing the order of genes i i+1 : : :j . Given permutations and , the reversal distance problem is to nd a shortest series of reversals 1 ; 2; : : :; t such that 1 2 t = (Fig. 2a). We call t the reversal distance between and . Note that the reversal distance between and equals the reversal distance between ?1 and the identity permutation (12 : : :n). Sorting by reversals is the problem of nding the reversal distance, d(), between and the identity permutation. A restricted (but biologically more relevant) version of \sorting by reversal" problem is \sorting signed permutation by reversals", which deals with permutations that associate a sign with each element (representing the direction of the corresponding gene). Recently HP95 developed a duality theorem expressing reversal distance for a signed permutation in terms of easily computable parameters of the permutation and also proposed a polynomial algorithm with an O(n4 ) implementation. We state the relevant results from HP95 in section 2. Since our improvement to the earlier implementation exploits certain combinatorial properties of permutations which can not be discussed here in detail, readers are strongly recommended to refer to HP95. The algorithm proposed in HP95 nds the optimal sequence of reversals iteratively. In any such iteration, nding a safe reversal in an oriented component presents the bottleneck (step 5 in the algorithm Reversal Sort stated in section 2), which in turn involves computing connected components of the interleaving graph of a permutation, a few times. In section 3 we present a O(n(n)) algorithm for computing connected components of interleaving graph of a permutation and using that and a few properties of interleaving graph, in section 4 we present a O(n(n)) algorithm for nding a safe reversal in an oriented component. This leads to a O(n2(n)) implementation of the algorithm.
3
5
8
6
4
7
9
2
1
10
11
3
5
4
6
8
7
9
2
1
10
11
3
4
5
6
8
7
9
2
1
10
11
3
4
5
6
7
8
9
2
1
10
11
9
8
7
6
5
4
3
2
1
10
11
1
2
3
4
5
6
7
8
9
10
11
(a)
black edge gray edge
(b) 0
3
5
8
6
4
7
9
2
1
10
11
12
A B
D E
C
F
(c) 0
5
6
10
+3
9
-5
15
16
+8
12
11
-6
7
8
+4
14
13
17
-7
18
+9
3
4
1
+2
2
19
+1
20
+10
B D
A
E
22
21
-11
F
(d) C
non-oriented cycle oriented cycle
Figure 2: (a) Optimal sorting of a permutation (3 5 8 6 4 7 9 2 1 10 11) by 5 reversals and (b) cycle graph of this permutation; (c) Transformation of a signed permutation into an unsigned permutation and the cycle graph G(); (d) Interleaving graph H with two oriented and one unoriented component.
23
2 Cycles, hurdles and fortresses Let = (1 : : :n ) be a permutation of the elements f1; : : :; ng. Denote i j if ji ? j j = 1. Extend a permutation = (1 : : :n) by adding 0 = 0 and n+1 = n + 1. We call a pair of consecutive elements i and i+1, 0 i n, of a breakpoint if i 6 i+1 and an adjacency if i i+1. The cycle graph of is an edge-colored graph G() with n + 2 vertices f0; 1; : : :; n; n+1g. We join vertices i and i+1 by a black edge for 0 i n and by a gray edge if i j (See Fig. 2b). Number of black edges in G() is denoted by b() and is trivially equal to n+1. A cycle in an edge-colored graph G is alternating if the colors of every two consecutive edges of this cycle are distinct. In the following, by cycles we mean alternating cycles. The length of a cycle C is the number of edges in it. Cycles of length 2 correspond to adjacencies in . We denote a cycle of length k by k-cycle. Let ~ be a signed permutation of f1; : : :; ng, i.e. a permutation with 00+00 or 00?00 sign associated with each element (Fig. 2c). In the signed case, every reversal of fragment [i; j] changes both the order and the signs of the elements within that fragment. We are interested in the minimum number of reversals d(~) required to transform a signed permutation ~ into the signed identity permutation (+1+ 2 : : : + n). De ne a transformation from a signed permutation ~ of order n to an (unsigned) permutation of f1; : : :; 2ng as follows. To model the signs of elements in ~ replace the positive elements +x by 2x ? 1; 2x and negative elements ?x by 2x; 2x ? 1 (Fig. 2c). We call the unsigned permutation , the image of the signed permutation ~. In the cycle graph G(), elements 2x ? 1 and 2x are joined by both black and gray edges for 1 x n. We de ne the cycle graph G(~) of a signed permutation ~ as the cycle graph G() with these 2n edges excluded. Observe that in G(~) every vertex has degree 2 (Fig. 2c) and therefore the cycle graph of a signed permutation is a collection of disjoint cycles. Denote the number of such cycles as c(~). We observe that the signed identity permutation of order n maps to the (unsigned) identity permutation of order 2n, and the eect of a reversal on ~ can be mimicked by a reversal on thus implying d(~) d(). In the following, by sorting the image = 12 : : :2n of a signed permutation ~ = ~1~2 : : :~n, we mean sorting of by reversals (2i + 1; 2j) which \cut" only after even positions in . In the rest of this section, is an image of a signed permutation. We say that reversal (i; j) acts on black edges (i?1 ; i) and (j ; j +1) in G(). We call (i; j) a reversal on a cycle if the black edges (i?1; i) and (j ; j +1) belong to the same cycle in G(). Every reversal increases c() by at most 1, i.e., c() ? c() 1 (Bafna and Pevzner, 1993). A gray edge g is oriented if for a reversal acting on two black edges incident to g, c() ? c() = 1 and unoriented otherwise. A cycle in G() is oriented if it has an oriented gray edge and unoriented otherwise. Gray edges (i ; j ) and (k ; t) in G() are interleaving if the intervals [i; j] and [k; t] overlap but neither of them contains the other. Cycles C1 and C2 are interleaving if there exist interleaving gray edges g1 2 C1 and g2 2 C2. See Fig. 2c for examples.
Let C be the set of cycles in the cycle graph of a permutation . De ne an interleaving graph H (C ; I ) of with the edge set I = f(C1; C2) : C1 and C2 are interleaving cycles in g (Fig. 2d). The vertex set of H is partitioned into oriented and unoriented vertices (cycles in C ). A connected component of H is oriented if it has at least one oriented vertex and unoriented otherwise. In the following we use the terms edge of , cycle in and component of instead of (more accurate) terms edge of G(), cycle in G() and component of H(). A connected component U corresponds to the set of integers U = fi : i 2 C 2 U g representing the set of positions of the permutation belonging to cycles of U. For a set of integers U de ne Umin = minu2U u and Umax = maxu2U u. Let be a partial order on a set P. An element x 2 P is called a minimal element in if there is no element y 2 P with y x. An element x 2 P is the greatest in if y x for every y 2 P and jP j > 1. Let U be a collection of sets of integers. De ne a partial order on U by the rule U W i [Umin; Umax ] [Wmin ; Wmax ] for U; W 2 U . We say that a set U 2 U separates sets U 0 and U 00 if there exists 0 < u < Umin 00 . A hurdle for the set U is de ned as an u 2 U such that Umax unoriented component U in U which is either a minimal hurdle or the greatest hurdle where a minimal hurdle is a minimal element in and the greatest hurdle satis es the following two conditions (i) U is the greatest element in and (ii) U does not separate any two sets in U . A hurdle K 2 U protects a non-hurdle U 2 U if deleting K from U transforms U from a non-hurdle into a hurdle (i.e. U is a hurdle in U n K). A hurdle in is a superhurdle if it protects a non-hurdle U 2 U. De ne a collection of sets of integers U = fU : U is an unoriented component of permutation g and let h() be the overall number of hurdles for the collection U . Permutation is called a fortress if it has an odd number of hurdles and all these hurdles are superhurdles. De ne if is a fortress f() = 1; 0; otherwise For a signed permutation ~ with the image we de ne b(~) = b(), c(~) = c(), h(~) = h() and f(~ ) = f().
Theorem 1 (HP95) For a signed permutation ~ of order n, d(~) = b(~) ? c(~) + h(~) + f(~ ).
Notice that theorem 1 immediately leads to a naive polynomial algorithm based on exhaustive search, but a more ecient algorithm was developed in HP95 for which we propose a faster implementation in this paper. In the following we introduce certain notions essential to state the polynomial algorithm for sorting signed permutation by reversals developed in HP95. Previous studies revealed that complicated interleaving structure of long cycles (cycles of length greater than 4) in the cycle graphs poses serious diculties in
g C
(g,b)-split
b w
g
w
b
C2
C1
v
b
v
g
w
g
w
b
w
v
v
b
vg
Figure 3: Example of a (g; b)-split. analyzing sorting by reversals (Bafna and Pevzner, 1993) and by transpositions (Bafna and Pevzner, 1995b). To get around this problem we introduce equivalent transformations of permutations. A permutation is simple if all cycles of G() are of length 4. Let b = (vb ; wb) be a black edge and g = (wg ; vg ) be a gray edge belonging to a cycle C = : : :; vb; wb; : : :; wg ; vg ; : : : in the cycle graph G() of a permutation . ^ obtained from G() by A (g; b)-split of G() is a new graph G() removing edges g and b, adding two new vertices v and w, adding two new black edges (vb ; v) and (w; wb), adding two new gray edges (wg ; w) and (v; vg ). Fig. 3 shows a (g; b)-split transforming a cycle C in G() into cycles C1 and C2 ^ If G() is a cycle graph of a signed permutation then every (g; b)-split in G(). of G() corresponds to the cycle graph of a signed generalized permutation ^ ^ = G(^). Below we de ne generalized permutations and describe such that G() the padding procedure to nd a generalized permutation ^ corresponding to a (g; b)-split of G. A generalized permutation = (1 2 : : :n) is a permutation of arbitrary distinct reals (versus permutations of integers f1; 2; : : :; ng we considered before). In this section by permutations we mean generalized permutations and by generalized identity permutation we mean a generalized permutation = (1 2 : : :n) with i < i+1 for 1 i n ? 1. Extend a permutation = (12 : : :n) by adding 0 = min1in i ? 1 and n+1 = max1in i + 1. Elements j and k of are consecutive if there is no element l such that j < l < k for 1 l n. Elements i and i+1 of are adjacent for 0 i n. The cycle graph of a (generalized) permutation = (1 2 : : :n) is de ned as the graph on vertices f0; 1; : : :; n; n+1g with black edges between adjacent elements and gray edges between consecutive elements. Obviously the de nition of the cycle graph for generalized permutation is consistent with the notion of the cycle graph described earlier. Let b = (i+1 ; i) be a black edge and g = (j ; k) be a gray edge belonging to a cycle C = : : :; i+1; i; : : :; j ; k ; : : : in the cycle graph G(). De ne = k ?j and let v = j + 3 , w = k ? 3 . A (g; b)-padding of = (1 2 : : :n) is a permutation on n + 2 elements obtained from by inserting v and w after the i-th element of (0 i n): ^ = (12 : : :ivwi+1 : : :n )
Note that v and w are both consecutive and adjacent in ^ thus implying that if is (the image of) a signed permutation then ^ is also (the image of) a signed permutation. The following lemma establishes the correspondence between (g; b) paddings and (g; b)-splits. 2
Lemma 1 (HP95) ^ = G(^). G()
A (g; b)-padding transforming into ^ (i.e. ^ = ) is safe if it acts on non-incident edges of a long cycle and h() = h(^). Clearly, every safe padding breaks a long cycle into two smaller cycles without aecting the reversal distance of the permutation.
Theorem 2 (HP95)
If C is a long cycle in G(), then there exists a safe (g; b)-padding acting on C .
A permutations is equivalent to a permutation ( ; ) if there exists a series of permutations (0); (1); : : :; (k) such that (i + 1) = (i) (i) for a safe (g; b)-padding (i) acting on i (0 i k ? 1).
Theorem 3 (HP95)
For every permutation there exists an equivalent simple permutation.
Let ^ be a (g; b)-padding of and be a reversal acting on two black edges of ^ . Then can be mimicked on by ignoring the padded elements. We need a generalization of this observation. A sequence of permutations (0); (1); : : :; (k) is called a generalized sorting of if is the (generalized) identity permutation and (i+1) is obtained from (i) either by a reversal or by a padding. Note that reversals and paddings in generalized sorting of may interleave.
Lemma 2 (HP95)
Every generalized sorting of mimics a (genuine) sorting of with the same number of reversals.
A reversal on is safe if (b() ? c() + h()) ? (b() ? c() + h()) = 1. Let K be an oriented component of H and let R(K) be a set of reversals acting on oriented cycles from K. Assume that a reversal 2 R(K) \breaks" K into a number of connected components (this phenomenon is made more clear in section 4). De ne Index() as the union of unoriented components of H originally
2 Of course, a (g; b)-padding of a permutation = (1 2 : : : n ) on f1; 2; : : : ; ng can be modeled as a permutation ^ = (^1 ^2 : : : ^i vw^i+1 : : : ^n ) on f1; 2; : : : ; n + 2g where v = j + 1; w = k + 1, ^i = i + 2 if i > minfj ; k g and ^i = i otherwise. The generalized permutations were introduced to make the following \mimicking" procedure more intuitive.
contained in K and index() as jIndex()j. If index() > 0 then may be unsafe since some of the new components that form Index() may create new hurdles (recall that the hurdles are special instances of unoriented components) in thus increasing h() as compared to h(). However, if index() = 0, then is guaranteed to be safe. In Section 4 we show that the set R(K) always contains such a safe reversal, thus providing a new proof of the following theorem:
Theorem 4 (HP95)
For every oriented component K in H there exists a (safe) reversal 2 R(K) such that all components of H contained in K are oriented (i.e. index() = 0).
Our new proof will allow to nd this safe reversal more eciently. A reversal cuts a hurdle if it acts on a cycle of the hurdle. A reversal merges two hurdles if it acts on black edges belonging to both hurdles.
Lemma 3 (HP95)
A reversal acting on a cycle of a simple hurdle is safe.
Lemma 4 (HP95)
If h() > 3 then there exists a safe reversal merging two hurdles in .
Lemma 5 (HP95)
If h() = 2 then there exists a safe reversal merging two hurdles in . If h() = 1 then there exists a safe reversal cutting the only hurdle in .
Lemmas 2, 3, 4, 5 and theorems 2, 1 and 4 motivate the algorithm Reversal Sort (Figure 4) which optimally sorts signed permutations.
Theorem 5 (HP95)
Reversal Sort() optimally sorts a permutations = (12 : : :n ) in O(n4) time.
Proof Sketch: Theorem 1 implies that Reversal Sort provides generalized sorting of by a series of reversals and paddings containing d() reversals. Lemma 2 implies that this generalized sorting mimics an optimal (genuine) sorting of by d() reversals. We sketch an O(n4 ) implementationof Reversal Sort() (the description P of data structures is omitted). De ne the complexity of a permutation as C 2C (l(C)? 2) where C is the set of cycles in G() and l(C) is the number of black (or equivalently, gray) edges in C. Clearly, the complexity of a simple permutation is 0. Note that every iteration of while loop in Reversal Sort reduces the amount complexity() + 3d() by at least 1 (complexity() increases by 2 when two hurdles are merged but d() decreases by 1 in that step) thus implying that the number of iterations of Reversal Sort is bounded by 4n.
Algorithm Reversal Sort()(HP95) 1. while is not sorted 2. if has a long cycle 3. select a safe (g; b)-padding of (theorem 2) 4. else if has an oriented component 5. select a safe reversal in this component (theorem 4) 6. else if has an even number of hurdles 7. select a safe reversal merging two hurdles in (lemma 4 and 5) 8. else if has at least one simple hurdle 9. select a safe reversal cutting this hurdle in (lemmas 3 and 5) 10. else if is a fortress with more than 3 superhurdles 11. select a safe reversal merging two (super)hurdles in (lemma 4) 12. else = is a 3-fortress = 13. select an (un)safe reversal merging two arbitrary (super)hurdles in 14. 15. endwhile 16. mimic (genuine) sorting of using the computed generalized sorting of (lemma 2)
Figure 4: Polynomial algorithm for sorting signed permutation by reversals Steps 2 and 3 can be implemented in linear time (which will become self-evident after we describe the fast algorithm for computing connected components in section 3). Computing the conditions in rest of the steps requires the information about the connected components of interleaving graph of the permutation. Moreover, step 5 also computes connected components a number of times (O(n) times in the worst case) in search for a safe reversal. Computing connected components can be implemented in O(n2) time and hence, step 5, which presents the bottleneck in any iteration can be implemented in O(n3 ) time. Conditions in steps 6, 8 and 10 can be computed in linear time (evident in HP95). This 2 gives us a O(n4 ) time implementation of Reversal Sort.
3 Finding connected components in O(n(n)) time In algorithm \Reversal Sort", steps 4 thru 13 implicitly require the computation of connected components of the interleaving graph H . Particularly in step 4 when one of the components is oriented, we need to evaluate index() for a series of candidate reversals , which in turn mainly involves computing the connected components of the interleaving graph H . Using a general method, nding connected components requires scanning all the edges of the given graph and hence can not be accomplished in o(m) time where m is the number of edges. But
the interleaving graph is a simple case and a linear scan of the permutation with disjoint-set operations at every step suces to nd the connected components. Following discussion is only with respect to simple permutation since it suces to support our claims. A node of H is a 4-cycle and can be represented by one of its gray edges (say the one originating at leftmost position). We will view a gray edge e as a record with elds B and E (the beginning and the ending), so e = (e:B; e:E). For the algorithm, we mark which of the 2n+2 positions corresponding to the (image of) a simple signed permutation are the beginnings and endings of gray edges representing 4-cycles. The algorithm performs a linear scan on the positions of the permutation. An edge of H is a pair (e; f) such that e:B < f:B < e:E < f:E. After scanning the positions from 0 thru j we can detect such an edge unless both e:E and f:E are larger than j. We construct graph Hj by retaining only these nodes and the edges of H that can be detected in the scan from 0 to j (thus H0 has at most one node and no edge). In ith iteration (starting with 0th iteration) if a new node is detected, we create a new component corresponding to it and if an edge is detected, we compute the connected components of Hi by performing unions on the components of Hi?1. At the end of scan the sets of Find-Union structure should form the connected components of H2n+1 = H . A node e of Hj is active i e:B j < e:E. A component C of Hj is active if it contains an active node; in this case we de ne handle(C) as an active node e of C with the maximum ending e:E. Note that if e is not active and e:B j, then it has the same adjacent nodes in Hj as in H . Consequently, if C is an inactive component of Hj , it remains unchanged in the following iterations and hence is also a component of H . A Union/Find structure is used to maintain the connected components with an additional eld to store the handle of the component (if it is active). When two components are merged, we look at their handles, the one with larger end position becomes the handle of the merged component. We store the beginning positions of handles of the active components in a stack. Figure 5 shows the algorithm to compute the connected components of an interleaving graph based on the ideas discussed above. Step 7 involves one disjoint set operation and step 9 involves 2 disjoint set operations. Every union operation reduces the number of components by one, hence the overall number of times step 9 is executed is O(n). With union by rank and path compression, O(n) disjoint set operations can be executed in O(n(n)) time. It is easy to see that the rest of the steps involves only a constant amount of work for each of the 2n + 2 positions, which gives a running time of O(n(n)) for Connected Components. Let Cei be the connected component of Hi containing e, and let ei be the handle of Cei . We will omit the superscripts since they will be obvious from the context.
Lemma 6 Assume that f; g; h are three active nodes in Hi such that f:B < g:B < h:B . If Cf = Ch = C then Cg = C .
Algorithm Connected Components((2n + 2)) 1. Stack 2. For i = 0 to 2n + 1 3. If i is the beginning of an edge (i; j) 4. Create a new component with set fi; j g and handle (i; j) 5. push(i) 6. Else if i is the end of an edge 7. C Find(i), s handle(C):B 8. While top s 9. C Union(Find(pop), C) 10. e handle(C) 11. If e:E > i /* if C is active */ 12. push(e:B) Figure 5: Algorithm for computing connected components of an interleaving graph of a permutation
Proof Assume the contrary. Then we can partition C into two disjoint subsets: C1 = fa 2 C ja:B < g:B and either a:E < g:B or a:E ig, and C2 = fa 2 C ja:B > g:B g, since any other node would be adjacent to g. Note that f 2 C1 and h 2 C2, so both parts are non-empty. Moreover, no node in C1 is adjacent to a node in C2 in Hi , so C = C1 [ C2 is not connected, a contradiction. Theorem 6 After executing Connected Components, the sets of Union/Find structure are the connected components of H . Proof It suces to show that after kth iteration (iteration with i = k) of the
algorithm the sets of Union/Find are the connected components of Hk , while the stack consists of the beginnings of handles of the active components, in the increasing order. We will prove it by induction. The basis step is trivial. For the induction step, consider the kth iteration of step 2. If k = e:B for some gray edge e then the graphs Hk?1 and Hk don't dier, except that Ce becomes a one-element active component, hence the only action required to maintain the inductive claim is pushing k on the stack, as it is the beginning of the handle of Ce, i.e. k = e. Quite obviously, at this time it is the largest beginning of a handle. Now we will consider the case when k = e:E for some gray edge e. The only edges of Hk that are not in Hk?1 have the form (e; f), where e:B < f:B < e:E = k < f:E. Let A = ff : e:B < f:B < e:E < f:E & Cf 6= Ceg. Clearly, to compute the components of Hk and maintain the stack, it suces to remove Cf for each f 2 A and Ce and replace these components with their union. Moreover we need to ensure that the resulting component is uniquely represented on the stack by
its handle i it remains to be active. To show that Connected Components does exactly that in steps 8-12, we rst claim that f 2 A i Cf is active and e:B < f:B. Assume that f 2 A. Clearly, f is active in Hk?1 and e:B 6= f:B since Cf 6= Ce. If e:B > f:B, then for all placements of e:B, f:B, e:B and f:B obeying the ordering constraints, either the triplet (e:B; f:B; e:B) or the triplet (f:B; e:B; f:B) contradict Lemma 6. This implies that e:B < f:B. Now, assume that f is active and e:B < f:B. Clearly Cf 6= Ce and by Lemma 6, e:B < f:B, implying that f 2 A. By our inductive assumption concerning the stack, f 2 A i the beginning of the handle of Cf is popped from the stack in step 9, and consequently i Cf is incorporated into C. Notice that handle of Ce is also popped out of the stack due to inequality condition of step 8. The condition in step 11 ensures if C is active, in which case the beginning of its handle is pushed on to the stack. Moreover, the same claim ensures that the stack is changed properly by the pops in step 9 and push in step 12. 2
4 Finding safe reversal in O(n(n)) time The problem can be better illustrated by the following puzzle. We are given a connected graph G whose vertices are colored either green or red with atleast one vertex colored green. Let N(u) denote the set of vertices adjacent to u in G. The goal is to delete all the vertices under the following restrictions: Only a single green vertex u can be deleted at a time and the resulting graph G0 can be obtained from G by (i) switching the color of every vertex v 2 N(u) (changing green to red and vice-versa) (ii) switching the adjacency of every pair of vertices (v; w) where v; w 2 N(u) (making v and w adjacent to each other in G0 if they not adjacent in G and vice-versa) (iii) deleting u. A little inspection of the rules reveals that deletion of a certain green vertex may create G0 which is no more connected. If one of the connected components C of G0 has only red vertices then we can not possibly delete any of the vertices of C directly and deleting vertices from other component does not aect C. So our goal is to choose a green vertex for deletion such that every component of the resulting graph has atleast one green vertex. Proceeding in this manner we can delete all the vertices of G recursively. Searching for a safe reversal in an oriented component is exactly the afore mentioned problem for a subclass of graphs for which it is always possible to delete all the vertices. This subclass is the class of interleaving graphs of simple signed permutations where an oriented cycle corresponds to a green vertex and an unoriented cycle corresponds to a red vertex. At rst glance, nding a safe reversal in an oriented component K seems to require an exhaustive search among all the oriented cycles of K, but looking closely at the combinatorial properties of the interleaving graph we can avoid
that. Proof of theorem 4 shows that if for a particular reversal acting on an oriented cycle of K, index() > 0 (i.e. creates some unoriented components) then there exists an alternative reversal acting on another oriented cycle of K such that index() < index(). In the worst case we might end up trying O(n) candidate reversals before we nd a safe reversal with zero index. One of the ideas behind the proposed improvement is to nd a reversal such that index() 21 index(), thus bounding the number of trials to O(logn). Another idea behind the proposed improvement is that we can reduce the problem size for nding and all the subsequent candidates. This is achieved by showing that we can ignore part of the interleaving graph (or equivalently, permutation) without sacri cing any information in evaluating the index of the alternate reversal. O(logn) trials with halving of the problem size for every consecutive trials leads to an almost linear time algorithm. In the following we assume (w.l.o.g.) that H has a single component and it is oriented. Moreover, we use O to denote the set of oriented cycles in H . Obviously, if jOj = 1, the problem is trivial as the only cycle in O de nes a safe reversal. Therefore we will assume that jOj > 1.
Lemma 7 Given two reversals and acting on oriented cycles u and v in O respectively, if Index() \ O is disjoint with Index() \ O then Index() is disjoint with Index().
Proof Assume the contrary and let w 2 Index()\Index() when Index()\O is disjoint with Index() \ O. Let U be the unoriented component containing w in H n O (subgraph of H induced by unoriented vertices). Since H is connected, there exists an oriented cycle s adjacent to some cycle of U. Assume w.l.o.g. that s 62 Index() \O (i.e. s ends up in an oriented component in H ). Note that there is path w w0 ; w1; : : :; wk s in H where s is the only oriented cycle in this path. Since w 2 Index() (i.e. w ends up in an unoriented component in H ), this path is broken in H . For this path to get broken there must be wi ; i < k such that both wi and wi+1 are adjacent to u (or wi+1 = u) in H , since wi is not adjacent to wi+1 in H (HP95). Consider smallest such i. Clearly wi is unoriented in H and thus becomes oriented in H and hence w belongs to an oriented component in H , a contradiction. 2
Lemma 8 There exists a pair of reversals and acting on oriented cycles u and v in O respectively, such that Index() is disjoint with Index(). Proof By Lemma 7 it suces to show and such that Index()\Index()\O is empty. We consider two cases. Case 1: O is not a clique in H . We choose u; v 2 O such that u and v are not adjacent in H . Let w 2 Index() \ O (i.e. w 2 O belongs to an unoriented component in H ). As w is unoriented in H , it must be adjacent to u in H . Because u and v are
not adjacent in H , in H cycle w is still adjacent to u and u remains oriented, hence w 62 Index(). case 2: O is a clique in H . Let P be the set of unoriented cycles adjacent to some cycle in O in H . For t 2 P let Ot be the set of nodes in O adjacent to t. We x s 2 P such that the set Os has the minimum size. case 2.1: For some t 2 P , Os 6 Ot . Then we choose u 2 Os ? Ot and v 2 Ot ? Os. If w 2 Index() \ O then w is adjacent to s and not adjacent to t in H (otherwise w would be either become adjacent to s in H , or it would still be adjacent to t that is still adjacent to v that becomes adjacent to s, in either case it would become connected with s, while s becomes oriented). Since w is adjacent to s but not to t, we can conclude (by a symmetric reasoning) that w 62 Index() \O. case 2.2: For every t 2 P , Os Ot. Suppose that Os contains two distinct cycles u and v. Then u and v have the same set of neighbors in H , and so in H cycle v becomes unoriented and isolated, a contradiction, since it is an impossible con guration of a cycle graph (HP95). Thus we can de ne u as the sole member of Os . We can show that Index() \O is empty. Indeed, if v 2 O?fug, then in H , cycle v and s are both adjacent to u, and not adjacent to each other, while s is unoriented. Therefore in H cycle v is adjacent to (now) oriented cycle s. 2 We will brie y describe how to eciently nd a pair of vertices u and v satisfying Lemma 8. Assume that O = fu1; :::; umg where u1 :B < ::: < um :B. If ui :E > ui+1:E for some i, ui and ui+1 are not adjacent and so satisfy Case 1. Similarly, if u1:E < um :B, u1 and um are not adjacent. In the remaining case u1 :B < ::: < um :B < u1:E < ::: < um :E and O is a clique, and so we are in Case 2. Now observe that the ends of oriented cycles split f0; :::; 2n + 1g into 2m + 1 intervals and that we can compute the array NOI such that NOI[a] gives the number of the interval that contains a. Observe that Ot is determined by the pair (NOI[t:B]; NOI[t:E]). In particular, after these pairs are computed, it takes constant time to compute the size of Ot , to check two such sets for subset relation and to nd an element of such set and an element of the dierence of two such sets. Details are left to the reader. Lemma 8 imply that
Theorem 7 There exists a reversal on an oriented cycle in O such that at most half of all the cycles in H end up in an unoriented component of H , i.e. index() jH2 j . Followingtheorem suggests a way to reduce the number of cycles to be considered to evaluate the index of an alternative reversal.
Theorem 8 Let be a reversal on an oriented cycle u in H . Let K() be an unoriented component created in H . Let be a reversal in on an oriented
Algorithm Find Safe Reversal(C) 1. Find a reversal on an oriented cycle of C such that index() jCj2 .
(Theorem 7) 2. If is safe return as a safe reversal.
3. 4.
Else
/* Let K be an unoriented component of H . */ Find Safe Reversal(K) (Theorem 8)
Figure 6: An O(n(n)) algorithm for nding safe reversal in an oriented component cycle v belonging to K() (v is oriented in H and unoriented in H ). A cycle w belongs to an unoriented component in H i w belongs to an unoriented component in the subgraph of H induced by K() (we will refer to this subgraph K ). as H
Proof If w belongs to an unoriented component in H then w belongs to an K (the proof of this statement is embedded in the unoriented component in H proof of theorem 4). To prove the converse, assume the contrary and let w be K but belonging to an a cycle belonging to an unoriented component W in H 0 oriented component W of H itself. Let W = W n W. Let s 2 W and t 2 W 0 such that edge (s; t) 2 H . We will consider a few cases: Case 1:(s; t) 2 H Since (s; t) 2 H and (s; t) 62 H (s and t belong to dierent components of H ), (s; u); (t; u) 2 H . Moreover s is oriented in H since it is unoriented in H . This implies that (s; v) 2 H since s is unoriented in H . Moreover since the edge (s; t) is both in H and H , either (s; v) 62 H or (t; v) 62 H . Since it is already shown that (s; v) 2 H , it implies that (t; v) 62 H . Since (t; v) 62 H and (t; u); (v; u) 2 H , edge (v; t) 2 H , a contradiction to the fact that v and t belong to dierent component in H . case 2:(s; t) 62 H (s; t) 2 H and (s; t) 62 H imply that (s; v); (t; v) 2 H . (t; v) 2 H and (t; v) 62 H imply that (t; u) 2 H . (s; t) 62 H and (s; t) 62 H imply that either (s; u) 62 H or (t; u) 62 H which implies that (s; u) 62 H . Since (s; u) 62 H and s is unoriented in H it implies that s is unoriented in H . And since (s; v) 2 H s is oriented in H , a contradiction. 2 Theorems 7 and 8 immediatelylead to the recursive algorithm Find Safe Reversal for nding a safe reversal in an oriented component (Figure 6). There are at most log(jOj) recursive calls to Find Safe Reversal and in every successive call the input size is reduced by a factor of 2. Step 1 involves trying at most 2 candidates (Theorem 7) each of which involves computing the connected components which can be done in O(m(m)) where m is the current input size. Hence the overall running time of Find Safe Reversal is O(n(n)) where n is the
size of the oriented component. This leads to a O(n2(n)) implementation of Reversal Sort.
References [1] M. Aigner and D. B. West. Sorting by insertion of leading element. Journal of Combinatorial Theory, 45:306{309, 1987. [2] V. Bafna and P. Pevzner. Genome rearrangements and sorting by reversals. In 34th Annual IEEE Symposium on Foundations of Computer Science, pages 148{157, 1993. (to appear in SIAM J. Computing). [3] V. Bafna and P. Pevzner. Sorting by reversals: Genome rearrangements in plant organelles and evolutionary history of X chromosome. Mol. Biol. and Evol., 12:239{246, 1995a. [4] V. Bafna and P. Pevzner. Sorting by transpositions. In Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 614{623, 1995b. [5] D. Cohen and M. Blum. Improved bounds for sorting pancakes under a conjecture. 1993 (manuscript). [6] S. Even and O. Goldreich. The minimum-length generator sequence problem is NP-hard. Journal of Algorithms, 2:311{313, 1981. [7] W. H. Gates and C. H. Papadimitriou. Bounds for sorting by pre x reversals. Discrete Mathematics, 27:47{57, 1979. [8] S. Hannenhalli. Polynomial algorithm for computing translocation distance between genomes. In Combinatorial Pattern Matching, Proc. 6th Annual Symposium (CPM'95), Lecture Notes in Computer Science, pages 162{176. Springer-Verlag, Berlin, 1995. [9] S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). In Proc. 27th Annual ACM Symposium on the Theory of Computing, pages 178{ 189, 1995a. [10] S. Hannenhalli and P. Pevzner. Transforming men into mice (polynomial algorithm for genomic distance problem). In 36th Annual IEEE Symposium on Foundations of Computer Science, pages 581{592, 1995c. [11] S. Hannenhalli and P. Pevzner. To cut ... or not to cut (applications of comparative physical maps in molecular evolution). In Seventh Anuual ACM-SIAM Symposium on Discrete Algorithms, pages 304{313, 1996. [12] M. Heydari and I. H. Sudborough. On sorting by pre x reversals and the diameter of pancake networks. 1993 (manuscript).
[13] M. Jerrum. The complexity of nding minimum-length generator sequences. Theoretical Computer Science, 36:265{289, 1985. [14] J. Kececioglu and D. Gus eld. Reconstructing a history of recombinations from a set of sequences. In 5th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 471{480, 1994. [15] J. Kececioglu and R. Ravi. Of mice and men: Evolutionary distances between genomes under translocation. In Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 604{613, 1995. [16] J. Kececioglu and D. Sanko. Exact and approximation algorithms for the inversion distance between two permutations. In Combinatorial Pattern Matching, Proc. 4th Annual Symposium (CPM'93), volume 684 of Lecture Notes in Computer Science, pages 87{105. Springer-Verlag, Berlin, 1993. (Extended version has appeared in Algorithmica, 13: 180-210, 1995.). [17] J. Kececioglu and D. Sanko. Ecient bounds for oriented chromosome inversion distance. In Combinatorial Pattern Matching, Proc. 5th Annual Symposium (CPM'94), volume 807 of Lecture Notes in Computer Science 807, pages 307{325. Springer-Verlag, Berlin, 1994. [18] C. A. Makaro and J. D. Palmer. Mitochondrial DNA rearrangements and transcriptional alterations in the male sterile cytoplasm of Ogura radish. Molecular Cellular Biology, 8:1474{1480, 1988. [19] J. D. Palmer and L. A. Herbon. Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. Journal of Molecular Evolution, 27:87{97, 1988. [20] P.A. Pevzner and M.S. Waterman. Open combinatorial problems in computational molecular biology. In 3rd Israel Symposium on Theory of Computing and Systems, pages 158{163. IEEE Computer Society Press, 1995. [21] D. Sanko. Edit distance for genome comparison based on non-local operations. In Combinatorial Pattern Matching, Proc. 3rd Annual Symposium (CPM'92), volume 644 of Lecture Notes in Computer Science, pages 121{ 135. Springer-Verlag, Berlin, 1992. [22] D. Sanko, R. Cedergren, and Y. Abel. Genomic divergence through gene rearrangement. In Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, chapter 26, pages 428{438. Academic Press, 1990. [23] D. Sanko, G. Leduc, N. Antoine, B. Paquin, B. F. Lang, and R. Cedergren. Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA, 89:6575{6579, 1992.