General Techniques for Comparing Unrooted ... - Semantic Scholar

General Techniques for Comparing Unrooted Evolutionary Trees Ming-Yang Kao Tak-Wah Lamy Teresa M. Przytyckaz Wing-Kin Sungy Hing-Fung Tingy

Abstract

computer scientists have been investigating how to construct and compare evolutionary trees. An agreement subtree of two evolutionary trees is an evolutionary tree which is also a topological subtree of the two given trees. A maximum agreement subtree is one with the largest possible number of leaves. Dierent theories about the evolutionary relationship of the same species often result in dierent evolutionary trees. A fundamental problem in computational biology is to determine how much two theories have in common. To a certain extent, this problem can be solved by computing a maximum agreement subtree of two given evolutionary trees [8]. Algorithms for computing a maximum agreement subtree for two unrooted evolutionary trees, as well as two rooted evolutionary trees, have been studied intensively in the past few years. The unrooted case is more dicult than the rooted case. There is indeed a lineartime reduction from the rooted case to the unrooted one, but the reverse is not known. Finden and Gordon [8] gave a heuristic for nding an agreement subtree of two unrooted trees in O(n5 ) time where n is the size of the trees. Kubicka, Kubicki and McMorris [16] studied binary unrooted trees and gave an algorithm that can compute a maximum agreement subtree in O(n( 21 +) log n ) time. Steel and Warnow [19] gave the rst polynomial-time algorithm for general unrooted trees, i.e., those with no restriction on the degrees. Their algorithm runs in O(n4:5 log n) time. Farach and Thorup [6] reduced the time complexity to O(n2+o(1) ) for unrooted trees and O(n2 ) for rooted trees. More recently, they gave an algorithm for rooted trees [7] that runs in O(n1:5 log n) time, and showed that this time bound cannot be improved unless there is a breakthrough in a longstanding open problem of maximum bipartite matching. For the unrooted case, the time complexity was improved by Lam, Sung and Ting [17] to O(n1:75+o(1) ). Algorithms that work well for rooted trees with degrees bounded by a constant have also been revealed recently. The algorithm of Farach, Przytycka and Thorup [3, 5] runs in O(n log3 n) time, and Kao [12] has also obtained an O(n log2 n) time algorithm. The case when the inputs are further restricted to binary rooted trees was considered by Cole and Hariharan [3, 4]; they

This paper presents two sets of techniques for comparing unrooted evolutionary trees, namely, label compression and four-way dynamic programming. The technique of four-way dynamic programming transforms existing algorithms for computing rooted maximum agreement subtrees into new ones for unrooted trees. Let n be the size of the two input trees. This technique leads to an O(n log n)-time algorithm for unrooted trees whose degrees are bounded by a constant, matching the best known complexity for the rooted binary case. The technique of label compression is not based on dynamic programming. With this technique, we obtain an O(n1:5 log n)-time algorithm for unrooted trees with arbitrary degrees, also matching the best algorithm for the rooted unbounded degree case.

1 Introduction An evolutionary tree is a tree whose leaves are labeled with distinct symbols representing species. Evolutionary trees are useful for modeling the evolutionary relationship of species. Many mathematical biologists and Department of Computer Science, Duke University, Durham, NC 27708, [email protected]. Research supported in part by NSF Grant CCR-9101385. y Department of Computer Science, University of Hong Kong, Hong KOng, ftwlam, hfting, wksungg @cs.hku.hk. Research supported in part by Hong Kong RGC Grants 338/065/0027 and 338/065/0028. z Department of Biophysics, School of Medicine, Johns Hopkins University, Baltimore, MD 21205, [email protected], [email protected]. Research supported in part by Grant GM29458 and by the Sloan Foundation and Department of Energy Postdoctoral Fellowship for Computational Molecular Biology. Most of this work was performed while visiting Department of Computer Science, University of Maryland, College Park, MD 20742.

1

leaves of R labeled over L0 and the least common ancestors of such leaves in R, and whose edges preserve the ancestor-descendant relationship of L. Consider two rooted evolutionary trees R0 and R1 labeled over L. Let R00 be a subtree of R0 induced by some subset of L. We similarly de ne R10 for R1 . If there exists an isomorphism between R00 and R10 mapping each leaf in R00 to one in R10 with the same label, then R00 and R10 are each called agreement subtrees of R0 and R1 . If this isomorphism maps a node u 2 R0 to a node v 2 R1 , we say that u is mapped to v in R00 and R10 . A maximum agreement subtree of R0 and R1 is one containing the largest possible number of labels. Let MAST(R0 ; R1 ) denote the number of labels in such a tree. A maximum agreement subtree of two unrooted evolutionary trees T0 and T1 is one with the largest number of labels among the maximum agreement subtrees of T0u and T1v over all nodes u 2 T0 and v 2 T1 . Let

gave an O(n log n) time algorithm. For unrooted trees, no similar improvement has been known, however. This paper presents two sets of techniques for comparing unrooted evolutionary trees: four-way dynamic programming [18]. label compression [15]; Using the technique of four-way dynamic programming, we show that to solve the unrooted subtree problem, it suces to run a dynamic programming based algorithm for rooted trees only four times, each time processing pairs of nodes in the input trees in a dierent order. As a result, we transform dynamic programming based algorithms for computing rooted maximum agreement subtrees into new ones for unrooted trees. Let n be the size of the two input trees. For trees whose degrees arep bounded by d, our transformation leads to an O(nd2 d log n)-time algorithm. When d is a constant, this running time is O(n log n), matching the best known complexity for the rooted binary case. Moreover, this technique can be used to compute unrooted agreements for more than two input trees, provided that the number of input trees is bounded by a constant. The technique of label compression is not based on dynamic programming. This technique enables us to process overlapping subtrees recursively while keeping the total tree size very close to the original input size. The technical backbone of label compression is a technique that computes maximum weight matchings over many subgraphs of a bipartite graph as eciently as it takes to compute a single matching [14]. With label compression, we obtain an O(n1:5 log n)-time algorithm for unrooted trees with arbitrary degrees, also matching the best algorithm for the rooted unbounded degree case [7]. In general, if the degrees are bounded by d, the running time of our p algorithm is the minimum of O(d2 n log d log4 n), O( dn log5 n) and O(n1:5 log n). Our algorithm also allows the input trees to be mixed trees, i.e., trees which may contain directed and undirected edges at the same time [10, 11]. Such trees can handle a broader range of information than rooted trees and unrooted trees.

UMAST(T0 ; T1 ) = u2Tmax MAST(T0u; T1v ) (1) 0 ;v 2T1 Remark. The nodes u (or v) can be restricted to internal nodes. We can also generalize the above de nition to handle a pair of rooted and unrooted trees (R0 ; T1). That is, we let UMAST(R0 ; T1) = maxv2T1 MAST(R0 ; T1v ). To simplify the discussion, the algorithms in this paper only computes UMAST(T0 ; T1 ). They can be augmented to report a corresponding subtree.

3 An O(n log n)-time algorithm for bounded degree trees 3.1 The reduction technique Let v be an internal node of a unrooted tree T of degree three (by analogy to it's rooted counterpart we refer to such a tree as a binary tree.) Let s(v), t(v), and b(v) denote the three nodes adjacent to v. Let t(v ), b(v ), and s(v ) to denote respectively the subtrees rooted at t(v), b(v), and s(v) provided that T is rooted at the node v. If T is a rooted binary tree, we let t(v) to be the parent of v, b(v) the root of a subtree of v with larger number of nodes and s(v) the root of the other subtree (the ties are broken arbitrarily). Let X = ft; b; sg. Let the set of possible permutations of (t; b; s). Let x0 ; x1 be two internal nodes of T0 , T1 respectively, and (t); (b); (s) a permutation form . Then MAST(T0x0 ; T1x1 ) =

2 Basics The following notations are used throughout this paper. For a rooted tree R and a node v, let Rv denote the subtree of R rooted at v. For an unrooted tree T and a node u, let T u denote the rooted tree constructed by assigning u in T as the root; for another node v in T , T uv is the rooted subtree of T u , whose root is at v. We let L(T u ) to denote the set of leaf labels of T u . Let R be a rooted evolutionary tree with leaves labeled over a set L. A label subset L0 L induces a subtree of R, denoted by RkL0, whose nodes are the

8 < MAST(s(x0 ); (s)(x1 ))+ MAST(b(x ); (b)(x ))+ max 2 : MAST(t(x0 ); (t)(x1 ))g 0 1

2

(2)

v x0

y

0

MAST(x0 ; x1 ) =

z1

0

8 > > > > > > > > > > > > > < max > > > > > > > > > > > > > :

x1

z0

y1

v1

Figure 1: The up-down partial order First, we will show, how to compute in O(n2 ) time MAST(T0x0 ; T1x1 ) values for all pairs (x0 ; x1 ). In order to do so compute all the MAST values needed in the expression (2) using a dynamic programming algorithm. This, by (1) implies an O(n2 ) time algorithm for UMAST . Naturally, this algorithm is closely related to the O(n2 )-time algorithm of Steel and Warnow and the main dierence is the order in which the computation is performed. To compute needed MAST values, we process all pairs of nodes (x0 ; x1 ) such that xi 2 Ti four times, each time using a dierent partial order. For the purpose of the O(n log n) time algorithm this partial orders will be de ned on paths of nodes rather than on individual nodes and we treat a node as a single node path. Given a path P in a tree T , the vertex of P that is closest to the root is called the top-most vertex of P and is denoted by topmost(P ). Similarly a vertex of P is furtherest form the root is called the bottom-most vertex of P and is denoted bottommost(P ). Given a partition of a tree T into paths we de ne two partial orders: up and down on the set of paths in the partition. Namely, let P; P 0 be two paths of a partition of T . Then P up P 0 if level(topmost(P )) > level(topmost(P 0 )) where level(v) is the distance of the vertex v to the root. Similarly P down P 0 if level(topmost(P ))) < level(topmost(P 0 )). Thus the up order corresponds to a bottom-up ordering of the paths in the tree while the down order corresponds to a topdown ordering. Using the partial orders up and down we de ne following partial orders on the pairs of nodes: up-up, down-up, up-down, and down-down. The up-up order is the lexicographic order using partial order up on both elements of the pair, the up-down order is the lexicographic order using the up order on the rst element of the pair and the down order on the second and the rest of the orders in de ned in the analogous way. Consider the O(n2 ) time dynamic programming algorithm for computing MAST value of two tree rooted at R0 ; R1 respectively by processing all pairs of vertices (x0 ; x1 ) such that xi 2 Ti in the bottom-up order [19]. We refer to this algorithm as the Steel-Warnow algorithm. Let Ri be the root if Ti and let xi = TiR x . The algorithm uses the following dynamic programming formula:

binvalue(L(x0 ) = L(x1 ))

if x0 and x1 are leaves. MAST(x0 ; b(x1 )) if x1 is not a leaf. MAST(b(x0 ); x1 ) if x0 is not a leaf. MAST(x0 ; s(x1 )) if x0 is not a leaf. (3) MAST(s(x0 ); x1 ) if x1 is not a leaf. MAST(b(x0 ); b(x1 )) + MAST(s(x0 ); s(x1 )) if neither x0 nor x1 is a leaf. MAST(s(x0 ); b(x1 )) + MAST(b(x0 ); s(x1 )) if neither x0 nor x1 is a leaf.

Let T0 ; T1 be two unrooted trees. Root them at some leaves R0 ; R1 . Observe, that if we run the above algorithm on T0; T1 rooted in this way a large portion of the information needed in (2) will be computed. Namely, as the algorithm processes the pairs of nodes in the up-up order, for each pair of nodes (x0 ; x1 ) it computes: MAST(b(x0 ); b(x1 )) MAST(s(x0 ); b(x1 )), MAST(b(x0 ); s(x1 )), MAST(s(x0 ); s(x1 )). To obtain the remaining information needed to compute UMAST using (1) and (2), we perform three more passes over all pairs of the vertices considering these pairs in the tree remaining partial order. In each phase, the basic operation on each pair of internal nodes (x0 ; x1 ) is that of computing a MAST value for this pair of nodes. What makes the computation dierent in each phase is that the MAST value is computed under assumption that the trees are rooted at dierent nodes. For the up-up direction, we assume that the roots are node R0 ; R1 respectively, i.e. we compute MAST(T0R0 x0 ; T1R1 x1 ). In the updown direction, we compute MAST(T0R0x0 ; T1s(x1)x1 ) and MAST(T0R0x0 T1b(x1)x1 ). Symmetrically for the down-up direction. Finally, in the down-down computation we compute MAST(T0b(x0)x0 ; T1b(x1 )x1 ), MAST(T0b(x0 )x0 ; T1s(x1)x1 ), MAST(T0s(x0)x0 ; T1b(x1 )x1 ), and MAST(T0s(x0 )x0 ; T1s(x1)x1 ). (The last is the reason for the d2 factor in the complexity of a natural generalization of this algorithm to degree d trees.) Assume that we process a pair of nodes (v0 ; v1 ) in one of these partial orders (0 ; 1 ) where i 2 fup, downg. Let (x0 ; x1 ) be a pair of nodes such that xi is adjacent to vi and vi xi (i.e. the direction of processing the tree after processing pair (x0 ; x1 ) can be towards the pair (v0 ; v1 ) ), see Figure 1 for the case where (0 ; 1 ) is the up-down order. Let vi be as above and let yi ; zi be the remaining two nodes adjacent to xi (if any). In our algorithm we ensure that in each of the four phases the following is know prior to processing a pair (x0 ; x1 ): MAST(T0v0 y0 ; T1v1y1 ), MAST(T0v0 z0 ; T1v1z1 ), MAST(T0v0 y0 ; T1v1 z1 ), MAST(T0v0 z0 ; T1v1y1 ), v y v x MAST(T0v0 x0 ; T1v1y1 ), MAST(T0 0 0 ; T1 1 1 ), v z v x v x MAST(T0 0 0 ; T1 1 1 ), MAST(T0 0 0 ; T1v1z1 ).

i i

3

belonging to the pair P0 ; P1 as to the Path Matching problem [3]. The main idea of the path matching algorithms can be described as follows. The interesting and supporting pairs are processed in the up-up order. There is a data structure to store the partial results of the computation, including (among other) the value MAST(T0R0 x0 ; T1R1 x1 ) for interesting or supporting pair (x0 ; x1 ). This data structure is in fact a binary search tree, S whose leaves are the nodes of P1 . The vertices in P0 are considered in turn in bottom-up order. For each vertex ui 2 P0 , the interesting pairs (ui ; vj ) where vj 2 P1 are are considered in the up-up order. Then, S is modi ed to include MAST(T0R0 x0 ; T1R1x1 ) (as well as some other information relevant for the computation). In [5] S is a balanced binary search tree and this computation takes O(log n) time per each interesting pair and supporting pair. In [3, 4] the search tree is a weight balanced search tree where the time of processing an edge vary and an amortization argument is used derive the total time complexity. The O(n log n) time MAST algorithm processed all pairs of interesting paths in the up-up order. For each pair of paths it calls the path matching algorithm. By an inductive argument, after performing the Path Matching algorithm for all pairs of paths preceding a given pair of centroid paths (P0 ; P1 ) in up-up order, the data needed as the input the Path Matching problem for the pair (P0 ; P1 ) is computed.

Note that this is exactly the the property used in the MAST algorithm of Steel and Warnow. We argue that if the up-down (or down-up) phase follows the up-up phase then the condition is satis ed as well. Similarly, if the down-down phase is the last phase then we can show that the above condition is guaranteed.

3.2 Centroid decomposition and path matching The sub-quadratic time algorithms [3, 4, 5, 7, 12] for the binary MAST problem are based on the observation that the dynamic programming instances corresponding to the MAST problem is sparse. In practice this means, that to compute MAST of two rooted trees not all pairs (x0 ; x1 ) of internal nodes need to be consider. This observation is also true for the UMAST problem. Let be R0 ; R1 arbitrary leaves. Root Ti on Ri and de ne s, t and b pointers with respect to this root. Thus each internal vertex v in Ti (i = 0; 1), let the central child, denoted b(v), be the node with the maximum number of descending leaves, and let the side child, denoted s(v) be the other child, with ties broken arbitrarily. Correspondingly, call the arc (x; cntr(x)) a central arc, and (x; s(x)) a side arc. By a centroid path we mean a maximum path using central arc. Note that each vertex belongs to exactly one centroid path. Let v 6= Ri be a node of a centroid path. Then t(v) is the parent of v in Ti . A pair of centroid paths (P1 ; P2 ) is an interesting pair of paths if there exists a pair of labels a; b such that lcaT1 (a; b) 2 P1 and lcaT2 (a; b) 2 P2 . A pair of internal nodes (x1 ; x2 ) where xi 2 Ti is interesting if it belongs to an interesting pair of paths and L(s(x0 ) \ L(s(x1 )) 6= ;. A pair (x1 ; x2 ) where one of xi , say x2 is a leaf is interesting if L(x2 ) 2 L(x1 ). Let topmost(v) be the topmost vertex of a centroid path to which v belongs. Be the de nition of an interesting pair we have: UMAST(T0 ; T1) = max

ri 2Ti ;(r0 ;r1 ) is interesting

MAST(T0r0 ; T1r1 )

3.3 The O(n log n) time UMAST algorithm. In the MAST algorithm outlined above, while processing all interesting pairs of central paths in the upup order using a Path Matching algorithm we also process all interesting pairs of nodes in the up-up order. To transform a MAST algorithm that uses a path matching approach to a UMAST algorithm we use the same technique as in the O(n2 ) time UMAST algorithm presented in the previous subsection. Namely, we process all interesting and supporting pairs in the four partial order. As in the case of the O(n log n) MAST algorithm we organize the computation along pairs of interesting paths and apply a variant of path matching algorithm for each such pair.

(4)

The following fact follows is implicit in the O(n log n) time algorithm [3] an is proven in an explicit way in the survey part of [18]. Lemma 1 ([18]) The number of interesting pairs is at most n log n. If x0 from Ti forms and interesting pair with some node x1 on a centroid path P1 in the other tree, then pairs (x0 ; topmost(P1 )) are a called supporting pairs. Given a pair of paths (P0 ; P1 ) where Pi is a centroid path in Ti , we refer to the problem of computing MAST values for all interesting and supporting pairs of nodes

Theorem 2 If T0; T1 are two binary trees then UMAST(T0 ; T1 ) can computed in O(n log n) time. Proof: (sketch) Extend each centroid path by adding

the parent of topmost if it exists and call it topmost '. If the topmost' vertex is not the root, we let the two subtrees rooted at nodes adjacent to v and not included in the corresponding centroid path the pseudoside trees. The pseudoside tree that contains the root of the tree is

4

each of these nodes belongs to an interesting pair. Between nodes vi and vi+1 we may have in tree T1 other nodes, that do not form an interesting pair with any the large pseudoside tree node on P0 . The corresponding side trees can however intersect the pseudoside trees adjacent to P0 . Since among the side trees that are adjacent to nodes on the path between vi and vi+1 only one can be used to match a pseudoside tree in an agreement matching, topmost’ it suces to take into account only this pseudosideside pair which has the highest MAST value. Thus topmost for every pair vi ; vi+1 of consecutive interesting nodes on a given path we add to supporting pairs the pair the small pseudoside tree (pseudoside; vî+1 ) corresponding to a maximum MAST value between the pseudoside tree and a side tree adjacent to a node on a path between vi and vi+1 . We also add pair (pseudoside; v^0) representing the interval of nodes preceding v1 . We extend further the set of supporting pairs by adding (x0 ; bottommost(P1 )) and (bottommost(P0 ); x1 ) for each interesting pair (x0 ; x1 ) in ( P0 ; P1 ). Finally, for each interesting pair (ui ; vj ) add a central path pairs (smallpseudoside; ui); (largepseudoside; ui). a side tree The weights of all added edges are computed in previous iterations. Consider computing of the weight of Figure 2: The pseudoside subtrees (pseudoside; vî+1 ). Let P0 ; P1 be a pair of interesting paths. Let v1 ; : : : vk be node of P1 that are in an incalled the large pseudoside tree and the other the small teresting pair with a node on P0 (see Figure 3.) Aspseudoside tree. (See Figure 2.) sume that P0 is not the topmost centroid path (the As in the case of the O(n2 ) algorithm, we process topmost path does not have adjacent pseudoside trees). all pairs of paths (this time we use centroid paths) in Let (P 0 ; P10 ) be the interesting pair of paths such that up-up, up-down, down-up and down-down orders. directly0 dominates (P0 ; P1 ). Let u be the node of P00 In each phase we process all interesting and supporting such that P0 2 s(u). Then (u; vi ) is interesting for pairs in the same order as the partial order of in which i = 1; : : : k. Furthermore (smallside; vî) is simply maxiwe process the pairs of centroid paths in that phase. mum over (x; y), where (x; y) interesting pair in (P00 ; P10 ) The algorithm can be outlined as follows: Let o1 ; : : : o4 such that x proceeds u in the up order and y is between be respectively the four partial orders. vi and vi+1 and thus it can be computed during path matching algorithm for pair P00 ; P10 . Algorithm 2 SPARSE UMAST (T0; T1) The remaining details are left to the full version of part 1: Using the O(n log n) time algorithm for the the paper. MAST problem identify interesting pairs of paths and interesting pairs of nodes. 4 An O(n1 5 log n)-time algorithm for part 2: for i = 1 to 4 do unbounded degree trees Consider all interesting pairs of paths in order oi . For each such pair of path execute path matching We present an algorithm for computing a maximum algorithm considering the interesting and supportagreement subtree for two unrooted evolutionary trees. ing pairs in oi order. It takes O(n1:5 log n) time for trees with unbounded degrees, matching the best algorithm for the rooted case. part 3: For every interesting pair (x1 ; x2 ) compute The time complexity reduces to O(n log4 n) if the input x x 0 1 MAST(T0 ; T1 ) and then UMAST(T0 ; T1 ) = trees have degrees bounded by a constant [15]. Unlike maxx0 2T0 ;x1 2T1 MAST(T0x0 ; T1x1 ). previous algorithms, our algorithm is not based on dyWhen we process pairs of paths in a direction other than namic programming. Instead it adopts a recursive stratup-up, we need to treat pseudoside trees in the same egy exploiting a technique called label compression. way as side trees. This introduces some complication to Let U1 and U2 be unrooted evolutionary trees with these phases which can be addressed as outlined below. the same leaf count. Let N be the size of U1 (or U2 ). Consider an interesting pair of paths P0 ; P1 and let The next theorem is our main result. vi ; vi+1 be two consecutive nodes on path P1 . Thus the root

:

5

P1 v1

v2

v3

P’ v 1 1

u P’0

P0

P’

v2

v3

u

0

large pseudoside

P’0 large pseudoside tree

Figure 3: Computing the weight of pseudoside-side pairs

Theorem 3 We can compute UMAST(U1; U2) in

to nodes in U2 , and x is on the path in U1 between v1 and v2 . This case can be handled in a way similar to Case 1. Let U~2 be the tree constructed by adding a dummy node w in the middle of every edge (u; v) in U2 . Then, UMAST(U1 ; U2 ) = MAST(U1x ; U~2w ) for some dummy node w in U~2 . Thus, UMAST(U1 ; U2 ) equals UMAST(U1x ; U~2 ) and can be computed in O(N 1:5 log N )

O(N 1:5 log N ) time.

For ease of explanation, the proof of this theorem is presented in a top-down manner, starting with an outline of our subtree algorithm here. Our algorithm uses graph separators. A separator of a tree is an internal node whose removal divides the tree into connected components each containing at most half of the tree's nodes. Every tree that contains at least three nodes has a separator, and such an internal node can be found in linear time. If U1 or U2 has at most two nodes, UMAST(U1 ; U2 ) as de ned in (1) can easily be computed in O(N ) time. Otherwise, both trees have at least three nodes each, and we can nd a separator x of U1 . We then consider three cases. Case 1: In some maximum agreement subtree of U1 and U2 , the node x is mapped to a node in U2 . In this case, UMAST(U1 ; U2 ) = UMAST(U1x ; U2 ). To compute UMAST(U1x; U2 ), we might simply evaluate MAST(U1x ; U2y ) for dierent y in U2 . This approach involves solving the MAST problem for (N ) dierent pairs of rooted trees and introduces much redundant computation. For example, consider a rooted subtree R of U2 . For all y 2 U2 ? R, R is a common subtree of U2y . Hence, R is examined repeatedly in the computation of MAST(U1x ; U2y ) for these y. To speed up the computation, we devise the technique of label compression in x5 to elicit sucient information between U1x and R so that we can compute MAST(U1x; U2y ) for all y 2 U2 ? R without examining R. In fact, we can use the next lemma to compute Case 1 in O(N 1:5 log N ) time.

time as in Case 1. Case 3: None of the other two cases. Let U1;1 ; U1;2; : : : ; U1;b be the evolutionary trees formed by the connected components of U1 ?fxg. Let J1 ; J2 ; : : : ; Jb be the sets of labels in these components, respectively. Since both the rst and the second cases are not satis ed, a maximum agreement subtree of U1 and U2 is labeled over some Ji . Therefore, UMAST(U1 ; U2 ) = maxfUMAST(U1;i ; U2 kJi ) j i 2 [1; b]g, and we compute each UMAST(U1;i ; U2 kJi ) recursively. Figure 4 summarizes the algorithm for computing UMAST(U1 ; U2 ). Here we analyze the time complexity T (N ) of this algorithm based on Lemma 4. Cases 1 and 2 each take OP(N 1:5 log N ) time. Let Ni = jU1;i j. Then Case 3 takes i2[1;b] T (Ni ) time. By recursion,

T (N ) = O(N 1:5 log N ) +

X Ni

T (Ni ):

As x P is a separator of U1 , we have Ni N=2. Then, since i2[1;b] Ni N , we have T (N ) = O(N 1:5 log N ). The time bound in Theorem 3 follows. To complete the proof of Theorem 3, we devote x5 through x7 to proving Lemma 4.

5 Label compressions To compute maximum agreement subtrees, our algorithm recursively processes overlapping subtrees of the input trees. The technique of label compression compresses overlapping parts of such subtrees to reduce their total size. We de ne label compressions with respect to a rooted subtree and with respect to two labeldisjoint rooted subtrees. We do not use label compression with respect to three or more trees.

Lemma 4 Assume that U1 and U2 have at least three nodes each. Given an internal node x 2 U1 , we can compute UMAST(U1x ; U2) in O(N 1:5 log N ) time.

Proof: See x6. Case 2: In some maximum agreement subtree of U1 and U2 , two certain nodes v1 and v2 of U1 are mapped

6

UMAST(U1 ; U2 ) /* U1 , U2 are unrooted trees */ nd a separator x of U1 ; construct U~2 by adding a dummy node w at the middle of each edge (u; v) in U2 ; val = UMAST(U1x ; U2 ); val0 = UMAST(U1x; U~2 ); let U1;1; U1;2 ; : : : ; U1;b be the evolutionary trees formed by the connected components of U1 ? fxg; let Ji be the set of labels of U1;i for all i 2 [1::b]; for all i 2 [1::b] vali = UMAST(U1;i ; U2 kJi )g; return maxfval; val0 ; maxbi=1 vali g

Figure 4: Algorithm for UMAST(U1 ; U2 ). The nodes z1 , z2 and p1 are called compressed nodes, and the leaves in T R which are not compressed are called atomic leaves. We further store in T R some auxiliary information about the relationship between T and R. For an internal node v in T R, let (v) = MAST(T v ; R). For a compressed leaf v in W , if it is compressed from a set of subtrees T v1 ; T v2 ; : : : ; T v , let (v) = maxfMAST(T v1 ; R), MAST(T v2 ; R); : : : ; MAST(T v ; R)g. Given T R, for any rooted evolutionary tree T 0 that contains R as a subtree, we can compute UMAST(T; T 0) without examining R. We rst replace the subtree R of T 0 with a shrunk leaf and then compute UMAST(T; T 0) from T R and T 0 R. To further our discussion, we need to generalize the de nition of a maximum agreement subtree for a pair of trees that contain compressed and shrunk leaves, respectively. Let W = T R. Let W 0 be a rooted tree containing up to one shrunk leaf representing R. We can de ne a maximum agreement subtree of W and W 0 , as well as MAST(W; W 0 ), in a way similar to that of ordinary evolutionary trees. Any atomic leaf must still be mapped to a leaf with the same label. However, the shrunk leaf

of W 0 can be mapped to any leaf or internal node v of W as long as (v) > 0. Let T and T 0 be two rooted evolutionary trees. Let R be a subtree of T 0. Let W = T R and W 0 = T 0 R. The following lemma is the cornerstone of label compression.

Let T be a rooted tree. Let R be a rooted subtree of T . Let T R denote the rooted tree obtained by replacing R by a leaf . We say that is a shrunk leaf representing R. The other leaves are atomic leaves. Similarly, for two label-disjoint rooted subtrees R1 and R2 of T , let T (R1 ; R2 ) denote the rooted tree obtained by replacing R1 and R2 by shrunk leaves 1 and 2 , respectively. We extend these notions to an unrooted tree U and de ne U R and U (R1 ; R2 ) similarly. Let x be a node in T and y a descendant of x. Let P be the path of T from x to y. A node lies between x and y if it is in P but diers from x and y. A subtree of T is attached to x if it is some T z where z is a child of x. A subtree of T hangs between x and y if it is attached to some node lying between x and y, but its root is not in P . Suppose that T is labeled over L. Consider any K L, which we would like to compress. For each node y and its parent x in T k(L ? K ), A(T; K; y) denotes the set of subtrees of T that are attached to y and whose leaves are all labeled over K; H(T; K; y) denotes the set of subtrees of T that hang between x and y. (From the de nition of T k(L ? K ), these subtrees are all labeled over K .) Let T and R be rooted evolutionary trees labeled over L and K , respectively. The compression of T with respect to R, denoted by T R, is a tree constructed by axing extra nodes to T k(L ? K ) with the following steps. Consider a node y and its parent x in T k(L ? K ).

s

s

Lemma 5 MAST(T; T 0) = MAST(W; W 0 ). Let n = maxfjW j; jW 0 jg and N = maxfjT j; jT 0jg.

(See Figure 5 for an example.) If A(T; K; y) is non-empty, compress all the trees in A(T; K; y) into a single node z1 and attach it to y. If H(T; K; y) is non-empty, compress the parents p1 ; p2 ; : : : ; pm of the roots of the trees in H(T; K; y) into a single node p1 , and insert it between x and y; compress all the trees in H(T; K; y) into a single node z2 and attach it to p1 .

We can compute UMAST(W; W 0 ) by modifying existing algorithms for ordinary evolutionary trees [7, 12, 5].

Lemma 6 Suppose that all the auxiliary information of W has been given. Then MAST(W; W 0 ) can be found in n1:5 log N time. Afterwards we can retrieve MAST(W v ; W 0 ) for any node v 2 W in O(1) time. 7

1

9 4

8 5 6

R

7 2 3

7

T

9

p1

6 1 2 345

T

9

z1

8

7

8

7

8

z2

T R

0

9

T0 R

Figure 5: An example of label compression. maxfUMAST(Wi ; Xi ) j i 2 [1; b]g. Moreover, our algorithm enforces the following properties.

We can generalize label compression to two disjoint subtrees R1 and R2 of T 0. Let W denote the compression of T with respect to R1 and R2 , and let W 0 denote T 0 (R1 ; T2 ). Again, we store in W some auxiliary information about the relationship between T and R1 ; R2 . Then we can determine the value of UMAST(T; T 0) from W and W 0 . This time the auxiliary information in W , as well as the de nition of UMAST(W; W 0 ), is slightly more complicated. Nevertheless, the time required to compute MAST(W; W 0 ) remains the same as in Lemma 6.

If X contains at most one shrunk leaf, every sub-

problem generated has size at most half that of X .

If X has two shrunk leaves, at most one subproblem (Wi ; Xi ) has size greater than half that of X , but Xi contains only one shrunk leaf. o

o

o

Thus, in the next recursion level, every subproblem spawned by (Wi ; Xi ) has size at most half that of X . To summarize, whenever the recursion gets down by two levels, the size of a subproblem reduces by half. The subproblems UMAST(W1 ; X1 ); UMAST(W2 ; X2 ), : : :, UMAST(Wb ; Xb ) are formally de ned as follows. The degree of any node in X is at most d, the maximum degree of U2 . Assume that the separator y has b neighbors in X , namely, v1 ; v2 ; : : : ; vb . Note that b d. For each i 2 [1; b], let Ci be the connected component in X ?fyg that contains vi . The number of atomic leaves in Ci is at most half of jX j. Intuitively, we would like to shrink the subtree X v y into a leaf, producing a smaller unrooted tree Xi . We rst consider the simple case where X has at most one shrunk leaf. Then no Ci contains more than one shrunk leaf. If Ci contains no shrunk leaf, then Xi contains only one shrunk leaf representing the subtree X v y . Note that X v y corresponds to the subtree U2v y in U2 and Xi = U2 U2v y . Let Wi = U1x U2v y . If Ci contains one shrunk leaf 1 , representing some subtree U2v y , then Xi contains 1 , as well as a new shrunk leaf representing the subtrees X v y . The two subtrees are label-disjoint. Again, X v y corresponds to the subtree U2v y in U2 . Assume that 1 corresponds to a subtree U2v y in U2 . Then Xi = U2 (U2v y ; U2v y ). Let Wi = U1x (U2v y ; U2v y ). It remains to consider the case where X itself already has two shrunk leaves 1 and 2 . If y lies on the path between 1 and 2 , then no Ci contains more than one shrunk leaf and we de ne the smaller problem instances o

o

6 Proof of Lemma 4 At a high level, we rst apply label compression to the input instance (U1x ; U2 ). We then reduce the problem to a number of smaller subproblems (W; X ), each of which is similar to (U1x ; U2 ) and is solved recursively. For each (W; X ) generated, X is a subtree of U2 with at most two shrunk leaves, and W is a label compression of U1x with respect to some rooted subtrees of U2 which are represented by the shrunk leaves of X . Also, W and X contain the same number of atomic leaves. Our recursive algorithm for computing UMAST(W; X ) initially sets W = U1x and X = U2 . In general, if W or X has at most two nodes, then UMAST(W; X ) can easily be computed in linear time. Otherwise, both W and X each have at least three nodes. Our algorithm rst nds a separator y of X and computes UMAST(W; X ) for the following two cases. The output is the larger of the two cases. Figure 6 outlines our algorithm. Case 1: UMAST(W; X ) = MAST(W; X y ). We can orient X to root at y and evaluate MAST(W; X y ). This takes O(n1:5 log N ) time. Case 2: UMAST(W; X ) = MAST(W; X z ) for some z 6= y. We compute maxfMAST(W; X z ) j z 6= yg by solving a set of subproblems fUMAST(W1 ; X1 ), UMAST(W2 ; X2 ), : : :, UMAST(Wb ; Xb )g such that their total size is n and maxfMAST(W; X z ) j z 6= yg =

i

i

i

i

i

0

i

0

i

i

i

0

0

0

0

8

0

i

0

i

Xi and Wi from X and W is straightforward. However, computing the auxiliary information in all Wi eciently

/* W is a rooted tree with compressed leaves, X is an unrooted tree with shrunk leaves */ UMAST(W; X ) let y be the separator of X ; val = MAST(W; X y ); if (X contains at most one shrunk leaf) or (y lies on the path between the two shrunk leaves) then new subproblem(W; X; y); for each (Wi ; Xi ) vali = UMAST(Wi ; Xi ); else let y0 be the node on the path between the two shrunk leaves which is the closest to y; val = UMAST(W; X y ); new subproblem(W; X; y0); for each (Wi ; Xi ) vali = UMAST(Wi ; Xi ); return maxfval; maxbi=1 vali g

requires some intricate techniques. Details are shown in

x7. The total time required to prepare all subproblems

will be proven to be O(n1:5 log N ). Let n = maxfjW j; jX jg and N = fjU1 j; jU2 jg. Lemma 7 UMAST(W; X ) can be computed in O(n1:5 log N ) time. Proof: Let T (n) be the computation time of UMAST(W; X ). The computation is divided into two cases. Case 1 of takes O(n1:5 log N ) time. For Case 2, a set of subproblems, fUMAST(Wi ; Xi ) j i 2 [1; b]g, are generated. The time to prepare all these subproblems is also O(n1:5 log N ) time. All these subproblems, except possibly one, are each of size less than n=2 + 1. For the exceptional subproblem, say, UMAST(Wl ; Xl ), its computation is again divided into two cases. One case takes O(n1:5 log N ) time. For the other case, another set of subproblems are generated in O(n1:5 log N ) time. This time every such subproblem must have size less than n=2 + 1. Let S be the set of all the subproblems generated in both steps. The total size of the subproblems in S is at most n. For some constant c,

0

new subproblem(W; X; y) let v1 ; v2 ; : : : ; vb be the neighbors of y in X ; for all i 2 [1::b] let Xi be the unrooted tree formed by shrinking the subtree X v y into a shrunk leaf; let Wi be the rooted tree formed by compressing W with respect to X v y ; compute and store the auxiliary information in Wi for all i 2 [1::b]

T (n) 4cn1:5 log N +

i

UMAST(W ;X )2S 0

i

T (jX 0j):

0

It follows that T (n) = O(n1:5 log N ). By letting W = U1x and X = U2 , we have proved Lemma 4. What remains is to show how to compute the auxiliary information stored in all Wi from (W; X ) in a total of O(n1:5 log N ) time. Note that X contains at most two shrunk leaves. In the next section, we consider the simplest case where X has no shrunk leaves, yet this case already elects the diculty of such computation. In the full paper, we will show the other cases can be handled in a slightly dierent way.

Figure 6: Algorithm for UMAST(W; X ). (Wi ; Xi ) as above. Otherwise, there is a Ci containing both 1 and 2 . Also, Xi as de ned will contain three compressed leaves, violating our requirement. In this case, we abandon y and replace it with the node y0 on the path between 1 and 2 , which is the closest to y. Now, to compute UMAST(W; X ), we consider the two cases depending on whether the root of W is mapped to y0 or not. Again, we rst compute MAST(W; X y ). Then, we de ne the connected components Ci and the smaller problem instances (Wi ; Xi ) with respect to y0 . Every Xi has at most two compressed leaves, but y0 may not be a separator and we cannot guarantee the size of every subproblem is reduced by half. Fortunately, there can exist only one connected component Ci containing more than half of the atomic leaves of X . Indeed, Ci is the component containing y. In this case, both 1 and 2 are not inside Ci , and Xi as de ned contains only one compressed leaf. Thus, the subproblems which UMAST(Wi ; Xi ) spawns in the next recursion level each have size of at most half that of (W; X ). With respect to y or y0 , computing the topology of all

7 Auxiliary information for X Recall that jU1 j = jU2 j = N , and jW j = jX j = n. The case of X containing no shrunk leaf occurs when the algorithm starts, i.e., W = U1x , X = U2 . The subproblems UMAST(W1 ; X1 ); UMAST(W2 ; X2 ); : : :, UMAST(Wb ; Xb ) spawned from (W; X ) are de ned by an internal node y in X , which is adjacent to the nodes v1 ; v2 ; : : : ; vb . Let Ri and Ri denote the rooted subtrees X v y and X yv , respectively. Note that the rooted tree X y is composed of the subtrees R1 ; R2 ; : : : ; Rb . Also, Wi = W Ri and Xi = W Ri . The number of atomic leaves in all Ri is at most n. Furthermore, each Ri is X y with Ri removed. This sec-

0

o

i

o

o

X

o

i

tion discusses how to compute the auxiliary information required by each Wi in O(n1:5 log N ) total time. 9

7.1 Auxiliary information in the compressed leaves of W

7.2 Auxiliary information in the internal nodes of W

Consider any compressed leaf v in Wi . Let S denote the set of subtrees from which v is compressed. Then, the auxiliary information to be stored in v is

Consider any internal node v in Wi . We want to compute the auxiliary information (v) = UMAST(W v ; Ri ). We cannot aord to compute each UMAST(W; Ri ) separately because each individual Ri may have (n) atomic leaves and the total size of all Ri can be (n2 ). If the input trees have unbounded degrees, even computing one particular UMAST(W; Ri ) already takes O(n1:5 log N ) time. Fortunately, these Ri are very similar. Each Ri is X y with Ri removed. Exploiting this similarity and making use of an ef cient algorithm for the all-cavity maximum matching [14], O(n1:5 log N ) time suces to compute all UMAST(W; Ri ). Furthermore, after such preprocessing, we can retrieve UMAST(W v ; Ri ) for any internal node v in W and i 2 [1; b] in O(log2 jW j) = O(log2 n) time. Therefore, it takes O(jWi j log2 n) time to compute (v) for all internal nodes v of one particular Wi , and O(n log2 n) time for all Wi . Details are as follows. First of all, we study a simple preprocessing. Note that if we remove y from X y , the tree would decompose into the subtrees R1 ; R2 ; : : : ; Rb . Thus, the total number of atomic leaves in all Ri is exactly n. The following lemma extends Lemma 6. Lemma 9 We can compute MAST(W; Ri ) for all i 2 [1; b] in O(n1:5 log N total time. Afterwards, for any node v in W and i 2 [1; b], MAST(W v ; Ri ) can be retrieved in O(log n) time. To take advantage of the similarity between the trees Ri , we partition the value of UMAST(W v ; Ri ) to reveal some structural relationship. For any v 2 W and i 2 [1; b], let RMAST(W v ; Ri ) denote the maximum size among all the agreement subtrees in which v (i.e., the root of W ) is mapped to the root of Ri . v Lemma = 10 MAST(W ; Ri ) v R ) g ; max f UMAST ( W ; j j 2 [1 ;b ] ;j = 6 i max maxfRMAST(W z ; R ) j z 2 W v g:

i

(v) = maxfUMAST(W z ; Ri ) j W z 2 S g:

i

(5)

Observe that for any W z 2 S , W z contains no labels outside Ri . Thus, UMAST(W z ; Ri ) = UMAST(W z ; X y ) and we can rewrite (5) as

(v) = maxfUMAST(W z ; X y ) j W v 2 S g: We can compute UMAST(W; X y ) in O(n1:5 log N ) time, and afterwards, we can retrieve the value of UMAST(W z ; X y ) for any node z 2 W in O(1) time. Next, we show how to compute maxfUMAST(W z ; X y ) j W z 2 S g eciently. Assume that for any node u 2 W , the subtrees attached to u are numbered consecutively, starting from 1. We rst consider a preprocessing for ecient retrieval of values of the following types: for some node u 2 W and some interval [a; b], maxfUMAST(W z ; X y ) j W z is a subtree attached to u and its number falls in [a; b]g; for some path P of W , maxfUMAST(W z ; X y ) j W z is a subtree attached to some node in P g.

Lemma 8 Assume

that we can retrieve UMAST(W z ; X y ) for each z 2 W in O(1) time. Then we can preprocess W and X and construct additional data structures in O(n log n) time so that any value of the above types can be retrieved in O(1) time.

Proof: By adapting preprocessing techniques for online product queries [2].

We are ready to illustrate how to compute the auxiliary information in the compressed leaves of each Wi . Recall that for any compressed leaf v 2 Wi , (v) = maxfUMAST(W z ; X y ) j W z 2 S g where S is the set of subtrees in W compressed into v. Note that S is either a subset of the subtrees attached to v's parent in W or the set of subtrees attached to nodes on a particular path in W . In the former case, S is partitioned into at most dv + 1 intervals where dv is the degree of v in Wi . From Lemma 8, (v) can be found in O(dv + 1) time. Similarly, for the latter case, (v) can be found in O(1) time. Thus, the compressed leaves in Wi are processed in O(jWi j) total time. If we sum over all Wi , the time complexity is O(n). Therefore, the overall computation time for preprocessing and nding auxiliary information in the leaves of all Wi is O(n1:5 log N ).

i

Proof:

If, in some maximum agreement subtree of W v and Ri , the root of Ri is mapped to some node z in W v , then MAST(W v ; Ri ) = MAST(W z ; Ri ) = RMAST(W z ; Ri ). On the other hand, UMAST(W v ; Ri ) = UMAST(W v ; Rj ) for some j 6= i if, in some maximum agreement subtree of W v and Ri , the root of Ri is not mapped to any node z in W v . The lemma follows. By Lemma 10, we decompose the computation of UMAST(W v ; Ri ) into two parts. We perform some preprocessing on W and X and build a dierent data structure for each part. From Lemma 9, after an O(n1:5 log N ) time preprocessing, we can retrieve

10

UMAST(W v ; Ri ) for any v in W and any i 2 [1; b] in O(log n) time. Then, based on the work in [7], we can build in O(n log2 n) time a data structure for the values UMAST(W v ; Ri ) where v in Wi and i in [1; b]. Afterwards, we can determine maxfUMAST(W v ; Rj ) j j 2 [1; b] and j 6= ig for any v 2 W and i 2 [1; b] in O(log2 n)

O(1) time. In addition, we require that all the entries in A can be initialized to a xed value in O(1) time and that all the distinct values stored in A can be retrieved in O(m) time, where m denotes the number of distinct values in A. For an implementation of sparse array, see Exercise 2.12, page 71 of [1]. Before showing how to build these sparse arrays, we rst illustrate how they can support the computation of UMAST(W v ; Ri ). To nd this value, we need to compute maxfmwm(Gz;i ) j z is any node in W v g; which equals maxfA(z )[i] j z is any node in W v g: (6) Note that (6) is a generalization of tree product queries [2]. An optimal preprocessing for answering such queries is given in [13]. In O(n log2 n) time we can build a data structure on top of all the sparse arrays A(z ) such that we can retrieve for any v 2 W and i 2 [1; b] the value of maxfA(z )[i] j u is any node in W v g in O(log n) time. In conclusion, based on all the preprocessings discussed in this subsection, the value of UMAST(W v ; Ri ) for any v 2 W and any i 2 [1; b] can be computed in O(log2 n) time. Now, we show how to construct a sparse array A(z ), or equivalently, compute the values fmwm(Gz;i ) j i 2 [1; b]g eciently. We cannot aord to examine every Gz;i and compute mwm(Gz;i ). Instead we build only one weighted bipartite graph G Ch(z ) fR1 ; : : : ; Rb g as follows. For a node z in W , the max-child z 0 of z is a child of z such that the subtree rooted at z 0 contains the maximum number of atomic leaves among all subtrees attached to z . Let (z ) denote the total number of atomic leaves which are in W z but not in W z . The edges of G are speci ed as follow. For any non-max-child u of z , G contains an edge between u and some Ri if and only if UMAST(W u ; Ri ) > 0. This edge is given a weight of UMAST(W u ; Ri ). Regarding the max-child u0 of z , we only put into G a limited number of edges between u0 and fR1 ; : : : ; Rb g: For each Ri already connected to some non-maxchild of z , G has an edge between u0 and Ri if UMAST(W u ; Ri ) > 0. Among all other Ri , we pick Ri and Ri such that UMAST(W u ; Ri ) and UMAST(W u ; Ri ) are the rst and second largest. Note that Ri and Ri can be found in O(((z ) + 1) log2 n) time. Note that G is a graph containing O(jCh(z )j + b) nodes and O((z ) + 1) edges. Let G?R denote the bipartite subgraph constructed by removing node Ri from G .

time. The computation of maxfRMAST(W z ; Ri ) j z is any node in W v g is more dicult. For one thing, for a xed v and i, there may already be n values involved in the expression. We cannot aord to evaluate them separately. Again, we take advantage of the similarities between the trees Ri . It is known [7] that RMAST(W z ; Ri ) can be derived from a maximum weight matching of some bipartite graph. Let Ch(z ) denote the set of children of z in W , and de ne Gz;i Ch(z ) fR1 ; : : : ; Ri?1 ; Ri+1 , : : : ; Rb g to be a weighted bipartite graph in which a child w of z is connected to Rj if and only if UMAST(W w ; Rj ) > 0. This edge is given a weight of UMAST(W w ; Rj ), which is at most N . Let mwm(Gz;i ) denote the weight of Gz;i . Lemma 11 (see [7]) If the root of Ri is mapped to z in some maximum agreement subtree of W z and Ri , then the maximum weight matching of Gz;i consists of at least two edges, and mwm(Gz;i ) equals RMAST(W z ; Ri ). Note that if the maximum weight matching of Gz;i consists of one edge, it corresponds to an agreement subtree of W z and Ri in which the root of Ri is not mapped to any node in W z . Thus, it is possible that mwm(Gz;i ) > RMAST(W z ; Ri ). Nevertheless, in this case we are no longer interested in the exact value of RMAST(W z ; Ri ) since in the maximum agreement subtree of W z and Ri , the root of Ri is not mapped to any node in W z . As a matter of fact, Lemma 10 can be rewritten with the RMAST(W z ; Ri ) replaced by mwm(Gz;i ). Furthermore, because Gz;1 , Gz;2 , : : :, Gz;d are very similar, the values of their maximum weight matchings cannot be all distinct. Lemma 12 Among mwm(Gz;1 ), mwm(Gz;2 ), : : :, mwm(Gz;d ), there are at most du + 1 distinct values, where du denotes the degree of u in W . Below, we show that it takes O(n1:5 log N ) time to compute for all z in W the values of mwm(Gz;1 ), mwm(Gz;2 ); : : :, mwm(Gz;b ). The results are to be stored in an array A(z ) of dimension b for each node z , i.e., A(z )[i] = mwm(Gz;i ). Note that if we represent each A(z ) as an ordinary array, then lling these arrays entry by entry for all z 2 W would cost (bn) time. Nevertheless, as pointed out in Lemma 12, most of the mwm(Gz;i ) have the same value. Thus, we store these values in sparse arrays. Like an ordinary array, any entry in the sparse array A can be read and modi ed in

0

0

0

0

00

0

0

00

0

i

11

00

Lemma 13 mwm(G?R ) = mwm(Gz;i ) for all i 2 [1; b]. Furthermore, The graph G can be built in O(((z ) + 1) log2 n) time.

[8] C. R. Finden and A. D. Gordon. Obtaining common pruned trees. Journal of Classi cation, 2:255{276, 1985. [9] H. N. Gabow and R. E. Tarjan. Faster scaling algorithms for network problems. SIAM Journal on Computing, 18:1013{1036, 1989. [10] D. Gus eld. Optimal mixed graph augmentation. SIAM Journal on Computing, 16:599{612, 1987. [11] M. Y. Kao. Data security equals graph connectivity. SIAM Journal on Discrete Mathematics, 9:87{100, 1996. [12] M. Y. Kao. Tree contractions and evolutionary trees. SIAM Journal on Computing, 1996. To appear. [13] M. Y. Kao, T. W. Lam, W. K. Sung, and H. F. Ting. Ecent data structures for range maximum problems. To be submitted. [14] M. Y. Kao, T. W. Lam, W. K. Sung, and H. F. Ting. Ecient algorithms for all-cavity maximum matching probelms. To be submitted. [15] M. Y. Kao, T. W. Lam, W. K. Sung, and H. F. Ting. Label compressions and unrooted evolutionary trees. Submitted for journal publication, 1997. [16] E. Kubicka, G. Kubicki, and F.R. McMorris. An algorithm to nd agreement subtrees. Journal of Classi cation, 12(1):91{99, 1995. [17] T. W. Lam, W. K. Sung, and H. F. Ting. Computing the unrooted maximum agreement subtree in subquadratic time. In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory, pages 124{135, 1996. [18] T. M. Przytycka. Sparse dynamic programming for maximum agreement subtree problem. Submitted for journal publication, 1997. [19] M. Steel and T. Warnow. Kaikoura tree theorems: Computing the maximum agreement subtree. Information Processing Letters, 48:77{82, 1993.

i

Proof: The fact that mwm(G?R ) = mwm(Gz;i ) for all i 2 [1; b] follows from the construction of G . Recall that G contains O(d) nodes and O((z ) + 1) edges and the weight of every edge in G is at most N . Using Gabow and Tarjan's algorithm [9], we can compute mwm(G ) p i

in O( (z ) + 1((z ) + 1) log dN ) time. It is shown in [14] that the same amount of time suces to compute mwm(G?R ) for all i 2 [1; b] and store the results in a sparse array A(z ). Therefore, the time used to compute A(z ) for all z 2 W is in the order of i

X

z2W

f((z ) + 1)1:5 log N g:

(7)

Using the centroid decomposition technique of Farach, Przytycka and Thorup [5], we can show that (7) equals O(n1:5 log N ) time.

Acknowledgments The authors thank Martin Farach and Mikkel Thorup for helpful discussions.

References [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addision Wesley, Reading, MA, 1974. [2] N. Alon and B. Schieber. Optimal preprocessing for answering on-line product queries. Manuscript. [3] R. Cole, M. Farach, R. Hairhanan, T. M. Przytycka, and M. Thorup. An O(n log n) time algorithm for the maximum agreement subtree problem for binary trees. Submitted for journal publication; journal version of [4, 5], 1996. [4] R. Cole and R. Hariharan. An O(n log n) algorithm for the maximum agreement subtree problem for binary trees. In Proceedings of the 7th Annual ACMSIAM Symposium on Discrete Algorithms, pages 323{ 332, 1996. [5] M. Farach, T. M. Przytycka, and M. Thorup. Computing the agreement of trees with bounded degrees. In P. Spirakis, editor, Lecture Notes in Computer Science 979: Proceedings of the Third Annual European Symposium on Algorithms, pages 381{393. Springer-Verlag, New York, NY, 1995. [6] M. Farach and M. Thorup. Fast comparison of evolutionary trees. In Proceedings of the 5th Annual ACMSIAM Symposium on Discrete Algorithms, pages 481{ 488, 1994. [7] M. Farach and M. Thorup. Optimal evolutionary tree comparison by sparse dynamic programming (extended abstract). In Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, pages 770{779, 1994.

12