Efficient Algorithms for Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome Yaw-Ling Lin? and Tsan-Sheng Hsu?? Abstract. A phylogenetic tree is a rooted tree with unbounded degree such that each leaf node is uniquely labelled from 1 to n. The descendent subtree of of a phylogenetic tree T is the subtree composed by all edges and nodes of T descending from a vertex. Given a set of phylogenetic trees, we present linear time algorithms for finding all leaf-agree descendent subtrees as well as all isomorphic descendent subtrees. The normalized cluster distance, d(A, B), of two sets is defined by d(A, B) = ∆(A, B)/(|A| + |B|), where ∆(A, B) denotes the symmetric set difference of two sets. We show that computing all pairs normalized cluster distances between descendent subtrees of two phylogenetic trees can be done in O(n2 ) time. Since the total size of the outputs will be Θ(n2 ), the algorithm is thus computationally optimal. A nearest subtree of a subset of leaves is such a descendent subtree that has the smallest normalized cluster distance to these leaves. Here we show that finding nearest subtrees for a collection of pairwise disjointed subsets of leaves can be done in O(n) time. Several applications of these algorithms in areas of bioinformatics is considered. Among them, we discuss the 2CS (Two component systems) functional analysis and classifications on bacterial genome. Keywords: algorithm, phylogenetic trees, subtrees comparison, co-evolutionary classification.
1
Introduction
Trees are widely used to represent evolutionary, historical, or hierarchical relationships in various fields of classification. Biologists use the information contained in the DNA sequences of a collection of organisms, or taxa, to infer the evolutionary relationships among those taxa. Phylogenetic trees typically represent the evolutionary history of a collection of extant species or the line of descent of some genes, and may also be used to classify individuals (or populations) of the same species. Numerous phylogenetic inference methods, e.g. maximum parsimony, maximum likelihood, distance matrix fitting, subtrees consistency, and quartet based methods have been proposed over the years [15, 1, 14, 26, 17, 27, 4]; furthermore, it is rather common to compare the same set of species w.r.t. different biological sequences or different genes, hence obtaining various trees. The resulting trees may agree in some parts and differ in others. In general, one is interested in finding the largest set of items on which the trees agree. This fact motivates the compelling need to compare different trees in order to achieve consensus or extract partial agreements. For measuring the similarity / difference between trees, several distance measures have been proposed [9], e.g. the symmetric difference metric [23], the nearest-neighbor interchange metric [29], the subtree transfer distance [2], the Robinson and Foulds (RF) metric [24], and the quartet metric [11, 6]. There have been many suggestions for how to infer the consensus tree from a profile of trees in the literature [7, 12, 3, 10]. Among them, many extensively investigated results concerning the maximum agreement subtree problem (MAST). Also known as the maximum homeomorphic agreement subtree [5], the problem is: given a set of rooted trees whose leaves are drawn from the same ?
??
Department of Computer Science and Information Management, Providence University, 200 Chung Chi Road, Shalu, Taichung, Taiwan 433. e-mail:
[email protected] . This work is supported in part by the National Science Council, Taiwan, R.O.C, grant NSC-91-2213-E-126-004. Institute of Information Science Academia Sinica, Nankang 115 Taipei, Taiwan 115. e-mail:
[email protected].
set of items of size n, find the largest subset of these items so that the portions of the trees restricted to the subset are isomorphic. Amir and Keselman [3] show that the problem is NP-hard even for 3 unbounded degree trees. The problem is also hard to approximate. Hein et al. [19] show that the MAST for three trees with unbounded degree cannot be approximated within ratio δ 2log n in polynomial time for any δ < 1, unless NP ⊂ DTIME[2polylog n ]. On the positive side, polynomial time algorithms are obtainable for three or more bounded degree trees [3, 12], even though the time complexity is exponential in the bound for the degree. Efficient algorithms for the MAST problem for instances of two trees have been widely investigated in literature. Farach and Thorup [13] give an O(n1.5 log n) time algorithm for two arbitrary degree trees. Cole et al. [8] show that the MAST of two binary trees √ can be found in O(n log n) time, while the MAST of two degree d trees can be found in O(min{n d log2 n, nd log n log d}) time. With the rapid expansion in genomic data, the age of large-scale biomolecular sequence analysis has arrived. An important line of research in post-genome analysis is to analyze the evolution and co-evolution genes clustering of genomic sequences. Facing the known algorithmic difficulties of MAST problem discussed above, here we turn our attention to the problem of comparing the descendent subtrees among a set of different phylogenetic trees according to their Robinson-Foulds distance metric. The Robinson-Foulds distance metric [24] is a commonly used similarity measure, which is closely related to the notion of symmetric set difference [16]. Formally, let A, B ⊂ S be two clusters of genes set S. The symmetric set difference, ∆(A, B), is defined by ∆(A, B) = |A ∪ B \ A ∩ B| = |A \ B| + |B \ A|. Further, the normalized cluster distance, a RF-like metric, d(A, B), is defined as d(A, B) = ∆(A,B) |A|+|B| The normalized cluster distance between two sets is considered to be a rough measurement of the degree of difference between them. Note that 0 ≤ d(A, B) ≤ 1; d(A, B) = 0 if A = B, and d(A, B) = 1 if A ∩ B = ∅. In other words, the smaller value of d(A, B) implies a greater similarity of A, B. Consider a rooted (unbounded degree) phylogenetic tree T with n leaves such that each leaf is labelled with a distinct number from 1 to n. Here we assume that each internal node of T has at least two children; thus the total size of T is bounded by O(n). Let v be a (internal or leaf) vertex of T . The descendent subtree of T descending from v, denoted by T [v], is the subtree of T composed by all edges and nodes of T descending from v. Furthermore, we use LT (v) to denote the set of all descendent leaves of v in T . That is, LT (v) = {x | x a leaf of T [v]}. Furthermore, for clarity of presentation, we sometimes omit the subscript T and use L(v) as a shorthand of LT (v) when the context is clear. Note that L(v) = {v} if v itself is a leaf. Let Ti , Tj be two rooted trees whose leaves are drawn from the same set {1, . . . , n}. Two descendent subtrees of Ti , Tj , namely Ti [vi ] and Tj [vj ], are called leaf-agree if LTi (vi ) = LTj (vj ). Two leaf-agree isomorphic (or just isomorphic) subtrees, denoted by Ti [vi ] ∼ = Tj [vj ], are defined recursively as the following: Either vi ∈ Ti and vj ∈ Tj are two leaf nodes with the same label; otherwise, the children of vi (say {vi1 , . . . , vim }) and the children of vj (say {vj1 , . . . , vjm }) can be but into one-to-one corresponding such that Ti [vi1 ] ∼ = Ti [vi1 ], . . . , and Ti [vim ] ∼ = Ti [vim ]. In this paper, we consider the following problems: Definition 1 (all-pair-descendent-subtree). Given two rooted trees, {T1 , T2 } whose leaves are drawn from the same set S, compute the all pairs normalized cluster distances between each pair of descendent leaves. That is, output the set {d(LT1 (u), LT2 (v) | u ∈ T1 , v ∈ T2 }. Definition 2 (near-subtree). Given a rooted trees T with leaf nodes S and a subset V ⊂ S, compute the nearest descendent subtree of T with respect to V . That is, find a vertex v ∗ ∈ T such that d(V, LT (v ∗ )) = min{d(V, LT (v)) | v ∈ T }.
Definition 3 (leaf-agree-subtree). Given a set of rooted trees, {T1 , T2 , . . . , Tk } whose leaves are drawn from the same set S, find the all leaf-agree subtrees of these trees. That is, output the set {(v1 , v2 , . . . , vk ) | v1 ∈ T1 , v2 ∈ T2 , . . . , vk ∈ Tk , LT1 (v1 ) = LT2 (v2 ) = · · · = LTk (vk )}. Definition 4 (isomorphic-subtree). Given a set of rooted trees, {T1 , T2 , . . . , Tk } whose leaves are drawn from the same set S, find all isomorphic subtrees of these trees. That is, output the set {(v1 , v2 , . . . , vk ) | v1 ∈ T1 , v2 ∈ T2 , . . . , vk ∈ Tk , T1 [v1 ] ∼ = T2 [v2 ] ∼ = ··· ∼ = Tk [vk ]}. The rest of the paper is organized as the following. In Section 2, we show that computing all pairs normalized cluster distances between all paired subtrees of two trees can be done in O(n2 ) time. Since the total size of the outputs will be Θ(n2 ), the algorithm is thus computationally optimal. In Section 3, we show that finding nearest subtrees for a collection of pairwise disjointed subsets of leaves can be done in O(n) time. In Section 4, we show that finding all leaf-agreesubtrees as well as the isomorphic subtrees of {T1 , T2 , . . . , Tk } can be solved in O(kn) time, which are optimal linear time algorithms proportional to the size of the input trees. Section 5 discusses the biological applications in more depth.
2
All Pairs Subtrees Comparison
Here we discuss the problem of all pairs subtree comparison. A naive algorithm would need to compute all the normalized distances of all pairs of subtrees of given paired trees using O(n2 ) comparisons with each comparison takes O(n) time; thus a totally O(n3 ) operations are needed. The goal here is to compute all paired distances within O(n2 ) time. Note that the total size of the outputs will be Θ(n2 ). An O(n2 ) time algorithm is thus computationally optimal. The idea here is trying to find a recurrence formula such that the normalized cluster distance of a parent node can be computed from its children in time proportional to the number of its children. Let u be a node of a phylogenetic tree T1 , and v be a node of another phylogenetic tree T2 such that v is the parent of {v1 , . . . , vm }. Now the target is to compute ∆(u, v) from {∆(u, v1 ), . . . , ∆(u, vm )} in O(m) time. Here, for easier description, we use ∆(u, v) as a short-hand notation of ∆(L(u), L(v)). Lemma 1 (constant time ∆(A, B) calculation). Let A, B1 , B2 be 3 sets with B1 ∩ B2 = ∅. It follows that ∆(A, B1 ∪ B2 ) = ∆(A, B1 ) + ∆(A, B2 ) − |A|. Proof. It is easily verified that ∆(X, Y ) = |X| + |Y | − 2|X ∩ Y | for any two sets X, Y . Now we have ∆(A, B1 ∪ B2 ) = |A| + |B1 ∪ B2 | − 2|A ∩ (B1 ∪ B2 )| = |A| + |B1 | + |B2 | − 2|A ∩ (B1 ∪ B2 )| = |A| + |B1 | + |B2 | − 2|A ∩ B1 | − 2|A ∩ B2 )| = |A| + |B1 | − 2|A ∩ B1 | + |A| + |B2 | − 2|A ∩ B2 )| − |A| = ∆(A, B1 ) + ∆(A, B2 ) − |A| since B1 ∩ B2 = ∅.
¤
The following result shall be easily deduced from Lemma 1 by induction. Corollary 1. Let A, B1 , . . . , Bk be k + 1 sets such that (∀i 6= j)(Bi ∩ Bj = ∅). It follows that P ∆(A, B1 ∪ · · · ∪ Bk ) = (1 − k)|A| + ki=1 ∆(A, Bi ).
Note that Lemma 1 implies that ∆(u, v) can be calculated from ∆(u, v1 ) and ∆(u, v2 ) in constant time when |L(u)| is precomputed. Given a pair of phylogenetic trees T1 and T2 , we can store at each node u ∈ T1 , v ∈ T2 with its associated descendants size |L(u)| and |L(v)|. Further, for each node u ∈ T1 we can store an array consisting of ∆(u, ·)’s so that whenever we need to decide the value d(u, v), it can be computed as d(u, v) = ∆(u, v)/(|L(u)| + |L(v)|). Theorem 1. Computing all paired subtree distances can be done in O(n2 ) time.
All-Pair(T1 , T2 ) Input: Two phylogenetic trees T1 and T2 with leaves {1, 2, . . . , n}. Output: All pairs ∆(u, v)’s for all u ∈ T1 , v ∈ T2 . 1 Compute |L(u)|’s, |L(v)|’s, level(u)’s, level(v)’s for all u ∈ T1 , v ∈ T2 . 2 for each u ∈ T1 in increasing order of level(u) do B bottom up. 3 for each v ∈ T2 in increasing order of level(v) do 4 if both u, v are leaf nodes then B initial condition. 5 ∆(u, v) ← 0 if u = v; otherwise, ∆(u, v) ← 2 6 else if u is a leaf then let v be the parent of {v1 , . . . , vj }; P 7 ∆(u, v) ← 1 − j + ji=1 ∆(u, vi ) B Note that (1 − j) = (1 − j)|L(u)|. 8 else let u be the parent of {u1 , . . . , uk }; P 9 ∆(u, v) ← (1 − k)|L(v)| + ki=1 ∆(ui , v) Fig. 1. Computing all pairs ∆(T1 , T2 )’s in O(n2 ) time.
Proof. We propose an O(n2 ) time algorithm, All-Pair(T1 , T2 ), shown in Figure 1. The algorithm essentially builds up all ∆(·, ·)’s in a bottom up manner. The correctness of the algorithm is easily followed by Corollary 1 and the correctness of the computation ordering. To ensure the correct computation ordering, we introduce the following notations. Recall that the v-descendant subtree, denoted by T [v], is the subtree induced of by all descendants of v in T ; here we assume that v is a descendant of itself. The level of a node v in T , denoted by level(v), is the height of T [v]. Thus, whenever we traverse nodes of a phylogenetic tree T in their increasing levels ordering, we ensure that the descendants of a node v have already been visited before v. It is easily seen that Step 1 of All-Pair can be computed in O(n) time by a bottom up computation. A commonly used post order nodes traversal of a tree suffices. Let v be the parent of {v1 , . . . , vm } in a tree T . It follows that |L(v)| = |L(v1 )|+· · ·+|L(vm )| with the initial condition that |L(v)| = 1 when v is a leaf. Further, level(v) = max{level(v1 ), . . . , level(vm )} + 1 with the initial condition that level(v) = 1 when v is a leaf. These computation can be done in time proportional to the number of edges plus the number vertices in T , i.e., O(n) time. Also, it is clear that the total works for the inner loop of Step 3 to Step 9 take time proportional to the number of edges plus the number vertices in T , i.e., O(n) time. Since there are exactly O(n) number of iterations for the outer loop (Step 2), it follows that All-Pair(T1 , T2 ) finishes in O(n2 ) time. ¤
3
Nearest Subtrees
The lowest common ancestor (LCA) between two nodes u and v in a tree is the furthest node from the root node that exists on both paths from the root to u and v. Harel and Tarjan [18]
have shown that any n-node tree can be preprocessed in O(n) time such that subsequently LCA queries can be answered in constant time. 3.1
Confluent Subtrees
Let T be a phylogenetic tree with leaf nodes L. Given S ⊂ L, let set Λ(S) = {Lca(x, y) | x 6= y ∈ S} denote the collection of all (proper) lowest common ancestors defined over S. Definition 5 (confluent subtree). Let T be a phylogenetic tree with leaf nodes S. Given S ⊂ L, the confluent subtree of S in T is a phylogenetic tree, denoted by T↑S , with leaf nodes S and internal nodes Λ(S). Further, u ∈ Λ(S) is a parent of v in T↑S if and only if u is the lowest ancestor of v in T comparing to any other node in Λ(S). Our notation of confluent subtree is called induced subtrees in the literature [8]. Let T be a phylogenetic tree with leaf nodes L. A post-order (pre-order, or in-order) traversal of nodes of L within T defines a tree ordering of nodes on L. The following results is a generalization of the section 8 of [8]. Lemma 2. Let T be an n-node phylogenetic tree with leaf nodes L. The following subsequent operation can be done efficiently after an O(n) time preprocessing. Given a query set A ⊂ L, the confluent subtree T↑A can be constructed in O(|A|) time if A is given in sorted tree ordering; otherwise, T↑A can be constructed in O(|A| log log |A|) time.
Confluent(T, A) Input: A phylogenetic trees T with leaves L = {1, 2, . . . , n}, A ⊂ L. Output: The confluent subtree of A in T , T↑A . Preprocessing: Compute the tree ordering of L on T , and perform the Lca constant time queries preprocessing [18]. Notations: p[·, T 0 ], lef t[·, T 0 ], right[·, T 0 ]: parent, left, right children links. 1 2 3 4 5 6 7 8
Let A = hv1 , v2 , . . . , vk i be nodes of A in the tree ordering. Create a dummy node λ and let level[T, λ] ← +∞ ; Push(S, λ) for i ← 1 to k − 1 do B visit each vi ’s. x ← Lca(vi , vi+1 ) ; y ← vi while level[x, T ] > level[top[S], T ] do y ← Pop(S) Push(S, x) ; p[vi+1 , T 0 ] ← x ; p[y, T 0 ] ← x ; p[x, T 0 ] ← top[S] lef t[x, T 0 ] ← y ; right[x, T 0 ] ← vi+1 ; right[top[S], T 0 ] ← x root[T 0 ] ← right[λ, T 0 ] ; return T 0 as T↑A ; Fig. 2. Computing the confluent subtree.
Proof. We propose an O(|A|) time algorithm, Confluent(T, A), shown in Figure 2. The algorithm requires O(n) time preprocessing phase for building up the tree ordering of L on T , and perform the Lca constant time queries preprocessing [18]. Further, the input A ⊂ L is assumed to be listed according to the tree ordering of L; otherwise, we can use the data structure of van Emde Boas [28] for sorting these finite ranged integers in O(|A| log log |A|) time. The correctness and the time analysis of the proof is omitted for most of the details are similiar to the arguments presented [8]. ¤
3.2
Nearest Subtrees
Here we discuss the problem of finding the nearest descendent subtree within a phylogenetic tree given a subset of leaf nodes. Definition 6 (nearest subtree). Let T be a phylogenetic tree with leaf nodes L. Given A ⊂ L, a node (internal or leaf ) v ∈ T (inducing T [v]) is the nearest subtree of A in T if (∀x ∈ T ) (d(L(v), A) ≤ d(L(x), A)). By utilizing Lemma 2, we can efficiently solve the nearest subtrees problem. Theorem 2. Let T be an n-leaf phylogenetic tree with leaf nodes L. Given a collection of pairwise disjointed subsets of L, S = {A1 , A2 , . . . , Aj }, the nearest subtrees of all Ai ’s on T can be found in totally O(n) time.
Nearest(T, S) Input: A phylogenetic trees T with leaves L, S = {A1 , A2 , . . . , Aj }, Ai ⊂ L. Output: The nearest subtrees of Ai ’s in T . 1 2 3 4 5 6 7
Compute the tree ordering of L on T and the Lca queries preprocessing. Sort nodes of Ai ’s in the tree ordering. Compute the subtree sum s[v, T ]’s for each v ∈ T ; Q ← ∅ for i ← 1 to |S| do B visit each Ai ’s. Ti ← Confluent(T, Ai ) Compute the subtree sum s[v, Ti ]’s for each v ∈ Ti . B Find the node v in Ti with the maximum d(Ai , L(v)). vi ← arg max{d(Ai , L(v)) = 1 − 2s[v, Ti ]/(s[v, T ] + |Ai |) | v ∈ Ti } 8 Add vi into output queue Q 9 return Q, the nearest subtrees of Ai ’s in T Fig. 3. Computing the nearest subtrees.
Proof. We propose an O(n) time algorithm, Nearest(T, S), shown in Figure 3. For each Ai , the algorithm essentially computes all values of d(Ai , L(v))’s and find one with the smallest value. The correctness of the algorithm follows from the fact that any node v ∈ / Λ(Ai ) of T can not have the smallest cluster distance to Ai . The reason is that, if v ∈ / Λ(Ai ), one of subtrees of T [v] (with parent v) contains leaves that are completely disjointed with Ai . That is, one of other subtrees shall have a smaller distance. Thus, it suffices to consider only nodes of Λ(Ai ) or Confluent(T, Ai ). The time complexity analysis appears at the full paper, and is omitted here due to the page limit. ¤ With slight modifications of the algorithm, Nearest(T, S), we can rephrase the result of Theorem 2 and obtain the following: Theorem 3. Let T be an n-node phylogenetic tree with leaf nodes L. The following subsequent operation can be done efficiently after an O(n) time preprocessing. Given a query set A ⊂ L, the nearest subtree of A can be constructed in O(|A|) time if A is given in sorted tree ordering; otherwise, some extra O(|A| log log |A|) time is needed for sorting A.
4
Leaf-agree Subtrees and Isomorphic Subtrees
The definitions of leaf-agree subtrees and leaf-agree isomorphic subtrees can be found in Section 1. Interestingly, using the idea of the Lca sub-merging we can solve the leaf-agree descendent subtrees problem and the isomorphic subtrees problem in linear time. First we present our results for finding the leaf-agree descendent subtrees of two phylogenetic trees. Lemma 3 (leaf-agree-subtree/two trees). Given two n-leaf phylogenetic trees T1 and T2 , finding all paired subtrees of T1 and subtrees of T2 with same set of leaves can be done in O(n) time. Proof. Given a phylogenetic tree T with leaf nodes L and S ⊂ L, we use the notation LcaT (S) to denote the lowest common ancestor defined over S; i.e., LcaT (S) = argmaxv {level(v) | v ∈ Λ(S)}. In other words, suppose v ∗ = LcaT (S) ∈ Λ(S), we must have LcaT (v ∗ , x) = v ∗ for any vertex x ∈ Λ(S). Given T1 , T2 , we define a function ψ[T1 , T2 ](·), or just ψ(·) in short, such that for each vertex x ∈ T1 find the corresponding vertex y ∈ T2 such that y = ψ(x) = LcaT2 (LT1 (x)). By using Lca queries, it is easy to calculate all ψ(x)’s totally in O(n) time for every node x ∈ T1 . This can be done by a computation ordering that lists nodes of T1 from the leaf nodes and steps up toward the root. That is, let x be the parent of {x1 , . . . , xm }. We have: ψ(x) = LcaT2 ({ψ(x1 ), . . . , ψ(xm )}) Note that A ⊆ LT (LcaT (A)) in general; thus we have A = LT (LcaT (A)) whenever |A| = |LT (LcaT (A))|. It follows that LT1 (x) ⊆ LT2 (ψ(x)), and LT1 (x) = LT2 (ψ(x)) whenever |LT1 (x)| = |LT2 (ψ(x))|. We conclude that a node x ∈ T1 is agree with the node ψ(x) ∈ T2 if and only if |LT1 (x)| = |LT2 (ψ(x))| Now that, by Theorem 2, the subtree sums for nodes of T1 and T2 can calculated in O(n) time. It follows that the all-agreement problem can be solved in O(n) by checking the constant time LCA queries and the subtree sizes information. ¤ We can easily extend the result of Lemma 3 to obtain the linear time algorithm for finding all leaf-agree descendent subtrees of more than two phylogenetic trees. Theorem 4 (leaf-agree-subtree/k trees). Given a set of n-leaf phylogenetic trees {T1 , . . . , Tk }, finding all k-tuple (v1 , . . . , vk )’s such that LT1 (v1 ) = LT2 (v2 ) = · · · = LTk (vk ) can be done in O(kn) time. Proof. By extending the ψ[T1 , T2 ](·) function, defined in the proof of Lemma 3, we define the function η(x, i) such that for each vertex x ∈ T1 find the corresponding vertex y ∈ Ti such that η(x, 1) = x and η(x, i + 1) = ψ[Ti , Ti+1 ](η(x, i)) defined recursively. By the discussions in proof of Lemma 3, we can precomputed tables of all ψ[Ti , Ti+1 ](·)’s totally in O(kn) time. Furthermore, the tables of all η(x, i)’s can thus be obtained from ψ[Ti , Ti+1 ](·)’s also in O(kn) time. By similar arguments used in proof of Lemma 3, we conclude that LT1 (x) = LT2 (η(x, 2)) = · · · = LTk (η(x, k))
if and only if |LT1 (x)| = |LT2 (η(x, 2))| = · · · = |LTk (η(x, k))| Since all the conditions’ checking can be done in again O(kn) time, it follows that the leaf-agree subtrees problem can be correctly solved in O(kn) time. ¤ Here we show that finding all isomorphic descendent subtrees of phylogenetic trees can be done in linear time as well. Theorem 5 (isomorphic-subtree). Given a set of n-leaf phylogenetic trees {T1 , . . . , Tk }, finding all k-tuple (v1 , . . . , vk )’s such that T1 [v1 ] ∼ = T2 [v2 ] ∼ = ··· ∼ = Tk [vk ] can be done in O(kn) time. Proof. We use the η(x, i) function, defined in the proof of Theorem 4, as the basics of testing isomorphism. Let x ∈ T1 , y ∈ T2 , it is clear that T1 [x] ∼ = T2 [y] implies LT1 (x) = LT2 (y). To report ∼ all possible k-tuple (v1 , . . . , vk )’s such that T1 [v1 ] = T2 [v2 ] ∼ = ··· ∼ = Tk [vk ], it suffices to consider only those leaf-agree subtrees. That is, tuples of the form (η(x, 1), η(x, 2), . . . , η(x, k)) such as LT1 (x) = LT2 (η(x, 2)) = · · · = LTk (η(x, k)). Note that these tuples are obtainable in O(kn) time by Theorem 4. We say a vertex x ∈ T1 is a good start if and only if T1 [x] ∼ = T2 [η(x, 2)] ∼ = ··· ∼ = Tk [η(x, k)]. Let x be the parent of {x1 , . . . , xm }. Note that x is a good start implies that each of {x1 , . . . , xm } is a good start, although the converse is not true. To establish the sufficient and necessary condition, we associate with each vertex v ∈ T with a unique label and let λ(v) = min LT (v). Furthermore, for each vertex x being the parent of a set of vertices {x1 , . . . , xm } in a tree T , we use the ordered list hx1 , . . . , xm i to arrange these k subtrees in their increasing λ(·) order; i.e., λ(x1 ) < · · · < λ(xm ). It is not hard to see that the calculation of λ(·)’s and the arrangement of subtrees ordering can be done for all phylogenetic trees in linear, O(kn), time. In the following, we use λ+ (x) = hλ(x1 ), . . . , λ(xm )i to denote the labels sequence for children of x. Now it shall be clear to see that: T1 [x] ∼ = T2 [η(x, 2)] ∼ = ··· ∼ = Tk [η(x, k)] or x is a good start if and only if each child of {x1 , . . . , xm } is a good start, and λ+ (x) = λ+ (η(x, 2)) = · · · = λ+ (η(x, k)) Note that the total works for checking labels of λ+ (·)’s are basically equal to visiting the descendent edges of the trees. Since no edge will be visited twice, the total works can not be greater than O(kn), the upper bound for the total number of edges in these trees. It follows that the problem of finding all isomorphic descendent subtrees can be correctly solved in O(kn) time. ¤
5
Applications to Bacterial 2CS Sequence Analysis
Rapid adaptation to environmental challenge is essential for bacterial survival. To orchestrate their adaptive responses to changes in their surroundings, bacteria mainly use so-called twocomponent regulatory systems (2CS) [20]. These systems are usually composed of a sensor kinase, which is able to detect one or several environmental stimuli, and a response regulator, which is phosphorylated by the sensor kinase and which, in turn, activates the expression of genes necessary for the appropriate physiological response. Sensor kinases (or histidine kinases) usually possess two domains: an input domain, which monitors environmental stimuli, and a transmitter domain, which auto-phosphorylates following stimulus detection. A classical response regulator contains an
amino-terminally located conserved receiver domain that is phosphorylated by the sensor kinase at a strictly conserved aspartate residue, leading to activation of the carboxy-terminal effector or output domain [22, 25]. The identification of the function of these 2CS would greatly facilitate not only our understanding on the basic physiology and regulatory networks of bacteria but also designing a way to prevent from causing disease in humans. It is therefore interesting to know whether the gene encoding regulatory protein and the gene encoding the sensor kinase in a 2CS were derived by duplication from an existing 2CS (the co-evolution) or they were evolved independently and were assembled by recombination event later. Furthermore, one way of extracting the useful clustering information that might later lead to functional classifications of these 2CS from the regulator tree and the sensor tree is to incorporate the evolutionary information from both trees. To address these questions, we collect regulatory protein encoding genes and sensor-encoding genes of the 63 2CS within P. aeruginosa [25], and construct the evolutionary distances as well as the sensor and regulator gene trees as our model of co-evolutionary clusterings. Based on these obtained data, we have developed a web-based system, mostly consists of a list of PHP programs, for visualizing and integration of sensor and regulator gene trees; the web system is freely accessible at http://alzoo.cs.pu.edu.tw/two component.htm. Currently the system exhibits several distinct relationships of the 2CS within P. aeruginosa, and part of our results can also be found in [21]. In this case, the integration of the two trees has provided biologists bioinformatic evidences toward the key to reveal the secret of 2CS co-evolutionary process.
6
Concluding Remarks
Due to the known algorithmic difficulties of MAST problem concerning the unbounded degree or more than two trees [12, 3, 19, 5], here in this paper we restrict our attention to the problem of comparing the descendent subtrees within a set of unbounded degree phylogenetic trees. In this paper, we present algorithmic results concerning the the descendent subtrees comparison problems with bioinformatic applications in functional analysis and classifications on bacterial genome. We show that computing all pairs normalized cluster distances between all paired subtrees of two trees can be computationally optimally done in O(n2 ) time in Theorem 1. By using the concept of confluent subtree, we are able to show that finding nearest subtrees for a collection of pairwise disjointed subsets of leaves can be done in O(n) time in Theorem 2. Furthermore, we present linear time algorithms for finding all leaf-agree descendent subtrees as well as all isomorphic descendent subtrees in Theorem 4 and Theorem 5. Finally, we thank Ming-Tat Ko of Academia Sinica for the constructive discussions and good points of ideas. Most of the results presented in the paper are obtained during the first author’s visiting in Institute of Information Science Academia Sinica at the summer of 2002.
References 1. A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10(3):405–421, 1981. 2. B. L. Allen and M. Steel. Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics, 5:1–13, 2001. 3. A. Amir and D. Keselman. Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithms. SIAM Journal on Computing, 26(6):1656–1669, 1997.
4. V. Berry and O. Gascuel. Inferring evolutionary trees with strong combinatorial evidence. Theoretical Computer Science, 240(2):271–298, 2000. 5. P. Bonizzoni, G. Della Vedova, and G. Mauri. Approximating the maximum isomorphic agreement subtree is hard. In R. Giancarlo and D. Sankoff, editors, Proceedings of the11th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 1848, pages 119–128, Montr´eal, Canada, 2000. 6. G.S. Brodal, R. Fagerberg, and C.N. Pedersen. Computing the quartet distance between evolutionary trees in time O(n log2 n). ISAAC, Lecture Notes in Computer Science, 2223:731–742, 2001. 7. D. Bryant. Building Trees, Hunting for Trees, and Comparing Trees. PhD thesis, University of Canterbury, Christchurch, New Zealand, 1997. 8. R. Cole, M. Farach, R. Hariharan, T. Przytycka, and M. Thorup. An O(n log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM Journal on Computing, 30(5):1385–1404, 2002. 9. DasGupta, He, Jiang, Li, Tromp, and Zhang. On distances between phylogenetic trees. In Proceedings of the 8th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 427–436, 1997. 10. W.H.E. Day. Optimal algorithms for comparing trees with labelled leaves. Journal of Classification, 2:7–28, 1985. 11. G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34(2):193–200, 1985. 12. M. Farach, T.M. Przytycka, and M. Thorup. On the agreement of many trees. Information Processing Letters, 55(6):297–301, 1995. 13. M. Farach and M. Thorup. Sparse dynamic programming for evolutionary-tree comparison. SIAM Journal on Computing, 26(1):210–230, 1997. 14. J. Felsenstein. Numerical methods for inferring evolutionary trees. Quarterly Review on Biology, 57(4):379–404, 1982. 15. W. M. Fitch. Toward defining the course of evolution: Minimal change for a specific tree topology. Systematic Zoology, 20:406–441, 1971. 16. D. Gilbert, D. Westhead, N. Nagano, and J. Thornton. Motif–based searching in tops protein topology databases, 1999. 17. D. Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21:19–28, 1991. 18. D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338–355, 1984. 19. J. Hein, T. Jiang, L. Wang, and K. Zhang. On the complexity of comparing evolutionary trees. Discrete Applied Mathematics, 71:153–169, 1996. 20. J.A. Hoch and T.J. Silhavy. Two-Component Signal Transduction. ASM Press, 1995. 21. Y.L. Lin. Two component systems sequence characteristics identification in bacterial genome. In Sixth Proceedings World Multiconference on Systemics,Cybernetics and Informatics (SCI’2002), pages 445–449, Orlando, Florida, 2002. 22. J.S. Parkinson and E.C. Kofoid. Communication modules in bacterial signalling proteins. Annu. Rev. Genet., 26:71–112, 1992. 23. D. F. Robinson and L. R. Foulds. Comparison of weighted labelled trees. In Combinatorial mathematics, VI (Proc. Sixth Austral. Conf., Univ. New England, Armidale), Lecture Notes in Mathematics 748, pages 119–126. Springer-Verlag, Berlin, 1979. 24. D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. Math. Biosci, 53(1-2):131–147, 1981. 25. A. Rodrigue, Y. Quentin, A. Lazdunski, V. M´ejean, and M. Foglino. Two-component systems in pseudomonas aeruginosa: why so many? Trends Microbiol., 8:498–504, 2000. 26. N. Saitou and M. Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology Evolution, 4:406–425, 1987. 27. K. Strimmer and A. von Haeseler. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Molecular Biology and Evolution, 13(7):964–969, 1996. 28. P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6:80–82, 1977. 29. M. S. Waterman and T. F. Smith. On the similarity of dendrograms. Journal of Theoretical Biology, 73:789–800, 1978.