Theoretical Computer Science 359 (2006) 378 – 399 www.elsevier.com/locate/tcs
DLS-trees: A model of evolutionary scenarios夡 Paweł Górecki∗ , Jerzy Tiuryn Warsaw University, Institute of Informatics, Banacha 2, 02-097 Warsaw, Poland Received 10 June 2005; accepted 5 May 2006 Communicated by M. Crochemore
Abstract We present a model of evolution of gene trees in the context of species evolution. Its concept is similar to reconciliation models. We assume that the gene evolution is modelled by duplications and losses. Evolution of species is modelled by speciation events. We define an evolutionary scenario (called a DLS-tree) which can represent an evolution of genes in species. We are interested in all scenarios for a given species tree and a given gene tree—not only parsimonious ones. We propose a rewrite system for transforming the scenarios. We prove that the system is confluent, sound and strongly normalizing. We show that a scenario in normal form (i.e., non-reducible) is unique and minimal in the sense of the cost computed as the total number of gene duplications and losses (mutation cost). We present a classification of the scenarios and analyze their hierarchy. Finally, we prove that the reconciled tree can be easily transformed into DLS-tree in normal form. This solves some open problems for reconciled trees. © 2006 Elsevier B.V. All rights reserved. Keywords: Rewrite system; Molecular evolution; Phylogenetic tree; Duplication-loss model; Reconciled tree
1. Introduction Reconstruction of species relationships from a set of gene family trees is a difficult task. Hardness of this problem is caused by dissimilarities which usually occur between gene trees. Those inconsistencies are due to gene losses, gene duplications, gene convergence, horizontal gene transfers or errors in sequencing. They lead to two important problems: reconstruction of the species tree from a family of possibly different gene trees and reconciling a given gene tree with a given species tree. These problems have been studied by Goodman [6] and then in the nineties [4,9,12–15]. The concepts of mapping and reconciling trees were introduced. They inspired research on duplication-loss models (we call them DL-models) and their extensions, for instance, models with a horizontal gene transfer [3,7,10]. All DL-models are believed to be biologically meaningful [13]. Most approaches to reconstruction of evolution history are parsimonious, that is, it is assumed that the solution with the minimal cost is the most likely one. There are several possible cost functions: size of the reconstructed tree, or the number of specified evolutionary events, e.g. gene duplications, or the total number of gene duplications and gene losses. The latter measure, called mutation cost, was particularly popular among researchers [11,15]. One of the crucial 夡A
preliminary version of this paper was presented in [8].
∗ Corresponding author. Tel.: +48 225544401; fax: +48 225544200.
E-mail addresses:
[email protected] (P. Górecki),
[email protected] (J. Tiuryn). 0304-3975/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.tcs.2006.05.019
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
379
terms in the DL-models is that of reconciled tree which represents the common evolutionary history of genes and species. In [2], authors present several definitions of a reconciled tree which were used recently in the literature. They prove that the definitions are equivalent and that the tree is minimal with respect to the size. Paper [2] still left open questions: (see [16]) is the reconciled tree minimal with respect to the mutation cost or is it minimal with respect to the total number of gene duplications (duplication cost). Also the question of uniqueness of such a tree was left open. In the present paper, we answer all these questions. We build a formal framework of evolutionary scenarios which represents a common history of genes and species under the assumption that only gene duplications, losses and speciations may occur. These scenarios are called here DLS-trees. 1 We claim that a DLS-tree can be used to represent all possible evolutionary scenarios under the above assumptions. Given a DLS-tree T , we show how to retrieve from T a gene tree gene(T ), as well a species tree spec(T ). A DLS-tree is similar to the concept of a reconciliation (see [1]) for a given species tree and a gene tree. Although, in the definition of a DLS-tree we do not use any particular gene or species tree. We introduce a system of rules for transforming DLS-trees. This is a certain kind of a term rewrite system. It has pleasing mathematical properties: soundness, 2 confluence and strong normalization. We prove that, a DLS-tree in normal form has minimal size, minimal mutation cost, and minimal duplication cost. It follows from our theory that for every DLS-tree T in normal form, if DT is the set of all DLS-trees which have the normal form T , then T is the unique tree in DT among all trees in DT having the same mutation cost. We show an example that the uniqueness property fails when mutation cost is replaced by duplication cost. We show a one-to-one correspondence between the reconciled trees and the DLS-trees in normal form. Thus, the theory build in this paper is immediately applicable to reconciled trees. We obtain a formula for computing the total number of duplications and losses in a reconciled tree, as a function of a gene tree and a species tree. A formal analysis of these formulas in the context of reconciled trees can be found in [5,17]. First we define basic terms and DLS-trees. Then we show how to extract a gene and a species tree from a DLS-tree. In Sections 5 and 6, we present the system of rules and prove soundness, completeness and confluence. In Section 6.7, we present an example of a hierarchy of DLS-trees together with all their reductions (Fig. 13). In Section 7, we present formulas for computing the tree in normal form (for a given species tree and a given gene tree) and the number of duplications and losses. Finally, we show a one-to-one correspondence between the reconciled trees and the DLS-trees in normal form. 2. Gene and species trees Let I be a set of species. A gene tree is a rooted binary directed tree whose leaves are labelled by the elements from I. The labelling need not be one to one. A species tree is a gene tree 3 whose leaves are uniquely labelled. Let T be a gene tree. For a node v of T , by T (v) we denote the subtree of T rooted in v. For each node v of a gene tree T , we define a multiset mTv = {x1i1 , x2i2 , . . . , xkik }, where ij > 0 is the number of leaves labelled xj in T (v). Similarly, for v, we define a cluster as a set mTv = {x1 , x2 , . . . , xm }. Let MT denote the multiset {mTv |v ∈ V }. In order to make the notation more readable, mTv will be denoted by x1 x2 . . . xm . Note that if T is a species tree, then mTv = mTv . We denote by root (T ) the root of T , and by L(T ) the set of all labels (i.e., species) in T . For example, see Fig. 1. We use the standard nested-parenthesis notation for trees: • The empty tree will be denoted by ∅. • An one-element tree, whose node is labelled by a is denoted by a. • If Tp and Tq are two non-empty trees with roots p and q, respectively, then (Tp , Tq ) is a tree whose root has two children: p and q. The trees Tp and Tq are rooted in (Tp , Tq ) at the nodes p and q, respectively. Lemma 1. If T and S are species trees and MT = MS , then T = S. Proof. It follows easily from the definition of the species tree.
1 DLS stands for Duplication, Loss and Speciation.
2 If T reduces to T then gene(T ) = gene(T ) and spec(T ) = spec(T ). 3 A species tree is a special case of a gene tree in the model.
380
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
Fig. 1. Gene tree G = ((a, a), b).
Fig. 2. Counterexample for gene trees.
Lemma 1 fails for gene trees. Fig. 2 presents the counterexample. Let T1 = (a, (a, (a, a))), T2 = ((a, a), (a, a)), T = ((T1 , T2 ), T2 ) and S = ((T2 , T2 ), T1 ). We can easily check that MT = MS = {{a 1 }12 , {a 2 }5 , {a 3 }1 , {a 4 }3 , {a 8 }1 , {a 12 }1 } but the gene tree are different. A multiset M is said to determine a species tree if MS = M, for some species tree S. By T (M) we denote the tree determined by M. Theorem 2. A multiset M determines a species tree if and only if (M1) if M is non-empty, then M ∈ M, (M2) for all a ∈ M, {a} ∈ M, (M3) for all A ∈ M such that A is not a singleton: {X | X ∈ M and XA} contains exactly two maximal (in the sense of inclusion) disjoint sets. Proof. (=>) Let us assume that T is a species tree determined by M. (M1) is satisfied by the root of T . (M2) obviously refers to the leaves of T . The last condition is satisfied by the internal nodes of T , that is, if a and b are children of an internal node, then a and b determine the two maximal disjoint sets. ( #A (T ), #X (T ) = #X (T )
for X = A.
Apply CLOST−1 . After this step all speciation nodes have exactly one lost child. Note that the final tree is semi-normal. (A3) Let us assume that v is a speciation node in T labelled by A and its child is a duplication node labelled by B. Note that the second child is lost (after (A2)). Thus, we can apply TMOVE−1 . Let T = TMOVE−1 (T , v). We have #A (T ) > #A (T ), #B (T ) < #B (T ), #X (T ) = #X (T )
for X = A, B.
390
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
Applying TMOVE−1 preserves the property that speciation nodes have exactly one lost child and does not introduce redexes of type I. After finite number of steps, we get #X = 0 for all X = (T ). It means that in the final tree all duplication nodes are labelled by (T ). We have shown that T can be transformed into a fat tree. From the proof, we conclude that the procedure presented below will produce a fat tree from any DLS-tree: • eliminate iteratively all redexes DUP and SPEC, • eliminate all redexes of TMOVE−1 , • eliminate all redexes of CLOST−1 . From this construction, we have the following property: Corollary 13. Every complete DLS-tree is equivalent to a complete fat tree. Observe that we can increase the cost of a fat tree by applying SPEC in direction →−1 ; in this way, we increase each B by at most |B| − 1, or by applying DUP in direction →−1 ; this can be done an unbounded number of times, increasing the number of the duplication nodes and introducing spurious loss nodes. Note that applying transformations (A2) and (A3) (see the proof of Proposition 12) we get a tree with larger size. Thus, we conclude that a fat tree is the heaviest (in the sense of size) among all equivalent semi-normal trees. Recall that a complete DLS-tree is a tree without lost species. Proposition 14. For complete DLS-trees T1 and T2 , if T1 ∼ T2 , then there exists a unique complete fat tree equivalent to T1 and T2 . Proof. By Corollary 13, there exist complete fat trees F1 and F2 equivalent to T1 and T2 , respectively. Let spec(T1 ) = S. Notice that LabelsT1 = LabelsF1 = MS = LabelsT2 = LabelsF2 . By Propositions 7 and 10, we obtain uniqueness. 6.3. Completeness We can also prove completeness of the system. Proposition 15 (Completeness). Let T1 and T2 be complete DLS-trees such that gene(T1 ) = gene(T2 ) and spec(T1 ) = spec(T2 ). Then T1 ∼ T2 . Proof. By Corollary 13, there exist complete fat trees F1 and F2 equivalent to T1 and T2 , respectively. By Proposition 7, for i = 1, 2, we have gene(Ti ) = gene(Fi ) and spec(Ti ) = spec(Fi ). Moreover, LabelsT2 = LabelsF2 = MS = LabelsT2 = LabelsF2 . By Proposition 10, we get F1 = F2 . Hence T1 ∼ T2 . 6.4. Confluency and normalization =
We use T →T if T can be obtained from T by at most one reduction. The following proposition states that the system is weakly confluent. Proposition 16 (Weak confluence). Let T be a DLS-tree. Then, for each T1 and T2 such that T → T1 and T → T2 , = = there exists T3 such that T1 →T3 and T2 →T3 . Proof. For i = 1, 2, let Ti = Ri (T , vi ). Without loss of generality we may assume that both reductions are different. • R1 = DUP: We consider a pattern P = (R, (R) ) rooted in v1 . Then, v2 is ◦ either outside of the subtree rooted by v1 in T ; in this case the pattern P is present in R2 (T , v2 ), ◦ or in the subtree R. We see that the applications are independent, i.e., R2 (R1 (T , v1 ), v2 ) = R1 (R2 (T , v2 ), v1 ).
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
391
Fig. 11. TMOVE and CLOST case.
Fig. 12. SPEC and TMOVE, SPEC and CLOST cases.
• R1 = TMOVE and R2 = CLOST case: The dependence may occur only if the rules can be applied to the same node. In such a case, at least one of the principal trees for R2 in v2 is a one-element tree with a loss node. Without loss of generality we may assume that B is the principal tree with a loss node. The reductions are presented in Fig. 11. • R1 = SPEC and R2 = TMOVE: The dependence may occur only if at least one of the principal trees for R2 in v2 is a one-element tree with a loss node. Without loss of generality we may assume that B is the principal tree with a loss node. The reductions are presented in the left part of Fig. 12. • R1 = SPEC and R2 = CLOST case: This case is similar to the previous one (see the right diagram in Fig. 12). • R1 = R2 : If the applied rules are equal, then their redexes are different. Simple analysis leads to the conclusion that they have to be independent. The following theorem states that our system is confluent. Recall that a DLS-tree in normal form is non-reducible. Theorem 17 (Confluence). Take a DLS-tree T . There exists a unique DLS-tree T ∗ (in normal form) such that every sequence of reductions in direction →, which starts in T and terminates in normal form, yields T ∗ . Proof. The termination follows from the fact that every application of rules reduces the cost. Let us assume that Tk0 and T0n are in normal form such that T00 → T10 → T20 → · · · → Tk0 and T00 → T01 → 2 T0 → · · · → T0n , where T00 = T . It follows from Proposition 16, that the diagram of reductions presented in (7) is well defined. The proof is straightforward (by induction).
392
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
=
=
=
=
=
T00 → T10 → T20 → · · · → Tk0
↓
↓
↓
↓ =
=
=
↓ ↓ ↓ ↓ = = = 1 1 1 T0 → T1 → T2 → · · · → Tk1 =
.. .
.. .
.. .
↓
↓ =
..
.. .
.
↓ =
(7)
↓
=
↓
=
↓
=
↓
=
=
=
T02 → T12 → T22 → · · · → Tk2
↓ =
=
T0n → T1n → T2n → · · · → Tkn By induction, we show that, for i = 1, 2, . . . , n, Tki is in normal form and equal Tk0 . Analogously, for i = 1, 2, . . . , k, Tjn is in normal form and equal T0n . Finally, Tk0 = T0n . Theorem 18. For DLS-trees T1 and T2 , we have T1 ∼ T2 if and only if T1∗ = T2∗ . −1,1
−1,1
Proof. (⇒). Let −−→ denote the following relation on DLS: T −−→ T if and only if T → T or T → T . Let (T1 , T2 ) denote the minimal number of reductions → and →−1 required to transform T1 into T2 . We show by induction that, for each d ∈ {0, 1, . . . }, if T1 ∼ T2 and (T1 , T2 ) = d then T1∗ = T2∗ . Let d = 0. We have T1 = T2 . This case is clear, by Theorem 17. Let d > 0. Assume that, for all T1 and T2 such that (T1 , T2 ) < d, T1∗ = T2∗ . −1,1
−1,1
−1,1
−1,1
−1,1
Consider T1 and T2 such that (T1 , T2 ) = d. Thus, we have T1 −−→ S 1 −−→ S 2 −−→ · · · −−→ S d −−→ T2 . By the induction hypothesis, we have −1,1 −1,1 −1,1 −1,1 T1 −−1,1 −→ S1 −−→ S2 −−→ . . . −−→ Sd −−→ T2
S1∗
S2∗
Sd∗
T2∗
=
=
...
=
=
(8)
where, for each i = 1, 2, . . . , d, Si∗ is the normal form of Si . We have two cases: • If T1 → S1 , then T1 S1∗ . By the uniqueness of the normal form of T1 we obtain T1∗ = S1∗ = T2∗ . • If T1 →−1 S1 , then S1 T1∗ . By the uniqueness of the normal form of S1 , we obtain S1∗ = T1∗ = T2∗ . (⇐). We have T1 T1∗ = T2∗ T2 . Thus, T1 ∼ T2 . Corollary 19. For a DLS-tree T , T ∗ is the unique tree with minimal cost in the set of all trees which are equivalent to T . 6.5. Computing a fat tree Algorithm 1. DLS-tree to a fat tree. 1. Input: DLS-tree T 2. Output: The unique fat tree equivalent to T 3. eliminate all redexes of DUP, SPEC and HGT 4. eliminate all redexes of TMOVE−1 and CLOST−1 Algorithm 1 presents an efficient procedure for transforming a given DLS-tree into the (unique) equivalent fat tree.
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
393
Fig. 13. Example hierarchy of semi-normal trees with all possible reductions.
Theorem 20. The time and space complexity of Algorithm 1 is O(|T |2 ). Proof. Assume that T has n nodes. Then T has k = (n + 1)/2 leaves and at most k gene nodes. Also, the longest path in T has at most k nodes. Hence, the path in the longest chain tree in the final fat tree has at most k nodes. Having, this we conclude that we have at most 2 ∗ k ∗ k + n nodes in the fat tree. Each reversed reduction of type II increases the size of the tree, therefore, the time and space complexity of Algorithm 1 complexity is O(|T |2 ). 6.6. Computing a tree in normal form Similarly to Algorithm 1, we can define an algorithm for computing the normal form of a given DLS-tree (by removing −1 in line 4). It is clear that the time complexity of this transformation is O(|T |) and the space complexity is O(1) (in each step we decrease the size of T ).
394
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
Fig. 14. D (semi-normal) and D ∗ (in normal form) have the same minimal duplication cost (v is a redex).
6.7. Hierarchy of semi-normal trees Semi-normal trees are important representants of each class of equivalent DLS-trees. We consider a hierarchy of equivalent semi-normal trees and summarize its properties. By the proof of Proposition 12 and further discussion, we can transform each semi-normal tree into the unique fat tree in two steps. In the first step, we apply all possible CLOST rules in the reverse direction. Then, we apply TMOVE rules in the reverse direction. Analogously, we can transform each semi-normal tree into the unique tree in a normal form. First, we apply TMOVE rules, then CLOST in the direction →. An example of a hierarchy of semi-normal trees, with all possible reductions, is presented in Fig. 13. In our example, we have all possible 15 semi-normal DLS-trees. T f is the fat tree. T ∗ is the tree in normal form. The labels of the internal nodes are not shown. They can be easily reconstructed from the labels of the leaves. Dotted and solid arrows denote CLOST and TMOVE reductions, respectively. The nodes marked by a number i (and ∗) are the redexes of the rules which produce Ti (and T ∗ , respectively). For instance, a redex of T5 → T9 is a node in T5 marked by 9. It should be noted that, if trees are in the same row in Fig. 13, then they have the same number of speciations. It follows easily from the fact that the reductions of type II reduce the number of speciations by one. 6.8. Duplication cost and uniqueness We can also prove that there exists more than one DLS-tree with the same minimal duplication cost. It is clear, if an application of SPEC is possible. For example, there exists a DLS-tree T (not semi-normal) which has the same duplication cost as T ∗ from Fig. 13 and such that T ∗ is obtained from T by one application of SPEC (to obtain T replace de in T ∗ by (d , e ) ). Even if we consider only semi-normal trees, the uniqueness of duplication cost is not satisfied. See Fig. 14 for details. 7. From gene and species trees to DLS-trees In this section, for a given species tree S and a gene tree G, we present the construction of a DLS-tree (G, S) in normal form subject to the condition ∅ = L(G) ⊆ L(S) (this condition is required for the trees in this section). We show how to compute the number of evolutionary events in such a tree. Also, we show a natural transformation from reconciled trees into normal form DLS-trees. 7.1. Normal from trees Let denote a path existence relation in S, i.e., a b if and only if there exists a path from a to b in S. Let denote a child relation, i.e., a b if and only if b is a child of a. Reversed arrows are used to denote the reversed relations. For a species tree S and a gene tree G such that ∅ = L(G) ⊆ L(S), for each g ∈ G, by M(g) we denote the node s ∈ S such that S S mS {mw | mG s = g ⊆ mw }. The obtained function M : G → S is called in the literature [6,15] a least common ancestor mapping or just lca-mapping (see Fig. 19).
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
395
The definition of (G, S) is by structural induction on the size of G and S. Let s = root (S) and g = root (G). If S and G are leaves, then = a, where a is the label of g. Otherwise, let p and q be the children of g, then ⎧ ⎨ ((G(p), S), (G(q), S)) (G, S) = ((G(p), S(a), (G(q), S(b))) ⎩ ((G, S(a)), (mS b ) )
if M(g) = s = M(q) (R1), if M(p)a s = M(g) bM(q) (R2), if M(g)a s b = a (R3).
(9)
Lemma 21. For a species tree S and a gene tree G such that ∅ = L(G) ⊆ L(S), (G, S) is a DLS-tree. Proof. The proof follows easily from the definition of the DLS-tree.
Lemma 22. Under the assumptions of Lemma 21 (I) gene((G, S)) = G, (II) labels of (G, S) are clusters in S, (III) if L(G) = L(S), then spec((G, S)) = S. Proof. (I) It follows by induction on the size of G and S. If G = S = g, then gene((g, g)) = g. Let d > 1. Assume that, for all G and S such that |G| + |S| < d and ∅ = L(G) ⊆ L(S), gene((G, S)) = G. We proceed with G and S such that |G| + |S| = d. From the definition of gene and (9), we have (R1) gene(((G(p), S), (G(q), S)) ) = (gene((G(p), S)), gene((G(q), S))) = (G(p), G(q)) = G, (R2) gene(((G(p), S(a)), (G(q), S(b))) ) = (gene((G(p), S(a))), gene((G(q), S(b)))) = (G(p), G(q)) = G, (R3) gene(((G, S(a)), (mS b ) ) ) = gene((G, S(a))) = G. (II) It follows immediately from the fact that ((G, S)) = L(S). (III) Note that lost(G ,S ) = L(S) \ L(G). In this case (G, S) is a complete DLS-tree. By (II) and Lemma 3, we conclude that spec((G, S)) = S. This completes the proof. One of the most important properties of is stated below: Lemma 23. Let G be a gene tree and S be a species tree such that ∅ = L(G) ⊆ L(S). Then (G, S) is in normal form. Proof. We show that there are no redexes of the DLS rules in R = (G, S). (SPEC) Observe that (G , S ) can never be a lost leaf. Thus, there is no DLS pattern (A , B )∗ in (G, S). We conclude that SPEC cannot be applicable. (DUP) Similarly, we obtain that (A , T ) is not present in (G, S). (TMOVE) Let us assume that ((S1 , C ) , (S2 , C ) ) (i.e., the premise of this rule) is present in (G, S). Without loss of generality, we may assume that the root of (G, S) is the redex of the rule. The pattern can be obtained only from the first case, that is, from (R1) of (9). So we have M(root (G)) = root (S) = M(q),
(10)
where q is a child of the root in G. By (R1), for i = 1 or 2, we see that (G(q), S) = (Si , C ) . The pattern could be obtained only from the case (R3). However, this requires M(root (G(q))) = M(q) = root (S) which contradicts (10). (CLOST) The proof is similar to (TMOVE) case. Now, we conclude that if T is a complete DLS-tree, then T ∗ = (gene(T ), spec(T )), where T ∗ is the normal form of T .
396
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
Fig. 15. Example of G and S with loss0 = 2.
7.2. Counting evolutionary events Having formula (9), we can compute the number of evolutionary events in a tree in normal form. Lemma 24. Let G be a gene tree and S be a species tree such that ∅ = L(G) ⊆ L(S). Then the number of duplications in (G, S) equals dup(G, S) = |{g | M(g) = M(p) where p is a child of g in G}|. Proof. Follows immediately from (R1) in (9).
Lemma 25. Let G be a gene tree and S be a species tree such that ∅ = L(G) ⊆ L(S). For each node g, we define a non-negative integer lossg as follows: we set lossg = 0 if g is leaf in G, if g is an internal node in G then let p and q denote the two children of g. We define d(M(g), M(p)) + 1 if M(p) = M(g) = M(q), lossg = d(M(g), M(p)) + d(M(g), M(q)) otherwise, S S where d(s, s ) = |{t | mS s ⊂ mt ⊂ ms }|. S S Let loss0 = |{s | mM(root (G )) ⊂ ms }|. Then the number of gene losses in (G, S) is given by loss0 + lossg . g∈G
(11)
Proof. The first element of the final sum (11) is needed, if M(root (G)) = root (S). It should be clear from (R3) case of (9) that we obtain loss0 gene losses in the first applications of until S = S(M(root (G))) is reached as the second argument of : loss0 , (G, S )) . . . ) . (G, S) = (A1 , . . . (A Fig. 15 illustrates this property. Note that if M(root (G)) = root (S), then loss0 = 0. Let g ∈ G. It follows from (9) that S(M(g)) is the smallest subtree T of S such that (G, S) = (. . . (G(g), T ) . . . ). For a node g in G, let g be the root of Tg = (G(g), S(M(g))) in (G, S). Let us assume that g is internal node in G. Let p and q be the two children of g. We consider paths gp1 . . . pk p and gq1 . . . qm q in (G, S). We show that lossg = k + m. Without loss of generality we assume that g is the root of G and M(g) = root (S). Case A: M(p) = M(g) = M(q). (R1) is applied. So (G, S) = ((G(p), S), (G(q), S)) , and hence Tp = (G(p), S). Similarly, Tq = (G(q), S). We obtain: k = m = 0. See Fig. 16 for details. Case B: M(p) = M(g) = M(q). Again, (R1) is applied. We have Tq = (G(q), S), and hence m = 0. For the second child, we apply a sequence of k (R3) cases. In this way, we reach Tp in (G, S). Fig. 16 presents the details. Note that mS M(g) = g = p1 .
(12)
It should be clear that p2 , p3 , . . ., pk is a sequence of all intermediate labels which occur between M(g) and M(p) in S. So d(M(g), M(p)) = k − 1, and hence the number of gene losses equals d(M(g), M(p)) + 1.
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
397
Fig. 16. Cases A and B.
Fig. 17. Case C.
Case C: M(p) = M(g) = M(q). First, (R2) is applied. So the labels of the children of g in Tg (i.e., p1 and q1 ) do not equal mS M(g) . Thus, this situation is different, that is, (12) does not hold. We apply a sequence of k (R3) cases
to obtain Tp . In this case, p1 , p2 , . . ., mS pk is a sequence of all intermediate labels which occur between M(g) and M(p) in S. So d(M(g), M(p)) = k and, similarly, d(M(g), M(q)) = m. Fig. 17 presents the details. 7.3. Embeddings and normal form trees
It should be clear that we can embed a DLS-tree D into a species tree S if the labels of T are clusters in S. For a gene tree G and a species tree S, consider the set P(G, S) of all DLS-trees in normal form such that if T ∈ P(G, S), then gene(T ) = G and Labels(T ) ⊆ MS . The following lemma states one of the crucial properties of P(G, S). Lemma 26. Let G be a gene tree and S be a species tree such that ∅ = L(G) ⊆ L(S). Then P(G, S) = {T (s) | T = (G, S), s ∈ V T , L(G) ⊆ Ts ⊆ L(S)}.
(13)
Proof. (⊇) Clear. (⊆) Assume that P(G, S) contains T1∗ and T2∗ such that (T1∗ ) = (T2∗ ) = L(S). There exist fat trees F1 and F2 equivalent to T1∗ and T2∗ , respectively. By Proposition 10, F1 = F2 . Thus, T1∗ and T2∗ are equivalent and by uniqueness of the normal form T1∗ = T2∗ . We conclude that there is exactly one DLS-tree T ∗ in normal form in P(G, S) satisfying L(S) = (T ∗ ). It is defined by (G, S). Assume that P(G, S) contains T such that (T ) ⊂ L(S). It is obvious that L(G) ⊆ (T ). Let T = (. . . ((T , A1 ) , A2 ) . . . , Ak ) such that (T ) = L(S) (note that T is reconstructed uniquely). This yields T = (G, S). In general, it should be clear that P(G, S) contains loss0 + 1 elements. See Fig. 18 for illustration. 7.4. Reconciled trees Now, we present a definition of the reconciled tree taken from [2]. We know that this definition is equivalent to the definition given by [15]. Let s = root (S) and g = root (G). The reconciled tree R(G, S) of G with respect to S is the
398
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
Fig. 18. Elements of P(G, S) and corresponding embeddings.
Fig. 19. Example of a mapping M, a reconciled tree R(G, S) and a DLS-tree (G, S).
Fig. 20. The evolution of species and genes (cont. example from Fig. 19).
tree G, if G and S are leaves. Otherwise, let p and q be the children of g, then ⎧ if M(g) = s = M(q) (RT 1), ⎨ (R(G(p), S), R(G(q), S)) R(G, S) = (R(G(p), S(a)), R(G(q), S(b))) if M(p)a s = M(g) bM(q) (RT 2), ⎩ (R(G, S(a)), S(b)) if M(g)a s b = a (RT 3).
(14)
Theorem 27. Let G be a gene tree and S be a species tree such that ∅ = L(G) ⊆ L(S). Let be a transformation which takes a DLS-tree and returns a tree with leaves labelled by species: ⎧ if T = a and a ∈ I, ⎨a (T ) = S(v) (15) if T = A and v in S such that mS v = A, ⎩ (T1 , T2 )∗ if T = (T1 , T2 )∗ , where ∗ ∈ { , }. Then ((G, S)) = R(G, S). Proof. It follows immediately from the definition of the reconciled tree and the definition of .
Thus, is a natural transformation between reconciled trees and DLS-trees in normal form. We claim that the formula for computing the mutation cost is the same for the reconciled tree [11] and the tree in normal form.
P. Górecki, J. Tiuryn / Theoretical Computer Science 359 (2006) 378 – 399
399
Fig. 19 presents an example of a lca-mapping M, for all internal nodes of G. Also, it presents a reconciled tree R(G, S) and a DLS-tree (G, S). Note that (G, S) equals the tree T ∗ shown in Fig. 13. It is easy to notice that ((G, S)) = R(G, S). For a more readable presentation, all the lost gene lineages are shown with dotted lines. The solid lines in R(G, S) and (G, S) represent embedded gene trees. Fig. 20 is a continuation of the example presented in Fig. 19. Tree E is the evolutionary interpretation of our DLStree. T presents an extraction of the gene lineages from E. Note that the tree T is equal topologically to the DLS-tree (G, S). Also the embedding is shown. 8. Final remarks All algorithms presented here, the system of rules, mappings, reductions, reconciled trees and others for DL-models are included in this software. It will be soon available. Acknowledgments We thank Bartosz Wilczynski and Szymon Nowakowski for comments. This research was supported by KBN Grant 4 T11F 020 25. References [1] L. Arvestad, A.-C. Berglund, J. Lagergren, B. Sennblad, Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution, in: RECOMB 2004, 2004. [2] P. Bonizzoni, G. Vedova, R. Dondi, Reconciling gene trees to a species tree, Algorithms and Complexity, Proc. Fifth Italian Conf. (CIAC 2003) Vol. 2653, 2003, pp. 120–131. [3] M.A. Charleston, Jungles: a new solution to the host/parasite phylogeny reconciliation problem, Math. Biosci. 149 (1998) 191–223. [4] O. Eulenstein, B. Mirkin, M. Vingron, Duplication-based measures of difference between gene and species trees, J. Comput. Biol. 5 (1) (1998) 135–148. [5] O. Eulenstein, M. Vingron, On the equivalence of two tree mapping measures, DAMATH: Discrete Appl. Math. Combin. Oper. Res. Comput. Sci. 88 (1995). [6] M. Goodman, J. Czelusniak, G.W. Moore, A.E. Romero-Harrera, G. Matsuda, Fitting the gene lineage into its species lineage. A parsimony strategy illustrated by cladograms constructed from globin sequences, Syst. Zool. 28 (1979) 132–163. [7] P. Górecki, Reconciliation problems for duplication, loss and horizontal gene transfer, in: RECOMB 2004, San Diego, 2004, pp. 316–325. [8] P. Górecki, J. Tiuryn, On the structure of reconciliations, Lecture Notes in Computer Science, Vol. 3388, Springer, Berlin, 2005, pp. 42–54. [9] R. Guigo, I. Muchnik, T. Smith, Reconstruction of ancient molecular phylogeny, Molecular Phys. Evol. 6 (1996) 189–213. [10] M. Hallett, J. Lagergren, New algorithms for the duplication-loss model, in: Proc. RECOMB 2000, ACM Press, Tokyo, 2000, pp. 138–146. [11] B. Ma, M. Li, L. Zhang, On reconstructing species trees from gene trees in term of duplications and losses, in: RECOMB 1998, 1998, pp. 182–191. [12] B. Ma, M. Li, L. Zhang, From gene trees to species trees, SIAM J. Comput. 30 (2000) 792–852. [13] B. Mirkin, I. Muchnik, T.F. Smith, A biologically consistent model for comparing molecular phylogenies, J. Comput. Biol. 2 (4) (1995) 493–507. [14] R.D.M. Page, Component analysis: a valiant failure? Cladistics 6 (1990) 119–136. [15] R.D.M. Page, Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas, Syst. Biol. 43 (1994) 58–77. [16] R.D.M. Page, M.A. Charleston, Reconciled trees and incongruent gene and species trees, Mathematical Hierarchies and Biology, DIMACS Series in Mathematics and Theoretical Computers Science 37 (1998) 57–70. [17] L. Zhang, On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies, J. Comput. Biol. 4 (2) (1997) 177–188.