On the Multiple Gene Duplication Problem 1 Introduction ... - CiteSeerX

On the Multiple Gene Duplication Problem Michael Fellows Department of Computer Science University of Victoria Victoria, B.C. Canada V8W 3P6 [email protected]

Michael Hallett, Ulrike Stege Computational Biochemistry Research Group ETH Zurich CH-8092 Zurich, Switzerland fhallett,[email protected]

Abstract

A fundamental problem in computational biology is the determination of the correct species tree for a set of taxa given a set of (possibly contradictory) gene trees. In recent literature,

the Duplication and Loss model has received considerable attention. Here one measures the similarity/dissimilarity between a set of gene trees by counting the number of paralogous gene duplications and subsequent gene losses which need to be postulated in order to explain (in an evolutionarily meaningful way) how the gene trees could have arisen with respect to the species tree. Here we count the number of multiple gene duplication events (duplication events in the genome of the organism involving one or more genes) without regard to gene losses. The Multiple Gene Duplication problem asks to nd the species tree which requires the fewest number of multiple gene duplication events to be postulated in order to explain a set of gene trees 1 2 is k. We also examine the related problem which assumes the species tree known and asks to nd the explanation for 1 2 requiring the fewest multiple gene k duplications. Via a reduction to and from a combinatorial model we call the Ball and Trap Game, we show that the general form of this problem is NP-hard and various parameterized versions are hard for the complexity class [1]. These results immediately imply that the Multiple Gene Duplication problem is similarily hard. We also prove that several parameterized variants are xed parameter tractable. Namely, when the number of gene trees is 2, the species tree is given, and several restrictions are placed on the number and type of duplication events, the problem becomes linear-time solvable. S

G ; G ; : : :; G

S

G ; G ; : : :; G

W

1 Introduction to the Model A fundamental problem arising in computational biology is the determination of the (correct) evolutionary topology for a set of taxa given a set of (possibly contradictory) gene trees. A gene tree is a complete rooted binary tree formed over a family of homologous genes for a set of taxa. For various reasons, two or more gene trees for the same set of taxa may not always agree (see [BE90, F88] amongst others). The question then arises of how to reconstruct the correct species tree for these taxa from the given gene trees. Several models have appeared in the literature including possibly the most famous MAST [FM67, NA86, P94, PC97]. One such cost model which has received considerable attention of late is the Gene Duplication and Loss model introduced in [GCMRM79] and discussed in [GMS96, MLZ98, Z97]. The basic idea here is to measure the similarity/dissimilarity between a set of gene trees by counting the number of postulated paralogous gene duplications and subsequent gene losses required to explain (in an evolutionarily meaningful way) how the gene trees could have arisen with respect to the species tree. We use angle brackets to denote multisets. All trees in this paper are rooted and leaf labeled. Let T = (V; E; L) be such a rooted tree where V is the vertex set, E is the edge set, and L V is the leaf label set. For a vertex u 2 V ? L, let Tu be the subtree of T rooted by u. Let 1

root(T ) denote the root of T and, for any vertex v 2 V , let parentT (v ) be the parent of v in T , and for binary trees, let leftT (v ) be the left kid of v and rightT (v ) be the right kid of v . Where L is a set, L = f1; 2; : : :; ng, we call these leaf labeled trees either a species or a gene tree. Let G = (VG; EG ; L) be a gene tree and S = (VS ; ES ; L ), L L , be a species tree. We use a function locG;S : VG ! VS to associate each vertex in G with a vertex in the species tree S . Furthermore, we use a function eventG;S : VG ! fdup; specg to indicate whether the event in the gene tree G corresponds to a duplication or speciation event. In [MLZ98], a function tdup is given which returns the minimum number of duplication events necessary to rectify a gene tree with a species tree (here the algorithm from [GMS96] is modi ed to ignore losses). Our function M below maps a gene tree G into a species tree S by de ning functions locG;S and eventG;S . It is the case that tdup = jfuju 2 VG ? L; eventG;S (u) = dupgj. M (G; S ): for each u 2 VG ? L, loc(u) = lca S (u) spec if loc(u ) 6= loc(u); for all u where u is a kid of u in G. event(u) = dup otherwise 0

0

0

0

0

The following problem lies at the heart of Duplication model: Optimal Species Tree (Duplication Model): Input: Set of gene trees G1 ; : : :; Gk . P Question: Does there exist a species tree S with minimum duplication cost ki=1 tdup (Gi; S )? In [MLZ98], it is shown that the problem of nding the species tree S which minimizes the number of gene duplications is NP-complete (here the gene trees may contain leaf labels which appear more than once). When each gene tree may contain a leaf label at most once (a gene tree is formed over exactly one gene per taxa), the problem remains NP-complete and the parameterized (by k) version is hard for W [1] [FHKS98]. A similar question to Optimal Species Tree arises if we ask for the species tree which implies the minimum number of multiple gene duplications for a given set of gene trees. A duplication event in the genome of an organism involves a stretch of DNA where one or more genes may reside. Previous models for rectifying gene trees with species trees considered a duplication event to eect one gene at a time. However, there is evidence that genomes of, for example, eukaryotic organisms, have been entirely duplicated one or more times or individual chromosomes have been duplicated multiple times. In either case, sets of genes were duplicated in one event creating a set of paralogous genes. Such paralogous duplications make nding the correct species topology especially dicult ([F88] amongst others). Consider a vertex u in a species tree S . Each gene tree Gi has some number of vertices (possibly zero) with locG;S equal to u and eventG;S equal to dup. Let Dup = fd1; d2; : : :; dcg denote this set. We can partition the Dup into sets with the property that each set has at most one element from each Gi and so that these sets are maximal. One such set is termed a multiple gene duplication and it counts exactly one to the overall number of multiple gene duplications required to rectify the gene trees with respect to the species tree. The multiple gene duplication score for the vertex u is the total number of such partitions. By \moving" gene duplication events in Gi towards the root of S according to a set of rules, we can decrease the total number of multiple gene duplications required. Figure 1 gives a concrete example.

Observation 1 The duplication mapping function M given above (modi ed from [GMS96, MLZ98]) provides an upper bound for the number of multiple gene duplications for a set of gene trees G1; G2; : : :; Gk and a species tree S .

Let G = (VG ; EG; L) be a gene tree and S = (VS ; ES ; L ) a species tree, L L . Consider a vertex u 2 VG such that eventG;S (u) = dup and u 6= root(G). Let v = parentG (u). The rules for moving duplication events towards the root of a species tree are as follows (we refer the reader to [GMS96] 0

2

0

(a) G1

G2

A

C

D

E

B

F

A

S

C

(b)

B

F

E

D

A

B

C

D

E

F

vertex of S 1 0 0 1

11 00 00 11

vertex of G 1

gene losses

vertex of G 2

losses

1 0 1 0 00 11 0 1 0 01 1 0 1

duplications 1 0 0 01 1 0 1 1 0 0 01 1 0 1

1 0 0 01 1 0 1 1 0 0 1

1 0 0 1

A

1 0 0 1

B

1 0 0 1

C

1 0 0 1

D

1 0 0 1

E

(c)

F 00 11 00 11 11 00 00 11 1 0 0 0 01 1 01 1 0 1

1 0 0 01 1 0 1 1 0 0 01 1 0 1 1 0 0 01 1 0 1 1 0 0 01 1 0 1 1 0 0 1

1 0 0 1

A

1 0 0 1

B

1 0 0 1

C

1 0 0 1

D

1 0 0 1

E

F

Figure 1: (a) Two gene trees (G1; G2) and a proposed species tree S . (b) The species tree S has G1 and G2 embedded inside of it according to the standard Duplication and Loss mapping function M . Note that G1 causes one duplication (vertex ABCDE ) whilst G2 causes 3 duplications (vertices ABC , ABCDEF , and again ABCDEF ). The score according to the Multiple Gene Duplication model is 4. (c) After moving the gene duplication of G1 located at vertex ABCDE to the root of the species tree S , two additional gene duplications for G1 need to be postulated. Nevertheless, the score according to the Multiple Gene Duplication model is now 3. Note that it is not bene cial to move the gene duplication from G2 located at ABC towards the root. The two gene duplications located initially at the root from G2 cannot move upwards. 3

for more information concerning why these are the only two legal moves): Move 1: eventG;S (v ) = dup. We may move the duplication associated with u from locG;S (u) to locG;S (v ) without creating any new duplications. Now locG;S (u) = locG;S (v ). Move 2: eventG;S (v ) = spec. When moving the duplication associated with u from locG;S (u) to locG;S (v ), we must change eventG;S (v ) to be dup. Now locG;S (u) = locG;S (v ). De nition 1 Given a gene tree G, a species tree S , and the functions locG;S and eventG;S mapping G into S , we say that S receives G, if the con guration given by locG;S and eventG;S can be reached by a series of moves starting from the initial con guration obtained by applying M (G; S ). The Multiple Gene Duplication problem can now be stated as follows: Multiple Gene Duplication I

Input: Set of gene trees G1 ; : : :; Gk , integer c. Question: Do there exist a species tree S and functions locGi ;S , eventGi ;S , for 1 i k, such that S receives G1; : : :; Gk with at most c multiple gene duplications?

An easier version of the Multiple Gene Duplication Problem has the species tree S known. We state it as follows: Multiple Gene Duplication Problem II

Input: Set of gene trees G1 ; : : :; Gk , a species tree S , integer c. Question: Do there exist functions locGi ;S , eventGi ;S , for 1 i k, such that S receives G1; : : :Gk with at most c multiple gene duplications?

Via a reduction to and from a combinatorial problem called the Ball and Trap Game, we show W [1]-hardness and NP-completeness for Multiple Gene Duplication II. We also show the problem is xed parameter tractable when various restrictions are placed on the number of gene trees and the number/structure of the gene duplications. The remainder of the paper is organized as follows. Section 3 introduces the Ball and Trap Game and the reduction from Multiple Gene Duplication II, section 4 gives the proofs of intractability/tractability, and section 5 supplies the reduction from the Ball and Trap Game to Multiple Gene Duplication II. The appendix includes a brief introduction to parameterized complexity.

2 The Ball and Trap Game The Ball and Trap Game is played on a rooted labeled tree T = (V; E; L) decorated with a set of traps D and a set of balls B . Every ball and trap has a color associated with it; this is given by the functions cB : B ! [1 : k] and cD : D ! [1 : k] respectively. The balls and traps are initially associated with internal vertices of T via the attaching functions lB : B ! V ? L and lD : D ! V ? L. Each ball b 2 B of color cB (b) is labeled with a (possible empty) subset Rb D of traps. For each ball b every trap d 2 Rb is of the color cD (d) = cB (b). A ball with a given set of traps may occur many times in the tree (i.e. for b; b 2 B Rb = Rb and cB (b) = cB (b ) but lB (b) 6= lB (b ) is possible). Also, a vertex in the tree can be decorated with many dierent balls and traps. A game consists of some number of moves, after which the score is calculated. A rules of the game are as follows: 1. Balls and traps are initially placed at internal vertices of T according to lB and lD . 2. Balls may not move down the tree. They may either stay in the same place or move upwards following the topology of T . In each turn, a ball b on a vertex v can be moved to the parent(v ). 0

0

4

0

0

3. We say that a trap d 2 D is dangerous for a ball b 2 B if d 2 Rb . A ball b sets o a trap if the ball is placed at the vertex of a trap dangerous for b. 4. When a trap d is set o by a ball b, it is removed from the game and replaced by two new balls bnew , bnew such that (a) cB (bnew ) = cB (bnew ) = cB (b), (b) lB (bnew ) = lB (bnew ) = lB (b), (c) Rbnew = Rbnew = RB ? d and Rb = Rb ? d, and (d) Rb = Rb ? d, for all b 2 B . The goal of the game is to minimize the score of tree T which is de ned by 0

0

0

0

0

0

0

smax (v) =

X

v V (T )

maxfs(1; v ); :::; s(k; v )g

2

where s(c; v ) denotes the number of balls of color c at vertex v in T . Ball and Trap Game (Optimization)

Input: A rooted labeled tree T , a set of balls B , a set of traps D, two coloring functions cB : B ! [1 : k] and cD : D ! [1 : k], two initial location functions lB : B ! VT ? L, lD : D ! VT ? L, and for each ball b 2 B a set Rb D where for each d 2 Rb cD (d) = cB (b). A Round: Each round of the game consists of the player moving any number of same colored balls up the tree or deciding not to move any balls (halting move). Output: The location function l (B ) generated according to the above rules which minimizes v V (T )smax (v ). 0

8 2

The input is measured as follows: n denotes the size of T , k denotes the number of colors, r denotes the number of traps, and there are at most m balls on any vertex of T in the initial con guration. The above de ned game leads to the following decision variant of the problem: Ball and Trap (BT) - Decision

Input: A rooted tree T decorated with traps D and balls B in the manner described above, and a positive integer t. Question: Can the Ball and Trap Game be played on T to achieve a score of at most t?

Theorem 1 The Multiple Gene Duplication II (GDII) problem reduces to the Ball and Trap (Decision) problem. Proof (sketch): We construct an instance I 2 BT from an instance of I 2 GDII . Let T equal 0

the species tree S , t = c, and the number of colors k of I be the number of gene trees k from I . Each color corresponds to one of the input gene trees. Apply M (Gi; S ) and consider the functions locGi ;S and eventGi ;S , for 1 i k. We create a ball b with cB (b) = i for every vertex u 2 VG such that eventGi ;S (u) = dup. Let lB (u) = locGi ;S (u). If locGi ;S (u) 6= root(S ), then let 0

0

Dup = fdjd 2 VGi ; d is an ancestor of u; eventGi ;S (d) = specg: For each d 2 Dup we create a trap d and let lD (d ) = locGi;S (d) in T and cD (d ) = i. Place d in Rb . The proof that I 2 BT\yes if and only if I 2 GDII\yes is straightforward. One need only verify that the legal moves for a ball in the Ball and Trap Game correspond to the legal 0

0

0

0

00

00

5

(b)

(a) 1

trap

1

1

A

B

C

D

E

A

F

B

C

D

E

F

Figure 2: Two instances of the Ball and Trap game. Notice in (b) that the ball located at ABC contains trap 1. If it is moved upwards to the root, decorated by trap 1, an additional ball will be added to the game ((c)). moves for a duplication event for the Multiple Gene Duplication problem. The only slightly tricky situation arises when there is a series u1; u2; : : :; uq 2 VGi such that eventGi ;S (up ) = dup, locGi ;S = x and up is a director descendant of up+1 in Gi , for all 1 p < q . In the Multiple Gene Duplication problem, one must rst move duplication event p upwards before moving duplication events 1; : : :; p ? 1. However, when the ball corresponding to duplication event up is moved upwards to the same level as the ball corresponding to duplication event up , p < p , the traps dangerous for these two balls are equivalent. Figure 2 shows two Ball and Trap instances. Figure 2(a) is the Ball and Trap version for the instance shown in Figure 1(b). Figure 2(b) corresponds to the situation in Figure 1(c). 0

0

3 Easy and Hard Parameterizations of the Ball and Trap Game We consider the following parameterizations of Ball and Trap. Ball and Trap:

Input: A rooted tree T decorated with traps D and balls B in the manner described above and integer t. Parameters: k = 2, for each b 2 B let jRbj 2, and the number of traps r. (Version I) Parameters: k; r; m; t. (Version II) Parameters: k; r. (Version III) Question: Can the Ball and Trap Game be played on T to achieve a score of at most t?

3.1 Easy Parameterizations

We can use nite-state dynamic programming on trees (equivalently, nite-state recognition of trees vertex labeled from a nite set of labels) in order to prove the following xed-parameter tractability result.

Theorem 2 For every xed set of parameter values (k; r; m; t), the problem Ball and Trap II can be solved in time linear in the size of the tree.

Proof: Using the methods of [AF93], we can represent an input tree T as a labeled binary tree (even though T may not be binary), where the labels (which we will refer to as colors) indicate both the structure of T and the adornments of the vertices of T with balls and traps. The colored binary tree that represents an input tree T is called a parse tree for T . In this representation of T we can assume that the total number of balls of any given color on T is bounded by t, since 6

otherwise T would be a \No" instance. We argue that there is a nite state tree automaton that recognizes precisely those labeled binary trees that represent \yes" instances of the problem. Our argument is based on the \method of test sets" of [AF93, DFS98]. Since there are at most k2r types of balls, and since each vertex may have m balls, 2r (k2r )m colors suce. The input trees are rooted, and we may assume that the parse tree for a given input tree T is rooted compatibly (i.e., at the same vertex). We use the following parsing operators: (1) the unary operator c , for each color c, that has the eect of adding a single edge from the root to a new root colored c, and (2) the binary operator (de ned only for trees having the same color root) that takes as arguments two trees T1 and T2 and returns the tree T = T1 T2 obtained by identifying the root vertices of T1 and T2. We describe a set T of tests, each of which is a predicate p(T ) about a decorated tree T . We say that T passes the test if p(T ) = true. Decorated trees T and T are equivalent, denoted T T , if they: (1) have roots of the same color (i.e., that are decorated with traps and balls in the same way), and (2) pass the same set of tests in T , that is, if fp 2 T : p(T ) = trueg = fp 2 T : p(T ) = trueg A test in T is speci ed by the following items: (1) A positive integer t t. (2) A length k vector S = (s1 ; :::; sk) of non-negative integers, where each si is at most t. (3) The statement: \It is possible to play the Ball and Trap Game on T in such a way that at the end of play: The total score on the internal vertices of T is at most t . The set of balls on the root of T consists of s1 balls of color 1, s2 balls of color 2; : : :; and sk balls of color k? To conclude the theorem it is enough to establish the following three claims. Claim 1. The equivalence relation has a nite number of equivalence classes. Claim 2. If T1 T2 and T1 is a Yes-instance for the problem, then T2 is a Yes-instance also. Claim 3. The equivalence relation is a congruence with respect to the two parsing operators c and for trees. Claim 1 follows trivially from the fact that jT j (t + 1)k+1. Claim 2 is easily argued. If T1 T2 and T1 is a Yes-instance, then x attention on a winning game g played on T1. Let t be the total score on the internal vertices of T1 at the end of the game g , and let S be the vector indicating the balls of various colors that are left on the root at the end of g . Since T1 T2 the tree T2 has the same set of traps on the root as T1 and since the pair (t ; S ) speci es a test that is also passed by T2, it follows that T2 is a Yes-instance. To prove Claim 3 we essentially have two cases to consider. In the rst case, suppose T1 T2, and consider an operator c for some color c used in the parsing. We argue that c (T1) c (T2). It is easy to see that the set of tests passed by c (T1) is completely determined by the set of tests passed by T1 and the color c. This is the same as for T2 by the de nition of . In the second case, suppose T1 T1 . It is sucient to argue that for any tree T2 having the same root color c as T1 and T1, T1 T2 T1 T2 . Suppose T1 T2 passes the test speci ed by the pair (t ; S ), and let g be a game that witnesses this fact. Suppose that at the end of the game g, the total score on the internal vertices of T1 is t1, the total score on the internal vertices of T2 is t2, and suppose that S1 describes the subset of S consisting of those balls that are moved to the root from internal vertices of T1 during the play of g , and let S^1 be the union of S1 and the balls assigned to the root by the color c. Then (t1 ; S^1) describes a test passed by T1 that is therefore also passed by T1, which implies that a game with the same outcome as g can be played on T1 T2, and so T1 T2 T1 T2. 0

0

0

0

0

0

0

0

0

0

0

0

0

0

Theorem 3

Ball and Trap III can be solved in time nc where c = O((k2r )m ).

7

Proof (sketch): We use leaf-to-root dynamic programming. At each vertex u of the tree T we

calculate a table of pairs (c; S ) where S is a set of balls to be passed upwards from u, and c is the minimum total score that can be achieved in the subtree Tu rooted at u assuming that the balls of S are passed upwards. It is easy to calculate this information for a vertex u from the information for the children of u. The best score that can be achieved for T is the value c at the root for S = ;.

W-hardness of the Ball and Trap Game

3.2

We prove the W [1]-hardness of Ball and Trap I for parameter r by means of a polynomial-time parameterized reduction from Clique (Clique is proven complete for W [1] in [DF95b, DF98]). As an intermediate step we prove that the following parameterized problem is hard for W [1].

k; r-Small Union Input: A family F of subsets of f1; :::; ng, and positive integers r and k. Parameter: (r; k) Question: Is there a subfamily F F with jF j r such that the union of the sets in F has cardinality at most k? 0

0

0

k-Clique

Input: A graph G = (V; E ), an integer k. Parameter: k. Question: Does there exist a set V , V V , such that for all u; v 2 V , (u; v ) 2 E ? 0

0

0

Lemma 1 Small Union is NP-complete and hard for W [1]. Proof ? (sketch): Let (G; k) be an instance of k-Clique. We transform (G; k) into the triple (F ; k ; k) where F is the family of 2-element sets corresponding to the edges of G, with the vertices of G identi ed with f1; :::; ng. The correctness of the reduction is easily checked. 2

Theorem 4 Ball and Trap is NP-complete, and Ball and Trap I is hard for W [1]. Proof: Ball and Trap is well-de ned for non-binary trees, and so we describe how Small Union

can be reduced to Ball and Trap I. Let (F ; r; k) be an instance of Small Union. We can assume, by the reduction described above, that F consists of 2-element sets. In order to describe the reduction we must describe a tree T with decorations, and the target value t for the game on T . The tree T is just a star of degree n = jFj (with the root being the central vertex). Each leaf of T is associated with an element of F . The two colors are red and blue. There are n red traps 1 ; :::; n. Each leaf is decorated with a single red ball labeled with the set of traps fu; v g for the associated (\edge") set fu; vg. The root is decorated with k + r blue balls, each labeled with the empty set of traps. The root is also decorated with all the n red traps. We set t = (n ? r) + (k + r) = n + k. The basic idea for the correctness argument can be described as follows. Initially, the score is n + k + r. The only possible move is to move a red ball from a leaf to the root. If r balls can be moved up to the root from the leaves, with the r balls chosen so that the union of their trap label sets has cardinality k, then the result is a total of k + r red balls at the root (where there are k + r blue balls, so the cost of the root in the nal score remains k + r). Thus the score at the end is t. Conversely, if a score of t is achieved by a game g , then necessarily at least r red balls must be moved up from the leaves. Let g denote the truncated game consisting of the rst r moves. There are two possibilities: 0

8

1. g also achieves a score of at most t, and 2. the score for the game g is greater than t. In case 1, exactly r red balls are moved to the root and consequently the score for the root vertex is at most k + r, which implies that the union of the trap label sets for the balls moved up has cardinality at most k. This implies that (F ; r; k) is a \Yes" instance for the Small Union problem. In case 2, there are more than k + r red balls at the root after the moves of g . Since the number of red balls now exceeds the number of blue balls at the root, each further move of g is of no advantage in decreasing the total score, contradicting that g is a game that achieves a score of at most t. 0

0

0

Theorem 5

Ball and Trap I remains W [1]-hard restricted to binary trees, the maximum number of traps per vertex is one per color, and balls are placed on neither leaves nor parents of leaves.

Proof (sketch): We make the following modi cations to the construction in the proof of Theorem 4. We can replace the star T by a binary tree of n = jLj leaves which is isomorphic to a

caterpillar tree where the one kid of the root is a leaf. With the exception of the root, the internal vertices receive neither balls nor traps. Since moving red balls up these internal nodes (root excepted) does not change the score, the same argument as in the proof of Theorem 4 applies. Similar to the idea for the binary case, we \stretch" the root to prevent a vertex from being decorated with more than one red or one blue trap; that is, we replace the root of T with a caterpillar tree (de ned as above) on n ? 1 internal nodes, where the left kid of the sibling leaf pair is replaced by T . The n ? 1 \too many" traps are placed on the n ? 1 new internal vertices. Finally, every leaf of the binary tree is given two leaf children. We introduce the following version of Ball and Trap as it is used to establish the hardness of the Multiple Gene Duplication problem. Ball and Trap IV (BTIV): Input: A rooted binary tree T decorated with traps D and balls B in the typical manner, and a positive integer t. Parameters: k = 2 and the number of traps r. Conditions: (1) jRb j 2, for all b 2 B . (2) For all v 2 VT and each color c, Tv has at most jTv j ? 2 c-colored balls. (3) Rb = Rb if lB (b) = lB (b ) and cB (b) = cB (b ). (4) Rb Rb if lB (b) is an ancestor of lB (b ) in T . (5) No useless traps are allowed (a trap d is useless if no ball b in the subtree where the trap is located has d 2 Rb). (6) If b; b 2 B where the vertex lB (b) is a descendant of the vertex lB (b ), then all traps d 2 Rb ? Rb are placed at vertices on the path from b to b (inclusive). Question: Can the Ball and Trap Game be played on T to achieve a score of at most t? Because none of the conditions in speci ed in Ball and Trap IV are violated in the reduction, we receive the following corollary. 0

0

0

0

0

0

0

0

0

Corollary 2 Ball and Trap IV is W [1]-hard. Proof: Straightforward and therefore omitted.

4 Hardness of the Multiple Gene Duplication Problem Because of Theorem 1 and Corollary 2 the following theorem proves W [1]-hardness and NPcompleteness of the Multiple Gene Duplication II problem.

Theorem 6 The Ball and Trap IV problem reduces to the Multiple Gene Duplication II

problem.

9

Proof (sketch): We construct an instance I 2 GDII from an instance I 2 BTIV . Our reduction 0

builds a species tree S = (VS ; ES ; L) and gene trees G1 and G2 . I is restricted to 2 colors; we associate color 1 with G1 and color 2 with G2. W.l.o.g. we restrict our attention to balls and traps of one color c. Let S = T . We describe the construction of gene tree Gc in two parts. Part I: Building the contradictory topology = (V ; E ; L ) such that top Gc . Part II: Embedding the leaves L ? L in Gc as in S . Part I: Case 1: v 2 L. Set free(v ) = fv g. Case 2: v 2 V ? L. Here again we dier two cases. Case 2.a: v is decorated by a ball or a trap. Case 2.b: v is not decorated. Case 2.a.i: We consider only nodes not decorated by a trap. We remove a leaf from free(left(v )) and a leaf from free(right(v )), and let these leaves be the kids of a new vertex w. For each ball b at v we remove an element e from free(left(v)) or free(right(v)), where e is either a tree (if there is any in free(left(v )) [ free(right(v ))) or a leaf (otherwise). let w be the parent of w and e. we set w equal to w . Let free = free(left(v )) [ free(right(v )) [ fwg. Case 2.a.ii: Let v be decorated by a trap d and let e1 be the tree removed from free(left(v )) with Tleft(v) has a ball b, d 2 Rb , if this tree exists. Otherwise let e1 be some leaf removed from free(left(v )). Let e2 be the analogous element removed from free(right(v )). Note that at least one element of e1 and e2 is a tree. Let e1 and e2 be the kids of a new vertex w. Now we treat the balls placed at v as done in (i) and set free = free(left(v )) [ free(right(v )) [ fwg. Case 2.b: free = free(left(v )) [ free(right(v )). Finally let be a tree removed from free(root(T )). It is easy to verify that now free(root(T )) = L ? L . Part II: We complete Gc from by embedding the remaining leaves free(root(T )) in accordance to the topology of T . We build the maximal subtrees Tv = (VTv ; ETv ; LTv ) of T over the elements of free. For each such tree Tv we compute the sibling w of v in S and specify p, the lca of the leaves of LTv in . Then we subdivide edge (p; parent (p)) in (p; p ) and (p ; parent (p)) and add Tv as the sibling of p in . The proof that I 2 GDII\yes if and only if I 2 BTIV\yes is straightforward and omitted. 0

0

0

0

0

00

00

5 Conclusions Via a combinatorial abstraction called the Ball and Trap Game, we have examined the Multiple Gene Duplication problem from both the classical and parameterized complexity frameworks and provided several FPT algorithms for parameterized versions of the problem. It would be interesting to nd useful approximation algorithms (or even meaningful heuristics) for parameterized and restricted versions of this problem. Particularly, we would like a nice, meaningful way to guess a (possibly quite large) set of candidate topologies for the species tree S in Multiple Gene Duplication I. One idea is to use the parameterized complexity framework since FPT algorithms are closely related to the design of useful heuristics [DFS98]. Thus far, our models have used unweighted gene and species trees, which means that we have ignored potentially useful distance data between genes from the taxa. Future models should include this information. Additionally, the model so far ignores information about the position of genes along the chromosome (it makes no sense to postulate a multiple gene duplication for a set of genes when they lie on dierent chromosomes in the genome of the organism unless there is evidence that all genes located on this chromosome where duplicated). It may be possible to encode this type (and other types) of restrictions into the Ball and Trap Game. Lastly, other interesting and biologically meaningful parameterizations may lead to good FPT algorithms. 10

References [AF93] K. R. Abrahamson and M. R. Fellows. Finite automata, bounded treewidth and wellquasiordering. In Graph Structure Theory, Contemporary Mathematics vol. 147, pp. 539{564. American Mathematical Society, 1993. [BFR98] R. Balasubramanian, M. Fellows, V. Raman. \An improved xed parameter algorithm for vertex cover," Information Processing Letters, vol. 65 number 3, 1998. [BE90] S. Benner and A. Ellington. Evolution and Structural Theory. The frontier between chemistry and biochemistry. Bioorganic Chemistry Frontiers 1 (1990), 1{70. [BDFHW95] H. Bodlaender, R. Downey, M. Fellows, M. Hallett and H. T. Wareham, \Parameterized Complexity Analysis in Computational Biology," Computer Applications in the Biosciences 11 (1995), 49{57. [CCDF96] L. Cai, J. Chen, R. Downey, M. Fellows. \The Parameterized Complexity of Short Computation and Factorization," Arch. for Math. Logic, 36(1997), 321{337. [DF95a] R. G. Downey and M. R. Fellows. \Fixed Parameter Tractability and Completeness I: Basic Theory," SIAM Journal of Computing 24 (1995), 873-921. [DF95b] R. G. Downey and M. R. Fellows. \Fixed Parameter Tractability and Completeness II: Completeness for W [1]," Theoretical Computer Science A 141 (1995), 109-131. [DF95c] R. G. Downey and M. R. Fellows. \Parameterized Computational Feasibility," in: Feasible Mathematics II, P. Clote and J. Remmel (eds.) Birkhauser, Boston (1995) 219-244. [DF98] R. G. Downey and M. R. Fellows. Parameterized Complexity, Springer-Verlag, 1998. [DFS98] R. G. Downey, M. R. Fellows, and U. Stege. \Parameterized Complexity: A Framework for Systematically Confronting Parameterized Complexity," in The Future of Discrete Mathematics: Proceedings of the First DIMATIA Symposium, Stirin Castle, Czech Republic, June, 1997, F. Roberts, J. Kratochvil and J. Nesetril (eds.), AMS-DIMACS Proceedings Series, 1998, to appear. [F88] J. Felsenstein. Phylogenies from Molecular Sequences: Inference and Reliability. Annu. Rev. Genet.(1988), 22, 521{65. [FHKS98] M. Fellows, M. Hallett, C. Korostensky, and U. Stege. \Analogs & Duals of the MAST Problem for Sequences & Trees," ESA 98, to appear. [FM67] W. Fitch and E. Margoliash. \Construction of Phylogenetic Tree," Science 155 (1967), 279{284. [GCMRM79] M. Goodman, J. Czelusniak, G. W. Moore, A. E. Romero-Herrera and G. Matsuda. \Fitting the Gene Lineage into its Species Lineage: A parsimony strategy illustrated by cladograms constructed from globin sequences," Syst. Zool.(1979), 28, 132{163. [GMS96] R. Guigo, I. Muchnik, and T. F. Smith. \Reconstruction of Ancient Molecular Phylogeny," Molecular Phylogenetics and Evolution (1996),6:2, 189{213. [MLZ98] B. Ma, M. Li, and L. Zhang. \On Reconstructing Species Trees from Gene Trees in Term of Duplications and Losses," Recomb 98. 11

[NA86] J. E. Neigel and J. C. Avise. \Phylogenetic Relationship of mitochondrial DNA under various demographic models of speciation," Evolutionary Process and Theory(1986), Academic Press New York, 515{534. [P94] R. D. M. Page. \Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas," Syst. Biol. 43 (1994), 58{77. [PC97] R. D. M. Page and M. Charleston. \From Gene to organismal phylogeny: reconciled trees and the gne tree/species tree problem," Molecular Phylogenetics and Evolution 7 (1997), 231{ 240. [Z97] L. Zhang. \On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular Phylogenies," Journal of Computational Biology (1997) 4:2, 177{187.

Appendix: The Parameterized Complexity Framework Part of our analysis of the complexity of the three problems that we look at is in the framework of parameterized complexity introduced in [DF95a, DF95b, DF95c, DF98] that has become a routine tool to some extent, but not to the point where we feel it can be assumed to be familiar, and so we oer this brief summary. The parametric framework has previously been applied to sequence problems in [BDFHW95, FHKS98]. The goal in parametric complexity analysis is to identify small but still useful ranges of a parameter k for a hard problem, and see if for instances of size n the problem can be solved in time f (k)n for some constant independent of the parameter. (Typically, when this is possible, is small, just as with polynomial time complexity.) This good behaviour is termed xed parameter tractability. A canonical example that exhibits the issue is the dierence between the best known algorithms for Vertex Cover and Dominating Set, taking in both cases the parameter to be the number of vertices in the set. The best known algorithm for Vertex Cover, after several rounds of improvement in the parameter function, runs in time O((53=40)kk2 + nk) [BFR98]. Note that the exponential dependence on the parameter k in the last expression is additive, so that Vertex Cover is well-solved for input of any size so long as k is no more than around 60. Both problems are NP-complete, of course, and so the P-time versus NP-complete framework is unable to detect the signi cant qualitative dierence between these two problems displayed in Table 1.

n = 50

n = 100

n = 150

k=2 625 2,500 5,625 k = 3 15,625 125,000 421,875 k = 5 390,625 6,250,000 31,640,625 k = 10 1:9 1012 9:8 1014 3:7 1016 k = 20 1:8 1026 9:5 1031 2:1 1035 Table 1: The ratio n2kk+1n for various values of n and k. For parameterized complexity, the analog of NP-hardness is hardness for W [1]. The analogy is very strong, since the k-Step Halting Problem for Nondeterministic Turing Machines is complete for W [1] [CCDF96], a result which is essentially a miniaturization of Cook's Theorem. Dominating Set is hard for W [1] and is therefore unlikely to be xed parameter tractable. Indeed, one senses a \wall of brute force" (try all k-subsets) inherent complexity in this problem, much as one does for the k-Step Halting Problem, where a signi cant improvement on the obvious O(nk ) algorithm seems fundamentally unlikely, and much as one senses a wall of brute force (\try all 2n truth assignments") in the complexity of NP-complete problems such as satis ability. 12

On the Multiple Gene Duplication Problem 1 Introduction ... - CiteSeerX

On the Multiple Gene Duplication Problem 1 Introduction ... - CiteSeerX

Suggest Documents

ON THE ALIGNMENT PROBLEM 1. Introduction - CiteSeerX

ON THE ALIGNMENT PROBLEM 1. Introduction - CiteSeerX

On the Multiple Label Placement Problem Abstract 1 ... - CiteSeerX

The Quickest Transshipment Problem 1 Introduction - CiteSeerX

On the Bipartite Travelling Salesman Problem 1 Introduction - CiteSeerX

Some historical notes on the Stefan problem. 1 Introduction. - CiteSeerX

Heuristics for the Gene-duplication Problem - Computational Biology ...

An ILP solution for the gene duplication problem - BMC Bioinformatics ...

The Roles of Gene Duplication, Gene Conversion and ... - CiteSeerX

Carboxylesterase 1 Gene Duplication and mRNA

1 The Problem - CiteSeerX

Problem 1: Multiple Choice Questions

Applying Fusion/UML to the Invoice Problem 1 Introduction - CiteSeerX

The Role of Gene Duplication and Unconstrained ... - CiteSeerX

CHAPTER 1: INTRODUCTION Problem Statement

1. Introduction: 2. Problem Statement

1. Introduction 2. problem setting

Chapter 1 Introduction and the research problem

1. Introduction An important problem concerning the

THE FLOODLIGHT PROBLEM 1. Introduction - Semantic Scholar

THE FLOODLIGHT PROBLEM 1. Introduction - Semantic Scholar

The Vertex k-cut Problem 1. Introduction

Gene Duplication, Gene Conversion and the Evolution of the Y ...

MODELLING OF A COLLAGE PROBLEM 1. Introduction The problem ...