Obtaining Highly Accurate Topology Estimates of ... - CiteSeerX

Obtaining Highly Accurate Topology Estimates of Evolutionary Trees From Very Short Sequences Daniel H. Huson

Applied and Computational Mathematics Princeton University, Princeton NJ 08544-1000 e-mail: [email protected]

Scott Nettles

Department of Computer Science University of Arizona, Tucson AZ 85721 e-mail: [email protected]

Tandy J. Warnow

Department of Computer Science University of Arizona, Tucson AZ 85721 e-mail: [email protected] Abstract The evolutionary history of a set of species is represented by a phylogenetic tree, in other words, by a rooted, leaf-labelled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees has long been considered one of the major challenges in systematic biology. None of the polynomial time methods developed by the theoretical computer science community has been shown to outperform the popular Neighbor-Joining method used by systematic biologists, with respect to topology estimation. (However, preliminary experiments indicate that two new variants of Neighbor-Joining, Bio-NJ and Weighbor, do exhibit improved performance.) In this paper, we present a simple polynomial time method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods. We analyze the performance of DCM-boosted distance methods under the general Markov model of evolution, and prove that, by using the DCM-boosted Buneman method, for almost all trees, polylogarithmic length sequences suce for complete accuracy with high probability, while polynomial length sequences always suce. Our experimental study (based upon simulating sequence evolution on model trees, generating about 1000 datasets) con rms these substantial reductions in error rates and extremely fast convergence rates. In particular, we report that DCM-boosted Neighbor-Joining has only 8% of the error of Neighbor-Joining under conditions that are adverse to Neighbor-Joining, and on some trees achieving acceptable error rates (less than 5% error in the topology estimation) from sequences of a few hundred nucleotides, while Neighbor-Joining needs more than 10K nucleotides to achieve the same level of accuracy.

1 Introduction The evolution of biomolecular sequences can be modeled as a Markov process operating on a rooted binary tree: A biomolecular sequence at the root of the tree \evolves down" the tree, with each edge of the tree introducing point mutations, thereby generating sequences at the leaves of the tree, each of the same length as the root sequence. The phylogenetic tree reconstruction problem is to take the sequences that occur at the leaves of the tree, and infer, as accurately as possible, the tree that generated the sequences. The tree reconstruction problem has two objectives: rst, to recover the branching process, as represented by the rooted leaf-labelled topology of the evolutionary tree, and second, to estimate the parameters of the evolutionary process (the mutation probabilities on the edges of the tree, the rates of change across the positions (called \sites") of the sequences, etc). While inferring the mutational process is of interest to both biologists and statisticians, it turns out that getting a highly accurate estimate of the topology of the true tree makes this problem pretty straightforward, and consequently, biologists have mostly focused upon the question of inferring the tree topology. However, locating the root of the evolutionary tree turns out to be either hard or impossible, for both statistical and biological reasons. Therefore, the primary objective in phylogenetics is the inference of the unrooted leaf-labelled topology (simply called the \topology" of the evolutionary tree) underlying the evolutionary process. Methods for inferring or estimating the evolutionary history of biomolecular sequences are evaluated according to the accuracy of this topology estimation. Indeed, this is one of the most important problems for computational biology as a whole, because of the centrality of evolutionary studies to biology [11]. (Evolutionary trees are often the basis of multiple sequence alignment algorithms [18, 19, 20], protein structure prediction routines [34], and other problems in biology.) Experimentally investigating the performance of phylogenetic methods by simulating sequence evolution on different model trees in order to determine how the sequence length aects the accuracy of the topology prediction is central to systematic biology studies (e.g. [21, 35, 28, 40, 37, 24, 23, 22, 43]). The focus on sequence length in these studies is primarily practical: biomolecular sequences are not particularly long (those used for phylogenetic tree reconstruction

purposes are typically bounded by 2000 nucleotides, often by much smaller numbers), and sequence lengths of 5000 nucleotides are generally considered to be unusually long [30]. Furthermore, experimental and anecdotal evidence suggest that methods are very sensitive to sequence lengths, and that dierent methods can have extreme dierences in their accuracy on realistic sequence lengths. In an earlier paper [26], we showed that a critical factor aecting the accuracy of dierent methods is the maximum evolutionary distance in the tree, which we call the divergence of the tree. Theoretical studies [5, 12, 13, 14] have bounded the convergence rates of various polynomial time distance methods (Neighbor-Joining [36], a simple and very popular clustering technique used by systematic biologists, and more sophisticated methods developed by the theoretical computer science community (such as the the \Single Pivot" algorithm that 3-approximates the L1 tree [1, 15, 2]), and shown that these bounds grow exponentially in the divergence. This exponential dependency upon the divergence translates to superpolynomial convergence rates for almost all trees [14], suggesting that accuracy may be poor if these methods are used to infer tree topologies under conditions of high divergence. However, these are upper bounds, and may be pessimistic. Unfortunately, our earlier studies [26, 33] have indicated that none of the currently available methods is both fast enough to handle dataset sizes above (say) 300 taxa in reasonable time periods (bounded, say, by a few months of CPU time), and also suciently accurate with respect to topology prediction when the dataset contains taxa that have high evolutionary distances between them. The most accurate methods are based upon Maximum Likelihood or Maximum Parsimony, two NP-hard optimization problems, and these are extraordinarily computationally intensive; for example, Maximum Parsimony analyses of one dataset of 500 taxa has already taken several years of CPU time without converging to the global optimum [32], and Maximum Likelihood is even more computationally intensive. The faster techniques are polynomial time, hence fast enough to use with large datasets, but even the best of these show very signi cant degradation in accuracy (with respect to the topology estimation) when analyzing datasets containing high evolutionary distances. In this paper, we present a new method which we hope will greatly advance our ability to meet this challenge: the Disk-Covering Method (DCM). This is a polynomial time algorithm which can be used in conjunction with any base phylogenetic method in order to boost its performance. For some variants of DCM we can prove strong theoretical guarantees, showing that under the general Markov model [41], the convergence rates of certain variants of DCM are extremely fast: polynomial sequence lengths suce for all trees, and polylogarithmic sequence lengths suce for almost all trees, to recover the true tree topology with 1 ? o(1) probability. Our experimental studies con rm substantial reductions in error rates at all sequence lengths for DCMboosted polynomial time methods. We will also provide evidence that DCM-boosting provides signi cant reductions in running time for methods for solving hard optimization problems, such as Maximum Parsimony and Maximum Likelihood, thus making it possible that the accurate reconstruction of trees on hundreds or thousands of taxa could become feasible in realistic time bounds. DCM-boosting can also be used in conjunction with network reconstruction techniques, which are appropriate when the data exhibit recombination, or where there is

horizontal transfer of genetic material, and rst experiments indicate that DCM-boosted network reconstruction methods have signi cantly greater accuracy. Finally, although we have not yet experimentally tested this idea, we believe that DCM-boosting can also be used to advantage for multiple sequence alignment, one of the hardest problems in computational biology. 2 Convergence Rates of Distance Methods Suppose T is the true evolutionary tree, and that biomolecular sequences evolve down this tree under a Markov process. We will assume in this paper that the model of evolution is the general Markov model with iid site evolution, so that all sites evolve under identical and independent Poisson processes, and that the underlying tree is binary. If the assumption of iid site evolution (or that the sites evolve under dierent rates but from a known distribution) is relaxed, then positive results become less likely. For example, it is known that under more general models of evolution, even Maximum Likelihood (under the correct model!) can be statistically inconsistent [42]. The theorems we will develop in this paper are stated in terms of iid site evolution, but apply more generally to all models for which statistical consistency of distance-based methods has been established. We begin with some de nitions. De nition 1 A matrix D is called additive, if there exists aPtree T with a positive edge weighting w such that Dij = e2Pij w(e), where Pij is the path in T between leaves i and j. Additive matrices have the nice property that they correspond uniquely to positively edge-weighted trees [8], and that given D, the tree T and the weighting w can be recovered in polynomial time [45]. Let E (T ) denote the set of edges of T . We represent the evolutionary process by a set fXe : e 2 E (T )g of Poisson processes, where Xe is the Poisson random variable for the numberPof mutations of a random site on the edge e. Let Xij = e2Pij Xe . Then Xij is a Poisson random variable. Let ij = E [Xij ]. Then, is an additive matrix. We will call ij the expected evolutionary distance P , or true distance between i and j . Clearly, ij = e2Pij e , where e = E [Xe ]. De nition 2 We de ne the divergence of the tree T to be max = maxij fij g. The divergence can be unboundedly large, even if the number of leaves is held constant, since sites can change many times on an edge, even though only one change can possibly be observed for any site, between any pair of leaves. Because the matrix is additive, given the tree T can be constructed in polynomial time using a number of dierent distance-based methods. Furthermore, a distance correction transformation exists for the general Markov model of site evolution, so that arbitrarily good approximations to the matrix can be obtained if the sequence length is unboundedly large [44, 16]. This suggests the following distance-based approach to tree reconstruction: First, an approximation d to the matrix is computed; then, d is mapped (using some distance method M such as Neighbor-Joining) to a (nearby, hopefully) additive matrix M (d) = D. If D and de ne the same unrooted leaflabelled tree, then, the method is said to be accurate, even if they assign dierent weights to the edges of the tree.

While complete accuracy is the objective, partial accuracy is the rule, especially when trees are large and/or highly divergent (that is, contain evolutionarily distant pairs of taxa). For this reason, quanti cation of degrees of accuracy have been proposed. The most favored such quanti cation is as follows. De nition 3 Let T be the true tree, and let T 0 be the inferred tree, both leaf-labelled by a set S of taxa. Let (T ) be the set of bipartitions on the leaf0 set induced by deletions of internal edges in T , and let (T ) be equivalently de ned. 0 ). Any bipartition Then T = T 0 if and only if ( T ) = ( T 2 (T ) ? ( T 0 ) is called a false negative and any bipartition in (T 0 ) ? (T ) is said to be a false positive. Thus, a false negative is an edge in the true tree that is missing from the inferred tree, while a false positive is an edge in the inferred tree that is not in the true 0tree. We will say that an edge e 2 E (T )0 is recovered in T (or by the method M which produces T ) if the bipartition in (T ) associated to e appears in (T 0 ). The assumption of Markov models of evolution allows the performance of phylogenetic tree reconstruction methods to be assessed, either through analytical means, or by experiments. There are several criteria that have been considered of fundamental importance to statisticians working in evolutionary tree reconstruction [16]: statistical consistency (with respect to a given model of evolution): which is present, if the probability of recovering the leaf-labelled tree converges to 1, as the sequence length increases; convergence rate (with respect to a given model of evolution), which is the rate at which the probability (of recovering the leaf-labelled tree) goes to 1, as the sequence length increases; and accuracy (with respect to a given model of evolution), which is a measure of the expected number of topological errors at a given sequence length. The dependency upon the model of evolution is important, as some methods are statistically consistent under some models but not under others. Despite the intuition that appropriate distance methods should be statistically consistent under iid site evolution, no such methods were proven statistically consistent, until quite recently: Theorem 4 Let T be a tree under the general Markov model, let be the additive matrix of expected interleaf distances, and let d be an approximation to . Let = L1 (d; ). Let x be the minimum number of expected mutations of a random site on any internal edge in the tree T. (Atteson [3]) The Neighbor-Joining Method is guaranteed to be accurate if < x2 : (Erdos et al. [13]) The Buneman Tree Method is guaranteed to be accurate if < x2 ; furthermore, every edge e 2 E (T ) such that E [Xe ] 2 is recovered by the Buneman Tree Method. The sequence lengths that suce for accuracy for these three methods grow exponentially in O(max ). These results prove statistical consistency, and also provide upper bounds on the false negative rate for both the Single Pivot and Buneman Tree methods. In [26], a similar analysis was made of the sequence length that suces for approximately correct topology estimations, which turns out to grow exponentially in the divergence of the tree. An analysis of the expected divergence of random trees given in

[14] then shows that for almost all trees, the sequence length that suce for bounded inaccuracy is superpolynomial in the tree parameters. (The polynomial convergence rate of the Single Pivot method proven by Farach and Kannan in [15] makes an explicit bound upon the divergence of the data, and hence does not contradict this result.) However, these are upper bounds, and hence potentially loose; experimental studies, unfortunately, con rm the generally prohibitive state of aairs [33, 26]. The many important biological datasets which contain high divergence (for example, the Tree of Life [38]) will therefore not be amenable to such analyses. The proof of Theorem 4 is in [13], but here is an intuition about why it should be true. Recall that ij = E [Xij ]. As Xij is a Poisson random variable, its expectation is the same as its variance. Consequently, ij = Var[Xij ], and when the variance is high, errors in estimating ij are also high. De nition 5 Let ij = jdij ? ij j, and let q = maxfij j dij qg. If q is large, then q will converge to 0 more slowly than if q is small, as the sequence length increases. In fact, in [12, 13, 14], the following was shown: Lemma 6 For all q, the sequence length that suces for q < x2 is O(log neO(q)). For this reason, = max converges to 0 much more slowly when max is large than when max is small. 3 Divide-and-Conquer The previous section established that q ! 0 quickly if q is small, and that some methods are guaranteed to yield accurate estimations of the true tree if < x2 . This suggests a three-step approach to tree reconstruction: 1. Divide the dataset S into (possibly overlapping) subsets, A1 ; A2 ; : : : ; Ap, where each subset has low divergence. 2. Construct a tree Ti for each Ai , i = 1; 2; : : : ; p. 3. Combine the trees T1 ; T2 ; : : : ; Tp into one supertree on all of S . This general algorithmic technique should provide improved performance, but there are several obstacles to overcome: Even if the trees Ti are all accurate (i.e. when Ti = T jAi, where T jAi is the subtree of T that is induced by the taxon set A), the reconstruction of T from T1 ; T2 ; : : : ; Tp may be computationally dicult, because the \Subtree Compatibility Problem" [41] is NP-Complete. And then, even if the instance of Subtree Compatibility can be solved in polynomial time, the subtrees may not uniquely de ne the supertree in the sense that many dierent trees may be consistent with the set of subtrees. For this reason, the set of subsets must be de ned with care, with respect both to computational consequences, as well as to uniqueness of the supertree compatible with the subtrees. Finally, if any of the Ti have errors, then we will not be able to use algorithms for Subtree Compatibility, and instead will need to consider NP-hard optimization problems. Despite these obstacles, Erdos et al. developed exactly this kind of approach to tree reconstruction in the Short Quartet Methods [13]. These are two polynomial time

distance-based methods that are guaranteed to recover the true tree topology whenever the accuracy in the estimated distances between \nearby" taxa in the tree is good enough. A probabilistic analysis of how long the sequences need to be for these estimations to be suciently accurate with high probability shows that the Short Quartet Methods will, with high probability, obtain the true tree from sequences that grow at most polynomially in the model tree parameters, and only polylogarithmically for almost all trees. By comparison, the upper bounds on sequence length requirements for other distance-based methods were superpolynomial in the model tree parameters. The major theoretical limitation of the Short Quartet Method is that it either returns the true tree, or it fails with high probability to recover anything at all. To go beyond the performance of the Short Quartet Method so that we recover the true tree from short sequences, if possible, and otherwise recover a boundedly inaccurate tree, we have developed an algorithm for combining subtrees which can handle errors in the subtrees, and at the same time not lose too much of the topological information in the subtrees. We describe this algorithm in the following section. The technique we use for combining trees on subsets of taxa into one supertree is very simple, but we obtain performance guarantees for this simple technique by using facts about triangulated graphs. We therefore begin with the following de nition: De nition 7 A graph is triangulated if no subset of nodes induces a cycle of size four or more [9, 17]. Many NP-hard problems become solvable in polynomial time, when restricted to triangulated graphs, for example MAX CLIQUE: Lemma 8 [17] Every triangulated graph G has a simplicial elimination ordering v1 ; v2 ; : : : ; vn ; this is an ordering of the nodes so that the set Xi = fvj j j > i and (vi ; vj ) 2 E g forms a clique (i.e. for all fvk ; vl g Xi ; (vk ; vl ) 2 E ). The maximal cliques (cliques which cannot be enlarged by the addition of any further vertices) in G are of the form fvi g[ Xi ; hence there are 2at most n maximal cliques and these can be found in O(n ) time. Given a triangulated graph, a simplicial elimination ordering for the graph can be found in O(n2 ) time, and from it the maximal cliques can also be found in that time. 3.1 Supertree Construction Algorithm (SCA) We now describe the Supertree Construction Algorithm. The input is a triangulated graph G, a simplicial elimination ordering v1 ; v2 ; : : : ; vn for G and a collection of trees fTC j C a maximal clique in Gg. Assuming that G is connected, the output will be a tree T such that every bipartition of T projects onto every TC , i.e. (T jC ) (TC ). (In Section 4, G will be a triangulation of the \threshold graph" obtained by joining any two given taxa by an edge, if their distance is not greater than a given threshold q, and the collection fTC g will consist of trees obtained by applying some chosen tree reconstruction method to each maximal clique C of G.) Stage I (Preprocessing) Using the given simplicial elimination ordering v1 ; v2 ; : : : ; vn for G, compute Ci = fvi g[ Xi , where Xi = ?(vi )\fvi+1 ; vi+2 ; : : : ; vn g is the set of neighbors of vi which follow it in the simplicial elimination ordering. For each Ci , choose a maximal clique C containing Ci and let ti := TC jCi denote the subtree of TC induced by Ci .

1

2

3

5 4 1

6 2

1

2 2 5

3 6

4 1

5

1 6

2

3 7

4

3 3 7

4

7

4

Figure 1: Merging two trees together, by rst transforming them (through edge contractions) so that they induce the same subtrees on their shared leaves.

Stage II (Constructing the tree) For i = n ? 4; n ? 3; : : : ; 1, compute the tree Ti formed by merging ti and Ti+1 , using the Strict Consensus Subtree Merger method (described below). The output tree T is T1 . 3.2 Strict Consensus Subtree Merger Given two phylogenetic trees on distinct but overlapping sets of taxa. The Strict Consensus Subtree Merger method contracts a minimum set of edges in each tree in order to make them identical on the subtrees they induce on the common intersection X . The strict consensus [10] of the induced subtrees is de ned to be the maximally resolved tree that is a common contraction of the two subtrees. We will call this subtree on X the backbone. Merging the two trees together is then achieved by attaching the pieces of each tree appropriately to the dierent edges of the backbone. It is worth noting that the Strict Consensus Subtree Merger of two trees, while it always exists, may not be unique. In other words, it may be the case that some piece of each tree attaches onto the same edge of the backbone. We call this a collision. For example, in Fig. 1, the common intersection of the two leaf-sets is X = f1; 2; 3; 4g, and the strict consensus of the two subtrees induced by X is the 4-star. This is the backbone, it has four edges, and there is a collision on the edge of the backbone incident to leaf 4, but no collision on any other edge. Collisions are problematic, as the Strict Consensus Subtree Merger will potentially introduce false edges or lose true edges when they occur. 3.3 Theoretical Guarantees of SCA We now show that SCA is guaranteed to recover the true tree, when the input subtrees are correct and the triangulated graph is \big enough". To prove these results, we begin with some de nitions. De nition 9 Let (T; w) be a leaf-labelled binary tree edgeweighted by w : E (T ) ! R+ , let be the additive distance matrix associated to T ; Let e be an edge in T that is not incident to a leaf of T . Around e, there are four subtrees, A; B; C; and D. Let a; b; c; and d be four leaves in each of the four subtrees A; B; C and D, respectively, closest to e the distance between leaves i and j is measured as P(where e2Pij w(e)). The quartet a; b; c; d is called a short quartet around e, and the collection of all short quartets around

internal edges of T is denoted by Qshort(T ). We de ne dwidth(T)= maxfd(a;b) j fa; bg Q; Q 2 Qshort(T )g.

Erdos et al. showed: Theorem 10 [14] Let (T; w) be a leaf-labelled edge-weighted tree, and let T be a set of leaf-labelled trees containing the true tree for each short quartet in T (though T can contain additional trees as well). Then T can be uniquely recovered from T and in polynomial time, by the \Short Quartet Consistency Algorithm". In our Supertree Construction Algorithm (Section 3.1), we also work from a collection of subtrees (although not just from quartets) and we merge them into a supertree on the entire set of taxa. Unlike the Short Quartet Consistency Algorithm, however, we allow for incompatibilities in the quartets, so that we are able to obtain a tree from datasets where the Short Quartet Consistency Algorithm would fail to recover anything. The performance guarantees we can prove for the Supertree Construction Algorithm are strong, but depend upon making sure that the graph is \big enough", in the sense that we want to make sure we include every short quartet. We now de ne this formally. De nition 11 Let T be a leaf-labelled tree and let Gsq be the graph de ned by (i; j ) 2 E (Gsq ) if i and j are in some short quartet together. Theorem 12 Let T be a xed leaf-labelled tree, let G be a triangulated graph such that Gsq G, and let T = fT jC j C is a maximal clique of Gg. (Thus, every subtree in T is correct.) Let T be the tree obtained by applying SCA to (G; T ). Then T = T . Proof: We will show that every bipartition of the true tree projects onto every Ti by induction on n ? i, and hence T = T. First, note that projects onto each ti , since ti is formed by restricting some t 2 T to the leaf set fvi g [ Xi . The base case is showing that projects onto Tn?4 , and this is trivial since Tn?4 = tn?4 . Now assume that projects onto Ti . Let X be the intersection of the two leaf sets; thus, X = Xi?1 . Hence, projects onto ti?1 jX and onto Ti jX (these are the restrictions of ti?1 and Ti to the leaf set X ). 0 Let 0 = jX be the restriction of to the set X ; hence 2 (ti?1 jX ) \ (Ti jX ) = (t), where t is the strict consensus of ti?1 jX and Ti jX [10]. Now consider the edge e0 in t corresponding to 0 . Suppose there is0 a collision on this edge e0 ; thus, vi?1 and some subtree t0 on leaf set Y fvi ; vi+1 ; : : : ; vn g ? X both attach onto e . (Note that in this case, these are true attachments in the sense that vi?1 and the subset of leaves of t0 also attach to e0 in the true tree, because the bipartition is in T ). Let P be the path in T corresponding to the edge e0 and let its endpoints be a and b. Consider the subtree T0 of T obtained by deleting all the nodes in T which are separated from a by the deletion of b, or vice-versa, and let Aa;b be the leaves of T0 . In other words, T0 consists of the path P and all subtrees that attach to interior nodes of P . We then show: 1. vi?1 2 Aa;b and all leaves in t0 are also in Aa;b . 2. Gsq restricted to Aa;b is path connected. 3. X \ Aa;b = ;.

Proofs of (1) and (3) follow from the fact that is a bipartition in the true tree and that e0 induces the same bipartition in each of the subtrees Ti and ti?1 . The proof of (2) uses this and also uses properties of short quartets that were established in [13,0 14]. Now, let P be a path lying in Gsq \ Aa;b from vi?1 to some node y 2 Y 0 fvi ; vi+1 ; : : : ; vn g? X . Let y0 0 be the rst node from Y on P . The path from vi?1 to y lies entirely in v1 ; v2 ; : : : ; vi?1 , so that (vi?1 ; y0 ) 2 E (G) (this follows from facts about0 simplicial elimination orderings, see [17]). Consequently y 2 ?(vi?1 ), contradicting the assumption that X \ Y = ;. (Note that this contradiction was obtained by assuming that a collision occurred on e0 and that e0 was the projection of a bipartition from the true tree, it follows that no collision occurs on true edges when Gsq G. However, collisions may occur on false edges.) Therefore, Ti?1 also contains a projection of , extending the induction step. Thus, T1 = SCA(T ) contains (since T1 is on the full leaf set). We conclude with an analysis of the running time: Theorem 13 Let G be a triangulated graph with n vertices, T be the associated set of trees on each maximal clique, and assume that G and T are given as input. Then SCA(G) takes O(n2 ) time. Proof omitted due to space constraints. 4 The Disk Covering Method (DCM) We now describe the two-phase structure of the DiskCovering Method. This is a very general technique that has several exible components which can be modi ed to suit the preferences and speci c goals of the biologist. We discuss a number of speci c variants of DCM in Section 5. 4.1 Phase I In the rst phase of DCM, we iterate through the dierent distances q in the matrix fdij g and compute a tree Tq for each such q (or, in practice, for a speci c subset of the values): Let S be the input set of sequences, and let q 2 fdij g be the selected threshold. We construct the threshold qgraph dq associated with the parameter q as follows: d = (S; Eq ), where (i; j ) 2 Eq if dij q. Forq simplicity, we will assume that dq is connected. (If d has more than one component, then we can treat each component separately and the nal output will be a forest.) We triangulate dq , minimizing the weight of the largestq edge added, thus obtaining a triangulated graph d and compute a simplicial elimination ordering = (v1 ; v2; : : : ; vn ) for dq . We compute a tree TC for each maximal clique C of dq using any given tree reconstruction method. We apply the Supertree Construction Algorithm to dq,

and fTC g, and thus compute a tree Tq for S . Obtaining an optimal triangulation of a graph is in general NP-hard [6], but the graphs we will consider are triangulated or close to triangulated, as the following lemma suggests:

Lemma 14 If d is an additive matrix, then the threshold graph dq is triangulated. We omit the proof due to space constraints, but note that it depends upon the characterization of the class of triangulated graphs as the class of intersection graphs of subtrees of a tree [9, 17]. Although this result does not provide any guarantee that dq will be triangulated unless d is additive, the proof indicates that dq may in fact be triangulated if d is close to additive. Because corrected distance transformations [44] exist for the general Markov model [41], d converges to in probability as the sequence length increases; consequently, d is close to additive. Indeed, restricting the question to the submatrix of d such that dij q increases the probability that dq will be triangulated, or at least close to triangulated. This is upheld experimentally, as our experiments show that dq is often triangulated or close toq triangulated, so that a greedy heuristic for triangulating d rarely adds more than a few edges. 4.2 Phase II We take the trees Tq we have computed in Phase I and infer a consensus of these trees. 5 Speci c Variants of DCM In the following we describe details of how we implement the speci c phases of DCM. We will use the pre x DCM- to indicated the DCM-boosted version of a tree reconstruction method. We will also describe the speci c implementation we have selected for DCM-MP, where MP stands for Maximum Parsimony. We can only provide theoretical guarantees for DCM-Buneman, but the empirical performance results we obtain for DCM-NJ (Neighbor-Joining) and DCMMP are very strong, and clearly superior to DCM-Buneman. 5.1 DCM-Buneman Given a set S of taxa and a distance matrix d. The Buneman index of a quartet bipartitioning ffa1 ; a2 g; fb1 ; b2 gg is de ned as a1 ;a2 jb1 ;b2 = 21 (minfd(a1 ; b1 )+ d(a2 ; b2 ); d(a1 ; b2 )+ d(a2 ; b1 )g ? (d(a1 ; a2 ) + d(b1 ; b2 ))). The Buneman Tree is de ned by the set of all bipartitions fA; B g of S with AjB := minf a1 ;a2 jb1 ;b2 j a1 ; a2 2 A; b1 ; b2 2 B g > 0 [4]. In other words, the Buneman Tree method uses the Buneman index to select a tree topology on every quartet of taxa and computes the maximally resolved tree that induces precisely those selected topologies on the set of quartets [8, 4, 5]. The Buneman Tree method, like Neighbor-Joining, is guaranteed to be accurate when < x2 , where x is the minimum weight edge in the tree and is the maximum error in the estimations of interleaf distances (see Theorem 4), and thus is statistically consistent in the Jukes-Cantor model of sequence evolution. However, in practice, the Buneman Tree Method performs much worse than Neighbor-Joining, most likely because the Neighbor-Joining method examines all the sequences at once, rather than just four at a time, and hence is able to exploit correlations between sequences better than the Buneman Tree Method. The implementation of DCM-Buneman we have developed exploits the following observation: Observation 15 Let S S 0 be two set of taxa, and let T be the Buneman Tree on S , and let T 0 be the Buneman Tree

on S 0 . Let T be the restriction of T 0 to the leaves in S . Then T is a contraction of the tree T . This follows directly from the description of the Buneman Tree Method. Now suppose the sequences are long enough so that there is a q for which every edge (i; j ) in the triangulation of the threshold graph G for this threshold q has ij < x2 , and q d ? width(T ). Then \SCA-Buneman(G)" (i.e. SCA applied to G using the Buneman method to compute subtrees) returns the true tree, T . Consequently, the implementation of Phase II of DCM-Buneman has to be designed so as to discover this q. By the observation above, for all q0 q, SCABuneman(Gq ) would either identically equal SCABuneman(Gq ), or would produce a contraction of SCABuneman(Gq ). This suggests the following implementation of Phase II: Select the tree Tq which is maximally resolved; if more than one such Tq exists, then select the one associated with the largest q. 0

5.2 Performance Guarantees for DCM-Buneman We now consider the conditions that guarantee that DCMBuneman will recover the true tree. Theorem 16 Let T be a model tree under the general Markov model. Let > 0 be given. Then the sequence length that suces for accuracy with probability at least 1 ? of DCM-Buneman grows as

C log neO(-width(T )) ; where n is the number of taxa in T , x is the smallest expected number of mutations on any edge in the tree for a random site, and constant C depends only upon x and . Therefore, polynomial length sequences always suce for accuracy with high probability, and for almost all trees under the uniform and Yule-Harding distributions, polylogarithmic length sequences suce. Proof omitted due to space constraints.

5.3 DCM-NJ Our experimental studies indicate that for almost every q, the tree Tq has very low false positive rates, typically close to 0. Consequently, almost all Tq are either contractions of the true tree or close to being contractions of the true tree. (This is not really surprising, since we designed the merger using the strict consensus technique, and this collapses edges that are not supported by every subtree) This suggests the following implementation of Phase II: take all the trees Tq , and compute the asymmetric median tree of these trees. The asymmetric median tree is a consensus method which takes a set of leaf-labelled trees T = fT1 ; T2 ; : : : ; Tp g, and computes a tree T such that (T ) [i (Ti ), and such that if each c 2 (T ) is weightedPby the number of trees Ti which contain c, then w(T ) = c2(T ) w(c) is maximum. The idea behind this consensus method is that it recovers as many of the true edges as possible, under conditions where the trees in T have very low false positive rates. 5.4 DCM-MP and DCM-ML DCM-boosting Maximum Parsimony (MP) and Maximum Likelihood (ML) presents a dierent challenge from DCMboosting distance methods. Here, rst experiments seem to indicate that as q increases, the trees Tq become increasingly

6 DCM-boosting for Other Problems It is possible that DCM-boosting can also be used to improve the extremely dicult general tree alignment problem [20, 18, 19], where not only is the tree unknown, but the alignment of the biomolecular sequences is not given. The basic approach here is the same as before; if no multiple sequence alignment is given, use instead just pairwise alignments to indicate the division into subproblems. Once the division into subproblems has been produced, compute good multiple sequence alignments on the subproblems { this is

35 "MaxSize" "FP" "FN"

30 Number of Taxa

accurate estimations of the true tree, and at the same time, the size of the subsets increases. For example, in Figure 2, we show the result of a DCM-MP analysis of one dataset of 35 taxa. For each threshold q large enough to create a connected threshold graph, we computed the tree Tq , compared its false positive and false negative rates, and the size of the largest subproblem analyzed using maximum parsimony in constructing Tq . We see that as q increases, the error rates decline, but the maximum dataset size analyzed by parsimony increases. Thus, for optimal accuracy, the best approach would be to take the largest threshold size that can be handled, while for optimal running time, the best approach would be to take the rst threshold that makes a connected graph. Note that the reduction in running time that is aorded by even a small reduction in dataset size could be quite substantial, as these two methods (MP and ML) extremely computationally intensive. Furthermore, DCM can easily be parallelized, with each subproblem handled by dierent processors. How much accuracy is lost in combining trees on subproblems? Our experiments have looked at the version of DCM-MP in which we use the rst threshold values that make connected threshold graphs; this is the \fastest" but least accurate version of DCM-MP. We observed somewhat increased false negative rates but decreased false positive rates. Note that the result of using thresholds that are perhaps smaller than optimal is a tree which is close to being a contraction of the true tree (or may in fact be one), and that is missing just a few edges of resolution. This simpli es the problem greatly. Rather than solving Maximum Parsimony from scratch, this new tree can be re ned to obtain the best parsimony score possible, using either brute-force techniques, or using more sophisticated algorithms (see [7]). The degree to which this approach will reduce the running time will depend upon the degree to which we reduce the dataset size. How much reduction in dataset size do we realistically obtain for biological datasets? Our experiments are based upon biologically derived model trees, and these demonstrate signi cant reductions in dataset size. We have also examined a number of datasets of biomolecular sequences directly, to see the size reduction we could obtain. For example, we examined a data set consisting of 254 H3 hemagglutinin sequences from type A in uenza, each of length 987. In this case, using the smallest threshold value that gives rise to a connected threshold graph, DCM produced 104 subproblems, of maximal size 102 and median size 47. We also looked at a green plant data set [39] consisting of 232 taxa, obtaining a break down into 70 subproblems of maximal size 153 and median size 140. This indicates that the reduction of data set size can be signi cant, although some biological data sets (for example, those adhering strongly to a molecular clock) will only give small reduction.

25 20 15 10 5 0 0.8

1

1.2

1.4 1.6 Threshold Value

1.8

2

2.2

Figure 2: Here we plot the number of false positives (FP), false negatives (FN) and maximum problem size (MaxSize) depending on the threshold used, for DCM-MP applied to a set of DNA sequences of length 2000 generated on a 35 taxon model Jukes-Cantor tree with moderate rates of evolution. easier than when analyzing the entire set of sequences, because the sequences are more closely related, and there are fewer of them { along with the tree for the subproblems. Then merge the trees, as before. Once the supertree is computed, then redo the multiple sequence alignment, using the given tree. In some biological settings, such as in the presence of horizontal transfer, a tree is not a satisfactory model of the evolutionary history, and then sometimes \network" models are considered. We have used DCM in conjunction with one such method, the split decomposition [4, 25], and rst experimental results indicate a signi cant improvement in performance, greatly reducing the false negative rate. 7 Experimental Study Experimentally, we have compared the performance of Neighbor-Joining, the Buneman method and the Single Pivot method with the performance of their DCM-boosted versions. 7.1 Model Trees The model trees that we considered have topologies and rates of evolution based upon reconstructions of biological datasets (two subtrees of the 500 rbcL dataset [32], and the African Eve dataset [29]). The rates of evolution on these model trees were then scaled both up and down, in order to explore the eects of divergence on the performance of dierent methods. We used this larger set of model trees to generate about 1000 dierent sets of DNA sequences using the ecat simulator [31], and using the Jukes-Cantor model of evolution [27]. We then computed Jukes-Cantor distance matrices for each dataset, and used these matrices as input to six dierent distance-based methods: NeighborJoining (NJ), the Single Pivot method, and the Buneman Tree method, and their DCM-boosted versions. 7.2 Performance Evaluation Criteria We explored performance with respect to accuracy of the topology recovered by each method, by comparing the reconstructed tree to the model tree. Recall that this accuracy is quanti ed by examining false negatives (FP) and false positives (FN), see De nition 3. Here, we examine their error

rates, which we de ne as follows. We only use binary trees, so that each n-leaf tree has n ? 3 internal edges. The FN rate is the ratio of (nFN ?3) , expressed as a percentage, and similarly the FP rate is the ratio of (nFP ?3) expressed as a percentage. The Robinson-Foulds (RF) error is the average of the FN and FP rates, i.e., RF = 12 (FN + FP ). Recall that sequence lengths beyond 5000 nucleotides are considered unusually long for tree reconstruction, and that in general convergence to the true tree or acceptable error rates within 3000 nucleotides is thus the critical test of performance. Also, for systematic biology purposes, error rates below 5% can probably be tolerated, though of course this will depend upon the tree. Hence, we examined these experiments with the following speci c questions in mind:

At what sequence length do we get below 5% RF error

rate? At what sequence length (if any) do we recover the true tree reliably? How well do the dierent methods do when restricted to typical length sequences (between 200 and 1200 nucleotides)? Since we are interested in how DCM-boosting aects performance, we will speci cally address how DCM-boosted methods dier from their base methods with respect to these three questions. 7.3 Summary of Experimental Results We observed the following basic trends: DCM-boosting the three considered distances methods reduced the incidence of errors at all sequence lengths, except when the sequences are long enough to obtain accurate estimations of the true tree. Reductions in error rates were particularly high when limited to typical length sequences (between 200 and 1200 nucleotides). Average reduction in error rates over all datasets in which RF error rates for the base method were in the 5%-95% range were high: DCM-boosted methods had on average only 30% the RF error rates of their corresponding base methods. Average reduction in error rates over the subset of datasets for which a base method had moderate to high RF error rates (i.e. in the range 30%-65%) were especially large, for example: on hard datasets, DCMNJ had only 6% the RF error of NJ. Reduction in the sequence length needed to obtain acceptable levels (less than 5%) of RF error was also substantial, especially under conditions of high divergence. The base methods always had the following ranking with respect to their RF error rates: Neighbor-Joining outperformed the Single Pivot method, and the Single Pivot method outperformed Buneman, and the more divergent the dataset the greater the distinction in performance between every pair. Furthermore, their DCM-boosted versions also observed the same relative ordering. Thus, DCM-NJ was by far the best performing of all these methods.

7.4 Results on the 135 taxon tree We now focus on the results of a set of experiments on one tree, which is a 135 taxon model tree inferred for the African Eve dataset [29], with high rates of evolution (maximum edge substitution probability was set to :48). This model tree is a good example of how DCM-boosting aects performance when the tree is a dicult one to obtain, due to the combination of large numbers of taxa and high divergence. This study is based on approximately 100 dierent simulated data sets. In Figures 3, 4 and 5 we plot the false negative (FN) error rates of each base method and its DCM-boosted version, for dierent sequence lengths in the range 100 { 12800. Each point plotted indicates the average over 5 independent experiments. Here are some of the basic observations about the performance of these six methods on this tree: DCM-NJ signi cantly outperformed all other methods. Even the worst DCM-boosted method (DCMBuneman) outperformed the best base method (Neighbor-Joining). Typical reductions in RF error rates for typical length sequences (200-1200 nucleotides) were very high, for example, DCM-NJ had only 4% the RF error rate as NJ for such datasets. At every sequence length, all DCM-boosted versions have substantially reduced errors over all base methods on this tree. Thus, the theoretical advantages we have established show up in the experimental performance study. We now discuss the comparisons between each method and its DCMboosted version. 7.5 Buneman vs. DCM-Buneman The Buneman method returns trees with no false positives and close to 100% false negatives, see Fig. 3. In other words, it essentially returns a star for the datasets generated under these conditions. DCM-boosting reduces FN and RF error rates at all sequence lengths, although it very slightly elevates the FP error rates from 0% to about 3% (not shown here). Moreover, DCM-Buneman achieves RF error rates below 5% at 7000 nucleotides, whereas Buneman still has 98% error rates even at 12,800 nucleotides. 7.6 Single Pivot vs. DCM-Single Pivot The false negative rates for the Single Pivot method and DCM-boosted Single Pivot are reported in Fig. 4. The false positive rate for the Single Pivot method is approximately equal to the false negative rate. The false positive rate of the DCM-boosted version starts at about 10% for the shortest sequence length and then quickly goes down to slightly above zero and becomes zero at around sequence length 5000 (not shown here). Thus, DCM-boosting substantially reduces all error rates (FN, FP, and RF) at all sequence lengths we examined. Also, note that DCM-Single Pivot attains acceptably low (below 5%) RF error rates at 1000 nucleotides (at which point Single Pivot has an RF error rate of 80%), and that Single Pivot does not achieve RF error rates below 30% at any sequence length we examined. DCM-Single Pivot obtains the true tree at sequence lengths beyond 5000,

100

80

80

60 Buneman DCM-B. 40 20 0

60 NJ DCM-NJ 40 20 0

5000

10000

5000

Sequence Length

100 80 60 Single Pivot DCM-S. P. 40 20 0 5000

10000

Sequence Length

Figure 3: Buneman vs DCM-Buneman, false negatives

False Negative Rate

False Negative Rate

False Negative Rate

100

10000

Sequence Length

Figure 4: Single Pivot vs DCM-Single Pivot, false negatives demonstrating that exact accuracy in topology estimation can be recovered, even under high divergence. 7.7 NJ vs. DCM-NJ The false negative rates for Neighbor-Joining (NJ) and DCM-NJ are summarized in Fig. 5. For Neighbor-Joining, FN = FP . For DCM-Neighbor-Joining, the false positive rate starts at about 10% and then reaches 0% before sequence length 1000 (not shown here). DCM-NJ obtains an acceptable error rate (below 5%) at just 250 nucleotides, at which point NJ has more than 60% RF errors. Also, NJ doesn't achieve acceptably low RF error rates until 8000 nucleotides. Finally, DCM-NJ recovers the tree at all sequence lengths beyond 900, while NJ does not recover the true tree even from sequences of length 12,800. References [1] R. Agarwala, V. Bafna, M. Farach, B. Narayanan, M. Paterson, and M. Thorup. On the approximability of numerical taxonomy: tting distances by tree metrics. Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 365{372, 1996. [2] R. Ambainis, R. Desper, M. Farach, and S. Kannan. Nearly tight bounds on the learnability of evolution. Proceedings of the 38th Annual IEEE Symposium on the Foundations of Computer Science,, pages 524{533, 1997.

Figure 5: Neighbor-Joining (NJ) vs DCM-NJ, false negatives [3] K. Atteson. The performance of neighbor-joining algorithms of phylogeny reconstruction. In T. Jiang and D. Lee, editors, Lecture Notes in Computer Science, 1276, pages 101{110. Springer-Verlag, 1997. Cocoon '97. [4] H.-J. Bandelt and A. Dress. A canonical decomposition theory for metrics on a nite set. Advances in Mathematics, 92:47{105, 1992. [5] V. Berry and O. Gascuel. Inferring evolutionary trees with strong combinatorial evidence. In T. Jiang and D. Lee, editors, Lecture Notes in Computer Science, 1276, pages 111{123. Springer-Verlag, 1997. COCOON '97. [6] H. Bodlaender, M. Fellows, and T. Warnow. Two strikes against perfect phylogeny. In Lecture Notes in Computer Science, 623, pages 273{283. SpringerVerlag, 1992. Proceedings, International Colloquium on Automata, Languages and Programming. [7] M. Bonet, M. Steel, T. Warnow, and S. Yooseph. Better methods for solving parsimony and compatibility. Proceedings, RECOMB 1998. [8] P. Buneman. The recovery of trees from measures of dissimilarity. In Mathematics in the Archaeological and Historical Sciences, pages 387{395. Edinburgh University Press, 1971. [9] P. Buneman. A characterization of rigid circuit graphs. Discrete Mathematics, 9:205{212, 1974. [10] W. Day. Optimal algorithms for comparing trees with labelled leaves. Journal of Classi cation, 2:7{28, 1995. [11] T. Dobzhansky. Nothing in biology makes sense except in the light of evolution. American Biology Teacher, pages 125{129, March 1993. [12] P. L. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow. Constructing big trees from short sequences. In G. Goos, J. Hartmanis, and J. van Leeuwen, editors, Lecture Notes in Computer Science, volume 1256. 1997. ICALP'97. [13] P. L. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow. A few logs suce to build (almost) all trees I. DIMACS Technical Report 97-71, submitted to: Random Structures and Algorithms, 1997.

[14] P. L. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow. A few logs suce to build (almost) all trees II. DIMACS Technical Report 97-72, submitted to Theor. Comp. Sci., 1997. [15] M. Farach and S. Kannan. Ecient algorithms for inverting evolution. Proc. of the 28th Ann. ACM Symposium on the Theory of Computing, 1996. [16] J. Felsenstein. Phylogenies from molecular sequences: inference and reliability. Annu. Rev. Genet., 22:521{ 565, 1988. [17] M. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press Inc, 1980. [18] D. Gus eld. Ecient methods for multiple sequence alignment with guaranteed error bounds. Computer Science Division, UC Davis, Technical Report CSE 914, 1991. [19] D. Gus eld and L. Wang. New uses for uniform lifted alignments. Computer Science Division, UC Davis, Technical Report CSE 96-4., 1996. [20] J. Hein. A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol. Biol. Evol., 6:649{668, 1989. [21] D. Hillis. Inferring complex phylogenies. Nature, 383:130{131, 12 September 1996. [22] D. Hillis, J. Huelsenbeck, and C. Cunningham. Application and accuracy of molecular phylogenies. Science, 264:671{677, 1994. [23] J. Huelsenbeck. Performance of phylogenetic methods in simulation. Syst. Biol., 44:17{48, 1995. [24] J. Huelsenbeck and D. Hillis. Success of phylogenetic methods in the four-taxon case. Syst. Biol., 42:247{264, 1993. [25] D. Huson. SplitsTree: A program for analyzing and visualizing evolutionary data. Bioinformatics, 14(10):68{ 73, 1998. [26] D. Huson, S. Nettles, K. Rice, and T. Warnow. Hybrid tree reconstruction methods, 1998. Worshop on Algorithm Engineering, Saarbrucken. [27] T. Jukes and C. Cantor. Evolution of protein molecules. In H. Munro, editor, Mammalian Protein Metabolism, pages 21{132. Academic Press, 1969. [28] M. Kuhner and J. Felsenstein. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol., 11:459{468, 1994. [29] D. R. Maddison, M. Ruvolo, and D. L. Swoord. Geographic origins of human mitochondrial DNA: phylogenetic evidence from control region sequences. Systematic Zoology, 41:111{124, 1992. [30] A. Purvis and D. Quicke. Trends in Ecology and Evolution, 12(2):49{50, 1997. [31] K. Rice. ECAT, an evolution simulator, 1997. http://www.cis.upenn.edu/krice.

[32] K. Rice, M. Donoghue, and R. Olmstead. Analyzing large datasets: rbcl 500 revisited. Systematic Biology, 1997. [33] K. Rice and T. Warnow. Parsimony is hard to beat! In T. Jiang and D. Lee, editors, Lecture Notes in Computer Science, 1276, pages 124{133. Springer-Verlag, 1997. COCOON '97. [34] B. Rost and C. Sander. Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232:584{599, 1993. [35] N. Saitou and T. Imanishi. Relative eciencies of the Fitch-Margoliash, maximum parsimony, maximum likelihood, minimum evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol., 6:514{525, 1989. [36] N. Saitou and M. Nei. The neighbor-joining method: a new method, for reconstructing phylogenetic trees. Mol. Biol. Evol., 4:406{425, 1987. [37] M. Schoniger and A. von Haeseler. Performance of maximum likelihood, neighbor-joining, and maximum parsimony methods when sequence sites are not independent. Syst. Biol., 44(4):533{547, 1995. [38] M. L. Sogin, G. Hinkle, and D. D. Leipe. Universal tree of life. Nature, 362:795, 1993. [39] D. Soltis, P. S. Soltis, D. L. Nickrent, L. A. Johnson, W. J. Hahn, S. B. Hoot, J. A. Sweere, R. K. Kuzo, K. A. Kron, M. W. Chase, S. M. Swensen, E. A. Zimmer, S.-M. Chaw, L. J. Gillespie, W. J. Kress, and K. J. Sytsma. Angiosperm phylogeny inferred from 18s ribosomal dna sequences. Annals of the Missouri Botanical Garden, 84:1{49, 1997. [40] J. Sourdis and M. Nei. Relative eciencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. Mol. Biol. Evol., 5(3):293{311, 1996. [41] M. Steel. Recovering a tree from the leaf colourations it generates under a markov model. Applied Math Letters, 7(2):19{24, 1994. [42] M. A. Steel, L. A. Szekely, and M. D. Hendy. Reconstructing trees when sequence sites evolve at variable rate. J. Computational Biology, 1(2):153{163, 1994. [43] K. Strimmer and A. von Haeseler. Accuracy of neighbor-joining for n-taxon trees. Syst. Biol., 45(4):516{523, 1996. [44] T. Warnow. Some combinatorial problems in phylogenetics. To appear in the proceedings of the International Colloquium on Combinatorics and Graph Theory, Balatonlelle, Hungary, July 15-20, eds. A. Gyarfas, L. Lovasz, L.A. Szekely, in a forthcoming volume of Bolyai Society Mathematical Studies, 1996. [45] M. Waterman, T. Smith, and W. Beyer. Additive evolutionary trees. Journal Theoretical Biol., 64:199{213, 1977.