The Disk-Covering Method for Tree Reconstruction - CiteSeerX

3 downloads 11361 Views 235KB Size Report
e-mail: parida@cs.nyu.edu. Tandy Warnow. Department of Computer and Information Science. University of Pennsylvania, Philadelphia PA 19104 e-mail: ...
Proceedings of "Algorithms and Experiments" (ALEX98) Trento, Italy, Feb 9{11, 1998

R. Battiti and A. A. Bertossi (Eds.) pp. 62-75

The Disk-Covering Method for Tree Reconstruction Daniel Huson

Program in Applied and Computational Mathematics Princeton University, Princeton NJ 08544-1000 e-mail: [email protected]

Scott Nettles

Department of Computer and Information Science University of Pennsylvania, Philadelphia PA 19104 e-mail: [email protected]

Laxmi Parida

Courant Institute of Mathematical Sciences New York University, New York City NY 10012 e-mail: [email protected]

Tandy Warnow

Department of Computer and Information Science University of Pennsylvania, Philadelphia PA 19104 e-mail: [email protected]

and Shibu Yooseph

DIMACS, Rutgers University, Piscataway, NJ 08854 e-mail: [email protected]

Abstract

Evolutionary tree reconstruction is a very important step in many biological research problems, and yet is extremely dicult for a variety of computational, statistical, and scienti c reasons. In particular, the reconstruction of very large trees containing signi cant amounts of divergence is especially challenging. We present in this paper a new tree reconstruction method, which we call the Disk-Covering Method, which can be used to recover accurate estimations of the evolutionary tree for otherwise intractable datasets. DCM obtains a decomposition of the input dataset into small overlapping sets of closely related taxa, reconstructs trees on these subsets (using a \base" phylogenetic method of choice), and then combines the subtrees into one tree on the entire set of taxa. Because the subproblems analyzed by DCM are smaller, computationally expensive methods such as maximum likelihood estimation can be used without incurring too much cost. At the same time, because the taxa within each subset are closely related, even very simple methods (such as neighbor-joining) are much more likely to be highly accurate. The result is that DCM-boosted methods are typically faster and more accurate as compared to \naive" use of the same method. In this paper we describe the basic ideas and techniques in DCM, and demonstrate the advantages of DCM experimentally by simulating sequence evolution on a variety of trees.

The Disk-Covering Method for Tree Reconstruction

63

1 Introduction In Biology, the accurate inference of evolutionary history has tremendous theoretical importance, letting us answer questions such as: \Are humans more closely related to chimps or gorillas?", and \What is the origin of life?" Somewhat more pragmatically, understanding evolutionary history can help us, for example, design better insecticides and antibiotics. Many critical biological problems require tree reconstruction from large data sets, but in many cases these datasets are not only large, but also highly divergent. In other words, these data sets contain very distantly related sequences, or, more generally, pairs of highly dissimilar sequences, such as may occur when genes mutate quickly due to changes in environment, or other evolutionary pressures. For example, reconstructing the origin (or, as is more likely, the multiple origins) of photoreceptor molecules, requires high divergence datasets, as shown by Rice [35], as does reconstructing the tree of cholinesterases, neuroenzymes that are common targets of pesticides [12]. Understanding the origins of viruses that have been transferred to humans from other animals (for example, HIV) also involves hundreds of divergent taxa. Plant phylogeny (and speci cally, the evolution of owering plants) is one of the most hotly contested phylogenetic problems in biology today, and reconstruction of the evolution of plants involved analyzing DNA sequences for more than 500 plants [11, 36]. Inference of the geographic origin of humans requires at least 300 mitochondrial DNA sequences [31], while the inference of the \tree of all life" requires sequences for at least 1200 taxa [40]. Since the leaf-labeled topology of the evolutionary tree indicates the order of speciation events (or, in cases of gene trees, the order of gene duplication events), the central technical problem in understanding evolutionary history is reconstructing the topology of the phylogenetic tree. Inferring the leaf-labeled topology of the evolutionary tree is, unfortunately, enormously dicult. Computational diculties arise because essentially all optimization problems involving tree reconstruction are NP-hard [1, 5, 14, 20, 23, 41]. (Some polynomial time approximations do exist, for example, a 3-approximation algorithm for the L1 -nearest tree [1] and the standard trick to 2-approximate the maximum parsimony tree by constructing a minimum spanning tree, but they do not seem to be suciently accurate with respect to topology estimation.) Indeed, most of the techniques commonly used by biologists, for example, maximum parsimony (a sequence-based method aimed at minimizing the total amount of evolution), maximum likelihood estimation, and weighted least-squares, are not likely to be solvable in polynomial time. Exact, exponential time, solution of these problems is only possible for a small number (typically fewer than twenty) taxa. For larger datasets, these problems are typically \solved" by using heuristic search techniques, which really only nd local optima. The accuracy that these heuristic approaches achieve on large trees is generally not known, although our preliminary experiments on maximum parsimony suggest that they may sometimes work well. However, even if the heuristics are accurate, they can require extremely long computations. For example, one parsimony analysis by Rice and his colleagues [36] of a large (500 taxon) seed plant data set took almost a year to compute. Distance-based methods, such as neighbor-joining, which have computationally feasible, polynomial time solutions, would seem to o er a solution. Unfortunately these methods run into a statistical diculty, especially for divergent datasets. Recent studies have established that while most distance-based methods work well for small trees, the sequence lengths that they require for highly accurate topology estimation for signi cantly divergent datasets can be beyond what is ever likely to be available [2, 19, 37, 38]. Heuristics to solve the NP-hard maximum parsimony problem do not seem to have the same degradation on large divergent trees [37], but they have other problems. Statistical analysis shows that, even with in nite length sequences, there are some datasets for which parsimony will fail to recover the correct tree. In other words, maximal parsimony is inconsistent on some trees, even with iid site evolution [21] (by comparison, almost all distance based methods are consistent on all trees with iid site evolution [19]). Unfortunately, the inconsistency of maximum parsimony is not just a theoretical phenomenon, it occurs for real trees of biological interest as well. Research into the conditions under which maximum parsimony will be consistent has so far been inconclusive [21, 27, 30, 37, 44]. Nevertheless, maximum parsimony receives a great deal of support from systematic

The Disk-Covering Method for Tree Reconstruction

64

biologists. In general, standard phylogenetic methods seem to fall into two categories: fast methods (typically distance-based), which su er in performance when there is signi cant divergence in the dataset, and exponential time methods (typically sequence-based), which handle high divergence better. However, any given dataset has its own sequence length, so that even if there is not too much divergence in the dataset, there may be too much divergence given the sequence length in that dataset; consequently, some low-divergence datasets will require exponential time algorithms, if accurate (or even approximately accurate) tree reconstruction is desired. Generally, there is a tradeo between computational complexity and sequence length requirement. Unfortunately, despite increasing sequencing e orts, most phylogenetic datasets do not contain sequences longer than about 2000 nucleotides, and often, a signi cant proportion of the characters are constant across the dataset. Indeed, even in the long term, it is unclear whether we will have sequences of more than about 10K nucleotides for many large-scale tree reconstruction e orts. This is because phylogenetic reconstruction methods require homologous sequences (i.e. sequences that have a common ancestor), and it is more dicult to get long homologous sequences between distantly related taxa than between closely related taxa [25]. Consequently, what is needed are fast algorithms (at least, algorithms which are fast enough to be attractive to biologists) which have short sequence length requirements for accurate reconstructions of highly divergent trees. One attempt towards such a goal is the short quartet method [16]. This method can obtain completely accurate topology estimations from very short sequences, as was con rmed experimentally in [15]. Unfortunately, the short quartet method sometimes fails to return any tree at all, a serious limitation if the objective is to analyze a real data set! This paper focuses on a new method, or, more precisely, new class of methods, the disk-covering methods, or DCM. DCM is based on key insights from the short quartet method, but always returns a tree. It works by a divide-and-conquer strategy, producing subproblems that are both smaller and less divergent than the original problem. Preexisting techniques are then used to solve the base subproblems, but since these problems are easier, the existing techniques are likely to do a better and/or faster job, and it may be feasible to use expensive methods (such as maximum likelihood estimation) that could not be used on the larger problem. In essence, the disk-covering method acts as a \booster" for existing methods.

2 Basic idea of the Disk-Covering Method Both biologists and computer scientists have considered the following divide-and-conquer strategy for reconstructing trees on large divergent datasets: decompose the full dataset into several smaller datasets, reconstruct trees on these smaller datasets, and nally assemble the subtrees produced into one tree. Provided that decomposing the dataset and assembling the subtrees is not too expensive, then this strategy has an obvious computational advantage. However, if such strategies do nothing to reduce the divergence present in the subproblems, they can actually increase the sequence length requirements, as our results in [19] showed. Previous attempts along these lines [25, 32] have in general su ered either from diculty in nding how to partition into subproblems, from the inability to control the divergence in the subproblems, or from diculties in determining in how to arrange the obtained subtrees in a supertree.

2.1 An Insight from Short Quartet Methods

The development of the short quartet method , or, more precisely, a whole set of short quartet methods (SQM) [15, 16, 17, 18, 19] provided a key insight into how to solve the problem of decomposing datasets into smaller, less divergent subproblems. Here we rephrase it in terms that will allow us to describe DCM and why it works:

The Disk-Covering Method for Tree Reconstruction

65

Basic observation of the short quartet papers: Let T be an edge-weighted tree, and let S be the set of sequences labeling the leaves. Let D be the matrix of inter-leaf distances de ned by the tree T and its edge-weighting. For any set A  S , we can de ne the \width" of A to be maxfDij : i; j in Ag.

Then there is a bound w such that the topology of T is uniquely recoverable (and in polynomial time) from the topology of each of its subsets of width at most w. Furthermore, in general w is quite small relative to the maximum inter-leaf distance in T .

What this means, simply, is that the topology of any tree is uniquely de ned by the topologies on its small, \close", subtrees. This observation has great importance, because closely related subtrees are much easier to get accurate topology estimations of than distantly related subtrees; in other words, accurate topology estimations can be obtained from much shorter sequences on closely related subtrees than on distantly related subtrees. In fact, these papers also established that by comparison, neighbor-joining and other distance-based methods needed sequence lengths that were superpolynomial in the tree parameters in order to obtain correct topologies, while the polynomial time SQM could recover accurate trees using only polylogarithmic length sequences for almost all trees. Unfortunately, these methods share an important aw that makes them of more theoretical than practical interest. The problem is that each of the methods explicitly assumes that each subtree is accurately reconstructed. Thus, when the sequences are too short to ensure complete accuracy of each subtree reconstruction, SQM will fail to return anything.

2.2 The Disk-Covering Method (DCM), in Brief

The promise and problems of SQM suggests the following

Algorithmic goal: A method that obtains the true tree when SQM will obtain it, but which will succeed in obtaining a close approximation to the tree when SQM would fail to obtain any tree.

We have developed a general class of techniques that accomplishes this objective. We call this class of techniques the disk-covering methods, or DCM. DCM uses the key insight from SQM that each tree is reconstructible from its small width subtrees. Let us consider more fully how SQM works. Using techniques described in [16, 18, 19], SQM can compute a bound on the width w such that if we know all subtrees on subsets of width at most w we can reconstruct the tree. SQM reconstructs the small subtrees using a quartet-based technique (which can fail); then, if all subtrees are compatible, the tree consistent with all inferred subtrees is returned. In the case of inconsistent subtrees, no tree is returned. DCM di ers from SQM in the following ways. First, as discussed below, determining the bound w is more complex. Second, rather than using a quartet based method, given all the width w subsets, a tree is reconstructed for each subset using any desired technique, such as parsimony. Then, like SQM, if all the trees are consistent with each other, the unique tree consistent with all the inferred subtrees is returned. However, if there are inconsistencies, the trees are nevertheless combined into a supertree, using techniques that minimize the loss of evolutionary information, while resolving con icts.

3 Details of the Disk-Covering Method Recall that both SQM and DCM are based upon recovering the tree from the subtrees of bounded \width". Estimating this bound w is really the only technically challenging aspect of DCM. Given w, the dataset is decomposed into a number of subsets, each of width bounded by w. Subtrees on each of the subsets are then constructed using whatever base method has been chosen. Guided by the prior decomposition, these subtrees are then merged into one tree for the entire dataset. Some variants of DCM explicitly avoid estimating the bound, and rather compute a consensus of all the trees computed, one for each possible threshold. We explore these issues in turn here.

The Disk-Covering Method for Tree Reconstruction

66

3.1 Computing a Tree for a Given Width

For most iid models of evolution, and some non-iid models as well, it is possible to de ne distances between sequences generated on trees under these models, in such a way that the distances converge in probability to an additive [8] distance matrix de ning the model tree. Given this additive distance matrix recovering the edge-weighted model tree can be done in polynomial time (there are many such algorithms, the rst of which is due to Waterman [47]). Indeed, given long enough nite length sequences, under such models even extremely simple techniques can recover the topology of the model tree with high probability [19]. Furthermore, it can be shown that essentially all distancebased methods are provably consistent (i.e. given long enough sequences, they obtain the true tree with high probability). In DCM, we take advantage of the fact that distances based upon nite length sequences are approaching additive distance matrices, in order to obtain an ecient decomposition into subproblems. Let S be the input set of sequences, and let w be the selected similarity threshold. We construct the threshold graph dw associated to the parameter w as follows: dw = (S; Ew ), where (i; j ) 2 Ew if dij  w. We will use this threshold graph to determine a set of subsets to construct trees for, and then combine these trees into one tree or forest. If the threshold graph is connected, we will compute a tree Tw on S , and otherwise we will compute a forest Fw on S . Before we compute subsets, we embed the threshold graph dw into a triangulated graph. The purpose of this embedding is that triangulated graphs have very nice properties that will allow us to de ne subproblems, whose trees can then be combined easily into one tree. A triangulated graph is a graph which has no cycles of size four or more that are induced by any set of nodes in the graph [24]; consequently, all cycles of size four or more contain chords. We can show:

Lemma 3.1 If d is an additive matrix, then the threshold graph dw is triangulated. (We omit the proof, but note that it depends upon the characterization of triangulated graphs as intersection graphs of subtrees of a tree [9].) The threshold graph dw is likely to be close to triangulated, since it is close to Dw , where D is the additive matrix corresponding to the true tree (in fact, for small w, even if D and d are very far apart, the graphs dw and Dw may be identical). Our experiments have indicated dw is often actually triangulated, and when it is not triangulated a few additional edges suce to triangulate dw . Obtaining a minimal triangulation of a graph is in general NP-hard [48], but for the graphs we will consider, even simple heuristics work e ectively and eciently. Furthermore, it was recently shown that exact solutions can be obtained eciently when the number of additional edges needed is small [10, 29].

Lemma 3.2 (from [24]). Every triangulated graph G has a perfect elimination scheme, v1; v2; : : :; vn; this is an ordering of the nodes so that the set Xi = fvj : j > i and (vi ; vj ) 2 E g forms a clique (i.e. for all fvk ; vl g  Xi ; (vk ; vl ) 2 E ). The maximal cliques (cliques which cannot be enlarged by the addition of any additional vertices) in G are of the form fig [ Xi ; hence there are at most n maximal cliques and these can be found in O(n2) time. Furthermore, the minimal separators of a triangulated graph are cliques. Given a triangulated graph, a perfect elimination scheme for the graph can be found in O(n2) time, and from it the maximal cliques can also be found in that time.

The subproblems we will analyze will correspond precisely to the maximal cliques of the triangulated supergraph of dw . By the results from the short quartet papers [15, 16, 18, 19], small w values will generally suce to de ne the tree T . Consequently, the subproblems that are analyzed will by de nition have low divergence, and are likely to contain few taxa; therefore, trees on these subproblems are more likely to be correctly recovered using even simple algorithms.

Computing Tw

67

The Disk-Covering Method for Tree Reconstruction

1 3

2

1

5

3

5

4 6

4 6

1

3

1

2

1 3 2

2

2 4

7

3 6 5 4 7

4 7

Figure 1: Merging two trees together, by rst transforming them (through edge contractions) so that they induce the same subtrees on their shared leaves.

Step 1: For a given threshold w, compute the threshold graph dw . Apply the algorithm to each component of dw . Embed dw in a triangulated graph dw (introducing a minimum number of edges, if possible), and compute a perfect elimination scheme for dw . Also compute the sets Xi de ned above. Let G = dw . The entire sequence of O(n2) threshold graphs can be constructed in O(n2 log n) time, since each can be constructed from the proceeding threshold graph rather than from scratch. Step 2: IF G is a clique, then apply the phylogenetic method M of choice (henceforth called the base method) to the species SG represented by the nodes in G, and return M (SG ) (note for small enough subproblems, even expensive methods such as maximum likelihood estimation may be used), ELSE DO: { At least one of the Xi, say X , is a separator. Compute the components C1; C2; : : :; Cp of G ? X. { Recurse on Ci [ X for each i = 1; 2; : : :; p, obtaining trees ti, i = 1; 2; : : :; p. { Combine the trees t1; t2; : : :; tp into one tree. We accomplish this in two steps. We rst modify (if necessary) each ti so that they induce the same homeomorphic tree on X , creating a new set of trees t0i, i = 1; 2; : : :; p (this modi cation step can be based upon various techniques described below); this is called the subtree merger method. We then combine the trees so that the resultant supertree Tw when restricted to X [ Ci induces the same structure as t0i restricted to X [ Ci . This tree always exists, and can be constructed in O(n2) time. { Return Tw .

Handling subtree merger. There are many possible solutions to the subtree merger problem, but in our initial explorations, we have elected to be quite conservative, favoring increases in the false negative rate in order to keep the false positive rate (as de ned in Section 4.1) as low as possible. For this reason, we have selected strict consensus as our subtree merger method. This method contracts a minimum set of edges in each tree in order to make them identical on the subtrees they induce on X . Computing this minimum set of edges is straightforward and can be accomplished in O(np) time (where n is the overall number of taxa, and p is the number of trees). Note, if minimizing the false negative rate is most desirable, then the asymmetric median tree [34] is the preferred subtree merger method. However, there are many di erent consensus methods that can be considered. For an entry to the rich literature on consensus methods, see [13, 42, 4]. Figure 1 shows an example of this method. The two trees share leaves in X = f1; 2; 3; 4g but do not agree on X . The strict consensus of the two subtrees induced by X is computed, and each tree

The Disk-Covering Method for Tree Reconstruction

68

is transformed (by edge-contractions) so as to induce that structure on X . The two trees can then be merged together.

3.2 Estimating the Threshold w

Here we present one simple way of estimating w, but in general the question of how best to do this is quite open, and we expect to focus some e ort on re ning our techniques. From SQM, we know that if w is large enough so that all \short quartets" have width at most w, then Tw will be equal to T (the true tree) if all subproblems are correctly reconstructed. It follows that if T 0 is any tree leaf-labeled by S and w0 is the maximum width of any short quartet on T 0 (with respect to the input distance matrix d), then a correct inference of all subsets of width at most w0 will suce to reconstruct the true tree. Our technique uses any fast method at hand to reconstruct a tree (we have used neighbor-joining), estimates the maximum width of any short quartet on that tree, and uses that width as the bound for w. Alternatively, one could examine several di erent methods, and take the minimum such w obtained. Closely related to the question of estimating w is the question of how sensitive the results are to which w is selected. Clearly, this will depend at least in part upon the base method, but we will show that the selection of w is quite robust to errors, and that in general there is a wide range of w's that provide equivalently good, or at least improved, topology predictions as compared to the naive use of the base method. First, let us consider what range of w is interesting. Applying the base method to the entire dataset is equivalent to selecting w = maxfdij : 1  i < j  ng as the threshold w; thus w is a strict upper bound. Also, there is also no point in selecting w < w1 = minfw : dw is connectedg, since for such w, a forest rather than a tree is constructed. Thus the range of interest is [w1; w). Now the question of robustness to error really is how closely Tw for w 2 [w1; w) compares to the true tree, and how that error compares to that for Tw , which is the tree found just using the base method. We have made a preliminary exploration of this question experimentally, and nd di erent trends for neighbor-joining and maximum parsimony, as described in Section 4.2.

3.3 Another Approach

As the experiments described below indicate, while neighbor-joining and maximum parsimony differed in performance on the possible threshold selections, there was one feature they had mostly in common: low false positive rates throughout most of the range. This suggests yet another approach to tree construction: Construct all trees Tw , one for each threshold w, and then infer a tree from the set of trees, by combining the topologies of the trees. The particular consensus that is motivated by the low false positive rates is the asymmetric median tree (developed by Phillips and Warnow, in [34]). An asymmetric median tree is formed by taking as many as the bipartitions present in the trees in the set as possible. Although this is generally an NP-hard problem, it can be solved very eciently when the number of bipartitions which must be eliminated is small, using our techniques from [7, 34] (this problem is related to the tree compatibility problem, studied in [26, 45]). Since our experimental results showed that for most settings for w, the trees Tw had low false positive rates, these fast algorithms can be applied to obtain asymmetric median trees eciently. It is also possible to use straightforward greedy approaches to \approximate" the asymmetric median tree, and the experimental results we presented in Figure 2 were based upon such an implementation. The experiments discussed in Section 4.1 using asymmetric median trees with DCM (which we call AMT-DCM) show that it may have higher false positive rates than methods based upon estimating the appropriate threshold, but often has much lower false negative rates. In fact, with some signi cant frequency, AMT-DCM boosted maximum parsimony outperforms naive maximum parsimony.

69

The Disk-Covering Method for Tree Reconstruction

False Positives, False Negatives

100

80

60

FP=FN, NJ FN, disk-NJ FP, disk-NJ

40

20

0 500

1000

1500

2000

2500

3000

Sequence Length

Figure 2: Sequence length vs false positive and false negative rates, for DCM-boosted neighbor-joining (disk-NJ) and for naive neighbor-joining (NJ), 93 taxon tree, equipped with maximum mutation probability 0:64. Each point represents the average rates from 2-6 independently generated data sets.

4 Some Experimental Results DCM has a number of appealing theoretical properties, but of equal importance is how well it works in practice. To nd out, we have designed and implemented several variants of DCM in C++ based on [33]. Our experiments were designed to study how well DCM-boosted methods work compared to the un-boosted base methods, when applied to datasets generated by simulating evolution on a number of di erent model trees.

4.1 Experimental Setup

We used the Jukes-Cantor [28] model of evolution to generate DNA sequences. In this model, every site evolves independently and identically on a given model tree , the probability of change between nucleotides is identical, and we make the Markov assumption that the probability of changes below a node are not in uenced by the events that happen above that node. The sequence at the root is drawn from the uniform distribution. Every edge e in the tree is assigned a mutation probability, denoted by p(e), which determines the probability that a site will change its state between the endpoints of the edge. In order to ensure the possibility of recovering the tree from the sequences, we only used model trees for which 0 < p(e) < 3=4 for all edges e in the tree. Given a source of randomness, a model tree T , root r, and mutation probabilities fp(e) : e 2 E (T )g, we can then generate sequences at the leaves of T that are of the same length as the sequence at the root. These sequences are then given as input to a base phylogenetic method (for example, neighbor-joining, maximum parsimony heuristics, or the Buneman Tree), and to the DCM-boosted version of the method. The accuracy of the resultant output is evaluated by examining the bipartitions induced by the edges in the trees. Thus, every edge e in a leaf-labeled tree T de nes a bipartition e on the leaves (induced by the deletion of e), and hence the tree T is uniquely encoded by the set C (T ) = fe : e 2 E (T )g. If T is the model tree and T 0 is the tree obtained by the method, then the error in the topology can be calculated as follows:  False Positives: C (T 0 ) ? C (T ); these are \edges" (bipartitions) in T 0 missing from T , and  False Negatives: C (T ) ? C (T 0); these are \edges" (bipartitions) in T missing from T 0: The Robinson-Foulds distance is 12 jC (T )4C (T 0)j, i.e. the average of the false positive and false negative rates [39].

70

The Disk-Covering Method for Tree Reconstruction

False Positives, False Negatives

100

80

60

FN, BT FP, BT FN, disk-BT FP, disk-BT

40

20

0 1000

2000

3000

Sequence Length

Figure 3: Sequence length vs false positive and false negative rates, for DCM-boosted Buneman tree (diskBT) and for naive Buneman tree (BT), 35-taxon tree, equipped with maximum mutation probability 0:64. Each point represents the average rates from 10 independently generated data sets. Figure 2 shows the result of one such experiment, based on AMT-DCM, and using neighborjoining as the base method. The model tree used to generate the data is a 93 taxon subtree of the 500 taxon tree generated by Rice [36] using the rbcL dataset, with high mutation probabilities assigned to the edges. Earlier studies of maximumparsimony and neighbor-joining on this tree [37] showed that neighborjoining had false negative and false positive rates of about 40%, even at 12; 800 nucleotides, while maximum parsimony converged to a tree which was not the model tree, and had false positive and false negative rates of about 8% for sequences of length 1000 ? 12; 800. By comparison, DCM neighbor-joining quickly achieves very low false positive and false negative rates, at 3; 000 nucleotides. These results are consistent with studies discussed in [37, 38] that demonstrated that the accuracy of distance-based methods degrades dramatically as the divergence increases; thus, by reducing to problems of smaller divergence, the accuracy is increased. In fact, when we combined DCM with other distance-based methods, we found even greater improvements than we found with neighborjoining. Figure 3 shows the result of one such experiment, based on the simple version of DCM in which we select the most resolved Tw as the output tree, and using the Buneman Tree [8] as the base method.

4.2 Error Tolerance for Threshold Selection

Using neighbor-joining, we nd that unless the tree is very accurately inferred by naive neighborjoining (which tends not to be the case with highly divergent datasets), any selection of w in the range [w1; w) obtains an improved reconstruction of the tree, when we consider the Robinson-Foulds distance. In particular, the trees found using DCM generally have signi cantly lower false positive rates. The false negative rates of the trees Tw are almost all lower than the false negatives of the naive tree, but there is an initial segment of the range in which the false negatives can be (but are not always) higher. Both false positive and false negative rates drop to a low point in the interior of the range, and then increase as the threshold approaches the maximum value w . However, even at the high end of the range the trees obtained are better than the naive neighbor-joining tree. Optimal performance, then, will depend upon selecting w in the right portion of the range, but essentially all w's improve over naive use of neighbor-joining. Using maximum parsimony, our experiments indicate a di erent situation. Here, the performance improves as w increases, both with respect to false positives and false negatives. Interestingly, the false positives quickly decrease, often to 0, while the false negatives do not decrease quite as quickly. There is, however, a large range (typically, the last half of the range) in which comparable

71

Topological Error (%)

The Disk-Covering Method for Tree Reconstruction

50 45 40 35 30 25 20 15 10 5 0 0.15

"Total" "FP" "FN"

0.2

0.25 0.3 0.35 Threshold Value

0.4

Figure 4: Error rates, in percentages, of DCM parsimony for di erent threshold selections, on a 35 taxon tree with moderate rates of mutation (maximum p(e) = :1). performance to naive use of parsimony is obtained { i.e. where the di erence in errors amounts to losing at most an edge or two of the true tree. These results do not mean however that DCM-boosted maximum parsimony is not useful. Instead, the likely advantage is in faster computation times. Using DCM will create a linear number of subproblems, but the cost of maximum parsimony is exponential in the problem size. Thus small reductions in the size of the subproblems compared to the original problem will result in large decreases in the computation time. One of the goals of our future research will be to study the question of parsimony runtime improvements using DCM in some detail. Figure 4 presents the results of using DCM-boosted parsimony on a dataset of 35 sequences of length 800 (this is also a rbcL subset), generated on a model tree with moderate mutation rates (maximum p(e) = :1). We reconstructed trees using every threshold for which the triangulated threshold graph is connected but not a clique. Naive parsimony recovers the model tree from these sequences, and so does DCM parsimony for any selection of w = :28 and up. Furthermore, very low errors (no false positives, and only one false negative) are obtained for thresholds w  :22. Thus, for a great majority of the range (essentially the last 3/4 of the range), the performance is comparable to applying parsimony to the entire set of taxa. Note that if we had applied DCM for these smaller threshold settings, we would have reconstructed Tw from smaller subsets. How much smaller are these subsets? Each subproblem analyzed for w = :22 is of size at most 15, and each subproblem analyzed for w = :28 is at most 21. Unfortunately, we do not yet have measurements of how much this e ects the actual running time, but we expect it to be signi cant. Also, because the subproblems are independent, DCM provides a nice framework for using parallelism to speed up these computations. Finally, rst experiments with DCM-boosted split decomposition indicate that signi cant inprovements can be obtained over naive split decomposition, a method for producing splits graphs (essentially networks that can describe a number of con icting trees) to represent phylogenetic relationships [3].

Conclusions and Acknowledgments We have presented a new method for tree reconstruction which can be used in conjunction with other phylogenetic tree reconstruction methods, recovering generally a much more accurate estimation of the true tree from shorter sequences than has typically been possible. This work is quite preliminary, as it is clear that we will obtain much better results after ne-tuning the techniques to work with di erent phylogenetic methods. The reader interested in learning more about phylogenetic tree reconstruction methods should see [22, 43, 46] for a deeper introduction into the issues involved in phylogenetic analysis.

The Disk-Covering Method for Tree Reconstruction

72

We should mention another problem that arises for large sets of signi cantly divergent sequences, namely that it can dicult to compute a good alignment of the whole dataset, and tree reconstruction methods can be very sensitive to alignment errors. On the other hand, it is easier to obtain good alignments of subsets of closely related data. Hence, a promising approach is to use a poor alignment of the whole dataset just to guide the divide-and-conquer step of DCM, and then to compute higher quality alignments of each subproblem before passing it to the base method. We thank Olivier Gascuel and Hans Bodlander for suggestions which led us to develop this method, and we thank the NSF, Paul Angello, and David and Lucille Packard Foundation for support which made this research possible. We also thank the Program in Applied and Computational Mathematics at Princeton University which hosted Daniel Huson, and the Computer Science Department at Princeton University which hosted Tandy Warnow during her sabbatical, during which this work was done.

References [1] R. Agarwala, V. Bafna, M. Farach, B. Narayanan, M. Paterson, and M. Thorup, On the approximability of numerical taxonomy: tting distances by tree metrics, Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, (1996) 365{372. [2] K. Atteson,The performance of neighbor-joining algorithms of phylogeny reconstruction, Computing and Combinatorics, Third Annual International Conference, COCOON '97, Shanghai, China, August 1997, Proceedings. Lecture Notes in Computer Science, 1276, Tao Jiang and D.T. Lee, (Eds.). Springer-Verlag, Berlin, (1997) 101{110. [3] H.-J. Bandelt and A.W.M. Dress, A canonical decomposition theory for metrics on a nite set, Advances in Mathematics, 92 (1992) 47-105. [4] J. Barthelemy and F. McMorris, The median procedure for n-Trees, Journal of Classi cation, 3 (1986) 329-334. [5] H. Bodlaender, M. Fellows, and T. Warnow, Two strikes against perfect phylogeny, Proceedings, International Colloquium on Automata, Languages and Programming. Springer Verlag Lecture Notes in Computer Science, 623 (1992) 273-283. [6] M. Bonet, C.A. Phillips, T. Warnow, and S. Yooseph, Constructing evolutionary trees in the presence of polymorphic characters, to appear, SIAM J. Computing. (A preliminary version appeared in the ACM Symposium on the Theory of Computing, 1996.) [7] M. Bonet, M. A. Steel, T. Warnow, and S. Yooseph. Faster algorithms for solving parsimony and compatibility, to appear, RECOMB 1998. [8] P. Buneman, The recovery of trees from measures of dissimilarity, in Mathematics in the Archaeological and Historical Sciences, F. R. Hodson, D. G. Kendall, P. Tautu, eds.; Edinburgh University Press, Edinburgh, (1971) 387{395. [9] P. Buneman, A characterization of rigid circuit graphs, Discrete Mathematics, 9 (1974) 205-212. [10] L. Cai,Fixed-parameter tractability of graph modi cation problems for hereditary properties, Information Processing Letters, 58 (1996) 171{176. [11] M. W. Chase, D. E. Soltis, R. G. Olmstead, D. Morgan, D. H. Les, B. D. Mishler, M. R. Duvall, R. A. Price, H. G. Hills, Y.-L. Qiu, K. A. Kron, J. H. Rettig, E. Conti, J. D. Palmer, J. R. Manhart, K. J. Sytsma, H. J. Michaels, W. J. Kress, K. G. Karol, W. D. Clark, M. Hedrn, B. S. Gaut, R. k. Jansen, K.-J. Kim, C. F. Wimpee, J. F. Smith, G. R. Furnier, S. H. Strauss, Q.-Y. Xiang, G. M. Plunkett, P. M. Soltis, S. M. Swensen, S. E. Williams, P. A. Gadek, C. J. Quinn, L. E. Eguiarte, E. Golenberg, G. H. Learn Jr, S. W. Graham, S. C. H. Barrett, S. Dayanandan,

The Disk-Covering Method for Tree Reconstruction

73

and V. A. Albert, Phylogenetics of seed plants: An analysis of nucleotide sequences from the plastid gene rbcL. Annals of the Missouri Botanical Garden, 80 (1993) 528-580. [12] X. Cousin, T. Hotelier, K. Giles, P. Lievin, J.-P. Toutant and A. Chatonnet, The alpha/beta fold family of proteins database and the cholinesterase gene server ESTHER, Nucleic Acids Res.,

25 (1997) 143-146.

[13] W.H.E. Day,Optimal algorithms for comparing trees with labelled leaves, Journal of Classi cation 2, (1995) 7{28. [14] W.H.E. Day and D.S. Johnson, The computational complexity of inferring rooted phylogenies by parsimony, Mathematical Biosciences, 81 (1986) 33{42. [15] P.L. Erdos, K. Rice, M. A. Steel, L. Szekely, and T. Warnow, The Short Quartet Method, to appear, Mathematical Modeling and Scienti c Computing, Principia Scientia (1998). [16] P. L. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow, Constructing big trees from short sequences, ICALP'97, 24th International Colloquium on Automata, Languages, and Programming (Silver Jubilee of EATCS), Bologna, Italy, July 7th{11th, 1997, eds. G. Goos, J. Hartmanis, J. van Leeuwen, Lecture Notes in Computer Science 1256, 1997. [17] P. Erd}os, M. Steel, L. Szekely, and T. Warnow, Local quartet splits of a binary tree infer all quartet splits via one dyadic inference rule, Computers and Arti cial Intelligence, 16(2) (1997) 217-227. [18] P. L. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow, A few logs suce to build (almost) all trees I, submitted to Random Structures and Algorithms, DIMACS Technical Report 97-71, pp. 33, under http://www.dimacs.rutgers.edu/TechnicalReports/1997.html [19] P. L. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow, A few logs suce to build (almost) all trees II, submitted to Theor. Comp. Sci., DIMACS Technical Report 97-72, pp. 46, under http://www.dimacs.rutgers.edu/TechnicalReports/1997.html

[20] M. Farach, S. Kannan, and T. Warnow, A Robust Model for Finding Optimal Evolutionary Trees, Algorithmica, special issue on Computational Biology, 13(1) (1995) 155-179. (A preliminary version of this paper appeared at STOC 1993.) [21] J. Felsenstein, Cases in which parsimony or compatibility methods will be positively misleading, Syst. Zool. 27 (1978), 401{410. [22] J. Felsenstein, Phylogenies from molecular sequences: inference and reliability, Annu. Rev. Genet., 22 (1988) 521-565. [23] L. R. Foulds, R. L. Graham, The Steiner problem in phylogeny is NP-complete, Adv. Appl. Math. 3 (1982), 43{49. [24] M.C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press Inc, 1980. [25] Green Plant Phylogeny Research Coordination Group, Summary report of Workshop #1: Current Status of the Phylogeny of the Charophyte Green Algae and the Embryophytes. University and Jepson Herbaria, University of California, Berkeley, June 24-28, 1995. 7 January, 1996. [26] D. Gus eld, Ecient algorithms for inferring evolutionary trees, Networks, 21 (1991) 19-28. [27] D. Hillis, Inferring complex phylogenies, Nature, 383 12 September, 1996, 130{131. [28] T.H. Jukes and C.R. Cantor,Evolution of Protein Molecules, in: H.N. Munro, ed., Mammalian Protein Metabolism, Academic Press, New York, (1969) 21-132.

The Disk-Covering Method for Tree Reconstruction

74

[29] H. Kaplan, R. Shamir, and R.E. Tarjan. Tractability of parameterized completion problems on chordal and interval graphs: minimum ll-in and physical mapping. In Proceedings of the 35th Symposium on Foundations of Computer Science, pages 780-791. IEEE Computer Science Press, Los Alamitos, California, 1994. To appear, SIAM J. Computing. [30] J. Kim, General inconsistency conditions for maximum parsimony: e ects of branch length and increasing number of taxa, Syst. Biol., 45(3) (1996) 363{374. [31] D. R. Maddison, M. Ruvolo, and D. L. Swo ord, Geographic origins of human mitochondrial DNA: phylogenetic evidence from control region sequences. Systematic Zoology , 41 (1992) 111-124. [32] B. Mishler, Cladistic analysis of molecular and morphological data, Am. J. Phys. Anthropol, 94 (1994) 143-156. [33] S. Naher and K. Mehlhorn, LEDA, a Platform for Combinatorial and Geometric Computing, Communications of the ACM, 8(1) (1995) 96{102. [34] C.A. Phillips and T. Warnow, The Asymmetric Median Tree: a new model for building consensus trees, Discrete Applied Mathematics, Special Issue on Computational Molecular Biology, 71 (1996) 311-335. [35] K. Rice. The origin, evolution, and classi cation of G-protein-coupled receptors, PhD. dissertation, Harvard University, (1994). [36] K. Rice, M. Donoghue, and R. Olmstead, Analyzing large datasets: rbcL 500 revisited, Systematic Biology, (1997). [37] K. Rice and T. Warnow, Parsimony is Hard to Beat!, Computing and Combinatorics, Third Annual International Conference, COCOON '97, Shanghai, China, August 1997 Proceedings. Lecture Notes in Computer Science, 1276, Tao Jiang and D.T. Lee, (Eds.). Springer-Verlag, Berlin (1997) 124{133. [38] K. Rice, M. A. Steel, T. Warnow, S. Yooseph, Hybrid tree reconstruction methods, submitted for publication. [39] D. F. Robinson and L. R. Foulds, Comparison of weighted labelled trees, Lecture Notes in Mathematics Vol 748, Springer-Verlag, Berlin, (1979) 119{126. [40] M. L. Sogin, G. Hinkle, and D. D. Leipe, Universal tree of life, Nature, 362 (1993) page 795. [41] M. A. Steel, The complexity of reconstructing trees from qualitative characters and subtrees, J. Classi cation, 9 (1992) 91{116. [42] M.A. Steel and T. Warnow, Kaikoura tree theorems: computing the maximum agreement subtree, Information Processing Letters, 48 (1993) 72-82. [43] D. L Swo ord, G. J. Olsen, P. J. Waddell, D. M. Hillis, Chapter 11: Phylogenetic inference, in: Molecular Systematics, D. M. Hillis, C. Moritz, B. K. Mable, eds., 2nd edition, Sinauer Associates, Inc., Sunderland, (1996) 407{514. [44] C. Tuey and M. A. Steel. Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bulletin of Mathematical Biology, 59(3) (1997) 581-607. [45] T. Warnow, Tree Compatibility and Inferring Evolutionary History, Journal of Algorithms, 16 (1994) 388-407. (A preliminary version of this paper appeared at SODA 1993.)

The Disk-Covering Method for Tree Reconstruction

75

[46] T. Warnow, Some combinatorial problems in Phylogenetics, Invited to appear in the proceedings of the International Colloquium on Combinatorics and Graph Theory, Balatonlelle, Hungary, July 15-20, 1996, eds. A. Gyarfas, L. Lovasz, L.A. Szekely, in a forthcoming volume of Bolyai Society Mathematical Studies. [47] M.S. Waterman, T.F. Smith, and W.A. Beyer, Additive evolutionary trees, Journal Theoretical Biol., 64 (1977) 199-213. [48] M. Yannakakis, Computing the minimum ll-in is NP-complete, SIAM J. Alg. Disc. Meth., 2, (1981).

Suggest Documents