Maximum Likelihood of Phylogenetic Networks Guohua Jin1 , Luay Nakhleh1 , Sagi Snir2 , and Tamir Tuller3
?
1 2
Dept. of Computer Science Rice University Houston, TX, USA Dept. of Mathematics University of California Berkeley, CA, USA 3 School of Computer Science. Tel-Aviv University, Israel.
Abstract. Horizontal gene transfer (HGT) is believed to be ubiquitous among bacteria, and plays a major role in their genome diversification as well as their ability to develop resistance to antibiotics. In light of its evolutionary significance and implications in human health, developing accurate and efficient methods for detecting and reconstructing HGT is imperative. In this paper we provide a first likelihood framework for phylogeny-based HGT detection and reconstruction. Beside the formulation of various likelihood criteria, we offer novel hardness results and heuristics for efficient and accurate reconstruction of HGT under these criteria. We implemented our heuristics and used them to analyze biological as well as synthetic data. Our methods exhibited very good performance on the biological data, and preliminary results on the synthetic data show a similar trend. Further, preliminary results on synthetic data show good promise for detecting chimeric (partial) HGT, which has been mostly ignored by existing computational methods. Implementation of the criteria as well as heuristics is available from the authors upon request.
1
Introduction
Unlike eukaryotes, which evolve largely through vertical lineal descent, bacteria acquire genetic material through the transfer of DNA segments across species boundaries—a process known as horizontal gene transfer (HGT). This process plays a major role in bacterial genome diversification [10, 11, 26, 25, 5, 22], and is a significant mechanism by which bacteria develop resistance to antibiotics [12]. In the presence of HGT, the evolutionary history of a set of organisms is modeled by a phylogenetic network, which is a directed acyclic graph obtained by positing a set of edges between pairs of the branches of an organismal tree to model the horizontal transfer of genetic material [29]. Therefore, to reconstruct an accurate Tree (or Network) of Life and to unravel bacterial genomic complexities, developing accurate criteria and efficient methods for reconstructing and assessing the quality of phylogenetic networks is imperative. A large body of work has been introduced in recent years to address phylogenetic network reconstruction and evaluation. In general, three categories of non-treelike models have been addressed, all of which have been introduced under the umbrella concept of phylogenetic networks. However, major differences exist among the three categories. Splits networks are graphical models that capture incompatibilities in the data due to various factors, not necessarily HGT or hybrid speciation; examples of methods that address these networks are given in [38, 6, 21]. The second category is that of recombination networks, which are used to model the evolution of haplotypes and genes at the population level; see [2, 19] for examples. Phylogenetic networks are the extension of phylogenetic trees to enable the modeling of reticulation events, such as HGT and hybrid ?
Corresponding author, email
[email protected]
2
Jin et al.
speciation (these are also called reticulate networks in [21]); examples of methods that address these networks are given in [20, 29, 33, 28, 30]. See [27, 28] for detailed surveys of the various phylogenetic network models and methodologies. One of the most accurate and commonly used criteria for reconstructing phylogenetic trees is maximum likelihood (ML) [14]. Roughly speaking, this criterion considers a phylogenetic tree from a probabilistic perspective as a generative model, and seeks the model (i.e., tree) that maximizes the likelihood of observing a given set of sequences at the leaves of the tree. Likelihood in the general network setting has been investigated in the past by various works. However, no HGT-specific likelihood framework has ever been suggested. von Haeseler and Churchill [39] provided a framework for evaluating likelihoods on networks and subsequently [38] provided an approach to assess this likelihood. These works consider a network as an arbitrary set of splits and therefore fall into the first category. They are characterized by the combined analysis approach, which entails combining all gene datasets first (by sequence concatenation), and then analyze the combined data set. A serious drawback of the latter approach is that when individual genes are governed by different evolutionary mechanisms and models (a scenario that is very common in reticulate evolution), combining multiple data sets is problematic [32]. Likelihood on networks has also been considered in the setting of recombination networks (see e.g. [18, 8]). These methods, similarly to ours, are indeed tailored to identify breakpoints along the given sequences, however, their underlying model is different from ours as they model a different process. In this work, we extend the ML criterion to handle specifically HGT-oriented (i.e. phylogenetic) networks, and propose a set of criteria and efficient heuristics for computing them. Our extension is based on the fundamental observation that, barring recombination, the evolutionary history of a gene is modeled by a tree, such that a phylogenetic network can be modeled by its constituent trees [31]. We propose a set of ML criteria for phylogenetic networks; these criteria differ in how the tree information is used, which variant of the ML criterion is used, and finally what input is provided. Further, we investigate the computational complexity of some of these criteria and devise a set of efficient heuristics for reconstructing and evaluating phylogenetic networks based on them. In particular, we prove that scoring the likelihood of a phylogenetic network is NP-hard in general, and provide an empirically efficient exact algorithm for the problem, relying on the notion of bi-connected components. Further, we devise an efficient branch-and-bound heuristic for the problem of adding a number of HGT edges to a tree to obtain an optimal phylogenetic network. Finally, we make very preliminary progress on the consistency of the ML criterion for phylogenetic networks, and show that, given long enough sequences, ML outperforms the parsimony criterion for networks [30]. We have implemented our criteria and heuristics and studied their performance on biological as well as synthetic data. For the biological data, we analyzed the Rubisco gene in eubacteria and plastids, which was previously analyzed by Delwiche and Palmer, who postulated a set of HGT events for it [9]. For the synthetic data, we simulated multiple data sets with various HGT events and applied our techniques to the data. In the case of the biological data, our criteria and heuristics performed very well with respect to the identification of the correct number of HGT events as well as their placements on the organismal trees. In preliminary results that we have obtained on
ML of Phylogenetic Networks
3
the synthetic data, our ML criteria have shown similar trends in detecting the correct number and placement of the HGT edges, and exhibited very good performance in identifying HGT breakpoints within genes, contributing towards addressing chimeric HGT [3].
2
Maximum Likelihood of Phylogenetic Networks
2.1 Preliminaries and Definitions Let T = (V, E) be a tree, where V and E are the tree nodes and tree edges, respectively, and let L(T ) denote its leaf set and I(T ) its internal nodes. Further, let χ be a set of taxa (species). Then, T is a phylogenetic tree over χ if there is a bijection between χ and L(T ). Henceforth, we will identify the taxa set with the leaves they are mapped to, and let [n] = {1, . . . , n} denote the set of leaf-labels. A tree T is said to be rooted if the set of edges E is directed and there is a single distinguished internal vertex r with in-degree 0. A phylogenetic network N = N (T ) = (V 0 , E 0 ) over the taxa set χ is derived from T = (V, E) by adding a set H of edges to T , where each edge h ∈ H is added as follows: (1) split an edge e ∈ E by adding new node, ve ; (2) split an edge e0 ∈ E by adding new node, ve0 ; (3) finally, add a directed reticulation edge from ve to ve0 . The edges in H must satisfy that the phylogenetic network should be acyclic and satisfy additional temporal constraints [29]. Fig. 1(a) shows a phylogenetic network obtained by adding the edge (X, Y ) to the underlying organismal tree. The tree in Fig. 1(b)
Y
X
A
B
(a)
C
D
A
B
C
D
A
B
(b)
C
D
(c)
Fig. 1. (a) A phylogenetic network with a single HGT even from X to Y . (b) The underlying organismal (species) tree. (c) The tree of a horizontally transferred gene.
models the evolution of all genetic material that is vertically inherited from the ancestral organism, whereas the tree in Fig. 1(c) models the evolution of horizontally transferred genetic material. We denote by T(N ) the set of all trees contained inside network N . Each such tree is obtained by the following two steps: (1) for each node of in-degree 2, remove one of the incoming edges, and then (2) for every node x of in-degree and out-degree 1, whose parent is u and child is v, remove node x and its two adjacent edges, and add a new edge from u to v. For example, the set T (N ) of the network N in Fig. 1(a) contains only the two trees which are shown in Fig. 1(b) and 1(c). 2.2 Likelihood of Phylogenetic Networks Under the ML criterion, a phylogenetic tree is viewed as a probabilistic model from which input sequences S are assumed to be sampled. For given input sequences S the ith site, Si , is the set of values at the ith position for every sequence in S 4 . In this 4
can be viewed as the ith column when the sequences are aligned
4
Jin et al.
work we assume that sites evolve Independently and Identically Distributed (iid). Since the parameters of the phylogenetic tree, M , are unknown, they are usually estimated from the observed sequences by maximizing the likelihood function P (S|M ) [14]. In general, the overall likelihood of the aligned sequences S given the model M is obtained by the product of the likelihood of every site i given M as follows: L(S|M ) =
k Y
L(Si |M ),
(1)
i=1
where k is the sequence length. Therefore, we will consider the likelihood of a single site henceforth.The most likely model is the one maximizing Equation 1. When calculating the likelihood of a tree per a given site, two variants are considered: average likelihood [37] and ancestral likelihood [36] (the former is the more popular of the two). The ML criterion assumes a model of evolution. We consider here the Jukes-Cantor model of sequence evolution [24]. However, all the results here can be generalized to any other model of sequence evolution. Given a set of aligned sequences S ∈ Σ n×k , a tree T with I(T ) internal nodes (|I(T )| ≤ n − 2), and the edge transition probabilities p, Lav (Si |T, p), the average likelihood of obtaining site Si under T is defined as X Y Lav (Si |T, p) = m(pe , Si , a) , (2) a∈Σ |I(T )| e∈E(T )
where a ranges over all combinations of assigning labels to the I(T ) internal nodes of T . Each term m(pe , Si , a) is either pe /(|Σ|−1) or (1−pe ), depending on whether in the i-th site of S and a, the two endpoints of e are assigned different character states (and then m(pe , Si , a) = pe /(|Σ| − 1)) or the same character state (and then m(pe , Si , a) = 1−pe ). In the other variant of likelihood for phylogenetic trees, the ancestral likelihood, we replace the summation in Equation 2 with maximization as follows: Y Lanc (Si |T, p) = max m(pe , Si , a). (3) a∈Σ |I(T )|
e∈E(T )
That is, we seek the unique labeling to internal nodes to maximize the expression. The ML solution (or solutions) for a specific tree T is the point (or points) in the edge space p = [pe ]e∈E(T ) that maximizes the expression L(S|T, p) of Equations 2 and 3, where M = (T, p). We will refer to both criteria when evaluating the likelihood of a network. The natural way for generalizing this setting to networks is as follows. The topology of a phylogenetic network is defined as above, however in this case since tree edges have transition probabilities, when adding a reticulation edge between the edges e, e0 ∈ E we should mention where along the edges e, and e0 we add the two new vertices. The transition probability of the reticulation edge is always 0, meaning there are no substitutions along it (the reason being that HGT is instantaneous at the scale of evolution). However each reticulation edge r = (e, e0 ) has reticulation probability br associated with it. This probability denotes the probability of a DNA segment being transferred along that edge. Fig. 2 describes a simple phylogenetic network. Let re(T ) denote the set of reticulation edges used to obtain tree T in the network N , and let H(N ) denote the set of all reticulation edges in N . Let Y Y P (T ) = br (1 − br ), r∈re(T )
r∈H(N )\re(T )
ML of Phylogenetic Networks
5
i
P(
i,f)
e,i)
P(
e
f P(f,Y)
P(X,B)
P(Y,C)
P( A,
Y
A
B
C
)
X
f,D
P(
e)
P(e,X)
D
Fig. 2. Simple example of phylogenetic network under the likelihood setting.
where b denotes the reticulation edge probabilities. The likelihood of a network is obtained as a function of the likelihoods of the trees contained in it. Here again we consider two variants. In the first, the likelihood is the sum of the likelihood of all the trees of the network, where for each tree T we also need to multiply the resultant likelihood by P (T ). Again, we can choose between the two tree likelihood functions described above (Equations 2 and 3). Thus we get the following equation: Lall (Si |N, p, b) =
X
P (T ) · L(Si |T, p)
(4)
T ∈T(N )
In the other variant we want to reconstruct the sequence of reticulation events. Thus we want to find for each site one tree such that the likelihood of the leaf labels is maximized; we get the following equation: Lbest (Si |N, p, b) = max (P (T ) · L(Si |T, p)) T ∈T(N )
(5)
We stress that a tree likelihood can be computed only once the network is given as it is a component in the network. In order to complete the definition of the maximum likelihood of phylogenetic networks, we add the last criterion which is the type of the input provided. Therefore, we can define multiple ML criteria, depending on three issues: 1. Likelihood criterion for the trees: Is ancestral likelihood or average likelihood used to assess each tree likelihood in the network? 2. Tree criterion: At each site, is the best tree likelihood or the sum of the likelihoods over all trees taken? 3. Input data: Which of the network’s parameters (topology and/or probabilities) are given? We consider three possibilities: (a) The tiny problem: The network topology, transition probabilities and reticulation probabilities are given. (b) The small problem: The initial tree and reticulation edges (i.e. network topology) are given but not the transition or reticulation probabilities. (c) The big problem: An initial tree is given and we want to add reticulation edges. The above gives rise to 12 variants of the network maximum likelihood problem. We will refer to each of these variants by a triplet X–Y –Z, where X ∈ {ancestral, average}, Y ∈ {all, best}, and Z ∈ {tiny, small, big}.
6
3
Jin et al.
Properties of the ML Criterion
In this section we provide consistency and hardness results to some of the problems we formulated above. For simplicity we assume here the Neyman 2-state model of sequence evolution [34]; however all the results here can be easily generalized to more complex models. For lack of space all the proofs are deferred to the appendix. 3.1 Optimality and Identifiability A reconstruction method is said to be consistent if the probability that it reconstructs the true tree tends to 1 as the sequence length tends to infinity. A model (tree or network) generating some data D is said to be identifiable by D if D has enough information to identify the model uniquely. Consistency and identifiability in phylogenetics are very topical issues, sparked by Felsenstein’s seminal observation that MP is not consistent. Here we address two related aspects of these issues for the best–tiny (ancestral and average) versions of the problem. We define a reconstruction method for the best–tiny problem to be deterministic if for the given network N , and a specific character λ0 , the method will always return the same tree T0 ∈ T (N ). It is easy to see that both ML (average and ancestral) and MP are deterministic under the the best–tiny problem. It is also easy to see that when the length of the sequences generated by the network tends to infinity, all network’s probabilities are greater than zero, and the network is not a tree, the probability of a deterministic method to reconstruct correctly all sites tends to a number strictly less than 1. Therefore, the notion of consistency does not apply to the best–tiny problem. Nevertheless, we can aim at reconstructing correctly the maximum number of sites and we say that a method µ is optimal if when the number of sites (sequence lengths) tends to infinity, there is no other deterministic method µ0 that reconstructs more sites than µ. In the appendix we prove that ML is optimal under average–best–tiny problem while MP is not. 3.2 ML on Networks is NP-hard Finding a phylogenetic networks when only the sequences are given is naturally hard as the big average and ancestral problems are hard for trees [1, 7]. However in our case we assume an initial phylogenetic tree and seek to improve the likelihood by adding reticulation edges. We conjecture that these variants of the problems are hard. For a first step we prove the hardness of the following versions of the problem (the proofs are in the appendix): Tiny best tree ancestral, tiny best tree average, tiny all tree ancestral, tiny all tree average, and small best ancestral tree. The results are established by a reduction from the 3SAT problem and the max-2-sat problem [16].
4
Algorithms for the Tiny Versions
The algorithms we describe in this section handle the tiny versions of the ML problem. As we showed, a polynomial time algorithm is unlikely to be found. Therefore we here devise algorithms that are not polynomial but may have good empirical performance. 4.1 The Naive Algorithm As is well-known, the tiny versions of ML problem on trees can be solved by the simple dynamic programming algorithm of Felsenstein [14] for average ML and by the algorithm of Pupko et al. [36] for the ancestral ML. Therefore the most naive solution is to
ML of Phylogenetic Networks
7
search over all the network’s trees separately and either sum their likelihood or select the optimal. The complexity of such an approach is O(n)2r for each site where r is the number of reticulation edges in the network and n is the size of the tree. As can be easily seen, this approach is fairly naive and can result in a large unnecessary computations. 4.2
The Component-Wise Naive Algorithm
We now describe a more efficient algorithm that takes into consideration independent components in the network. For a network N and a node v ∈ V (N ), let Nv denote the graph induced by the nodes reachable from v. Let u, v be two nodes in N . We say that u and v are unrelated if u ∈ / Nv and v ∈ / Nu . In order to compute likelihood in a bottomup fashion, we need the following property to hold: for every two internal nodes u, v, either Nu ⊆ Nv or Nu ∩ Nv = ∅. Indeed, this is the underlying principle in Felsenstein pruning algorithm to compute likelihood on a tree [14]. Definition 1. A biconnected component (bi-component) [17] is a subgraph induced by a maximal set of vertices W , such that in the underlying graph, there are two vertex disjoint paths between any two vertices in W (we assume all the edges are undirected). A node in a bi-component B is a leaf in B if it has no children in (the directed graph of) B. Otherwise it is internal in B. Also, a node is a root in B if it has no ancestor in B. It is easy to see that every bi-component has at least one leaf and exactly one root. Also, every two bi-components are internal-vertex-disjoint. Observation 1 Let B be a bi-component with a root r. Then for every internal node u ∈ V (B) \ {r} there is an unrelated internal node v ∈ V (B) s.t. Nu ∩ Nv 6= ∅ (and by definition Nu 6⊆ Nv ). In particular, every leaf in a bi-component has at least two parents. The above observation yields that in a bi-component, we cannot run the simple dynamic programming algorithm. On the other hand, applying the naive algorithm is costly and unnecessary as it can be avoided between different bi-components. Definition 2. Let N be a network. Then B(N ), the bi-component graph of N is a graph (V (B), E(B)) where V is the set of bi-components in N and two bi-components B1 , B2 are connected by a (directed) edge in B(N ) if (1) there is an edge in N between v1 ∈ B1 and v2 ∈ B2 , and (2) there is a vertex v ∈ B1 , B2 , and v is a leaf in B1 (and necessarily internal in B2 ). Observation 2 Let N be a phylogenetic network. Then B(N ) is a tree. Based on the above observation, a better solution is to decompose the network into bi-components, run the naive exhaustive algorithm inside every bi-component and the tree algorithm in the bi-component. Let r(B) be the number of reticulation edges in a bi-component B, B∗ the largest bi-component in N and r∗ the maximum of r(B) over all bi-components of N . Then, the above improvement reduces the complexity of the algorithm to O(|B(N )|2r∗ |B ∗ |). This improvement can be quite important as the bi-components can be sparse in a network with few reticulation edges.
8
5
Jin et al.
Heuristics for the Small and the big Versions
In contrast to the tiny problem, in the small versions the network topology is given without the probabilities. In this case we have used the standard hill climbing technique to assess the optimal parameters the given topology. For accelerating the running time for the big problems, we use a branch and bound (B&B) heuristic [35, 15]. In general, the B&B technique is based on the decreasing monotonicity of the objective function with respect to partial inputs. In other words, the value of the objective function for a certain input is not greater than any valid part of that input. The B&B principal asserts that if the value of the objective function for some partial input is smaller than some known value on the total input, it is worthless to explore all inputs extending this partial input. B&B is normally applied to NP-hard problem and can lead to very efficient running times. In our case, the input is a phylogenetic network and a set of characters and the objective function is the likelihood of the network WRT the character set. Let N = (V, E) denote a phylogenetic network. A phylogenetic network N 0 = 0 (V , E 0 ) is defined as edge separated subnetwork of N if V 0 ⊆ V , ∀(v1 , v2 ) ∈ E 0 ⇒ (v1 , v2 ) ∈ E, and there is a cut of a single edge e between (V 0 , V \ V 0 ). The following observations establish an upper bound for the likelihood of a phylogenetic network. Observation 3 If a network N 0 is a valid phylogenetic network and is an edge separated subnetwork of phylogenetic network N , then P (S|N 0 ) > P (S|N ). 5 For a set S of sequences of length k, we denote by Snm (n ≥ 0, m ≤ k) the set of these sequences restricted only to the positions n ≤ i ≤ m Observation 4 For any phylogenetic network, N and set S of sequences, we have P (Snm |N ) ≥ P (S|N ). Based on these observations, we propose the following heuristic: (1) Optimize the likelihood of sub-networks; if a candidate network contains an edge separated sub-network with low likelihood score, it is not the network with the ML score; (2) Optimize the likelihood of a network on part of the sites; if a candidate network has low likelihood score for these sites, it is not the network with the ML score.
6
Empirical Performance
We implemented our ML criteria and algorithms and tested them on both biological as well as synthetic data. For the biological data, we considered a 15-taxon dataset of plastids, cyanobacteria, and proteobacteria, which is a subset of the dataset considered by Delwiche and Palmer [9] and for which multiple HGT events were conjectured by the authors. Due to space limitations and given that our results on synthetic data are very preliminary, we omit these results; nonetheless, complete experimental results on the synthetic data will be available for the final version of the paper. Details about the biological dataset are in Section B in the appendix. 5
In the full version of this paper we prove a stronger claim. We show that a tree with τ reticulation edges, has a subtree such that when adding to this subtree τ reticulation edges it has higher likelihood than the original tree
ML of Phylogenetic Networks
6.1
9
Data Analysis
Since the organismal tree and sequence alignment are given for both the biological and synthetic datasets, our analysis amounted to solving the big version of ML on the datasets. In other words, for each dataset (biological and synthetic), we sought edges whose addition to the organismal trees gave optimal solutions to the big version of ML. Our goal was to verify whether solving this problem would identify the HGT events that Delwiche and Palmer conjectured for the biological data [9]. We also compared our results to the results of [4]. We assessed the quality of our method with respect to the number of HGT events it identified, and the actual placement of the HGT events themselves on the organismal trees. To estimate the mutation and reticulation probabilities, we used the heuristic described in Section 5, where we obtained internal sequences using ancestral ML, and then used normalized Hamming distances as probabilities. 6.2
Results and Discussion
We implemented a heuristic for solving the big ML problem with ancestral and average likelihood, considering all trees inside a network as well as the best tree. Due to space limitations, we show only the results of the version that uses ancestral likelihood and sums over all trees. Results are shown in Fig. 3. We investigated the performance of ML with respect to two main questions: (1) Does the ML criterion infer the correct number Rhodobacter capsulatus Rhodobacter sphaeroides I
4
x 10
H2 H1 H3
Nitrobacter
1.2
Thiobacillus denitrificans I β Proteobacteria
H4
Alcaligenes H16 plasmid Hydrogenovibrio II
Chromatium L
γ Proteobacteria
H6
H5
Thiobacillus ferrooxidans ATCC 19859 Prochlorococcus Synechococcus Gonyaulax Cyanophora
1.15 1.1 1.05
Cyanobacteria
1 Dinoflagellate plastid Glaucophyte Plastid
Cyanidium
Red and Brown Plastid
Euglena
Green Plastid
(a)
−log Likelihood
Thiobacillus denitrificans II
H0: 12212.2 +H1: 11538.8 +H2: 10885.5 +H3: 10341.0 +H4: 9931.3 +H5: 9519.71 +H6: 9432.43
1.25
α Proteobacteria
0.95 0
1
2 3 4 Number of HGT edges
5
6
(b)
Fig. 3. (a) The species tree of the 15 organisms, as reported by Delwiche and Palmer, and the 6 HGT edges inferred by our heuristic for the big ML problem. (b) The improvement in the likelihood score as a function of the number of HGT edges added to the species tree. The improvement achieved by adding the sixth HGT edge is smaller compared to that achieved by adding the first five events, which indicates that five HGT events suffice to model the evolution of the rbcL gene for these 15 organisms. Likelihood criterion = ancestral, and tree criterion = all.
of HGT events? (2) Does the ML criterion infer the correct location of the HGT events? For the 15-taxon biological data, Delwiche and Palmer hypothesized the occurrence of several HGT events of the rubisco genes as opposed to ancient gene duplication and loss. Our findings, based on the ML criterion, agree with the grouping of the organisms based on he forms I and II rubisco gene, as hypothesized by Delwiche and Palmer. Detailed discussion of the results, shown in 3(a) is presented in Section B in the appendix. From the definition of the ML criterion for networks it follows that as more edges are added to the network, the likelihood score either remains the same or improves, but
10
Jin et al.
never becomes worse. Therefore, a significant question to investigate was whether the improvement becomes relatively negligible at some point so that the addition of edges could be stopped (stopping rule). To answer this question, we plotted the improvement in the likelihood score as a function of the number of HGT edges added; the results are shown in Fig. 3(b). The figure shows that while adding the first five edges achieves drastic improvements in the likelihood score, adding the sixth edge results in a much lower improvement, which is indicated by the slow decrease in the likelihood score. appendix.
7
Conclusions and Future Research
Phylogenetic networks model evolutionary histories of sets of organisms in the presence of non-treelike evolutionary events such as HGT and hybrid speciation. In this paper, we introduced a first maximum likelihood framework for reconstructing and evaluating phylogenetic networks. This framework gave rise to an array of computational problems. In this paper, we addressed some of these problems, namely proving the NP-hardness of the “tiny” variants, established mathematical properties of several variants, and devised efficient heuristics and algorithms. We implemented our criteria and methods and analyzed a biological dataset of eubacteria and plastids [9], and a large set of simulated data. We have obtained very good results on the biological dataset, and preliminary results on the synthetic data have shown very good promise as well. Many issues related to statistical analysis have not been addressed such as the existence of multiple maxima or the relaxation of the i.i.d. assumption. However, as this is the first paper to define this ML approach, we believe the most important points, such as rigorous definitions, hardness of computations, preliminary heuristics and biological relevance, have been addressed. Naturally, there are still many open questions. Given the large size of the prokaryotic branch of the Tree of Life, and the estimated high rates of HGT in this branch, developing more computationally efficient algorithms and heuristics is imperative. Since horizontal transfer involves large segments of DNA, it is clear that the neighboring sites are not independent. In the future we intend to model this phenomena by using HMMs. In this work we assume each reticulation edge has a reticulation probability associated with it. We assume that the probabilities of different edges are independent. We intend to explore models with distributions of reticulation probabilities where different reticulation edges are dependent. The empirical analysis done here was aimed to demonstrate the viability of the ML criterion. In the future, we intend to use our method for analyzing new and larger biological datasets.
Acknowledgments The authors would like to thank Derek Ruths for providing them with the simulated data. We also like to thank Satish Rao for pointing out the issue of lateral edge independence. This work was supported in part by the Rice Terascale Cluster funded by NSF under grant EIA-0216467, Intel, and HP. Sagi Snir was supported in part by NSF grant CCR-0105533.
References 1. L. Addario-Berry, B. Chor, M. Hallett, J. Lagergren, A. Panconesi, and T. Wareham. Ancestral maximum likelihood of evolutionary trees is hard. Jour. of Bioinformatics and Comp. Biology, 2(2):257–271, 2004.
ML of Phylogenetic Networks
11
2. V. Bafna and V. Bansal. Improved recombination lower bounds for haplotype data. In Proceedings of the Ninth Annual International Conference on Computational Molecular Biology, pages 569–584, 2005. 3. U. Bergthorsson, K.L. Adams, B. Thomason, and J.D. Palmer. Widespread transfer of mitochondrial genes in flowering plants. Nature, 424:197–201, 2003. 4. Alix Boc and Vladimir Makarenkov. New efficient algorithm for detection of horizontal gene transfer events. In WABI, pages 190–201, 2003. 5. J.R. Brown. Ancient horizontal gene transfer. Nat. Rev. Genet., 4:121–132, 2003. 6. D. Bryant and V. Moulton. NeighborNet: An agglomerative method for the construction of planar phylogenetic networks. In R. Guigo and D. Gusfield, editors, Proc. 2nd Workshop Algorithms in Bioinformatics (WABI’02), volume 2452 of Lecture Notes in Computer Science, pages 375–391. Springer Verlag, 2002. 7. B. Chor and T. Tuller. Maximum likelihood of evolutionary trees is hard. RECOMB, pages 296–310, 2005. 8. Husmeier D. and McGuire G. Detecting recombination with mcmc. Bioinformatics, 18:345– 353, 2002. 9. C. F. Delwiche and J. D. Palmer. Rampant horizontal transfer and duplicaion of rubisco genes in eubacteria and plastids. Mol. Biol. Evol, 13(6), 1996. 10. W.F. Doolittle, Y. Boucher, C.L. Nesbo, C.J. Douady, J.O. Andersson, and A.J. Roger. How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Phil. Trans. R. Soc. Lond. B. Biol. Sci., 358:39–57, 2003. 11. J.A. Eisen. Assessing evolutionary relationships among microbes from whole-genome analysis. Curr. Opin. Microbiol., 3:475–480, 2000. 12. I.T. Paulsen et al. Role of mobile DNA in the evolution of Vacomycin-resistant Enterococcus faecalis. Science, 299(5615):2071–2074, 2003. 13. J. Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool., 22:240–249, 1978. 14. J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol., 17:368–376, 1981. 15. J. Felsenstein. Phylip (phylogeny inference package) version 3.5c. distributed by the author. department of genetics, university of washington, seattle. 1993. 16. M. R. Garey and D. S. Johnson. Computer and Intractability. Bell Telephone Laboratories, incorporated, 1979. 17. M. C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, 1980. 18. N.C. Grassley and E.C. Holmes. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol Biol Evol, 14:239–247, 1997. 19. D. Gusfield and V. Bansal. A fundamental decomposition theory for phylogenetic networks and incompatible characters. In Proceedings of the Ninth Annual International Conference on Computational Molecular Biology, pages 217–232, 2005. 20. M. Hallett, J. Lagergren, and A. Tofigh. Simultaneous identification of duplications and lateral transfers. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, pages 347–356, 2004. 21. D. H. Huson and D. Bryant. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol, 23(2):254–267, 2006. 22. R. Jain, M.C. Rivera, J.E. Moore, and J.A. Lake. Horizontal gene transfer accelerates genome innovation and evolution. Molecular Biology and Evolution, 20(10):1598–1602, 2003. 23. G. Jin, L. Nakhleh, S. Snir, and T. Tuller. Parsimony of phylogenetic networks: Hardness results and efficient algorithms and heuristics. submitted, 2006. 24. T. H. Jukes and C. R. Cantor. Evolution of protein molecules. in mammalian-protein metabolism. H. N. Munro(Ed), 121-132 New York: Academic Press, 1969.
12
Jin et al.
25. C.G. Kurland. Something for everyone—horizontal gene transfer in evolution. Embo Reports, 1(2):92–95, 2000. 26. J.A. Lake, R. Jain, and M.C. Rivera. Mix and match in the tree of life. Science, 283:2027– 2028, 1999. 27. C.R. Linder, B.M.E. Moret, L. Nakhleh, and T. Warnow. Network (reticulate) evolution: biology, models, and algorithms. In The Ninth Pacific Symposium on Biocomputing (PSB), 2004. A tutorial. 28. V. Makarenkov, D. Kevorkov, and P. Legendre. Phylogenetic network reconstruction approaches. Applied Mycology and Biotechnology (Genes, Genomics and Bioinformatics), 6, 2005. To appear. 29. B.M.E. Moret, L. Nakhleh, T. Warnow, C.R. Linder, A. Tholse, A. Padolina, J. Sun, and R. Timme. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1):13–23, 2004. 30. L. Nakhleh, G. Jin, F. Zhao, and J. Mellor-Crummey. Reconstructing phylogenetic networks using maximum parsimony. Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference (CSB2005), 393:440–442, Jun 2005. 31. L. Nakhleh, D. Ruths, and H. Innan. Gene trees, species trees, and species networks. In R. Guerra and D. Allison, editors, Meta-analysis and Combining Information in Genetics. Chapman & Hall, CRC Press, 2005. In press. 32. L. Nakhleh, J. Sun, T. Warnow, C.R. Linder, B.M.E. Moret, and A. Tholse. Towards the development of computational tools for evaluating phylogenetic network reconstruction methods. In Proceedings of the Eighth Pacific Symposium on Biocomputing (PSB 03), 2003. 33. L. Nakhleh, T. Warnow, and C.R. Linder. Reconstructing reticulate evolution in species: theory and practice. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, pages 337–346, 2004. 34. J. Neyman. Molecular studies of evolution: A source of novel statistical problems. In S. Gupta and Y. Jackel, editors, Statistical Decision Theory and Related Topics, pages 1–27., 1971. Academic Press, New York. 35. D. Penny and M. D. Hendy. Turbotree: a fast algorithm for minimal trees. Comput Appl Biosci, 3(3):183–187, 1987. 36. T. Pupko, I. Pe’er, R. Shamir, and D. Graur. A fast algorithm for joint reconstruction of ancestral amino-acid sequences. Mol. Biol. Evol, 17(6):890–896, 2000. 37. M. Steel and D. Penny. Parsimony, likelihood and the role of models in molecular phylogenetics. Mol. Biol. Evol., 17:839–850, 2000. 38. K. Strimmer and V. Moulton. Likelihood analysis of phylogenetic networks using directed graphical models. Mol. Biol. Evol., 17(6):875–881, 2000. 39. A. von Haeseler and G. A. Churchill. Network models for sequence evolution. J. Mol. Evol., 37:77–85, 1993.
A A.1
Properties of the ML Criterion Optimality and Identifiability
Proposition 1. ML is optimal under average–best–tiny problem while MP is not. Proof. Let λ0 be some character (state assignment) on the leaves of the network N . Let T0 ∈ T (N ) be the tree maximizing the probability of obtaining λ0 . That is Lav (λ0 |T0 , p) = maxT ∈T (N ) Lav (λ0 |T, p). By definition of the average–best–tiny problem, ML will always choose T0 for a site Si exhibiting λ0 . By definition, λ0 will be mostly generated
ML of Phylogenetic Networks
13
by T0 . Since the above arguments are true for every state assignment λ0 , we obtain the first part of the proposition. It remains to show that MP fails to maximize the correct number of reconstructed sites for even a single character. Consider a quartet tree ((1, 2), (3, 4)) with the pendant edges leading to taxa 1 and 4 having probability 12 − ε and the pendant edges leading to 2 and 3 and the internal edge having probability ε. We now add a reticulation edge from the pendant edge leading to 4 to the pendant edge leading to 1, whose two endpoints coincide with the endpoints of the tree edges (i.e., between the taxa 4 and 1. See Fig. 4). We denote the underlying tree by T and the tree obtained by the reticulation event by T 0 . The proof proceeds along lines similar to the classical result of Felsenstein [13]. The probability of obtaining a character on T is a sum of four expressions depending on the internal assignment at the internal nodes. In particular, for the character ((0, 1), (1, 0)) we have at every such expression either the factor ( 21 − ε) or ( 12 + ε) for the pendant edge leading to 1. This factor is absent in the ML expressions for T 0 . Therefore, we can bound the ratio between the likelihood of obtaining ((0, 1), (1, 0)) on both trees by a function of ε alone. Now, for every such ε we can find a reticulation event probability ε2 such that the overall probability of obtaining ((0, 1), (1, 0)) on T is bigger than on T 0 . Specifically, let P0 and P00 be the probabilities of obtaining λ0 given the trees T and T 0 , respectively. Then we observe P0 =
)| 2|I(T X
i=0
1 + (−1)i ε · ti 2
(6)
, where ti is the product of the appropriate probabilities on the tree edges (other than |I(T )| the pendant edge to 1), which stems from the internal assignment a ∈ {0, 1}2 . Therefore, )| 2|I(T X
1 + (−1)i ε · ti P0 = 2 i=0 1 − ε P00 . = 2
>
1 −ε 2
)| 2|I(T X
ε · ti
i=0
(7)
Now, the probability of obtaining λ0 on T and T 0 given the network N is (1 − ε2 )P0 and ε2 P00 respectively, where ε2 is the reticulation edge probability. We require (1 − ε2 )P0 > ε2 P00 ; however, by Equation 7 it suffices to require 1 (1 − ε2 ) − ε P00 ≥ ε2 P00 . (8) 2 A simple calculation leads to the conclusion that for every ε2 ≤
1 − 2ε 3 − 2ε
the requirement holds. We end by commenting that MP will always choose the reticulation edge and return the tree ((1, 4), (3, 2)) (regardless of the probability ε2 ).
14
Jin et al.
2
3
1
4
Fig. 4. Although most ((0, 1), (1, 0)) characters will be produced on the underlying tree, MP will choose the tree containing the reticulation edge.
Eventually, it can be shown that for any other location of the reticulation edge along these two pendant edges, an even larger ε2 suffices. In order to show that MP fails to satisfy network identifiability, we use the same network as in Fig. 4. Observation 5 Let N be the network in Fig. 4 and let N 0 be the network obtained from N by swapping taxa 2 and 4. Then for every character λ0 , the parsimony score on N and N 0 is the same. The above observation yields that MP cannot distinguish between N and N 0 even when the number of sites is very large. A.2
ML on Networks is NP-hard
Problem 1. 3-Satisfiability (3SAT) [16] Input: A formula F over a set U of variables, collection C of clauses over U such that each clause c ∈ C has |c| = 3. Question: Is there a truth assignment for U that simultaneously satisfies all clauses in C? Given such a formula F we construct a network N (F ) as follows. In the sequel, we use x is connected to y to indicate that x is a child of y. 1. A root R with a 1-leaf child. 2. For every variable we crate a node connected to a diamond loop as depicted in Fig. 5. to R
0
0
0
0
1
0
x 1
1
Fig. 5. For every variable we create a diamond gadget.
ML of Phylogenetic Networks
15
3. For every clause ci = (i∨j∨k) we generate a 1-leaf, called clause leaf, and connect it by a tree edge to variable node i and by a reticulation edge to an intermediate node c0i . Intermediate node c0i is connected by a tree edge to variable node j and by a reticulation edge to an intermediate node c00i . Finally, intermediate node c00i is connected by a tree edge to variable node k (see Fig. 6). The length of a (tree) edge emanating from a variable node is 1 if the literal is negated and 0 otherwise. x
y
z 0 0
1
0
0 1
Fig. 6. A clause gadget for the clause (¯ x ∨ y ∨ z). (complementing 1-leaves removed).
4. Finally, we create a complementing 1-leaf child connected by a 0 long edge to every internal node with out-degree 1. A complete network representing the formula (¯ x ∨ y ∨ z) ∧ (x ∨ y¯ ∨ w) ¯ (without the complementing 1-leaves) is shown in Fig. 7.
R 0
0
0
0
0
diamond gudget x
diamond gudget y 1 0 1
0
diamond gudget z
diamond gudget w
0
1
0
0
0 0
1
1
1
Fig. 7. The reduction from 3-SAT to the tiny best ancestral likelihood. Each variable has a node to which all literals of that variable are connected. In the figure we see the sub network representing the formula (¯ x ∨ y ∨ z) ∧ (x ∨ y¯ ∨ w). ¯
Observation 6 The network N (F ) is a valid phylogenetic network.
16
Jin et al.
Proof. We observe the following: – There is a single internal node with in-degree 0. – Every internal node has in-degree > 1. – Every node with in-degree > 1 has exactly one entering tree edge and the rest are 0-length reticulation edges. – The temporal constraint property holds. Let S denote the leaf assignment under the reduction. Then we get the following claim: Claim. F has a satisfying assignment if and only if there is a tree T ∈ T(N ) and internal assignment a ∈ {0, 1}r s.t. L(S|T, a, p) > 0. Proof. We begin with some auxiliary observations. Observation 7 For any tree T in the network and internal assignment a, L(S|T, a, p) > 0 if and only if the root R and all non variable nodes (nodes which are not assigned to variables) must have internal assignment 1. We first show that if F has a satisfying assignment, then there is a tree T ∈ T(N ) and internal assignment a ∈ {0, 1}r s.t. L(S|T, p) > 0. For every variable we assign its value from the satisfying assignment. For every clause ci , we choose the path from the 1-leaf representing ci to the literal satisfying it. Note that if that literal is negated, then the edge emanating from it has substitution probability 1. Now, every variable is connected to the root by either the 0 − 0 path in the diamond if the variable is assigned 1 or the 1 − 0 path otherwise. Finally, if we set the probability of every reticulation edge to be positive, we get the desired result. To prove the other direction, assume there exists a tree T ∈ T(N ) and internal assignment a ∈ {0, 1}r s.t. L(S|T, p) > 0. Then by Observation 7, all non-variable internal nodes are forced to the value 1. Observation 8 For every variable v in T , all emanating edges from v in T have the same probability which is either 0 if v has assignment 1 or 1 otherwise. Since every leaf must be connected to the root we get that every clause is satisfied. We comment that in the final tree, few internal nodes may remain with out-degree 0 and hence disappear in the resulting tree. In addition, other internal nodes can remain with out-degree 1 and contracted with the resulting obvious probability on the new edge. Theorem 9. The tiny best tree ancestral sequences of phylogenetic networks is NPhard. Proof. It is easy to see that the problem is in NP, since given a tree T it is easy to check if T ∈ T(N ) and subsequently calculate its likelihood. Additionally, the reduction of Claim A.2 is performed in time polynomial in the size of F . Corollary 1. The tiny best tree average likelihood of phylogenetic networks is NP-hard.
ML of Phylogenetic Networks
17
Proof. Using the same reduction as in Claim A.2, and Observation 7, we see that for a tree to obtain likelihood greater than zero, all internal nodes are forced to 1. Therefore, the only freedom is in the variable nodes and by Observation 8 each such assignment defines a truth assignment to the variables in F . By using similar arguments to Corollary 1 we also obtain the two following hardness results: Corollary 2. The tiny all trees ancestral likelihood and the tiny all trees average likelihood of phylogenetic networks are NP-hard. All the above results set the complexity of the tiny versions of the network likelihood problems. The next reduction deals with the “small” problem where the network is given but we seek to find the tree and the edge probabilities that maximize the likelihood over the whole trees of the network. For the reduction we use the same construct we used in [23]. We also restrict the edge probabilities to some interval [0, p] for p < 1. We prove the hardness of the small best tree ancestral ML problem by a reduction from the Maximum 2-Satisfiability (max-2-sat) problem [16], which is formally defined as follows. Problem 2. Maximum 2-Satisfiability (max-2-sat) Input: Set U of variables, collection C of clauses over U such that each clause c ∈ C has |c| = 2, and a positive integer K ≤ |C|. Question: Is there a truth assignment for U that simultaneously satisfies at least K of the clauses in C? We start with a lemma which will be used in our main proof. Let a “True-True” denote a clause that has no negated literals, “True-False” denote a clause that has exactly one negated literal, and “False-False” denote a clause in which both literals are negated. For the “True-True” clause we generate the subnetwork shown in Fig. 8 on the right, For the “True-False” clause we generate the subnetwork shown in Fig. 8 on the left and for the “False-False” we generate the same network as for “True-True” but flip the value of the leaves. Lemma 1. [23] (1) The minimal number of substitutions is 3 for a “True-True” network is obtained by labelling x = 1, y = 1, or both. Otherwise,it is 4. (2) The minimal number of substitutions is 3 for a “True-False” network is obtained by labelling x = 0, y = 1, or both. Otherwise, it is 4. (3) The minimal number of substitutions is 3 for a “False-False” network is obtained by labelling x = 0, y = 0, or both. Otherwise, it is 4. Given a formula F as input to max-2-sat, we create a node for every variable and connect every variables to two ancestral nodes and these two to the root. In addition, for every clause ci we generate the appropriate subnetwork as in Fig. 8 and connect it to the variables involved. Fig. 9 shows a complete network generated for a specific clause. Since this version of problem deals with the ancestral version where every internal node is assigned a value of either 0 or 1, we get the following observation:
18
Jin et al.
Fig. 8. Part of the reduction from max-2-sat to small best tree ancestral likelihood.
Fig. 9. The complete network generated for the clauses (x ∨ w), (w ∨ y¯), (x ∨ z¯).
Observation 10 At any optimal tree, the probability of every edge will be set to either zero or p. Observation 11 Unless all variables have the same value (0 or 1) there is exactly one substitution on the subnetwork above the variable nodes. Therefore, WLOG, we will assume the optimal assignment has both values. Claim. Given a set of clauses C over variables X, as input to max-2-Sat, X has an assignment which k clauses are satisfied, if and only if the network constructed has a tree with likelihood p4|C|−k+1 . Proof. ⇒ By Lemma 1 there are 4|C| − k substitutions at the networks below the variable nodes and by Observation 11 exactly one more above the variable nodes. Now, by
ML of Phylogenetic Networks
19
Observation 10 the edges on which a substitution occurs get probability p and the others 0. This yields the desired result. The proof of the other direction proceeds similarly. Theorem 12. The small best ancestral tree is NP-hard.
B B.1
Empirical Performance: Further Analysis Data: further details
The 15-taxon rbcL dataset of [9] consists of two Form I sequences from the α, β, and γ-proteobacteria groups, two from cyanobacteria, one from green plastids, one from red plastids, one cyanophora, and four Form II rubisco sequences. For this dataset, we obtained the species (organismal) tree which was reported in [9, 4]. The species tree is based 16S rRNA and other evidence. We analyzed the rubisco gene rbcL of these 15 organisms. The gene dataset consists of 15 aligned amino acid sequences, each of length 532 (the alignment is available from http://www.life.umd.edu/labs/delwiche/alignments/rbcLgb7-95.distrib.txt). B.2
Analysis of the results in Fig. 3
(1) The two most significant HGT edges are H1 and H2 in Fig. 3(a), and they group the form II rubisco (Thiobacillus denitrificans II and Hydrogenovibrio II) of the β and γ proteobacteria, respectively, together with the form II rubisco (Rhodobacter capsulatus) of the α proteobacteria. This result agrees with the grouping of these three organisms, based on the form II rubisco, into one clade, as shown in Fig. 2 of [9]. (2) The third most significant HGT edge is H3 in Fig. 3(a), and it indicates an HGT of the red type form I rubisco from the α proteobacteria (Rhodobacter sphaeroides I) to the β proteobacteria (Alcaligenes H16 plasmid). This result agrees with the grouping of all red type form I rubisco genes in one clade in Fig. 2 of [9]. (3) The fourth most significant HGT edge is H4 in Fig. 3(a), which completes the grouping of the form II rubisco genes, along with H1 and H2, by indicating an HGT of the form II rubisco from Rhodobacter capsulatus to Gonyaulax (an α proteobacteria and plastid, respectively). This result is also supported by the single clade of all form II rubisco in Fig. 2 of [9]. (4) The fifth most significant HGT edge is H5 in Fig. 3(a), which indicates an HGT of the the form I rubisco from the the red and brown plastids (Cyanidium) to the α proteobacteria (Rhodobacter sphaeroides I). This HGT event is in agreement with the grouping of the red type form I rubisco in Fig. 2 of [9]. (5) The last HGT edge is H6 in Fig. 3(a) which groups the red and brown plastids with the α, β, and γ proteobacteria. Such grouping is supported by the red type form I rubisco group in Fig. 2 of [9]. To summarize, the HGT edges computed by our heuristic agree with the grouping of the organisms based on the forms I and II rubisco, as hypothesized by Delwiche and Palmer. We also compared our results to the results reported by Boc and Makarenkov [4], they used a distance method for discovering HGT and analyzed a dataset similar to ours. Three of our edges (H1 , H3 , H4 ) appeared in [4]. Two other edges (H2 , H5 ) appeared also in [4], but in the opposite direction. One edge, H6 , appeared in our results but didn’t
20
Jin et al.
appear in the results of [4]. In general, our results are similar to the result reported by Boc and Makarenkov, but simpler. Our results included 6 HGTs, while the solution of Boc and Makarenkov included 8 HGTs.