Syst. Biol. 54(4):548–561, 2005 c Society of Systematic Biologists Copyright ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150590950371
Simultaneous Statistical Multiple Alignment and Phylogeny Reconstruction R OLAND FLEISSNER,1,3 D IRK M ETZLER,2 AND ARNDT VON HAESELER 3,4 1 Initiative for Bioinformatics and Evolutionary Studies (IBEST), c/o Department of Mathematics, P.O. Box 441103, University of Idaho, Moscow, Idaho 83844-1103, USA; E-mail:
[email protected] 2 FB Biologie und Informatik, Goethe-Universit¨at, Robert-Mayer-Str. 11-15, D-60054 Frankfurt am Main, Germany 3 ¨ Bioinformatik, Heinrich-Heine-Universit¨at, Universit¨atsstr. 1, D-40225 Dusseldorf, ¨ Institut fur Germany 4 ¨ John-von-Neumann-Institute for Computing, Forschungsgruppe Bioinformatik, D-52425 Julich, Germany
When studying a set of related sequences, we are interested in several questions: What does the underlying tree look like? Which positions in the sequences are homologous? What are the characteristics of the evolutionary process that led to these sequences? How reliable are our deductions? We normally try to tackle these questions by first constructing an optimal multiple alignment on which we then base all our inference of the tree topology and the parameters of the substitution model we prefer. Due to highly elaborate alignment programs (Thompson et al., 1994) and tree reconstruction methods (cf. Swofford et al., 1996; Felsenstein, 2004), this way of proceeding is fast and probably works well in most cases. Nevertheless, it has several drawbacks: Firstly, as we did not attempt to model the insertion and deletion dynamics explicitly, we can hardly make any statements about the nature of this process. Secondly, in order to find an optimal alignment we have to specify mismatch and gap penalties, thus making implicit assumptions about the plausibility of substitutions, insertions, and deletions (Fleißner et al., 2000). Thirdly, multiple alignment methods either align the sequences according to a prespecified phylogenetic tree or ignore their evolutionary history (cf. Gotoh, 1999). Either way, our estimate of the phylogeny will be affected. Finally, our knowledge of the statistics of alignment scores and of the behavior of the heuristics that are used to come close to the optimum is at the moment very limited. Thus the problem of phylogenetic analysis is twofold: Our estimates of phylogenies and model parameters are influenced by the multiple alignments on which we base our studies, yet we do not know how to judge alignment errors and thus have no clue about the reliability of our estimates. Statistical alignment procedures stand on the sound base of explicit models for the insertion and deletion process. The most popular of these models, TKF1, was introduced in Thorne et al. (1991). For given parameters, this model can be used to calculate the joint probability of a pair of sequences or to find their most probable
alignment (Thorne et al., 1991). Furthermore, it allows maximum likelihood estimation of the model parameters (Thorne and Churchill, 1995) and the assessment of their reliability (Thorne and Churchill, 1995; Metzler et al., 2001). Recently, the TKF1 model for two sequences has been generalized, leading to an algorithm that gives the probability of n sequences that have evolved on a star-shaped tree (Steel and Hein, 2001) or on a binary tree (Hein, 2001). There is now also a method that samples multiple alignments according to their posterior probability under this extension of the TKF1 model conditioned on a given phylogeny and parameter values (Holmes and Bruno, 2001). The TKF1 model, however, is not very realistic in that it only allows for insertions and deletions of single nucleotides (or amino acids). This deficiency was remedied in Thorne et al. (1992) by the TKF2 model. Here, we first give a brief overview of the TKF2 model and provide a possible extension to the case of n sequences that are related by a tree. Then, we utilize this model to simultaneously construct a multiple alignment, reconstruct the sequences’ phylogeny, and estimate the mutation parameters. M ODEL If we want to treat sequence alignment in terms of the evolution dynamics that acts on DNA or protein sequences, we have to describe the substitution process as well as the process of insertions and deletions. We first focus on the insertion and deletion process, which is the aforementioned TKF2 model. This model decomposes a sequence into disjoint subsequences called fragments. The lengths of these fragments follow a geometric distribution with expectation (1 − ρ)−1 where ρ ∈ [0, 1). Between any two fragments and at the ends of the sequence a new fragment whose length is drawn from the same distribution may be inserted with rate λ. At rate µ each fragment is deleted entirely. Thus, the fragmentation structure remains unchanged: the fragments
548
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
Abstract.—Although the reconstruction of phylogenetic trees and the computation of multiple sequence alignments are highly interdependent, these two areas of research lead quite separate lives, the former often making use of stochastic modeling, whereas the latter normally does not. Despite the fact that reasonable insertion and deletion models for sequence pairs were already introduced more than 10 years ago, they have only recently been applied to multiple alignment and only in their simplest version. In this paper we present and discuss a strategy based on simulated annealing, which makes use of these models to infer a phylogenetic tree for a set of DNA or protein sequences together with the sequences’ indel history, i.e., their multiple alignment augmented with information about the positioning of insertion and deletion events in the tree. Our method is also the first application of the TKF2 model in the context of multiple sequence alignment. We validate the method via simulations and illustrate it using a data set of primate mtDNA. [Multiple sequence alignment; statistical alignment; TKF model; tree reconstruction.]
2005
FLEISSNER ET AL.—STATISTICAL ALIGNMENT AND TREE RECONSTRUCTION
are always deleted as a whole and are never split by an insertion (see also Fig. 1). Although this may be considered unrealistic, the results in Metzler (2003) indicate that, at least in the case of a sequence pair, the estimation of the model parameters is quite robust against violation of this assumption. Assuming that the insertion rate λ is smaller than the deletion rate µ, we get an equilibrium distribution for the number of a sequence’s possible insertion sites, i.e., the number of fragments plus one, which is the geometric distribution with expectation µ/(µ − λ), and thus we also have an equilibrium distribution for the sequence length. It is further assumed that the length of the ancestral sequence was taken from the equilibrium distribution of the process. The TKF2 insertion-deletion dynamics can be married to any of the reversible, continuous-time, and stationary Markov processes that are normally used to model the substitution process (cf. Tavar´e, 1986). Such a substitution model is defined by a rate matrix Q, which is a 4 × 4 matrix for nucleotide sequences and a 20 × 20 matrix for proteins. Instead of referring to the monomers that constitute the sequences as nucleotides or amino acids, we will henceforth simply call them characters. The off-diagonal elements q i j > 0 of Q specify the instantaneous rates at which character i is replaced by character j. Reversibility of the substitution process implies that these off-diagonal entries can be written as q i j = π j b i j , where π j is the equilibrium frequency of character j and the b i j = b ji are functions of the model’s parame-
ters (cf. Strimmer and von Haeseler, 2003). The diagonal entries q ii , on the other hand, are given by − j=i q i j , thus making the row sums equal to zero. This substitution process is responsible for assigning characters to the positions of the fragments: Whenever a fragment is inserted or if it has already been present at the beginning of the insertion-deletion process, its characters are drawn from the stationary distribution of the substitution process and during the fragment’s life span its characters evolve according to Q. As the evolution model used in this paper consists of two processes acting side by side, we have to adjust their relative speeds. Throughout this paper, we therefore scale time such that the expected rate of substitutions is − i πi q ii = 0.01. Let us now look at two sequences, s1 and s2 , that have evolved from an unknown common ancestor and let t be the sum of the evolutionary times that lie between the ancestral sequence and its two children. Due to reversibility of the substitution and the insertion-deletion process (cf. Metzler et al., 2005), we can regard sequence s1 as the ancestor and sequence s2 as its descendant. In order to compute the probability that these two sequences have arisen during evolution, we would have to sum over all possible realizations of the above model, which produce sequence s1 and which then let it evolve to sequence s2 during time t. Thorne et al. (1991, 1992) suggest a way of looking at the indel process that is essential for the computational tractability of this problem: Instead of seeing insertions as occurring between two neighboring fragments, each inserted fragment is considered to be the offspring of its left neighbor. If the insertion happens at the very left end of the sequence, it is declared to be the child of an imaginary immortal position at the beginning of the sequence. Each fragment of the descendant sequence s2 can then be traced back to a fragment in the ancestral sequence s1 or to the imaginary postion to the left of s1 , as it is either the same fragment or it was inserted to the right of an ancestral fragment or to the right of a fragment that was inserted to the right of an ancestral fragment and so forth. Thus, the insertion-deletion process produces a genealogy of fragments for every fragment of the ancestral sequence and for the imaginary position at the left end of the sequences (cf. Metzler, 2003). In order to properly represent this kinship structure in the pairwise alignment of s1 and s2 , each fragment of the descendant sequence is written as close as possible to its ancestor. This alignment notation rule together with the TKF2 model’s fixed fragmentation structure and the geometric distribution of the fragment lengths make it possible to describe the evolution of s1 and s2 in the form of a pair Hidden Markov Model or pair-HMM (Durbin et al., 1998) with the states “homologous sites”, “gap in sequence s1 ”, “gap in sequence s2 ”, “start of the alignment”, and “end of the alignment.” Using pair-HMM techniques we can compute the joint probability of s1 and s2 by summing over all paths through the HMM that might have produced them and computing the probability of one realization of the insertion-deletion process simply becomes a matter of parsing the corresponding pairwise alignment from left to right. Thorne et al. (1992), Metzler (2003), and Holmes
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 1. a, The possible fate of a sequence’s positions in the TKF2 model. The model allows for deletions of one or more bases at the same time (symbolized with ticked boxes) and for the insertion of one or more bases (here AAG). Additionally, each of the positions may be hit by a substitution (ticked circles). b, A situation that is not allowed in the TKF2 model. First the bases CGG are inserted and then the last two bases of this insertion are deleted together with their right neighbor.
549
550
SYSTEMATIC BIOLOGY
and Bruno (2001) describe in detail why this equivalence of the TKF2 insertion-deletion dynamics and the pair-HMM representation holds. For the purpose of this paper it is only important that the TKF2 pair-HMM is governed by the following probabilities and their complements: ρ the probability to stay in the same fragment, λ/µ the probability that the ancestral sequence is prolonged by another fragment, α(t) the probability that a given fragment has not been deleted during time t, β(t) the probability with which the sequences’ left end or an ancestral fragment that already has some progeny
VOL. 54
in the descendant sequence has another child, and γ (t) the probability that a deleted fragment has at least one child in the descendant sequence. See Holmes and Bruno (2001) on how to compute the last three values from λ, µ, and t. Extending the HMM idea to so-called multiple or evolutionary HMMs, which describe the evolution of several sequences along a bifurcating tree, is straightforward (see Holmes and Bruno, 2001, for the case ρ = 0). The multiple HMM given in Figure 2, for example, produces an alignment of four sequences that are related by the
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 2. The fixed-fragmentation TKF2 evolutionary HMM that produces four sequences, w, x, y, and z, that are related by the tree shown in Figure 3, sequence w being considered to be the ancestral sequence. States in this HMM emit positions only in those sequences whose name appears in their labelling. For simplicity, we write αi , βi , and γi instead of α(ti ), β(ti ), and γ (ti ) (see text).
2005
FLEISSNER ET AL.—STATISTICAL ALIGNMENT AND TREE RECONSTRUCTION
551
TKF2 model is not only a step towards more biological realism, but it also yields the possibility to compute a tree’s likelihood for a certain indel history through pruning of the tree (Felsenstein, 1981) without a tedious bookkeeping of the fragmentation structure. We therefore regard the variable-fragmentation model as the adequate description of the insertion-deletion process on a tree. Yet, the variable-fragmentation HMM for the tree in Figure 3 would be much more intricate than Figure 2. Therefore, some parts of our method still make use of the fixedfragmentation TKF2 model as will be described in the next section.
tree shown in Figure 3. Note that only three of the emitted sequences correspond to the leaves of the tree, whereas the fourth one (x) lies at an internal node. Thus, a path through a multiple HMM leads to a hypothesis about the evolution of the sequences in a data set that is more detailed than their multiple alignment. It also specifies where in the tree positions got inserted or deleted. Therefore, we call such a data structure the indel history of the sequences. To understand the architecture of this HMM, consider first the situation that a fragment in x has children in y as well as in z; i.e., there have been insertions on neighboring branches of a tree. Because the TKF2 model does not provide a rule in which order we should write down these independent insertions, one has to introduce an arbitrary order on the subtrees and record insertions on the “left” subtrees before the ones on the “right” subtrees in a depth-first tree traversal (Hein, 2001). In our case, insertions that happen on the branch leading to y come before insertions occurring on the branch that connects x and z. This makes it, however, necessary to double the number of HMM states that correspond to inserted positions in y. Another characteristic of this HMM is the fact that a deletion on the branch between w and x cannot be followed by insertions on the branches leading to y and z, as these insertions would not have a parental fragment. Note also that there is no state that emits characters in w, y, and z, but not in x, because this would mean aligning positions that are not homologous. Another problem we have to deal with is the fragmentation of the sequences. The HMM in Figure 2 keeps the fragmentation structure fixed over the whole tree and one might call this the fixed-fragmentation TKF2 model. Alternatively, one can fix the fragmentation only for individual branches and allow it to change at the internal nodes (Metzler et al., 2005). This variable-fragmentation
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 3. Star phylogeny that shows the relationship between the four sequences emitted by the HMM in Figure 2.
ALGORITHM Having had an evolutionary model that includes the insertion and deletion processes, the obvious next step would be the inference of the maximum likelihood tree given the unaligned sequences. Although summing over all thinkable alignments is possible (Lunter et al., 2003b), the necessary computing time grows exponentially with the number of sequences. Even if we fix the sequences’ alignments, the summation over all possible indel histories (Lunter et al., 2003a) seems to be too slow to be used for a problem that implies searching tree space, optimizing parameters, and sampling multiple alignments. Therefore, we pursue a slightly different goal and look for the combination of model parameters, phylogeny, and indel history that has the highest posterior probability given the sequences. Our approach rests upon the techniques introduced by Holmes and Bruno (2001) in the context of sampling indel histories under the TKF1 model for a given phylogeny. Here, we extend their methods to the TKF2 model and add a procedure that changes the tree. Our method intertwines two simulated-annealing problems (Kirkpatrick et al., 1983), one which maximizes the posterior probability of the parameters given the sequences, their indel history, and their phylogeny and another one to maximize the posterior probability of the indel history and the phylogeny given the sequences and the parameters. Each run of the program starts with an unrooted and bifurcating guide tree and arbitrary parameter values for the substitution and insertiondeletion processes according to which sequences are aligned progressively. Then, two steps alternate again and again until one does not observe significant changes in the posterior probabilities: In step 1 we propose new parameters, in step 2 we propose a new tree and indel history. The probability to accept bad proposals is decreased monotonically from iteration to iteration according to the annealing schedule from Salter and Pearl (2001). Both the tree optimization and the parameter optimization have an annealing scheme of their own whose parameters are determined in a short burn-in phase at the beginning of each run. We added two additional features to this simulated-annealing approach. Firstly, as we do not want the search to start in a very unlikely area of tree space, the initial tree is not picked at random but instead it is determined by neighbor joining (Saitou and Nei, 1987).
552
SYSTEMATIC BIOLOGY
Secondly, in order to decrease the chance of the search getting stuck at a local optimum, we conduct nearestneighbor interchanges whenever it has converged. All these points will be addressed in more detail in the following paragraphs.
FIGURE 4. The five-taxon tree used as an example to describe our algorithm. The labels si refer both to the nodes of the tree as well as to the sequences at these nodes.
HMM produces three sequences—the ancestral sequence s5 and its two descendants s1 and s2 —we will henceforth call it triplet HMM. Let E=
star t,
c c c c − − − , , , , , , , end cc c− −c −− c− 1 c− 2 −c
denote the set of states of the triplet HMM, where a ‘c’ in the top row of a state’s name stands for an emitted character in sequence s5 and a ‘c’ in the first or in the second position of the bottom row symbolizes an emitted character in sequence s1 or s2 , respectively. To get an impression of how the states of this HMM are connected look at the lower half of Figure 2. Note also that the triplet HMM has two states that emit only a characc ter in sequence s1 , one may follow the states ccc and −c c c and another one can succeed c− and −− . Emitting the sequence s5 is now done by parsing the pairwise alignment of s1 and s2 from left to right. Denote the ith emission of the triplet HMM as e i and let p j be the next position in the alignment of s1 and s2 . Because we want the indel history of s1 , s2 , and s5 to be consistent with the alignment of s1 and s2 , the triplet HMM is not allowed to jump to every state in E. For example, if p j corresponds to an insertion on the way from s1 to s2 , c then the triplet HMM can choose between the states −− , c − , or . Let E ⊂ E be the set of all HMM states ei p j −c −c that may follow e i and are in agreement with p j and let P(x → A) = y∈A P(x → y) be the triplet HMM’s transition probability from state x to any of the states y in c a certain set A ⊂ E. Now, with probability P(x → { −− }) c to be the next state e i+1 and with probabilwe choose −− c c }))P(x → {e i+1 })/P(x → Ee i p j \ { −− }) ity (1 − P(x → { −− c we pick an e i+1 ∈ Ee i p j \ { −− } and increment j by one. Every emitted position in s5 then gets assigned a character that is either drawn from the posterior distribution of characters given the homologous positions in s1 and s2 or from the equilibrium distribution of the substitution process in case that it is only present in s5 . The above procedure is repeated with the sequences s3 and s4 , resulting in an indel history for these sequences and an inferred ancestral sequence at the node s6 . This in turn is aligned to sequence s5 , giving us the possibility to emit a sequence at node s7 , which then finally is aligned to sequence s8 . Thus, we get an indel history for the entire data set that consists of pairwise alignments along the branches of the tree. Although we assign characters to the positions of the inferred sequences at internal nodes of the tree, it should be noted that this is only done in order to be able to align the sequences progressively. The motivation for assigning sequences to internal nodes instead of working with profiles (Gribskov et al., 1987) is to speed up the program. It should become clear in the next paragraph that we are summing over all possible characters whenever we are computing the joint probability of the sequences and their indel history.
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
Building the Initial Tree and Indel History As described above, we want to start our search in an area of tree space that is not too unlikely. This is achieved by computing a neighbor-joining tree for the unaligned sequences. The necessary distance matrix is obtained by a coarse variant of the EM algorithm in Thorne and Churchill (1995): Alignment paths for each pair of sequences are sampled like in Metzler et al. (2001), the transitions from each HMM state to the next one and the numbers of the found matches and mismatches are counted. Then, given these observations, the joint likelihood of the substitution rate and of the speed of the indel process as compared to the substitution process is optimized. These two steps are repeated until the rate estimates do not change substantially. Thus, we get for each sequence pair an estimate of the substitution rate and a ratio of the speeds of the indel and the substitution processes. The median of the latter is used to rescale the rates of the indel process and the whole estimation procedure is repeated, but now only to estimate the substitution rates (Thorne and Kishino, 1992). Using these rate estimates, we then build the initial neighbor-joining tree, which serves as a guide tree to construct the initial indel history by the way of progressive alignment (cf. Feng and Doolittle, 1987). To see how progressive sequence alignment works in this context, consider the tree shown in Figure 4. Due to the reversibility of the model, we can pick leaf s8 as the root and then proceed from the other tips upwards. Applying the Viterbi algorithm (cf. Durbin et al., 1998), we first find the most probable alignment of the sequences s1 and s2 given time t1 + t2 . Then, we have a fixed-fragmentation HMM emit a sequence at the internal node s5 as well as an indel history comprising the sequences s1 , s2 , and s5 that is consistent with the pairwise alignment of s1 and s2 . Because a run through this
VOL. 54
2005
553
FLEISSNER ET AL.—STATISTICAL ALIGNMENT AND TREE RECONSTRUCTION
Likelihood Computation If we wanted to compute the likelihood pθ,t (s) of the parameters θ and the tree t for some sequences s, we would have to sum over all possible indel histories h of these sequences, in other words pθ,t (s) =
pθ,t (s, h).
(1)
h
pθ,t (s, h) = pλ,µ,ρ,t (h) pξ1 ,...,ξk ,t (s | h).
(2)
The second factor, pξ1 , ...,ξk ,t (s|h), can be computed with the well-known algorithm introduced by Felsenstein (1981) to calculate a phylogeny’s likelihood given a set of aligned sequences, thereby summing over all possible internal sequences. The first factor, pλ,µ,ρ,t (h), on the other hand, can be obtained easily as well, after realizing that under the variable-fragmentation model the indel history h decomposes into pairwise indel histories along the branches of the tree. So, taking again Figure 4 as an example and writing l(si ) for the length of sequence si , ti for the length of the ith branch of t and h i for the indel history along this branch, we get 7 pλ,µ,ρ,ti (h i ) pλ,µ,ρ,t (h) = 7i=1 2 , p j=5 λ,µ,ρ (l(s j ))
(3)
where pλ,µ,ρ,ti (h i ) is the probability of the pairwise indel history h i , and where pλ,µ,ρ (l) is the model’s equilibrium distribution for the sequence length. In the next two sections we will see that being able to compute pθ,t (s, h) is essential for the optimization of p(θ, t, h | s), the joint posterior distribution of parameters, tree and indel history given the sequences. Optimizing the Parameters Our task is to optimize the posterior density p(θ, t, h | s). Just like in the last section, θ denotes the model parameters, t stands for the phylogenetic tree, h is the indel history, and s represents the sequences. As has already been mentioned, we are optimizing p(θ, t, h | s) by switching to and fro between two simulated annealing procedures, one which optimizes p(θ | s, t, h) and a second one to optimize p(t, h | s, θ ). So, let the current parameter values in step i of the search be θi = (ξ1i , . . . , ξki , λi , ρi−1 − 1), with ξ1i to ξki being the various parameters of the substitution model,
˜ = q (θi , θ)
−1 k ˜ − ρ˜ −1 −1 1 1 − λλ˜ 1 − ξκ ρ −1 i i e · −1 · e ξκi , e λi ξ ρi − 1 κ=1 κi
(4)
i.e., each component of θ˜ is drawn from the exponential distribution that has the corresponding component of θi as its mean. This proposal is accepted as the search’s next state θi+1 according to the probability ˜ θi ) c(i) p(θ˜ | t, s, h) · q (θ, a (θi , θ˜ ) = min 1, , p(θi | t, s, h) · q (θi , θ˜ )
(5)
where c(i) increases monotonically and hence the probability to accept bad proposals decreases monotonically with every iteration. If θ˜ is rejected, set θi+1 = θi . Assum˜ as ing that pθ˜ (t) = pθi (t), we can compute a (θi , θ) ˜ · q (θ˜ , θi ) c(i) pθ˜ ,t (s, h) · π (θ) min 1, , pθi ,t (s, h) · π (θi ) · q (θi , θ˜ ) where we chose π (θ) = e −λ e 1−ρ density for θ .
−1
k
κ=1
(6)
e −ξκ as the prior
Optimizing Tree and Indel History Although optimizing p(t, h | s, θ ) follows the same principles as optimizing p(θ | s, t, h), it is slightly more complicated. Let ti and hi be the tree and the indel history in step i of the search. Again, we proceed by proposing a new tree ˜t and a new indel history h˜ and by accepting these proposals with probability ˜ d(i) pθ, ˜t (s, h) min 1, , pθ,ti (s, hi )
(7)
where d(i) increases in every iteration. Note that by using this acceptance probability we implicitly make two assumptions: Firstly, that all trees up to a certain total length have the same prior probability and, secondly, that ˜ when the current state is (ti , hi ) proposing the pair (˜t, h) is about as probable as proposing the reverse transition ˜ to (ti , hi ). from (˜t, h) ˜ is done in the followGenerating the proposal (˜t, h) ing way, which, as far as changing the tree is concerned, is similar to the proposal mechanism outlined in Mau and Newton (1997): First, we create a random order of the branches of ti and let t j denote the length of the jth
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
As already has been mentioned, doing this summation would be extremely slow. For this reason, we settle for computing the joint probability pθ,t (s, h) of the sequences and a certain indel history given the parameters and the tree. Because the indel history h does not depend on the parameters of the substitution process (ξ1 to ξk ) and because the probability of the sequences given their indel history does not depend on the parameters of the insertion-deletion process (λ, µ and ρ), we can write:
λi being the insertion rate, and ρi being the parameter of the distribution of fragment lengths. Note that omitting the deletion rate µ in the list of parameters means that we do not attempt to optimize λ and µ independently of each other. Instead we use a fixed ratio of λ and µ, which yields the average sequence length as the expected length. Based on θi we draw new param˜ ρ˜ −1 − 1) from the proposal eter values θ˜ = (ξ˜1 , . . . , ξ˜k , λ, distribution
554
VOL. 54
SYSTEMATIC BIOLOGY
Nearest-Neighbor Interchanges As always with simulated annealing the procedure might run for a very long time without any changes in the estimates. Yet, we will never be sure if there will not be any jumps some time in the future. To explore tree space a bit faster we do nearest-neighbor interchanges (Swofford et al., 1996) in these phases of stagnation: Every internal branch is visited in a random order and pairwise distances for its four adjacent nodes are estimated (to remove the least influence of the old topology, new sequences have been emitted at these nodes just paying regard to the subtree which starts from there). The distance estimation is done by the same EM algorithm that was used to obtain the initial distance matrix for the data set, yet now without the rescaling of the insertion rate. Then, we reconstruct a neighbor-joining tree for these four nodes and emit new sequences at the new internal nodes. After having done the nearest-neighbor interchanges, the program returns to the simulated annealing schedule until it encounters another phase of stagnation. The program quits after having passed through a prespecified number of nearest-neighbor interchange steps, saving the most probable of the investigated combinations of parameters, phylogeny, and indel history. For the examples presented in the next section, this
FIGURE 5. The four-taxon tree used in simulation 1. The branch labels a and b correspond to 5.0 and 30.0 units of time, respectively. These branch lengths amount to 0.05 and 0.3 substitutions per position and to a deletion rate of 0.01 and 0.06, respectively.
number was arbitrarily set to two times the number of sequences. APPLICATION Simulation 1 In a first simulation, we produced 100 data sets according to the TKF1 model (i.e., ρ = 0) along the phylogeny shown in Figure 5. The deletion rate µ was 0.002 and the expected length of the sequences was set to 1000. As substitution process we used the Jukes-Cantor model (Jukes and Cantor, 1969). Given the four unaligned sequences at the leaves of the tree, we tried to reconstruct their phylogeny using the Felsenstein substitution model (Felsenstein, 1981) and an initial insertion rate of 0.1 per time unit. One hundred iterations at the beginning of the search were the burn-in phase. If the absolute value of the difference of the likelihoods in two successive iterations was smaller than 1 for more than 1000 iterations, we conducted nearest-neighbor interchanges, after which we continued with the simulated annealing.
FIGURE 6. Evolutionary history used in simulation 2. The simulation was conducted with two different combinations of branch lengths (see text).
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
branch in that random order. Given t j we now draw a length for the corresponding branch of ˜t, thereby distinguishing two cases. If the branch is a leaf, its length is drawn from the uniform distribution U(t j − δ, t j + δ) if the positive parameter δ is smaller than t j and from δ−t j t U(0, δ − t j ) + δj U(δ − t j , δ + t j ) if δ is bigger than t j . δ Thus, it is guaranteed that a leaf never gets a negative branch length. In order not to propose only trees that will later be rejected, the parameter δ is adjusted in the course of the program so as to keep the chain moving. On the other hand, if the branch j is an internal branch, its length is always drawn from U(t j − δ, t j + δ) and may therefore become negative. In this case, the branch is removed and one of the two alternative topologies is chosen randomly. As length of the new internal branch we simply take the absolute value of the drawn number. Because the indel history hi is only meaningful in combination with a tree that has the same topology as ti , changing the tree’s shape and creating a new indel history go hand in hand. Thus, every time we change an internal branch we have to update the indel history by producing sequences at the two newly created internal nodes that are consistent with the indel history for the four adjacent nodes. This is done analogously to the way internal sequences are emitted during the progressive alignment procedure described above. Again for the sake of simplicity, we employ a fixed-fragmentation HMM to perform this task, but this one emits the sequences ˜ at the six nodes of a quartet tree. To finally obtain h, the proposal for the indel history, we optimze the pairwise indel histories for every branch of ˜t and emit a new sequence for each of ˜t’s internal nodes using the fixedfragmentation HMM from Figure 2.
2005
FLEISSNER ET AL.—STATISTICAL ALIGNMENT AND TREE RECONSTRUCTION
Our method was able to reconstruct the correct topology for 95 of the 100 data sets. When we used the same simulation setup together with the more traditional approach of first aligning the sequences with CLUSTALW (Thompson et al., 1994) and alternatively with PRRN (Gotoh, 1996) and then reconstructing neighbor-joining trees based on the resulting alignments, the results were quite different: No matter which gap penalties we chose, we never got more than 62 out of 100 topologies right (Fleißner, 2004). Simulation 2 In the second simulation we compared our method with the more traditional approach of first aligning the
555
sequences and then reconstructing a tree based on this alignment for data sets that have evolved along the tree shown in Figure 6. We used two combinations of branch lengths. In the first one, a was 1.0 unit of time and b was equal to 7.0 units of time, whereas in the second one a and b were set to 2.0 and 19.0 units, respectively. For each of these two combinations we simulated 100 data sets of nucleotide sequences with an expected length of 500 nucleotides under Kimura’s two-parameter model (Kimura, 1980), with the transition-transversion parameter being 4.0. The insertion rate of the variablefragmentation TKF2 process was 0.002 and the expected fragment length was 10 (ρ = 0.9). To analyze these data sets we used three different procedures: The first procedure aligned the sequences with Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 7. The results of simulation 2: The partition distances between the reconstructed trees and the true phylogeny. The columns show the results for two different sets of branch lengths (see Fig. 6). First row: our method. Second row: CLUSTALW+IQPNNI. Third row: PRRN+IQPNNI.
556
SYSTEMATIC BIOLOGY
VOL. 54
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 8. The results of simulation 2: Histograms showing the proportion of alignment positions of simulated data sets that were also present in the reconstructed alignments. Beware the varying scaling of the y-axes. The columns show the results for two different sets of branch lengths (see Fig. 6). First row: our method. Second row: CLUSTALW+IQPNNI. Third row: PRRN+IQPNNI.
CLUSTALW using the program’s default settings and, after removal of the positions with gaps, we reconstructed a phylogeny with the program IQPNNI (Vinh and von Haeseler, 2004) together with the Tamura-Nei substitution model (Tamura and Nei, 1993) (for equal base frequencies and with a pyrimidine-purine transition parameter of 1, this model is equivalent to Kimura’s twoparameter model). The second method employed the alignment program PRRN together with IQPNNI, again using the alignment program’s default values, removing the gapped positions, and modeling the substitution process according to the Tamura-Nei model. Finally, we used our method together with the Tamura-Nei model. The initial value for the transition-transversion parame-
ter was 2.0 and the pyrimidine-purine transition parameter initially was 1.0. The initial insertion rate was 0.1 and the initial expected fragment length was 2 (ρ = 0.5). One thousand iterations at the beginning of the search were the burn-in phase. Nearest-neighbor interchanges were conducted whenever the absolute value of the difference of the likelihoods in two succesive iterations was smaller than 1 for more than 800 iterations. The results of this simulation are summarized in Figures 7 to 12. Figure 7 shows the dissimilarity of the tree from Figure 6 and the reconstructed trees in terms of their partition distance (Robinson and Foulds, 1981). In order to interpret these results, one should keep in mind that the expected partition distance of two randomly chosen
2005
FLEISSNER ET AL.—STATISTICAL ALIGNMENT AND TREE RECONSTRUCTION
bifurcating eight-taxon topologies is 9.4 and that less than 1% of those random topology pairs have a partition distance smaller than 6 (Penny et al., 1982). As expected, reconstructing the phylogeny becomes more difficult if the branches get longer no matter which of the methods is used. Especially the usage of PRRN seems to bias the inference in the case of the longer tree while CLUSTALW performs remarkably well. The fact that our method is not as accurate as CLUSTALW+IQPNNI is probably caused by our stochastic search strategy as compared to the more deterministic approach of CLUSTALW. Figure 8 plots the fraction of positions of the true alignment that were also present in the inferred alignments.
557
Both our method and CLUSTALW are reasonably successful in aligning the sequences correctly. PRRN, on the other hand, has difficulties in finding the correct alignment. As far as the estimation of the parameters of the substitution model is concerned (Figs. 9 and 10), our method and CLUSTALW+IQPNNI perform well, whereas PRRN+IQPNNI exhibits a considerable bias when it comes to the estimation of the transitiontransversion parameter τ , which may be due to the circumstance that PRRN does not distinguish between transitions and transversion when it penalizes mismatches.
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 9. The results of simulation 2: The estimates of the transition-transversion parameter τ . The vertical lines mark the value that was used to generate the data. The columns show the results for two different sets of branch lengths (see Fig. 6). First row: our method. Second row: CLUSTALW+IQPNNI. Third row: PRRN+IQPNNI.
558
SYSTEMATIC BIOLOGY
VOL. 54
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 10. The results of simulation 2: The estimates of the pyrimidine-purine transition parameter κ. The vertical lines mark the value that was used to generate the data. The columns show the results for two different sets of branch lengths (see Fig. 6). First row: our method. Second row: CLUSTALW+IQPNNI. Third row: PRRN+IQPNNI.
Unlike the other two methods, our method also produced estimates of the insertion rate λ (Fig. 11) and of ρ the parameter of the fragment length distribution (Fig. 12). Especially estimating ρ does not seem to be too problematic. An Example: HVR1 from Primates As an application to real data, we used the hypervariable region I sequences of the mitochondrial control region (HVR1, cf. Anderson et al., 1981) from 11 primate species. Details on the accession numbers are given in the legend of Figure 13. We modeled the substitution
process with the Tamura-Nei model, starting with the transition-transversion parameter being set to 5.0 and the pyrimidine-purine transition parameter initially being 1.0. The initial expected fragment length was 2 (ρ = 0.5). One hundred iterations at the beginning of the search were the burn-in phase. If the absolute value of the difference of the likelihoods in 2 succesive iterations was smaller than 5 for the more than 1100 iterations we conducted nearest-neighbor interchanges. In order to see if our program manages to find the optimal combination of phylogeny, alignment, and parameters we repeated this analysis 100 times.
2005
FLEISSNER ET AL.—STATISTICAL ALIGNMENT AND TREE RECONSTRUCTION
559
Figure 13 shows the consensus tree of the trees found in these 100 independent runs of our program. The edge labels show the percentage of trees which contain the respective branch. These values should not so much be seen as support values for the individual branches, but rather as an illustration of how good we were in finding the optimum. It should be noted that the depicted tree conforms well with textbook views on primate phylogeny. Most internal branches were present in almost all of the 100 trees. Only the relations of the three gibbon sequences (H. lar, H. hoolock, and H. syndactylus) seem hard to resolve. However, this problem was also observed in other studies (Roos and Geissmann, 2001). The parameter estimates that we got in the 100 runs are shown in Figure 14. Thus, assuming the TKF2 model to be a sound description of the indel process acting on primate HVR1 sequences, we come to the conclusion that during the evolution of these sequences, there was about 1 insertion per 28 substitutions and the average indel length is about 1.7 nucleotides.
D ISCUSSION AND O UTLOOK The traditional way to do phylogenetic analysis by first aligning the sequences and then reconstructing the tree based on the alignment raises serious questions about the objectivity and reliability of the results. In this paper we have shown that it is feasible to simultaneously infer multiple alignments, phylogenetic trees with branch lengths, and model parameters in a probabilistic framework. Our algorithm is not limited to DNA, but can as well handle amino acid sequences. It can also be easily adapted to more complex and more realistic models. One could, for example, allow for rate heterogeneity either by having site-specific substitution rates (cf. Thorne and Churchill, 1995) or in the form of a covarion model (Penny et al., 2001). Instead of using fragments of geometric length, one could also draw the fragment lengths from a mixture of geometric distributions (Fleißner, 2004) and thus achieve a better modeling of the various causes of insertions and deletions.
FIGURE 12. The results of simulation 2: Our method’s estimates of ρ the parameter of the fragment-length distribution. The vertical lines mark the value that was used to generate the data. The columns show the results for two different sets of branch lengths (see Fig. 6).
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 11. The results of simulation 2: Our method’s estimates of the insertion rate λ. The vertical lines mark the value that was used to generate the data. The columns show the results for two different sets of branch lengths (see Fig. 6).
560
SYSTEMATIC BIOLOGY
A more pressing task than increasing the complexity of the models is, however, decreasing the run time of our program in order to graze the parameter and tree spaces more efficiently. Due to time limitations, we had to relax
the convergence criteria of our simulated-annealing approach and we had to set an arbitrary upper bound to the number of nearest-neighbor interchanges. Yet, stricter convergence conditions and a better way to stop the search for better trees (e.g., Vinh and von Haeseler, 2004) should improve the method. One step towards optimizing the run time would be adapting some of the numerous tree search heuristics to this problem. As the same techniques that we used in this paper to progressively align the sequences can also be used to combine trees and alignments for subsets of the data, implementing methods such as quartet puzzling (Strimmer and von Haeseler, 1996) is straightforward. We hope that further optimization might even make it possible to develop methods to judge the reliability of multiple alignments and phylogeny reconstruction by Bayesian MCMC sampling, at least if the number of sequences is not too high. For practical reasons, the presented method emits sequences at the internal nodes. Although this approach yields a simple way to get an initial alignment for a data set, and although one can easily propose new alignments by changing the pairwise alignments of these internal sequences, the inference of ancestral states normally is problematic (Mossel, 2003) and most of the time not very interesting to the researcher. Using the formalism described in Lunter et al. (2003a, 2003b) and Metzler et al. (2005), one can avoid the assignment of sequences to the internal nodes. This, however, makes the computations significantly more complex, especially when indels
FIGURE 14. The parameter estimates for the primate data set obtained in 100 independent runs. a, The transition-transversion parameter τ . b, The pyrimidine-purine transition parameter κ. c, The insertion rate λ. d, ρ, the parameter of the fragment-length distribution. The vertical lines mark the values with the highest joint posterior probability of the parameters, the indel history, and the phylogeny.
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
FIGURE 13. Consensus tree of the hominoid HVR1 phylogenies with the highest likelihood found in 100 independent runs. The accesion numbers and included positions of the sequences are as follows: NC 002082 1-366 (H. lar), AF311725 1-378 (H. hoolock), AF311722 1-396 (H. syndactylus), Hvr1 ID 1 2-379 (H. sapiens neanderthalensis), Hvr1 ID 1244 24-400 (H. sapiens sapiens), Hvr1 ID 264 24-398 (P. troglodytes), Hvr1 ID 261 24-399 (P. paniscus), Hvr1 ID 389 1-374 (P. pygmaeus pygmaeus 1), Hvr1 ID 390 1-374 (P. pygmaeus pygmaeus 2), Hvr1 ID 388 1-372 (P. pygmaeus abelii), NC 001992 1-378 (P. hamadryas). Accession numbers that begin with ‘Hvr1 ID ’ are the identifiers in the hvrbase (Handt et al., 1998). Regions with high similarity were determined with the help of the program dotter (Sonnhammer and Durbin, 1996).
VOL. 54
2005
FLEISSNER ET AL.—STATISTICAL ALIGNMENT AND TREE RECONSTRUCTION
of more than one position are allowed. As it would reduce the dimension of the search space tremendously, it still seems worthwhile to look for algorithmic improvements in that framework. ACKNOWLEDGMENTS We thank Anton Wakolbinger, Bojan Basrak, Gunter Weiss, and Zaid Abdo for helpful comments and fruitful discussion. We also would like to thank Roderic Page, Paul Lewis, Jeff Thorne, and an anonymous reviewer for their advice. Funding for our research was provided by the Deutsche Forschungsgemeinschaft. Some of the computations were done on the IBEST program’s Beowulf, funded by the grants NSF EPSCOR EPS-0080935, NIH INBRE P20RR16454, and NIH COBRE P20RR16448. Our method is implemented in the program alifritz, which is freely available at http://www.bi.uniduesseldorf.de/software/alifritz.
Mau, B., and M. A. Newton. 1997. Phylogenetic inference for binary data on dendograms using markov chain monte carlo. J. Computat. Graph. Stat. 6:122–131. Metzler, D. 2003. Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19:490–499. Metzler, D., R. Fleißner, A. Wakolbinger, and A. von Haeseler. 2001. Assessing variability by joint sampling of alignments and mutation rates. J. Mol. Evol. 53:660–669. Metzler, D., R. Fleißner, A. Wakolbinger, and A. von Haeseler. 2005. Stochastic insertion-deletion processes and statistical sequence alignment. Pages 247–267 in Interacting stochastic systems (J.-D. Deuschel and A. Greven, eds.). Springer, Berlin. Mossel, E. 2003. On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 10:669–676. Penny, D., L. R. Foulds, and M. D. Hendy. 1982. Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 297:197–200. Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: The covarion model of molecular evolution. J. Mol. Evol. 53:711–723. Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131–147. Roos, C., and T. Geissmann. 2001. Molecular phylogeny of the major hylobatid divisions. Mol. Phyl. Evol. 19:486–494. Saitou, N., and M. Nei. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406– 425. Salter, L. A., and D. K. Pearl. 2001. Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Syst. Biol. 50:7–17. Sonnhammer, E. L., and R. Durbin. 1996. A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167:1–10. Steel, M., and J. Hein. 2001. Applying the Thorne-Kishino-Felsenstein model to sequence evolution on a star shaped tree. Appl. Math. Lett. 14:679–684. Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964–969. Strimmer, K., and A. von Haeseler. 2003. Nucleotide substitution models. Pages 72–87 in The phylogenetic handbook (M. Salemi and A. Vandamme, eds.). Cambridge University Press, Cambridge, UK. Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pages 407–514 in Molecular systematics (D. M. Hillis, C. Moritz, and B. K. Mable, eds.). Sinauer Associates, Sunderland, Massachusetts. Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512–526. Tavar´e, S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Pages 57–86 in Some mathematical questions in biology: DNA sequence analysis (M. S. Waterman, ed.). The American Mathematical Society, Providence, Rhode Island. Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple alignment through sequence weighting, positions-specific gap penalties and weight matric choice. Nucleic Acids Res. 22:4673–4680. Thorne, J. L., and G. A. Churchill. 1995. Estimation and reliability of molecular sequence alignments. Biometrics 51:100–113. Thorne, J. L., and H. Kishino. 1992. Freeing phylogenies from artifacts of alignment. Mol. Biol. Evol. 9:1148–1162. Thorne, J., H. Kishino, and J. Felsenstein. 1991. An evolutionary model for maximum likelihood-alignment of DNA sequences. J. Mol. Evol. 33:114–124. Thorne, J. L., H. Kishino, and J. Felsenstein. 1992. Inching toward reality: An improved likelihood model of sequence evolution. J. Mol. Evol. 34:3–16. Vinh, L. S., and A. von Haeseler. 2004. IQPNNI: Moving fast through tree space and stopping in time. Mol. Biol. Evol. 21:1565–1571. First submitted 27 May 2004; reviews returned 20 August 2004; final acceptance 4 February 2005 Associate Editor: Paul Lewis
Downloaded from http://sysbio.oxfordjournals.org/ by guest on December 31, 2015
R EFERENCES Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. de Bruijn, A. R. Coulson, J. Drouin, I. C. Eperon, D. P. Nierlich, B. A. Roe, F. Sanger, P. H. Schreier, A. J. H. Smith, R. Staden, and I. G. Young. 1981. Sequence and organization of the human mitochondrial genome. Nature 290:457–465. Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK. Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368–376. Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Massachusetts. Feng, D.-F., and R. F. Doolittle. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351–360. Fleißner, R. 2004. Sequence alignment and phylogenetic inference. Logos Verlag, Berlin. Fleißner, R., D. Metzler, and A. von Haeseler. 2000. Can one estimate distances from pairwise sequence alignments? Pages 89–95 in Proceedings of the German Conference on Bioinformatics 2000 (E. Bornberg-Bauer, U. Rost, J. Stoye, and M. Vingron, eds.). Logos Verlag, Berlin. Gotoh, O. 1996. Significant improvement in accuracy of multiple protein alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264:823–838. Gotoh, O. 1999. Multiple sequence alignments: Algorithms and applications. Adv. Biophys. 36:159–206. Gribskov, M., A. McLachlan, and D. Eisenberg. 1987. Profile analysis detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 88:4355–4358. Handt, O., S. Meyer, and A. von Haeseler. 1998. Compilation of human mtDNA control region sequences. Nucleic Acids Res. 26:126–130. Hein, J. 2001. A generalisation of the Thorne-Kishino-Felsenstein model of statistical alignment to k sequences related by a binary tree. Pages 179–190 in Pacific Symposium on Biocomputing. World Scientific Publishing, Singapore. Holmes, I., and W. J. Bruno. 2001. Evolutionary HMMs: A Bayesian approach to multiple alignment. Bioinformatics 17:803–820. Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pages 21–132 in Mammalian protein metabolism (H. N. Munro, ed.). Academic Press, New York. Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120. Kirkpatrick, S., C. D. J. Gerlatt, and M. P. Vecchi. 1983. Optimization by simulated annealing. Science 220:671–680. Lunter, G., I. Miklos, ´ J. L. Jensen, A. Drummond, and J. Hein. 2003a. Bayesian phylogenetic inference under a statistical indel model. Pages 228–244 in Lecture notes on computer science, Proceedings of WABI’03, volume 2812. Springer, Berlin. Lunter, G., I. Miklos, ´ Y. S. Song, and J. Hein. 2003b. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comp. Biol. 10:869–889.
561