Syst. Biol. 48(2):317–328, 1999
Effect of Nonindependent Substitution on Phylogenetic Accuracy J OHN P. HUELSENBECK1 AND RASMUS NIELSEN2 Department of Biology, University of Rochester, Rochester, New York 14627, USA; E-mail:
[email protected] r.edu Department of Integrative Biology, University of California, Berkeley, California 94720-3140 , USA 1
2
Abstract.— All current phylogenetic methods assume that DNA substitutions are independent among sites. However, ample empirical evidence suggests that the process of substitution is not independent but is, in fact, temporally and spatially correlated. The robustness of several commonly used phylogenetic methods to the assumption of independent substitution is examined. A compound Poisson process is used to model DNA substitution. This model assumes that substitution events are Poisson–distributed in time and that the number of substitutions associated with each event is geometrically distributed. The asymptotic properties of phylogenetic methods do not appear to change under a compound Poisson process of DNA substitution. Moreover, the rank order of the performance of different methods does not change. However, all phylogenetic methods become less efcient when substitution follows a compound Poisson process. [Compound Poisson process; nonindependent substitution; phylogenetic accuracy.]
All current phylogenetic methods assume independence among sites. By independence, we mean that the act of observing a substitution at one site does not change the probability of observing substitutions at other sites in the sequence. For parsimonybased methods, independence among sites is implicitly assumed when the number of changes at each site is summed to determine the overall length for a particular topology. Similarly, for likelihood methods the assumption of independent substitution at each site allows the joint probability of observing the data to be written as the product of the marginal probabilities of observing each site pattern. Although independence of the substitution process at each site is a central assumption of phylogenetic methods, this assumption is believed rarely to hold for real data. Correlated character evolution has been extensively discussed in the context of morphological or behavioral characters. A typical goal of morphological studies is to determine whether suites of characters are evolving in a correlated manner (Ridley, 1983; Harvey and Pagel, 1991). For example, one may want to examine whether a change in one morphological character is correlated with behavioral or morphological changes in another (e.g., Pagel and Harvey, 1989; Maddison, 1990).
Correlated character evolution has also been examined for molecular data. However, the problem of identifying character correlation for DNA sequences is somewhat different from that in morphological data. First, for morphological studies, the question of character correlation is posed for two (or very few) characters. For molecular data, on the other hand, the number of characters (sites) involved ordinarily will be much larger. Also, instead of strong correlation among just a few sites, the expectation with molecular data often is that the evolution at many sites is only weakly correlated. Second, in the absence of a thorough understanding of gene function, only in rare cases can one specify those sites that are expected to evolve together. Several authors have investigated correlated evolution in the stem regions of ribosomal genes, which are an exception to the generalizations made above (Wheeler and Honeycutt, 1988; Dixon and Hillis, 1993; Sch o¨ niger and von Haeseler, 1994; Muse, 1995; Rzhetsky, 1995): The correlated sites can be specied (paired stem sites) and they are few in number. Sch o¨ niger and von Haeseler (1995) examined the accuracy of three phylogenetic methods (maximum likelihood, minimum evolution, and maximum parsimony) under a model that allows correlated substitutions in the stem
317
318
SYSTEMATIC BIOLOGY
regions of ribosomal genes for the fourtaxon case. The substitution model developed by Sch o¨ niger and von Haeseler (1994; SH94 model) considers nucleotide pairs. They used the probabilities of all 164 = 65,536 possible doublet patterns as the observed data for four taxa; hence, they were able to examine the consistency of the methods (the performance of the phylogenetic methods under the assumption of an innite number of data). To examine the effect of nonindependent substitution on phylogenetic estimation, Scho¨ niger and von Haeseler (1995) compared the accuracy of methods in which the actual process of substitution followed the SH94 model with the accuracy of methods in which the process generating the sequence data followed the Hasegawa et al. (1985; denoted HKY85) model of DNA substitution. Scho¨ niger and von Haeseler found that, in general, the performance of maximum likelihood was not as adversely affected as the performance of neighbor joining. This result is consistent with the results of Tillier and Collins (1995), who also investigated the performance of maximum likelihood and neighbor joining for a model of sequence evolution for the stem regions of RNA. However, the SH94 model and the HKY85 model differ in other aspects than just the assumption of independence. For example, at stationarity the expected proportion of changes between any two nucleotides differs for the two models for the choice of parameters applied by Scho¨ niger and von Haeseler. It is therefore not clear exactly which effect causes the difference in the large–sample properties of the phylogenetic estimators under the two models. Models of DNA substitution similar to those appropriate for the stem regions of RNA sequences have been developed for codons (Goldman and Yang, 1994; Muse and Gaut, 1994). Instead of considering the states of the Poisson process to be the 16 nucleotide doublets, these models consider the states to be the 64 different codons (61 if stop codons are excluded). However, the performance of phylogenetic methods has not been examined for these models. Moreover, the stem-based and codon-based models address the issue of spatial correlation; sub-
VOL. 48
stitutions at different sites are correlated to one another. To our knowledge, no one has examined models in which the substitutions are temporally correlated. Models of temporal correlation are important because of the observation that substitutions are overdispersed (Gillespie, 1991). Processes that can cause temporal correlation of nucleotide substitutions include selective sweeps and gene conversion. In this article, we consider the robustness of phylogenetic methods to violation of the assumption of nonindependent substitution. We do this by developing a general model of substitution that allows substitutions to be correlated in time— the compound Poisson process model. The case of independent substitution in time is simply a special case of the compound Poisson model. We examine the performance of several widely used phylogenetic methods by utilizing a general model of correlated evolution. We conclude that correlation reduces the efciency of all phylogenetic methods but does not in itself change the large–sample properties or relative efciencies of the common phylogenetic estimators. MARKOVIAN AND POISSON PROCESSES Currently, all explicitly model-based methods of phylogenetic inference assume a Markovian model of DNA substitution. This implies that the probability of observing a substitution depends only on the current state of the sequence. If a substitution has recently occurred in a site, this is assumed not to change the probability of observing another substitution in the same site or any other site. Under a Markovian model of substitution, the probability of observing a substitution in a single site may vary in time because different nucleotides may mutate at different rates; when the process is averaged over many sites, however, the substitutional process will become approximately Poisson. Under a Poisson process model, events (substitutions) occur independently in time at a constant rate (l ). Assuming that two substitutions cannot occur simultaneously [the probability of observing two substitutions
1999
HUELSENBECK AND NIELSEN—NONINDEPENDENT SUBSTITUTION
319
in an interval ( d t) is O( d t)], the number of substitutions per site in a specied interval (T) in a sequence follows a Poisson distribution with parameter l T. Furthermore, the expected mean and variance (E[N(T)] and Var[N(T)], respectively) of the number of substitutions [N(T)] for the Poisson process are equal ( l T for each); therefore, the index of dispersion, dened as I=
Var[N(T)] , E[(N(T)]
should equal 1. For nite sequences, small deviations from a Poisson process may be observed because different nucleotides mutate with different probabilities. However, under biologically reasonable assumptions, these small deviations from a Poisson process do not signicantly increase I to > 1 (Nielsen, 1997). Furthermore, by allowing mutations between identical nucleotides, any Markovian model of nucleotide substitutions can be modeled as a Poisson process. This method of transforming a continuous–time Markov chain into a Poisson process is called uniformization (e.g., see Ross, 1996:282). We will therefore equate the Markov models of DNA substitution to models that assume the process of substitution is Poisson. Figure 1a shows an example of one possible realization of a Poisson process model of substitution. The x-axis represents the position along the sequence (space) and the y-axis represents time. Substitutions (represented as points) occur randomly in time and along the sequence. Note that no substitutions occur at exactly the same time. However, two substitutions (marked with asterisks) occur at the same position in the sequence, thus obscuring from the observer the total number of substitutions that have occurred. TEMPORAL CORRELATION BY USING A COMPOUND POISSON PROCESS When the act of observing one substitution changes the probability of observing another substitution, the substitution process is no longer Markovian and we say that
FIGURE 1. Two graphs illustratin g Poisson and compound Poisson processes of substitution. (a) Under a Poisson process, substitutions occur independently in time and along the sequence (space). (b) Under a compound Poisson process, substitutions are correlated in time. In both examples, a total of seven substitutions occur. However, two of the substitutions occur at the same site ( * ) and the total number of substitutions is obscured.
the substitutions are correlated. Commonly used models of DNA substitution, including those that include rate variation among sites (Yang, 1993), spatial correlation (Yang, 1995; Sch o¨ niger and von Haeseler, 1994), and complex substitution matrices are all Markovian models. The number of substitutions under these models approximately follows a Poisson process when averaged over sites. Current phylogenetic methods can also account for variable rates in time simply by allowing the length of different branches to vary freely. In fact, when rates, l (t), vary over time, the number of substitutions in any interval of time (T) is still approximately PoisRT son distributed with a mean of 0 l (s)ds. In contrast, when substitutions are temporally correlated, substitutions will not be Poisson distributed.
320
VOL. 48
SYSTEMATIC BIOLOGY
There are several reasons why the substitution process may be temporally correlated. For one, mutations may affect more than one nucleotide. Processes such as gene conversion are known to affect many nucleotides at once. Also, selective sweeps may occur (e.g., see Kaplan et al., 1989). If a mutation goes to xation in the population by positive selection, many linked neutral mutations may be driven to xation at the same time. Naturally, substitutions will be correlated in time in such a case. Empirical evidence for temporal correlation in nature is ample. Gillespie (1986, 1987, 1991) and Ohta (1995), for example, show that the substitution process is overdispersed. That is, the variance to mean ratio in the number of substitutions is greater than expected from a Poisson process. This implies that substitutions are clustered in time. We have used a compound Poisson process to examine the effect of nonindependence among sites. In our implementation of the compound Poisson process, substitutional events occur according to a Poisson process with rate l per site and the number of substitutions in each substitutional event is geometrically distributed. In other words, the number of substitution episodes follows a Poisson process, but more than one substitution may occur in each episode. However, no more than one substitution is allowed in each site in an episode. This compound Poisson process was chosen to model temporal correlation because it is identical to a model in which mutations may stretch over more than one nucleotide. Moreover, the compound Poisson process provides a close approximation to a nonmechanistic description of evolution by selective sweeps and may also provide a description of the evolution of DNA sequences under other selective models (Gillespie, 1991). We assume that each substitution increases the probability of observing a substitution in another site in the same innitesimal interval of time by a constant factor, r (0 # r # 1). In this model, the number of substitutions in each episode follows a geometric distribution with parameter 1– r . The expected number of substitutions for a compound Poisson process is simply the prod-
uct of the expected number of substitutional episodes and the mean number of substitutions per episode, or E[N(T)] =
l T
1–r
.
The variance of the number of substitutions is l T(1 + r ) Var[N(T)] = . (1 – r )2 Note that the index of dispersion for our implementation of the compound Poisson process is I = (1 + r ) / (1 – r ). A Poisson process model is simply a special case of our model with r = 0. One of the main effects of temporal correlation is to increase the variance of the number of substitutions per site over a branch of length T. For a Poisson process, the variance of the number of substitutions per site is l T. For a compound Poisson process with r = 1/ 2, the variance in the number of substitutions is three times greater (when the two models are standardized such that the expected number of substitutions per site over the branch are equal). Figure 1b illustrates one potential realization of a compound Poisson process. Episodes of substitution (times at which substitutions occur, denoted by the gray horizontal lines) are distributed according to a Poisson process. Note, however, that each substitutional episode has one or more substitutions associated with it. Hence, the compound Poisson process models rapid bursts of temporally correlated substitutions. METHODS Simulation of Data We assumed an unrooted four-taxon tree as the model tree in this study (Fig. 2). We constrained the internal branch and two opposing peripheral branches to be equal in length (v3 ) and the remaining two peripheral branches to be equal in length (v2 ). Branch lengths are measured in terms of expected number of substitutions per site. By varying v3 and v2 , we were able to explore a range of possible trees with differing relative branch lengths. Some of the conditions
1999
HUELSENBECK AND NIELSEN—NONINDEPENDENT SUBSTITUTION
321
FIGURE 2. The four–taxon tree used as the model tree for simulations. Two opposing peripheral branches were constrained to be the same length (v2 ), as were the remaining three branches, but at a different length (v3 ). The internal nodes are denoted x and y.
we explored are known to provide methods with difcult phylogenetic problems (e.g., when v2 > > v3 ; Felsenstein, 1978; Jin and Nei, 1990; Nei, 1991; Huelsenbeck and Hillis, 1993; Charleston et al., 1994; Tateno et al., 1994; Huelsenbeck, 1995). All of the analyses we performed were conned to a parameter space for which v2 varied from 0 to 5.0 and v3 varied from 0 to 1.0—that is, the same parameter space Scho¨ niger and von Haeseler (1995) explored (Fig. 3). We simulated data under Poisson and compound Poisson models of DNA substitution. For both models, we simulated data for cases in which (1) all substitution are equally likely and (2) transitions occur at a rate 10 times that of transversions [k = a / b = 10 in Kimura’s (1980) notation]. For our implementation of a compound Poisson model, substitutions in an episode occur randomly along the sequence. For all simulations, the equilibrium frequencies of all four bases were assumed to be equal. Equations for calculating the probabilities of base substitution under the Poisson models can be found in Swofford et al. (1996). Figure 4 illustrates the procedure used to simulate sequences. For illustrative purposes, we will assume a model similar to the Jukes–Cantor (1969) model of DNA substitution, in which the four bases have equal frequencies and the rates of substitution among nucleotide states are equal. The parameter of the geometric distribution for
FIGURE 3. The parameter space examined in this study. The branches v2 long were varied from 0 to 5.0 expected substitutions per site, whereas the v3 branches were varied from 0 to 1.0 expected substitutions per site.
the compound Poisson process is assumed to be r = 0.5. The ancestral sequence for all of the tip sequences was initialized by drawing nucleotides with equal probability. Starting from the root of the tree, we sequentially visited nodes from the root towards the tips (i.e., performed a preorder traversal of the tree) and determined the sequence at that node. For each node that was visited, the ancestral sequence had been determined in the previous step; we needed only to determine the sequence at that node from the ancestral sequence. Figure 4a shows the situation faced for each node visited. The ancestral sequence was known, and is shown in Figure 4 as the sequence A, G, C, C, T,. . . , G. The length of the branch (in terms of expected number of substitutions per site), denoted v, was also known. The sequence of nucleotides at the node, however, was unknown and is represented as a series of blank boxes. We rst determined the number of episodes of substitution that occurred between the ancestral sequence and the current (unknown) sequence. The number of episodes of substitution is Poisson distributed with parameter v(1 – r ). In Figure 4, the number of episodes of substitution is
322
VOL. 48
SYSTEMATIC BIOLOGY
FIGURE 4. Illustration of the simulation procedure. At each node of the tree, the ancestral sequence is known, but the sequence at the node is unknown (a). The horizontal gray lines represent episodes of substitution. For each episode of substitution, there may be more than one substitution along the sequence (b).
3, and these episodes are denoted by gray lines. Conditioned on the times at which the ancestral and descendant sequences occurred (tA and t D , respectively), the episodes of substitution will be uniformly distributed on the interval tA , tD . For each episode of substitution, we determined the number and location of the substitutions along the sequence. The number of substitutions for each episode was determined by drawing from a geometric distribution with parameter 1 – r . The substitutions were then randomly placed on the sequence. In Figure 4b, the number of substitutions is 3, 1, and 3 for the rst, second, and third episodes of substitution, respectively. Note that one nucleotide position has two substitutions (a change from a T to A and then a change from A back to T). The total number of substitutions along the sequence was 7, but the number of differences between the ancestral and descendant sequences was 6. Methods Examined We examined the performance of several widely used phylogenetic methods, including parsimony [using Fitch (1971) character optimization], minimum evolution (Kidd and Sgaramella-Zonta, 1971), UPGMA (Sokal and Michener, 1958), and max-
imum likelihood (Felsenstein, 1981). For the distance methods, we used proportion different (p), Jukes–Cantor (Jukes and Cantor, 1969; dJC ), and Kimura (Kimura, 1980; dK ) distances. We examined the accuracy of maximum likelihood implemented with the Jukes–Cantor model of DNA substitution. The methods we examine in this study are applied to the majority of analyses published in the pages of Molecular Biology and Evolution and Systematic Biology for the past decade (Huelsenbeck, 1995). Analysis Several criteria can be applied to investigate the accuracy of a phylogenetic method, including consistency (how well the method estimates the correct tree, given an innite number of sites), efciency (how well the method estimates phylogeny for a nite number of sites), and robustness (how well the method estimates phylogeny when the assumptions of the method are violated). We measure accuracy as the proportion of the time that a phylogenetic method correctly estimates the true (simulated) fourtaxon phylogeny. For this study, we are mainly interested in the robustness of phylogenetic methods to violations of the assumption of indepen-
1999
HUELSENBECK AND NIELSEN—NONINDEPENDENT SUBSTITUTION
dent substitution among sites. However, unlike models of independent substitutions, in which it is simple to calculate the probabilities of all site patterns (and, hence, to examine the consistency of the phylogenetic methods), the compound Poisson models do not lend themselves to easy analysis. The main reason for this is that the stochastic process describing the change of the nucleotide sequence in time is non-Markovian. Hence, we examined the consistency of phylogenetic methods by comparing the expected frequencies of different site patterns under a Poisson process with the observed frequencies under a compound Poisson process. If the frequencies of site patterns under a compound Poisson process do not signicantly differ from the site–pattern frequencies expected under a Poisson process, then the large sample performance of phylogenetic methods should also not differ. A site pattern is an assignment of nucleotides to species at a particular site; one possible site pattern is fA, A, A, Ag, a pattern in which all four species have the nucleotide “A” at the site. For four taxa, a total of 44 = 256 site patterns are possible. However, for methods that assume equal base frequencies and consider only transition and transversion changes, many site patterns can be combined and only 36 site patterns need be considered (Table 1). In Table 1, the nucleotide assigned to taxon n1 is “1”. If the other taxa differ by a transition, they are assigned “2”; if they differ by a transversion, they are assigned a “3” or “4”. The expected frequency of the site patterns can be calculated exactly under a Markovian model. Under the Markovian model, the probability of a particular site pattern is XX x
Pr [n1 , n2 , n3 , n4 ] = p
n1
pn x (v2 ) pxn (v3 ) 1
2
y
pxy (v3 ) pyn (v2 ) pyn (v3 ) 3
4
where pij (v) is the probability of a change from nucleotide i to j over a branch of length v, p n is the equilibrium frequency of the nucleotide at species n1 , and the summations are over all possible assignments of nucleotides to the internal nodes (x and y). 1
323
To determine whether the asymptotic properties of methods change when substitutions are temporally correlated, we compared the expected frequencies of the site patterns under a Poisson process to the observed frequencies under the compound Poisson process. The observed frequencies for the compound Poisson process were determined by simulating 106 sites (1000 matrices of 1000 sites each). We examined the match between the Poisson and compound Poisson model of substitution for a range of branch length combinations from the parameter space of Figure 3. Because 110 points were examined from the parameter space for each set of conditions, a Bonferroni–type correction to the signicance level was applied (a = 0.05/ 110 = 0.000454). A x 2 test with 35 df was used to test the null hypothesis that the observed frequencies under a compound Poisson process are the same as under a Poisson process model of DNA substitution. An example of the results from a single point from the parameter space of Figure 3 is shown in Table 1. The efciency of phylogenetic methods under Poisson and compound Poisson models was examined by using simulation. For the efciency analyses, we examined two points from the branch-length parameter space for different numbers of simulated sites (v2 = 0.05, v3 = 0.01; v2 = 0.1, v3 = 0.14). RESULTS Consistency The asymptotic properties of phylogenetic methods do not appear to change under a compound Poisson process of DNA substitution. The null hypothesis that the frequencies of the different site patterns is the same for the Poisson and compound Poisson models could not be rejected by the x 2 test (Fig. 5). Because the frequencies of the different site patterns for the compound Poisson process are the same as for the Poisson process, the performance of phylogenetic methods should also be the same when the number of sites increases to innity. This result applies, minimally, to the two situations explored in this study: (1) rates of substitution being equal among sites and the
324
VOL. 48
SYSTEMATIC BIOLOGY
TABLE 1. The expected and observed site patterns under, respectively, the Poisson and compound Poisson process models of DNA substitution. All branches were 0.2 substitutions per site long. Expected frequencies were calculated under the Jukes–Cantor model of DNA substitution with equal rates among sites. The observed pattern frequencies for the compound Poisson process with r = 0.5 were based on 10 6 sites. The null hypothesis that the pattern frequencies are the same is not rejected (x 2 = 41.28; P = 0.22).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Pattern
Expected frequency
Observed frequency
1111 1222 1211 1121 1112 1333 1311 1131 1113 1122 1212 1221 1133 1313 1331 1233 1322 1344 1134 1123 1132 1323 1232 1343 1314 1213 1312 1332 1223 1334 1341 1231 1321 1234 1324 1342
0.3817 0.0294 0.0294 0.0294 0.0294 0.0589 0.0589 0.0589 0.0589 0.0314 0.0042 0.0042 0.0063 0.0084 0.0084 0.0124 0.0124 0.0124 0.0124 0.0124 0.0124 0.0053 0.0053 0.0053 0.0053 0.0053 0.0053 0.0053 0.0053 0.0053 0.0053 0.0053 0.0053 0.0023 0.0023 0.0023
0.3817 0.0296 0.0293 0.0293 0.0292 0.0591 0.0587 0.0586 0.0596 0.0316 0.0041 0.0040 0.0627 0.0084 0.0083 0.0124 0.0125 0.0124 0.0124 0.0123 0.0125 0.0054 0.0055 0.0054 0.0054 0.0053 0.0053 0.0054 0.0053 0.0054 0.0053 0.0052 0.0053 0.0023 0.0022 0.0022
rates of different nucleotide substitutions being equal; and (2) rates of substitution being equal among sites and transitions occurring at a rate 10 times that of transversions. Efciency and Robustness Although the asymptotic properties of phylogenetic methods do not appear to change under a compound Poisson process, the efciency of phylogenetic methods does change considerably. Figure 6 shows the re-
x
2
0.0001 1.0322 0.0692 2.0343 1.3548 1.3092 0.2381 0.9644 9.4692 1.1172 0.1410 8.7016 0.6243 0.0090 0.9353 0.0444 0.1015 0.0002 0.0652 0.6439 0.4709 0.4065 2.4164 2.1679 0.4605 0.0039 0.3369 0.3558 1.0935 0.4793 0.0024 2.9457 0.1512 0.0223 0.2567 0.2359
sults of simulations in which the number of sites are varied. The accuracy of phylogenetic methods decreases as the parameter r increases. The rank order of the performance of phylogenetic methods, however, does not change under a compound Poisson model of substitution. Figure 7 shows the performance of four phylogenetic methods for two sets of simulated conditions under a Poisson process model (Fig. 7a, d) and under a compound Poisson process model
1999
HUELSENBECK AND NIELSEN—NONINDEPENDENT SUBSTITUTION
325
FIGURE 6. The accuracy of methods decreases when temporal correlation exists. The performance of both maximum likelihood (square points) and parsimony (circular points) decreases as r increases (r = 0.0 for solid points and r = 0.9 for hollow points). A Jukes– Cantor model of substitution was assumed. The branch lengths were v2 = 0.5 and v3 = 0.1.
tive performance of phylogenetic methods does not appear to change with temporal correlation. FIGURE 5. The results of x 2 tests of the t of the compound Poisson model with r = 0.5 to a Poisson model of DNA substitution. A total of 110 points from the parameter space of Figure 3 were examined for two different models of substitution [Poisson and compound Poisson process models with no transition/transversion rate ratio (i.e., k = 1) (a) and with a transition/transversion rate ratio of k = 10(b)]. A x 2 test was performed to examine whether the observed frequencies of the 36–site patterns under a compound Poisson process are significantly different from the expected frequencies under a Poisson process. Shown are the distributions of the x 2 test statistic for the 110 comparisons. When corrected for multiple comparisons, the individual x 2 values are not signicant ( x 2 must exceed 69.55 to be signicant at the 5% level).
(Figs. 7b, c, e, f). Note that the relative performances of the methods are the same when substitutions are temporally correlated. The relative performance of the methods is consistently the same with temporal correlation, regardless of whether the process generating the sequences has biased substitutions. Figure 8 shows the accuracy of phylogenetic methods as r increases when the transitions occur 10 times as rapidly as transversions. As with the simulation results shown in Figure 7, the rela-
D ISCUSSION Several models of DNA substitution allow correlated rates among different sites. For example, Yang (1995) and Felsenstein and Churchill (1996) developed models of correlated rates in which the rate at one site depends on the rate at an adjacent site. Models of DNA substitution that allow correlated events have received less attention in phylogenetics. Under the SH94 model mentioned earlier, doublet substitutions follow a Markov model, but the resulting nucleotide substitution process cannot be described as a Markov model except in special cases. Nucleotide substitutions under this model will therefore be weakly correlated. This model is most appropriate for the stem regions of ribosomal genes that are formed through Watson–Crick basepairing interactions between rRNA bases. The results discussed above have simple intuitive explanations. When the substitutions are temporally correlated, this corresponds to reducing the number of independent data. In effect, correlation corresponds to a reduction in the effective number of sites. However, correlation has no strong ef-
326
SYSTEMATIC BIOLOGY
VOL. 48
FIGURE 7. The rank order of phylogenetic methods does not appear to change with temporal correlation. The performance under a Poisson model of substitution (r = 0.0) (a, d); the performance under a compound Poisson process model with r = 0.5 (b, e) and r = 0.9 (c, f). A Jukes–Cantor substitution matrix was assumed for all simulations and the branch lengths were v2 = 0.5 and v3 = 0.1 (a, b, c) or v2 = 0.1 and v3 = 0.1 (d, e, f). Parsimony (l), UPGMA (s), and minimum evolution (u) with Jukes–Cantor distances, and maximum likelihood (n) assuming a Jukes–Cantor model of substitution.
fect on the large–sample properties of the methods for our implementation of the compound Poisson process. The intuitive explanation for this is that correlation in itself does not change the expected probability of observing a particular site pattern. In our implementation, the marginal process in each
site is still approximately Poisson (ignoring edge effects). However, had we modeled the compound Poisson process differently, other results may have been obtained. Apparently, methods of phylogenetic estimation are about equally sensitive to violation of the assumption of inde-
1999
HUELSENBECK AND NIELSEN—NONINDEPENDENT SUBSTITUTION
327
to violation of the assumption of independent substitution. In fact, the zone of consistency does not appear to change under our implementation of the compound Poisson process of substitution in comparison with the behavior of methods when substitution follows a Poisson process. This (asymptotic) behavior is qualitatively different from the behavior of phylogenetic methods when other assumptions of the substitution process are violated; rather, the zone of inconsistency typically becomes larger when other assumptions, such as equal rates among sites, are violated (e.g., see Huelsenbeck, 1995). Although the results of this study suggest that the main effect of temporal correlation is to reduce the effective number of sites, other processes for introducing nonindependence may effect phylogenetic methods in a different manner. Exploration of the robustness of phylogenetic methods to other models of nonindependence should clarify whether the results of this study are general. Simulation should prove particularly useful for such studies. ACKNOWLEDGMENTS This paper beneted from discussions with Michael Cummings and Mark Kirkpatrick. This research was supported by NIH grant GM40282 awarded to Monty Slatkin and a Danish Research Council Fellowship awarded to R. N. FIGURE 8. The performance of phylogenetic methods as r increases when transitions occur at a rate 10 times that of transversions. The performance under a Poisson model of substitution [Kimura’s (1980) model)] (a). The performance under a compound Poisson process with r = 0.5 (b) and r = 0.9 (c). Parsimony (l), UPGMA with Jukes–Cantor (s) and Kimura (n ) distances, minimum evolution with Jukes–Cantor (u) and Kimura (e ) distances, and maximum likelihood (n) assuming a Jukes–Cantor model of substitution.
pendence, because the rank order of the performance of methods does not change under a compound Poisson process of substitution. The assumption of independent substitution is central to phylogenetic analysis. Like other assumptions made in a phylogenetic analysis, this assumption is known to be violated for real data. Encouragingly, phylogenetic methods appear to be robust
REFERENCES CHARLESTON , M. A., M. D. HENDY, AND D. PENNY . 1994. The effects of sequence length, tree topology, and number of taxa on the performance of phylogenetic methods. J. Comput. Biol. 1:133–151. DIXON , M. T., AND D. M. HILLIS. 1993. Ribosomal RNA secondary structure: Compensatory mutations and implication s for phylogenetic analysis. Mol. Biol. Evol. 10:256–267. FELSENSTEIN , J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401–410. FELSENSTEIN , J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368–376. FELSENSTEIN , J., AND G. A. CHURCHILL. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93–104. FITCH , W. M. 1971. Toward dening the course of evolution: Minimum change for a specic tree topology. Syst. Zool. 20:406–416. GILLESPIE , J. H. 1986. Variability of evolutionary rates of DNA. Genetics 113:1077–1091.
328
SYSTEMATIC BIOLOGY
GILLESPIE , J. H. 1987. Molecular evolution and the neutral theory. Oxford Surv. Evol. Biol. 4:10–37. GILLESPIE , J. H. 1991. “The causes of molecular evolution.” Oxford Univ. Press, Oxford, England. GOLDMAN , N., AND Z. YANG . 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736. HARVEY, P. H., AND M. D. PAGEL . 1991. “The comparative method in evolutionary biology”. Oxford Univ. Press, Oxford, England. HASEGAWA , M., H. KISHINO , AND T. YANO . 1985. Dating of the human–ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174. HUELSENBECK , J. P. 1995. The performance of phylogenetic methods in simulation. Syst. Biol. 44:17–48. HUELSENBECK , J. P., AND D. M. HILLIS. 1993. Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264. J IN, L., AND M. NEI. 1990. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol. Biol. Evol. 7:82–102. J UKES, T. H., AND C. R. CANTOR. 1969. Evolution of protein molecules. Pages 21–132 in Mammalian protein metabolism (H. Munro, ed.). Academic Press, New York. KAPLAN , N. L., R. R. HUDSON , AND C. H. LANGLEY . 1989. The “hitch-hiking” effect revisited. Genetics 123:887–899. KIDD , K. K., AND L. A. SGARAMELLA -ZONTA . 1971. Phylogenetic analysis: Concepts and methods. Am. J. Hum. Genet. 23:235–252. KIMURA, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120. MADDISON , W. P. 1990. A method for testing the correlated evolution of two binary characters: Are gains or losses concentrated on certain branches of a phylogenetic tree? Syst. Zool. 40:304–314. MUSE , S. V. 1995. Evolutionary analyses when nucleotides do not evolve independently. Pages 115–124 in Current topics on molecular evolution (M. Nei and N. Takahata, eds.). Institute on Molecular Evolution and Genetics, University Park, Pennsylvania. MUSE , S. V., AND B. S. GAUT . 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. 11:715– 724. NEI, M. 1991. Relative efciencies of different treemaking methods for molecular data. Pages 90–128 in Phylogenetic analysis of DNA sequences (M. Miyamoto and J. Cracraft, eds.).Oxford Univ. Press, Oxford, England.
VOL. 48
NIELSEN , R. 1997. The robustness of the estimator of the index of dispersion for DNA sequences. Mol. Phyl. Evol. 7:346–351. OHTA, T. 1995. Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. J. Mol. Evol. 40:56–63. PAGEL , M. D., AND P. H. HARVEY. 1989. Comparative methods for examining adaptation depend on evolutionary methods. Folia Primatol. 53:203–220. RIDLEY, M. 1983. The exploration of organic diversity: The comparative method and adaptations for mating. Oxford Univ. Press, Oxford, England. ROSS , S. M. 1996. Stochastic processes, 2nd edition. John Wiley & Sons, Inc., New York. RZHETSKY , A. 1995. Estimating substitution rates in ribosomal RNA genes. Genetics 141:771–783. ¨ SCHONIGER , M., AND A. VON HAESELER . 1994. A stochastic model and the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3:240–247. ¨ SCHONIGER , M., AND A. VON HAESELER . 1995. Performance of the maximum likelihood, neighbor joining, and maximum parsimony methods when sequence sites are not independent. Syst. Biol. 44:533–547. SOKAL , R. R., AND C. D. MICHENER. 1958. A statistical method for evaluatin g systematic relationships. Univ. Kans. Sci. Bull. 28:1409–1438. SWOFFORD, D. L., G. J. OLSEN , P. J. WADDELL , AND D. M. HILLIS. 1996. Phylogenetic inference. Pages 407–514 in Molecular Systematics (D. M. Hillis, C. Moritz, and B. K. Mable, eds.). Sinauer Associates, Inc., Sunderland, Massachusetts. TATENO , Y., N. TAKEZAKI, AND M. NEI. 1994. Relative efciencies of the maximum-likelihood, neighbor joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11:261– 277. TILLIER, E. R. M., AND R. A. COLLINS . 1995. Neighbor joining and maximum likelihood with RNA sequences: Addressing the interdependence of sites. Mol. Biol. Evol. 12:7–15. WHEELER , W. C., AND R. L. H ONEYCUTT . 1988. Paired sequence difference in ribosomal RNAs: Evolutionary and phylogenetic implication . Mol. Biol. Evol. 5:90–96. YANG , Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401. YANG , Z. 1995. A space–time process model for the evolution of DNA sequences. Genetics 139:993–1005. Received 18 November 1997; accepted 31 August 1998 Associate Editor: D. Cannatella