by these methods from real data were compared with those estimated by repeated bootstrap resampling and maximum-likelihood estimation. It turned out that ...
Accuracies of the Simple Methods for Estimating the Bootstrap Probability of a Maximum-Likelihood Tree Masami Hasegawa * and Hirohisa Kishino”f *The Institute
of Statistical
Mathematics
and TOcean Research
Institute,
University
of Tokyo
Two simple methods have been proposed by Kishino et al. for estimating bootstrap probabilities of alternative trees from a maximum-likelihood analysis without performing maximum-likelihood estimation for each resampled data set; i.e., a resampling estimated log-likelihood method and a multivariate normal distribution method. To examine the extent to which these two methods provide good approximations, bootstrap probabilities estimated by these methods from real data were compared with those estimated by repeated bootstrap resampling and maximum-likelihood estimation. It turned out that both of these simple methods are good approximations to the computationally intensive bootstrap method. This finding should motivate people in this field to use the maximumlikelihood method when inferring molecular evolutionary trees.
Introduction A maximum-likelihood (ML) method for inferring molecular evolutionary trees from DNA sequence data was originally developed by Felsenstein ( 198 1). Later, Kishino et al. ( 1990) and Adachi and Hasegawa ( 1992) developed an ML method for analyzing protein sequence data. These methods do not assume rate constancy among lineages and hence are robust against departures from constancy (Hasegawa et al. 199 1; Hasegawa and Fujiwara 1993 ) . In a series of papers, we have developed a method to evaluate the confidence limits of the ML tree. Hasegawa and Kishino ( 1989) noted that, since the likelihood is dependent on a particular realization of a random variable, it is important to estimate the variance of the log of the likelihood ratio between alternative trees. They estimated the variance by using bootstrap resampling (Felsenstein 1985). Chino and Hasegawa ( 1989) developed a simple method for estimating the variance by expressing it explicitly. Rather than this sort of pairwise comparison between alternative trees, a bootstrap probability-i.e., a frequency of a particular tree being the highest-likelihood tree among alternatives during bootstrap resamplingmight be more appealing to molecular evolutionists. Key words: approximate bootstrap probability, maximum likelihood, molecular evolutionary tree. Address for correspondence and reprints: Masami Hasegawa, The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan. Mol. Biol.
Evol. 1 I( 1): 142- 145. 1994. 0 1994 by The University of Chicago. All rights reserved. 0737-4038/94/l IOI-0014$02.00
142
However, repeating the ML estimation of the tree for a bootstrap sample (the standard bootstrap method) generally imposes too much computational burden. Kishino et al. ( 1990) developed simple methods for estimating an approximate bootstrap probability. They presented two methods for that purpose; one is called a “multivariate normal distribution” (MND) method, and the other is called a “resampling estimated log-likelihood” (RELL) method. These methods can estimate a bootstrap probability of a tree without repeating bootstrap resampling and ML estimation. The purpose of the present paper is to examine the extent to which the MND and RELL methods are good approximations for estimating bootstrap probabilities. The bootstrap probabilities estimated by these methods from mitochondrial DNA (mtDNA) data will be compared with those estimated by repeating bootstrap resampling and ML estimation. Bootstrap Probability
of an ML Tree
DNA and protein are sequences of four types of bases and of 20 types of amino acids, respectively. At each location of a base or amino acid, one type of base or amino acid may change to another one during evolution, and the substitutions occur independently between lineages after they separate in an evolutionary tree. These substitutions can best be regarded as stochastic (Kimura 1983). Members of a set of DNA or protein sequences descendent from a common ancestral sequence are called “homologous” to each other, and a set of homologous sequences in contemporary organisms are a result of the stochastic process along a particular
Bootstrap Probability of a Maximum-Likelihood Tree
evolutionary tree. The ML method infers the evolutionary tree from these sequences. The data can be represented as follows: species l&,
x12,
x13,
. . . , xl&
species
2-x21)
x22,
x23,
follow an MND whose mean can be estimated by M%)
. . . ,
species k-Xkl , Xk2, Xk3, . . . , Xkn, where Xii is either of T, C, A, or G for DNA sequences and is either of the 20 amino acids for protein sequences. We write the whole data (Xij) as X and the value of the hth site (X,h . . . Xk,)T (a superscript “T” denotes a transposed vector) as Xh. Under the assumption of independent identical distributions (i.i.d.) among sites, the likelihood is given X2&
asymptotically ante-covariance
143
and vari-
IX 1,
. . . ;
and n I
loghi)(xh
I o(i)>-
’ n
C h&l
lOgf&Wh~
I %)I
by
uelx)
=
fIf(xhle) h=l
(1)
7
wheref(x18) =f(xl, x2, . . . , ~~10) is the probability that species 1 has the base or amino acid x1, species 2 site. hasX2,. . . , and species k has & at a homologous The vector 8 denotes the unknown parameters, such as the branching dates and the substitution rates, along the respective branches of the tree. The likelihood of each tree is computed by altering 8 in an optimal fashion. Then, the fit of the tree to the data is compared with those of the alternative trees by the criterion of likelihood. The tree with the highest hkelihood is chosen as the most likely candidate for the true tree. The log-likelihoods of m alternative trees are represented by
(Kishino and Hasegawa 1989)) respectively. The MND method estimates a bootstrap probability that a particular tree is selected as the best tree from comparison among components of a random number that follows the MND presented above (Kishino et al. 1990). The bootstrap probabilities can be estimated also by the RELL method, which resamples the estimated log likelihoods of sites as follows: n
Zt:’= 2
. lOgAi)(XY’
I e(i))
,
(4)
h=l
by bootstrap where Xy) refers to a site resampled (Schork and Schork 1989; Kishino et al. 1990). Either method can give the bootstrap probabilities of candidate trees without performing ML estimation for each resampled datum. Comparison of the RELL and MND Methods with the Standard Bootstrap Method
where each term of the right-hand side follows i.i.d. When 8 is replaced by the ML estimate, 6, each term of the right-hand side of n 1
l&+i)
1 X)
=
2
logAi)(xh
h=l
i=
I e(i))
9
(3)
l,...,m,
no longer follows i.i.d. However, when n is large, the distribution of eq. ( 3) coincides asymptotically with that of eq. (2). Therefore, the estimated log likelihoods
To examine the extent to which the RELL and MND methods provide good approximations, bootstrap probabilities estimated by these methods were compared with those estimated by the standard bootstrap method. The data used in this study are the third-codon positions of the 3’ end of the NADH-dehydrogenase subunit 4 (ND4 gene) and of the 5’ end of the ND5 gene from Hominoidea (human, chimpanzee, gorilla, and orangutan) (Brown et al. 1982). The data consist of 232 nucleotides and have been analyzed elsewhere (Hasegawa and Kishino 1989; Kishino and Hasegawa 1989). Three tree topologies are possible among human, chimpanzee, and gorilla, when orangutan is taken as an outgroup to these three species (Hasegawa and Kishino 1989; Kishino and Hasegawa 1989) : tree 1-( ((human, chimp), gorilla), orangutan); tree 2-( (( human, gorilla), chimp), orangutan); and tree 3-( ((chimp, gorilla), human), orangutan). The log likelihood for each
144
Hasegawa
and Kishino
tree topology was computed by DNAML of Felsenstein’s ( 1988) PHYLIP program package (version 3.2). The transition / transversion parameter ( =26 ) was chosen so as to maximize the likelihood. Bootstrap resampling and ML estimation were repeated 5,000 times, and, for each resampling of sites, equation (4) was also applied to estimate bootstrap probabilities of the three alternative trees by the RELL method. Bootstrap probabilities were also estimated by the MND method. Estimated log-likelihoods of trees 1, 2, and 3 ( ItI ), -663.16, and -665.22, reic2), and lf3)) are -661.52, spectively, and tree 1 is the highest-likelihood tree. Figure 1 shows distributions of &I - l( 1) and l(3) - I( 1) estimated by the bootstrap and RELL methods. It is apparent that the simple method of RELL gives nearly the same result as the cumbersome bootstrap method. Table 1 represents bootstrap probabilities estimated by the bootstrap method as well as by the RELL and MND methods. It turns out that the RELL and MND methods give almost identical estimates of bootstrap probabilities and that these estimates approximately coincide with the result obtained by repeated bootstrap resampling and ML estimation. This result suggests that the RELL and MND methods are useful as good approximations for estimating the bootstrap probabilities. Discussion The present study demonstrates that the simple RELL and MND methods give bootstrap probabilities that approximate well the result obtained by the standard bootstrap method. Once an ML analysis has been done,
Table 1 Bootstrap Probabilities of the Three Alternative Trees of Hominoidea, Estimated by the RELL, MND, and Bootstrap Methods TOTAL
BOOTSTRAP
RELL: Tree 1 Tree 2 Tree 3 Total
Tree 1
Tree 2
Tree 3
RELL
MND
0.6382 0.0250 0.0072 0.6704
0.0034 0.3008 0.0100 0.3142
0.0028 0.0076 0.0050 0.0154
0.6444 0.3334 0.0222
0.6458 0.3342 0.0200
NOTE.-NO. of bootstrap replications = 5,000. The (i, j) element of the central 3 X 3 region of the table represents a relative frequency of samples that gave tree i by the RELL and treej by the standard bootstrap method. The marginal totals represent bootstrap probabilities estimated by the RELL and standard bootstrap methods, respectively. Tree 1 = (((human, chimp), gorilla), orangutan): tree 2 = (((human, gorilla), chimp), orangutan); and tree 3 = (((chimp, gorilla), human), orangutan).
the CPU time for the RELL method is proportional to the length of the sequences, while that of the MND method is independent of it. Hence, for long sequences, the MND method is more rapid than the RELL method. However, the CPU time for the RELL method, as well as for the MND method, is negligible compared with that in estimating the log-likelihood for each site. Because of the simplicity of the RELL method in its application, the method has been implemented in a PROTML program for an ML analysis of protein sequences ( Adachi and Hasegawa 1992 ) . The RELL method would be useful not only in molecular phylogenetic analyses, but also in various fields of applied statistics in which bootstrap resampling and ML estimation must be repeated. This method is related to the parsimony test of Templeton ( 1983). The ML method has a disadvantage in that it requires a large computational burden when compared with other widely used methods of molecular phylogenetics. However, this disadvantage might be compensated to some extent by the introduction of the simple methods, discussed in the present paper, for estimating the bootstrap probability of an ML tree. Acknowledgments
-22 -20 -18 -16 -14 -12 -10
-8 -6 -4 -2
0
2
4
6
8
10 12 14
log-likelihood difference FIG. 1.-Histograms of &, - 11,) and Zt3) - I(,) estimated by the bootstrap and RELL methods (no. of bootstrap replications = 5,000). Solid lines and dotted lines indicate the standard bootstrap and RELL methods, respectively.
We are grateful to M. Fujiwara and J. Adachi for help with the computer. We also thank J. Felsenstein for information and J. Reeves and the two reviewers for helpful comments on an earlier version of the manuscript. This work was carried out under the Institute of Statistical Mathematics Cooperative Research Program (9 1-ISM CRP-69 and 92-ISM CRP-8 1) and was supported by grants from the Ministry of Education, Science, and Culture of Japan. l
l
Bootstrap Probability of a Maximum-Likelihood Tree LITERATURE CITED ADACHI, J., and M. HASEGAWA. 1992. Computer science monographs, no. 27: MOLPHY: programs for molecular phylogenetics. I.-PROTML: maximum likelihood inference of protein phylogeny. Institute of Statistical Mathematics, Tokyo. BROWN,W. M., E. M. PRAGER, A. WANG, and A. C. WILSON. 1982. Mitochondrial DNA sequences of primates: tempo and mode of evolution. J. Mol. Evol. l&225-239. FELSENSTEIN,J. 198 1. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376. -. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-79 1. -. 1988. PHYLIP, version 3.1. University of Washington, Seattle. HASEGAWA,M., and M. FUJIWARA. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. 2: l-5. HASEGAWA,M., and H. KISHINO. 1989. Confidence limits on the maximum likelihood estimate of the hominoid tree from mitochondrial-DNA sequences. Evolution 43:672-677. HASEGAWA,M., H. KISHINO, and N. SAITOU. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol. 32:443-445.
145
KIMURA, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge. KISHINO, H., and M. HASEGAWA. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29: 170-179. KISHINO, H., T. MIYATA, and M. HASEGAWA. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 30: 15 1- 160. SCHORK,N., and M. A. SCHORK. 1989. Testing separate families of segregation hypotheses: bootstrap methods. Am. J. Hum. Genet. 45:803-8 13. Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37: 22 l-244.
TAKASHI GOJOBORI, reviewing
Received
October
6, 1992
Accepted
August
12, I993
editor