TAXON 58 (3) • August 2009: 955–958
Torres-Carvajal • Non-parametric bootstrapping
Non-parametric bootstrapping of partitioned datasets Omar Torres-Carvajal Escuela de Biología, Pontificia Universidad Católica del Ecuador, Avenida 12 de Octubre y Roca, Apartado 17-01-2184, Quito, Ecuador; Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, MRC 162, Washington, D.C. 20560, U.S.A.
[email protected] Non-parametric bootstrapping is one of the most commonly used methods for branch support assessment. Unlike Bayesian posterior probability values, which are influenced by a priori data partitioning, non-parametric bootstrapping is usually applied to unpartitioned (combined) datasets. The resulting bootstrap support values are misleading in that they do not measure how well clades are supported by all the partitions, unless all partitions are equal in size (i.e., number of characters). Since most empirical studies include data partitions that are heterogeneous in size, our current bootstrapping approach for partitioned datasets (i.e., bootstrapping the combined dataset) is not adequate. Here I propose a simple modification to non-parametric bootstrapping that takes a priori data partitioning into account by obtaining bootstrap replicates for each partition separately and combining them in such a way that the size (i.e., number of characters) of each partition is taken into account. With this “corrected” bootstrap support value, characters from smaller partitions will have greater influence on final bootstrap values, and those in larger partitions relatively less influence than they would for unpartitioned data.
KEYWORDS: non-parametric bootstrapping, partitioned datasets
INTRODUCTION Non-parametric bootstrapping, one of the most commonly used methods for branch support assessment, is a statistical resampling technique introduced by Efron (1979) and first applied to phylogenetics by Felsenstein (1985). In phylogenetics this technique involves (1) generating pseudoreplicate data matrices equal in size to an original matrix by resampling (with replacement) columns of characters from the original matrix; (2) obtaining an optimal tree topology for each pseudoreplicate under the same optimality criterion applied to the original matrix; and (3) summarizing the pseudoreplicate topologies to obtain a distribution of bipartition frequencies (Felsenstein, 1985, 2004). The proportion of times that each bipartition is inferred from the pseudoreplicate matrices is called the bootstrap support value and is commonly used as a measure of branch support, although the meaning of this value (e.g., repeatability, accuracy, Type I error rate, adequacy) has been subject to extensive debate (Alfaro & al., 2003; Grant & Kluge, 2008). Inference of phylogenetic trees using heterogeneous datasets (e.g., multiple genes, morphological versus molecular characters) has become common practice among systematists. A well-recognized problem relevant to phylogenetic inference is that the different “parts” of these heterogeneous datasets might evolve under different stochastic evolutionary models; therefore, assuming a single model for a heterogeneous dataset might be misleading
(DeBry, 1999; Caterino & al., 2001; Castoe & al., 2004; Pagel & Meade, 2004; Brandley & al., 2005). Based on some biologically-justified criterion, it is possible to partition a heterogeneous dataset a priori (e.g., by gene, codon position, tRNA secondary structure), estimate the best model of evolution for each partition, and analyze each partition separately or all partitions together. Thus, data partitioning attempts to correct phylogenetic inference by allowing “model heterogeneity” across partitions. Currently the most widely used method to accommodate a priori data partitioning is Bayesian phylogenetic inference under mixed models, which allows heterogeneity across data partitions in overall evolution rate, substitution model parameters, topology, and branch lengths (Ronquist & Huelsenbeck, 2003; Nylander & al., 2004). As any other method of Bayesian phylogenetic inference, the mixed-model approach estimates posterior probabilities of bipartitions, which can be considered as measurements of branch support because they represent the probability that each bipartition is true. More importantly, Bayesian posterior probability values are influenced by a priori data partitioning. In contrast, although non-parametric bootstrapping is commonly used as an additional measurement of branch support for heterogeneous datasets, this method is usually applied to the unpartitioned (combined) dataset, and a priori data partitioning is ignored. The resulting bootstrap support values are misleading in that they do not measure how well clades are supported by all the partitions, unless all partitions 955
Torres-Carvajal • Non-parametric bootstrapping
are equal in size (i.e., number of characters). Since most empirical studies usually include data partitions that are heterogeneous in size, our current bootstrapping approach for partitioned datasets (i.e., bootstrapping the combined dataset) is not adequate. In this paper I propose a simple modification to non-parametric bootstrapping that takes a priori data partitioning into account.
THE CORRECTED BOOTSTRAP VALUE In order to determine the contribution of each data partition to the final bootstrap support values, we have to obtain bootstrap replicates for each partition separately and combine them in such a way that the size (i.e., number of characters) of each partition is taken into account. This can be achieved by following two simple steps regardless of the optimality criterion being used (e.g., parsimony or maximum likelihood): (1) Calculate the number of bootstrap replicates (r) to be obtained for each partition by using the equation: r=
n ×R N
where n is the number of characters in the partition, N is the sum of characters of all partitions, and R is the total number of bootstrap replicates. For example suppose we want to address branch support with 10,000 bootstrap replicates for a dataset composed of two partitions, A and B, containing 300 and 700 characters, respectively. Following the formula above, we would obtain 3,000 bootstrap replicates ([300 / 1,000] × 10,000) for partition A and 7,000 ([700 / 1,000] × 10,000) replicates for partition B. Thus, the number of bootstrap replicates for each partition is proportional to the number of characters contained in that partition relative to the total number of characters in the dataset. (2) Infer the optimal tree topology for each bootstrap replicate and calculate the frequency of each bipartition across all topologies; this is traditionally achieved by obtaining a 50% majority rule consensus tree (Margush & McMorris, 1981). The resulting frequency values can be called corrected bootstrap values because they reflect the contribution of each partition as opposed to bootstrap values obtained from unpartitioned datasets. The corrected bootstrap support value summarizes how well a particular clade is supported by all partitions after correcting for differences in size among them. For example, if a clade is strongly supported by all partitions (i.e., it has a high bootstrap value for each partition), it will receive a high corrected bootstrap value. This means that the level of congruence among partitions for that particular clade is high. If the same clade is strongly supported by 956
TAXON 58 (3) • August 2009: 955–958
one partition, but weakly or not at all supported by others, the corrected bootstrap value will be significantly lower. This means that the level of congruence among partitions for that particular clade is lower than the previous case. With the corrected bootstrap support value, characters from smaller partitions will have greater influence on the final bootstrap values, and those in larger partitions relative less influence, than they would for the unpartitioned data.
A THEORETICAL EXAMPLE To illustrate the significance of corrected bootstrap values I constructed several 1,000-character data matrices containing two partitions, each supporting strongly one of two tree topologies, and compared the “uncorrected” nonparametric bootstrapping with the corrected bootstrapping proposed herein (Fig. 1). Matrices were obtained by replicating two different 10-character datasets, each constructed manually to support a different tree. Both trees were rooted with outgroup taxon O and represent two different hypotheses of relationships among five ingroup taxa (A–E). Nine 1,000-character matrices (a–i; Fig. 1) were constructed to contain both partitions supporting these trees in the following proportions: 90% / 10%, 80% / 20%, 70% / 30%, 60% / 40% and 50% / 50%. All matrices were bootstrapped (1,000 replicates, parsimony, branch-andbound) under both the uncorrected and corrected bootstrap strategies using PAUP* 4.0b10 (Swofford, 2003). As shown by the frequency of each of the five possible bipartitions, this example shows clearly that the uncorrected bootstrap analysis might fail to sample partitions that represent up to 40% of the total number of characters in the dataset. In contrast, the corrected bootstrap analysis takes into account the smallest partition, even when it represents only 10% of the total number of characters (Fig. 1). This example is extreme in that all branches on each tree are strongly supported by the corresponding data matrix, which highlights the importance of correcting the traditional method of non-parametric bootstrapping for partitioned datasets: even when 40% of a dataset supports strongly an alternative tree topology, this 40% might be ignored when bootstrapping unless we perform the proposed correction.
CORRECTED BOOTSTRAP VALUES AND TREE TOPOLOGY CONGRUENCE The idea of incorporating partitioning schemes in non-parametric bootstrap approaches is not new. For example, Struck & al. (2006) proposed a method called
TAXON 58 (3) • August 2009: 955–958
Torres-Carvajal • Non-parametric bootstrapping
O
tree 1
A
B
B
A
C
bipartition 1 bipartition 2 bipartition 3
O
tree 2
D
bipartition 4
D
E
bipartition 5
E
C
bipartition 2 bootstrap value
bootstrap value
bipartition 1 100 80 60 40 20
a
b
c
d
e
f
g
h
100 80 60 40 20
i
a
b
c
d
matrix
80 60 40 20
b
c
f
g
h
i
f
g
h
i
bipartition 4 bootstrap value
bootstrap value
bipartition 3 100
a
e matrix
d
e
f
g
h
100 80 60 40 20
i
a
b
c
d
matrix
e matrix
bootstrap value
bipartition 5 100 80 60 40 20
a
b
c
d
e
f
g
h
i
matrix Fig. 1. Uncorrected (grey) and corrected (black) bootstrap support values for five bipartitions corresponding to two different tree topologies. Bipartitions 1, 3 and 4, 5 correspond to trees 1 and 2, respectively; bipartition 2 corresponds to both trees. Bootstrap values were obtained upon analysis of nine data matrices, each containing two partitions supporting, respectively, trees 1 and 2 in the following proportions: (a) 10% / 90%, (b) 20% / 80%, (c) 30% / 70%, (d) 40% / 60%, (e) 50% / 50%, (f) 60% / 40%, (g) 70% / 30%, (h) 80% / 20%, (i) 90% / 10%.
957
Torres-Carvajal • Non-parametric bootstrapping
partition addition bootstrap alteration (PABA), which evaluates congruence for any given node in a tree by determining how bootstrap scores change when different data partitions are added. The corrected bootstrap proposed here also can be used to assess tree topology congruence, which is commonly measured with consensus trees, consensus indices, or tree comparison metrics (see Swofford, 1991, for a review). In this case the method is the same as described above except that instead of calculating the number of bootstrap replicates for data partitions we calculate the number of replicates for all datasets being compared. The corrected bootstrap values assess the congruence among different datasets (under a particular optimality criterion) by reflecting the size-constrained contribution of each dataset to a final frequency distribution of bipartitions.
ACKNOWLEDGEMENTS I am grateful to M. Alfaro, K. de Queiroz and L. Torres for helpful comments on earlier stages of this manuscript.
LITERATURE CITED Alfaro, M.E., Zoller, S. & Lutzoni, F. 2003. Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov Chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. Molec. Biol. Evol. 20: 255–266. Brandley, M.C., Schmitz, A. & Reeder, T.W. 2005. Partitioned bayesian analyses, partition choice, and the phylogenetic relationships of scincid lizards. Syst. Biol. 54: 373–390. Castoe, T.A., Doan, T.M. & Parkinson, C.L. 2004. Data partitions and complex models in bayesian analysis: the
958
TAXON 58 (3) • August 2009: 955–958
phylogeny of gymnophthalmid lizards. Syst. Biol. 53: 448– 469. Caterino, M.S., Reed, R.D., Kuo, M.M. & Sperling, F.A.H. 2001. A partitioned likelihood analysis of swallowtail butterfly phylogeny (Lepidoptera: Papilionidae). Syst. Biol. 50: 106–127. DeBry, R.W. 1999. Maximum likelihood analysis of gene-based and structure-based process partitions, using mammalian mitochondrial genomes. Syst. Biol. 48: 286–299. Efron, B. 1979. Bootstrap methods: another look at the jackknife. Ann. Statist. 7: 1–26. Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783–791. Felsenstein, J. 2004. Inferring Phylogenies. Sinauer, Sunderland. Grant, T. & Kluge, A.G. 2008. Clade support measures and their adequacy. Cladistics 24: 1051–1064. Margush, T. & McMorris, F.R. 1981. Consensus n-trees. Bull. Math. Biol. 43: 239–244. Nylander, J.A.A., Ronquist, F., Huelsenbeck, J.P. & NievesAldrey, J.L. 2004. Bayesian phylogenetic analysis of combined data. Syst. Biol. 53: 47–67. Pagel, M. & Meade, A. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53: 571–581. Ronquist, F. & Huelsenbeck, J.P. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574. Struck, T.H., Purschke, G. & Halanych, K.M. 2006. Phylogeny of Eunicida (Annelida) and exploring data congruence using a partition addition bootstrap alteration (PABA) approach. Syst. Biol. 55: 1–20. Swofford, D.L. 1991. When are phylogeny estimates from molecular and morphological data incongruent? Pp. 295–333 in: Miyamoto, M.M. & Cracraft, J. (eds.), Phylogenetic Analysis of DNA Sequences. Oxford University Press, New York. Swofford, D.L. 2003. PAUP*. Phylogenetic Analysis Using Parsimony* (and Other Methods), version 4.0. Sinauer, Sunderland.