Points of View

5 downloads 0 Views 600KB Size Report
I thank Chris Simon, Dave Wagner, Kent Holsinger,. John Huelsenbeck, David Cannatella, Bill Hahn, Dave. Swofford, and an anonymous reviewer for providing.
Points of View Syst. Biol. 45(3):375-380/ 1996

Combining Data with Different Distributions of Among-Site Rate Variation JACK SULLIVAN 1

the Bull et al. simulated data sets. Most real data sets, however, contain at least some degree of among-site rate variation, with evolutionary rates more or less continuously distributed across sites (e.g., Uzzell and Corbin, 1971; Kocher and Wilson, 1991; Yang et al., 1994; Sullivan et al., 1995), and classification of data from one gene as uniformly fast and data from another as uniformly slow is neither feasible nor appropriate. In such cases, combining data may be appropriate (Chippindale and Wiens, 1994). I present an example using two data sets in which phylogenetic analyses of each separately result in strongly supported yet conflicting topologies and in which the distributions of evolutionary rates across sites are significantly different. Further, several tests of between-data-set heterogeneity suggest that the incongruence is greater than can be attributed to sampling error alone. Phylogenetic analysis of the combined data set with equal weights (the most homogeneous reconstruction model possible), however, yields a better estimate of phylogeny than does analysis of either of the data sets separately. The data sets used here are those of Sullivan et al. (1995) and include 12S ribosomal RNA data (780 bp) and a portion of the cytochrome b (cyt b) gene (321 bp) for 1 Present address: Laboratory of Molecular System- 10 taxa of deer mice (Peromyscus) and atics, MSC, MRC-534, Smithsonian Institution, Washington, D.C. 20560, USA. E-mail: [email protected]. grasshopper mice (Onychomys). The relationships among these taxa are well unedu.

As the number of molecular studies continues to grow, so does the problem of how to analyze multiple data sets. The importance of this problem is indicated by the flurry of recent papers (reviewed by de Queiroz et al v 1995) addressing philosophical and practical considerations facing systematists fortunate enough to have several pertinent data sets on hand. Perhaps the most rigorous practical treatment was that of Bull et al. (1993), who argued against combining data when there is demonstrable heterogeneity among the different data sets in the processes governing character evolution. These authors showed that combining simulated data sets generated from the same topology but with different rates of evolution leads to less accurate estimation of phylogeny than does analysis of the more slowly evolving data set alone. Although the validity of these results is not in question for the models tested, the models used in these simulations may not be very realistic. Specifically, Bull et al. used two single-rate models to generate data sets, one with a uniform high rate and one with a uniform low rate of evolution. Chippindale and Wiens (1994) demonstrated that down-weighting rapidly evolving characters resulted in increased accuracy in combined analyses of

375

Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 30, 2016

Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, Connecticut 06269-3043, USA

376

VOL. 4 5

SYSTEMATIC BIOLOGY

(a)

87 |

P. leucopus N P. leucopus S P. gossypinus

95 6

53J— P. keeni

85

96

4

6

TL

P. polionotus P. melanotis P. eremicus

681

nrl_

O. arenicola O. tonidus O. leucogaster

P. leucopus N P. leucopus S P. gossypinus

P. eremicus

[]

P. melanotis P. keeni P. polionotus O. arenicola O. tonidus O. leucogaster

/c\ *

aa

95 | w^^A

P. leucopus N P. leucopus S

96

P. gossypinus P. keeni

100

P. polionotus P. melanotis P. eremicus

[]

O. arenicola O. tonidus O. leucogaster

FIGURE 1. Phylogenetic analyses of Peromyscus and Onychomys. Bootstrap values (500 replicates) are indicated above branches; decay indices are below branches. I = the P. leucopus group; • = the P. tnaniculatus group; • = the P. eremicus group, (a) Cytochrome b data, (b) 12S ribosomal RNA data, (c) Combined data using a homogeneous reconstruction model. The relationships indicated in trees a and c are congruent with the well-corroborated relationships among these taxa.

Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 30, 2016

derstood based on congruence of several data sets, including allozyme data (Avise et al., 1974, 1979) and cladistic analyses of chromosomal banding data (Stangl and Baker, 1984; Smith, 1990). Phylogenetic analyses of the cyt b data recover the wellcorroborated relationships among these mice (Fig. la), whereas analyses of the 12S data result in strorig support for a conflicting resolution (Pig. lb). Because these genes are physically linked on the nonrecombining mitochondrial genome, they must share a common history, thus excluding the possibilities that separate analyses of each gene are providing correct estimates of different gene trees (Miyamoto et al., 1994) and that different histories are the cause of incongruence (de Queiroz et al., 1995). The gamma distribution shape parameter can be used to quantify among-site rate variation. This parameter is inversely related to the coefficient of variation, such that low values suggest much rate heterogeneity. Analyses of among-site rate variation for these data sets (using GAMMA; Sullivan et al., 1995) reveal significantly different distributions of evolutionary rates. Specifically, the 12S data fit the gamma-distributed rates model, with a shape parameter of 0.281, whereas the cyt b data have significantly less among-site rate variation (shape parameter = 0.649). The extreme rate heterogeneity in the 12S data appears to have made the 12S data susceptible to the misleading effects of nonrandom noise (Sullivan et al., 1995). Because differences between patterns of variation, such as those described above, result from differences in substitution processes, the significantly different estimates of the shape parameter can be taken to indicate that different processes govern sequence evolution ir\ the two genes. To test the significance of the differences between the processes, the G option in Baseml (Yang, 1994) can be used to create a mixed reconstruction model in which sites from each of the two genes are analyzed using model parameters estimated from their respective genes. If the two genes have evolved following different processes, al-

1996

POINTS OF VIEW

TABLE 1. Significance of data heterogeneity in the maximum-likelihood analysis of the combined 12S ribosomal RNA and cytochrome b data set for deer and grasshopper mice. l0 is the log-likelihood score calculated under a mixed reconstruction model, and lx is the score calculated under a homogeneous model. Model

L

Homogeneous Heterogeneous

-2916.71 -2892.37

X2 = 2(/0 - /,) = 48.68; P < 0.001

suggest a paraphyletic maniculatus group. More significantly, the support for the well-corroborated placement of P. eremicus in the cyt b data set is not compromised by the inclusion of the 12S data; in fact, there is a slight increase in the decay index (Fig. 1). Apparently, there is phylogenetic signal in the 12S data regarding the sister group relationship between the leucopus and maniculatus species groups. Presumably, this signal is present at more slowly evolving sites but is overridden by some systematic bias at rapidly evolving sites, which results in strong support for a node uniting P. eremicus with the leucopus group in the analysis of the 12S data alone (Fig. lb). When among-site rate variation is examined in the context of the well-corroborated relationships, the sites that unambiguously (yet apparently incorrectly) unite eremicus with the leucopus group are identified as high-rate (homoplastic) sites (Sullivan et al., 1995). The signal at more slowly evolving sites in the 12S data, which is not recognized in the separate analysis, contributes to the increased support for the well-corroborated resolution of these taxa in the combined analysis, relative to the cyt b data alone (Fig. 1). Thus, phylogenetic signal can in fact be additive when data from genes with different evolutionary processes are analyzed under a homogeneous reconstruction model (parsimony with equal weights), such that signal that is present but obscured by nonrandom noise in one data set can be revealed when data sets are combined (Barrett et al., 1991). The inclusion of the hidden signal can lead to more robust estimates of phylogeny when data sets are analyzed in combination rather than separately, even in the face of significant incongruence among data sets. Although the results of the Bull et al. (1993) simulations may apply in some cases, there will often be overlap in the distributions of evolutionary rates in two or more molecular data sets, and characterization of genes as "slow" versus "fast" is often an oversimplification. Provided the multiple data sets share a common history, as must be the

Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 30, 2016

lowing for intergenic heterogeneity in models used for maximum-likelihood analysis will result in an improvement in the likelihood score. Because the mixed model has one more parameter than the homogeneous model, the improvement in the likelihood score can be assessed directly using the likelihood-ratio test, and the test statistic will follow a chi-square distribution with one degree of freedom (Yang et al, 1995). In this case, allowing for intergenic heterogeneity results in a significantly better likelihood score (Table 1), providing evidence that these genes represent a valid process partition. In addition, application of the Farris et al. (1995) incongruence test, as implemented in PAUP* (version 4.0; Swofford, 1996), suggests that there is significant incongruence between data sets (P = 0.016). Significant heterogeneity between these genes is therefore indicated by three different tests: the likelihood-ratio test, comparison of bootstrap values for conflicting nodes (de Queiroz, 1993), and the test of Farris et al. (1995). Thus, following the logic of Bull et al. (1993), each data set appears to represent a valid process partition and the data should not be combined. However, when these data sets are combined, parsimony analysis with equal weights (the most homogenous reconstruction model possible) recovers the well-corroborated relationships (Fig. lc), the same topology as produced by the cyt b data alone. Further, nodal support for the wellcorroborated relationships among the maniculatus group taxa is higher in the combined analysis than in the cyt b analysis, even though the 12S data alone (Fig. lb)

377

378

SYSTEMATIC BIOLOGY

VOL. 4 5

Gene 2

Genei (a)

1.00- -

1.00 - -

0.75- -

0.75 - -

0.50- -

0.50 - -

0.25- -

0.25 - -

0

0.75

"5

0.50

c o

0.25

£

3

5

0

1

2

3

4

5

1 2

3

4

5

1.00 - 0.75-fc 0.50

.25-1

0.25

0 1 2 (c)

4

I

2

Q.

1 2

1.00 - r

3

4

5

0

1.00 - -

1.00 - -

0.75 - -

0.75 - -

0.50 - -

0.50 - -

0.25 - -

0.25--

0

1

2

3

4 5

0

1 2

I 3

4 5

Expected Number of Substitutions per Site FIGURE 2. Three possible cases of evolutionary rate heterogeneity among sites in two genes, (a) There is no heterogeneity between genes; all sites in each gene are evolving at the same rate, (b) Each gene has some degree of among-site rate variation, in this case gamma distributions with significantly different shape parameters (0.28 and 0.67). There is considerable overlap in these distributions, especially in the region that is expected to contain the strongest phylogenetic signal (enclosed by box), and a combined data approach may improve phylogenetic estimation, (c) Each gene has a very different uniform rate. Simultaneous analysis may be compromised by inclusion of the high-rate gene (gene 2).

case for mitochondrial DNA genes, the potential will exist for phylogenetic signal present in each data set to be additive. One can envision a continuum in which at one end sites from two genes are evolving at the same uniform rate (i.e., both follow a Poisson distribution with the same rate parameter; Fig. 2a) and at the other end sites from the two data sets have different uniform evolutionary rates (i.e., both follow Poisson distributions but with different

rate parameters; Fig. 2c). This end is where the data sets simulated by Bull et al. (1993) lie. Between these extremes are cases in which two or more molecular data sets have different continuous distributions of evolutionary rates (i.e., two different gamma distribution shape parameters; Fig. 2b); most real data sets will lie in this area. The issue then becomes how much heterogeneity between data sets is necessary to confound simultaneous analysis under

Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 30, 2016

(b)

1996

379

POINTS OF VIEW

ACKNOWLEDGMENTS I thank Chris Simon, Dave Wagner, Kent Holsinger, John Huelsenbeck, David Cannatella, Bill Hahn, Dave

Swofford, and an anonymous reviewer for providing comments. These ideas were formulated during an NSF Graduate Research Traineeship in the evolution, ecology, and conservation of biodiversity (BIR9256616), and the work was supported by a grant from the University of Connecticut Research Foundation and NSF grant BSR 91-96213. REFERENCES AVISE, J. C , M. H. SMITH, AND R. K. SELANDER. 1979.

Biochemical polymorphism and systematics in the genus Peromyscus. VII. Geographic variation in members of the truei and maniculatus species groups. J. Mammal. 60:177-192. AVISE, J. C , M. H. SMITH, R. K. SELANDER, T. E. LAW-

LOR, AND P. R. RAMSEY. 1974. Biochemical polymorphism and systematics in the genus Peromyscus. V. Insular and mainland species of the subgenus Hapbmybmys. Syst. Zool. 23:226-238. BARRETT, M., M. J. DONOGHUE, AND E. SOBER. 1991.

Against consensus. Syst. Zool. 40:486-493. BULL, J. J., J. P. HUELSENBECK, C. W. CUNNINGHAM, D. L. SWOFFORD, AND P. J. WADDELL. 1993. Partition-

ing and combining data in phylogenetic analysis. Syst. Biol. 42:384-397. CHIPPINDALE, P. T., AND J. J. WIENS. 1994. Weighting,

partitioning and combining characters in phylogenetic analysis. Syst. Biol. 43:278-287. DE QUEIROZ, A. 1993. For consensus (sometimes). Syst. Biol. 42:368-372. DE QUEIROZ, A., M. J. DONOGHUE, AND J. KIM. 1995.

Separate versus combined analysis of phylogenetic evidence. Annu. Rev. Ecol. Syst. 26:657-681. FARRIS, J. S., M. KALLERSJO, A. G. KLUGE, AND C. BULT.

1995. Testing significance of incongruence. Cladistics 10:315-319. HUELSENBECK, J. P., D. L. SWOFFORD, C. W. CUNNINGHAM, J. J. BULL, AND P. J. WADDELL. 1994. Is char-

acter weighting a panacea for the problem of data heterogeneity in phylogenetic analysis? Syst. Biol. 43:288-291. KOCHER, T. D., AND A. C. WILSON. 1991. Sequence

evolution of mitochondrial DNA in humans and chimpanzees: Control region and protein coding region. Pages 391-413 in Evolution of life: Fossils, molecules, and culture (S. Osawa and T. Honjo, eds.). Springer, Tokyo. MIYAMOTO, M. M., M. W. ALLARD, R. M. ADKINS, L. L. JANACEK, AND R. L. HONEYCUTT. 1994. A con-

gruence test of reliability using linked mitochondrial DNA sequences. Syst. Biol. 43:236-249. MIYAMOTO, M. M., AND J. CRACRAFT. 1991. Phyloge-

netic inference, DNA sequence analysis, and the future of molecular systematics. Pages 3-17 in Phylogenetic analysis of DNA sequences (M. M. Miyamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. SMITH, S. A. 1990. Cytosystematic evidence against monophyly of the Peromyscus boylii species group (Rodentia: Cricetidae). J. Mammal. 71:654-667. STANGL, F. B., AND R. J. BAKER. 1984. Evolutionary

relationships in Peromyscus: Congruence among

Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 30, 2016

a homogeneous reconstruction model. The improvement in phylogenetic estimation accomplished by simultaneous analysis under the most homogeneous reconstruction model possible is surprising in light of the evidence of between-data-set heterogeneity. Swofford (1991) illustrated that congruence between data sets is a complex issue; the present case also demonstrates this point. All tests for incongruence examined here argue against combining data, yet simultaneous analysis under a severely homogenous reconstruction model improves phylogenetic. estimation, as judged by support for the well-corroborated relationships (Miyamoto and Cracraft, 1991). I agree with Huelsenbeck et al. (1994) and Bull et al. (1993) that, in theory, the best estimates of phylogeny will come from realistic reconstruction models that incorporate data heterogeneity. However, what remains to be determined is whether a given level of demonstrated heterogeneity (or incongruence among data sets) is sufficiently strong that simultaneous analyses will be compromised; it is unclear where in the continuum described above data partitioning will be necessary. This example should not be taken as evidence that the best phylogenetic estimates will always be derived from a total evidence approach; rather, our current tests of incongruence do not adequately address the issue of whether or not one should combine data. Examination of cases such as this, where there is evidence for incongruence between linked mitochondrial genes, can help address this question. When among-site rate variation is present, combining data may improve phylogenetic estimates because there likely will be overlap in the distributions of evolutionary rates (Fig. 2). Estimating among-site rate variation and conducting both separate and combined phylogenetic analyses will lead to better understanding of the data at hand.

380

SYSTEMATIC BIOLOGY

chromosomal, genie, and classical data sets. J. Mammal. 65:643-654. SULLIVAN, J., K. E. HOLSINGER, AND C. SIMON. 1995.

Among-site rate variation and phylogenetic analysis of 12S rRNA in sigmodontine rodents. Mol. Biol. Evol. 12:988-1001. SWOFFORD, D. L. 1991. When are phylogeny estimates from molecular and morphological data incongruent? Pages 295-333 in Phylogenetic analysis of DNA sequences (M. M. Miyamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. SWOFFORD, D. L. 1996. PAUP*: Phylogenetic analysis using parsimony, version 4.0. Sinauer, Sunderland, Massachusetts. UZZELL, T., AND K. W. CORBIN. 1971. Fitting discrete

VOL. 45

distributions to evolutionary events. Science 172: 1089-1096. YANG, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39: 306-314. YANG, Z., N. GOLDMAN, AND A. FRIDAY. 1994. Com-

parison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316-324. YANG, Z., I. J. LAUDER, AND H. J. LIN. 1995. Molecular

evolution of the hepatitus B virus genome. J. Mol. Evol. 41:587-596. Received 12 September 1995; accepted 5 March 1996 Associate Editor: David Cannatella

The Probabilistic Basis of Jaccard's Index of Similarity RAIMUNDO REAL AND JUAN M. VARGAS Department of Animal Biology, Faculty of Science, University of Malaga, Malaga 29071, Spain; E-mail: [email protected] (R.R.)

Interspecific association analysis from presence/absence data is an unresolved topic in ecology and biogeography (e.g., Connor and Simberloff, 1979, 1983, 1984, 1986; Simberloff and Connor, 1981; Gilpin and Diamond, 1982,1984; Ryti and Gilpin, 1987; Jackson et al., 1992). Several techniques have been proposed to test association between species. Connor and Simberloff (1979) put forward a null model based on the Monte Carlo randomization procedure. Gilpin and Diamond (1982) took a different approach, based on a loglinear model with binary data. Jackson et al. (1992) proposed a hybrid model combining the two previous methods. However, all of these null models use an observed data matrix to generate a null distribution, and so observed and null distributions lack statistical independence (Grant and Abbott, 1980). In a fourth approach, now called the coefficient model (see Jackson et al., 1992), the observed distribution of a similarity index is tested against a distribution of expected values for that index.

In this context, similarity indices are frequently used to study the coexistence of species or the similarity of sampling sites. A matrix of similarity coefficients, between either species or locations, may be analyzed in two ways: by ordination, i.e., by attempting to arrange the locations or species within a theoretically continuous sequence, or by classification, the aim of which is to place the locations or species in discontinuous groups (McCoy et al., 1986), which may overlap in nonhierarchical classification approaches. The main aim of this type of analysis is to discover distribution patterns common to different species and groups of areas with similar biota (Birks, 1987). However, Simberloff and Connor (1979) stated that most indices of similarity are not associated with probability values because their underlying distributions are unknown, thus preventing high and low levels of association between species from being recognized objectively with regard to what may be expected at random. Only the distributions of the simple matching coefficient (Goodall, 1967),

Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 30, 2016

Syst. Biol. 45(3):380-385, 1996