A Novel Method for Estimating Substitution Rate Variation ... - NCBI

1 downloads 3589 Views 307KB Size Report
Different approaches have been used so far to estimate mous (third codon positions) and nonsynonymous (first the substitution rate at each site. A first approach ...
Copyright  2001 by the Genetics Society of America

A Novel Method for Estimating Substitution Rate Variation Among Sites in a Large Dataset of Homologous DNA Sequences Graziano Pesole* and Cecilia Saccone† *Dipartimento di Fisiologia e Biochimica Generali, Universita` di Milano, 20133 Milano, Italy and †Dipartimento di Biochimica e Biologia Molecolare, Universita` di Bari, 70126 Bari, Italy Manuscript received July 18, 2000 Accepted for publication October 10, 2000 ABSTRACT We present here a novel method to estimate the site-specific relative variability in large sets of homologous sequences. It is based on the simple idea that the more closely related are the compared sequences, the higher the probability of observing nucleotide changes at rapidly evolving sites. A simulation study has been carried out to support the reliability of the method, which has been applied also to analyzing the site variability of all available human sequences corresponding to the two hypervariable regions of the mitochondrial D-loop.

T

HE specific variability of different nucleotide sites during evolution is a general property of DNA sequences. In coding sequences, it is well known that the average evolutionary dynamics of the three codon positions are rather different, with the third codon positions evolving much faster than the first and the second codon positions, the latter evolving slower than the former. This rate pattern is expected as most of the changes in the third codon positions are synonymous, i.e., do not change the coded amino acid, whereas all changes in the second codon positions are nonsynonymous and generally introduce drastic changes in the physicalchemical properties of the coded amino acid. Furthermore, some rate heterogeneity can be observed within each of the three codon positions depending on the specific functional constraints acting on each site. Thus, first and second positions in codons for amino acids critical for the biochemical activity of the protein are usually totally invariant or strongly conserved. Because of the remarkable rate heterogeneity between synonymous (third codon positions) and nonsynonymous (first and second codon positions) sites, it is advisable to analyze the two groups of sites separately. In noncoding sequences, usually endowed with regulatory activity, a higher level of rate heterogeneity is observed with respect to coding sequences, with some sites extremely conserved and others highly variable and prone to accept insertion/deletion events. Reliable estimates of genetic distances between homologous sequences and their phylogenetic relationships can be obtained under the condition that the mathematical model used in the analysis adequately fits the sequence data (Pagel 1999). Indeed, it has been observed that Corresponding author: Graziano Pesole, Dipartimento di Fisiologia e Biochimica Generali, Universita` di Milano, via Celoria 26, 20133 Milano, Italy. E-mail: [email protected] Genetics 157: 859–865 (February 2001)

models imposing incorrect a priori conditions on the structure of the rate matrix (e.g., all transitions and transversions equally probable) or that do not take into account the base composition of the sequences at sites under examination may produce erroneous results (Pesole et al. 1995). Furthermore, a realistic model of rate heterogeneity has to be included in the model to obtain correct estimates of genetic divergence and reliable phylogenies (Yang 1996). Indeed, the estimate of the nucleotide substitution rate and the phylogenetic reconstructions may be strongly affected if rate heterogeneity is not taken into account in the mathematical model used in the analysis. Evidence has been provided that shows that variation in substitution rate among sites can be suitably described by discrete or continuous gamma distribution models (Yang 1996; Yang and Kumar 1996; Gu and Zhang 1997) that satisfactorily fit both mtDNA and nuclear DNA evolution. Different approaches have been used so far to estimate the substitution rate at each site. A first approach counts the substitution events along a phylogenetic tree estimated by using maximum parsimony (Hasegawa et al. 1993; Wakeley 1993). Methods based on maximum likelihood (ML) have been proposed by Excoffier and Yang (1999) and Meyer et al. (1999). In the method proposed by Excoffier and Yang (1999), after ML estimate of substitution rate parameters using a previously determined tree, the rate of substitution at individual sites is determined by an empirical Bayes approach (Yang and Wang 1995). A similar method proposed by Meyer et al. (1999) assigns each site to a predefined rate category that yields the highest likelihood value. A major drawback of all of the above approaches is their dependence on a phylogenetic tree. In particular, in the case of a maximum-parsimony (MP) estimated tree the longer the branches in the tree, the more under-

860

G. Pesole and C. Saccone

estimated the substitution rate. ML-based approaches add to the predetermination of the tree the a priori determination of the substitution rate parameters. These approaches are computationally very intensive and thus can be conveniently applied only to relatively small datasets; therefore, several randomly selected subsets of the whole dataset are often used in the analysis (Meyer et al. 1999). Furthermore, they require a priori definition of a discrete number of equivalent rate categories. We propose here a new method that provides the relative rate for each individual site from all pairwise genetic distances calculated between the sequences under examination without any assumption of either the structure of the tree or the rate categories. It is based on the simple idea that the closer the two compared sequences are the higher is the probability of observing nucleotide changes of rapidly evolving sites. A similar argument is also the basis of the tree-independent method proposed by Van de Peer and De Wachter (1993), which uses a different approach based on the partitioning of sequence pairs in discrete Jukes-Cantor distance intervals. This method is implemented in TREECON (Van de Peer and De Wachter 1993), a commercial software package for evolutionary analysis. The method we propose here is much simpler and considerably faster than all others, allowing the analysis of a very large sequence dataset with very limited computational work. The reliability of the method has been verified in a simulation study and an application is shown by analyzing all the available sequences of the two hypervariable segments of the control region of the human mtDNA, which represent by far the largest available dataset for intraspecific genetic variability. More than 100 pathogenic mtDNA base substitution mutations have been identified in a variety of degenerative diseases (Kogelnik et al. 1998) although for most of them the association with the pathological status is not clearly assessed. Then the knowledge of site-specific rate is crucial not only to define more realistic models of sequence evolution for phylogenetic inferences but it could also be of great practical relevance for the assessment of the pathological implications of specific nucleotide “mutations.” Indeed, it is unlikely that a nucleotide substitution is associated with a specific mitochondrial disease if it shows an appreciable degree of relative variability.

MATERIALS AND METHODS The substitution rate at a given sequence position can be calculated ideally by observing that particular position for a suitable amount of time and then measuring the number of changes per unit of time. Of course, this is not possible when dealing with extant sequences that are the product of unknown past evolutionary processes. However, if we assume that a given position has a specific rate that is independent from the particular lineage considered, one could ideally equally calculate its rate by carrying

out a suitable number of pairwise comparisons of sequences whose divergence is just one unit of time. Of course the unit of time we define here, dt, has to be sufficiently small to minimize the possibility of multiple substitutions occurring within its time span. In other words, if the position i evolves twice as fast as the position j, in independent pairwise comparisons of homologous sequences whose divergence time is dt we should expect to observe twice as many nucleotide changes in position i than in position j. When extending this principle to real sequences we can expect that the more variable a given position, the more changes are observed in closely related sequences. Let us consider a nucleotide multialignment of N sequences and L sites. Following the above reasoning the relative variability of the ith site can be simply given by ␯i ⫽

N(N⫺1)



j⫽1

␦ij , Kj

(1)

where ␦ij ⫽ 1 or ␦ij ⫽ 0 depending on the observation or not of a nucleotide substitution in the jth pairwise comparison, Kj being the overall genetic distance. To calculate genetic distances as correctly as possible, we used the stationary Markov model of Saccone et al. (1990), also called “general time reversible” (GTR), which has proved to be rather accurate in different conditions (Zharkikh 1994). However, if the compared sequences are very closely related, the number of observed differences is very close to the actual number of substitutions and thus the estimate of the genetic distance is almost independent of the chosen method. According to Equation 1 the relative contribution of an observed pairwise change in a given position to its measure of variability will be inversely proportional to the corresponding pairwise genetic distance. The absolute value of ␯i being dependent on the particular dataset considered, its relative value can be defined as ␥i ⫽

␯i , ␯max

where ␥i ranges from 0 for invariant sites to 1 when ␯i ⫽ ␯max. The number of compared sequences is critical as it can be reasonably expected that the higher the number of sequences considered, the more reliable the estimate of the relative variability of the individual positions in a multialignment. The SiteVar software that implements the above-described method, written in C language and running under the Unix operating system, is available from the authors upon request. We tested the reliability of the above-described method through a simulation procedure. To carry out the simulation we used the program Seq-gen (Rambaut and Grassly 1997) that, given as input a tree describing the phylogenetic relationships between N taxa and assuming a model of sequence evolution, generates through a Monte Carlo simulation a dataset of N sequences. To make the simulation as close as possible to a realistic evolutionary pattern, we used trees we denoted as T50 and T500. These trees describe the phylogenetic relationships of two datasets of 50 and 500 sequences of the Hvr1 mitochondrial D-loop, spanning from position 16,050 to position 16,350 (numeration according to Anderson et al. 1981, extracted from MitBase (Attimonelli et al. 2000). In both datasets, one-quarter of the sequences were from individuals of African origin, the remaining sequences being randomly selected from individuals from all other continental areas. The trees were calculated by applying the neighbor-joining method to pairwise distances calculated using the GTR model (Saccone et al. 1990). In the Seq-gen sequence simulations we assumed the GTR model for sequence evolution and used as input trees T50

Site-Specific DNA Substitution Rate and T500 to generate datasets of 50 and 500 sequences 1000 nucleotides (nt) long, denoted S50 and S500, respectively. A specific model implemented in the Seq-Gen software, assuming three classes of relative variability, was used to mimic sitespecific rate heterogeneity. In this way the simulated sequences contained three classes of sites, each accounting for one-third of the total sequence length, which we denoted as p1, p2, and p3, and evolving at three different relative rates. Two different rate ratios were used in the simulation, namely p1:p2:p3 ⫽ 1:2:3 or 1:5:10. Four different average rates of substitution were used in the simulations along T50 and T500 trees, namely 5, 10, 25, and 50%/myr (denoted as R1, R2, R5, and R10), which included the average rate of nucleotide substitution of the D-loop Hvr1 corresponding to ⵑ10%/myr (Saccone et al. 2000). The BASEML program, implemented in the PAML package (Yang 1997), was also used to estimate site-specific rates of the S50 dataset assuming the T50 tree. The HKY85 ⫹ discrete gamma model with eight rate categories was used in BASEML runs. To test the method on real data we determined site-by-site variability of the two hypervariable regions of the D-loop (Hvr1 and Hvr2). For the Hvr1 region, spanning from position 16,024 to 16,382, we used a dataset of 1308 sequences, whereas for the Hvr2 region, spanning from position 57 to 371, we used a smaller dataset of 458 sequences. In both cases we used all the available sequences in the HVRbase collection (Handt et al. 1998), including different sequence lineages of individuals from all over the world.

RESULTS

Figure 1 shows the results obtained by our model applied to the S50 and S500 sequence datasets, whose evolution was simulated to make sites fit in three different classes of variability and to which four different average rates of nucleotide substitution were applied, denoted as R1, R2, R5, and R10 (see materials and methods for further details on the simulation). It is remarkable that the model determines quite accurately site relative variability as set up in the simulation both in the case of a small rate difference (relative variability of sites 1:2:3) and in the case of a large rate difference (relative variability of sites 1:5:10 for both S50 and S500 datasets). As expected, a higher level of accuracy was obtained with the S500 dataset for all rates, with a slight underestimate at fastest average rates (i.e., R5 and R10) mainly in the S50 dataset. All in all, the simulation demonstrates the remarkable accuracy of the proposed methodology in the determination of the relative nucleotide substitution rate variation among sites. On the contrary, the analysis carried out with BASEML software on only the S50 dataset, due to the higher computational needs of this approach, has shown quite surprisingly that site-specific rates were appropriately estimated qualitatively only with p1 ⬍ p2 ⬍ p3 since observed rate ratios were remarkably lower than expected (Figure 2). We then determined the relative variability of sites of the Hvr1 and Hvr2 regions of the human mitochondrial D-loop by analyzing all available sequences from the

861

Figure 1.—Relative variability of the sites of simulated S50 and S500 sequence datasets calculated by using the SiteVar software. The simulation, carried out as described in materials and methods by using the Seq-gen software (Rambaut and Grassly 1997) along the phylogenetic trees T50 and T500 and imposing four different average substitution rates (i.e., R1, R2, R5, and R10) under the stationary Markov model (Lanave et al. 1984; Saccone et al. 1990; also defined as general time reversible; Yang 1994a), generated sequences 1000 nt long with sites belonging to three classes of rates. Two different rate ratios were used: p1:p2:p3 as 1:2:3 and 1:5:10.

HVRbase collection (Handt et al. 1998). In total 1308 sequences of the Hvr1 region and 458 sequences of the Hvr2 region were considered, including only those sequenced without ambiguities from position 16,024 to 16,382 in Hvr1 and from position 57 to 371 in Hvr2 (numbering according to Anderson et al. 1981). The plot of the site-specific relative variability is shown in Figure 3 for both Hvr1 (Figure 3a) and Hvr2 (Figure 3b). The gamma parameter estimated from the data in Figure 3 proved to be 0.09 for Hvr1 and 0.05 for Hvr2, denoting for both regions an extreme rate heterogeneity, which was higher in Hvr2. Higher values, but with a similar pattern (i.e., Hvr1, 0.26; Hvr2, 0.13), were determined by Meyer et al. (1999). The presence of an unknown fraction of invariant sites could bias both estimates. The main noncoding region of vertebrate mtDNA, called the D-loop region, contains the regulatory elements for the replication and the expression of the genome. On the basis of both degree of conservation and base content, the D-loop was divided into three domains, a highly conserved central domain flanked by the two hypervariable regions denoted as extended termination associated sequence (ETAS) and conserved sequence box (CSB) domains (Sbisa` et al. 1997). The Hvr1 region, contained in the ETAS domain, corresponds to the 3⬘ end of the D-loop where the newly synthesized H-strand stops. Of the 359 Hvr1 sites considered, 125 (35%) proved to be invariant, the relative variability of the remaining sites being between 0.001 and 1 (position 16,223). The most variable tract, corresponding to a 136-nt region (positions 16,165–

862

G. Pesole and C. Saccone

Figure 2.—Relative variability of the sites of the simulated S50 sequence dataset as calculated by using the BASEML software (Yang 1997) along the T50 phylogenetic tree. The same four average rates and two different rate ratios as in the SiteVar application in Figure 1 were used.

16,300), was only present in some primates but not in other mammalian D-loops and has been defined as “insertion sequence” (IS; Saccone et al. 1991), with an average variability of 0.055 with respect to a global average of 0.036. The Hvr2 region, in the CSB domain, corresponding to the 5⬘ end of the D-loop, contains the main regulatory elements of the mitochondrial genome: the two promoters (HSP and LSP) and the origin of replication of the H-strand. Of the 315 Hvr2 sites considered, 210 (67%) proved to be invariant, the relative variability of the remaining sites being between 0.003 and 1 (position 146) as against a global average variability of 0.026. Table 1 shows the occurrence of the different substitution patterns observed in the aligned Hvr1 and Hvr2 sequences and their average relative variability. In both regions the CT and AG transition sites greatly outnumber transversion sites with CT more frequent than AG patterns and the relative variability of the former higher than that of the latter, particularly in Hvr1. Table 2 lists all hypervariable sites (relative variability ⬎0.2) found in both Hvr1 and Hvr2. Most of the sites that were hypervariable in Hvr1 were also hypervariable in other studies carried out on this region (Hasegawa et al. 1993; Wakeley 1993; Excoffier and Yang 1999; Meyer et al. 1999) and most of them showed a CT substitution pattern in the multialignment. Only two sites, both in Hvr2, showed the AG pattern, the remaining sites showing three or four different nucleotides. DISCUSSION

The knowledge of the evolutionary dynamics of a specific nucleotide sequence may greatly contribute to the elucidation of its structure-function relationships. In particular, the level of constraints acting on each individual site is generally correlated to its functional

activity. The same information may be also particularly useful in evolutionary studies both for the measurement of nucleotide substitution rate and for the reconstruction of the phylogeny (Lauder et al. 1993; Yang 1994b; Felsenstein and Churchill 1996). Indeed, evolutionary models generally used for evolutionary analyses may provide misleading results if the specific assumptions made by the model (e.g., equal probability of transitions and transversions, rate homogeneity among analyzed sites, etc.) are not fulfilled by real sequence data. The better we know the specific evolutionary dynamics of the set of homologous sequences under examination the higher is the chance of obtaining reliable inference for the use of more suitable evolutionary models. In recent years the study of human genetic variability, in particular of the main noncoding regulatory region of mtDNA (D-loop) for its remarkable level of intraspecific diversity, has produced important breakthroughs in the elucidation of the origin of modern man. The study of mtDNA genetic variability in humans could also contribute to assessing the functional significance of mtDNA mutations, possibly associated with the wide variety of mitochondrial degenerative diseases, aging, and cancer (Wallace 1999). Recently, Michikawa et al. (1999) have found that some point mutations in the human mtDNA control region are not inherited but are associated to aging processes. Of the seven mutation events falling in the Hvr2 segment considered in our analysis, sites 146, 152, and 195 proved to be by far the most variable in the whole region (see Figure 3 and Table 2) and an A deletion was observed at position 249 in some individuals. On the contrary, the mutations T285C, A368G, and the T insertion after site 383 were not observed at all. These findings suggest a different functional relevance for the different mutations with the former, which were

Site-Specific DNA Substitution Rate

863

Figure 3.—Site-specific variability calculated by using the SiteVar software on the Hvr1 (A) and Hvr2 (B) regions of the human D-loop-containing region (positions 16,024 to 16,382 and 57 to 371 respectively; numbering according to Anderson et al. (1981). The IS region (Saccone et al. 1991) and CSB1-3 are shaded in gray. Arrows point to sites where substitutions associated with aging were found by Michikawa et al. (1999).

previously described as polymorphisms, representing mutation hotspots. Our data suggest that the nucleotide changes T146C, T152C, and T195C are so fast that they

can occur several times during the life span of an individual. The strong rate heterogeneity of the D-loop hyperva-

864

G. Pesole and C. Saccone TABLE 1 Occurrence of the different nucleotide states at aligned sites of the Hvr1 and Hvr2 D-loop regions with the relevant average variability Hvr1 Nucleotide states in aligned sites AC ACG ACGT ACT AG AGT AT CG CGT CT GT

Count

Average relative variability

Count

Average relative variability

8 9 15 19 55 6 8 4 12 86 4

0.0080 0.1110 0.0460 0.0770 0.0200 0.0210 0.0030 0.0040 0.0960 0.0830 0.0030

4 3 1 6 29 2 4 3 1 38 2

0.0688 0.1247 0.2120 0.4195 0.0674 0.0235 0.0078 0.0460 0.0220 0.0713 0.0050

riable regions explains the 20-fold difference in the estimate of the mutation rate for the mtDNA D-loop obtained by phylogenetic or family studies (Parsons et al. 1997; Jazin et al. 1998), the latter being much more affected by hotspots than the former. Our data clearly TABLE 2 Hypervariable sites (relative variability ⬎0.2, numbering according to Anderson et al. 1981) in the Hvr1 and Hvr2 D-loop regions

Position Hvr1 16,223 16,311 16,362 16,129 16,278 16,294 16,126 16,172 16,298 16,093 16,319 16,261 16,270 16,217 Hvr2 146 152 195 73 150 182 263 185 189

Relative variability

Hvr2

Nucleotide states at aligned sites

1.000 0.633 0.588 0.496 0.462 0.382 0.358 0.290 0.283 0.279 0.254 0.242 0.241 0.225

CT CT CGT ACG CT CT CT CT CT CT ACG CT ACT CT

1.0000 0.9990 0.8430 0.5300 0.5100 0.3010 0.2400 0.2120 0.2080

ACT CT ACT AG ACT CT AG ACGT ACG

The observed nucleotide states at aligned sites are also reported.

show that the estimate of the average nucleotide substitution rate is not very informative in the evolutionary dynamics of the D-loop region. Despite the fact that ML models using gamma-distributed rates are considered the most reliable in assessing rate variation among sites, if the assumed gamma model does not comply with the actual evolutionary process, inaccurate results will be obtained. This is particularly true when discrete rate-class models are used as in our simulation study where the observed discrepancy between the expected site-specific rates and those inferred by BASEML software are due to the mismatch between the simulation model and the assumed gamma distribution. Indeed, we do not know a priori the distribution of rate variation among sites for the real datasets under analysis and the continuous-rate models are computationally unfeasible for large datasets. The simple method we propose, which does not require the knowledge or assumption of a phylogenetic tree or of a substitution model and has proved to be rather reliable in a simulation study, has the great advantage of allowing the analysis of very large sequence datasets. Indeed, this feature is particularly important for the expected rapid growth of sequence data from intraspecific studies aimed at both population genetic analyses and at the study and diagnosis of genetic diseases. For both topics, particularly the latter, a site-by-site elucidation of the evolutionary dynamics can provide very useful information. This work has been supported by Ministero dell’Universita´ e della Ricerca Scientifica e Tecnologica (MURST), Italy (PRIN project “Bioinformatics and Genomics”), and by programma Biotecnologie legge 95/95 (MURST 5%).

LITERATURE CITED Anderson, S., A. T. Bankier, B. G. Barrel, M. H. L. d. Bruijin, A. R. Coulson et al., 1981 Sequence and organization of the human mitochondrial genome. Nature 290: 457–465.

Site-Specific DNA Substitution Rate Attimonelli, M., N. Altamura, R. Benne, A. Brennicke, J. M. Cooper et al., 2000 MitBASE: a comprehensive and integrated mitochondrial DNA database. The present status. Nucleic Acids Res. 28: 148–152. Excoffier, L., and Z. Yang, 1999 Substitution rate variation among sites in mitochondrial hypervariable region I of humans and chimpanzees. Mol. Biol. Evol. 16: 1357–1368. Felsenstein, J., and G. A. Churchill, 1996 A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13: 93–104. Gu, X., and J. Zhang, 1997 A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14: 1106–1113. Handt, O., S. Meyer and A. von Haeseler, 1998 Compilation of human mtDNA control region sequences. Nucleic Acids Res. 26: 126–129. Hasegawa, M., A. Di Rienzo, T. D. Kocher and A. C. Wilson, 1993 Toward a more accurate time scale for the human mitochondrial DNA tree. J. Mol. Evol. 37: 347–354. Jazin, E., H. Soodyall, P. Jalonen, E. Lindholm, M. Stoneking et al., 1998 Mitochondrial mutation rate revisited: hot spots and polymorphism. Nat. Genet. 18: 109–110. Kogelnik, A. M., M. T. Lott, M. D. Brown, S. B. Navathe and D. C. Wallace, 1998 MITOMAP: a human mitochondrial genome database—1998 update. Nucleic Acids Res. 26: 112–115. Lanave, C., G. Preparata, C. Saccone and G. Serio, 1984 A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20: 86–93. Lauder, I. J., H. J. Lin, J. Y. Lau, T. S. Siu and C. L. Lai, 1993 The variability of the hepatitis B virus genome: statistical analysis and biological implications. Mol. Biol. Evol. 10: 457–470. Meyer, S., G. Weiss and A. von Haeseler, 1999 Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics 152: 1103–1110. Michikawa, Y., F. Mazzucchelli, N. Bresolin, G. Scarlato and G. Attardi, 1999 Aging-dependent large accumulation of point mutations in the human mtDNA control region for replication. Science 286: 774–779. Pagel, M., 1999 Inferring the historical patterns of biological evolution. Nature 401: 877–884. Parsons, T. J., D. S. Muniec, K. Sullivan, N. Woodyatt, R. Alliston-Greiner et al., 1997 A high observed substitution rate in the human mitochondrial DNA control region. Nat. Genet. 15: 363–368.

865

Pesole, G., G. Dellisanti, G. Preparata and C. Saccone, 1995 The importance of base composition in the correct assessment of genetic distances. J. Mol. Evol. 41: 1124–1127. Rambaut, A., and N. C. Grassly, 1997 Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13: 235–238. Saccone, C., C. Lanave, G. Pesole and G. Preparata, 1990 Influence of base composition on quantitative estimates of gene evolution. Methods Enzymol. 183: 570–583. Saccone, C., G. Pesole and E. Sbisa`, 1991 The main regulatory region of mammalian mitochondrial DNA: structure-function model and evolutionary pattern. J. Mol. Evol. 33: 83–91. Saccone, C., M. Attimonelli, C. Lanave, G. Pesole and E. Sbisa`, 2000 Mitochondrial DNA and human diversity: a detailed analysis of the D-loop region, pp. 69–78 in The Origin of Humankind, edited by M. Aloisi et al. IOS Press, Venice. Sbisa`, E., F. Tanzariello, A. Reyes, G. Pesole and C. Saccone, 1997 Mammalian mitochondrial D-loop region structural analysis: identification of new conserved sequences and their functional and evolutionary implications. Gene 205: 125–140. Van de Peer, Y., and R. De Wachter, 1993 TREECON: a software package for the construction and drawing of evolutionary trees. Comput. Appl. Biosci. 9: 177–182. Wakeley, J., 1993 Substitution rate variation among sites in hypervariable region I of human mitochondrial DNA. J. Mol. Evol. 37: 613–623. Wallace, D. C., 1999 Mitochondrial diseases in man and mouse. Science 283: 1482–1488. Yang, Z., 1994a Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39: 105–111. Yang, Z., 1994b Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39: 306–314. Yang, Z., 1996 Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 9: 367–372. Yang, Z., 1997 PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555–556. Yang, Z., and S. Kumar, 1996 Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites. Mol. Biol. Evol. 13: 650–659. Yang, Z., and T. Wang, 1995 Mixed model analysis of DNA sequence evolution. Biometrics 51: 552–561. Zharkikh, A., 1994 Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39: 315–329. Communicating editor: F. Tajima