Geographic Variation in Human Mitochondrial DNA Control Region Sequence: The Population History of Turkey and its Relationship to the European Populations David Comas, Francesc Calafell,’ Eva Mateu, * Anna Pe’rez-Lezaun, and Jaume Bertranpetit” Laboratori d’Antropologia, Facultat de Biologia, Universitat de Barcelona, Catalonia, Spain; and *Institut de Salut Pliblica de Catalunya The hypervariable segment I of the control region of the mtDNA (positions 16024-16383) was amplified from hair roots by PCR and sequenced in 45 unrelated individuals from Anatolia (Asian Turkey). Forty different sequences were found, defined by 56 variable positions, of which only one involves a transversion. The neighbor-joining tree of Kimura’s distance matrix for all sequences shows four main clusters. Cluster D was found to be the most statistically robust of the four, and all the sequences in it shared a mutation that is present only in European and West Asian populations. The variability in cluster D could have originated between 37,000 and 107,000 years ago. No branch is unexpectedly long, denoting the absence of sequences that diverged much before the others. The pairwise difference distribution is bell-shaped, in accordance with a population expansion occurring roughly 35,000 to 100,000 years ago. When compared to other Caucasoid populations through the pairwise difference distribution, there is a pattern from the Middle East (older expansion) to the various European populations, with Turkey in an intermediate position; when Turkish sequences are compared through a neighbor-joining tree on a genetic distance matrix of populations, this position is again evidenced. Although there is a very low level of genetic divergence among Caucasoid populations as shown by mtDNA control region sequences, a geographic pattern of genetic variation emerges, denoting a stepping-stone position of Turkey between the Middle East and Europe, which is in
agreement with the hypothesis of a renlacement of Neanderthals by modem humans, which could be related to the Gpper Paleolithic cu1ir.A expansion. _ Introduction The current analysis of genetic variation is helping to unravel numerous evolutionary questions; when we focus on humans, a single species, the time scale is small and the evolutionary framework becomes a historical framework. In this sense, human population genetics is helping to reconstruct population history, be it in a very wide scope (the origins of the humans species) or at a narrow geographic scale. The study by CavalliSforza, Menozzi, and Piazza (1994) is an excellent synthesis of this field with the use of classical genetic markers. Genetic interpretation of the past can, thereafter, be compared to other disciplines traditionally devoted to the human past, like archaeology or linguistics (Renfrew 1992), in the attempt to reconstruct the unique history of human populations, with their genes, languages, and cultures. Mitochondrial DNA (mtDNA) has been widely used in human population genetics due to its particular properties: maternal inheritance, absence of recombination, and high mutation rates. Among the possible analyses of mtDNA, those concerning the sequence of the control region are highly informative (Vigilant et al. 1989; Di Rienzo and Wilson 1991; Ward et al. 1991, 1993; Santos, Ward, and Barrantes 1994; Bertranpetit et al. 1995). Worldwide sequence analysis of mtDNA control region supports the hypothesis of a recent African I Present address: Department of Genetics, Yale University School of Medicine. Key words: mtDNA, control region sequence, Turkey, Europe, neighbor-joining tree, clusters, pairwise difference distribution. Address for correspondence and reprints: Jaume Bertranpetit, Facultat de Biologia (Antropologia), Diagonal 645, 08028 Barcelona, Spain. E-mail:
[email protected]. Mol. Biol. Ed. 13(8):1067-1077. 1996 0 1996 by the Society for Molecular Biology and Evolution. ISSN: 07374038
origin of all modem humans (Vigilant et al. 1989), although there is still much debate on the time and place of the ancestral populations (Templeton 1992). In Europe and surrounding areas (mainly western Asia), the main issue in the genesis of modem populations is the possible replacement of the Neanderthals by modem humans, as postulated through morphology (Stringer 1989), and the possible relationship of this replacement to the cultural innovation that occurred at the passage from the Middle Paleolithic cultures (associated with Neanderthals) to the Upper Paleolithic (associated with modem humans). Both the morphological analysis and the archaeological record tend to place the early presence of modem populations in the Middle East and postulate an expansion from there into Europe (Stringer 1989; Mellars 1993, among others). It is in this context that it is crucial to test the possible role of Asian Turkey (Anatolia) as a genetic bridge in the expansion from western Asia toward Europe of the populations that may have been the ancestors of present-day Europeans. Other more recent expansions and invasions are known to have affected Turkey. The European Neolithic originated in western Asia, in the area known as Fertile Crescent, and Anatolia was the bridge from which the new subsistence pattern spread to Europe (Ammerman and Cavalli-Sforza 1984, pp. 24-33). Much more recently, in the 11th century A.D., Turkic nomadic people occupied the grassland in the interior of Asia Minor, imposing a language of the Turkic group by an Nite dominance process (Renfrew 1987, pp. 131-133), of little genetic consequence for the whole population. The present study analyzes the hypervariable region I of the control region of mtDNA, the most variable region in the highly variable mtDNA, in a sample of individuals from Asian Turkey in an attempt to under1067
1068 Comas et al.
MEDITERRANEAN SEA
SYRIA
IRA0
F I G. I.-Map of Anatolia. Dots indicate maternal birthplaces of individuals included in the present sample.
stand the role of Turkey in the history and the making of European populations. mtDNA analysis can be used to evaluate several scenarios for the genesis of the Turkish population, which may have broader implications in the genesis of European populations. Materials and Methods
Population Sampling A 360-nt sequence in region I of the mtDNA D-loop from 45 unrelated individuals was analyzed. The sample comprised Turkish-speaking individuals from rural villages scattered throughout Anatolia (fig. 1). Sampling in large cities such as Istanbul or Ankara was avoided to ensure the autochthony of the sample. In each case, the mother’s birthplace was recorded as the origin of the individual. Sample Collection and DNA Extraction Hairs with their roots were plucked with sterile gloves and stored in a vial with 95% ethanol. One root of each sample was introduced in a 1.5-ml sterile microfuge tube containing 0.5 ml of extraction buffer (10 mM Tris, pH 8.0; 10 mM EDTA, pH 8.0; 100 mM NaCl, 2% SDS; 39mM DTT, and 20pg/ml proteinase K), then was incubated at 37°C and shaken at 180-200 rpm for at least 3 h. After a phenol-chloroform extraction, the DNA was concentrated in Centricon- tubes and stored at -20°C. mtDNA Amplification Amplification was performed using 5-20 ~1 of the sample in a 50-p,l reaction volume; the temperature profile for 30 cycles of amplification was 94°C for 1 min, 58°C for 1 min, and 72°C for 1 min. The primers used in this reaction, L15929 (5’-CACCAGTCTTGTAAACCGGA-3’), which was specifically designed, and H 16498 (5’-CCTGAAGTAGGAACCAGATG-3’) (Ward et al. 1991), amplified a segment of 608 base pairs (bp) containing the 360 bp hypervariable region, which was later sequenced. mtDNA Sequencing Of the 45 samples, 23 were sequenced with an automatic sequencer, while the rest were sequenced man-
ually; the choice of the method depended only on sequencer availability. Some fragments of several individuals were sequenced with both methods, and the same results were found in all cases. In automatic sequencing, H16498 was a 5’ biotinylated primer that allowed us to separate the strands of the amplified DNA with magnetic beads attached to streptavidin (Dynabeads M-280 Streptavidin). The sequencing reaction was performed separately on the two strands resulting from the magnetic bead reaction with the Autoread Fluorescent Sequencing Kit (Pharmacia); the fluorescent sequencing primers used were L15997 (5’-CACCATTAGCACCCAAA GCT-3’) (Ward et al. 1991) and H16401 (5’-TGATTTCACGGAGGATGGTG-3’) (Vigilant et al. 1989), both with the fluorescein molecule attached to the 5’ end. The result of the sequence reaction was run in an A.L.E Automated Sequence@ (Pharmacia) and the sequences were aligned with the ESEE computer program (Cabot 1988). For manual sequencing, the product of the amplification was purified with GeneClean (BIO 101). Seven microliters of the purified amplified product was used for sequencing with Sequenase Version 2.0 (USB) following supplier’s recommendations, except that the annealing step was performed by boiling the annealing reaction mixture for 3 min in the presence of nonidet P-40, followed by a short time in a dry-ice ethanol bath. Both strands were sequenced using primers L15997 and H16412 (5’-GTGCGGGATATTGATTTCAC-3’). Primer H164 12 was specifically designed. Reaction products were separated by electrophoresis, dried, fixed, and subjected to autoradiography. Computer Analysis The final information about each individual was a string of 360 characters, belonging to hypervariable region I, for base positions 16024-16383. For most calculations, standard packages, especially PHYLIP 3.5~ (Felsenstein 1989) were used; some programs were written specifically. Sequence trees were built using the DNADIST program in the PHYLIP package. The distance used is based on Kimura’s two-parameter model with the transition-to-transversion ratio set to 15: 1 according to Tamura and Nei (1993). Other values of this parameter produced nearly identical trees. A neighborjoining tree (Saitou and Nei 1987) was constructed from the distance matrix. Alternatively, the Tamura and Nei (1993) distance, which assumes gamma-distributed mutation rates for nucleotide positions, was computed by means of the MEGA package (Kumar, Tamura, and Nei 1994). Parsimony trees were generated using DNAPARS, from PHYLIP 3.5c, in repeated runs with the “Jumble” option (random sequence input order). One thousand putatively most-parsimonious trees were produced this way, and their strict consensus was derived using CONSENSE from the PHYLIP package. Pairwise difference distributions (mismatch distributions) were computed. From them, the 7 and t3 parameters from the Harpending et al. (1993) two-parameter model were derived. Standard errors were computed
Human mtDNA Sequence Variation in Turkey
from 1,000 bootstrap iterations; resampled sequences were produced by sampling sites with replacement. Data from four other populations were used for comparison: Sardinians (Di Rienzo and Wilson 1991), Basques (Bertranpetit et al. 1995), British (Piercy et al. 1993), and Middle Easterners (Di Rienzo and Wilson 1991). The first three are European and all four, as well as the Turks studied here, are Caucasoid. The relation between them was studied through an intermatch-mismatch distance, D = du - (dii + djj)/2, where dv is the raw mean nucleotide pairwise difference between populations i and j, and dij and djj are, respectively, the raw mean nucleotide pairwise differences within populations i and j. This expression was defined by Rao (1982), although he termed it the Jensen difference (see also Nei 1987, p. 276). This particular genetic distance is related to pairwise difference distributions, which have been studied and modeled intensely (Rogers and Harpending 1992; Harpending et al. 1993); besides, it is highly correlated to other genetic distances based on DNA sequences, such as Nei and Miller’s (Nei and Miller 1990; Francalacci et al. 1996). Standard errors for the genetic distance were estimated by bootstrap (Efron 1982). Neighbor-joining trees were built from the distance matrix, and bootstrap was also used to estimate the robustness of the clusters found (Felsenstein 1985). As there is no agreement in the estimation for the mutation rate, a wide range of estimations have been considered, from 1.44 X 10e6 (Vigilant et al. 1991) to 4.14 X 1O-6 (Ward et al. 1991). A generation time of 20 years was assumed. Results Sequence Diversity The complete sequence of a 360-bp segment of the control region (positions 16024-16383; Anderson et al. 1981) was determined for 45 individuals. Sequence comparison identified 40 different sequences defined by 56 variable positions, shown in figure 2. Nine of the variable sites (16071, 16086, 16167, 16214,16215,16216, 16243, 16269 and 16288) had not been described as such in the population set used for comparison, composed of 256 individuals and 183 different sequences of Caucasoid origin (Sardinians, Basques, British, and Middle Easterners). Some of the 40 sequences had also been found in other Caucasoid populations: five in Sardinians, five in Basques, eight in British, and three in Middle Easterners. Two individuals presented the reference sequence (Anderson et al. 1981), which is the most frequent sequence among Caucasoids. No sequence was shared with available non-Caucasoid sequences (Africans, Asians, and New Guineans from Vigilant et al. [1991]; Amerindians from Ward et al. [1991] and Torroni et al. [1993a, 1993b]). The variable positions among the Turkish sequences were mostly transitions: 55 out of 56, with only one A-C transversion. Among transitions, there is a high proportion of C-T substitutions (43 positions, found 112 times across all sequences) compared to A-G (12 positions, found 19 times). Taking into account base
1069
composition, the excess of pyrimidine changes is highly significant (x2 = 10.06, df = 1, P = 0.001). Nonetheless, the observed proportion is not significantly different from the value observed in a large worldwide sequence set (Wakeley 1993), in which transitions between pyrimidines were also nearly three times as abundant as transitions between purines (x2 = 2.20, df = 1, P = 0.138). There appears to be a population-independent bias. A few sites show high levels of polymorphism, like position 16223, with a T in 13 sequences (32.5%) and a C in 27 (67.5%); or position 16126, with a C in 9 sequences (22.5%) and a T in 31 (77.5%). Neighbor-Joining Tree of Sequences In order to examine sequence variation patterns, genetic distances between sequences in the sample were computed using Kimura’s two-parameter model. The corresponding neighbor-joining tree (Saitou and Nei 1987) is shown in figure 3. The Tamura and Nei (1993) distance produced essentially the same topology. Despite the complexity of the tree, four clusters (A to D) become apparent. TUK43 and TUK58, the only sequences presenting a transversion, share all other variable positions except 16223 and 16261; thus, they form an isolated branch in the tree. TUK22 and TUK59, which bear the reference sequence, are found near the center of the tree because (1) it is the sequence from which the minimum number of substitutions is needed to produce all other sequences and (2) for every site, the nucleotide in the reference sequence is the most frequent. They are at the base of cluster A. Cluster B stems from TUK7, which differs from the reference by a T-to-C transition at site 163 11. This change was also found in two other sequences (TUKS and TUK76) that were placed in other clusters because they shared other mutations with the sequences in these other clusters. This fact, as explained below, undermines the robustness of the tree as measured by bootstrap. Nonetheless, the clinal variation of the frequency of the C at 16311 substitution strengthens its phylogenetic value. Thus, the presence of a C at 16311 shows a gradient from Africa to Europe and Asia, while it was not found in Amerindians. Cluster C springs from TUK63, which differs from the reference sequence by a T at 16223, also found in two sequences of other clusters. The proportion of T’s at position 16223 displays a clear gradient, with very high frequencies among Africans, from more than 90% in most populations to fixation in the Hadza (Vigilant et al. 1991); in Asia, including India (Mountain et al. 1995), values are in the mid range, while they are much lower in Caucasoids: 8% in British, 7% in Sardinians, and 4% in Basques. Results in Turks are intermediate (29%). Finally, all sequences in cluster D present a C at 16126; no other sequence showed this change. A putative ancestral sequence presenting only this mutation was not found in our sample, but was recovered in an individual from Tuscany (Central Italy; Francalacci et al. 1996). Its frequency also shows marked geographical frequency patterns: a C at 16126 was found in individ-
1070 Comas et al. 11111111111111111111111111111111111111111111111111111111 66666666666666666666666666666666666666666666666666666666 00000011111111111112222222222222222222222222333333333333 46678922446666788891112233444566677888999999011222455566 27916369582378236934563414359616908478123468419457356728 Am
13 TUK 82 TuK8 TLlK 11 Tux 74 Tm 47 Tllx 48 TUK 17
........................................................ ........................................................ ....... . ................................................ .......................................... ..c ........... .......................................... ..c ........... .................................... ..T ................. .................................... ..T ........... a ..... ..... ........... ..T....................................~ .......... O.......................................G ..... ..... .......... ..T.............C.C.....................~ ................... ..a...............................c ... ................................. ..a. .................... .. . ..T............................~...............C..T.C
A
TuK7 TUK 25 TUK 70 TUK 18 TUK 68 TUK 53 TUK 30
........................................... ..c .......... ............................. ..T ........... ..C .......... .T..................Q. ...................... ..C .......... .T...........................T...............C.....T .... .. ..C.................T .................... ..C .......... ... ..C.................C.T ................... C .......... ..... C..................................Q. ...............
B
TUK 63 TIJK 46 TUK 28 TLIK 50 TUK 60 TUK 71 Tm 14 TIJK 35 TIM5 Tmc 78 TUK6
.................... ..T ................................. .................... ..T................T ................ ... ..C................T ................ T ................ ..... ..A..............T................T ................ ............ ..C.......T.C..............T ................ .................... ..T...............................C. A..................T ..T ............................. ..c. ... ..C...........C....T....T................C.........C. ... ..C.A..............T..........T....T......CA.......C. .................... ..T .. ..T ........................... . ....... ..T............T..............C.....C ..... T . . . . . .
C
TUK 23 TLtK 27 Tux 41 TUK 34 TOK 12 Tux 29 TWK 76 TUK 81 TuK3 Tux 33 TOK 38 TUK 15
..T . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. . . C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. . . C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..T...C.......................T ......................... ..T...C.A.....C...............T ......................... ..T...C.A.....................T.....T. . . . . . . . . . . . . . . . . . . . . T . . . C . . . . . . . . . . . T. . . . . . . . . . . . . . . . . . . . . . . . ..C. . . . . . . . . . . . . . . .C . . . . . . . . . . . . . . . . . C . . . . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..C....O....TC.......................T. . . . . . . . . . . . . . . . . . ..C....O....TC.......................T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..TT . . . . . . . . . . . . . . . . . . . c..................................TE....c. . . . . . . .
Tmi 58 Tux 43
. . . . . . . . . . . . . ..C.C....T...........T ..................... ............. ..C.C............T...T.....................
TOK
TOK TOK Tcm TUK TUK
22 59 20 44 62
D
FIG. 2.-Polymorphic sites of the sequences of the hypervariable segment I of the control region found in 45 Turkish individuals. Sequences and base positions are given in comparison to the reference sequence AND (Anderson et al. 1981). Only nucleotides differing from the reference are shown; dots indicate identity with AND. Sequences are grouped according to the cluster they belong to (see fig. 3). The 45 Turkish sequences have been submitted to GenBank under the accesion numbers U59009-U59053.
uals from every Caucasoid sample analyzed, with a maximum frequency of 47.6% in Middle Easterners, while it is absent in East Asians, Amerindians, New Guineans, and Africans (save for 5 out of 14 Yorubas; Vigilant et al. 1991). Within Caucasoid populations, the frequency of this mutation declines from east to west, with values from 26.7% in this Turkish sample to 6.7% in Basques. The transition to T at 16069, which defines two clear-cut subgroups, is also Caucasoid-specific and was found with a frequency of 19.1% in Middle Easterners and of 15.6% in Turks, and in lower frequencies (4% to 12%) in Basques, Sardinians, and British. This intermediate position of Turkey is found throughout our analyses.
The relative ages of the different clusters (table 1) can be ascertained through the average number of nucleotide differences between every sequence and the cluster stem (i.e., the putative ancestor of the sequences in a cluster). Ages were found to be similar among clusters, in a bracket between around 30,000 and 120,000 years ago, depending on the mutation rate estimates used (see table 1) and a generation time of 20 years. In order to test for differences in internal diversity (and, thus, in age), every pair of clusters was compared with the Takezaki two-cluster test (Takezaki, Rzhetsky, and Nei 1995). All six pairwise tests were nonsignificant (table 1). Therefore, none of the four clusters is significantly older or younger.
Human mtDNA Sequence Variation in Turkey
;,
D ....___________.______~.. . . . . . . . . .
1071
__
,; :
-, r. . . . . . . . . . . . . . . ..____............... / ’ ‘... . . . . . . . . . . . . ..__.__...............................................................................
FIG. 3.-Neighbor-joining tree of sequences. Distances were based on a Kimura two-parameter model (transition : transversion ratio set to 15:l). Boxes: main clusters. Asterisks denote sequences found in more than one (two or three) individual. Arrows point to each cluster stem, which was not found for cluster D.
The presence of single outlier sequences, which could have accrued mutation since much earlier times, was also tested. If substitutions have accumulated randomly since a common origin in time, their number is expected to follow a Poisson distribution (Hudson 1990). The Kolmogorov-Smimov goodness-of-fit test, which is especially sensitive to single outliers, failed to show significant departure from the Poisson distribution
in the overall tree (2 = 0.354, P = 0.722). None of the sequences can be considered significantly apart from the rest; the implications of this finding will be discussed below. Several lines of evidence suggest that the differentiation in the four clusters predates the colonization of Europe. First, all four cluster stems were found in at least one European population (Basque, Sardinian, Tus-
Table 1 Cluster Parameters and Divergence Ties Computed from Diierent Mutation Rate Estimates CLUSTER
Number of sequences . . . . . . . . . . . . . . . . . . Relative number of sequences (%) . . . . . Mean number of substitutionsb . . . . . . . . . . . . . Standard deviation . . . . . . . . . . . . . . . . . . . Sequence divergence (%) . . . . . . . . . . . . . . . Divergence time (years) (JL = 4.14 X 10m6)’ Divergence time (years) ()L = 2.16 X 10-6)d . Divergence time (years) (CL = 1.44 X 10-6)d .
A
B
C
D
TOTAL’
11 27.5 2.01 1.41 0.56 26,850 51,380 77,300
7 17.5 2.17 0.75 0.60 29,130 55,750 83,860
11 27.5 3.00 1.89 0.83 40,270 77,070 115,940
9 22.5 2.78 1.09 0.77 37,310 71,420 107,440
38 95.0 3.16 0.97 0.88 42,450 81,260 122,240
Nom.-Divergence times were compared pairwise through the Taker&i two-cluster test (Take&i, Rzhetsky, and Nei 1995). A and B, Z = 0.302, P = 0.764, A and C, Z = 0.680. P = 0.496; A and D, Z = 0.316, P = 0.756; B and C, Z = 0.416, P = 0.674: B and D, Z = 0.409, P = 0.682; C and D, Z = 0.008, P = 0.999. a The total does not include sequences TUK58 and TUK43. b Numbers of substitutions were computed from the putative sequence stem (see fig. 3). even in the case of cluster D, whose stem was not actually found in our sample. c Mutation rates (p) are given per nucleotide and generation. A generation time of 20 years is assumed. Source: Ward et al. (1991). d Source: Vigilant et al. (1991).
1072 Comas et al.
FIG. 4.-Maximum-parsimony tree linking sequences in our sample. Letters (A to D) indicate clusters mostly correlated with those defined in figure 2. Inset: strict consensus between 1,000 most-parsimonious trees; it shows all sequences springing from a common node, save for cluster D, which remains intact.
can, or British) and often also in the Middle East. Second, the distribution across clusters of the sequences that these populations share with the Turks appears random (x2 = 5.58, df = 3, P = 0.134). And, finally, in another study (unpublished data), we have been able to show that, in a tree comprising 483 individuals from nine European and West Asian populations, clusters A to D were readily identifiable. The robustness of the neighbor-joining sequence tree was tested through 1,000 bootstrap iterations. As expected, the exact topology is difficult to achieve in the resampling since some sequences incorporate substitutions that define different clusters and thus could belong to any of them. That is the reason why cluster C was found in only 45 out of 1,000 bootstrapped trees, A in 33, and B in 7. Nonetheless, cluster D was found in 256 out of 1,000 bootstrapped trees, denoting a much more robust structure. Despite the ambiguous position of
some sequences and, thus, the low boostrap values, the neighbor-joining tree of sequences gives, as has been seen, interesting information on the phylogenetic structure of the genetic variation. Maximum-Parsimony Analysis A maximum-parsimony tree (fig. 4) shows results very close to those achieved by the neighbor-joining tree on the Kimura distance matrix. Clusters C and D are recovered, while A and B merge in a single cluster. Tree length is 76 steps. A strict consensus tree (inset of fig. 4) was obtained from 1,000 putatively most-parsimonious trees, and shows most sequences springing from the reference sequence, while cluster D is preserved. Cluster D appears to be the most robust, both by bootstrap analysis on the neighbor-joining tree and in the consensus of 1,000 most-parsimonious trees.
Human mtDNA Sequence Variation in Turkey
0
2
4
1073
a
Palrwl80 dlfhnno.8
(8)
4 - ob8uv8d
8 8 Pafrwl,. differono.8 - F&m & Hu~onding
lo
l2
14
‘--- Pohon
FIG. 5.-a, Nucleotide pairwise difference distribution. Error bars derived from 1,000 bootstrap iterations. b, Observed pairwise difference distribution and its fit to Poisson and Rogers and Harpending (1992) models.
Pairwise Difference Distributions The analysis of the distribution of the number of nucleotide differences between all pairs of sequences from a given population is of special interest for inferring the history of the population. Ancient episodes of sudden expansion generate bell-shaped pairwise difference distributions, with a peak at T (time after expansion in mutational units) differences, which increase with time, whereas populations that remain stationary have irregular, multimodal distributions (Rogers and Harpending 1992). The pairwise difference distribution for
the Turkish population (fig. 5~2) is clearly bell-shaped, with a peak at five differences, and it fits the Rogers and Harpending model (fig. 5b). This empirical distribution is robust, as shown by the small errors of the different values as estimated by 1,000 bootstrap iterations (fig. 5~). The distribution observed shows a variance (5.10) close to the mean (5.38), and also fits a Poisson distribution (fig. 5b), as predicted by Slatkin and Hudson (1991). From the observed distribution of pairwise differences, it is possible to estimate the parameters of the
1074 Comas et al.
Table 2 Expansion Times (in Years) Computed for Different Populations from 7 in the Rogers and Harpending (1992) Model for Nucleotide Pairwise Difference Distributions MUTATION RATE 4.14 x
Turks’ . . . . . . . . . . . . . . Middle Easterners“ . . Sardiniansd . . . . . . . . British” ......... Basques’ . . . . . . . . . .
W6” 2.16 X 10-6b 1.44 x 10-6b 65,586 98,516 34,219 90,278 135,605 47,101 78,234 27,174 52,083 20,042 14,647
57,700 42,619
38,413 28,074
= Ward et al. (1991). b Vigilant et al. (1991). c Present study. d Di Rienzo and Wilson (1991). c Piercy et al. (1993). f Bertranpetit et al. (1995).
Table 3 Genetic Distance Between Populations Based on D-Loop Segment I Sequences (Below Diagonal), Distance Standard Error Computed from 1,000 Bootstrap Iterations (Above Diagonal), and Mean Nucleotide Pairwise Differences (Diagonal, Bold)
Turks ......... Basques ........ British . . . . . . . . Sardinians ..... Middle Easterners . .
TUK
BAS
BRI
SAR
MEA
5.378 0.194 0.079 0.110 0.125
0.069 3.236 0.090 0.094 0.438
0.041 0.020 4.353 0.053 0.242
0.045 0.021 0.012 4.223 0.246
0.05 1 0.017 0.087 0.097 7.078
p TUK, Turks; BAS, Basques; BRI, British; SAR, Sardinians; MEA, Middle Easterners.
theoretical model proposed by Rogers and Harpending (1992) or its simplified version (Harpending et al. 1993). In particular, 7 = 2pZt, where p, is the mean mutation rate per nucleotide, 1 is the sequence length, and t is the number of generations elapsed after the expansion episode; the value obtained (from the two-paramter version; Harpending et al. 1993) is 7 = 5.38 2 0.14 (standard error computed from 1,000 bootstrap iterations). Several mutation rate estimates have been used to calculate t (table 2), which, for a generation time of 20 years, yields dates from roughly 35,000 years for the highest mutation rate to close to 100,000 for the lowest. It should be noted that the error in the estimation of 7 is much smaller than the error of the mutation rate and does not substantially alter the estimates of expansion times. These are crude estimates, but they may help to rule out some hypotheses, as will be seen in the Discussion. The comparison of the distribution of pairwise differences with those from other populations in Europe and the Middle East (fig. 6) may be more meaningful and independent of the mutation rate. The European populations show peaks at the left side of the graph (that
is, for low values of pairwise differences), while the peak for Turkey appears in an intermediate position, between populations in Europe and the Middle East. This pattern can also be seen, even more clearly, from the mean pairwise differences in each population (diagonal values on table 3), with the highest values for the Middle East. The relative position of the distributions indicates, according to the model, a temporal gradient in the expansion time of the populations: the Middle East population expansion would have been the most ancient, the European would have been the most recent, and the Anatolian expansion would have happened some time in between. Population Trees Genetic differences were computed between the Turkish population and those used for comparison through the mismatch-intermatch distance described in Materials and Methods. The standard error of this genetic distance was estimated by 1,000 bootstrap iterations (table 3). A neighbor-joining tree was built from the distance matrix, and its robustness was assessed through 1,000 bootstrap replicates; a consensus tree was
Frequency
0.16
0.06
10
6
16
20
Number of difkronooe - Turk.
- Brltlrh
+ Swdlnlmr
-y- Barquo~
----. Middl. East
FIG. 6.-Nucleotide pairwise difference distributions for five European and west Asian populations.
Human mtDNA Sequence Variation in Turkey M EAST
SARDINIAN
BASCiJE F IG. 7.-Neighbor-joining tree built from sequence distances between populations. Figures indicate the frequency with which each cluster was found in 1,000 bootstrap resamples. The figure 60.3%. for example, is the proportion in which the node linking Sardinians and Basques is found.
built, which, along with the percentage of bootstraps supporting each node, is shown in figure 7. The tree shows a clear gradient from the Middle East to the Basques. Bootstrap supports are all above 50%; the relative position of Sardinians and British is the least robust part of the tree. The Turkish population presents its shortest genetic distance with the British, but at the same time Turkey is the population with the shortest genetic distance to the Middle East. Once again, Turkey’s intermediate genetic position between the Middle East and Europe is shown. Genetic Differences and Geographic Distance In order to test the possible geographic stratification of the genetic variation within the population, geographic distance between each pair of individuals was computed and compared with the nucleotide pairwise differences through Pearson’s correlation coefficient. The statistical significance of the result (r = 0.150) was checked through the Mantel test (Mantel 1967); after 1,000 iterations, it could be inferred that the correlation between genetic and geographic distances is not significantly different from zero, although P has a value very close to the significance level (P = 0.052). The sequence tree can also be used to examine the geographic structure of the genetic variation. When the geographic centroids for the provenance of individuals in the four different clusters (A to D) were calculated, it was not possible to discern any particular pattern, and all four centroids lay in the center of Anatolia, close to each other. Confidence ellipses, drawn with axis lengths corresponding to one standard deviation in latitude and longitude around the centroid, overlapped widely, indicating that there is no geographic differentiation related to the tree of sequences. No internal structure is evident in these results. Discussion
Our analyses consistently show that Anatolian mtDNA sequences present features that are intermediate between those found in Europe and in the Middle East. This is especially patent in (1) in the cline of the frequency of the substitutions defining cluster D; (2) in the
1075
average and distribution of nucleotide pairwise differences (fig. 6) and, hence, in an intermediate genetic diversity; and (3) in the position of Turkey in the population tree (fig. 7). Several scenarios could have produced an intermediate genetic position. However, they have different, specific predictions on the outcome of the analyses performed. This eventually will allow us to reject, solely on the basis of genetic evidence, all but one of the population history scenarios proposed: The Middle East was colonized from Europe through Turkey. This scenario would predict a loss of genetic
diversity toward the East. The present results show exactly the opposite pattern, as demonstrated, for example, by mean nucleotide pairwise differences (fig. 6 and table 3). Isolation by distance acting on populations in demographic equilibrium. As Rogers and Harpending
(1992) showed, this would lead to irregular, multimodal pairwise difference distributions, again in contradiction with our findings (figs. 5 and 6). The present Turkish population is the result of recent admixture between European and Middle Eastern populations. Although this scenario would result in
pairwise difference distributions and population distances compatible with those actually found, a recent admixture would not account either for the large number of lineages found exclusively in Turks (31 out of 40), even when compared to a large database of European and Middle Eastern sequences, or for the high mean nucleotide pairwise differences between Turks and both European and Middle Eastern populations. Moreover, the sequences shared by Turks and Middle Easterners (TUK7, TUK33, and TUK38) are also found in Europe, as far west as Britain and the Basque Country. Therefore, the pattern of sequence sharing agrees with a common, ancient origin for European and west Asian populations rather than with Turks being basically the product of recent admixture.
4. Turkey was in the pathway of the colonization of Europe from the Middle East through population expansions. This would explain the patterns of se-
quence sharing; the clinal frequency of the Caucasoid substitutions found at the base of cluster D; the mean pairwise nucleotide differences and their distribution; and population distances and trees. Expansion dates can be estimated from the whole tree from 42,000 to 122,000 years ago, which would be an upper estimate for the age of the European and west Asian variation. Cluster D, the most robust in the analysis, is entirely Caucasoid, and its divergence time gives an estimate of the minimum divergence time for the Turkish and European populations, from 37,000 to 107,000 years ago. Similar age brackets are found through the pairwise difference distribution. These estimates agree with the archaeological dates for the spread of anatomically modem humans in Europe. Although the debate on the origin and spread of the Upper Paleolithic in Europe and its relationship to the expansion of modem humans remains open, there is widespread
1076 Comas et al.
agreement that nonmodem people (Neanderthals in Europe and west Asia) differed profoundly in their behavior from their modem successors (Klein 1994). In Europe, the Upper Paleolithic artifactual evidence for modem behavior seems to have appeared abruptly about 40,000 years ago, and to the extent that human remains occur in Upper Paleolithic archaeological sites, they come almost exclusively from typically modem humans (Klein 1994). Moreover, the Middle to Upper Paleolithic transition could have produced a population replacement (Stringer 1989; Mellars 1993). The mtDNA patterns seen in this Turkish sample are compatible with an expansion at 40,000 years ago, and do not show traces of the persistence of older populations (e.g., Neanderthals). The differential effect that two separate expansions, the Upper Paleolithic and the Neolithic, may have produced in the genetic makeup of Europe is not altogether clear, as both originated in the Middle East. The present analysis with mtDNA control region sequences points to an important role of very ancient events. Cavalli-Sforza (Ammerman and Cavalli-Sforza 1984, pp. 105-107; Cavalli-Sforza, Menozzi, and Piazza 1994, pp. 296-299) nonetheless interpreted the variation found in classical genetic markers as due mainly to the Neolithic expansion. It is intrinsically difficult to separate the genetic effects of those two diacronic waves, which had very similar geographic origins and expansion paths. However, it is possible that the reduced population size during the Upper Paleolithic allowed drift to act deeply on gene frequencies but had little effect on sequence diversity, as it is likely that the European population did not suffer any narrow bottleneck. In this case, the effect of the expansion of farming (that is, a sharp increase in mobility and population size) on gene frequencies could have been deep, transforming a random variation pattern into a cline, but would have had few consequences on mtDNA sequence diversity, which would reflect more ancient events. Acknowledgments This research was supported by Direcci6n General de Investigacidn Cientifica y TCcnica (Spain) grant PB92-0722 to J.B., who also received part of the Human Capital and Mobility network grants ERCHRXCT920032 and ERBCHRXCT92-0090, and D.C. was awarded a pre-doctoral grant (FI/93- 1.15 1) by the Comissionat per a Universitats i Recerca, Catalan Autonomous Government. Thanks to Metin Uzbek, professor of Physical Anthropology at the Hacettepe University of Ankara, for collecting the samples. Some of the sequences were obtained in the laboratory of Svante Pxtibo (grant from the DFG) during a visit from D.C. This manuscript was greatly improved through the suggestions of the associate editor and two anonymous reviewers. LITERATURE CITED AMMERMAN , A. J., and L. L. C AVALLI-SFORZA. 1984. The Neolithic transition and the genetics of populations in Europe. Princeton University Press, Princeton, N.J.
ANDERSON, S., A. T. BANKLER, B. G. BARRELL et al. (14 co-au-
thors). 1981. Sequence and organization of the human mitochondrial genome. Nature 290:457-465. B ERTRANPETIT, J., J. SALA, E CALAFELL , F! A. UNDERHILL , F! MORAL, and D. COMAS. 1995. Human mitochondrial DNA
variation and the origin of the Basques. Ann. Hum. Genet. 59:63-S 1. CABOT, E. L. 1988. ESEE, the eyeball sequence editor. Version 1.06. Bumaby, B.C., Canada. CAVALLI-SFORZA, L. L., P MENOZZI, and A. PI A Z Z A . 1994. History and geography of human genes. Princeton University Press, Princeton, N.J. DI RIENZO, A., and A. C. W ILSON . 1991. Branching pattern in the evolutionary tree for human mitochondrial DNA. Proc. Natl. Acad. Sci. USA 88: 1597-1601. EFRON, B. 1982. The jackknife, the bootstrap and other resampling plans. Society for Industrial and Applied Mathematics, Philadelphia, Pa. FELSENSTEIN, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 35:785-791. -. 1989. PHYLIP-phylogeny inference package (version 3.2). Cladistics 5164-166. FRANCALACCI , P., J. BERTRANPETIT , E CA L A F E L L, and I? UNDERHILL . 1996. Sequence diversity of the control region of mitochondrial DNA in Tuscany and its implications for the peopling of Europe. Am. J. Phys. Anthropol. (in press). H ARPENDING , H. C., S. T. SH E R R Y, A. R. RO G E R S , and M. STONEKING. 1993. The genetic structure of ancient human populations. Curr. Anthropol. 34~483-496. H UDSON, R. R. 1990. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1-44. KLEIN, R. G. 1994. The problem of modem human origins. Pp. 3-17 in M. H. NITECKI and D. V. N~~ECKI, eds. Origins of anatomically modem humans. Plenum Press, New York and London. K UMAR, S., K. T AMURA, and M. NEI. 1994. MEGA: molecular evolutionary genetics analysis software for microcomputers. CABIOS 10:189-191. M ANTEL , N. 1967. The detection of disease clustering and a generalized regression approach. Cancer Res. 27:209-220. M ELLARS , F! 1993. Archaeology and modem human origins in Europe. Proc. Br. Acad. 82:1-35. M OUNTAIN , J. L., J. M. HEBERT, S. BHAT~ACHARYYA, F? A. U NDERHILL , F? O~OLENGHI, M. GADGIL, and L. L. CAVALLI-SFORZA. 1995. Demographic history of India and mtDNA-sequence diversity. Am. J. Hum. Genet. 56:979992. N EI, M. 1987. Mole&lar evolutionary genetics. Columbia University Press, New York. N E I, M., and J. C. M ILLER . 1990. A simple method for estimating average number of nucleotid substitutions within and between populations from restriction data. Genetics 125:873-879. RERCY, R., K. M. SULLIVAN, N. BENSON , and F? G ILL. 1993. The application of mitochondrial DNA typing to the study of white Caucasian genetic identification. Int. J. Legal Med. 106:85-90. R A O, C. R. 1982. Diversity and dissimilarity coefficients: a unified approach. Theor. Popul. Biol. 21:24-43. R ENFREW , C. 1987. Archaeology and language. The puzzle of Indoeuropean origins. Jonathan Cape, London. -. 1992. Archaeology, genetics and liguistic diversity. Man 271445-478. R OGERS , A. R., and H. HARPENDING . 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9:552-569.
Human mtDNA Sequence Variation in Turkey
SAITOU, N., and M. N EI. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425. SANTOS, M., R. H. WARD, and R. BARRANTES . 1994. mtDNA variation in the Chibcha Amerindian Huetar from Costa Rica. Hum. Biol. 66:963-977. SLATKIN, M., and R. R. H UDSON. 1991. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129555-562. STRINGER , C. B. 1989. The origin of early modem humans: a comparison of the European and non-European evidence. Pp. 232-244 in l? MELLARS and C. S TRINGER, eds. The
human revolution: behavioural and biological perspectives on the origins of modem humans. Princeton University Press, Princeton, N.J. TAKEZAKI, N., A. RZHETSKY, and M. N EI. 1995. Phylogenetic test of the molecular clock and linearized trees. Mol. Biol.
Evol. 12:823-833. TAMURA, A., and M. N EI. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-523. T EMPLETON , A. R. 1992. Human origins and analysis of mitochondrial DNA sequences. Science 255:737. T ORRONI , A., T. G. SCHURR, M. E CABELL, M. D. BROWN, J. V. NEEL, M. LARSEN, D. G. SMITH, C. M. VULLO, and D.
C. WALLACE. 1993~. Asian affinities and continental radi-
1077
ation of the four founding Native American mtDNAs. Am. J. Hum. Genet. 53:563-590. TORRONI, A., R. I. SUKERNIK, T. G. SCHURR, Y. B. STARIKOVSKAYA, M. E CABELL, M. H. CRAWFORD, A. G. COMUZZIE, and D. C. WALLACE . 19936. mtDNA variation of aboriginal Siberians reveals distinct genetic affinities with Native Americans. Am. J. Hum. Genet. 53:591-608. V IGILANT, L., R. PENNINGTON , H. HARPENDING, T. D. KOCHER , and A. C. WILSON . 1989. Mitochondrial DNA sequences in single hairs from a southern African population. Proc. Natl. Acad. Sci. USA 86:9350-9354. VIGILANT , L., M. STONEKING , H. HARPENDING , K. HAWKES , and A. C. WILSON. 1991. African populations and the evolution of mitochondrial DNA. Science 253: 1503-1507. WAKELEY, J. 1993. Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. J. Mol. Evol. 37:613-623. WARD, R. H., B. L. F RAZIER , K. DEW-JAGER , and S. P&&Bo. 1991. Extensive mitochontial diversity within a single Amerindian tribe. Proc. Natl. Acad. Sci. USA 88:87208724. W ARD , R. H., A. REDD, D. VALENCIA, V. FRAZIER, and S. P&~Bo. 1993. Genetic and linguistic differentiation in the Americas. Proc. Natl. Acad. Sci. USA 90:10663-10667. S IMON E ASTEAL , reviewing editor
Accepted June 13, 1996