Polymorphic NumtS trace human population ... - Springer Link

3 downloads 345 Views 595KB Size Report
Dec 8, 2011 - Galaxy software by intersecting genomic intervals of .... (CAS), Europe (EUR) and South America (SAM) (Supple- .... mentary Table S3).
Hum Genet (2012) 131:757–771 DOI 10.1007/s00439-011-1125-3

ORIGINAL INVESTIGATION

Polymorphic NumtS trace human population relationships Martin Lang · Marco Sazzini · Francesco Maria Calabrese · Domenico Simone · Alessio Boattini · Giovanni Romeo · Donata Luiselli · Marcella Attimonelli · Giuseppe Gasparre

Received: 31 October 2011 / Accepted: 30 November 2011 / Published online: 8 December 2011 © Springer-Verlag 2011

Abstract The human genome is constantly subjected to evolutionary forces which shape its architecture. Insertions of mitochondrial DNA sequences into nuclear genome (NumtS) have been described in several eukaryotic species, including Homo sapiens and other primates. The ongoing process of the generation of NumtS has made them valuable markers in primate phylogenetic studies, as well as potentially informative loci for reconstructing the genetic history of modern humans. Here, we report the identiWcation of 53 human-speciWc NumtS by inspection of the UCSC genome browser, showing that they may be direct insertions of mitochondrial DNA into the human nuclear DNA after the human-chimpanzee split. In silico analyses allowed us to identify 14 NumtS which are polymorphic in terms of their presence/absence within the human genome in individuals of diVerent ancestry. The allele frequencies

of these polymorphic NumtS were calculated for 1000 Genomes Project sequence data from 13 populations worldwide, and principal components analysis and hierarchical clustering methods allowed the detection of strong signals of geographical structure related to the genetic diversity of these loci. All identiWed polymorphic human-speciWc NumtS together with a tandemly duplicated NumtS have also been validated by PCR ampliWcation on a panel of 60 samples belonging to Wve native populations worldwide, conWrming the expected NumtS variability. On the basis of these Wndings, we have succeeded in depicting the landscape of variation of a series of NumtS in several ethnic groups, making an advance in their identiWcation as useful markers in the study on human population genetics.

Introduction M. Lang and M. Sazzini are co-Wrst authors. M. Attimonelli and G. Gasparre contributed equally to this work. Electronic supplementary material The online version of this article (doi:10.1007/s00439-011-1125-3) contains supplementary material, which is available to authorized users. M. Lang · G. Romeo · G. Gasparre (&) Dipartimento di Scienze Ginecologiche, Ostetriche e Pediatriche, U.O. Genetica Medica, Pad.11, Pol.S.Orsola-Malpighi, Università di Bologna, Via Massarenti 9, 40138 Bologna, Italy e-mail: [email protected] M. Sazzini · A. Boattini · D. Luiselli Dipartimento di Biologia Evoluzionistica Sperimentale, Laboratorio di Antropologia Molecolare, Università di Bologna, 40138 Bologna, Italy F. M. Calabrese · D. Simone · M. Attimonelli Dipartimento di Biochimica e Biologia Molecolare “E. Quagliariello”, Università di Bari, 70126 Bari, Italy

Insertions of mitochondrial DNA (mtDNA) sequences into the nuclear genome (NumtS) have been described in several eukaryotic species (Hazkani-Covo et al. 2010), including Homo sapiens and other primates (Tourmen et al. 2002; Mishmar et al. 2004; Hazkani-Covo and Graur 2007; Lascaro et al. 2008). The whole mitochondrial genome can in fact be found in the shape of about 600 diVerent fragments, apparently distributed at random over the human nuclear genome (Simone et al. 2011) and most probably inserted by a non-homologous end-joining mechanism (NHEJ) during repair of DNA double strand breaks (DSBs) (HazkaniCovo and Covo 2008). Although most NumtS derive from independent insertion events of mtDNA into the nuclear genome, others result from tandem duplications or are due to larger segmental duplications (Tourmen et al. 2002; Bensasson et al. 2003; Mishmar et al. 2004; Hazkani-Covo et al. 2010). In any case, once inserted into the nuclear

123

758

genome, these fragments become subject to the nuclear evolutionary rate, which is far lower than that of mtDNA (Pakendorf and Stoneking 2005). Therefore, the most ancient NumtS have been described as true “molecular fossils”, whereas the most recent ones may be regarded as “snapshots” of mtDNAs coexisting at the time of insertion (Hazkani-Covo et al. 2010). This property of NumtS sequences to resemble ancestral mtDNA has been successfully used for phylogenetic classiWcation in very diVerent taxonomic groups (Schmitz et al. 2005; Ovchinnikov and Kholina 2010; Baldo et al. 2011), oVering the opportunity to study the correlation between mitochondrial and nuclear genome evolutions. The insertion of mtDNA fragments into the nuclear genome is an ongoing and more or less continuous process in yeast (Ricchetti et al. 1999) and primates (Ricchetti et al. 2004), with a NumtS insertion rate in the latter of approximately 5.6 NumtS per million years (Gherman et al. 2007; Hazkani-Covo 2009). As most NumtS are thought to derive from single insertion events, excluding the possibility of similarity due to convergent evolution, NumtS may thus be considered as loci free of molecular homoplasy (Zischler 2000; Hazkani-Covo 2009), a characteristic which makes them very good candidates as reliable phylogenetic markers. Comparison of the presence/ absence of NumtS between human, chimpanzee and other primates’ genomes has indeed allowed the reconstruction of an accurate primate phylogeny (Hazkani-Covo and Graur 2007; Jensen-Seaman et al. 2009). Lastly, the insertion of novel NumtS into the human genome is also thought to be an ongoing process, as suggested by diseasecausing insertions of mitochondrial fragments into proteincoding genes (Willett-Brozick et al. 2001; Turner et al. 2003; Goldin et al. 2004; Chen et al. 2005). NumtS may thus represent potential informative markers in reconstructing the genetic history of modern humans, even at intraspeciWc level. In the last few decades, human population and evolutionary genetics have involved increasing deeper surveys of H. sapiens genetic variation, thanks to the possibility of using an ever-increasing spectrum of markers which can highlight genetic diVerences among populations (Lopez Herraez et al. 2009; Kersbergen et al. 2009). All these markers were typically characterized by high degrees of polymorphism, codominant inheritance and even distribution throughout the genome, as well as selective neutrality and relatively simple analysis. In addition to uniparental markers (mtDNA and non-recombining Y-chromosome), molecular markers on autosomes, including microsatellites, mobile repeat elements and, especially, single nucleotide polymorphisms (SNPs), are frequently used in the Weld of population genetics (Rosenberg et al. 2002; Jakobsson et al. 2008; Li et al. 2008; Novembre et al. 2008). Biallelic inser-

123

Hum Genet (2012) 131:757–771

tion/deletion polymorphisms (indels) have also aided the exploration of human populations structure (Bastos-Rodrigues et al. 2006). Although this sudden increase in knowledge has represented a turning point in the study of the evolutionary forces which shaped H. sapiens diversity (Coop et al. 2009; Itsara et al. 2009; Auton et al. 2009; Hao et al. 2010), the identiWcation of new genetic markers with the characteristics mentioned above will enable even better exploration of the complex history of human populations. Accordingly, data recently produced by the 1000 Genomes Project have provided a very detailed description of human genetic variation (The 1000 Genomes Project Consortium 2010), and are also expected to provide an invaluable resource for NumtS studies at population level. Although a few works have already suggested the use of NumtS in the Weld of human evolutionary genetics (Thomas et al. 1996; Ricchetti et al. 2004), extensive studies concerning the usefulness and properties of polymorphic human-speciWc NumtS as population genetic markers, to the best of our knowledge, have never been carried out. As NumtS display highly polymorphic features, a population genetic approach, capable of investigating their variability in human groups of diVerent ancestry, would allow studies of the micro-evolutionary processes which occurred during the diVerentiation of H. sapiens populations from a completely new perspective. We present here in-depth analyses of human-speciWc NumtS, with the aim of proposing them as molecular markers able to trace human population diVerentiation patterns. Although phylogenetic studies have generally examined NumtS sequences (Schmitz et al. 2005; Ovchinnikov and Kholina 2010), in the present work we analysed their property as indels, thanks to the recent implementation of NumtS tracks on the University of California Santa Cruz (UCSC) human genome browser build hg18 (http:// genome.ucsc.edu; Kent et al. 2002; Simone et al. 2011), a powerful tool which allowed us to identify a very large number of human-speciWc NumtS. By applying bioinformatics analyses to the 1000 Genomes Project sequence data, and providing in vitro validation of the observed polymorphic loci, we did succeed in depicting the landscape of variation of a series of NumtS in several ethnic groups worldwide, making an advance in their identiWcation as useful markers in the study of human population genetics.

Materials and methods Public database inspections The intersection between all NumtS of the RHNumtS.2 compilation (Simone et al. 2011) and segmental duplications

Hum Genet (2012) 131:757–771

annotated in the UCSC human genome browser (hg18) (http:// genome.ucsc.edu/; Kent et al. 2002) was performed by Galaxy software (http://main.g2.bx.psu.edu/; Goecks et al. 2010; Blankenberg et al. 2010) and a spreadsheet comparison of genomic regions covered by NumtS and segmental duplications. Human-speciWc NumtS were identiWed by visual inspection of the UCSC human genome browser (hg18) by comparing NumtS tracks, available as “NumtS Sequences” in the “Variation and Repeats” section, with “Chimp Chain/ Net”, “Orangutan Chain/Net”, “Rhesus Chain/Net” and “Marmoset Chain/Net”. Reference sequences for in silico analyses were downloaded from the UCSC genome browser and alignments of sequenced NumtS to the human reference genome (hg18) were performed with its BLAT tool (Kent 2002). Polymorphic human-speciWc NumtS were selected by Galaxy software by intersecting genomic intervals of human-speciWc NumtS with indels, annotated in the Database of Single Nucleotide Polymorphisms, build 130 (dbSNP, Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine; http:// www.ncbi.nlm.nih.gov/SNP/) and retrieved through the UCSC Table Browser tool (Karolchik et al. 2004). Lists of pilot indels released by the 1000 Genomes Project were downloaded from the following websites: ftp://ftp. 1000genomes.ebi.ac.uk/////vol1/ftp/pilot_data/release/2010_ 07/low_coverage/indels/ and http://www.well.ox.ac.uk/ »gerton/1000G/LC/pilot1-indelcalls-17sept09.tgz. Similarity searches between reported indels of at least 15 bp in length and the reference human mtDNA (rCRS, NC_012920) were carried out by NCBI/Blastn software (http://blast. ncbi.nlm.nih.gov/Blast.cgi; Altschul et al. 1997).

759

In silico validation of human-speciWc NumtS on rhesus macaque and orangutan DNA was carried out by downloading reference sequences of the syntenic regions Xanking the predicted NumtS insertion sites, and by performing pairwise alignment to the human reference sequence with Sequencher software, version 4.7 (Gene Codes Corporation, Ann Arbor, MI USA; http://www.genecodes.com). Primer sequences and PCR conditions are reported in Supplementary Table S1. Analysis of 1000 Genomes Project data Sequence alignment data generated by the 1000 Genomes Project (http://www.1000genomes.org/) were analysed to deWne which human-speciWc NumtS are polymorphic within modern H. sapiens populations and to assay the frequencies of alleles carrying the NumtS in diVerent human groups worldwide. Populations were selected on the basis of the availability of paired-end reads data produced by Illumina sequencing technology. Low coverage sequence data for 940 individuals belonging to 13 diVerent populations, suYciently representative of the human ethnic groups inhabiting the various continents, were thus retrieved for analysis (Table 1). Aligned sequence reads covering 1.2 kb up- and downstream of the NumtS insertion sites were downloaded for each individual from the EBI ftp site (ftp://ftp.1000 genomes.ebi.ac.uk/vol1/ftp/data) and indexed by SAMtools software (Li et al. 2009). Data were retrieved as Binary Alignment/Map (BAM) Wles. Aligned sequence reads were thus analysed by Dindel software (Albers et al. 2010), Table 1 Surveyed populations of 1000 Genomes Project Population

Code

CEPH individuals

CEU

N

Validation of human-speciWc NumtS In order to validate the absence of the predicted human-speciWc NumtS in chimpanzee, whole genomic DNA from a Pan troglodytes sample was PCR ampliWed with primers designed for H. sapiens (Simone et al. 2011). An exception was made for three NumtS (HSA_NumtS_513, HSA_ NumtS_038, HSA_NumtS_426), for which no ampliWcation was obtained with the primers mentioned above, and for which speciWc primers were designed for the chimpanzee genome on the syntenic region, Xanking the NumtS insertion site. Sequencing was carried out as previously described (Lascaro et al. 2008) and the obtained sequences were submitted to the EMBL databank, according to the protocol described in Simone et al. (2011). Accession numbers of submitted sequences are as follows: HE613801 to HE613849 (H. sapiens) and HE614082 to HE614112 (P. troglodytes).

81

HapMap African ancestry individuals from SW US

ASW

50

Han Chinese in Beijing

CHB

81

Han Chinese South

CHS

92

HapMap Finnish individuals from Finland

FIN

75

CML

50

Colombian individuals

a

Japanese individualsa

JPT

78

Luhya individuals

LWK

83

HapMap Mexican individuals from LA California

MXL

54

Puerto Ricans in Puerto Rico

PUR

52

Tuscan individuals, Italy

TSI

98

Yoruba individualsa

YRI

76

British individuals from England and Scotlanda

GBR

Total

13

70 940

Abbreviations for each population described in text (code) and number of individuals (N) are listed a Populations on which all human-speciWc NumtS were in silico analysed

123

760

applying default settings. For each population, BAM Wles of single individuals were explored simultaneously with the option “doPooled” on all candidate indels produced by the program. Indels corresponding to NumtS (deletions for NumtS present in the reference genome and insertions for NumtS absent in the reference genome) were added to the list of candidates, in order to force the program to test reads for evidence of NumtS insertion/deletion. Merging all reads for a given region and population into a single BAM Wle gave equivalent results upon analysis with Dindel (data not shown). The 1000 Genomes Project version of the reference human genome sequence (release hg19, human_g1k_v37.fa) was used to re-align the reads. All human-speciWc NumtS for which the exact insertion site could be reconstructed were analysed in four populations from diVerent continents (CLM, JPT, GBR, YRI) (Table 1). Ten human-speciWc NumtS showed evidence of deletion polymorphisms with respect to the reference sequence and were thus further examined in all the remaining populations, together with the four NumtS showing insertion polymorphisms with respect to the reference sequence, as subsequently described. Allele frequencies for data with a quality score indicating a conWdence value of at least 99% or for which visual inspection of sequence reads, carried out by means of the Integrative Genomic Viewer (IGV1.5.61; http://www.broadinstitute.org/igv; Robinson et al. 2011), did not show discrepancies with respect to Dindel software output, were successfully calculated for nine human-speciWc NumtS on the whole dataset of 940 individuals belonging to 13 populations worldwide (Table 2). A Reynolds’ genetic distance matrix among such populations (Reynolds et al. 1983) was computed with the R 2.12.2 package (http://www.r-project.org/) adegenet library and used for graphical representation by means of a neighborjoining tree (Saitou and Nei 1987). Tree and bootstrap distance computations were performed with the R ape library. Principal components analysis (PCA) (Pearson 1901) on the allele frequency data of populations was carried out with the R ade4 library, in order to explore the potential structure intrinsic to the observed NumtS variability. Computed principal components (PCs) were subsequently used for hierarchical clustering analysis, in which signiWcance estimates were calculated for each of the identiWed clusters via multiscale bootstrap resampling by means of the R pvclust library and the adoption of a > 95 bootstrap value threshold (Suzuki and Shimodaira 2006). Population samples of validation panel PCR validation of human-speciWc NumtS which were expected to be polymorphic in H. sapiens populations

123

Hum Genet (2012) 131:757–771

according to in silico analyses (Table 2) was performed on a panel of 60 unrelated healthy individuals belonging to native ethnic groups worldwide. In brief, data from 12 Ethiopians, 13 Iranians, 11 Kazakhs and Uyghurs, 12 Peruvian Yanesha, and 12 Northern Italians were analysed clustering them on Wve distinct geographically based groups: SubSaharan Africa (SSA), Middle East (MDE), Central Asia (CAS), Europe (EUR) and South America (SAM) (Supplementary Table S2). All subjects were third-generation natives from the selected geographical areas and were chosen according to their mtDNA proWle, obtained from previous data (personal data), in order to collect a validation panel which could be suYciently representative of at least a fraction of the ancestral genetic variability characterizing the diVerent continents. Written informed consent was collected from each subject and the study was designed according to the ethical principles for medical research involving human subjects stated by the World Medical Association Declaration of Helsinki. PCR validation of polymorphic human-speciWc NumtS PCR ampliWcation was carried out on 40 ng of genomic DNA extracted from either buccal swabs or peripheral blood samples, with 500 nM primers designed to be external to the NumtS and with KAPA2G Fast HotStart ReadyMix (Resnova, Genzano di Roma, Italy) in a Wnal volume of 10 l. The length of the resulting amplicons was determined by gel electrophoresis, allowing inference of the presence/absence of the NumtS in question. When the NumtS allele state could not be unambiguously inferred, PCR products were sequenced as previously described (Lascaro et al. 2008). For each region, both alleles, with and without NumtS, were sequenced at least once, in order to conWrm the NumtS insertion site in H. sapiens. Primer sequences and PCR conditions are reported in Supplementary Table S1. Basic descriptive statistics on validation panel All statistical analyses were performed on the Wve distinct geographically based groups previously deWned (SSA, MDE, CAS, EUR and SAM). Calculations of allele frequencies, as well as goodness of Wt to the Hardy–Weinberg expectation (HWE) with relative correction for multiple testing, were performed for each of the 14 validated humanspeciWc NumtS, according to the R adegenet library and potentially excluding loci with >20% missing data. Estimates of basic descriptive statistics for each continental group, such as the number of polymorphic loci (S) and average gene diversity over all loci (), were computed

Hum Genet (2012) 131:757–771

761

Table 2 List of human-speciWc NumtS with corresponding mtDNA coordinates (mtStart-mtEnd), length of NumtS in base-pairs (bp), orientation on nuclear chromosomes with respect to rCRS orientation NumtS code

mtStart

mtEnd

HSA_NumtS_009b

8933

9006

HSA_NumtS_015

12476

12518

HSA_NumtS_030

6869

HSA_NumtS_038

(orient) and position on nuclear DNA (chrom pos) or insertion site (ins site) for NumtS not present in human reference sequence build hg18 (in italics) Length (bp)

Orient

Chrom pos/ins site (hg18)

73

+

chr1:37849935-37850008

42



chr1:103965301-103965343

6980

111

+

chr1:145799428-145799539

9562

9600

38

+

chr1:213739762-213739800

HSA_NumtS_050b

1763

1818

55



chr2:33846042-33846097

HSA_NumtS_052

6741

7012

271



chr2:49310271-49310542

HSA_NumtS_058

7858

8109

251



chr2:81747112-81747363

HSA_NumtS_062

8314

8789

475

+

chr2:87905524-87905999

HSA_NumtS_092

611

742

131



chr2:149355765-149355896

HSA_NumtS_102

10787

10897

110



chr2:203191851-203191963

HSA_NumtS_115

13048

14223

1,175



chr2:238093986-238095159

HSA_NumtS_116-129a

13048

13103

55



chr2:238095165-238096013

HSA_NumtS_133b

10984

11022

38

+

chr3:25483999-25484037

HSA_NumtS_143

1396

2718

1,322



chr3:97818722-97820044

HSA_NumtS_171

9338

9679

341



chr4:12251016-12251357

HSA_NumtS_179

14980

15072

92



chr4:47469046-47469138

HSA_NumtS_182

962

1092

130

+

chr4:55889084-55889214

HSA_NumtS_188

2226

2456

230



chr4:79148708-79148939

HSA_NumtS_201

12250

12417

167



chr4:163561976-163562143

a

10801

10841

40

+

chr5:73107473-73107513

HSA_NumtS_215

341

2697

2,356



chr5:79981597-79983943

HSA_NumtS_219

12662

16124

3,462

+

chr5:93928917-93932379

HSA_NumtS_228

10270

15488

5,218



chr5:134286898-134292116

HSA_NumtS_232

12146

12188

42



chr5:165890002-165890044

HSA_NumtS_276

12960

13065

105

+

chr7:67839451-67839556

HSA_NumtS_289

1615

1710

95

+

chr7:145325359-145325454

HSA_NumtS_304

13936

16024

2,088



chr8:36254670-36256756

HSA_NumtS_319

14860

14943

83



chr8:100577274-100577357

HSA_NumtS_398

6642

6804

162

+

chr11:72899354-72899516

HSA_NumtS_410

14659

14730

71



chr11:122379524-122379595

HSA_NumtS_426b

3790

3878

88



chr12:40043704-40043792

HSA_NumtS_432-453

4417

4477

60



chr12:125633808-125634895

HSA_NumtS_460b

9537

9592

55

+

chr13:40240503-40240558

HSA_NumtS_214

HSA_NumtS_464

5104

5229

125



chr13:55443769-55443894

HSA_NumtS_472

982

1237

255



chr13:108874473-108874728

HSA_NumtS_474

5583

6606

1,023



chr14:32023055-32024075

HSA_NumtS_512

10142

10209

67

+

chr17:39430610-39430677

HSA_NumtS_513

6818

7470

652

+

chr17:48538093-48538745

HSA_NumtS_518

6902

6948

46



chr17:76205971-76206017

HSA_NumtS_519

14381

14503

122



chr18:2832230-2832352

HSA_NumtS_522b

7975

8166

191



chr18:43633615-43633806

HSA_NumtS_543b

2180

2221

41

+

chr20:9097571-9097612

HSA_NumtS_544

3499

3541

42

+

chr20:13095959-13096001

HSA_NumtS_546

12961

13030

69

+

chr20:55072517-55072586

HSA_NumtS_560

6180

6226

46

+

chr22:34611665-34611711

HSA_NumtS_574

1272

1342

70



chrY:4272822-4272892

123

762

Hum Genet (2012) 131:757–771

Table 2 continued NumtS code

mtStart

mtEnd

HSA_NumtS_585

15565

15710

HSA_NumtS_586a

12612

12640

a

Length (bp)

Orient

Chrom pos/ins site (hg18)

145

+

chrY:19493376-19493521

28

+

chr3:68801367-68803395

59

16089

530



chr11:49883569

HSA_NumtS_588a

1574

1613

39

+

chr1:54863435

HSA_NumtS_589b

16423

16461

38



chr11:4612920

HSA_NumtS_590a

6857

6894

37



chr11:33876328

HSA_NumtS_591b

5780

5820

40



chr14:25486225

HSA_NumtS_587

Polymorphic NumtS in bold type a NumtS analysed by PCR in validation panel b NumtS analysed by PCR in validation panel and on 1000 Genomes Project data

with the Arlequin package, version 3.5.1.2 (ExcoYer and Lischer 2010). Analysis of validation panel at individual level Homozygosity/heterozygosity information concerning the presence of validated NumtS in the 60 examined samples was used to perform PCA (Pearson 1901) at individual level by means of the R ade4 library, in order to explore the genetic diVerentiation of single subjects from the Wve continental groups according to observed NumtS variability. An exploratory analysis was Wrst carried out to identify and remove potential outlier individuals (four individuals), and PCA was then repeated on the resulting reduced dataset. Analysis of population structure within validation panel The apportionment of genetic variance among the Wve geographically based groups was studied by performing a series of locus-by-locus analysis of molecular variance (AMOVA) (ExcoYer et al. 1992), each testing a diVerent number of population clusters (1 · k · 4), in order to identify those which were maximally diVerentiated. Amonggroups pairwise FST genetic distances calculated according to the method of Cockerham and Weir (1984) were also computed with the Arlequin package.

mentation of human NumtS tracks on the hg18 build of the UCSC genome browser, we succeeded in intersecting all 585 NumtS of the reference human NumtS compilation (RHNumtS.2) (Simone et al. 2011) with annotated segmental duplications, and found a total of 105 NumtS which were associated with such genomic elements. Because of their deWnition as duplications of non-repeatmasked sequences longer than 1,000 bases and with more than 90% sequence similarity, 30 NumtS, found to intersect with segmental duplications, turned out to be direct insertions of mtDNA, because subsequent comparison of their Xanking regions showed that the annotated segmental duplications did not extend beyond the NumtS (Supplementary Table S3). Conversely, we identiWed 75 NumtS involved in duplication events, and comparison of both their sequences and their Xanking regions identiWed 12 clusters of duplicated NumtS. The structure of some of these clusters suggested that more than one duplication event had generated “non-original” NumtS and that some of them derived from multiple original insertions. In more detail, 19 out of 75 NumtS must have arisen from direct insertion of mtDNA into the nuclear genome, whereas the remaining 56 (9.1% of all annotated NumtS) must have originated from segmental duplications involving NumtS Xanking sequences (Supplementary Table S3). Compilation of human-speciWc NumtS

Results NumtS on segmental duplications NumtS may arise either by de novo insertion of mtDNA sequences into the nuclear genome or by duplication of larger DNA segments spanning previous insertions of NumtS. Although this latter case has already been described (Tourmen et al. 2002; Bensasson et al. 2003; Mishmar et al. 2004), pertinent reports are limited to a relatively low number of loci. By exploiting the recent imple-

123

Inspection of the NumtS tracks within the UCSC Genome Browser allowed their comparison with chain and net tracks of other primate species, leading to the identiWcation of human-speciWc NumtS (Supplementary Figure S1). Following this procedure, we were able to identify 53 NumtS which were predicted to be absent in the following primate species: chimpanzee (Pan troglodytes), orangutan (Pongo pygmaeus abelii), rhesus macaque (Macaca mulatta), and marmoset (Callithrix jacchus), thus increasing the number of previously listed human-speciWc NumtS by approximately 32% (Jensen-Seaman et al. 2009; Hazkani-Covo

Hum Genet (2012) 131:757–771

2009) (Table 2). None of these human-speciWc NumtS appeared to originate from segmental duplications. Predictions made by UCSC genome browser inspection were subsequently validated by PCR ampliWcation and sequencing on chimpanzee DNA for 41 out of the 53 syntenic regions of NumtS insertion sites, the reconstruction of which was enabled by the sequence data. In order to exclude the possibility of ancestral NumtS deletion in the chimpanzee genome, we performed in silico analysis on the reference sequences of rhesus macaque and orangutan, being able to reject the possibility that ancestral NumtS were lost in our phylogenetically closest neighbors, conWrming that they are indeed human-speciWc. Selection of polymorphic human-speciWc NumtS The validation process of NumtS within the RHNumtS.2 compilation (Simone et al. 2011) conWrmed the hypothesis that the insertion of NumtS into the nuclear genome is an ongoing process in the human species (Hazkani-Covo et al. 2010; Ricchetti et al. 2004) and that their presence/ absence polymorphisms is therefore to be expected, as already observed (Thomas et al. 1996; Ricchetti et al. 2004). By joining genomic intervals of RHNumtS with indels, annotated as “simple nucleotide polymorphisms” on the dbSNP build 130 (Sherry et al. 2001), we identiWed nine loci for which alleles without the NumtS were annotated, suggesting that they were polymorphic for presence/absence within the human species. In addition, a Blast search of dbSNP build 130 indels against the human mitochondrial reference sequence did not reveal other polymorphic NumtS. All human-speciWc NumtS for which the exact insertion site could be reconstructed were analysed on 1000 Genomes Project sequence data (http://www.1000genomes. org) in four populations from diVerent continents (CLM, JPT, GBR, YRI) by means of Dindel software (Albers et al. 2010) (Table 1). Ten human-speciWc NumtS showed evidence of deletion polymorphisms with respect to the reference sequence in these populations. In order to identify possible NumtS which represent insertions with respect to the reference sequence (i.e. not detectable by means of similarity searches on the reference genome), we analysed lists of pilot indels released by the 1000 Genomes Project. Blasting all reported indels of at least 15 bp in length against the reference sequence of human mtDNA (rCRS, NA_012920) allowed us to identify four additional candidate NumtS with insertion polymorphisms (HSA_NumtS_588, HSA_NumtS_589, HSA_ NumtS_590, HSA_NumtS_591). Lastly, in order to compare our data with published allele frequencies, we also included in our analyses a previously reported insertion polymorphism (HSA_NumtS_587; Thomas et al. 1996).

763

Taken together, 14 out of the 53 human-speciWc NumtS (26%) were considered to be polymorphic for their presence/absence in various human populations, thus almost doubling the number of known human polymorphic NumtS reported for our species so far (Hazkani-Covo et al. 2010). Characteristics of human-speciWc NumtS Comparing the length of human-speciWc and non-humanspeciWc NumtS listed in the RHNumtS.2 compilation, and excluding repeat elements from the count, we found a statistically signiWcant diVerence in their mean lengths (73 and 203 bp, respectively; Wilcoxon-Mann–Whitney p value = 5.612e¡09). As expected, the mean similarity between NumtS and rCRS was found to be higher for human-speciWc than for non-human-speciWc NumtS (93.57 and 78.37%, respectively); considering only polymorphic NumtS, the mean similarity further increases to 97.94%. At variance with the reported preferential integration of NumtS within genes (Ricchetti et al. 2004), we did not Wnd a statistically signiWcant bias of NumtS insertions within coding sequences with respect to other genomic regions. As about 35% of human genomes are covered by reference protein-coding genes (Hou and Lin 2009), we could not observe preferential insertion of NumtS into protein-coding genes either by taking into account all the NumtS of the RHNumtS.2 compilation (212/585, 36.2%) or by considering only human-speciWc NumtS (24/53, 45%). The divergence of data concerning such insertions from literature reports may be due to the fact that not only NCBI Reference Sequence genes, but also predicted and hypothetical genes, were included in Ricchetti’s analysis (Ricchetti et al. 2004). Population structure in the 1000 Genomes Project data For nine of the selected polymorphic human-speciWc NumtS (Table 2), it was possible to calculate allele frequencies (Table 3) from sequence data produced by the 1000 Genomes Project and concerning 940 individuals belonging to 13 worldwide human populations (Table 1). The widest range of allele frequencies among populations was observed for NumtS HSA_NumtS_589, for which the frequency of the NumtS-carrying allele was tendentially high in all populations, with the exception of the African samples. Instead, HSA_NumtS_543 showed the lowest frequency variability, with all populations showing relatively high frequencies for the allele containing this NumtS (Table 3). Pairwise genetic distances among the 13 surveyed populations (Supplementary Table S4) were computed from such allele frequencies and subsequently used to construct a neighbor-joining tree representing population relationships

123

764

Hum Genet (2012) 131:757–771

Table 3 Allele frequencies of NumtS in 1000 Genomes Project data Pop

N_009

N_050

N_133

N_426

N_460

N_522

N_543

N_589

N_591

CEU

0.367

0.653

0.737

0.593

0.993

0.900

0.963

0.780

0.393

GBR

0.491

0.587

0.722

0.567

0.993

0.830

0.942

0.830

0.432

FIN

0.432

0.537

0.674

0.431

1.000

0.810

0.943

0.840

0.457

TSI

0.549

0.599

0.714

0.538

1.000

0.870

0.950

0.799

0.405

YRI

0.669

0.616

0.774

0.703

0.985

0.890

0.931

0.372

0.180

ASW

0.535

0.586

0.754

0.610

0.984

0.780

0.878

0.587

0.231

LWK

0.608

0.609

0.774

0.674

0.941

0.540

0.857

0.442

0.278

CLM

0.385

0.692

0.649

0.440

1.000

0.685

0.892

0.769

0.434

MXL

0.475

0.667

0.701

0.387

1.000

0.555

0.982

0.778

0.510

PUR

0.531

0.622

0.708

0.443

1.000

0.790

0.948

0.802

0.377

JPT

0.389

0.723

0.564

0.460

0.916

0.500

0.958

0.840

0.472

CHB

0.298

0.691

0.566

0.427

0.966

0.620

0.917

0.905

0.410

CHS

0.260

0.645

0.498

0.552

0.910

0.550

0.901

0.820

0.306

N_xxx: NumtS code (HSA_NumtS_xxx). See Table 1 for population codes (Pop)

Fig. 1 Population genetics analyses on 1000 Genomes Project data. a Illustrates neighbor-joining tree representing population relationships among surveyed populations. b Shows Wrst two PCs of PCA on allele frequencies with respective percentages of variation explained by each PC and c represents hierarchical clustering analysis on PCs computed for same populations with bootstrap support values. See Table 1 for population codes

(Fig. 1a). The expected tree topology was observed, reXecting the well-known clustering pattern of human groups previously obtained with several other molecular markers (Rosenberg et al. 2002; Falush et al. 2003; Ramachandran et al. 2005; Bastos-Rodrigues et al. 2006; Handley et al. 2007; Jakobsson et al. 2008), especially for the African and European populations, with which the Puerto Rican sample (PUR) clustered, as well as, to a lesser extent, for the East Asian and Latin American groups. According to this evidence, PCA based on population allele frequencies identiWed a Wrst PC, explaining 52% of total variance, which clearly distinguished African populations from those from the other continents. The second PC accounted for 30% of total variance and perfectly diVerentiated European and Puerto Rican samples from the others.

123

In addition, it highlighted an intermediated position of most Latin American groups between East Asians and Europeans and a considerable diVerentiation between Luhya and Han Chinese from Beijing with respect to other populations belonging to the same continents (Fig. 1b). Computed PCs were also used to perform hierarchical clustering analysis to depict population relationships in more detail. This approach did demonstrate that the geographical structure of NumtS genetic variations can be appreciably detected at continental level, as four main clusters were distinguished (Fig. 1c). One cluster consisted of the three African populations, the second of the East Asian groups, the third of the European and Puerto Rican groups, and the fourth of the Colombian and Mexican samples.

Hum Genet (2012) 131:757–771

765

Fig. 2 Population genetics analyses on validation panel. a Shows Wrst two PCs of PCA on 56 individuals for validation panel after exclusion of four outliers. Samples classiWed according to aYliation to main geographical areas (SSA, MDE, CAS, EUR, SAM). Fractions of variance explained by each PC shown on X- and Y-axes. b Represents gel electrophoresis of PCR-ampliWed fragments relative to HSA_NumtS_588 in nine individuals belonging to two diVerent populations (MDE and SAM). Homozygous samples with and without NumtS and heterozygous samples are shown. Note diVerences in allele frequencies between the two populations

Polymorphic variation and genetic diversity in the validation panel PCR validation of the whole set of identiWed polymorphic human-speciWc NumtS (Table 2; Fig. 2b) was performed on a panel of 60 unrelated healthy individuals belonging to Wve native ethnic groups worldwide, which are representative of at least a fraction of the ancestral genetic variability characterizing the following geographical areas: Sub-Saharan Africa (SSA), Middle East (MDE), Central Asia (CAS), Europe (EUR) and South America (SAM). Summary statistics of genetic diversity for the above groups are listed in Table 4 and allele frequencies for each of the surveyed NumtS are reported in Table 5. Samples from MDE and SSA showed the highest heterozygosity per locus (0.349 § 0.195 and 0.342 § 0.191, respectively) and the SAM sample the lowest (0.201 § 0.120), while the Eurasian populations (EUR, MDE) lay in an intermediate position, in accordance with

Table 4 Summary statistics for 14 PCR-validated human-speciWc NumtS Population

N

S



SSA

24

12

0.342 § 0.191

MDE

26

13

0.349 § 0.195

CAS

22

12

0.260 § 0.151

EUR

24

13

0.321 § 0.183

SAM

24

9

0.201 § 0.120

Table lists number of chromosomes (N) and polymorphic sites (S), as well as average gene diversity over loci () and relative standard deviations

known worldwide patterns of genetic diversity reported in the literature (Li et al. 2008; TishkoV et al. 2009) and explained by the several bottlenecks undergone by H. sapiens populations during colonization of the Eurasian and American continents after the Out of Africa migration (Stoneking 2008).

123

766

Hum Genet (2012) 131:757–771

Table 5 Allele frequencies of 14 PCR-validated human-speciWc NumtS in validation panel Pop

N_009

N_050

N_133

N_214

N_426

N_460

N_522

N_543

N_586

N_587

N_588

N_589

N_590

N_591

SSA

0.333

0.545

0.500

0.333

0.583

1.000

0.208

0.708

0.500

0.542

0.250

0.792

0.000

0.208

MDE

0.269

0.545

0.423

0.500

0.346

0.962

0.192

0.808

0.346

0.538

0.500

0.846

0.000

0.654

CAS

0.136

0.455

0.409

0.318

0.227

0.864

0.000

0.955

0.050

0.773

0.273

1.000

0.050

0.500

EUR

0.083

0.778

0.583

0.500

0.083

1.000

0.583

0.792

0.500

0.583

0.458

0.917

0.042

0.417

SAM

0.083

0.583

0.458

0.375

0.000

0.750

0.000

0.958

0.000

0.792

0.125

1.000

0.000

0.833

N_xxx: NumtS code (HSA_NumtS_xxx)

Partially in contrast with this portrait of diversity, the largest proportion of polymorphic NumtS was not observed for the Sub-Saharan African sample (80%), but in the Middle Eastern and European ones (87%). However, the spectra of expected heterozygosity (Supplementary Figure S2) clearly conWrmed that the Sub-Saharan African and Middle Eastern samples showed the largest proportion of loci with remarkably high heterozygosity values, which progressively decreased as the distance to the African continent increased (Supplementary Table S5). With respect to NumtS shared among continental groups, all NumtS were polymorphic in at least two groups: 93% of loci had a minor allele frequency >0 in at least three groups, 67% in at least four groups and 57% in all continents. In more detail, HSA_NumtS_426, HSA_NumtS_460 and HSA_NumtS_586 were monomorphic in only one group (SAM), HSA_NumtS_589 and HSA_NumtS_522 were monomorphic in two (CAS and SAM) and HSA_NumtS_590 in three continental groups (SSA, MDE and SAM), the latter being completely absent. The last two loci showed the lowest values of observed heterozygosity averaged over all continental samples (HSA_NumtS_522: 0.109 § 0.050 and HSA_NumtS_590: 0.092 § 0.012) and HSA_NumtS_113 the highest (0.557 § 0.217). All polymorphic NumtS satisWed Hardy–Weinberg expectations (HWE) in all samples after correction for multiple tests (Supplementary Table S5). One out of two sites presenting tandem duplications of NumtS (HSA_NumtS_116-129; Hazkani-Covo and Graur 2007) could also be PCR ampliWed in our validation panel, and inference of the number of NumtS repeats by gel electrophoresis conWrmed that the locus was polymorphic. Alleles with seven, 13, 14 and 15 repeats were detected. Interestingly, the chimpanzee genome tested during the validation process of human-speciWc NumtS revealed two diVerent alleles at this locus, with two and three repeats, respectively. Hence, duplication of this NumtS may have already started before the human-chimpanzee split, leading, however, to much higher ampliWcation in humans. Comparisons between our allele frequency data with published data for HSA_NumtS_587 (Thomas et al. 1996)

123

revealed good matches for European and South American samples, whereas our African and Central Asian samples showed slightly higher frequencies of NumtS insertion. No comparable data for samples from Middle East were reported by Thomas et al. (1996). Ricchetti’s group reported presence/absence data for seven human-speciWc NumtS in 21 individuals of diVerent geographic origin (Ricchetti et al. 2004). Unfortunately, such data were too few to allow a signiWcant comparison with our allele frequencies, but the trend toward high or low prevalence of NumtS within individuals of a given population was consistent with the here presented data. Population structure in the validation panel Genetic distances among the examined continental groups were measured with the Wxation index (FST) as an estimate of their allele frequency diVerentiation. The resulting genetic distance matrix is shown in Supplementary Figure S3. The largest and most signiWcant FST values were observed for pairwise population comparisons involving the South American sample (FST = 0.222, p < 0.001 when compared with SSA, and FST = 0.188, p < 0.001 when compared with EUR); the smallest, non-signiWcant distance was found between the Central Asian and South American samples (FST = 0.023, p = 0.09). To further investigate whether the signals of geographical structure observed for NumtS variability in the 1000 Genomes Project populations were conWrmed in our validation dataset, we performed a PCA at individual level (Supplementary Figure S4), retaining the Wrst two PCs displayed in Fig. 2a. The Wrst PC accounted for 19% of total variance and mainly diVerentiated South American and Central Asian individuals from the great majority of samples belonging to the other continental groups. The second PC, accounting for 14% of total variance, tended to separate Sub-Saharan Africans from most of the European, South American and Central Asian individuals. Only subjects from the Middle East seemed to be equally scattered along the two components, showing much higher among-individuals variance in comparison with the other continental groups.

Hum Genet (2012) 131:757–771

767

Table 6 Analysis of molecular variance (AMOVA) ClassiWcations

Percentages of genetic variance Among pop/groups

Fixation indices

Among pop within groups

Within pop

FCT

FSC

90.52

FST

K=1

9.48

K=2

5.47

6.96

87.56

0.055

0.074***

0.095*** 0.124***

K=3

5.12

5.82

89.06

0.051

0.061**

0.109***

K=4

5.94

4.10

89.96

0.059

0.043

0.100***

Apportionment of genetic variance among Wve geographically based groups investigated by calculating among-groups component of variance (FCT), diVerentiation among populations within groups (FSC) and diVerentiation among worldwide populations (FST). Following population groups (K) were considered: 1 = all samples in a single group; 2 = South America, all remaining samples; 3 = South America, Central Asia, all remaining samples; 4 = South America, Central Asia, Europe, all remaining samples *** p < 0.001; ** p < 0.01

Statistical support for the observed geographical structure for NumtS genetic diversity in our validation panel was also provided by analysis of molecular variance (AMOVA) results (Table 6), in which the majority of variance was indeed accounted for by within-population diVerences (90.52%) and diVerentiation among worldwide populations was relatively low (FST = 0.095, p < 0.001). In addition, when population samples were progressively split according to geography, the among-groups component of the variance (FCT values) turned out to be very low and was never signiWcant. However, according to the sequential increase of population clusters obtained by AMOVA analyses, the greater among-groups percentage of variance (5.94%) was observed when South America, Central Asia and Europe were considered as single entities, although with a non-signiWcant and low among-groups Wxation index (FCT = 0.059, p = 0.112). At the same time, Sub-Saharan Africans and individuals from the Middle East were grouped together and a low and non-signiWcant level of diVerentiation among populations within the same group (FSC = 0.043, p < 0.104) was obtained.

Discussion The recent implementation of human NumtS tracks on the hg18 build of the UCSC genome browser is very useful in evaluating the various properties of the NumtS in the human genome. We report here the results of NumtS track application, among others, to the detection of NumtS involved in duplication events, to the recognition of NumtS lying within protein-coding genes and, most importantly, to the identiWcation of human-speciWc NumtS by comparison with other primate genomes. For some NumtS, no syntenic region is annotated in P. troglodytes or other primate species. Hazkani-Covo (2009) proposed a classiWcation of NumtS loci based on comparisons of various genomes in four categories: presence,

absence, other and unknown. According to this classiWcation, 43 of our human-speciWc NumtS belong to the category “absence” in the chimpanzee genome, and ten fall into the categories “other” and “unknown”, thus explaining the absence of PCR products for these NumtS regions on chimpanzee DNA. Since deletions of NumtS after the human-chimpanzee split have been reported to be very unlikely (Bensasson et al. 2003), we did not expect deletions of NumtS in H. sapiens with respect to P. troglodytes, and this was indeed conWrmed by in silico analyses of NumtS insertion sites on outgroup primate genomes. According to our data, human-speciWc NumtS tend to be of reduced length and are very similar to rCRS. The adopted experimental design for NumtS discovery, with rCRS as target sequence, may partly explain why short ancestral NumtS are not recognized and listed in the RHNumtS.2 compilation, but another underlying biological reason cannot be excluded. Although some NumtS involved in segmental duplications also appear to be derived from multiple original insertions, they are expected to be ancestral and hence do not always align with modern human mtDNA in a single fragment after generation of NumtS compilation, or simply span the D-loop region, thus being separated into two fragments. Taken together, our data strongly support the hypothesis that NumtS are randomly inserted into the human genome and highlight their similar properties to repeat elements. Evaluation of the applied procedure for the identiWcation of polymorphic human-speciWc NumtS has to take into account that the used human reference sequence (build hg18) is a “consensus sequence” of six individuals from diVerent geographic areas and thus does not include variants with low frequency. As already observed for SNPs, many known genetic variants are indeed shared between most populations, whereas novel low-frequency variants, found by analyses of the pilot 1000 Genomes Project data, are typically speciWc of few geographically distinct groups

123

768

(The 1000 Genomes Project Consortium 2010). Therefore, it is not expected to Wnd more human-speciWc NumtS of recent insertion within the reference human genome sequence, which could be continent speciWc, restricted to a few geographic areas or with low allele frequency in populations. As a result, the here presented compilation demands to be a comprehensive list of reference humanspeciWc NumtS. Insertion of the majority of analysed human-speciWc NumtS very probably occurred in ancestral African groups before modern humans migrated out of Africa and thus before the several bottlenecks undergone by H. sapiens populations during colonization of the Eurasian and American continents, as already reported for various human NumtS (Bensasson et al. 2003; Ricchetti et al. 2004; Ovchinnikov and Kholina 2010). Some human NumtS have been shown to be highly similar to the Neanderthal mtDNA, and this has been explained by their integration into the nuclear genome of a common ancestor of Neanderthals and modern humans (Ovchinnikov and Kholina 2010). Instead, to the best of our knowledge, NumtS within the Neanderthal and Denisova nuclear genomes have never been examined. Our attempt to inspect public Neanderthal and Denisova sequence reads mapped to the human genome at insertion sites of human-speciWc NumtS showed that read depth and quality values are too low to allow a thorough investigation of NumtS presence/ absence (http://genome.ucsc.edu; Briggs et al. 2009a, b; Green et al. 2010; Reich et al. 2010). However, for a few NumtS, preliminary exploration of these data did allow us to deduce their presence/absence based on reads spanning both Xanking regions and NumtS. All non-polymorphic human-speciWc NumtS for which data could be examined (16 NumtS for Neanderthal, 25 for Denisova) were indeed present in the Neanderthal and Denisova genomes, whereas polymorphic NumtS were only partially present. In fact, two out of six NumtS for which information could be retrieved were present in the Neanderthal genome and three out of seven NumtS with available data were present in the Denisova genome, whereas the remaining NumtS that could be examined were absent from these genomes. Moreover, for four NumtS data could be found for both species, turning out to match each other. These data further corroborate the hypothesis that NumtS generation is an ongoing process during human evolution. If NumtS are not subjected to selective pressure, observed diVerences in the prevalence of polymorphic NumtS within human populations may indicate that they were inserted at diVerent time points during human evolution. The origin of a given NumtS may sometimes be inferred from its mtDNA haplogroup of origin, although this is often impossible, as most polymorphic NumtS are short (29–539 bp; Bensasson et al. 2003). Nevertheless, a

123

Hum Genet (2012) 131:757–771

preliminary attempt was made to assign NumtS to haplogroups, showing that, out of Wve NumtS for which a haplogroup could be assigned, three (HSA_NumtS_133, HSA_NumtS_214, HSA_NumtS_460) had variant sites that supported the hypothesis of their insertion within the African continent. One of them (HSA_NumtS_587) could be assigned to the Western Eurasian mtDNA haplogroup R, while HSA_NumtS_590, absent from the reference human genome, showed a base change which deWned the Australian N13 haplogroup. Conversely, for HSA_NumtS_522, the sensitivity of the analysis was greatly reduced since sites deWning several haplogroups were detected (Supplementary Table S6; Rubino et al. 2011). These preliminary results, together with the allele frequency data of HSA_NumtS_050, HSA_NumtS_587 and HSA_NumtS_590 within the diVerent population groups (Table 5), support the hypothesis that these NumtS do represent examples of mtDNA sequences inserted into the nuclear genome after “Out of Africa”. For HSA_NumtS_589, no information could be obtained on the haplogroup of origin, but its allele frequencies suggest that the insertion event took place in a non-African population. Altogether, these data emphasize the usefulness of NumtS properties in investigating the relationship between nuclear and mitochondrial genomes. According to estimates of the NumtS insertion rate calculated in previous works (Hazkani-Covo 2009), new polymorphic population-speciWc NumtS are expected to be found. Therefore, by inspecting pilot indel releases of the 1000 Genomes Project, we may include four unreported insertion polymorphisms with high sequence similarity to rCRS in our analyses. To further increase this number, high coverage sequence data for individuals from diVerent continents are needed. In any case, NumtS identiWed from shotgun-sequenced genomes will always need to be empirically veriWed, as carried out in the present study (Venkatesh et al. 2006). Unfortunately, we did not have the opportunity to characterize the examined human-speciWc NumtS on a validation panel already typed for several other genetic markers, so that our data did not allow a direct comparison between population structures observed according to investigations of diVerent typologies of loci. Nevertheless, we show that a relatively small number of polymorphic loci of mtDNA insertions into the nuclear genome are suYcient to provide a clear-cut picture of the genetic landscape of worldwide human ethnic groups, in contrast to the higher number of autosomal microsatellites, indels, CNVs, and especially SNPs, needed to appreciate a comparable, or anyway not too much higher, level of population diVerentiation (Rosenberg et al. 2002; Bastos-Rodrigues et al. 2006; Jakobsson et al. 2008; Li et al. 2008; Novembre et al. 2008; Coop et al. 2009; Kersbergen et al. 2009; Lopez Herraez et al. 2009). According to this, the reasonable strength of

Hum Genet (2012) 131:757–771

NumtS in detecting human population structure may be of particular interest for laboratories which have no possibilities to produce dense genome-wide data for the characterization of speciWc population samples. Moreover, with respect to other polymorphic indels (e.g., repeat elements, short tandem repeats), NumtS have also the property of being free of molecular homoplasy and are easily identiWed, as their sequence is homologous to mtDNA. Exploration of population relationships according to the allele frequency variability of only nine NumtS, for which highly reliable information was retrieved from the 1000 Genomes Project data, indeed drew a picture which perfectly matched the well-known pattern of genetic aYnities/diVerentiations among human groups already described with more extensive panels of uniparental and autosomal loci (Rosenberg et al. 2002; Falush et al. 2003; Ramachandran et al. 2005; Bastos-Rodrigues et al. 2006; Jakobsson et al. 2008). The only exception was represented by the Puerto Rican sample, which exhibited a very small genetic distance with respect to the bulk of European populations, probably as a consequence of a high percentage of admixed individuals with a component of European ancestry. On the contrary, the other Latin American populations occupied an intermediated position between East Asians and Europeans, as expected on the basis of both Native Americans’ origin from East and North Asia and recent gene Xow from Europe (Bryc et al. 2010). To experimentally validate the whole set of humanspeciWc NumtS which were seen to be polymorphic in various H. sapiens populations according to in silico analyses, a panel of unrelated individuals belonging to Wve native ethnic groups worldwide and characterized by potentially reduced internal variations, was PCR ampliWed for the loci in question. Whereas the low coverage sequence data of the 1000 Genomes Project did not allow us to infer “genotype” data for single subjects, so that only NumtS allele frequencies in populations could be computed, the use of a validation panel enabled us to perform population genetic analyses on the basis of presence/ absence data from single individuals. This led to the acquisition of additional results, such as the goodness of Wt to HWE for the examined NumtS and a detailed portrait of intra-population genetic diversity for each of the investigated geographically based sample. According to this, analyses on the validation panel allowed us to conWrm: the heterogeneous composition of all candidate humanspeciWc polymorphic NumtS; comparable variations with expected genetic variability patterns (TishkoV and Verrelli 2003; Li et al. 2008; TishkoV et al. 2009). Both PCA at individual level and AMOVA highlight consistent withingroups variation with respect to among-groups variation for African, European and, especially, Middle Eastern samples. The analysed locus with tandem duplications of

769

NumtS also conWrmed the polymorphic character of these genomic elements. In conclusion, we succeeded in depicting the landscape of variation of a series of NumtS in several human ethnic groups, conWrming their usefulness as markers in the Weld of human population genetics. That being so, exploration of human-speciWc polymorphic NumtS variability by means of whole genome sequence data provides a completely new tool in tracing human population relationships, potentially enabling deeper investigation of the micro-evolutionary processes which occurred during the diVerentiation of H. sapiens populations, by exploiting the invaluable peculiarity of NumtS to link mitochondrial and nuclear genomes. Acknowledgments This study was partly supported by the Italian Ministry of University and Research (MIUR) grant FIRB ‘Futuro in Ricerca’ J31J10000040001 to G.G. and contributions from Prof. Herawati Sudoyo of the Eijkman Institute of Molecular Biology, Jakarta (Indonesia) to G.G. and from the “Fondo di Ateneo” (University of Bari) to M.A.

References Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R (2010) Dindel: accurate indel calls from short-read data. Genome Res 21(6):961–973 Altschul SF, Madden TL, SchaVer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402 Auton A, Bryc K, Boyko AR, Lohmueller KE, Novembre J, Reynolds A, Indap A, Wright MH, Degenhardt JD, Gutenkunst RN, King KS, Nelson MR, Bustamante CD (2009) Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res 19(5):795–803 Baldo L, de Queiroz A, Hedin M, Hayashi CY, Gatesy J (2011) Nuclear-mitochondrial sequences as witnesses of past interbreeding and population diversity in the jumping bristletail Mesomachilis. Mol Biol Evol 28(1):195–210 Bastos-Rodrigues L, Pimenta JR, Pena SD (2006) The genetic structure of human populations studied through short insertion-deletion polymorphisms. Ann Hum Genet 70(Pt 5):658–665 Bensasson D, Feldman MW, Petrov DA (2003) Rates of DNA duplication and mitochondrial DNA insertion in the human genome. J Mol Evol 57(3):343–354 Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19:Unit 19 10 11-21 Briggs AW, Good JM, Green RE, Krause J, Maricic T, Stenzel U, Lalueza-Fox C, Rudan P, Brajkovic D, Kucan Z, Gusic I, Schmitz R, Doronichev VB, Golovanova LV, de la Rasilla M et al (2009a) Targeted retrieval and analysis of Wve Neandertal mtDNA genomes. Science 325(5938):318–321 Briggs AW, Stenzel U, Meyer M, Krause J, Kircher M, Paabo S (2009b) Removal of deaminated cytosines and detection of in vivo methylation in ancient DNA. Nucleic Acids Res 38(6):e87 Bryc K, Velez C, Karafet T, Moreno-Estrada A, Reynolds A, Auton A, Hammer M, Bustamante CD, Ostrer H (2010) Colloquium paper: genome-wide patterns of population structure and admixture

123

770 among Hispanic/Latino populations. Proc Natl Acad Sci USA 107(2):8954–8961 Chen JM, Chuzhanova N, Stenson PD, Ferec C, Cooper DN (2005) Meta-analysis of gross insertions causing human genetic disease: novel mutational mechanisms and the role of replication slippage. Hum Mutat 25(2):207–221 Cockerham CC, Weir BS (1984) Covariances of relatives stemming from a population undergoing mixed self and random mating. Biometrics 40(1):157–164 Coop G, Pickrell JK, Novembre J, Kudaravalli S, Li J, Absher D, Myers RM, Cavalli-Sforza LL, Feldman MW, Pritchard JK (2009) The role of geography in human adaptation. PLoS Genet 5(6):e1000500 ExcoYer L, Lischer HE (2010) Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour 10(3):564–567 ExcoYer L, Smouse PE, Quattro JM (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131:479–491 Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164(4):1567–1587 Gherman A, Chen PE, Teslovich TM, Stankiewicz P, Withers M, Kashuk CS, Chakravarti A, Lupski JR, Cutler DJ, Katsanis N (2007) Population bottlenecks as a potential major shaping force of human genome architecture. PLoS Genet 3(7):e119 Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86 Goldin E, Stahl S, Cooney AM, Kaneski CR, Gupta S, Brady RO, Ellis JR, SchiVmann R (2004) Transfer of a mitochondrial DNA fragment to MCOLN1 causes an inherited case of mucolipidosis IV. Hum Mutat 24(6):460–465 Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH, Hansen NF, Durand EY, Malaspinas AS, Jensen JD, Marques-Bonet T et al (2010) A draft sequence of the Neandertal genome. Science 328(5979):710–722 Handley LJ, Manica A, Goudet J, Balloux F (2007) Going the distance: human population genetics in a clinal world. Trends Genet 23(9):432–439 Hao K, Chudin E, Greenawalt D, Schadt EE (2010) Magnitude of stratiWcation in human populations and impacts on genome wide association studies. PLoS One 5(1):e8695 Hazkani-Covo E (2009) Mitochondrial insertions into primate nuclear genomes suggest the use of numts as a tool for phylogeny. Mol Biol Evol 26(10):2175–2179 Hazkani-Covo E, Covo S (2008) Numt-mediated double-strand break repair mitigates deletions during primate genome evolution. PLoS Genet 4(10):e1000237 Hazkani-Covo E, Graur D (2007) A comparative analysis of numt evolution in human and chimpanzee. Mol Biol Evol 24(1):13–18 Hazkani-Covo E, Zeller RM, Martin W (2010) Molecular poltergeists: mitochondrial DNA copies (numts) in sequenced nuclear genomes. PLoS Genet 6(2):e1000834 Hou Y, Lin S (2009) Distinct gene number-genome size relationships for eukaryotes and non-eukaryotes: gene content estimation for dinoXagellate genomes. Plos One 4:e6978 Itsara A, Cooper GM, Baker C, Girirajan S, Li J, Absher D, Krauss RM, Myers RM, Ridker PM, Chasman DI, MeVord H, Ying P, Nickerson DA, Eichler EE (2009) Population analysis of large copy number variants and hotspots of human genetic disease. Am J Hum Genet 84(2):148–161 Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, Bras JM, Schy-

123

Hum Genet (2012) 131:757–771 mick JC, Hernandez DG, Traynor BJ, Simon-Sanchez J et al (2008) Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451(7181):998–1003 Jensen-Seaman MI, Wildschutte JH, Soto-Calderon ID, Anthony NM (2009) A comparative approach shows diVerences in patterns of numt insertion during hominoid evolution. J Mol Evol 68(6):688– 699 Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ (2004) The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32 (Database issue):D493–D496 Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–664 Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12(6):996–1006 Kersbergen P, van Duijn K, Kloosterman AD, den Dunnen JT, Kayser M, de KnijV P (2009) Developing a set of ancestry-sensitive DNA markers reXecting continental origins of humans. BMC Genet 10:69 Lascaro D, Castellana S, Gasparre G, Romeo G, Saccone C, Attimonelli M (2008) The RHNumtS compilation: features and bioinformatics approaches to locate and quantify Human NumtS. BMC Genomics 9:267 Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM (2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science 319(5866):1100– 1104 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079 Lopez Herraez D, Bauchet M, Tang K, Theunert C, Pugach I, Li J, Nandineni MR, Gross A, Scholz M, Stoneking M (2009) Genetic variation and recent positive selection in worldwide human populations: evidence from nearly 1 million SNPs. PLoS One 4(11):e7888 Mishmar D, Ruiz-Pesini E, Brandon M, Wallace DC (2004) Mitochondrial DNA-like sequences in the nucleus (NUMTs): insights into our African origins and the mechanism of foreign DNA integration. Hum Mutat 23(2):125–133 Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, Stephens M, Bustamante CD (2008) Genes mirror geography within Europe. Nature 456(7218):98–101 Ovchinnikov IV, Kholina OI (2010) Genome digging: insight into the mitochondrial genome of Homo. PLoS One 5(12):e14278 Pakendorf B, Stoneking M (2005) Mitochondrial DNA and human evolution. Annu Rev Genomics Hum Genet 6:165–183 Pearson K (1901) On lines and planes of closest Wt to systems of points in space. Philos Mag 2:559–572 Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL (2005) Support from the relationship of genetic and geographic distance in human populations for a serial founder eVect originating in Africa. Proc Natl Acad Sci USA 102:15942–15947 Reich D, Green RE, Kircher M, Krause J, Patterson N, Durand EY, Viola B, Briggs AW, Stenzel U, Johnson PL, Maricic T, Good JM, Marques-Bonet T, Alkan C, Fu Q et al (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468(7327):1053–1060 Reynolds J, Weir BS, Cockerham CC (1983) Estimation of the coancestry coeYcient: basis for a short-term genetic distance. Genetics 105(3):767–779 Ricchetti M, Fairhead C, Dujon B (1999) Mitochondrial DNA repairs double-strand breaks in yeast chromosomes. Nature 402(6757):96–100

Hum Genet (2012) 131:757–771 Ricchetti M, Tekaia F, Dujon B (2004) Continued colonization of the human genome by mitochondrial DNA. PLoS Biol 2(9):E273 Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP (2011) Integrative genomics viewer. Nat Biotechnol 29:24–26 Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW (2002) Genetic structure of human populations. Science 298(5602):2381–2385 Rubino F, Piredda R, Calabrese FM, Simone D, Lang M, Calabrese C, Petruzzella V, Tommaseo-Ponzetta M, Gasparre G, Attimonelli M (2011) HmtDB, a genomic resource for mitochondrion-based human variabilità studies. Nucleic Acids Res. doi:10.1093/nar/ gkr1086 Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4):406–425 Schmitz J, Piskurek O, Zischler H (2005) Forty million years of independent evolution: a mitochondrial gene and its corresponding nuclear pseudogene in primates. J Mol Evol 61(1):1–11 Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311 Simone D, Calabrese FM, Lang M, Gasparre G, Attimonelli M (2011) The reference human nuclear mitochondrial sequences compilation validated and implemented on the UCSC genome browser. BMC Genomics 12(1):517 Stoneking M (2008) Human origins. The molecular perspective. EMBO Rep 9 (Suppl 1):S46–S50 Suzuki R, Shimodaira H (2006) Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22:1540–1542

771 The 1000 Genomes Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467 (7319): 1061–1073 Thomas R, Zischler H, Paabo S, Stoneking M (1996) Novel mitochondrial DNA insertion polymorphism and its usefulness for human population studies. Hum Biol 68(6):847–854 TishkoV SA, Verrelli BC (2003) Patterns of human genetic diversity: implications for human evolutionary history and disease. Annu Rev Genomics Hum Genet 4:293–340 TishkoV SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, Froment A, Hirbo JB, Awomoyi AA, Bodo JM, Doumbo O, Ibrahim M, Juma AT, Kotze MJ, Lema G, Moore JH et al (2009) The genetic structure and history of Africans and African Americans. Science 324(5930):1035–1044 Tourmen Y, Baris O, Dessen P, Jacques C, Malthiery Y, Reynier P (2002) Structure and chromosomal distribution of human mitochondrial pseudogenes. Genomics 80(1):71–77 Turner C, Killoran C, Thomas NS, Rosenberg M, Chuzhanova NA, Johnston J, Kemel Y, Cooper DN, Biesecker LG (2003) Human genetic disease caused by de novo mitochondrial-nuclear DNA transfer. Hum Genet 112(3):303–309 Venkatesh B, Dandona N, Brenner S (2006) Fugu genome does not contain mitochondrial pseudogenes. Genomics 87(2):307–310 Willett-Brozick JE, Savul SA, Richey LE, Baysal BE (2001) Germ line insertion of mtDNA at the breakpoint junction of a reciprocal constitutional translocation. Hum Genet 109(2):216–223 Zischler H (2000) Nuclear integrations of mitochondrial DNA in primates: inference of associated mutational events. Electrophoresis 21(3):531–536

123