© 2005 Nature Publishing Group http://www.nature.com/naturegenetics
LETTERS
Complex haplotypes, copy number polymorphisms and coding variation in two recently divergent mouse strains David J Adams1,4, Emmanouil T Dermitzakis1,2,4, Tony Cox1, James Smith1, Rob Davies1, Ruby Banerjee1, James Bonfield1, James C Mullikin3, Yeun Jun Chung1, Jane Rogers1 & Allan Bradley1 Inbred mouse strains provide the foundation for mouse genetics. By selecting for phenotypic features of interest, inbreeding drives genomic evolution and eliminates individual variation, while fixing certain sets of alleles that are responsible for the trait characteristics of the strain. Mouse strains 129Sv (129S5) and C57BL/6J, two of the most widely used inbred lines, diverged from common ancestors within the last century1–5, yet very little is known about the genomic differences between them. By comparative genomic hybridization and sequence analysis of 129S5 short insert libraries, we identified substantial structural variation, a complex fine-scale haplotype pattern with a continuous distribution of diversity blocks, and extensive nucleotide variation, including nonsynonymous coding SNPs and stop codons. Collectively, these genomic changes denote the level and direction of allele fixation that has occurred during inbreeding and provide a basis for defining what makes these mouse strains unique. Most laboratory strains of mice originate from a limited number of founders from Mus musculus subspecies musculus and domesticus1–5. The inbred strain C57BL/6J was developed by C.C. Little from stock provided by A. Lathrop2–5. C57BL/6J mice are characterized by longevity; a relatively low incidence of tumors; susceptibility to dietinduced obesity, type 2 diabetes and atherosclerosis; and a preference for alcohol and morphine4. The 129Sv substrain, from which strain 129S5 originates, was developed as a tumor-prone strain by L. Stevens6 and is now widely used because it is the origin of the most reliable embryonic stem cell lines7. Embryonic stem cell technology uses both mouse strains; mutations generated in 129Sv-derived embryonic stem cells are usually analyzed in hybrid (129Sv C57BL/6J) F1 and F2 mice8. Phenotypic variation and variable penetrance of traits in these F2 mice reflects genetic variation fixed in these strains during their establishment. To study the nature of this genetic variation, we compared the genomes of strains 129S5 and C57BL/6J by comparative genomic hybridization (CGH) and end-sequence profiling of short insert libraries.
Gene duplication may allow an organism to evolve by modifying the duplicated gene or its regulatory elements9. Another theory suggests that subfunctionalization of two duplicates with reciprocal decay of their regulatory regions or protein domains ultimately confers a new gene function10–12. We carried out CGH analysis using 129S5 DNA competitively hybridized with C57BL/6J DNA to RPCI-23 (C57BL/6J) BAC clones spotted at 1-Mb resolution13 (Fig. 1 and Supplementary Table 1 online). Most of the genome was silent, suggesting that large-scale copy-number changes occurred infrequently, or have been selected against, during the divergence of strains C57BL/6J and 129S5. We did, however, detect several large-scale genomic changes, for example, a large duplicated region in strain 129S5 corresponding to 12.3–14.2 Mb on chromosome 14 and a region of copy-number gain in strain C57BL/6J at 23 Mb on chromosome 7 (Fig. 1). These data are consistent with a recent report of segmental polymorphisms between mouse strains14. The duplication on chromosome 14 seems to contain a few large genes, including those encoding topoisomerase IIb and the developmental regulator RARb, extending over 500 kb, identifying a few gene duplicates that may be undergoing divergent evolution. We applied the Rossetta error model15 to these CGH data and obtained a genome-wide profile of copy-number gains and losses (Supplementary Table 1 online). In total, 112 genomic regions (corresponding to 130 of 2,803 BAC clones; 4.6%) showed statistically significant variation in copy number. To obtain a finer-scale picture of genome diversity, we doubleend-sequenced more than 150,000 129S5 short insert genomic clones and mapped the resulting reads to the C57BL/6J genome (NCBIm30), using SSAHA-2 (refs. 16–18). We mapped end reads independently and then used read-pair information to localize additional reads. We mapped 257,013 reads (187,920 pairs and 69,093 single reads) to unequivocal genomic locations. We estimated the total unique (nonredundant) C57BL/6J genome coverage to be 4.7%. We could not map 50,504 end reads; 22,067 of these fell into assembly gaps and could be mapped to NCBIm33. Manual analysis of a representative sample of the remaining unmapped reads showed that most were of lower quality (77.5%) or were repeats, such as long terminal repeats (7.5%). In addition, 10% of the reads seemed to
1The Wellcome Trust Sanger Institute, Hinxton, Cambs, CB10 1SA, UK. 2Department of Genetic Medicine and Development. University of Geneva Medical School, 1211 Geneva, Switzerland. 3National Human Genome Research Institute, Bethesda, Maryland 20892-8004, USA. 4These authors contributed equally to this work. Correspondence should be addressed to A.B. (
[email protected]).
Published online 24 April 2005; doi:10.1038/ng1551
532
VOLUME 37
[
NUMBER 5
[
MAY 2005 NATURE GENETICS
129Sv (S5)
E5
E4
RP23-30k15
RP23-81J4
0.5
0.5
log2
C57BL/6J log2
E2.2 E2.3 E3
D3
D2
D1
C2 C3
C1
1.0
1.0
© 2005 Nature Publishing Group http://www.nature.com/naturegenetics
B
A3
A2
A1
F5
F4
F3
F2
F1
E2 E3
E1
D2
D1
C
B5
B4
Chromosome 14 B3
B2
B1
A3
A2
A1
Chromosome 7
E1 E2.1
LETTERS
0
129S5
0
–0.5
–0.5 C57BL/6J
–1.0
–1.0 0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
Chromosomal position (Mb)
Figure 1 CGH analysis of the C57BL/6J and 129S5 mouse genomes. 129S5 genomic DNA was competitively hybridized with C57BL/6J DNA to RPCI-23, C57BL/6J, BAC clones spotted at 1-Mb resolution. Shown is the CGH profile of chromosomes 7 and 14 (whole genome is shown in Supplementary Table 1 online), which showed the most significant variation. BACs that showed a statistically significant copy-number change, following analysis by the Rosetta error model, are shown in red for copy-number gain or green for copy-number loss. Data were derived from two independent experiments using double spotted arrays, including a dye swap experiment. Inset, fluorescence in situ hybridization analysis of BACs RP23-30K15 (duplicated in 129S5) and RP23-81J14 (control), illustrating a copy-number gain on chromosome 14, consistent with results of CGH analysis.
be very complex and were not detected by repeatmasker. Many of these reads were also of lower quality (Phred score o 20). The remaining good quality reads (Phred score 4 30; length 4 200 bp) may represent true strain-specific insertions unique to strain 129S5 (1–3% of the genome). Accurately mapped reads are a rich source of genomic information at the nucleotide level. We used a Phred algorithm to call SNPs using only the best read-genome match. From this analysis, we identified 201,407 nonredundant SNPs, consistent with independent analysis using SSAHA-SNP17,18. Fortuitously, a large number of end-sequence reads overlapped. Of the 201,407 SNPs identified, 48,966 were confirmed by a SNP called in at least one independent overlapping read. In 96% of cases, overlapping reads identified the same SNP. We also assessed the SNP call error rate experimentally. We amplified and resequenced 300 predicted SNPs and found that 491% were consistent with the computational SNP call. Our SNP collection overlaps with 3,586 SNPs in dbSNP121 (Supplementary Table 2 online), consistent with the genome coverage of this study. These SNP calls have an overall reliability of 96%, but this accuracy will vary across the genome depending on the quality of the C57BL/6J assembly and the density of SNP calls in a given region. In high-diversity regions of the genome, there will be fewer base-call errors relative to real SNPs than in low-diversity regions. In addition to SNPs, our end-sequence profiling identified 63,191 insertion-deletion polymorphisms. The location of these polymorphisms was significantly correlated with SNPs (P o 105). The density and genome-wide distribution of these SNPs provide an opportunity to reassess the haplotype block pattern of the mouse genome19–21. By focusing on the haplotypes of two strains, we obtained very high-resolution block structure and boundaries. We
NATURE GENETICS VOLUME 37
[
NUMBER 5
[
MAY 2005
used a model assuming a Poisson distribution of nucleotide differences between the two genomes to detect and define transitions in the diversity levels (number of SNPs per 100 bp) of genomic segments. We made no a priori assumptions about the diversity of blocks; we simply attempted to detect genomic positions where there was a statistically
Reads 107–109
Diversity 2%
0%
Reads 1–5 Reads 1–5
Reads 107–109
Figure 2 Refined haplotype structure of the mouse genome. Plot of diversity using the program GOLD showing the absolute difference between all possible pairs of diversity values for a segment of chromosome 11 (109 reads). Absolute difference of diversity values is presented using a color scale with a gradient from red, for the highest diversity differences, to blue for the lowest diversity differences. The high-diversity blocks (circled) are represented in the horizontal dimension as small blocks of blue squares followed by large blocks of yellow and red squares.
533
LETTERS
a
b 200
Frequency
15,000
100
10,000
5,000 0 0
c
0
0.4
0.8
1.2 1.6 Read diversity (%)
2.0
2.4
2.8
0
1 2 3 4 Average block diversity (whole genome) (%)
5
d
6
Average diversity (%)
0.23
Block size (Mb)
© 2005 Nature Publishing Group http://www.nature.com/naturegenetics
Frequency
20,000
4
2
0
0.18
0.13
0.08 0
1
2
3
X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
4
Whole-genome average diversity (%)
Chromosome
Figure 3 Analysis of diversity between strains 129S5 and C57BL/6J. (a) Distribution of read diversity in all chromosomes. This bimodal distribution peaks at very low diversity (close to zero) and at B0.17%. (b) Distribution of average block diversity in all chromosomes. (c) Scatter plot of average block diversity within blocks. High-diversity blocks are short, whereas low-diversity blocks vary in size. (d) Average diversity of reads for each of the 20 mouse chromosomes. Error bars indicate 95% confidence intervals of the diversity values. Highly significant heterogeneity exists among chromosomes (P o 105).
significant transition in diversity. We then looked at the distribution of diversity values of the discrete diversity blocks to correlate their size distribution and compared these values with those reported previously. Our analysis identified genomic blocks of common ancestry that have a much higher diversity than those previously defined20 and also identified low-diversity regions within high-diversity blocks (Figs. 2 and 3). Overall, we conclude that the high-diversity regions (Z0.4%; as previously defined20) of the genome averaged 227 kb in size, smaller than previously reported20. These data extend previous analyses of small regions of the genome22,23 and indicate that the haplotype pattern of the mouse genome is much richer than first thought.
There were clusters of high diversity within clusters of significantly lower diversity, and the distribution of the average block diversity ranged from 0 to 3–4% (Fig. 3a,b). Blocks with higher diversity (40.4%) were shorter, whereas blocks of lower diversity varied in length (Fig. 3c). We did not encounter very large blocks of low diversity; the largest block of low diversity was B6 Mb, relatively smaller than those previously reported20,21. This refined resolution is the result of using a much larger collection of SNPs than previously used20,21. Wade et al.20 reported very large blocks of high diversity (41 Mb), but our analysis resolved their ‘high-diversity blocks’ (40.4%) into smaller, but higher-diversity, blocks (41%) within regions of very low diversity. Recent studies, which sampled select
Table 1 Nucleotide changes provide a means for rapid genomic evolution Gene symbol (protein ID)a
Chromosome
SNP baseb
Length of proteinc
Transition
Suspected function (if known) Ribosome biogenesis
ENSMUST00000053256 (ENSMUSO00000060700)
1
9564818
365
R273X
ENSMUST00000050810 (ENSMUSO00000061372) ENSMUST00000044967 (ENSMUSO00000042906)
1 1
67144376 187460899
393 136
W393X R2OX
ENSMUST00000058140 (ENSMUSO00000053219) ENSMUST00000059402 (ENSMUSO00000053159)
12 15
68612581 24559559
99 138
Q76X Q123X
Actin polymerization
Listed transcripts were found to contain premature termination (stop) codons. These stop codons were confirmed by resequencing. aBased
534
on Ensembl NCBIm30 computationally predicted gene set. bLocation on the NCBIm30 assembly. cIn amino acids.
VOLUME 37
[
NUMBER 5
[
MAY 2005 NATURE GENETICS
© 2005 Nature Publishing Group http://www.nature.com/naturegenetics
LETTERS regions of the genome, identified a similar complex pattern of diversity22. Our study samples the entire genome and is therefore less likely to reflect regional biases. Even so, the level of genomic variation that we report is probably an underestimate because our analysis does not compare two finished genomes and is limited to two strains. Notably, we observed a significant difference in the average level of diversity between chromosomes (P o 105; Fig. 3d). This can be explained in part by the inheritance of chromosomes from common parental strains, M. musculus musculus and M. musculus domesticus in strains 129S5 and C57BL/6J. But individual chromosomes have also probably diverged at different rates owing to differential selective pressures dictated by their gene content. We also compared the SNPcontaining regions of the genome (excluding regions with zero diversity) that showed copy-number variation between strains C57BL/6J and 129S5 (Fig. 1 and Supplementary Table 1 online). The average diversity (7 s.d.) in these regions was 0.55574% (7 0.4428) in strain C57BL/6J versus 0.4398% (7 0.5038%) in strain 129S5 for regions of the genome with normal copy number (P ¼ 0.024). This result may reflect a read-mapping bias in these segments of the genome, but this is unlikely to affect our analysis substantially because these regions constitute a relatively small component of the genome. Coding variation probably has an important role in phenotypic differences between strains. Coding variants may have been acquired during divergence of mouse strains from common founders or acquired by mutation24. By translating the end-read sequences that align with exons of the Ensembl NCBIm30 computationally predicted gene set, we predicted 2,959 coding SNPs (Supplementary Table 3 online); 693 of these are ‘double hit’ SNPs, 1,559 are nonsynonymous and 5 cause premature termination codons in strain 129S5 (Table 1 and Supplementary Table 3 online). Because our 129S5 sequence covers B4.7% of exonic nucleotides in strain C57BL/6J, there may be 4100 premature termination codons and 462,000 coding changes in strain 129S5 compared with strain C57BL/6J. The genes with premature termination codons do not seem to map to known duplicated regions of the genome or to the regions identified as having copynumber changes in our CGH analysis (Fig. 1), suggesting that these mutations are not compensated for by a nonmutated homolog. For most of the genes containing SNPs (Table 1), a nonmutated full-length human or rat ortholog is available in which the overall gene structure is conserved, suggesting that the process of selection resulting in the fixation of these premature termination events was unique to strain 129S5. Mutations that alter splicing are also likely to be deleterious to gene function. In particular, mutations in the invariant intronic GT-AG splice sites have been associated with disease conditions25–27. By genomic DNA sequencing, we confirmed in strain 129S5 mutation of the invariant GT-AG splice sites in the genes ENSMUSG00000051470, ENSMUSG00000035941 and ENSMUSG00000047829. Our data indicate that although strains C57BL/6J and 129S5 were established from common ancestors, inbreeding has fixed substantial diversity in their genomes, characterized in part by copy-number variation, SNPs, coding changes and splice-site alterations. The variation we describe underscores the divergence that has shaped the phenotypic evolution of these strains and has implications for the analysis of phenotypes on the hybrid C57BL/6J 129S5 background. METHODS Libraries, end-sequencing and mapping. We constructed the 5¢ HPRT (MHPN) and 3¢ HPRT (MHPP) libraries as described previously16,28 using DNA from male 129Sv (S5; also known as 129S5/SvEvBrd) mice. We confirmed
NATURE GENETICS VOLUME 37
[
NUMBER 5
[
MAY 2005
the genetic purity of the source DNA in these libraries by genotyping them with a panel of 103 polymorphic 129Sv markers spaced every 20 Mb throughout the genome (Charles River). We converted these libraries from l phage to plasmids by infecting BNN132 Cre-expressing bacterial cells (provided by S. Elledge; Baylor College of Medicine). We then transformed complex DNA libraries into DH10B for sequencing and arraying. The average insert length (7 s.d.) was 6,776 bp (7 2,227) and 7,479 (7 2,310) for the 5¢ HPRT and 3¢ HPRT libraries, respectively. We mapped clones against the NCBIm30 C57BL/6J genome using SSAHA-2. Mapped clones are shown on the Ensembl genome browser (DAS source MICER). All reads are available from the Sanger Trace repository. Quality clipped reads were also submitted to EMBL. CGH and SNP discovery. We carried out CGH as described13 using RPCI-23 BACs spotted at 1-MB resolution. We carried out fluorescence in situ hybridization as described previously13. We used vector and quality trimmed reads, derived from clones either sequenced from both ends or for which an end-sequenced read mapped to a single definitive location, for SNP detection. We used a minimum Phred base call of 24. We identified SNPs by comparison of end-sequence reads to the C57BL/6J whole-genome shotgun assembly NCBIm30. All SNPs were submitted to dbSNP. Statistical analysis. We analyzed CGH data using the Rossetta error model15. We defined diversity blocks by detecting statistically significant transitions of diversity. We made no assumptions about the SNP composition of high- and low-diversity regions. For each chromosome, we used the first two reads from the centromere to calculate their average diversity. We calculated an expected Poisson count of SNPs for the next two reads (third and fourth), given their length and the average diversity of the previous two reads (first and second). We defined a transition as a case in which both the third and fourth reads fell outside the Poisson confidence intervals and in the same direction. If neither or only one of the reads was outside the confidence intervals or if they had different directions, no transition was called. We then included the third read and repeated the analysis on the fourth and fifth reads. We continued this process until we encountered a statistically significant transition. When a transition was called, we started again with the first two of the new set of reads after the transition point and continued the analysis along the chromosome. URLs. The Ensembl Mouse Genome is available at http://www.ensembl.org/ Mus_musculus/. The Ensembl/Sanger Trace repository is available at http:// trace.ensembl.org/. dbSNP is available at http://www.ncbi.nlm.nih.gov/SNP/. The program GOLD is available at http://www.well.ox.ac.uk/asthma/GOLD/. Accession numbers. End reads: BX957368–BX999998, CR000001–CR075066, CR075068–CR076137, CR076139–CR249243 and CR249245–CR278160 (EMBL). ArrayExpress: E-MEXP-298. Note: Supplementary information is available on the Nature Genetics website. ACKNOWLEDGMENTS We thank B. Plumb and his team for sequencing; P. Biggs and members of the Sanger informatics team for their assistance; and W. Wang, L. van der Weyden, J. Jonkers and A. Velds for discussions. D.J.A. was supported by a CJ Martin Fellowship from the Australian National Health and Medical Research Council. This work was supported by the Wellcome Trust. COMPETING INTERESTS STATEMENT The authors declare that they have no competing financial interests. Received 19 October 2004; accepted 18 March 2005 Published online at http://www.nature.com/naturegenetics/ 1. Simpson, E.M. et al. Genetic variation among 129 substrains and its importance for targeted mutagenesis in mice. Nat. Genet. 16, 19–27 (1997). 2. Altman, P.L.K. & Katz, D.D. Part I, Mice and Rats. in Inbred and Genetically Defined Strains of Laboratory Animals 1–418 (Federation of the American Societies for Experimental Biology, Bethesda, Maryland, 1979). 3. Beck, J.A. et al. Genealogies of mouse inbred strains. Nat. Genet. 24, 23–25 (2000). 4. Silver, L.M. Mouse Genetics (Oxford Univ. Press, New York, 1995). 5. Bonhomme, F. et al. The polyphyletic origin of laboratory inbred mice and their rate of evolution. J. Linn. Soc 30, 51–88 (1987).
535
© 2005 Nature Publishing Group http://www.nature.com/naturegenetics
LETTERS 6. Stevens, L.C. A new inbred subline of mice (129-terSv) with a high incidence of spontaneous congenital testicular teratomas. J. Natl. Cancer Inst. 50, 235–242 (1973). 7. Auerbach, W. et al. Establishment and chimera analysis of 129/SvEv- and C57BL/6derived mouse embryonic stem cell lines. Biotechniques 29, 1024–1028, 1030, 1032 (2000). 8. van der Weyden, L., Adams, D.J. & Bradley, A. Tools for targeted manipulation of the mouse genome. Physiol. Genomics 11, 133–164 (2002). 9. Levine, M. & Tjian, R. Transcription regulation and animal diversity. Nature 424, 147– 151 (2003). 10. Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531–1545 (1999). 11. Dermitzakis, E.T. & Clark, A.G. Differential selection after duplication in mammalian developmental genes. Mol Biol Evol 18, 557–562 (2001). 12. Eichler, E.E. & Sankoff, D. Structural dynamics of eukaryotic chromosome evolution. Science 301, 793–797 (2003). 13. Chung, Y.J. et al. A whole-genome mouse BAC microarray with 1-Mb resolution for analysis of DNA copy number changes by array comparative genomic hybridization. Genome Res. 14, 188–196 (2004). 14. Li, J. et al. Genomic segmental polymorphisms in inbred mouse strains. Nat. Genet. 36, 952–954 (2004). 15. Hughes, T.R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109–126 (2000). 16. Adams, D.J. et al. MICER—Mutagenic insertion and chromosome engineering resource. Nat. Genet. 36, 867–871 (2004). 17. Birney, E. et al. An overview of Ensembl. Genome Res. 14, 925–928 (2004).
536
18. Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729 (2001). 19. Lindblad-Toh, K. et al. Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse. Nat. Genet. 24, 381–386 (2000). 20. Wade, C.M. et al. The mosaic structure of variation in the laboratory mouse genome. Nature 420, 574–578 (2002). 21. Wiltshire, T. et al. Genome-wide single-nucleotide polymorphism analysis defines haplotype patterns in mouse. Proc. Natl. Acad. Sci. USA 100, 3380–3385 (2003). 22. Yalcin, B. et al. Unexpected complexity in the haplotypes of commonly used inbred strains of laboratory mice. Proc. Natl. Acad. Sci. USA 101, 9734–9739 (2004). 23. Frazer, K.A. et al. Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 Mb of mouse genome. Genome Res. 14, 1493–1500 (2004). 24. Bernardi, G. The compositional evolution of vertebrate genomes. Gene 259, 31–43 (2000). 25. Aebi, M., Hornig, H., Padgett, R.A., Reiser, J. & Weissmann, C. Sequence requirements for splicing of higher eukaryotic nuclear pre-mRNA. Cell 47, 555–565 (1986). 26. Atweh, G.F., Anagnou, N.P., Shearin, J., Forget, B.G. & Kaufman, R.E. Betathalassemia resulting from a single nucleotide substitution in an acceptor splice site. Nucleic Acids Res. 13, 777–790 (1985). 27. Zeniou, M., Gattoni, R., Hanauer, A. & Stevenin, J. Delineation of the mechanisms of aberrant splicing caused by two unusual intronic mutations in the RSK2 gene involved in Coffin-Lowry syndrome. Nucleic Acids Res. 32, 1214–1223 (2004). 28. Zheng, B., Mills, A.A. & Bradley, A. A system for rapid generation of coat color-tagged knockouts and defined chromosomal rearrangements in mice. Nucleic Acids Res. 27, 2354–2360 (1999).
VOLUME 37
[
NUMBER 5
[
MAY 2005 NATURE GENETICS