Scientific Correspondence
Development and Characterization of Genome-Wide Single Nucleotide Polymorphism Markers in the Green Alga Chlamydomonas reinhardtii1 Valentina S. Vysotskaia*, Damian E. Curtis, Alexander V. Voinov, Pushpa Kathir2, Carolyn D. Silflow, and Paul A. Lefebvre Exelixis, Inc., 170 Harbor Way, P.O. Box 511, South San Francisco, California 94083-0511 (V.S.V., D.E.C., A.V.V.); and Departments of Genetics, Cell Biology and Development and Plant Biology, University of Minnesota, 1445 Gortner Avenue, St. Paul, Minnesota 55108 (P.K., C.D.S., P.A.L.) Chlamydomonas reinhardtii is a unicellular green alga, with a genome estimated to be around 100 Mbp (Harris, 1989). Similar to yeast, C. reinhardtii has wellunderstood haploid genetics, but unlike yeast it has both a chloroplast and flagella. These unique properties make it a powerful model system for studying fundamental cellular and molecular biology questions concerning photosynthesis, flagellar motility, and basal body function (for review, see Lefebvre and Silflow, 1999; Grossman, 2000). The photosynthetic mechanisms for CO2 fixation in C. reinhardtii are very similar to those used in vascular plants. However, unlike higher plants, C. reinhardtii can be grown heterotrophically on acetate as a sole source of carbon. As a result, mutants with defective photosynthesis are readily isolated and characterized. When growing heterotrophically, dark-grown cells exhibit normal photosynthetic capability and chloroplast development. In addition, C. reinhardtii can be used as a model system to identify the molecular targets of herbicides and metabolic inhibitors (Harris, 1989). Currently, genomic and genetic resources for C. reinhardtii include around 65,000 expressed sequence tags (ESTs; http://www.kazusa.or.jp/en/ plant/chlamy/EST/; http://www.biology.duke.edu/ chlamy genome/EST.html), a bacterial artificial chromosome library (Lefebvre et al., unpublished data), and an RFLP map aligned with the genetic map (Silflow, 1998; Kathir et al., unpublished data). The complete genomic sequence is not available, and construction of a physical map linked to the genetic map is in progress (Lefebvre and Silflow, 1999). The RFLP linkage map was constructed from a mapping population generated by crossing the standard laboratory strain of C. reinhardtii (Smith 137C, isolated in Massachusetts in 1945) and the Minnesota isolate 1 This work was supported by the National Institutes of Health (grant no. GM34437 to P.A.L.) and the National Science Foundation (grant no. NSF/MCB–9975765 to P.A.L. and C.D.S.). 2 Present address: 10704 Dundas Oak Court, Burke, VA 22015. * Corresponding author; e-mail
[email protected]; fax 650837-8204. www.plantphysiol.org/cgi/doi/10.1104/pp.010485.
386
S1D2 (Gross et al., 1988). This map includes 250 markers, identifying each of the 17 linkage groups (Silflow, 1998). However, genotyping with RFLP markers is labor intensive and time consuming. To develop a high-throughput, inexpensive system for genetic mapping, we converted RFLP markers to single nucleotide polymorphism (SNP) markers. SNP markers can be assayed rapidly and easily using a wide variety of techniques, including a templatedirected dye-terminator incorporation assay with fluorescence polarization detection (Chen et al., 1999), pyrosequencing (Alderborn et al., 2000), oligonucleotide-specific ligation (Tobe et al., 1996), molecular beacons (Marras et al., 1999), dynamic allele-specific hybridization (Prince et al., 2001), the Taq-Man system (Livak, 1999), mass spectrometry (Stoerker et al., 2000), and oligonucleotide arrays (Hirschhorn et al., 2000; Pastinen et al., 2000). Here, we report the development of a collection of 186 SNP markers in C. reinhardtii. Sequence information and further details concerning these markers are available on the web site of Duke University/the Chlamydomonas Genetics Center (http://www. biology.duke.edu/chlamy/). We also characterized DNA polymorphisms between the mapping strains 137C and S1D2 and evaluated C. reinhardtii EST data as a source for additional SNP markers. ASSESSMENT OF POLYMORPHISMS BETWEEN C. REINHARDTII STRAINS
The 137C and S1D2 strains of C. reinhardtii used to generate the RFLP map are strains commonly used by the C. reinhardtii research community. We evaluated the level of sequence polymorphism between these strains by comparing short stretches of DNA sequence. We used 137C genomic and cDNA sequence information from GenBank to design 34 sequence-tagged sites (STSs), ranging between 300 and 500 bp and representing exons, introns, and 3⬘ untranslated regions (UTR). Using genomic DNA from both strains as templates, STSs were amplified, sequenced using the PCR primers, and analyzed by BLAST (Altschul et al., 1990) or the phred/phrap/
Plant Physiology, October 2001, Vol. 127, pp. 386–389, www.plantphysiol.org © 2001 American Society of Plant Biologists
Scientific Correspondence
consed package (Ewing et al., 1998; Gordon et al., 1998). For each STS entry, gene name, accession number, gene region, and STS size as well as number of polymorphic sites, single-base changes (SNPs), larger substitutions (affecting more then 1 base), small (6 bp) deletions/insertions are shown in Table I. To ensure high accuracy, sequence variations were confirmed by visual inspection of the traces from both strains. The results shown in Table I indicate that the S1D2 strain of C. reinhardtii is highly polymorphic with the 137C strain. In total, we detected 248 polymorphic loci out of 11,651 total bases examined, representing an average of one sequence variation per 47 bp. More differences were observed in non-coding regions (introns and 3⬘UTR) than in coding regions. Because most STSs were designed from 3⬘UTR and only a few from exons, a statistical analysis of the polymorphism distribution between coding/noncoding regions cannot be performed. The majority of
polymorphic loci were single-nucleotide substitutions. The ratio between transitions and transversions was roughly 2:1. Only two (6%) of 34 analyzed STSs showed large deletions/insertions (43 and 140 bp, respectively). SNPs were not randomly distributed across the STS loci. The number of SNPs per STS ranged from none at three loci to 33 at the gene encoding gamete lytic enzyme, which is a zinc metalloprotease mediating digestion of the cell walls during mating (Kinoshita et al., 1992). Such local variation in polymorphism rate may arise because some loci are inherently more mutable than others. This phenomenon has been described in Caenorhabditis elegans (Koch et al., 2000), Mus musculus (Lindblad-Toh et al., 2000), and Drosophila melanogaster (Hoskins et al., 2001). To develop SNP markers from all available RFLP probes, we concentrated only on single nucleotide substitutions and preferentially designed STSs from the 3⬘UTR to avoid
Table I. Characterization of sequence variations between the 137C and S1D2 strains of C. reinhardtii Gene Name
Accession No.
cds/Intron/3⬘UTR
Analyzed Sequence
No. of Polymorph.a Sites
No. of SNPs
Substitutions ⬎1 bp
Del/Insb ⬍6 bp
Del/Ins ⬎6 bp
bp
PC1 RBCS2 AC208 YPTC6 PF14 HSP70A RPL41 PF16 Caltractin Cblp CRY1 Ida4 TUA1 TUA2 HsP70B TUB1 GLE RBCS2 RSP4 PF14 IDA1/PF9 LC3 Tctex1 LC4 GBP1 KLP1 ALAD FLA14 CRANT LAO1 AC206 ARS LC5 S926
U36752 X04472 L07282 U13169 X14549 M76725 AF130727 U40057 X57973 X53574 U06937 Z48059 M11447 M11448 X96502 M10064 D90503 X04472 M87526 X14549 U61364 U43610 AF039437 U34345 U10442 X78589 U19876 U19490 X65194 U78797 U70999 X52304 X43609 X62135
cds cds&intron cds&intron cds&intron cds cds 3⬘UTR 3⬘UTR 3⬘UTR cds&intron 3⬘UTR 3⬘UTR 3⬘UTR cds&intron 3⬘UTR 3⬘UTR cds&3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR 3⬘UTR
Total a
Polymorph., Polymorphism.
Plant Physiol. Vol. 127, 2001
b
380 200 350 260 270 350 440 370 330 270 246 360 480 360 425 435 400 360 320 325 320 340 380 350 380 300 350 320 330 340 320 350 300 340
3 11 14 3 0 2 2 0 4 9 2 4 15 5 1 1 35 9 9 13 5 13 16 4 1 3 2 6 2 8 12 8 7 19
3 9 12 3 0 2 1 0 4 9 1 4 14 4 0 1 33 7 8 12 3 12 13 3 1 2 2 6 2 7 12 7 6 10
0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 9 1 1 1 0 0 0 0 0 0 0 1 0 0
0 3 1 0 0 0 0 0 0 0 1 0 0 1 0 0 2 1 1 0 2 0 2 0 0 0 0 0 0 1 0 0 1 6
0 0 0 0 0 0 1 (43 bp) 0 0 0 0 0 0 0 1 (9 bp) 0 0 1 (7 bp) 0 0 0 0 0 0 0 1 (140 bp) 0 0 0 0 0 0 0 1 (16 bp)
11,651
248
213
17
22
5
Del, Deletions; Ins, insertions. 387
Scientific Correspondence
possible cross-amplification from closely related genes.
DEVELOPING SNP MARKERS FROM RFLP PROBES
The molecular map of the C. reinhardtii genome includes 250 RFLP markers. Specific genes and random genomic and cDNA clones have been used as RFLP probes to develop these markers. At the time this project was initiated, sequence information from 68 of the 250 RFLP probes was known and available from GenBank. These sequences represent genomic clones and cDNAs of known genes from the C. reinhardtii strain 137C. To identify SNPs from these 68 RFLP markers, we designed short (on average 250 bp) STSs using available sequence information. We also sequenced 100 RFLP probes representing the 137C random genomic DNA or cDNA clones and then designed STSs based on the obtained sequence information. Approximately 60% of the primer pairs designed from the 137C sequences yielded robust specific products in the S1D2 strain. Presumably PCR success is lowered because of the high level of polymorphism between C. reinhardtii strains. Often we had to redesign primers in order to obtain an amplified product from the S1D2 strain. The extensive sequence polymorphisms present a difficulty in using a locus-specific amplification approach for SNP discovery. This approach requires the oligonucleotide primer synthesis for each locus based on sequence information from the laboratory strain, and many primers would fail for a polymorphic strain with such a high level of sequence variation. Alternative approaches such as the reduced representation shotgun (Altshuler et al., 2000) and EST analysis (Picoult-Newberg et al., 1999) do not require previous sequence knowledge or PCR. However, these alternative approaches require the developed SNPs to be genetically mapped, whereas the RFLP probes have been mapped previously. We started with the phred/phrap/consed package (Ewing et al., 1998; Gordon et al., 1998) for identification of sequence variations, but we found many cases in which the 137C and S1D2 reads corresponding to the same locus could not be assembled by phrap due to significant nucleotide sequence differences between strains. To solve this problem, we switched from phrap to ClustalW (Thompson et al., 1994). A multiple alignment composed of the 137C and S1D2 reads was scanned by a custom script for the presence of sequence variations. A fuzzy set theory approach (Zadeh, 1975) was used to discern whether the variations represent an SNP or sequencing error. Potential SNPs were confirmed by visual inspection of the traces from both strains. A total of 156 RFLP markers were converted to SNP markers, generating an average marker density of one SNP marker per 500 kb. 388
SNP IDENTIFICATION AND VALIDATION FROM PUBLIC EST DATA
To increase the density of the SNP markers, publicly available EST data were scanned for sequence variations. At the time this project was initiated, the C. reinhardtii EST database consisted of 21,971 and 1,550 reads for 137C and S1D2 strains, respectively. EST reads were clustered into 539 contigs, which contained at least one read for both 137C and S1D2 strains. Of the 539 contigs, 170 contained more than one S1D2 read. For SNP identification we focused on contigs containing at least one read from both strains, 137C and S1D2. We identified approximately 200 contigs with potential SNPs. Because traces were not available, we could not distinguish true SNPs from false positives caused by sequencing errors. To assess the accuracy of SNP discovery without read quality information, we randomly selected 48 SNP-containing contigs for experimental confirmation. Regions surrounding the putative SNPs were PCR-amplified from 137C and S1D2 genomic DNA and resequenced. Amplified PCR products from 35 of 48 (73%) primer pairs were evaluated by direct sequencing. Introns or sequencing errors may have prevented the other primer pairs from producing product since the primers were designed from the EST sequences. Among the 35 successful PCR products, 30 (86%) contained SNPs at the predicted positions. In many cases more than one SNP per PCR product was detected. The 30 SNPs identified by this approach result in an overall yield of 62%. These results indicate that ESTs currently available in GenBank could provide more then 125 additional SNP markers, which can be mapped genetically. Information on the 30 EST markers is available at http://www.biology.duke.edu/chlamy/. We conclude that the growing EST database for C. reinhardtii will be very useful for identifying new SNPs. CONCLUDING REMARKS
Genome-wide SNP markers are now being developed in model organisms such as M. musculus (Lindblad-Toh et al., 2000), C. elegans (Koch et al., 2000), Arabidopsis (Cho et al., 1999; http://www. Arabidopsis.org/Cereon/index.html), and D. melanogaster (Hoskins et al., 2001). SNPs are undoubtedly an important tool for modern genetic analyses in any organism and significantly increase the efficiency of map-based cloning of genes of interest. In this study, 156 genome-wide SNP markers have been developed in C. reinhardtii by analyzing RFLP markers with known map position. This approach automatically provides map positions for identified SNPs. This collection will be of immediate value to the C. reinhardtii research community and is an important first step toward the production of a larger map. It would be valuable to increase the density of SNPs by 2- to 3-fold to obtain dense coverage throughout the gePlant Physiol. Vol. 127, 2001
Scientific Correspondence
nome and to cover existing gaps on the map. To develop additional SNP markers, we evaluated publicly available EST data as a potential source for SNP discovery. Based on our results, the current set of SNP markers could be nearly doubled with minimal effort. Increasing the number of S1D2 ESTs would also identify additional SNPs. The C. reinhardtii community has shown strong enthusiasm for sequencing the entire genome; thus, mapping of discovered SNPs will be no problem in the near future. ACKNOWLEDGMENTS We would like to thank the Exelixis sequencing group and Plant Genetics group for their support. We are also grateful to Drs. John Davies, Andreas Gnirke, Karin Schmitt, and Nancy Federspiel for helpful comments on this manuscript. Received May 31, 2001; accepted June 15, 2001. LITERATURE CITED Alderborn A, Kristofferson A, Hammerling U (2000) Genome Res 10: 1249–1258 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J Mol Biol 215: 403–410 Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES (2000) Nature 407: 513–516 Chen X, Levine L, Kwok P-Y (1999) Genome Res 9: 492–498 Cho RJ, Mindrinos M, Richards DR, Sapolsky RJ, Anderson M, Drenkard E, Dewdney J, Reuber TL, Stammers M, Federspiel N et al. (1999) Nat Genet 23: 203–207 Ewing B, Hillier L, Wendl MC, Green P (1998) Genome Res 8: 175–185 Gordon D, Abajian C, Green P (1998) Genome Res 8: 195–202 Gross CH, Ranum LP, Lefebvre PA (1988) Curr Genet 13: 503–508 Grossman A (2000) Curr Opin Plant Biol 3: 132–137
Plant Physiol. Vol. 127, 2001
Harris E (1989) The Chlamydomonas Sourcebook. Academic Press, New York Hirschhorn JN, Sklar P, Lindblad-Toh K, Lim YM, RuizGutierrez M, Bolk S, Langhorst B, Schaffner S, Winchester E, Lander ES (2000) Proc Natl Acad Sci USA 97: 12164–12169 Hoskins RA, Phan AC, Naeemuddin M, Mapa FA, Ruddy DA, Ryan JJ, Young LM, Wells T, Kopczynski C, Ellis MC (2001) Genome Res 11: 1100–1113 Kinoshita T, Fukuzawa H, Shimada T, Saito T, Matsuda Y (1992) Proc Natl Acad Sci USA 89: 4693–4697 Koch R, van Luenen HG, van Der Horst M, Thijssen KL, Plasterk RH (2000) Genome Res 10: 1690–1696 Lefebvre P, Silflow C (1999) Genetics 151: 9–14 Lindblad-Toh K, Winchester E, Daly MJ, Wang DG, Hirschhorn JN, Laviolette JP, Ardlie K, Reich DE, Robinson E, Sklar P et al. (2000) Nat Genet 24: 381–386 Livak KJ (1999) Genet Anal 14: 143–149 Marras SA, Kramer FR, Tyagi S (1999) Genet Anal 14: 151–156 Pastinen T, Raitio M, Lindroos K, Tainola P, Peltonen L, Syvanen AC (2000) Genome Res 10: 1031–1042 Picoult-Newberg L, Ideker TE, Pohl MG, Taylor SL, Donaldson MA, Nickerson DA, Boyce-Jacino M (1999) Genome Res 9: 167–174 Prince JA, Feuk L, Howell WM, Jobs M, Emahazion T, Blennow K, Brookes AJ (2001) Genome Res 11: 152–162 Silflow CD (1998) Organization of the nuclear genome. In J-D Rochaix, M Goldschmidt-Clermont, S Merchant, eds, The Molecular Biology of Chloroplasts and Mitochondria in Chlamydomonas. Kluwer Academic Publishers, Dordrecht, The Netherlands, pp 25–40 Stoerker J, Mayo JD, Tetzlaff CN, Sarracino DA, Schwope I, Richert C (2000) Nat Biotechnol 18: 1213–1216 Thompson JD, Higgins DG, Gibson TJ (1994) Nucleic Acids Res 22: 4673–4680 Tobe VO, Taylor SL, Nickerson DA (1996) Nucleic Acids Res 24: 3728–3732 Zadeh L (1975) Info Sci 8: 199–249
389