Schlüter & al. • Low-copy nuclear sequence markers
54 (3) • August 2005: 766–770
METHODS AND TECHNIQUES
Making the first step: practical considerations for the isolation of low-copy nuclear sequence markers Philipp M. Schlüter1, Tod F. Stuessy1 & Hannes F. Paulus2 1 Department
of Systematic and Evolutionary Botany, Institute of Botany, University of Vienna, Rennweg 14, A-1030 Vienna, Austria.
[email protected] (author for correspondence), tod.stuessy@ univie.ac.at 2 Department of Evolutionary Biology, Institute of Zoology, University of Vienna, Althanstraße 14, A-1090 Vienna, Austria.
[email protected] In many plant groups, the use of low-copy nuclear sequence markers for phylogenetics and population genetics has been hindered by their limited availability. Although it may be possible to PCR amplify low-copy markers using primers designed for use with other plant groups, this does not always yield the desired results. Here, we suggest several alternative approaches to begin the isolation and characterisation of novel low-copy markers when there is little or no sequence information available. These alternatives are: (1) the design of new primers from information in the sequence databases; (2) isolation of homologous DNA using a gene probe from another organism; (3) characterisation of sequence markers from DNA fingerprints; and (4) obtaining novel sequences via cDNA cloning.
KEYWORDS: low-copy DNA sequences, nuclear markers, molecular phylogenetics.
INTRODUCTION The utility of low-copy number nuclear DNA sequence markers is increasingly being recognised for their potential in low-level phylogenetics or population level studies for generating allelic genealogies in the framework of coalescent theory (Schaal & Olsen, 2000; Schaal & Leverich, 2001; Mort & Crawford, 2004; Small & al., 2004). Although many aspects of this topic have recently been covered in excellent reviews (Hare, 2001; Sang, 2002; Zhang & Hewitt, 2003; Mort & Crawford, 2004; Small & al., 2004), one critical, practical question for many researchers has not been sufficiently addressed. Since the availability of low-copy markers is limited in many groups, the question of interest is: How can such markers be established in a study system where none has been characterised so far, especially if screening available primers from other groups of taxa does not yield the desired results? Our purpose here is to give a few suggestions as to how one may obtain low-copy nuclear sequence markers. When talking about low-copy nuclear markers, one is typically referring to protein-coding genes with an exon-intron structure. While introns are contained in the majority of protein-coding genes (see Deutsch & Long, 1999), it has recently been recognised that genes with a low GC content are (1) more likely to contain introns, and (2) that these introns are longer on average (Carels & Bernardi, 2000). Intron length is generally not well con766
served (Deutsch & Long, 1999; Betts & al., 2001), and it remains to be seen whether it is correlated with genome size (Deutsch & Long, 1999; Wendel & al., 2002). Intron position often is conserved, even across large evolutionary distances (e.g., Jager & al., 2003; Sánchez & al., 2003), although this cannot be generalised (see Rokas & al., 1999; Betts & al., 2001; Wada & al., 2002). While it is not a prerequisite for a phylogenetic marker to have any known molecular function, functional genes with exon-intron structure do have certain advantages; exons are typically conserved enough to provide robust primer binding sites for polymerase chain reaction (PCR) across a range of taxa while intron sequence is desired for its higher variability that allows evolutionary history to be reconstructed in closely related organisms (Strand & al., 1997; Hare, 2001; Small & al., 2004). Since it may be difficult to verify the identity of a gene from intron DNA sequence, it is advisable to include exon sequence that can be identified by BLAST searches (Strand & al., 1997; Small & al., 2004). Characterising a useful phylogenetic marker is bound to take some time, but how much time it will take is hard to predict. The time investment will depend upon the study group; a group that contains model organisms will be easier to work with than a group where there is (almost) no molecular data available. Some groups are difficult to work with because they contain substances that are co-isolated with DNA that interfere with enzymatic reactions; obtaining high quality DNA for these
54 (3) • August 2005: 766–770
may be essential (Small & al., 2004). For some groups, living material is unavailable, and the researcher depends upon silica-dried material, which can vary greatly in quality. It is also possible that intrinsic factors such as genome size and composition may impede development of markers because of additional complexity.
SIMPLE SCREENING The obvious simplest choice for establishing a lowcopy marker is to test existing primers that have worked in related study groups. Intuitively, one would guess that the probability of success would rise with increased relatedness. Although PCR may not work to begin with, it can frequently be improved by tweaking reaction conditions (see e.g., Small & al., 2004). Factors to consider include the DNA polymerase, primers (annealing temperatures and concentrations—especially if primers are degenerate), buffers (pH, KCl and MgCl2/MgSO4/ (NH4)2SO4 concentrations), and the reaction volume (see e.g., Henegariu & al., 1997). Gradient cyclers are efficient for screening different annealing temperatures. Near-optimal PCR buffers (and PCR enhancers) can nowadays be tested by use of commercial PCR optimisation kits. Furthermore, “hot-start” PCR (e.g., Kellogg & al., 1994) and “touch-down” PCR (Don & al., 1991) may be useful tools for increasing PCR specificity and efficiency. Hot-start PCR functions by preventing a premature start of the PCR reaction by adding or activating an essential PCR component only after the initial denaturation step (e.g., polymerase may be inactivated by an antibody), whereas touch-down PCR works by decreasing the annealing temperature during the first few PCR cycles. It is noteworthy that every primer/template combination will have to be optimised independently. Since one can essentially play endlessly with PCR programmes and reaction conditions, it is important to realise that after a certain time spent on screening and optimising PCR primers, other methods may become more timeefficient. Also, synthesis of new unlabelled primers is now relatively cheap and quick. It may therefore be preferable to design several new PCR primers rather than optimising every aspect of an inefficient PCR reaction set-up. This, however, requires that there be some sequence information available.
ALTERNATIVE APPROACHES All alternatives to screening primers from other study groups aim at obtaining new sequence information that will allow the design of new PCR primers and their testing. For primer design guidelines, the reader is
Schlüter & al. • Low-copy nuclear sequence markers
referred elsewhere (e.g., Strand & al., 1997; Rose & al., 1998, and abundant internet sources), and we limit ourselves to say that predicted annealing temperatures may not match optimal temperatures as determined empirically. In many cases, as a first step for obtaining new sequences, it may be sufficient to isolate a partial sequence of a target marker and extend this sequence by inverse PCR (Ochman & al., 1988; Triglia & al., 1988) or, more efficiently, with techniques analogous to classical chromosome walking (e.g., so-called PCR walking) (Siebert & al., 1995; Padegimas & Reichert, 1998). There are now a number of alternative approaches, depending on the study group and the information available: (1) design of new primers from database; (2) “fishing” with a gene probe; (3) isolation from fingerprints; and (4) construction of a cDNA library. (1) Designing primers from accessions of the nucleotide database. — Some very useful ideas for this are given in Strand & al. (1997). The question here is which genes to use and from which plant groups. If sequences of nuclear genes or mRNA transcripts from closely related organisms are known, these may provide a good starting point, assuming that sequence divergence will be low enough for primers to bind in your organism of choice. Alternatively, homologous sequences from very divergent organisms may be aligned and consensus primers designed that may work in a wide variety of groups. Any information available, e.g., on molecular function, is useful to identify the gene of choice. For instance, key metabolic enzymes (such as glyceraldehyde 3-phosphate dehydrogenase), will be highly conserved but may also be present as different isoenzymes that are differently expressed in different tissues. Once a gene of interest has been identified, target regions can be selected. Since intron positions are often conserved across large evolutionary distances, their expected position within the coding sequence may be inferred from comparison with a relatively distantly related genomic sequence. However, as intron length can vary dramatically, the length of the expected PCR product is not always easy to predict. Moreover, long introns may result in PCR problems or even failure. If intron positions are unknown, it is advisable to design several primers in a short interval of sequence to allow for the possibility that some primers may not work because they are located at an intron-exon boundary. As noted above, it is not necessary as a first step to amplify intron-containing sequence, since it is often possible to extend the known sequence. More highly conserved regions of genes are generally better suited for primer design, especially if the starting sequences are distantly related. Besides considering site conservation in the alignment, it may be useful to consider the biochemistry of the protein encoded by your gene of interest. Structural and functional amino acid 767
Schlüter & al. • Low-copy nuclear sequence markers
residues in the protein, such as α-helices, β-sheets and catalytic residues in enzymes, respectively, are much more likely to be conserved than residues on the surface of the protein that face the solvent and do not interact with other macromolecules. Information on this may be obtained by inspecting protein structures deposited in the protein database (PDB, http://www.rcsb.org/pdb/) if available, by subjecting the protein’s amino acid sequence to secondary structure prediction programs or by examining literature on the molecular biology of the protein of interest. Primers will intuitively be made to the minimally degenerate regions in the nucleotide sequence alignment. Importantly, by introducing degenerate positions in a primer many different primers are created, effectively diluting the correctly matching primer, which may have to be compensated by adjusting the PCR primer concentrations. Some degeneracies can be circumvented by use of inosine as a degenerate base. An excellent approach to primer design based on amino acid sequence conservation is implemented in the program CODEHOP (Rose & al., 1998; Morant & al., 2002). Recently, Xu & al. (2004) undertook a promising attempt to obtain versatile candidate primers by comparing the rice and Arabidopsis genomes and filtering for highly conserved oligonucleotide stretches in regions of low genomic complexity. The obvious difficulty in using diverse database entries for primer design is that sequences usually do not meet all the desired criteria. (2) “Fishing” with a gene probe. — If a reasonably closely related model organism is available, one may attempt to isolate sequences of a target gene by means of sequence homology. The principle of this approach can be described as performing a solution hybridisation step of a gene probe from a model organism with genomic DNA from the organism of interest (similar to Southern blotting), isolating the hybrid molecule by magnetic bead capture and then directly cloning or selectively amplifying genomic regions that include homologies to the probe. For instance, one could PCR amplify exonic sequence from Arabidopsis with hapten (e.g., biotin or digoxigenin)-labelled primers or dNTPs and gel-purify the resulting labelled PCR product. DNA from your organism of choice is then digested with different restriction endonucleases (6-cutters), and adapters with a primer binding site are ligated to these fragments. After heat denaturation, hybridisation of the probe with adapter-ligated genomic DNA can be carried out in solution, and DNA heteroduplexes of labelled probe and genomic DNA fragment can be pulled out of the mixture by an affinity purification step using the hapten label (e.g., streptavidin-coated magnetic beads for biotin labels). The isolated genomic fragment can then be amplified using primers that are complementary to the adapter sequences, and the resulting fragments can be 768
54 (3) • August 2005: 766–770
cloned and sequenced. Similar enrichment methods have been successfully used for the isolation of microsatellites (e.g., Fischer & Bachmann, 1998), detection of bacterial DNA in soil samples (Jacobsen, 1995), or comprehensive cDNA library screening and enrichment (e.g., Lavery & al., 1997; Shepard & Rae, 1997; Hamaguchi & al., 1998). Although this technique has not been generally employed to isolate genes out of plant genomic DNA, Cheng & Armstrong (2002) demonstrate that this can be accomplished with relative ease. The efficiency of this approach for the isolation of low-copy genes will, among other things, depend upon the complexity of the genome of the organism of choice and the length of the homologous regions between probe and genomic DNA. Isolation of a single copy gene may therefore be difficult if the target genome is large. (3) Isolation of markers using fingerprint techniques. — It is possible to isolate SCAR (sequence characterised amplified regions) markers, which might be low-copy genes, from bands generated in fingerprinting techniques (see McLenachan & al., 2000; Bailey & al., 2004). Bands that are monomorphic across a study group may be excised from a fingerprinting gel (e.g., RAPD or AFLP), cloned and sequenced such that specific primers may then be designed. This approach has the advantage that no prior sequence information from the organism of interest is required. However, since large portions of the genome are non-coding, the molecular function of these markers will in many cases be unknown, and it is impossible to predict what level of variation will be encountered at such loci. Obviously, the randomness of fingerprinting patterns may mean that the probability for any band to be derived from a high-copy number sequence motif is high. Thus, many candidate bands may have to be evaluated. (4) Cloning of cDNA fragments. — Another method for isolating coding sequences is by generation of a cDNA library from mRNA (for protocols see Sambrook & Russell, 2001). It relies on the assumption that the majority of eukaryotic protein-coding genes will contain introns. This method is an option if fresh living plant material is at hand, although commercially available RNA preservation reagents may also be an option provided they work satisfactorily in the study group. The availability of commercial kits for mRNA isolation makes it possible to obtain mRNA relatively easily. RNA polymerase II-transcribed protein-coding genes produce mature mRNA that has a poly-A tail. Messenger RNA is isolated by virtue of this poly-A tail using complementary poly-T oligonucleotides. Using reverse transcriptase (a RNA-dependent DNA polymerase), a complementary DNA (cDNA) strand can be produced that is stabler than the original RNA strand. A second cDNA strand can be synthesised using an enzyme cocktail and primers (e.g.,
54 (3) • August 2005: 766–770
Schlüter & al. • Low-copy nuclear sequence markers
E. coli DNA polymerase I, DNA ligase and RNase H). Double stranded cDNA can be blunted by use of T4 DNA polymerase and used for adapter-ligation or digestion with restriction enzymes. Subsequently, cDNA can be used for cloning via blunt ends or restriction enzyme sites (internal or in the adapters), or PCR-amplified via primer binding sites in the adapter and then cloned. Similarly, cDNA-AFLPs (Bachem & al., 1996, 1998) are suitable for cloning of anonymous cDNA. Clones can then be screened for different inserts, the inserts sequenced and primers designed from these sequences. It should be noted, however, that a few highly expressed transcripts may account for a large proportion of cloned cDNAs. The advantage of this approach is that potentially a large set of protein-coding sequences can be isolated relatively quickly and without prior sequence information. The disadvantage is that it is more expensive, technically demanding, and it is limited by the availability of material from which intact mRNA can be isolated. However, while the generation of a cDNA library may seem daunting because of its apparent technical difficulty, the availability of commercial kits greatly simplifies the procedure and increases the probability of success.
CONCLUSIONS Clearly, the choice of approach to characterise lowcopy nuclear genes will be determined by the available material and information, the available equipment, budget and laboratory skill (Table 1). It is possible, of course, to combine the above approaches. They represent only the first step towards the characterisation of novel markers by providing sequence information from which to start primer design and analysis of candidate markers by PCR (see, e.g., Small & al., 2004). Obviously, questions such as the copy number and the variability that are actually encountered at a given locus will have to be examined in a second step before a marker can be successfully used in a comparative study (e.g., Sang, 2002; Small & al., 2004).
ACKNOWLEDGEMENTS The authors gratefully acknowledge support by the Austrian Science Foundation (FWF) on project P16727-B03.
LITERATURE CITED Bachem, C. W. B., Oomen, R. J. F. J. & Visser, R. G. F. 1998. Transcript imaging with cDNA-AFLP: a step-by-step protocol. Pl. Molec. Biol. Rep. 16: 157–173. Bachem, C. W. B., van der Hoeven, R. S., de Bruijn, S. M., Vreugdenhil, D., Zabeau, M. & Visser, R. G. F. 1996. Visualization of differential gene expression using a novel method of RNA fingerprinting based on AFLP: analysis of gene expression during potato tuber development. Pl. J. 9: 745–753. Bailey, C. D., Hughes, C. E. & Harris, S. A. 2004. Using RAPDs to identify DNA sequence loci for species level phylogeny reconstruction: an example from Leucaena (Fabaceae). Syst. Bot. 29: 4–14. Betts, M. J., Guigó, R., Agarwal, P. & Russell, R. B. 2001. Exon structure conservation despite low sequence similarity: a relic of dramatic events in evolution? EMBO J. 20: 5354–5360. Carels, N. & Bernardi, G. 2000. Two classes of genes in plants. Genetics 154: 1819–1825. Cheng, D. W. & Armstrong, K. C. 2002. Direct capture and cloning of receptor kinase and peroxidase genes from genomic DNA. Genome 45: 977–983. Deutsch, M. & Long, M. 1999. Intron-exon structures of eukaryotic model organisms. Nucl. Acids Res. 27: 3219–3228. Don, R. H., Cox, P. T., Wainwright, B. J., Baker, K. & Mattick, J. S. 1991. “Touchdown” PCR to circumvent spurious priming during gene amplification. Nucl. Acids Res. 19: 4008. Fischer, D. & Bachmann, K. 1998. Microsatellite enrichment in organisms with large genomes (Allium cepa L.). Biotechniques 24: 796–802. Hamaguchi, M., O’Connor, E. A., Chen, T., Parnell, L., McCombie, R. W. & Wigler, M. H. 1998. Rapid isolation of cDNA by hybridization. Proc. Natl. Acad. Sci. U.S.A. 95: 3764–3769. Hare, M. P. 2001. Prospects for nuclear gene phylogeography. Trends Ecol. Evol. 16: 700–706. Henegariu, O., Heerema, N. A., Dlouhy, S. R., Vance, G. H. & Vogt, P. H. 1997. Multiplex PCR: critical parameters and step-by-step protocol. Biotechniques 23: 504–511.
Table 1. Comparison of resources needed for methodological approaches for characterising nuclear genes (per single experiment).
Approach Primers & PCR Gene “fishing” SCARs cDNA cloning
Time + +(+) ++ ++(+)
Cost per experiment + ++ +(+) +++
Laboratory skill + ++ ++ ++(+)
Probability of success ?* ?* + ++
Comments usually many trials needed bias towards high copy number many potential markers; requires mRNA
*Dependent upon amount and quality of sequence information available.
769
Schlüter & al. • Low-copy nuclear sequence markers
Jacobsen, C. S. 1995. Microscale detection of specific bacterial DNA in soil with a magnetic capture-hybridization and PCR amplification assay. Appl. Env. Microbiol. 61: 3347–3352. Jager, M., Hassanin, A., Manuel, M., le Guyader, H. & Deutsch, J. 2003. MADS-box genes in Ginkgo biloba and the evolution of the AGAMOUS family. Molec. Biol. Evol. 20: 842–854. Kellogg, D. E., Rybalkin, I., Chen, S., Mukhanedova, N., Vlasik, T., Siebert, P. D. & Chenchik, A. 1994. TaqStart antibody: “hot start” PCR facilitated by a neutralizing monoclonal antibody directed against Taq DNA polymerase. Biotechniques 16: 1134–1137. Lavery, D. J., Lopez-Molina, L., Fleury-Olela, F. & Schibler, U. 1997. Selective amplification via biotin- and restriction-mediated enrichment (SABRE), a novel selective amplification procedure of differentially expressed mRNAs. Proc. Natl. Acad. Sci. U.S.A. 94: 6831–6836. McLenachan, P. A., Stöckler, K., Winkworth, R. C., McBreen, K., Zauner, S. & Lockhart, P. J. 2000. Markers derived from amplified length polymorphism gels for plant ecology and evolution studies. Molec. Ecol. 9: 1899–1903. Morant, M., Hehn, A. & Werck-Reichhart, D. 2002. Conservation and diversity of gene families explored using the CODEHOP strategy in higher plants. BMC Plant Biol. 2: 7. Mort, M. E. & Crawford, D. J. 2004. The continuing search: low-copy nuclear sequences for lower-level plant molecular phylogenetic studies. Taxon 53: 257–261. Ochman, H., Gerber, A. S. & Hartl, D. L. 1988. Genetic applications of an inverse polymerase chain reaction. Genetics 120: 621–623. Padegimas, L. S. & Reichert, N. A. 1998. Adapter ligationbased polymerase chain reaction-mediated walking. Anal. Biochem. 260: 149–153. Rokas, A., Kathirithamby, J. & Holland, P. W. H. 1999. Intron insertion as a phylogenetic character: the engrailed homeobox of Strepsiptera does not indicate affinity with Diptera. Insect Molec. Biol. 8: 527–530. Rose, T. M., Schultz, E. R., Henikoff, J. G., Pietrokovski, S., McCallum, C. M. & Henikoff, S. 1998. Consensusdegenerate hybrid oligonucleotide primers for amplification of distantly related sequences. Nucl. Acids Res. 26: 1628–1635. Sambrook, J. & Russell, D. W. 2001. Molecular Cloning: a Laboratory Manual, ed. 3. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. Sánchez, D., Ganfornina, M. D., Gutiérrez, G. & Marín, A. 2003. Exon-intron structure and evolution of the lipocalin gene family. Molec. Biol. Evol. 20: 775–783. Sang, T. 2002. Utility of low-copy nuclear gene sequences in plant phylogenetics. Crit. Rev. Biochem. Molec. Biol. 37: 121–147. Schaal, B. A. & Leverich, W. J. 2001. Plant population biology and systematics. Taxon 50: 679–695. Schaal, B. A. & Olsen, K. M. 2000. Gene genealogies and population variation in plants. Proc. Natl. Acad. Sci. U.S.A. 97: 7024-7029. Shepard, A. R. & Rae, J. L. 1997. Magnetic bead capture of cDNAs from double-stranded plasmid cDNA libraries. Nucl. Acids Res. 25: 3183–3185. 770
54 (3) • August 2005: 766–770
Siebert, P. D., Chenchik, A., Kellogg, D. E., Lukyanov, K. A. & Lukyanov, S. A. 1995. An improved PCR method for walking in uncloned genomic DNA. Nucl. Acids Res. 23: 1087–1088. Small, R. L., Cronn, R. C. & Wendel, J. F. 2004. Use of nuclear genes for phylogeny reconstruction in plants. Austr. Syst. Bot. 17: 145–170. Strand, A. E., Leebens-Mack, J. & Milligan, B. G. 1997. Nuclear DNA-based markers for plant evolutionary biology. Molec. Ecol. 6: 113–118. Triglia, T., Peterson, M. G. P. & Kemp, D. J. 1988. A procedure for in vitro amplification of DNA segments that lie outside the boundaries of known sequences. Nucl. Acids Res. 16: 8186. Wada, H., Kobayashi, M., Sato, R., Satoh, N., Miyasaka, H. & Shirayama, Y. 2002. Dynamic insertion-deletion of introns in deuterostome EF-1α genes. J. Molec. Evol. 54: 118–128. Wendel, J. F., Cronn, R. C., Álvarez, I., Liu, B., Small, R. L. & Senchina, D. S. 2002. Intron size and genome size in plants. Molec. Biol. Evol. 19: 2346–2352. Xu, W., Briggs, W. J., Padolina, J., Timme, R. E., Liu, W., Linder, C. R. & Miranker, D. P. 2004. Using MoBIoS’ scalable genome joins to find conserved primer pair candidates between two genomes. Bioinformatics 20 Suppl. 1: i355–i362. Zhang, D.-X. & Hewitt, G. M. 2003. Nuclear DNA analyses in genetic studies of populations: practice, problems and prospects. Molec. Ecol. 12: 563–584.