Fazekas & al. • Sequence quality of mononucleotide repeats
TAXON 59 (3) • June 2010: 694–697
Stopping the stutter: Improvements in sequence quality from regions with mononucleotide repeats can increase the usefulness of non-coding regions for DNA barcoding Aron J. Fazekas,1 Royce Steeves,1 Steven G. Newmaster1 & Peter M. Hollingsworth2 1 Department of Integrative Biology, University of Guelph, Guelph, Ontario, N1G 2W1 Canada 2 Royal Botanic Garden Edinburgh, Edinburgh EH3 5LR, U.K. Author for correspondence: Aron Fazekas,
[email protected] Keywords chloroplast microsatellite; cpSSR; DNA barcoding; mononucleotide repeat; psbA-trnH ; trnH-psbA
DNA barcoding is now well established in animals based on sequences from a 648 base pair (bp) portion of the mitochondrial coding gene cytochrome oxidase 1 (CO1) (Hebert & al., 2003). In contrast, it has proved more difficult to identify a suitable DNA barcoding locus (or loci) for land plants (reviewed by Hollingsworth, 2008). The generally low substitution rate of plant mitochondrial DNA (e.g., Fazekas & al., 2008) has led to investigations into the relative performance and information content of different loci from the plastid genome as alternative DNA barcodes for plants (e.g., Chase & al., 2007; Kress & Erickson, 2007; Fazekas & al., 2008; Lahaye & al., 2008; CBOL Plant Working Group, 2009; Ford & al., 2009; Hollingsworth & al., 2009). At the third International Barcode of Life Conference in Mexico City (November 2009), the Consortium for the Barcode of Life (CBOL) announced that the standard core DNA barcode for land plants would consist of a two locus combination comprising portions of the protein-coding plastid genes rbcL and matK, to be supplemented with additional loci as required. In making this recommendation, two potential problems were recognized. Firstly, successful amplification and sequencing of the matK barcoding region can be difficult in some taxa with existing primers, and further primer and protocol development is required for this locus. Secondly, the rbcL + matK barcode will not lead to 100% species discrimination in many plant groups, and additional loci beyond this core-barcode will be needed to increase levels of species discrimination in these cases. In recognition of these problems, an 18-month review period on the performance of the rbcL + matK barcode has been established (completion due in mid-2011). During this review period, the executive committee of CBOL recommended that the plant barcoding community continue to collect data from other strongly performing candidate barcoding loci such as the non-coding plastid intergenic spacer trnH-psbA and the internal transcribed spacers (ITS) of nuclear ribosomal DNA. One concern that has been raised regarding the use of noncoding plastid regions such as trnH-psbA as DNA barcodes is the presence of microsatellite repeats which can make it difficult to obtain bi-directional sequences in some samples (Fazekas & al., 2008; CBOL Plant Working Group, 2009; Devey & al., 2009). Verification of the sequenced strand through bidirectional sequencing is desirable for the maintenance of data 694
quality standards, as recommended by the CBOL Database Working Group for sequences destined to receive annotation with the reserved keyword ‘Barcode’ by the INSDC (DDBJ, EMBL & GenBank). However, PCR amplicons derived from regions containing microsatellites can produce poor quality sequence chromatograms. Slippage of the polymerase at microsatellite regions during PCR results in a ‘stutter’ effect observed in the chromatogram, with decreasing sequence quality associated with increasing repeat length. The reduction in sequence quality from regions with mononucleotide runs with less than ten repeats is generally moderate and ambiguous base calls by the software are usually few, requiring little editing by hand. However, as the repeat number increases, the number of ambiguous bases increases disproportionately. This results in a longer amount of time required for editing (with a corresponding reduction in confidence of the true sequence) to the point where sequence data cannot be used at all past the repeat. The net effect is often a contig with overlapping data only at the repeat, and a shortened read length due to missing data at the ends (Fig. 1). In more extreme cases, where multiple mononucleotide repeats occur within a given amplicon, disruption of forward and reverse sequencing reads at different mononucleotide runs can lead to only partial sequences of the region with no overlap, thus preventing the construction of a sequence contig. The problem caused by mononucleotide repeats in non-coding regions was noted by the CBOL Plant Working Group (2009) when evaluating the attributes of different barcoding loci for plants, and this contributed towards the selection of an entirely coding core plant barcode. Fixing the ‘mononucleotide repeat problem’ was considered less tractable than the challenge of improving primer universality for matK. Recently, however, an evaluation of PCR methods and DNA polymerases, focusing on the reduction of the stutter effects resulting from mononucleotide repeats, has demonstrated that the use of particular polymerases can reduce the effect of slipped strand mispairing in PCR (Fazekas & al., 2010). This recent study tested the performance of different PCR profiles, reaction chemistries and polymerases on sequences from the trnH-psbA spacer from 25 plant samples. These samples were specifically selected because they contain mononucleotide repeats which had previously resulted in ambiguous base calls and low-quality sequence traces with conventional PCR and
TAXON 59 (3) • June 2010: 694–697
Fazekas & al. • Sequence quality of mononucleotide repeats
Fig. 1. Forward (top) and reverse (bottom) sequencing trace files illustrating the disruption to sequencing reads by a mononucleotide repeat.
sequencing approaches. A key finding was that two of the five tested DNA polymerases (Phusion—Finnzymes, Espoo, Finland; Herculase II fusion—Agilent, Santa Clara, California, U.S.A.), were regularly able to produce improved quality sequence reads through mononucleotide repeats up to 13 bp long. These enzymes also led to an overall improvement in the percentage of bases above a minimum sequence quality threshold of QV > 20 from samples containing repeats up to 14–15 bp, albeit with some samples still showing early termination of sequencing reads when repeats of this length were reached. In the two samples that possessed repeats greater than 15 bp, the sequence quality was not improved. These results are promising and provide optimism that further advances in polymerase technology will reduce the effect of mononucleotide repeats on sequence quality. However, it also indicates that there is at present an upper limit to the utility of these polymerases in addressing this problem. Given this upper limit to the number of repeats it is possible to sequence through, an obvious question is: To what extent will this improvement in sequence quality impact the overall ability to barcode regions that are prone to mononucleotide repeats? In order to assess the potential impact, we examined the frequency distribution of mononucleotide repeat size classes in the trnH-psbA spacer from four previously published barcoding datasets (Kress & Erickson, 2007; Fazekas & al., 2008; Gonzalez & al., 2009; Kress & al., 2009). Specifically, we compared the proportion of successfully sequenced samples in these datasets that have repeats in the 10–13 bp range with the proportion that have repeats of 14 bp or greater (Table 1). This interval of 10–13 repeats, is the range for which Fazekas & al. (2010) reported the greatest improvement in sequence quality. Our analysis of these datasets shows that between 22% and 54% of sequences in the different datasets possess mononucleotide repeats runs in the size range that can cause problems for sequencing reads (e.g., ≥10 bp). Of these, most have a maximum repeat length in the range of 10–13 bp (between 21% and 41% samples in the different datasets; 36% over all samples). One to eight percent of samples (5% overall) have repeats with a maximum length of 14 or 15 bp and 0% to 6% (4% overall) have repeat lengths >15 bp. Based on these datasets it appears that
the use of alternative enzymes such as Phusion, and Herculase II fusion, may potentially reduce the effect of stutter in a large proportion of cases. A rough extrapolation from the results of Fazekas & al. (2010) indicates that the residual ‘more difficult to improve’ problem of repeats >13 bp would be restricted to between 1% and 13% of samples (8% overall); with only 0% to 6% of samples falling into the category of microsatellite lengths >15 bp which showed no improvement at all with the use of these alternative enzymes (4% overall). There are two caveats that need making concerning these extrapolations from the results of Fazekas & al. (2010). Firstly the quality of a given sequence will depend not only on the length of a given mononucleotide repeat, but also on the base composition of the surrounding sequence, the absolute number of mononucleotide arrays per amplicon and potentially the position of the repeat(s) in relation to the sequencing primers. These elements may act as confounding variables in estimating the improvement in performance when using different enzymes. Secondly, the frequency of long mononucleotide repeats may be underestimated in the datasets described in Table 1; if the presence of such repeats results in complete sequencing failure, these samples would by definition be excluded from our analyses. However, given the generally low percentages of overall failure for trnH-psbA in these datasets, the extent of this latter problem is likely to be small. Overall, the results of Fazekas & al. (2010) show the potential for improvement in sequence read quality from regions containing mononucleotide repeats. The additional costs of the use of next generation polymerases is relatively modest, and based on 2010 prices at the Canadian Centre for DNA Barcoding equate to approximately an additional $25 US per 96-well plate. This additional cost is minimal in comparison with the potential for wasted reagents and sequencing costs, as well as time spent on extensive sequence editing. It should be stressed that the approach is not a ‘silver bullet’, and there are of course other facets to the ‘coding versus non-coding loci’ debate that go beyond the presence of mononucleotide repeats (e.g., the automation of sequence assembly and quality control checks are facilitated by coding loci that possess conserved length and the ability to translate sequences to check 695
Fazekas & al. • Sequence quality of mononucleotide repeats
TAXON 59 (3) • June 2010: 694–697
Table 1. Frequency distribution of the maximum mononucleotide repeat length among sequences of the trnH-psbA intergenic spacer in four different datasets, and the distribution of all sequences combined.
Sample size
Kress & al. (2009)
Kress & Erickson (2007)
Gonzalez & al. (2009)
Fazekas & al. (2008)
296a
96
413–415b
251
PCR/sequencing success
94%
96%
89%
99%
No. sequences obtainedc
280
92
369
249
17.5%
10.9%
Repeat length (bp) 10
Combined data
990
% of sequences with a given maximum repeat length 20.1%
11.6%
16.4%
11
12.9%
3.3%
10.3%
6.8%
9.5%
12
6.4%
3.3%
7.9%
5.2%
6.4%
13
3.2%
3.3%
2.2%
6.0%
3.5%
14
1.8%
1.1%
4.9%
2.8%
3.1%
15
2.5%
0.0%
2.7%
0.8%
1.9%
16
1.8%
0.0%
2.7%
0.4%
1.6%
17
0.7%
0.0%
0.8%
0.0%
0.5%
18
0.4%
0.0%
1.4%
0.0%
0.6%
19
0.4%
0.0%
0.3%
0.0%
0.2%
20
0.4%
0.0%
0.0%
0.0%
0.1%
>20
0.4%
0.0%
0.3%
0.0%
0.2%
% of sequences per ‘maximum repeat size’ bin >10
48.4%
21.9%
53.6%
33.6%
44.0%
10–13
40.0%
20.8%
40.5%
29.6%
35.8%
14–15
4.3%
1.1%
7.6%
3.6%
5.1%
>15
4.1%
0.0%
5.5%
0.4%
3.2%
a
Kress & al. (2009) investigated 1035 samples, however, our analyses are restricted to the 280 sequences of trnH-psbA which are available on GenBank. b Gonzalez & al. (2009) do not report the sample size for trnH-psbA, but they do report sequencing success. The sample size range is inferred from the reported sequencing success and the number of sequences deposited in GenBank. c The frequency distribution of SSRs for each dataset are based the following GenBank accessions: Kress & al. (2009): GQ982133–GQ982412; Kress & Erickson (2007): DQ006131, DQ006132, DQ006157, DQ006176, DQ006177, DQ006179, DQ006207, EF590667–EF590751; Gonzalez & al. (2009): FJ038839–FJ038860, FJ038863–FJ038887, FJ038889–FJ038908, FJ038910–FJ038917, FJ038919–FJ038930, FJ038932, FJ038933, FJ038935–FJ038941, FJ038943–FJ038969, FJ038971–FJ038974, FJ038976–FJ038984, FJ038986, FJ038988–FJ039019, FJ039021– FJ039028, FJ039030–FJ039040, FJ039042–FJ039079, FJ039081–FJ039085, FJ039087–FJ039094, GQ428645–GQ428774; Fazekas & al. (2008): EU750427–EU750675.
for base calling errors and pseudogenes; Ford & al., 2009). However, the results of Fazekas & al. (2010) do show that the magnitude of the problems caused by mononucleotide repeats can be reduced in non-coding barcoding loci such as trnHpsbA, and this can be expected to lead to a higher proportion of bi-directional sequencing reads and greater confidence in the scoring of trace files.
Literature cited CBOL Plant Working Group. 2009. A DNA barcode for land plants. Proc. Natl. Acad. Sci. U.S.A. 106: 12794–12797. Chase, M.W., Cowan, R.S., Hollingsworth, P.M., Van den Berg, C., Madrinan, S., Petersen, G., Seberg, O, Jorgensen, T., Cameron, K.M., Carine, M., Pedersen, N., Hedderson, T.A.J., Conrad, F., Salazar, G.A., Richardson, J.E., Hollingsworth, 696
M.L., Barraclough, T.G., Kelly, L. & Wilkinson, M. 2007. A proposal for a standardised protocol to barcode all land plants. Taxon 56: 295–299. Devey, D.S., Chase, M.W. & Clarkson, J.J. 2009. A stuttering start to plant DNA barcoding: Microsatellites present a previously overlooked problem in non-coding plastid regions. Taxon 58: 7–15. Fazekas, A.J., Burgess, K.S., Kesanakurti, P.R., Graham, S.W., Newmaster, S.G., Husband, B.C., Percy, D.M., Hajibabaei, M. & Barrett, S.C.H. 2008. Multiple multilocus DNA barcodes from the plastid genome discriminate plant species equally well. PLoS ONE 3: e2802. Fazekas, A.J., Steeves, R. & Newmaster, S.G. 2010. Improving sequencing quality from PCR products containing long mononucleotide repeats. Biotechniques 48: 277–285. Ford, C.S., Ayres, K.L., Haider, N., Toomey, N., Van Alpen Stohl, J., Kelly, L., Wilstöm, N., Hollingsworth, P.M., Duff, R.J., Hoot, S.B., Cowan, R.S., Chase, M.W. & Wilkinson, M.J. 2009. Selection of candidate coding DNA barcoding regions for use on land plants. Bot. J. Linn. Soc. 159: 1–11.
TAXON 59 (3) • June 2010: 694–697
Gonzalez, M.A., Baraloto, C., Engel, J., Mori, S.A., Pétronelli, P., Riéra, B., Roger, A., Thébaud, C. & Chave, J. 2009. Identification of Amazonian trees with DNA barcodes. PLoS ONE 4: e7483. Hebert, P.D.N., Cywinska, A., Ball, S.R. & de Waard, J.R. 2003. Biological identifications through DNA barcodes. Proc. Roy. Soc. London, Ser. B, Biol. Sci. 270: 313–321. Hollingsworth, M.L., Clark, A.A., Forrest, L.L., Richardson, J., Pennington, R.T., Long, D.G., Cowan, R., Chase, M.W., Gaudeul, M. & Hollingsworth, P.M. 2009. Selecting barcoding loci for plants: Evaluation of seven candidate loci with specieslevel sampling in three divergent groups of land plants. Molec. Ecol. Res. 9: 439–457
Fazekas & al. • Sequence quality of mononucleotide repeats
Hollingsworth, P.M. 2008 DNA barcoding plants in biodiversity hotspots: Progress and outstanding questions. Heredity 101: 1–2. Kress, W.J. & Erickson, D.L. 2007. A two-locus global DNA barcode for land plants: The coding rbcL gene complements the non-coding trnH-psbA spacer region. PLoS ONE 2: e508. Kress, W.J., Erickson, D.L, Jones, F.A., Swensond, N.G., Perez, R., Sanjurb, O. & Bermingham E. 2009. Plant DNA barcodes and a community phylogeny of a tropical forest dynamics plot in Panama. Proc. Natl. Acad. Sci. U.S.A. 106: 18621–18626. Lahaye, R., van der Bank, M., Bogarin, D., Warner, J., Pupulin, F., Gigot, G., Maurin, O., Duthoit, S., Barraclough, T.G. & Savolainen, V. 2008. DNA barcoding the floras of biodiversity hotspots. Proc. Natl. Acad. Sci. U.S.A. 105: 2923–2928.
697