LETTER
Degenerate Primer IDs and the Birthday Problem In PNAS, Jabara et al. (1) proposed an important approach to correct for recombination, allelic skewing, misincorporations, and sequencing errors in deep sequencing technologies. Their approach relies on each viral RNA template receiving a unique, randomly generated “Primer ID.” They were thus able to quantify template resampling by counting the repeated detection of each unique ID. Furthermore, they were able to exploit template resampling to correct for PCR errors, sequencing errors, and allelic skewing by only retaining consensus sequences where at least three reads shared a Primer ID. To attach a unique Primer ID to each template of a pool of ∼10,000 viral copies, Jabara et al. (1) used primers that included an 8-bp string of degenerate bases, which results in 48 (65,536) different possibilities. However, selecting 10,000 unique Primer IDs from a pool of 48 possibilities is extremely unlikely, with a probability of 2.67 × 10−350. Synthesizing 10,000 primers with an 8-bp degenerate region will, on average, result in ∼726 Primer ID “collisions,” where more than one template is tagged with the same random Primer ID. Only ∼8,585 of the templates would be tagged with a unique ID. This phenomenon is perhaps best illustrated by the “Birthday Problem,” which asks how many people must be present in order for it to be more likely than not that at least 2 will share the same birthday. The answer, quite surprisingly, is only 23. The occurrence of Primer ID collisions could influence estimates of false diversity that would otherwise be attributed to PCR or sequencing error. For Jabara et al. (1), only ∼2,000 of
E1330 | PNAS | May 22, 2012 | vol. 109 | no. 21
the initial viral copies were reflected in their final consensus sequences, which ameliorated the problem. Assuming no nucleotide frequency biases in the degenerate tags, on average, ∼30 of 2,000 consensus sequences would not have been derived from a single viral template. Although consensus sequences derived from all sequences with the same ID should be robust to Primer ID collisions, there will be an avoidable loss of depth. Although it was fortunate that only ∼2,000 of the 10,000 viral RNA copies that were input were reflected in the final sequences, we caution that as depth increases, so does the impact of insufficiently long Primer IDs. We anticipate that this important method will be widely adopted, and therefore encourage careful consideration when selecting the length of the Primer ID. ACKNOWLEDGMENTS. D.J.S. was supported by a grant from the National Research Foundation of South Africa. B.M. was supported by Europeaid Grant Sante/2007/174-790 from the European Commission.
Daniel J. Shewarda,1, Ben Murrellb,c, and Carolyn Williamsona a Division of Virology, Institute of Infectious Diseases and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, Western Cape 7925, South Africa; bBiomedical Informatics Research Division, eHealth Research and Innovation Platform, Medical Research Council, Cape Town, Western Cape 7505, South Africa; and cComputer Science Division, Department of Mathematical Sciences, University of Stellenbosch, Stellenbosch 7600, South Africa 1. Jabara CB, Jones CD, Roach J, Anderson JA, Swanstrom R (2011) Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID. Proc Natl Acad Sci USA 108:20166–20171.
Author contributions: D.J.S. designed research; D.J.S. and B.M. performed research; B.M. contributed new reagents/analytic tools; D.J.S. and B.M. analyzed data; and D.J.S., B.M., and C.W. wrote the paper. The authors declare no conflict of interest. 1
To whom correspondence should be addressed. E-mail:
[email protected].
www.pnas.org/cgi/doi/10.1073/pnas.1203613109