Benchmarks
Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LRPCR using Illumina GA sequencingby-synthesis technology Olivier Harismendy and Kelly A. Frazer Scripps Genomic Medicine, Scripps Translational Science Institute, Scripps Research Institute, La Jolla, CA, USA BioTechniques 46:229-231 (March 2009) doi 10.2144/000113082 Keywords: next-generation sequencing; coverage; long-range PCR
One approach for high-throughput population-based sequencing of targeted intervals in the human genome is to amplify the regions using long-range PCR (LR-PCR) followed by sequencing with next-generation sequencing (NGS) technologies. Utilizing this method, we have observed that the 50 bp located at the amplicon ends account for more than 50% of the sequenced bases and that the sequence coverage depth of base pairs within an amplicon is highly variable. Here we propose an explanation for the overrepresentation of the amplicon ends and show that the use of 5′-blocked primers for the LR-PCR reaction reduces their overrepresentation. Furthermore, we demonstrate that using a 600-bp library insert size rather than the standard 200-bp insert size results in more uniform sequence coverage depth. The capability to increase sequence coverage uniformity greatly improves the effective throughput of NGS platforms. The use of next-generation sequencing (NGS) platforms for population-based sequencing of targeted genomic intervals will enable the examination of genetic variants across the allele frequency spectrum for association with diseases (1,2). NGS technologies perform best for base-calling accuracy and variantfinding sensitivity with high and uniform sequence coverage. Using long-range PCR (LR-PCR) to amplify targeted genomic intervals followed by sequencing on the Illumina Genome Analyzer (GA) (San Diego, CA, USA), we and others (3,4) have noted an overrepresentation of the amplicon ends, and in particular, that the 50 bp located at the amplicon ends can account for more than 50% of the sequenced bases. We also noted that the sequence coverage depth of base pairs within the amplicon—and thus present in equimolar amount in the starting sample material—is highly variable. This per-base sequencing coverage variability is known to be an important issue in nextVol. 46 | No. 3 | 2009
generation sequencing, and observed regardless of the organism or the type of input material (5–7). These artifacts are not only wasteful for the sequencing yield but decrease the expected average coverage depth across the targeted interval and thereby impact data quality. Prior to sequencing on the Illumina GA, the LR-PCR amplicons were fragmented to ∼200 bp, ligated to linkers, and amplified through linker-mediated PCR. We reasoned that the overrepresentation of the ends was a result of the sample preparation method, as nucleotides located at amplicon ends are present at the extremities of the 200-bp fragments more frequently than a random internal nucleotide. To avoid overrepresentation of the amplicon ends, we tested the use of 5′-blocked primers in the LR-PCR to prevent their ligation to the linkers. Six genomic intervals (size range 3129–10,989 bp) were amplified from DNA sample NA17460 obtained from the Coriell Institute for Medical Research
229
(Camden, NJ, USA). We performed the LR-PCR reactions using 30 ng of genomic DNA, 0.5 μM forward LR-PCR primers, 0.5 μM reverse LR-PCR primers in a total reaction volume of 12 μL, as described (8). The primers were ordered from IDT Technologies (Coralville, IA, USA) without modification (unblocked) or with the 5′ modification of either Amino Modifier C6 (NH2-blocked) or C3 spacer (C3-blocked). Following LR-PCR, the 6 amplicons generated with one type of primer were quantified and combined in equimolar amounts prior to fragmentation and purification. The Illumina GA libraries were prepared according to the manufacturer’s instructions except for the following steps: the fragmentation was performed enzymatically using 1 μg of the equilmolar pooled amplicons incubated for 25 min at 37°C with 0.05 U of DNase I, resulting in digestion to the 170- to 250-bp fragment size range. Modified adaptors were used in order to add a 4-nucleotide barcode at the 5′ end of the library fragments as described by Craig et al. (9) with the following modifications: a 4-bp barcode with two constant bases and two variable bases (CNNT) was used; and both oligonucleotides were mixed at 100 μM in TE pH 8.0, heated for 5 min at 95°C, and annealed by slowly cooling to 4°C over 12 h. Each of the three libraries [corresponding to the different LR-PCR primer types (unblocked, NH2-blocked or C3-blocked) used to generate the amplicons] received different indexes. Following adaptor ligation, we selected a fragment size of ∼200 bp by gel extraction and enriched for adapter-ligated fragments using manufacturer primer sequences and the following PCR conditions: 18 cycles of 30 s at 98°C, 20 s at 65°C, 15 s at 72°C, 15 s at 72°C; and 5 min at 72°C. We pooled the three indexed libraries generated from the different primer types and sequenced the library pools on two different lanes of the flow cell following manufacturer’s instructions for cluster generation and sequencing-by-synthesis of single ends for 40 cycles. We used the Illumina Genome Analyzer Pipeline Version 0.2 software with default signal quality filters (chastity of base signals >0.6 within the first 12 bases of the read) in order to qualify reads that passed filters (PF; read files are available upon request). The PF reads were split according to their corresponding indexes and the 4-bp index was then removed. The remainder of the reads were aligned to the reference sequence (6 LR-PCR amplicons sequences from NCBI36) using the MAQ (mapping and assembling with qualities) algorithm (10). Poor-quality bases (200,000.
A
p < 0.01
B
1000 _
200bp Library C3 blocked primers
200bp Library C 200-bp
library
0_ 1000 _
600bp Library C3 blocked primers
600-bp 600bp Library C
library
0_ fS
G
200 bp
600 bp
Figure 2. Increased library size improves sequencing coverage uniformity. (A) Coverage depth variability of the 200-bp and 600-bp libraries as measured by the average coefficient of variation across the six amplicons (50-bp ends excluded). The error bars represent the standard deviation over the 6 amplicons and the 3 types of primers. (B) Coverage variability across a ∼5-kb genomic interval sequenced from a LR-PCR amplicon using C3 5′-blocked primers and a 200-bp or 600-bp sequencing library.
MAQ quality score) were removed from the final coverage depth calculation. The coverage of the amplicon ends using blocked primers was reduced by 6.8 times on average (ranging 2.8–9.5) when compared with unblocked primers (Figure 1, A and B), with both types of blocking groups working equally. We note that the coverage of the amplicon ends is still twice above expected and that sequence coverage depth of base pairs within the Vol. 46 | No. 3 | 2009
amplicon show great variability. We hypothesize that this bias is introduced during the PCR amplification of the library and that using a larger library size would reduce coverage variability and residual overrepresentation of the ends. To test this, we fragmented the amplicon pools down to 600 bp, generated libraries, and sequenced on the Illumina GA as described for the 200-bp fragments. The coverage of the ends of the 600-bp library
230
generated with unblocked primers was reduced by 3.6 times when compared with a 200-bp library (Figure 1A and 1B), thus supporting the hypothesis that the overrepresentation of the ends is due to the fragmentation rather than to the sample preparation by LR-PCR. The combined effect of an increased library size and the 5′-blocked primers lowers the coverage of the amplicon ends, nearing the expected level. In addition to the www.BioTechniques.com
Benchmarks
reduction in sequence coverage of the amplicon ends, the 600-bp library had a 28% reduction in coverage variability across the amplicon compared with the 200-bp library (t-test, P < 0.01, Figure 2, A and B), therefore improving overall coverage uniformity. Our results demonstrate that the efficiency of DNA re-sequencing using LR-PCR amplicons to amplify targeted intervals can be greatly improved by utilizing 5′-blocked primers and a 600-bp library size. Implementation of these simple steps will maximize coverage yield and limit coverage variability when designing targeted re-sequencing experiments using next-generation short read technologies.
Acknowledgements
We would like to thank Karrie Trevarthen for technical assistance, and Kari Ohlsen and Ryan Lister for helpful discussions. We are grateful to John Havens (IDT Technologies) for providing C3-blocked primers. This work is supported by a National Institutes of Health Clinical LSD_AD_8.375x5.5_half_H 9:46(NIH AM and Translational10/10/07 Science Award
CTSA; grant no. NIH 1U54RR02520401). This paper is subject to the NIH Public Access Policy. The authors declare no competing interests.
References
1. B entley, D.R. 2006. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16:545-552. 2. Mardis, E.R. 2008. The impact of next-generation sequencing technology on genetics. Trends Genet. 24:133-141. 3. Parla, J.S. and W.R. McCombie. 2008. Highthroughput genetic resequencing of target genomic regions. Poster presented at the 21st Annual Cold Spring Harbor Laboratory Biology of Genomes Meeting, Cold Spring Harbor, NY. 4. Yeager, M., N. Xiao, R.B. Hayes, P. Bouffard, B. Desany, L. Burdett, N. Orr, C. Matthews, et al. 2008. Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum. Genet. 124:161-170. 5. Ossowski, S., K. Schneeberger, R.M. Clark, C. Lanz, N. Warthmann, and D. Weigel. 2008. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 18:2024-2033. 6. Cronn, R., A. Liston, M. Parks, D.S. Page 1Gernandt, R. Shen, and T. Mockler. 2008.
Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-bysynthesis technology. Nucleic Acids Res. 36:e122 7. H illier, L.W., G.T. Marth, A.R. Quinlan, D. Dooling, G. Fewell, D. Barnett, P. Fox, J.I. Glasscock, et al. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5:183-188. 8. Frazer, K.A., E. Eskin, H.M. Kang, M.A. Bogue, D.A. Hinds, E.J. Beilharz, R.V. Gupta, J. Montgomery, et al. 2007. A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature 448:1050-1053. 9. Craig, D.W., J.V. Pearson, S. Szelinger, A. Sekar, M. Redman, J.J. Corneveaux, T.L. Pawlowski, T. Laub, et al. 2008. Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 5:887893. 10. L i, H., J. Ruan, and R. Durbin. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851–1858. Received 22 September 2008; accepted 10 December 2008. Address correspondence to Olivier Harismendy, Scripps Genomic Medicine, 3344 N. Torrey Pines Court, La Jolla, CA, USA, 92037. email:
[email protected]
Give your marketing programs a LIFT! Let Informa Life Sciences mailing lists spearhead your next campaign If you are looking to maximize your marketing efforts to professionals who work in all aspects of pharma and biotech, you need a resource that gives you access to your best responders — and that’s Informa Life Science mailing lists. Cultivated from our targeted subscriber base of life science professionals, our lists give you the best chance of delivering your brand and offer to this market. Let us locate the right decision-makers in your market segment: • Over 125,000 life science professionals • $25 billion in purchasing power • Segmentation including function, title and laboratory technique • worldwide reach
Reach over
125,000 Life Science Professionals
Vol. 46 | No. 3 | 2009
231
Life science marketing lists with more life. For more information, call (212) 520-2729 www.BioTechniques.com