Linear amplification for deep sequencing - Biochemistry and

11 downloads 0 Views 771KB Size Report
Jun 23, 2011 - Nuclease-free water (Ambion, cat. no. AM9930, AM9932, AM9937,. AM9938, AM9939 or 4387936) β-Mercaptoethanol (14.3M; Merck, cat. no.
protocol

Linear amplification for deep sequencing Wieteke A M Hoeijmakers1,2, Richárd Bártfai1,2, Kees-Jan Françoijs1 & Hendrik G Stunnenberg1 Department of Molecular Biology, Faculty of Science, Nijmegen Center for Molecular Life Sciences, Radboud University, Nijmegen, The Netherlands. 2These authors contributed equally to this work. Correspondence should be addressed to H.G.S. ([email protected]). 1

Published online 23 June 2011; doi:10.1038/nprot.2011.345

© 2011 Nature America, Inc. All rights reserved.

Linear amplification for deep sequencing (LADS) is an amplification method that produces representative libraries for Illumina next-generation sequencing within 2 d. The method relies on attaching two different sequencing adapters to blunt-end repaired and A-tailed DNA fragments, wherein one of the adapters is extended with the sequence for the T7 RNA polymerase promoter. Ligated and size-selected DNA fragments are transcribed in vitro with high RNA yields. Subsequent cDNA synthesis is initiated from a primer complementary to the first adapter, ensuring that the library will only contain full-length fragments with two distinct adapters. Contrary to the severely biased representation of AT- or GC-rich fragments in standard PCR-amplified libraries, the sequence coverage in T7-amplified libraries is indistinguishable from that of nonamplified libraries. Moreover, in contrast to amplification-free methods, LADS can generate sequencing libraries from a few nanograms of DNA, which is essential for all applications in which the starting material is limited.

INTRODUCTION Rationale In recent years, next-generation sequencing (NGS) technology has revolutionized the analysis of complex RNA/DNA samples and has become an essential tool for nearly all fields of research, including genome (re)sequencing, clinical studies and population biology1. It enables parallel sequencing of millions of small DNA fragments for low per-base costs in a short time. Besides de novo sequencing, NGS provides accurate information on the composition of complex (c)DNA samples, making it the method of choice for most, if not all, genomic applications (e.g., transcriptome analysis (RNASeq2), profiling of methylated DNA3,4 or DNA-associated ­proteins (ChIP-Seq5–7)); new applications for NGS appear frequently (e.g., single-cell transcriptome analysis8, directional RNA-Seq9–11, GRO-seq12 and ribosome footprinting13). Furthermore, NGS applications are under intense scrutiny to produce even more and better quality data14–16. Currently, the Illumina Genome Analyzer is the most widely used among the available NGS platforms1. Libraries are prepared by ligation of specific bifurcated adapters to the ends of DNA fragments, which facilitate their capture on the surface of a flow cell. Subsequently, each fragment is locally amplified to generate a cluster of identical sequences that are subjected to reverse terminator sequencing (for more information, see ref. 17 or http://www.illumina. com/technology/sequencing_technology.ilmn). Typically, a single lane contains 25–30 million DNA fragments (clusters) that can be sequenced for up to 150 bases from one or both ends of the fragment. As in most applications the amount of starting material is limited, a PCR amplification step (18 cycles) is an inherent part of the manufacturer’s protocol (Supplementary Fig. 1a). This leads to a paradoxical problem, as PCR amplification is known to introduce strong bias in sample composition, and fragments with high AT- or GC content then become underrepresented or are completely lost during library preparation9,14,18–21 (Fig. 1). These PCRmediated changes introduced during sample preparation severely compromise the quality of the data, complicating data quantitation and representation (coverage), which can only be partially compensated for by deeper sequencing. This bias is especially detrimental to sequencing of ‘extreme’ genomes, such as that of a human malaria para­ site (Plasmodium falciparum—average AT content, ~80% (ref. 22)), 1026 | VOL.6 NO.7 | 2011 | nature protocols

slime mold (Dictyostelium discoideum—average AT content, ~78% (ref. 23) and herpes B virus (average GC content, ~75% (ref. 24)), or profiling of highly GC-rich CpG islands of cancer genomes. Therefore, development of a method resulting in linear amplification of NGS libraries is indispensable for sequencing of ‘extreme’ genomes and accurate quantification of all applications of NGS. The linear amplification for deep sequencing method (LADS18), described here in detail, uses the T7 linear amplification system, which has previously been successfully used for linear amplification of DNA for array hybridization25–27. T7 amplification requires addition of the T7 promoter to DNA fragments, enabling T7 RNA polymerase-mediated transcription in an in vitro reaction. T7 RNA polymerase transcribes the same fragment multiple times yielding many RNA transcripts, thereby amplifying the template in a linear rather than an exponential manner. The resulting RNA fragments are subsequently converted to cDNA, in which the representation of different DNA fragments is highly similar to that of the original sample25. In LADS (Fig. 2), the T7 promoter is incorporated through ‘adapter (B)’ ligation, which requires synthesis of two separate adapters (A and B, Table 1) corresponding to the two strands of the Illumina adapter. These adapters are ligated to blunt-end-repaired and ‘A’-tailed DNA, resulting in fragments with either two identical (A-A and B-B) or two distinct (A-B and B-A) adapters (Fig. 2). As only fragments containing two different adapters can be sequenced on the Illumina platform, selection of fragments with both adapters A and B is key to LADS. This is achieved in two steps: (i) owing to the presence of the T7 promoter on adapter B, only fragments bearing adapter B (A-B, B-A and B-B) are transcribed into RNA during in vitro transcription (A-A–type fragments are not transcribed to RNA and lost during subsequent RNA purification); and (ii) subsequent cDNA synthesis is initiated from a primer complementary to adapter A (P5, Fig. 2 and Table 1), resulting in the exclusive conversion of fragments with A-B/B-A adapters (B-B-type fragments are not converted into cDNA and removed by RNA degradation; Fig. 2). The resulting library can be directly used for Illumina deep sequencing. Proof of principle for this method was provided by sequencing of genomic DNA from the extremely AT-rich genome of the

protocol

Log2 ratio over expected coverage

a

Log2 ratio over expected coverage

b

4 3 2 1 0 –1 –2 –3 –4 –5 –6

Figure 1 | Comparison of different library preparation methods for Illumina sequencing. (a,b) Distribution of sequence reads in relation to GC content in shotgun sequencing data obtained from P. falciparum (a) and M. tuberculosis (b) genomic DNA prepared by different amplification methods. Note that overrepresentation of fragments with 40–60% GC content in PCR-amplified P. falciparum library is a relative overrepresentation due to the severe underrepresentation of AT-rich sequences comprising ~40% of the genome (gray area highlights deviation from expected coverage in PCR-amplified libraries). Normalized read density is plotted as log2 ratio of expected read density per 150-bp window for different GC content. Panel a has been modified from reference 18. For details about preparation and analysis of M. tuberculosis samples, see Supplementary Figure 2.

Plasmodium falciparum

0

10

20

30

40

50

% (G+C)

Mycobacterium tuberculosis

1 0

malaria parasite, P. falciparum18. Shotgun sequencing following LADS yielded a much more uniform coverage of the malaria parasite genome compared with PCR-amplified samples (see ref. 18). Such uniform coverage makes sequencing more cost efficient and allows complete genome sequencing from limited DNA samples (e.g., noncultivatable field isolates or forensic samples). LADS has also enabled the accurate genome-wide localization (ChIP-seq) of histone marks/variants that associate with highly AT-rich parts of the malaria parasite genome (see ref. 18). Similarly, LADS should improve quantitation when profiling CpG-dense promoters for DNA methylation (MeDip and MethylCap) or transcription factor binding (ChIP-seq). Finally, LADS has enabled highly quantitative analyses of the P. falciparum transcriptome during intraerythrocytic development (see ref. 18) and improved coverage of the AT-rich untranslated regions (W.A.M.H., R.B. and H.G.S., unpublished data). With slight modification of adapter A (see Supplementary Table 1 for a small set of ‘barcoded’ adapters as an example) LADS can be applied for parallel sequencing (i.e., multiplexing) of multiple samples in a single lane (R.B., unpublished data).

–1 –2 –3 –4 –5

© 2011 Nature America, Inc. All rights reserved.

40

50 Amplification free

60

70 LADS

80 % (G+C) PCR amplification

malaria parasite P. falciparum18 and the very GC-rich genome of Mycobacterium tuberculosis (average GC content, ~65%28; Fig. 1b and Supplementary Fig. 2). This analysis showed that, contrary to PCR-amplified libraries in which sequences with  > 80% AT content were severely reduced or even completely absent, the T7-amplified library enabled sequencing of up to 100% AT centromeres of P. falciparum and preserved the original sample composition (see ref. 18 and Fig. 1a). On the other side of the spectrum, M. tuberculosis DNA fragments with  > 80% GC content were reduced on PCR amplification, but maintained in the LADS library (Fig. 1b and Supplementary Fig. 2). Therefore, LADS results in robust and quantitative data and, thus, superior analysis for a plethora of NGS applications.

Comparison with other bias-prevention strategies LADS was developed to circumvent severe sequence bias resulting from PCR amplification of Illumina sequencing libraries (Supplementary Fig. 1a). Previously, Kozarewa et al.14 reported an amplification-free method (Supplementary Fig. 1b) that efficiently tackles this problem, but only if several hundred nanograms of DNA are available for library preparation. However, the method cannot be used when only a limited amount of starting material is available. Therefore, LADS will provide a solution for many applications in which the material is limited, such as sequencing of clinical isolates, ChIP-seq, RNA-seq and ribosome footprinting. Alternatively, true single-molecule sequencing technology,

Applications of LADS In principle, LADS can be applied to all types of NGS applications in which the starting material for sequencing is double-stranded DNA (dsDNA), and it is particularly beneficial when the starting material is limited and amplification is inevitable. However, directionality is not maintained in this protocol, complicating its application to directional transcriptome analysis (see ‘Limitations of LADS’ below). So far we have successfully applied LADS to three different applications (shotgun genome (re)sequencing, ChIP-seq and transcriptome analysis) on the highly AT-rich genome of the human

End repair Figure 2 | Workflow of LADS. Doublestranded DNA fragments are first blunt end–repaired, phosphorylated and ‘A’ tailing A tailed (Steps 1–8). Subsequently, T-tailed adapters, containing sequences Adapter B T7 P7 S2 used during cluster formation (P5 and P7) T Adapter ligation and sequencing (S1–2) on the Illumina + T In vitro transcription (amplification) cDNA synthesis platform or for in vitro transcription P5 S1 (T7 promoter in adapter B), are ligated B A Adapter A B A to each fragment (Steps 9–11). After A B P5 primer B B removal of excess adapters and size B A B B B A selection (Step 12), adapter B–containing A A fragments (A-B, B-A or B-B) are subjected P7 primer to in vitro transcription, resulting in LADS library ready for sequencing multiple RNAs per template DNA (Steps 13–17). First- and second-strand cDNA synthesis (Steps 18–28) are initiated by adapter-specific primers (P5 or P7, respectively), ensuring that only fragments with two distinct adapters will be present in the library. p

p

p

p

p A

A p

p A

A p

T A

T A

T A

T A

A T

A T

A T

A T

T A A A

A T T T

T A

T A

A T

T A

A T

A T

T A

A T

nature protocols | VOL.6 NO.7 | 2011 | 1027

protocol

© 2011 Nature America, Inc. All rights reserved.

Table 1 | Oligos required for LADS. Name

Sequence (5′–3′)

Adapter-A-Fora

AATGATACGGCGACCACCGAGATCTACACTCTTTCCCT ACACGACGCTCTTCCGATCbT

Adapter-A-Reva

c

Adapter-B-Fora

GAATTTAATACGACTCACTATAGGGAcaagcagaagac ggcatacgaGATCGGTCTCGGCATTCCTGCTGAACCGC TCTTCCGATCbT

Adapter-B-Reva

c

P5d

AATGATACGGCGACCACCGA

P7

caagcagaagacggcatacga

GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAG ATCTCGGTGGTCGCCGTATCATT

GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA CCGATCTCGTATGCCGTCTTCTGCTTGTCCCTATAGT GAGTCGTATTAAATTC

T7 promoter sequence is underlined, P5 primer sequence is italic, P7 primer sequence is lowercase, sequencing primer 1 is bold, sequencing primer 2 is double underlined. Adapter oligos should be minimally PAGE purified, but ideally prepared by double purification (e.g., double PAGE, HPLC-PAGE). bPhosphorothioate linkage that can stabilize the overhanging thymidine (Optional). c5′ phosphate group. dShould be synthesized and kept RNase-free. a

such as the Helicos Genetic Analysis System (Helicos BioScience Cooperation19) or Pacific Bioscience SMRT technology (http:// www.pacificbiosciences.com/)29, could allow sequencing of small amounts of DNA/RNA without amplification. However, the Helicos system is no longer available to the general marketplace, whereas SMRT technology just became available to the research community. Therefore, unbiased library amplification is still indispensable for most applications of NGS. T7 in vitro transcription is not the only linear amplification technology available. Another well-known alternative is rolling circle amplification (RCA)30,31, in which a highly processive DNA polymerase generates hundreds of concatenated copies from a circularized template. Similar to T7 in vitro transcription, RCA has very high replication fidelity and results in linear amplification of the input material32. Therefore, during development of LADS we also considered this method, but we finally decided to use T7 amplification, as it was easier to adapt to the Illumina sequencing platform. RCA, however, could be considered for selective enrichment of the template33 followed by amplification-free library preparation or LADS. Limitations of LADS Similar to the standard library preparation protocol, strand information is not maintained in LADS (i.e., the sequence read obtained can correspond to the sense/Crick or antisense/Watson strand of the original DNA/RNA fragment with equal probability). Accordingly, if directionality needs to be maintained, it has to either be introduced before adapter ligation or it requires modification of the LADS protocol. For example, directional RNA-seq is directly compatible with LADS if a ‘barcode’ is incorporated on the reverse strand during first-strand cDNA synthesis while generating the double-stranded cDNA to start LADS. A known limitation of the T7 amplification system is caused by the premature termination of T7 RNA polymerase on certain low-complexity sequences34. Accordingly, we observed a reduction in signal in the LADS library compared with amplification-free 1028 | VOL.6 NO.7 | 2011 | nature protocols

control for a set of ~100 regions of the P. falciparum genome. As these sequences were found to comprise short telomeric repeats or long homopolymeric polyA or polyT stretches, we assume that the underrepresentation of these sequences resulted from such premature termination events. Given that ~50% of these regions are located in difficult-to-map/assemble repetitive regions and they comprise a minuscule fraction of the genome (~0.07%), we believe that this limitation is negligible compared with the advantages brought about by LADS. Moreover, if needed, this premature termination can be circumvented by lowering the temperature during in vitro transcription35, but at the cost of a lower yield. Experimental design LADS involves many steps (summarized in Fig. 3), but rather few at which efficient processing of the sample can be monitored. Therefore, we recommend the use of a control sample to set up the entire protocol first. This can be useful to pinpoint problems (see troubleshooting) and can prevent wasting precious experimental material. Control samples should contain a DNA fragment of a distinct size (~200 bp) that has been prepared by restriction enzyme digestion. Any representative PCR fragment of distinct size can be used as a control and restriction sites can be incorporated through the primers. The restriction enzyme needs to be a non–blunt cutting enzyme that does not leave a 3′ A-overhang, so that both end repair (Steps 3–5 in PROCEDURE) and A tailing (Steps 6–8 in PROCEDURE) are required for efficient adapter ligation. By using 40 ng of control fragment, the efficiency of adapter ligation can be easily monitored on the gel (fragments with only one A-adapter, fragments with only one B-adapter and fragments with an adapter on each side will have distinct size shifts on the gel and will reveal the efficiency of adapter ligation). The Illumina sequencing platforms give maximum performance with fragments of a distinct size (e.g., ~300 bp, including adapter sequences). Accordingly, a size-selection step is included in the LADS protocol (see Step 12 in PROCEDURE), and it is advisable that the majority of the fragments in the starting material are around 150–200-bp long (the adapters add 145 bp to the fragment). This can be achieved by numerous means depending on the type of starting material: (i) for genomic DNA or for cross-linked chromatin we use sonication18, but other mechanical or enzymatic methods can also be used21; (ii) native chromatin can be digested to mononucleosomal DNA fragments using MNase18,36; and (iii) for RNA-seq37, hydrolysis of the RNA38 followed by random primed cDNA synthesis has worked best for us18. It is pertinent that all DNA fragments in the starting material are in a double-stranded state. Therefore, care must be taken to avoid denaturation of DNA fragments during the preparation of starting material (e.g., heat inactivation of enzymes after cDNA synthesis) or the library itself (e.g., gel extraction15). For example, for de-cross-linking of formaldehyde cross-linked chromatin samples from P. falciparum, we recommend 45 °C O/N incubation in the presence of 500 mM NaCl, as this better preserves the representation of high AT-rich sequences (shown by quantitative PCR (qPCR) in Supplementary Fig. 3). QC of the starting material. Given that preparation and sequencing of LADS libraries is rather expensive, it is important to perform quality controls (QCs); these should be done, at a minimum, at the beginning and at the end of the procedure. Samples must be free from any contaminants (e.g., salt, phenol or enzyme). Therefore,

protocol Preparation of starting material 1h Steps 1 and 2 Purification and quantitation of starting material. End repair 1.5–2 h Steps 3–5 This step ensures that DNA ends are blunt and phosphorylated. ‘A’ tailing 1.5–2 h Steps 6–8 This step adds an adenine to the 3′-end of each DNA fragment and enables ligation to the adapters (containing a ‘T’ overhang) later on.

© 2011 Nature America, Inc. All rights reserved.

Adapter ligation 1.5–2 h Steps 9–11 This step covalently joins adapters (either one A-type and one B-type, or two A- or B-type) to the ends of each DNA fragment. Size selection 1–3 or >2 h Step 12 This step removes the excess of adapter oligos and selects a narrow size range of fragments for optimal sequencing results. In vitro transcription 12–16 h Steps 13–17 In this step, all fragments containing at least one B-type adapter will be translated to multiple RNAs, resulting in amplification. cDNA synthesis 3–4 h Steps 18–28 This step converts the amplified RNA to cDNA. The use of adapter-A-specific primer in the first-strand synthesis reaction ensures that only fragments with two different adapters get converted.

Figure 3 | Flowchart of experimental procedure. The flowchart summarizes the main stages of LADS, explains the purpose of each and indicates the time required and the corresponding steps of the protocol.

in most cases, we recommend column purification of the starting material (see Steps 1 and 2 in the PROCEDURE). Accurate quantification of the starting material is also crucial, as for optimal

MATERIALS REAGENTS • T4 DNA polymerase (New England Biolabs, cat. no. M0203L or M0203S) • DNA Polymerase I, large (Klenow) fragment (New England Biolabs, cat. no. M0210L or M0210S) • Klenow fragment (3′→5′ exo–; New England Biolabs, cat. no. M0212L or M0212S) including NEBuffer 2 • T4 polynucleotide kinase (New England Biolabs, cat. no. M0201L or M0201S) • T4 DNA ligase buffer (Promega, cat. no. C1263) • LigaFast rapid DNA ligation system (Promega, cat. no. M8221 or M8225) containing T4 DNA ligase and 2× rapid ligation buffer  CRITICAL Use of DNA ligase from another source may cause concatenation of the adapters when removal of adapter T-overhangs occurs because of contaminating exonuclease activity. This potential exonuclease activity is only problematic if the adapters do not contain the phosphorothioate bond for stabilization of the overhanging T. • Certified Low-Range Ultra Agarose (Bio-Rad, cat. no. 161-3107) • GelStar nucleic acid gel stain (Lonza; cat. no. 50535) ! CAUTION It is potentially mutagenic and carcinogenic. Handle with care and wear double gloves and eye protection. • Tris/acetic acid/EDTA (TAE, 50×; Bio-Rad, cat. no. 161-0743) • Orange G (Sigma-Aldrich, cat. no. O3756-25G or O3756-100G) • Ultrapure glycerol (Invitrogen, cat. no. 15514-011 or 15514-029)

sample preparation each sample should contain a minimum of 3 ng (although less might work), but not more than 40 ng, of dsDNA in a maximum volume of 40 µl to start Step 3 of PROCEDURE (for samples with a broad size range, starting from   40 ng starting material may result in inefficient reactions. The majority of the DNA fragments should be in the range of 150–300 bp (e.g., 100–600 bp).  PAUSE POINT DNA can be stored at  − 20 °C for few months. End repair ● TIMING 1.5–2 h 3| Set up 50 µl reaction for each DNA sample as follows. For samples with a large fragment size range (sonicated DNA/chromatin, ds cDNA) starting with  

Suggest Documents