[ 25] for short read mapping and Samtools as a swiss-army knife for format conversion. Although there are many other mapping pro- grams that could be used as ...
Chapter 12 Next-Generation Whole Genome Sequencing of Dengue Virus Pauline Poh Kim Aw, Paola Florez de Sessions, Andreas Wilm, Long Truong Hoang, Niranjan Nagarajan, October M. Sessions, and Martin Lloyd Hibberd Abstract RNA viruses are notorious for their ability to quickly adapt to selective pressure from the host immune system and/or antivirals. This adaptability is likely due to the error-prone characteristics of their RNAdependent, RNA polymerase [1, 2]. Dengue virus, a member of the Flaviviridae family of positive-strand RNA viruses, is also known to share these error-prone characteristics [3]. Utilizing high-throughput, massively parallel sequencing methodologies, or next-generation sequencing (NGS), we can now accurately quantify these populations of viruses and track the changes to these populations over the course of a single infection. The aim of this chapter is twofold: to describe the methodologies required for sample preparation prior to sequencing and to describe the bioinformatics analyses required for the resulting data. Key words Dengue, Intra-host genetic diversity, Next-generation sequencing, Dengue serotypes
Abbreviations TBE EtBr PCR RT-PCR cDNA DNA RNA dNTP kb bp
Tris–Borate–EDTA Ethidium bromide Polymerase chain reaction Reverse transcriptase polymerase chain reaction Complementary DNA Deoxyribonucleic acid Ribonucleic acid Deoxyribonucleotide triphosphate Kilobase Base pairs
Radhakrishnan Padmanabhan and Subhash G. Vasudevan (eds.), Dengue: Methods and Protocols, Methods in Molecular Biology, vol. 1138, DOI 10.1007/978-1-4939-0348-1_12, © Springer Science+Business Media, LLC 2014
175
176
1
Pauline Poh Kim Aw et al.
Introduction Sequencing technology has evolved over the last 6 years, moving from Sanger sequencing or capillary sequencing to NGS. Dengue NGS publications, reporting on various aspects of the virus and its hosts started to emerge in 2010. Skalsky et al. [4] used NGS to identify conserved and novel micro RNAs in Culex and Aedes mosquitoes. The investigators found that female Culex mosquitoes infected with West Nile Virus, a close flavivirus relative to Dengue, showed significant expression level changes in two micro RNAs upon flavivirus infection. Both of these mosquito species, Culex and Aedes, are important flavivirus vectors in tropical and subtropical areas worldwide. The host transcriptome response to dengue infection has also been analyzed by NGS in both the mosquito and human hosts. David et al. [5] explored aspects of pollutants and insecticides on the Aedes mosquito transcriptome. In this study, they were attempting to raise awareness of how mosquito vector control using pollutants can have greater effects on our ecosystem than just on the target organism. In 2013, Sessions et al. explored the human transcriptome response to infection with a wild-type strain of dengue virus and an attenuated derivative vaccine candidate [6]. The authors postulated that the large extent of previously uncharacterized transcriptional regulation might constitute a novel human innate immune response, which was more successfully evaded by a wild-type strain compared to an attenuated strain. In addition to vector control and host immunology, NGS has also been used as a tool for clinical identification of Dengue virus. Yozwiak et al. characterized a gamut of viruses in acute serum from over 100 Nicaraguan patients enrolled in a prospective dengue study using NGS and Virochip microarray technologies [7]. These studies serve to elucidate the wide variety of facets that can be studied using NGS as a tool for both clinical and academic purposes. When using the Sanger sequencing platform, consensus sequences can be inferred quite readily from the output chromatograms. However, non-consensus variants are obscured by the nucleotides of higher frequencies and cannot be easily quantified. In order to more accurately quantify the viral population and determine how the heterogeneity and subpopulations therein may be contributing to pathogenesis and transmission, an in-depth analysis of the viral population is essential. NGS technologies permit rapid and cost-effective acquisition of millions of short DNA sequences, which can then be used to reconstruct a full-length genome. Due to the relatively small size of viral genomes, it is possible to combine them into a single sequencing reaction and thus significantly reduce the cost of sequencing. These millions of reads, if distributed with relative uniformity across the genome, can be used to calculate single nucleotide variations (SNV) in a viral
Next-Generation Whole Genome Sequencing of Dengue Virus
177
Fig. 1 Flow of viral NGS from viral RNA extraction to sequencing analysis
population with digital precision. The frequencies of these variants are detectable long before the SNV is fixed in the population [8]. Indeed, with sufficient coverage, variants present at far less than 1 % frequency can be reliably calculated [9]. These variant positions within the viral genome can either be of benefit to the replicative fitness, deleterious or neutral, depending on the viral host environment [10]. This type of analysis on the virus population structure presented here is broadly applicable to any virus [11–17]. Although we will focus on the protocols and analysis pipelines used to create and interpret sequencing data on the Illumina platform, this kind of analysis is in theory agnostic and can be performed using other sequencing platforms as well [18, 19]. An overview of preparing samples for sequencing on an Illumina machine is presented schematically in Fig. 1 and is described below. A single RT-primer is designed to bind specifically to the 3′-end of dengue virus genomes. Reverse transcriptase is used to make complementary DNA (cDNA). Next, the primer binding locations are designed by identifying regions of high conservation among strains. Briefly, representative dengue virus sequences for each serotype were downloaded from NCBI (www. ncbi.nlm.nih.gov/nuccore/). It is important to include highly related strains with the same geographical and genetic background as the virus intended for NGS since the areas of conservation can vary significantly between viruses from different regions and genotypes [20]. Next, the MEGA software was used to align the sequences and conserved regions of each serotype were identified.
178
Pauline Poh Kim Aw et al.
Fig. 2 Diagram of primer positions for DENV1, DENV2, DENV3, and DENV4
Primers targeting these conserved regions were then generated so that each amplicon produced is approximately 2 kb in length. These primer pairs are then used with a high fidelity DNA polymerase to amplify the viral genome in 5–7 fragments (Fig. 2 and Table 1). These fragments, once generated, are then separated and extracted from a 1 % agarose gel and their concentrations measured using the Agilent Bioanalyzer. Equal concentrations of the DNA products are pooled and are fragmented to ~300 bp. After fragmentation, the samples are purified and optionally checked for quality using the Agilent Bioanalyzer. The ends of the DNA samples, which were damaged during the fragmentation step, are repaired with Klenow polymerase to form uniform blunt ends. To allow for efficient ligation of the sequencing adapters in the next step, the blunt ends of the DNA are then adenylated with polynucleotide kinase (PNK). Ligation of the sequencing adapters adds common primer binding sites to the end of each fragmented piece of DNA and is the step that ultimately allows for unbiased amplification and sequencing. If desired, one can check again for quality of the ligated product using an Agilent Bioanalyzer. After ligation, ~300 bp fragments are excised from 2 % agarose gel. Samples are subjected to 14 PCR cycles of enrichment PCR to incorporate indices and are size selected on a 2 % agarose gel one final time to ensure a tight distribution of library size. Following a final quality check, these samples are then ready to be sequenced. Although the bioinformatics analysis of deep sequencing can initially seem overwhelming, it can in actuality be accomplished with a cursory understanding of Linux and a handful of useful programs. While it is true that there are programs to analyze deep sequencing data on all three major computing platforms (Windows,
Next-Generation Whole Genome Sequencing of Dengue Virus
179
Table 1 Primer sequences for DENV1, DENV2, DENV3, and DENV4 DENV 1
Primer sequence (5′–3′)
D1f1F
GTTAGTCTACGTGGACCGAC
D1f1R
CATCGTGATAGGAGCAGGTG
D1f3F4
TCACAAGAAGGAGCAATGCACA
D1f2R2
AAGAAGAACTTCTCTGGATGTTA
D1f5F
ACCAATGTTTGCTGTAGGGC
D1f5R
TATTCCCCGTCTATTGCTGC
D1f7F
CAGAGCAACGCAGTTATCCA
D1f7R
CAATTTAGCGGTTCCTCTCG
D1f9F
TCACAGATCCTCTTGATGCG
D1f9R
CATGGCACCACTATTTCCCT
D1f10F
ATGGCTCACAGGAAACCAAC
D1f10R
TGCCTGGAATGATGCTGTAG
DENV2
Primer sequence (5′–3′)
D2f1F
AGTWGTTAGTCTACGTGGAC
D2f1R
TGGGCTGTCTTTTTCTGTGA
D2f2F
GCAGAAACACAACATGGAACA
D2f2R
AACGCGTCAGTCAGTTCAAG
D2f3F
GAAAGCTGACCTCCAAGGAA
D2f3R
CTGAAATGTCTGTCGTGACCA
D2f4F3
AGGCAGCTGGGATTTTCATGA
D2f4R3
TTTCCCTTCTGGTGTGACCA
D2f5F3
ACTCAAGTATTGATGATGAGGA
D2f5R3
TGTGTCCAATCGTTCCATCCT
D2f12F
GCAGGATGGGACACAAGAAT
D2f12R
AGAACCTGTTGATTCAACAG
DENV 3
Primer sequence (5′–3′)
D3f1F
AGTTGTTAGTCTACGTGGAC
D3f1R
TGTCCGTGGTGAGCATTCTA
D3f2F2
TATGGAACCCTTGGGCTAGAA
D3f2R2
TAGTTGAAGTCCAGCTCCAATT (continued)
180
Pauline Poh Kim Aw et al.
Table 1 (continued) DENV 1
Primer sequence (5′–3′)
D3f3F
AAGAGCATGGAATGTGTGGGA
D3f3R
TCTTCCAGTTCTGCTTTCTGTGT
D3f4F3
GATGATGAGACTGAGAAYATC
D3f4R
TGGTTCAAAGAGAGCTGGTAT
D3f5F
AGGAGAGGGAGAGTTGGCA
D3f5R
AGCAAGCCCAGCTCCTGCTA
D3f6F
ACCAATAACAACACTCTGGGA
D3f6R
ACGCGAGAACCARTGGTCTT
D3f7F
ACAGAGGAGAACCAATGGGA
D3f7R
AGAACCTGTTGATTCAACAGCA
DENV 4
Primer sequence (5′–3′)
D4-1F
GATGAGGGAAGATGGGGAGTTGTTAGTCTGTGT GGACCGAC
D4-2579R
GGGCATTYAATATTGCAGACGCTA
D4-2065F
CATAGTGATAGGTGTTGGAG
D4-4246R
GCAAGCCTCCTGCCACCATTG
D4-4226F
CAATGGTGGCAGGAGGCTTGC
D4-6463R
CTATGTTGTCAAGGGCCAGC
D4-6444F
GCTGGCCCTTGACAACATAG
D4-8531R
CCATTCACCATGGAGGATGC
D4-8512F
GCATCCTCCATGGTGAATGG
D4-10626R
CCATCTCGCGGCGCTCTGTGCC
Mac OS, and Linux), these programs are generally offered at a premium price and tend to lag behind open source software development. For this reason, most bioinformatics analyses are done on the Linux platform. The learning curve for the Linux operating system is approximately the same as switching from Windows to Mac OS and vice versa. There are many different “flavors” of the Linux operating system such as Ubuntu, Fedora, SUSE, RedHat, etc. For the purposes of this tutorial we will be using Ubuntu as it is freely available and the interface will seem more familiar to those used to working with Windows and/or Mac OS. Installation of Ubuntu can be done in parallel to an existing operating system and will cause no interference with the normal operation of the existing installation.
Next-Generation Whole Genome Sequencing of Dengue Virus
181
Once a Linux environment is successfully installed on your system, there are a handful of programs needed for the analysis. Several approaches exist that are tailored towards the analysis of low frequency variants and haplotype estimations for small viral and bacterial genomes, such as V-Phaser [21], VICUNA [22], SNVer [23], Breseq [24], and LoFreq [9]. Most use an existing mapping of the sequencing reads as a starting point. To create such a mapping a short read mapping software and usually also a format conversion tool is needed. Here we chose the Burrows-Wheeler Aligner (BWA) [25] for short read mapping and Samtools as a swiss-army knife for format conversion. Although there are many other mapping programs that could be used as a drop-in replacement for the efficient mapping of next-generation sequencing reads to a reference genome, we chose BWA since it is widely used and able to handle both single and paired end reads generated from multiple platforms such as Illumina, 454, Ion Torrent, etc. In this chapter we focus on the utilization of reads from the most widely used platform, Illumina. The file format generated by an Illumina sequencing run is FASTQ, which includes both the raw sequence information and the quality measurement for each base in the read. BWA uses this information during the mapping process. Although BWA works well with high performance, multi-node computer clusters, alignments can also be accomplished on typical laptop configurations. Naturally, more computing power will result in shorter alignment times. It is critical to emphasize here the importance of the best possible reference genome. Reads should ideally be mapped to a reference as close as possible to the actual sample to achieve highest possible coverage and mapping quality. If a reference is not available, one can in theory be created using a de novo assembler such as SOAPdenovo2 [26]. One should note, however, that de novo assembly of viral sequences is often very difficult due to the extremely high and variable coverage and the presence of haplotypes, which tend to violate assumptions made by assemblers: high coverage is usually interpreted as repeats and the presence of haplotypes will be interpreted as sequencing errors. Consequently, most de novo assemblers will produce only short contigs, instead of the desired full-length genome. One viable alternative is based on iterative mapping: it starts by mapping the reads to an initial user-provided reference, then constructs a consensus sequence from that mapping, remaps the reads against that consensus and repeats this process until no more additional reads map [27]. The output of a BWA alignment of deep sequencing reads to a reference genome is a Sequence Alignment Map (SAM) [28] file. This type of file is a non-compressed, human readable file that contains all of the information from the FASTQ file, the location that each individual read maps to on the reference and the associated quality of the mapping (if supported by the mapping program) for each sequence read. In order to save space and improve computation efficiency when dealing with these large files, SAM files are typically compressed into a binary file
182
Pauline Poh Kim Aw et al.
Fig. 3 Overview of the bioinformatics analysis of deep sequencing data
format that contains all of the same information at a fraction of the original size. This format is called binary alignment map (BAM) and will serve as the input for all subsequent steps. Conversion from SAM to BAM can be performed using a freely available software suite called Samtools. Once a BAM file is created one can readily start to predict low frequency variants from it. Optionally one might want to recalibrate base-call quality scores first with a program such as GATK [29], which usually results in fewer spurious low frequency variant calls in the downstream analysis. For our downstream analysis, we utilize the LoFreq program [9]. LoFreq is a fast and sensitive variant-caller and is designed to robustly call low-frequency variants by exploiting base-call and read-mapping quality values. Variant calls made by LoFreq down to frequencies as low as 0.5 % have been laboratory validated. LoFreq outputs a file containing only variants that have significant p-values after multiple testing correction. These should be filtered to remove positions with strand-bias, which is known to be associated with mapping issues and false positive SNV calls [30]. Ideally, primer regions are ignored in the final analysis, as they are known to contain artifactual low-frequency SNVs. A schematic of this workflow is shown in Fig. 3. The analyses steps described above allow for great flexibility: at each step alternative programs can be used as replacement or quality checks can be performed before proceeding. Alternatively, a pipeline can be used that automates the execution of all steps. One such pipeline is called Vipr (available at https://github.com/ CSB5/vipr). Basically it performs all steps described, i.e., it computes a consensus sequence in an iterative fashion starting from a given reference sequence, maps reads against the consensus, masks primer regions, performs base-call quality recalibration and finally calls low frequency SNVs by means of LoFreq.
Next-Generation Whole Genome Sequencing of Dengue Virus
183
Further analyses steps from this point onwards will depend largely on the experimental design of each investigators and individual studies.
2
Materials
2.1 Dengue RNA Extraction Using QIAamp Viral RNA Mini Kit (Qiagen)
1. Ethanol (100 %). 2. 1.5 ml microcentrifuge tubes. 3. Sterile, RNase-free pipet tips. 4. Microcentrifuge. Reagents supplied in the kit 5. QIAamp Mini Spin Columns. 6. Collection Tubes (2 ml). 7. Buffer AVL. 8. Buffer AW1. 9. Buffer AW2. 10. Buffer AVE. 11. Carrier RNA.
2.2 cDNA Preparation Using Maxima H Minus First Strand cDNA Synthesis Kit (Thermo Scientific)
1. Template RNA. 2. Primers. 3. 10 mM dNTP Mix. 4. Nuclease-free water. 5. 5× RT Buffer. 6. Maxima H Minus Enzyme Mix. 7. 0.2 ml nuclease-free tubes. 8. Microcentrifuge. 9. Thermocycler.
2.3 PCR Amplification Using PfuUltra II Fusion HS DNA Polymerase (Agilent)
1. Distilled water (dH2O). 2. 10× PfuUltra II reaction buffer. 3. dNTP mix (25 mM each dNTP). 4. DNA template. 5. Forward primer (10 μM). 6. Reverse primer (10 μM). 7. PfuUltra II fusion HS DNA polymerase. 8. 0.2 ml nuclease-free tubes. 9. Microcentrifuge. 10. Thermocycler.
184
2.4
Pauline Poh Kim Aw et al.
Running Gel
1. Agarose gel. 2. Loading dye. 3. 1× TBE. 4. 1× TAE. 5. Ethidium bromide (EtBr). 6. Ladder. 7. Items required for running gel: Power pack, casting tray, weighing machine, microwave, comb. 8. X-tracta Gel Extraction Tool. 9. UV transilluminator.
2.5 Isolate PCR Product from Gel Using QIAGEN Gel Extraction kit
1. Absolute Ethanol. 2. Buffer QG. 3. Buffer PE (diluted with 100 % Ethanol). 4. Buffer EB. 5. Microcentrifuge. 6. 1.5 or 2 ml microcentrifuge tubes. 7. QIAquick column (2 ml). 8. Isopropanol (100 %). 9. Heating block or water bath set at 37 °C. 10. NanoDrop to quantify individual amplicons for amplicon balancing and mixing.
2.6
Fragmentation
1. Covaris S2 sonicator (Covaris, Woburn, MA, USA). 2. microTUBE AFA Fiber Pre-Slit Snap-Cap 6 × 16 mm (Covaris).
2.7 DNA Purification Using QIAquick PCR Purification Kit
1. QIAquick Spin Columns. 2. Buffer PB. 3. Buffer PE (diluted with 100 % Ethanol). 4. Buffer EB. 5. Collection Tubes (2 ml).
2.8 DNA Quality Check (Agilent Bioanalyzer)
1. Chip priming station (reorder number 5065-4401). 2. IKA vortex mixer. 3. Microcentrifuge. 4. Bioanalyzer. 5. DNA1000 chip.
2.9 Library Preparation Using KAPA Library Preparation Kit (KAPA Biosystems)
1. End Repair Enzyme Mix. 2. 10× End Repair Buffer with dNTPs. 3. A-Tailing Enzyme. 4. 10× A-Tailing Buffer.
Next-Generation Whole Genome Sequencing of Dengue Virus
185
5. DNA Ligase. 6. 5× Ligation Buffer. 7. PfuUltra II Fusion HS DNA Polymerase. 8. Microcentrifuge. 9. Thermocycler. 10. 0.2 ml tubes. 2.10
Linux Tutorial
Many excellent tutorials are freely available for familiarization with the Linux operating system. To get started, we recommend: http://www.linux.org/forums/beginner-tutorials.53/
2.11
Ubuntu Linux
The Ubuntu operating system and detailed instructions for its installation and use can be found here: http://www.ubuntu.com
2.12
SOAPdenovo2
The SOAPdenovo2 program and detailed instructions for its installation and use can be found here: http://soap.genomics.org.cn/ soapdenovo.html
2.13
BWA
The BWA program and detailed instructions for its installation and use can be found here: http://bio-bwa.sourceforge.net
2.14
Samtools
The Samtools program and detailed instructions for its installation and use can be downloaded here: http://samtools.sourceforge.net
2.15
LoFreq
The LoFreq program and detailed instructions for its installation and use can be downloaded here: http://sourceforge.net/projects/lofreq/
3
Methods
3.1 Dengue RNA Extraction with the QIAamp-Viral-RNAMini Kit (See Note 1)
1. Add 310 μl Buffer AVE to the tube containing 310 μg lyophilized carrier RNA to obtain a solution of 1 μg/μl. 2. Follow steps 1–10 of the QIAamp-Viral-RNA-Minihandbook—June-2012-EN(1).pdf, pages 22–24. 3. Elute samples in 1 × 40 μl of Buffer AVE.
3.2 cDNA Preparation with the Maxima H Minus First Strand cDNA Synthesis Kit (See Note 2)
Prepare the component 1 and 2 as listed below: Component 1 (for one reaction) Nuclease-free water
8 μl
10 mM dNTPmix
1 μl
Primer antisense (see Note 3)
1 μl
RNA
5 μl
Total
15 μl
186
Pauline Poh Kim Aw et al.
1. Incubate at 65 °C for 5 min, then place on ice for at least 1 min. Component 2 (for one reaction) 5× RT buffer
4 μl
Maxima H Minus Enzyme Mix
1 μl
Total
5 μl
Add component 2 (5 μl) into the mix in component 1, mix gently, and centrifuge to bring liquid droplets down to bottom of the tube. Incubate the mix in the thermocycler at 50 °C for 30 min followed by 85 °C for 5 min. Cool the mix to 4 °C before starting the next step (see Note 4). 3.3 PCR Amplification (See Notes 4 and 5)
Prepare the following components in a 0.2 ml tube as listed below: Nuclease-free water
39 μl
10 mM dNTPmix
1 μl
10× Pfu Ultra II rxnbuffer
5 μl
PfuUltraII fusion HS DNA Polymerase
1 μl
cDNA
2 μl
Primer sense (see Note 6)
1 μl
Primer antisense (see Note 6)
1 μl 50 μl
Total
Briefly vortex the reagents and spin down the reaction mix prior to PCR amplification according to these conditions:
3.4
Running Gel
Activation
2 min at 92 °C
40 cycles
10 s at 92 °C 20 s at 55 °C 1.5 min at 68 °C (30 s per kb)
Final extension
5 min at 68 °C
Hold
4 °C
To select the desired PCR product size, run the entire PCR product on a 1 % agarose gel. 1. Cast a 1 % agarose gel in 1× TBE buffer. Add 1 μl of EtBr for every 50 ml of 1× TBE buffer used. 2. Add 20 μl of water to 10 μl of 1 kb ladder and 6 μl of 6× loading dye. 3. Add 8 μl of 6× loading dye to the PCR product. 4. Load the samples and ladder into the gel.
Next-Generation Whole Genome Sequencing of Dengue Virus
187
5. Run the gel according to the time and voltage as listed below: –
90 V for 40 min for smallest gel (70 ml)
–
100 V for 40 min for medium gel (250 ml)
–
120 V for 40 min for large gel (400 ml)
3.5 DNA Product Isolation from Gel Using QIAquick Gel Extraction Kit (Qiagen) (See Note 7)
Excise the correct DNA fragment from the agarose gel using the x-tracta gel extraction tool. Place the gel slice into a colorless tube and weigh the gel slice (see Note 8). Add 3 volumes of Buffer QG to 1 volume of gel (100 mg–100 μl). Incubate at 37 °C for 15 min or until the gel slice has completely dissolved. To help dissolve gel, vortex the tube every 2–3 min during the incubation. Follow steps 5–12 EN-QIAquick-Spin Handbook.pdf, pages 25–26. To elute DNA, add 32 μl of Buffer EB (10 mM Tris–Cl, pH 8.5) to the center of the QIAquick membrane and centrifuge the column for 1 min.
3.6 Measure the Concentration of the PCR Amplicons Using an Agilent Bioanalyzer DNA1000 Chip (See Note 9)
Prepare the gel–dye mix according to the manufacturer’s instructions and load onto the chip (see Note 10). Load the marker, followed by the ladder and finally the samples into the designated wells. Place the chip into the machine and run the pre-programmed DNA1000 chip analysis (see Note 11).
3.7 Balance the PCR Amplicons
Using the precise concentration measurement from the Agilent Bioanalyzer, balance the total amount (e.g., in nanograms) of each PCR amplicon to the fragment with the lowest concentration and aliquot into a 1.5 ml tube. To insure that the subsequent library creation steps are successful, we recommend that once the amplicons are mixed together, the total quantity should be at least 400 ng (see Note 4).
3.8 Fragmentation Using the Covaris Machine (See Note 12)
1. Place the pooled PCR amplicons into a Covaris microTUBE. Add EB buffer to obtain a total volume of 130–140 μl. 2. Shear the pooled PCR amplicons using the following settings: Duty cycle
10 %
Intensity
5.0
Cycles per burst
200
Duration
110 s
Mode
Frequency sweeping
Power
Covaris S2-23W
Temperature
4–5 °C
Agilent Bioanalyzer DNA1000 chip can be run for an optional quality check following fragmentation.
188
Pauline Poh Kim Aw et al.
3.9 Perform End Repair on Your Sheared Products (See Notes 13 and 14)
1. Prepare the reaction mix: DNA sample
30 μl
Water
55 μl
10× End repair buffer
10 μl
End Repair enzyme mix Total
5 μl 100 μl
2. Incubate in thermal cycler for 30 min at 20 °C. 3.10 DNA Purification Using QIAquick PCR Purification Kit
1. Follow steps 1–8 of the QIAquick-Spin Handbook.pdf (Qiagen), pages 19–20.
3.11 “A” Tailing (See Notes 13 and 15)
1. Prepare the reaction mix:
2. To elute DNA, add 30 μl Buffer EB (10 mM Tris–Cl, pH 8.5) to the center of the QIAquick membrane and centrifuge the column for 1 min.
DNA sample:
30 μl
Water
12 μl
10× A-Tailing Buffer:
5 μl
A-Tailing Enzyme
3 μl
Total
50 μl
2. Incubate for 30 min at 30 °C. 3. Purify the samples using QIAquick PCR Purification Kit (see Subheading 3.10). 3.12 Ligate Adapters to DNA Fragments
1. Prepare the reaction mix: DNA sample
30 μl
5× ligation buffer
10 μl
DNA ligase
5 μl
DNA adaptor (15 μM)
5 μl
Total
50 μl
2. Incubate in a thermal cycler for 30 min at 20 °C. 3. Agilent Bioanalyzer DNA1000 chip can be run (see Subheading 3.6.) for an optional quality check following adaptor ligation.
Next-Generation Whole Genome Sequencing of Dengue Virus
3.13 Size Selection of the Desired 300bpproduct from Gel Using QIAquick Gel Extraction Kit (Qiagen) (See Notes 7and 16)
189
Prepare 2 % agarose gel in 1× TAE buffer and EtBr. 1. Load 50 or 100 bp ladder. 2. Load 32 μl sample together with 4 μl of 6× loading dye. 3. Run the gel at 120 V for 60 min. 4. View and capture the gel on a UV transilluminator. 5. Excise DNA fragments in the region of 300 bp (see Note 17). 6. Do a gel extraction using the QIAquick Gel Extraction Kit (see Subheading 3.5). 7. An Agilent Bioanalyzer DNA1000 chip can be run (see Subheading 3.6) for an optional quality check following gel clean up.
3.14 PCR Amplification of the Library (See Notes 4, 13 and 18)
1. Prepare the reaction mix: Nuclease-free water
Balance to 50 μl
10× PfuUltra™ II reaction buffer
5 μl
Pfu UltraII HS polymerase
1 μl
dNTPs
1 μl
PCR Primer InPE1.0
1 μl
PCR Primer InPE2.0
1 μl
PCR Primer Index (see Note 19)
1 μl
Sample
Variable (see Note 18)
Total
50 μl
2. Amplify using the following PCR protocol: 2 min at 92 °C 14 cycles of: 10 s at 92 °C 20 s at 55 °C 30 s at 68 °C 5 min at 68 °C Hold at 4 °C 3. Purify the PCR product using MinElute PCR Purification Kit (see Subheading 3.10); however, use the PCR MinElute PCR Purification Kit columns and elute in 22 μl of EB buffer. 4. Run Agilent Bioanalyzer DNA1000 chip to check for the concentration and fragment size (see Subheading 3.6). 5. Dilute samples to 10 nM. 6. Consolidate equal volumes of diluted samples into a single 1.5 mL tube. Final volume will depend on the sequencing facility requirements (see Note 20).
190
Pauline Poh Kim Aw et al.
Fig. 4 An example of a command entered into a terminal window 3.15 Submit Your Samples to a Sequencing Facility and Get Your Reads Back in FASTQ Format (See Note 21)
Copy the following command into a terminal (see Notes 22 and 23): bwa index /home/October/Desktop/Sample1-Reference.fa An example of what a command should look like in a terminal window is shown in Fig. 4.
3.16 Index Reference with BWA
Copy the following command into a terminal (see Notes 22 and 23):
3.17 Align Reads to Reference with BWA
bwa aln -R 2 /home/October/Desktop/Sample1-Reference / home/October/Desktop/Sample1_read1.fastq > /home/ October/Desktop/Sample1_read1_sa.sai && bwa aln -R 2 / home/October/Desktop/Sample1-Reference /home/ October/Desktop/Sample1_read2.fastq > /home/October/ Desktop/Sample1_read2_sa.sai
3.18 Map Reads to Reference with BWA
Copy the following command into a terminal (see Notes 22 and 23):
3.19 Convert SAM File to a Compressed BAM File
Copy the following command into a terminal (see Notes 22 and 23):
bwa sampe -r '@RG\tID:120113_SN513_0262_AC0C6RACXXSample1\tLB:OMS\tPL:ILLUMINA\tCN:Your-SequencingFacility\tPU:1\tSM:Sample1' /home/October/Desktop/ Sample1-Reference /home/October/Desktop/Sample1_ read1_sa.sai /home/October/Desktop/Sample1_read2_sa. sai /home/October/Desktop/Sample1_read1.fastq /home/ October/Desktop/Sample1_read2.fastq > /home/October/ Desktop/Sample1.sam
samtools view -bS /home/October/Desktop/Sample1.sam > / home/October/Desktop/Sample1.bam
Next-Generation Whole Genome Sequencing of Dengue Virus
191
3.20 Sort Your BAM File
Copy the following command into a terminal (see Notes 22 and 23):
3.21 Index Your Sorted BAM File
Copy the following command into a terminal (see Notes 22 and 23):
3.22 Create a Pileup File with Samtools mpileup
Copy the following command into a terminal (see Notes 22 and 23):
3.23 Call SNVs, Including Low Frequency SNVs with LoFreq
Copy the following command into a terminal (see Notes 22 and 23):
3.24 Filter SNPs with Strand Bias Out of Your LoFreq Output
Copy the following command into a terminal (see Notes 22 and 23):
4
samtools sort /home/October/Desktop/Sample1.bam /home/ October/Desktop/Sample1-SORTED
samtools index /home/October/Desktop/Sample1-SORTED.bam
samtools mpileup -B -d500000 -f /home/October/Desktop/ Sample1-Reference.fa /home/October/Desktop/Sample1SORTED.bam > /home/October/Desktop/Sample1_pileup
python /home/October/bin/LoFreq-0.2/scripts/lofreq_snpcaller. py -Q 3 -i /home/October/Desktop/Sample1_pileup -o / home/October/Desktop/Sample1.snp
lofreq_snpcaller.py -i /home/October/Desktop/Sample1.snp -o /home/October/Desktop/Sample1_filtered.snp --strandbiasholmbonf
Notes 1. The QIAamp-Viral-RNA-Mini Kit (Qiagen) is a non-phenol– chloroform extraction method for viral RNA. The columnbased technology uses a silica membrane to bind viral RNA from patient plasma, patient serum, or cell culture supernatant. Below is the RNA Mini Kit protocol with minor modifications. Described below is the method we chose to implement but any other method that gives you a clean RNA extraction product would also be suitable. 2. The Maxima H Minus First Strand cDNA Synthesis Kit from Thermo Scientific is used to generate cDNA from the viral RNA. We have achieved reliable and reproducible results with this kit, but it is by no means the only available kit for cDNA preparation. This is the case for all kits used in subsequent sections of the methods seconds. If you are generating more than ten fragments to amplify your viral genome, you will need to make multiple cDNA preparations at this step.
192
Pauline Poh Kim Aw et al.
3. The primer for making cDNA is the reverse primer at the 3′ end of the viral genome. 4. The protocol can be paused here if necessary and stored over night at −80 °C 5. PFU Ultra II Fusion HS DNA Polymerase from Agilent serves to amplify viral amplicons. In the case of whole genome sequencing, it is ideal to have the PCR amplicons be approximately equal in size as this makes balancing the molarity of your amplicons more straightforward as well as more homogenous for fragmentation. 6. Ideally, the fewest number of amplicons required to cover the entire genome should be used so as to reduce the amount of primer masking that occurs in the downstream analysis. See Appendix for the primers used to amplify DENV1-4. 7. This step is essentially following the QIAquick-Spin Handbook. pdf (Qiagen), pages 25–26. 8. Weigh the empty tube first and then weigh again with the gel slice in the tube. Subtract the weight of the empty tube from the weight of the tube with the gel slice to get the weight of the gel slice alone. 9. This step is performed to check the size, concentration, and quality of the DNA material collected after the gel extraction in **step 3.5 prior to pooling the amplicons. 10. It is important to ensure that you have allowed sufficient time for your reagents to equilibrate to room temperature. Failure to do so can be a source of error. 11. After filling all the wells, ensure that the run is started within 5 min for best results. It is important to minimize vibrations around the machine while the analysis is running, as this can be a cause of errors. 12. This step shears your gel extracted amplicons to a range ~200 base pairs. Covaris implements adaptive focus acoustics in a temperature-controlled environment to mechanically process samples to a specific size range of interest by using bursts of ultrasonic acoustic energy. The settings below have been optimized for 1,500–2,500 base pair amplicons. Other amplicon sizes would need further optimization. 13. Library preparation of your cleaned fragmented cDNA is necessary to ensure proper binding of your library onto the flowcell of the sequencer. KAPA biosystems makes the kit we will describe in subsequent steps, but as the field evolves there will inevitably be more choices for library preparation.
Next-Generation Whole Genome Sequencing of Dengue Virus
193
14. This step converts overhangs generated during the fragmentation step into blunt ends using the end repair mix. 15. This step adds an “A” to the 3′ end of the blunt DNA fragments, which prevents self-ligation of fragments and allows for adapter ligation in the following step. 16. This step ensures that the fragments entering the PCR step are relatively homogenous in size. 300 bp libraries are typically preferred when doing 2 × 76bp Illumina sequencing. If your application requires longer reads, than the excised product will need to be increased accordingly to ensure that your reads do not overlap on the final product. 17. Depending on what your starting concentration was, it is possible that you may not see a band at this position. Don’t panic. Cut as closely as you can to where the 300 bp product should be and proceed to amplification step. 18. Primers binding to the ligated adapters are used along with Kapa high-fidelity polymerase to amplify adapter-ligated products, which are at this point present in very low concentrations. The optimal amount of adapter-ligated product to be used as starting material for the enrichment PCR is 40 ng. If your sample is less concentrated, you can add up to the entire volume of the purified adapter-ligated product, 32 μl, to begin the enrichment PCR. It is critical that during the enrichment step, the number of cycles used to enrich the library be kept to an absolute minimum to avoid introducing PCR-derived error. 19. Each sample will be assigned a unique barcode, consisting of six random nucleotides. 20. For example, if there are 12 samples (12 indexes), take 2 μl of each sample. 21. The amount of time it takes to sequence your sample will depend on the platform you are using. An Illumina MiSeq can generate data in as little as 4 h while the Illumina HiSeq can take as long as 2 weeks. 22. For simplicity, all files are assumed to be located in the “home” directory of the user “October” on the “Desktop.” This location of the files is not a requirement and will change depending on the name of the user and where the actual files are located. For a better understanding of Paths in a Linux environment, please see the link provided in Subheading 2.10. 23. It is important to type this command all on one line without pressing, “return” until the entire command has been entered.
194
Pauline Poh Kim Aw et al.
References 1. Coffey LL, Beeharry Y, Borderia AV, Blanc H, Vignuzzi M (2011) Arbovirus high fidelity variant loses fitness in mosquitoes and mice. Proc Natl Acad Sci U S A 108(38):16038– 16043. doi:10.1073/pnas.1111650108 2. Jenkins GM, Rambaut A, Pybus OG, Holmes EC (2002) Rates of molecular evolution in RNA viruses: a quantitative phylogenetic analysis. J Mol Evol 54(2):156–165. doi:10.1007/ s00239-001-0064-3 3. Holmes EC (2003) Molecular clocks and the puzzle of RNA virus origins. J Virol 77(7):3893–3897 4. Skalsky RL, Vanlandingham DL, Scholle F, Higgs S, Cullen BR (2010) Identification of microRNAs expressed in two mosquito vectors, Aedes albopictus and Culex quinquefasciatus. BMC Genomics 11:119. doi: 10.1186/1471-2164-11-119 5. David JP, Coissac E, Melodelima C, Poupardin R, Riaz MA, Chandor-Proust A, Reynaud S (2010) Transcriptome response to pollutants and insecticides in the dengue vector Aedes aegypti using next-generation sequencing technology. BMC Genomics 11:216. doi:10.1186/1471-2164-11-216 6. Sessions OM, Tan Y, Goh KC, Liu Y, Tan P, Rozen S, Ooi EE (2013) Host cell transcriptome profile during wild-type and attenuated dengue virus infection. PLoS Negl Trop Dis 7(3):e2107. doi:10.1371/journal.pntd.0002107 7. Yozwiak NL, Skewes-Cox P, Stenglein MD, Balmaseda A, Harris E, DeRisi JL (2012) Virus identification in unknown tropical febrile illness cases using deep sequencing. PLoS Negl Trop Dis 6(2):e1485. doi:10.1371/journal. pntd.0001485 8. Borderia AV, Stapleford KA, Vignuzzi M (2011) RNA virus population diversity: implications for inter-species transmission. Curr Opin Virol 1(6):643–648. doi:10.1016/j. coviro.2011.09.012 9. Wilm A, Aw PP, Bertrand D, Yeo GH, Ong SH, Wong CH, Khor CC, Petric R, Hibberd ML, Nagarajan N (2012) LoFreq: a sequencequality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res 40(22):11189–11201. doi:10.1093/nar/gks918 10. Domingo E, Holland JJ (1997) RNA virus mutations and fitness for survival. Annu Rev Microbiol 51:151–178. doi:10.1146/annurev. micro.51.1.151 11. Cordey S, Junier T, Gerlach D, Gobbini F, Farinelli L, Zdobnov EM, Winther B, Tapparel
12.
13.
14.
15.
16.
17.
C, Kaiser L (2010) Rhinovirus genome evolution during experimental human infection. PLoS One 5(5):e10588. doi:10.1371/journal.pone.0010588 Eckerle LD, Becker MM, Halpin RA, Li K, Venter E, Lu X, Scherbakova S, Graham RL, Baric RS, Stockwell TB, Spiro DJ, Denison MR (2010) Infidelity of SARS-CoV Nsp14exonuclease mutant virus replication is revealed by complete genome sequencing. PLoS Pathog 6(5):e1000896. doi:10.1371/journal. ppat.1000896 Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, Berlin AM, Malboeuf CM, Ryan EM, Gnerre S, Zody MC, Erlich RL, Green LM, Berical A, Wang Y, Casali M, Streeck H, Bloom AK, Dudek T, Tully D, Newman R, Axten KL, Gladden AD, Battis L, Kemper M, Zeng Q, Shea TP, Gujja S, Zedlack C, Gasser O, Brander C, Hess C, Gunthard HF, Brumme ZL, Brumme CJ, Bazner S, Rychert J, Tinsley JP, Mayer KH, Rosenberg E, Pereyra F, Levin JZ, Young SK, Jessen H, Altfeld M, Birren BW, Walker BD, Allen TM (2012) Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog 8(3):e1002529. doi:10.1371/journal.ppat.1002529 Nasu A, Marusawa H, Ueda Y, Nishijima N, Takahashi K, Osaki Y, Yamashita Y, Inokuma T, Tamada T, Fujiwara T, Sato F, Shimizu K, Chiba T (2011) Genetic heterogeneity of hepatitis C virus in association with antiviral therapy determined by ultra-deep sequencing. PLoS One 6(9):e24907. doi:10.1371/journal.pone.0024907 Neverov A, Chumakov K (2010) Massively parallel sequencing for monitoring genetic consistency and quality control of live viral vaccines. Proc Natl Acad Sci U S A 107(46):20063– 20068. doi:10.1073/pnas.1012537107 Parameswaran P, Charlebois P, Tellez Y, Nunez A, Ryan EM, Malboeuf CM, Levin JZ, Lennon NJ, Balmaseda A, Harris E, Henn MR (2012) Genome-wide patterns of intrahuman dengue virus diversity reveal associations with viral phylogenetic clade and interhost diversity. J Virol 86(16):8546–8558. doi:10.1128/ JVI.00736-12 Wright CF, Morelli MJ, Thebaud G, Knowles NJ, Herzyk P, Paton DJ, Haydon DT, King DP (2011) Beyond the consensus: dissecting withinhost viral population diversity of foot-andmouth disease virus by using next-generation genome sequencing. J Virol 85(5):2266–2275. doi:10.1128/JVI.01396-10
Next-Generation Whole Genome Sequencing of Dengue Virus 18. Chin-inmanu K, Suttitheptumrong A, Sangsrakru D, Tangphatsornruang S, Tragoonrung S, Malasit P, Tungpradabkul S, Suriyaphol P (2012) Feasibility of using 454 pyrosequencing for studying quasispecies of the whole dengue viral genome. BMC Genomics 13(Suppl 7):S7. doi:doi:10.1186/1471-2164-13-S7-S7 19. Makhluf H, Buck MD, King K, Perry ST, Henn MR, Shresta S (2013) Tracking the evolution of dengue virus strains D2S10 and D2S20 by 454 pyrosequencing. PLoS One 8(1):e54220. doi:10.1371/journal.pone.0054220 20. Hoang LT, Lynn DJ, Henn M, Birren BW, Lennon NJ, Le PT, Duong KT, Nguyen TT, Mai LN, Farrar JJ, Hibberd ML, Simmons CP (2010) The early whole-blood transcriptional signature of dengue virus and features associated with progression to dengue shock syndrome in Vietnamese children and young adults. J Virol 84(24):12982–12994. doi:10.1128/JVI.01224-10 21. Macalalad AR, Zody MC, Charlebois P, Lennon NJ, Newman RM, Malboeuf CM, Ryan EM, Boutwell CL, Power KA, Brackney DE, Pesko KN, Levin JZ, Ebel GD, Allen TM, Birren BW, Henn MR (2012) Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput Biol 8(3):e1002417. doi:10.1371/journal.pcbi.1002417 22. Yang X, Charlebois P, Gnerre S, Coole MG, Lennon NJ, Levin JZ, Qu J, Ryan EM, Zody MC, Henn MR (2012) De novo assembly of highly diverse viral populations. BMC Genomics 13:475. doi:10.1186/1471-216413-475 23. Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H (2011) SNVer: a statistical tool for variant calling in analysis of pooled or individual nextgeneration sequencing data. Nucleic Acids Res 39(19):e132. doi:10.1093/nar/gkr599
195
24. Barrick JE, Yu DS, Yoon SH, Jeong H, Oh TK, Schneider D, Lenski RE, Kim JF (2009) Genome evolution and adaptation in a long-term experiment with Escherichia coli. Nature 461(7268):1243–1247. doi:10.1038/ nature08480 25. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760. doi:10.1093/bioinformatics/btp324 26. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu SM, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam TW, Wang J (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1):18. doi:10.1186/2047-217X-1-18 27. Otto TD, Sanders M, Berriman M, Newbold C (2010) Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26(14):1704–1707. doi:10.1093/bioinformatics/btq269 28. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25(16):2078– 2079. doi:10.1093/bioinformatics/btp352 29. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20(9):1297– 1303. doi:10.1101/gr.107524.110 30. Guo Y, Li J, Li CI, Long J, Samuels DC, Shyr Y (2012) The effect of strand bias in Illumina short-read sequencing data. BMC Genomics 13:666. doi:10.1186/1471-2164-13-666