Next-generation sequencing and its applications in molecular

0 downloads 0 Views 773KB Size Report
able advances in DNA sequencing techno logies with the ... cancers. The advent of next-generation sequencing (NGS) technologies has reduced sequencing.
Review For reprint orders, please contact [email protected]

Next-generation sequencing and its applications in molecular diagnostics Expert Rev. Mol. Diagn. 11(3), 333–343 (2011)

Zhenqiang Su1, Baitang Ning2, Hong Fang1, Huixiao Hong2, Roger Perkins2, Weida Tong2 and Leming Shi†2 Z-Tech, an ICF International Company at US FDA’s National Center for Toxicological Research, 3900 NCTR Road, Jefferson, AR 72079, USA 2 National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA † Author for correspondence: [email protected] 1

www.expert-reviews.com

DNA sequencing is a powerful approach for decoding a number of human diseases, including cancers. The advent of next-generation sequencing (NGS) technologies has reduced sequencing cost by orders of magnitude and significantly increased the throughput, making whole-genome sequencing a possible way for obtaining global genomic information about patients on whom clinical actions may be taken. However, the benefits offered by NGS technologies come with a number of challenges that must be adequately addressed before they can be transformed from research tools to routine clinical practices. This article provides an overview of four commonly used NGS technologies from Roche Applied Science//454 Life Sciences, Illumina, Life Technologies and Helicos Biosciences. The challenges in the analysis of NGS data and their potential applications in clinical diagnosis are also discussed. Keywords : ChIP-Seq • massively parallel sequencing • molecular diagnostics • next-generation sequencing • RNA-Seq • single-nucleotide variant • transcriptome

Over the past few years, there have been remarkable advances in DNA sequencing techno­logies with the emergence and rapid evolution of next-generation sequencing (NGS), also known as massively parallel sequencing [1] . Examples are platforms developed by companies such as Roche Applied Science (454 Genome Sequencer FLX [GS FLX] System; CT, USA), Illumina (Genome Analyzer  [GA] II; CA, USA), Life Technologies (Sequencing by Oligonucleotide Ligation and Detection [SOLiD™]; CA, USA) and Helicos BioSciences (HeliScope™ Single Molecule Sequencer; MA, USA). By sequencing DNA in a massively parallel fashion, NGS technologies have dramatically reduced both cost-per-base and time required to decode an entire human genome, making DNA sequencing a cost-effective option for many experimental approaches and allowing investigators to carry out experiments that previously were not technically feasible or affordable (e.g., sequencing thousands of cancer genomes). Although differing in sequencing chemistries and technical details, all commercialized NGS platforms utilize a similar technical strategy – miniaturization of individual sequencing chemical reactions [1] to overcome the limited scalability of traditional Sanger sequencing [2] , which has been extensively used in somatic and 10.1586/ERM.11.3

germline genetic studies over the past 30 years and currently remains the gold standard for decoding DNA sequences. The miniaturization of individual sequencing reactions, coupled with other technical breakthroughs, including over­ coming the bottlenecks of library preparation and template preparation  [3] , allows millions of individual sequencing reactions to occur in parallel. Clonal clusters of an original DNA fragment are sequenced in each miniaturized chemical reaction, and millions of them are spatially arranged so that individual reactions are isolated from one another and can be distinctly detected by digital imaging or other approaches. The results are prodigious volumes of short-read sequence data, unprecedented detail and singlenucleotide resolution of sequence complexity, with consequential challenges in storing, managing, analyzing and interpreting such a wealth of data. In a relatively short time span since 2005, NGS technologies have fundamentally changed high-throughput genomic research and have opened up many new research areas and novel applications  [4] . With the exponential growth of the numbers of NGS-related research articles indexed on Medline (Figure 1) , NGS techno­logies have demonstrated their enormous potential for researchers working in medicine, biology and

© Leming Shi

ISSN 1473-7159

333

Review

Su, Ning, Fang et al.

1600

1468

Total number of articles

1400 1200 1000 800

739

600 400 242

200 7

0 2004

2005

21

66

2006

2007

2008

2009

2010

Year

Figure 1. The number of publications related to nextgeneration sequencing and indexed in PubMed has been increasing exponentially. The numbers reflect the PubMed search results by using the following query on 11 October, 2010: (‘next generation sequencing’ OR ‘next-generation sequencing’ OR ‘next generation DNA sequencing’ OR ‘next-generation DNA sequencing’ OR ‘RNA-Seq’ OR ‘Chip-Seq’ OR ‘mRNA-Seq’ OR ‘PeakSeq’ OR ‘454 sequencing’ OR ‘direct RNA sequencing’ OR ‘massively parallel sequencing’ OR ‘ultrafast DNA sequencing’ OR “deep sequencing” OR ‘454 Life Sciences’) AND (2004[Publication Date]:2010[Publication Date]).

life sciences. Along with the development of robust informatics tools for nucleotide variant detection  [5] , the ongoing evolution of NGS technologies will continually reduce the cost, simplify the workflow for sample preparation and improve the technical robustness [6,7] , paving the path for translating NGS technologies into clinical diagnostics and personalized medicine. In this article, we first provide an overview of the principles of four commercialized NGS technologies. We then discuss the general challenges in the analyses of NGS short reads and, finally, we discuss the possible impacts and applications of NGS technologies in clinical diagnostics. Next-generation sequencing technologies Roche/454 pyrosequencing

The Roche/454 GS FLX System is based on emulsion PCR [8] and pyrophosphate detection [9] techniques. A library of DNA templates is constructed by a highly efficient DNA amplification method known as emulsion PCR, where sheared DNA fragments are ligated to specific oligonucleotide adapters, resulting in the binding of each DNA fragment to a fragment-carrying bead. The beads are then captured in separate emulsion droplets that function as amplification reactors to produce approximately 10 million clonal copies of the DNA template that are needed for sufficient light signal intensities. Upon completion of the emulsion PCR amplification, the emulsion is disrupted and the beads containing clonally amplified template DNAs are enriched. The beads are then separated by limiting dilution and deposited into individual 334

picotiter-plate wells. The picotiter plates serve as sequencing reactors to allow individual enzymatic sequencing reactions to occur without interference from adjacent wells. Visible light emitted from the subsequent pyrosequencing reactions are detected by a charge-coupled device (CCD) that is bonded to a fiber-optic bundle. During each cycle of a pyrosequencing reaction, a single species of unlabeled nucleotide is supplied to the reaction mixture to all beads on the chip, so that the complementary strand of DNA is sequentially synthesized. With the incorporation of each base in the growing chain, an inorganic pyrophosphate group is released and converted to ATP. During sequencing, the ATP molecule is used by luciferase to convert luciferin to oxyluciferin, producing a light pulse. Detecting the light emissions together with the known nucleotide identity in each step allows the incorporated base to be determined. Through a series of such pyrosequencing reaction cycles, the sequences of the DNA templates carried by individual beads are determined. In a given pyrosequencing reaction cycle, multiple consecutive incorporations may occur owing to the lack of a terminating moiety. Thus, the length of homopolymers (i.e., repeats of the same base, such as AAAA) in sequence reads must be inferred from light signal intensity, with a higher intensity corresponding to more repeats of the same base. The error rate of calling consecutive repeats increases when the length of the homopolymers is greater than three-to-four repeating bases. Consequently, the major error type for the Roche/454 system is insertions and deletions (or indels), other than substitutions [10] . Compared with other NGS platforms, the strength of the Roche/454 system is its longer sequence reads. The Roche/454 GS FLX, with its newest chemistry GS FLX �������������������� Titanium series reagents, can generate more than 1 million individual sequence reads with read lengths over 400 bases during a 10-h timespan [201] . Although its per-base cost is much higher than that of other NGS platforms (e.g., Life Technologies/SOLiD and Illumina/GA IIx), the Roche/454 system is best suited for certain applications, such as de novo sequencing of new genomes, for which long read length is critical for de novo genome assembly. Illumina sequencing technology

The Illumina GA system is the first short-read sequencing platform and currently dominates the NGS market [1] ; it uses an array technique to achieve cloning-free DNA amplification. Reversible terminator chemistry is the defining characteristic that provides massively parallel sequencing of millions of DNA fragments at a low cost. DNA samples are randomly sheared into fragments that are then end-repaired to generate 5´-phosphorylated blunt ends. The Klenow fragment of DNA polymerase is then used to attach a single ‘A’ base to the 3´-end of the DNA fragments, which prepares the DNA fragments for ligation to oligonucleotide adapters. After ligation to adapters at both ends, the DNA fragments are denatured and single-stranded DNA fragments are attached to reaction chambers that are located on an optically transparent solid surface called a flow cell. The attached DNA fragments are extended and amplified by bridge PCR amplification in order to obtain sufficient light signal intensity for reliable detection. Expert Rev. Mol. Diagn. 11(3), (2011)

Next-generation sequencing & its applications in molecular diagnostics

The bridge PCR amplification can create an ultra-high-density sequencing array on the flow cell, containing hundreds of millions of clusters with each cluster containing approximately 1000 copies of the same DNA template. These templates are finally sequenced through the sequencing-by-synthesis technique that applies reversible terminators with removable fluorescent dyes. For sequencing and DNA synthesis, the reaction mixtures comprising primers, DNA polymerase and four reversible terminator nucleotides, each labeled with a different fluorescent dye, are supplied to the flow cell. In each sequencing cycle, a specific terminator is incorporated according to sequence complementarity in each template DNA strand in a clonal cluster. After incorporation, the identity (base calling) and the position of the specifically incorporated terminator on the flow cell is determined according to the fluorescence dye emission, and the signal is recorded using a CCD camera. In the following cycle, the reversible terminator is unblocked and the fluorescent dye label is removed from the base so that a new nucleotide can be incorporated and a new base can be detected using the same strategy. This repetitive sequencingby-synthesis process takes approximately 2.5 days to generate 50 million reads per flow cell, with a read-length of 36 bases. The overall sequencing output of the Illumina GA system is more than 1 billion bases (Gb) per analytical run. The throughput is dramatically increased with new models, such as the GA IIx and HiSeq 2000 [202] . In a given cycle of sequencing, any modified nucleotide could be incorporated with decreased or increased efficiency, resulting in an under- or over-incorporation and a heterogeneous mixture of synthesis lengths and concomitant degradation of signal purity and precision. Moreover, the ‘dark’ bases (without a fluorophore) can also result in leading or lagging dephasing. In addition, chemical cleavage of terminating moieties and florescent dye labels are subject to incompletion. Therefore, Illumina’s sequencing strategy generates much shorter reads and its most common error type is substitutions [10] . The base-call error rate increases with read length owing to ‘dephasing noise’ [11] . In addition, an overrepresentation of GC-rich regions and an under-representation of AT-rich regions have been observed [11] .

Review

Briefly, a mixture of partially degenerate oligonucleotide octamers is competitively hybridized to the DNA fragments as probes, and a universal primer is oriented to provide a 5´-phosphate group for ligation. The specificity of the probe ligated to a primer is determined by the fourth and fifth bases of the probe that are complementary to the template, and the identities (base calling) of the fourth and fifth bases of probes are characterized by one of four fluorescent labels at the end of the octamer, so that the interrogation of the fourth and fifth base is achieved. After ligation, the ligated octamer oligonucleotides are cleaved off after the fifth base and the fluorescent label is removed, so that the next hybridization and ligation cycle can proceed. In this way, the fourth and fifth bases in the template are determined in the first cycle, and the ninth and tenth bases in the second cycle, and so on. The ligation sequencing can also be carried out in the same way with another primer offset by one base in the adapter, so bases three and four, eight and nine, and so on, in the template can be determined. By any given five-cycle rounds, each base is interrogated twice with two different fluorescent labels, resulting in a significantly reduced base-call error rate. By using ligation-based sequencing-by-synthesis, the SOLiD system mitigates homopolymeric sequencing error. The buildin two-base encoding system can also correct most of the read errors where the two-base transition can be identified. The dominant error type is substitutions. The raw error rate is high, ranging from approximately 2% in the 5´-end to approximately 8% in the 3´-end [13] . But according to a study [14] , an accuracy of 99.99% can be achieved by Roche 454, Illumina GA and SOLiD platforms under saturated coverage. In addition, Life Technologies has recently acquired Ion Torrent, which has developed a non-light-based sequencing technology. Ion Torrent sequencing is based on a natural biochemical process in which a hydrogen ion is released when a nucleotide is incorporated into a DNA strand. By monitoring the pH of the solution, the incorporated bases can be determined. As no proprietary chemistries, fluorescence, chemiluminescence or optics are required, Ion Torrent sequencing technology makes it a simpler, faster, and more cost-effective and scalable system than other commercialized platforms.

Life Technologies/SOLiD

The SOLiD system relies on the techniques described by Shendure et al. [12] and McKernan et al. [101] . Library construction for the SOLiD system is similar to Roche/454 technology, in which DNA is stochastically sheared into fragments that are subsequently ligated to oligonucleotide adapters, attached to beads and clonally amplified by emulsion PCR. After denaturing the templates, template-carrying beads are enriched and deposited onto a solid substrate. The templates on the selected beads are then 3´-modified for the purpose of covalent attachment to the slide. After this, 3´-modified beads are deposited onto a derivitized-glass flow cell surface to generate a dense, disordered array. Sequencing reactions are started by hybridizing a primer oligonucleotide complementary to the adapter at the adapter–template junction. Unlike the Roche/454 sequencing approach, the sequencing-bysynthesis in the SOLiD system is based on ligation chemistry. www.expert-reviews.com

Helicos HeliScope genetic analysis system

The HeliScope Genetic Analysis System is the first commercialized single-molecule DNA sequencer. It is based on the true single molecule sequencing technology stemmed from the work by Braslavsky et al. [15] and relies on the cyclic interrogation of a dense array of sequencing features. By directly sequencing single molecules of DNA or RNA without requiring clonal amplification like other NGS systems, the Helicos’ true single molecule sequencing technology significantly increases the speed and decreases the cost of sequencing. In the ������������������������������������������������������ HeliScope��������������������������������������������� system, a DNA library is constructed by random fragmentation of a DNA sample, and 3´-end poly­adenylation of DNA fragments with the adenosine terminal transferase. Denatured poly-A fragments are captured on a flow-cell surface by hybridization to surface-tethered poly-T oligomers to yield 335

336

Non-bias single-molecule sequencing, high error rate 21–35 8 600–800 25–55 SOLiD: Sequencing by Oligonucleotide Ligation and Detection.

True single molecule None sequencing by synthesis Helicos Biosciences

HeliScope™ Single Molecule Sequencer

1400

3–12

Suggest Documents