Review For reprint orders, please contact
[email protected]
Technological advances in DNA sequence enrichment and sequencing for germline genetic diagnosis Expert Rev. Mol. Diagn. 12(2), 159–173 (2012)
Chee-Seng Ku*1, Mengchu Wu1, David N Cooper2, Nasheen Naidoo3, Yudi Pawitan4, Brendan Pang5, Barry Iacopetta6 and Richie Soong1,5 Cancer Science Institute of Singapore, National University of Singapore, Singapore 2 Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, UK 3 Saw Swee Hock School of Public Health, National University of Singapore, Singapore 4 Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden 5 Department of Pathology, National University Health System, Singapore 6 School of Surgery, The University of Western Australia, Australia *Author for correspondence: Tel.: +65 81388095 Fax: +65 68739664
[email protected] 1
www.expert-reviews.com
The potential applications of next-generation sequencing technologies in diagnostic laboratories have become increasingly evident despite the various technical challenges that still need to be overcome to potentiate its widespread adoption in a clinical setting. Whole-genome sequencing is now both technically feasible and ‘cost effective’ using next-generation sequencing techniques. However, this approach is still considered to be ‘expensive’ for a diagnostic test. Although the goal of the US$1000 genome is fast approaching, neither the analytical hurdles nor the ethical issues involved are trivial. In addition, the cost of data analysis and storage has been much higher than initially expected. As a result, it is widely perceived that targeted sequencing and wholeexome sequencing are more likely to be adopted as diagnostic tools in the foreseeable future. However, the information-generating power of whole-exome sequencing has also sparked considerable debate in relation to its deployment in genetic diagnostics, particularly with reference to the revelation of incidental findings. In this review, we focus on the targeted sequencing approach and its potential as a genetic diagnostic tool. KEYWORDS :DIAGNOSTICsEXOMEs-ENDELIANDISORDERsNEXT GENERATIONSEQUENCINGsSEQUENCEENRICHMENT
Next-generation sequencing (NGS) technologies have matured as a mutation discovery tool since their advent in 2005 [1,2] . However, the prospect of using high-throughput sequencing tech nology in a medical diagnostic setting has only recently become a reality when a patient with congenital chloride-losing diarrhea was diagnosed through the identification of a homozygous missense variant in a known causative gene for the disease, SLC26A3, by means of whole-exome sequencing (WES) [3] . This patient was initially suspected of having Bartter’s syndrome (a group of renal diseases characterized by hypokalemia and metabolic alkalosis), although this diagnosis was only based on superficial phenotyping. This notwithstanding, this pioneering study highlighted the power of WES compared with a targeted sequencing approach, as the molecular diagnosis of congenital chloride-losing diarrhea would not have been made had the sequencing analysis been confined to known causal genes for Bartter’s syndrome. This conclusion has been further underscored by the somewhat less equivocal case of an adult female 10.1586/ERM.11.95
with a well-defined clinical diagnosis of Leber’s congenital amaurosis, who did not harbor mutations in any of the known gene loci responsible for this genetically heterogeneous phenotype [4] . Instead, she was found to be homozygous for a well-known Zellweger syndrome mutation in the PEX1 gene, which would not have been revealed without WES. WES has also demonstrated its clinical utility in the case of a patient affected by two different disorders. Therefore, two different mutations in the unlinked genes SLC45A2 and G6PC3 were identified in a single patient with an indeterminate clinical phenotype. These lesions are sufficient to account for the two different phenotypes manifested by this patient – namely, oculocutaneous albinism type 4 and congenital neutropenia. Had a targeted sequencing approach been used in this case to screen all known oculocutaneous albinism genes for pathological mutations, then only the homozygous mutation in SLC45A2 would have been identified and the congenital neutropenia would have remained unexplained. The converse would have been true
© 2012 Expert Reviews Ltd
ISSN 1473-7159
159
Review
Ku, Wu, Cooper et al.
had the targeted sequencing approach been applied to all genes implicated in congenital neutropenia [5] . The aforementioned delay in applying WES in a diagnostic context may be attributed in part to the initial technical difficulties inherent in isolating and enriching the collection of all exons (the exome) in the human genome, which lay beyond the technical capacity of traditional PCR amplification methods. This technical obstacle was removed with the development of commercial whole-exome enrichment kits by Agilent, Illumina and NimbleGen [6] . During the exome-enrichment steps, the genomic regions of interest (i.e., all exons) are captured, while the unwanted DNA sequences (i.e., noncoding regions) are removed prior to sequencing, leading to a significant reduction in the proportion of the genome needing to be sequenced. As a result of the enrichment, a higher sequencing depth (or per-base coverage) of the targeted regions can be achieved. This is particularly important in the context of clinical diagnostic applications, either to allow the accurate detection of inherited disease mutations or to confirm that an apparent negative result is not incorrect. Therefore, these enrichment methods coupled with the existing NGS technologies have made WES more technically feasible as well as more cost effective [1,2,7–11] . In a parallel development, a targeted sequencing approach coupled with NGS, albeit with limited discovery value (i.e., only capable of discovering new causal mutations in known or targeted genes), has also been widely examined as a potential genetic diagnostic and screening tool. The number of genes targeted by this approach range from several up to many tens [12,13] . This targeted sequencing approach was initially performed using traditional PCR-Sanger sequencing methods, which were both laborious and expensive and had low scalability in terms of the number of amplicons and samples. However, attempts have been made to develop in-house multiplexed PCR reactions, to enrich for the entire coding regions of seven genes implicated in peripheral neuropathies, with sequencing being performed on the Roche 454 Genome Sequencer (GS) FLX in 40 individuals [14] . Further validation by Sanger sequencing confirmed the detection of all variants, indicating the high specificity of the methodology. In addition to traditional multiplexed PCR reactions developed in-house [14,15] , the parallel developments of custom-made sequence hybridization enrichment kits marketed by Agilent and NimbleGen [12,16,17] , and the high-throughput multiplexed PCR enrichment methods developed by RainDance Technologies™ and Fluidigm [13,18,19] , have made this targeted sequencing approach increasingly more relevant to, and accessible within, a clinical setting. The application of sequencing to clinical medicine includes the detection of known disease-causing mutations for Mendelian disorders, such as autosomal recessive ataxia [16] , primary ciliary dyskinesia [12] and congenital disorders of glycosylation [13] for diagnosis; and the genetic screening of causal genes such as BRCA1/BRCA2 in families with familial breast cancer [20] and the adenomatous polyposis coli (APC) and mismatch-repair genes in familial colorectal cancer [21] for risk stratification in carriers of high-risk mutations. The custom-made sequence capture and enrichment kits based on hybrid selection allow researchers 160
to target specific regions of up to several megabases of the human genome in a single experiment (rather than the whole exome), while the commercial microdroplet PCR methods potentiate the amplification of thousands of amplicons simultaneously in a single reaction with a total genomic size of hundreds of kilobases. Although several enrichment methods are available, the highthroughput production of sequencing data (up to several hundred gigabases) by NGS technologies such as Illumina® HiSeq 2000 and Life Technologies SOLiD® 4/5500 have rendered them less suitable for use by clinical diagnostic laboratories, which require a lower throughput and a much more rapid sequencing turnaround time than is required in a research context [1,2] . The capacity of the high-throughput NGS technologies clearly greatly exceeds the requirement to sequence multiple disease genes in the context of genetic diagnostics and screening for a single patient or small number of samples. By contrast, Sanger sequencing does not meet the demand of this sequencing requirement in a cost-effective manner; indeed, this has presented a significant bottleneck to adapting the sequencing approach for routine use in clinical laboratories. In a similar vein, developments have also been made in sequencing technologies with a view to facilitating their adoption in a clinical setting. The advent of several ‘medium-throughput’ (i.e., a sequencing throughput lying somewhere between the highthroughput NGS technologies and the low-throughput Sanger sequencing) or the commonly known ‘bench-top’ sequencing machines have effectively closed the gap between the extremes in the spectrum of sequencing data production. Roche took the lead through its introduction of the first medium-throughput sequencing instrument – the Roche 454 GS Junior Sequencing System in 2010. The performance of the 454 GS Junior in a diagnostic context has recently been evaluated [15,22] . In parallel, Life Technologies have launched their Ion Torrent™ Personal Genome Machine (PGM) Sequencer [23] , while the Illumina MiSeq™ Personal Sequencing System has also been marketed at the end of 2011 [101] . Therefore, through harnessing the power of recent technological developments in sequence enrichment and sequencing methodology, targeted sequencing has now become more technically feasible and cost effective. The bench-top sequencing machines contribute to the technical flexibility (i.e., sample logistics because a smaller batch of samples can be processed) and cost–effectiveness (i.e., by avoiding oversequencing in targeted sequencing studies or diagnostic tests of a small number of genes). Apart from the conventional PCR-Sanger sequencing approach, three different approaches or designs coupled with NGS are currently available: targeted sequencing of specific disease genes [12,13,15,16,18,20,22] ; WES [3,4,24–30] ; and whole-genome sequencing (WGS) [31–33] . The targeted-sequencing approach focuses on the disease-relevant loci rather than attempting to discover novel disease loci. Conversely, the information-generating power of WES and WGS is tailored to disease gene/mutation discovery. Therefore, the application of WES and WGS has led to success in the discovery of new causal mutations, as well as genes underlying previously unresolved disorders, such as Kabuki and Miller syndromes [34–36] . However, these approaches have sparked Expert Rev. Mol. Diagn. 12(2), (2012)
Technological advances for germline genetic diagnosis
considerable debate in relation to their deployment in clinical genetic diagnostics and screening [37] . The targeted sequencing approach is increasingly being deployed in a diagnostic setting. For example, this approach has been applied to the genetic diagnosis of primary ciliary dyskinesia, congenital disorders of glycosylation and autosomal recessive ataxia [12,13,16] . Thus, in this article, we review the technological advances that have led to a more efficient and cost-effective targeted sequencing approach, and discuss the strengths and limitations of different sequence-enrichment methods and mediumthroughput sequencing instruments. We also highlight recent developments in applying targeted sequencing in the context of genetic diagnostics and screening, and elaborate on the pros and cons of this approach compared with WES and WGS. Finally, we discuss which of these sequencing approaches would be more suitable for a clinical setting in the near future. Targeted sequence capture & enrichment
Conventional PCR-Sanger sequencing might still be the most appropriate and cost-effective diagnostic tool for inherited diseases in which practically all clinical cases can be accounted for by a single known causal gene (i.e., without locus heterogeneity) or several hotspot causal mutations in certain exons (i.e., without extensive allelic heterogeneity). This is clearly illustrated by familial adenomatous polyposis, where approximately 90% of cases can be accounted for by mutations in the APC gene [21] . By contrast, this approach is neither efficient nor cost effective for inherited disorders manifesting moderate-to-high locus heterogeneity and where up to tens of causal genes may be implicated, such as Charcot–Marie–Tooth disease (CMT) [24] and retinitis pigmentosa [38] . The current molecular diagnostic strategy involves essentially a ‘one-by-one’ approach. Quite apart from not being cost effective, this strategy is generally inefficient in clinical practice, especially when the accurate and timely molecular diagnosis of a particular condition can result in dramatic improvements in patient care. By way of example, a drastic clinical intervention was undertaken in the case of an allogeneic hematopoietic progenitor cell transplant, which was performed in order to prevent the development of life-threatening hemophagocytic lymphohistiocytosis, in accordance with the recommended treatment for X-linked inhibitor of apoptosis deficiency. The diagnosis was based on the genetic finding, together with the medical history and functional data [26] . In a similar vein, an accurate and rapid genetic diagnosis has also been critical in determining the correct patient treatment in neonatal diabetes [27] and congenital disorders of glycosylation [13] . For example, if biochemical analysis were to indicate a diagnosis of a congenital disorder of glycosylation, then it would be important to identify the precise gene defect responsible, as several reasonably effective therapies for several subtypes of these disorders are now available [13] . Therefore, the development of methods to overcome the one-by-one approach is needed in clinical practice. There is an urgent need for an efficient and cost-effective sequence-enrichment protocol, such as high-throughput multiplexed PCR, to enrich multiple genes simultaneously. www.expert-reviews.com
Review
Multiplexed PCR-based enrichment
Multiple enrichment methods are available [9,10] . However, in this article, we focus on the methods most commonly employed in diagnostic studies. The development of high-throughput PCR enrichment methods, such as those developed by RainDance Technologies and Fluidigm, have overcome the bottleneck of traditional PCR in targeting specific genes [19,39] . For example, the Fluidigm 48.48 Access Array™ Integrated Fluidic Circuit enables one to amplify 48 genomic regions (using 48 primer–probe sets) for 48 different samples simultaneously. As such, the preparation of a single sequencing-ready library comprising 48 samples that have been amplified and barcoded prior to pooling can be achieved in a single experiment. However, one limitation of this platform is the inflexibility of the sample size, because only one array design is available, where 48 samples have to be employed on the array [102] . This inflexibility creates a logistical challenge for clinical applications when a single sample is required for genetic diagnosis or if only several family members are requested for genetic screening. On the other hand, the RainDance Technologies Sequence Enrichment Assay is a microdroplet PCR-based technology capable of amplifying up to thousands of genomic loci per sample [103] . Although both platforms utilize a PCR-based amplification and enrichment strategy, the design of the RainDance Sequence Enrichment Assay is more flexible, as it allows different sample sizes to be processed at a time. This is particularly important for very rare disorders for which the sample number is likely to be small. Furthermore, use of the RainDance is advantageous in situations where hundreds of exons or genes need to be targeted and sequenced. The PCR-amplified products enriched by these platforms are then collected and prepared for library construction and sequencing. Both the Fluidigm and RainDance platforms have been applied for comprehensive mutation detection with 24 causal genes underlying congenital disorders of glycosylation. For example, a set of primers was designed for all 215 exons (or 387 PCR amplicons) from these 24 genes [13] . However, only approximately 48% of the filtered sequence reads for both PCRenrichment methods were ‘on-target’ – that is, mapped to the targeted regions. This also highlights the problem of sequencing redundant ‘off-target’ reads; this technical limitation should be further improved in terms of cost–effectiveness because any redundant sequencing adds to the overall cost of the analysis. It was also reported that approximately 76% of the unique reads mapping to the targeted regions were achieved by a study that sequenced seven genes known to cause autosomal recessive ataxia using custom NimbleGen sequence capture arrays, followed by Roche 454 GS FLX Titanium shotgun sequencing [16] . However, it should be noted that these two on-target percentages are not directly comparable. In general, PCR-based strategies would be expected to provide higher on-target rates than hybridizationbased methods; this notwithstanding, the ratio of on-target to off-target reads is also influenced by other technical and analytical factors, as well as by the genomic sequences of the target genes. Hence, a direct comparison of PCR-based strategies 161
Review
Ku, Wu, Cooper et al.
versus hybridization-based methods in terms of their on-target efficiencies should be performed in a single study. Target probe hybridization enrichment
It can be seen that, in addition to the PCR-based methods, desired genomic regions can also be enriched for by ‘target probe’ hybridization, either by on-solid (microarray-based) or in-solution capture [40–42] , for example, using enrichment kits available commercially from Agilent and NimbleGen. Therefore, instead of PCR primers, oligonucleotides are synthesized to target the regions of interest. FIGURE 1 displays the schematic illustration of library preparation using array-based and in-solution hybrid capture. The workflow involves multiple steps and is similar for the capture of both exome and custom or targeted regions. However, the difference between the two contexts lies with the different sets of probes required for the exome and for specific genes. The difference between onsolid and in-solution capture methods is that the oligonucleotide probes are tethered on microarrays or are suspended in solution (oligonucleotide probes attached to beads), respectively. The capture of the adapter-ligated DNA fragments is based on their complementarity with the oligonucleotide probe sequences. The size of the human exome is approximately 30 Mb; however, exome-enrichments kits have been expanded beyond the exome so as to include other regulatory elements [43] . Therefore, for example, the Agilent SureSelect Human All Exon 50 Mb Kit covers coding exons annotated by the GENCODE project [104] , as well as all exons annotated in the Consensus CDS (CCDS) and RefSeq databases. In addition, it also encompasses small noncoding RNAs from miRBase (v.13) and Rfam [105] . Similarly, the Illumina TruSeq™ Exome Enrichment Kit was designed to target a region of 62 Mb [106] , which is more than double the size of the human exome. This exome enrichment kit contains >340,000 probes (95-mer), each constructed against the human NCBI37/hg19 reference genome. Thus, this probe set was designed to enrich >200,000 exons, spanning a total of 20,794 genes. In addition to comprehensive coverage of the major exon databases, such as CCDS coding exons (31.3 Mb, hg19) and RefSeq (regGene) coding exons (33.2 Mb, hg19), this enrichment kit also provides broad coverage of noncoding DNA in exon-flanking regions (promoters and untranslated regions). Furthermore, 77.6% of the predicted microRNA targets (9.0 Mb, hg19) were also captured. In addition to the comparison of PCR-based enrichment methods for targeted sequencing, a proper empirical study to evaluate the performance of different exome-enrichment kits should also be conducted, although our review focuses more on targeted sequencing rather than WES. This is because the different designs (in-solution and on-solid captures) of exome-enrichment kits from different companies often make the selection of which product to use difficult, even after consideration of their technical performance and cost. Therefore, recent studies that have evaluated the technical aspects of different exome-enrichment kits are worthy of discussion [6,44,45] . None of these methods have been shown to outperform any other, with each method having its strengths and limitations. 162
The performance strengths and weaknesses for two solution exome-capture products (Agilent and NimbleGen) have recently been evaluated [44] . While both NimbleGen and Agilent have released updated versions of their solution exome-capture kits, which are based on the latest assembly of the human genome reference, hg19 (GRCh37), and target both RefSeq (67.0 Mb) and CCDS (31.1 Mb) annotations, some regions have still not been captured and may create a need to order custom capture designs. For example, the NimbleGen version 2 exome-enrichment kit targets 9.8 Mb more genomic space (36.0 Mb in total) than version 1, and it was therefore predicted that version 2 would provide 99.2% coverage of CCDS ( 10% more than version 1). However, only 49.6% of RefSeq would be covered by version 2 [44] . Other parameters have also been assessed. Sulonen et al. compared solution-based exome-capture methods for NGS, and found a larger percentage of the high-quality reads from the NimbleGen captures than from the Agilent captures aligned to the capture target regions [45] . Although sequences of the libraries prepared using the Agilent kits had fewer duplicated reads and their alignment to the human reference genome was equal to that of the NimbleGen kits, the latter had more high-quality reads and deeply covered base pairs (bp) in the regions actually targeted for sequence capture. While these studies compared exome-enrichment kits from NimbleGen and Agilent, others performed a comparison of three major commercial exome-sequencing platforms, which included Illumina [6] . The study found that the NimbleGen platform, the only one to use high-density overlapping baits, covers fewer genomic regions than the other platforms but requires the least amount of sequencing to detect small variants with a high degree of sensitivity. Agilent and Illumina are able to detect a greater total number of variants with additional sequencing. Illumina, however, captures untranslated regions, which are not targeted by the NimbleGen and Agilent platforms. Comparison of the technical aspects
Every enrichment method has advantages and limitations. Therefore, several factors must be taken into account when adopting a particular method. The cost of the enrichment method is an important factor to consider when developing a diagnostic tool; however, this must be balanced by other technical aspects. The total genomic size to be enriched and the sample size also influence the selection of the enrichment method. For example, the custommade enrichment kits from Agilent and NimbleGen are available in several formats, and are capable of capturing genomic sizes up to several Mb. On the other hand, enrichment by PCR has its limitations when a large genomic size or number of amplicons are encountered, but may be the right choice if only a few amplicons are of interest – for example, the Fluidigm 48.48 Access Array. However, this method has a much higher sample throughput and might therefore be more suitable for centralized laboratories with larger sample volumes. By contrast, approximately 24 samples can be multiplexed in a hybridization enrichment experiment, according to the manufacturer’s protocol [43] . On the other hand, the RainDance microdroplet-based PCR method demonstrated that enrichment with up to 3976 amplicons Expert Rev. Mol. Diagn. 12(2), (2012)
Technological advances for germline genetic diagnosis
Review
Genomic DNA
Fragmentation End-repair Adapter ligation
In-solution capturing
Array-based capturing Removal of unbound fragments
Multiplexing Enriched library
Next-generation sequencing
Figure 1. Schematic illustration of library preparation using array-based and in-solution hybrid selection or capture. An in vitro random/shotgun library is generated from genomic DNA through fragmentation. The fragment ends are repaired and ligated with common adapters flanking each fragment. The library (a collection of adapter-ligated fragments) is hybridized to oligonucleotide probes tethered onto a high-density exome-capture microarray or custom-made microarray for targeted genomic regions. The difference between exome capture and custom or targeted genomic-region capture lies with the probes tethered on the microarray. The difference between on-solid and in-solution capture methods is that the oligonucleotide probes are tethered on microarray or are suspended in solution, respectively. The capture of the adapter-ligated fragments is based on their complementarity with the sequences of oligonucleotide probes. After hybridization, unbound fragments are removed by washing, followed by elution of specifically hybridized fragments. The enriched fragment pool is amplified by PCR. Subsequently, the success of the enrichment is checked by quantitative PCR. Finally, the end product is a sequencing library enriched for target regions, which is then sequenced by high-throughput sequencing. Reprinted with permission from [77] .
www.expert-reviews.com
163
Review
Ku, Wu, Cooper et al.
is technically feasible [19] . RainDance has expanded the capabilities of its sequence-enrichment solution to enable the sequencing of up to 20,000 PCR targets. Indeed, a recent study also used the RainDance Technologies expanded content library to enrich the human X-chromosome exome (2.5 Mb) from 26 samples followed by massively parallel sequencing. The multiplex primer library covered 98.05% of the human X chromosomal exome in a single tube with 11,845 different PCR amplicons [46] . The demonstration that the RainDance methodology, coupled with NGS, can efficiently enrich and enable the routine sequencing of the entire exome of the human X chromosome, has important clinical implications for the diagnosis of X-linked disorders. This has been demonstrated by a large-scale resequencing screen of X-chromosome exons in mental retardation [47] . By contrast, others have chosen the Agilent in-solution enrichment method followed by massively parallel sequencing of exons on the X chromosome in talipes equinovarus, atrial septal defect, Pierre Robin sequence and persistent left superior vena cava syndrome [48] . Furthermore, the extent of incomplete capture of the targeted genomic regions and sequencing of the off-target reads must also be considered. The former may result in the need for additional tests, such as using conventional PCR-Sanger sequencing for the missing regions. The latter will reduce the cost–effectiveness of the analysis. For example, in a targeted sequencing study of 47 genes for cardiomyopathies, 3% of the regions of interest could not be covered, although 91% were covered with at least ten reads, which allows reliable variant detection on the SOLiD platform [49] . Similarly, only a small proportion of regions were covered ‘inadequately’ as reported by others, where only seven coding exons were covered less than tenfold (i.e., 2.6% of all coding exons) after the capture and sequencing of 279 exons in seven genes underlying ataxia [16] . In addition, uneven coverage across the genes was also reported, where individual disease genes showed an approximately twofold variation in coverage of the coding sequence, ranging from 17-fold to 45-fold [16] . Uneven sequence enrichment is also potentially problematic, for example, GC-rich sequences can be difficult to capture, which would lead to uneven sequencing coverage across the targeted regions. As a result, a higher overall sequencing depth is needed to ensure that those ‘poorly covered regions’ achieve the minimum coverage for accurate variant detection. In the worst-case scenario, these GC-rich regions would not be captured at all. This is clearly demonstrated by the findings of Hoischen et al., where two exons without any coverage contained a very high GC content (76.1 and 63.6%, respectively) compared with the average GC content of 37.6% for the 50 best covered exons in the ataxia genes [16] . However, it is noteworthy that ‘GC bias’ in sequencing is attributed to a combination of sequence capture, PCR and sequencing bias by different platforms, rather than the capture alone. Finally, uneven coverage also adversely affects our ability to detect large deleted and/or duplicated regions by quantitative evaluation of depth. Medium-throughput sequencing instruments
Targeted sequencing requires a combination of sequence enrichment and NGS. Although multiple enrichment methods are 164
available, the current NGS technologies, especially the Illumina HiSeq and Life Technologies SOLiD sequencing platforms, generate hundreds of Gb of sequencing data, leading to ‘oversequencing’ (i.e., sequencing to a redundant depth) for targeted sequencing of a few genes after sample barcoding. This is true even when only part of the flowcell (e.g., a single lane out of the eight lanes per flowcell) is used. Although the Roche 454 GS FLX has a much lower throughput per instrument run (i.e., 500 Mb) compared with the other two NGS technologies, it is still in excess of requirements when a single patient sample with a total targeted size of tens of kb or several hundred kb (a reasonable estimate for the collection of all the exons for multiple selected genes) is required for genetic diagnosis. Therefore, these highthroughput NGS technologies are not the platforms of choice for the clinical diagnostic laboratory. As demonstrated, excess coverage was achieved using targeted PCR-based enrichment methods (RainDance and Fluidigm) and SOLiD™ sequencing of 24 known disease genes with an average coverage of >400× per base over the entire gene set for the congenital disorders of glycosylation [13] . This ‘over-sequencing’, which would not increase the accuracy of variant detection, adds unnecessarily to the cost of the analysis. On the other hand, the three medium-throughput sequencing instruments have throughputs ranging from 10 Mb to >1 Gb. For example, three different Ion Torrent sequencing chips are available with minimum throughputs of >10 Mb, >100 Mb and >1 Gb per chip, respectively, and with a read length ranging from 100 to 200 bp. Similarly, Illumina Miseq is expected to generate sequencing data volumes ranging from >120 Mb to >1 Gb, depending upon the read length (which ranges from 35 to 150 bp) and whether it is a single-end or paired-end library. By contrast, the 454 GS Junior has a much lower throughput (>35 Mb) per instrument run but has a longer read length of 400 bp on average, compared with the other two compatible platforms. These varying throughputs provide multiple options to clinical diagnostic laboratories based on their sequencing turnaround time or sample volumes – for example an individual hospital-based versus a centralized, state or national level diagnostic laboratory. TABLE 1 summarizes the technological aspects of the three medium-throughput sequencing platforms. These medium-throughput sequencing instruments are more accessible to clinical diagnostic laboratories. For example, Walsh et al. designed oligonucleotides to cover coding regions, noncoding intronic sequences and 10-kb genomic sequences flanking each of the 21 genes responsible for an inherited risk of breast and ovarian cancers. The total DNA targeted was approximately 1 Mb after repetitive DNA elements were masked [20] . Thus, this amount of sequencing can be easily performed by the 454 GS Junior with a minimum throughput of 35 Mb per run for a single patient sample; hence, an average coverage depth of 20–30-fold is expected (assuming after filtering of sequence reads). This sequencing depth is deemed sufficient for the accurate calling of germline variants. Alternatively, for multiple samples, other platforms can be considered, such as the Illumina MiSeq pairedend sequencing (2 × 100 bp), which produces >680 Mb. The Expert Rev. Mol. Diagn. 12(2), (2012)
Technological advances for germline genetic diagnosis
Review
Table 1. A summary and comparison of the technological features of bench-top next-generation sequencing platforms. Feature
Sequencing platform Life Technologies Ion Torrent™ PGM
Illumina MiSeq™ System
Roche 454 GS Junior
Technical & throughput aspects Detection of nucleotide incorporation
Release of hydrogen ion and pH changes
Emission of fluorescent light
Emission of chemiluminescent light (pyrosequencing)
Sequencing of homopolymer regions
-OREACCURATE
-OREACCURATE
Less accurate, prone to indel errors in homopolymeric regions with >6 identical nucleotides
Throughput and sequencing time
There are three different Ion Sequencing Chips with different throughputs i.e., #HIP-BH #HIP-BH #HIP'BH
The throughputs depend upon whether it is a single-end of paired-end sequencing libraries with different sequencing times: Single-end sequencing (1 × 35 bp): -BH Paired-end sequencing (2 × 100 bp): -BH Paired-end sequencing (2 × 150 bp): >1 Gb (27 h)
-BPERRUNH
Read length
100–200 bp In 2012, the read length will be increased to >400 bp
35–150 bp
400 bp In 2012, the read length will be increased to >800 bp
Number of sequence reads per chip (Ion Torrent)/ mOWCELL-I3EQ PICOTITER plate (454 GS Junior)
Chip314 (>1 million wells) Chip316 (>6 million wells) Chip318 (>11 million wells) The number of reads is approximately 30–40% of the available wells for each chip
>3.4 million single-end reads or >6.8 million paired-end reads
>100,000 sequence reads per run (shotgun sequencing) >70,000 sequence reads per run (amplicon sequencing)
Indexing or barcoding of samples
Yes
Yes
Yes
Single-end sequencing
Yes
Yes
Yes
Paired-end sequencing
Yes
Yes
Yes (mate-pair sequencing)
Base accuracy
>99.5% raw accuracy
>75 to >95% bases higher than Q30, Q20 read length of 400 bases depending upon whether it is (i.e., 99% accuracy at 400 bases) single-end or paired-end sequencing and the sequence read length (Q30 accuracy = 99.9%)
Sample preparations Sequencing library preparation
Automated by Ion One Touch Automated by cluster-generation System (i.e., loading, clonal device amplification and sample recovery)
No automation
Level of technical difficulty in sequencing library preparation (low, moderate or high)
-ODERATE BECAUSE$.!LOADING during the emulsion PCR, size of the emulsion PCR product and uniformity of DNA loading onto the Ion Torrent Chip have to be optimized and tightly controlled
-ODERATE BECAUSE$.!LOADING during the emulsion PCR, size of the emulsion PCR product and uniformity of DNA loading onto the picotiter plate have to be optimized and tightly controlled
Low, because cluster generation on solid surface is more robust
Note: The throughput and quality statistics of the bench-top sequencing platforms summarized in this table (and discussed in the review) are, for the most part, based upon information provided by the vendors or manufacturers. As the information is being updated frequently, readers are encouraged to refer to the vendors’ websites for the latest information. BP"ASEPAIR'3'ENOME3EQUENCERINDEL)NSERTIONDELETION0'-0ERSONAL'ENOME-ACHINE
www.expert-reviews.com
165
Review
Ku, Wu, Cooper et al.
Table 1. A summary and comparison of technological features of bench-top next-generation sequencing platforms (cont.). Feature
Sequencing platform Life Technologies Ion Torrent™ PGM
Illumina MiSeq™ System
Roche 454 GS Junior
Diagnostic application (e.g., sequencing of known cancer genes or causal genes for -ENDELIANDISORDERS
Yes
Yes
Yes
Targeted sequencing application
Yes
Yes
Yes
Exome sequencing
No
No
No
Whole-genome sequencing of small genomes, such as a bacterial genome
Yes
Yes
Yes
ChIP-Seq application
No
No
No
miRNA or mRNA sequencing
No
No
No
Applications
Note: The throughput and quality statistics of the bench-top sequencing platforms summarized in this table (and discussed in the review) are, for the most part, based upon information provided by the vendors or manufacturers. As the information is being updated frequently, readers are encouraged to refer to the vendors’ websites for the latest information. BP"ASEPAIR'3'ENOME3EQUENCERINDEL)NSERTIONDELETION0'-0ERSONAL'ENOME-ACHINE
availability of different throughputs from Ion Torrent PGM and Illumina MiSeq (>10 Mb to >1 Gb) coupled with sample barcoding offers flexibility for different diagnostic or screening tests that vary with respect to both genomic size and sample volumes. However, the major application of these bench-top sequencing machines is in the targeted sequencing of candidate genes, as for example in a diagnostic test. Indeed, none of them are appropriate for other applications such as ChIP-Seq ( 20 million reads) and RNA-seq (100–200 million reads) in the human genome, owing to the paucity of sequence reads. They are also unsuitable for WES of the human genome, which requires several Gb of sequencing data. Thus, these applications are best performed on the high-throughput NGS platforms. The 454 GS Junior and Illumina MiSeq are based on the well-established sequencing chemistries of pyrosequencing and reversible terminator sequencing, respectively. These sequencing technologies have been well described in previous reviews [1,50] and are therefore only briefly summarized in this article.
reactions. The intensity of the chemiluminescent light emitted from each well corresponding to a single DNA template is recorded by the detection system and is proportional to the number of nucleotides incorporated into the DNA template. Based on this, pyrosequencing is more susceptible to insertion/deletion (indel) errors in homopolymer sequences longer than six bases. Several nucleotides can be incorporated during pyrosequencing when there are consecutive identical nucleotides in the sequences. This is in contrast to reversible terminator chemistry sequencing where only one reversible terminator nucleotide is incorporated into the DNA templates per cycle of sequencing [51,52] . Therefore, using these pyrosequencing reactions, the 454 GS Junior is capable of generating >100,000 sequence reads per run by shotgun sequencing or >70,000 sequence reads per run by amplicon sequencing. Owing to its long read length (>400 bp) and the throughput of 35 Mb per run, the 454 GS Junior is more suitable for targeted sequencing rather than WES studies. Illumina MiSeq
454 GS Junior
The addition of deoxynucleotide triphosphates and reagents for repeated cycles of sequencing (cycle sequencing) is controlled during pyrosequencing. Each type of nucleotide flows through the picotiter plate one at a time, sequentially per cycle of sequencing, followed by a different nucleotide in the next cycle and so on. The incorporation of the complementary nucleotides into the DNA templates results in the release of inorganic pyrophosphate, thereby triggering a series of downstream chemiluminescent 166
All four types of dideoxy nucleoside triphosphates (ddNTPs; reversible-terminator nucleotides) and sequencing reagents are added onto the flowcell in reversible-terminator chemistry sequencing for each cycle, where each of these ddNTPs are fluorescently labeled with a different color. A single flowcell (for Illumina GA and Hiseq) has several hundred million clusters, with each cluster containing clonally amplified copies from a single DNA template. The reversible-terminator nucleotides allow for the synthesis of DNA templates in the following cycle of sequencing after Expert Rev. Mol. Diagn. 12(2), (2012)
Technological advances for germline genetic diagnosis
removal of the blocking group. Therefore, using this approach allows the incorporation of one complementary ddNTP at a time into the DNA template, followed by washing steps to remove the excess sequencing reagents. The fluorescent signals are then imaged across the whole flowcell. After imaging, the fluorescent labels are removed, together with the 3´ blocking group of the ddNTPs [53] . Thus, this series of reversible-terminator sequencing reactions, the MiSeq System, is expected to generate >3.4 million single-end reads or >6.8 million paired-end reads. The sequencing throughputs vary in terms of sequencing library and sequence read lengths, with three different throughputs expected per run – that is, >120 Mb (single-end sequencing with 35 bp read length), >680 Mb (paired-end sequencing with 100 bp read length) and >1 Gb (paired-end sequencing with 150 bp read length). Ion Torrent PGM
By contrast, Ion Torrent sequencing represents a new technology. This is considered to be the world’s first ‘post-light’ sequencing technology because Ion Torrent sequencing technology does not rely on light emission during cyclic sequencing of nucleotide incorporation [23] . Other sequencing technologies are reliant on either fluorescent emission (Illumina GA/Hiseq/MiSeq and ABI SOLiD) or chemiluminescent light emission (i.e., pyrosequencing chemistry used by the Roche 454 sequencing platforms). The Ion Torrent sequencing platform comprises a PGM sequencer and semiconductor sequencing chip, which is a highdensity array of wells (or micromachined wells) to perform the sequencing process or nucleotide incorporation in a massively parallel manner. Each well holds a different DNA template and beneath the wells is an ion-sensitive layer overlaying a proprietary ion sensor for each well. The Ion Torrent PGM sequencer sequentially provides the chip with one type of nucleotide after another. When a nucleotide is incorporated into a DNA template by a DNA polymerase, a hydrogen ion is released, which causes a voltage change. If the next nucleotide is not matched, no voltage change will be recorded and no base will be called. In concordance, if there are two identical bases on the DNA template, the voltage will be doubled and the chip will record two identical bases as called. The number of sequence reads generated by Ion Torrent PGM depends on the number of wells per chip and the proportion of wells loaded with beads attached to DNA fragments. The number of wells are >1 million (chip314), >6 million (chip316) and >11 million (chip318), and the loading protocols usually fill approximately 30–40% of the available wells. Therefore, for example, the number of reads produced by chip314 is approximately 0.3–0.4 million. As such, the sequencing throughput per chip is dependent on these factors and the minimum throughput is estimated to be from >10 Mb (chip314) to >100 Mb (chip316) and >1 Gb (chip318). The Ion Torrent platform achieves the fastest sequencing time per chip or per instrument run (i.e., less than 2 h) compared with the other two competing platforms. The shortest sequencing time is achieved because nucleotide incorporation is directly detected through pH and voltage changes; as such, each nucleotide incorporation is recorded in seconds. However, this sequencing time www.expert-reviews.com
Review
excludes the time needed for library preparation [23] . The time to diagnosis is an important factor to consider, particularly when the accurate and timely molecular diagnosis of a disorder can result in dramatic improvements in patient care, as discussed earlier. Comparison of the technical aspects
In summary, several features of Ion Torrent sequencing chemistry resemble pyrosequencing – namely, the amplification of adapterligated DNA fragments is based on emulsion PCR (not discussed here); only one type of unlabeled nucleotide is present per cycle of sequencing; and the number of nucleotide incorporation is proportional to the voltage changes (Ion Torrent sequencing) or intensity of chemiluminescent light (pyrosequencing). The Ion Torrent sequencing and 454 GS Junior are based on emulsion PCR; as such, they have a greater level of technical difficulty in sequencing library preparation than MiSeq employed on the cluster generation. This is because DNA loading during the emulsion PCR, the size of the emulsion PCR product and the uniformity of DNA loading onto the chip or plate have to be optimized and tightly controlled to obtain the maximal sequencing output. To illustrate this point, for example, the DNA and bead ratio have to be optimized, to maximize the generation of monoclonal particles/beads – that is, clonal amplification from a single DNA template attached to a single bead. By contrast, polyclonal particles are wasteful – that is, clonal amplification from more than one DNA template attached to the same single bead during emulsion PCR. Furthermore, the sequencing throughput for Ion Torrent PGM and 454 GS Junior is also dependent on the DNA loading (after emulsion PCR) into the wells on the chip or plate. Higher throughput would be expected if the DNA-loading step is optimized to increase the proportion of the wells occupied by beads. Compared with other platforms, the strength of the Roche 454 system lies in its longer sequence reads. Detection of point mutations, small indels & large deletions & duplications
Various studies have demonstrated the accuracy of detecting point mutations in samples with known mutations in the targeted genes responsible for the diseases [12,13,20,22] . This enrichment of targeted genes has been performed using both PCR-based and custommade oligonucleotides on-array and in-solution enrichment methods, and all three NGS platforms (Illumina, Life Technologies and Roche 454) have been evaluated. For example, RainDance and Fluidigm PCR-isolation platforms have been used to enrich 24 genes known to underlie congenital disorders of glycosylation, and these genes were sequenced by the SOLiD platform. This proof-of-principle study demonstrated that the disease-causing mutations could be identified for all 12 positive controls comprising point mutations and small indels [13] . However, a less encouraging result was obtained for primary ciliary dyskinesia because not all the known mutations were successfully identified [12] . In comparison to the earlier study [13] , a different enrichment and sequencing method was adopted for the primary ciliary dyskinesia analysis. This study utilized a custom-designed array to capture 2089 exons from 79 genes associated with primary ciliary 167
Review
Ku, Wu, Cooper et al.
dyskinesia or ciliary function. The method was tested in four individuals harboring a number of previously identified primary ciliary dyskinesia mutations; it successfully identified three out of three substitution mutations and one out of three small-indel mutations (i.e., a 21-bp deletion). One small deletion mutation (4 bp) was observed after bioinformatic adjustment. However, the method failed to detect one single-nucleotide insertion and a whole-exon deletion also went undetected [12] . The failure to detect the single-nucleotide insertion might be attributed to the inherent difficulty in detecting small indels, a frequent type of sequencing error by the pyrosequencing platform that was employed in this study [50,54] . By contrast, all point mutations, small indel mutations (ranging from 1 to 19 bp), and large genomic duplications and deletions (ranging from 160 to 101,013 bp) were detected in 20 women diagnosed with breast or ovarian cancer harboring a known mutation in one of the genes responsible for inherited predisposition to these diseases [20] . This study also designed custom oligonucleotides in solution to target the complete genomic sequences of 21 genes responsible for the inherited risk of breast and ovarian cancers, with the sequencing being run on an Illumina Genome Analyzer IIx. The large deletions and duplications were detected by comparison of the number of sequence reads at each bp for each sample with all the other samples in the experiment. A deviation from diploidy was defined as a site where the test sample yielded 140% of the average number of reads of the other samples in the experiment. This read-depth strategy successfully identified the large deletions and duplications present in the samples, which were in complete concordance with the multiple ligation probe assay. Similarly, Lim et al., who assessed the application of NGS as a genetic diagnostic tool in 25 patients with Duchenne muscular dystrophy or Becker muscular dystrophy, have accurately predicted the deleted or duplicated exons in the nine patients with known mutations by using the other 16 patients (without the deletions and duplications) as the standard [55] . The development of more powerful analytical and statistical tools will further enhance the ability to detect copy-number variants from short sequence reads or NGS data [56–59] . In addition to the detection of point mutations (or single nucleotide variants) and small indels, the ability of a diagnostic tool to detect larger deleted and duplicated regions is also important because causal genes for some disorders harbor copynumber variants [20,55] . Development of a tool to accurately and cost-effectively detect various genetic aberrations is important for its adoption in a clinical setting; it follows that the nonavailability of such a comprehensive tool might create the need to order several diagnostic tests for patients. A notable example to illustrate this limitation is the genetic screening of inherited mutations in BRCA1/BRCA2 for familial breast and ovarian cancers. Genetic testing for BRCA1/BRAC2 mutations has been integrated into clinical practice for women with family histories of these cancers, for both newly diagnosed cases and clinically asymptomatic individuals, because of high penetrance or risk of the inherited mutations in causing the cancers [20] . However, a separate diagnostic test has been offered in order to detect large 168
exonic deletions and duplications that are undetectable by PCRSanger sequencing (BRACAnalysis® Technical Specifications [107] ). The aforementioned proof-of-principle studies, although employing a targeted sequencing approach, used conventional high-throughput NGS platforms. It was not until recently that the performance of medium-throughput NGS instruments was tested in this context. The 454 GS Junior was used to sequence amplicons for COL4A3, COL4A4 and COL4A5 genes amplified using a strategy based on the locus-specific amplification of genomic DNA [22] . Alport syndrome was used as the disease model and it was tested on three patients; two patients had a confident diagnosis of the disease, whereas the third patient had an uncertain diagnosis. The success of the application was demonstrated by identifying the previously undetected second mutation for the two patients with a confident diagnosis; the diagnosis of Alport syndrome in the third patient had to be reconsidered, as the sequencing did not identify any pathogenic mutation, only benign polymorphisms. The accuracy of mutation detection
These proof-of-principle studies have provided practical guidelines for accurate variant detection. Various technical aspects that can affect the accuracy of mutation detection have been investigated. For example, in the Alport syndrome study that sequenced three genes with 454 GS Junior, Sanger sequencing (as the goldstandard method) revealed that variants detected in 80%). Therefore, the percentage threshold may be a useful parameter to filter probable false-positive results. In addition, the study also found that a small fraction of highly unbalanced data – that is, one variant detected in a significant percentage (26–96%) of one sequence strand, but in a very low percentage (0–5%) of the other strand – should be considered to be technical artifacts [22] . This information will be useful in future studies in order to identify the sources of false positives and to optimize the data-analysis pipeline to achieve the required clinical-grade accuracy and specificity. Detecting heterozygous changes is more challenging than homozygous changes. The proportion of reads calling the mutant allele of a heterozygous mutation is in the range of 56–77%, compared with homozygous changes where all the reads should confirm the mutation [16] . An adequate sequencing depth is critical for the accurate calling of heterozygous mutations. A clear correlation between sequence coverage depth and genotyping accuracy (using the SNP genotyping array) was shown where a minimum coverage of tenfold resulted in 0.94% conflicts for heterozygous calls and 0.12% for homozygous calls. However, this agreement improved to 0.55 and 0.10%, respectively, for a minimum 15-fold coverage. Thus, 15-fold sequence coverage (comparable to a PHRED quality score of 20) is recommended for the reliable detection of heterozygous mutations [16] . Although technical improvements will be necessary to increase sensitivity and specificity and to enhance the robustness of the data analysis, taken together, these initial proof-of-concept studies have demonstrated the potential Expert Rev. Mol. Diagn. 12(2), (2012)
Technological advances for germline genetic diagnosis
of the targeted sequencing approach applied in genetic diagnostic and screening testing for germline mutations. The challenges of minimizing the false-positive and falsenegative rates are greater for somatic mutations because of the frequent contamination of tumor tissues with noncancerous cells, and the genetic heterogeneity between the tumor cells (i.e., multiple subclones with different mutational profiles), which both dilute the signal required to detect somatic mutations [60,61] . These challenges are encountered in the sequencing of primary tumor tissues, irrespective of the sequencing approaches [62–64] . For example, for a given germline heterozygous variant, one might expect to observe 50% of the reads mapping to the locus carrying the mutant allele and the other half of the reads containing the wild-type allele. However, in practice this is not the case, owing to uneven capture and sequencing. As such, the proportion of sequence reads was in the range of 56–77% for heterozygous variants [16] . This is further complicated if somatic mutations are to be detected in tumor tissues with varying degrees of purity. In a hypothetical example, if the ‘tumor tissue’ comprises tumor cells and nontumor cells in a 50/50 proportion, and assuming that all the tumor cells contained the heterozygous variant, then it might be expected that only 25% of the reads mapping to the locus would carry the mutant allele. Therefore, the targeted sequencing approach is particularly useful in this situation as the average coverage per base can be many times higher than WES and WGS. Current perspectives & conclusions
There are several important questions to pose when deciding which sequencing-based approach is most appropriate for diagnostic testing. First, one must know whether the results generated from a clinical diagnostic test can be used for further research investigations – that is, given that the ethical issues and concerns have been addressed adequately. Should this be the case, WES and WGS have advantages over targeted sequencing in identifying new causal mutations and genes. This is applicable to disorders that have not been completely accounted for by the known causal genes. For example, approximately 70% of CMT cases are still not accounted for by any known CMT gene [24] . Therefore, newly identified patients with a clinical diagnosis of CMT may not harbor causal mutations in any of the known CMT genes. Therefore, if WES were applied as a diagnostic tool, it would have the potential to identify new putative causal mutations and genes that can be validated by further research studies in additional cases or functional investigations to prove their causality. WES and WGS are powerful information-generating tools, and hence give rise to several common ethnical concerns. For example, should the potentially irrelevant results – that is, detection of mutations in other known causal genes predisposing to other disorders and detection of variants with unknown clinical significance – be disclosed to patients? If WES or WGS were applied as a diagnostic tool, should the cost of the test be subsidized partially by research funding if the results are intended to be used for research purposes, irrespective of whether the diagnostic test generates any interesting results that would need to be further pursued by research studies? www.expert-reviews.com
Review
Second is the logistical challenge. WES and WGS represent a ‘universal’ diagnostic tool for all genetic disorders accounted for by the coding region mutations. Such approaches obviate the logistical challenge to a diagnostic laboratory of performing a large number of diagnostic tests specific to each disorder. Third is cost–effectiveness. Targeted sequencing and WES are more cost effective than WGS, and this will remain the case even when the ‘US$1000 genome’ becomes a reality. The cost of WGS is not limited to the sequencing, but includes the costs incurred by data storage and analysis. Furthermore, to date, all the novel discoveries made by the application of WGS could also have been achieved by WES. However, it is still uncertain to what extent novel discoveries could only be made by WGS, particularly in the context of the identification of modifiers interacting with causal mutations, a common source of clinical or phenotypic heterogeneity. Since approximately 85% of the known causal mutations fall within the coding regions of their cognate genes, it is also reasonable to postulate that the use of WGS will only result in the discovery of the remaining 15% of causal mutations within the noncoding regions. However, in the context of a clinical diagnostic setting, targeted sequencing and WES are more likely to be adopted in the near future with further technical improvements. By contrast, WGS still presents substantial challenges [65,66] , ranging from technical and analytical issues, to ethical concerns and affordability, because the ‘real’ cost of sequencing is higher than might be expected owing to the costs incurred by data analysis [37,67–69] . Finally, although NGS information has been used to make a diagnosis that was used in clinical management, it should be emphasized that no official clinical diagnostic test based on NGS has been approved so far by regulatory bodies such as the US FDA in heavily regulated clinical settings, for example, as required by Clinical Laboratory Improvement Amendments. Expert commentary
Although sequence enrichment coupled with NGS in the form of targeted sequencing and WES have shown their potential as diagnostic tools, several technical challenges remain and require further improvement and optimization as discussed above. We believe that in order for these targeted sequencing and WES approaches to be adopted in a clinical diagnostic setting, the results should be validated and improved until the sensitivity and specificity meet clinical standards. Although improvements have been made, it is currently still unclear whether targeted sequencing and WES will be able to achieve these clinical standards. Furthermore, the incomplete capture of some exons (variants could not be detected) and uneven sequencing depth (inaccuracy of variant detection in regions of poor coverage) could potentially lead to a negative result. It is therefore critical to generate a report that details the quality of the sequencing run – that is, what was not captured and what was sequenced unreliably owing to the low sequencing depth for diagnostic applications. Clearly, while WES and WGS represent powerful informationgeneration tools, several ethical issues should be considered and addressed prior to their adoption in a diagnostic context. By contrast, the targeted sequencing of all known disease-causing genes 169
Review
Ku, Wu, Cooper et al.
minimizes these complications, as it has limited discovery value. Critical ethical issues include, but are not limited to, revealing findings considered ‘incidental’ or ‘unrelated’ to the original purpose of the diagnostic test. Should these findings be disclosed to the patients? The other issue relates to whether clinicians and medical geneticists have a responsibility to sift through the list of detected variants to identify known pathogenic mutations that could predispose the patients to other diseases. These issues are still controversial and are currently without any consensus view. Therefore, quite apart from the cost and the technical hurdles to overcome, the challenge now lies in finding a mutually agreeable future strategy to interpret the results from WES and WGS as a diagnostic test. In our opinion, these incidental or unrelated findings should be disclosed to patients after after proper validation and appropriate consultation with the attending clinician, genetic counselor or medical geneticist so that proper management can be undertaken. Regarding the question of whether one should sift through the list of detected variants to identify other known pathogenic mutations, we believe that a comprehensive and well-curated disease mutation database to catalog all the variants that have been identified and validated for all the complex and Mendelian diseases will help to alleviate the problem. The list of detected variants can be easily compared with the database by bioinformatic means. However, it is critical that the disease mutation database must be built to a ‘clinical grade’. Through the building of a clinical-grade disease mutation database, other known pathogenic mutations can be readily identified from WES and WGS data. However, there is still the question of whether these results should be communicated to the patients after proper consultation and whether the patient should have the right to demand access to this information on his/her own DNA? These ethical issues are not trivial and should be subjected to critical debate, in order to arrive at a consensus to ensure that these sequencing methods are ready to be used as a tool in clinical diagnosis.
Five-year view
It is arguably inevitable that WGS will eventually replace targeted sequencing of specific genes and WES for research and diagnostic purposes, in the context of both polygenic (complex) and monogenic (Mendelian) diseases. It is important to recognize that disease-associated variants are not likely to be confined to coding exons [70,71] . Similarly, variants in noncoding regions can act as modifiers for Mendelian disorders; these variants have the potential to modify the clinical phenotype in the context of a particular gene defect and are responsible for diverse phenotypic manifestations [72] . WGS offers the most comprehensive method of studying the human genome compared with the other two sequencing approaches. Although both WES and WGS are considered to be powerful, in comparison to WES, WGS provides additional advantages to study potentially disease-relevant variants in the noncoding regions of the human genome or coding regions that had not initially been annotated as such [73,74] . Furthermore, WGS is also a powerful tool to investigate structural variants such as copy-number variants, translocations, inversions and fusion events. However, in the clinical diagnostic setting, the question remains as to when it will become cost effective and affordable for the patient? Although the goal of the US$1000 genome is fast approaching [75] , the challenges in analysis and the ethical issues involved are not trivial [67,76] . In addition, the cost of analysis and data storage is much higher than originally expected. This ‘hidden’ or ‘additional’ cost of sequencing has been highlighted in these papers “The $1000 genome, the $100,000 analysis?” and “The real cost of sequencing: higher than you think!” [68,69] . Our viewpoint is that targeted sequencing and WES is more likely to be adopted as a diagnostic tool in the near future; we cannot afford to wait until such a time that WGS becomes technically and analytically feasible and cost effective. Finally, we believe that it is only a matter of time before NGS-based genetic tests (either targeted sequencing, WES or WGS) are incorporated into the standard fare offered by clinical genetic diagnostic laboratories.
Key issues s Next-generation sequencing (NGS) technologies have matured as a mutation-discovery tool since their advent in 2005; however, the prospect of using high-throughput sequencing technology in a medical diagnostic setting has only recently become a reality. s The delay in applying whole-exome sequencing in a diagnostic context may be attributed in part to the initial technical difficulties inherent in isolating and enriching the collection of all exons (the exome) in the human genome, which lay beyond the technical capacity of traditional PCR amplification methods. s In a parallel development, a targeted sequencing approach coupled with NGS, albeit with limited discovery value, has also been widely examined as a potential genetic diagnostic and screening tool. s Although several enrichment methods are available, the high-throughput production of sequencing data (up to several hundred gigabases) by NGS technologies has rendered them less suitable for use by clinical diagnostic laboratories. s In a similar vein, developments have also been made in sequencing technologies with a view to facilitating their adoption in a clinical setting. The advent of several ‘medium-throughput’ or the commonly known ‘bench-top’ sequencing machines have effectively closed the gap between the extremes in the spectrum of sequencing data production. s Although sequence enrichment coupled with NGS in the form of targeted sequencing and whole-exome sequencing have shown their potential to be a diagnostic tool, several technical challenges remain and require further improvement and optimization. s Whole-exome sequencing and whole-genome sequencing are powerful information-generation tools; as a result, several ethical issues should be considered and addressed prior to their adoption in a diagnostic context. Critical ethical issues include, but are not limited to, revealing of findings considered ‘incidental’ or ‘unrelated’ to the original purpose of the diagnostic test. s The ethical issues surrounding NGS are not trivial and should be subjected to critical debate, in order to arrive at a consensus to ensure that these sequencing methods are ready to be used as a tool in clinical diagnosis.
170
Expert Rev. Mol. Diagn. 12(2), (2012)
Technological advances for germline genetic diagnosis
Review
Acknowledgements
Financial & competing interests disclosure
C-S Ku, M Wu, DN Cooper and R Soong contributed to the conceptualization of this article. C-S Ku, M Wu and DN Cooper contributed to the writing of the article and the preparation of the table. N Naidoo, Y Pawitan, B Pang and B Iacopetta were involved in the discussion and critical reading. C-S Ku, M Wu and R Soong approved the final version and had final responsibility for this article.
The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties. No writing assistance was utilized in the production of this manuscript.
References
8
Ng SB, Buckingham KJ, Lee C et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42(1), 30–35 (2010).
s
/NEOFTHElRSTSTUDIESDEMONSTRATING THEFEASIBILITYOFWHOLE EXOMESEQUENCING TOIDENTIFYNEWCAUSALMUTATIONSAND GENESFOR-ENDELIANDISORDERSWITH PREVIOUSLYUNKNOWNGENETICETIOLOGY
Papers of special note have been highlighted as: sOFINTEREST ssOFCONSIDERABLEINTEREST 1
Metzker ML. Sequencing technologies – the next generation. Nat. Rev. Genet. 11(1), 31–46 (2010).
2
Mardis ER. A decade’s perspective on DNA sequencing technology. Nature 470(7333), 198–203 (2011).
s
!COMPREHENSIVEREVIEWANDPERSPECTIVE OFNEXT GENERATIONSEQUENCING TECHNOLOGIES
3
Choi M, Scholl UI, Ji W et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl Acad. Sci. USA 106(45), 19096–19101 (2009).
ss 4HElRSTPROOF OF PRINCIPLESTUDY DEMONSTRATINGTHEFEASIBILITYOFUSING WHOLE EXOMESEQUENCINGINA DIGANOSTICCONTEXT 4
5
6
Majewski J, Wang Z, Lopez I et al. A new ocular phenotype associated with an unexpected but known systemic disorder and mutation: novel use of genomic diagnostics and exome sequencing. J. Med. Genet. 48(9), 593–596 (2011). Cullinane AR, Vilboux T, O’Brien K et al. Homozygosity mapping and whole-exome sequencing to detect SLC45A2 and G6PC3 mutations in a single patient with oculocutaneous albinism and neutropenia. J. Invest. Dermatol. 131(10), 2017–2025 (2011). Clark MJ, Chen R, Lam HY et al. Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 29, 908–914 (2011).
ss !COMPREHENSIVECOMPARISONOFTHREE MAJORCOMMERCIALEXOMESEQUENCING PLATFORMSFROM!GILENT )LLUMINAAND .IMBLE'EN 7
Ng SB, Turner EH, Robertson PD et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461(7261), 272–276 (2009).
www.expert-reviews.com
9
10
11
12
13
17
Shearer AE, DeLuca AP, Hildebrand MS et al. Comprehensive genetic testing for hereditary hearing loss using massively parallel sequencing. Proc. Natl Acad. Sci. USA 107(49), 21104–21109 (2010).
18
Turner EH, Ng SB, Nickerson DA, Shendure J. Methods for genomic partitioning. Annu. Rev. Genomics Hum. Genet. 10, 263–284 (2009).
Hopp K, Heyer CM, Hommerding CJ et al. B9D1 is revealed as a novel Meckel syndrome (MKS) gene by targeted exon-enriched next-generation sequencing and deletion analysis. Hum. Mol. Genet. 20(13), 2524–2534 (2011).
19
Mamanova L, Coffey AJ, Scott CE et al. Target-enrichment strategies for nextgeneration sequencing. Nat. Methods 7(2), 111–118 (2010).
Tewhey R, Warner JB, Nakano M et al. Microdroplet-based PCR enrichment for large-scale targeted sequencing. Nat. Biotechnol. 27(11), 1025–1031 (2009).
20
Walsh T, Lee MK, Casadei S et al. Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proc. Natl Acad. Sci. USA 107(28), 12629–12633 (2010).
Kuhlenbaumer G, Hullmann J, Appenzeller S. Novel genomic techniques open new avenues in the analysis of monogenic disorders. Hum. Mutat. 32(2), 144–151 (2011). Berg JS, Evans JP, Leigh MW et al. Next generation massively parallel sequencing of targeted exomes to identify genetic mutations in primary ciliary dyskinesia: implications for application to clinical testing. Genet. Med. 13(3), 218–229 (2011). Jones MA, Bhide S, Chin E et al. Targeted polymerase chain reaction-based enrichment and next generation sequencing for diagnostic testing of congenital disorders of glycosylation. Genet. Med. 13(11), 921–932 (2011).
14
Goossens D, Moens LN, Nelis E et al. Simultaneous mutation and copy number variation (CNV) detection by multiplex PCR-based GS-FLX sequencing. Hum. Mutat. 30(3), 472–476 (2009).
15
Jiang Q, Turner T, Sosa MX, Rakha A, Arnold S, Chakravarti A. Rapid and efficient human mutation detection using a bench-top next-generation DNA sequencer. Hum. Mutat. 33(1), 281–289 (2012).
16
Hoischen A, Gilissen C, Arts P et al. Massively parallel sequencing of ataxia genes after array-based enrichment. Hum. Mutat. 31(4), 494–499 (2010).
ss $EMONSTRATEDTHEFEASIBILITYOFCUSTOM ENRICHMENTCOUPLEDWITHNEXT GENERATION SEQUENCINGTODETECTPOINTMUTATIONS SMALLINSERTIONDELETIONSANDLARGE GENOMICDELETIONSANDDUPLICATIONS ACCURATELY 21
Jasperson KW, Tuohy TM, Neklason DW, Burt RW. Hereditary and familial colon cancer. Gastroenterology 138(6), 2044–2058 (2010).
22
Artuso R, Fallerini C, Dosa L et al. Advances in Alport syndrome diagnosis using next-generation sequencing. Eur. J. Hum. Genet. 20(1), 50–57 (2012).
23
Rothberg JM, Hinz W, Rearick TM et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475(7356), 348–352 (2011).
s
&IRSTREPORTOFPOST LIGHTORNONOPTICAL SEQUENCINGTECHNOLOGY
24
Montenegro G, Powell E, Huang J et al. Exome sequencing allows for rapid gene identification in a Charcot–Marie–Tooth family. Ann. Neurol. 69(3), 464–470 (2011).
s
$EMONSTRATEDTHEADVANTAGESOF WHOLE EXOMESEQUENCINGCOMPAREDWITH THETARGETEDSEQUENCINGAPPROACHINTHE GENETICDIAGNOSISOFDISORDERSWITH
171
Review
Ku, Wu, Cooper et al.
HIGHLOCUSHETEROGENEITY SUCHAS #HARCOTn-ARIEn4OOTHDISEASE 25
26
27
28
29
30
31
32
33
34
35
exome sequencing. Hum. Genet. 129(4), 351–370 (2011).
49
Meder B, Haas J, Keller A et al. Targeted next-generation sequencing for the molecular genetic diagnostics of cardiomyopathies. Circ. Cardiovasc. Genet. 4(2), 110–122 (2011).
36
Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Hum. Mol. Genet. 19(R2), R145–R151 (2010).
37
Bick D, Dimmock D. Whole exome and whole genome sequencing. Curr. Opin. Pediatr. 23(6), 594–600 (2011).
50
Shendure J, Ji H. Next-generation DNA sequencing. Nat. Biotechnol. 26(10), 1135–1145 (2008).
38
Zuchner S, Dallman J, Wen R et al. Whole-exome sequencing links a variant in DHDDS to retinitis pigmentosa. Am. J. Hum. Genet. 88(2), 201–206 (2011).
51
39
Huentelman MJ. Targeted next-generation sequencing: microdroplet PCR approach for variant detection in research and clinical samples. Expert Rev. Mol. Diagn. 11(4), 347–349 (2011).
Droege M, Hill B. The genome sequencer FLX system – longer reads, more applications, straight forward bioinformatics and more complete data sets. J. Biotechnol. 136(1–2), 3–10 (2008).
52
Rothberg JM, Leamon JH. The development and impact of 454 sequencing. Nat. Biotechnol. 26(10), 1117–1124 (2008).
53
Bentley DR, Balasubramanian S, Swerdlow HP et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218), 53–59 (2008).
54
Li Y, Wang J. Faster human genome sequencing. Nat. Biotechnol. 27(9), 820–821 (2009).
55
Lim BC, Lee S, Shin JY et al. Genetic diagnosis of Duchenne and Becker muscular dystrophy using next-generation sequencing technology: comprehensive mutational search in a single platform. J. Med. Genet. 48(11), 731–736 (2011).
56
Mertes F, Elsharawy A, Sauer S et al. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct. Genomics 10(6), 374–386 (2011).
Sathirapongsasuti JF, Lee H, Horst BA et al. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics 27(19), 2648–2654 (2011).
57
44
Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR. A comparative analysis of exome capture. Genome Biol. 12(9), R97 (2011).
Xie C, Tammi MT. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10, 80 (2009).
58
45
Sulonen AM, Ellonen P, Almusa H et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 12(9), R94 (2011).
Medvedev P, Stanciu M, Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6(Suppl. 11), S13–S20 (2009).
46
Mondal K, Shetty AC, Patel V, Cutler DJ, Zwick ME. Targeted sequencing of the human X chromosome exome. Genomics 98(4), 260–265 (2011).
59
Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M. Detecting copy number variation with mated short reads. Genome Res. 20(11), 1613–1622 (2010).
Sobreira NL, Cirulli ET, Avramopoulos D et al. Whole-genome sequencing of a single proband together with linkage analysis identifies a Mendelian disease gene. PLoS Genet. 6(6), e1000991 (2010).
47
Tarpey PS, Smith R, Pleasance E et al. A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation. Nat. Genet. 41(5), 535–543 (2009).
60
Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 11(10), 685–696 (2010).
61
Ng SB, Nickerson DA, Bamshad MJ, Shendure J. Massively parallel sequencing and rare disease. Hum. Mol. Genet. 19(R2), R119–R124 (2010).
48
Johnston JJ, Teer JK, Cherukuri PF et al. Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate. Am. J. Hum. Genet. 86(5), 743–748 (2010).
Robison K. Application of second-generation sequencing to cancer genomics. Brief Bioinform. 11(5), 524–534 (2010).
62
Ding L, Getz G, Wheeler DA et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455(7216), 1069–1075 (2008).
Watkins D, Schwartzentruber JA, Ganesh J et al. Novel inborn error of folate metabolism: identification by exome capture and sequencing of mutations in the MTHFD1 gene in a single proband. J. Med. Genet. 48(9), 590–592 (2011). Worthey EA, Mayer AN, Syverson GD et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet. Med. 13(3), 255–262 (2011). Bonnefond A, Durand E, Sand O et al. Molecular diagnosis of neonatal diabetes mellitus using next-generation sequencing of the whole exome. PLoS One 5(10), e13630 (2010). Schrader KA, Heravi-Moussavi A, Waters PJ et al. Using next-generation sequencing for the diagnosis of rare disorders: a family with retinitis pigmentosa and skeletal abnormalities. J. Pathol. 225(1), 12–18 (2011). Al-Romaih KI, Genovese G, Al-Mojalli H et al. Genetic diagnosis in consanguineous families with kidney disease by homozygosity mapping coupled with whole-exome sequencing. Am. J. Kidney Dis. 58(2), 186–195 (2011). Simpson DA, Clark GR, Alexander S, Silvestri G, Willoughby CE. Molecular diagnosis for heterogeneous genetic diseases with targeted high-throughput DNA sequencing applied to retinitis pigmentosa. J. Med. Genet. 48(3), 145–151 (2011). Lupski JR, Reid JG, Gonzaga-Jauregui C et al. Whole-genome sequencing in a patient with Charcot–Marie–Tooth neuropathy. N. Engl. J. Med. 362(13), 1181–1191 (2010). Rios J, Stein E, Shendure J, Hobbs HH, Cohen JC. Identification by whole-genome resequencing of gene defect responsible for severe hypercholesterolemia. Hum. Mol. Genet. 19(22), 4313–4318 (2010).
Ku CS, Naidoo N, Pawitan Y. Revisiting Mendelian disorders through
172
40
41
42
43
Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. Microarray-based genomic selection for high-throughput resequencing. Nat. Methods 4(11), 907–909 (2007). Albert TJ, Molla MN, Muzny DM et al. Direct selection of human genomic loci by microarray hybridization. Nat. Methods 4(11), 903–905 (2007). Gnirke A, Melnikov A, Maguire J et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27(2), 182–189 (2009).
Expert Rev. Mol. Diagn. 12(2), (2012)
Technological advances for germline genetic diagnosis
63
64
65
Dalgliesh GL, Furge K, Greenman C et al. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes. Nature 463(7279), 360–363 (2010). Lee W, Jiang Z, Liu J et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465(7297), 473–477 (2010). Berg JS, Khoury MJ, Evans JP. Deploying whole genome sequencing in clinical practice and public health: meeting the challenge one bin at a time. Genet. Med. 13(6), 499–504 (2011).
ss !COMPREHENSIVEREVIEWANDDISCUSSIONOF THEISSUESANDCHALLENGESINAPPLYING NEXT GENERATIONSEQUENCINGINCLINICAL PRACTICEANDPUBLICHEALTH 66
67
Kingsmore SF, Saunders CJ. Deep sequencing of patient genomes for disease diagnosis: when will it become routine? Sci. Transl. Med. 3(87), 87ps23 (2011). Sharp RR. Downsizing genomic medicine: approaching the ethical complexity of whole-genome sequencing by starting small. Genet. Med. 13(3), 191–194 (2011).
68
Mardis ER. The $1,000 genome, the $100,000 analysis? Genome Med. 2(11), 84 (2010).
69
Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost
www.expert-reviews.com
of sequencing: higher than you think! Genome Biol. 12(8), 125 (2011). 70
71
72
73
74
75
Manolio TA, Collins FS, Cox NJ et al. Finding the missing heritability of complex diseases. Nature 461(7265), 747–753 (2009). Cooper DN, Chen JM, Ball EV et al. Genes, mutations, and human inherited disease at the dawn of the age of personalized genomics. Hum. Mutat. 31(6), 631–655 (2010). Genin E, Feingold J, Clerget-Darpoux F. Identifying modifier genes of monogenic disease: strategies and difficulties. Hum. Genet. 124(4), 357–368 (2008). Bainbridge MN, Wang M, Wu Y et al. Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities. Genome Biol. 12(7), R68 (2011). Coffey AJ, Kokocinski F, Calafato MS et al. The GENCODE exome: sequencing the complete human exome. Eur. J. Hum. Genet. 19(7), 827–831 (2011). Drmanac R, Sparks AB, Callow MJ et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327(5961), 78–81 (2010).
Review
76
Koboldt DC, Ding L, Mardis ER, Wilson RK. Challenges of sequencing human genomes. Brief Bioinform. 11(5), 484–498 (2010).
77
Haas J, Katus HA, Meder B. Nextgeneration sequencing entering the clinical arena. Mol. Cell Probes 25, 206–211 (2011).
Websites 101
MiSeq Personal Sequencer. www.illumina.com/systems/miseq.ilmn
102
Fluidigm Corporation. www.fluidigm.com/home.html
103
RainDance Technologies. www.raindancetech.com
104
Wellcome Trust Sanger Institute. www.sanger.ac.uk/gencode/
105
Aligent SureSelect Human All Exon 50Mb Kit. www.chem.agilent.com/Library/ datasheets/Public/5990-6319en_lo.pdf
106
TruSeq Exome Enrichment Kit. www.illumina.com/products/truseq_ exome_enrichment_kit.ilmn
107
Myriad for Professionals. www.myriadtests.com/provider/doc/ BRACAnalysis-Technical-Specifications.pdf
173