Download book PDF - Springer Link

1 downloads 0 Views 5MB Size Report
Altshuler D, Gabriel S, Daly M et al (2010). The genome ...... Gibbs R, Muzny D, Scherer S, Bouck J, .... Gabriel S, Altshuler D, Daly M (2011) A frame- work for ...
Methods in Molecular Biology 1038

Noam Shomron Editor

Deep Sequencing Data Analysis

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes: http://www.springer.com/series/7651

TM

.

Deep Sequencing Data Analysis

Edited by

Noam Shomron Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel

Editor Noam Shomron Faculty of Medicine Tel Aviv University, Tel Aviv, Israel

ISSN 1064-3745 ISSN 1940-6029 (electronic) ISBN 978-1-62703-513-2 ISBN 978-1-62703-514-9 (eBook) DOI 10.1007/978-1-62703-514-9 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2013942649 # Springer Science+Business Media New York 2013 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Humana Press is a brand of Springer Springer is part of Springer Science+Business Media (www.springer.com)

Preface Sean O’Faolain, an Irish writer (1900–1991), wrote that “there is only one admirable form of the imagination: the imagination that is so intense that it creates a new reality, that it makes things happen.” In the biomedical field we are fortunate to be living in a new technological era in which our imagination is ignited by the new DNA sequencing machines. The quantum leap is so remarkable that a new genetic reality is created. It is now up to us scientists to make it come about and change our world of DNA sequence information. The new genetic revolution is fuelled by Deep Sequencing (or Next Generation Sequencing) apparatuses which, in essence, read billions of nucleotides per reaction. Effectively, when carefully planned, any experimental question that can be translated into reading nucleic acids can be applied. In our book titled “Deep Sequencing Data Analysis,” under the “Methods in Molecular Biology” series, numerous leading authors contribute to the multifacet deep sequencing data analysis. We start with an introduction to a high-throughput sequencing experimental design and then present a solution for compressing the masses of data generated. Given that most current technologies are based on short reads, accuracy, coverage, and assembly are included in subsequent chapters. For the identification of variants in a given genome we look at interpretation of short reads in the context of the “Exome” information (the coding-gene portion of the genome), then apply it to the identification of disease-causing mutations. The complexity of the genetic sequence, its expression, and interpretation are dealt with as part of the analysis of tandem repeats, editing of microRNAs and alternative mRNA splicing. The closing chapters present novel methods, Chromatin Immunoprecipitation (ChIP-seq) and Reverse Transcriptase Termination Site (RTTS), their statistics and implications thereof. The new sequencing outputs generated by deep sequencing bring up major computational challenges. The myriad amounts of sequence reads should be collated to build a coherent interpretation. We present an overview of data analysis that aids in making sense out of this information. We discuss several topics in depth presenting the “tip of the iceberg” in terms of available methods. However, we have collected key data analysis procedures that should be of great use for the beginner or savvy bioinformaticians when approaching deep sequencing data analysis. Tel Aviv, Israel

Noam Shomron

v

Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 An Introduction to High-Throughput Sequencing Experiments: Design and Bioinformatics Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rachelly Normand and Itai Yanai 2 Compressing Resequencing Data with GReEn . . . . . . . . . . . . . . . . . . . . . . . . . . . . Armando J. Pinho, Diogo Pratas, and Sara P. Garcia 3 On the Accuracy of Short Read Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Menzel, Jes Frellsen, Mireya Plass, Simon H. Rasmussen, and Anders Krogh 4 Statistical Modeling of Coverage in High-Throughput Data . . . . . . . . . . . . . . . . David Golan and Saharon Rosset 5 Assembly Algorithms for Deep Sequencing Data: Basics and Pitfalls . . . . . . . . . Nitzan Kol and Noam Shomron 6 Short Read Mapping for Exome Sequencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xueya Zhou, Suying Bao, Binbin Wang, Xuegong Zhang, and You-Qiang Song 7 Profiling Short Tandem Repeats from Short Reads . . . . . . . . . . . . . . . . . . . . . . . . Melissa Gymrek and Yaniv Erlich 8 Exome Sequencing Analysis: A Guide to Disease Variant Detection. . . . . . . . . . Ofer Isakov, Marie Perrone, and Noam Shomron 9 Identifying RNA Editing Sites in miRNAs by Deep Sequencing . . . . . . . . . . . . . Shahar Alon and Eli Eisenberg 10 Identifying Differential Alternative Splicing Events from RNA Sequencing Data Using RNASeq-MATS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juw Won Park, Collin Tokheim, Shihao Shen, and Yi Xing 11 Optimizing Detection of Transcription Factor-Binding Sites in ChIP-seq Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aleksi Kallio and Laura L. Elo 12 Statistical Analysis of ChIP-seq Data with MOSAiCS . . . . . . . . . . . . . . . . . . . . . . Guannan Sun, Dongjun Chung, Kun Liang, and S€ u nd€ u z Keles¸ 13 Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and Massive Parallel Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lukasz J. Kielpinski, Mette Boyd, Albin Sandelin, and Jeppe Vinther Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

v ix

1 27 39

61 81 93

113 137 159

171

181 193

213 233

Contributors SHAHAR ALON  George S. Wise Faculty of Life Sciences, Department of Neurobiology, Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel SUYING BAO  Department of Biochemistry, The University of Hong Kong, Hong Kong SAR, China METTE BOYD  Department of Biology, University of Copenhagen, Copenhagen, Denmark; Biotech Research and Innovation Centre, University of Copenhagen, Copenhagen, Denmark DONGJUN CHUNG  Department of Statistics, University of Wisconsin, Madison, WI, USA ELI EISENBERG  Raymond and Beverly Sackler School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel; Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel LAURA L. ELO  Department of Mathematics and Statistics, University of Turku, Turku, Finland; Turku Centre for Biotechnology, Turku, Finland YANIV ERLICH  Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA, USA JES FRELLSEN  Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark SARA P. GARCIA  IEETA/DETI, University of Aveiro, Aveiro, Portugal DAVID GOLAN  School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel MELISSA GYMREK  Harvard-MIT Division of Health Sciences and Technology, MIT, Cambridge, MA, USA; Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA OFER ISAKOV  Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel ALEKSI KALLIO  CSC—IT Center for Science Ltd, Espoo, Finland € Z KELES¸  Department of Statistics, University of Wisconsin, Madison, WI, USA; S€ uNDu Department of Biostatistics and Biomedical Informatics, University of Wisconsin, Madison, WI, USA LUKASZ J. KIELPINSKI  Department of Biology, University of Copenhagen, Copenhagen, Denmark NITZAN KOL  Functional Genomics Laboratory, Tel Aviv University, Tel Aviv, Israel ANDERS KROGH  Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark KUN LIANG  Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada PETER MENZEL  Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark RACHELLY NORMAND  Life Sciences and Engineering, Technion Genome Center, Technion—Israel Institute of Technology, Haifa, Israel ix

x

Contributors

JUW WON PARK  Department of Microbiology, Immunology, and Molecular Genetics, UCLA, Los Angeles, CA, USA MARIE PERRONE  Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel ARMANDO J. PINHO  IEETA/DETI, University of Aveiro, Aveiro, Portugal MIREYA PLASS  Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark DIOGO PRATAS  IEETA/DETI, University of Aveiro, Aveiro, Portugal SIMON H. RASMUSSEN  Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark SAHARON ROSSET  School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel ALBIN SANDELIN  Department of Biology, University of Copenhagen, Copenhagen, Denmark; Biotech Research and Innovation Centre, University of Copenhagen, Copenhagen, Denmark SHIHAO SHEN  Department of Microbiology, Immunology, and Molecular Genetics, UCLA, Los Angeles, CA, USA NOAM SHOMRON  Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel YOU-QIANG SONG  Department of Biochemistry, The University of Hong Kong, Hong Kong SAR, China; HKU-Zhejiang Research Institute, The University of Hong Kong, Hong Kong SAR, China GUANNAN SUN  Department of Statistics, University of Wisconsin, Madison, WI, USA COLLIN TOKHEIM  Department of Microbiology, Immunology, and Molecular Genetics, UCLA, Los Angeles, CA, USA JEPPE VINTHER  Department of Biology, University of Copenhagen, Copenhagen, Denmark BINBIN WANG  National Research Institute for Family Planning, Beijing, China YI XING  Department of Microbiology, Immunology, and Molecular Genetics, UCLA, Los Angeles, CA, USA ITAI YANAI  Department of Biology, Technion—Israel Institute of Technology, Haifa, Israel XUEGONG ZHANG  Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, Beijing, China; Department of Automation, Tsinghua University, Beijing, China XUEYA ZHOU  Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, Beijing, China; Department of Automation, Tsinghua University, Beijing, China

Chapter 1 An Introduction to High-Throughput Sequencing Experiments: Design and Bioinformatics Analysis Rachelly Normand and Itai Yanai Abstract The dramatic fall in the cost of DNA sequencing has revolutionized the experiments within reach in the life sciences. Here we provide an introduction for the domains of analyses possible using high-throughput sequencing, distinguishing between “counting” and “reading” applications. We discuss the steps in designing a high-throughput sequencing experiment, introduce the most widely used applications, and describe basic sequencing concepts. We review the various software programs available for many of the bioinformatics analysis required to make sense of the sequencing data. We hope that this introduction will be accessible to biologists with no previous background in bioinformatics, yet with a keen interest in applying the power of high-throughput sequencing in their research. Key words RNA-Seq, ChIP-Seq, Resequencing, De novo genome assembly, Initial bioinformatics analysis

1

Introduction High-throughput sequencing is the process of identifying the sequence of millions of short DNA fragments in parallel. In this chapter, we discuss applications and analyses of high-throughput sequencing done on the Illumina platform. The main advantage of this technology is that it allows a very high throughput; currently up to 1.6 billion DNA fragments can be sequenced in parallel in a single run, to produce a total of 320Gbp (HiSeq 2000, version three kits). One challenge with this technology, however, is that the sequenced fragments are relatively short—currently up to 250 bp (MiSeq instrument) or 150 bp (HiSeq 2500 instrument)—though double this can be produced using the paired-end option (see below). We operate a service unit in a university setting providing highthroughput sequencing (henceforth, HTS) sample preparation, sequencing, and initial bioinformatics analysis. Based upon our experiences over the past 2 years we provide the following notes. We do not aim to provide a complete picture of all of the innumerable

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_1, # Springer Science+Business Media New York 2013

1

2

Rachelly Normand and Itai Yanai

resources available for any one of the described applications. Rather, our goal is to provide a basic overview of the opportunities and challenges that HTS represents. The field is clearly changing rapidly and so the details are to be taken with caution as they will surely need revision as new algorithms and technology emerge. While many applications are supported by HTS, the actual input to the instrument is the same: libraries comprising billions of DNA strands of roughly the same length (typically 300 bp) with particular sequences (linkers) on either end. “Sample preparation” is the process by which an initial sample arrives at this highly ordered state. When genomic DNA is the starting material, it is fragmented and then size-selected for the tight size distribution. If the starting material is RNA, often times it is polyA-selected to limit the sequencing to mRNA. The RNA is reverse transcribed to DNA and then also size-selected. Irrespective of the application, linker DNA molecules of particular sequences are ligated to the ends of the strands. These consist of two fragments: adaptors and indices. The adaptors hybridize the DNA fragments to the flowcell on which they are sequenced. The indices are 6–7 bp sequences tagging different samples within the same library that will be sequenced together. Importantly there is a PCR amplification step in many of the sample preparation protocols which has implications for the structure of the data: identical sequences may be a result of the amplification or reflect recurrence in the original sample of DNA.

2

Materials

2.1 Basic Concepts in High-Throughput Sequencing

Figure 1 indicates the anatomy of an insert. The following are additional basic definitions important for HTS: 1. Insert—The DNA fragment that is used for sequencing. 2. Read—The part of the insert that is sequenced. 3. Single Read (SR)—A sequencing procedure by which the insert is sequenced from one end only. 4. Paired End (PE)—A sequencing procedure by which the insert is sequenced from both ends. 5. Flowcell—A small glass chip on which the DNA fragments are attached and sequenced. The flowcell is covered by probes that allow hybridization of the adaptors that were ligated to the DNA fragments.

Fig. 1 Schematic of a paired-end read

An Introduction to High-Throughput Sequencing Experiments. . .

3

6. Lane—The flowcell consists of eight physically separated channels called lanes. The sequencing is done in parallel on all lanes. 7. Multiplexing/Demultiplexing—Sequencing a few samples on the same lane is called multiplexing. The separation of reads that were sequenced on one lane to different samples is called demultiplexing and is done by a script that recognizes the index of each read and compares it to the known indices of each sample. 8. Pipeline—A series of computational processes.

3

Methods

3.1 High-Throughput Sequencing Applications

HTS applications can be divided into two main categories: “reading” and “counting.” In reading applications the focus of the experiment is the sequence itself, for example, for finding genomic variants or assembling the sequence of an unknown genome. Counting applications are based on the ability to count amounts of reads and compare these counts, for example, to assess gene expression levels. Table 1 shows some of the main applications enabled by HTS. These represent but a sampling of the main HTS applications. It should be noted that one can invoke HTS in practically any experiment that produces DNA fragments. What should be considered and planned before the sequencing however is the method by which the analysis of the sequenced fragments will be done to extract the meaning from the experiment. As an example of a unique HTS experiment, chromatin interactions can be identified by PE sequencing [1]. This procedure includes capturing interacting loci in the genome by immune-precipitating cross-linked fragments of DNA and proteins from fixed cells. There are many others, published at a rate of about 1 per day.

3.2 Sequence Coverage

In reading applications, coverage corresponds to the number of reads that cover each base in the genome on average. Coverage can be calculated as Average coverage ¼

read length  number of reads genome size

Note that only the number of mapped reads should be included in the above calculation. In general, 30 coverage is considered a minimum for identifying genomic variants, while de novo assembly usually requires a much higher coverage. Furthermore, the needed coverage depends on the experiment design. For example, if resequencing is done on a population and the sample includes pooling of heterogenic genomes, the coverage must be higher for the robust detection of rare variants.

Counting

Reading

Table 1 HTS applications

Identify a genomic sequence without any additional reference

De novo assembly

Find the binding locations of RNA- or DNA-binding proteins

Target enrichment sequencing is a specific form of resequencing that is focused only on certain genomic loci. This is useful for organisms with large genomes where enrichment increases the coverage on the loci of interest, thereby reducing costs

Target-enriched sequencing

ChIP-Seq/ RIP-Seq

Find variants in a given sample relative to reference genome

Goal

Resequencing

Application

First, the ChIP/RIP experiment is done: Proteins are bound to the DNA/RNA and are cross-linked to it. The DNA/RNA is then fragmented. The proteins are pulled down by an

After the DNA is extracted from the cells and undergoes sample preparation, an enrichment process is done to capture the relevant loci Target enrichment can be done on specific regions of the genome using “tailored” target-enrichment probes, or by using available kits such as exome-enrichment kits Same as in resequencing

Extract DNA from the relevant cells, conduct sample preparation consisting of DNA fragmentation and sequencing

Experiment details

The sequenced fragments are mapped to the genome. The enriched locations in the genome are found by detecting “peaks” of mapped fragments along the genome. These peaks should be significantly higher than the

The assembly process relies on overlaps of DNA fragments. These overlaps are merged into consensus sequences called contigs and scaffolds

Mapping of the sequenced fragment to the reference genome and identifying variants relative to the reference genome by summarizing the differences of the fragments from the genomic loci to which they map Same as in resequencing

Basic analysis summary

4 Rachelly Normand and Itai Yanai

Reading/ counting

microRNA-Seq

RNA-Seq

Detect and count microRNAs

Detecting and comparing gene expression levels

Total RNA is extracted from the cells, and the microRNA is isolated by recognizing the natural structure common to most known microRNA molecules. The microRNA fragments are then reverse transcribed and sequenced

immunoprecipitation process and then the crosslinking is reversed The DNA/RNA fragments that are enriched in the proteinbinding sites are then sequenced Total RNA is extracted from the cells. In a sample preparation process the mRNA is pulled down and fragmented. The mRNA fragments are then reverse transcribed to cDNA. The cDNA fragments are sequenced

The sequenced fragments are mapped to the genome. The microRNA can then be detected and counted

Un-annotated genes and transcripts can be found in an RNA-Seq experiment by detecting bundles of fragments that are mapped to the genome in an un-annotated region

mapped fragments in the surrounding loci, and significantly higher compared to a control sample—usually the input DNA of the ChIP experiments or another sample of immunoprecipitation done by a nonspecific antibody The cDNA fragments are mapped to the reference genome. The fragments that map to each gene are counted and normalized to allow comparisons between different genes and different samples

An Introduction to High-Throughput Sequencing Experiments. . . 5

6

Rachelly Normand and Itai Yanai

Fig. 2 Saturation report. The different series are sets of genes that differ in their final expression values using the complete dataset (in this case, 32 million reads). Highly expressed genes are saturated with even 10 % of the reads, whereas lowly expressed genes require a higher amount of reads, while very lowly expressed genes remain unsaturated even with the complete dataset

Contaminations may not pose a great difficulty for “reading” applications with a known reference genome, since they will not map to the reference genome. However contaminations “steal” coverage from the sample, and should be taken into account when estimating the expected coverage. If it is not possible to assess what percentage of contaminations the sample will contain, a pilot experiment may again prove useful: sequencing of just one or two samples in low coverage, and then assessing by mapping the percentage of contaminants. In de novo assembly, contaminations may be a lot more difficult to detect and thus attempts to eliminate contamination should be made when extracting the DNA, before sequencing and analysis. In counting applications, such as RNA-Seq, the notion of coverage is not straightforward since the number of reads along the genome is not expected to be uniform. For example, most RNA-Seq reads will correspond to highly expressed transcripts, whereas lowly expressed transcripts will be less represented. This notion presents the question of how many reads are required for a particular application. In general, this is a trial-and-error process, and consequently we have found it useful to begin with a pilot experiment of a few samples to provide an estimate of the transcriptomic complexity. An analysis that can help assess whether enough reads have been sequenced is a “saturation report” (Fig. 2, [2]). In this “jack-knifing” method, the expression levels are determined using all of the reads. The expression levels are then compared to those recalculated using only a fraction of the reads. Examining the expression levels at each cut of the data informs at which point

An Introduction to High-Throughput Sequencing Experiments. . .

7

the expression level remains unchanged despite additional data. As expected, additional data is most helpful in resolving the expression levels of the lowly expressed genes. After deciding how many reads are required per sample, the samples are divided into lanes according to the number of sequenced reads per lane, which is a fixed amount. 3.3 Sequencing Recipe: Single-Read vs. Paired-End, Insert Size, and Read Length

The sequencing recipe is influenced by several factors:

3.3.1 The Repetitive Nature of the Genome

Human and mouse genomes have ~20 % repetitive sequences [3]. Consequently, to uniquely score a read mapping to a repetitive region it must be longer than the repetitive region or border the neighboring non-repetitive sequence. Longer reads or PE reads allows “rescue” of the nonunique end and also mapping to nonunique regions in the genome (Fig. 3).

3.3.2 Differentially Spliced Variants

When assessing gene expression levels in RNA-Seq, it is potentially informative to discover the differential expression levels of different transcripts of the same gene. Reads that map to an exon shared by more than one transcript pose a difficulty in assessing the transcript of origin. PE reads may solve this problem if one end of the sequenced fragment maps to an exon that is unique to one of the transcripts. Figure 4 shows an example in which one cannot determine with certainty from which transcripts the SR originated. Sequencing it as PE resolves this problem.

Reference genome Repetitive region

Unique region

Repetitive region

Fig. 3 The red end would not have been uniquely mapped if sequenced as a single read as opposed to a paired-end read

Transcript 1 Transcript 2

Fig. 4 The single read maps to the gene, however, cannot distinguish among the transcripts. Paired-end reads provide a better chance at identifying splice variants

8

Rachelly Normand and Itai Yanai

a Sequenced strain’s genome Mapping

Reference genome

b

Fig. 5 (a) The sequenced strain contains a deletion in comparison with the reference genome. Consequently, paired-end reads mapped to the reference genome will have a bigger distance between them than the expected insert size. (b) An example of a genomic deletion in the IGV browser [4]

3.3.3 Genetic Distance of the Sequenced Sample from the Reference Genome

If the sequenced samples are genetically distant from the reference genome, longer reads may be required to determine the origin of each read in the genome. The mappings of each read will contain more mismatches, thus making it difficult to unambiguously determine its correct location, thereby increasing the probability that more than one location may be possible. Thus, the longer the read, the more likely a unique mapping becomes.

3.3.4 Finding Structural Variations

Structural variations in the genome, such as long deletions, inversions, and translocations, can be PE information. For example, if a large deletion the sequenced strain, the insert lengths will be expected (Fig. 5).

3.3.5 De Novo Assembly

Assembling a new genome from short sequenced reads consists of overcoming many challenges, such as sequencing errors, lowcomplexity regions, and repetitive regions among others [5, 6]. De novo assembly remains a notoriously difficult problem and often the genome of a metazoan remains in thousands of contigs. Obviously, longer PE reads lead to better assemblies. It has also been shown that using a few sequencing libraries with different insert length may improve the assembly process [5].

insertions or found using is present in longer than

An Introduction to High-Throughput Sequencing Experiments. . .

3.4 Number of Samples for Sequencing 3.4.1 Resequencing

9

If the reference genome to which the sequenced reads are mapped is genetically distant, sequencing the actual strain in its baseline state (before the mutagenesis, without the phenotypic change, etc.) will be beneficial for interpreting the data. This will help in distinguishing the variations that are due to evolutionary distance from those that cause the actual phenotypic trait under study.

3.4.2 RNA-Seq

It is highly recommended to sequence a few biological replicates to control for biological noise. Technical replicates will also inevitably show variation [7]. Some gene expression software programs, such as Cufflinks [8], can use the data from different replicates and merge it into one value with a higher statistical significance.

3.4.3 ChIP-Seq

A ChIP-Seq experiment should include the IP DNA and one more sample that will serve as a control. The control sample may be the input DNA, before the IP process, or an IP done on the same DNA with a nonspecific antibody, such as IgG [9, 10]. Sequencing a control sample enables detection of enriched regions that are also significantly enriched compared to the control sample, and not only enriched compared to the area surrounding them in the IP sample. This may reduce false-positive peaks detected solely because of areas in the genome that have a higher coverage due to better DNA fragmentation compared to the surrounding area.

3.5 Analysis Pipelines

Figure 6 shows the bioinformatics pipelines involved in four main applications: resequencing, de novo assembly, RNA-Seq, and ChIPSeq. Several processes are common to all or multiple applications.

Raw data handling

Quality control and reads manipulation if needed

Assembling contigs

Assembling scaffolds

Mapping

Variant calling

Assembling transcripts

Variant filtering

Assessing gene expression

Peaks detection

Differential gene expression De-novo Assembly

Resequencing

Fig. 6 Bioinformatics pipelines of the four main applications

RNA-Seq

ChIP-Seq

10

Rachelly Normand and Itai Yanai

3.5.1 Raw Data Handling

Available software for this step: Illumina’s CASAVA software. The Illumina run produces “base-calling” files (*.bcl) which only become useful bioinformatically when converted to the general fastq format (see below). During this file conversion, the demultiplexing process is also carried out, which is the separation of reads from different samples that were sequenced on the same lane.

3.5.2 Quality Control and Read Manipulation

Available software for this step: CASAVA and FastQC (Babraham Bioinformatics). After a sequencing run is completed and before starting the analysis, the run’s quality should be checked for the following parameters which may be telling of the quality of the sample and run. 1. Pass filter (PF) reads—The number and percentage of PF reads in each lane and for each sample should match the number of expected sequenced reads. If it is dramatically lower, this might indicate a low-quality run, and may reduce the expected coverage. 2. Control reads—Apart from the DNA libraries, control DNA from the viral PhiX genome is spiked-in at 1 % concentration with the sample onto each lane of the flowcell. Reads are automatically mapped by the Illumina software to the PhiX genome. The percentage of reads from each lane mapping to this genome and the amount of mismatches in the mapping are used as control values for the lane’s quality. A good run typically has ~1 % sequencing errors, as detected by the mismatches to the PhiX genome. 3. Quality scores of the reads—As will also be explained in the next section (“Diving into the technical details”) each base of each sequenced read is associated with a quality score providing the confidence in the particular base. In general, the quality scores drop toward the end of the sequenced read. These confidences should be assessed to check for the overall quality of the run. The quality scores may be automatically produced by the sequencing platform, and may also be created by programs like FastQC that provide other statistics on the sequenced reads, such as overrepresented sequences, per base GC content and more (Fig. 7). Based upon these parameters, we found it advantageous in particular instances to further manipulate the sequences. For example, sequences may be trimmed to reduce low-quality ends, filtering reads by quality, and removal of adaptors.

3.5.3 Assembling Contigs and Scaffolds for De Novo Assembly

Available software for this step: SOAPdenovo [11], ABySS [12], Velvet [13], and ALL-PATHS [14]. De novo assembly is the most challenging application and continues to be the subject of intense algorithmic research. The process generally consists of three basic steps (Fig. 8):

An Introduction to High-Throughput Sequencing Experiments. . .

a

Sequence quality across all bases

Sequence content across all bases

b

40 38

11

100

36 90

34 32

80

30

Quality Score

28

70

26 24

60

22 20

50

18 16

40

14 12

30

10 20

8 6

10

4 2 1

2

3

4

5

6

7

8

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

0

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Position in read

Position in read

c

Relative enrichment over read length

100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Position in read

Fig. 7 Examples of statistics created by FastQC. (a) Quality score statistics per base. (b) Sequence contents per base. (c) Abundant Kmers across the reads

1. Contig-ing—The first step in the assembly consists of detecting overlaps between single reads. Bundle of overlapping reads are merged into a consensus sequence, called a contig. Repetitive or low-complexity regions in the assembled genomic sequence often prevent the construction of one long sequence at this initial step. This step typically results in >10,000 contigs, depending of course on the size of the genome and the number and length of sequenced reads. 2. Scaffolding—For de novo sequencing of complex genomes, it is crucial for the sequenced reads to be of paired-ends inserts. If so, the many contigs can then be merged onto longer segments called scaffolds by taking into account the paired-end information of the reads. Since the paired-end inserts contain an unknown sequence between the two reads, the scaffold may contain unknown sequence (represented as N’s) of a size that can be determined by the average insert length. 3. Gap closing—After creating the scaffolds, the sequence of any remaining gaps within the scaffolds may be resolved by mapping the original paired-end reads to the scaffolds and

12

Rachelly Normand and Itai Yanai

Fig. 8 Three basic steps of de novo assembly: (a) Aligning reads to find overlaps. (b) Connecting contigs into scaffold by using PE information. (c) Closing intra-scaffold gaps

searching for a read that informs the gap regions. This function may be an integrated process of some assemblers or a separate function may need to be run as in SOAPdenovo. It should be noted that de novo assembly projects may include a reference genome of a close strain, or sequences that are known to be included in the assembly, which may help with the assembly process. In this section we discuss the basic de novo assembly process that does not rely on additional reference sequences. In the assembly process the identification of sequencing errors is more difficult than it is when mapping reads to a reference genome. Detection of sequencing errors in the process of finding overlaps and merging them into a consensus sequence is possible if there is enough coverage. This is one of the reasons that a higher coverage is required for de novo assembly compared to application that consists of a known reference genome. 3.5.4 Mapping

Available software for this step: BWA [15], Bowtie [16], and TopHat [8]. The process of mapping is done in any application that includes a known reference genome. Each read is mapped to the reference genome separately under the conditions of the

An Introduction to High-Throughput Sequencing Experiments. . .

13

mapping software, as defined by the input parameters. PE reads are each mapped separately and only then the distance between their mappings is measured. The main parameters inputted for a mapping software deal with the measure of difference between the read and the reference genome. As in many other bioinformatics methods, deciding on the measure of similarity between reads and the reference genome raises the dilemma between sensitivity and specificity: Allowing too much difference may result in false-positive mappings, while allowing too little difference may lead to missing true positives. From our experience the best way to decide on the parameters is to try a few values and see how they affect the results. There are two main methods to control the measure of dissimilarity between reads and the reference genome: 1. Number of differences per read—Apply the mapping software with a value that defines the maximum number of allowed differences (mismatches, insertions, and deletions) between the read and the reference genome. 2. Seed mapping—In this method the software looks for a sequence of certain length inside the read that does not contain differences or contains a small amount of differences compared to the reference genome. The rest of the alignment is elongated without limiting the amount of differences. The parameters given to the software control the seed length, the amount of differences allowed in it, and sometimes also the intervals in the reads in which it is searched. In general, seed mapping is a more permissive approach and is suitable for sequence strains that are distant from the reference genomes they are mapped to. The first method is more strict and is suitable for strains that are known to be close to the reference genome and when trying to avoid false positives. It should be noted that when using the first method and allowing many differences per read, the results become similar to those that are received in the second method. The sensitivity and specificity can be tuned also by the parameters of each method. It is important to remember that the way the mapping step is done affects the rest of the analysis. Allowing a low amount of mismatches may cause regions in the reference genome that contain many variations compared to the sequenced strain to have little to no coverage. Regions in the genome with little to no coverage may be caused by a few reasons. First, region is not present in the sequenced strain—the zero coverage implies a deletion compared to the reference genome. Second, the region does exist in the sequenced strain but is not represented in the sequenced library because of a bias caused in the sample preparation process (for example, because of some regions in the genome that are not

14

Rachelly Normand and Itai Yanai

sonicated as well as others). Finally, the low coverage may also be caused by allowing too few differences per read to a region in the genome that contains many variations in the sequenced strain compared to the reference genome. Trying to map the reads again with a higher percentage of differences may cause these low-coverage regions to “fill-up.” After the mapping is done one can choose to use only a partial set of the mappings: 1. Use only uniquely mapped reads: It is very common for initial analyses to use only reads that map to one unique location in the genome. Under the mapping conditions, defined by the parameters, reads may be mapped to more than one location in the genome. In this case, one cannot surely determine where the read has originated from. There are a few approaches to deal with such reads—map them randomly to one of the possible locations, map them to all locations, apply an even amount of coverage to every possible location, etc. Each of these approaches may cause a bias in the results, and can be ignored in the initial analysis by using only the uniquely mapped reads. 2. Use mappings with a minimum mapping score: One can choose to use only mappings of higher quality in order to disregard low-quality mappings that may introduce false positives. 3. Filter mappings with certain insert sizes: PE reads are first mapped separately and only then is the distance between them measured. Long insert sizes or reads that map to different chromosomes may imply structural variations such as large deletions, inversions, and translocations (Fig. 5). One can choose to use only mappings with irregular insert size to find such structural variations or use only mappings with normal insert size for initial variant analysis. BreakDancer is an example of a program that uses PE information to find structural variations [17]. 4. Removal of PCR duplicates: PCR amplification is part of the sample preparation, and may introduce bias. PCR duplicates may be identified as reads that map to the exact same location, and in PE reads have the same insert size. 3.5.5 Variant Calling and Filtering

Available software for this step: SAMtools [18], GATK [19], and MAQ [20]. Based on the mapping done in the previous step, variants can be called by finding the consensus sequence from the mapped reads. The first step in this process is to create a “pileup” file of the mapped reads. This file summarizes for each base in the reference genome the coverage of the reads that are mapped at the loci and the called bases of these reads. Depending on the software that creates the pileup file, more information can be obtained from

An Introduction to High-Throughput Sequencing Experiments. . .

15

it, such as genotype calling, mapping qualities, and p-values. The information in the pileup file can be used to detect and filter variants. The two basic parameters that help detect variants are the following: (a) Coverage at the loci—The detected variants should rely on a sufficient coverage. A minimum number of reads should be set as a threshold for initial filtering. (b) Frequency of the allele that was sequenced—The variant should have sufficient frequency out of the total reads covering the loci. If one read out of 15 reads covering a loci shows a base different from the reference genome, it may not imply a variant but rather a sequencing error. To find heterozygous variants the frequency should be ~0.5, and for homozygous variants the frequency should be ~1; if pooling was done then the frequency should match the expected percentage in the pooling. When filtering by allele frequency taking a margin of security is recommended, especially if the coverage is low. For example, for heterozygous variants filter by a frequency of 0.4 or 0.3. The above are two basic parameters for variant filtering, but other parameters can be used for variant filtering, for example, the mapping and base qualities in the variant locations. 3.5.6 Assembling Transcripts

In strains that do not have full or sufficient gene annotations, novel annotations can be found by HTS. The idea is to sequence mRNA, map the reads to the reference genome, and infer transcripts from the detected bundles of reads in a certain loci. Based on these annotations a gene expression analysis can then be done. In principle, one can assemble the whole genome before performing RNA-Seq experiment, or assemble the transcriptome only in an application called “de novo RNA” [21] (or a combination of both).

3.5.7 Gene Expression Analysis

Available software for this step: Cufflinks [8] and Myrna [22]. After mapping the reads to the reference genome an assessment of their abundance can be made by the gene annotations. In general, the amount of reads that overlap each gene is counted. The raw count must be normalized for further analysis. A common normalization method is called Fragments per Kilobase Million (FPKM) and is calculated as follows: FPKM ¼

raw count gene length  number of mapped reads in millions

The normalization takes into account the gene’s length, to avoid a bias toward higher expression in longer genes. FPKM also takes into account the total number of mapped reads in each sample, to avoid a bias because of difference in number of reads in each sample.

16

Rachelly Normand and Itai Yanai

Transcript 2 Gene

Gene expression level

Transcript 1

Time point 1

Time point 2

Time point 3

Fig. 9 An example of differential expression in the transcript abundances of a gene. If only the gene’s expression level is calculated the expression does not appear to change over time; yet a separate calculation of each transcript’s expression level shows a different molecular event

A basic approach to gene expression is to count all the reads that map to a gene’s annotation, normalize them, and set this value as the expression level of that gene. If a gene has more than one transcript due to alternative splicing, not separating the reads that map to it to each of the transcript can cause a great bias and change the results entirely (Fig. 9). Finding the expression levels of different transcripts of the same gene is challenging, since reads that map to exons that belong to more than one transcript cannot be unambiguously correlated to one transcript [8]. The software Cufflinks [8] attempts to assess transcript expression levels by using the reads that can be unambiguously correlated with certain exons to infer the expression of all the reads (Fig. 10). Cufflinks’ algorithm uses maximum likelihood to assess the abundances of each transcript. 3.5.8 Peak Detection

Available software for this step: MACS [23, 24] and SICER [25]. A ChIP-Seq experiment is done to detect enriched regions in the IP sample. These regions, called “peaks” or “islands,” should be significantly higher both from their surrounding in the IP sample and from the same loci in the control sample. The peaks are found by statistical modeling of the enriched regions compared to the control. There are two important parameters for peak detection: the abundance in the genome and the width of the binding sites. We introduce two programs for peak detection, each addressing binding sites with different abundance and width characteristics. MACS is more suitable for narrow peaks that represent short and specific binding sites, for example, of transcription factors. SICER is more suitable for wide peaks that extend over thousands of base pairs;

An Introduction to High-Throughput Sequencing Experiments. . .

100 reads

10 reads

17

100 reads

Coverage

Transcript 1 Transcript 2

Fig. 10 Assessing transcript abundance. Since 10 reads undoubtedly originated from transcript 1, it may be inferred that 90 reads from each shared exon originated from transcript 2 while 10 reads from each shared exon originated from transcript 1

these peaks are typical for histone modification experiments, in which many close binding across the genome. Their proximity to each other makes the peaks merge to wide enriched regions rather than short and sharp peaks.

4

Notes

4.1 Tuning Up the Pipelines

The pipelines detailed above are general. It is crucial to examine each project specifically and decide what pipeline is best suited for it. Tuning up the parameters of each step in the pipeline may be vital for accurate results. Tuning up the pipeline and parameters can be done by following the general pipeline presented above and conducting quality control measurements after each step, to allow identifying a phenomenon that might infer some insight or require special action in the analysis. Quality control measurements should be done after each step in the analysis. After the raw data handling a quality control step is done as detailed above. After the mapping step the mapping statistics should be assessed. How many reads were not mapped, uniquely mapped, and multi-mapped? High values in the first and third parameters may infer a problem. How does the coverage profile look like? What percentage of the genome is covered with sufficient coverage, and what is the average coverage? For exome projects, what is the coverage over exons? It is highly recommended to look at the mappings in a genome viewer. Some phenomena can be detected easier visually (Fig. 11). Tuning up the parameters in each step of the analysis allows to control the balance between sensitivity and specificity. For example, if we allow one mismatch per 50 bp read in the mapping step, it will reduce the rate of incorrect mappings, but we will not be able to

18

Rachelly Normand and Itai Yanai

Fig. 11 Four bacteria samples were sequenced and mapped to the same reference genome. The mapping statistics of all of them showed that 96–98 % of the reads were unmapped. Viewing them on a genome viewer reveals the different phenomenon in each sample. (a), (b) Only 2–3 % of the reads are of the expected strain, while the rest are contamination. This can be seen by the high and continuous coverage and lack of variants. (c), (d) These sequenced samples seem to be evolutionarily distant from the reference genome, as can be seen by the low and segmented coverage and many variants

detect 2-base indel or areas in the genome that have more than one variant per 50 bp; the coverage in these regions will be low or zero due to incapability of mapping. Another example from gene expression analysis: when comparing gene expression between two samples one can choose to statistically test only genes that have a minimum amount of reads mapped to them in at least one sample [8]. Choosing a high threshold may cause missing interesting genes, but choosing a low threshold may include genes, the differential expression of which is not significant—a gene can be expressed in a fold change of 5 if the ration between the samples is 1 read vs. 5 reads or 1,000 reads vs. 5,000 reads.

An Introduction to High-Throughput Sequencing Experiments. . .

19

4.2 Diving into the Technical Details: File Formats

In this section we overview the formats of some basic files used in HTS data analysis. Though not all useable formats are mentioned here, this section provides a general idea of how the files used in the analysis are constructed, as their structure is similar and the same concepts generally apply. All the files we present in this section and most of the files used for HTS analysis are plain text files and usually tab delimited, which enables easy management by various tools and scripts.

4.2.1 Fastq: Raw Read Format (Fasta + Quality)

A fastq file is constructed out of quadruplet lines (Fig. 12), each quadruplet representing a read and containing the following information: 1. Read identifier—PE reads will have the same identifier. The read’s identifier is unique and is constructed in the following way (CASAVA 1.8.2): @:::::: :

filtered>:: > ; for c ¼ pn1 ; > > tn þ 3 > > > < h2 þ 1 n Pn ðcÞ ¼ (2) ; for c ¼ pn2 ; > tn þ 3 > > > > > 1  Pn ðpn1 Þ  Pn ðpn2 Þ > : θðcÞ; for c 6¼ pn1 ; pn2 : 1  θðpn1 Þ  θðpn2 Þ

Compressing Resequencing Data with GReEn

Reference sequence

33

>SEQ ID AAAGGATAGGTAACGATATTCCTAG. . .

Target sequence

> SEQ ID AGGATAGGTAacgGTATTccta?. . .

Fig. 2 The copy model: when at position 10 of the target sequence (in green), the copy model was restarted to position 12 of the reference sequence (in green). Apart from the character at position 14 in the target sequence (in red), it has since correctly predicted 12 characters, if the case is ignored, or 5 characters (in upper case), if case is considered. The copy model is currently predicting that the character in position 23 of the target sequence (in blue) should be a G (in blue in the reference sequence)

However, if only pn1 are given by 8 > > < Pn ðcÞ ¼ > > :

or pn2 belongs to C, then the probabilities hþ1 ; for c ¼ p; tn þ 2 1  Pn ðpÞ θðcÞ; for c 6¼ p; 1  θðpÞ

(3)

where h ¼ hn1 if p ¼ pn1 , or h ¼ hn2 if p ¼ pn2 . As such, we have considered only two events, namely, E 1 ¼ {p} and E 2 ¼ C n fpg, where the distribution of probabilities among the characters of E 2 is performed as before. Finally, if both pn1 ; pn2 62 C, the probabilities communicated to the arithmetic coder are the character frequencies of the target sequence, i.e., Pn ðcÞ ¼ θðcÞ:

(4)

Usually, the codec starts by constructing a hash table with the occurrences and corresponding positions in the reference sequence of all k-mers of a given size (the default size is k ¼ 11, but it can be changed using a command line option). Using this hash table, it is easier to find in the reference sequence the characters that come right after all occurrences of a given k-mer. Before encoding a new character from the target sequence, the performance of the copy model, if in use, is checked. If tn  hn1  hn2 > mf , where mf is a parameter that indicates the maximum number of prediction failures allowed, the copy model is stopped. The default value for mf is zero, but this may be changed through a command line option. Following this performance check, if the copy model is not in use, an attempt is made to restart the copy model before compressing the character. This is accomplished by looking for the positions

34

Armando J. Pinho et al.

in the reference sequence where the k-mer composed of the k-most-recently-encoded characters occurs. If more than one position is found, the one closest to the encoding position is chosen. If none is found, the current character is encoded using the static model and a new attempt for starting a new copy model is performed after advancing one position in the target sequence.

4

Notes 1. If this flag is present, the encoder prints the alphabets of the reference and target sequences. It also indicates how many times it encountered each character. 2. Changing this parameter might affect the compression performance. During normal operation (i.e., when not in “equal size mode”), the encoder often searches for the positionally closest k-mer in the reference sequence that matches the most recently encoded k-mer of the target sequence. The idea is to start a copy model that will predict the following characters with high certainty. If the target sequence is similar to the reference, then values of k greater than 11 might produce higher compression, because in this case it is easier to find the correct location where the repetition occurs in the reference sequence. On the other hand, if the two sequences differ considerably, then it is usually better to leave this parameter unchanged. Note that using greater k values also implies using more memory. 3. When the target and reference sequences are aligned, i.e., when the most probable character in the target sequence is, on average, the one located at the same position in the reference sequence, then there is no need to build the hash table. In fact, the hash table is used for speeding up the process of finding k-mer matches and is overkilling if the sequences are aligned, because it claims unnecessary memory and time. This special mode is selected if the reference and target sequences have the same size. However, there are situations where, despite the sequences having equal size, they are not aligned and, therefore, this mode would produce poor compression results. This flag turns off this special mode, forcing the use of the hash table. 4. This parameter is also directly related to the encoding performance of the method and may be adjusted according to the degree of similarity between the reference and target sequences. Basically, it controls how many failures we allow the copy model predictor to have before it is declared useless and is replaced by another. For target sequences that are reasonably similar to the reference sequence, it is safe to use a value greater than the default (zero), because, despite some prediction failures due, for example, to SNPs, it is expected that a given copy model is useful for long segments. When the target and reference

Compressing Resequencing Data with GReEn

35

sequence are considerably dissimilar, then it is better to replace the copy model as soon as it starts failing the predictions. 5. This note applies both to the reference and target sequences. It also applies to the reference sequence used by the decoder. The files can be plain text or compressed by “gzip.” FASTA files are handled by ignoring all lines starting with the “ > ” character. All new line characters are also ignored. All other characters are considered as belonging to the sequence to be encoded. 6. The output file will contain only the characters that have been considered by the encoder as part of the sequence. Therefore, FASTA formatting is lost. Also, the output file will be made of a single line, i.e., without line breaks. If this option is not present, the decoder will run normally, but without producing an output file. 7. The probabilities of the symbols may change along the message. In fact, this is what makes arithmetic coding powerful, because the statistical model of the source can be continuously updated as encoding proceeds. The only requirement is that the encoder and decoder must be synchronized, i.e., for a given position in the message, the statistical model has to produce exactly the same probability distribution both at the encoder and at the decoder. 8. In practice, arithmetic coding cannot be implemented using directly the procedure that we have described, because the floating point precision would be exhausted quickly. The first practical arithmetic coding algorithm was introduced by Rissanen in 1976 [19]. Further detailing how arithmetic coding is implemented is out of the scope of this chapter. Interested readers can obtain more details in e.g. [20]. 9. GReEn does not compress the reference sequence. If needed, it can be compressed using a general purposed compression program, such as “gzip,” or a specialized DNA compression tool [11, 17]. 10. Note that, whereas the θ(c) values are fixed for a given target sequence, the Pn(c) values usually change along the coding process. Moreover, the θ(c) values are the only ones that are known in advance to both the encoder and the decoder. Therefore, they are communicated to the decoder before decoding starts. All other statistical information related to the target sequence is collected as encoding (decoding) proceeds, i.e., the process is causal. Hence, for a certain position n in the target sequence, the encoder can use only past information for inferring the statistics. 11. Note that characters of the reference sequence that do not appear in the target sequence do not belong to C.

36

Armando J. Pinho et al.

12. The first two branches of Eq. 2 correspond to Laplace probability estimators of the form PðE k Þ ¼

NE k þ 1

K P

;

E k  C;

NE k þ K

k¼1

where the E ks form a set of K collectively exhaustive and mutually exclusive events, and NE k denotes the number of times that event E k has occurred in the past. In Eq. 3 we considered three events, namely, E 1 ¼ { pn1 }, E 2 ¼ { pn2 }, and E 3 ¼ C n fpn1 ; pn2 g. The third branch of Eq. 3 defines how the probability assigned to E 3 , i.e., 1  PðE 1 Þ  PðE 2 Þ, is distributed among the individual characters of E 3 . This distribution is proportional to the relative frequencies of the characters, θ(c), after discounting the effect of treating pn1 and pn2 differently.

5

Acknowledgments This work was partially funded by FEDER through the Operational Program Competitiveness Factors—COMPETE and by National Funds through FCT—Foundation for Science and Technology in the context of the projects FCOMP-01-0124-FEDER-010099 (FCT reference PTDC/EIA-EIA/103099/2008) and FCOMP01-0124-FEDER-022682 (FCT reference PEst-C/EEI/ UI0127/2011). Sara P. Garcia acknowledges funding from the European Social Fund and the Portuguese Ministry of Education and Science.

References 1. Grumbach S, Tahi F (1993) Compression of DNA sequences. In: Proceedings of the data compression conference, DCC-93, Snowbird, pp 340–350 2. Rivals E, Delahaye J-P, Dauchet M, Delgrange O (1996) A guaranteed compression scheme for repetitive DNA sequences. In: Proceedings of the data compression conference, DCC-96, Snowbird, p 453 3. Loewenstern D, Yianilos PN (1997) Significantly lower entropy estimates for natural DNA sequences. In: Proceedings of the data compression conference, DCC-97, Snowbird, March 1997, pp 151–160 4. Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. In: Dunker AK, Konagaya A, Miyano S, Takagi T (eds) Genome informatics 2000: proceedings of the 11th workshop, Tokyo, pp 43–52

5. Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE Eng Med Biol Mag 20:61–66 6. Chen X, Li M, Ma B, Tromp J (2002) DNACompress: fast and effective DNA sequence compression. Bioinformatics 18 (12):1696–1698 7. Manzini G, Rastero M (2004) A simple and fast DNA compressor. Softw Pract Exp 34:1397–1411 8. Korodi G, Tabus I (2005) An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans Inform Syst 23(1):3–34 9. Behzadi B, Le Fessant F (2005) DNA compression challenge revisited. In: Combinatorial pattern matching: proceedings of CPM-2005. LNCS, vol 3537. Jeju Island, June 2005. Springer-Verlag, New York, pp 190–200

Compressing Resequencing Data with GReEn 10. Korodi G, Tabus I (2007) Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In: Proceedings of the data compression conference, DCC-2007, Snowbird, March 2007, pp 33–42 11. Cao MD, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of the data compression conference, DCC-2007, Snowbird, March 2007, pp 43–52 12. Giancarlo R, Scaturro D, Utro F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25(13): 1575–1586 13. Pinho AJ, Neves AJR, Afreixo V, Bastos CAC, Ferreira PJSG (2006) A three-state model for DNA protein-coding regions. IEEE Trans Biomed Eng 53(11):2148–2155 14. Pinho AJ, Neves AJR, Ferreira PJSG (2008) Inverted-repeats-aware finite-context models for DNA coding. In: Proceedings of the 16th European signal processing conference, EUSIPCO-2008, Lausanne, August 2008 15. Pinho AJ, Neves AJR, Bastos CAC, Ferreira PJSG (2009) DNA coding using finite-

37

context models and arithmetic coding. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP-2009, Taipei, April 2009, pp 1693–1696 16. Pinho AJ, Pratas D, Ferreira PJSG (2011) Bacteria DNA sequence compression using a mixture of finite-context models. In: Proceedings of the IEEE workshop on statistical signal processing, Nice, June 2011 17. Pinho AJ, Ferreira PJSG, Neves AJR, Bastos CAC (2011) On the representability of complete genomes by multiple competing finitecontext (Markov) models. PLoS One 6(6): e21588 18. Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 40(4):e27 19. Rissanen J (1976) Generalized Kraft inequality and arithmetic coding. IBM J Res Dev 20 (3):198–203 20. Sayood K (2006) Introduction to data compression, 3rd edn. Morgan Kaufmann, San Francisco

Chapter 3 On the Accuracy of Short Read Mapping Peter Menzel, Jes Frellsen, Mireya Plass, Simon H. Rasmussen, and Anders Krogh Abstract The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program. Key words Mapping, Short reads, High-throughput sequencing

1

Introduction

1.1 The Read Mapping Problem

Because of the tremendously increased throughput of current sequencing technologies, we are often faced with hundreds of millions of short reads for a single experiment. In most experiments, we want to know the genomic origin of those reads and thus we need to align them to a reference genome. This step is called mapping, because the reads are usually almost identical to their origin in the genome apart from errors introduced by the experimental protocol or the sequencing technology.

Peter Menzel and Jes Frellsen have contributed equally. Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_3, # Springer Science+Business Media New York 2013

39

40

Peter Menzel et al.

The problem of mapping a read to a genome is an instance of the standard pairwise alignment problem. However, old and proven alignment methods, such as BLAST [1], are almost useless for this task—they are just too slow—and therefore much faster “next-generation” alignment programs are required. Many new algorithms and programs (called mappers) have been developed, which gain their speed from an advanced indexing of the reference genome or the reads and from the assumption of high similarity between the reads and the genome. However, while the success of BLAST was due to a careful consideration of the statistical significance of a match, many of these mappers lack sophistication in the statistical interpretation of the results. This leaves researchers in the unsatisfying situation of not knowing how trustworthy the mappings produced by the various programs really are, especially in the case of very short reads. In this chapter we: l

Introduce the current state of short read mapping, exemplified on real data sets.

l

Present a statistical analysis on the probabilities of short read mapping with mismatches.

l

Discuss how to make best use of the available base quality scores in sequencing data in a probabilistic model for read mapping.

In this chapter, we focus on the human genome, although the general considerations are valid for any genome. Most problems with wrongly mapped reads are expected to occur for short reads, that is, reads shorter than 30–40 nt, which is below the maximum read length of several modern sequencing platforms. However, many types of experiments produce short sequences, such as short RNA sequencing and fragmented ancient DNA, or short tags coming from CAGE and ChIP-Seq. Thus, even with an increase of read length in the sequencing instruments, the task of accurately mapping short sequences remains. 1.2 The Current State of Short Read Mapping

Sequencing machines introduce distinct error patterns and experimental protocols add sequential noise from adapters, barcodes, and other sources. On top of that, a mismatch in a read compared to the reference genome can be due to the type of experiment or biological variation like SNPs, RNA editing, and DNA damage. All these error sources result in a certain divergence between the read sequence and its genomic origin, which has to be accounted for by the mapping program. If a read maps to several locations in the genome, we ideally want to know which of these locations is the true origin. Thus, the mapper uses some criterion for this decision or, if it has multiple indistinguishable mappings, reports a read as multiple mapped. For example, the mappers BWA [2], MAQ [3], SOAP2 [4], and Bowtie2 [5] report possible alternative matches.

On the Accuracy of Short Read Mapping

41

Often the mapping of a short read is accepted if it is unique, and nonunique matches are discarded. This view is problematic because it is an insufficient criterion for accuracy and it does not allow for a detailed assessment of the quality of the mapping. In this context, a unique match means a match with fewer mismatches and insertions/deletions (indels) than any other possible match in the genome. More precisely, a score or a distance is calculated for a match by summing integer numbers for matches/ mismatches and indels. Once a score or a distance is defined, a unique match is defined as one which is better than any other match, meaning that it has a higher score or a lower distance. One scheme (a distance) is to count the number of mismatches and add a penalty (e.g., 2) for each indel. Another scheme (a score) is to add a number for a match and subtract penalties for mismatches and indels. These schemes do the same: the fewer mismatches and indels, the better; and the main challenge is setting the relative cost of an indel versus a mismatch. BWA calculates a distance between read and genome by using a default penalty of 3 for a mismatch, and 11 for gap open and 4 for gap extension. Bowtie2 uses a scoring scheme, which depends on quality scores (see below). Therefore, the “degree of uniqueness” can be quantified by a mapping quality score as discussed later. When comparing mappers, their performance is usually assessed using simulated data, where the correct location of the read is known in advance (e.g., [6]). On real data, it is difficult to assess the mapping accuracy, because the correct origin is unknown. We can, however, measure the difference between mapping programs. If they differ in their results, it indicates that some reads are wrongly mapped by at least one of the programs. We decided to compare BWA and Bowtie2 on several publicly available data sets. We selected data sets from various types of experiments in order to investigate the differences in the performance of mappers according to the type of data used. The different data sets had different lengths (ranging from 36 to 70 nt), which also allowed us to compare the performance of the alignment methods on different length ranges. In all cases, the sequencing was done on an Illumina Genome Analyzer platform. Before mapping, we removed barcodes and adapter sequences, if necessary, and trimmed low-quality nucleotides and unknown nucleotides (N) at the ends of the reads. See Note 3 for the accession numbers and the detailed mapping protocol. The different data sets used in our mapping example are of the following types: Ancient DNA sequencing is a type of sequencing where specialized techniques and protocols are used to extract DNA fragments from ancient tissue samples, which have characteristic damage patterns in the DNA [7] and can have large variation in fragment length. Contamination from other species is another important aspect for

42

Peter Menzel et al.

the mapping problem [8]. This particular sample [9] is from a hair extract and is quite clean with very low levels of damage and contamination. Cap Analysis of Gene Expression (CAGE) is a technology developed to identify transcription start sites (TSS) and measure gene expression. In the CAGE protocol, mRNAs are reversely transcribed to produce cDNAs. Those cDNAs that reach the 50 -end of the gene are selected by cap-trapping and small fragments of ~20 nt from the beginning of the gene are produced by using specific restriction enzymes [10]. Small RNA sequencing is used to measure the expression of microRNAs and other types of small RNAs in a sample. RNA is purified and only fragments with a length around 18–30 nt are selected for sequencing. Some of these RNAs can undergo modifications such as addition of nucleotides in the 30 -end [11], which increase the mapping difficulty. CLIP-Seq was developed to identify the binding sites of RNAbinding proteins in RNAs. Protein–RNA complexes are UVcross-linked and immunoprecipitated with specific antibodies. The cross-linking process can introduce errors in the reads such as deletions due to the cross-linking protocol [12]. ChIP-Seq is a similar technique to CLIP-Seq and it is used to study the binding of proteins to the DNA. This technique has been applied to identify the binding of proteins that recognize specific sequences in the DNA such as transcription factors or to characterize the location of proteins more loosely associated to the DNA such as Pol-II or histones. To compare the performance of the two mappers, we calculated the overlap in the classification of reads (as uniquely mapped, multiple mapped, or unmapped). The overlaps for the five data sets are shown in Fig. 1a–e. For comparison, we also show the results obtained on simulated reads (Fig. 1f). Each of the bubble plots represents the agreement between the mappers for a given data set. Each cell shows the percentage of reads with a particular classification by Bowtie2 and BWA (e.g., in Fig. 1a: 5.01 % of reads were classified as uniquely mapped by Bowtie2 and as multiple mapped by BWA). The size of the bubbles represents these percentages. The diagonal (from top left to bottom right) shows the agreement between the two programs, whereas all the other cells show the disagreement. The first thing that we notice is that for different data sets and mappers the amount of reads that can be mapped to the genome varies from 71 to 97 %. This number is reduced to ~50 % in some cases if we only consider uniquely mapped reads. Using simulated data, less than 1 % of the reads are missed and up to 84 % of the reads can be uniquely mapped. When we look at the agreement between the two mappers, we see that it is ranging from 80 % in small RNA-Seq (Fig. 1c) to 98 % in the case of

On the Accuracy of Short Read Mapping

43

Fig. 1 Bubble plots showing the overlap of uniquely mapped, multiple mapped, and unmapped reads between Bowtie2 and BWA in real (a)–(e) and simulated (f) data. The numbers in each cell show the percentage of reads with a particular classification given the mapping of Bowtie2 (rows) and BWA (columns). The size of each bubble represents the percentage of reads in the cell

ChIP-Seq (Fig. 1e). Interestingly, the agreement in the reads that are mapped uniquely varies from 47 % in ancient DNA sequencing to 80 % in CAGE (Fig. 1a, b). We also have to consider that if a read is reported as uniquely mapped by the two programs it does not mean that it is mapped to the same location in the genome. In the majority of the data sets the location in the genome of the uniquely mapped reads is the same in more than 99 % of the cases. But this may not always be the case, as for instance in the ancient DNA data set, ~8 % of uniquely mapped reads are mapped to different locations by the two programs.

2

Materials: Statistics of Read Mapping In our example, we saw that the two mappers disagree on some fraction of the reads, and for some data sets the disagreement is quite substantial. This may of course be due to subtle differences in the algorithm or scoring scheme used by the mappers, but it may also be due to some reported matches simply being wrong.

44

Peter Menzel et al.

In this section we discuss in more detail the problem of wrongly mapped reads by giving some simple examples, show the statistics, and elucidate the problem with simulated reads. 2.1 Unique Matches Are Not Necessarily Correct

We are going to start with a very simple example showing that a unique match is not necessarily the correct match. Imagine that we have an Illumina read originating from an Alu repeat. This is a common situation, since there are around one million Alu repeats in the human genome [13]. It happens that this read has a perfect unique match in the genome, i.e., a single match with no mismatches. In this example, we also assume that there are a relatively large number (n) of matches in the genome for this read that have one mismatch. Now, we want to know the probability that the one perfect match is correct. It is lower than one would probably expect! Let us assume that there is some probability p that there is either a sequencing error or a polymorphism at any given base (ignoring the fact that the quality depends on the position in the read). With a few other simplifying assumptions we can show (see Note 1) that the probability for a perfect match being correct is approximately equal to 1=ð1 þ np=3Þ. For example, if there is a 2 % error rate (p ¼ 0:02) and there are n ¼ 150 genomic matches with exactly one mismatch, then there is a probability of 50 % that the perfect match is correct. If n ¼ 1; 000, it is as low as 13 %. Whether the above example is realistic or not depends entirely on the repeat structure of the genome in question. The human genome, on which we focus in this chapter, is quite repetitive, but the repeat structure is complex ranging from ancient duplications over transposable elements to simple repeats or low-complexity regions. It is therefore difficult to theoretically quantify the impact of genomic repeats on the mapping problem, and therefore we address it empirically. What is the probability that a unique match is wrong? To address this question we generated a large number of reads from the human genome by randomly sampling short sequences from all chromosomes and adding errors. They were generated with different lengths and different error rates. In order to simplify the example, the error rates were independent of the position in read and neither insertions nor deletions were introduced. We then mapped the reads back to the genome using exhaustive mapping with up to three mismatches and without insertions and deletions. This means that no heuristics were used in the mapping and all possible matches in the genome were found. As can be seen from Fig. 2, a considerable fraction of the reads have unique matches to an incorrect position in the genome. The phenomenon is particularly pronounced for short reads and the largest error is observed for reads of length 18, where nearly 10 % of the reads with error rate 0.05 are uniquely mapped to wrong positions. Even for reads of length 50, more than 1 % of the reads

On the Accuracy of Short Read Mapping 0.100

Error rate 0.05 Error rate 0.02 Error rate 0.01

0.050 Fraction of reads

45

0.020 0.010 0.005 0.002 0.001 14

22

30

50 Read length

75

100

Fig. 2 Fraction of reads from the human genome that map uniquely to an incorrect position in the human genome. We generated three sets of reads from the human genome (hg19), with position-independent error rates of 0.01, 0.02, and 0.05. Every set contains one million reads for each of the lengths: 14, 16, 18, 20, 22, 30, 50, 75, and 100. The reads were mapped to the human genome using exhaustive mapping with up to three mismatches and without insertions and deletions. For every set, the fraction of all reads that have unique matches and are mapped back to an incorrect position in the human genome is shown as a function of the read length. The reads of length 75 and 100 with error rate 0.05 were also mapped using exhaustive mapping with up to six mismatches; these results are shown with grey dashes

with error rate 0.05 are uniquely and wrongly mapped. Although the rates are small for long reads, it is surprising that more than 1 in 1,000 reads of length 100 are uniquely, but wrongly, mapped. This example clearly illustrates that although all reads are originating from the human genome and match a position uniquely, there is a considerable risk that this position is wrong for short reads. The fact that the fraction of wrongly mapped reads drops faster for 5 % error rate than for 2 % is caused by the reads having more mismatches than accepted in the mapping, as can be seen from the dashed line where we allow up to five mismatches. For very short reads, there can also be multiple random matches even in a non-repetitive genome. The chance of such a random match is another important consideration, also relating to contamination in the sample, etc. This is the subject of the next section. 2.2 Significance of a Match

In all sequencing experiments some of the reads originate from DNA, which is not present in the reference genome. This may be contamination from other sources of DNA, such as primer/adapter sequences and carrier DNA, or contamination from other species. It may also be regions in the sample DNA which are not present in the reference genome because they have not been sequenced or because of differences between individuals or strains. In metagenomics,

46

Peter Menzel et al.

for instance, reads may be mapped to all known microbial genomes, but it is likely that a large fraction of the DNA does not originate from any of these. While sample contamination is likely to be most common, one should also keep in mind that a contamination of the reference sequence is also plausible due to misassembly [14]. Therefore, one might expect that some fraction of the reads should not be mapped to the reference genome(s); and if they are mapped, it is a source of error. Here we consider the chance that such foreign sequence maps to a genome. This problem is not a new one. The E-value returned by the BLAST program is an estimate of the number of random matches expected in a database of the same size as the reference database. Fortunately, we can estimate the probability that a match is random under reasonable assumptions. Let us assume that we are mapping reads of length l, which are randomly composed of the four bases with equal probability 1/4. Then the probability of a match to a specific location with exactly k mismatches is given by the binomial distribution     lk  k 1 l 1 3 ¼ f k; l; : (1) 4 k 4 4 The probability a match   ofP  with up to m mismatches is then just the 1 f k; l; sum F m; l; 14 ¼ m k¼0 4 . Assuming independence between positions in the genome, the expected number of matches in a genome of length L is   1 Eðm; l; LÞ ¼ L  F m; l; : (2) 4 This is the E-value for short read mapping under these assumptions. Note that what we call the length L is really the number of possible match positions, and since a match can be on both strands, it is twice the length of the genome. One can show that the corresponding probability of a random match (the p-value) is well approximated by (see Note 2) pðm; l; LÞ ¼ 1  eEðm;l;LÞ :

(3)

For L ¼ 6  10 , which corresponds to a random “genome” with three billion bp (approximately the same size as the human genome), this probability is shown in Fig. 3 as a function of read length and for a few different values of m. Notice that the read length should be around 28 nt to map with a p-value below 0.01 when allowing up to three mismatches. For shorter reads, such as miRNA-sized reads or CAGE tags, it is probably wise to only accept perfectly matching reads, for which we get a p-value of less than 0.01 at read length of at least 20 nt. In the calculations above we assumed independence of positions in the genome, which will not hold in a real repetitive genome. However, the p-value is a good indication of what to expect when 9

On the Accuracy of Short Read Mapping

47

Probability of a match

1e+00

1e−01

1e−02

1e−03

1e−04

Up to 3 mismatches Up to 2 mismatches Up to 1 mismatches Up to 0 mismatches

15

20

25

30

Read length

Fig. 3 The probability of a random match. The full lines show the probability of a random match, given by equation (3), as a function of the read length for L ¼ 6  109 , which correspond to a genome of three billion bp. Individual lines are shown for 0, 1, 2, and 3 mismatches in red, blue, orange, and green, respectively. The crosses are obtained by generating one million errorless reads from the E. coli genome for each of the read lengths: 14, 16, 18,  , 30. These reads were mapped to the human genome (hg19) using exhaustive mapping with up to three mismatches and without insertions and deletions. The red crosses show the fraction of reads that has a match with up to 0 mismatches for each read length, while the blue, orange, and green crosses show the fractions for up to 1, 2, and 3 mismatches, respectively

dealing with DNA of homogeneous composition. For genomes with nonuniform base composition, such as high or low GC content, the approximation of the p-value will be worse in the sense that longer reads are required in order to have significant matches. It is also of interest to ask for the probability of obtaining a random unique match. Although this can be calculated for random sequences, here it suffices to observe that if the probability of obtaining any match in a random sequence is small, then almost all matches will be unique. We can illustrate the problem of obtaining random matches by mapping reads from the E. coli genome (GenBank acc. FM180568) to the human genome (hg19). For practical purposes we can consider these two genomes to be unrelated, and accordingly all matches will be random. From the E. coli genome we sampled a large amount of errorless reads with different lengths, and mapped the reads to the human genome, using exhaustive mapping with up to three mismatches. The fraction of these reads with at least one match in the human genome is shown in Fig. 3. There is a remarkable agreement between these points and the theoretical lines in the logarithmic plot. This is partially coincidental, and a close

48

Peter Menzel et al. Up to 3 mismatches Up to 2 mismatches Up to 1 mismatches Up to 0 mismatches

Fraction of reads

0.25 0.20 0.15 0.10 0.05 0.00 15

20

25

30

Read length

Fig. 4 Fraction of reads from E. coli that have unique matches in the human genome. The fraction of all reads that map uniquely to the human genome is shown as a function of the read length. Individual curves are shown for matches with up to, respectively, 0, 1, 2, and 3 mismatches. The same reads and procedure as in Fig. 3 were used in this figure

inspection shows deviations that are due to dependencies between positions in the human genome, and perhaps real similarities between the two genomes. The fraction of E. coli reads that have unique matches in the human genome is shown in Fig. 4. Initially, this fraction increases with read length, because for very short reads there are so many random matches that it is unlikely to find just one unique match. The curves peak at read lengths 15–22 depending on the number of mismatches allowed. From the maximum they decrease towards zero and as the fraction of mapped reads becomes smaller, the curves approximately coincide with the curves in Fig. 3, as we discussed above. For instance, we observe that more than 25 % of all the length 22 nt reads match the human genome uniquely with up to three mismatches.

3

Methods: Probabilistic Read Mapping We have seen that uniqueness is not a good measure of mapping quality on its own. Ideally, we would like to know the probability that a match is correct, and in this section, we derive such a probability. For this, we can use the per base quality scores.

3.1

Quality Scores

High-throughput sequencing (HTS) data normally include an error probability for each base in a read. Typically, the precision of base calls drops towards the end of the read (Fig. 5b) and the error probabilities allow us to estimate how large the uncertainty for correct base calls becomes. Error probabilities are converted into a range of discrete integers called quality scores or Phred scores,

On the Accuracy of Short Read Mapping

a

b Average quality score

Error probability

1.0000 0.1000 0.0100 0.0010 0.0001

49

40

30

20

10

0 0

5

10

15

20

25

30

35

40

! " #$%&' ( ) * + , − . / 0123456789 : ; ?@ABCDEFGH

1

5

10

15

20

25

30

35

Position in read

Quality Score and one−letter ASCII code using offset 33

Fig. 5 (a) Quality scores: Conversion between error probabilities and Phred-type quality scores. The quality score Q is calculated from the error probability e by Q ¼ 10  log e and can be represented as an ASCII character, here using an offset 33. (b) Quality profile: Example of a quality profile for the 35 first nucleotides of the reads from the Illumina CLIP-Seq data set used in the Introduction. The average quality scores decrease towards the 30 -end of the reads

which can be represented as characters from the ASCII code using a certain offset (e.g., 33, which corresponds to “!”) to skip nonprintable characters at the beginning of the ASCII code. Figure 5a shows the relation between error probabilities and quality scores. The reads and their corresponding quality scores are usually stored in the FASTQ format [15]. Note that there are several possible ways to calculate the quality scores, depending on the sequencing technology and instrument version. For example, quality scores in Roche/454 reads indicate the probability that the homopolymer length at a given position is correct [16, 17], while Illumina quality scores denote the probability of an incorrect base call at this position. Also different offsets for the conversion to ASCII characters are used, e.g., by different Illumina instruments, so that programs for downstream analysis usually require an explicit setting of the used offset or try to guess it from the quality score characters. The quality scores contain valuable information about the precision of base calls that can be used in the mapping process. As a simple example, if we have a mismatch between read and genome in the first few bases, we can trust it more to be a real mismatch compared to a mismatch in the low-quality end of a read where the base in the read has higher probability to be a sequencing error. Figure 6 illustrates the possible mapping of a read to two different locations in the genome, where the mapping with two mismatches has a higher probability of being correct than the mapping with only one mismatch. Most current mappers use the quality scores in one way or another. Typically, programs provide an option to truncate a read at a specified

50

Peter Menzel et al.

Fig. 6 Most HTS platforms report a quality score for each nucleotide in a read. It denotes the probability e that the called nucleotide is incorrect, here the probability it is correct 1-e is shown as a sequence logo. Assuming a uniform background distribution for the genome, we can calculate the probability of each of the other three possible nucleotides to be e/3. When a read matches several locations in the genome, the mappers typically report the mapping with fewest mismatches as uniquely mapped. In our example, when mapping the read directly to the genome, the mapper would report match 1 as uniquely mapped and disregard match 2. However, when using the probabilities based on the quality scores, instead of considering mismatches we can calculate the probability of each mapping. In this case, we would obtain an approximately 550 times higher probability for match 2 compared to match 1, and we would choose match 2 as the best mapping

quality score cutoff in order to increase the chance of a correct mapping by removing the low-quality end of the read. In Bowtie2, the penalty of a mismatch can be set to depend on the quality score—by default the penalty is 6 for a quality of 40, 5 for values from 30 to 39, 4 from 20 to 29, 3 from 10 to 19, and 2 from 0 to 9. 3.2

PSSMs

From the quality score we know the probability of an error e and thus the probability of the called base is 1e. If we assume no nucleotide bias for wrong base calls by the sequencing machine, then the probabilities for having each of the other three bases are e/3. Given the four base probabilities for each position in the read, we can convert the read into a position-specific scoring matrix (PSSM), which is a construct often used in bioinformatics. If this probability

On the Accuracy of Short Read Mapping

51

of a base a at position i is called pi ðaÞ, the corresponding PSSM score is si ðaÞ ¼ logðpi ðaÞ=qðaÞÞ, where qðaÞ is the “background” probability, which could be the base frequencies in the genome, but often we just set it uniformly to 14. If the base is of  very  high quality, the score of the correct base is high: almost log 1= 14 ¼ logð4Þ, which is 2 if we are using the logarithm base 2 as is customary. On the other hand, the score for the other bases at a high-quality position will be large  negative;  e.g., if the error probability is 1/1,000, it would be log

1=3;000 1=4

, which

is almost 10. For low-quality positions, the score is small, and if the probabilities of the four bases are close to 14, the scores become close to 0 for all of them. To score a sequence in the genome, we just add up these scores—if we want to score the PSSM for a read against the sequence CTAAG  , we would calculate s1 ðCÞ þ s2 ðTÞ þ s3 ðAÞþ s4 ðAÞ þ s5 ðGÞ þ    . In principle one can model sequence error biases and dinucleotide biases and so forth, but this is probably rarely done. One sophistication should be considered, namely, the possibility of having differences between the sample and the reference genome, such as SNPs, i.e., differences that are not due to sequencing errors. In its simplest form, we can assume that the probability of a base difference between the sample and the reference genome is p0 . Then the probability of seeing base b in the genome, given base a in the sample, would be pðbjaÞ ¼ 1  p0 for a ¼ b and pðbjaÞ ¼ p0 =3 for the rest. P Now we can simply replace the above probability pi ðaÞ with ~pi ðbÞ ¼ a pðbjaÞ  pi ðaÞ in the calculation of the PSSM. For the human genome, the expected frequency of SNPs is around 1 in 1,000, so one could use p0 ¼ 0:001. It is also possible to incorporate more sophisticated evolutionary models than the above and, for instance, use different probabilities for transitions and transversions. 3.3 The Probability of a Match

Given the quality scores, we can calculate the probability that a mapping is correct. Let us assume that we have calculated the score of a read for every possible location in the genome using the PSSM approach described above. Then P the probability of a match at a given location l is PðlÞ ¼ 2Sl = k 2Sk , where Sl is the score for a match at position l and the denominator is a sum over all positions in the genome [3, 18, 19]. If there is contamination in the sample, which is almost always the case, as discussed above, the correct posterior match probability is [19] 2Sl : Sk k 2 þ Lð1  PM Þ=PM

PðlÞ ¼ P

(4)

Here PM is the prior probability of a match in the genome and L is the number of possible match positions in the genome (~6 billion for the human genome).

52

Peter Menzel et al.

Instead of considering whether a match is correct or not, we can now quantify uniqueness. If a long high-quality read matches very well to the genome, and there are no competing high-scoring matches, the above probability PðlÞ will be very close to one. On the other hand, if the read is short and/or of low quality, so that 2Sl is similar in size or less than Lð1  PM Þ=PM , the probability will be low. The probability also gets lower if there are competing matches with scores similar to the best match. If there are two matches with identical score, for instance, the probability will be at most 0.5. The PSSM mapping has been implemented in a version of BWA called BWA-PSSM [19]. The whole sum over all possible sites in the genome would take way too long to calculate, but it is well approximated by a sum over the high-scoring positions. This means that the PSSM search and the calculation of the match probability do not increase the run-time dramatically (depending on parameter settings). Mappers like BWA and Bowtie2 calculate a mapping quality (MapQ). In BWA, it is derived as an approximation to the above PðlÞ and log-transformed like the base quality reads [3]. A MapQ value of 37 means a single match with less mismatches than the maximum allowed, 25 means a match with exactly the maximum number of mismatches allowed, whereas matches having competing matches (with more mismatches than the best match) are scored from 23 to 0 as the number of competing matches increases. The MapQ calculated in Bowtie2 is based on the difference between the score of the best match and the score of the second best match divided by the maximum possible score and the minimum accepted score. If there are no competing matches, the difference is replaced by the difference between the score of the read and the smallest score allowed. The maximum MapQ is 42 for a fraction above 0.8 and no competing matches and goes towards zero for matches with lower fractions and/or competing matches. To see how useful these mapping qualities are to avoid undesired mappings, we simulated Illumina reads with lengths from 20 to 30 nt without insertions/deletions from the E. coli genome (Genbank acc. FM180568, 1 coverage) using ART [20] and mapped them to the human genome (hg19) with BWA, Bowtie2, and BWA-PSSM. We used the mapper’s default settings and chose a prior match probability PM ¼ 0:8 in BWA-PSSM. Figure 7 shows the fraction of uniquely mapped reads for the three mappers. We see that the number of uniquely mapped reads is lower for BWA-PSSM, which can make use of the prior match probability. By using a probability cutoff with BWA-PSSM, we can influence the amount of matches from very strict to more permissive. When requiring a posterior probability of 0.99 (dotted green line), only ~1 of all reads are mapped uniquely. Bowtie2 maps only around half as many reads uniquely as BWA and for both of them, even so random matches are not modelled directly, having a minimum cutoff on the

On the Accuracy of Short Read Mapping 0.30

BWA Bowtie2 BWA−PSSM

0.25 Fraction of reads

53

0.20 0.15 0.10 0.05 0.00 20

22

25 Read length

28

30

Fig. 7 Mapping of E. coli reads to the human genome. The figure shows the fraction of all reads that are mapped uniquely for BWA, Bowtie2, and BWAPSSM using their default settings. The full lines denote all uniquely mapped reads, while the dotted lines denote only those uniquely mapped reads that have a MapQ value greater than 20 in BWA and Bowtie2, or equivalently a posterior probability greater than 0.99 in BWA-PSSM. For BWA-PSSM we define uniquely mapped reads as reads with a posterior probability greater than 0.5. See also [19]

score helps to reduce undesired mappings in this case (dotted lines). When using a cutoff, BWA still maps nearly twice as many reads than any of the other mappers for read lengths 20 and 22 nt. In practice, when using BWA, we would recommend using the p- or E-values given in this chapter together with the MapQ score (or avoid short and/or low-quality reads). Posterior probabilities are also explicitly used in other recent mappers. For example, Stampy [21] uses a Bayesian model to estimate the probability of an incorrect mapping based on errors and variation in the read and repetitiveness of the genome or if the read is contamination. For 454 pyrosequencing, the FLAT mapper [22] also employs a probabilistic framework that considers sequencing errors for short reads. LAST [18] also converts reads into scoring matrices based on the quality values and employs a probabilistic model for the mapping. The FASTQ quality scores may not be accurate and may vary between runs of the sequencing machine. Therefore it is worth to consider a recalibration of the quality scores; see, e.g., DePristo et al. [23].

4

Conclusion The success of many experiments is dependent on high-quality mapping of short reads. From our introductory example with six read data sets from different types of experiments, we saw that

54

Peter Menzel et al.

different mapping programs yield quite differing mappings depending on the type of data. Even this simple experiment shows that, although mapping is a routine task nowadays, it is important to think about the choice of the mapper and the mapping parameters, which are crucial for a high-quality mapping. For short reads, the most severe problem in complex genomes, like human and other eukaryotic genomes, is wrongly mapped reads. In our statistical analysis (Fig. 2), we show that when considering unique matches in the human genome, the highest rate of wrongly mapped reads is observed at a read length around 18 nt. When the reads have a high error rate, the problem persists with more than 1 % wrongly mapped reads up to a length around 50 nt. Because of the repetitive nature of genomes, the risk of wrongly mapped reads will always be there. It may actually be more severe than it seems, because repeat regions usually also involve assembly problems. If a region is missing in the reference genome or several regions are mixed in the assembly, the chance of wrong mappings increases. Another problem is the unintended mapping of reads to a genome, which do not originate from that genome, such as contamination. From our theoretical considerations, we can see that this problem is most severe for very short reads. For example, when mapping random sequences with length below 20 nt to the human genome, more than 1 % of them will map perfectly, which cannot be influenced by any mapping parameters. We confirmed this estimate by mapping randomly sampled reads from E. coli to the human genome. For more permissive criteria on the mapping, i.e., allowing for mismatches, this error rate shifts to around a 25–30 nt read length, and for longer reads it should be below 1/1,000, and decreasing exponentially with length. Both of the above problems can in principle be dealt with by calculating a probability that the read is correctly mapped as shown in the last section. Several mappers do this in one way or the other, and assign either a probability or a mapping quality score, which can represent a log-transformed mapping (or error) probability. The calculation of these quantities varies a great deal between mappers and is often based on approximations in order to maintain the speed of mapping. However, we saw that using a cutoff on this posterior probability can limit the number of matches of E. coli reads to the human genome (Fig. 7). Another way to limit the problem of wrongly mapped reads, which is used in many applications, is to require multiple matches to the same region. For instance, one would rarely accept a binding site unless it is covered by many reads in a ChIP-Seq experiment, and one would not call a polymorphic site based on a single read in a re-sequencing experiment. This is an excellent strategy if one is careful with removing PCR artifacts (e.g., reads that map with exactly the same start and end position). However, one should

On the Accuracy of Short Read Mapping

55

keep in mind that (1) a few reads might happen to map incorrectly to the wrong repeat, and (2) the higher the quality of mapping, the smaller the number of reads required. In this chapter we have not considered the problem of sensitivity: reads that should have been mapped but are missed by the mapper. This problem may cause higher expenses, because a higher sequencing depth is needed if too few reads are mapped, and does therefore receive more attention. In our opinion it is less of a problem than wrongly mapped reads. Exactly how to map reads depends heavily on the application. We hope that this chapter has sharpened your intuition on the subject and given you some pointers on how to attack the problems.

5

Notes

5.1 Probability of Correct Exact Matches

In the text we consider a simple example, where a read matches exactly and uniquely, but there are n other matches with one mismatch. We want to calculate the probability that an exact match is also the correct match PðcorrectjexactÞ. Using Bayes’ theorem, it can be written as PðcorrectjexactÞ ¼

PðexactjcorrectÞ  PðcorrectÞ : PðexactÞ

The first two terms are easy (see below), but the denominator needs to be rewritten as PðexactÞ ¼ PðexactjcorrectÞ  PðcorrectÞ þ PðexactjincorrectÞ  PðincorrectÞ: The individual terms are PðexactjcorrectÞ ¼ ð1  pÞl ; where l is the read length and p is the error rate for each base, i.e., the probability that a base is incorrect. Similarly, for precisely one mismatch: PðexactjincorrectÞ ¼

p ð1  pÞl1 : 3

The prior probability that a match is correct (before even comparing the sequences) is just PðcorrectÞ ¼ 1=ðn þ 1Þ and PðincorrectÞ ¼ 1  PðcorrectÞ. Now inserting all the terms. we get after rearrangements PðcorrectjexactÞ ¼

1 1þ

np 3ð1pÞ



1 ; 1 þ np=3

where the last approximation holds only if p is small.

56

Peter Menzel et al.

This calculation ignores the possibility that there are matches in the genome with more than one mismatch. These will further lower the probability that the exact match is the correct one, so the above is an upper bound for this probability. 5.2 Approximation of p-Value for Random Matches

The probability of not having a match anywhere in the genome is   L   ¼ ð1  Eðm; l; LÞ=LÞL . Assuming that F m; l; 14 1  F m; l; 14 is small, this is very well approximated by 1  eEðm;l;LÞ.

5.3 Read Preprocessing and Mapping for Fig. 1

Illumina reads of length 30 nt (without indels) were simulated with ART [20] from the human reference genome hg19 with 0.01 coverage, which results in 953147 reads. Experimental Data sets were downloaded from GEO (http://www.ncbi.nlm.nih.gov/ geo/) and SRA (http://www.ncbi.nlm.nih.gov/sra, [24]) databases. The accession numbers of the different data sets used are listed below in Table 1. To identify barcodes and adapters we analyzed the data sets with FASTQC (http://www.bioinformatics. babraham.ac.uk/projects/fastqc/) to find overrepresented sequences in the reads and identify adapters used in Illumina sequencing protocols. We used AdapterRemoval [25] to remove barcodes, adapters, and trimming low-quality bases and trailing Ns. The primer sequences and barcodes are listed in Table 1. After preprocessing, all reads less than 20 nt long were discarded. 500,000 reads were randomly sampled from each of the above data sets. The indices for Bowtie2 and BWA were created from the human reference genome hg19 and reads were mapped to the

Table 1 Accession numbers and barcode/adapter sequences Data set

Accession numbers Barcode Adapter

Ancient DNA

SRX013912



AATGATACGGCGACCACCGAGATCTACAC TCTTTCCCTACACGACGCTCTTCCGATCT

CAGE

GSM849338



AGACAGCAGG (50 -adapter)

Small RNA-Seq

GSM893567 GSM893569 GSM893584 GSM893585 GSM893571

ACTT GTTA GGGT GTTA TCGC

CTGTAGGCACCATCAAT

CLIP-Seq

GSM859978 GSM859979 GSM859980 GSM859981



TCGTATGCCGTCTTCTGCTTG

ChIP-Seq

GSM727557



GATCGGAAGAGCTCGTATGCCGTCTTCT GCTTG

On the Accuracy of Short Read Mapping

57

genome with sensitive settings for both mappers. The commands used for preprocessing and mapping are as follows: Read preprocessing: AdapterRemoval --trimns -trimqualities --5prime [5’-adapter/barcode

sequence]

--pcr1

[3’adapter

sequence] < file.fastq

Mapping with Bowtie2 Build index: bowtie2 -build ref_genome.fa bowtie_index

Read mapping: bowtie2 -i L,4,0 -L 18 -M 2000000 -N 1 -x bowtie_index -U infile.fastq -S outfile.sam

Options: – M n number of distinct alignments to consider for a read is n + 1. –

i f,b,a interval length as function of read length x, LðxÞ ¼ b þ a  f ðxÞ. It defines the number of seeds to extract from the read and thereby the number of allowed mismatches across the reads.



Ln



N

n is the seed length.

number of allowed mismatches in each seed.

Mapping with BWA (0.6.1-r104) Build index: bwa index -a bwtsw ref_refgenome.fa

Read mapping: bwa aln -n 0.01 -l 1024 -m 2000000 -t 28 ref_genome.fa data.fastq > data.sai bwa samse -f data.sam ref_genome.fa data.sai data. fastq

Options: – n p ratio of the reads that would not be mapped at an error rate of 2 % (for more details see [26]). –

l n seed length. If the length of a read is shorter than n then seeding is disabled for the given read. By setting the value of n to 1,024, seeding is disabled for all reads in the datasets.



m n number of alignments to consider when looking for the optimal alignment.

Acknowledgments This work was supported by grants from the Danish Strategic Research Council (COAT), the Novo Nordisk Foundation, and European Union (Project #265933 Hotzyme).

58

Peter Menzel et al.

References 1. Altschul S, Madden T, Sch€affer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402 2. Li L, McCorkle S, Monchy S, Taghavi S, van der Lelie D (2009) Bioprospecting metagenomes: glycosyl hydrolases for converting biomass. Biotechnol Biofuels 2:10. doi:10.1186/ 1754-6834-2-10 3. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858. doi:10.1101/gr.078212.108 4. Li R, Yu C, Li Y, Lam T, Yiu S, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967. doi:10.1093/bioinformatics/ btp336 5. Langmead B, Salzberg S (2012) Fast gappedread alignment with bowtie 2. Nat Methods 9:357–359. doi:10.1038/nmeth.1923 6. Ruffalo M, LaFramboise T, Koyut€ urk M (2011) Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27:2790–2796. doi:10.1093/ bioinformatics/btr477 7. Stiller M, Green R, Ronan M, Simons J, Du L, He W, Egholm M, Rothberg J, Keates S, Keats S, Ovodov N, Antipina E, Baryshnikov G, Kuzmin Y, Vasilevski A, Wuenschell G, Termini J, Hofreiter M, Jaenicke-Despre´s V, P€a€abo S (2006) Patterns of nucleotide misincorporations during enzymatic amplification and direct large-scale sequencing of ancient DNA. Proc Natl Acad Sci U S A 103(13):578–584. doi:10.1073/pnas. 0605327103 8. Kircher M (2012) Analysis of high-throughput ancient DNA sequencing data. Methods Mol Biol 840:197–228. doi:10.1007/978-161779-516-9∖textunderscore23 9. Rasmussen M, Li Y, Lindgreen S, Pedersen J, Albrechtsen A, Moltke I, Metspalu M, Metspalu E, Kivisild T, Gupta R, Bertalan M, Nielsen K, Gilbert M, Wang Y, Raghavan M, Campos P, Kamp H, Wilson A, Gledhill A, Tridico S, Bunce M, Lorenzen E, Binladen J, Guo X, Zhao J, Zhang X, Zhang H, Li Z, Chen M, Orlando L, Kristiansen K, Bak M, Tommerup N, Bendixen C, Pierre T, Grønnow B, Meldgaard M, Andreasen C, Fedorova S, Osipova L, Higham T, Ramsey C, Hansen T, Nielsen F, Crawford M, Brunak S, Sicheritz-Ponte´n T, Villems R, Nielsen R, Krogh A, Wang J, Willerslev E (2010) Ancient human genome

sequence of an extinct Palaeo-Eskimo. Nature 463:757–762. doi:10.1038/nature08835 10. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, Fukuda S, Sasaki D, Podhajska A, Harbers M, Kawai J, Carninci P, Hayashizaki Y (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 100 (15):776–781. doi:10.1073/pnas.2136655100 11. Morin R, O’Connor M, Griffith M, Kuchenbauer F, Delaney A, Prabhu A, Zhao Y, McDonald H, Zeng T, Hirst M, Eaves C, Marra M (2008) Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 18:610–621. doi:10.1101/gr.7179508 12. Zhang C, Darnell R (2011) Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. Nat Biotechnol 29:607–614. doi:10.1038/nbt.1873 13. Lander E, Linton L, Birren B, Nusbaum C, Zody M, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov J, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin J, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston R, Wilson R, Hillier L, McPherson J, Marra M, Mardis E, Fulton L, Chinwalla A, Pepin K, Gish W, Chissoe S, Wendl M, Delehaunty K, Miner T, Delehaunty A, Kramer J, Cook L, Fulton R, Johnson D, Minx P, Clifton S, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng J, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs R, Muzny D, Scherer S, Bouck J, Sodergren E, Worley K, Rives C, Gorrell J, Metzker M, Naylor S, Kucherlapati R, Nelson D, Weinstock G, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith D, Doucette-Stamm L, Rubenfield M,

On the Accuracy of Short Read Mapping Weinstock K, Lee H, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis R, Federspiel N, Abola A, Proctor M, Myers R, Schmutz J, Dickson M, Grimwood J, Cox D, Olson M, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans G, Athanasiou M, Schultz R, Roe B, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie W, de la Bastide M, Dedhia N, Blo¨cker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey J, Bateman A, Batzoglou S, Birney E, Bork P, Brown D, Burge C, Cerutti L, Chen H, Church D, Clamp M, Copley R, Doerks T, Eddy S, Eichler E, Furey T, Galagan J, Gilbert J, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson L, Jones T, Kasif S, Kaspryzk A, Kennedy S, Kent W, Kitts P, Koonin E, Korf I, Kulp D, Lancet D, Lowe T, McLysaght A, Mikkelsen T, Moran J, Mulder N, Pollara V, Ponting C, Schuler G, Schultz J, Slater G, Smit A, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf Y, Wolfe K, Yang S, Yeh R, Collins F, Guyer M, Peterson J, Felsenfeld A, Wetterstrand K, Patrinos A, Morgan M, de Jong P, Catanese J, Osoegawa K, Shizuya H, Choi S, Chen Y, Szustakowki J, International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. doi:10.1038/35057062 14. Longo M, O’Neill M, O’Neill R (2011) Abundant human DNA contamination identified in non-primate genome databases. PLoS One 6:e16,410. doi:10.1371/journal. pone.0016410 15. Cock P, Fields C, Goto N, Heuer M, Rice P (2010) The sanger FASTQ file format for sequences with quality scores, and the Solexa/ Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771. doi:10.1093/nar/gkp1137 16. Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, Dewell S, Du L, Fierro J, Gomes X, Godwin B, He W, Helgesen S, Ho C, Ho C, Irzyk G, Jando S, Alenquer M, Jarvie T, Jirage K, Kim J, Knight J, Lanza J, Leamon J, Lefkowitz S, Lei M, Li J, Lohman K, Lu H, Makhijani V, McDade K, McKenna M, Myers E, Nickerson E, Nobile J, Plant R, Puc B, Ronan M, Roth G, Sarkis G, Simons J,

59

Simpson J, Srinivasan M, Tartaro K, Tomasz A, Vogt K, Volkmer G, Wang S, Wang Y, Weiner M, Yu P, Begley R, Rothberg J (2005) Genome sequencing in microfabricated highdensity picolitre reactors. Nature 437:376–380. doi:10.1038/nature03959 17. Gilles A, Megle´cz E, Pech N, Ferreira S, Malausa T, Martin J (2011) Accuracy and quality assessment of 454 GS-FLX titanium pyrosequencing. BMC Genomics 12:245. doi:10.1186/1471-2164-12-245 18. Hamada M, Wijaya E, Frith M, Asai K (2011) Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection. Bioinformatics 27:3085–3092. doi:10.1093/bioinformatics/ btr537 19. Kerpedjiev P, Lindgreen S, Frellsen J, Krogh A (2013) Adaptable probabilistic mapping of short reads using position specific scoring matrices. Unpublished 20. Huang W, Li L, Myers J, Marth G (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594. doi:10.1093/ bioinformatics/btr708 21. Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of illumina sequence reads. Genome Res 21:936–939. doi:10.1101/gr.111120.110 22. Vacic V, Jin H, Zhu J, Lonardi S (2008) A probabilistic method for small RNA flowgram matching. Pac Symp Biocomput 75–86 23. DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas M, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. doi:10.1038/ng.806 24. Kodama Y, Shumway M, Leinonen R, International Nucleotide Sequence Database Collaboration (2012) The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res 40:D54–D56. doi:10.1093/nar/ gkr854 25. Lindgreen S (2012) AdapterRemoval: easy cleaning of next generation sequencing reads. BMC Res Notes 5:337. doi:10.1186/17560500-5-337 26. Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595. doi: 10.1093/bioinformatics/btp698

Chapter 4 Statistical Modeling of Coverage in High-Throughput Data David Golan and Saharon Rosset Abstract In high-throughput sequencing experiments, the number of reads mapping to a genomic region, also known as the “coverage” or “coverage depth,” is often used as a proxy for the abundance of the underlying genomic region in the sample. The abundance, in turn, can be used for many purposes including calling SNPs, estimating the allele frequency in a pool of individuals, identifying copy number variations, and identifying differentially expressed shRNAs in shRNA-seq experiments. In this chapter we describe the fundamentals of statistical modeling of coverage depth and discuss the problems of estimation and inference in the relevant experimental scenarios. Key words Coverage, Modeling, Next-generation sequencing, High-throughput sequencing, SNP calling, CNV calling, Differential expression, Statistical modeling of coverage

1

Introduction In many high-throughput sequencing (HTS) experiments, the underlying goal of the researcher is to estimate the abundance of a certain (or multiple) fragments of DNA or RNA in the sample. While the inner works of a sequencing machine differ from one technology to the other, in general the machine samples short DNA or RNA fragments from the sample numerous times and sequences each of these fragments, generating “reads”—short DNA sequences (for a detailed review see [1, 2]). Thus, the researcher might hope that the more abundant fragments would be sampled more often, leading to a higher number of reads spanning the corresponding genomic regions. Due to the random nature of the preparation of the sample, the sampling process by the sequencer, and various other artifacts we discuss later on, the proportion of fragments read by the machine is not exactly the same as the proportion of the DNA fragment in the original sample, and can only be thought of as a proxy for the actual abundance. We are therefore required to construct proper statistical

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_4, # Springer Science+Business Media New York 2013

61

62

David Golan and Saharon Rosset

foundations for estimation and inference regarding such proportions from the observed data. Many applications take advantage of the relationship between abundance in a sample and the number of reads. Some of these applications include: SNP calling: After sequencing an individual, we are usually interested in calling the genotype of each single-nucleotide polymorphism (SNP). We assume there are two alleles in the population, a major allele denoted B (sometimes referred to as the wild-type allele) and a minor allele denoted b (mutant allele). For every diploid position, we expect roughly half of the reads to contain the allele from the maternal chromosome, while the other half should contain the paternal chromosome. Therefore, if a vast majority of reads mapping to a specific locus contain the same allele, the individual is probably homozygous at this locus (the genotype is either BB or bb, depending on the content of the reads). If, however, we observe two different alleles, each one in roughly half of the reads, the individual is probably heterozygous (has the genotype Bb). Analyzing pooled samples: In many scenarios, individuals are pooled together and sequenced to reduce preparation and sequencing costs. For example, it is common practice in case–control studies to pool cases and controls in separate pools or to pool many individuals together to identify novel rare variants or to estimate the minor allele frequency (MAF) in the population. Then, for each polymorphic region, the proportion of reads containing each variant can be used to estimate the proportion of this variant in the sample or the population. When pooling case–control studies, a variant displaying significantly different proportions in the case and control groups might be significantly associated with the disease. Differential short RNA expression analyses: In such experiments, the goal is to identify which short RNAs have different expression levels between two groups (e.g., cases and controls or treatment and controls). Higher expressed shRNAs are more abundant in the cytoplasm and are therefore more likely to be sampled by the sequencer. Comparing the proportion of reads spanning each shRNA in each sample can indicate which shRNAs are differentially expressed. Similar methodologies are used for quantifying mRNA (called RNA-seq), but these are more involved since most mRNAs are much longer than the reads of most HTS technologies [3]. Identifying copy number variations: Copy number variations (CNVs) are duplications or deletions of large segments of the genome. Such regions can be identified by looking at the average coverage of a region. A duplicated region would display, on average, 1.5 times higher coverage depth than the neighboring

Statistical Modeling of Coverage in High-Throughput Data

63

regions, assuming only one of the chromosomes contains a duplication. A deletion of a segment on one of the diploid chromosomes would lead to a halved coverage depth. As with RNA-seq, there are methodologies for CNV calling which take into account additional information other than the coverage and provide better results [4]. However, coverage depth can still provide an indication for CNVs. We start by describing the commonly used statistical modeling of coverage and discuss several statistical complications which usually arise when analyzing coverage. We then show how all of the problems above can be formulated as straightforward estimation and hypothesis tests once the statistical foundations have been laid properly and demonstrate these ideas by analyzing real, publically available, data.

2

Materials The analysis described in this chapter uses various software for three distinct purposes: Aligning reads to a reference genome: There are several aligners which are commonly used to align short sequences to a reference genome. One highly popular and easy to use aligner is BowTie2 [5]. However, there are various other aligners, and different sequencing platforms often require different aligners. Make sure you are using the proper aligner and an up-to-date version of the reference genome. Manipulating aligned files: Output files are usually in .fasta or .bam formats. The state-of-the-art software for viewing and manipulating these files is SAMTools [6]. The most relevant manipulation required for depth analysis is obtaining the actual number of reads spanning each locus in the genome. Statistical analysis: The statistical analysis, especially the fitting of generalized linear models described later, requires some sort of statistical software. While most programming languages can support such analysis to some extent, we recommend using the programming language R [7]. To fit negative binomial models as described in the text, one has to load the MASS library using the command library(MASS).

3

Modeling Coverage

3.1 The Poisson Model

The sequencing process is composed of two major stages—sample preparation and sequencing, both vary greatly depending on the technology used and experiment at hand. The preparation step usually involves polymerase chain reaction (PCR), making sure

64

David Golan and Saharon Rosset

there’s enough DNA/RNA in the sample, and usually includes an additional step which aims at making sure that DNA/RNA which is not relevant to the experiment is left out [2]. This filtering step is used to make sure that reads are not “wasted” on less interesting regions of the genome. For example, when studying short RNAs, all long RNAs are filtered out of the sample prior to sequencing, because there’s no interest in sequencing them. The sample is then sequenced, and the sequencer randomly captures a huge number of fragments and sequences each one. The number of reads and their lengths also vary according to the technology. For example, Illumina HiSeq generates reads of a constant length, while other technologies such as Roche’s 454 and LifeSciences’ Ion-Torrent generate reads of varying lengths [8]. For simplicity we assume constant length of reads, but relaxing this assumption does not alter the analysis in any material manner. The overall coverage of the experiment, denoted β, is the average number of reads covering each base pair and is given by: β¼

#of reads  length of read : total size of sequenced genome

Let i denote a certain base pair of interest in the genome. We denote yi the number of reads mapped to i. The most common and simple model of coverage assumes: yi  PoisðβÞ; i.e., the number of reads spanning each base pair follows a Poisson distribution with rate β. In this case Ey ¼ β, so the expected and observed mean coverage depth coincide. When dealing with entire genomic regions, yi is interpreted as the number of reads mapping to the ith region and the rate is adjusted according to the region’s size:   region size þ read length  1 : yi  Pois β  read length The fact that each locus has the same rate of coverage hides two important assumptions: that the sample preparation affects all regions in the same manner, and that the sequencer samples all fragments with the same probability. While the Poisson model is easy to analyze and deal with, experimental evidence shows that coverage depth is influenced by many factors and is not constant across the entire sample. Some local properties of the DNA/RNA molecule, such as GC content, secondary and tertiary structure might influence the way it responds to PCR and other steps in the preparation and sequencing phase (see, e.g., [9]). It might therefore be the case that certain segments of DNA systematically respond better or worse to PCR, creating an impression that such segments are more or less

Statistical Modeling of Coverage in High-Throughput Data

65

abundant than they actually are. Similarly, the sequencer’s ability to generate a high quality read of a sequence might depend on the sequence itself, resulting in similar artifacts. Any serious model of coverage must therefore include the effects of such factors. Moreover, the effects of such factors on coverage might vary greatly from one sample to the other, as the preparation step is never identical for two different samples. We must therefore not only include such effects in our model, but we must also be able to estimate their effect from the data, or account for the uncertainty resulting from these effects otherwise. 3.2 Accounting for Observed Covariates Using Generalized Linear Models

We now assume that the Poisson rate β is not constant, but rather dependent on an observed variable X, such as local GC content. We are interested in estimating the effect of X on coverage depth, testing whether this effect is significant and accounting for it in downstream analyses. Fortunately, there are established statistical methodologies for dealing with such scenarios, which fall under the vast category of “generalized linear models” (GLMs) [10]. We will not survey GLMs here, but only briefly explain how they can be used in this setup. Statistical software for estimation and inference using GLMs is widely available. We assume each locus i has a rate βi associated with it. The functional relationship between βi and the relevant covariate Xi is given by: f ðβi Þ ¼ α0 þ α1 Xi ; and: yi  Poisðβi Þ; where α0, α1 are parameters (generally known as the regression coefficients, with α0 being the intercept and α1 being the slope), and f is a function called the “link” function, which transforms the rate to the appropriate scale. In our case, we expect β  0, so we require f : [0, 1] ! [  1, 1]. It is customary to use the log function in the context of the Poisson regression, so: logðβi Þ ¼ α0 þ α1 Xi ; or: βi ¼ e α0 þα1 Xi : Note that the choice of log as the link function dictates a multiplicative effect of the variable X on the rate. This relationship might not always be the best description of the effect of a change in X on β, hence sometimes transformations of X are used (for example log(X) instead of X), or additional terms such as X2 are added as additional covariates to allow for more complicated relationship

66

David Golan and Saharon Rosset

between X and β. The model can be easily expanded to include more covariates by introducing additional coefficients α2, α3, . . .. Another way to model a different relationship between the covariate and the rate is to use a different link function. Given a vector of yi’s (coverages) and Xi’s (covariates), there’s an established methodology for estimation and inference of the α coefficients, known as “Poisson regression.” This methodology allows us to obtain maximum-likelihood (ML) estimates αb0 and αb1 and test hypotheses such as: H0 : α1 ¼ 0 H1 : α1 6¼ 0; in other words testing whether the covariate in question has a significant effect on the coverage or not. 3.3 Accounting for Unexplained Overdispersion Using the Negative-Binomial Distribution

So far, we have assumed that yi follows a Poisson distribution. Initially, we assumed a constant mean across the genome, and later we described how known covariates can be accounted for, still assuming that the distribution of depth, after properly conditioning on the known covariates, follows a Poisson distribution. There are two relevant scenarios where the Poisson assumption would be violated: First, even if we assume that after conditioning on all covariates which effect the coverage depth, the coverage follows a Poisson distribution, it is very likely that some of these covariates are unknown or unmeasurable, and therefore cannot be accounted for using Poisson regression. Hence, the unknown variables introduce additional variance which is left unaccounted for, even after controlling for all the known variables and the resulting distribution would have a higher variance than expected under to the Poisson model. An observed variance which is higher than the expected variance according to the Poisson model is called overdispersion. Second, the Poisson assumption implicitly assumes that the fragments chosen by the sequencer are independent. However, many times this is not the case. Since the DNA/RNA fragments are duplicated by PCR, a fragment which is—by chance—not duplicated in the first few rounds of PCR would end up presenting many fewer copies in the sequenced sample. This fact creates an important difference between the fragments in the sample before and after PCR—while before PCR the assumption of independent sampling holds, after PCR, sampling a certain fragment increases the chance that the same fragment would be sampled again (as there is a higher posterior chance that this fragment is overrepresented due to PCR duplications). Such dependence, again, would lead to increased variance relative to what we would expect given the Poisson assumption. Moreover, while the overdispersion which

Statistical Modeling of Coverage in High-Throughput Data

67

is due to unaccounted for covariates should disappear once we discover and control for these variables, cleaning the effects of dependent sampling is a much harder task. some analysis tools (such as SAMtools [6]) allow the removal of reads which seem to be PCR duplicates of one another, in order to reduce the dependence and make the Poisson model more accurate. Unfortunately, in many scenarios, identifying such PCR duplicates is impossible, as is the case in pooled sequencing and shRNA sequencing, in which case the overdispersion cannot be removed and must be accounted for appropriately. Both scenarios lead us to seek an extension of the Poisson model which would allow a higher variance than expected according to the model. Luckily, the problem of overdispersion is often encountered in statistics, especially in the context of generalized linear models, where unobserved variables or hidden dependence create a higher-than-expected variance. A common approach to dealing with overdispersion is to think of the rates (the βi’s) not as fixed parameters, but rather as random variables with some prior distribution, where the randomness of the prior distribution can be thought of as the effect of the unobserved variables, or as a way to compensate for the dependence. The distribution of yi still follows a Poisson distribution, but only after we condition on this random variable: βi  F ðΘÞ; yi jβi  Poisðβi Þ; where F is the “prior” distribution of the rates, which might depend on parameters Θ. Usually, F is assumed to be the Γ (Gamma) distribution. One can show that under these assumptions the marginal distribution of y is negative binomial. The negative binomial distribution is denoted NB(μ, θ), where μ is called the mean parameter and θ is called the overdispersion or the shape parameter. 2 For X  NB(μ, θ) we have EX ¼ μ and VarðX Þ ¼ μ þ μθ , so when θ ! 1, the variance converges to the mean, and the distribution converges to the Poisson distribution. θ can therefore be interpreted as a measure of the degree of departure from the Poisson assumption. Notice how the negative-binomial distribution generalizes the Poisson distribution—for Poisson variables, the mean determines the variance (since if X  Pois(β) then EX ¼ VarðX Þ ¼ β), but for the negative-binomial distribution, the overdispersion parameter allows us to “fine-tune” the variance to match the degree of overdispersion observed in the data. This approach is used in a wide range of applications ranging from genetics [11] to modeling of Internet traffic and transportation (see [12] for a review of many applications).

68

David Golan and Saharon Rosset

3.4 Combining Known Covariates and Overdispersion in Negative-Binomial Regression

After presenting the negative-binomial distribution as a generalization of the Poisson distribution which allows for explicit modeling of overdispersion, one might wish to incorporate the effects of known covariates into the negative-binomial model in the same manner that was done for the Poisson model using generalized linear models. Again, we choose log as our link function, only this time, only the mean term μi at the ith locus is affected by the covariates: log μi ¼ α0 þ α1 Xi ; and yi  NBðμi ; θÞ: This model is known as negative-binomial regression or overdispersed Poisson regression. As is the case for Poisson regression, there are established methodologies and easy-to-use software for estimation of the coefficients, as well as estimation of θ, and inference regarding these parameters.

4

Estimation and Hypothesis Testing Once an accurate model of coverage is in hand, we can move on to proper statistical estimation and hypothesis testing regarding the relevant biological questions. In general, most of the biologically relevant analyses discussed in the introduction fall into three categories: calling genotypes or estimating the MAF in a pooled sample are in fact problems of proportion estimation. Identifying CNVs is a problem of estimation/hypothesis testing of the rate of the Poisson/ NB distribution. Lastly, testing for significant differential expression of shRNAs, and testing for different allele frequencies between cases and controls are problems of proportion comparisons. We begin by describing the Poisson case and then briefly discuss the negative-binomial case.

4.1 Estimating Proportion

When pooling n individuals in a single sample, we are actually pooling 2n haplotypes. Let m be the number of haplotypes with the mutant (minor) allele and 2n  m the number of haplotypes with the wild-type (major) allele, then the proportion of minor alleles in the sample is p¼

m : 2n

Note that in the case of n ¼ 1, the problem of determining whether m is 0, 1, or 2 is the problem of individual genotyping. Under the simple Poisson assumption, we assume that the sequencing machine has the same probability of sampling each allele, it follows that the number of mutant reads (denoted MU)

Statistical Modeling of Coverage in High-Throughput Data

69

and the number of wild-type reads (denoted WT) sequenced by the machine both follow a Poisson distribution with parameters pβ and (1  p)β, respectively. Note that when calling SNPs, the effect of a single base pair change on GC content, secondary structure, etc. is very minor and therefore can easily be neglected. If, however, the variant in question is a deletion/insertion of several bases, it could definitely change the GC content in a significant manner, so one must account for the covariates when computing the expected rates, in which case the rates would be pβ(MU) and (1  p)β(WT), where β(MU), β(WT) are the expected rates after accounting for the covariates relevant to the mutant or the wild-type sequence, respectively. We neglect such effects in the following analyses for simplicity and tractability. Since sequencing technologies are not perfect, reads may contain errors, and a sequencer might sequence a wild-type allele but report a mutant allele or vice versa. We denote e the probability of such error. e itself depends greatly on the technology and on the allele. For the sake of simplicity, we assume symmetry—that the probability of the machine to mistake any base for any other base is identical. However, the error probability needn’t be symmetric. Moreover, e itself might change drastically from one run to the other, and even between different lanes of the same run. It is therefore important to estimate e from the data. It follows that the total number of reported mutant reads follows a Poisson distribution with mean term: 4

βMU ðpÞ ¼ ð1  eÞpβ þ eð1  pÞβ; which is simply the probability of correctly sequencing a mutant allele plus the probability of sequencing a wild-type allele and mistaking it for a mutant allele, and we define similarly βWT(p), which is equivalent to defining βWT ðpÞ ¼ β  βMU ðpÞ. Moving on to estimation, the likelihood function is given by: LðpÞ ¼ e βMU ðpÞ

βMU ðpÞMU β ðpÞWT  e βWT ðpÞ WT ; MU! WT!

with some algebra, one can show that     β ðpÞ MU β ðpÞ N MU  1  MU ; LðpÞ / MU β β where N ¼ MU þ WT is the total number of reads. The fact that this likelihood function resembles a binomial likelihood is no coincidence. This is a special case of a more general result in probability—the conditional distribution of Poisson random variables when conditioning on their sum is a multinomial distribution. In our case there are only two Poisson random variables (MU and WT), so the conditional distribution is binomial. We define q ¼ βMUβ ðpÞ the expected proportion of mutant reads.

70

David Golan and Saharon Rosset

For a random variable X  Bin(n, p) the MLE for p is given by ^p ¼ Xn , applying this to our case yields the naı¨ve maximum likelid ¼ MU . A slightly more refined estimator hood estimator MAF MUþWT should take into account the discrete number of haplotypes in the sample: d ¼ arg max Lð i Þ: MAF 2n i¼0...2n It is easy to show that due to concavity considerations the 1 MU MLE of the discrete version is either 2n bMUþWT  2nc or 1 MU d  2ne, so there’s no need to actually iterate over all 2n MUþWT possible values (where bxc, dxe denote the largest integer smaller than x and the smallest integer larger x, respectively). 4.2 Hypothesis Tests on Proportions

Under the Poisson model, hypothesis testing becomes a simple variation of hypothesis testing on proportions of binomial random variables, both when testing hypotheses on a single proportion and when comparing proportions in the context of differential expression analysis or case–control studies. For example, suppose we wish to test whether there exists a polymorphism at a certain locus or not. A clear indication of a polymorphism is an individual with one (or more) haplotypes with an allele different than the wild type. More formally, we wish to test the hypothesis of p ¼ 0 against the alternative hypothesis of p > 0. Due to the error term e, we still expect to see reads reporting a mutant allele, even if there are no such alleles in the sample. It is therefore more revealing to formulate the hypothesis test as: H0 : βMU ¼ e H1 : βMU > e; we then use textbook methods for hypothesis testing regarding proportions. If the coverage is high enough, it is reasonable to use the normal approximation (due to the central limit theorem) and reject H0 when rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MU eð1  eÞ >eþ Z1α ; MU þ WT MU þ WT where Z1  α is the (1  α)th percentile of the standard normal distribution, and α is the significance level desired. When the coverage is too low to allow the use of the central-limit theorem, one needs to compute the exact cutoff point (as we do in the example). The problem of testing whether there’s a significant difference between the MAF in the control and case groups is again similar to classical statistics inference problems. Usually, due to the large number of cases and controls, each group is divided to several

Statistical Modeling of Coverage in High-Throughput Data

71

pools, and each pool is sequenced separately. A significantly different frequency of a certain genetic variant between the case and control groups might indicate that this variant is causative or protective of the phenotype in question. Here, we are interested in comparing the proportion of mutant alleles between the case and control samples. More formally, we wish to test: H0 : pcase ¼ pcontrol H1 : pcase 6¼ pcontrol ; which can be carried out using standard tests for equality of proportion between two samples. Analysis of differential expression in shRNA is conceptually identical—we wish to test whether the abundance of a given shRNA is different between the two samples, so we test whether there’s evidence that the proportion of reads containing this shRNA is different between the two samples. 4.3 Detecting Rate Changes

Unlike the previously discussed problems, which boiled down to statistical estimation and inference of proportions, the problem of identifying CNVs is essentially different. Given the natural number of copies, we expect the number of reads to follow a Poisson distriburegion size + read length --1 tion with rate βexpected ¼ β  . We can read length then formulate the appropriate hypothesis test for detection of CNVs: H0 : β ¼ βexpected H1 : β 6¼ βexpected ; where replacing H1 by H1 : β < βexpected or H1 : β > βexpected would result in tests for the detection of deletions and insertions, respectively. The thresholds for rejection of the null hypothesis can be computed using the normal approximation, as described earlier, or by using the exact percentiles when the rates are too low for the central limit theorem to apply—even though this is highly unlikely since the reads are aggregated across a larger region, and we therefore expect relatively high coverage.

4.4 Estimation and Hypothesis Testing Using the NegativeBinomial Model

Analysis under the negative-binomial model is usually much more involved. Unlike the Poisson case, the problems are not reduced to a simple, well-studied problem. Therefore, maximum likelihood estimation requires per-case implementation and numerical optimization, as even the simple problem of estimating μ and θ has no closed form solution. Implemented solutions exist for some of the problems discussed, for example testing for differential expression of shRNAs [13]. One exception which is rather simple to analyze is the detection of CNVs, since the distribution is easily defined under the null hypothesis: the number of reads mapping to a region of interest is

72

David Golan and Saharon Rosset

distributed NB(β, θ), and so rejection regions can be computed using the appropriate percentiles, or by using the central limit theorem approximation: NBðβ ; θÞβ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N _ ð0; 1Þ; 2 β þ βθ and using it to determine the rejection thresholds as done in the Poisson case.

5

Example: Exploring the Actual Distribution of Coverage in a 1000genomes Sample We now move on to demonstrating some of the ideas described in the previous section. We use publically available low coverage sequencing data of a single individual (NA12045) from the 1000genomes project [14], focusing on a single chromosome (22) for illustrative purposes. We computed the depth at each locus using SAMtools’ depth option [6] and extracted the depth of every thousandth base pair to keep a manageable-sized data set. After removing regions with many unknowns in the genome, we were left with 35,195 depth counts (see Notes 1 and 2). We first analyze the data using the naı¨ve Poisson model. As stated earlier, the maximum-likelihood estimator of the Poisson rate is the average in the sample. In our case, β^ ¼ 5:15. We wish to identify candidate regions which might be duplicated, resulting in higher-than-expected coverage. To do so, we formulate the problem as a hypothesis testing problem, testing: H0 : β ¼ 5:15 H1 : β > 5:15: Following the usual hypothesis testing scheme, we reject the null hypothesis H0 when the observed depth is extremely unlikely according to H0. To specify the test, one must first determine the probability of type-1 error, denoted α. It is widely acceptable to set α ¼ 0. 05. We therefore reject H0 for every locus with depth > t where t is the (1  α)th quantile of Pois(5.15). In our case, t ¼ 10, resulting in 2,205 loci with significantly elevated coverage. Clearly, there aren’t 2,205 duplicated regions on chromosome 22 of NA12045. This is a manifestation of the well-known statistical problem of multiple hypothesis testing—since we run the same test for 34,894 loci, the expected number of type-1 errors (false positives) is α 34,894  1,745, resulting in a high number of false positives drowning whatever true positives which might exist in the data.

Statistical Modeling of Coverage in High-Throughput Data

73

The problem of multiple comparisons has been widely addressed in the literature. We adopt the most basic scheme, known as the Bonferroni correction, by defining: α α0 ¼ ; n where n is the number of tests, and using α0 instead of α when computing the threshold t. One can show that this scheme guarantees that the probability of a type-1 error across the entire set of tests is limited to α. Using α0 the adjusted threshold is 20, resulting in 35 loci with significantly elevated coverage. Before rushing off to publish our findings, it is worth while to check that the model we used for inference does indeed describe the distribution of the data in a satisfactory manner. An easy way to identify count data that do not follow the Poisson distribution is by comparing the empirical mean and variance—a Poisson distribution would result in an estimated mean and variance which are insignificantly different from one another. The empirical variance of the data is 9.14, and since we have 34,894 observations it is easy to see that the estimated mean and variance are significantly different from one another. We therefore choose to explore the extensions of the Poisson model described in this chapter. For example, it has been suggested that PCR is less effective for regions which are GC rich or GC poor. To explore such a hypothesis, we plot the depth vs. GC content, computed in windows of 101 bp (Fig. 1) and observe that there is a considerable dependence between GC content and average coverage depth, and that the relationship is indeed unimodal with the mode roughly at GC content of 0. 5 as predicted by our biological understanding of the PCR mechanism (see [9] and Note 3 for more details). Such a dramatic effect of CG content can lead to dire results if left unaccounted for. For example, regions which are GC rich or GC poor might appear as if their copy number is lower. Moreover, since the effect of GC content might differ from one run to the other, failing to account for it might result in false calls of differential expression and even false calls of association in case–control studies. To account for the effect of GC content, we adopt the Poisson regression model: logðβi Þ ¼ α0 þ α1 GCi þ α2 GC2i ; where the inclusion of the local GC content as a second-degree polynomial allows the model to capture the unimodal nature of the relationship between coverage and GC content. We estimate the coefficients using the GLM implementation available as part of the statistical computing language R [7] (see Note 4). Figure 2 displays the fitted expected coverage depth as function of GC content.

David Golan and Saharon Rosset

Fig. 1 Scatter plot of depth vs. GC content, displaying a unimodal relationship between GC content and depth. The bold line is the average coverage at each observed value of the GC content

observed fitted

6

5

4

E[coverage]

74

3

2

1

0 0.0

0.2

0.4

0.6

0.8

%GC

Fig. 2 Observed mean coverage and fitted mean coverage using a seconddegree polynomial of GC content as the covariates of a Poisson regression

Statistical Modeling of Coverage in High-Throughput Data

75

observed fitted

6

5

E[coverage]

4

3

2

1

0 0.0

0.2

0.4

0.6

0.8

%GC

Fig. 3 Fit using a sixth-degree polynomial

A careful inspection of the fit might give rise to the suspicion that a second-degree polynomial does not fully capture the functional nature of the relationship. Using some trial and error leads us to adopting a sixth-degree polynomial (Fig. 3). For further reading about variable selection and manipulation in the context of linear models and generalized linear models, we refer the reader to [10]. We can now repeat the same inference procedure, only this time we account for the GC content at each locus. Therefore, there is not one threshold used globally, but rather for each locus i, the predicted rate is β(GC) and we find ti such that P(Pois(βi) > ti) is α0 , or as close as possible due to the discrete nature of the Poisson distribution (Fig. 4). This analysis reduces the number of loci with significantly elevated coverage to 34 loci. It is interesting to note that the set loci is not a subset of the loci obtained using the previous model—two loci are removed and one locus is added. The reason is that the change in the threshold is not monotone: for regions which are GC rich or GC poor, the threshold is lower than in the naı¨ve Poisson model, whereas for midrange GC levels, the threshold is higher (see Fig. 4). To study the variance of the distribution, or more precisely to make sure that the Poisson model captures the variance structure as well as the mean, we compare the expected variance, according to the fitted Poisson model to the observed variance of the data. As can be

76

David Golan and Saharon Rosset

Fig. 4 Thresholds for rejecting the null hypothesis of normal coverage according to the four models: Naı¨ve Poisson, Poisson regression, Naı¨ve negative-binomial, and negative-binomial regression. Note that for the naı¨ve Poisson and naı¨ve NB models, the threshold does not depend on GC content, and therefore it is constant

seen in Fig. 5, the variance according to the Poisson model is consistently lower than the observed variance, which might lead to an increased number of false positives. We therefore adopt the negativebinomial model, in an attempt to account for the overdispersion. We first fit the naı¨ve negative-binomial model, yielding the estimates of μ ^ ¼ 5:15; ^θ ¼ 10:13, a threshold of 26 reads and a total of 12 loci with significantly elevated coverage. Introducing the covariates leads to a significant increment of ^θ to 13.05, indicating that some of the overdispersion has been explained by accounting for the effect of the GC content (see Note 4). Under this model there are only 11 loci with significantly elevated coverage. Thresholds and variances predicted by these models are displayed in Figs. 4 and 5, respectively, demonstrating a much better fit of the variance. At this point, were this a true study, we would probably focus on analyzing these 11 loci, instead of the 35 obtained by our initial analysis. Further analysis should include controlling for additional covariates which might affect coverage, careful inspection of the high-coverage loci (for example, checking if they are clustered together) and inclusion of additional data such as the depth along the rest of the chromosome, the quality of the reads and of the alignment, accounting for multiple alignments, and the use of more elaborate algorithms to identify CNVs.

Statistical Modeling of Coverage in High-Throughput Data 20

77

Observed Naive Poisson Poisson regression Naive NB NB regression

Depth

15

10

5

0 0.0

0.2

0.4

0.6

0.8

%GC

Fig. 5 Observed and expected variances according to the different models

6

Notes

6.1 Obtaining Depth from BAM Files

To transform a .bam file into a list of coordinates and their corresponding depth, we used SAMTools’ depth option. Note that the depth option computes the depth without applying any quality filters. The more popular mpileup options applies various default filters.

6.2 Dealing with 0-Depth Coordinates

SAMTools outputs the depth of every covered genomic coordinate. Therefore, coordinates which are not covered do not appear in the output. To prevent downstream errors, it is recommended to fill in these missing coordinates and specify that the depth at these coordinates is 0.

6.3 Computing GC Content

GC content at each window should be computed using the reference genome. It is important to note that the reference genome might contain “unknown” bases, typically marked by N (rather than A, C, G or T). We recommend ignoring Ns in the process of computing GC content. It is important, however, to keep track of regions where there is a high proportion of Ns. Such regions should be excluded from the analysis.

78

6.4

David Golan and Saharon Rosset

Fitting GLMs in R

Poisson regressions are native to R and are fit by running: my.poisson.model 1{print $1,$2+1,$3,"+", $4}’ \ exome/targets.bed >> exome/targets.interval_ list

98

Xueya Zhou et al. cat genome/b37.dict > exome/baits.interval_list awk

’BEGIN{OFS¼"\t"}{gsub("chr","",$1);

print

$1,$2+1,$3,$6,$4}’ \ exome/BED/029368_D_BED_20111101.bed \ >> exome/baits.interval_list

3.2 Mapping Reads to Reference

1. Generate alignments in suffix array coordinates for each end separately, using six parallel threads to speed up (see Note 9). for RUN in SRR070556 SRR070839; do mkdir -p mapping/$RUN bwa aln -t 6 -q 10 -f mapping/$RUN/${RUN}_1.sai genome/b37 \ fastq/$_1.filt.fastq.gz bwa aln -t 6 -q 10 -f mapping/$RUN/${RUN}_2.sai genome/b37 \ fastq/$_2.filt.fastq.gz done

2. Create alignment in SAM format for each run of paired-end reads (see Note 10). for RUN in SRR070556 SRR070839; do bwa sampe -f mapping/${RUN}/$RUN.sam \-r "@RG\tID:$RUN\tSM:NA20322\tLB:NA20322\tPL: ILLUMINA" genome/b37 \ mapping/${RUN}/${RUN}_1.sai mapping/${RUN}/${RUN}_2.sai \ fastq/${RUN}_1.filt.fastq.gz fastq/${RUN}_2.filt.fastq.gz done

3. Fix mate information in SAM output and convert to binary format (see Note 11). for RUN in SRR070556 SRR070839; do java -XX:ParallelGCThreads¼2 -Xmx2g -jar jar/FixMateInformation.jar \ INPUT¼mapping/$RUN/$RUN.sam OUTPUT¼mapping/$RUN/$RUN.fixed.bam \ TMP_DIR¼mapping/$RUN/tmp VALIDATION_STRINGENCY¼SILENT done

4. Reorder the contigs and sort aligned reads by their mapped coordinates within contigs (see Note 12). for RUN in SRR070556 SRR070839; do java -XX:ParallelGCThreads¼2 -Xmx2g -jar jar/ ReorderSam.jar \ INPUT¼mapping/$RUN/$RUN.fixed.bam \ OUTPUT¼mapping/$RUN/$RUN.reordered.bam \ REFERENCE¼genome/b37.fasta

Short Read Mapping for Exome Sequencing

99

TMP_DIR¼mapping/$RUN/tmp \ VALIDATION_STRINGENCY¼SILENT java -XX:ParallelGCThreads¼2 -Xmx2g -jar jar/SortSam.jar \ INPUT¼mapping/$RUN/$RUN.reordered.bam \ OUTPUT¼mapping/$RUN.sorted.bam TMP_DIR¼mapping/$RUN/tmp \ SORT_ORDER¼coordinate CREATE_INDEX¼true \ VALIDATION_STRINGENCY¼SILENT Done

This step also creates index file (.bai) for each sorted BAM file. 5. Merge two lane-level alignments into the sample level. java -XX:ParallelGCThreads¼2 -Xmx2g -jar jar/MergeSamFiles.jar \ INPUT¼mapping/SRR070556.sorted.bam INPUT¼mapping/SRR070 839.sorted.bam \ OUTPUT¼mapping/NA20322.merged.bam

CREATE_IN-

DEX¼true \ SORT_ORDER¼coordinate USE_THREADING¼true \ TMP_DIR¼mapping/tmp

VALIDATION_STRINGENCY¼SI-

LENT

The NA20322.merge.bam is sorted and index BAM file for the sample. 3.3 Post-processing on Aligned Reads

1. Mark up duplicated fragments detected from aligned reads (see Note 13). mkdir sam_qc java

-XX:ParallelGCThreads¼2

-Xmx2g

-jar jar/

MarkDuplicates.jar \ INPUT¼mapping/NA20322.merged.bam OUTPUT¼mapping/NA20322.dupmarked.bam \ CREATE_INDEX¼true METRICS_FILE¼sam_qc/NA20322.duplication_ metrics \ TMP_DIR¼mapping/tmp VALIDATION_STRINGENCY¼SILENT

This step also generates duplication metrics in sam_qc directory. 2. Realignment around indels using GATK (see Note 14). This is done by first generating targets for realignment, and then performing realignment within target intervals. java

-XX:ParallelGCThreads¼2

-Xmx4g

-jar jar/

GenomeAnalysisTK.jar \ -l INFO -et STDOUT -T RealignerTargetCreator \ -R gatk/b37.fasta -L exome/targets.interval_ list \

100

Xueya Zhou et al. -I mapping/NA20322.dupmarked.bam \ --known gatk/1000G_phase1.indels.b37.vcf \ --known gatk/Mills_and_1000G_gold_standard.indels.b37. vcf \ -o mapping/NA20322.realign.interval_list java

-XX:ParallelGCThreads¼2

-Xmx4g

-jar jar/

GenomeAnalysisTK.jar \ -l INFO -et STDOUT -T IndelRealigner \ -R gatk/b37.fasta -I mapping/NA20322.dupmarked. bam \ -o mapping/NA20322.realigned.bam \ -targetIntervals mapping/NA20322.realign.interval_list \ --knownAlleles

gatk/1000G_phase1.indels.b37.

vcf \ --knownAlleles

gatk/Mills_and_1000G_gold_

standard.indels.b37.vcf \ --consensusDeterminationModel USE_READS

3. Recalibrate base quality scores (see Note 15). This is done by first tabulating the effect estimates of covariates at mismatches outside known SNPs, and then applying corrections over all base calls. java

-Xmx4g

-XX:ParallelGCThreads¼2

-jar jar/

GenomeAnalysisTK.jar \ -l INFO -et STDOUT -T CountCovariates -nt 4 \ -R gatk/b37/b37.fasta -knownSites gatk/b37/dbsnp_135.b37.vcf \ -I

mapping/NA20322.realigned.bam

--default_

platform Illumina \ -cov ReadGroupCovariate -cov QualityScoreCovariate \ -cov CycleCovariate -cov DinucCovariate \ -recalFile mapping/NA20322.pre_recal.csv java

-Xmx4g

-XX:ParallelGCThreads¼2

-jar jar/

GenomeAnalysisTK.jar \ -l INFO -et STDOUT -T TableRecalibration \ -R gatk/b37/b37.fasta --default_platform Illumina \ -I mapping/NA20322.realigned.bam -o mapping/ NA20322.recal.bam \ -recalFile mapping/NA20322.pre_recal.csv

The resulting NA20322.recal.bam file is now suitable for variant calling. Before proceeding to downstream analysis, we need to ensure the data quality at first.

Short Read Mapping for Exome Sequencing

3.4 Quality Control on Alignment

101

1. Collect various quality control statistics from processed alignment file. java

-XX:ParallelGCThreads¼2

-Xmx2g

-jar jar/

CollectMultipleMetrics.jar \ INPUT¼mapping/NA20322.recal.bam OUTPUT¼sam_qc/NA20322 \ REFERENCE_SEQUENCE¼exome/b37.fasta \ PROGRAM¼CollectAlignmentSummaryMetrics \ PROGRAM¼QualityScoreDistribution \ PROGRAM¼MeanQualityByCycle \ PROGRAM¼PROGRAM¼CollectInsertSizeMetrics \ VALIDATION_STRINGENCY¼SILENT

It calculates the alignment statistics, insert size distribution, quality score distributions, and mean quality score for each machine cycle. The main results for our sample NA20322 are shown in Fig. 2a–c (see Note 16). 2. Evaluate the depth of coverage over target regions. java

-XX:ParallelGCThreads¼2

-Xmx2g

-jar jar/

CalculateHsMetrics.jar \ INPUT¼mapping/NA20322.recal.bam \ OUTPUT¼sam_qc/NA20322.hybrid_selection_metrics \ BAIT_INTERVALS¼exome/baits.interval_list \ TARGET_INTERVALS¼exome/targets.interval_list \ PER_TARGET_COVERAGE¼sam_qc/NA20322.per_target_coverage \ REFERENCE_SEQUENCE¼exome/b37/b37.fasta

VALI-

DATION_STRINGENCY¼SILENT

It calculates various statistics evaluating the exome capture performance and sequencing completeness (shown in Fig. 2a), and also the depth of coverage for each target in a separate file. 3. We also need to test if the sequenced sample matches his/her own identity. The 1000 Genomes Project has genotyped all of the phase 1 samples on Illumina Omin2.5 SNP array. The genotypes in VCF file are a part of GATK bundle. The sample information can be downloaded from the Project FTP,1 and copied to the k1g directory. The sample NA20322 belongs to ASW population, so we first extract SNP genotypes for ASW population within the autosomal targets of the exome kit.

1

The sample information for the phase 1 of 1,000 Genomes Projects can be downloaded from ftp://ftp-trace. ncbi.nih.gov/1000genomes/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.panel.

102

Xueya Zhou et al. awk

’$2¼¼"ASW"{print

$1}’

k1g/phase1_integra-

ted_calls.20101123.ALL.panel \ > k1g/ASW.keep awk ’$1 ~ /^[0-9]+$/’ exome/targets.bed > exome/ targets.auto.bed vcftools --vcf gatk/1000G_omni2.5.b37.vcf --keep k1g/ASW.keep \ --remove-filtered-all

--bed

exome/targets.

auto.bed \ --recode --out k1g/ASW.exome

4. Then run verifyBamID to detect sample swapping or contamination. verifyBamID --vcf k1g/ASW.exome.recode.vcf \ --bam mapping/NA20322.recal.bam --out sam_qc/ idcheck --verbose \ --maxDepth 2000 --precise --minQ 20 --maxQ 100 -minMapQ 17

The results at sample level can be found in file idechek.selfSM. Both CHIPMIX and FREEMIX statistics are less than 0.001 (see Note 17). It suggests that our sample matched to the identity, and there is no evidence of contamination from other samples.

4

Notes 1. We choose BWA for mapping reads in exome sequencing based on the following considerations. First, the medical resequencing projects require reads to be mapped in a way that maximizes the sensitivity for variant discovery. They were typically implemented with paired-end sequencing with ~100 bp length. For this purpose, gapped alignment supported by BWA is essential [11]. It was also shown to have higher accuracy and sensitivity than other tools as compared by Bao et al. [7]. Second, BWA supports the major features of the two widely used platforms for exome sequencing: Illumina and SOLiD. Both platforms typically have higher error rate at 50 end; and SOLiD system also provides error-correctable strategy. Third, we also need to strike a balance between accuracy and efficiency. For example, stampy and novoalign were both shown to have lower mapping errors than BWA, but their throughputs were also an order of magnitude lower [12]. A typical human exome contains over 10 giga base pairs, which will take ~10 CPU days to align using stampy but less than one day using BWA running four parallel threads. Although the speed may be less an issue if you have abundant computational resources, previous studies also showed that increased accuracy like novoalign did not show

Short Read Mapping for Exome Sequencing

103

much improved performance in SNP calling as compared with mapping with BWA [5]. So we considered BWA as a reasonable choice. But the effect of mapping accuracy on indel calling is less clear and remains to be investigated in the future. The choice of read mapping software is generally influenced by the factors like the biological question, the sequencing platform, and trade-offs between speed and performance. In comparative genomics, when mapping reads to a diverged species, tools that retain sensitivity given higher rate of mismatches should be considered. In applications like ChIP-seq, reads with typical length of ~50 bp were generated; the subsequent analysis focuses on the peak calling from aligned read depth; then gapped alignment may not be essential, and priorities can be given to run time efficiency. Tools differ in their behavior in handing repetitive sequences. mrsFAST is the only tool to report all possible mapping positions given the number of mismatches, which is critical for detecting structural variation [13]. Different tools also have different support for reads generated by Roche 454 platform, which are of intermediate length (200–400 bp) and have an increased indel error in the presence of homopolymers. All the above considerations need to be synthesized in choosing a read mapping tool. 2. In our case of human genetics study, it is recommended to use the reference sequence adopted by the 1000 Genomes Project. This version was mainly based on the latest human genome reference sequences maintained by the Genome Reference Consortium (GRC). The 1000 Genomes version of the reference contained all 24 assembled chromosomes (1–22, X, Y) and 59 unplaced contigs (collectively referred to as the primary assembly) of the GRCh37 release, and replaced the mitochondrion sequence with the revised Cambridge Reference Sequences (Genbank: NC_012920). We often casually refer to the same version as the hg19 in UCSC’s genome browser, which was also derived from GRCh37. The sequences on the primary assemblies are the same, except that the UCSC has its own naming conventions on chromosome and contigs. But the UCSC used the old version of mitochondrion (Genbank: NC_001807) and also included alternative representations of nine genomic loci like MHC. Because those alternative loci contain long flanking sequences highly homologous to the primary assembly, we do not include them in our reference; otherwise, it will artificially reduce mapping qualities for reads mapped to those positions. Although we used human as example, the general principle for choosing a reference is the same for other species: the reference sequence should be the best known haploid representation of a genome from which the short reads were presumed to be generated. The unlocalized and unplaced contigs should also be included, but not

104

Xueya Zhou et al.

alternative haplotypes of known loci. We also noted that two PARs on the chromosome X and Y were redundant to each other, so one copy of them will be masked out before indexing the sequences (see step 2 in Subheading 3.1). 3. The 1000 Genomes Project kept updating the raw data regularly on two of FTP mirror sites at EBI (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and NCBI (ftp://ftp-trace.ncbi. nih.gov/1000genomes/ftp/). Information about each data file in the project was recorded and updated in the sequence index files (can be found under sequence_indices directory). From that file we can learn that NA20322 was sampled from a population of African Ancestry in Southwest USA. The exome of NA20322 was sequenced in Washington University Genome Sequencing Center (WUGSC) on two lanes of Illumina Genome Analyzer II. When samples were sequenced on multiple lanes, we suggest to treat the raw sequences separately (see also Note 10). The exomes sequenced in WUGSC were captured using Agilent SureSelect V2 kit.2 4. Baits are the nomenclature of Agilent Technology; they are biotinylated 170mer RNA capture probes. Exome enrichment kits from other manufacturers used DNA molecules with different lengths as the capture probes. Their information can be found on the manufacturers’ Web sites. The downloaded target and bait definition files are in BED format.3 Each row represents a target or a bait interval on the UCSC’s hg19 genome coordinates. It should be noted that the Agilent V2 exome kit only included target regions on chromosomes 1–22, X, and Y; so it is equivalent to the 1000 Genomes version of reference after stripping out the chromosome name prefix “chr.” The first three columns give the chromosome name, and start and end positions. The intervals are zero-based half open. 5. The -a switch tells the program to index the genome using bwtsw algorithm, which is the only option for the size of human genome. For mapping reads generated by SOLID platform, the index should be generated by adding -c option. Under color space mode, three additional index files will be generated: b37.nt.amb, b37.nt.ann, and b37.nt.pca. Please also be noted that the support for color space has been discontinued after BWA v0.6.0. 6. The FastQ format is most widely used to store short reads. Most sequencing platforms can convert their output files to FastQ format. It contains information for one read every four lines. The first line always starts with “@” and specifies the name

2 Information about exome targets of 1,000 Genomes Projects can be found in the following documents on the Project FTP: technical/reference/exome_pull_down_targets/README.20120518.exome.consensus. 3 UCSC BED format is documented in http://genome.ucsc.edu/FAQ/FAQformat#format1.

Short Read Mapping for Exome Sequencing

105

of the read; the third line contains either only “+” or followed by the read name starting after “@” in the first line. Paired-end (PE) reads are stored in a pair of files, with reads kept in the same sorted order in both files. The corresponding read names must be identical up to the tailing “/1” or “/2.” In PE sequencing, reads are generated from both ends of a captured DNA fragment with known lengths; such information can be used to constrain the mapping of both ends. Reads are paired based on their names during the mapping process. When read names contain white spaces, only the part before first white space will be used by the mapping tools. The second line contains the actual sequence of the read. They can be either in bases or in color space. The base qualities are encoded as ASCII characters in the last line. The base qualities are Phred scaled (ten times the minus log10) error probabilities with some added offsets. There are three different quality encodings for FastQ files [14]. The traditional Sanger scheme used ASCII characters 33–126 to encode qualities from 0 to 93. The Solexa score has now been considered as legacy and rarely used these days. FastQ files produced by the earlier version of Illumina (Illumina 1.3+) used Phred score 0–62 encoded in ASCII 64–126. There is also a subversion of FastQ (Illumina 1.5+), which is the same as Illumina 1.3+, but do not use ASCII 64,65, and ASCII 66 was reserved to mean “the base should not be considered in downstream analysis”. The latest format (Illumina1.8 or later) has switched back to the Sanger standard. Most mapping tools assume by default that the quality scores are in Sanger encoding scheme. Mapping software like BWA also support Illumina 1.3+ scheme as a command line option. In case a quality score is not supported, FastQ format conversion can be achieved by several tools (e.g., fq_all2std.pl script in MAQ,4 NGSQC Toolkit5). Incorrect specification of base quality encoding will result in erroneous mapping if the tool used base quality score to calculate the mapping quality. It will also affect the downstream analysis (e.g., variant calling) when quality score is incompatible with the presumed encoding. 7. The FastQC’s report shows a summary of each analysis module and indications of whether the results are normal or unusual. Some optional preprocessing steps can be taken to filter out reads with low average base qualities, or to trim out low-quality bases at 50 ends. We skipped these steps in our projects, because filtering step has been done by the Illumina base calling pipelines, and BWA contains option to automatically trim lowquality bases at the ends. And the example FastQ used here 4 5

MAQ can be downloaded from http://maq.sourceforge.net/. NGS QC Toolkit: http://www.nipgr.res.in/ngsqctoolkit.html.

106

Xueya Zhou et al.

has been filtered by the 1000 Genomes Consortium. If the preprocessing steps were to be taken, care must be taken to preserve paired-end information in FastQ files (see also Note 6). 8. Picard interval list format contains sequence dictionary in the header followed by one line per interval.6 The position intervals are one-based, different from UCSC’s BED format. 9. BWA’s default parameter setting works well for aligning exome sequencing reads. We discuss some options here, because many of them are shared by similar ones in other tools, and you may tweak them in other applications. (a) Mismatches: BWA accepts or rejects an alignment based on the counted number or fraction of mismatches between read and the genomic position specified by -n option. Other tools may also use the sum of quality scores at mismatch positions. The default parameters are tuned for genomes with low polymorphisms. (b) Seeding: Seed refers to the first few tens of base pairs of a read. This part of a read is expected to contain fewer errors. Many tools used seed in their heuristic search that maximizes the performance. BWA’s -l and -k options were designed for seeding strategy, but they are disabled by default. (c) Color space: For reads generated by SOLiD systems, all except first base are given a number out of four depending on the value of previous base. Mapping in color space directly has the advantage of distinguishing sequencing errors from true polymorphism. BWA’s -c option enables this function, given reads in color spaces. (d) Base trimming: BWA contains a -q option to trim lowquality bases at 50 end. Its meaning is explained in Fig. 1. (e) Quality score: BWA assumes that the quality score is Sanger encoding; use -I if the base quality is in Illumina 1.3+ encoding. 10. BWA separates the index searching and reads alignment in two steps. When paired-end reads are used, the distribution of fragment length is estimated from the data, unless there are too few reads. In the latter case, users can specify the minimal and maximum insert sizes. Most other tools have a similar behavior. It is important in this step to specify the read group information by -r option. This line will be put in the generated SAM file specifying the identity for the read group, library, and sample. The read group refers to a set of reads that are generated by a single lane from a single sequencing run. The reads

6

Picard interval list format is explained in http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_ for_the_GATK#Intervals.

Short Read Mapping for Exome Sequencing

107

Fig. 1 Explaining BWA’s read trimming option. Shown in the figure is a schematic view of quality scores along the positions of a read. Given a threshold q, if there are positions at the 50 end having quality score less than q, BWA will determine the trimming bases from the leftmost positions where area shown in orange is no less than areas shown in cyan

within the same read group will have similar error profiles but may differ systematically from reads in other read group. This information will be used in quality score recalibration (see also Note 15). The library tag (LB) is also critical for marking up duplicated reads (see also Note 13). If a mapping tool did not accept read group information, we can also add read group information to the header in output file using ReplaceSamHeader program in picard package. 11. The BWA’s alignment output is in SAM format,7 which has become the standard for data exchange [15]. Most other mapping tools can either directly support SAM output, or provide auxiliary scripts to convert their native outputs to SAM format. For reduced storage, the SAM files are usually compressed in a binary format (BAM), and can be sorted and indexed in a way that facilitates range query. It should be emphasized that SAM files are software dependent. In addition to different mapping accuracies, software also have their own idiosyncrasies in outputting SAM file formats. For example, the SAM specification required that the record for one read contains some information about its mate pair which is also stored in the record for its mate. Although it seemed redundant, it can be helpful in many cases. This redundancy also created some inconsistency usually found in BWA’s output. Broken mate information has nothing to imply about the mapping qualities; it just means that the software that aligned these reads did not set the information correctly. The FixMateInformation program will attempt to fill in these attributes.

7

Details can be found in http://samtools.sourceforge.net/SAM1.pdf.

108

Xueya Zhou et al.

There are also some discrepancies to the SAM standard that cannot be corrected by the picard tools. For example, BWA concatenates reference contigs together before indexing; if a read happened to (erroneously) map to the position spanning the edge of one contig onto another, it will have a mapping position and quality but marked as unmapped. When the file is parsed by picard, it triggers an “MAPQ must should be 0 for unmapped read” error, since picard is very picky about the file format. As we believe that BWA’s SAM output should be correct, it is no harm to left as is. So we stop the error checking by setting the picard command line option VALIDATION_STRINGENCY¼SILENT. Picard package also comes with a utility program ValidateSamFile that can be used to check if a SAM file strictly conforms to the standard. Picard determines whether to write in SAM or BAM format by examining the file extension of the output file. The format conversion is done by giving appropriate file extension. 12. Reordering contigs to the same order as sequence dictionary file is required for preparing files that can be read by GATK. Individual BAM files must be sorted and indexed before they can be merged. 13. Generation of duplicated DNA sequencing reads is one technical artifact found in many applications. The duplicates can typically arise during PCR, especially when the total number of molecules in the library is small. The detection of duplicated reads is relied on the alignment positions, and works better for paired-end reads. It is because that reads mapped to the same positions at both ends are much more likely to be duplicated than having identical ends by chance. Duplication removal is a necessary step for variant calling, because it removes nonindependence among reads caused by amplification bias. It can also be an optional step for ChIP-seq, if the paired-end approach is used. In RNAseq, however, removing duplicated reads will attenuate the true signals; instead, methods in expression analysis have been developed to account for this other biases [16]. MarkDuplicates work on library level which was determined from the LB field in read group information. Reads mapped to the same positions but with different LB tags are not considered as duplicates. 14. Realignment is a recommended post-processing step for variant calling, as indels at the end of aligned reads often lead to false-positive SNP calls. The indel artifacts arise because mapping tools process one read at a time. Realignment step adjusts the indel position by utilizing all mapped reads so that the overall mismatches are minimized. 15. Quality scores generated from sequencing machine may not be accurate and influenced by several known covariates. Recalibration step is to remove those effects to make quality scores more

Short Read Mapping for Exome Sequencing

109

Fig. 2 (a) A selected subset of quality control statistics. (b) The distribution of insert sizes. (c) The mean quality score across machine cycles before and after recalibration. (d) Visualization of alignments using Integrated Genome Viewer (IGV)

reflective of true error rate. Special options should be turned on for SOLiD reads: --solid_recal_mode REMOVE_REF_BIAS --solid_nocall_strategy PURGE_READ.

16. Figure 2 shows the expected results for exome sequencing. We consider that the sequencing is complete if more than 80 % of the targets are covered by 20. It is because not all targets can be adequately covered. The depth of coverage is mainly influenced by the local GC contents, which can be examined by CollectGcBiasMetrics program in picard. The proportion of aligned reads should be typically over 90 %; most unaligned reads have poor base qualities. In our example, the aligned proportion is very high because many low-quality reads have already been filtered out. If an unusually low proportion of mapped reads is observed, then you should be wary of contamination from other species. To test this possibility, FastQ Screen8 can be used to screen the unmapped reads across 8

FastQ Screen: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/.

110

Xueya Zhou et al.

several known species and determine the mapping proportions. The unmapped reads can be extracted from alignment and reformatted to FastQ file using the following command: samtoolsview -f 0x0004mapping/NA20322.recal.bam | \ awk

’{

print

"@"$1"\n"$10"\n+\n"$11;}’

>

unmapped.fastq

17. verifyBamID was designed to detect sample swapping or mixing. The CHIPMIX statistics estimate unusual proportion of non-reference alleles given known genotypes at known sites and corresponding population allele frequencies. In the absence of known SNP genotypes, verifyBamID estimates the proportion of non-reference alleles that is incompatible with that which comes from one individual (the FREEMIX statistics).

Acknowledgments Xueya Zhou and Suying Bao have contributed equally to this work. This work was funded by grants from the National Basic Research Program of China (No. 2012CB316504) to X.G.Z. and NSFC grants (No. 81271226 to Y.Q.S. and No. 91010016 to X.G.Z.), and from the Research Grants Council of Hong Kong (HKU775208M/HKU777212) and the Research Fund for the Control of Infectious Diseases of Hong Kong (No.11101032) to Y.Q.S. References 1. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. doi:10.1038/nrg3031 2. Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12(2):87–98. doi:10.1038/ nrg2934 3. Park PJ (2009) ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 10(10):669–680. doi:10.1038/ nrg2641 4. Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11(10):1725–1729. doi: 10.1101/gr.194201 5. Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483. doi:10.1093/bib/bbq015 6. Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly.

Nat Methods 6(11 Suppl):S6–S12. doi:10.1038/nmeth.1376 7. Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ (2011) Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 56(6):406–414. doi:10.1038/ jhg.2011.43 8. Holtgrewe M, Emde AK, Weese D, Reinert K (2011) A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 12:210. doi:10.1186/1471-2105-12-210 9. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105–1111. doi:10.1093/bioinformatics/btp120 10. Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27 (11):1571–1572. doi:10.1093/bioinformatics/btr167 11. Krawitz P, Rodelsperger C, Jager M, Jostins L, Bauer S, Robinson PN (2010) Microindel

Short Read Mapping for Exome Sequencing detection in short-read sequence data. Bioinformatics 26(6):722–729. doi:10.1093/bioinformatics/btq027 12. Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 21(6):936–939. doi:10.1101/gr.111120. 110 13. Hach F, Hormozdiari F, Alkan C, Birol I, Eichler EE, Sahinalp SC (2010) mrsFAST: a cacheoblivious algorithm for short-read mapping. Nat Methods 7(8):576–577. doi:10.1038/ nmeth0810-576 14. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/ Illumina FASTQ variants. Nucleic Acids Res 38 (6):1767–1771. doi:10.1093/nar/gkp1137 15. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/ Map format and SAMtools. Bioinformatics 25(16):2078–2079. doi:10.1093/bioinformatics/btp352 16. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol 12(3):R22. doi:10.1186/gb2011-12-3-r22 17. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760. doi:10.1093/bioinformatics/btp324

111

18. Langmead B, Salzberg SL (2012) Fast gappedread alignment with Bowtie 2. Nat Methods 9(4):357–359. doi:10.1038/nmeth.1923 19. Langmead B, Trapnell C, Pop M, Salzberg S (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25 20. Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One 4(11):e7767. doi:10.1371/journal.pone.0007767 21. Homer N, Merriman B, Nelson S (2009) Local alignment of two-base encoded DNA sequence. BMC Bioinformatics 10(1):175 22. Hormozdiari F, Hach F, Sahinalp SC, Eichler EE, Alkan C (2011) Sensitive and fast mapping of di-base encoded reads. Bioinformatics 27(14):1915–1921. doi:10.1093/bioinformatics/btr303 23. Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M (2009) SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol 5(5):e1000386. doi:10.1371/ journal.pcbi.1000386 24. David M, Dzamba M, Lister D, Ilie L, Brudno M (2011) SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27(7):1011–1012. doi:10.1093/bioinformatics/btr046 25. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18(11): 1851–1858. doi:10.1101/gr.078212.108

Chapter 7 Profiling Short Tandem Repeats from Short Reads Melissa Gymrek and Yaniv Erlich Abstract Short tandem repeats (STRs), also known as microsatellites, have a wide range of applications, including medical genetics, forensics, and population genetics. High-throughput sequencing has the potential to profile large numbers of STRs, but cumbersome gapped alignment and STR-specific noise patterns hamper this task. We recently developed an algorithm, called lobSTR, to overcome these challenges and to accurately profile STRs from short reads. Here we describe how to use lobSTR to call STR variations from high-throughput sequencing datasets and to diagnose the quality of the calls. Key words Short tandem repeats, Microsatellites, Whole-genome sequencing, Variant calling, Personal genomes, lobSTR

1

Introduction Short tandem repeats (STRs), or microsatellites, are a class of genetic variation consisting of repeated elements of two to six base pairs. Due to their repetitive structure, these loci are prone to polymerase slippage, which results in frequent mutations that make them highly polymorphic in the number of repeat copies [1]. At least 30 genetic diseases, including Huntington’s disease [2] and Fragile X syndrome [3], are known to result from STR repeat expansions [4]. The high variability of STRs makes them highly informative markers for applications such as linkage studies [5], forensics [6], and genetic genealogy [7]. STR variations pose a significant challenge to mainstream aligners [8]. First, large allelic expansions and contractions that differ from the reference allele present as a gapped alignment problem. Aligners such as the widely used BWA [9] exhibit a significant trade-off between insertion and deletion tolerance and run time [10]. Second, not all reads aligning to an STR are informative: in order to determine the number of repeat copies present, a read must entirely span an STR with sequence overlap in both the upstream and downstream flanking regions. Third, the PCR amplification

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_7, # Springer Science+Business Media New York 2013

113

114

Melissa Gymrek and Yaniv Erlich

Fig. 1 An overview of the steps for profiling STRs using lobSTR. Step 0: The indexer generates a separate reference index for each STR motif present in the genome. Step 1: lobSTR detects STR-containing reads and determines their STR motif. Step 2: lobSTR aligns the non-repetitive flanking regions to the STR reference. Step 3: lobSTR alignments are checked for quality and visualized. Step 4: Generation of STR allelotype calls. Step 5: Allelotypes are checked for quality and prepared for downstream processing

step involved in sample preparation introduces stutter noise due to polymerase slippage, which can add or delete repeat copies from the true allele. lobSTR is a focused algorithm to profile STRs from short reads while mitigating these problems using a three-step approach [11] (Fig. 1). In the first step, called sensing, lobSTR quickly scans the input sequencing file and filters only informative reads that fully span the STR locus. Then, it parses the read into an STR region and non-STR flanking regions, and extracts the repeat motif from the STR region. In the next step, called alignment, lobSTR aligns only the non-STR flanking regions to the reference in order to avoid the

STR Profiling from Short Reads

115

gapped-alignment problem. To increase the specificity of the alignment, lobSTR only searches in the vicinity of STR loci whose repeat motif matches the motif that was detected in the sensing step. Finally, in the allelotyping step, lobSTR uses statistical learning to distinguish the true allelotype at each STR locus from PCR stutter and other noise sources. In our previous experiments [11], lobSTR showed high accuracy in profiling STR variations from Illumina data that far exceeded the performance of mainstream aligners. We note that lobSTR cannot replace mainstream aligners to genotype SNPs and other variations from high-throughput sequencing data. However, the algorithm is extremely fast and comes with a low computational cost. The lobSTR package consists of three main programs: an index builder, an alignment tool, and an allelotype tool. Additionally, lobSTR provides tools for checking the quality of alignment and allelotyping results, as well as a tool to convert lobSTR output to the widely used VCF format [12]. Below we walk through each of these steps with a focus on choosing the appropriate parameter settings and how to interpret the resulting alignments and allelotypes.

2

Materials STR profiling using lobSTR can be performed in most UNIX environments with only minimal software and hardware requirements.

2.1 Computing Resources

2.2 Software Requirements

2.3

Input Data

1. The minimal amount of RAM is 1.3 GB. 2. lobSTR can run on a single processor or multiple processors. lobSTR has been tested with up to 40 processors. Using five processors, lobSTR can process a single low-coverage (4) genome in several hours. Using 25 processors, lobSTR can process a higher coverage (30) genome in under 5 hours. The lobSTR software package (compiled binaries, source code, and installation instructions can be found at http://jura.wi.mit.edu/ erlich/lobSTR/index.html). All instructions are based on lobSTR version 2.0.0. Usage is subject to change slightly as lobSTR is continually improved. Updated usage instructions are available on the lobSTR Web site as new versions are released. 1. A lobSTR reference index. This can be downloaded from the lobSTR Web site at http://jura.wi.mit.edu/erlich/lobSTR/ downloads.html. Some users may wish to create an index with a custom set of STR loci. Instructions for doing so are given in Subheading 3.1. 2. Raw sequencing reads in FASTQ/FASTA or BAM format. See Note 1 for more information on the various file formats referenced here. See Note 2 for the types of sequencing data lobSTR can process.

116

3

Melissa Gymrek and Yaniv Erlich

Methods All code snippets or UNIX commands are in monospaced font. All user-defined variables are prefixed with a $. All UNIX commands are prefaced with a %. The path where lobSTR is installed is denoted by $PATH_TO_LOBSTR.

3.1 Building a lobSTR Index (Optional)

This section is optional, and is only for users wishing to profile their own custom reference set of STR loci. The lobSTR Web site contains a prepackaged index for both the hg18 and hg19 human reference genomes consisting of STR loci identified by the Tandem Repeat Finder [13] (see Note 3). If you opt to use lobSTR precompiled references, you can skip to Subheading 3.2. 1. To compile a reference set of STR loci, each STR locus must be annotated with the chromosome number, start coordinate, end coordinate, STR motif, the length of the STR motif, and the copy number of the motif in the reference genome. We recommend running Tandem Repeat Finder on your genome of interest in order to create these annotations. 2. Generate a file in BED format with one-line entry for each locus. Save this file at $STR_TABLE. This file can contain any desired information about each STR, as long as it contains the required columns: (a) Column 1: Chromosome number (b) Column 2: Start coordinate of the STR (c) Column 3: End coordinate of the STR (d) Column 4: Period of the STR (e) Column 5: Reference copy number (f) Column 15: STR repeat motif This format is used because of its compatibility with the tandem repeats table available from the UCSC Genome Browser [14]. See Note 4 for how to download this table from UCSC. See Note 5 for tips on converting raw Tandem Repeat Finder output to this format. 3. Make a directory where the index will be stored: % mkdir $index_dir

4. Run lobstr_index.py: % python $PATH_TO_LOBSTR/scripts/lobstr_index. py --str $STR_TABLE --ref $REFERENCE_GENOME.fa --out $index_dir

STR Profiling from Short Reads

117

This script takes an optional parameter, --extend, which denotes the length of the flanking regions around each locus to include in the reference. By default, 1,000 bp on either side of each STR are included. See Note 6 for advice on setting this parameter. 3.2 Aligning STRContaining Reads

The steps here focus on determining the appropriate lobSTR parameters and flags, running lobSTR alignment with those parameters, and working with the output files. For a description of lobSTR parameters not discussed here, see Note 7. For more details on how lobSTR performs the sensing and alignment steps, see Notes 8 and 9. 1. Determine which type of raw input sequencing files will be used, and set the appropriate lobSTR flags. See Note 10 for how lobSTR handles each type of input. (a) lobSTR takes a comma-separated list of input files. If using single-end reads or paired-end reads in BAM format, use the -f parameter followed by the list of files. For pairedend reads, use --p1 to specify the list of files containing the first ends, and --p2 to specify the second ends in the same order as the list in --p1. See Note 11 for common issues specifying input files. (b) If the files are gzipped (FASTA or FASTQ only), specify this with the flag --gzip. (c) If the input files are in FASTQ format, specify this with the flag -q. (d) If the input files are in BAM format, specify this with the flag --bam. (e) If the input files are in BAM format and are paired, specify this with the --bam flag as well as the --bampair flag. BAM files in paired-end format must be sorted by read identifier before processing (see Note 12). 2. Set the path to the index. If the lobSTR index (generated using the steps in Subheading 3.1 or downloaded from the lobSTR Web site) is saved in $index_dir, set the index parameter with --index-prefix $index_dir/lobSTR_. 3. Set the output prefix. Specifying --out $prefix will generate lobSTR output files $prefix.aligned.tab and $prefix. aligned.bam.

4. Optional: Set the sensing parameters. This step calculates the sequence entropy in sliding windows across the read to detect repetitive sequences. For detected reads, the STR motif is characterized. In most cases, lobSTR works best with the default sensing parameters. If you wish to use the default parameters, you may skip to the next step. You can control the following parameters:

118

Melissa Gymrek and Yaniv Erlich

(a) Set the sliding window size and step size using the --fft -window-size and --fft-window-step parameters. By default, lobSTR uses a window size of 24 bp and a step size of 12 bp. For read lengths shorter than 50 bp or to detect shorter STR regions, a smaller window size may be used (see Note 13). It is not recommended to run lobSTR on reads less than 45 bp long, and in most cases no valid alignments will be returned for shorter reads. (b) Set the entropy score cutoff using the --entropythreshold parameter. By default, the entropy threshold is 0.45. Strings with an entropy value below this threshold likely contain STRs and are retained for analysis. Lower cutoffs are more sensitive but increase the algorithm run time (see Note 14). (c) Set the minimum and maximum STR motif period to attempt to detect using the --minperiod and --maxperiod parameters. By default, lobSTR detects STRs of periods 2–6 bp. Performance outside these parameters has not been tested and is not guaranteed with the current version of lobSTR. (d) Set the minimum and maximum flanking region to attempt aligning using the --minflank and --maxflank parameters. By default, lobSTR will not attempt to align flanking regions shorter than 10 bp, and will trim flanking regions to a maximum of 25 bp. Shorter flanking regions than 10 bp in most cases will not produce unique alignments. Longer flanking regions can give more specific alignments, but can increase run time and decrease sensitivity (see Note 15). 5. Optional: Set the alignment parameters. In the alignment step, lobSTR aligns the flanking regions of STR-containing reads to all reference loci with the detected STR motif. Similar to the sensing step, in most cases lobSTR achieves good performance at default parameters. If you wish to use the default parameters, you may skip to the next step. The parameters below can be used to change the alignment sensitivity and to filter lower quality alignments. (a) If using a custom index (see Subheading 3.1), set -extend to the value used when running lobstr_ index.py. If the default value was used for index building, there is no need to change the parameter here. (b) lobSTR adapts the alignment tool of BWA [9] for aligning the flanking regions. Use the parameters -g to set the number of gap openings allowed in each flanking region, -e to set the number of gap extensions, -m for the allowed edit distance, or -r for the allowed edit distance as a percentage of the read length (see Note 16).

STR Profiling from Short Reads

119

(c) Set the map quality threshold using the --mapq parameter. The map quality is calculated as the sum of base quality scores at positions mismatching the reference (see Note 17). By default, alignments with scores above 100 are discarded. (d) Set the maximum allowed length difference of STR variations from the reference sequence using the --max-diffref parameter. By default, variations differing from the reference by more than 50 bp are not reported. This threshold does not apply to reads only partially spanning an STR (see Note 18). (e) Specifying --unit will discard any aligned read differing from the reference sequence by a non-integer number of repeats. While in some cases the true allele does differ by a non-integer copy number, often these alignments are of poor quality and result from PCR stutter or sequencing noise at homopolymer repeats (see Note 19). Note 20 gives additional information about how lobSTR refines alignments after a unique STR locus has been determined. 6. Run lobSTR with the parameters set in steps 1–3. At the UNIX command line, run: %

lobSTR



--index-prefix\

$index_dir/lobSTR_ --out $prefix \ [sensing parameters] [alignment parameters] \ [other options]

7. lobSTR outputs two files: $prefix.aligned.tab and $prefix.aligned.bam. Note 21 describes the output formats. Each contains identical alignment information for each read. Instructions for evaluating the quality of lobSTR alignments are given in Subheading 3.3. 3.3 Evaluating lobSTR Alignments

1. Run the provided alignment diagnostic script to determine alignment quality: % python $PATH_TO_LOBSTR/scripts/lobSTR_alignment_ checks. py -f $prefix.aligned.tab [plot]

where $prefix.aligned.tab is the alignment file created in Subheading 3.2, and --plot is an optional parameter to specify that figures with alignment summary statistics should be created. This writes the following to standard output:

120

Melissa Gymrek and Yaniv Erlich

(a) The number of aligned reads. (b) The number of reads that were stitched (see Note 10), single ends, and mates of single ends. The number of single ends should be equal to the number of mate pairs reported. The number of reads stitched depends on the insert size of the library. As a benchmark, for a library of 100 bp reads with mean insert size of 250 bp, 1–2 % of alignments are the result of stitched reads. For 100 bp reads with a mean insert size of 150 bp, nearly 20 % of read pairs can be stitched. (c) The percentage of reads that only partially span the STR. This number varies with read length but is often over 50 %. (d) The percentage of reads aligned to the reverse strand. This should always be very close to 50 %. (e) The mean insert size between aligned paired ends (this is irrelevant for lobSTR results run in single-end mode). This depends on the sequencing library and is usually near several hundred base pairs. (f) The percentage of alignments differing by a non-integer number of repeats from the reference. This should be in the 1–5 % range. If this number is too high, it may be an indication that the alignment is of poor quality. This usually occurs when there are high error rates at homopolymer runs in sequencing platforms such as 454 and IonTorrent or when reads are of poor quality. If the --plot option is specified, the plots listed below are generated in the directory of the aligned file: (a) The insert size distribution (for paired samples only) (Fig. 2a). (b) The quality score distribution (for fully spanning reads only) (Fig. 2b). (c) The distribution of the number of base pair difference from the reference sequence. This should be centered near 0 (Fig. 2c). 2. Visualize alignment outputs: This is a useful way to tell at a glance whether lobSTR is reporting high-quality alignments. lobSTR alignments are easily viewed in any tool that can display alignments in BAM format. Before visualizing the BAM output file, it must first be sorted and indexed (make sure that the samtools suite is installed): % samtools

sort

$prefix.aligned.bam

$prefix.

aligned.sorted % samtools index $prefix.aligned.sorted.bam

where $prefix.aligned.bam is the BAM file created in Subheading 3.2.

Fig. 2 Examples of alignment diagnostic plots for a high-quality Illumina dataset of 100 bp paired-end reads at 30 coverage. (a) Insert size distribution for aligned mate pairs. (b) Read alignment quality score distribution. Most scores should be near 0, indicating few or no mismatches to the reference sequence. (c) Distribution of alleles supported by read alignments given as the difference in base pairs from the reference. The distribution should be centered near 0

122

Melissa Gymrek and Yaniv Erlich

To use the samtools tview tool, at the UNIX command line type: % samtools tview $prefix.aligned.bam $reference_ genome.fa

Press “g” followed by the chromosome number and coordinate (e.g., chr1:12345) to view any genomic position. Use this option to scroll to genomic regions containing STR alignments. More information is provided on the samtools sourceforge page (http://samtools.sourceforge.net/samtools.shtml). Another option to visualize alignments is using the Integrative Genomics Viewer (IGV) [15]. To view a BAM file, open IGV and set the genome build in the upper left corner to that used for lobSTR alignment. Load the BAM file using the File: Load From File option. Enter a coordinate in the toolbar at the top of the screen to scroll to any genomic position. Examples of high-quality and noisy STR alignments visualized using IGV are shown in Fig. 3. 3.4 Calling STR Variants

This step describes generating STR allelotypes from the alignment performed in Subheading 3.2. As above, the steps here focus on setting the appropriate allelotype parameters, running the allelotype tools, and analyzing the output. For more details on how lobSTR performs the allelotyping step, see Note 22. 1. Optional: Train a PCR stutter noise model. The lobSTR download provides a premade Illumina noise model. If using this noise model skip this step and replace $noisemodel below with $PATH_TO_LOBSTR/models/illumina.noisemodel. txt. Training requires a high-coverage (30 or more) male sample with ample data for the sex chromosomes, since these are used to train the model (see Note 23). To train using the lobSTR alignment generated in Subheading 3.2: % allelotype

--command

train

--bam

$prefix.

aligned.bam --sex M --noise_model $noisemodel

2. Generate allelotype calls: % allelotype --command classify --bam $prefix. aligned.bam --sex [U|M|F] --noise_model $noisemodel --out $prefix

Replace the --sex parameter with the appropriate sex of the sample (M¼male, F¼female, U¼gender is unknown or not applicable). By default, the allelotype tool removes PCR duplicates (see Note 24). To turn off this option, specify --no-rmdup. See Note 25 on how to set quality filters, which restrict the allelotyping to input reads with a certain quality.

STR Profiling from Short Reads

123

Fig. 3 Examples of STR alignment visualization in IGV. (a) An example of a high-quality STR alignment. This trinucleotide locus was called as heterozygous for the reference allele and for a 9 bp deletion. (b) An example of an STR alignment from a low-coverage dataset. (c) An example of a noisy STR alignment. Five different alleles are supported by reads at this locus. Inspecting the reference sequence reveals a complex region consisting of three adjacent STR segments (boxes), making this locus difficult to call accurately

3. The allelotyper generates output in the file $prefix.genotypes.tab. This file contains one entry per STR locus with the allelotype given in the number of base pair length difference from reference, coverage, number of supporting reads, allelotype confidence score (see Note 26), and information on all reads partially spanning the STR locus (see Note 27). For information on evaluating the quality of allelotyping results and for downstream processing of allelotypes, see Subheading 3.5. 3.5 Evaluating and Processing Allelotype Results

1. Run the provided allelotype diagnostic script to evaluate the alignment quality: % python $PATH_TO_LOBSTR/scripts/lobSTR_allelotype_ checks.py -f $prefix.genotypes.tab [plot]

124

Melissa Gymrek and Yaniv Erlich

where $prefix.genotypes.tab is the file generated in Subheading 3.4, and --plot is an optional parameter specifying whether diagnostic plots should be generated. This script writes the following to standard out: (a) The number of calls with coverage 1. (b) The number of calls with coverage  5. (c) The mean STR coverage (including only fully spanning reads). This is usually half or less of the genomic coverage. (d) The mean percentage of reads agreeing with the call. This should be above 95 % in most cases. Lower numbers indicate a high amount of noise, usually from PCR stutter noise or poor alignment quality. (e) The number of STRs in each allelotype category separated by period. The four categories are homozygous reference, heterozygous reference/non-reference, homozygous nonreference, and heterozygous non-reference/non-reference. The number in the first category should increase with increasing period: dinucleotides often have around 25 % of sites called as homozygous reference, whereas hexanucleotides are usually around 85 % homozygous reference. If the --plot option is specified, the diagnostic plots listed below are generated in the same directory as the allelotype output file (Fig. 4): (a) The distribution of STR coverage (Fig. 4a). (b) A bar plot of the STR period vs. the fraction in each allelotype category as described above (Fig. 4b). (c) The distribution of the percentage of reads agreeing with the allelotype call (Fig. 4c). (d) The distribution of allelotype confidence scores (see Note 26) (Fig. 4d). (e) The distribution of the number of base pair difference from the reference sequence for each allele called (Fig. 4e). (f) The number of non-unit repeat alleles called, separated by period. For each period, the number of alleles reported for each number of base pairs modulo the period is reported. Most alleles should be 0 bp modulo the period. Most nonunit alleles should be 1 bp modulo the period (Fig. 4f). 2. Convert lobSTR output to VCF format for downstream processing. The VCF format is a popular format for handling genetic variation data. It is especially convenient when processing data from more than one sample. To convert the $prefix. genotypes.tab file to VCF format:

STR Profiling from Short Reads

125

Fig. 4 Examples of allelotype diagnostic plots for a high-quality Illumina dataset of 100 bp paired-end reads at 30 coverage. (a) STR coverage distribution. (b) Distribution of allelotype categories by period. (Red— homozygous reference, blue—heterozygous reference/non-reference, green—homozygous non-reference, yellow—heterozygous non-reference/non-reference). Variation should decrease with increasing period, with dinucleotides being the most polymorphic category. (c) Distribution of the percentage of reads agreeing with each allelotype call. Most calls should have 90–100 % reads agreeing with the called alleles. (d) Distribution of allelotype confidence scores. Scores near 1 indicate many high-confidence calls. (e) Distribution of called allele sizes given the number of base pair difference from the reference. This should be centered near 0. (f) Number of base pair difference from the reference modulo the period size (green—dinucleotides, orange— trinucleotides, red—tetranucleotides, blue—pentanucleotides, purple—hexanucleotides). Most calls should be at 0, indicating a complete repeat unit difference from the reference. Non-unit calls should be most prevalent at 1 bp modulo the unit size

126

Melissa Gymrek and Yaniv Erlich % python $PATH_TO_LOBSTR/scripts/lobstr_to_vcf. py --gen $prefix.genotypes.tab --sample $SAMPLE --out $PREFIX

where $prefix.genotypes.tab is the file generated in Subheading 3.4, $SAMPLE is the name of the sample being processed, and $PREFIX specifies that the file $PREFIX.vcf will be created. For more information on processing results in VCF files, see the VCFtools Webpage at http://vcftools.sourceforge. net/perl_module.html.

4 4.1

Notes File Formats

Note 1: The FASTA/FASTQ format is the standard format for storing raw sequencing data [16] for all platforms that lobSTR is compatible with. BAM is a binary format that follows the SAM format specifications [17]. The BED format, described in detail at the UCSC site (http://genome.ucsc.edu/FAQ/FAQformat), is a tab-delimited file with columns for the chromosome, start, and end coordinate of each region, with optional additional information fields. Note 2: lobSTR has been tested on sequencing data generated by the Illumina [18], 454 [19], Sanger [20], and IonTorrent [21] platforms. lobSTR theoretically supports any platform that outputs reads in base space that can be represented in any of the supported input formats (FASTA/FASTQ/BAM). lobSTR default parameters have been optimized for Illumina data, and platforms with high homopolymer indel rates (including 454 and IonTorrent [22]) give less accurate allelotypes. The minimum and maximum accepted sequencing read lengths are 45 and 1,024 bp, respectively. In most cases, shorter reads do not allow for accurate STR detection and do not contain enough flanking sequence to align. The detection step is currently only built to deal with sequences of 1,024 bp or less.

4.2 Building a lobSTR Index

Note 3: The lobSTR reference is built using a modified version of the Tandem Repeat Finder table downloaded from the UCSC genome browser [23] consisting of ~240,000 loci. The table contains STRs of at least 25 bp in total from the reference sequence that contain at least three complete copies of the STR motif. Only STRs with a TRF score of 50 or more are included (using scores of match ¼ 2, mismatch ¼ 7, indel ¼ 7), allowing for imperfect STRs. Entries with invalid repeat motifs, such as tetramers with motifs “ATAT,” are discarded. Note 4: To download the Tandem Repeat Finder table from UCSC, navigate to the UCSC homepage (http://genome.ucsc.edu/). Click the “Tables” tab at the top of the page and select the desired clade, species, and assembly. Under “group,” select “Variation and

STR Profiling from Short Reads

127

Repeats.” Select the “Simple Repeats” track. Select “selected fields from primary and related tables” under “output format.” After clicking “get output,” select all columns except the first column (“bin”) and output the table to a file. This file is in the format required for lobSTR indexing. Note 5: In some cases users may want to generate an STR reference from TRF using different settings or a different species than those on UCSC. Raw TRF output requires some manipulation before lobSTR indexing. One way to get from TRF to the required table is by running TRF separately on each chromosome, concatenating the output to one BED file, and then performing any necessary filtering before running the indexer. An example bash script to do this for a human genome is as follows: for i in {1..22} X Y do trf chr$i.fa 2 7 7 80 10 24 6 -d cat chr$i.fa.2.7.7.80.10.24.6.dat | awk ’($9 >0)’ | sed ’s//\\t/g’ | awk ’{print \"chr$i\\t\" $0}’ > chr$i.trf.hg18.bed done cat chr*.bed > trf.hg18.2.7.7.80.10.24.6.bed

This file is in the format required for lobSTR indexing. Note 6: By default, the reference contains 1,000 bp upstream and downstream of each STR locus. In most cases this is enough to align both the read spanning the STR and its mate pair in the case of paired-end reads. For paired-end reads with an insert size of larger than several hundred base pairs, --extend can be set to include enough of the surrounding region that the mate pair of an STR read will be able to align to the reference. Note that increasing the region included in the reference causes a proportional increase in memory usage. 4.3 Aligning STRContaining Reads

Note 7: Besides the parameters described in detail below, the lobSTR program accepts the following parameters: 1. 2.

--help:

displays the usage screen

--verbose:

prints out helpful status messages as the program

progresses 3.

--min-read-length/--max-read-length: lobSTR will only attempt to process reads with lengths between these thresholds. These default to 45 and 1,024 bp, respectively. lobSTR does not produce reliable output for reads shorter than 45 bp.

128

Melissa Gymrek and Yaniv Erlich

4.

lobSTR trims the 30 ends of reads based on quality scores using the method specified in the BWA manual (http:// bio-bwa.sourceforge.net/bwa.shtml). This parameter acts in the same way as the –q parameter for BWA. The default value of this parameter is 10.

5.

--oldillumina:

--bwaq:

Specifies that base pair quality scores are reported in the old Phred format (Illumina 1.3+, Illumina 1.5+) where quality scores are given as Phred + 64 rather than Phred + 33 [16].

Note 8: The goal of the sensing step is to determine which reads contain STRs and, for those that do, to determine the STR repeat motif. To filter non-STR-containing reads, lobSTR measures the sequence entropy in sliding windows across the read (Notes 13 and 14). Reads with a low entropy region are considered potentially STR-containing. To determine the motif, lobSTR first finds the period of the repeat, and then determines the motif sequence. To find the motif period, lobSTR uses a signal processing approach similar to that employed by Sharma et al. [24] which takes advantage of the fact that different motif periods show distinct spectral signatures. Once the period k is known, lobSTR scans the STR region to find the most prevalent k-mer and returns this value as the STR motif. For full details about the sensing step, see the lobSTR publication [11]. Several alternative approaches to the signal processing method have been used to scan genomic sequences for STRs. For a survey of other programs to detect microsatellite sequences, see refs. 25, 26. Tandem Repeat Finder scans sequences for adjacent regions with high sequence identity and uses a probabilistic model of STR sequences. RepeatMasker (http://www.repeatmasker.org/) looks for low-complexity sequences and scans for certain repeat motifs. TROLL [27] uses the Aho Corasick algorithm, which constructs a finite automaton, to search for tandem repeats of a set of preselected motifs. Note 9: The alignment step takes advantage of the information from the sensing step to increase alignment specificity. Reads are only aligned to regions of the genome containing the STR detected in the sensing step. To avoid a gapped alignment problem, lobSTR only aligns the non-repetitive flanking regions to the reference. If there is a unique alignment of flanking regions to an STR region, a valid alignment has been found. Once the alignment is determined, lobSTR performs a local realignment step to refine the alignment (see Note 20). For full details of the alignment step, see the lobSTR publication [11]. Note 10: For single-end reads, lobSTR performs the sensing, alignment, and allelotyping steps as described above. In the case of paired-end reads, lobSTR can take advantage of mate pair information to improve the alignment confidence and in some cases to span

STR Profiling from Short Reads

129

longer STR regions. lobSTR first attempts to align each read in a pair separately using the same sensing and alignment procedure as for single-end reads. If neither of the pair aligns, the reads are discarded. If one read is aligned to one or more STR loci, and the other read aligns to exactly one of these same regions, that alignment is returned. If both reads are aligned to an STR, with only one STR locus agreeing between the two, that alignment is returned. Finally, once a unique STR locus is determined, lobSTR attempts to stitch together the read pairs into one longer read if the two paired ends show some sequence overlap. In some cases, paired ends can be stitched across an STR region and allow longer STR alleles to be reported than can be spanned by a single read. Stitching requires a minimum of 16 bp overlap between read pairs, 90 % identity between the overlapped regions, and only one unique best stitching. Only imperfect STRs can allow for stitching across the STR region unless both reads in the pair contain some of each flanking region. Note 11: Do not include spaces in the comma-separated list of input files. Filenames must include the entire path relative to the working directory from which lobSTR is run. lobSTR will report error messages for files it cannot find. Note, in the case of multiple input files, lobSTR will not crash when it cannot find a file, but instead will move on to attempt to process the next file. The list of paired-end files must follow the same order in the --p1 and --p2 parameters. No checking is done on read identifiers to verify that paired reads are input correctly. Note 12: lobSTR assumes that BAM files in paired-end format are sorted by read identifier. To sort a BAM file $test.bam using the Samtools [17] package: % samtools sort -n $test.bam $test.sorted

This will output the file ready for lobSTR processing.

$test.sorted.bam

which is then

Note 13: By default, lobSTR scans reads using sliding windows of size 24 bp overlapping by 12 bp to determine regions of low sequence entropy. For shorter read lengths near 50 bp, windows of 20 bp overlapping by 10 bp perform better since they allow the read to be partitioned into enough windows to determine flanking regions versus STR regions. A shorter window size may also be desirable to detect STRs with shorter total lengths, whereas a larger window can be more efficient if only looking for long STR regions. A longer window size can better capture STRs with longer motifs (5–6 bp), since more repeat units can fit into a single window. Note 14: Sequence entropy is calculated using a dinucleotide alphabet that gives almost a perfect classifier between randomly chosen genomic 24mers, and 24mers drawn from STR sequences. lobSTR calculates an entropy score of (4-entropy)/4 for each sliding window, and windows above a certain threshold are marked as

130

Melissa Gymrek and Yaniv Erlich

repetitive. Theoretically, 0.42 would be the threshold achieving the best separation between random versus STR sequences. However, this range allows highly prevalent centromeric (TGGAA) and telomeric (TTAGGG) repeats, which cannot be uniquely aligned, to pass the threshold. To avoid wasting run time attempting to align these sequences, the entropy threshold is set by default to 0.45, which excludes a small number of true STRs. Lowering the threshold increases sensitivity but also greatly increases run time: less specific detection results in attempting to align a large number of sequences that are not true STRs. Raising the threshold will greatly reduce run time but will exclude many STRs that often have higher entropy, such as those of period four or more. Note 15: In most cases, only around 10 bp (upstream and downstream) flanking regions are needed to align an STR-containing repeat once the motif is known. Long flanking regions often fail to align because they accumulate too many mismatches from the reference. Therefore, lobSTR trims flanking regions by default to 25 bp, and requires at least 10 bp in order to attempt alignment. Increasing the maximum flanking region length will result in more specific alignments, but will increase run time and discard many true alignments. On the other hand, a small maximum flanking region size will produce more false-positive alignments. If using paired-end reads, most incorrect alignments will be discarded if they are not supported by both reads in the pair, and so this is not a major concern. Note 16: By default, lobSTR uses -g 1 -e 1 -r 0.01. Because flanking regions tend to be of highly variable length, setting –r usually performs better than -m. If -r is set, then --m is ignored. Increasing the value of -r or -m will increase the alignment sensitivity, but will also increase the number of alignments discarded due to multiple mappers. Note 17: The mapping quality score returned by lobSTR attempts to capture how likely a read is to be mismapped. Because lobSTR can theoretically detect STRs of unlimited length difference from the reference sequence, gaps are ignored. The score is calculated as the sum of quality scores at bases mismatching the reference. High scores indicate that many high-quality bases do not match the reference, and therefore the alignment is unlikely to be true. By default, reads with a score of 100 or greater are discarded. Note 18: In the default settings, lobSTR discards reads differing by more than 50 bp from the reference length. Variations greater than this tend to be erroneous. In some cases, a read only partially spans an STR. This is denoted by a “1” in the partial column of the output file. In this case, the allele reported by lobSTR should be interpreted as a lower bound on the length of the allele present since the true allele could not be determined. The allele length and CIGAR string reported for partially spanning reads should be

STR Profiling from Short Reads

131

regarded as very rough estimates. The --max-diff-ref threshold is not applied to partially spanning reads. Note 19: In most cases, STR alleles differing from the reference sequence by an incomplete number of repeat units are the result of noise. As mentioned in Note 2, many platforms have high error rates at homopolymer runs. Many common STR motifs (e.g., AAAAT) contain homopolymers and can often exhibit false alleles with +1 or 1 base of the homopolymer (for this example, AAAT, or AAAAAT). Especially in cases of low coverage, discarding these reads can increase accuracy. However, there are known cases where this is not true (e.g., the CODIS marker THO1 has several such common alleles, http://www.cstl.nist.gov/strbase/str_TH01. htm), and if an allele is supported by multiple independent reads lobSTR can type non-unit STR alleles with high confidence. Note 20: After a unique alignment to an STR locus is found, lobSTR locally refines the alignment using the Needleman–Wunsch dynamic programming algorithm [28]. This step determines a CIGAR score describing the alignment of the entire read, not just the flanking regions, to the reference sequence. Scores of match: 2, mismatch: 2, gap open: 6, and gap end: 0.1 are used to allow for an affine gap penalty. End gaps in the reference sequence are allowed and are denoted by the “S” symbol (for “soft clipping”) in the reported CIGAR string. The final STR allele takes into account only insertions and deletions overlapping the STR region, and does not include indels in the flanking regions in the allele length calculation. Note 21: The first two lines of the $prefix.aligned.tab file give a parameter string in order to keep track of which parameters were used, followed by column headers. For read pairs in which one read aligns to an STR region and the other aligns to the same region, both reads are reported, with the fields for the mate pair filled in as indicated below. The columns output in the $prefix.aligned. tab file, along with the corresponding tag name in the $prefix. aligned.bam file, are as follows: 1. Read identifier 2. Read nucleotides after trimming 3. Read quality scores after trimming (all N’s if no quality scores are available) 4. STR region (XG) 5. STR period 6. STR motif (XR) 7. Chromosome 8. STR start coordinate (XS) 9. STR end coordinate (XE)

132

Melissa Gymrek and Yaniv Erlich

10. Read start coordinate 11. Length difference from the reference sequence (in number of base pairs) (XD) 12. Number of copies of the STR in the reference sequence (XC) 13. Strand (0 ¼ plus, 1 ¼ minus) 14. CIGAR string 15. Partial spanning read indicator (1 ¼ the read only partially spans the STR, 0 ¼ the read completely spans the STR, - for a non-STR-containing mate pair) (XP). 16. Map quality score (calculated as described in Note 17). For paired reads, the score is summed across both reads, and the same score is reported for both. This value is used in the map quality column of the $prefix.aligned.bam file. However, it should not be interpreted as the same type of quality score returned by tools such as BWA. 17. Distance between the 50 ends of mate pairs (1 for unpaired or stitched paired-end reads) (XM). 18. Stitching indicator (1 ¼ mate pairs were stitched and reported as a single read, 0 ¼ mate pairs were not stitched, 1 ¼ the read is the non-STR-containing mate pair of an unstitched read pair) (XX). 4.4 Calling STR Variants

Note 22: The allelotyping step aggregates all reads at each locus to determine the most likely allelotype. It performs a grid search over all possible allelotypes within two units of the smallest and largest allele observed. For example, for a dinucleotide repeat, if the smallest and largest observed alleles are 2 and 10 bp from the reference, lobSTR will evaluate all allelotypes with alleles ranging from 6 to 14 bp from the reference. For each possible allelotype, it uses the trained stutter model (see Note 24) to determine the likelihood of seeing the reads present at that locus. The maximum likelihood allelotype is returned. For full details of the allelotyping, see the lobSTR publication [11]. Note 23: Genetic variants on male sex chromosomes are hemizygous. To train the PCR stutter noise model, lobSTR assumes that the modal allele at each male sex chromosome STR is true, and that any read not matching the modal allele results from stutter noise. Note: This assumes that all reads have been mapped correctly. lobSTR then trains a logistic regression to determine the probability of stutter given the period size and also learns the probability of stuttering by each step size (1 copy, 2 copies, etc.). Overall, noise decreases with increasing period size: dinucleotides exhibit the most stutter, whereas hexanucleotides rarely show stutter noise. Most reads resulting from stutter are 1 repeat unit away from the true allele.

STR Profiling from Short Reads

133

Note 24: PCR amplification during sample preparation may lead to the same molecule template being sequenced multiple times. These sequences contain redundant information and should not be treated as independent reads. lobSTR treats reads with the same starting coordinate and length as PCR duplicates. All duplicate reads are collapsed to a single representative read using a majority vote of the allele length across all reads. Average read quality scores are used to break ties when there is no majority allele. PCR duplicate collapsing can be turned off using the --no-rmdup allelotype parameter. Note 25: By default, the allelotyper processes all reads that completely span STR loci. To only process reads meeting certain quality filters, use the following flags: 1.

--max-diff-ref :

2.

--unit: Filters reads differing by a non-integer number of repeats from the reference sequence.

3.

--mapq :

excludes reads with a difference from reference more than INT bp.

Filters reads with a mapq score (see Note 17)

above INT. 4.

--max-matedist : Filters reads with a distance between mate pairs greater than INT.

Note 26: The allelotype tool returns the maximum likelihood allelotype, calculated using the trained PCR stutter noise model. A confidence score is given as the posterior probability of the allelotype call, calculated as the ratio of the likelihood of the returned allelotype to the sum of likelihoods for all possible allelotypes. High-confidence calls have scores above 0.9. lobSTR also reports a score for each allele, given as the marginal posterior probability of each allele. This is calculated as the ratio of the sum of likelihoods for each possible allelotype containing the allele, divided by the sum of likelihoods of all possible allelotypes. In cases where only one allele can be confidently determined, the second allele is reported as “NA” with a marginal likelihood of 1. Note 27: The first two lines of the $prefix.genotypes.tab file are the alignment parameter string and the allelotyping parameter string. The columns are as follows: 1. Chromosome. 2. STR start coordinate. 3. STR end coordinate. 4. STR motif. 5. STR period. 6. Reference copy number. 7. Maximum likelihood allelotype. The genome is assumed to be diploid, except for sex chromosomes of a male. Alleles are

134

Melissa Gymrek and Yaniv Erlich

reported as the number of base pair difference from the reference sequence. 8. Coverage (includes only completely spanning reads). 9. Number of reads agreeing with the allelotype call. 10. Number of reads disagreeing with the allelotype call. 11. List of observed alleles (“/” separated list of :). 12. Allelotype confidence score (see Note 26). 13. Allele 1 confidence score (see Note 26). 14. Allele 2 confidence score (see Note 26). 15. Number of reads that partially span the locus. 16. Lower bound allele length determined from the partially spanning reads. Since these reads do not span the entire STR region, this value is extremely approximate, and only indicates that we think the STR should not be shorter than the number given here. 17. List of partially spanning alleles seen, in the same format as column 11.

Acknowledgements Y.E. is an Andria and Paul Heafy Family Fellow. This publication was supported by the National Defense Science and Engineering Graduate Fellowship (M.G.). We thank Dina Esposito for useful comments. References 1. Mirkin SM (2007) Expandable DNA repeats and human disease. Nature 447:932 2. (1993) A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. The Huntington’s Disease Collaborative Research Group. Cell 72: 971 3. Pearson CE, Nichol Edamura K, Cleary JD (2005) Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet 6:729 4. Kozlowski P, Sobczak K, Krzyzosiak WJ (2010) Trinucleotide repeats: triggers for genomic disorders? Genome Med 2:29 5. Broman KW, Murray JC, Sheffield VC, White RL, Weber JL (1998) Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet 63:861 6. Butler JM, Buel E, Crivellente F, McCord BR (2004) Forensic DNA typing by capillary electrophoresis using the ABI Prism 310 and 3100

genetic analyzers for STR analysis. Electrophoresis 25:1397 7. Zhivotovsky LA et al (2004) The effective mutation rate at Y chromosome short tandem repeats, with application to human populationdivergence time. Am J Hum Genet 74:50 8. Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13:36 9. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754 10. Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11:473 11. Gymrek M, Golan D, Rosset S, Erlich Y (2012) lobSTR: a short tandem repeat profiler for personal genomes. Genome Res 22 (6):1154–1162

STR Profiling from Short Reads 12. Danecek P et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156 13. Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27:573 14. Kent WJ et al (2002) The human genome browser at UCSC. Genome Res 12:996 15. Robinson JT et al (2011) Integrative genomics viewer. Nat Biotechnol 29:24 16. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/ Illumina FASTQ variants. Nucleic Acids Res 38:1767 17. Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078 18. Bentley DR et al (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53 19. Wheeler DA et al (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452:872 20. Friedmann T (1979) Rapid nucleotide sequencing of DNA. Am J Hum Genet 31:19 21. Rothberg JM et al (2011) An integrated semiconductor device enabling non-optical genome sequencing. Nature 475:348

135

22. Loman NJ et al (2012) Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 30 (5):434–439 23. Kent WJ et al (2002) The human genome browser at UCSC. Genome Res 12:996 24. Sharma D, Issac B, Raghava GP, Ramaswamy R (2004) Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics 20:1405 25. Leclercq S, Rivals E, Jarne P (2007) Detecting microsatellites within genomes: significant variation among algorithms. BMC Bioinformatics 8:125 26. Lim KG, Kwoh CK, Hsu LY, Wirawan A (2013) Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Brief Bioinform 14(1):67–81 27. Castelo AT, Martins W, Gao GR (2002) TROLL–tandem repeat occurrence locator. Bioinformatics 18:634 28. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443

Chapter 8 Exome Sequencing Analysis: A Guide to Disease Variant Detection Ofer Isakov, Marie Perrone, and Noam Shomron Abstract Whole exome sequencing presents a powerful tool to study rare genetic disorders. The most challenging part of using exome sequencing for the purpose of disease-causing variant detection is analyzing, interpreting, and filtering the large number of detected variants. In this chapter we provide a comprehensive description of the various steps required for such an analysis. We address strategies in selecting samples to sequence, and technical considerations involved in exome sequencing. We then discuss how to identify variants, and methods for first annotating detected variants using characteristics such as allele frequency, location in the genome, and predicted severity, and then classifying and prioritizing the detected variants based on those annotations. Finally, we review possible gene annotations that may help to establish a relationship between genes carrying high-priority variants and the phenotype in question, in order to identify the most likely causative mutations. Key words Exome, Exome sequencing, Variant detection, Exome analysis, Disease variants, Variant annotation, Variant prioritization, Gene annotation

1

Introduction Whole exome sequencing (WES) presents a powerful tool to study rare genetic disorders [1]. The exome represents the 1–2 % of the human genome that is translated into proteins. By analyzing the exome sequence of a single individual who displays an unusual phenotype or is affected by a rare genetic disease, the mutation that causes that disease or phenotype can be discovered. This method yields significant benefits. Not only can it help to identify the exact cause of a particular condition [2], but it can also identify genes not previously known to be associated with a biological pathway, or inspire new therapeutics for the condition [3]. Prior to the advent of deep-sequencing technologies, the discovery of mutations causing rare diseases was challenging. Traditional methods, such as Sanger sequencing of specific genes, linkage analysis, genome-wide association studies, karyotyping, and homozygosity

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_8, # Springer Science+Business Media New York 2013

137

138

Ofer Isakov et al.

Fig. 1 Deep sequencing disease variant detection work flow

mapping, have many limitations. These methods are slow, less sensitive, or discerning enough to pinpoint actual causative mutations, or rely on large samples of affected individuals and prior knowledge about the genes involved in the disease. In order to detect rare mutations, a more powerful and sophisticated approach is needed. As the throughput of genetic sequencing has increased through the use of massively parallel sequencing [4], it has become feasible to discover specific rare, novel disease-causing variants with only the exomes of a few affected individuals. 1.1 Whole Exome Sequencing vs. Whole Genome Sequencing (Table 1)

The fundamental limitation to sequencing only the exome is that, of course, the causative mutation will only be found if it is in a known functional region of the genome. Nearly 99 % of the genome is ignored in exome sequencing. Exome sequencing generally includes exons and certain select functional elements, such as microRNAs, but genes that have not yet been discovered will not be sequenced, nor will mitochondrial DNA, introns, or any other

Exome Sequencing Analysis: A Guide to Disease Variant Detection

139

part of the genome that has not yet been found to be functional (also see further explanation below). Annotation of the human genome still has a long way to go, and many elements of the genome are yet to be discovered [5]. Also, structural variants and copy number variations are harder to identify using targeted exome capture [6]. However, choosing to sequence only the exome has many advantages. For starters, WES is both faster than and one-sixth the cost of whole genome sequencing (WGS) [7]. Whereas sequencing more than one genome would more likely be costprohibitive, sequencing more than one exome is usually feasible, and, as we will discuss later, can greatly improve the chances of identifying the causative mutation. Furthermore, WES inherently enriches the data for only the most relevant variations. This enrichment occurs for a few reasons. First, mutations in noncoding regions of the genome are less likely to cause severe phenotypes. Studies of genetic diseases demonstrate that genetic diseases are much more likely to be caused by mutations in the coding regions of the genome. Missense mutations in coding regions are the most common known type of disease-causing mutations [8]. Thus, by looking only at the mutations in the exome, researchers efficiently focus only on the most likely candidate mutations, and they eliminate a significant portion of benign mutations. As the cost of genome sequencing continues to drop, it may soon be practical to sequence the entire genome rather than just the exome and detect more mutations, which will mean that variations outside the known functional regions will not be missed. However, even if it is viable to sequence the whole genome, it may not make sense to. While the effects of variations in protein-coding regions on transcription and translation are fairly predictable, the effects of mutations in noncoding regions may be unclear. In protein-coding regions, mutations can be classified based on the effect they have on the protein-coding sequence. Mutations in the noncoding region, on the other hand, are more difficult to interpret. Further complicating the matter is the fact that sequencing the whole genome will result in exponentially more variations, since variations in nonfunctional regions are not subject to the same negative selection pressure as variations in the coding regions [9]. Thus, WGS carries with it the risk that the true causative mutation will be buried under a pile of uninformative nonfunctional mutations. Having more information is not always better, if the information cannot be interpreted. For now, sequencing only the exome provides the most efficient means to identify specific variations causing genetic disease. 1.2 Successful Exome Analysis

One of the first reported successful uses of exome sequencing to discover a novel causative mutation was in 2009 [2, 10]. The exomes of four individuals with Miller syndrome, a rare

140

Ofer Isakov et al.

Mendelian disease that causes craniofacial and limb development abnormalities, were sequenced and analyzed to discover a mutation in a single gene, DHODH, which was not previously associated to the disease. In the following 2 years, exome sequencing was successfully used to implicate over 30 novel variants in a wide variety of genetic diseases, ranging from autism to hearing loss [10]. However, exome sequencing is not an infallible method for causative variant detection [11]. Assuming that the causative mutation is in the exome, it can still be missed if analysis of the found variants is not done properly. 1.3

2

Chapter Outline

The most challenging part of using exome sequencing for the purpose of variant detection is analyzing, interpreting, and filtering the large number of detected variants. A typical exome sequencing run result in tens of thousands of small single-nucleotide variants (SNVs), insertions, and deletions [12]. Most successful studies use a stepwise filtering process to extract the most likely variants from the pool of thousands. The steps in this process may include searching for variations in known variant databases, such as dbSNP, or comparing variations to variations in other relevant exomes. Ideally, the filtering steps would identify one variant, or a small number of variants, that can be verified by Sanger sequencing or some other method. Unfortunately, researchers are often left with too many possible variants, or none at all. In order to establish exome analysis as the standard tool for variant detection in genetic disease, a robust, methodical analysis pipeline is essential. This chapter describes the process of detecting disease-causing variations in exomes. First, we address strategies in selecting samples to sequence, and technical considerations involved in exome sequencing. Next, we discuss how to identify variants, and methods for first annotating detected variants using characteristics such as allele frequency, location in the genome, and predicted severity, and then classifying and prioritizing the detected variants based on those annotations. Finally, we review possible gene annotations that may help to establish a relationship between genes carrying high-priority variants and the phenotype in question, in order to identify the most likely causative mutations.

Materials Throughout this chapter we describe various tools that may be employed in the different analysis steps. However, other established tools and software are either already available or being developed. We therefore urge readers to always keep up to date regarding novel and useful tools relevant for their analysis (see Notes 6 and 7). The tools described throughout the chapter include the following:

Exome Sequencing Analysis: A Guide to Disease Variant Detection

141

Table 1 Available tools for exome variant detection and evaluation Tool/resource

Function

Reference Link

OMIM

Clinical association –

http://omim.org/

NHGRI-GWAS catalog

Clinical association [47]

http://www.genome.gov/ gwastudies/

Genetic Association Database (GAD)

Clinical association [58]

http://geneticassociationdb.nih. gov/

Phenopedia and Genopedia

Clinical association [59]

http://www.hugenavigator.net/

Human Phenotype Ontology

Clinical association [60]

http://www.human-phenotypeontology.org/

Mouse Genome Database (MGD)

Clinical association [61]

http://www.informatics.jax.org/

GERP++

Conservation

[39]

http://mendel.stanford.edu/ SidowLab/downloads/gerp/ index.html

PhyloP

Conservation

[40]

http://compgen.bscb.cornell.edu/ phast/help-pages/

PhastCons

Conservation

[41]

http://compgen.bscb.cornell.edu/ phast/

Duplicated Genes Database (DGD)

Duplicate genomic – regions

http://dgd.genouest.org/

dbDNV

Duplicate genomic [65] regions

http://goods.ibms.sinica.edu.tw/ DNVs/

NHLBI Exome Sequencing Project

Known variant database



http://evs.gs.washington.edu/ EVS/

dbSNP

Known variant database

[20]

http://www.ncbi.nlm.nih.gov/ SNP/

Complete Genomics

Known variant database

[35]

http://www.completegenomics. com/

1000 Genomes Project

Known variant database

[9]

http://www.1000genomes.org/

miRBase

miRNA sequence database

[29]

http://www.mirbase.org/

TargetScan

miRNA target database

[31]

http://www.targetscan.org/

miRNA.org

miRNA target database

[32]

http://www.microrna.org/ microrna/home.do

Rfam

ncRNA sequence database

[30]

http://rfam.sanger.ac.uk/ (continued)

142

Ofer Isakov et al.

Table 1 (continued) Tool/resource

Function

Reference Link

Biocarta

Pathway annotation



http://www.biocarta.com

KEGG

Pathway annotation

[55]

http://www.genome.jp/kegg/

REACTOM

Pathway annotation

[56]

http://www.reactome.org/ ReactomeGWT/entrypoint.html

BioGRID interaction database

Protein interactions

[62]

http://thebiogrid.org/

STRING protein functional interactions database

Protein interactions

[63]

http://string-db.org/

Human protein–protein interaction prediction database (PIPs)

Protein interactions

[64]

http://www.compbio.dundee.ac. uk/www-pips/

Catalog of Somatic Mutations in Cancer (COSMIC)

Somatic mutations [67]

http://www.sanger.ac.uk/genetics/ CGP/cosmic/

Annovar

Variant annotation [33]

http://www.openbioinformatics. org/annovar/

Ensembl SNP effect predictor

Variant annotation [34]

http://ensembl.org/ Homo_sapiens/UserData/ UploadVariations

SnpEff

Variant annotation [48]

http://snpeff.sourceforge.net/

VariantClassifier

Variant annotation [49]

http://www.jcvi.org/cms/ research/projects/variantclassifier

SNPnexus

Variant annotation [50]

http://www.snp-nexus.org/

MutationAssessor

Variant annotation [51]

http://mutationassessor.org/

UCSC genome browser

Variant annotation [27] and visualization

http://genome.ucsc.edu/

Genome Analysis Toolkit (GATK)

Variant discovery

[19]

http://www.broadinstitute.org/ gatk/

SIFT

Variant severity

[42]

http://sift.jcvi.org/

PolyPhen2

Variant severity

[43]

http://genetics.bwh.harvard.edu/ pph2/

MutationTaster

Variant severity

[44]

http://www.mutationtaster.org/

LRT

Variant severity

[45]

http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC2752137/

Exome Sequencing Analysis: A Guide to Disease Variant Detection

3 3.1

143

Methods Sample Selection

Although there are cases in which sequencing of a single individual’s whole exome revealed the causative variation [13], such a detection strategy is limited. In order to increase the probability of finding a single novel causative variation, it is recommended to, if possible, sequence a few additional relevant exomes [11]. In this section we describe strategies for choosing which additional individuals’ exomes should be sequenced. The optimal strategy will depend on the mode of inheritance of the mutation in question. Information about the rarity or the prevalence of the disease and knowledge of the affected individual of interest’s genetic pedigree will help in surmising the mode of inheritance (see Note 1).

3.1.1 Sample Selection for a Recessive Mode of Inheritance

Recessive mutations are easier to identify by filtering for homozygosity, so it is less crucial to gather exomes from other individuals. If the only exome available is that of the affected individual, having a recessive mutation is to the researcher’s advantage, as homozygosity can be used to help narrow down the pool of possible mutations. Additionally, if there is suspected consanguinity in the affected family, the variant will most likely come from a region of the genome that is identical by descent. If additional exomes are available, a useful strategy to use is to compare exomes from affected and unaffected individuals in the same kindred. The strategy here is to search for variations homozygous only to the affected individuals and to none of the unaffected individuals. It is helpful to choose affected individuals who are as distantly related as possible in order to minimize the amount of similar benign variations.

3.1.2 Sample Selection for a Dominant Mode of Inheritance

If the disease is suspected to follow a dominant mode of inheritance and neither parent of the affected individual of interest is affected, then there are two possible scenarios: either there is partial penetration or the mutation has arisen de novo. While partial penetration is more likely, the discovery of a causative mutation will be more difficult. Having a de novo mutation is much rarer; the typical exome contains zero to three de novo mutations. However, the sample selection strategy is simple: sequence the exome of the affected individual and those of his or her healthy parents, and then look for the variant that is present in the child but not in the parents. It is important to note that this method requires both a high sensitivity and high specificity so as not to miss mutations or mistake sequencing artifacts for de novo mutations. More often, the dominant genetic mutation will be inherited from one of the parents. Here, there are two possible sample selection strategies, which can be used in tandem. The first strategy

144

Ofer Isakov et al.

is to sequence exomes from as many unrelated affected individuals as possible and search for a common affected gene. It is useful to choose affected individuals from similar geographical ancestry so as to minimize the amount of benign variance and narrow the pool of candidate genes [2]. The exomes should be compared for variations found in the same gene (though not necessarily the same loci) across all exomes. The second strategy, similar to the one mentioned in the recessive mode of inheritance section, is to sequence exomes from individuals in the same kindred. By comparing unaffected and affected individuals from the same kindred, benign, private mutations can be eliminated. In the case of a dominant mutation, variants should be present in a heterozygous state in the affected population and should not be found at all in the unaffected population. This last strategy of sequencing affected and unaffected family members is useful for both dominant and recessive mutations, so it could be the strategy of choice if the mode of inheritance is unclear. 3.1.3 Combining Strategies

These selection strategies, of course, are not mutually exclusive. Studies often use a combination of the above strategies. Both unrelated and related individuals can be included in the study. Combining selection strategies may be useful if no assumptions can be made about the mode of inheritance. For example, two related individuals’ exomes, one affected, one unaffected, could be used to filter out common variants. If this method alone does not narrow down the variant pool enough, an exome from unrelated, affected individual could be used to further filter the pool of variants. For a comprehensive review of exome sequencing strategies, see ref. 12.

3.2 Technical Considerations

When sequencing an exome for variant detection, technical aspects of the sequencing process should be taken into consideration. There are many different deep-sequencing platforms to choose from. There are a few commercial exome-capturing kits available (e.g., Agilent, Illumina, and Nimblegen), each with slightly different sample preparation and selection methods. Different kits have different methods of capturing or enriching the desired regions of the genome. There are three main enrichment strategies: hybridization, circularization, or PCR. Circularization and PCR are mostly used for enriching a very small amount of genetic material, less than that of an exome. The most optimal and commonly used method for targeted exome capture is hybridization [14]. In this technique, small DNA or RNA probe matching regions of the exome act as bait for matching exomic sequences in a fragment library. There are two ways to do hybrid capture: on-array or in-solution [15]. In on-array hybridization, the probes are immobilized on an array,

Exome Sequencing Analysis: A Guide to Disease Variant Detection

145

and a surfeit of target DNA is used as compared to the number of probes. In-solution hybridization, on the other hand, uses an excess of probes as compared to target, and probes are not immobilized on a surface. In-solution hybridization may be the preferred method if there is a very limited amount of sample DNA or if the target size is small (on the order of 3 Mb). Otherwise, the two methods of hybrid capture provide comparable results [15]. Whichever method of hybridization is chosen, there are inherent errors to consider. Due to faulty hybridization, some desired regions may be missed, and there may be noncoding regions of the genome that are captured through off-target effects. If the fragment library has long fragments, the sequences that are captured might extend into the non-targeted regions of the genome. These are called “near target” sequences. Using longer fragments means that more near target sequences will be captured; using shorter fragments means that more target sequences will be missed. Different kits have different library preparation guidelines and use different size selection methods [16]. Generally, fragment size is not a determining factor when choosing a kit, but it is good to be aware of the potential for error. For more information on the errors inherent in targeted exome capture with hybridization, see ref. 15. One final important consideration for targeted exome capture is the definition of “exome.” Different commercial kits have different definitions of “exome.” It can include only the known proteincoding regions, or known coding regions and other known functional elements, such as untranslated regions (UTRs), certain microRNAs, or noncoding exons. The size of the region captured ranges from around 34,000 kb to over 50,000 kb. The choice of kit may depend on which parts of the genome a researcher wishes to capture. References 16–18 each provides detailed comparisons of specific kits (see Note 2). 3.3 Variant Calling (Figs. 1 and 2)

This section gives a general overview of the process of variant calling. The basic steps in the pipeline are initial mapping of the reads, improvement of alignments and quality scores, variant identification, and recalibration of the variants’ quality scores. An in-depth review of variant calling can be found in [19]. The first step in variant calling is mapping the reads. To ensure that variants are properly identified, it is important to have goodquality data. Having adequate coverage of each base is one way to improve data quality. Depending on the source, the necessary minimum coverage for reliably detecting single-nucleotide variations is anywhere from 8 to 200. In general, a coverage of 20 to 50 at each nucleotide is considered acceptable when identifying variations [14]. The mapping quality and accuracy can then be further improved by marking duplicate reads, local realignment, and base quality recalibration.

146

Ofer Isakov et al.

Fig. 2 Variants should be classified into several priority levels and gradually reviewed. Classifications should be designed to include higher priority classes inside lower ones in order to increase the search space without removing possibly relevant variants. In this figure, we chose two classifications (frequency and gene-effect) reducing them into only three subcategories (although more such subcategories can be added). The darker the color shade the higher the priority, with novel nonsense mutations representing the highest priority variants

Having duplicate reads may result in overestimation of coverage and quality of variants. When the sequence library is being created, PCR amplification of the DNA can result in duplicate sequences. This would make it seem like a variant has higher read coverage, and, thus, a higher quality, than it actually should, and can result in false SNVs. Therefore, it is recommended to remove all but one of the reads whose ends match to the same places on the reference genome. Next, local realignment helps to verify that reads are mapped correctly. Initial alignment algorithms often misalign reads that begin or end with insertions or deletions (indels), mistakenly calling false nucleotide changes. Local realignment tools take into consideration other reads that map to the same region and can provide evidence for indels rather than SNVs. Base quality recalibration adjusts the quality scores associated with each base pair, based on a sample of tentatively called variants. To determine the likelihood that the SNVs found in the reads are accurate, a subsample of reads is aligned to the reference genome, and differences between the reads and the reference genome are called and cross-referenced to a database of known SNVs (such as dbSNP [20]). Only about 1–10 % of the SNVs called should be novel (not found in the reference database). If there is a greater

Exome Sequencing Analysis: A Guide to Disease Variant Detection

147

proportion of novel SNVs than this, the accuracy of the entire exome should be called into question. The quality of the bases is then recalibrated based on the number of unknown SNVs and a number of other factors, including the bases around the unknown SNVs. Finally, the actual variants are ready to be called. A common file format for called variants is variant call format (VCF). SNVs are determined by simply identifying statistically significant differences between the mapped reads and the reference genome. Indel calling is based on gapped alignment methods or inferences from mapping of paired-end reads. Structural variants require paired-end sequencing to be discovered. The distance between the mapped locations of the paired reads provides a clue as to whether an inversion, translocation, or copy number variation is present. One final step to ensure good-quality data is to recalibrate the variant quality scores. The likelihood that the called variants are accurate can be estimated by, again, comparing the discovered variants to “true” variants from a database such as dbSNP. Additionally, if exomes from related individuals were sequenced, variant genotypes can be compared between exomes for consistent inheritance patterns. 3.4 Variant Annotation

The variant calling pipeline, described in the previous section, results in a list of high-quality SNVs, insertions, and deletions. The number of discovered variants is usually determined by the size and conservation of the sequenced region. Larger less conserved regions will carry more variants and smaller, functional, conserved targets will carry fewer variants. For example, when performing WGS, millions of variants are usually detected [21–23]. WES, targeting only known functional regions in the genome, that are more conserved, will usually result in thousands of detected variants [2, 24, 25]. Although the list of variants resulting from WES is much shorter, deciphering which of the detected variants have functional significance and which are irrelevant to the phenotype in question remains a challenging task. In order to facilitate the task of pinpointing the most relevant variants, a comprehensive annotation of the different variants is necessary. The more information gathered on each variant, the easier it is to classify and prioritize it. Such prioritization is crucial when dealing with a large data set of candidate variants where the goal is to reduce the list of relevant variants considerably, and retain only a handful of suspected mutations which are later validated and whose association to the phenotype in question confirmed (see Note 3). The annotation process involves the collection of as much data as possible on each single detected variant. Most of the annotation data should be gathered from open-access, publicly available, specialized resources. For each of the detected variants, information should be gathered regarding its:

148

Ofer Isakov et al.

Genomic region: Annotations regarding the region in which each variant is found in the genome are essential for early-stage prioritization since variants found in known functional regions are more likely to have a phenotypic effect than variants found in regions without any known genetic or epigenetic function. Regional information should include whether the variant is up/downstream of a known gene, whether it is located in a gene’s intron, exon, splice junction, or UTRs. Different gene annotations (e.g., RefSeq [26], UCSC genes [27], Ensembl [28]) result in different regional annotations with RefSeq being the most curated version (with ~40 K records) and Ensembl the most comprehensive (~180 K records). Using a more comprehensive gene list will produce a more sensitive regional annotation, but the annotation will be less specific, as some of the annotations were not validated and thus could represent false annotations. It is also useful to annotate variants that are found inside functional regions such as miRNA and other noncoding RNA sites (found in miRBase [29] and Rfam [30], respectively). Information about known and predicted microRNA-binding sites can also be retrieved from Web servers such as TargetScan [31] or miRNA.org [32]. Finally, additional regions with known functionally relevant regulatory properties should also be considered (retrieved from the UCSC genome browser [27]). Exonic effect: In cases where the detected variant is found inside an exon, the expected variant effect on translation should be added to the annotation. If the variant is a SNV it may be synonymous (change in the codon that does not lead to an amino acid (AA) change), non-synonymous (the codon change results in an amino acid change), or stop loss/gain (the codon changes either from or into a stop codon). If the variant is an insertion or a deletion (indel) it may be non-frameshift (an indel that does not cause the translation reading frame to shift) or frameshift (an indel causing a shift in the reading frame). It is important to note that most genes have more than one possible transcript (isoforms); therefore variants may have different effects on different transcripts of the same gene. A comprehensive annotation predicts the effect of each variant on each of the gene’s possible isoforms. The exonic effect of a variant can be produced by several available tools such as ANNOVAR [33] and the Ensembl SNP effect predictor [34]. Population frequency: Determining a variant’s frequency in the population can facilitate the elucidation of relevance to the phenotype in question. If the phenotype is severe and very rare it is unlikely that the causative allele will have a high frequency in the population. Therefore, each detected variant should be annotated with any available information regarding previous detection. Detected variants in sequencing experiments are recorded in designated public databases such as dbSNP [20], the 1000 genomes project [9], the

Exome Sequencing Analysis: A Guide to Disease Variant Detection

149

NHLBI Exome Sequencing Project (http://evs.gs.washington. edu/EVS/), and the Complete Genomics data [35]. These databases should contain information regarding each variant’s allele frequency (the ratio between the number of times the variant allele was observed and the number of times the reference allele was observed). Most variants detected through exome sequencing will have been previously observed and will be found in one of the aforementioned public databases [36, 37]. We emphasize that a variant previously detected and recorded in one of these databases should not be automatically excluded from downstream analysis. A variant does not need to be novel in order for it to be causative and frequency rather than mere novelty should be considered for such prioritization. It is also recommended to incorporate additional allele frequency annotations gathered from previous personal experiments so as to remove common sequencing platform-derived errors and additional variants common in previously sequenced individuals. Conservation: Variants found inside conserved sites in the genome are more likely to have deleterious effects with phenotypic consequences, since many such regions are conserved as a result of higher functionality and negative selection [38]. Thus, annotating each variant with the conservation levels and inferred constraint levels of its location using tools such as GERP++ [39], PhyloP [40], and PhastCons [41] can facilitate the prediction of its functional and phenotypic effect. Expected severity: This annotation assigns each variant an expected severity score (a number that indicates how deleterious the variant is expected to be). This score is based primarily on exonic effect and conservation information, and incorporates additional information, such as chemical and physical properties of amino acids and protein structure. Higher severity variants are more likely to alter the transcription or the translation of the gene in which they reside. These alterations may change crucial qualities of the gene such as expression, affinity, and function and are therefore more likely to result in a phenotypic change. Publicly available tools such as SIFT [42], PolyPhen2 [43], Mutation Taster [44], and LRT [45] utilize the aforementioned characteristics in order to predict the deleteriousness level of a given non-synonymous mutation. These tools are used only for non-synonymous mutations because, while synonymous mutations have been previously associated with functional changes that lead to disease [46], non-synonymous variants, which result in an amino acid changes in the translated protein, are much more likely to induce deleterious functional alterations. Annotating non-synonymous variants with severity scores may facilitate the process of filtering mutations predicted to be tolerated, benign, or silent and the prioritization of the remaining mutations.

150

Ofer Isakov et al.

Clinical associations: As mentioned before, most variants detected in exome sequencing data have already been observed and recorded in one of the major variant databases. Some of these variants may also have clinical information assigned to them. A combination of such clinical information might be highly relevant and shed light on the phenotype of interest. For a comprehensive list of the variants found to be significantly associated with various phenotypes one should use the National Human Genome Research Institute catalog of published Genome-Wide Association Studies (NHGRIGWAS catalog) [47]. It is also recommended to annotate variants with any recorded phenotypic effect from the Online Mendelian Inheritance in Man database (OMIM; http://omim.org/). Publicly available tools such as ANNOVAR [33], snpEff [48], VariantClassifier [49], SNPnexus [50], MutationAssessor [51], and the Ensembl SNP effect predictor [34] automatically annotate variant lists supplied by the user with some of the aforementioned annotations and are highly recommended for such a task. 3.5 Variant Classification

Once all the information has been gathered, and annotations complete, the researcher still faces the challenging task of uncovering the most likely candidate variants out of the immense variant list. In order to make this task feasible, it is recommended to group variants that share similar annotations into various classes. Each given classification represents a subgroup of variants that share similar features and annotations, thus simplifying the data and making it much easier to analyze (see Note 4). It is important to note that the prioritization process may result in filtering out the actual causative variant. Therefore it is recommended to stratify the variant classification into several priority levels and gradually review them from high priority to low. One should refrain from implementing a simple dichotomous classification which may result in missing the causative variant altogether. Common classifications include the following: Frequency: Based on population allele frequency (AF) annotations (see previous section) it is possible to classify the different variants into different frequency levels which would enable the researcher to filter out common variants and include only a set of variants with appropriate frequency in the downstream analysis. For example, it is highly unlikely that a very rare disease is caused by a common mutation. Therefore only variants classified as rare will be reviewed when searching for the causative variant in a rare disease. A variant is usually considered very common, if it has an AF higher than 5 %, less common for 1 % < AF < 5 %, rare for AF < 1 %, and private for variants detected only in a specific proband [52]. This classification is fairly simple when only one population frequency database is employed. However, it is important to note that different public databases can record different allele frequencies for the same

Exome Sequencing Analysis: A Guide to Disease Variant Detection

151

variant. In such cases, the frequency classification may be set by the researcher according to the experimental setting and the various databases employed. For example, if the exome sequencing was performed on individuals from a European ancestry, it is recommended to first review the variant frequency in European populations as it is more informative than the general population frequency [53]. Deleteriousness: Another classification which is highly informative is deleteriousness. In this classification, variants are classified according to their predicted effect in regard to protein translation, viability, and functionality. For this purpose, annotations regarding a variant’s type, exonic effect, conservation, and expected severity are taken into account. For example, an insertion that causes a frameshift close to the start of a gene will be considered more deleterious than a synonymous mutation that occurs in a nonconserved region of the gene [54]. In this classification, there are no standard categories, and the researcher may decide how to categorize the different variants. One form of deleteriousness classification can be divided into four groups: the first, which is the most deleterious and likely to affect gene function, includes frameshift insertion–deletion variants, variants that cause a loss or a gain of a stop codon, variants that are found inside a splice junction, and all non-synonymous variants that are considered severe by more than one of the variant effect prediction methods (see previous section). The second class, considered less deleterious, may include all insertions, deletions, and non-synonymous variants. The third classification includes all the variants found in any functionally annotated region of the genome and the forth includes the entire set of detected variants. This classification method, in which all the variants found in one category are also found in the following, less severe, category, allows the user to efficiently review the most likely variant candidates initially, and then gradually expand the variant search space, if necessary. Sample information: Last but not least, information about the sample of exomes that were sequenced, such as a genetic pedigree and the suspected mode of inheritance of the causative mutation, should be considered. This information may suggest to the researcher the type of the causative variant (e.g., homozygous or heterozygous). If multiple exomes were sequenced, comparisons of variants between exomes will greatly help the filtering process. For example, if the variant is suspected to be recessive, and the affected individual and his or her unaffected parents are sequenced, variants that are homozygous in the affected individual and heterozygous in each of the parents should be pulled out and classified as a high-priority group. Another example of sample information is segregation, in which a variant is expected to appear either in a

152

Ofer Isakov et al.

higher rate or exclusively (depending on the predicted penetrance) in sequenced affected individuals and should not appear in any of the unaffected individuals. Variants that do not segregate as expected could be excluded from downstream analysis. Combining annotation-based classification with experimentalsetting-classifications results in various categories such as “very rare, highly deleterious, found only in cases” or “common mildly deleterious found in both cases and controls.” Using these categories the researcher may choose to review only variants falling into the most relevant combination of classifications, hence significantly reducing the number of variants set for downstream analysis. 3.6

Gene Annotation

Once a class or a combination of classes of variants is chosen, it is usually necessary to expand the characterization to the affected gene level. In some cases, variants found in the highest priority level are reviewed and determined to be irrelevant or inconsequential. This is not uncommon as it is has been shown that there are many such loss-of-function mutations in healthy individuals [54]. One way to predict the phenotype caused by a variant and filter out such high-priority, nonfunctional variants is to annotate and review the gene in which each variant resides. The more information gathered on each affected gene the easier it is to assess its functional significance and relevance to the disease in question (see Note 5). Common informative annotations include the following: Pathway information: Pathway annotation includes all the pathways in which a given gene takes part in. When studying a disease with known associated pathways and characterized pathogenesis, it is reasonable to first review variants found in genes that participate in its related pathways. This is also true when a phenotypically similar disease exists, and its pathogenesis and related pathways are known. Pathway information can be gathered from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [55], Biocarta (http://www.biocarta.com), and REACTOM [56] (see ref. 57 for additional pathway resources). Clinical associations and phenotype data: There are occasions in which several different variants on the same gene all result in the same phenotype. Gathering as much clinical and phenotype information known for each of the affected genes may help uncover such cases of genetic heterogeneity. Variants found in a gene in which other variants have already been associated with a certain phenotype are more likely to be associated with the same phenotype. Much like in the case of variant clinical associations, the NHGRIGWAS catalog and the OMIM database can be utilized to annotate gene clinical associations. Further resources for clinical associations include the genetic association database [58], and Phenopedia and Genopedia [59]. Additional phenotype data can be retrieved from

Exome Sequencing Analysis: A Guide to Disease Variant Detection

153

the Human Phenotype Ontology [60] and the Mouse Genome Database (MGD) [61]. Gene–gene and protein–protein interactions: Interactions are highly informative in cases in which there are already known genes which associate with the studied trait. In such cases variants detected inside genes that interact with genes associated with the phenotype in questions can be considered as better, more likely candidates. Recently it was shown that loss-of-function mutations predicted to significantly affect protein translation and function are less common in genes with many interactions [54]. This suggests that variants found in genes with many interactions should also be considered better candidates. Interaction annotations can be gathered from databases such as the BioGRID interaction database [62], STRING protein functional interactions database [63], and the human protein–protein interaction prediction database (PIPs) [64]. Duplicate genes and paralogs: Genes that have multiple duplicates, closely related gene family members (paralogs), and other genes have been shown to carry significantly more loss-of-function variants than other genes [54]. Since loss-of-function variants will usually be classified in a high-priority class (see previous section) it is important to annotate genes with information regarding their paralogs and duplicates and consider this information when deciding on the most likely and functionally relevant mutations (e.g., loss-of-function variants found in highly redundant genes should be considered as less likely candidates). Information regarding variants in gene duplicates and gene paralogs and duplicates can be retrieved from dbDNV [65] and the Duplicated Genes Database (DGD; http://dgd.genouest.org/). Cancer mutations: When the disease in question is a type of cancer, it may be beneficial to gather, for each gene, any available information regarding the amount and type of somatic and germline mutations previously detected in it. A gene commonly mutated in sequenced cancer samples is more likely to play a role in the development of the disease [66]. A comprehensive database containing such information is the Catalogue of Somatic Mutations in Cancer (COSMIC; [67]). Once all the genes carrying high-priority variants have been sufficiently annotated, it is much easier to prioritize them and select the most probable candidates. In this stage data regarding the phenotype in question should be integrated and compared with the annotations gathered on each gene in order to possibly find matching functionally relevant characteristics and accordingly validate only the most likely subset of variants found in these genes.

154

4

Ofer Isakov et al.

Notes Deep sequencing, also known as high-throughput (or next generation) sequencing, has revolutionized the field of human genetics. The immense amount of data produced by deep sequencing has made it the technology of choice when setting out to interrogate genetic phenomenon, specifically when studying human genetic diseases. Although, when introduced, deep sequencing was still very costly and hence carefully implemented, the unprecedented drop in per-base sequencing prices has turned deep sequencing into both the most efficient and cost-effective option when studying human genetics. Today, the bottleneck has shifted from producing sequence data to the actual analysis of the data. In this chapter we reviewed the common considerations, steps, and resources one should implement when utilizing deep sequencing for the purpose of uncovering the genetic cause of a chosen phenotype. Throughout the chapter we emphasized key points that should be implemented in order to both increase sensitivity and facilitate the filtration and prioritization of the massive amount of data produced by deep sequencing, including the following: 1. Observed or predicted modes of inheritance must be considered when selecting samples for exome sequencing. The selection should include the combination of individuals with the highest likelihood of pinpointing the causative mutation. 2. It is beneficial to examine the included genomic regions and limitations of available enrichment kits in order to better focus on regions relevant to specific studies. 3. Detected variants should be annotated with as much information as possible. This information can be gathered from various public databases, and automatically incorporated using several tools. 4. The more annotations a variant has, the better it can be classified and an educated assessment of its relevance to the phenotype may be employed. 5. Information gathered on the genes affected by candidate variants may pinpoint the most likely candidate and associate it with the observed phenotype. 6. The list of resources mentioned throughout the chapter includes highly established tools and databases but does not include the entire arsenal of possible available computational resources. 7. We urge the researchers to incorporate as many resources as possible in downstream analysis as it is an efficient way to stratify variant data and elucidate relevant disease-causing mutations.

Exome Sequencing Analysis: A Guide to Disease Variant Detection

155

References 1. Stitziel NO, Kiezun A, Sunyaev S (2011) Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol 12:227 2. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ (2009) Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42:30–35 3. Liu P, Morrison C, Wang L, Xiong D, Vedell P, Cui P, Hua X, Ding F, Lu Y, James M, Ebben JD, Xu H, Adjei AA, Head K, Andrae JW, Tschannen MR, Jacob H, Pan J, Zhang Q, Van Den Bergh F, Xiao H, Lo KC, Patel J, Richmond T, Watt M-A, Albert T, Selzer R, Anderson M, Wang J, Wang Y, Starnes S, Yang P, You M (2012) Identification of somatic mutations in non-small cell lung carcinomas using whole-exome sequencing. Carcinogenesis 33(7):1270–1276 4. Mardis ER (2008) The impact of nextgeneration sequencing technology on genetics. Trends Genet 24:133–141 5. Lander ES (2011) Initial impact of the sequencing of the human genome. Nature 470:187–197 6. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–276 7. Biesecker LG, Shianna KV, Mullikin JC (2011) Exome sequencing: the expert view. Genome Biol 12:128 8. Tabor HK, Risch NJ, Myers RM (2002) Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet 3:391–397 9. Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA (2010) A map of human genome variation from populationscale sequencing. Nature 467:1061–1073 10. Gilissen C, Hoischen A, Brunner HG, Veltman JA (2011) Unlocking Mendelian disease using exome sequencing. Genome Biol 12:228 11. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12:745–755 12. Gilissen C, Hoischen A, Brunner HG, Veltman JA (2012) Disease gene identification strategies

for exome sequencing. Eur J Hum Genet 20:490–497 13. Brownstein Z, Bhonker Y, Avraham KB (2012) High-throughput sequencing to decipher the genetic heterogeneity of deafness. Genome Biol 13:245 14. Mertes F, ElSharawy A, Sauer S, van Helvoort JM, Van Der Zaag PJ, Franke A, Nilsson M, Lehrach H, Brookes AJ (2011) Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct Genomics 10(6):374–386 15. Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ (2010) Target-enrichment strategies for next-generation sequencing. Nat Methods 7:111–118 16. Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR (2011) A comparative analysis of exome capture. Genome Biol 12: R97 17. Asan, Xu Y, Jiang H, Tyler-Smith C, Xue Y, Jiang T, Wang J, Wu M, Liu X, Tian G, Wang J, Wang J, Yang H, Zhang X (2011) Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biology 12(9):R95 18. Sulonen A-M, Ellonen P, Almusa H, Lepisto¨ M, Eldfors S, Hannula S, Miettinen T, Tyynismaa H, Salo P, Heckman C, Joensuu H, Raivio T, Suomalainen A, Saarela J (2011) Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol 12:R94 19. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5):491–498 20. Sherry ST, Ward M-H, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308–311 21. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song X, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452:872–876

156

Ofer Isakov et al.

22. Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DCY, Nazareth L, Bainbridge M, Dinh H, Jing C, Wheeler DA, McGuire AL, Zhang F, Stankiewicz P, Halperin JJ, Yang C, Gehman C, Guo D, Irikat RK, Tom W, Fantin NJ, Muzny DM, Gibbs RA (2010) Wholegenome sequencing in a patient with Charcot–Marie–Tooth neuropathy. N Engl J Med 362:1181–1191 23. Roach JC, Glusman G, Smit AFA, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328:636–639 24. He M-L, Chen Y, Chen Q, He Y, Zhao J, Wang J, Yang H, Kung H-F (2011) Multiple gene dysfunctions lead to high cancer-susceptibility: evidences from a whole-exome sequencing study. Am J Cancer Res 1:562–573 25. Gilissen C, Arts HH, Hoischen A, Spruijt L, Mans DA, Arts P, van Lier B, Steehouwer M, van Reeuwijk J, Kant SG, Roepman R, Knoers NVAM, Veltman JA, Brunner HG (2010) Exome sequencing identifies WDR35 variants involved in Sensenbrenner syndrome. Am J Hum Genet 87:418–423 26. Pruitt KD, Tatusova T, Klimke W, Maglott DR (2009) NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res 37:D32–D36 27. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler AD (2002) The human genome browser at UCSC. Genome Res 12:996–1006 28. Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, Kahari AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS, Ritchie GRS, Ruffier M, Schuster M, Sobral D, Tang YA, Taylor K, Trevanion S, Vandrovcova J, White S, Wilson M, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Durbin R, Fernandez-Suarez XM, Harrow J, Herrero J, Hubbard TJP, Parker A, Proctor G, Spudich G, Vogel J, Yates A, Zadissa A, Searle SMJ (2011) Ensembl 2012. Nucleic Acids Res 40:D84–D90 29. Kozomara A, Griffiths-Jones S (2010) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 39: D152–D157 30. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn

RD, Griffiths-Jones S, Eddy SR, Bateman A (2009) Rfam: updates to the RNA families database. Nucleic Acids Res 37:D136–D140 31. Lewis BP, Burge CB, Bartel DP (2005) Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are MicroRNA targets. Cell 120:15–20 32. Betel D, Wilson M, Gabow A, Marks DS, Sander C (2007) The microRNA.org resource: targets and expression. Nucleic Acids Res 36: D149–D153 33. Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164 34. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26:2069–2070 35. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G, Dahl F, Fernandez A, Staker B, Pant KP, Baccash J, Borcherding AP, Brownley A, Cedeno R, Chen L, Chernikoff D, Cheung A, Chirita R, Curson B, Ebert JC, Hacker CR, Hartlage R, Hauser B, Huang S, Jiang Y, Karpinchyk V, Koenig M, Kong C, Landers T, Le C, Liu J, McBride CE, Morenzoni M, Morey RE, Mutch K, Perazich H, Perry K, Peters BA, Peterson J, Pethiyagoda CL, Pothuraju K, Richter C, Rosenbaum AM, Roy S, Shafto J, Sharanhovich U, Shannon KW, Sheppy CG, Sun M, Thakuria JV, Tran A, Vu D, Zaranek AW, Wu X, Drmanac S, Oliphant AR, Banyai WC, Martin B, Ballinger DG, Church GM, Reid CA (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327:78–81 36. Ng SB, Nickerson DA, Bamshad MJ, Shendure J (2010) Massively parallel sequencing and rare disease. Hum Mol Genet 19(R2):R119–R124 37. Takata A, Kato M, Nakamura M, Yoshikawa T, Kanba S, Sano A, Kato T (2011) Exome sequencing identifies a novel missense variant in RRM2B associated with autosomal recessive progressive external ophthalmoplegia. Genome Biol 12:R92 38. Kumar S, Suleski MP, Markov GJ, Lawrence S, Marco A, Filipski AJ (2009) Positional conservation and amino acids shape the correct diagnosis and population frequencies of benign and damaging personal amino acid mutations. Genome Res 19:1562–1569 39. Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S (2010) Identifying

Exome Sequencing Analysis: A Guide to Disease Variant Detection a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025 40. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A (2010) Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20:110–121 41. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15:1034–1050 42. Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814 43. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249 44. Schwarz JM, Ro¨delsperger C, Schuelke M, Seelow D (2010) MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods 7:575–576 45. Chun S, Fay JC (2009) Identification of deleterious mutations within three human genomes. Genome Res 19:1553–1561 46. Sauna ZE, Kimchi-Sarfaty C (2011) Understanding the contribution of synonymous mutations to human disease. Nat Rev Genet 12:683–691 47. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106:9362–9367 48. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Ruden DM, Lu X (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3, Fly 6, 0–1. Fly (Austin) 6(2):80–92 49. Li K, Stockwell T (2010) VariantClassifier: a hierarchical variant classifier for annotated genomes. BMC Res Notes 3:191 50. Dayem Ullah AZ, Lemoine NR, Chelala C (2012) SNPnexus: a web server for functional annotation of novel and publicly known genetic variants (2012 update). Nucleic Acids Res 40(Web Server issue):W65–W70 51. Reva B, Antipin Y, Sander C (2011) Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 39:e118

157

52. Cirulli ET, Goldstein DB (2010) Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet 11:415–425 53. Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA (2011) Phased wholegenome genetic risk in a family quartet using a major allele reference sequence. PLoS Genet 7: e1002280 54. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner M-M, Hunt T, Barnes IHA, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, Tyler-Smith C (2012) A systematic survey of loss-of-function variants in human protein-coding genes. Science 335:823–828 55. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 40:D109–D114 56. Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D’Eustachio P (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37: D619–D622 57. Ooi HS, Schneider G, Lim T-T, Chan Y-L, Eisenhaber B, Eisenhaber F (2010) Biomolecular pathway databases. Methods Mol Biol 609:129–144 58. Becker KG, Barnes KC, Bright TJ, Wang SA (2004) The genetic association database. Nat Genet 36:431–432 59. Yu W, Clyne M, Khoury MJ, Gwinn M (2010) Phenopedia and genopedia: disease-centered and gene-centered views of the evolving knowledge of human genetic associations. Bioinformatics 26:145–146 60. Robinson PN, Mundlos S (2010) The human phenotype ontology. Clin Genet 77:525–534

158

Ofer Isakov et al.

61. Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE (2012) The mouse genome database (MGD): comprehensive resource for genetics and genomics of the laboratory mouse. Nucleic Acids Res 40:D881–D886 62. Stark C, Breitkreutz B-J, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, Nixon J, Van Auken K, Wang X, Shi X, Reguly T, Rust JM, Winter A, Dolinski K, Tyers M (2011) The BioGRID interaction database: 2011 update. Nucleic Acids Res 39:D698–D704 63. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C (2011) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 39:D561–D568

64. McDowall MD, Scott MS, Barton GJ (2009) PIPs: human protein-protein interaction prediction database. Nucleic Acids Res 37: D651–D656 65. Ho M-R, Tsai K-W, Chen C, Lin W (2011) dbDNV: a resource of duplicated gene nucleotide variants in human genome. Nucleic Acids Res 39:D920–D925 66. Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458:719–724 67. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA (2010) COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res 39:D945–D950

Chapter 9 Identifying RNA Editing Sites in miRNAs by Deep Sequencing Shahar Alon and Eli Eisenberg Abstract Deep sequencing has many possible applications; one of them is the identification and quantification of RNA editing sites. The most common type of RNA editing is adenosine to inosine (A-to-I) editing. A prerequisite for this editing process is a double-stranded RNA (dsRNA) structure. Such dsRNAs are formed as part of the microRNA (miRNA) maturation process, and it is therefore expected that miRNAs are affected by A-to-I editing. Indeed, tens of editing sites were found in miRNAs, some of which change the miRNA binding specificity. Here, we describe a protocol for the identification of RNA editing sites in mature miRNAs using deep sequencing data. Key words miRNA, A-to-I editing, Deep sequencing, Bioinformatics, Genomics

1

Introduction Adenosine to inosine (A-to-I) editing is catalyzed by enzymes of the adenosine deaminase that act on RNA (ADAR) family, which post-transcriptionally changes adenosine to inosine, the latter being treated by cell machinery similar to guanosine [1]. As ADARs bind to dsRNA, they may act on the double-strand formation of the primary transcript of a microRNA (miRNA) gene (pri-miRNA) [2]. A-to-I editing of pri-miRNA can affect the processing of the primiRNA to precursor miRNA (pre-miRNA) or the processing of the pre-miRNA to mature miRNA [2]. The mature miRNA is a singlestranded RNA, ~21 bases long, that can bind to partially complementary targets located in the 30 untranslated region of specific mRNAs. Its binding blocks the translation or triggers degradation of the target mRNAs [3]. Pri-miRNA and pre-miRNA editing events may lead to an edited version of the mature miRNAs. Bases 2–8 at the 50 end of the mature miRNA were found to be critical for the target recognition [3]. Therefore, if the editing

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_9, # Springer Science+Business Media New York 2013

159

160

Shahar Alon and Eli Eisenberg

occurs in this recognition site, known as the “seed” region, a change in the set of target genes is expected. A striking example is mouse miR-376 in which editing in the recognition site alters the target specificity of the miRNA and profoundly affects cellular processes [4]. Editing outside the recognition site can also be of interest as it may affect the loading of the mature miRNAs to the RNA-induced silencing complex (RISC), thereby altering the effectiveness of the mature miRNAs [1]. It should be mentioned that it was recently discovered that mature miRNAs undergo large-scale RNA modifications in the form of adenylation and uridylation in the 30 end [5]. However, the functional significance of these changes is not fully understood. In addition to A-to-I editing, it is possible that other types of RNA editing modify the mature miRNA sequence. Proper utilization of deep sequencing data has the potential to unravel the full extent of A-to-I editing in miRNAs as well as other types of RNA editing. However, technical problems in the analysis of deep sequencing datasets may result in a false detection of miRNA modification events [6]. Indeed, recent deep sequencing studies have raised a debate surrounding the extent of these other types of modifications in the human transcriptome in general and in miRNAs in particular [7–9]. Here we describe a protocol which utilizes deep sequencing data to characterize RNA editing in mature miRNAs while avoiding technical data analysis problems [7].

2

Materials 1. Hardware: 64-bit computer running Linux (including other UNIX-based operating systems and Cygwin inside Windows) or Mac OS operation system with at least 8 GB of RAM (preferably 16 GB). 2. Short read alignment program: In this protocol we use Bowtie [10] which can be downloaded for Linux (see Note 1) or Mac OS from http://bowtie-bio.sourceforge.net/index.shtml. The Bowtie Web site provides installation instructions. 3. Genome indexes: The alignment program Bowtie uses genome indexes to speed up computation time. Pre-built genome indexes for many organisms can be downloaded via the Bowtie Web site (mentioned above). The Mouse genome indexes (UCSC mm9) are needed for the example below. Therefore, they should be downloaded and placed in the “indexes” folder inside the Bowtie folder. 4. Software: The Perl programming language should be preinstalled. Perl for Linux or Mac OS can be downloaded from ActiveState http://www.activestate.com/activeperl. The ActiveState Web site provides installation instructions.

Identifying RNA Editing Sites in miRNAs

161

5. Mathematical package: The Perl package Math-CDF is required for the Perl scripts. This package can be downloaded from CPAN (http://www.cpan.org/). The CPAN Web site provides installation instructions. 6. Scripts: Three Perl scripts are required for this protocol: http://www.tau.ac.il/~elieis/miR_editing/Process_reads.pl http://www.tau.ac.il/~elieis/miR_editing/Analyze_mutation.pl http://www.tau.ac.il/~elieis/miR_editing/Binomial_analysis.pl 7. miRNA definitions files: For the example below, download these three text files which contain data about mouse miRNAs: the mature/star sequences, the alignment of the pre-miRNA sequence against the genome, and the secondary structure of the pre-miRNA: http://www.tau.ac.il/~elieis/miR_editing/Mature_file.txt http://www.tau.ac.il/~elieis/miR_editing/Mouse_pre_mi RNA_aligned_against_the_genome.txt http://www.tau.ac.il/~elieis/miR_editing/RNAfold_file_ mouse.txt One may place these files in the same working folder as the Perl scripts for the purpose of the example below. 8. Sequencing data: This protocol refers to the output of Illumina sequencing machines. As the following analysis relies on the quality scores of the sequencing reads, the Fastq formatted output of the sequencing run is required. To run the example given below, download the dataset SRR346417 from the SRA NCBI http://www.ncbi.nlm.nih.gov/sra/ (see Note 2). Place the downloaded file in the same folder as the Perl scripts.

3

Methods

3.1 Filtering LowQuality Reads and Trimming Sequence Adapters

The Fastq file of the sequencing reads could be the raw reads without any filter, or filtered reads using Illumina software tools. In the latter case, proceed directly to Subheading 3.2. Otherwise, continue with this step. Raw sequencing reads likely contain parts of the adapter sequence. Therefore, these sequences must be identified and removed. Moreover, low-quality reads (as defined by the read quality score) are unlikely to be informative and therefore should be removed. There are several published tools for raw Fastq filtering [11, 12], or one can use our in-house filtering script “Process_reads.pl.” The script allows the user to define “low-quality reads” by choosing (a) a quality score cutoff and (b) the maximum number of locations allowed to have lower quality score compared to the chosen cutoff. For example, one can use a cutoff of 20 for the Phred quality score (which ranges from 0 to 40) and the maximum number of locations can be set to 3 [7]. The script also trims the adapters (see Note 3) and the resulting trimmed

162

Shahar Alon and Eli Eisenberg

read can be filtered if it is too long or too short (as defined by the user). For example, as the expected length of mature miRNAs is ~21 bases, one can remove reads with length longer than 28 bases or shorter than 15 bases. As an example, we will use the publicly available dataset of mature miRNAs from the mouse cerebellum (accession number SRR346417). 1. Use the head command to view the Fastq file (see Note 4): $ head -100 sra_data.fastq It is clear that these are raw, 36 bases long, reads. 2. Run the Perl script “Process_reads.pl” (see Note 5): $ perl Process_reads.pl sra_data.fastq sra_data_filtered.fastq The output on the screen should indicate that from ~25 million reads, ~19 million reads passed the filtering procedure. RUN TIME: ~10 min (using Intel Core i7 processor). 3.2 Aligning the Reads Against the Genome

The filtered and trimmed reads should be aligned against the genome of interest and not against a miRNA sequence database [7]. We require unique best alignment, that is, the reads cannot be aligned to other locations in the genome with the same number of mismatches (Fig. 1). Lastly, we recommend using only alignments with up to one mismatch (in our datasets, allowing up to two mismatches did not aid in detecting more editing sites; instead, it added unreliable alignments). These steps taken together solve, by and large, the cross mapping problem that significantly hinders identification of true editing sites in mature miRNAs [6, 7]. The last two bases (the 30 end) of mature miRNA undergo extensive adenylation and uridylation [5]. It is thus recommended that these bases will not be considered in the alignment. Naturally, doing so prevents detection of editing in these locations. However,

Fig. 1 A schematic representation of the alignment procedure. The sequencing reads should not be aligned to more than one location in the genome with the same number of mismatches

Identifying RNA Editing Sites in miRNAs

163

not taking this measure and still demanding low number of mismatches will severely reduce the number of alignments obtained. The following command line runs the alignment program Bowtie in a manner compatible with the above considerations (see Notes 6–8): $ bowtie -p 8 -n 1 -e 50 -a -m 1 –best –strata –trim3 2 The_ bowtie_folder/indexes/mm9 sra_data_filtered.fastq>sra_data_ filtered.output The output on the screen should indicate that from ~19 million reads, ~8 million reads were uniquely aligned against the mouse genome. RUN TIME: ~1 min (using Intel Core i7 processor with eight threads). 3.3 Mapping the Mismatches Against the Pre-miRNA Sequences

The purpose of this step is to move from reads aligned against the genome (the end-point of the previous step) to counts of each of the four possible nucleotides at each position along the pre-miRNA sequence, for all the pre-miRNAs. Performing this transformation will allow us to focus our analysis only on bona fide miRNA and to use, in the following step, binomial statistics to detect significant modifications inside them. We use pre-built files for the transformation: the alignment of the pre-miRNA sequences against the genome and the mature/star sequences of miRNAs. Taken together, these files give the location of the pre-miRNA and the mature/star miRNA inside the genome (see Notes 9 and 10). The pre-built files and the Bowtie output file from the previous step are the input for the script “Analyze_mutation.pl.” A key userdefined input for this script is the minimum quality score allowed in the location of the mismatch in order for it to be counted. Naturally, the higher this number is, the lower the probability for sequencing error. However, taking very high quality score filter will give small number of counted modifications. We suggest using 30 as the minimum quality score allowed (see also below and [7]). Running this script on large sequencing datasets (large Bowtie output files) requires extensive internal memory resources. A way around this hurdle is to divide the sequencing dataset into several smaller files; afterwards the output files can be merged (see Subheading 3.4). We suggest dividing the Bowtie output file if the number of reads is higher than 2.5 million when using 8 GB of RAM. 1. As in our example the Bowtie output file has ~8 million reads, dividing the file is required (see Note 11): $ split -l 2500000 sra_data_filtered.output part_ This command will divide the Bowtie output file into 4 files: part_aa, part_ab, part_ac, and part_ad. 2. The script “Analyze_mutation.pl” can now be run: $ perl Analyze_mutation.pl part_aa main_output_a.txt

164

Shahar Alon and Eli Eisenberg

RUN TIME: ~20 min (using Intel Core i7 processor with 8 GB of RAM). 3. Run the other parts of the Bowtie output file: $ perl Analyze_mutation.pl part_ab main_output_b.txt $ perl Analyze_mutation.pl part_ac main_output_c.txt $ perl Analyze_mutation.pl part_ad main_output_d.txt RUN TIME: ~60 min (using Intel Core i7 processor with 8 GB of RAM). The main output of this script is a text file containing the counts of each of the four possible nucleotides at each position along all the pre-miRNA sequences. This output file is later used for the binomial test (see Note 12). The binomial statistics requires the number of successes (the number of mismatches of a given type in a given position), the number of trials (the total number of counts in the given position), and the sequencing error probability. The sequencing error probability can be estimated using the Phred score. For example, if only mismatches with Phred score of 30 are allowed, the expected base call error rate should be 0.1 % in each position. The aligned read data can be used in order to examine the actual sequencing error in the retained reads (see Note 13). 3.4 Using Binomial Statistics to Remove Sequencing Errors

In this step binomial statistics is performed on the output file of the previous step to separate sequencing errors from statistically significant modifications (Fig. 2). Importantly, binomial statistics do not require any arbitrary expression level filter. It is well suited even for low-expressed miRNAs with low number of sequencing reads, and the P-values computed reflect the absolute number of reads detected, small or large as the case may be. This analysis is performed for every position (except the last two positions of the miRNA due to the extensive adenylation and uridylation) in every mature/star miRNA separately. As multiple tests are performed, the resulting P-value for each position must be corrected accordingly. The script performing this analysis, “Binomial_analysis.pl,” gives the user an option to use Bonferroni or Benjamini–Hochberg corrections. 1. Run the script “Binomial_analysis.pl” using the files from the previous step (see Notes 14 and 15): $ perl Binomial_analysis.pl main_output_a.txt main_output_b. txt main_output_c.txt main_output_d.txt>binomial_output.txt RUN TIME: ]*

The standard deviations for the r1. Replicates are separated by comma. The default is 70 for each replicate

-sd2[,< float>]*

The standard deviations for the r2. Replicates are separated by comma. The default is 70 for each replicate

-c

The cutoff splicing difference. The cutoff used in the null hypothesis test for differential splicing. The default is 0.05 for 5 % difference. Valid: 0  cutoff < 1

-analysis analysisType Type of analysis to perform. analysisType is either “P” or “U”. “P” is for paired replicate analysis and “U” is for unpaired replicate analysis. The default is “U” -expressionChange

Filters out AS events where corresponding gene expression levels differ more than the given cutoff fold change between the two samples. Valid: Fold change >1.0. The default is 10.0

The anchor length used for TopHat and the distribution of insert length for paired-end data can optionally be provided. The cutoff of exon inclusion-level difference tested for differential AS can also be defined by users

FDR of differential AS. It reports AS events for all five major types of AS patterns (Fig. 1) in the output folder specified by the -o option. 3.4

Output

All output files are in the user-defined output folder. 1. MATS_output: A folder that contains the MATS output of AS events. Each output file is sorted by P-values in ascending order. 2. MATS_output/AS_Event.MATS.JunctionCountOnly.txt: A file that evaluates differential AS using only reads that span splice junctions. IJC_SAMPLE_N column is for the inclusion junction counts of SAMPLE_N and SJC_SAMPLE_N column is for the skipping junction counts of SAMPLE_N. 3. MATS_output/AS_Event.MATS.ReadsOnTargetAndJunctionCounts.txt: A file that evaluates differential AS with both reads that span splice junctions and reads that map to target exons. IC_SAMPLE_N is for the inclusion counts of SAMPLE_N and JC_SAMPLE_N is for the skipping counts of SAMPLE_N.

Alternative Splicing Analysis of RNA-Seq Data

177

4. summary.txt: A file that contains a summary of statistically significant differential AS events for each type of AS patterns. 5. ASEvents: A folder that contains all possible AS events derived from the GTF gene/transcript annotation and the RNA-Seq data. 6. SAMPLE_1/REP_N: A folder that contains the mapping results of SAMPLE_1, Replicate_N. 7. SAMPLE_2/REP_N: A folder that contains the mapping results of SAMPLE_2, Replicate_N. 8. commands.txt: A list of key commands executed. 9. log.RNASeq-MATS: A log file for running RNASeq-MATS. 3.5 Downstream Analysis

4

RNASeq-MATS reports the exon inclusion levels, P-value, and FDR for each AS event using either splice junction counts (JC) only or splice junction counts plus counts of reads mapped to target exons (JC + EC). The validation of candidate differential AS events can be done using a variety of techniques, most commonly RT-PCR using primers targeting flanking exon regions. Figure 2 shows the RT-PCR validation of a candidate skipped exon event reported by RNASeq-MATS.

Notes 1. RNASeq-MATS currently supports Linux and MacOS. 2. RNASeq-MATS works with both read sequence files (FASTQ) and mapped read files (BAM). All read sequences in FASTQ files or BAM files must have the same length. 3. RNASeq-MATS is freely available from http://rnaseq-mats. sourceforge.net/ from where users can also subscribe to update notifications. 4. To trim the 30 end of reads with poor sequencing quality or to ensure that all reads have the same length, use the trimFastq.py script found in the bin directory. The following example command trims 231ESRP.250K.r1.fastq to 32 bp long by removing sequence from the 30 end of the reads and then saving it to trimmed.fastq. Usage python

trimFastq.py

input.fastq

trimmed.fastq

desiredLength

Example python

trimFastq.py

231ESRP.250K.r1.fastq

trimmed.fastq 32

FastQC (http://www.bioinformatics.babraham.ac.uk/pro jects/fastqc/) can be used to check read quality.

178

Juw Won Park et al.

Fig. 2 Example differential skipped exon (SE) event validated by RT-PCR in a human breast cancer cell line (MDA-MB-231) with ectopic expression of the epithelial specific splicing factor ESRP1 or an empty vector (EV) control. (a) A skipped exon event of the MLPH (Melanophilin) gene reported by RNASeq-MATS is shown with its flanking exons. Histograms represent exon read density and arcs represent splice junctions with the number of reads mapped to the junction indicated by the thickness of the arc. Both splice junction counts and exon read counts indicate that the target exon (striped) is repressed by ESRP1. (b) RT-PCR result for the exon inclusion form (top band) and the exon skipping form (bottom band) validates that the target exon is repressed by ESRP1. (c) Exon inclusion level (ψ) of the target exon in each sample and the inclusion-level difference (ψ ESRP1  ψ EV) are measured by RNA-Seq and RT-PCR. RNASeq-MATS has two sets of results for JC (using splice junction counts only) and JC + EC (using splice junction counts plus reads on target exons)

5. Bowtie indexes for human (hg19) and mouse (mm9) are provided. For instructions on making a Bowtie index for other species, refer to the Bowtie manual (http://bowtie-bio. sourceforge.net/manual.shtml#the-bowtie-build-indexer). 6. The following is a list of gene/transcript annotation GTF files provided in the gtf directory: Human, Homo sapiens (Ensembl or UCSC Known Genes) Mouse, Mus musculus (Ensembl or UCSC Known Genes) Drosophila, Drosophila melanogaster (FlyBase) C. elegans, Caenorhabditis elegans (RefSeq) Zebrafish, Danio rerio (RefSeq) Alternatively, you can download your own gene/transcript annotation in the GTF format. Note that the first column

Alternative Splicing Analysis of RNA-Seq Data

179

(chromosome/contig name) in the GTF file must match the sequence names in your Bowtie index. Use bowtie-inspect (found in the bowtie directory) to display sequence names for the Bowtie index. bowtie-inspect --names your_bowtie_index

Only use a GTF in which the chromosome/contig name (first column) matches the output of the above command. For organisms with poorly annotated transcriptomes, the GTF output of Cufflinks [11] or other de novo transcript identification tools could be used as an input for RNASeq-MATS. 7. The testData folder contains all necessary test data (FASTQ files and BAM files) to test if RNASeq-MATS is installed properly. 8. RNASeq-MATS is designed to run on a server; however it can also be used on a local/personal computer with a Linux or a MacOS. If you have troubles installing RNASeq-MATS on a server, you may want to contact your system administrator. References 1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C et al (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221):470–476 2. Cooper TA, Wan L, Dreyfuss G (2009) RNA and disease. Cell 136(4):777–793 3. Wang Z, Gerstein M, Snyder M (2009) RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63 4. Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS, Morin RD et al (2010) Alternative expression analysis by RNA sequencing. Nat Methods 7(10):843–847 5. Katz Y, Wang ET, Airoldi EM, Burge CB (2010) Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods 7(12):1009–1015 6. Brooks AN, Yang L, Duff MO, Hansen KD, Park JW, Dudoit S et al (2011) Conservation of an RNA regulatory map between Drosophila and mammals. Genome Res 21 (2):193–202

7. Shen S, Park JW, Huang J, Dittmar KA, Lu ZX, Zhou Q et al (2012) MATS: a bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data. Nucleic Acids Res 40(8):e61 8. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079 9. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25 10. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105–1111 11. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515

Chapter 11 Optimizing Detection of Transcription Factor-Binding Sites in ChIP-seq Experiments Aleksi Kallio and Laura L. Elo Abstract Chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) offers a powerful means to study transcription factor binding on a genome-wide scale. While a number of advanced software packages have already become available for identifying ChIP-seq-binding sites, it has become evident that the choice of the package together with its adjustable parameters can considerably affect the biological conclusions made from the data. Therefore, to aid these choices, we have recently introduced a reproducibility-optimization procedure, which computationally adjusts the parameters of the popular peak detection algorithms for each ChIP-seq data separately. Here, we provide a detailed description of the procedure together with practical guidelines on how to apply its implementation, the peakROTS R-package, in a given ChIP-seq experiment. Key words Next-generation sequencing, ChIP-seq, Transcription factor-binding site, Peak detection, Reproducibility

1

Introduction Chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) is a powerful technique to map sites of transcription factor binding and chromatin modification, with good coverage and resolution [1]. Therefore, it has increasingly become the tool of choice for many genome-scale studies of transcriptional regulation, e.g., [2–4]. In ChIP-seq, ChIP-enriched DNA fragments are sequenced for a small fraction from their ends (often ~30 bp). The resulting short sequence reads (tags) are then aligned back to a reference genome, typically retaining only those reads that map to a unique genomic location [1, 5]. Binding sites are finally identified on the basis of accumulation of reads to particular genomic loci using various peak detection algorithms; see [6, 7] for a good introduction to different aspects of peak calling. To aid the biological interpretation of the predicted binding sites, further downstream analysis often involves

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_11, # Springer Science+Business Media New York 2013

181

182

Aleksi Kallio and Laura L. Elo

steps like motif discovery and annotation of the peak locations to known genomic features, as well as visualization of the detected regions [1]. Together with the increasing popularity of ChIP-seq technology, a number of advanced software packages for identifying ChIP-seq binding sites have become available. These include, for instance, PeakSeq [5], MACS [8], FindPeaks [9], QuEST [10], and CSAR [11] among others. A common challenge of using these packages is, however, that they typically involve various user-adjustable parameters. While the default parameters seem a natural choice, they may not be appropriate for the particular data under analysis; we and others have observed that with the fixed default parameters, the choice of the best package depends strongly on the data [12–15]. To provide the user with an informed means to adjust the parameters of a given peak detection software, we have recently introduced a computational procedure, which optimizes the binding site detections for a particular data by maximizing their reproducibility under bootstrap sampling [16]. We have previously showed the benefits of a similar reproducibility-optimization test statistic (ROTS) for gene expression microarrays and quantitative mass spectrometry-based technologies [17, 18]. In ChIP-seq experiments, the procedure improved the accuracy of binding site detections while avoiding arbitrary reiterative peak calling under various parameter settings [16]. Moreover, it could indicate whether the quality of the data or the software parameters were sufficient for reliable binding site detection or not. Here, we provide a detailed description of the generic reproducibility-optimization procedure and its application to ChIP-seq studies, together with practical guidelines on how to apply the peakROTS R-package to identify transcription factorbinding sites in a given ChIP-seq experiment.

2

Materials The peakROTS R-package implementing the reproducibilityoptimization procedure is freely available at http://www.nic. funet.fi/pub/sci/molbio/peakROTS. For peak detection, we use here the Model-based Analysis of ChIP-Seq (MACS) software [8], available at http://liulab.dfci.harvard.edu/MACS/. We selected MACS for the examples, as it is popular in many ChIP-seq studies; for recent ones see, e.g., [15, 19–23]. MACS is an advanced peak detection software, which uses tag shifting and windowing to scan chromosome regions and a dynamic Poisson distribution to model the background signal. For the analysis work, we use Unix environment because MACS and most other peak detection packages are designed to run there. To use the ROTS procedure for peak

Optimizing ChIP-seq binding site detection

183

detection, we load the peakROTS package into R environment (http://www.r-project.org). To visualize results we use Chipster genome browser software [24].

3

Methods

3.1 ROTS Procedure for ChIP-seq

The reproducibility-optimization ROTS procedure selects the best parameter combination for a given peak detection algorithm by maximizing the reproducibility of the detections across pairs of bootstrapped datasets (see Fig. 1 for a schematic illustration). Each bootstrap data is generated by randomly sampling with replacement reads from the original data, preserving the ChIP and control sample labels (see Note 1). Reproducibility is defined in terms of average overlap of the top-ranked peaks across the bootstrap resamples, compared to null reproducibility in randomized data. Notably, no fixed number of top detections needs to be selected, but the top list size is optimized together with the other parameters of the peak detection algorithm. More specifically, the optimal parameter combination α* is determined by maximizing the reproducibility Z-score Zk;α ¼

0 Rk;α  Rk;α

sk;α

over a dense lattice of different parameter combinations α and top list sizes k. Here, Rk,α is the average reproducibility (overlap) of the k top-ranked peaks over pairs of bootstrapped datasets, sk,α is the 0 is corresponding standard deviation of the reproducibility, and Rk;α the null reproducibility in randomized datasets. To estimate the null reproducibility, the current implementation of the procedure assumes that the data contains both a ChIP and a control sample, which are typically preferred in ChIP-seq studies [7, 10, 12]. The null reproducibility is estimated by swapping the ChIP and control samples. The ROTS output is finally the peak list obtained from the original data using the parameter combination selected by the ROTS procedure. When determining the peak list overlaps, it is necessary to take into account the ambiguity that a peak in one dataset may overlap multiple peaks in another dataset. Similarly as in [25], we first merge the two peak lists under comparison into a union set of n detected regions. The reproducibility is then defined as R ¼ m/n, where m is the number of regions in the union list that overlap regions in both of the original lists. It obtains value one if all the regions are overlapping and value zero if none of the regions overlaps. Two regions are considered overlapping if they have at least one shared base pair.

184

Aleksi Kallio and Laura L. Elo

Fig. 1 A schematic illustration of the ROTS procedure. (1) First, B pairs of bootstrap datasets are generated from the original data by randomly sampling with replacement an equal number of reads as in the original data. (2) For each parameter combination, peaks are detected from each bootstrap data and the corresponding randomized data and average peak list reproducibility are calculated at various top list sizes. (3) The parameter combination that maximizes the reproducibility Z-score over increasing top list sizes is selected (see Subheading 3.1 for details). (4) The ROTS output is a peak list obtained from the original data using the parameters selected with the ROTS procedure

Optimizing ChIP-seq binding site detection

185

Besides determining the optimal parameters, the optimized reproducibility Z-score can be used to investigate whether the quality of the data or the software parameters are sufficient for reliable binding site detection; low reproducibility values indicate that the detections may be unreliable (see Note 2). 3.2 ChIP-seq Data Analysis Using Default MACS

We first demonstrate how ChIP-seq data can be analyzed using the MACS package [8] with default settings. The example dataset is a restricted version of the human NRSF dataset [26]. In the Unix command line, first create a directory for the example analysis session (pdexample for “peak detection example”) and load the example dataset from the peakROTS Web site. After unzipping the example data package, you can see two files, corresponding to treatment and control samples in ELAND format. mkdir pdexample cd pdexample wget

ftp://ftp.funet.fi/index/Science/molbio/

peakROTS/download/peakROTS_tutorial_data_v1.zip unzip peakROTS_tutorial_data_v1.zip

Next we install the MACS package. Detailed instructions are available at the MACS Web site (http://liulab.dfci.harvard.edu/ MACS/). MACS can be installed to user’s home directory so that administrator privileges are not required (for instructions, we refer to INSTALL file inside the MACS package). After installation, invoking macs14 at command line should output the MACS quick help. To run peak detection with default parameters, invoke the following command line (see Note 3): macs14 -t treatmentdata_example.dat -c controldata_ example.dat -n pdexample

When MACS is run, it produces a set of output files, depending on the parameters it was given. Most importantly the file pdexample_peaks.xls contains the called peaks, with their statistics, including the false discovery rate (FDR) associated with each peak. MACS calculates FDR values by swapping sample and control; the socalled negative peaks produced in this process are also reported by the software (file pdexample_negative_peaks.xls). MACS has a wide collection of parameters that can be used to tune the peak detection process. For an example, we run peak detection again with the dynamic background lambda calculation disabled (parameter nolambda): macs14 -t treatmentdata_example.dat -c controldata_ example.dat --nolambda -n nolambda

We notice that the change in one parameter has a drastic effect on results with the given dataset: instead of 441 peaks detected with the default settings, the result now contains 2,693 peaks (see Note 4).

186

Aleksi Kallio and Laura L. Elo

3.3 ChIP-seq Data Analysis Using peakROTS

We have observed that different parameter settings applied to the same data can produce large differences to peak detection results. To gain better control and understanding over the detection process, we next use MACS together with the peakROTS package [16] that implements the reproducibility-optimization ROTS procedure for peak detection (see Note 5). peakROTS is an R package that is freely available as open source. The implementation is platform independent, requiring only an R environment (http://www.rproject.org). As the search through the parameter space to pick an optimal parameter combination can be computationally demanding, the package allows using a large computational cluster to perform the analysis work efficiently. To work with the peakROTS package, we download it from the peakROTS Web site (http://www.nic.funet.fi/pub/sci/molbio/ peakROTS); instructions on how to install the package are available at the Web site. Analysis with peakROTS has an initialization and an execution step. Unlike MACS and other conventional analysis tools, peakROTS does not enforce the user to run the whole calculation in one go, so an initialization step is needed to create bookkeeping that allows the analysis process to be completed in multiple separate sessions. This way peakROTS analysis can be performed reliably and efficiently. In our example, we use peakROTS to examine the nolambda parameter space (i.e., parameter values true and false). MACS models the tag distribution by Poisson distribution, which can be parameterized by a local or a global lambda parameter. When nolambda is set to false (MACS default value), a local lambda is estimated from the control samples, allowing to account for location-specific artifacts, such as local chromatin structure, DNA amplification, sequencing bias, and genome copy number variation [8]. When nolambda is set to true, a genome-wide background lambda is used instead of a local lambda. For the example analysis, we keep the other parameters at MACS default values (see Note 6). The peakROTS package is used in an interactive R session. We load the peakROTS library and initialize the analysis session. The initialization step will create a subdirectory called work-pdexample and fill it with required runtime data. When calling the initialise function, the parameter r.command must be set to match the R executable under which you are working: library(peakROTS) initialise( path.work ¼ "work-pdexample", data.path ¼ "pdexample", treatment.file ¼ "treatmentdata_example.dat", control.file ¼ "controldata_example.dat", shiftsize ¼ list(100), tsize ¼ list(25),

Optimizing ChIP-seq binding site detection

187

bw ¼ list(300), nolambda ¼ list(TRUE, FALSE), pvalue ¼ 10^-5, bootstrap.count ¼ 10, r.command ¼ "R" )

Next, in the execution step, we start processing the session that was initialized. We do not use computational cluster for this small example session, but we do use 10 parallel processes, thus making use of multiple processing cores that most modern desktop computers have. This is selected with parameters do.run and jobs.running.max when calling the function run. For running peakROTS in computational cluster see Note 7: run( path.work ¼ "work-pdexample", do.run ¼ do.run.local, jobs.running.max ¼ 10 )

The function call starts the so-called master node that begins to process the analysis session. It initiates smaller analysis jobs, each of which processes one part of the whole ROTS computation pipeline. The work-pdexample directory contains files pending-jobs.txt, running-jobs.txt, and finished-jobs.txt. Each job has a single line in one of the files and all jobs go through the files as their states progress from pending to running and from running to finished. After all the jobs have finished, the master node stops and the call to run function returns (see Note 8). Reproducibility-optimization results can be found from the subdirectory work-pdexample/results: load("work-pdexample/result/result.Rdata") print(repro.summary) print(cat("ROTS selected the parameter combination: ", result))

Peak detection results on the original data can be found from the subdirectory work-pdexample/peaks, following the naming scheme Original_, where is the same parameter string that is also reported in the reproducibility results. To locate the peak lists determined as best by peakROTS, we check the optimized parameter combination and pick the corresponding Original_ files. In this example case, we discovered that parameter value nolambda ¼ false resulted in better reproducibility: it was given Z-score of 19.7, while nolambda ¼ true was given Z-score of 17.3. Based on this we would use nolambda ¼ false for peak calling with peakROTS and MACS, if the other parameters were kept at their default values. After the optimized peak detection results have been produced, the analysis pipeline typically continues with downstream analysis

188

Aleksi Kallio and Laura L. Elo

Fig. 2 Visualization of detected peaks using the Chipster genome browser software. Example peaks from human STAT1 data are shown. In this case, the ROTS-optimized parameters produced narrower peak widths than the MACS default parameters

steps such as visualizing or annotating result data. The peakROTS result files are in standard formats so that common software packages can be used for further analysis. As an example, we investigate the detected peak regions in a genomic context by loading them into Chipster genome browser software [24]. Figure 2 shows an example region identified from the human STAT1 data [5], in which MACS with ROTS-optimized parameters produced narrower peak widths than MACS with default parameters. In general, several differences between the ROTS optimized and default parameter settings can be observed [16].

4

Notes 1. Bootstrap resamples. In gene expression and proteomic studies, bootstrap resamples can usually be generated by randomly sampling the individual replicates with replacement [17, 18]. In ChIP-seq studies, however, the number of replicates is typically very low. Therefore, the current implementation of

Optimizing ChIP-seq binding site detection

189

peakROTS makes bootstrap resamples of the reads within a single dataset by sampling an equal number of reads as in the original data with replacement. 2. ROTS as a benchmarking tool. Besides providing guidance on how to adjust the parameters of the peak detection software, the ROTS procedure can also indicate whether data quality or software parameters are sufficient for reliable binding site detection. Hence, the peakROTS procedure can be used by developers of new and improved peak detection algorithms as a benchmarking tool. In general, peakROTS allows assessing the quality of each step of a ChIP-seq data analysis pipeline. Such unbiased assessment is essential when optimizing a pipeline to avoid biases towards a specific data type or peak detection algorithm. 3. Input file format. The example data files are in ELAND format. MACS, and also the peakROTS package, can automatically detect the input file format. Therefore the increasingly popular Binary Alignment/Map (BAM) format can be used without any changes to the given commands. 4. MACS version. The exact number of peaks identified by MACS is dependent on the MACS version; here we used version 1.4.2, the latest available at the time of writing. 5. Generalizability of peakROTS to other peak calling software. MACS was used here as an example peak finding package, but peakROTS is not limited to it. It supports also a more recent PeakSeq software and can be extended with other peak finding software when needed. Currently, the only limitation is that the experimental setup must include both a treatment and a matching control sample. 6. Parameter space. It must be noted that the analysis session explained here is a small example session that can be quickly walked through in a low-powered desktop computer. When peakROTS is used on real analysis cases, the number of parameter combinations is typically higher. With MACS, we recommend to at least explore various values of nolambda, shiftsize, and bw parameters. These parameters control central aspects of MACS: nolambda controls how the lambda parameter of the Poisson distribution is estimated; shiftsize determines how many base pairs ChIP-seq tags are shifted towards the 30 direction, with the aim of representing the precise protein–DNA interaction site; and the bandwidth parameter (bw) defines the window size that MACS uses for peak detection [8]. Also, the number of bootstrap samples is low in the example session. For typical usage, we recommend using at least 100 bootstrap samples.

190

Aleksi Kallio and Laura L. Elo

7. Running peakROTS in computational cluster. The peakROTS implementation is modularized so that it can be efficiently distributed across multiple computing cores, allowing large computational resources to be utilized effectively. To achieve this, pass different values for do.run parameter of the run function call. When peakROTS package starts to run jobs, it calls the given do.run function for each individual job. The peakROTS package contains readymade implementations for local and batch processing jobs and new processing environments can be taken into use by simply providing the one or two lines long do.run function; see peakROTS library documentation for details. The infrastructure needed for the distributed computing is included in the R library. The current implementation supports both local process-based distribution (single node, multiple cores), as well as LSF batch processing (multiple nodes). 8. Error tolerance. The peakROTS sessions can involve heavy computation. To perform heavy computation reliably, the system was designed to be error tolerant. The state of the distributed work is kept at disk, meaning that the master node can crash or be shut down without losing any of the intermediate results. State is stored to easily readable text files so that when the master node is not running, the files can be edited manually, allowing, for instance, failure resolution. In particular, it is possible to stop the master node while the work is progressing by typing Ctrl + C. Jobs will continue to run until they are finished, but new jobs will not be started. It is possible to continue the work at a later time by loading the state of the analysis session from the work-pdexample directory, after which the run method can be called exactly the same way as before and the processing will continue from where it was left: load.settings(path.work ¼ "work-pdexample")

Acknowledgments The work was supported by the Academy of Finland (grants 127575, 218591). References 1. Park PJ (2009) ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 10:669–680 2. Elo LL, J€arvenp€a€a H, Tuomela S et al (2010) Genome-wide profiling of interleukin-4 and

STAT6 transcription factor regulation of human Th2 cell programming. Immunity 32:852–862 3. Schmidt D, Wilson MD, Ballester B et al (2010) Variation in transcription factor binding among humans. Science 328:1036–1040

Optimizing ChIP-seq binding site detection 4. Northrup DL, Zhao K (2011) Application of ChIP-Seq and related techniques to the study of immune function. Immunity 34:830–842 5. Rozowsky J, Euskirchen G, Auerbach RK et al (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27:66–75 6. Farnham PJ (2009) Insights from genomic profiling of transcription factors. Nat Rev Genet 10:605–616 7. Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:S22–S32 8. Zhang Y, Liu T, Meyer CA et al (2008) Modelbased analysis of ChIP-Seq (MACS). Genome Biol 9:R137 9. Fejes AP, Robertson G, Bilenky M et al (2008) FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 24:1729–1730 10. Valouev A, Johnson DS, Sundquist A et al (2008) Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods 5:829–834 ˜ o JM, Kaufmann K, van Ham RC et al 11. Muin (2011) ChIP-seq analysis in R (CSAR): an R package for the statistical detection of proteinbound genomic regions. Plant Methods 7:11 12. Laajala TD, Raghav S, Tuomela S et al (2009) A practical comparison of methods for detecting transcription factor binding sites in ChIPseq experiments. BMC Genomics 10:618 13. Wilbanks EG, Facciotti MT (2010) Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One 5:e11471 14. Szalkowski AM, Schmid CD (2011) Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking efforts. Brief Bioinform 2:626–633 15. Micsinai M, Parisi F, Strino F et al (2012) Picking ChIP-seq peak detectors for analyzing chromatin modification experiments. Nucleic Acids Res 40:e70 16. Elo LL, Kallio A, Laajala TD et al (2012) Optimized detection of transcription factor-binding

191

sites in ChIP-seq experiments. Nucleic Acids Res 40:e1 17. Elo LL, File´n S, Lahesmaa R et al (2008) Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Trans Comput Biol Bioinform 5:423–431 18. Elo LL, Hiissa J, Tuimala J et al (2009) Optimized detection of differential expression in global profiling experiments: case studies in clinical transcriptomic and quantitative proteomic datasets. Brief Bioinform 10:547–555 19. Cotney J, Leng J, Oh S et al (2012) Chromatin state signatures associated with tissue-specific gene expression and enhancer activity in the embryonic limb. Genome Res 22:1069–1080 20. Holmes KA, Hurtado A, Brown GD et al (2012) Transducin-like enhancer protein 1 mediates estrogen receptor binding and transcriptional activity in breast cancer cells. Proc Natl Acad Sci USA 109:2748–2753 21. Home P, Saha B, Ray S et al (2012) Altered subcellular localization of transcription factor TEAD4 regulates first mammalian cell lineage commitment. Proc Natl Acad Sci USA 109:7362–7367 22. Jishage M, Malik S, Wagner U et al (2012) Transcriptional regulation by Pol II(G) involving mediator and competitive interactions of Gdown1 and TFIIF with Pol II. Mol Cell 45:51–63 23. Meier K, Mathieu EL, Finkernagel F et al (2012) LINT, a novel dL(3)mbt-containing complex, represses malignant brain tumour signature genes. PLoS Genet 8:e1002676 24. Kallio MA, Tuimala JT, Hupponen T et al (2011) Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics 12:507 25. Euskirchen GM, Rozowsky JS, Wei CL et al (2007) Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. Genome Res 17:898–909 26. Johnson DS, Mortazavi A, Myers RM et al (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316: 1497–1502

Chapter 12 Statistical Analysis of ChIP-seq Data with MOSAiCS Guannan Sun, Dongjun Chung, Kun Liang, and S€und€uz Keles¸ Abstract Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is invaluable for identifying genome-wide binding of transcription factors and mapping of epigenomic profiles. We present a statistical protocol for analyzing ChIP-seq data. We describe guidelines for data preprocessing and quality control and provide detailed examples of identifying ChIP-enriched regions using the Bioconductor package “mosaics.” Key words Chromatin immunoprecipitation, Next-generation sequencing, Peak finding, Mixture regression model, R, Bioconductor

1

Introduction The advent of high-throughput next-generation sequencing (NGS) technologies has revolutionized genetics and genomics by allowing rapid and inexpensive sequencing of billions of bases. Among the NGS applications, ChIP-seq (chromatin immunoprecipitation followed by NGS) is perhaps the most successful to date. ChIP-seq technology is advancing the studies of genome-wide binding of transcription factors and mapping of epigenomic profiles. Both of these play a crucial role in the programming of cell-specific gene expression; therefore their genome-wide mapping can significantly advance our ability to understand and diagnose human diseases. In a typical ChIP-seq protocol, proteins are cross-linked to DNA in vivo. After extraction and shearing of chromatin, specific protein–DNA complexes are immunoprecipitated with a suitable antibody. Then, the resulting sample is subjected to massively parallel sequencing. Majority of the ChIP-seq experiments are currently performed on the Illumina platform [1, 2] which generates millions of short (25–100 bps) directional DNA sequence reads (tags). These reads represent ends of immunoprecipitated DNA fragments with lengths varying between 150 and 300 bps. Although there are many methods for analyzing ChIP-seq data

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_12, # Springer Science+Business Media New York 2013

193

194

Guannan Sun et al.

(reviewed in ref. [3]), the computational and statistical challenges associated with these data are evolving as the sequencing platforms improve and become capable of generating hundreds of millions of reads in a single experimental run. In particular, earlier ChIP-seq analysis methods developed for data with low genomic coverage may not be suitable for the analysis of deeply sequenced data without further modification and tuning [4]. In this chapter, we review an adaptive and robust statistical approach, named MOSAiCS [5], for the analysis of ChIP-seq data. MOSAiCS stands for “Model-Based One- and Two-sample Analysis and Inference for ChIP-seq data” and is available in Bioconductor as an R package (www.bioconductor.org). DNA sequence reads in a ChIP-seq sample can be considered as a mixture of reads that originate from actual binding regions, i.e., enrichment events, and reads that are part of the background noise of the sequencing process. The primary goal of a basic ChIP-seq analysis is to identify which regions of the genome show enrichment, i.e., have high read density, in the ChIP sample compared to an appropriate control sample. MOSAiCS implements a flexible parametric mixture regression model to infer for every small nonoverlapping genomic interval whether its observed reads are part of the background or the enrichment signal. It accommodates both onesample (only ChIP sample) and two-sample (ChIP and control samples) analyses. It does not require the sequencing depths, i.e., number of reads, of the ChIP and input samples to be balanced and is suitable for the analysis of both low- and high-coverage samples. Furthermore, it is one of the few methods that can handle both single-end reads, where only one end of each immunoprecipitated DNA fragment is sequenced, and paired-end reads where both ends of the DNA fragments are sequenced.

2

Materials Throughout this chapter, we use “mosaics,” which is available in Bioconductor and can be installed as follows: R> source("http://bioconductor.org/biocLite.R") R> biocLite("mosaics")

Then, the package can be loaded into the R environment with the “library” function. R> library( mosaics)

2.1 Data Description: Raw Read Files

We will use data from two ChIP-seq experiments (Table 1). This first experiment (courtesy of Svaren Lab at UW, Madison) interrogates Sox10 occupancy in rat spinal cord cells and includes matching input control samples from the same cells. It has two biological replicates with two lanes of ChIP sequencing for the second replicate.

Statistical Analysis of ChIP-seq Data with MOSAiCS

195

Table 1 Summary of the ChIP-seq experiments used in the illustrations ChIP-Seq experiment Sample Replicate

No. of reads No. of aligning readsa % of reduction by (sequencing depth) (no capping) capping at 3b(%)

Rat Sox10

Input Input

1 2 Lane 1 Lane 2 1 2

37,609,463 17,547,664 16,003,622 35,608,252 111,820,759

27,247,876 12,486,976 11,276,531 27,423,241 88,476,912

24 40 36 1 35

ChIP ChIP ChIP Input Input Input

1 2 3 1 2 3

6,393,286 7,855,488 10,161,135 12,601,191 11,177,091 12,357,910

4,577,761 5,297,258 7,482,177 10,819,302 9,631,741 10,643,263

92 81 55 10 7 10

Yeast TFx

ChIP ChIP

Reported percentage reductions are computed by dividing the sequencing depth with capping at 3 by the sequencing depth without any capping a Denotes reads that uniquely align to reference genome, i.e., reads mapping to multiple locations are discarded b Capping at 3 is evaluated by allowing a maximum of three reads with exactly the same 50 mapping genomic location

The second ChIP-seq experiment (courtesy of Gasch Lab at UW, Madison) measures binding of a sequence-specific transcription factor (will be referred to as TFx hereafter) in yeast S. cerevisiae in the presence of ethanol. This latter experiment was performed in triplicates with a matching input control sample for every ChIP sample. TFx is an example of a deeply sequenced dataset. The coverage of 24.4 M mappable reads (across the 3 ChIP samples) in yeast corresponds to the same coverage with 5.3 billion mappable reads in the rat genome. 2.2 Data Description: Aligned Read Files

ChIP-seq analysis with “mosaics” starts with aligned read files. “mosaics” can accommodate a large class of alignment files including Eland result (“eland_result”), Eland extended (“eland_ extended”), Eland export (“eland_export”), default Bowtie (“bowtie”), SAM (“sam”), BED (“bed”), and CSEM BED (“csem”). All of these, except CSEM BED, are standard alignment formats used by multiple ChIP-seq software. The CSEM BED is the BED format alignment that considers reads that map to multiple locations (multi-reads) [6] (see Notes in Section 4 for further details). When starting out with raw read files, we suggest aligning them with Bowtie, which is a memory-efficient short read aligner [7] and is available at http://bowtie-bio.sourceforge.net/index. shtml. In our yeast TFx application, we aligned the raw reads by running Bowtie with the following parameters:

196

Guannan Sun et al. bowtie -q -5 14 -3 40 -v 2 -m 1 -p 8 /bowtie-0.12.7/ indexes/s_cerevisiae

/scratch/TFx-EtOH-IP-1.fastq

/scratch/TFx_EtOH_IP_1_2mis_bowtie_uni.txt

Option “-5 14 -3 40” trims each read 14 bases from the 50 end and 40 bases from the 30 end. This is determined by investigating the “fastqc” files from the Illumina sequencing run. “-v 2” allows up to two mismatches in the alignment of each read and “-m 1” keeps reads with at most one alignment (uni-reads). We refer to Bowtie manual for more options [7].

3

Methods

3.1 Reading Aligned Read Files into R Environment

The first step of the analysis with “mosaics” is transforming aligned read files into bin-level data to reflect local read densities. This is accomplished by partitioning the reference genome into nonoverlapping intervals of 100–300 bps. The actual interval size is determined by the average fragment size, i.e., library size, in the experimental ChIP protocol. Then, each mapped read is extended to the average fragment size towards its 30 end. “mosaics” counts the number of reads overlapping each bin. The number of reads in each bin is treated as the observed data and modeled with a mixture regression model. This data preprocessing procedure is implemented in the “constructBins()” function of “mosaics.” For the Sox10 ChIP and input samples, bin files can be constructed as follows: R> constructBins(infile ¼ "/scratch/sam/sox10_CNS_ chip.sam", fileFormat ¼ "sam", outfileLoc ¼ "/ scratch/ bin_cap0/", byChr ¼ FALSE, excludeChr ¼ "chrM", fragLen ¼ 200, binSize ¼ 200, capping ¼ 0, PET ¼ FALSE) R> constructBins(infile ¼ "/scratch/sam/sox10_CNS_ input. sam ", fileFormat ¼ "sam", outfileLoc ¼ "/scratch/ bin_cap0/", byChr ¼ FALSE, excludeChr ¼ "chrM", fragLen ¼ 200, binSize ¼ 200, capping ¼ 0)

Execution of this command converts the sam-formatted alignment files (“sox10_CNS_chip.sam” and “sox10_CNS_input.sam”) in the local /scratch/sam/ directory into bin files “sox10_CNS_chip.sam_fragL200_bin200.txt” and “sox10_CNS_ input.sam_fragL200_bin200.txt” in the local directory /scratch/ bin_cap0/, following naming convention “[infileName]_fragL [fragLen]_bin[binSize].txt” and using an average fragment length of 200 bps and a bin size of 200 bps. Here are a few lines from the bin file “sox10_CNS_chip.sam_fragL200_bin200.txt”: ... chr7

6400

25

chr7

6600

49

chr7

6800

27

Statistical Analysis of ChIP-seq Data with MOSAiCS chr7

7000

0

chr7

7200

0

chr7

7400

0

chr7

7600

4

chr7

7800

4

chr7

8000

9

197

...

This output indicates that there are 25 reads spanning the bin that starts at position 6400 and ends in 6599 on chromosome 7. The parameter “byChr ¼ FALSE” ensures that bin data across all chromosomes are written to a single file, whereas “byChr ¼ TRUE” generates chromosome-level bin files. The parameter “capping” controls the maximum number of reads with the same 50 genomic location at each nucleotide position and is useful for eliminating potential polymerase chain reaction (PCR) amplification artifacts as discussed in the next section. Capping is not applied when the “capping” parameter is set to 0 or less. 3.2 Exploratory Diagnostics and Assessing Data Quality

Bin-level data can be read into the R environment with the “readBins()” function of “mosaics” and is useful for evaluating the overall data quality by exploratory data plots.

3.2.1 Library Complexity

A common issue in ChIP-seq data is the existence of redundant reads, i.e., reads with the same 50 genomic location. Large number of redundant reads can be an indication of low library complexity. Redundant reads can arise due to biological reasons when the starting materials are in small amounts and thus require extensive PCR amplification or when the samples are deeply sequenced. They can also arise with adequate amounts of starting material simply as PCR amplification artifacts in the ChIP experimental protocol. Therefore, one needs to distinguish between these cases. In the former case, many redundant reads would be expected to be part of the signal whereas they could lead to spurious signal in the latter situation. The standard practice for dealing with redundant reads so far has been to cap such read counts at a small number; e.g., if more than five reads have the same 50 genomic location, PeakSeq [8] caps the total number of reads at this nucleotide at 5. This strategy works reasonably well for samples with low sequencing depths; however it might wipe out the enrichment signal for high-depth samples. For example, a few millions of reads are sufficient to give excellent coverage for small genomes such as those of bacteria E. coli and yeast S. cerevisiae; hence capping at small values is highly likely to significantly attenuate and even diminish the ChIP enrichment signals in E. coli and S. cerevisiae samples. Table 1 lists the sequencing depths of the ChIP and input samples with and without capping. For the yeast TFx sample, capping reduces the depths of the ChIP samples by 55–92 % whereas the effect on input is only

198

Guannan Sun et al.

1–10 %. Input samples are expected to cover a large fraction of the genome; therefore, they should overall generate much fewer redundant reads unless they are sequenced very deeply. Bin-level data obtained with (capping ¼ k) and without (capping ¼ 0) capping are useful for understanding the extent of the low-complexity library issue and deciding whether or not to apply any capping. For the Sox10 samples, we generated additional bin-level data by capping at 3. This reduced the sequencing depths of the samples by about 24–40 % (Table 1). R> constructBins(infile ¼ "/scratch/sam/sox10_CNS_ chip.sam ", fileFormat ¼ "sam", outfileLoc ¼ "/ scratch/ bin_cap3/", byChr ¼ FALSE, excludeChr ¼ "chrM", fragLen ¼ 200, binSize ¼ 200, capping ¼ 3, PET ¼ FALSE) R> constructBins(infile ¼ "/scratch/sam/sox10_CNS_ input.sam ", fileFormat ¼ "sam", outfileLoc ¼ "/ scratch/ bin_cap3/", byChr ¼ FALSE, excludeChr ¼ "chrM", fragLen ¼ 200, binSize ¼ 200, capping ¼ 3, PET ¼ FALSE)

Then, we generated a scatter plot of the two bin-level data by the “hexbin()” plotting function (Fig. 1). The function “readBins()” from “mosaics” is used to load bin-level files into R (details are in Subheading 3.3). R> library( hexbin) R> exampleBinData_cap a bin_rep

a

plot(a, trans ¼ log, inv ¼ exp, main ¼ "", colramp ¼ rainbow, xlab ¼ "ChIP (replicate 3)", ylab ¼ "ChIP (replicate 2)", lcex ¼ 0.9)

3.2.3 Overall ChIP Enrichment

A hexbin plot between the ChIP and the input samples, which can be generated by the following commands, is useful for assessing whether the ChIP sample shows enrichment compared to the input sample. ##

Details about R object exampleBinData are in Section 3.3.2. ##

R>

a

hbp ¼ plot(a, trans ¼ log, inv ¼ exp, main ¼ "",

Data_cap@ tagCount, xbins ¼ 100) colramp ¼ rainbow, xlab ¼ "Input", ylab¼ "ChIP", lcex ¼ 0.9)

Statistical Analysis of ChIP-seq Data with MOSAiCS

201

Fig. 3 ChIP enrichment over input. Hexbin plots of the bin-level read counts of ChIP versus input samples for (a) rat Sox10; (b) yeast TFx ChIP-Seq samples

R>

hexVP.abline(

hbp$plot.vp,

a

¼

0,

b

¼

sum

(exampleBinData@tagCount)/sum(exampleBinData@ input), lwd ¼ .2)

This is one of the key plots of ChIP-seq data analysis. Figure 3a, b displays bin-level ChIP versus input read counts for the Sox10 and TFx samples. The long vertical arm on the ChIP axis indicates enrichment of the ChIP sample compared to the input sample. The solid line is the sequencing depth ratio of the ChIP sample over the input sample and is displayed as an estimate of the normalization factor between the two samples. We expect majority of the bins, i.e., 70–95 % based on 26 histone modification ChIP-seq experiments in cell lines K562 and GM12878 from the ENCODE project (http://genome.ucsc.edu/ENCODE/), to be part of the background component of the ChIP sample and their ChIP versus input read counts to align well with the line representing the normalization factor. The ratio of the sequencing depths of the ChIP and input samples is an estimate of the normalization factor between the two samples. This estimate is often inaccurate with high bias because it normalizes the input sample to the whole ChIP sample rather than the actual background component of the ChIP sample. We refer to [10] for a discussion on the normalization of ChIP-seq data and alternative estimators of the normalization factor. 3.3 Model Fitting and Peak Calling

Once the ChIP and input samples pass the overall data quality check, we are ready to use “mosaics” for peak calling. In the presence of a control sample (input or other types of control), a two-sample analysis, which directly compares the ChIP sample with the control sample, is preferred. In the absence of a control

202

Guannan Sun et al.

Fig. 4 Histogram of ChIP and input read counts for (a) rat Sox10; (b) yeast TFx ChIP-seq samples

sample, “mosaics” fits a one-sample model to the ChIP data using genomic features, mappability and GC content, as covariates in its mixture regression model. For the two-sample analysis, “mosaics” provides two options. When the sequencing depth of the input sample is high, twosample analysis without mappability and GC content, i.e., “inputonly model,” often works well. When the sequencing depth is low, two-sample analysis that incorporates mappability and GC content in addition to input usually improves the model fit. We will next illustrate both cases of the two-sample analysis. 3.3.1 Model Fitting: Input-Only Two-Sample Model

For “input-only” model analysis of TFx data, bin-level data can be imported to the R environment with the command R> exampleBinData plot( exampleBinData)

Function “mosaicsFit()” then can be used to fit the “input-only” model. R> exampleFit exampleFit Summary: MOSAiCS model fitting (class: MosaicsFit) ------------------------------------------------analysis type: two-sample analysis (Input only) parameters used: k ¼ 3, d ¼ 0.25 BIC of one-signal-component model ¼ 790031.8 BIC of two-signal-component model ¼ 789043.4 -------------------------------------------------

The GOF plot (Fig. 5a) can be generated by R> plot( exampleFit)

GOF plots display both the densities of actual bin-level ChIP and input experimental data and the densities of simulated data generated using parameters from the MOSAiCS model fit. In Fig. 5a (also 5b–d), the black curve, denoted by “Actual data (ChIP),” represents actual ChIP data and the grey, denoted by “Actual data (Control),” actual input data. Green curve (“Sim:N”) denotes the estimated background whereas the red (“Sim:N+S1”) and blue (“Sim:N+S1+S2”) are densities of simulated data generated from fits based on one- and two-signal-component MOSAiCS models. If the MOSAiCS model fits the data well, red and/or blue curves follow the actual ChIP-seq density (black curve) closely. In Fig. 5a, since both one- and two-signal-component models fit the actual ChIP sample well, BIC can be used to choose the better of these models for peak calling. In this particular case, the two-component model with the smaller BIC value of 789,043.4 provides a better fit. This example also illustrates the use of one of the tuning parameters of the MOSAiCS model, namely, “truncProb.” This tuning parameter is relevant for the “input-only” model and controls the robustness of the background parameter estimation to high input read counts. Figure 5b displays the GOF plot with the default value 0.999 of “truncProb.” As apparent from the comparison of Fig. 5a, b, setting “truncProb ¼ 0.08” significantly improves the model fit. We provide further discussion on the tuning parameters of the MOSAiCS model in Subheading 3.3.4. 3.3.2 Model Fitting: Two-Sample Model with Mappability and GC Content

We start out with an “input-only” two-sample model for the Sox10 data. R> exampleBinData exampleFit plot(exampleFit)

Figure 5c displays the GOF plot for this analysis. Neither the blue (two-signal component) nor the red (one-signal component) curve provides good fit to the ChIP data. Therefore, we consider fitting a two-sample model with mappability and GC content. Mappability (“M”), GC content (“GC”), and sequence ambiguity scores (“N”) can be obtained using the reference genome assembly FASTA files. Sequence ambiguity scores are for excluding regions of the reference genome with ambiguous sequence content denoted by “N.” The FASTA files are processed to generate binary and bin-level files representing the “M,” “GC,” and “N” scores.

Statistical Analysis of ChIP-seq Data with MOSAiCS

205

For hg18 and mm9, preprocessed files with average fragment length and bin size of 200 bps are available from http://www.stat. wisc.edu/~keles/Software/mosaics/index.html. Step-by-step instructions and the relevant scripts for generating mappability, GC content, and ambiguity scores are also available from this Web site. Files for other organisms or parameter settings can also be obtained by inquiring with the Google MOSAiCS User Group at https://groups.google.com/forum/?fromgroups#! forum/mosaics_user_group. For the “two-sample model with mappability and GC content” analysis, bin-level data can be imported to the R environment and the fit can be obtained with the following commands: R> exampleBinData exampleFit exampleFit Summary: MOSAiCS model fitting (class: MosaicsFit) ------------------------------------------------analysis type: two-sample analysis (with mappability & GC content) parameters used: k ¼ 3, meanThres ¼ 1, s ¼ 2, d ¼ 0.25 BIC of one-signal-component model ¼ 55608364 BIC of two-signal-component model ¼ 54792387 -------------------------------------------------

Figure 5d displays the GOF plot for the two-sample analysis with mappability and GC content in addition to input. The GOF improves significantly compared to the input-only model. Both the GOF plot and the BIC value, which equal 55,608,364 and 54,792,387 for the one- and two-signal-component models, support that the two-signal-component model provides the best fit to the ChIP sample. Therefore, we will use the two-signal-component model to call peaks by function “mosaicsPeak().” 3.3.3 Peak Calling

Once the MOSAiCS model fit is obtained, peaks can be identified by controlling the false discovery rate (FDR) [11] at a desired level. In addition to FDR control, MOSAiCS allows filtering of peaks with low ChIP read counts and merging of nearby peaks as illustrated below. The parameters “maxgap,” “minsize,” and “thres” are for refining the peak set.

206

Guannan Sun et al. R> thres ¼ quantile(exampleFit@tagCount, probs ¼ 0.95) R> examplePeak examplePeak Summary: MOSAiCS peak calling (class: MosaicsPeak) ------------------------------------------------final model: two-sample analysis (with M & GC) with two signal components setting: FDR ¼ 0.05, maxgap ¼ 200, minsize ¼ 199, thres ¼ 50 # of peaks ¼ 16839 median peak width ¼ 200 empirical FDR ¼ 0.05 -------------------------------------------------

Statistical Analysis of ChIP-seq Data with MOSAiCS

207

# Display data for the first 5 peaks R> print( examplePeak)[1:5,] peakStop

peakSize

aveP

minP

aveChipCount

1 chr1

chrID peakStart 115800

115999

200

9.086462e-04

9.086462e-04

73

2 chr1

2274600

2274999

400

3.690171e-15

2.987072e-16

1073

3 chr1

2296200

2296399

200

5.193861e-02

5.193861e-02

161

4 chr1

2374800

2374999

200

8.183753e-02

8.183753e-02

80

5 chr1

2428400

2428599

200

9.724042e-02

9.724042e-02

52

maxChipCount

aveInputCount

aveInputCountScaled

aveLog2Ratio

map

GC

1 73

7

3.192915

4.141500

0.82

0.33

2 1102

59

26.911712

5.268542

1.00

0.47

3 161

37

16.876836

3.179830

1.00

0.43

4 80

18

8.210353

3.136594

0.85

0.39

5 52

12

5.473569

3.033359

1.00

0.43

As displayed in this output, if the analysis uses mappability and GC for calling peaks, these quantities are also reported for each peak. Finally, the peaks can be exported by the “export()” function in multiple formats (”txt,” ”bed,” and ”gff”) with the command R> export(examplePeak, type ¼ "bed", filename ¼ "TSpeakList. BED")

3.3.4 Tuning Parameters for Improving the MOSAiCS Fit

From a statistical standpoint, the FDR control for peak finding depends on how well the underlying model of the peak caller fits the data. Because MOSAiCS provides GOF plots, it enables assessing how adequate the model fit is. Overall, unsatisfactory model fits can be improved via the tuning parameters in the “mosaicsFit()” function. Our empirical studies suggest that as the sequencing depths are getting larger, genomic features, mappability and GC content, have less of an impact on the overall model fit. In particular, we suggest tuning the input-only model in the cases of unsatisfactory fits before switching to a fit with mappability and GC content. The following are some general tuning suggestions: for the “input-only” two-sample analysis, lowering the value of the “truncProb” parameter (as in the yeast TFx example); for “twosample analysis with mappability and GC content,” a larger “s” parameter and a smaller “meanThres” parameter; for “one-sample analysis with mappability and GC content,” varying the “meanThres” parameter. Although MOSAiCS has two additional tunable parameters, “k” and “d,” we have accumulated ample empirical evidence through analyzing many datasets that the default values of “k ¼ 3” and “d ¼ 0.25” work well for these parameters.

3.4 Generating Wig Files to View on the UCSC Genome Browser

It is often of interest to visualize the raw sequencing data and the identified peaks with respect to reference genome annotation. “mosaics” peaks and the bin-level read counts can be easily visualized in the UCSC Genome Browser (http://genome.ucsc.edu/).

208

Guannan Sun et al.

The following commands convert read alignment files to wiggle track format (WIG), which can then be directly uploaded to the UCSC Genome Browser along with the BED file of the peaks. Since UCSC genome browser has a size limit for custom tracks, chromosome-wise wig with option “byChr ¼ TRUE” is appropriate for large genomes. Alternatively, the WIG files can be converted to the bigWig track format by the binary utilities, e.g., wigToBigWig, provided by the UCSC Genome Browser (http:// hgdownload.cse.ucsc.edu/admin/exe/). R>

export(examplePeak,

type

¼

"bed",

filename

¼

"TSpeakList.bed") R> generateWig(infile ¼ "/scratch/sam/sox10_CNS_ chip. sam", PET ¼ FALSE, fileFormat ¼ "sam", outfileLoc ¼ "/ scratch/bin_cap0/", byChr ¼ FALSE, useChrfile ¼ FALSE, chrfile ¼ NULL, fragLen ¼ 200, span ¼ 200, capping ¼ 0)

4

Notes We described the versatile MOSAiCS protocol for ChIP-seq data analysis by utilizing reads that map to unique locations (uni-reads) in the reference genome. The sequencing depths of short-read (25–100 bps) ChIP or input samples can be increased up to 25 % by utilizing reads that map to multiple locations (multi-reads) [6]. We showed in [6] that even for transcription factors that are not particularly known to interact with repetitive DNA, the number of high-quality peaks can be improved by 10–35 % by utilizing multireads. Therefore, incorporation of multi-reads is another important topic in ChIP-seq analysis. Our multi-read allocator CSEM, which is available through Galaxy Tool Shed (http://toolshed.g2.bx.psu. edu/) and generates CSEM BED format, is specifically designed for allocating multi-reads in ChIP-seq data analysis [6]. Currently, majority of the ChIP-seq datasets are in the form of single-end reads where only the 50 ends of DNA fragments are sequenced. MOSAiCS can easily handle paired-end ChIP-seq data with “PET ¼ TRUE” option in the “constructBins()” function. When multiple ChIP-seq datasets such as those of a single transcription factor across different experimental conditions, i.e., time points or treatments, are available, differential enrichment/ occupancy can be quantified with the DBChIP method [12], which is also available through Bioconductor. In the case of multiple histone modifications, an integrative analysis approach might be useful for boosting the power of enrichment detection and automatically classifying regions of reference genome into multiple enrichment patterns. An extension of MOSAiCS, named jMOSAiCS [13], enables simultaneous analysis of multiple ChIP-seq

Statistical Analysis of ChIP-seq Data with MOSAiCS

209

datasets. Once regions of reference genome are classified into specific enrichment patterns, differential enrichment or occupancy can be further quantified with DBChIP [12].

Acknowledgments This work is supported by National Institutes of Health Grants (HG0067161, HG003747) to S.K. We thank Audrey Gasch and Jeff Lewis (yeast TFx), John Svaren and Rajini Srinivasan (Sox10 in rat), and Qiang Chang and Emily Cunningham (human ChIP-seq) for the datasets and useful discussions regarding the analysis.

Appendix: R Script for the Analysis of Yeast TFx ChIP-seq Datasets library( mosaics) library( hexbin) # construct bin-level files for each replicate ChIP sample # constructBins(infile ¼ "TFx_EtOH_IP_1_2mis_bowtie_ uni.txt", outfileLoc ¼ "/bin_cap0", fileFormat ¼ "bowtie", byChr ¼ FALSE, fragLen ¼ 200, binSize ¼ 200, capping ¼ 0, PET ¼ FALSE) constructBins(infile ¼ "TFx_EtOH_IP_2_2mis_bowtie_ uni.txt", outfileLoc ¼ "/bin_cap0", fileFormat ¼ "bowtie", byChr ¼ FALSE, fragLen ¼ 200, binSize ¼ 200, capping ¼ 0, PET ¼ FALSE) constructBins( infile ¼ "TFx_EtOH_IP_3_2mis_bowtie_ uni.txt", outfileLoc ¼ "/bin_cap0", fileFormat ¼ "bowtie", byChr ¼ FALSE, fragLen ¼ 200, binSize ¼ 200, capping ¼ 0, PET ¼ FALSE) # construct bin-level files for pooled ChIP (uncapped and capped by 3) and pooled input samples # constructBins( infile ¼ "TFx_EtOH_input_2mis_bowtie_ uni.txt", outfileLoc ¼ "/bin_cap0", fileFormat ¼ "bowtie", byChr ¼ FALSE, fragLen ¼ 200, binSize ¼ 200, capping ¼ 0, PET ¼ FALSE) constructBins( infile ¼ "TFx_EtOH_input_2mis_bowtie_ uni.txt", outfileLoc ¼ "/bin_cap3", fileFormat ¼ "bowtie", byChr ¼ FALSE, fragLen ¼ 200, binSize ¼ 200, capping ¼ 3, PET ¼ FALSE) constructBins( infile ¼ "TFx_EtOH_chip_2mis_bowtie_ uni.txt", outfileLoc ¼ "/bin_cap0",

210

Guannan Sun et al. fileFormat ¼ "bowtie", byChr ¼ FALSE, fragLen ¼ 200, binSize ¼ 200, capping ¼ 0, PET ¼ FALSE) # generate hexbin plots between replicates to decide on pooling # bin_rep ¼ vector( list,3) bin_rep[[1]] ¼ readBins(type ¼ c("chip","input"), fileName ¼ c( "/bin_cap0/TFx_EtOH_IP_1_2mis_bowtie_ uni.txt_fragL200_bin200.txt"," /bin_cap0/TFx_EtOH_ IP_2_2mis_bowtie_uni.txt_fragL200_bin200.txt")) bin_rep[[2]] ¼ readBins(type ¼ c("chip","input"), fileName ¼ c( "/bin_cap0/TFx_EtOH_IP_1_2mis_bowtie_ uni.txt_fragL200_bin200.txt","/bin_cap0/TFx_EtOH _IP_3_2mis_bowtie_uni.txt_fragL200_bin200.txt")) bin_rep[[3]] ¼ readBins(type ¼ c("chip","input"), fileName ¼ c( "/bin_cap0/TFx_EtOH_IP_2_2mis_bowtie_ uni.txt_fragL200_bin200.txt"," /bin_cap0/TFx_EtOH_ IP_3_2mis_bowtie_uni.txt_fragL200_bin200.txt")) xlabel ¼ c( rep1, rep1, rep2); ylabel ¼ c( rep2, rep3, rep3) for (i in 1:3) { a ¼ hexbin( bin_rep[[i]]@tagCount, bin_rep[[i]] @input, xbins ¼ 100) plot( a, trans ¼ log, inv ¼ exp, xlab ¼ xlabel[i], ylab ¼ ylabel[i], colramp ¼ rainbow) } # generate hexbin plots of uncapped and capped binlevel read count data # bin_cap ¼ readBins(type ¼ c("chip","input"), filename ¼

c(

"/bin_cap0/TFx_EtOH_IP_2mis_bowtie_uni.txt_

fragL200_bin200.txt", "/bin_cap3/TFx_EtOH_IP_2mis_ bowtie_uni.txt_fragL200_bin200.txt")) a ¼ hexbin (bin_cap@ input, bin_cap@ tagCount, xbins ¼ 100) plot(a, trans ¼ log, inv ¼ exp, xlab ¼ "Capped counts", ylab ¼ "Uncapped counts", colramp ¼ rainbow) # load bin-level files for MOSAiCS analysis # bin ¼ readBins(type ¼ c("chip","input"), filename ¼ c ("/bin_cap0/TFx_EtOH_IP_2mis_bowtie_uni.txt_fragL 200_bin200.txt",

"/bin_cap0/TFx_EtOH_input_2mis_

bowtie_uni.txt_fragL200_bin200.txt")) # generate hexbin plot to check ChIP enrichment over Input#

Statistical Analysis of ChIP-seq Data with MOSAiCS

211

a ¼ hexbin (bin@input, bin@tagCount, xbins ¼ 100) plot(a, trans ¼ log, inv ¼ exp, xlab ¼ "Input", ylab ¼ "ChIP", colramp ¼ rainbow) # histogram of bin-level files # plot(bin) # use Input-only model to fit data # fit ¼ mosaicsFit(bin, analysisType ¼ "IO", bgEst ¼ "automatic", truncProb ¼ 0.08) # goodness-of-fit plot of the model # plot(fit) # call peaks # thres ¼ quantile (fit@tagCount, probs ¼ 0.95) peak ¼ mosaicsPeak(fit, signalModel ¼ "2S", FDR ¼ 0.05, maxgap ¼ 200, minsize ¼ 50, thres ¼ thres) # export peaks # export(peak, type ¼ "txt", filename ¼ "TFx_TSpeakList. txt") export(peak, type ¼ "bed", filename ¼ "TFx_TSpeakList. bed") # generate wig files # generateWig(infile ¼ "TFx_EtOH_IP_2mis_bowtie_uni. txt ", PET ¼ FALSE, fileFormat ¼ "bowtie", outfileLoc ¼ "/bin_cap0/", byChr ¼ FALSE, useChrfile ¼ FALSE, chrfile ¼ NULL, fragLen ¼ 200, span ¼ 200, capping ¼ 0)

References 1. Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV, Ren B (2012) A map of cis-regulatory sequences in the mouse genome. Nature 488:116–120 2. Fujiwara T, O’Geen H, Keles S, Blahnik K, Linnemann AK, Kang Y, Choi K, Farnham PJ, Bresnick EH (2009) Discovering hematopoietic mechanisms through genome-wide analysis of GATA factor chromatin occupancy. Mol Cell 36(4):667–681 3. Wilbanks EG, Facciotti MT (2010) Evaluation of algorithm performance in ChIP-Seq peak detection. PLoS One 5:e11471 4. Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang T, Kim T-K, He HH, Zieba J, Ruan Y, Bickel PJ, Myers RM, Wold BJ, White KP, Lieb JD, Liu XS (2012)

Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods 9(6):609–614 5. Kuan PF, Chung D, Pan G, Thomson JA, Stewart R, Keles S (2011) A statistical framework for the analysis of ChIP-Seq data. J Am Stat Assoc 106(495):891–903 6. Chung D, Kuan P-F, Li B, SanalKumar R, Liang K, Bresnick E, Dewey C, Keles S (2011) Discovering transcription factor binding sites in highly repetitive regions of genomeswith multi-read analysis of ChIP-Seq data. PLoS Comput Biol 7(7):e1002111 7. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25 8. Rozowsky J, Euskirchen G, Auerbach R, Zhang Z, Gibson T, Bjornson R, Carriero N,

212

Guannan Sun et al.

Snyder M, Gerstein M (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27:66–75 9. Benjamini Y, Speed TS (2012) Summarizing and correcting the GC content bias in highthroughput sequencing. Nucleic Acids Res 40(10):e72 10. Liang K, Keles S (2012) Normalization of ChIP-seq data with control. BMC Bioinformatics 13:199 11. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful

approach to multiple testing. J Roy Stat Soc B Met 57(1):289–300 12. Liang K, Keles S (2012) Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics 28(1):121–122 13. Zeng X, Sanalkumar R, Bresnick EH, Li H, Chang Q, Keles S (2012) jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Submitted. Technical report available at http://www. stat.wisc.edu/~keles/Papers/jmosaics.pdf. R package available at http://www.stat.wisc. edu/~keles/Software/

Chapter 13 Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and Massive Parallel Sequencing Lukasz J. Kielpinski, Mette Boyd, Albin Sandelin, and Jeppe Vinther Abstract Detection of reverse transcriptase termination sites is important in many different applications, such as structural probing of RNAs, rapid amplification of cDNA 50 ends (50 RACE), cap analysis of gene expression, and detection of RNA modifications and protein–RNA cross-links. The throughput of these methods can be increased by applying massive parallel sequencing technologies. Here, we describe a versatile method for detection of reverse transcriptase termination sites based on ligation of an adapter to the 30 end of cDNA with bacteriophage TS2126 RNA ligase (CircLigase™). In the following PCR amplification, Illumina adapters and index sequences are introduced, thereby allowing amplicons to be pooled and sequenced on the standard Illumina platform for genomic DNA sequencing. Moreover, we demonstrate how to map sequencing reads and perform analysis of the sequencing data with freely available tools that do not require formal bioinformatics training. As an example, we apply the method to detection of transcription start sites in mouse liver cells. Key words Reverse transcription, Termination, Sequencing, TS2l26 RNA ligase, CAGE, Galaxy

1

Introduction Detection of reverse transcriptase termination sites (RTTS) is a general strategy that can be used to detect different features of RNA, such as their ends [1], modifications [2], structure [3], and binding of proteins [4]. Historically, RTTS have been monitored by fragment analysis using radioactive or fluorescent labelling of the primer used for the reverse transcription and detection with denaturing gel or capillary electrophoresis, respectively. Alternatively, RTTS can be detected by ligating an adapter to the 30 end of the terminated cDNA, cloning, and sequencing. While fragment analysis has been very successfully used to investigate many different RNA features, the decreasing cost of sequencing makes it increasingly more advantageous to use sequencing for detection of RTTS. It is therefore likely that existing RTTS-based methods will be adapted for sequencing and that new methods will be developed.

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_13, # Springer Science+Business Media New York 2013

213

214

Lukasz J. Kielpinski et al.

The key step in the detection of RTTS by sequencing is to attach sequencing adapter sequences to the ends of the cDNA. Typically the 50 adapter sequence is included as overhang in genespecific or random primer used for the first-strand reaction. The next step is the ligation of an adapter to the 30 end of the terminated cDNA and several methods for doing this have been developed. In single-strand linker ligation a double-stranded adapter with a 30 overhang is ligated to the free 30 end of the RTTS cDNA using T4 DNA ligase [5]. Alternatively, a single-stranded adapter can be used for ligation with the thermostable TS2126 RNA ligase (CircLigase) [6]. The efficiency of both of these enzymes are somewhat biased by the sequence in the very 30 end of the cDNA that have to be ligated (results not shown), but these biases are reproducible and are therefore not an issue if an appropriate control is used for normalization. Another issue is the ability of reverse transcriptase to add 1–3 untemplated nucleotides to the 30 end of cDNAs. This occurs more efficiently at capped 50 ends compared to 50 ends ending in OH (typical for degraded RNA) [7] and has to be taken into account when sequences are mapped to the RNA being investigated. The added nucleotides allow the reverse transcriptase to perform template switching, which can be exploited to add an adaptor sequence to the 30 of cDNAs [8]. Some RTTS methods have successfully been adapted to massive parallel sequencing. Cap analysis of gene expression (CAGE) has been successfully used to identify transcription start sites (TSS) [9]. Originally the CAGE method was based on concatenation of CAGE tags and Sanger sequencing [10], but it has recently been adapted to massive parallel sequencing [1]. Another example is SHAPE-based probing of RNA structure, which has been widely and successfully used for investigating the structure of single RNAs using capillary electrophoresis [11]. Nevertheless, recent result demonstrating that populations of RNA molecules can be SHAPE probed in parallel using sequencing fuels hope that the throughput of structure probing can be increased [12]. These successful implementations of sequencing for RTTS detection suggest that RTTS methods generally can be adapted to the new sequencing technologies. Here, we describe a general method for detecting RTTS based on the Illumina paired-end genomic DNA adapters, sequencing primer, and indexing reads. Samples can therefore be multiplexed with other samples containing the standard Illumina adaptors and used for both single- and paired-end sequencing. The method can easily be adapted to detect RTTS produced by any experimental protocol. In addition, we demonstrate in detail how to go from the raw sequencing reads to counts of RTTS mapped to the RNA being investigated and how to compare with the existing annotation and visualize the results in the UCSC genome browser. An overview of the entire protocol is shown in Fig. 1.

Reverse Transcriptase Termination Site (RTTS) Mapping

215

Fig. 1 Schematic outline of the analysis. The starting material are RNA molecules containing a feature of interest, which can cause reverse transcriptase termination. The RNA is reverse transcribed with a primer containing a 50 adapter overhang. After cDNA purification, a second adapter is ligated to the 30 ends of the obtained cDNA. Molecules containing both adapters serve as templates for a PCR, which adds all necessary elements for Illumina sequencing. After library sequencing, the resulting sequencing reads are mapped to sequences of interest (this could be the full genome or selected RNA sequences) and the location of the reads’ 50 ends (corresponding to the feature of interest) counted. The resulting RTTS count file can be used for further analysis, such as visualization in the UCSC genome browser, producing RTTS plots for specific RNA molecules, and comparing with the existing annotation

216

2

Lukasz J. Kielpinski et al.

Materials

2.1

RNA Sample

1. Material to be analyzed: The RNA should be treated in a way that reverse transcription will terminate on sites of interest. This could be RNA strand breaks, RNA modifications, RNA 50 ends, protein–RNA cross-links among others.

2.2

Oligonucleotides

1. Oligonucleotide sequences are listed in Table 1. RT_random_primer and LIGATION_ADAPTER were HPLC purified, and the remaining oligonucleotides were PAGE purified.

2.3 Reverse Transcription and Purifications

1. PrimeScript™ ReverseTranscriptase including PrimeScript™ 5 buffer (Takara). 2. 10 mM dNTPs. 3. Sorbitol–trehalose mix (1.67 M sorbitol, 0.33 M trehalose). 4. Agencourt® AMPure® XP–PCR Purification (Beckman Coulter). 5. Agencourt® RNAClean® XP (Beckman Coulter). 6. 70 % EtOH. 7. 5 mM Na-citrate pH 6. 8. 10 mM Tris–HCI pH 8.3. 9. RNAseH (New England Biolabs).

2.4

Linker Ligation

1. CircLigase (Epicentre). 2. 1 mM ATP (Epicentre). 3. CircLigase buffer (Epicentre). 4. 50 mM MnCl2 (Epicentre). 5. 50 % PEG 6000 (filter sterilized). 6. 5 M glycine betaine (filter sterilized).

2.5

PCR

1. Phusion® High-Fidelity DNA Polymerase (NEB). 2. 5 HF Phusion buffer (NEB). 3. 10 mM dNTPs. 4. H2O (PCR grade).

2.6

Quality Control

1. Agarose electrophoresis. 2. 1 TBE buffer. 3. Agarose. 4. 6 DNA loading buffer (Fermentas). 5. DNA Size standard with 150 bp band (e.g., Ultra Low Range DNA ladder—Fermentas).

50 phosphate-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-30 3NHC3 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

LIGATION_ADAPTER

PCR_forward

PCR_REVERSE_INDEX.1_ATCACG

PCR_REVERSE_INDEX.2_CGATGT

PCR_REVERSE_INDEX.3_TTAGGC

PCR_REVERSE_INDEX.4_TGACCA

PCR_REVERSE_INDEX.5_ACAGTG

PCR_REVERSE_INDEX.6_GCCAAT

PCR_REVERSE_INDEX.7_CAGATC

PCR_REVERSE_INDEX.8_ACTTGA

PCR_REVERSE_INDEX.9_GATCAG

PCR_REVERSE_INDEX.10_TAGCTT

PCR_REVERSE_INDEX.11_GGCTAC

PCR_REVERSE_INDEX.12_CTTGTA

PCR_REVERSE_INDEX.13_AGTCAA

PCR_REVERSE_INDEX.14_AGTTCC

PCR_REVERSE_INDEX.15_ATGTCA

PCR_REVERSE_INDEX.16_CCGTCC

All sequences are 5 –3 The oligonucleotide sequences of the Illumina genomic DNA adapters are copyrighted by Illumina, Inc. 2006. All rights reserved LIGATION_ADAPTER is a linker with clonable 50 end and with an amino-blocked 30 end Index sequences are shown in bold

0

AGACGTGTGCTCTTCCGATCTNNNNNNNNS

RT_random_primer

0

Primer sequence

Name

Table 1 Oligonucleotides used in this study

Reverse Transcriptase Termination Site (RTTS) Mapping 217

218

Lukasz J. Kielpinski et al.

6. Stain G (Serva). 7. Agilent DNA 1000 Kit (Agilent Technologies). 2.7

Equipment

1. Tubes. (a) RT, purifications, ligation—0.5 ml PCR tube (BRAND, 781310). (b) PCR—0.2 ml 8-Strip tubes (alpha laboratories, LW2500). 2. Thermocyclers. (a) RT, ligation: MJ Research, PTC-200 for 0.5 ml tubes. (b) PCR: BIORAD S1000. 3. Magnetic stand. 4. NanoDrop 1000. 5. Agilent 2100 Bioanalyzer.

3

Methods

3.1 Reverse Transcription (Modified from ref. 1)

1. Mix 1 μg of RNA starting material (could be in vitro-transcribed RNA or purified RNA) with 100 pmol of RT_random_primer in a 7.5 μl volume (optimal amount of primer can vary with the specific application). Heat denature for 5 min at 65  C, and put on ice (see Note 1). 2. Prepare master mix. For one reaction take 7.5 μl 5 PrimeScript Buffer, 1.87 μl 10 mM dNTP, 7.5 μl sorbitol–trehalose mix (which is half of the concentration used in [1]), 9.38 μl H2O, and 3.75 μl PrimeScript enzyme. Add 30 μl master mix to RNA–primer, and mix by pipetting (see Note 2). 3. Incubate as follows: 25  C, 10 min (skip this incubation if a gene-specific primer is used); 42  C, 30 min; 50  C, 10 min; 56  C, 10 min; 60  C, 10 min; and hold on 4  C. The result of the reverse transcription is a cDNA carrying a 50 adapter and terminating at the feature of interest (Fig. 2a). 4. Inactivate reverse transcriptase enzyme by incubating sample for 15 min at 70  C, then place on ice, and add 1 μl RNase H enzyme (New England Biolabs, 5,000 U/ml). Incubate for 20 min at 37  C to degrade the RNA (see Note 3).

3.2 cDNA Purification (Modified from ref. 1)

1. Add 67.5 μl RNAClean XP beads (room temperature, well mixed) to reactions, and pipette mix. Incubate at room temperature for 30 min vortexing every 10 min. 2. Put on magnetic stand for 5 min, and aspirate cleared solution. 3. 2 wash with 70 % ethanol (used volume depends on the tubes used; for 500 μl tubes use 400 μl ethanol). 4. Add 40 μl 5 mM Na-citrate (pH 6) preheated to 37  C, and mix extensively by pipetting. Incubate for 10 min at 37  C.

Reverse Transcriptase Termination Site (RTTS) Mapping

219

Fig. 2 Outline of library generation. (a) The first steps in library generation are reverse transcription and ligation of an adapter to the 30 end of the cDNA, which correspond to the location of the feature of interest. (b) In the subsequent PCR Illumina adapter sequences are added to produce a double-stranded DNA library that is ready for sequencing on the Illumina genomic DNA platform

5. Place on magnetic stand, and transfer eluant to the new tube (see Note 4). 3.3

cDNA Ligation

1. Prepare master mix. For one reaction take 1 μl of CircLigase buffer (Epicentre), 0.5 μl of 1 mM ATP, 0.5 μl 50 mM MnCl2, 2 μl of 50 % PEG 6000, 2 μl of 5 M betaine, 0.5 μl

220

Lukasz J. Kielpinski et al.

100 μM LIGATION_ ADAPTER, and 0.5 μl CircLigase enzyme. Mix well. 2. Split master mix into 7 μl aliquots, and add 3 μl of cDNA. 3. Incubate as follows: 60  C, 2 h; 68  C, 1 h; 80  C, 10 min; and hold on 4  C. 4. Add 10 μl H2O to increase volume. 5. Purify as point 2, but using Ampure XP bead (20 μl ligation reaction + 36 μl Ampure beads). Elute in 16 μl H2O. The result of the cDNA ligation step is a single-stranded cDNA containing adapters both at the 50 and 30 end, which can be used for the subsequent PCR reaction (Fig. 2b). 3.4 PCR Amplification of Library

1. Prepare master mix. For one reaction take 3 μl PCR_forward 10 μM primer, 10 μl Phusion 5 HF buffer, 1 μl 10 mM dNTPs, 27.5 μl H2O, and 1 μl Phusion DNA polymerase. Mix well. 2. Split master mix into 42.5 μl aliquots, and add 2.5 μl of indexing primer (PCR_REVERSE_INDEX.##_NNNNNN) (see Note 5) and 5 μl purified linker-ligated cDNA. Start the PCR reaction program as follows: 98  C, 3 min; (98  C, 80 s; 64  C, 15 s; 72  C, 30 s)  4; (98  C, 80 s; 72  C, 45 s)  15; 72  C, 5 min; and hold on 4  C (see Note 6). 3. Agarose electrophoresis (see Note 7). Prepare 2 % agarose gel with a DNA stain (e.g., Stain G). Apply 5 μl of samples (add loading dye) and size standard, and run at 4 V/cm until bromophenol blue from loading dye has travelled approximately 2.5 cm. Visualize under UV light. You should see smears of products longer than 200 bp. Presence of amplified PCR product shorter than 150 bp is typically caused by low amounts of starting material, combined with small amounts of leftover reverse transcription primer in the ligation reaction, and is the result of amplification of directly ligated RT primer— LIGATION_ADAPTER molecules, which can be Illumina sequenced, but is uninformative (see Fig. 3a). To get rid of the short PCR product, try to redo the library with more starting material or alternatively perform agarose gel purification to remove the short PCR product. In case that no amplified library (smear) is detected at this step, perform small-scale PCR with different number of cycles and analyze the PCR reaction by agarose electrophoresis. Then repeat the PCR reaction with the lowest number of PCR cycles that allows for detection of the library on the gel. Optimal number of cycles depends on the amount of starting material.

Reverse Transcriptase Termination Site (RTTS) Mapping

221

Fig. 3 Expected result from PCR amplification. (a) PCR products are first checked by agarose electrophoresis. A successfully prepared library should form a smear of molecules longer than 150 bp (lane 2). Presence of band shorter than 150 bp (lane 1) indicates problems with library preparation (see step 3 of Subheading 3.4). (b) Library is purified and checked for size distribution on Agilent Bioanalyzer DNA 1000 chip. A successfully prepared library should have dsDNA molecules of varied length with a considerable fraction being above 200 bp and below 600 bp 3.5 Purification and Quantification of Library ( See Note 7)

1. Ampure XP purification—as Subheading 3.2 but use Ampure XP beads and add 72 μl beads to 40 μl PCR reaction. Elute in 20 μl preheated 10 mM Tris–HCl pH 8.3. 2. Measure concentration on NanoDrop (as dsDNA) and run Bioanalyzer DNA 1000. Perform smear analysis (side panel -> Global ->Advanced ->Smear analysis ->regions) with range 140–600 bp and use molarity as a guideline for your sequencing order. The library should contain dsDNA molecules of varied length with a considerable fraction being above 200 bp and below 600 bp (Fig. 3b) (see Note 8). 3. Samples can now be sequenced using standard Illumina DNA genomic sequencing and can be multiplexed with other samples made with the same adapters (genomic DNA) as long as they utilize different indexes (see Note 5).

3.6 Data Analysis Using Linux Command Line

1. Data analysis of massive parallel sequencing experiments can be a challenge for scientists without formal training in bioinformatics. Below we demonstrate in detail how to go from the

222

Lukasz J. Kielpinski et al.

sequencing output (FASTQ file) to an RTTS count file without assuming prior knowledge of bioinformatics using tools available in GALAXY [13]–16], including the Bowtie mapper for sequencing reads [17] and the FASTX toolkit [18]. However, using a Unix or an OSX machine with a command line interface is recommended for large projects. For those users, the analysis implemented in Subheadings 3.7–3.9 can be carried out using Bowtie and an awk script available at this URL http://people. binf.ku.dk/~lukasz/SAM2counts.awk. 3.7 Quality Check of Sequencing Reads

1. Log in to Galaxy (http://usegalaxy.org/) and create a new Galaxy history. Upload the relevant FASTQ files to Galaxy with the “Upload File from your computer” tool found in the “Get Data” tool category. Point to the location of the relevant FASTQ file on your computer and click execute (see Note 9). 2. Check the integrity of the FASTQ files with the “FASTQ Groomer” tool found in the “NGS: QC and manipulation” tool category. For newer FASTQ files (Illumina 1.8 and later) the quality is encoded in Sanger format. Choose the Galaxy history item containing the FASTQ file, set “Input FASTQ quality scores type:” to Sanger, and click execute (see Note 10). 3. Compute FASTQ quality statistics with the “Compute quality statistics” tool found in the “NGS: QC and manipulation” tool category. Choose the groomed FASTQ file and click execute. 4. Plot the distributions of quality scores for the different sequencing cycles using the “Draw quality score boxplot” tool found in the “NGS: QC and manipulation” tool category. Choose the Galaxy history item containing the output of the “Compute quality statistics” tool and click execute. Look at the resulting boxplot by clicking on the eye icon next to the “Draw quality score boxplot” history item (see Fig. 4a). For most experiments, where the median quality is not very low (falling below 25), it is unnecessary to filter the reads on quality. If quality is very low it may be an advantage to filter the reads for low quality using the “Filter by quality” tool found in the “NGS: QC and manipulation” tool category. Set the “Quality cut-off value” option to 20 and the “Percent of bases in sequence that must have quality equal to/higher than cut-off value” option to 90 and click execute. 5. Plot nucleotide distributions of the different sequencing cycles using the “Draw nucleotides distribution chart” tool found in the “NGS: QC and manipulation” tool category. Look at the resulting plot by clicking on the eye icon next to the “Draw nucleotides distribution chart” history item (see Fig. 4b).

Reverse Transcriptase Termination Site (RTTS) Mapping

223

Fig. 4 Expected quality plots of sequencing reads. (a) Example of quality boxplot produced by Galaxy. The plot shows the median read quality in the different sequencing cycles. (b) Example of nucleotide distribution plot produced by Galaxy. The plot shows the percentage of the nucleotides in the different sequencing cycles. Deviation from uniform distribution in the first cycle reflects a combination of bias for specific nucleotides in terminal transferase activity of Reverse Transcriptase, bias in the TS2126 RNA ligase reaction and in some cases biased seqences of the genomic locations being mapped by the RTTS

The nucleotide distributions are typically similar across the sequencing cycles, but if this is not the case, the library may not have sufficient complexity or be contaminated with adapter–adapter ligation products.

224

Lukasz J. Kielpinski et al.

3.8 Mapping Reads with Bowtie

1. Depending on the nature of your experiment you can map your reads either to the entire genome relevant for the experiment or to one or more RNA sequences. The genomes of the most commonly investigated species are pre-installed in Galaxy, whereas mapping to one or more specific RNAs requires that the sequence(s) is uploaded to Galaxy as a FASTA file. If necessary upload a FASTA file with the “Upload File from your computer” tool found in the “Get Data” tool category. Point to the location of the relevant FASTA file on your computer and click execute. 2. To map the reads, use the “Map with Bowtie for Illumina” tool found in the “NGS: Mapping” tool category. If mapping to a genome that is pre-indexed in Galaxy, choose “Use a built-in index” and the relevant genome. Otherwise choose “Use one from history” and select history item containing the uploaded FASTA file. Next, select the history item containing the groomed (and filtered) FASTQ file under the “FASTQ file” option and choose “Full parameter list” in the “Bowtie settings to use” drop-down menu. Then change “Maximum number of mismatches permitted in the seed (-n)” to 3 and “Maximum permitted total of quality values at mismatched read positions (-e)” to 300 and choose “Use best” in the “Whether or not to make Bowtie guarantee that reported singleton alignments are ‘best’ in terms of stratum and in terms of the quality values at the mismatched positions (–best)” drop-down menu. Finally map the reads by clicking Execute. The mapping may take a while depending on the size of the FASTQ file and the sequence to be mapped against (see Note 11).

3.9 Preparing an RTTS Count File from SAM File

1. It is necessary to trim mapped reads that contain untemplated nucleotides added by reverse transctiptase (see Note 12). This trimming requires many Galaxy operations and we have therefore created a Galaxy workflow to perform this operation and subsequently count and sum RTTS. In this procedure reads are trimmed, if they contain mismatches in the first three positions (Fig. 5). To download the workflow go to https://main.g2.bx. psu.edu/workflow/list_published and search for RTTS Mapper. Click on the workflow and import it into your own Galaxy account by clicking on “Import workflow” in the upper right corner. Alternatively, if a local instance of galaxy is used, the RTTS Mapper workflow can be imported into Galaxy by clicking on “Workflow” on the top Galaxy bar and then on the “Upload and import workflow” button in the upper right corner. At this URL https://main.g2.bx.psu.edu/workflow/ import_workflow, the workflow can be imported by providing the URL http://people.binf.ku.dk/~lukasz/Galaxy_RTTS_ mapper.ga as the “Galaxy workflow URL” and clicking import.

Reverse Transcriptase Termination Site (RTTS) Mapping

225

Fig. 5 Schematic representation of the trimming performed by the RTTS mapper. After mapping the reads to genome the three 5’ terminal nucleotides of mapped reads (corresponding to 3’ ends of cDNA molecules) are evaluated for mismatches to the reference sequence and trimmed if necessary. The four possible scenarios are the following: full match (a), mismatch at the terminal position (b), position one (c), or position two (d) before the terminal position, in which cases we trim 0, 1, 2, or 3 positions, respectively (returned positions are indicated by the triangles). Red boxes: Mismatched positions; white boxes: matched positions

Also import a control file to your history by pasting in this URL http://people.binf.ku.dk/~lukasz/RTTS_control.interval in the URL/Text window in the “Upload File from your computer” tool found in the “Get Data” tool category. 2. To prepare RTTS count files, use the RTTS Mapper workflow imported above. Click on “Workflow” on the top Galaxy bar and then on the “RTTS mapper” workflow and choose “Run.” Select the history item containing the SAM file from the Bowtie mapping for the “Select dataset to convert” option and the RTTS_control.interval file for the “Select control file” option and click “Run workflow” at the bottom of the page. 3. The resulting RTTS count files (for counts on the plus and minus strand, respectively) can be used for further analysis in R, Excel, or other data analysis program. The exact analysis will depend on the nature of the experiment performed. Below we provide tools for some common types of analysis using the freely available tool R, which can easily be installed on any computer platform [19] (see Note 13). 3.10 Preparing Wig File and Visualizing in the UCSC Genome Browser

If the RTTS experiments have been mapped to a genome assembly, it will often be advantageous to visualize the results on the UCSC Genome Browser and compare with the many kinds of data available as tracks. To do this it is necessary to convert the RTTS file to the UCSC wig format.

226

Lukasz J. Kielpinski et al.

1. Download RTTS count files to your local computer from the Galaxy server by clicking on the floppy disc icon for the relevant history items. 2. The RTTS count file can be converted to wig format by copy/ pasting a small program (script) into R. Download the provided script from http://people.binf.ku.dk/~lukasz/wig_ generator.R. Open the file in a text editor and modify it by changing the assignment of variables “input_filename_plus” and “input_filename_min” to names of files produced by galaxy workflow. 3. Start-up R, and change working directory to the one containing RTTS count files by writing “setwd (‘path of file directory’)” in the console window and pressing enter or using the “Change dir” command found in the File menu. Then paste the edited script into the R console and hit enter. This will produce two new files named OUTPUTp.wig and OUTPUTm.wig in the same folder. 4. Go to the UCSC genome browser (http://www.genome.ucsc. edu/cgi-bin/hgGateway) and choose the species and assembly that were used for the mapping of the RTTS experiment in the drop-down menu. Then click “manage custom tracks,” browse the local drive for the wig files, and submit them one by one (after adding first one press “add custom tracks”). Finally press “go to the genome browser” with RTTS counts visualized as histogram at each genomic position (Fig. 6a). 3.11 Making Plots for Single RNAs

If the RTTS data was mapped to single RNAs (using a provided FASTA file) rather than the full genome, it will often be relevant to visualize the RTTS counts across each of the different RNAs. 1. Create a new folder and download the FASTA file that were used for mapping and the two RTTS count files from the Galaxy history by clicking on the floppy disc icon for the relevant history items to the folder. Change file names of the RTTS count files to counts_plus.txt and counts_minus.txt. 2. Then download this R script http://people.binf.ku.dk/ ~lukasz/few_genes_histogram.r and open it in a text editor. Execute R and set the working directory (as described in Subheading 3.10.3) to the folder containing the FASTA file and the RTTS count files. To generate RNA-specific RTTS plots for each RNA present in FASTA file that have at least one read mapped, copy/paste the script to the R console window and hit enter (Fig. 6b).

3.12 Comparing to Annotation Data

In some cases, it will be relevant to compare RTTS data to some kind of annotation to identify global trends. This can be done by summarizing the read counts around a set of locations. We have

Reverse Transcriptase Termination Site (RTTS) Mapping

227

Fig. 6 Example of output produced with the described protocol. Mouse liver RNA was analyzed with the described protocol, including an optional CAGE selection to enrich for RTTS corresponding to transcription start sites. (a) Output of Subheading 3.10. The sequencing data was mapped to genome, RTTS counted, converted to wig file, and uploaded to UCSC genome browser. Height of the bar at each genomic location corresponds to the number of read 50 ends mapping to this location. Minus strand is shown with negative values using different scale. (b) Output of Subheading 3.11. Reads were mapped to a single sequence (Hmgcs2 mRNA) and count of 5’ ends at each location was plotted. Reads mapping to positive strand are shown as above 0, while those mapping to negative strand as below zero. (c) Output of Subheading 3.12. Upper plot shows sum of reads at each distance from annotated TSS. High peak at position 1 results from many reads mapped to known TSS, while high peak at position 13 results from an alternative TSS for the highly expressed albumin transcript

prepared R script utilizing the bioconductor [20] for generating such a plot from the RTTS count wigs and an additional file containing either user-supplied genomic locations or refseq TSS.

228

Lukasz J. Kielpinski et al.

1. Prepare file with a set of locations that will serve as reference point for counting read locations. The format of the file is three tab-delimited columns. Columns must have headers (named “seqnames”—name of the chromosome, “position,” “strand”). Example given in the script. Positions must be 1based (see Note 14). 2. Download the script from this URL http://people.binf.ku.dk/ ~lukasz/plot_around_locations_from_wig.r and open it in a text editor. In the text editor, edit the input file names to match two wig files prepared as described in Subheading 3.9 and the position file prepared above. Also edit genome assembly name and the size of the window surrounding the given positions and used for summarizing read counts. 3. Start R, set proper working directory, and copy/paste the script to R console. This will produce a barplot of the RTTS counts relative to the positions given as reference (Fig. 6c).

4

Notes 1. The amount of starting material can be reduced if necessary. On the other hand for samples that are to be used for CAGE selection a minimum of 5 μg of RNA is needed. The amount of reverse transcription primer should be scaled with the amount of RNA. The quality of the RNA starting material is very important as degraded RNA will produce background in any type of experiment based on detection of RTTS. Moreover, random priming typically produces more background than gene-specific priming. In CAGE experiments the non-fulllength cDNAs are removed in a selection step, thereby effectively reducing the background, but in other applications a negative control sample is required and can be used to normalize for reverse transcriptase pretermination. 2. The priming sequence used in RT_random_primer (..NNNNNNNNS-30 ) can be modified according to specific needs. In many cases, such as RNA structure probing, a genespecific primer with the 50 overhang sequence can be used (50 -AGACGTGTGCTCTTCCGATCT-“gene specific sequence”). 3. If CAGE selection is to be performed this step should be skipped. 4. Optional selection of full-length cDNA for CAGE analysis of TSS can be performed according to Subheadings 3.3–3.7 [without concentration] as described in [1] and results in a total volume 34 μl cap-selected RNA. 5. Be careful with low-level pooling of indexes since proper sequencing requires that at each cycle there is at least one

Reverse Transcriptase Termination Site (RTTS) Mapping

229

green laser read nucleotide (G or T) and one red laser red (A or C). See more at http://www.epibio.com/pdftechlit/312p l1211.pdf. 6. Using long denaturation time in PCR reaction helps alleviate GC bias and fosters reproducibility between different thermal cyclers [21]. 7. To simplify the procedure and reduce the risk of contaminating laboratory space with generated libraries one can instead of running agarose gel analyze and quantify the PCR products on Bioanalyzer DNA 1000 chip without prior purification. This allows pooling the crude reactions in right proportions (it is advisable to add EDTA to the reactions before pooling to avoid index switching) and performing only single Ampure XP purification. 8. In case when prepared libraries have the same size distribution it is possible to pool them based on NanoDrop measured concentration. 9. The output from sequencing is one or more FASTQ file containing the sequence reads and the corresponding quality scores. If several indexes were used for different experimental conditions, FASTQ files from each index should be analyzed individually. If using the main Galaxy server and dealing with large datasets (>2 GB), it is an advantage to use the Galaxy ftp upload. A tutorial can be found here: http://screencast.g2. bx.psu.edu/quickie_17_ftp_upload/flow.html. The analysis described below can be carried out on a local instance of Galaxy or on the main Galaxy server (http://usegalaxy.org/). When using the main server be sure to make a login so that your analysis is saved. Alternatively the analysis can be performed on a Unix/OSX machine in-house (see Subheading 3.6). 10. If the dataset consists of several FASTQ files they can be merged into one file with the “Concatenate datasets” tools found in the “Text Manipulation” tool category at this point to facilitate the further analysis of the full dataset. 11. Other sequencing read mappers can be used instead of the Bowtie mapper. However, it is important not to use too stringent cutoff for mapping, because a considerable fraction of reads contain untemplated sequence added by reverse transcriptase at the 50 end. The stringency of mapping conditions should be considered individually for each experiment while taking the quality of the sequencing reads and the complexity of the sequences that are being mapped against into account. When mapping against short sequences the coverage towards the 30 end can be improved by trimming sequencing reads from the 30 end.

230

Lukasz J. Kielpinski et al.

12. Reverse transcriptase will in some cases add extra untemplated nucleotides after terminating at the 50 end of the RNA. This is especially pronounced when the 50 end of the RNA is capped, which is the case for mRNAs. For the conditions described here, we find that the Primescript RT enzyme will add untemplated nucleotides in 81 % of cases for RTTS located closer than 50 nts to an annotated TSS (most of these presumably being capped), while the same is the case for 12 % of the RTTS located elsewhere. It is therefore necessary to trim the reads that have one or more mismatches in the first three mapped positions, which is implemented in the published workflow. In the cases where untemplated nucleotide matches the genomic sequence, it is not possible to do trimming. 13. R can be freely downloaded for any platform at http://cran.rproject.org/. Scripts are written for version 2.15. 14. At this step user must ensure that numbers provided as locations of interest are in 1-based coordinate system. This system is used, e.g., in UCSC genome browser display window. Be aware that tables downloaded from UCSC table browser are provided in 0based system. To use TSS information from the table in provided script one must add 1 to the starting positions. Read more on coordinate systems on http://genomewiki.ucsc.edu/ index.php/Coordinate_Transforms.

Acknowledgments The research was funded by the Danish Council for Strategic Research, the Lundbeck Foundation and the Novo Nordisk Foundation. Morten Lindow and Susanna Obad, Santaris Pharma, provided mouse liver samples and RIKEN/Piero Carninci provided the updated CAGE protocol as well as advice ahead of publication. References 1. Takahashi H, Kato S, Murata M et al (2012) CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. In: Deplancke B, Gheldof N (eds) Gene regulatory networks, vol 786. Humana, Totowa, NJ, pp 181–200 2. Motorin Y, Muller S, Behm‐Ansmant I et al (2007) Identification of modified residues in RNAs by reverse transcription‐based methods. Methods Enzymol 425:21–53. doi:10.1016/ s0076-6879(07)25002-5 3. Mortimer SA, Weeks KM (2009) Time-resolved RNA SHAPE chemistry: quantitative RNA structure analysis in one-second snapshots and

at single-nucleotide resolution. Nat Protoc 4(10):1413–1421. doi:nprot.2009.126 [pii] 10.1038/nprot.2009.126 4. Ko¨nig J, Zarnack K, Rot G et al (2010) iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol 17(7):909–915. doi:10.1038/nsmb.1838 5. Shibata Y, Carninci P, Watahiki A et al (2001) Cloning full-length, cap-trapper-selected cDNAs by using the single-strand linker ligation method. Biotechniques 30(6):1250–1254 6. Li TW, Weeks KM (2006) Structureindependent and quantitative ligation of

Reverse Transcriptase Termination Site (RTTS) Mapping single-stranded DNA. Anal Biochem 349 (2):242–246. doi:10.1016/j.ab.2005.11.002 7. Hirzmann J, Luo D, Hahnen J et al (1993) Determination of messenger RNA 5’-ends by reverse transcription of the cap structure. Nucleic Acids Res 21(15): 3597–3598 8. Zhu YY, Machleder EM, Chenchik A et al (2001) Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. Biotechniques 30 (4):892–897 9. Carninci P, Kasukawa T, Katayama S et al (2005) The transcriptional landscape of the mammalian genome. Science 309(5740): 1559–1563. doi:10.1126/science.1112014 10. Shiraki T, Kondo S, Katayama S et al (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 100(26):15776–15781. doi:10.1073/ pnas.2136655100 11. Weeks KM, Mauger DM (2011) Exploring RNA structural codes with SHAPE chemistry. Acc Chem Res 44(12):1280–1291. doi:10.1021/ ar200051h 12. Lucks JB, Mortimer SA, Trapnell C et al (2011) Multiplexed RNA structure characterization with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc Natl Acad Sci U S A 108(27):11063–11068. doi: 10.1073/pnas.1106501108 13. Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15 (10):1451–1455. doi:10.1101/Gr.4086505

231

14. Goecks J, Nekrutenko A, Taylor J et al (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86. doi:10.1186/Gb2010-11-8-R86 15. Blankenberg D, Gordon A, Von Kuster G et al (2010) Manipulation of FASTQ data with Galaxy. Bioinformatics 26(14):1783–1785. doi:10.1093/bioinformatics/btq281 16. Blankenberg D, Von Kuster G, Coraor N et al. (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19:Unit 19.10.11–21 17. Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25. doi:10.1186/Gb2009-10-3-R25 18. Hannon-Lab, Gordon A (2010) FASTX-toolkit: FASTQ/A short-reads pre-processing tools. http://hannonlab.cshl.edu/fastx_toolkit/ 19. R Foundation for Statistical Computing (2012) R: A language and environment for statistical computing, 2151st edn. R Foundation for Statistical Computing, Vienna, Austria 20. Gentleman RC, Carey VJ, Bates DM et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80 21. Aird D, Ross MG, Chen WS et al (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12(2):R18. doi:10.1186/gb-2011-12-2-r18

INDEX A Adapter .......................40, 41, 45, 56, 57, 161–162, 166, 213–221, 223 Adenosine to inosine (A-to-I) editing ......................... 159 Alignment tools...............................................87, 93, 115, 118, 175 Allele frequency .....................................15, 140, 149, 150 Alternative splicing.........................................16, 171–179 Arithmetic coding ........................................27, 30–33, 35 Assembly ..............................3, 4, 6, 8–12, 54, 81–89, 93, 97, 103, 126, 204, 225, 226, 228 Assembly–combining short patches ............................... 81

analysis ............................................................ 137–154 sequencing ...................................................... 93–110, 137–154

F Factor binding site ............................................... 181–190 Fragment ....................1–5, 7, 15, 40–42, 61, 64, 66, 83, 99, 105, 106, 144, 145, 181, 193, 194, 196, 205, 206, 208, 213

G

Bioconductor package .................................................. 194 Burrows–Wheeler aligner (BWA) ..................... 12, 40–43, 53, 56, 57, 94, 95, 102–106, 113, 118, 128, 130

Galaxy ................................ 165, 166, 208, 222–226, 229 Gene annotation .............................. 15, 140, 148, 152–153 regulation................................................................. 171 Genetic disorders .......................................................... 137 Genome resequencing encoding (GReEn) .............27–36

C

H

Cap analysis of gene expression (CAGE) .............. 40, 42, 43, 46, 56, 214, 227, 228 Causative mutations................... 138–140, 143, 151, 154 cDNA......................................................... 5, 42, 213–230 ChIP-seq.......................4, 9, 16, 40, 42, 43, 54, 56, 103, 108, 181–190, 193–211 Chromatin immunoprecipitation ........................ 181, 193 Copy number variations (CNV) ............... 62–63, 68, 71, 76, 78, 139, 147, 186 Coverage depth .................................................. 62–66, 73

High-throughput sequencing (HTS) ................ 1–24, 48, 50, 61, 62, 78, 115, 154

B

D

I Illumina.........................1, 10, 41, 44, 49, 52, 56, 64, 93, 96, 98, 100–102, 104–106, 115, 121, 122, 125, 126, 128, 144, 161, 166, 167, 171, 193, 196, 214, 215, 217, 219–222, 224 Initial bioinformatics analysis ........................................... 1

L lobSTR................................................. 114–124, 126–133

Data compression......................................................27–36 De Bruijn graph ................................................. 84, 86–88 De novo assembly ................................... 3, 4, 6, 8–12, 82 Differential expression .................... 7, 16, 18, 62, 68, 70, 71, 73, 78 Disease variants .................................................... 137–154 Double-stranded RNA (dsRNA) ................................. 159

E Epigenomic profiles ...................................................... 193 Exome.........................4, 17, 96–99, 101, 102, 104, 106, 139–141, 143–145, 147, 151

M Mapping algorithms.........................................40, 43, 146 MATS. See Multivariate analysis of transcript splicing (MATS) Microarray ..................................................................... 182 MicroRNA (miRNA) ................5, 42, 46, 138, 141, 145, 148, 159–169 Microsatellites....................................................... 113, 128 Mixture regression model........................... 194, 196, 202 Multivariate analysis of transcript splicing (MATS) ..................................................... 171–179

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9, # Springer Science+Business Media New York 2013

233

EEP SEQUENCING DATA ANALYSIS 234 D Index

N Next-generation sequencing (NGS) .......................81–83, 86–88, 93, 105, 154, 171, 193, 222, 224

O Open source software ..................................................... 84 Overlap-layout-consensus (OLC) .................................. 85

P Parallel sequencing technologies........138, 193, 213–230 PCR amplification .................. 2, 14, 113, 133, 146, 197, 220–221 Peak detection ............... 16–17, 181–183, 185–187, 189 Peak finding................................................. 182, 189, 207 Phenotype........................... 71, 137, 139–141, 147, 148, 150, 152–154 Predicted severity .......................................................... 140 Probabilistic models .........................................40, 53, 128 Protein-RNA crosslinks ......................................... 42, 216

RNA editing sites............................................... 40, 159–169 modifications .................................................. 160, 216 RNA sequencing (RNA-Seq) .................. 5–7, 15, 40, 42, 56, 62, 63, 67, 94, 108, 171–179, 215, 224

S Sequencing machine ..................... 40, 50, 53, 61, 68, 69, 81, 93, 108, 161 Short tandem repeats (STRs) .............................. 113–134 shRNA ......................................................... 62, 67, 68, 71 shRNA-seq ...................................................................... 67 SNP calling .....................................................62, 103, 108 Statistical modeling of coverage ...............................61–79

T Transcription factors ..................... 16, 42, 181–190, 193, 195, 208 Transcriptome .......................................15, 160, 172, 179 TS2l26 RNA ligase .............................................. 214, 223

Q

V

Quality scores ......................... 10, 11, 19, 20, 22, 40, 41, 48–51, 53, 54, 97, 100, 101, 105–108, 119–121, 128, 130–133, 145–147, 161, 163, 167, 222, 229

Variant annotation .......................................142, 147–150 Variant calling..........................14–15, 23, 100, 105, 108, 145–147 Variant detection ............................................15, 137–154 Variant prioritization................................... 140, 147, 148

R Rapid amplification of cDNA 50 ends (50 RACE)....... 213 Repetitive region .................................................... 7, 8, 82 Resequencing .........................3, 4, 9, 27–36, 54, 93, 102 Reverse transcriptase termination sites ............... 213–230

W Whole genome sequencing .........................138–139, 147