Detection of Reverse Transcriptase Termination Sites Using cDNA ...

2 downloads 0 Views 769KB Size Report
following PCR amplification, Illumina adapters and index sequences are introduced, thereby ... attach sequencing adapter sequences to the ends of the cDNA.
Chapter 13 Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and Massive Parallel Sequencing Lukasz J. Kielpinski, Mette Boyd, Albin Sandelin, and Jeppe Vinther Abstract Detection of reverse transcriptase termination sites is important in many different applications, such as structural probing of RNAs, rapid amplification of cDNA 50 ends (50 RACE), cap analysis of gene expression, and detection of RNA modifications and protein–RNA cross-links. The throughput of these methods can be increased by applying massive parallel sequencing technologies. Here, we describe a versatile method for detection of reverse transcriptase termination sites based on ligation of an adapter to the 30 end of cDNA with bacteriophage TS2126 RNA ligase (CircLigase™). In the following PCR amplification, Illumina adapters and index sequences are introduced, thereby allowing amplicons to be pooled and sequenced on the standard Illumina platform for genomic DNA sequencing. Moreover, we demonstrate how to map sequencing reads and perform analysis of the sequencing data with freely available tools that do not require formal bioinformatics training. As an example, we apply the method to detection of transcription start sites in mouse liver cells. Key words Reverse transcription, Termination, Sequencing, TS2l26 RNA ligase, CAGE, Galaxy

1

Introduction Detection of reverse transcriptase termination sites (RTTS) is a general strategy that can be used to detect different features of RNA, such as their ends [1], modifications [2], structure [3], and binding of proteins [4]. Historically, RTTS have been monitored by fragment analysis using radioactive or fluorescent labelling of the primer used for the reverse transcription and detection with denaturing gel or capillary electrophoresis, respectively. Alternatively, RTTS can be detected by ligating an adapter to the 30 end of the terminated cDNA, cloning, and sequencing. While fragment analysis has been very successfully used to investigate many different RNA features, the decreasing cost of sequencing makes it increasingly more advantageous to use sequencing for detection of RTTS. It is therefore likely that existing RTTS-based methods will be adapted for sequencing and that new methods will be developed.

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038, DOI 10.1007/978-1-62703-514-9_13, # Springer Science+Business Media New York 2013

213

214

Lukasz J. Kielpinski et al.

The key step in the detection of RTTS by sequencing is to attach sequencing adapter sequences to the ends of the cDNA. Typically the 50 adapter sequence is included as overhang in genespecific or random primer used for the first-strand reaction. The next step is the ligation of an adapter to the 30 end of the terminated cDNA and several methods for doing this have been developed. In single-strand linker ligation a double-stranded adapter with a 30 overhang is ligated to the free 30 end of the RTTS cDNA using T4 DNA ligase [5]. Alternatively, a single-stranded adapter can be used for ligation with the thermostable TS2126 RNA ligase (CircLigase) [6]. The efficiency of both of these enzymes are somewhat biased by the sequence in the very 30 end of the cDNA that have to be ligated (results not shown), but these biases are reproducible and are therefore not an issue if an appropriate control is used for normalization. Another issue is the ability of reverse transcriptase to add 1–3 untemplated nucleotides to the 30 end of cDNAs. This occurs more efficiently at capped 50 ends compared to 50 ends ending in OH (typical for degraded RNA) [7] and has to be taken into account when sequences are mapped to the RNA being investigated. The added nucleotides allow the reverse transcriptase to perform template switching, which can be exploited to add an adaptor sequence to the 30 of cDNAs [8]. Some RTTS methods have successfully been adapted to massive parallel sequencing. Cap analysis of gene expression (CAGE) has been successfully used to identify transcription start sites (TSS) [9]. Originally the CAGE method was based on concatenation of CAGE tags and Sanger sequencing [10], but it has recently been adapted to massive parallel sequencing [1]. Another example is SHAPE-based probing of RNA structure, which has been widely and successfully used for investigating the structure of single RNAs using capillary electrophoresis [11]. Nevertheless, recent result demonstrating that populations of RNA molecules can be SHAPE probed in parallel using sequencing fuels hope that the throughput of structure probing can be increased [12]. These successful implementations of sequencing for RTTS detection suggest that RTTS methods generally can be adapted to the new sequencing technologies. Here, we describe a general method for detecting RTTS based on the Illumina paired-end genomic DNA adapters, sequencing primer, and indexing reads. Samples can therefore be multiplexed with other samples containing the standard Illumina adaptors and used for both single- and paired-end sequencing. The method can easily be adapted to detect RTTS produced by any experimental protocol. In addition, we demonstrate in detail how to go from the raw sequencing reads to counts of RTTS mapped to the RNA being investigated and how to compare with the existing annotation and visualize the results in the UCSC genome browser. An overview of the entire protocol is shown in Fig. 1.

Reverse Transcriptase Termination Site (RTTS) Mapping

215

Fig. 1 Schematic outline of the analysis. The starting material are RNA molecules containing a feature of interest, which can cause reverse transcriptase termination. The RNA is reverse transcribed with a primer containing a 50 adapter overhang. After cDNA purification, a second adapter is ligated to the 30 ends of the obtained cDNA. Molecules containing both adapters serve as templates for a PCR, which adds all necessary elements for Illumina sequencing. After library sequencing, the resulting sequencing reads are mapped to sequences of interest (this could be the full genome or selected RNA sequences) and the location of the reads’ 50 ends (corresponding to the feature of interest) counted. The resulting RTTS count file can be used for further analysis, such as visualization in the UCSC genome browser, producing RTTS plots for specific RNA molecules, and comparing with the existing annotation

216

2

Lukasz J. Kielpinski et al.

Materials

2.1

RNA Sample

1. Material to be analyzed: The RNA should be treated in a way that reverse transcription will terminate on sites of interest. This could be RNA strand breaks, RNA modifications, RNA 50 ends, protein–RNA cross-links among others.

2.2

Oligonucleotides

1. Oligonucleotide sequences are listed in Table 1. RT_random_primer and LIGATION_ADAPTER were HPLC purified, and the remaining oligonucleotides were PAGE purified.

2.3 Reverse Transcription and Purifications

1. PrimeScript™ ReverseTranscriptase including PrimeScript™ 5 buffer (Takara). 2. 10 mM dNTPs. 3. Sorbitol–trehalose mix (1.67 M sorbitol, 0.33 M trehalose). 4. Agencourt® AMPure® XP–PCR Purification (Beckman Coulter). 5. Agencourt® RNAClean® XP (Beckman Coulter). 6. 70 % EtOH. 7. 5 mM Na-citrate pH 6. 8. 10 mM Tris–HCI pH 8.3. 9. RNAseH (New England Biolabs).

2.4

Linker Ligation

1. CircLigase (Epicentre). 2. 1 mM ATP (Epicentre). 3. CircLigase buffer (Epicentre). 4. 50 mM MnCl2 (Epicentre). 5. 50 % PEG 6000 (filter sterilized). 6. 5 M glycine betaine (filter sterilized).

2.5

PCR

1. Phusion® High-Fidelity DNA Polymerase (NEB). 2. 5 HF Phusion buffer (NEB). 3. 10 mM dNTPs. 4. H2O (PCR grade).

2.6

Quality Control

1. Agarose electrophoresis. 2. 1 TBE buffer. 3. Agarose. 4. 6 DNA loading buffer (Fermentas). 5. DNA Size standard with 150 bp band (e.g., Ultra Low Range DNA ladder—Fermentas).

50 phosphate-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-30 3NHC3 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

LIGATION_ADAPTER

PCR_forward

PCR_REVERSE_INDEX.1_ATCACG

PCR_REVERSE_INDEX.2_CGATGT

PCR_REVERSE_INDEX.3_TTAGGC

PCR_REVERSE_INDEX.4_TGACCA

PCR_REVERSE_INDEX.5_ACAGTG

PCR_REVERSE_INDEX.6_GCCAAT

PCR_REVERSE_INDEX.7_CAGATC

PCR_REVERSE_INDEX.8_ACTTGA

PCR_REVERSE_INDEX.9_GATCAG

PCR_REVERSE_INDEX.10_TAGCTT

PCR_REVERSE_INDEX.11_GGCTAC

PCR_REVERSE_INDEX.12_CTTGTA

PCR_REVERSE_INDEX.13_AGTCAA

PCR_REVERSE_INDEX.14_AGTTCC

PCR_REVERSE_INDEX.15_ATGTCA

PCR_REVERSE_INDEX.16_CCGTCC

All sequences are 5 –3 The oligonucleotide sequences of the Illumina genomic DNA adapters are copyrighted by Illumina, Inc. 2006. All rights reserved LIGATION_ADAPTER is a linker with clonable 50 end and with an amino-blocked 30 end Index sequences are shown in bold

0

AGACGTGTGCTCTTCCGATCTNNNNNNNNS

RT_random_primer

0

Primer sequence

Name

Table 1 Oligonucleotides used in this study

Reverse Transcriptase Termination Site (RTTS) Mapping 217

218

Lukasz J. Kielpinski et al.

6. Stain G (Serva). 7. Agilent DNA 1000 Kit (Agilent Technologies). 2.7

Equipment

1. Tubes. (a) RT, purifications, ligation—0.5 ml PCR tube (BRAND, 781310). (b) PCR—0.2 ml 8-Strip tubes (alpha laboratories, LW2500). 2. Thermocyclers. (a) RT, ligation: MJ Research, PTC-200 for 0.5 ml tubes. (b) PCR: BIORAD S1000. 3. Magnetic stand. 4. NanoDrop 1000. 5. Agilent 2100 Bioanalyzer.

3

Methods

3.1 Reverse Transcription (Modified from ref. 1)

1. Mix 1 μg of RNA starting material (could be in vitro-transcribed RNA or purified RNA) with 100 pmol of RT_random_primer in a 7.5 μl volume (optimal amount of primer can vary with the specific application). Heat denature for 5 min at 65  C, and put on ice (see Note 1). 2. Prepare master mix. For one reaction take 7.5 μl 5 PrimeScript Buffer, 1.87 μl 10 mM dNTP, 7.5 μl sorbitol–trehalose mix (which is half of the concentration used in [1]), 9.38 μl H2O, and 3.75 μl PrimeScript enzyme. Add 30 μl master mix to RNA–primer, and mix by pipetting (see Note 2). 3. Incubate as follows: 25  C, 10 min (skip this incubation if a gene-specific primer is used); 42  C, 30 min; 50  C, 10 min; 56  C, 10 min; 60  C, 10 min; and hold on 4  C. The result of the reverse transcription is a cDNA carrying a 50 adapter and terminating at the feature of interest (Fig. 2a). 4. Inactivate reverse transcriptase enzyme by incubating sample for 15 min at 70  C, then place on ice, and add 1 μl RNase H enzyme (New England Biolabs, 5,000 U/ml). Incubate for 20 min at 37  C to degrade the RNA (see Note 3).

3.2 cDNA Purification (Modified from ref. 1)

1. Add 67.5 μl RNAClean XP beads (room temperature, well mixed) to reactions, and pipette mix. Incubate at room temperature for 30 min vortexing every 10 min. 2. Put on magnetic stand for 5 min, and aspirate cleared solution. 3. 2 wash with 70 % ethanol (used volume depends on the tubes used; for 500 μl tubes use 400 μl ethanol). 4. Add 40 μl 5 mM Na-citrate (pH 6) preheated to 37  C, and mix extensively by pipetting. Incubate for 10 min at 37  C.

Reverse Transcriptase Termination Site (RTTS) Mapping

219

Fig. 2 Outline of library generation. (a) The first steps in library generation are reverse transcription and ligation of an adapter to the 30 end of the cDNA, which correspond to the location of the feature of interest. (b) In the subsequent PCR Illumina adapter sequences are added to produce a double-stranded DNA library that is ready for sequencing on the Illumina genomic DNA platform

5. Place on magnetic stand, and transfer eluant to the new tube (see Note 4). 3.3

cDNA Ligation

1. Prepare master mix. For one reaction take 1 μl of CircLigase buffer (Epicentre), 0.5 μl of 1 mM ATP, 0.5 μl 50 mM MnCl2, 2 μl of 50 % PEG 6000, 2 μl of 5 M betaine, 0.5 μl

220

Lukasz J. Kielpinski et al.

100 μM LIGATION_ ADAPTER, and 0.5 μl CircLigase enzyme. Mix well. 2. Split master mix into 7 μl aliquots, and add 3 μl of cDNA. 3. Incubate as follows: 60  C, 2 h; 68  C, 1 h; 80  C, 10 min; and hold on 4  C. 4. Add 10 μl H2O to increase volume. 5. Purify as point 2, but using Ampure XP bead (20 μl ligation reaction + 36 μl Ampure beads). Elute in 16 μl H2O. The result of the cDNA ligation step is a single-stranded cDNA containing adapters both at the 50 and 30 end, which can be used for the subsequent PCR reaction (Fig. 2b). 3.4 PCR Amplification of Library

1. Prepare master mix. For one reaction take 3 μl PCR_forward 10 μM primer, 10 μl Phusion 5 HF buffer, 1 μl 10 mM dNTPs, 27.5 μl H2O, and 1 μl Phusion DNA polymerase. Mix well. 2. Split master mix into 42.5 μl aliquots, and add 2.5 μl of indexing primer (PCR_REVERSE_INDEX.##_NNNNNN) (see Note 5) and 5 μl purified linker-ligated cDNA. Start the PCR reaction program as follows: 98  C, 3 min; (98  C, 80 s; 64  C, 15 s; 72  C, 30 s)  4; (98  C, 80 s; 72  C, 45 s)  15; 72  C, 5 min; and hold on 4  C (see Note 6). 3. Agarose electrophoresis (see Note 7). Prepare 2 % agarose gel with a DNA stain (e.g., Stain G). Apply 5 μl of samples (add loading dye) and size standard, and run at 4 V/cm until bromophenol blue from loading dye has travelled approximately 2.5 cm. Visualize under UV light. You should see smears of products longer than 200 bp. Presence of amplified PCR product shorter than 150 bp is typically caused by low amounts of starting material, combined with small amounts of leftover reverse transcription primer in the ligation reaction, and is the result of amplification of directly ligated RT primer— LIGATION_ADAPTER molecules, which can be Illumina sequenced, but is uninformative (see Fig. 3a). To get rid of the short PCR product, try to redo the library with more starting material or alternatively perform agarose gel purification to remove the short PCR product. In case that no amplified library (smear) is detected at this step, perform small-scale PCR with different number of cycles and analyze the PCR reaction by agarose electrophoresis. Then repeat the PCR reaction with the lowest number of PCR cycles that allows for detection of the library on the gel. Optimal number of cycles depends on the amount of starting material.

Reverse Transcriptase Termination Site (RTTS) Mapping

221

Fig. 3 Expected result from PCR amplification. (a) PCR products are first checked by agarose electrophoresis. A successfully prepared library should form a smear of molecules longer than 150 bp (lane 2). Presence of band shorter than 150 bp (lane 1) indicates problems with library preparation (see step 3 of Subheading 3.4). (b) Library is purified and checked for size distribution on Agilent Bioanalyzer DNA 1000 chip. A successfully prepared library should have dsDNA molecules of varied length with a considerable fraction being above 200 bp and below 600 bp 3.5 Purification and Quantification of Library ( See Note 7)

1. Ampure XP purification—as Subheading 3.2 but use Ampure XP beads and add 72 μl beads to 40 μl PCR reaction. Elute in 20 μl preheated 10 mM Tris–HCl pH 8.3. 2. Measure concentration on NanoDrop (as dsDNA) and run Bioanalyzer DNA 1000. Perform smear analysis (side panel -> Global ->Advanced ->Smear analysis ->regions) with range 140–600 bp and use molarity as a guideline for your sequencing order. The library should contain dsDNA molecules of varied length with a considerable fraction being above 200 bp and below 600 bp (Fig. 3b) (see Note 8). 3. Samples can now be sequenced using standard Illumina DNA genomic sequencing and can be multiplexed with other samples made with the same adapters (genomic DNA) as long as they utilize different indexes (see Note 5).

3.6 Data Analysis Using Linux Command Line

1. Data analysis of massive parallel sequencing experiments can be a challenge for scientists without formal training in bioinformatics. Below we demonstrate in detail how to go from the

222

Lukasz J. Kielpinski et al.

sequencing output (FASTQ file) to an RTTS count file without assuming prior knowledge of bioinformatics using tools available in GALAXY [13]–16], including the Bowtie mapper for sequencing reads [17] and the FASTX toolkit [18]. However, using a Unix or an OSX machine with a command line interface is recommended for large projects. For those users, the analysis implemented in Subheadings 3.7–3.9 can be carried out using Bowtie and an awk script available at this URL http://people. binf.ku.dk/~lukasz/SAM2counts.awk. 3.7 Quality Check of Sequencing Reads

1. Log in to Galaxy (http://usegalaxy.org/) and create a new Galaxy history. Upload the relevant FASTQ files to Galaxy with the “Upload File from your computer” tool found in the “Get Data” tool category. Point to the location of the relevant FASTQ file on your computer and click execute (see Note 9). 2. Check the integrity of the FASTQ files with the “FASTQ Groomer” tool found in the “NGS: QC and manipulation” tool category. For newer FASTQ files (Illumina 1.8 and later) the quality is encoded in Sanger format. Choose the Galaxy history item containing the FASTQ file, set “Input FASTQ quality scores type:” to Sanger, and click execute (see Note 10). 3. Compute FASTQ quality statistics with the “Compute quality statistics” tool found in the “NGS: QC and manipulation” tool category. Choose the groomed FASTQ file and click execute. 4. Plot the distributions of quality scores for the different sequencing cycles using the “Draw quality score boxplot” tool found in the “NGS: QC and manipulation” tool category. Choose the Galaxy history item containing the output of the “Compute quality statistics” tool and click execute. Look at the resulting boxplot by clicking on the eye icon next to the “Draw quality score boxplot” history item (see Fig. 4a). For most experiments, where the median quality is not very low (falling below 25), it is unnecessary to filter the reads on quality. If quality is very low it may be an advantage to filter the reads for low quality using the “Filter by quality” tool found in the “NGS: QC and manipulation” tool category. Set the “Quality cut-off value” option to 20 and the “Percent of bases in sequence that must have quality equal to/higher than cut-off value” option to 90 and click execute. 5. Plot nucleotide distributions of the different sequencing cycles using the “Draw nucleotides distribution chart” tool found in the “NGS: QC and manipulation” tool category. Look at the resulting plot by clicking on the eye icon next to the “Draw nucleotides distribution chart” history item (see Fig. 4b).

Reverse Transcriptase Termination Site (RTTS) Mapping

223

Fig. 4 Expected quality plots of sequencing reads. (a) Example of quality boxplot produced by Galaxy. The plot shows the median read quality in the different sequencing cycles. (b) Example of nucleotide distribution plot produced by Galaxy. The plot shows the percentage of the nucleotides in the different sequencing cycles. Deviation from uniform distribution in the first cycle reflects a combination of bias for specific nucleotides in terminal transferase activity of Reverse Transcriptase, bias in the TS2126 RNA ligase reaction and in some cases biased seqences of the genomic locations being mapped by the RTTS

The nucleotide distributions are typically similar across the sequencing cycles, but if this is not the case, the library may not have sufficient complexity or be contaminated with adapter–adapter ligation products.

224

Lukasz J. Kielpinski et al.

3.8 Mapping Reads with Bowtie

1. Depending on the nature of your experiment you can map your reads either to the entire genome relevant for the experiment or to one or more RNA sequences. The genomes of the most commonly investigated species are pre-installed in Galaxy, whereas mapping to one or more specific RNAs requires that the sequence(s) is uploaded to Galaxy as a FASTA file. If necessary upload a FASTA file with the “Upload File from your computer” tool found in the “Get Data” tool category. Point to the location of the relevant FASTA file on your computer and click execute. 2. To map the reads, use the “Map with Bowtie for Illumina” tool found in the “NGS: Mapping” tool category. If mapping to a genome that is pre-indexed in Galaxy, choose “Use a built-in index” and the relevant genome. Otherwise choose “Use one from history” and select history item containing the uploaded FASTA file. Next, select the history item containing the groomed (and filtered) FASTQ file under the “FASTQ file” option and choose “Full parameter list” in the “Bowtie settings to use” drop-down menu. Then change “Maximum number of mismatches permitted in the seed (-n)” to 3 and “Maximum permitted total of quality values at mismatched read positions (-e)” to 300 and choose “Use best” in the “Whether or not to make Bowtie guarantee that reported singleton alignments are ‘best’ in terms of stratum and in terms of the quality values at the mismatched positions (–best)” drop-down menu. Finally map the reads by clicking Execute. The mapping may take a while depending on the size of the FASTQ file and the sequence to be mapped against (see Note 11).

3.9 Preparing an RTTS Count File from SAM File

1. It is necessary to trim mapped reads that contain untemplated nucleotides added by reverse transctiptase (see Note 12). This trimming requires many Galaxy operations and we have therefore created a Galaxy workflow to perform this operation and subsequently count and sum RTTS. In this procedure reads are trimmed, if they contain mismatches in the first three positions (Fig. 5). To download the workflow go to https://main.g2.bx. psu.edu/workflow/list_published and search for RTTS Mapper. Click on the workflow and import it into your own Galaxy account by clicking on “Import workflow” in the upper right corner. Alternatively, if a local instance of galaxy is used, the RTTS Mapper workflow can be imported into Galaxy by clicking on “Workflow” on the top Galaxy bar and then on the “Upload and import workflow” button in the upper right corner. At this URL https://main.g2.bx.psu.edu/workflow/ import_workflow, the workflow can be imported by providing the URL http://people.binf.ku.dk/~lukasz/Galaxy_RTTS_ mapper.ga as the “Galaxy workflow URL” and clicking import.

Reverse Transcriptase Termination Site (RTTS) Mapping

225

Fig. 5 Schematic representation of the trimming performed by the RTTS mapper. After mapping the reads to genome the three 5’ terminal nucleotides of mapped reads (corresponding to 3’ ends of cDNA molecules) are evaluated for mismatches to the reference sequence and trimmed if necessary. The four possible scenarios are the following: full match (a), mismatch at the terminal position (b), position one (c), or position two (d) before the terminal position, in which cases we trim 0, 1, 2, or 3 positions, respectively (returned positions are indicated by the triangles). Red boxes: Mismatched positions; white boxes: matched positions

Also import a control file to your history by pasting in this URL http://people.binf.ku.dk/~lukasz/RTTS_control.interval in the URL/Text window in the “Upload File from your computer” tool found in the “Get Data” tool category. 2. To prepare RTTS count files, use the RTTS Mapper workflow imported above. Click on “Workflow” on the top Galaxy bar and then on the “RTTS mapper” workflow and choose “Run.” Select the history item containing the SAM file from the Bowtie mapping for the “Select dataset to convert” option and the RTTS_control.interval file for the “Select control file” option and click “Run workflow” at the bottom of the page. 3. The resulting RTTS count files (for counts on the plus and minus strand, respectively) can be used for further analysis in R, Excel, or other data analysis program. The exact analysis will depend on the nature of the experiment performed. Below we provide tools for some common types of analysis using the freely available tool R, which can easily be installed on any computer platform [19] (see Note 13). 3.10 Preparing Wig File and Visualizing in the UCSC Genome Browser

If the RTTS experiments have been mapped to a genome assembly, it will often be advantageous to visualize the results on the UCSC Genome Browser and compare with the many kinds of data available as tracks. To do this it is necessary to convert the RTTS file to the UCSC wig format.

226

Lukasz J. Kielpinski et al.

1. Download RTTS count files to your local computer from the Galaxy server by clicking on the floppy disc icon for the relevant history items. 2. The RTTS count file can be converted to wig format by copy/ pasting a small program (script) into R. Download the provided script from http://people.binf.ku.dk/~lukasz/wig_ generator.R. Open the file in a text editor and modify it by changing the assignment of variables “input_filename_plus” and “input_filename_min” to names of files produced by galaxy workflow. 3. Start-up R, and change working directory to the one containing RTTS count files by writing “setwd (‘path of file directory’)” in the console window and pressing enter or using the “Change dir” command found in the File menu. Then paste the edited script into the R console and hit enter. This will produce two new files named OUTPUTp.wig and OUTPUTm.wig in the same folder. 4. Go to the UCSC genome browser (http://www.genome.ucsc. edu/cgi-bin/hgGateway) and choose the species and assembly that were used for the mapping of the RTTS experiment in the drop-down menu. Then click “manage custom tracks,” browse the local drive for the wig files, and submit them one by one (after adding first one press “add custom tracks”). Finally press “go to the genome browser” with RTTS counts visualized as histogram at each genomic position (Fig. 6a). 3.11 Making Plots for Single RNAs

If the RTTS data was mapped to single RNAs (using a provided FASTA file) rather than the full genome, it will often be relevant to visualize the RTTS counts across each of the different RNAs. 1. Create a new folder and download the FASTA file that were used for mapping and the two RTTS count files from the Galaxy history by clicking on the floppy disc icon for the relevant history items to the folder. Change file names of the RTTS count files to counts_plus.txt and counts_minus.txt. 2. Then download this R script http://people.binf.ku.dk/ ~lukasz/few_genes_histogram.r and open it in a text editor. Execute R and set the working directory (as described in Subheading 3.10.3) to the folder containing the FASTA file and the RTTS count files. To generate RNA-specific RTTS plots for each RNA present in FASTA file that have at least one read mapped, copy/paste the script to the R console window and hit enter (Fig. 6b).

3.12 Comparing to Annotation Data

In some cases, it will be relevant to compare RTTS data to some kind of annotation to identify global trends. This can be done by summarizing the read counts around a set of locations. We have

Reverse Transcriptase Termination Site (RTTS) Mapping

227

Fig. 6 Example of output produced with the described protocol. Mouse liver RNA was analyzed with the described protocol, including an optional CAGE selection to enrich for RTTS corresponding to transcription start sites. (a) Output of Subheading 3.10. The sequencing data was mapped to genome, RTTS counted, converted to wig file, and uploaded to UCSC genome browser. Height of the bar at each genomic location corresponds to the number of read 50 ends mapping to this location. Minus strand is shown with negative values using different scale. (b) Output of Subheading 3.11. Reads were mapped to a single sequence (Hmgcs2 mRNA) and count of 5’ ends at each location was plotted. Reads mapping to positive strand are shown as above 0, while those mapping to negative strand as below zero. (c) Output of Subheading 3.12. Upper plot shows sum of reads at each distance from annotated TSS. High peak at position 1 results from many reads mapped to known TSS, while high peak at position 13 results from an alternative TSS for the highly expressed albumin transcript

prepared R script utilizing the bioconductor [20] for generating such a plot from the RTTS count wigs and an additional file containing either user-supplied genomic locations or refseq TSS.

228

Lukasz J. Kielpinski et al.

1. Prepare file with a set of locations that will serve as reference point for counting read locations. The format of the file is three tab-delimited columns. Columns must have headers (named “seqnames”—name of the chromosome, “position,” “strand”). Example given in the script. Positions must be 1based (see Note 14). 2. Download the script from this URL http://people.binf.ku.dk/ ~lukasz/plot_around_locations_from_wig.r and open it in a text editor. In the text editor, edit the input file names to match two wig files prepared as described in Subheading 3.9 and the position file prepared above. Also edit genome assembly name and the size of the window surrounding the given positions and used for summarizing read counts. 3. Start R, set proper working directory, and copy/paste the script to R console. This will produce a barplot of the RTTS counts relative to the positions given as reference (Fig. 6c).

4

Notes 1. The amount of starting material can be reduced if necessary. On the other hand for samples that are to be used for CAGE selection a minimum of 5 μg of RNA is needed. The amount of reverse transcription primer should be scaled with the amount of RNA. The quality of the RNA starting material is very important as degraded RNA will produce background in any type of experiment based on detection of RTTS. Moreover, random priming typically produces more background than gene-specific priming. In CAGE experiments the non-fulllength cDNAs are removed in a selection step, thereby effectively reducing the background, but in other applications a negative control sample is required and can be used to normalize for reverse transcriptase pretermination. 2. The priming sequence used in RT_random_primer (..NNNNNNNNS-30 ) can be modified according to specific needs. In many cases, such as RNA structure probing, a genespecific primer with the 50 overhang sequence can be used (50 -AGACGTGTGCTCTTCCGATCT-“gene specific sequence”). 3. If CAGE selection is to be performed this step should be skipped. 4. Optional selection of full-length cDNA for CAGE analysis of TSS can be performed according to Subheadings 3.3–3.7 [without concentration] as described in [1] and results in a total volume 34 μl cap-selected RNA. 5. Be careful with low-level pooling of indexes since proper sequencing requires that at each cycle there is at least one

Reverse Transcriptase Termination Site (RTTS) Mapping

229

green laser read nucleotide (G or T) and one red laser red (A or C). See more at http://www.epibio.com/pdftechlit/312p l1211.pdf. 6. Using long denaturation time in PCR reaction helps alleviate GC bias and fosters reproducibility between different thermal cyclers [21]. 7. To simplify the procedure and reduce the risk of contaminating laboratory space with generated libraries one can instead of running agarose gel analyze and quantify the PCR products on Bioanalyzer DNA 1000 chip without prior purification. This allows pooling the crude reactions in right proportions (it is advisable to add EDTA to the reactions before pooling to avoid index switching) and performing only single Ampure XP purification. 8. In case when prepared libraries have the same size distribution it is possible to pool them based on NanoDrop measured concentration. 9. The output from sequencing is one or more FASTQ file containing the sequence reads and the corresponding quality scores. If several indexes were used for different experimental conditions, FASTQ files from each index should be analyzed individually. If using the main Galaxy server and dealing with large datasets (>2 GB), it is an advantage to use the Galaxy ftp upload. A tutorial can be found here: http://screencast.g2. bx.psu.edu/quickie_17_ftp_upload/flow.html. The analysis described below can be carried out on a local instance of Galaxy or on the main Galaxy server (http://usegalaxy.org/). When using the main server be sure to make a login so that your analysis is saved. Alternatively the analysis can be performed on a Unix/OSX machine in-house (see Subheading 3.6). 10. If the dataset consists of several FASTQ files they can be merged into one file with the “Concatenate datasets” tools found in the “Text Manipulation” tool category at this point to facilitate the further analysis of the full dataset. 11. Other sequencing read mappers can be used instead of the Bowtie mapper. However, it is important not to use too stringent cutoff for mapping, because a considerable fraction of reads contain untemplated sequence added by reverse transcriptase at the 50 end. The stringency of mapping conditions should be considered individually for each experiment while taking the quality of the sequencing reads and the complexity of the sequences that are being mapped against into account. When mapping against short sequences the coverage towards the 30 end can be improved by trimming sequencing reads from the 30 end.

230

Lukasz J. Kielpinski et al.

12. Reverse transcriptase will in some cases add extra untemplated nucleotides after terminating at the 50 end of the RNA. This is especially pronounced when the 50 end of the RNA is capped, which is the case for mRNAs. For the conditions described here, we find that the Primescript RT enzyme will add untemplated nucleotides in 81 % of cases for RTTS located closer than 50 nts to an annotated TSS (most of these presumably being capped), while the same is the case for 12 % of the RTTS located elsewhere. It is therefore necessary to trim the reads that have one or more mismatches in the first three mapped positions, which is implemented in the published workflow. In the cases where untemplated nucleotide matches the genomic sequence, it is not possible to do trimming. 13. R can be freely downloaded for any platform at http://cran.rproject.org/. Scripts are written for version 2.15. 14. At this step user must ensure that numbers provided as locations of interest are in 1-based coordinate system. This system is used, e.g., in UCSC genome browser display window. Be aware that tables downloaded from UCSC table browser are provided in 0based system. To use TSS information from the table in provided script one must add 1 to the starting positions. Read more on coordinate systems on http://genomewiki.ucsc.edu/ index.php/Coordinate_Transforms.

Acknowledgments The research was funded by the Danish Council for Strategic Research, the Lundbeck Foundation and the Novo Nordisk Foundation. Morten Lindow and Susanna Obad, Santaris Pharma, provided mouse liver samples and RIKEN/Piero Carninci provided the updated CAGE protocol as well as advice ahead of publication. References 1. Takahashi H, Kato S, Murata M et al (2012) CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. In: Deplancke B, Gheldof N (eds) Gene regulatory networks, vol 786. Humana, Totowa, NJ, pp 181–200 2. Motorin Y, Muller S, Behm‐Ansmant I et al (2007) Identification of modified residues in RNAs by reverse transcription‐based methods. Methods Enzymol 425:21–53. doi:10.1016/ s0076-6879(07)25002-5 3. Mortimer SA, Weeks KM (2009) Time-resolved RNA SHAPE chemistry: quantitative RNA structure analysis in one-second snapshots and

at single-nucleotide resolution. Nat Protoc 4(10):1413–1421. doi:nprot.2009.126 [pii] 10.1038/nprot.2009.126 4. Ko¨nig J, Zarnack K, Rot G et al (2010) iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol 17(7):909–915. doi:10.1038/nsmb.1838 5. Shibata Y, Carninci P, Watahiki A et al (2001) Cloning full-length, cap-trapper-selected cDNAs by using the single-strand linker ligation method. Biotechniques 30(6):1250–1254 6. Li TW, Weeks KM (2006) Structureindependent and quantitative ligation of

Reverse Transcriptase Termination Site (RTTS) Mapping single-stranded DNA. Anal Biochem 349 (2):242–246. doi:10.1016/j.ab.2005.11.002 7. Hirzmann J, Luo D, Hahnen J et al (1993) Determination of messenger RNA 5’-ends by reverse transcription of the cap structure. Nucleic Acids Res 21(15): 3597–3598 8. Zhu YY, Machleder EM, Chenchik A et al (2001) Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. Biotechniques 30 (4):892–897 9. Carninci P, Kasukawa T, Katayama S et al (2005) The transcriptional landscape of the mammalian genome. Science 309(5740): 1559–1563. doi:10.1126/science.1112014 10. Shiraki T, Kondo S, Katayama S et al (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 100(26):15776–15781. doi:10.1073/ pnas.2136655100 11. Weeks KM, Mauger DM (2011) Exploring RNA structural codes with SHAPE chemistry. Acc Chem Res 44(12):1280–1291. doi:10.1021/ ar200051h 12. Lucks JB, Mortimer SA, Trapnell C et al (2011) Multiplexed RNA structure characterization with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc Natl Acad Sci U S A 108(27):11063–11068. doi: 10.1073/pnas.1106501108 13. Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15 (10):1451–1455. doi:10.1101/Gr.4086505

231

14. Goecks J, Nekrutenko A, Taylor J et al (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86. doi:10.1186/Gb2010-11-8-R86 15. Blankenberg D, Gordon A, Von Kuster G et al (2010) Manipulation of FASTQ data with Galaxy. Bioinformatics 26(14):1783–1785. doi:10.1093/bioinformatics/btq281 16. Blankenberg D, Von Kuster G, Coraor N et al. (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19:Unit 19.10.11–21 17. Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25. doi:10.1186/Gb2009-10-3-R25 18. Hannon-Lab, Gordon A (2010) FASTX-toolkit: FASTQ/A short-reads pre-processing tools. http://hannonlab.cshl.edu/fastx_toolkit/ 19. R Foundation for Statistical Computing (2012) R: A language and environment for statistical computing, 2151st edn. R Foundation for Statistical Computing, Vienna, Austria 20. Gentleman RC, Carey VJ, Bates DM et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80 21. Aird D, Ross MG, Chen WS et al (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12(2):R18. doi:10.1186/gb-2011-12-2-r18

Suggest Documents