Bioinformatics Core An Introduction to Next Generation Sequencing ...

31 downloads 68 Views 883KB Size Report
An Introduction to Next Generation Sequencing. – Chemistry, Data Files and Applications. Genomics Core and Bioinformatics Core at Purdue University.
Bioinformatics Core An Introduction to Next Generation Sequencing – Chemistry, Data Files and Applications Genomics Core and Bioinformatics Core at Purdue University Phillip San Miguel, Ph.D. Genomics Facility Director [email protected] (765)-496-6328 Jyothi Thimmapuram, Ph.D. Bioinformatics Core Director [email protected] (765)-496-6252

Bioinformatics Core

Learning Objectives of the seminar • An overview of chemistry and library preps for two NGS platforms - Illumina and GS FLX (454)

• Description of GS FLX (454) and Illumina data files • Brief discussion of various NGS applications

Bioinformatics Core

Agenda Chemistry and Library Prep Protocols ~45 min Phillip San Miguel Q&A ~10 min Break ~5 min Data files and Applications ~45 min Jyothi Thimmapuram Q&A ~10 min

Bioinformatics Core

Some common notations • Two types of sequencing – DNA (genome) – RNA (transcriptome)

• Two types of processing in the first step – Alignment/Mapping to Ref. genome or transcriptome – de novo assembly

• Reads = Sequences • GS FLX = 454 = Roche • Illumina = Solexa

Bioinformatics Core Roche/454 GS FLX Titanium

Illumina HiScanSQ)

250-450bp 1-1.5 mil

Read length Reads per run

50, 100bp 350 -700 mil

0.25-0.5 Gb sff, fna, qual

Total yield Files

100-150 Gb FASTQ

Bioinformatics Core

GS FLX Data files

GS FLX Files: One sff, fna & qual file per region sff – Standard Flowgram Format not a text file - do not open in text editors

Bioinformatics Core Common Header: Magic Number: 0x2E736666 Version: 0001 Index Offset: 182488856 Index Length: 872484 # of Reads: 43558 Header Length: 1440 Key Length: 4 # of Flows: 1400 Flowgram Code: 1 Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTCGTACGTACG…… Key Sequence: TCAG

Common Header

>GQTXD0C01AHEZZ Run Prefix: R_2010_11_08_16_28_38_ Region #: 1 XY Location: 0081_0925

Read Header

Run Name: R_2010_11_08_16_28_38_FLX02080324_adminrig_491 Analysis Name: D_2010_11_09_16_24_31_test3_fullProcessing Full Path: /data/R_2010_11_08_16_28_38_FLX02080324_adminrig_491/D_2010_11_09_16_24_31_test3_fullProcessing/ Read Header Len: 32 Name Length: 14 # of Bases: 474 Clip Qual Left: 15 Clip Qual Right: 422 Clip Adap Left: 0 Clip Adap Right: 0 Bases: gactactacgtctctGCACTTCAGTGCGAACGACATGTAGAATAGAGTTTGGCCGCACATTTTGCATAGCCAAGGAGATGTCATCCTCAAGTTCTTTTCCAGAAAC AAACAATTTGATCTTTTCAGATGGTCTGAATTCTAATTTCTTGGAAGCTTTTGATTTGACAGAGAGTGGAGTTTCCTCTGTG……….TACAAGCAGGTCCTTGCAACTCAGCAGCA AGTTTATTTCATGGGGGTACCACTTAGCTCTTGGTCTATGGAAGTTTGCAATTTCCTTGTCGCTCAACACTGCCTTCATGGACTGCAGCTTCTGAGCGGGAACAGAATGTACCA CCTTTATGCCCATTGAAACGAGACGGAgtggtcggcgtctcccaaggcacacaggggataggnnnnnnnnnnnnnnnnn Quality Scores:

40 40 40 40 40 17 14 16 17 0

40 40 40 40 40 40 40 40 40 ………………………………………… 20 26 17 20 14 19 16 16 0 0 0 0

40 40 40 40

40 40 40 40

40 40 40 40

26 22 19 16 0 0

25 17 19 16 0 0

23 17 17 16 0 0

40 40 40 40

40 40 40 40

21 14 16 16 0 0

15 14 16 17 0 0

Read Data

Bioinformatics Core

sff file – common header Common Header: Magic Number: 0x2E736666 Version: 0001 Index Offset: 182488856 Index Length: 872484 # of Reads: 43558 Header Length: 1440 Key Length: 4 # of Flows: 1400 Flowgram Code: 1 Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTA CGTACGTACGTACGTACGTACGTACGTCGTACGTACG…… Key Sequence: TCAG

Bioinformatics Core

sff file – read header >GQTXD0C01AHEZZ Run Prefix: R_2010_11_08_16_28_38_ Region #: 1 XY Location: 0081_0925 Run Name: R_2010_11_08_16_28_38_FLX02080324_adminrig_491 Analysis Name: D_2010_11_09_16_24_31_test3_fullProcessing Full Path: /data/R_2010_11_08_16_28_38_FLX02080324_adminrig_491/D_2010_11_09_16_24_31_ test3_fullProcessing/ Read Header Len: 32 Name Length: 14 # of Bases: 474 Clip Qual Left: 15 Clip Qual Right: 422 Clip Adap Left: 0 Clip Adap Right: 0

Bioinformatics Core

sff file – read data Bases: gactactacgtctctGCACTTCAGTGCGAACGACATGTAGAATAGAGTTTGGCCGCACA TTTTGCATAGCCAAGGAGATGTCATCCTCAAGTTCTTTTCCAGAAACAAACAATTTGATC TTTTCAGATGGTCTGAATTCTAATTTCTTGGAAGCTTTTGATTTGACAGAGAGTGGAGT TTCCTCTGTG……….TACAAGCAGGTCCTTGCAACTCAGCAGCAAGTTTATTTCATGGG GGTACCACTTAGCTCTTGGTCTATGGAAGTTTGCAATTTCCTTGTCGCTCAACACTGCC TTCATGGACTGCAGCTTCTGAGCGGGAACAGAATGTACCACCTTTATGCCCATTGAAA CGAGACGGAgtggtcggcgtctcccaaggcacacaggggataggnnnnnnnnnnnnnnnnn Quality Scores: 40 40 40 40 40 40 40 40 40 40 40 ………………………………………… 20 26 26 20 22 17 19 19 17 16 16 16 0 0 0 0 0 0

40 40 40 40

40 40 40 40

40 40 40 40

40 40 40 40

40 40 40 40

40 40 40

25 17 16 17 0 0

23 14 16 17 0

21 14 16 0 0

15 14 16 0 0

17 14 16 0 0

17 19 16 0 0

Bioinformatics Core SFF Tools from Roche sfffile Construct one SFF file from multiple regions or runs Filter reads (using accession no., trimming points or lengths) sffinfo Extract read information from SFF file into a text form – FASTA and quality score files

sff_extract http://bioinf.comav.upv.es/sff_extract/index.html

Bioinformatics Core

fna and qual files Sequence ID

Definition

>GZXAOSR15JHOOX rank=0000090 x=3774.0 y=3503.5 length=382 TTTATTTTCAATGCAATGTACATTATTTAAAGGAAACACCCGACGAACGATATATTATAG TACCGAACTGCCGAGCGTTGGTGCACGAGCGTGGAAAATCGTTTCACCGCCATCGAC AAATATAACAAATCGGCA…………………………………...GTTCACAGAAGGTTTAGTCG TTTTTTGATAGTGAAATTATTATAAATTGTGTTTTGCAGGTGAACGCGATAATTAATGAA TATATTTATTTTCTTAAAGTTTCTTACGCATTACATATTATAAATTATTTGCTAACTGAAA ATGCGATGAAATTCAAACCAC

>GZXAOSR15JHOOX rank=0000090 x=3774.0 y=3503.5 length=382 39 39 36 38 16 16 16 16 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 40 40 40 39 39 38 39 39 20 20 20 39 22 18 20 20 29 35 40 40 39 39 39 39 40 40 40 40 40 40 40 40 40 40 40 39 34 34 34 38 38 38 38 40 40 40 40 39 40 40 40 40 40 40 39 39 39 40 40 40 40 40 38 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 40 40 39 ……………..……………………. ………………………. 38 37 33 33 33 33 27 25 21 19 12 12 12 12 21 13 13 14 14 14 23 22 30 28 36 38 38 40 38 40 39 39 31 31 31 38 39 40 39 39 40 37 37 32 31 31 32 32 30 31 31 31 38 36 26 26 26 37 37 39 32 33 33 33 37 39 40 39 40 40 40 31 31 31 39 39 39 39 38 38 31 31 21 31

Bioinformatics Core

Illumina files: one fastq file per sample

Sequence ID Sequence @HWI-ST330_0106:4:1:2643:2862#CTTGTA/1 CTTGACAAAGGGTGCAAGGCAGTTAGTGGTGCAAGATGCATTGCTGATGATGGGTTCATCAGGGCTGTAATCATA + ggggggggggggdgggeggfgggggfaffdgdgdgeggggggggggdggdggggceeedeegggg_eYdc_dcac Quality Scores @HWI-ST330_0106:4:1:2613:2891#CTTGTA/1 CGTGTCTTAAGGAGGCACCAAACAATATAAAGCTACAGATGGCGTCCTTGGTTTTTAATTTTAAGTTGGGGGACT + ggggggggggggegggdgggggegggffdgggggggggeggggggggggggegfggeegggfgcgggefgge^eg

Bioinformatics Core

Single Reads SP

A1

A2

s_1_sequence.txt

Paired-end Reads SP1

A1

A2 SP2

s_1_1_sequence.txt

s_1_2_sequence.txt

Bioinformatics Core Illumina FASTQ quality scores Q = -10log10(e)

e= estimated probability of the base call being wrong Q40: 1 error in 10,000 base calls Q30: 1 error in 1,000 Q20: 1 error in 100 Read Length

Percent of Bases Higher than Q30*

2 × 50 bp

> 85%

2 × 100 bp

> 80%

PHRED quality score of 2 (ASCII ‘B’) – unreliable quality scores at the ends of some reads – not specific errors Do not use for downstream analysis

Bioinformatics Core

Sequence quality check

• FASTX-Toolkit – http://hannonlab.cshl.edu/fastx_toolkit/

• FastQC – http://www.bioinformatics.bbsrc.ac.uk/projects /fastqc/ Sequence composition Quality score

Bioinformatics Core

Good run Bad run

x-axis: Position in read y-axis: Quality scores

Bioinformatics Core

FastQC Before trimming

After trimming

x-axis: Position in read y-axis: Quality scores

Bioinformatics Core

NGS Applications • GS FLX – Genomic and transcriptomic assemblies – Re-sequencing – SNP/InDel, polymorphisms – Metagenomics/Metatranscriptomics (sp. identification within a population) – Amplicon sequencing (accurate haplotype identification) – Exon capture • Illumina – Genomic and transcriptomic assemblies – Re-sequencing – SNP/InDel, CNV – RNA-Seq : DGE – ChIP-Seq – miRNA analysis – Metagenomics/Metatranscriptomics

Bioinformatics Core

List of software packages for NGS analysis

Bioinformatics Core

de novo genome assembly GS FLX reads: gsAssembler (Newbler) Use sff files when possible Can recognize linker/circularization adaptor in PE reads Illumina reads: • ABySS – Simpson et al., Genome Res 2009, 19:1117 • SOAP (Short Oligonucleotide Analysis Package) – denovo, aligner, snp, splice – Li et al., Bioinformatics 2009, 25:1966 • Velvet – Zerbino and Birney, Genome Res 2008, 18:821

Bioinformatics Core

Recommendations for sequencing of large genomes for de novo assembly more coverage with Illumina and lower coverage with GS FLX (454) Illumina: at least 60 X coverage with several libraries using paired-end sequencing ranging from 500 bp to 10Kb

GS FLX (454): shotgun (15-20X), and mate-pair sequencing (3-5X: 3Kb; 2-3X: 8 & 20 Kb)

Bioinformatics Core

Strategies for RNA-Seq

Haas and Zody, Nature Biotechnology, 2010, 28:421

Bioinformatics Core

Applications using RNA-Seq data • • • • • •

Differential gene expression Structural annotation of a genome Alternative splicing Fusion transcripts de novo transcriptome assembly SNPs/Indels

• 454 or Illumina

Bioinformatics Core

de novo transcriptome assembly • Assemble and then use contigs as reference transcriptome – Only option when genome and/or transcriptome not available – Results depend on the accuracy of the assembly – If sequencing depth is low, some genes are not assembled – More work involved for assembly & annotation

Bioinformatics Core

Illumina - de novo transcriptome assembly - software • Trinity – Grabherr and Haas et.al., Nature Biotechnology. 2011, 29:644-652 • ABySS and Trans-ABySS – Birol et al., Bioinformatics, 2009, 25:2872 – Robertson et al., Nature Methods, 2010, 7:909

• Velvet and Oases – Zerbino and Birney, Genome Res., 2008, 18:821

Bioinformatics Core

Mapping of RNA-Seq reads • Align to genome – Can detect novel exons or un-annotated genes – Aligners should be able to map reads across splice sites – Reads from non-genic regions – influence expression values, SNP detection etc. • Align to transcriptome – Information about splice junctions is not required – PE distance and junction reads - isoforms

Bioinformatics Core

Some popular aligners • BWA – slow for long reads and reads with higher error rate; suboptimal alignment pairs; allows gapped alignment

• TopHat – uses Bowtie; maps reads to genome, builds a database of possible splice junctions, and maps the reads against these junctions to confirm • Novoalign – most accurate, slow • gsMapper (Newbler – Roche) for GS FLX reads • Others: SpliceMap, MapSplice, SOAP, MAQ, CLC Bio

Bioinformatics Core

Number of reads/Coverage • Number of genes in the species • Number of genes expressed under the treatment/tissue • Rare transcripts

Bioinformatics Core

Number of reads/Coverage

Trapnell et al., Nature Biotechnology, 2010, 28:311

Bioinformatics Core

SNP and sequence depth

Bentley et al., Nature 2008, 456:53

Squares – All SNPs Triangles – Heterozygous SNPs Circles – Homozygous SNPs

Bioinformatics Core

ChIP-Seq

Park (2009), Nature Reviews Genetics 10 : 669-680

Bioinformatics Core

ChIP-Seq : workflow

Park (2009), Nature Reviews Genetics 10 : 669-680

Bioinformatics Core

ChIP-Seq : software – QuEST (Quantitative Enrichment of Sequence Tags) • Valouev et al., 2008. Nature Methods. 5(9): 829-834. – MACS (Model-based Analysis of ChIP-Seq) • Zhang et al., 2008. Genome Biol. 9(9):R137

Bioinformatics Core SAM – Sequence Alignment/Map http://samtools.sourceforge.net/ • Unified format for storing read alignments to a reference genome • Header @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45

• Alignment section – tab-separated table with al least 11 columns – each line describes one alignment

Bioinformatics Core SAM format: Alignment section @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * • QNAME: • FLAG: • RNAME: • POS: • MAPQ: • CIGAR: • RNEXT: • PNEXT: • TLEN: • SEQ: • QUAL:

Query template NAME alignment flags Reference sequence NAME (eg: chromosome name) 1-based leftmost mapping POSition in reference MAPping Quality (as Phred score) Alignment description (gaps etc.) in CIGAR format Ref. name of the mate/next segment (for paired end) Position of the mate/next segment (for paired end) observed Template (insert) LENgth (for paired end) segment SEQuence (of the read) quality string of the read – Phred-scaled

Bioinformatics Core FLAG field Numeric binary description 1 00000001 template has multiple fragments in sequencing 2 00000010 each fragment properly mapped according to aligner 4 00000100 fragment is unmapped 8 00001000 mate is unmapped 16 00010000 sequence is reverse complemented 32 00100000 sequence of mate is reversed 64 01000000 is first fragment in template 128 10000000 is second fragment in template

Bioinformatics Core

@HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *

83 = 64 + 16 + 2 + 1 = 01010011

template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is reverse complemented, sequence of mate is not reversed, this is the first fragment in the template, this is not the second fragment in the template

Bioinformatics Core

CIGAR string M alignment match (can be sequence match or mismatch) I insertion to the reference D deletion to the reference N skipped region from the reference S soft clipping (clipped sequence is present in SEQ) H hard clipping (clipped sequence is not present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch

Bioinformatics Core @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *

CIGAR (Compact Idiosyncratic Gapped Alignment Report) string gives alignment details. Example: “8M2I4M1D3M” means • the first 8 bases of the read map normally (not necessarily perfectly) • 2 bases are inserted, i.e., missing in the reference • after another 4 mapped bases, 1 base is deleted, (skipped in the query). • the last 3 bases match normally

Bioinformatics Core GFF – General feature Format Feature file should be tab delimited SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0 seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003 dJ102G20 GD_mRNA coding_exon 7105 7201 . - 2 Sequence "dJ102G20.C1.1" 1. seqname - sequence name on which the feature exists. eg., “chr1”, “myChrom”, “contig123”. 2. source - The source of this feature. Normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc. 3. feature – The feature type name. For example, “exon”, “gene”, “SNP” etc. 4. start - The one-based starting position of feature on seqname. 5. end - The one-based ending position of feature on seqname. 6. score – A score assigned to the GFF feature. 7. strand - Defines the strand. Use '+', '-' or ‘.’ 8. frame – The frame of the coding sequence. Use ‘0’, ‘1’, ‘2’, or ‘.’. 9. attribute

Bioinformatics Core

BED – Browser Extensible Data track name=pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

To display annotation track 3 required fields – chrom, chromStart, chromEnd 9 optional fields – No. of fields must be consistent for any single data set

http://genome.ucsc.edu/FAQ/FAQformat

Bioinformatics Core Summary

• • • • • • • •

GS FLX – sff, fna, qual Illumina – FASTQ Quality control SAM – for alignment GTF/GFF – features of a sequence BED – for annotation track Several applications, software packages Plan data analysis before you start sequencing

Bioinformatics Core Future Workshops

Topic

Date

Introduction to NGS Data Files and Applications

Feb 2012

de novo assembly

March 2012

Mapping/Alignement

April 2012

RNA-Seq Analysis

May 2012

MeDIP and ChIP-Seq

June 2012

SNP (Re-sequencing & RAD-Seq)

July 2012

Bioinformatics Core

Please send an email to [email protected] to add your name to the Bioinformatics Core mailing list

Thank you!

Suggest Documents