An Introduction to Next Generation Sequencing. – Chemistry, Data Files and
Applications. Genomics Core and Bioinformatics Core at Purdue University.
Bioinformatics Core An Introduction to Next Generation Sequencing – Chemistry, Data Files and Applications Genomics Core and Bioinformatics Core at Purdue University Phillip San Miguel, Ph.D. Genomics Facility Director
[email protected] (765)-496-6328 Jyothi Thimmapuram, Ph.D. Bioinformatics Core Director
[email protected] (765)-496-6252
Bioinformatics Core
Learning Objectives of the seminar • An overview of chemistry and library preps for two NGS platforms - Illumina and GS FLX (454)
• Description of GS FLX (454) and Illumina data files • Brief discussion of various NGS applications
Bioinformatics Core
Agenda Chemistry and Library Prep Protocols ~45 min Phillip San Miguel Q&A ~10 min Break ~5 min Data files and Applications ~45 min Jyothi Thimmapuram Q&A ~10 min
Bioinformatics Core
Some common notations • Two types of sequencing – DNA (genome) – RNA (transcriptome)
• Two types of processing in the first step – Alignment/Mapping to Ref. genome or transcriptome – de novo assembly
• Reads = Sequences • GS FLX = 454 = Roche • Illumina = Solexa
Bioinformatics Core Roche/454 GS FLX Titanium
Illumina HiScanSQ)
250-450bp 1-1.5 mil
Read length Reads per run
50, 100bp 350 -700 mil
0.25-0.5 Gb sff, fna, qual
Total yield Files
100-150 Gb FASTQ
Bioinformatics Core
GS FLX Data files
GS FLX Files: One sff, fna & qual file per region sff – Standard Flowgram Format not a text file - do not open in text editors
Bioinformatics Core Common Header: Magic Number: 0x2E736666 Version: 0001 Index Offset: 182488856 Index Length: 872484 # of Reads: 43558 Header Length: 1440 Key Length: 4 # of Flows: 1400 Flowgram Code: 1 Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTCGTACGTACG…… Key Sequence: TCAG
Common Header
>GQTXD0C01AHEZZ Run Prefix: R_2010_11_08_16_28_38_ Region #: 1 XY Location: 0081_0925
Read Header
Run Name: R_2010_11_08_16_28_38_FLX02080324_adminrig_491 Analysis Name: D_2010_11_09_16_24_31_test3_fullProcessing Full Path: /data/R_2010_11_08_16_28_38_FLX02080324_adminrig_491/D_2010_11_09_16_24_31_test3_fullProcessing/ Read Header Len: 32 Name Length: 14 # of Bases: 474 Clip Qual Left: 15 Clip Qual Right: 422 Clip Adap Left: 0 Clip Adap Right: 0 Bases: gactactacgtctctGCACTTCAGTGCGAACGACATGTAGAATAGAGTTTGGCCGCACATTTTGCATAGCCAAGGAGATGTCATCCTCAAGTTCTTTTCCAGAAAC AAACAATTTGATCTTTTCAGATGGTCTGAATTCTAATTTCTTGGAAGCTTTTGATTTGACAGAGAGTGGAGTTTCCTCTGTG……….TACAAGCAGGTCCTTGCAACTCAGCAGCA AGTTTATTTCATGGGGGTACCACTTAGCTCTTGGTCTATGGAAGTTTGCAATTTCCTTGTCGCTCAACACTGCCTTCATGGACTGCAGCTTCTGAGCGGGAACAGAATGTACCA CCTTTATGCCCATTGAAACGAGACGGAgtggtcggcgtctcccaaggcacacaggggataggnnnnnnnnnnnnnnnnn Quality Scores:
40 40 40 40 40 17 14 16 17 0
40 40 40 40 40 40 40 40 40 ………………………………………… 20 26 17 20 14 19 16 16 0 0 0 0
40 40 40 40
40 40 40 40
40 40 40 40
26 22 19 16 0 0
25 17 19 16 0 0
23 17 17 16 0 0
40 40 40 40
40 40 40 40
21 14 16 16 0 0
15 14 16 17 0 0
Read Data
Bioinformatics Core
sff file – common header Common Header: Magic Number: 0x2E736666 Version: 0001 Index Offset: 182488856 Index Length: 872484 # of Reads: 43558 Header Length: 1440 Key Length: 4 # of Flows: 1400 Flowgram Code: 1 Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTA CGTACGTACGTACGTACGTACGTACGTCGTACGTACG…… Key Sequence: TCAG
Bioinformatics Core
sff file – read header >GQTXD0C01AHEZZ Run Prefix: R_2010_11_08_16_28_38_ Region #: 1 XY Location: 0081_0925 Run Name: R_2010_11_08_16_28_38_FLX02080324_adminrig_491 Analysis Name: D_2010_11_09_16_24_31_test3_fullProcessing Full Path: /data/R_2010_11_08_16_28_38_FLX02080324_adminrig_491/D_2010_11_09_16_24_31_ test3_fullProcessing/ Read Header Len: 32 Name Length: 14 # of Bases: 474 Clip Qual Left: 15 Clip Qual Right: 422 Clip Adap Left: 0 Clip Adap Right: 0
Bioinformatics Core
sff file – read data Bases: gactactacgtctctGCACTTCAGTGCGAACGACATGTAGAATAGAGTTTGGCCGCACA TTTTGCATAGCCAAGGAGATGTCATCCTCAAGTTCTTTTCCAGAAACAAACAATTTGATC TTTTCAGATGGTCTGAATTCTAATTTCTTGGAAGCTTTTGATTTGACAGAGAGTGGAGT TTCCTCTGTG……….TACAAGCAGGTCCTTGCAACTCAGCAGCAAGTTTATTTCATGGG GGTACCACTTAGCTCTTGGTCTATGGAAGTTTGCAATTTCCTTGTCGCTCAACACTGCC TTCATGGACTGCAGCTTCTGAGCGGGAACAGAATGTACCACCTTTATGCCCATTGAAA CGAGACGGAgtggtcggcgtctcccaaggcacacaggggataggnnnnnnnnnnnnnnnnn Quality Scores: 40 40 40 40 40 40 40 40 40 40 40 ………………………………………… 20 26 26 20 22 17 19 19 17 16 16 16 0 0 0 0 0 0
40 40 40 40
40 40 40 40
40 40 40 40
40 40 40 40
40 40 40 40
40 40 40
25 17 16 17 0 0
23 14 16 17 0
21 14 16 0 0
15 14 16 0 0
17 14 16 0 0
17 19 16 0 0
Bioinformatics Core SFF Tools from Roche sfffile Construct one SFF file from multiple regions or runs Filter reads (using accession no., trimming points or lengths) sffinfo Extract read information from SFF file into a text form – FASTA and quality score files
sff_extract http://bioinf.comav.upv.es/sff_extract/index.html
Bioinformatics Core
fna and qual files Sequence ID
Definition
>GZXAOSR15JHOOX rank=0000090 x=3774.0 y=3503.5 length=382 TTTATTTTCAATGCAATGTACATTATTTAAAGGAAACACCCGACGAACGATATATTATAG TACCGAACTGCCGAGCGTTGGTGCACGAGCGTGGAAAATCGTTTCACCGCCATCGAC AAATATAACAAATCGGCA…………………………………...GTTCACAGAAGGTTTAGTCG TTTTTTGATAGTGAAATTATTATAAATTGTGTTTTGCAGGTGAACGCGATAATTAATGAA TATATTTATTTTCTTAAAGTTTCTTACGCATTACATATTATAAATTATTTGCTAACTGAAA ATGCGATGAAATTCAAACCAC
>GZXAOSR15JHOOX rank=0000090 x=3774.0 y=3503.5 length=382 39 39 36 38 16 16 16 16 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 40 40 40 39 39 38 39 39 20 20 20 39 22 18 20 20 29 35 40 40 39 39 39 39 40 40 40 40 40 40 40 40 40 40 40 39 34 34 34 38 38 38 38 40 40 40 40 39 40 40 40 40 40 40 39 39 39 40 40 40 40 40 38 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 40 40 39 ……………..……………………. ………………………. 38 37 33 33 33 33 27 25 21 19 12 12 12 12 21 13 13 14 14 14 23 22 30 28 36 38 38 40 38 40 39 39 31 31 31 38 39 40 39 39 40 37 37 32 31 31 32 32 30 31 31 31 38 36 26 26 26 37 37 39 32 33 33 33 37 39 40 39 40 40 40 31 31 31 39 39 39 39 38 38 31 31 21 31
Bioinformatics Core
Illumina files: one fastq file per sample
Sequence ID Sequence @HWI-ST330_0106:4:1:2643:2862#CTTGTA/1 CTTGACAAAGGGTGCAAGGCAGTTAGTGGTGCAAGATGCATTGCTGATGATGGGTTCATCAGGGCTGTAATCATA + ggggggggggggdgggeggfgggggfaffdgdgdgeggggggggggdggdggggceeedeegggg_eYdc_dcac Quality Scores @HWI-ST330_0106:4:1:2613:2891#CTTGTA/1 CGTGTCTTAAGGAGGCACCAAACAATATAAAGCTACAGATGGCGTCCTTGGTTTTTAATTTTAAGTTGGGGGACT + ggggggggggggegggdgggggegggffdgggggggggeggggggggggggegfggeegggfgcgggefgge^eg
Bioinformatics Core
Single Reads SP
A1
A2
s_1_sequence.txt
Paired-end Reads SP1
A1
A2 SP2
s_1_1_sequence.txt
s_1_2_sequence.txt
Bioinformatics Core Illumina FASTQ quality scores Q = -10log10(e)
e= estimated probability of the base call being wrong Q40: 1 error in 10,000 base calls Q30: 1 error in 1,000 Q20: 1 error in 100 Read Length
Percent of Bases Higher than Q30*
2 × 50 bp
> 85%
2 × 100 bp
> 80%
PHRED quality score of 2 (ASCII ‘B’) – unreliable quality scores at the ends of some reads – not specific errors Do not use for downstream analysis
Bioinformatics Core
Sequence quality check
• FASTX-Toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
• FastQC – http://www.bioinformatics.bbsrc.ac.uk/projects /fastqc/ Sequence composition Quality score
Bioinformatics Core
Good run Bad run
x-axis: Position in read y-axis: Quality scores
Bioinformatics Core
FastQC Before trimming
After trimming
x-axis: Position in read y-axis: Quality scores
Bioinformatics Core
NGS Applications • GS FLX – Genomic and transcriptomic assemblies – Re-sequencing – SNP/InDel, polymorphisms – Metagenomics/Metatranscriptomics (sp. identification within a population) – Amplicon sequencing (accurate haplotype identification) – Exon capture • Illumina – Genomic and transcriptomic assemblies – Re-sequencing – SNP/InDel, CNV – RNA-Seq : DGE – ChIP-Seq – miRNA analysis – Metagenomics/Metatranscriptomics
Bioinformatics Core
List of software packages for NGS analysis
Bioinformatics Core
de novo genome assembly GS FLX reads: gsAssembler (Newbler) Use sff files when possible Can recognize linker/circularization adaptor in PE reads Illumina reads: • ABySS – Simpson et al., Genome Res 2009, 19:1117 • SOAP (Short Oligonucleotide Analysis Package) – denovo, aligner, snp, splice – Li et al., Bioinformatics 2009, 25:1966 • Velvet – Zerbino and Birney, Genome Res 2008, 18:821
Bioinformatics Core
Recommendations for sequencing of large genomes for de novo assembly more coverage with Illumina and lower coverage with GS FLX (454) Illumina: at least 60 X coverage with several libraries using paired-end sequencing ranging from 500 bp to 10Kb
GS FLX (454): shotgun (15-20X), and mate-pair sequencing (3-5X: 3Kb; 2-3X: 8 & 20 Kb)
Bioinformatics Core
Strategies for RNA-Seq
Haas and Zody, Nature Biotechnology, 2010, 28:421
Bioinformatics Core
Applications using RNA-Seq data • • • • • •
Differential gene expression Structural annotation of a genome Alternative splicing Fusion transcripts de novo transcriptome assembly SNPs/Indels
• 454 or Illumina
Bioinformatics Core
de novo transcriptome assembly • Assemble and then use contigs as reference transcriptome – Only option when genome and/or transcriptome not available – Results depend on the accuracy of the assembly – If sequencing depth is low, some genes are not assembled – More work involved for assembly & annotation
Bioinformatics Core
Illumina - de novo transcriptome assembly - software • Trinity – Grabherr and Haas et.al., Nature Biotechnology. 2011, 29:644-652 • ABySS and Trans-ABySS – Birol et al., Bioinformatics, 2009, 25:2872 – Robertson et al., Nature Methods, 2010, 7:909
• Velvet and Oases – Zerbino and Birney, Genome Res., 2008, 18:821
Bioinformatics Core
Mapping of RNA-Seq reads • Align to genome – Can detect novel exons or un-annotated genes – Aligners should be able to map reads across splice sites – Reads from non-genic regions – influence expression values, SNP detection etc. • Align to transcriptome – Information about splice junctions is not required – PE distance and junction reads - isoforms
Bioinformatics Core
Some popular aligners • BWA – slow for long reads and reads with higher error rate; suboptimal alignment pairs; allows gapped alignment
• TopHat – uses Bowtie; maps reads to genome, builds a database of possible splice junctions, and maps the reads against these junctions to confirm • Novoalign – most accurate, slow • gsMapper (Newbler – Roche) for GS FLX reads • Others: SpliceMap, MapSplice, SOAP, MAQ, CLC Bio
Bioinformatics Core
Number of reads/Coverage • Number of genes in the species • Number of genes expressed under the treatment/tissue • Rare transcripts
Bioinformatics Core
Number of reads/Coverage
Trapnell et al., Nature Biotechnology, 2010, 28:311
Bioinformatics Core
SNP and sequence depth
Bentley et al., Nature 2008, 456:53
Squares – All SNPs Triangles – Heterozygous SNPs Circles – Homozygous SNPs
Bioinformatics Core
ChIP-Seq
Park (2009), Nature Reviews Genetics 10 : 669-680
Bioinformatics Core
ChIP-Seq : workflow
Park (2009), Nature Reviews Genetics 10 : 669-680
Bioinformatics Core
ChIP-Seq : software – QuEST (Quantitative Enrichment of Sequence Tags) • Valouev et al., 2008. Nature Methods. 5(9): 829-834. – MACS (Model-based Analysis of ChIP-Seq) • Zhang et al., 2008. Genome Biol. 9(9):R137
Bioinformatics Core SAM – Sequence Alignment/Map http://samtools.sourceforge.net/ • Unified format for storing read alignments to a reference genome • Header @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45
• Alignment section – tab-separated table with al least 11 columns – each line describes one alignment
Bioinformatics Core SAM format: Alignment section @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * • QNAME: • FLAG: • RNAME: • POS: • MAPQ: • CIGAR: • RNEXT: • PNEXT: • TLEN: • SEQ: • QUAL:
Query template NAME alignment flags Reference sequence NAME (eg: chromosome name) 1-based leftmost mapping POSition in reference MAPping Quality (as Phred score) Alignment description (gaps etc.) in CIGAR format Ref. name of the mate/next segment (for paired end) Position of the mate/next segment (for paired end) observed Template (insert) LENgth (for paired end) segment SEQuence (of the read) quality string of the read – Phred-scaled
Bioinformatics Core FLAG field Numeric binary description 1 00000001 template has multiple fragments in sequencing 2 00000010 each fragment properly mapped according to aligner 4 00000100 fragment is unmapped 8 00001000 mate is unmapped 16 00010000 sequence is reverse complemented 32 00100000 sequence of mate is reversed 64 01000000 is first fragment in template 128 10000000 is second fragment in template
Bioinformatics Core
@HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
83 = 64 + 16 + 2 + 1 = 01010011
template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is reverse complemented, sequence of mate is not reversed, this is the first fragment in the template, this is not the second fragment in the template
Bioinformatics Core
CIGAR string M alignment match (can be sequence match or mismatch) I insertion to the reference D deletion to the reference N skipped region from the reference S soft clipping (clipped sequence is present in SEQ) H hard clipping (clipped sequence is not present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch
Bioinformatics Core @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
CIGAR (Compact Idiosyncratic Gapped Alignment Report) string gives alignment details. Example: “8M2I4M1D3M” means • the first 8 bases of the read map normally (not necessarily perfectly) • 2 bases are inserted, i.e., missing in the reference • after another 4 mapped bases, 1 base is deleted, (skipped in the query). • the last 3 bases match normally
Bioinformatics Core GFF – General feature Format Feature file should be tab delimited SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0 seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003 dJ102G20 GD_mRNA coding_exon 7105 7201 . - 2 Sequence "dJ102G20.C1.1" 1. seqname - sequence name on which the feature exists. eg., “chr1”, “myChrom”, “contig123”. 2. source - The source of this feature. Normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc. 3. feature – The feature type name. For example, “exon”, “gene”, “SNP” etc. 4. start - The one-based starting position of feature on seqname. 5. end - The one-based ending position of feature on seqname. 6. score – A score assigned to the GFF feature. 7. strand - Defines the strand. Use '+', '-' or ‘.’ 8. frame – The frame of the coding sequence. Use ‘0’, ‘1’, ‘2’, or ‘.’. 9. attribute
Bioinformatics Core
BED – Browser Extensible Data track name=pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
To display annotation track 3 required fields – chrom, chromStart, chromEnd 9 optional fields – No. of fields must be consistent for any single data set
http://genome.ucsc.edu/FAQ/FAQformat
Bioinformatics Core Summary
• • • • • • • •
GS FLX – sff, fna, qual Illumina – FASTQ Quality control SAM – for alignment GTF/GFF – features of a sequence BED – for annotation track Several applications, software packages Plan data analysis before you start sequencing
Bioinformatics Core Future Workshops
Topic
Date
Introduction to NGS Data Files and Applications
Feb 2012
de novo assembly
March 2012
Mapping/Alignement
April 2012
RNA-Seq Analysis
May 2012
MeDIP and ChIP-Seq
June 2012
SNP (Re-sequencing & RAD-Seq)
July 2012
Bioinformatics Core
Please send an email to
[email protected] to add your name to the Bioinformatics Core mailing list
Thank you!