Introduction to. Next Generation. Sequencing. Rafael A. Irizarry http://rafalab.org.
Many slides courtesy of: Héctor Corrada Bravo and Ben Langmead ...
Statistical Methods for Next Generation Sequencing http://www.biostat.jhsph.edu/~khansen/enar2012.html
Zhijin Wu Brown University Kasper Hansen, Rafael A Irizarry Johns Hopkins University
Outline
• Introduction to NGS • SNP calling and genotyping • RNA-sequencing • Hands-on exercise
2
\
Introduction to Next Generation Sequencing Rafael A. Irizarry http://rafalab.org
Many slides courtesy of: Héctor Corrada Bravo and Ben Langmead
Remember this?
D. melanogaster, Science, 2000
H. sapiens, Nature, 2000 and Science, 2000
M. musculus, Nature, 2002
Back then: millions of clones (thousand bps) in 9 months for billions of dollars Today: billion of short reads (35-100 bps) in a week for thousands of dollars Claim: Assemble a genome in weeks for less than $100,000
4
Start with DNA (millions of copies)
5
Break it
6
Put in sequencer
7
Sequence first 35-400 bps: call them “reads” GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC
8
Platforms
The Genome Analyzer generates several billion bases of high-quality sequence per run at less than 1% of the cost of capillary-based methods. An expansive scale of research unimaginable with other technology platforms is now possible.
INTRODUCTION
Illumina/Solexa NUCLEOTIDES æSPECIALLYæDESIGNEDæWITHæ
)LLUMINAæSEQUENCINGæTECHNOLOGYæ
AæREVERSIBLEæTERMINATIONæPROPERTY æ
LEVERAGESæCLONALæARRAYæFORMATIONæANDæ
ALLOWæEACHæCYCLEæOFæTHEæSEQUENCINGæ
PROPRIETARYæREVERSIBLEæTERMINATORæ
REACTIONæTOæOCCURæSIMULTANEOUSLYæFORæ
TECHNOLOGYæFORæRAPIDæANDæACCURATEæ
ALLæCLUSTERSæINæTHEæPRESENCEæOFæALLæFOURæ
LARGE SCALEæSEQUENCINGæ4HEæINNOVA
NUCLEOTIDESæ! æ# æ4 æ' æ)NæEACHæCYCLE æ
TIVEæANDæmEXIBLEæSEQUENCINGæSYSTEMæ
THEæPOLYMERASEæISæABLEæTOæSELECTæTHEæ
ENABLESæAæBROADæARRAYæOFæAPPLICATIONSæ
CORRECTæBASEæTOæINCORPORATE æWITHæTHEæ
INæGENOMICS æTRANSCRIPTOMICS æANDæ
NATURALæCOMPETITIONæBETWEENæALLæFOURæ
EPIGENOMICS
ALTERNATIVESæLEADINGæTOæHIGHERæAC
FIGURE 1: ILLUMINA GENOME ANALYZER FLOW CELL
CURACYæTHANæMETHODSæWHEREæONLYæONEæ CLUSTER GENERATION
NUCLEOTIDEæISæPRESENTæINæTHEæREACTIONæ
3EQUENCINGæTEMPLATESæAREæIMMO
MIXæATæAæTIMEæ3EQUENCESæWHEREæAæ
BILIZEDæONæAæPROPRIETARYæmOWæCELLæ
PARTICULARæBASEæISæREPEATEDæEG æHO
SURFACEæ&IGUREæ æDESIGNEDæTOæPRESENTæ
MOPOLYMERS æAREæADDRESSEDæLIKEæANYæ
THEæ$.!æINæAæMANNERæTHATæFACILITATESæ
OTHERæSEQUENCEæANDæRESOLVEDæWITHæ
ACCESSæTOæENZYMESæWHILEæENSURINGæ
HIGHæACCURACY
• Eight lanes
HIGHæSTABILITYæOFæSURFACE BOUNDæTEM
Several samples can be loaded onto the eight-lane flow cell for simultaneous analysis on the Illumina Genome Analyzer.
• ~160M short reads (~50-70 bp) per lane
PLATEæANDæLOWæNON SPECIlCæBINDINGæ
ANALYSIS PIPELINE
OFæmUORESCENTLYæLABELEDæNUCLEOTIDESæ
4HEæ)LLUMINAæSEQUENCINGæAPPROACHæ
3OLID PHASEæAMPLIlCATIONæCREATESæ
ISæBUILTæAROUNDæAæMASSIVEæQUANTITYæ
UPæTOæ æIDENTICALæCOPIESæOFæEACHæ
OFæSEQUENCEæREADSæINæPARALLELæ$EEPæ
SINGLEæTEMPLATEæMOLECULEæINæCLOSEæ
DATA COLLECTION, PROCESSING, AND ANALYSIS
SAMPLINGæOFæMOREæTHANæTEN FOLDæEVENæ
PROXIMITYæDIAMETERæOFæONEæMICRONæORæ
4HEæ'ENOMEæ!NALYZERæDATAæCOLLECTIONæ
COVERAGEæISæUSEDæTOæGENERATEæAæCON
LESS æ"ECAUSEæTHISæPROCESSæDOESæNOTæ
SOFTWAREæENABLESæUSERSæTOæALIGNæSE
SENSUSæANDæENSUREæHIGHæCONlDENCEæINæ
INVOLVEæPHOTOLITHOGRAPHY æMECHANICALæ
QUENCESæTOæAæREFERENCEæINæRESEQUENC
DETERMINATIONæOFæGENETICæDIFFERENCESæ
SPOTTING æORæPOSITIONINGæOFæBEADSæINTOæ
INGæAPPLICATIONSæ$EVELOPEDæINæCOL
$EEPæSAMPLINGæALLOWSæTHEæUSEæOFæ
WELLS æDENSITIESæOFæUPæTOæTENæMILLIONæ 10
LABORATIONæWITHæLEADINGæRESEARCHERS æ
WEIGHTEDæMAJORITYæVOTINGæANDæSTATISTI
SINGLE MOLECULEæCLUSTERSæPERæSQUAREæ
THISæSOFTWAREæSUITEæINCLUDESæTHEæFULLæ
CALæANALYSIS æSIMILARæTOæCONVENTIONALæ
RANGEæOFæDATAæCOLLECTION æPROCESSING æ
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
name sequence quality scores
x 100s of millions
Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
Not just Assembly • Resequencing • SNP discovery and genotyping • Variant discovery and quantification • TF binding sites: ChIP-Seq • Gene expression: RNA-Seq • Measuring methylation
Not just Assembly
13
\
1000 Genomes Project Genotyping
14
\
Human Epigenome Project Methylation
15
\
What to do with all these sequences? GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC
16
Most apps: Start by matching to reference GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC 17
Variant detection GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC
Align
AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT
“Pileup” or “Coverage plot” “Depth of coverage” = 14
Aggr egat e
ics t s i t a t S GTCGCAGTATCTGTCT
Call: HET A, G p-value: 0.0023
GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC ATATATATATATATAT |||||||||||||||| TATGTCGCAGTATCTG ATATATATATATATAT TATATCGCAGTATCTT TATATCGCAGTATCTG TCTCTCCCANNAGAGC “Coverage” ||||||||| ||||| NATATCGCAGTATNTG TCTCTCCCAGGAGAGC CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG TTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATT
Reference
RNA-seq differential expression Sample A Align
GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCANNAGAGC ||||||||| ||||| TCTCTCCCAGGAGAGC
GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATAT GCCGGAGCACCCTATG
Aggr egat e
Gene 1 differentially expressed?: YES
tics s i t a St
p-value: 0.0012
Gene 1
ATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTAC
Sample B Align
GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCANNAGAGC ||||||||| |||||
TGTCGCAGTATCTGTC AGCACCCTATGTCGCA GCCGGAGCACCCTATG
gate e r g Ag
ChIP-seq GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC GGATCTGCGATATACC |||||| ||||||||| TATGTCGCAGTATCTG GGATCT-CGATATACC TATATCGCAGTATCTG Aggr e gate TATATCGCAGTATCTG AATCTGATCTTATTTT Align |||||||||||||||| TATATCGCAGTATCTG AATCTGATCTTATTTT CCCTATATCGCAGTAT CCCTATATCGCAGTAT ATATATATATATATAT |||||||||||||||| CCCTATATCGCAGTAT ATATATATATATATAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT TCTCTCCCANNAGAGC CCCTATATCGCAGTAT ||||||||| ||||| TCTCTCCCAGGAGAGC CCCTATATCGCAGTAT Binding AGCACCCTATGTCGCA occurs here AGCACCCTATATCGCA p-value: AGCACCCTATGTCGCA 0.0023 s c i t GAGCACCCTATGTCGC s Stati CCGGAGCACCCTATAT CCGGAGCACCCTATAT GATAGCATTGCGAGAC GCCGGAGCACCCTATG GATTCCTGCCTCATCC TATGCACGCGATAGCA TTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATT
Reference
Matching Revisted GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC 21
Matching 10,000,000 32 bps reads
• BLAST takes more than 6 months • BLAT takes 2 months • MAQ takes 1 day and half • Bowtie takes 17 minutes
22
Matching GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC 23
Mapping Take a read: CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC
And a reference sequence: >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA GGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAG
How do we determine the read’s point of origin with respect to the reference? Answer: sequence similarity Hypothesis 1: Read CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC |||||| |||| |||| ||||||||| |||| ||||| CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA
Reference
Hypothesis 2: Read CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC |||||||||||| ||||||||||||||||||||| ||||| | CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC
Reference
Which hypothesis is better? Say hypothesis 2 is correct. Why are there still mismatches and gaps?
More on variants and base-calling
SNPs GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC
SNPs TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC
All Reads
Nucleotide composition
1.0 0.8 0.6
A
0.4
T
0.2 0.0 0
10
20
Sequencing cycle
30
1000 Genomes Data SNPs in dbSNP
Novel SNPs Sample NA19238, Non−UCSC Loci, n=15005
Number of Calls 3000 4000 0
0
1000
2000
5000
All data
Number of Calls 10000
5000
6000
15000
7000
Sample NA19238, UCSC Loci, n=48864
1
3
5
7
9
11
13
15
17 19 21 Sequencing Cycle
23
25
27
29
31
33
35
1
3
Sample NA19238, Hapmap Loci, n=20069
7
9
11
13
15
17 19 21 Sequencing Cycle
23
25
27
29
31
33
35
15 17 19 21 23 25 27 15 Sequencing 17 19 Cycle 21 23 25 27 Sequencing Cycle Loci, n=25563 Sample NA19238, Non−1kgenomes
29 29
31 31
33 33
35 35
Sample NA19238, Non−UCSC Loci, n=4986
0 0
0 0
500
Number of Calls Number of Calls 1500 5000 1000 10000
Number of Calls Number of Calls 2000 4000 10000 5000
6000
200015000
15000
5
Sample NA19238, Non−Hapmap Loci, n=43800
Sample NA19238, UCSC Loci, n=42691
5 5
7 7
9 9
11 11
13 13
15 17 19 21 23 25 15Sequencing 17 19 Cycle 21 23 25 Sequencing Cycle Sample NA19238, 1kgenomes Loci, n=38306
10000 6000
12000 7000
Sample NA19238, Hapmap Loci, n=19067
27 27
29 29
31 31
33 33
35 35
1 1
3 3
Cycle 10000 10000
3 3
8000
1 1
0
Filtered: snpq>=20, nreads=20, nreadsC T>C A>G T>A C>T C>A G>A T>G G>C G>T A>T C>G
40 Sequencing Cycle
60
A>C T>C A>G T>A C>T C>A G>A T>G G>C G>T A>T C>G
Estimated Error Probabilities 0.1 0.2 0.3 0.4
20
0.0
0
0
20
40 Sequencing Cycle
60
Remember This?
Nucleotide composition
1.0 0.8 0.6
A
0.4
T
0.2 0.0 0
10
20
Sequencing cycle
30
Bias Explained
●
●
●
● ●
●
●
● ● ● ●
● ●
●
●
● ●
●
●● ●● ●
● ●
●
● ●
●
● ●
●
●
●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ●●●● ●● ● ● ● ● ● ●● ●● ●● ●●● ● ● ● ●● ● ● ●●● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ●●●●● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ●●●●●● ●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ●● ● ●● ●●●● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ●● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●● ●● ●● ● ●● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ●●●● ●●● ● ● ●●●● ●● ●● ●● ● ● ●● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ● ●● ● ● ● ● ●● ● ● ●●● ● ●●●● ●●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ●●●● ● ● ● ●●●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ●● ●● ● ●● ● ● ●● ●● ● ●●● ● ●● ●●● ● ● ●● ● ● ●● ● ● ● ●● ●●●● ● ●●●●● ● ● ●●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ●● ● ● ● ●●● ●●●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●●● ●● ●● ● ● ● ●● ● ●●● ● ●● ● ● ● ●● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●●● ● ●●●●●● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ●●● ● ● ● ● ●●● ● ● ● ●●●●●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ●●●● ●●●●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●● ●● ●●●●● ● ●● ● ●● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ● ●●●●●●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ●●●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ●●● ●● ● ● ●● ●●●● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●●● ●●● ● ●● ●● ● ●●● ●● ● ● ● ●●● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● ●● ●● ● ● ●● ●● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ● ●● ● ●● ● ● ● ●● ● ●●●● ●●●●●●● ●● ●●● ●● ● ●● ●●● ● ● ● ● ● ●● ●●● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ●●● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●●● ●● ● ● ● ● ● ●● ● ● ●●●●● ●● ●● ●● ● ● ●● ● ● ● ●●●●●● ● ●● ● ● ● ●●●●●●● ●● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●●● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ● ●●● ● ●● ● ● ● ●●●● ● ● ●●● ● ● ●● ●● ● ● ●●●● ● ●● ●● ● ● ●● ● ● ● ●●● ●● ● ●● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ●●● ●● ●● ●● ● ● ●●●● ●● ●●●●● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ●● ●●●● ●● ●● ● ● ● ● ●● ● ●●●● ● ● ●●● ● ● ● ●●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●●● ● ● ●● ●●● ● ●● ● ● ●●●●● ●●●● ●● ● ● ● ●● ●● ● ●● ●● ●●● ● ● ●●● ●●● ● ● ● ● ● ●●● ● ●● ● ●● ●●●● ●● ● ● ● ●● ●● ● ● ●● ● ●●● ●●●●● ●●● ● ● ● ●● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ●●● ●● ●● ●● ●● ● ●● ●●●●● ●●● ● ● ●● ●●●● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●●●● ●● ●● ●● ● ● ● ●● ● ●● ●● ● ●●●●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ●●●● ●● ● ● ● ●●● ● ●● ●● ●● ● ●● ● ● ●● ●●●●● ● ●● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ●● ● ● ●●●● ●● ●● ● ●●● ●● ●●● ●● ● ● ● ●●●●● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ●●● ● ●● ●● ● ● ●● ● ● ●● ●●●●●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ● ● ●● ●● ● ● ●● ●●● ● ●● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ●●●●● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●●●● ● ●● ● ●●●● ● ●● ● ● ● ●●●● ● ● ●● ●●●● ●● ●● ●● ●● ●● ●● ● ●●●● ● ● ●●● ● ● ● ●● ●●● ●●●●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ●●● ● ● ● ●●● ● ●●●● ● ●● ●● ●●● ●● ● ●●● ● ● ●● ● ●●● ● ● ●●●● ●●● ●●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● ●●● ● ● ●● ● ● ●● ●● ● ● ●●●● ●● ● ●● ●● ●● ●●●● ● ●● ●● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ●●● ●● ●● ● ●● ● ● ●●● ● ●● ●● ● ● ● ● ●● ●● ●●● ● ●●● ● ● ●● ● ●● ●● ●● ● ● ●● ●●● ● ●●● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ●●● ●●● ● ●● ●● ● ● ● ● ●● ●●● ● ●● ●● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ● ● ●● ● ●● ●●●● ●● ●● ● ●●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ●●● ●●●●● ● ●● ● ●●● ● ●●● ● ● ● ●● ● ●● ●● ● ●●●● ●● ●● ●● ● ● ● ●● ●● ●●●●●● ●● ● ● ● ● ●● ● ●● ●● ●● ● ●● ●● ● ●●● ●● ● ●● ●● ●● ●●● ●●● ●●●● ●● ●●● ●● ● ●●● ●●● ● ●● ●●● ●●● ● ●● ●● ●●● ●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●● ●● ●● ●● ● ●●● ●● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ●●● ● ●● ●● ● ● ● ●● ● ●● ● ● ●●● ●● ●● ●● ●● ●●● ● ●●● ●●● ●●●●● ● ● ● ● ●●●●● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ● ● ● ●● ●●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●●●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●● ●● ● ● ●●●● ●● ●● ● ● ● ● ●● ●●●● ●● ●● ● ● ● ●● ●●● ● ●●●●●● ● ● ● ● ● ●● ● ●● ● ●● ● ●●●●● ●●● ● ● ● ● ●● ●● ●● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ●●● ● ● ●●● ● ● ●●●● ●● ● ●● ● ●● ●● ● ●●● ● ● ● ●●●● ● ● ●●●●● ● ● ● ●●●●● ● ● ● ●● ●●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●●● ●● ● ●●● ● ● ●●● ● ●● ●●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ●● ● ●●● ● ●● ●●● ●● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ●● ●● ●●● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ●● ● ●●● ● ● ● ●●● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ●●●●● ● ●●● ●●● ● ●● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ●●●●●●● ● ●●● ● ● ●● ● ●● ● ●●●● ● ● ●● ●●●● ●● ● ● ●●●●●● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ●●●● ●● ● ● ● ●●●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●●●●●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ●●● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ●●● ● ● ● ●● ●● ● ●●● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●●●●●●●● ● ●●●●● ● ●●●● ● ● ●● ● ● ● ●●● ● ●● ● ●● ●● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●● ●● ●●●●● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ●●●●●●● ● ● ●●● ●● ●●●● ● ● ● ●● ● ●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●●●● ● ● ● ● ● ●●●●●●● ●● ● ● ●● ●● ●●● ● ●●● ● ● ●● ● ●●● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●●● ●● ●●● ●●● ●● ●● ● ● ●● ● ●●●●●● ● ● ● ●● ●● ●● ● ●●●● ●● ● ●●● ● ●●● ● ● ● ●● ●● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ● ●● ●● ● ● ● ●● ● ● ● ●●● ●●● ● ●● ●●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●●●●● ● ●●●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ●● ●● ● ● ● ●● ●● ● ● ●●●● ● ● ●● ●● ● ● ●●● ●● ●●● ● ●● ● ● ● ● ● ● ●● ●●●● ●● ● ●● ●● ● ●● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ●● ● ●● ●●●● ●● ●●● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ●● ● ● ●● ● ● ●●● ● ●● ● ● ● ●● ●● ● ●●● ●● ●● ● ●● ●● ●● ● ●●● ●●● ● ●●●● ●●● ● ● ●●● ● ● ●● ● ● ●●● ● ●●●●●● ● ● ●●●● ●●●● ●● ● ● ● ●● ● ● ●● ●● ●●● ● ●● ● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ● ●● ●●●● ● ● ● ● ●● ● ●● ●● ● ●● ● ●●● ● ●● ●●●● ●●●●● ●● ●● ●● ●● ●● ●●●● ●● ● ● ● ●● ● ● ● ●●●● ● ●● ● ● ● ● ● ●●● ● ●● ●●● ● ●● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ●●● ● ●●● ● ●● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●●● ●● ● ●● ●●●● ●● ● ●● ● ● ● ●●● ● ●●●●●● ●●● ● ●●●● ●●● ●● ● ●● ●●● ●●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ●●●● ●● ●●● ●● ● ● ● ●●● ●● ●●● ● ● ●●● ●● ●●●●● ● ●●● ● ●● ●● ●● ● ● ● ● ● ●● ● ●●●● ● ● ● ●● ●● ● ● ●●●●● ●●●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ●●●● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●●● ●● ● ● ● ●● ● ●●●● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ●●●●● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ●● ● ●●● ● ●● ● ●●●● ● ●●●●●●●●● ●● ● ● ●● ● ●●● ●●● ● ●● ●● ● ● ●●● ● ●● ●●●● ● ● ● ● ●● ●●●●● ●● ●●● ● ●● ● ● ●●●●● ● ●● ●●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ●● ● ●● ●● ● ●● ● ● ● ● ●●●●● ● ●● ●●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●●● ● ●●● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●● ● ●
●
●
●
● ●
5000
●
●
●
0
A−T
10000
●
●
cycle < 20 cycle ≥ 20
● ●
● ●
●
●
● ●
● ● ●
● ●
● ●●
● ● ●
●
●
● ●
●
●
●
● ●
0
2000
4000 0.5(A+T)
6000
8000
Base Calling 1) Rougemont et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics (2008) 2) Erlich et al. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods (2008) 3) Kao et al. BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res (2009) 4) Corrada Bravo and Irizarry. Model-Based Quality Assessment and Base-Calling for Second-Generation Sequencing Data. Biometrics (2009) 5) Cokus et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature (2009)
Intensity Model 14.0
13.5
13.0
12.5
cycle
Intensity Model log intensity read i, cycle j, channel c
uijc =
∆ijc (µcjα + xTj αi + �α ijc ) + (1 − ∆ijc )(µcjβ +
xTj βi
+
β �ijc )
indicators of nucleotide identity, read i, pos. j ∆ijc
� 1 = 0
if c is the nucleotide in read i position j otherwise
Intensity Model log intensity read i, cycle j, channel c
uijc =
∆ijc (µcjα + xTj αi + �α ijc ) + (1 − ∆ijc )(µcjβ +
xTj βi
read-specific linear models
+
β �ijc )
Intensity Model log intensity read i, cycle j, channel c
uijc =
∆ijc (µcjα + xTj αi + �α ijc ) + (1 − ∆ijc )(µcjβ +
xTj βi
+
β �ijc )
measurement error 2 �α ∼ N (0, σ ijc αi )
β �ijc
∼
2 N (0, σβi )
Read & Cycle Effects 0 10 20 30
0 10 20 30 14 13 12 11
14
h(intensity)
13 12 11 14 13 12 11 14 13 12 11 0 10 20 30
0 10 20 30
cycle
0 10 20 30
Probability
Base Identity Probability Profiles 1 0.8 0.6 0.4 0.2 0 1 3 5 7 9 11 14 17 20 23 26 29 32 35 Position
Before And After Solexa (Default)
Srfim (Statistical Approach) 1.0
0.8 0.6
A
0.4
T
0.2
Nucleotide composition
Nucleotide composition
1.0
A
0.8 0.6 0.4 0.2
T 0.0
0.0 0
10
20
Sequencing cycle
30
0
10
20
Sequencing cycle
30
The End