Statistical Methods for Next Generation Sequencing

23 downloads 86762 Views 7MB Size Report
Introduction to. Next Generation. Sequencing. Rafael A. Irizarry http://rafalab.org. Many slides courtesy of: Héctor Corrada Bravo and Ben Langmead ...
Statistical Methods for Next Generation Sequencing http://www.biostat.jhsph.edu/~khansen/enar2012.html

Zhijin Wu Brown University Kasper Hansen, Rafael A Irizarry Johns Hopkins University

Outline

• Introduction to NGS • SNP calling and genotyping • RNA-sequencing • Hands-on exercise

2

\

Introduction to Next Generation Sequencing Rafael A. Irizarry http://rafalab.org

Many slides courtesy of: Héctor Corrada Bravo and Ben Langmead

Remember this?

D. melanogaster, Science, 2000

H. sapiens, Nature, 2000 and Science, 2000

M. musculus, Nature, 2002

Back then: millions of clones (thousand bps) in 9 months for billions of dollars Today: billion of short reads (35-100 bps) in a week for thousands of dollars Claim: Assemble a genome in weeks for less than $100,000

4

Start with DNA (millions of copies)

5

Break it

6

Put in sequencer

7

Sequence first 35-400 bps: call them “reads” GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC

8

Platforms

The Genome Analyzer generates several billion bases of high-quality sequence per run at less than 1% of the cost of capillary-based methods. An expansive scale of research unimaginable with other technology platforms is now possible.

INTRODUCTION

Illumina/Solexa NUCLEOTIDES æSPECIALLYæDESIGNEDæWITHæ

)LLUMINAæSEQUENCINGæTECHNOLOGYæ

AæREVERSIBLEæTERMINATIONæPROPERTY æ

LEVERAGESæCLONALæARRAYæFORMATIONæANDæ

ALLOWæEACHæCYCLEæOFæTHEæSEQUENCINGæ

PROPRIETARYæREVERSIBLEæTERMINATORæ

REACTIONæTOæOCCURæSIMULTANEOUSLYæFORæ

TECHNOLOGYæFORæRAPIDæANDæACCURATEæ

ALLæCLUSTERSæINæTHEæPRESENCEæOFæALLæFOURæ

LARGE SCALEæSEQUENCINGæ4HEæINNOVA

NUCLEOTIDESæ! æ# æ4 æ' æ)NæEACHæCYCLE æ

TIVEæANDæmEXIBLEæSEQUENCINGæSYSTEMæ

THEæPOLYMERASEæISæABLEæTOæSELECTæTHEæ

ENABLESæAæBROADæARRAYæOFæAPPLICATIONSæ

CORRECTæBASEæTOæINCORPORATE æWITHæTHEæ

INæGENOMICS æTRANSCRIPTOMICS æANDæ

NATURALæCOMPETITIONæBETWEENæALLæFOURæ

EPIGENOMICS

ALTERNATIVESæLEADINGæTOæHIGHERæAC

FIGURE 1: ILLUMINA GENOME ANALYZER FLOW CELL

CURACYæTHANæMETHODSæWHEREæONLYæONEæ CLUSTER GENERATION

NUCLEOTIDEæISæPRESENTæINæTHEæREACTIONæ

3EQUENCINGæTEMPLATESæAREæIMMO

MIXæATæAæTIMEæ3EQUENCESæWHEREæAæ

BILIZEDæONæAæPROPRIETARYæmOWæCELLæ

PARTICULARæBASEæISæREPEATEDæEG æHO

SURFACEæ&IGUREæ æDESIGNEDæTOæPRESENTæ

MOPOLYMERS æAREæADDRESSEDæLIKEæANYæ

THEæ$.!æINæAæMANNERæTHATæFACILITATESæ

OTHERæSEQUENCEæANDæRESOLVEDæWITHæ

ACCESSæTOæENZYMESæWHILEæENSURINGæ

HIGHæACCURACY

• Eight lanes

HIGHæSTABILITYæOFæSURFACE BOUNDæTEM

Several samples can be loaded onto the eight-lane flow cell for simultaneous analysis on the Illumina Genome Analyzer.

• ~160M short reads (~50-70 bp) per lane

PLATEæANDæLOWæNON SPECIlCæBINDINGæ

ANALYSIS PIPELINE

OFæmUORESCENTLYæLABELEDæNUCLEOTIDESæ

4HEæ)LLUMINAæSEQUENCINGæAPPROACHæ

3OLID PHASEæAMPLIlCATIONæCREATESæ

ISæBUILTæAROUNDæAæMASSIVEæQUANTITYæ

UPæTOæ æIDENTICALæCOPIESæOFæEACHæ

OFæSEQUENCEæREADSæINæPARALLELæ$EEPæ

SINGLEæTEMPLATEæMOLECULEæINæCLOSEæ

DATA COLLECTION, PROCESSING, AND ANALYSIS

SAMPLINGæOFæMOREæTHANæTEN FOLDæEVENæ

PROXIMITYæDIAMETERæOFæONEæMICRONæORæ

4HEæ'ENOMEæ!NALYZERæDATAæCOLLECTIONæ

COVERAGEæISæUSEDæTOæGENERATEæAæCON

LESS æ"ECAUSEæTHISæPROCESSæDOESæNOTæ

SOFTWAREæENABLESæUSERSæTOæALIGNæSE

SENSUSæANDæENSUREæHIGHæCONlDENCEæINæ

INVOLVEæPHOTOLITHOGRAPHY æMECHANICALæ

QUENCESæTOæAæREFERENCEæINæRESEQUENC

DETERMINATIONæOFæGENETICæDIFFERENCESæ

SPOTTING æORæPOSITIONINGæOFæBEADSæINTOæ

INGæAPPLICATIONSæ$EVELOPEDæINæCOL

$EEPæSAMPLINGæALLOWSæTHEæUSEæOFæ

WELLS æDENSITIESæOFæUPæTOæTENæMILLIONæ 10

LABORATIONæWITHæLEADINGæRESEARCHERS æ

WEIGHTEDæMAJORITYæVOTINGæANDæSTATISTI

SINGLE MOLECULEæCLUSTERSæPERæSQUAREæ

THISæSOFTWAREæSUITEæINCLUDESæTHEæFULLæ

CALæANALYSIS æSIMILARæTOæCONVENTIONALæ

RANGEæOFæDATAæCOLLECTION æPROCESSING æ

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

name sequence quality scores

x 100s of millions

Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

Not just Assembly • Resequencing • SNP discovery and genotyping • Variant discovery and quantification • TF binding sites: ChIP-Seq • Gene expression: RNA-Seq • Measuring methylation

Not just Assembly

13

\

1000 Genomes Project Genotyping

14

\

Human Epigenome Project Methylation

15

\

What to do with all these sequences? GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC

16

Most apps: Start by matching to reference GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC 17

Variant detection GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC

Align

AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT

“Pileup” or “Coverage plot” “Depth of coverage” = 14

Aggr egat e

ics t s i t a t S GTCGCAGTATCTGTCT

Call: HET A, G p-value: 0.0023

GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC ATATATATATATATAT |||||||||||||||| TATGTCGCAGTATCTG ATATATATATATATAT TATATCGCAGTATCTT TATATCGCAGTATCTG TCTCTCCCANNAGAGC “Coverage” ||||||||| ||||| NATATCGCAGTATNTG TCTCTCCCAGGAGAGC CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG TTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATT

Reference

RNA-seq differential expression Sample A Align

GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCANNAGAGC ||||||||| ||||| TCTCTCCCAGGAGAGC

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATAT GCCGGAGCACCCTATG

Aggr egat e

Gene 1 differentially expressed?: YES

tics s i t a St

p-value: 0.0012

Gene 1

ATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTAC

Sample B Align

GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCANNAGAGC ||||||||| |||||

TGTCGCAGTATCTGTC AGCACCCTATGTCGCA GCCGGAGCACCCTATG

gate e r g Ag

ChIP-seq GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC GGATCTGCGATATACC |||||| ||||||||| TATGTCGCAGTATCTG GGATCT-CGATATACC TATATCGCAGTATCTG Aggr e gate TATATCGCAGTATCTG AATCTGATCTTATTTT Align |||||||||||||||| TATATCGCAGTATCTG AATCTGATCTTATTTT CCCTATATCGCAGTAT CCCTATATCGCAGTAT ATATATATATATATAT |||||||||||||||| CCCTATATCGCAGTAT ATATATATATATATAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT TCTCTCCCANNAGAGC CCCTATATCGCAGTAT ||||||||| ||||| TCTCTCCCAGGAGAGC CCCTATATCGCAGTAT Binding AGCACCCTATGTCGCA occurs here AGCACCCTATATCGCA p-value: AGCACCCTATGTCGCA 0.0023 s c i t GAGCACCCTATGTCGC s Stati CCGGAGCACCCTATAT CCGGAGCACCCTATAT GATAGCATTGCGAGAC GCCGGAGCACCCTATG GATTCCTGCCTCATCC TATGCACGCGATAGCA TTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATT

Reference

Matching Revisted GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC 21

Matching 10,000,000 32 bps reads

• BLAST takes more than 6 months • BLAT takes 2 months • MAQ takes 1 day and half • Bowtie takes 17 minutes

22

Matching GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC 23

Mapping Take a read: CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC

And a reference sequence: >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA GGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAG

How do we determine the read’s point of origin with respect to the reference? Answer: sequence similarity Hypothesis 1: Read CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC |||||| |||| |||| ||||||||| |||| ||||| CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA

Reference

Hypothesis 2: Read CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC |||||||||||| ||||||||||||||||||||| ||||| | CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC

Reference

Which hypothesis is better? Say hypothesis 2 is correct. Why are there still mismatches and gaps?

More on variants and base-calling

SNPs GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

SNPs TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

All Reads

Nucleotide composition

1.0 0.8 0.6

A

0.4

T

0.2 0.0 0

10

20

Sequencing cycle

30

1000 Genomes Data SNPs in dbSNP

Novel SNPs Sample NA19238, Non−UCSC Loci, n=15005

Number of Calls 3000 4000 0

0

1000

2000

5000

All data

Number of Calls 10000

5000

6000

15000

7000

Sample NA19238, UCSC Loci, n=48864

1

3

5

7

9

11

13

15

17 19 21 Sequencing Cycle

23

25

27

29

31

33

35

1

3

Sample NA19238, Hapmap Loci, n=20069

7

9

11

13

15

17 19 21 Sequencing Cycle

23

25

27

29

31

33

35

15 17 19 21 23 25 27 15 Sequencing 17 19 Cycle 21 23 25 27 Sequencing Cycle Loci, n=25563 Sample NA19238, Non−1kgenomes

29 29

31 31

33 33

35 35

Sample NA19238, Non−UCSC Loci, n=4986

0 0

0 0

500

Number of Calls Number of Calls 1500 5000 1000 10000

Number of Calls Number of Calls 2000 4000 10000 5000

6000

200015000

15000

5

Sample NA19238, Non−Hapmap Loci, n=43800

Sample NA19238, UCSC Loci, n=42691

5 5

7 7

9 9

11 11

13 13

15 17 19 21 23 25 15Sequencing 17 19 Cycle 21 23 25 Sequencing Cycle Sample NA19238, 1kgenomes Loci, n=38306

10000 6000

12000 7000

Sample NA19238, Hapmap Loci, n=19067

27 27

29 29

31 31

33 33

35 35

1 1

3 3

Cycle 10000 10000

3 3

8000

1 1

0

Filtered: snpq>=20, nreads=20, nreadsC T>C A>G T>A C>T C>A G>A T>G G>C G>T A>T C>G

40 Sequencing Cycle

60

A>C T>C A>G T>A C>T C>A G>A T>G G>C G>T A>T C>G

Estimated Error Probabilities 0.1 0.2 0.3 0.4

20

0.0

0

0

20

40 Sequencing Cycle

60

Remember This?

Nucleotide composition

1.0 0.8 0.6

A

0.4

T

0.2 0.0 0

10

20

Sequencing cycle

30

Bias Explained







● ●





● ● ● ●

● ●





● ●



●● ●● ●

● ●



● ●



● ●







● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ●●●● ●● ● ● ● ● ● ●● ●● ●● ●●● ● ● ● ●● ● ● ●●● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ●●●●● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ●●●●●● ●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ●● ● ●● ●●●● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ●● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●● ●● ●● ● ●● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ●●●● ●●● ● ● ●●●● ●● ●● ●● ● ● ●● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ● ●● ● ● ● ● ●● ● ● ●●● ● ●●●● ●●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ●●●● ● ● ● ●●●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ●● ●● ● ●● ● ● ●● ●● ● ●●● ● ●● ●●● ● ● ●● ● ● ●● ● ● ● ●● ●●●● ● ●●●●● ● ● ●●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ●● ● ● ● ●●● ●●●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●●●●●● ●● ●● ● ● ● ●● ● ●●● ● ●● ● ● ● ●● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●●● ● ●●●●●● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ●●● ● ● ● ● ●●● ● ● ● ●●●●●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ●●●● ●●●●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●● ●● ●●●●● ● ●● ● ●● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ● ●●●●●●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ●●●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ●●● ●● ● ● ●● ●●●● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●●● ●●● ● ●● ●● ● ●●● ●● ● ● ● ●●● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● ●● ●● ● ● ●● ●● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ● ●● ● ●● ● ● ● ●● ● ●●●● ●●●●●●● ●● ●●● ●● ● ●● ●●● ● ● ● ● ● ●● ●●● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ●●● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●●● ●● ● ● ● ● ● ●● ● ● ●●●●● ●● ●● ●● ● ● ●● ● ● ● ●●●●●● ● ●● ● ● ● ●●●●●●● ●● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●●● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ● ●●● ● ●● ● ● ● ●●●● ● ● ●●● ● ● ●● ●● ● ● ●●●● ● ●● ●● ● ● ●● ● ● ● ●●● ●● ● ●● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ●●● ●● ●● ●● ● ● ●●●● ●● ●●●●● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ●● ●●●● ●● ●● ● ● ● ● ●● ● ●●●● ● ● ●●● ● ● ● ●●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●●●● ● ● ●● ●●● ● ●● ● ● ●●●●● ●●●● ●● ● ● ● ●● ●● ● ●● ●● ●●● ● ● ●●● ●●● ● ● ● ● ● ●●● ● ●● ● ●● ●●●● ●● ● ● ● ●● ●● ● ● ●● ● ●●● ●●●●● ●●● ● ● ● ●● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ●●● ●● ●● ●● ●● ● ●● ●●●●● ●●● ● ● ●● ●●●● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●●●● ●● ●● ●● ● ● ● ●● ● ●● ●● ● ●●●●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ●●●● ●● ● ● ● ●●● ● ●● ●● ●● ● ●● ● ● ●● ●●●●● ● ●● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ●● ● ● ●●●● ●● ●● ● ●●● ●● ●●● ●● ● ● ● ●●●●● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ●●● ● ●● ●● ● ● ●● ● ● ●● ●●●●●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ● ● ●● ●● ● ● ●● ●●● ● ●● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ●●●●● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●●●● ● ●● ● ●●●● ● ●● ● ● ● ●●●● ● ● ●● ●●●● ●● ●● ●● ●● ●● ●● ● ●●●● ● ● ●●● ● ● ● ●● ●●● ●●●●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ●●● ● ● ● ●●● ● ●●●● ● ●● ●● ●●● ●● ● ●●● ● ● ●● ● ●●● ● ● ●●●● ●●● ●●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ●●●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● ●●● ● ● ●● ● ● ●● ●● ● ● ●●●● ●● ● ●● ●● ●● ●●●● ● ●● ●● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ●●● ●● ●● ● ●● ● ● ●●● ● ●● ●● ● ● ● ● ●● ●● ●●● ● ●●● ● ● ●● ● ●● ●● ●● ● ● ●● ●●● ● ●●● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ●●● ●●● ● ●● ●● ● ● ● ● ●● ●●● ● ●● ●● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ● ● ●● ● ●● ●●●● ●● ●● ● ●●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ●●● ●●●●● ● ●● ● ●●● ● ●●● ● ● ● ●● ● ●● ●● ● ●●●● ●● ●● ●● ● ● ● ●● ●● ●●●●●● ●● ● ● ● ● ●● ● ●● ●● ●● ● ●● ●● ● ●●● ●● ● ●● ●● ●● ●●● ●●● ●●●● ●● ●●● ●● ● ●●● ●●● ● ●● ●●● ●●● ● ●● ●● ●●● ●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●● ●● ●● ●● ● ●●● ●● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ●●● ● ●● ●● ● ● ● ●● ● ●● ● ● ●●● ●● ●● ●● ●● ●●● ● ●●● ●●● ●●●●● ● ● ● ● ●●●●● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ● ● ● ●● ●●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●●●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●● ●● ● ● ●●●● ●● ●● ● ● ● ● ●● ●●●● ●● ●● ● ● ● ●● ●●● ● ●●●●●● ● ● ● ● ● ●● ● ●● ● ●● ● ●●●●● ●●● ● ● ● ● ●● ●● ●● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ●●● ● ● ●●● ● ● ●●●● ●● ● ●● ● ●● ●● ● ●●● ● ● ● ●●●● ● ● ●●●●● ● ● ● ●●●●● ● ● ● ●● ●●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●●● ●● ● ●●● ● ● ●●● ● ●● ●●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ●● ● ●●● ● ●● ●●● ●● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ●● ●● ●●● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ●● ● ●●● ● ● ● ●●● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ●●●●● ● ●●● ●●● ● ●● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ●●●●●●● ● ●●● ● ● ●● ● ●● ● ●●●● ● ● ●● ●●●● ●● ● ● ●●●●●● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ●●●● ●● ● ● ● ●●●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●●●●●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ●●● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ●●● ● ● ● ●● ●● ● ●●● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●●●●●●●● ● ●●●●● ● ●●●● ● ● ●● ● ● ● ●●● ● ●● ● ●● ●● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●● ●● ●●●●● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ●●●●●●● ● ● ●●● ●● ●●●● ● ● ● ●● ● ●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●●●● ● ● ● ● ● ●●●●●●● ●● ● ● ●● ●● ●●● ● ●●● ● ● ●● ● ●●● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●●● ●● ●●● ●●● ●● ●● ● ● ●● ● ●●●●●● ● ● ● ●● ●● ●● ● ●●●● ●● ● ●●● ● ●●● ● ● ● ●● ●● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ● ●● ●● ● ● ● ●● ● ● ● ●●● ●●● ● ●● ●●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●●●●● ● ●●●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ●● ●● ● ● ● ●● ●● ● ● ●●●● ● ● ●● ●● ● ● ●●● ●● ●●● ● ●● ● ● ● ● ● ● ●● ●●●● ●● ● ●● ●● ● ●● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ●● ● ●● ●●●● ●● ●●● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ●● ● ● ●● ● ● ●●● ● ●● ● ● ● ●● ●● ● ●●● ●● ●● ● ●● ●● ●● ● ●●● ●●● ● ●●●● ●●● ● ● ●●● ● ● ●● ● ● ●●● ● ●●●●●● ● ● ●●●● ●●●● ●● ● ● ● ●● ● ● ●● ●● ●●● ● ●● ● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ● ●● ●●●● ● ● ● ● ●● ● ●● ●● ● ●● ● ●●● ● ●● ●●●● ●●●●● ●● ●● ●● ●● ●● ●●●● ●● ● ● ● ●● ● ● ● ●●●● ● ●● ● ● ● ● ● ●●● ● ●● ●●● ● ●● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ●●● ● ●●● ● ●● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●●● ●● ● ●● ●●●● ●● ● ●● ● ● ● ●●● ● ●●●●●● ●●● ● ●●●● ●●● ●● ● ●● ●●● ●●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ●●●● ●● ●●● ●● ● ● ● ●●● ●● ●●● ● ● ●●● ●● ●●●●● ● ●●● ● ●● ●● ●● ● ● ● ● ● ●● ● ●●●● ● ● ● ●● ●● ● ● ●●●●● ●●●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ●●●● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●●● ●● ● ● ● ●● ● ●●●● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ●●●●● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ●● ● ●●● ● ●● ● ●●●● ● ●●●●●●●●● ●● ● ● ●● ● ●●● ●●● ● ●● ●● ● ● ●●● ● ●● ●●●● ● ● ● ● ●● ●●●●● ●● ●●● ● ●● ● ● ●●●●● ● ●● ●●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ●● ● ●● ●● ● ●● ● ● ● ● ●●●●● ● ●● ●●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●●● ● ●●● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



●● ● ●







● ●

5000







0

A−T

10000





cycle < 20 cycle ≥ 20

● ●

● ●





● ●

● ● ●

● ●

● ●●

● ● ●





● ●







● ●

0

2000

4000 0.5(A+T)

6000

8000

Base Calling 1) Rougemont et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics (2008) 2) Erlich et al. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods (2008) 3) Kao et al. BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res (2009) 4) Corrada Bravo and Irizarry. Model-Based Quality Assessment and Base-Calling for Second-Generation Sequencing Data. Biometrics (2009) 5) Cokus et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature (2009)

Intensity Model 14.0

13.5

13.0

12.5

cycle

Intensity Model log intensity read i, cycle j, channel c

uijc =

∆ijc (µcjα + xTj αi + �α ijc ) + (1 − ∆ijc )(µcjβ +

xTj βi

+

β �ijc )

indicators of nucleotide identity, read i, pos. j ∆ijc

� 1 = 0

if c is the nucleotide in read i position j otherwise

Intensity Model log intensity read i, cycle j, channel c

uijc =

∆ijc (µcjα + xTj αi + �α ijc ) + (1 − ∆ijc )(µcjβ +

xTj βi

read-specific linear models

+

β �ijc )

Intensity Model log intensity read i, cycle j, channel c

uijc =

∆ijc (µcjα + xTj αi + �α ijc ) + (1 − ∆ijc )(µcjβ +

xTj βi

+

β �ijc )

measurement error 2 �α ∼ N (0, σ ijc αi )

β �ijc



2 N (0, σβi )

Read & Cycle Effects 0 10 20 30

0 10 20 30 14 13 12 11

14

h(intensity)

13 12 11 14 13 12 11 14 13 12 11 0 10 20 30

0 10 20 30

cycle

0 10 20 30

Probability

Base Identity Probability Profiles 1 0.8 0.6 0.4 0.2 0 1 3 5 7 9 11 14 17 20 23 26 29 32 35 Position

Before And After Solexa (Default)

Srfim (Statistical Approach) 1.0

0.8 0.6

A

0.4

T

0.2

Nucleotide composition

Nucleotide composition

1.0

A

0.8 0.6 0.4 0.2

T 0.0

0.0 0

10

20

Sequencing cycle

30

0

10

20

Sequencing cycle

30

The End