HMM for sequence alignment: profile HMM

HMM for sequence alignment: profile HMM

Pair HMM HMM for pairwise sequence alignment, which incorporates affine gap scores. “Hidden” States •  Match (M) •  Insertion in x (X) •  insertion in y (Y)

Observation Symbols •  Match (M): {(a,b)| a,b in ∑ }. •  Insertion in x (X): {(a,-)| a in ∑ }. •  Insertion in y (Y): {(-,a)| a in ∑ }.

Pair HMMs τ 1-τ-2δ

δ 1-2δ

Begin

δ

1-ε M

X

ε

δ δ 1-ε

δ δ Y

ε

End

Alignment: a path  a hidden state sequence A T - G T T A T A T C G T - A C M M Y M M X M M

Multiple sequence alignment (Globin family)

Profile model (PSSM) •  A natural probabilistic model for a conserved region would be to specify independent probabilities ei(a) of observing nucleotide (amino acid) a in position i •  The probability of a new sequence x according to this model is L

P(x | M) = ∏ ei (x i ) i=1

Profile / PSSM • DNA / proteins Segments of the same length L; • Often represented as Positional frequency matrix;

LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL LPMSRQDIADYLGLTIETVSRTFTKLERHGAI

Searching profiles: inference •  Give a sequence S of length L, compute the likelihood ratio of being generated from this profile vs. from background model: L ei ( x i ) –  R(S|P)=

∏ i=1

qx i

–  Searching motifs in a sequence: sliding window approach €

Match states for profile HMMs •  Match states –  Emission probabilities eM i (a )

beg

M1

M2

Mj

ML

end

Components of profile HMMs •  Insert states eIi ( a ) –  Emission prob. •  Can be the background distribution qa.

–  Transition prob. •  Mi to Ii, Ii to itself, Ii to Mi+1

–  Log-score for a gap of length k (not including the log-score from emission) log a + log a + (k − 1) log a M jI j

I j M j +1

I0

I1

I2

Ij

IL

beg

M1

M2

Mj

ML

I jI j

end

Components of profile HMMs •  Delete states –  No emission prob. –  Cost of a deletion •  Mi to Di+1, Di to Di+1, Di to Mi+1 •  Each Di to Di+1 might be different for different i

beg

D1

D2

Dj

DL

M1

M2

Mj

ML

end

Full structure of profile HMMs D1

D2

Dj

DL

I0

I1

I2

Ij

IL

beg

M1

M2

Mj

ML

end

This is the structure implemented in Hmmer, slightly different from the structure described in the textbook: there is no transition allowed from Dj to Ij or from Ij to Dj+1. As a result, the recursive equation for Viterbi algorithm is different from the one described in the book too.

Deriving HMMs from multiple alignments •  Key idea behind profile HMMs –  Model representing the consensus for the alignment of sequence from the same family –  Not the sequence of any particular member HBA_HUMAN HBB_HUMAN MYG_PHYCA GLB3_CHITP GLB5_PETMA LGB2_LUPLU GLB1_GLYDI

...VGA--HAGEY... ...V----NVDEV... ...VEA--DVAGH... ...VKG------D... ...VYS--TYETS... ...FNA--NIPKH... ...IAGADNGAGV... *** *****

Deriving HMMs from multiple alignments •  Basic profile HMM parameterization –  Aim: making the higher probability for sequences from the family

•  Parameters –  the transition and emission probabilities: trivial if many of independent alignment sequences are given. Akl Ek ( a ) akl = ek (a) = ∑l ' Akl ' ∑a ' E k ( a ' ) –  length of the model: heuristics or systematic way (e.g., using the MAP algorithm)

Deriving HMMs from multiple alignments (a) Multiple alignment: x A A A A 1

bat rat cat gnat goat Matching

x G G G G G 2

. A A .

. G A A .

(c) Observed emission/transition counts . A A .

x C C G C C 3

Emission from M

Emission from I

(b) Profile-HMM architecture: D1

D2

D3

I0

I1

I2

I3

beg

M1

M2

M3

end

0

1

2

3

4

Transition probabilities

A C G T A C G T M-M M-D M-I I-M I-I D-M D-D

0 0 0 0 0 4 1 0 0 0 -

1 4 0 0 0 0 0 0 0 4 0 0 0 0 1 0

2 0 0 5 0 6 0 1 0 2 0 3 2 4 0 0

3 0 4 1 0 0 0 0 0 4 0 0 0 0 0 0

A simple rule that works well is that columns that are more than half gap characters should be modeled by inserts.

Sequence conservation: entropy of the emission probability distributions

High entropy indicates non-conserved positions (note: here negative entropy a plotted.

Matching a sequence to a profile HMM (global alignment) •  Viterbi algorithm: pHMM θ (with L matching states) and a query sequence x1x2..xN $V M (i −1)⋅ a j −1 M j −1 M j , & Initialization: V jM (i) = eM j (x i )⋅ max% V jI−1 (i −1)⋅ aI j −1 M j , M M M & D V 0 = 1; V 0 = 0; V ( ) ( ) (i > 0) = 0; 0 j >0 0 V (i −1)⋅ a ; j −1 D M ' j −1 j V jI (0) = 0; $&V jM (i −1)⋅ aM I , D j j I V (i) = 0;. 0 % V j (i) = eI j (x i )⋅ max I &'V j (i −1)⋅ aI j I j ; Termination: $&V jM−1 (i)⋅ aM D , j −1 j € V = max[VLM ( N ),VLI ( N ),VLD ( N )] V jD (i) = max% D &'V j −1 (i)⋅ aD j −1 D j ; for i=1,…,L, j=1,…,N Note: this is slightly different from the textbook; € no transition from Dj to Ij or from Ij to Dj+1.

€

€

Viterbi algorithm: trace back $ && M ptrj (i) = % & &' $& I ptrj (i) = % &' $& ptrjD (i) = % &'

if

V jM (i) = eM j (x i )⋅ V jM−1 (i −1)⋅ aM j −1 M j

if

V jM (i) = eM j (x i )⋅ V jI−1 (i −1)⋅ aI j −1 M j

if

V jM (i) = eM j (x i )⋅ V jD−1 (i −1)⋅ aD j −1 M j

if

V jI (i) = eI j (x i )⋅ V jM (i −1)⋅ aM j I j ,

if

V jI (i) = eI j (x i )⋅ V jI (i −1)⋅ aI j I j ;

if

V jD (i) = eD j (x i )⋅ V jM−1 (i)⋅ aM j −1 D j ,

if

V jD (i) =eD j (x i )⋅ V jD−1 (i)⋅ aD j −1 D j ;

" M if V = VLM ( N ) $ ptrter = # I if V = VLI ( N ) $ D % D if V = VL ( N ) PrintStateSeq(ptrM, ptrI, ptrD, ptrter, N, L)

PrintStateSeq(ptrM, ptrI, ptrD, ptrter, i, j) if i=0 or j=0 return; if ptrter = “M” if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “M”, i-1, j-1) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “I”, i-1, j-1) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “D”, i-1, j-1) print Mj else if ptrter = “I” if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “M”, i-1, j) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “I”, i-1, j) print Ij else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “M”, i, j-1) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “D”, i, j-1) print Dj

Matching a sequence to a profile HMM (global alignment) •  Forward: pHMM θ (with L matching states) and a query sequence x1x2..xN

[

F jM (i) = eM j (x i )⋅ F jM−1 (i −1)⋅ aM j −1 M j + F jI−1 (i −1)⋅ aI j −1 M j + F jD−1 (i −1)⋅ aD j −1 M j

[

F jI (i) = eI j (x i )⋅ F jM (i −1)⋅ aM j I j + F jI (i −1)⋅ aI j I j D j

M j −1

]

D j −1

]

for i=1,…,L, j=1,…,N

F (i) = F (i)⋅ aM j −1 D j + F (i)⋅ aD j −1 D j Initialization:

€

€

F0M (0) = 1;

Termination: F jM>0 (0) = 0;

F0M (i > 0) = 0;

F = FLM ( N ) + FLI ( N ) + FLD ( N )

F jI (0) = 0; F0D (i) = 0; € Note: this is slightly different from the textbook; no transition from Dj to Ij or from Ij to Dj+1.

Example L=3

Emission from M

D1

D2

D3

I0

I1

I2

I3

beg

M1

M2

M3

end

0

1

2

3

4

Query: AGG (N=3)

Emission from I



0 .2 .3 .3 .2 .8 .2 0 .5 .5 -

1 2 3 1 .5 0 0 0 .8 0 .5 .2 0 0 0 .2 .9 .2 .3 0 .3 .3 .1 .3 .2 0 .2 1 .4 1 0 0 0 0 .6 0 .5 .3 .5 .5 .7 .5 1 .5 1 0 .5 0

Example: Viterbi algorithm Emission from M

Emission from I



0 .2 .3 .3 .2 .8 .2 0 .5 .5 -

Query: X=AGG (N=3)

1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5

3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0

Initialization j 0

1

2

3

i VM VI VD VM VI VD VM VI VD VM VI VD

0 1 0 0 0 0 0 0 0 0

1 0

2 0

3 0

0

0

0


Emission from I



0 .2 .3 .3 .2 .8 .2 0 .5 .5 -

1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5

3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0

j 0

1

2

3

(


0 1 0 0 0 0 .2 0 0

1 0

2 0

3 0

0

0

0

0 0

V1D (0) = max V0M (0) aM 0 D1 ,V0D (0) aD0 D1

Query: X=AGG (N=3)

= max(0.2,0) = 0.2

€

)


Emission from I



0 .2 .3 .3 .2 .8 .2 0 .5 .5 -

Query: X=AGG (N=3)

1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5

3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0

j 0

1

2

3

(

0 1 1 0 0 0 0 0 0 .8 0 .2 0 0 0 0 0 0

2 0

3 0

0

0

V1M (1) = eM 1 ( x1 )⋅ max V0M (0) aM 0 M 1 ,V0I (0) aI 0 M 1 ,V0D (0) aD0 M 1 = 1 × max(1 × 0.8,0 × 0.5,0 × 1) = 0.8

€


)


Emission from I



0 .2 .3 .3 .2 .8 .2 0 .5 .5 -

Query: X=AGG (N=3)

1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5

3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0

j


0

1

2

3

M 3

V

0 1 2 3 1 0 0 0 0 0 0 0 0 0 0 0 0 .8 0 0 0 0 0 0 .2 0 0 0 0 .1 .4 0 0 0 .006 .0240 0 0 0 0 0 0 .008 .0320 0 0 0 .0014 0 0 0 0

(3) = ?


Emission from I



0 .2 .3 .3 .2 .8 .2 0 .5 .5 -

Query: X=AGG (N=3)

1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5

3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0

j 0

1

2

3


0 1 2 3 1 0 0 0 0 0 0 0 0 0 0 0 0 .8 0 0 0 0 0 0 .2 0 0 0 0 .1 .4 0 0 0 .006 .0240 0 0 0 0 0 0 .008 .0320 0 0 0 .0014 0 0 0 0

Traceback

Searching with profile HMMs •  Main usage of profile HMMs –  Detecting potential sequences in a family –  Core algorithm: matching a sequence to a profile HMMs •  Viterbi algorithm or forward algorithm

–  Comparing the resulting probability with random model (R): log-odd score P( x | R) = ∏ q xi i

where qx is the frequency of observing xi.

Matching a sequence to a profile HMM (global alignment) Viterbi algorithm

#logV M (i −1) + log a Initialization: j −1 M j −1 M j , % e (x ) M i V jM (i) = log j + max$ logV jI−1 (i −1) + log aI j −1 M j , M V0 (0) = 0; V jM>0 (0) = −∞; V0M (i > 0) = −∞; qx i % D logV j −1 (i −1) + log aD j −1 M j ; V jI (0) = −∞; & V0D (i) = −∞;. M # eI j (x i ) %logV j (i −1) + log aM j I j , I V j (i) = log + max$ qx i %&logV jI (i −1) + log aI j I j ; € Termination: M #%logV j −1 (i) + log aM D , M I D j −1 j D V = max V N , V N ,V ( ) ( ) ( N )] [ L L L V j (i) = max$ %&logV jD−1 (i) + log aD j −1 D j ; for i=1,…,L, j=1,…,N

€

Matching a sequence to a profile HMM (global alignment) Forward algorithm M j

F (i) = log F jI (i) = log

eM j (x i ) qx i

eI j (x i ) qx i

+ log[aM j −1 M j exp(F jM−1 (i −1)) + aI j −1 M j exp(F jI−1 (i −1)) + aD j −1 M j exp(F jD−1 (i −1))];

+ log[aM j I j exp(F jM (i −1)) + aI j I j exp(F jI (i −1))];

F jD (i) = log[aM j −1 D j exp(F jM−1 (i)) + aI j −1 D j exp(F jI−1 (i))];

Initialization: €

V0M (0) = 0;

V jM>0 (0) = −∞;

Termination: V0M (i > 0) = −∞;

V0I (0) = −∞; V0D (i) = −∞. €

[

]

F = log exp( FLM ( N )) + exp( FLI ( N )) + exp( FLD ( N ))

Significance of HMM alignment •  The log-odd score of local Viterbi alignment (V) alignment between a random sequence and a profile HMM follows a Gumbel (type I EVD) distribution

[

P (V ≥ t ) = 1 − exp −e − λ ( t − µ)

]

•  With ~200 Viterbi, the location parameter µ can be accurately estimated; •  λ~log(z),€z is the base of the log-odd score, e.g., z=2 when the sequence length approaches infinite •  The length effect can be corrected by λ ~ log2 + 1.44 hN –  Where, N is the length, and h is the average relative entropy per match state in the pHMM; –  For typical Pfam models, N~140, h~1.8, λ~log2+0.0057, a small € correction. Eddy, PLoS Comp. Biol., 4:1, 2008

Variants for non-global alignments •  Local alignment (Smith-Waterman type) –  Emission prob. in flanking states use background values qx. –  Looping prob. close to 1, e.g. (1- η) for some small η. Dj

Ij

Mj

beg

end

Q

Q

1- η

1- η

Variants for non-global alignments •  Overlap (also called glocal or fit) alignment –  The loop probability of the first and last insert states is much higher than the other insert states –  When expecting to find either present as a whole or absent (e.g., of a protein domain within a protein) –  Transition to first delete state allows missing first residue Dj

Q

Ij

beg

Mj

Q

end

Variants for non-global alignments •  Repeat alignments –  Transition from right flanking state back to random model –  Can find multiple matching segments in query string Dj

Ij

Mj

beg

Q

end

Optimal model construction: different ways of marking columns (a) Multiple alignment: x A A A A 1

bat rat cat gnat goat Matching

x G G G 2

. A A .

. G A A .

(c) Observed emission/transition counts . A A .

x C C C C 3

Emission from M

Emission from I

(b) Profile-HMM architecture: D1

D2

D3

Transition probabilities I0

I1

I2

I3

beg

M1

M2

M3

end

0

1

2

3

4

A C G T A C G T M-M M-D M-I I-M I-D I-I D-M D-D D-I

0 0 0 0 0 4 1 0 0 0 0 -

1 4 0 0 0 0 0 0 0 3 1 0 0 0 0 0 1 0

2 0 0 3 0 6 0 1 0 2 0 1 2 1 4 0 0 2

3 0 4 0 0 0 0 0 0 4 0 0 0 0 0 1 0 0

Optimal model construction •  MAP (match-insert assignment) –  Recursive calculation of a number Sj •  Sj: log prob. of the optimal model for alignment up to and including column j, assuming j is marked. •  Sj is calculated from Si and summed log prob. between i and j. •  Tij: summed log prob. of all the state transitions between marked i and j. Tij = ∑ c xy log axy x,y∈{ M,D,I}

–  cxy are obtained from partial state paths implied by marking i and j.

€

Optimal model construction •  Algorithm: MAP model construction –  Initialization: •  S0 = 0, ML+1 = 0.

–  Recurrence: for j = 1,..., L+1: S j = max ( Si + Tij + M j + I i+1, j−1 + λ ) 0≤i< j

σ j = arg max ( Si + Tij + M j + I i+1, j−1 + λ ) 0≤i< j

–  Traceback: from j = σL+1, while σj > 0: •  Mark column j as a match column •  j  σj.

Weighting training sequences •  Input sequences are random? •  “Assumption: all examples are independent samples” might be incorrect •  Solutions –  Weight sequences based on similarity: highly similar pair of training sequences receive lower weights

Multiple sequence alignment (MSA) by training profile HMM •  Sequence profiles can be represented as probabilistic models like profile HMMs. –  ML methods for building (training) profile HMM are based on multiple sequence alignment –  Profile HMMs can also be trained from initially unaligned sequences using the Baum-Welch-like EM algorithm •  Simultaneously aligning multiple sequences and building the profile HMM from the multiple alignment

Multiple alignment with a known profile HMM •  A step backward: to derive a multiple alignment from a known profile HMM model –  e.g., to align many sequences from the same family based on the HMM model built from the (seed) multiple alignment of a small representative set of sequences in the family.

•  It just requires calculating a Viterbi alignment for each individual sequence –  Match a sequence to a profile HMM: Viterbi algorithm –  Residues aligned to the same match state in the profile HMM should be aligned in the same columns; –  Given a preliminary alignment, HMM can align additional sequences.

Multiple alignment with a known profile HMM •  Comparing with other MSA program –  Profile HMM does not align inserts whereas other MSA algorithms align the whole sequences.

Training profile HMM from unaligned sequences •  Simultaneously aligning multiple sequences and building the profile HMM from the multiple alignment –  Initialization: choose the length of the profile HMM and initialize parameters of the model –  MSA: align all sequences to the final model using the Viterbi algorithm and build a multiple alignment as described in the previous section. –  Training: estimate the model using the Baum-Welch algorithm –  Iterating until the model (and the MSA) converges

Profile HMM training from unaligned sequences •  Initial Model –  The only decision that must be made in choosing an initial structure for Baum-Welch estimation is the length of the model M. –  A commonly used rule is to set M be the average length of the training sequence. –  We need some randomness in initial parameters to avoid local maxima.

Multiple alignment by profile HMM training •  Avoiding Local maxima –  Baum-Welch algorithm is guaranteed to find a LOCAL maxima. •  Models are usually quite long and there are many opportunities to get stuck in a wrong solution.

–  Solution •  Start many times from different initial models. •  Use some form of stochastic search algorithm, e.g. simulated annealing.

Multiple alignment by profile HMM training--Model surgery •  We can modify the model after (or during) training a model by manually checking the alignment produced from the model. –  Some of the match states are redundant –  Some insert states absorb too many sequences

•  Model surgery –  If a match state is used by less than ½ of training sequences, delete its module (match-insert-delete states) –  If more than ½ of training sequences use a certain insert state, expand it into n new modules, where n is the average length of insertions –  ad hoc, but works well

Hmmer 3

Figure 4. The HMMER3 acceleration pipeline. Eddy SR (2011) Accelerated Profile HMM Searches. PLoS Comput Biol 7(10): e1002195. doi:10.1371/journal.pcbi.1002195 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002195

Accelerated Profile HMM Searches

Figure 1. The MSV profile. Eddy SR (2011) Accelerated Profile HMM Searches. PLoS Comput Biol 7(10): e1002195. doi:10.1371/journal.pcbi.1002195 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002195

HMM for sequence alignment: profile HMM

HMM for sequence alignment: profile HMM

Suggest Documents

Accelerated Profile HMM Searches - Semantic Scholar

HMM ... - CiteSeerX

Indirect-HMM-based Hypothesis Alignment for Combining ... - CiteSeerX

A Jumping Profile HMM for Remote Protein Homology Detection

GenSeed-HMM: A Tool for Progressive Assembly Using Profile ... - Core

HMM Module - AUK

HMM/GMM - Academic Journals

Hidden Markov model (HMM)

HMM 09-15-20

HMM-Based Word Alignment in Statistical Translation - CiteSeerX

HMM 09-15-20

Example HMM code

continuous density hmm - CiteSeerX

Hybrid Neural Network Alignment and Lexicon Model in Direct HMM

Speeding Up HMM Decoding and Training by Exploiting Sequence ...

HMM TOPOLOGY OPTIMIZATION FOR ... - Semantic Scholar

A HMM-Based Hierarchical Framework for Long

An Image Transform Approach for HMM Based

Temporal and Hierarchical HMM for Activity

An HMM-Based Approach for Off-Line

hmm based pos tagger for hindi - AIRCC

HMM models for arabic speech ... - IEEE Xplore

BALL EVENT RECOGNITION USING HMM FOR ... - CiteSeerX

EBSeq-HMM: a Bayesian approach for identifying