Basic profile HMM parameterization. â Aim: making the higher probability for sequences .... I (i â1)â
aI jâ1M j
HMM for sequence alignment: profile HMM
Pair HMM HMM for pairwise sequence alignment, which incorporates affine gap scores. “Hidden” States • Match (M) • Insertion in x (X) • insertion in y (Y)
Observation Symbols • Match (M): {(a,b)| a,b in ∑ }. • Insertion in x (X): {(a,-)| a in ∑ }. • Insertion in y (Y): {(-,a)| a in ∑ }.
Pair HMMs τ 1-τ-2δ
δ 1-2δ
Begin
δ
1-ε M
X
ε
δ δ 1-ε
δ δ Y
ε
End
Alignment: a path a hidden state sequence A T - G T T A T A T C G T - A C M M Y M M X M M
Multiple sequence alignment (Globin family)
Profile model (PSSM) • A natural probabilistic model for a conserved region would be to specify independent probabilities ei(a) of observing nucleotide (amino acid) a in position i • The probability of a new sequence x according to this model is L
P(x | M) = ∏ ei (x i ) i=1
Profile / PSSM • DNA / proteins Segments of the same length L; • Often represented as Positional frequency matrix;
LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL LPMSRQDIADYLGLTIETVSRTFTKLERHGAI
Searching profiles: inference • Give a sequence S of length L, compute the likelihood ratio of being generated from this profile vs. from background model: L ei ( x i ) – R(S|P)=
∏ i=1
qx i
– Searching motifs in a sequence: sliding window approach €
Match states for profile HMMs • Match states – Emission probabilities eM i (a )
beg
M1
M2
Mj
ML
end
Components of profile HMMs • Insert states eIi ( a ) – Emission prob. • Can be the background distribution qa.
– Transition prob. • Mi to Ii, Ii to itself, Ii to Mi+1
– Log-score for a gap of length k (not including the log-score from emission) log a + log a + (k − 1) log a M jI j
I j M j +1
I0
I1
I2
Ij
IL
beg
M1
M2
Mj
ML
I jI j
end
Components of profile HMMs • Delete states – No emission prob. – Cost of a deletion • Mi to Di+1, Di to Di+1, Di to Mi+1 • Each Di to Di+1 might be different for different i
beg
D1
D2
Dj
DL
M1
M2
Mj
ML
end
Full structure of profile HMMs D1
D2
Dj
DL
I0
I1
I2
Ij
IL
beg
M1
M2
Mj
ML
end
This is the structure implemented in Hmmer, slightly different from the structure described in the textbook: there is no transition allowed from Dj to Ij or from Ij to Dj+1. As a result, the recursive equation for Viterbi algorithm is different from the one described in the book too.
Deriving HMMs from multiple alignments • Key idea behind profile HMMs – Model representing the consensus for the alignment of sequence from the same family – Not the sequence of any particular member HBA_HUMAN HBB_HUMAN MYG_PHYCA GLB3_CHITP GLB5_PETMA LGB2_LUPLU GLB1_GLYDI
...VGA--HAGEY... ...V----NVDEV... ...VEA--DVAGH... ...VKG------D... ...VYS--TYETS... ...FNA--NIPKH... ...IAGADNGAGV... *** *****
Deriving HMMs from multiple alignments • Basic profile HMM parameterization – Aim: making the higher probability for sequences from the family
• Parameters – the transition and emission probabilities: trivial if many of independent alignment sequences are given. Akl Ek ( a ) akl = ek (a) = ∑l ' Akl ' ∑a ' E k ( a ' ) – length of the model: heuristics or systematic way (e.g., using the MAP algorithm)
Deriving HMMs from multiple alignments (a) Multiple alignment: x A A A A 1
bat rat cat gnat goat Matching
x G G G G G 2
. A A .
. G A A .
(c) Observed emission/transition counts . A A .
x C C G C C 3
Emission from M
Emission from I
(b) Profile-HMM architecture: D1
D2
D3
I0
I1
I2
I3
beg
M1
M2
M3
end
0
1
2
3
4
Transition probabilities
A C G T A C G T M-M M-D M-I I-M I-I D-M D-D
0 0 0 0 0 4 1 0 0 0 -
1 4 0 0 0 0 0 0 0 4 0 0 0 0 1 0
2 0 0 5 0 6 0 1 0 2 0 3 2 4 0 0
3 0 4 1 0 0 0 0 0 4 0 0 0 0 0 0
A simple rule that works well is that columns that are more than half gap characters should be modeled by inserts.
Sequence conservation: entropy of the emission probability distributions
High entropy indicates non-conserved positions (note: here negative entropy a plotted.
Matching a sequence to a profile HMM (global alignment) • Viterbi algorithm: pHMM θ (with L matching states) and a query sequence x1x2..xN $V M (i −1)⋅ a j −1 M j −1 M j , & Initialization: V jM (i) = eM j (x i )⋅ max% V jI−1 (i −1)⋅ aI j −1 M j , M M M & D V 0 = 1; V 0 = 0; V ( ) ( ) (i > 0) = 0; 0 j >0 0 V (i −1)⋅ a ; j −1 D M ' j −1 j V jI (0) = 0; $&V jM (i −1)⋅ aM I , D j j I V (i) = 0;. 0 % V j (i) = eI j (x i )⋅ max I &'V j (i −1)⋅ aI j I j ; Termination: $&V jM−1 (i)⋅ aM D , j −1 j € V = max[VLM ( N ),VLI ( N ),VLD ( N )] V jD (i) = max% D &'V j −1 (i)⋅ aD j −1 D j ; for i=1,…,L, j=1,…,N Note: this is slightly different from the textbook; € no transition from Dj to Ij or from Ij to Dj+1.
€
€
Viterbi algorithm: trace back $ && M ptrj (i) = % & &' $& I ptrj (i) = % &' $& ptrjD (i) = % &'
if
V jM (i) = eM j (x i )⋅ V jM−1 (i −1)⋅ aM j −1 M j
if
V jM (i) = eM j (x i )⋅ V jI−1 (i −1)⋅ aI j −1 M j
if
V jM (i) = eM j (x i )⋅ V jD−1 (i −1)⋅ aD j −1 M j
if
V jI (i) = eI j (x i )⋅ V jM (i −1)⋅ aM j I j ,
if
V jI (i) = eI j (x i )⋅ V jI (i −1)⋅ aI j I j ;
if
V jD (i) = eD j (x i )⋅ V jM−1 (i)⋅ aM j −1 D j ,
if
V jD (i) =eD j (x i )⋅ V jD−1 (i)⋅ aD j −1 D j ;
" M if V = VLM ( N ) $ ptrter = # I if V = VLI ( N ) $ D % D if V = VL ( N ) PrintStateSeq(ptrM, ptrI, ptrD, ptrter, N, L)
PrintStateSeq(ptrM, ptrI, ptrD, ptrter, i, j) if i=0 or j=0 return; if ptrter = “M” if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “M”, i-1, j-1) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “I”, i-1, j-1) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “D”, i-1, j-1) print Mj else if ptrter = “I” if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “M”, i-1, j) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “I”, i-1, j) print Ij else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “M”, i, j-1) else if ptrM(L) = “ ” printStateSeq(ptrM, ptrI, ptrD, “D”, i, j-1) print Dj
Matching a sequence to a profile HMM (global alignment) • Forward: pHMM θ (with L matching states) and a query sequence x1x2..xN
[
F jM (i) = eM j (x i )⋅ F jM−1 (i −1)⋅ aM j −1 M j + F jI−1 (i −1)⋅ aI j −1 M j + F jD−1 (i −1)⋅ aD j −1 M j
[
F jI (i) = eI j (x i )⋅ F jM (i −1)⋅ aM j I j + F jI (i −1)⋅ aI j I j D j
M j −1
]
D j −1
]
for i=1,…,L, j=1,…,N
F (i) = F (i)⋅ aM j −1 D j + F (i)⋅ aD j −1 D j Initialization:
€
€
F0M (0) = 1;
Termination: F jM>0 (0) = 0;
F0M (i > 0) = 0;
F = FLM ( N ) + FLI ( N ) + FLD ( N )
F jI (0) = 0; F0D (i) = 0; € Note: this is slightly different from the textbook; no transition from Dj to Ij or from Ij to Dj+1.
Example L=3
Emission from M
D1
D2
D3
I0
I1
I2
I3
beg
M1
M2
M3
end
0
1
2
3
4
Query: AGG (N=3)
Emission from I
Transition probabilities
A C G T A C G T M-M M-D M-I I-M I-I D-M D-D
0 .2 .3 .3 .2 .8 .2 0 .5 .5 -
1 2 3 1 .5 0 0 0 .8 0 .5 .2 0 0 0 .2 .9 .2 .3 0 .3 .3 .1 .3 .2 0 .2 1 .4 1 0 0 0 0 .6 0 .5 .3 .5 .5 .7 .5 1 .5 1 0 .5 0
Example: Viterbi algorithm Emission from M
Emission from I
Transition probabilities
A C G T A C G T M-M M-D M-I I-M I-I D-M D-D
0 .2 .3 .3 .2 .8 .2 0 .5 .5 -
Query: X=AGG (N=3)
1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5
3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0
Initialization j 0
1
2
3
i VM VI VD VM VI VD VM VI VD VM VI VD
0 1 0 0 0 0 0 0 0 0
1 0
2 0
3 0
0
0
0
Example: Viterbi algorithm Emission from M
Emission from I
Transition probabilities
A C G T A C G T M-M M-D M-I I-M I-I D-M D-D
0 .2 .3 .3 .2 .8 .2 0 .5 .5 -
1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5
3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0
j 0
1
2
3
(
i VM VI VD VM VI VD VM VI VD VM VI VD
0 1 0 0 0 0 .2 0 0
1 0
2 0
3 0
0
0
0
0 0
V1D (0) = max V0M (0) aM 0 D1 ,V0D (0) aD0 D1
Query: X=AGG (N=3)
= max(0.2,0) = 0.2
€
)
Example: Viterbi algorithm Emission from M
Emission from I
Transition probabilities
A C G T A C G T M-M M-D M-I I-M I-I D-M D-D
0 .2 .3 .3 .2 .8 .2 0 .5 .5 -
Query: X=AGG (N=3)
1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5
3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0
j 0
1
2
3
(
0 1 1 0 0 0 0 0 0 .8 0 .2 0 0 0 0 0 0
2 0
3 0
0
0
V1M (1) = eM 1 ( x1 )⋅ max V0M (0) aM 0 M 1 ,V0I (0) aI 0 M 1 ,V0D (0) aD0 M 1 = 1 × max(1 × 0.8,0 × 0.5,0 × 1) = 0.8
€
i VM VI VD VM VI VD VM VI VD VM VI VD
)
Example: Viterbi algorithm Emission from M
Emission from I
Transition probabilities
A C G T A C G T M-M M-D M-I I-M I-I D-M D-D
0 .2 .3 .3 .2 .8 .2 0 .5 .5 -
Query: X=AGG (N=3)
1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5
3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0
j
i VM VI VD VM VI VD VM VI VD VM VI VD
0
1
2
3
M 3
V
0 1 2 3 1 0 0 0 0 0 0 0 0 0 0 0 0 .8 0 0 0 0 0 0 .2 0 0 0 0 .1 .4 0 0 0 .006 .0240 0 0 0 0 0 0 .008 .0320 0 0 0 .0014 0 0 0 0
(3) = ?
Example: Viterbi algorithm Emission from M
Emission from I
Transition probabilities
A C G T A C G T M-M M-D M-I I-M I-I D-M D-D
0 .2 .3 .3 .2 .8 .2 0 .5 .5 -
Query: X=AGG (N=3)
1 2 1 .5 0 0 0 .5 0 0 .2 .9 .3 0 .3 .1 .2 0 1 .4 0 0 0 .6 .5 .3 .5 .7 1 .5 0 .5
3 0 .8 .2 0 .2 .3 .3 .2 1 0 0 .5 .5 1 0
j 0
1
2
3
i VM VI VD VM VI VD VM VI VD VM VI VD
0 1 2 3 1 0 0 0 0 0 0 0 0 0 0 0 0 .8 0 0 0 0 0 0 .2 0 0 0 0 .1 .4 0 0 0 .006 .0240 0 0 0 0 0 0 .008 .0320 0 0 0 .0014 0 0 0 0
Traceback
Searching with profile HMMs • Main usage of profile HMMs – Detecting potential sequences in a family – Core algorithm: matching a sequence to a profile HMMs • Viterbi algorithm or forward algorithm
– Comparing the resulting probability with random model (R): log-odd score P( x | R) = ∏ q xi i
where qx is the frequency of observing xi.
Matching a sequence to a profile HMM (global alignment) Viterbi algorithm
#logV M (i −1) + log a Initialization: j −1 M j −1 M j , % e (x ) M i V jM (i) = log j + max$ logV jI−1 (i −1) + log aI j −1 M j , M V0 (0) = 0; V jM>0 (0) = −∞; V0M (i > 0) = −∞; qx i % D logV j −1 (i −1) + log aD j −1 M j ; V jI (0) = −∞; & V0D (i) = −∞;. M # eI j (x i ) %logV j (i −1) + log aM j I j , I V j (i) = log + max$ qx i %&logV jI (i −1) + log aI j I j ; € Termination: M #%logV j −1 (i) + log aM D , M I D j −1 j D V = max V N , V N ,V ( ) ( ) ( N )] [ L L L V j (i) = max$ %&logV jD−1 (i) + log aD j −1 D j ; for i=1,…,L, j=1,…,N
€
Matching a sequence to a profile HMM (global alignment) Forward algorithm M j
F (i) = log F jI (i) = log
eM j (x i ) qx i
eI j (x i ) qx i
+ log[aM j −1 M j exp(F jM−1 (i −1)) + aI j −1 M j exp(F jI−1 (i −1)) + aD j −1 M j exp(F jD−1 (i −1))];
+ log[aM j I j exp(F jM (i −1)) + aI j I j exp(F jI (i −1))];
F jD (i) = log[aM j −1 D j exp(F jM−1 (i)) + aI j −1 D j exp(F jI−1 (i))];
Initialization: €
V0M (0) = 0;
V jM>0 (0) = −∞;
Termination: V0M (i > 0) = −∞;
V0I (0) = −∞; V0D (i) = −∞. €
[
]
F = log exp( FLM ( N )) + exp( FLI ( N )) + exp( FLD ( N ))
Significance of HMM alignment • The log-odd score of local Viterbi alignment (V) alignment between a random sequence and a profile HMM follows a Gumbel (type I EVD) distribution
[
P (V ≥ t ) = 1 − exp −e − λ ( t − µ)
]
• With ~200 Viterbi, the location parameter µ can be accurately estimated; • λ~log(z),€z is the base of the log-odd score, e.g., z=2 when the sequence length approaches infinite • The length effect can be corrected by λ ~ log2 + 1.44 hN – Where, N is the length, and h is the average relative entropy per match state in the pHMM; – For typical Pfam models, N~140, h~1.8, λ~log2+0.0057, a small € correction. Eddy, PLoS Comp. Biol., 4:1, 2008
Variants for non-global alignments • Local alignment (Smith-Waterman type) – Emission prob. in flanking states use background values qx. – Looping prob. close to 1, e.g. (1- η) for some small η. Dj
Ij
Mj
beg
end
Q
Q
1- η
1- η
Variants for non-global alignments • Overlap (also called glocal or fit) alignment – The loop probability of the first and last insert states is much higher than the other insert states – When expecting to find either present as a whole or absent (e.g., of a protein domain within a protein) – Transition to first delete state allows missing first residue Dj
Q
Ij
beg
Mj
Q
end
Variants for non-global alignments • Repeat alignments – Transition from right flanking state back to random model – Can find multiple matching segments in query string Dj
Ij
Mj
beg
Q
end
Optimal model construction: different ways of marking columns (a) Multiple alignment: x A A A A 1
bat rat cat gnat goat Matching
x G G G 2
. A A .
. G A A .
(c) Observed emission/transition counts . A A .
x C C C C 3
Emission from M
Emission from I
(b) Profile-HMM architecture: D1
D2
D3
Transition probabilities I0
I1
I2
I3
beg
M1
M2
M3
end
0
1
2
3
4
A C G T A C G T M-M M-D M-I I-M I-D I-I D-M D-D D-I
0 0 0 0 0 4 1 0 0 0 0 -
1 4 0 0 0 0 0 0 0 3 1 0 0 0 0 0 1 0
2 0 0 3 0 6 0 1 0 2 0 1 2 1 4 0 0 2
3 0 4 0 0 0 0 0 0 4 0 0 0 0 0 1 0 0
Optimal model construction • MAP (match-insert assignment) – Recursive calculation of a number Sj • Sj: log prob. of the optimal model for alignment up to and including column j, assuming j is marked. • Sj is calculated from Si and summed log prob. between i and j. • Tij: summed log prob. of all the state transitions between marked i and j. Tij = ∑ c xy log axy x,y∈{ M,D,I}
– cxy are obtained from partial state paths implied by marking i and j.
€
Optimal model construction • Algorithm: MAP model construction – Initialization: • S0 = 0, ML+1 = 0.
– Recurrence: for j = 1,..., L+1: S j = max ( Si + Tij + M j + I i+1, j−1 + λ ) 0≤i< j
σ j = arg max ( Si + Tij + M j + I i+1, j−1 + λ ) 0≤i< j
– Traceback: from j = σL+1, while σj > 0: • Mark column j as a match column • j σj.
Weighting training sequences • Input sequences are random? • “Assumption: all examples are independent samples” might be incorrect • Solutions – Weight sequences based on similarity: highly similar pair of training sequences receive lower weights
Multiple sequence alignment (MSA) by training profile HMM • Sequence profiles can be represented as probabilistic models like profile HMMs. – ML methods for building (training) profile HMM are based on multiple sequence alignment – Profile HMMs can also be trained from initially unaligned sequences using the Baum-Welch-like EM algorithm • Simultaneously aligning multiple sequences and building the profile HMM from the multiple alignment
Multiple alignment with a known profile HMM • A step backward: to derive a multiple alignment from a known profile HMM model – e.g., to align many sequences from the same family based on the HMM model built from the (seed) multiple alignment of a small representative set of sequences in the family.
• It just requires calculating a Viterbi alignment for each individual sequence – Match a sequence to a profile HMM: Viterbi algorithm – Residues aligned to the same match state in the profile HMM should be aligned in the same columns; – Given a preliminary alignment, HMM can align additional sequences.
Multiple alignment with a known profile HMM • Comparing with other MSA program – Profile HMM does not align inserts whereas other MSA algorithms align the whole sequences.
Training profile HMM from unaligned sequences • Simultaneously aligning multiple sequences and building the profile HMM from the multiple alignment – Initialization: choose the length of the profile HMM and initialize parameters of the model – MSA: align all sequences to the final model using the Viterbi algorithm and build a multiple alignment as described in the previous section. – Training: estimate the model using the Baum-Welch algorithm – Iterating until the model (and the MSA) converges
Profile HMM training from unaligned sequences • Initial Model – The only decision that must be made in choosing an initial structure for Baum-Welch estimation is the length of the model M. – A commonly used rule is to set M be the average length of the training sequence. – We need some randomness in initial parameters to avoid local maxima.
Multiple alignment by profile HMM training • Avoiding Local maxima – Baum-Welch algorithm is guaranteed to find a LOCAL maxima. • Models are usually quite long and there are many opportunities to get stuck in a wrong solution.
– Solution • Start many times from different initial models. • Use some form of stochastic search algorithm, e.g. simulated annealing.
Multiple alignment by profile HMM training--Model surgery • We can modify the model after (or during) training a model by manually checking the alignment produced from the model. – Some of the match states are redundant – Some insert states absorb too many sequences
• Model surgery – If a match state is used by less than ½ of training sequences, delete its module (match-insert-delete states) – If more than ½ of training sequences use a certain insert state, expand it into n new modules, where n is the average length of insertions – ad hoc, but works well
Hmmer 3
Figure 4. The HMMER3 acceleration pipeline. Eddy SR (2011) Accelerated Profile HMM Searches. PLoS Comput Biol 7(10): e1002195. doi:10.1371/journal.pcbi.1002195 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002195
Accelerated Profile HMM Searches
Figure 1. The MSV profile. Eddy SR (2011) Accelerated Profile HMM Searches. PLoS Comput Biol 7(10): e1002195. doi:10.1371/journal.pcbi.1002195 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002195