Apr 2, 2009 - use a shift table that gives for each position of the pattern the next position to test in the text. JS Varré (http://bioinfo.lifl.fr). MP & KMP for PWMs.
Self-Overlapping Occurrences and Knuth-Morris-Pratt Algorithm for Weighted Matching Aude Liefooghe, H´el`ene Touzet and Jean-St´ephane Varr´e LIFL - UMR CNRS 8022 - Universit´e Lille 1, France INRIA Lille Nord-Europe, France Sequoia group http://bioinfo.lifl.fr
April 2nd 2009
Position Weight Matrices (PWMs) TTGCGGTC TTGCGGTT TTGCGGTC TTGTGGTT CTGCGGTT TTGTGGTC TTGTGGTC CTGTGGTT CTGCGGTT TTGCGGTA ATGCGGTT CTGCGGTT ATGTGGTA TTGTGGAC AAGTGGTT TCTTGGTT CAGTGGGT
count the occurrences
2
compute the frequencies with correction
3
compute the log-odd ratio M(i, x) = log2
F (i, x) p(x)
with p(x) the background probability of letter x A 0.60 -0.32 -0.69 -2.89 -2.83 -2.89 -2.89 -1.28 -0.69
C -0.69 0.15 -1.28 -2.89 0.66 -2.89 -2.89 -2.89 -0.05
G -1.28 -2.89 -2.89 1.28 -2.83 1.34 1.34 -1.28 -2.89
T 0.32 0.72 1.15 -1.28 0.66 -2.89 -2.89 1.22 0.91
2.0
bits
A A T A A T C A C A T G A T A T T
1
1.0 0.0
T
GGT
TGT
A TC A
C
C A CT
A
5
A
T
C
C
G
A
WebLogo 3.0
Positive score: over represented letter in the pattern. JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
2 / 20
Weighted Pattern Matching given : I
I
a weighted pattern : a matrix M of size m × |Σ| M(p, x) = score at position p for the letter x in Σ a score threshold α
find the occurrences of the weighted pattern in a text T I
occurrence : a position in T such that the score of Tp ..Tp+m−1 is greater that a threshold α
I
score : Score(u, M) =
m−1 X
M(p, up )
p=0
Threshold is chosen given a P-value. Pv(s) = proportion of words of score greater than s. JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
3 / 20
Brute-force approach the Θ(nm) naive algorithm:
p
p+4 A C A G G
... A C G T
JS Varr´ e (http://bioinfo.lifl.fr)
... at the end: Score(M, Tp . . . Tp+4 ) ≥ α ?
MP & KMP for PWMs
April 2nd 2009
4 / 20
Brute-force approach the Θ(nm) naive algorithm: the O(nm) lookahead strategy [Wu et al., 2000]: stop the computation when the score reaches a threshold for a given column p
p+4 A C A G G
... A C G T
JS Varr´ e (http://bioinfo.lifl.fr)
...
at each position: Score(M, Tp . . . Tp+i ) ≥ GLB(M, p + i, α) ? with GLB(M, p+i, α) = α−MaxSc(M[p + i + 1..m − 1]) Greatest Lower Bound
MP & KMP for PWMs
April 2nd 2009
4 / 20
How to improve pattern matching with PWMs ? pre-processing the text: I I
suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
5 / 20
How to improve pattern matching with PWMs ? pre-processing the text: I I
suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results
pre-processing the matrix I I
expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] F F F
split the matrix in q sub-matrices compute the score for all words for each sub-matrix complexity : O( mq n)
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
5 / 20
How to improve pattern matching with PWMs ? pre-processing the text: I I
suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results
pre-processing the matrix I I I
expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] use a shift table that gives for each position of the pattern the next position to test in the text
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
5 / 20
How to improve pattern matching with PWMs ? pre-processing the text: I I
suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results
pre-processing the matrix I I I
expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] use a shift table that gives for each position of the pattern the next position to test in the text
Requires to compute the self-overlaping occurrences of the matrix
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
5 / 20
How to improve pattern matching with PWMs ? pre-processing the text: I I
suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results
pre-processing the matrix I I I
expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] use a shift table that gives for each position of the pattern the next position to test in the text
Requires to compute the self-overlaping occurrences of the matrix I
use an Aho-Corasick automaton [Pizzi et al., 2007] F
F F
build an Aho-Corasick automaton for the set of words of score ≥ threshold use the Aho-Corasick algorithm to obtain the shift very efficient ... but the automaton becomes very large for long matrices
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
5 / 20
Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T borders:
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
6 / 20
Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T
JS Varr´ e (http://bioinfo.lifl.fr)
borders: TAT
MP & KMP for PWMs
April 2nd 2009
6 / 20
Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T borders: TAT, T
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
6 / 20
Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T borders: TAT, T Morris-Pratt shifting rule : shift = failure position - length of the longest border ...
... T A T C T A T +4
? T A T C T A T
JS Varr´ e (http://bioinfo.lifl.fr)
mpNext[0] = −1 mpNext[i] = ` ` the length of the longest border shifts:[1, 1, 2, 2, 4, 4, 4, 4]
MP & KMP for PWMs
April 2nd 2009
6 / 20
Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T borders: TAT, T Morris-Pratt shifting rule : shift = failure position - length of the longest border ...
... T A T C T A T ? +4 T A T C T A T
JS Varr´ e (http://bioinfo.lifl.fr)
mpNext[0] = −1 mpNext[i] = ` ` the length of the longest border shifts:[1, 1, 2, 2, 4, 4, 4, 4]
MP & KMP for PWMs
April 2nd 2009
6 / 20
Self-overlapping occurrences of a PWM a border for M and α is a word u of Σ` , with ` < m, such that there exist v , w ∈ Σm−` satisfying: Score(M, uv ) ≥ α, and Score(M, wu) ≥ α. w
u
A C G T u
v
Lemma Score(M[m − `..m − 1], u) ≥ α−MaxSc(M[0..m − ` − 1]) Score(M[0..` − 1], u) ≥ α−MaxSc(M[`..m − 1])= GLB(M, ` − 1, α)
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
7 / 20
Expansion of the Morris-Pratt shifting rule c(M, p, α) as true iff, there exists a border of length m − p for M and α mpNext[i] = i − p, where p is the lowest value such that c(M[0..i − 1], p, GLB(M, i − 1, α)) is true ...
w
u
...
A C G T ? ? ? A C G T JS Varr´ e (http://bioinfo.lifl.fr)
after the shift, we don’t know the score of u on M[0..1] the complexity remains O(mn) MP & KMP for PWMs
April 2nd 2009
8 / 20
How to compute the border of a PWM ? let p, i be two positions of M (0 ≤ p ≤ i ≤ m − 1), b(M, p, i, β, δ) = ∃u ∈ Σi−p+1 Score(M[p..i], u) = β ∧ Score(M[0..i − p], u) = δ
Lemma c(M, p, α) is true if, and only if, for each value i ranging from p to m − 1, there exist β and δ, such that 1
b(M, p, i, β, δ) = 1, and
2
β ≥ GLB(M, i, α) − MaxSc(M[0..p − 1]), and
3
δ ≥ GLB(M, i − p, α). p
i −p
i M
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
9 / 20
b can be computed by DP Lemma b(M, p, i, 0, 0) = 1, whenever i < p b(M, p, i, β, δ) = 0, whenever i < p, β 6= 0 or δ 6= 0 b(M, p, i, β, δ) = ∨x∈Σ b(M, p, i − 1, β − M(i, x), δ − M(i − p, x)) otherwise
we need only b(M, p, i, β, δ) for β and δ satisfying conditions of the previous lemma the computation can be done by increasing values of i, β and δ unfortunately the computation for a given p cannot be reused for p+1 Computing b can be time and memory consuming.
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
10 / 20
Computing the probability to observe two overlapping occurrences define the indicator variable Yk by Yk = 1 if M occurs in position k in the text T we have Yk = 1 and Yk+p = 1, for some 1 ≤ p ≤ m − 1, if there exist β and δ such that 1 2
3
Score(M[0..p − 1], Tk ..Tk+p−1 ) ≥ α − β, Score(M[p..m − 1], Tk+p ..Tk+m−1 ) = β and Score(M[0..m − p − 1], Tk+p ..Tk+m−1 ) = δ, Score(M[m − p..m − 1], Tk+m ..Tk+p+m−1 ) ≥ α − δ.
we introduce B(M, p, i, β, δ) : B(M, p, i, β, δ) = IP u ∈ Σi−p+1 ; Score(M[p..i], u) = β ∧ Score(M[0..i − p], u) = δ we have : P IP (Yk = 1, Yk+p = 1) = Pv(M[0..p − 1], α − β) β,δ ×B(M, p, m − 1, β, δ) ×Pv(M[m − p..m − 1]), α − δ) JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
11 / 20
The Knuth-Morris-Pratt algorithm Knuth-Morris-Pratt shifting rule : shift = failure position - length of the longest border not followed by the failure character (tagged-border) ...
... provide longer shifts T A T C T A T ? +6 T A T C T A T for PWMs : u is a tagged border for M, i and α I I
u is a border of M[0..i], and there exists a letter x of Σ such that 1 2
Score(M[0..`], ux) ≥ GLB(M, `, α), and Score(M[i − `..i], ux) < GLB(M, i, α) − MaxSc(M[0..i − ` − 1]).
where ` is the length of u.
computation with a predicate similar to c JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
12 / 20
Another border definition the x−tagged border : the shift depends on the letter in the text on which the failure occurs |Σ| rows shifting table u is a x-tagged border for M, i, and α, if I I
u is a border of M[0..i] and GLB(M, i, α) such that 1 2
Score(M[0..`], ux) ≥ GLB(M, `, α), and Score(M[i − `..i], ux) < GLB(M, i, α) − MaxSc(M[0..i − ` − 1]),
where ` is the length of u.
shifts using the tagged-border vs. shifts using the x−tagged border 1 1 1 1 1 1 1 1 3 4 7 7 7 7 7 7
JS Varr´ e (http://bioinfo.lifl.fr)
A:1 C:1 G:1 T:1
1 1 1 1
1 1 1 1
1 1 1 1
MP & KMP for PWMs
1 1 1 1
1 1 1 1
2 1 2 3
3 1 3 4
4 3 4 4
5 4 5 7
7 7 7 8
7 8 7 8
8 9 7 7 7 7 8 9 8 9 10 11 8 11 10 11
April 2nd 2009
13 / 20
Experimental results
data: on the JASPAR database (123 matrices of transcription factor binding sites) on sequences of length 50Mb 4 algorithms tested naive algorithm lookahead strategy KMP : Knuth-Morris-Pratt KMP-AB : Knuth-Morris-Pratt with the x−tagged border
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
14 / 20
Running times P-value 10−5 7
Naive Lookahead KMP KMP-AB
Time (in seconds)
6 5 4 3 2 1
22 30
15 16 17 18 20
13 14
12
11
10
9
0 Matrices (grouped by length)
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
15 / 20
Improvements
improving the computation of the next tables I
P-value 10−3 10−5 10−7
for α0 < α, shifts are necessarily higher or equal following increasing thresholds, compute only for positions where the shifts are greater than 1 Shifts 111111111111333 3 3 111111113477777 7 7 1 1 1 1 1 1 3 4 4 4 8 8 8 8 8 11 11
2.0
bits
I
1.0 0.0
C T C
CCAAT
AG C GG
AG
G
T
GC
C G
A
T
A
T
C
5
10
A
CCG A
C
A A
T
G
T
15 WebLogo 3.0
improving the speed-up : filtration method I I
while shift= 1, the shifting rule does not provide any speedup idea: pre-compute the score for the slice of the matrix which contains shifts of 1
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
16 / 20
Running times with improvements
Searching phase Preprocessing phase Searching and preprocessing
p-value p = 10−7 p = 10−5 nb. of matrices 45 86 Naive 3.80 3.22 1.41 1.86 Lookahead without improvement KMP 0.95 1.13 KMP-AB 0.71 0.86 KMP 0.019 0.023 KMP-AB 0.956 0.513 with improvements KMP 0.48 0.63 KMP-AB 0.54 0.65
p = 10−3 122 2.81 2.62 1.35 1.34 0.043 0.418 0.79 0.82
Average running time in seconds - Speed-up 3×
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
17 / 20
Running times with improvements P-value 10−5 5
Time (in seconds)
4
KMP KMP-AB KMP filtration KMP-AB filtration
3 2 1
22 30
15 16 17 18 20
13 14
12
11
10
9
0 Matrices (grouped by length)
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
18 / 20
Summary
a generalization of the Morris-Pratt algorithms for scoring matrices an algorithm to compute overlapping occurrences a moderate speed-up but storing the shift table requires little memory
JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
19 / 20
Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, (7). Liefooghe, A., Touzet, H., and Varr´e, J.-S. (2006). Large scale matching for position weight matrices. LNCS, 4009:401–412. Pizzi, C., Rastas, P., and Ukkonen, E. (2007). Fast search algorithms for position specific scoring matrices. LNCS, 4414:239–250. Salmela, L. and Tarhio, J. (2007). Algorithms for weighted matching. LNCS, 4726:276–286. Wu, T. D., Nevill-Manning, C. G., and Brutlag, D. L. (2000). Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics, 16(3):233–44. JS Varr´ e (http://bioinfo.lifl.fr)
MP & KMP for PWMs
April 2nd 2009
20 / 20
MP - KMP - KMB-AB vs. LSA number of matrices (y-axis) for which x% of letters are not read compared to the lookahead strategy P-value = 10−5 122
86
86
nb. of matrices
nb. of matrices
P-value = 10−3 122
62 48 24 0