Self-Overlapping Occurrences and Knuth-Morris-Pratt Algorithm ... - LIFL

7 downloads 0 Views 406KB Size Report
Apr 2, 2009 - use a shift table that gives for each position of the pattern the next position to test in the text. JS Varré (http://bioinfo.lifl.fr). MP & KMP for PWMs.
Self-Overlapping Occurrences and Knuth-Morris-Pratt Algorithm for Weighted Matching Aude Liefooghe, H´el`ene Touzet and Jean-St´ephane Varr´e LIFL - UMR CNRS 8022 - Universit´e Lille 1, France INRIA Lille Nord-Europe, France Sequoia group http://bioinfo.lifl.fr

April 2nd 2009

Position Weight Matrices (PWMs) TTGCGGTC TTGCGGTT TTGCGGTC TTGTGGTT CTGCGGTT TTGTGGTC TTGTGGTC CTGTGGTT CTGCGGTT TTGCGGTA ATGCGGTT CTGCGGTT ATGTGGTA TTGTGGAC AAGTGGTT TCTTGGTT CAGTGGGT

count the occurrences

2

compute the frequencies with correction

3

compute the log-odd ratio M(i, x) = log2

F (i, x) p(x)

with p(x) the background probability of letter x A 0.60 -0.32 -0.69 -2.89 -2.83 -2.89 -2.89 -1.28 -0.69

C -0.69 0.15 -1.28 -2.89 0.66 -2.89 -2.89 -2.89 -0.05

G -1.28 -2.89 -2.89 1.28 -2.83 1.34 1.34 -1.28 -2.89

T 0.32 0.72 1.15 -1.28 0.66 -2.89 -2.89 1.22 0.91

2.0

bits

A A T A A T C A C A T G A T A T T

1

1.0 0.0

T

GGT

TGT

A TC A

C

C A CT

A

5

A

T

C

C

G

A

WebLogo 3.0

Positive score: over represented letter in the pattern. JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

2 / 20

Weighted Pattern Matching given : I

I

a weighted pattern : a matrix M of size m × |Σ| M(p, x) = score at position p for the letter x in Σ a score threshold α

find the occurrences of the weighted pattern in a text T I

occurrence : a position in T such that the score of Tp ..Tp+m−1 is greater that a threshold α

I

score : Score(u, M) =

m−1 X

M(p, up )

p=0

Threshold is chosen given a P-value. Pv(s) = proportion of words of score greater than s. JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

3 / 20

Brute-force approach the Θ(nm) naive algorithm:

p

p+4 A C A G G

... A C G T

JS Varr´ e (http://bioinfo.lifl.fr)

... at the end: Score(M, Tp . . . Tp+4 ) ≥ α ?

MP & KMP for PWMs

April 2nd 2009

4 / 20

Brute-force approach the Θ(nm) naive algorithm: the O(nm) lookahead strategy [Wu et al., 2000]: stop the computation when the score reaches a threshold for a given column p

p+4 A C A G G

... A C G T

JS Varr´ e (http://bioinfo.lifl.fr)

...

at each position: Score(M, Tp . . . Tp+i ) ≥ GLB(M, p + i, α) ? with GLB(M, p+i, α) = α−MaxSc(M[p + i + 1..m − 1]) Greatest Lower Bound

MP & KMP for PWMs

April 2nd 2009

4 / 20

How to improve pattern matching with PWMs ? pre-processing the text: I I

suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

5 / 20

How to improve pattern matching with PWMs ? pre-processing the text: I I

suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results

pre-processing the matrix I I

expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] F F F

split the matrix in q sub-matrices compute the score for all words for each sub-matrix complexity : O( mq n)

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

5 / 20

How to improve pattern matching with PWMs ? pre-processing the text: I I

suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results

pre-processing the matrix I I I

expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] use a shift table that gives for each position of the pattern the next position to test in the text

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

5 / 20

How to improve pattern matching with PWMs ? pre-processing the text: I I

suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results

pre-processing the matrix I I I

expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] use a shift table that gives for each position of the pattern the next position to test in the text

Requires to compute the self-overlaping occurrences of the matrix

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

5 / 20

How to improve pattern matching with PWMs ? pre-processing the text: I I

suffix array [Beckstette et al., 2006], filtration algorithms [Pizzi et al., 2007] : look efficiently for potentiel occurrences and refine the results

pre-processing the matrix I I I

expansion of the Shift-Add algorithm [Salmela and Tarhio, 2007] pre-compute scores for all possible words [Liefooghe et al., 2006] use a shift table that gives for each position of the pattern the next position to test in the text

Requires to compute the self-overlaping occurrences of the matrix I

use an Aho-Corasick automaton [Pizzi et al., 2007] F

F F

build an Aho-Corasick automaton for the set of words of score ≥ threshold use the Aho-Corasick algorithm to obtain the shift very efficient ... but the automaton becomes very large for long matrices

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

5 / 20

Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T borders:

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

6 / 20

Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T

JS Varr´ e (http://bioinfo.lifl.fr)

borders: TAT

MP & KMP for PWMs

April 2nd 2009

6 / 20

Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T borders: TAT, T

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

6 / 20

Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T borders: TAT, T Morris-Pratt shifting rule : shift = failure position - length of the longest border ...

... T A T C T A T +4

? T A T C T A T

JS Varr´ e (http://bioinfo.lifl.fr)

mpNext[0] = −1 mpNext[i] = ` ` the length of the longest border shifts:[1, 1, 2, 2, 4, 4, 4, 4]

MP & KMP for PWMs

April 2nd 2009

6 / 20

Self-overlapping occurrences of a word key definition: border of a word a border of u of length ` is a prefix of length ` which is also a suffix of u. T A T C T A T T A T C T A T borders: TAT, T Morris-Pratt shifting rule : shift = failure position - length of the longest border ...

... T A T C T A T ? +4 T A T C T A T

JS Varr´ e (http://bioinfo.lifl.fr)

mpNext[0] = −1 mpNext[i] = ` ` the length of the longest border shifts:[1, 1, 2, 2, 4, 4, 4, 4]

MP & KMP for PWMs

April 2nd 2009

6 / 20

Self-overlapping occurrences of a PWM a border for M and α is a word u of Σ` , with ` < m, such that there exist v , w ∈ Σm−` satisfying: Score(M, uv ) ≥ α, and Score(M, wu) ≥ α. w

u

A C G T u

v

Lemma Score(M[m − `..m − 1], u) ≥ α−MaxSc(M[0..m − ` − 1]) Score(M[0..` − 1], u) ≥ α−MaxSc(M[`..m − 1])= GLB(M, ` − 1, α)

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

7 / 20

Expansion of the Morris-Pratt shifting rule c(M, p, α) as true iff, there exists a border of length m − p for M and α mpNext[i] = i − p, where p is the lowest value such that c(M[0..i − 1], p, GLB(M, i − 1, α)) is true ...

w

u

...

A C G T ? ? ? A C G T JS Varr´ e (http://bioinfo.lifl.fr)

after the shift, we don’t know the score of u on M[0..1] the complexity remains O(mn) MP & KMP for PWMs

April 2nd 2009

8 / 20

How to compute the border of a PWM ? let p, i be two positions of M (0 ≤ p ≤ i ≤ m − 1), b(M, p, i, β, δ) = ∃u ∈ Σi−p+1 Score(M[p..i], u) = β ∧ Score(M[0..i − p], u) = δ

Lemma c(M, p, α) is true if, and only if, for each value i ranging from p to m − 1, there exist β and δ, such that 1

b(M, p, i, β, δ) = 1, and

2

β ≥ GLB(M, i, α) − MaxSc(M[0..p − 1]), and

3

δ ≥ GLB(M, i − p, α). p

i −p

i M

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

9 / 20

b can be computed by DP Lemma   b(M, p, i, 0, 0) = 1, whenever i < p b(M, p, i, β, δ) = 0, whenever i < p, β 6= 0 or δ 6= 0  b(M, p, i, β, δ) = ∨x∈Σ b(M, p, i − 1, β − M(i, x), δ − M(i − p, x)) otherwise

we need only b(M, p, i, β, δ) for β and δ satisfying conditions of the previous lemma the computation can be done by increasing values of i, β and δ unfortunately the computation for a given p cannot be reused for p+1 Computing b can be time and memory consuming.

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

10 / 20

Computing the probability to observe two overlapping occurrences define the indicator variable Yk by Yk = 1 if M occurs in position k in the text T we have Yk = 1 and Yk+p = 1, for some 1 ≤ p ≤ m − 1, if there exist β and δ such that 1 2

3

Score(M[0..p − 1], Tk ..Tk+p−1 ) ≥ α − β, Score(M[p..m − 1], Tk+p ..Tk+m−1 ) = β and Score(M[0..m − p − 1], Tk+p ..Tk+m−1 ) = δ, Score(M[m − p..m − 1], Tk+m ..Tk+p+m−1 ) ≥ α − δ.

we introduce B(M, p, i, β, δ) : B(M, p, i, β, δ) =  IP u ∈ Σi−p+1 ; Score(M[p..i], u) = β ∧ Score(M[0..i − p], u) = δ we have : P IP (Yk = 1, Yk+p = 1) = Pv(M[0..p − 1], α − β) β,δ ×B(M, p, m − 1, β, δ) ×Pv(M[m − p..m − 1]), α − δ) JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

11 / 20

The Knuth-Morris-Pratt algorithm Knuth-Morris-Pratt shifting rule : shift = failure position - length of the longest border not followed by the failure character (tagged-border) ...

... provide longer shifts T A T C T A T ? +6 T A T C T A T for PWMs : u is a tagged border for M, i and α I I

u is a border of M[0..i], and there exists a letter x of Σ such that 1 2

Score(M[0..`], ux) ≥ GLB(M, `, α), and Score(M[i − `..i], ux) < GLB(M, i, α) − MaxSc(M[0..i − ` − 1]).

where ` is the length of u.

computation with a predicate similar to c JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

12 / 20

Another border definition the x−tagged border : the shift depends on the letter in the text on which the failure occurs |Σ| rows shifting table u is a x-tagged border for M, i, and α, if I I

u is a border of M[0..i] and GLB(M, i, α) such that 1 2

Score(M[0..`], ux) ≥ GLB(M, `, α), and Score(M[i − `..i], ux) < GLB(M, i, α) − MaxSc(M[0..i − ` − 1]),

where ` is the length of u.

shifts using the tagged-border vs. shifts using the x−tagged border 1 1 1 1 1 1 1 1 3 4 7 7 7 7 7 7

JS Varr´ e (http://bioinfo.lifl.fr)

A:1 C:1 G:1 T:1

1 1 1 1

1 1 1 1

1 1 1 1

MP & KMP for PWMs

1 1 1 1

1 1 1 1

2 1 2 3

3 1 3 4

4 3 4 4

5 4 5 7

7 7 7 8

7 8 7 8

8 9 7 7 7 7 8 9 8 9 10 11 8 11 10 11

April 2nd 2009

13 / 20

Experimental results

data: on the JASPAR database (123 matrices of transcription factor binding sites) on sequences of length 50Mb 4 algorithms tested naive algorithm lookahead strategy KMP : Knuth-Morris-Pratt KMP-AB : Knuth-Morris-Pratt with the x−tagged border

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

14 / 20

Running times P-value 10−5 7

Naive Lookahead KMP KMP-AB

Time (in seconds)

6 5 4 3 2 1

22 30

15 16 17 18 20

13 14

12

11

10

9

0 Matrices (grouped by length)

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

15 / 20

Improvements

improving the computation of the next tables I

P-value 10−3 10−5 10−7

for α0 < α, shifts are necessarily higher or equal following increasing thresholds, compute only for positions where the shifts are greater than 1 Shifts 111111111111333 3 3 111111113477777 7 7 1 1 1 1 1 1 3 4 4 4 8 8 8 8 8 11 11

2.0

bits

I

1.0 0.0

C T C

CCAAT

AG C GG

AG

G

T

GC

C G

A

T

A

T

C

5

10

A

CCG A

C

A A

T

G

T

15 WebLogo 3.0

improving the speed-up : filtration method I I

while shift= 1, the shifting rule does not provide any speedup idea: pre-compute the score for the slice of the matrix which contains shifts of 1

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

16 / 20

Running times with improvements

Searching phase Preprocessing phase Searching and preprocessing

p-value p = 10−7 p = 10−5 nb. of matrices 45 86 Naive 3.80 3.22 1.41 1.86 Lookahead without improvement KMP 0.95 1.13 KMP-AB 0.71 0.86 KMP 0.019 0.023 KMP-AB 0.956 0.513 with improvements KMP 0.48 0.63 KMP-AB 0.54 0.65

p = 10−3 122 2.81 2.62 1.35 1.34 0.043 0.418 0.79 0.82

Average running time in seconds - Speed-up 3×

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

17 / 20

Running times with improvements P-value 10−5 5

Time (in seconds)

4

KMP KMP-AB KMP filtration KMP-AB filtration

3 2 1

22 30

15 16 17 18 20

13 14

12

11

10

9

0 Matrices (grouped by length)

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

18 / 20

Summary

a generalization of the Morris-Pratt algorithms for scoring matrices an algorithm to compute overlapping occurrences a moderate speed-up but storing the shift table requires little memory

JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

19 / 20

Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, (7). Liefooghe, A., Touzet, H., and Varr´e, J.-S. (2006). Large scale matching for position weight matrices. LNCS, 4009:401–412. Pizzi, C., Rastas, P., and Ukkonen, E. (2007). Fast search algorithms for position specific scoring matrices. LNCS, 4414:239–250. Salmela, L. and Tarhio, J. (2007). Algorithms for weighted matching. LNCS, 4726:276–286. Wu, T. D., Nevill-Manning, C. G., and Brutlag, D. L. (2000). Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics, 16(3):233–44. JS Varr´ e (http://bioinfo.lifl.fr)

MP & KMP for PWMs

April 2nd 2009

20 / 20

MP - KMP - KMB-AB vs. LSA number of matrices (y-axis) for which x% of letters are not read compared to the lookahead strategy P-value = 10−5 122

86

86

nb. of matrices

nb. of matrices

P-value = 10−3 122

62 48 24 0