Slides

String Algorithms

Maxime Crochemore King’s College London

Universit´ e Paris-Est

&

M.C.

PhD Univ. Warsaw

1/63

Algorithms and combinatorics on words

⋆

⋆

M.C.

Links between combinatorial properties of words and algorithms on words: –

some algorithms are based on combinatorial properties

–

combinatorics is sometimes used to evaluate efficiency of algorithms

Examples: –

Text searching

–

Fragment assembly and shortest common superstring

–

Text indexing and suffix arrays

–

Text compression and permutations

⋆

Other examples in Algorithms on Strings [C., Hancart, Lecroq, 2007], [C., Rytter, 1994] http://www.dcs.kcl.ac.uk/staff/mac/

⋆

Combinatorial aspects in Applied Combinatorics on Words [Lothaire, 2004], [Lothaire, 2002] http://igm.univ-mlv.fr/∼berstel/Lothaire/index.html PhD Univ. Warsaw

2/63

Periods and borders of words

⋆

Non-empty string u, integer p, 0 < p ≤ |u|

⋆

p is a period of u if any of these equivalent conditions is satisfied: –

u[i] = u[i + p], for 1 ≤ i ≤ |u| − p

–

u is a prefix of some y k , k > 0, |y| = p

–

u = yw = wz, for some strings y, z, w with |y| = |z| = p String w is called a border of u p -

borderlandborder borderlandborder p

M.C.

⋆

period(u) = smallest period of u (can be |u|) border(u) = longest proper border of u (can be empty)

⋆

Periods and borders of abaabaa 3 abaa 6 a 7 empty string PhD Univ. Warsaw

3/63

Periodicity Lemma

Lemma 1 (Periodicity Lemma) If p and q are periods of a word x and satisfy p+q−GCD(p, q) ≤ |x| then GCD(p, q) is a period of x. [Fine, Wilf, 1965] see Algebraic Combinatorics on Words [Lothaire, 2002] a b a a b a b a a b a b a a b ··· a b a a b a b a a b a a b a a b a b a a b a a b a b ··· Used in the analysis of KMP algorithm and of many other pattern matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word x and satisfy p + q ≤ |x| then GCD(p, q) is a period of x. M.C.

PhD Univ. Warsaw

4/63

Proof of the weaker statement

⋆

p and q periods of x with p + q ≤ |x| and p > q

⋆

p − q period of x

⋆

M.C.

p

a

b

-

p−q the rest like Euclid’s induction

PhD Univ. Warsaw

-

c q

5/63

Proof of the weaker statement

⋆

p and q periods of x with p + q ≤ |x| and p > q

⋆

p − q period of x

p

a

b

-

p−q

-

c q

p c

⋆

M.C.

-

a

q

b p−q

-

the rest like Euclid’s induction

PhD Univ. Warsaw

5/63

On-line String Matching

u u

c a u

a

-

compatible shift

M.C.

PhD Univ. Warsaw

6/63

On-line String Matching (1)

u u

c a u

a

-

compatible shift ⋆

M.C.

compatible with match: shift = period(u) [Morris, Pratt, 1969]

text y pattern x

· · a b a a b a c · · · · · · · a b a a b a a a b a a b a a

PhD Univ. Warsaw

6/63


u u

c a u

a

-



⋆

idem+not incompatible with c [Knuth, Morris, Pratt, 1977]

M.C.

text y pattern x


text y pattern x


PhD Univ. Warsaw

6/63


u u

c a u

a

-



text y pattern x


⋆

idem+not incompatible with c [Knuth, Morris, Pratt, 1977]

text y pattern x


⋆

best shift = period(uc) [Simon, 1989], [Hancart, 1993] text y [Breslauer, Colussi, Toniolo, pattern x 1993]

M.C.


PhD Univ. Warsaw

6/63

Delay

⋆

Delay: maximum number of comparisons on a text letter

⋆

MP algorithm delay ≤ |x|

⋆

KMP algorithm delay ≤ logΦ (|x| + 1) text y pattern x

· · a b a a b a c · a b a a b a b a b a a b a b a a b Proof by the Periodicity Lemma The bound is tight ⋆

M.C.

· · · · · · a a a b a a a a b a a

Simon-Hancart algorithm use of string-matching automaton delay ≤ min(1 + log2 |x|, cardA) PhD Univ. Warsaw

7/63

Searching with an automaton

⋆

Uses the string-matching automaton SMA(x): smallest deterministic automaton accepting A∗ x

⋆

Example x = abaa

a b 0

a a

1

b

2

b a

b ⋆

3

a

4

b

Search for abaa in: b a b b a a b a a b a a b b a ··· state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 · · ·

M.C.

PhD Univ. Warsaw

8/63

Construction of SMA(x)

⋆

Unwinding arcs

⋆

From SMA(abaa) . . .

b 0

a

a a

1

b

2

b a

b ⋆

0

1

4

a

a a

a

b

. . . to SMA(abaab)

b

3

b

2

b a

3

a

4

b

5

a

b

b M.C.

PhD Univ. Warsaw

9/63

Complexity

⋆

Time and space optimization: implementation of significant arcs only –

Forward arcs: spell the pattern

–

Backward arcs: arcs going backwards without reaching the initial state

Lemma 3 SMA(x) contains at most |x| backward arcs. ⋆

⋆

M.C.

Consequences: –

implementation of SMA(x) in O(|x|) space

–

construction in O(|x|) time, independently of the alphabet size

There is a strategy on the choice of arcs for which: delay ≤ min(1 + log2 |x|, cardA) [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1998]

PhD Univ. Warsaw

10/63

Significant arcs

⋆

Complete SMA(ananas) ' n, s a

a $ a a a n a n 4 a 5 s 6 - 0 - 1 2 - 3 s n n, s s & % n, s & % n, & % s

⋆

Forward arcs: spell the pattern

⋆

Backward arcs: arcs going backwards without reaching the initial state a $ ' a a a a n a n 4 a 5 s 6 - 0 - 1 2 - 3 n

Lemma 4 SMA(x) contains at most |x| backward arcs. ⋆

M.C.

Consequence: the implementation of SMA(x) can be done in O(|x|) time and space, independently of the alphabet size PhD Univ. Warsaw

11/63

Backward arcs in SMA

⋆

States of SMA(x) are identified with prefixes of x A backward arc is of the form (u, τ, vτ ) (u, v prefixes of x, τ symbol) with –

vτ longest suffix of uτ that is a prefix of x, and u 6= v

Note: uτ is not a prefix of x Let p(u, τ ) = |u| − |v| ; it is a period of u because v is a border of u x

τ

v

τ

τ

σ

-

v

p(u, τ )

M.C.

⋆

Backward arcs to periods: p is injective Each period p, 1 ≤ p ≤ |x|, corresponds to at most one backward arc, thus there are at most |x| such arcs

⋆

A worst case: SMA(abm−1) has m backward arcs (a 6= b)

PhD Univ. Warsaw

12/63

Backward arcs (followed)

⋆

Proof that p is injective Two backward arcs (u, τ, vτ ), (u′, τ ′ , v ′τ ′ ) Assume p(u, τ ) = p(u′, τ ′) = p ; we prove u = u′ and τ = τ ′.

x

τ

v

τ

v′

τ′

σ

v v′

-

p

x

τ′

τ

v

M.C.

τ

If v = v ′ then u = u′ and also τ = τ ′

⋆

⋆

v′

τ

τ ′

σ′

τ τ′ p

σ

v -

v′

If v ′ a proper prefix of v then v ′τ ′ and v ′σ ′ are prefixes of v thus τ ′ = σ ′ a contradiction PhD Univ. Warsaw

13/63

Repetition

⋆

Repetition (w, w) repetition of (u, v) if w 6= ε and both: –

w suffix of u or u suffix of w, and

–

w prefix of v or v prefix of w

|w| is a local period of (u, v) ⋆

Local Period localperiod(u, v) = minimum local period of (u, v)

a b a b b a b b

a b a b b a a a a b a b b a b a b a

M.C.

PhD Univ. Warsaw

a b a b b a b b a b a b b a b a

14/63

Maximal Local Period

⋆

Word of period 5 a 1

b 2

a 2

b 5

b 1

a 3

⋆

Note: localperiod(u, v) ≤ period(uv)

⋆

(u, v) is a critical factorization of uv if

1

localperiod(u, v) = period(uv) localperiod(u, v) is maximal among all local periods ⋆

M.C.

Computation of all local periods in linear time [Duval, Kolpakov, Kucherov, Lecroq, Lefebvre, 2003]

PhD Univ. Warsaw

15/63

Critical Factorization

Theorem 1 (Critical Factorization Theorem) Any non-empty word x can be factorized into u · v with both: • |u| < p and • localperiod(u, v) = period(x). [Cesari, Vincent, Duval, 1983] a b a b b a b b a b a b b a b a Leads to time-space optimal string-matching algorithm: two-way algorithm

M.C.

PhD Univ. Warsaw

16/63

Two-Way String Matching

text y pattern x

⋆

u

shifts

w c w a u

|wc|

c w a w

v v v

-

period(x)

-

Time-space optimality –

Search Time: linear time (≤ 2n comparisons) with constant extra space

–

Preprocessing Time: idem, based on next theorem

[Crochemore, Perrin, 1992] ⋆

M.C.

Other solutions [Galil, Seiferas, 1983], [Crochemore, 1992] [Crochemore, Rytter, 1994], [Rytter, 2002]

PhD Univ. Warsaw

17/63

Two-way string matching (followed)

Searching for pattern x = x[0 . . m − 1] in text y = y[0 . . n − 1]

M.C.

⋆

(u, v) critical factorization of x: uv = x and |u| < localperiod(u, v) = period(x)

⋆

Searching –

search for an occurrence of v by left-to-right scan if mismatch: shift length = scan length, else

–

search for a preceding occurrence of u by right-to-left scan shift length = period of x

PhD Univ. Warsaw

18/63

Example

⋆ ⋆

Critical factorization a b a b b a b a u v Searching

period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a

M.C.

PhD Univ. Warsaw

19/63

Example (1)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a left-to-right scan length = 3

M.C.

PhD Univ. Warsaw

19/63

Example (1)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a shift length = 3

M.C.

PhD Univ. Warsaw

19/63

Example (2)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a

M.C.

PhD Univ. Warsaw

20/63

Example (2)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a left-to-right scan length = 4

M.C.

PhD Univ. Warsaw

20/63

Example (2)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a shift length = 4

M.C.

PhD Univ. Warsaw

20/63

Example (3)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a

M.C.

PhD Univ. Warsaw

21/63

Example (3)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a left-to-right scan

M.C.

PhD Univ. Warsaw

21/63

Example (3)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a right-to-left scan

M.C.

PhD Univ. Warsaw

21/63

Example (3)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a a b a b b a b a shift length = period = 5

M.C.

PhD Univ. Warsaw

21/63

Example (4)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a a b a b b a b a

M.C.

PhD Univ. Warsaw

22/63

Example (4)

⋆ ⋆


period = 5

window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a a b a b b a b a left-to-right and right-to-left scans occurrence found next shift length = period = 5

M.C.

PhD Univ. Warsaw

22/63

Two-way string matching

Searching for pattern x = x[0 . . m − 1] in text y = y[0 . . n − 1] ⋆

(u, v) critical factorization of x: uv = x and |u| < localperiod(u, v) = period(x)

⋆

Searching

⋆

⋆

M.C.

–

search for an occurrence of v by left-to-right scan if mismatch: shift length = scan length, else

–

search for a preceding occurrence of u by right-to-left scan shift length = period of x

Preprocessing –

compute factorization (u, v)

–

compute the (smallest) period of x

Algorithmic complexity –

total linear time; searching in less than 2n comparisons

–

only constant additional memory space required

PhD Univ. Warsaw

23/63

Orderings

⋆

Orderings ≤ lexicographic ordering based on the ordering ≤ of the alphabet lexicographic ordering based on the inverse ordering ≤−1 of the alphabet

Theorem 2 x is a non-empty word Let x = uv with v = suffix of x that is maximal for ≤ Let x = u′v ′ with v ′ = suffix of x that is maximal for If |v| ≤ |v ′ | then (u, v) is a critical factorization of x otherwise (u′ , v ′) is. Moreover, |u| < period(x) and |u′| < period(x). [Crochemore, Perrin, 1992]

M.C.

a b a a b a a

a b a b a a b b a b a b a

a b a a b a a

a b a b a a b b a b a b a

PhD Univ. Warsaw

24/63

Proof (1)

Four cases — w shortest repetition at (u, v) x ⋆ w suffix of u and v prefix of w

v w

v < w < wv, impossible

M.C.

u

PhD Univ. Warsaw

w

25/63

Proof (2)


u w

v < w < wv, impossible ⋆

M.C.

v

w suffix of u and w prefix of v z < v implies v = wz < wv impossible

x

PhD Univ. Warsaw

w

u

z w

w

25/63

Proof (3)


u w


⋆

M.C.

v


u

x

u suffix of w and v prefix of w (u, v) is a critical factorization

PhD Univ. Warsaw

w

z w

x

u w

w v w

25/63

Proof (4)


u w


⋆

⋆

M.C.

v


u

x

u suffix of w and v prefix of w (u, v) is a critical factorization u suffix of w and w prefix of v z < v; yz ≺ yv implies z ≺ v; z prefix of v then border of v w period of v and of x. PhD Univ. Warsaw

w

z w

x

w

u w

v w

v′ y

x w

z y

25/63

Computing Maximal Suffixes

⋆

M.C.

v maximal suffix of x; |w| its period; w′ proper prefix of w u v x w w w′ a b a c b c b a c b c b a c b c u w w w′

PhD Univ. Warsaw

26/63

Computing Maximal Suffixes (followed)

⋆


⋆

Match: the periodicity continues ?

?

a b a c b c b a c b c b a c b c b u w w new w′

M.C.

PhD Univ. Warsaw

26/63


⋆


⋆


?

a b a c b c b a c b c b a c b c b u w w new w′ ⋆

Smaller letter: new w, border-free ?

?

a b a c b c b a c b c b a c b c a u new w

M.C.

PhD Univ. Warsaw

26/63


⋆


⋆


?

a b a c b c b a c b c b a c b c b u w w new w′ ⋆

Smaller letter: new w, border-free ?

?

a b a c b c b a c b c b a c b c a u new w ⋆

Greater letter: new u and recomputation on the rest ?

?

a b a c b c b a c b c b a c b c c recomputation new u

M.C.

PhD Univ. Warsaw

26/63

Perfect factorization

Theorem 3 Any non-empty word x can be factorized into u · v with both: • |u| < 2 period(v) and • v starts with at most one cube of a primitive word. [Galil, Seiferas, 1983], [Crochemore, Rytter, 1994] [Mignosi, Restivo, Salemi, 1995] see [Lothaire, 2002] ⋆

Example word with period = 10 a a a b a a b a a b a a a b a a b a a b a a a b a a ···

⋆

M.C.

Leads to a time-space optimal string-matching algorithm

PhD Univ. Warsaw

27/63

Square prefixes

w

w

v u

v u

Lemma 5 (Three prefix squares) If u2 prefix of v 2, v 2 prefix of w2, and u primitive then |u| + |v| ≤ |w|. [Crochemore, Rytter, 1995] 10 7 3

M.C.

a a b a a b a a a b a a b a a b a a a b a a b a a b a a a b a a b a a a b a a b

PhD Univ. Warsaw

28/63

Squares in a word

Lemma 6 ([Fraenkel, Simpson, 1998]) No more that 2n squares of primitive root occur in a word of length n. y w v u

w v

u

?

rightmost positions in y? impossible! Direct proofs [Hickerson, 2004], [Ilie, 2005] Best bound: 2n − Θ(log n) [Ilie, 2005] Computation in linear time [Gusfield, Stoye, 1998] Lemma 7 ([Crochemore, 1981], [Gusfield, Stoye, 1999]) Maximal number of occurrences of squares (with primitive root) : cn log n. Maximum reached by Fibonacci strings. Theorem 4 ([Kolpakov, Kucherov, 1998]) Linear number of occurrences of runs (maximal powers) in a word. M.C.

PhD Univ. Warsaw

29/63

Sequencing and fragment assembly

⋆

“Reading” a biological molecular sequence

⋆

Straight sequencing for short fragments: ≤ 500 bases

⋆

Splicing long sequences, “shotgun sequencing” =⇒ reconstruction problem

⋆

Difficulties :

⋆

–

loss of orientation (double-stranded DNA)

–

overlaps with errors between fragments

Formalization : F set of fragments, ε ∈ [0, 1[ compute s, a short sequence for which ∀f ∈ F ∃a fragment of s d(a, f ) ≤ ε|a| or d(a, f˜) ≤ ε|a|

M.C.

PhD Univ. Warsaw

30/63

Shortest Common Superstring

⋆

Problem: let F = {f1, f2, . . . , fn } a factor code Compute a shortest word s in which each fi occurs

⋆

SCS(F) = |s| s= T A A f1 = A f2 = f3 = f4 = f5 = T A A f6 = A A

T A T T A T A T A T T A T T T T A T T A T A T T A

Lemma 8 Computing SCS(F) is NP-hard.

M.C.

PhD Univ. Warsaw

31/63

Greedy Approximation

⋆

while card F > 1 do assemble two fragments having the longest overlap

Theorem 5 The sequence s produced by the greedy algorithm satisfies |s| ≤ 4 × SCS(F).

M.C.

⋆

Conjecture : |s| ≤ 2 × SCS(F)

⋆

Variations based on the computation of hamiltonian cycle of minimal weight

PhD Univ. Warsaw

32/63

Graph of Overlaps

⋆

F = {f1, f2, . . . , fn }, factor code

⋆

Overlaps ov(fi, fj ) = max{|v| | fi = uv and fj = vw}

⋆

Nodes = {f1, f2, . . . , fn} Arcs = {(fi, ov(fi , fj ), fj ) | 1 ≤ i, j ≤ n, i 6= j}

⋆

ov(f, g) computed in time O(min(|f |, |g|)) [Morris et Pratt, 1970]

s= A T A T T A f1 = A T A T f2 = T A T T f3 = T T A f4 = T A f1 = A ↔ ↔ ↔ 3 2 3

M.C.

T A T

T T A T A T ↔ 3

PhD Univ. Warsaw

f1

3 f4 @ I

@ @ @ @ R @

3

@ @ @ @

3

f2

2

f3

33/63

Graph of Prefixes

⋆

F = {f1, f2, . . . , fn }, factor code

⋆

Prefixes pr(fi , fj ) = |fi| − ov(fi , fj )

⋆

Nodes = {f1, f2, . . . , fn} Arcs = {(fi, pr(fi , fj ), fj ) | 1 ≤ i, j ≤ n, i 6= j}

s= A T A T f1 = A T A T f2 = T A T f3 = T f4 = f1 = ↔ ←→ ↔ 1 2 1

M.C.

T A T A T T T A T T A T A A T A T ↔ 1 = 5

PhD Univ. Warsaw

f1

1 f4 @ I

@ @ @ @ R @

1

@ @ @ @

1

f2

2

f3

34/63

Technique

⋆

A cycle (fi1 , fi2 , . . . , fik , fi1 ) produces a superstring s of length: |s| = kj=1 |fij | − k−1 j=1 ov(fij , fij+1 ) P = k−1 j=1 pr(fij , fij+1 ) + |fik | P

M.C.

P

⋆

Computation of an hamiltonian cycle of maximal weight in the graph of overlaps

⋆

. . . or of minimal weight in the graph of prefixes

⋆

NP-hard problems

PhD Univ. Warsaw

35/63

3 × OP T Approximation

⋆

Based on cycle covers of the graph

⋆

Polynomial running time

⋆

Schema: (i) compute a cycle cover of minimal weight for the graph of prefixes (ii) open cycles (iii) s = assembly with overlaps of words obtained in (ii)

M.C.

⋆

Approximation : |s| ≤ 3 × SCS(F)

⋆

Proof based on the Overlap Lemma [Blum, Jiang, Li, Tromp, Yanakakis, 1994]

PhD Univ. Warsaw

36/63

Periodicities

⋆

Word x, integer p, 0 < p ≤ |x| ; p period of x if x[i] = x[i + p]

⋆

period(x) = smallest period of x

Lemma 9 (Overlap Lemma) Let u, v, p = period(u), q = period(v). If u[1 . . p] et v[1 . . q] are not conjugate, ov(u, v) < p + q − GCD(p, q).

M.C.

⋆

Exemple : period(baabaabaa) = 3, period(aabaaabaaaba) = 4 ov(baabaabaa, aabaaabaaaba) = |aabaa| = 5 < 3 + 4 − 1

⋆

Straight corollary of the Periodicity Lemma

PhD Univ. Warsaw

37/63

Better Approximation

⋆

Based on a more efficient cycle opening

⋆

Schema : (i) compute a cycle cover of minimal weight for the graph of prefixes (ii) open cycles at positions deduced from the Rotation Lemma (iii) s = assembly with overlaps of words obtained in (ii)

M.C.

⋆

Approximation : |s| ≤ (2 + 32 ) × SCS(F)

⋆

Proof based on the Rotation Lemma [Jiang, Jiang, Breslauer, 1996]

PhD Univ. Warsaw

38/63

Rotation Lemma

Lemma 10 (Rotation Lemma) Let x and p = period(x) be such that |x| ≥ 3p. There is a factorization u · v of x with both: • |u| < p, and • each w prefix of v with q = period(w) < p satisfies |w| ≤ 32 (p + q). [Breslauer, Jiang, Jiang, 1996] p y x q z z w

≤ ⋆

M.C.

2 3 (p

y

y

-

+ q)

The bound is tight: at each position < 2n + 3 of x = (an ban+1b)3 starts a factor of length ≥ 2n + 2 having a period ≥ n + 1

PhD Univ. Warsaw

39/63

Choice of the rotation

⋆

y 3 prefix of x with |y| = period(x) = p

⋆

≤ ordering on the alphabet, inverse ordering

⋆

r conjugate of y that is a Lyndon word according to ≤

⋆

t conjugate of y that is a Lyndon word according to

⋆

i first position of r on x, j first position of t on x

⋆

If i < j and j − i ≤ p, we choose u such that y = uu′ , r = u′u p y

i

-

y j

y

t r

⋆

M.C.

If i < j and j − i > p, we choose u such that y = uu′, t = u′ u

PhD Univ. Warsaw

40/63

Proof

⋆

r Lyndon word and j − i ≤ p y x j i t

p 2

y

y

r w

⋆

M.C.

s q

-

s

–

|w| < p since q < p and r is border-free

–

If q ≥ p2 , then |w| < 23 (p + q)

–

Else, q < p2 , and |w| < j − i + q (Critical Factorization at j) and j − i + q < 2p + q < 23 (p + q), QED

If j − i > p2 , replace r by t and t by the next occurrence of r, which brings back to the first case

PhD Univ. Warsaw

41/63

Text Indexing

⋆

Set of factors (subwords) of a static text

⋆

Basic operations

⋆

M.C.

–

existence of patterns in the text

–

number of occurrences of patterns

–

list of positions of occurrences

Other applications –

finding repetitions in texts

–

finding regularities in texts

–

approximate matchings

–

two-dimensional pattern matching

–

...

PhD Univ. Warsaw

42/63

Implementation of indexes

suffix of text pattern

Implementation with efficient data structures ⋆

Suffix Trees digital trees, PATRICIA trees (compact trees)

⋆

Suffix Automata or DAWG’s minimal automata, compact automata

Implementation with efficient algorithm ⋆

M.C.

Suffix Arrays binary search in the ordered list of suffixes

PhD Univ. Warsaw

43/63

Suffixes

Text y ∈ A∗ of length n ⋆

Suff (y) = set of suffixes of y

⋆

Suff (ababbb) i y[i]

0 1 2 3 4 5 a b a b b b a b a b b a b a b b

M.C.

b b b b b

b b b b b b ε

positions 0 1 2 3 4 5 6 (empty string)

PhD Univ. Warsaw

44/63

Suffix Structures

⋆

Suffix trie of ababbb

1

b

b

4

b

5

6

Its suffix tree

0

b

12

b

8

b

9

b

10

11

1

b

14

b

15

3

b

7

1

b

1

b

6

3

⋆

Its compact suffix automaton

2

a

3

b

bb 4

b

5

a b

2

4

b a

2

5 5

Its suffix automaton

0

4

abbb

b

4

⋆

bb

6

7 5

0

2

0 a

1

ab

13

b

6

b

abbb 3

2

a 0

b

3

a

⋆

2′

b

5′

b

6

0

ab

abbb

2

b

1

abbb b b

2′

b

3

Linear size data structures and O(n log card A) construction time, except for suffix tries

M.C.

PhD Univ. Warsaw

45/63

Suffix Array

⋆

Structure composed of: lexicographically sorted list of non-empty suffixes, and maximal lengths of common prefixes between suffixes consecutive in the list

⋆

Suffix array of ababbb

j y[j]

⋆

M.C.

0 1 2 3 4 5 a b a b b b

k SUF[k] LCP[k] 0 0 0 a b a b b b 1 2 2 a b b b 2 5 0 b 3 1 1 b a b b b 4 4 1 b b 5 3 2 b b b

Benefit: use of binary search and of additional combinatorial features lead to a O(m + log n) searching time (instead of a mere O(m × log n) time)

PhD Univ. Warsaw

46/63

Searching—Case one

⋆

Hypotheses Ld < x < Lf and ld ≤ |lcp(Li, Lf )| < lf

⋆

Example Ld a a Li a a Lf a

⋆

M.C.

a a a a a

a a b b b

c c b b b

a b a a b

x aabbbaa a ba bb ab

x aabbbaa

Conclusion Li < x < Lf and |lcp(x, Li)| = |lcp(Li, Lf )|

PhD Univ. Warsaw

47/63

Searching—Case two

⋆

Hypotheses Ld < x < Lf and ld ≤ lf < |lcp(Li, Lf )|

⋆


⋆

M.C.

a a a a a

a a b b b

c c b b b

a b a a b

x aabacb a ba bb ab

x aabacb

Conclusion Ld < x < Li and |lcp(x, Li)| = |lcp(x, Lf )|

PhD Univ. Warsaw

48/63

Searching—Case three

⋆

Hypotheses Ld < x < Lf and ld ≤ lf = |lcp(Li, Lf )|

⋆


⋆

M.C.

a a a a a

a a b b b

c c b b b

a b a a b

x aabbab a ba bb ab

x aabbab

Conclusion compare x and Li from position lf

PhD Univ. Warsaw

49/63

Sorting Suffixes on a Bounded Integer Alphabet

⋆

Schema [1] bucket sort positions i according to first3(y[i . . n − 1]), for i = 3q or i = 3q + 1 t[i]: rank of i in the sorted list [2] recursively sort the suffixes of the 2/3-shorter word t[0]t[3] · · · t[3q] · · · t[1]t[4] · · · t[3q + 1] · · · s[i]: rank of suffix i in the sorted list (i = 3q or i = 3q + 1) [3] sort suffixes y[j . . n − 1] for j of the form 3q + 2 (i.e., bucket sort pairs (y[j], s[j + 1])) [4] merge lists obtained at steps 2 and 3 Note: comparing suffixes i (first list) and j (second list) remains to compare: (x[i], s[i + 1]) and (x[j], s[j + 1]) if i = 3q (x[i]x[i + 1], s[i + 2]) and (x[j]x[j + 1], s[j + 2]) if i = 3q + 1

⋆

M.C.

Running time: T (n) = T (2n/3) + O(n) then T (n) = O(n) [K¨ arkk¨ ainen, Sanders, 2003] PhD Univ. Warsaw

50/63

Example

i y[i]

0 1 2 3 4 5 6 7 8 9 10 a a b a a b a a b b a Rank s 0 1 2 3 4 5 6 7

Rank t 0 a 1 a a b 2 a b a 3 a b b 4 b a

i y[i] r[i] SUF[i] M.C.

0 a 1 10

1 a 4 0

2 b 8 3

3 a 2 6

4 a 5 1

5 b 9 4

i 10 0 3 6 1 4 7 9 6 a 3 7

Suff (11142230) 0 1 1 1 4 2 2 3 0 1 1 4 2 2 3 0 1 4 2 2 3 0 2 2 3 0 2 3 0 3 0 4 2 2 3 0 7 a 6 9

8 b 10 2

9 b 7 5

Rank 0 1 2

j 2 5 8

(y[j], s[j + 1]) (b, 2) (b, 3) (b, 7)

10 a 0 8

PhD Univ. Warsaw

51/63

Computing LCP’s of suffixes

⋆

LCP’s of suffixes LCP[i] = |lcp(y[SUF[i − 1] . . n − 1], y[SUF[i] . . n − 1])|, for 0 ≤ i ≤ n i y[i] SUF[i] LCP[i]

0 a 10 0

1 a 0 1

2 b 3 6

3 a 6 3

4 a 1 1

5 b 4 5

6 a 7 2

7 a 9 0

8 b 2 2

9 b 5 4

10 11 a 8 1 0

j RANK[j] j RANK[j] 0 1 aabaabaabba 1 4 abaabaabba 3 2 aabaabba 4 5 abaabba Lemma 11 Let j ∈ (1, 2, . . . , n − 1) with RANK[j] > 0. Then LCP[RANK[j − 1]] − 1 ≤ LCP[RANK[j]].

M.C.

PhD Univ. Warsaw

52/63

LCP algorithm

⋆

Rank RANK: RANK[j] = rank of suffix at position j (RANK = SUF−1)

⋆

Permutation SUF: SUF[k] = position of suffix of rank k Lcp(y, n, SUF, RANK)

⋆ M.C.

1 ℓ←0 2 for j ← 0 to n − 1 do 3 ℓ ← max{0, ℓ − 1} 4 if RANK[j] > 0 then 5 i ← SUF[RANK[j] − 1] 6 while y[i + ℓ] = y[j + ℓ] do 7 ℓ←ℓ+1 8 LCP[RANK[j]] ← ℓ 9 LCP[0] ← 0 10 LCP[n] ← 0 11 return LCP Running time: O(n) PhD Univ. Warsaw

53/63

Burrows-Wheeler Transform

⋆

Text w = a1a2 · · · an a primitive word w1, w2, . . . , wn sequence of conjugates of w in increasing order bi = last letter of wi , then T (w) = b1b2 · · · bn

⋆

T (abracadabra) = rdarcaaaabb

⋆

M.C.

1 a a b 2 a b r 3 a b r 4 a c a 5 a d a 6 b r a 7 b r a 8 c a d 9 d a b 10 r a a 11 r a c T (w) depends only on the conjugacy class of w we may suppose that w is a Lyndon word, i.e. that w PhD Univ. Warsaw

r a a d b a c a r b a

a a c a r b a b a r d

c b a b a r d r a a a

a r d r a a a a b c b

d a a a b c b a r a r

a c b a r a r b a d a

b a r b a d a r c a a

r d a r c a a a a b b

= w1 54/63

Text Compression

⋆

M.C.

Basis of bzip text compression –

intial text aabracadabr

–

BW Transform: T (aabracadabr) = rdarcaaaabb

–

Either run-length encoding (using ASCII code): rdarca4b2

–

... or move-to-front encoding, roughly like encoding of: 17, 4, 2, 2, 4, 1, 0, 0, 0, 4, 0

PhD Univ. Warsaw

55/63

Related works

M.C.

⋆

Reversible transformation of a word used for text compression, used in bzip [Burrows, Wheeler, 1994]

⋆

Analysis of compression by [Manzini, 2001]

⋆

Related to combinatorics on words, Sturmian words [Mantaci, Restivo, Sciortino, 2003]

⋆

Particular case of a bijection due to Gessel and Reutenauer [Crochemore, D´ esarm´ enien, Perrin, 2005]

⋆

Linear-time computation (adapting suffix sorting) [K¨ arkk¨ ainen, Sanders, 2003][Kim, Sim, Park, Park, 2003] [Ko, Aluru, 2003][Nong, Zhang, Chan, 2009]

PhD Univ. Warsaw

56/63

Permutation σ

⋆

Permutation σ: σ(i) = rank of i-th circular shift

⋆

w = aabracadabr

i σ(i) 1 1 a a b r a c a d a b r 2 3 a b r a c a d a b r a 3 7 b r a c a d a b r a a 4 11 r a c a d a b r a a b 5 4 a c a d a b r a a b r 6 8 c a d a b r a a b r a 7 5 a d a b r a a b r a c 8 9 d a b r a a b r a c a 9 2 a b r a a b r a c a d 10 6 b r a a b r a c a d a 11 10 r a a b r a c a d a b

M.C.

σ −1 (j) 1 9 2 5 7 10 3 6 8 11 4

PhD Univ. Warsaw

j 1 2 3 4 5 6 7 8 9 10 11

a a a a a b b c d r r

a b b c d r r a a a a

b r r a a a a d b a c

r a a d b a c a r b a

a a c a r b a b a r d

c b a b a r d r a a a

a r d r a a a a b c b

d a a a b c b a r a r

a c b a r a r b a d a

b a r b a d a r c a a

r d a r c a a a a b b

57/63

Permutation π

⋆

Permutation P (w) = π: π(i) = σ(σ −1 (i) + 1)

⋆

w = aabracadabr 

σ=  



1 2 3 4 5 6 7 8 9 10 11    1 3 7 11 4 8 5 9 2 6 10

π as a cycle

π = 1 3 7 11 4 8 5 9 2 6 10

π as an array 

π=  

M.C.



1 2 3 4 5 6 7 8 9 10 11    3 6 7 8 9 10 11 5 2 1 4

PhD Univ. Warsaw

58/63

Inverse Transformation

⋆

w = a1 a2 · · · an y = b1b2 · · · bn with bi = last letter of wi z = c1c2 · · · cn with ci = first letter of wi

⋆

ai = cσ(i) bi = aσ−1(i)−1 ai = cπi−1(1) ci = bπ(i)

⋆

Property: i < j and ci = cj ⇒ π(i) < π(j) Occurrences of a in w appear in the same relative order in both y and z

⋆

M.C.

Linear-time computation of π and w from y

PhD Univ. Warsaw

⋆

w = a1a2b3r4a5c6a7d8a9b10r11 y = r11d8a1r4c6a9a2a5a7b10b3 i ci bπ(i) 1 a1 · · · r11 2 a9 · · · d8 3 a2 · · · a1 4 a5 · · · r4 5 a7 · · · c6 6 b10 · · · a9 7 b3 · · · a2 8 c6 · · · a5 9 d8 · · · a7 10 r11 · · · b10 11 r4 · · · b3

59/63

Descents of permutations

⋆

Descent of permutation P (w) = π: i such that π(i) > π(i + 1)

⋆

for π = P (aabracadabr): des(π) = {7, 8, 9} 

π=  



1 2 3 4 5 6 7 8 9 10 11    3 6 7 8 9 10 11 5 2 1 4

⋆

Parikh vector of w: (n1, n2, . . . , nk ) where k = #Alphabet with n1 + n2 + · · · + nk = n = |w|

⋆

ρ(v) = {n1, n1 + n2, . . . , n1 + · · · + nk−1}

⋆

for π = P (aabracadabr): v = (5, 2, 1, 1, 2) and ρ(v) = (5, 7, 8, 9)

⋆

From property: des(π) ⊆ ρ(v)

Theorem 6 Let v = (n1, n2, . . . , nk ) be a positive vector, n1 +n2 +· · ·+nk = n. The map P : w 7→ π is one to one from the set of conjugacy classes of primitive words of length n on k letters with Parikh vector v onto the set of cyclic permutations on {1, 2, . . . , n} such that des(π) ⊆ ρ(v).

M.C.

PhD Univ. Warsaw

60/63

From permutation to words

⋆

Words w having a given cyclic permutation π: P (w) = π No Parikh vector given

⋆

π = P (aabracadabr)   2 3 4 5 6 7 8 9 10 11   1  π=   3 6 7 8 9 10 11 5 2 1 4 π as a cycle = 1 3 7 11 4 8 5 9 2 6 10 on 5 letters

z= a a a a a b b c d r r w= a a b r a c a d a b r

on 4 letters

z= a a a a a a a c d r r w= a a a r a c a d a a r

on 7 letters

z= a a b b c c c d e f g w= a b c g b d c e a c f

number of words (up to alphabetical renaming): 26.21 = 128 M.C.

PhD Univ. Warsaw

61/63

Length 4

σ 1 2 3 4

1 2 4 3

1 3 2 4

M.C.

π descents 2 3 4 1 1 a a a a

w a a b b a b b c

b c c d

2 4 1 3

a b a b

b c b c

3 4 2 1

1

2

a a a a

b c c d

σ 1 3 4 2

π descents w 3 1 4 2 2 a c b c a d b c

1 4 2 3

4 3 1 2

2

a b c b a c d b

1 4 3 2

4 1 2 3

1

a a a a

b c c d

b c b c

b b b b

a b a c a c b d

PhD Univ. Warsaw

62/63

Main references

References [1] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on Strings. Cambridge University Press, 2007. 392 pages. [2] Maxime Crochemore and Wojciech Rytter. Jewels of Stringology. World Scientific Publishing, Hong-Kong, 2002. 310 pages. [3] See also http://monge.univ-mlv.fr/ mac/REC/text-algorithms.pdf or http://www.mimuw.edu.pl/ rytter/BOOKS/text-algorithms.pdf

M.C.

PhD Univ. Warsaw

63/63

Slides

Slides

Suggest Documents

Slides

Slides

slides

slides

Slides

slides

slides

Slides.

slides

slides

Slides

slides

Slides

slides

slides

slides

Slides

slides

slides

slides

slides

slides

Slides

slides