[Fine, Wilf, 1965] see Algebraic Combinatorics on Words [Lothaire, 2002] ...... 310
pages. [3] See also http://monge.univ-mlv.fr/ mac/REC/text-algorithms.pdf.
String Algorithms
Maxime Crochemore King’s College London
Universit´ e Paris-Est
&
M.C.
PhD Univ. Warsaw
1/63
Algorithms and combinatorics on words
⋆
⋆
M.C.
Links between combinatorial properties of words and algorithms on words: –
some algorithms are based on combinatorial properties
–
combinatorics is sometimes used to evaluate efficiency of algorithms
Examples: –
Text searching
–
Fragment assembly and shortest common superstring
–
Text indexing and suffix arrays
–
Text compression and permutations
⋆
Other examples in Algorithms on Strings [C., Hancart, Lecroq, 2007], [C., Rytter, 1994] http://www.dcs.kcl.ac.uk/staff/mac/
⋆
Combinatorial aspects in Applied Combinatorics on Words [Lothaire, 2004], [Lothaire, 2002] http://igm.univ-mlv.fr/∼berstel/Lothaire/index.html PhD Univ. Warsaw
2/63
Periods and borders of words
⋆
Non-empty string u, integer p, 0 < p ≤ |u|
⋆
p is a period of u if any of these equivalent conditions is satisfied: –
u[i] = u[i + p], for 1 ≤ i ≤ |u| − p
–
u is a prefix of some y k , k > 0, |y| = p
–
u = yw = wz, for some strings y, z, w with |y| = |z| = p String w is called a border of u p -
borderlandborder borderlandborder p
M.C.
⋆
period(u) = smallest period of u (can be |u|) border(u) = longest proper border of u (can be empty)
⋆
Periods and borders of abaabaa 3 abaa 6 a 7 empty string PhD Univ. Warsaw
3/63
Periodicity Lemma
Lemma 1 (Periodicity Lemma) If p and q are periods of a word x and satisfy p+q−GCD(p, q) ≤ |x| then GCD(p, q) is a period of x. [Fine, Wilf, 1965] see Algebraic Combinatorics on Words [Lothaire, 2002] a b a a b a b a a b a b a a b ··· a b a a b a b a a b a a b a a b a b a a b a a b a b ··· Used in the analysis of KMP algorithm and of many other pattern matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word x and satisfy p + q ≤ |x| then GCD(p, q) is a period of x. M.C.
PhD Univ. Warsaw
4/63
Proof of the weaker statement
⋆
p and q periods of x with p + q ≤ |x| and p > q
⋆
p − q period of x
⋆
M.C.
p
a
b
-
p−q the rest like Euclid’s induction
PhD Univ. Warsaw
-
c q
5/63
Proof of the weaker statement
⋆
p and q periods of x with p + q ≤ |x| and p > q
⋆
p − q period of x
p
a
b
-
p−q
-
c q
p c
⋆
M.C.
-
a
q
b p−q
-
the rest like Euclid’s induction
PhD Univ. Warsaw
5/63
On-line String Matching
u u
c a u
a
-
compatible shift
M.C.
PhD Univ. Warsaw
6/63
On-line String Matching (1)
u u
c a u
a
-
compatible shift ⋆
M.C.
compatible with match: shift = period(u) [Morris, Pratt, 1969]
text y pattern x
· · a b a a b a c · · · · · · · a b a a b a a a b a a b a a
PhD Univ. Warsaw
6/63
On-line String Matching (2)
u u
c a u
a
-
compatible shift ⋆
compatible with match: shift = period(u) [Morris, Pratt, 1969]
⋆
idem+not incompatible with c [Knuth, Morris, Pratt, 1977]
M.C.
text y pattern x
· · a b a a b a c · · · · · · · a b a a b a a a b a a b a a
text y pattern x
· · a b a a b a c · · · · · · · a b a a b a a a b a a b a a
PhD Univ. Warsaw
6/63
On-line String Matching (3)
u u
c a u
a
-
compatible shift ⋆
compatible with match: shift = period(u) [Morris, Pratt, 1969]
text y pattern x
· · a b a a b a c · · · · · · · a b a a b a a a b a a b a a
⋆
idem+not incompatible with c [Knuth, Morris, Pratt, 1977]
text y pattern x
· · a b a a b a c · · · · · · · a b a a b a a a b a a b a a
⋆
best shift = period(uc) [Simon, 1989], [Hancart, 1993] text y [Breslauer, Colussi, Toniolo, pattern x 1993]
M.C.
· · a b a a b a c · · · · · · · a b a a b a a a b a a b a a
PhD Univ. Warsaw
6/63
Delay
⋆
Delay: maximum number of comparisons on a text letter
⋆
MP algorithm delay ≤ |x|
⋆
KMP algorithm delay ≤ logΦ (|x| + 1) text y pattern x
· · a b a a b a c · a b a a b a b a b a a b a b a a b Proof by the Periodicity Lemma The bound is tight ⋆
M.C.
· · · · · · a a a b a a a a b a a
Simon-Hancart algorithm use of string-matching automaton delay ≤ min(1 + log2 |x|, cardA) PhD Univ. Warsaw
7/63
Searching with an automaton
⋆
Uses the string-matching automaton SMA(x): smallest deterministic automaton accepting A∗ x
⋆
Example x = abaa
a b 0
a a
1
b
2
b a
b ⋆
3
a
4
b
Search for abaa in: b a b b a a b a a b a a b b a ··· state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 · · ·
M.C.
PhD Univ. Warsaw
8/63
Construction of SMA(x)
⋆
Unwinding arcs
⋆
From SMA(abaa) . . .
b 0
a
a a
1
b
2
b a
b ⋆
0
1
4
a
a a
a
b
. . . to SMA(abaab)
b
3
b
2
b a
3
a
4
b
5
a
b
b M.C.
PhD Univ. Warsaw
9/63
Complexity
⋆
Time and space optimization: implementation of significant arcs only –
Forward arcs: spell the pattern
–
Backward arcs: arcs going backwards without reaching the initial state
Lemma 3 SMA(x) contains at most |x| backward arcs. ⋆
⋆
M.C.
Consequences: –
implementation of SMA(x) in O(|x|) space
–
construction in O(|x|) time, independently of the alphabet size
There is a strategy on the choice of arcs for which: delay ≤ min(1 + log2 |x|, cardA) [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1998]
PhD Univ. Warsaw
10/63
Significant arcs
⋆
Complete SMA(ananas) ' n, s a
a $ a a a n a n 4 a 5 s 6 - 0 - 1 2 - 3 s n n, s s & % n, s & % n, & % s
⋆
Forward arcs: spell the pattern
⋆
Backward arcs: arcs going backwards without reaching the initial state a $ ' a a a a n a n 4 a 5 s 6 - 0 - 1 2 - 3 n
Lemma 4 SMA(x) contains at most |x| backward arcs. ⋆
M.C.
Consequence: the implementation of SMA(x) can be done in O(|x|) time and space, independently of the alphabet size PhD Univ. Warsaw
11/63
Backward arcs in SMA
⋆
States of SMA(x) are identified with prefixes of x A backward arc is of the form (u, τ, vτ ) (u, v prefixes of x, τ symbol) with –
vτ longest suffix of uτ that is a prefix of x, and u 6= v
Note: uτ is not a prefix of x Let p(u, τ ) = |u| − |v| ; it is a period of u because v is a border of u x
τ
v
τ
τ
σ
-
v
p(u, τ )
M.C.
⋆
Backward arcs to periods: p is injective Each period p, 1 ≤ p ≤ |x|, corresponds to at most one backward arc, thus there are at most |x| such arcs
⋆
A worst case: SMA(abm−1) has m backward arcs (a 6= b)
PhD Univ. Warsaw
12/63
Backward arcs (followed)
⋆
Proof that p is injective Two backward arcs (u, τ, vτ ), (u′, τ ′ , v ′τ ′ ) Assume p(u, τ ) = p(u′, τ ′) = p ; we prove u = u′ and τ = τ ′.
x
τ
v
τ
v′
τ′
σ
v v′
-
p
x
τ′
τ
v
M.C.
τ
If v = v ′ then u = u′ and also τ = τ ′
⋆
⋆
v′
τ
τ ′
σ′
τ τ′ p
σ
v -
v′
If v ′ a proper prefix of v then v ′τ ′ and v ′σ ′ are prefixes of v thus τ ′ = σ ′ a contradiction PhD Univ. Warsaw
13/63
Repetition
⋆
Repetition (w, w) repetition of (u, v) if w 6= ε and both: –
w suffix of u or u suffix of w, and
–
w prefix of v or v prefix of w
|w| is a local period of (u, v) ⋆
Local Period localperiod(u, v) = minimum local period of (u, v)
a b a b b a b b
a b a b b a a a a b a b b a b a b a
M.C.
PhD Univ. Warsaw
a b a b b a b b a b a b b a b a
14/63
Maximal Local Period
⋆
Word of period 5 a 1
b 2
a 2
b 5
b 1
a 3
⋆
Note: localperiod(u, v) ≤ period(uv)
⋆
(u, v) is a critical factorization of uv if
1
localperiod(u, v) = period(uv) localperiod(u, v) is maximal among all local periods ⋆
M.C.
Computation of all local periods in linear time [Duval, Kolpakov, Kucherov, Lecroq, Lefebvre, 2003]
PhD Univ. Warsaw
15/63
Critical Factorization
Theorem 1 (Critical Factorization Theorem) Any non-empty word x can be factorized into u · v with both: • |u| < p and • localperiod(u, v) = period(x). [Cesari, Vincent, Duval, 1983] a b a b b a b b a b a b b a b a Leads to time-space optimal string-matching algorithm: two-way algorithm
M.C.
PhD Univ. Warsaw
16/63
Two-Way String Matching
text y pattern x
⋆
u
shifts
w c w a u
|wc|
c w a w
v v v
-
period(x)
-
Time-space optimality –
Search Time: linear time (≤ 2n comparisons) with constant extra space
–
Preprocessing Time: idem, based on next theorem
[Crochemore, Perrin, 1992] ⋆
M.C.
Other solutions [Galil, Seiferas, 1983], [Crochemore, 1992] [Crochemore, Rytter, 1994], [Rytter, 2002]
PhD Univ. Warsaw
17/63
Two-way string matching (followed)
Searching for pattern x = x[0 . . m − 1] in text y = y[0 . . n − 1]
M.C.
⋆
(u, v) critical factorization of x: uv = x and |u| < localperiod(u, v) = period(x)
⋆
Searching –
search for an occurrence of v by left-to-right scan if mismatch: shift length = scan length, else
–
search for a preceding occurrence of u by right-to-left scan shift length = period of x
PhD Univ. Warsaw
18/63
Example
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a
M.C.
PhD Univ. Warsaw
19/63
Example (1)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a left-to-right scan length = 3
M.C.
PhD Univ. Warsaw
19/63
Example (1)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a shift length = 3
M.C.
PhD Univ. Warsaw
19/63
Example (2)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a
M.C.
PhD Univ. Warsaw
20/63
Example (2)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a left-to-right scan length = 4
M.C.
PhD Univ. Warsaw
20/63
Example (2)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a shift length = 4
M.C.
PhD Univ. Warsaw
20/63
Example (3)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a
M.C.
PhD Univ. Warsaw
21/63
Example (3)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a left-to-right scan
M.C.
PhD Univ. Warsaw
21/63
Example (3)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a right-to-left scan
M.C.
PhD Univ. Warsaw
21/63
Example (3)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a a b a b b a b a shift length = period = 5
M.C.
PhD Univ. Warsaw
21/63
Example (4)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a a b a b b a b a
M.C.
PhD Univ. Warsaw
22/63
Example (4)
⋆ ⋆
Critical factorization a b a b b a b a u v Searching
period = 5
window a a a b b b b b a a b b a b a b b a b a b a . . a b a b b a b a a b a b b a b a a b a b b a b a a b a b b a b a left-to-right and right-to-left scans occurrence found next shift length = period = 5
M.C.
PhD Univ. Warsaw
22/63
Two-way string matching
Searching for pattern x = x[0 . . m − 1] in text y = y[0 . . n − 1] ⋆
(u, v) critical factorization of x: uv = x and |u| < localperiod(u, v) = period(x)
⋆
Searching
⋆
⋆
M.C.
–
search for an occurrence of v by left-to-right scan if mismatch: shift length = scan length, else
–
search for a preceding occurrence of u by right-to-left scan shift length = period of x
Preprocessing –
compute factorization (u, v)
–
compute the (smallest) period of x
Algorithmic complexity –
total linear time; searching in less than 2n comparisons
–
only constant additional memory space required
PhD Univ. Warsaw
23/63
Orderings
⋆
Orderings ≤ lexicographic ordering based on the ordering ≤ of the alphabet lexicographic ordering based on the inverse ordering ≤−1 of the alphabet
Theorem 2 x is a non-empty word Let x = uv with v = suffix of x that is maximal for ≤ Let x = u′v ′ with v ′ = suffix of x that is maximal for If |v| ≤ |v ′ | then (u, v) is a critical factorization of x otherwise (u′ , v ′) is. Moreover, |u| < period(x) and |u′| < period(x). [Crochemore, Perrin, 1992]
M.C.
a b a a b a a
a b a b a a b b a b a b a
a b a a b a a
a b a b a a b b a b a b a
PhD Univ. Warsaw
24/63
Proof (1)
Four cases — w shortest repetition at (u, v) x ⋆ w suffix of u and v prefix of w
v w
v < w < wv, impossible
M.C.
u
PhD Univ. Warsaw
w
25/63
Proof (2)
Four cases — w shortest repetition at (u, v) x ⋆ w suffix of u and v prefix of w
u w
v < w < wv, impossible ⋆
M.C.
v
w suffix of u and w prefix of v z < v implies v = wz < wv impossible
x
PhD Univ. Warsaw
w
u
z w
w
25/63
Proof (3)
Four cases — w shortest repetition at (u, v) x ⋆ w suffix of u and v prefix of w
u w
v < w < wv, impossible ⋆
⋆
M.C.
v
w suffix of u and w prefix of v z < v implies v = wz < wv impossible
u
x
u suffix of w and v prefix of w (u, v) is a critical factorization
PhD Univ. Warsaw
w
z w
x
u w
w v w
25/63
Proof (4)
Four cases — w shortest repetition at (u, v) x ⋆ w suffix of u and v prefix of w
u w
v < w < wv, impossible ⋆
⋆
⋆
M.C.
v
w suffix of u and w prefix of v z < v implies v = wz < wv impossible
u
x
u suffix of w and v prefix of w (u, v) is a critical factorization u suffix of w and w prefix of v z < v; yz ≺ yv implies z ≺ v; z prefix of v then border of v w period of v and of x. PhD Univ. Warsaw
w
z w
x
w
u w
v w
v′ y
x w
z y
25/63
Computing Maximal Suffixes
⋆
M.C.
v maximal suffix of x; |w| its period; w′ proper prefix of w u v x w w w′ a b a c b c b a c b c b a c b c u w w w′
PhD Univ. Warsaw
26/63
Computing Maximal Suffixes (followed)
⋆
v maximal suffix of x; |w| its period; w′ proper prefix of w u v x w w w′ a b a c b c b a c b c b a c b c u w w w′
⋆
Match: the periodicity continues ?
?
a b a c b c b a c b c b a c b c b u w w new w′
M.C.
PhD Univ. Warsaw
26/63
Computing Maximal Suffixes (followed)
⋆
v maximal suffix of x; |w| its period; w′ proper prefix of w u v x w w w′ a b a c b c b a c b c b a c b c u w w w′
⋆
Match: the periodicity continues ?
?
a b a c b c b a c b c b a c b c b u w w new w′ ⋆
Smaller letter: new w, border-free ?
?
a b a c b c b a c b c b a c b c a u new w
M.C.
PhD Univ. Warsaw
26/63
Computing Maximal Suffixes (followed)
⋆
v maximal suffix of x; |w| its period; w′ proper prefix of w u v x w w w′ a b a c b c b a c b c b a c b c u w w w′
⋆
Match: the periodicity continues ?
?
a b a c b c b a c b c b a c b c b u w w new w′ ⋆
Smaller letter: new w, border-free ?
?
a b a c b c b a c b c b a c b c a u new w ⋆
Greater letter: new u and recomputation on the rest ?
?
a b a c b c b a c b c b a c b c c recomputation new u
M.C.
PhD Univ. Warsaw
26/63
Perfect factorization
Theorem 3 Any non-empty word x can be factorized into u · v with both: • |u| < 2 period(v) and • v starts with at most one cube of a primitive word. [Galil, Seiferas, 1983], [Crochemore, Rytter, 1994] [Mignosi, Restivo, Salemi, 1995] see [Lothaire, 2002] ⋆
Example word with period = 10 a a a b a a b a a b a a a b a a b a a b a a a b a a ···
⋆
M.C.
Leads to a time-space optimal string-matching algorithm
PhD Univ. Warsaw
27/63
Square prefixes
w
w
v u
v u
Lemma 5 (Three prefix squares) If u2 prefix of v 2, v 2 prefix of w2, and u primitive then |u| + |v| ≤ |w|. [Crochemore, Rytter, 1995] 10 7 3
M.C.
a a b a a b a a a b a a b a a b a a a b a a b a a b a a a b a a b a a a b a a b
PhD Univ. Warsaw
28/63
Squares in a word
Lemma 6 ([Fraenkel, Simpson, 1998]) No more that 2n squares of primitive root occur in a word of length n. y w v u
w v
u
?
rightmost positions in y? impossible! Direct proofs [Hickerson, 2004], [Ilie, 2005] Best bound: 2n − Θ(log n) [Ilie, 2005] Computation in linear time [Gusfield, Stoye, 1998] Lemma 7 ([Crochemore, 1981], [Gusfield, Stoye, 1999]) Maximal number of occurrences of squares (with primitive root) : cn log n. Maximum reached by Fibonacci strings. Theorem 4 ([Kolpakov, Kucherov, 1998]) Linear number of occurrences of runs (maximal powers) in a word. M.C.
PhD Univ. Warsaw
29/63
Sequencing and fragment assembly
⋆
“Reading” a biological molecular sequence
⋆
Straight sequencing for short fragments: ≤ 500 bases
⋆
Splicing long sequences, “shotgun sequencing” =⇒ reconstruction problem
⋆
Difficulties :
⋆
–
loss of orientation (double-stranded DNA)
–
overlaps with errors between fragments
Formalization : F set of fragments, ε ∈ [0, 1[ compute s, a short sequence for which ∀f ∈ F ∃a fragment of s d(a, f ) ≤ ε|a| or d(a, f˜) ≤ ε|a|
M.C.
PhD Univ. Warsaw
30/63
Shortest Common Superstring
⋆
Problem: let F = {f1, f2, . . . , fn } a factor code Compute a shortest word s in which each fi occurs
⋆
SCS(F) = |s| s= T A A f1 = A f2 = f3 = f4 = f5 = T A A f6 = A A
T A T T A T A T A T T A T T T T A T T A T A T T A
Lemma 8 Computing SCS(F) is NP-hard.
M.C.
PhD Univ. Warsaw
31/63
Greedy Approximation
⋆
while card F > 1 do assemble two fragments having the longest overlap
Theorem 5 The sequence s produced by the greedy algorithm satisfies |s| ≤ 4 × SCS(F).
M.C.
⋆
Conjecture : |s| ≤ 2 × SCS(F)
⋆
Variations based on the computation of hamiltonian cycle of minimal weight
PhD Univ. Warsaw
32/63
Graph of Overlaps
⋆
F = {f1, f2, . . . , fn }, factor code
⋆
Overlaps ov(fi, fj ) = max{|v| | fi = uv and fj = vw}
⋆
Nodes = {f1, f2, . . . , fn} Arcs = {(fi, ov(fi , fj ), fj ) | 1 ≤ i, j ≤ n, i 6= j}
⋆
ov(f, g) computed in time O(min(|f |, |g|)) [Morris et Pratt, 1970]
s= A T A T T A f1 = A T A T f2 = T A T T f3 = T T A f4 = T A f1 = A ↔ ↔ ↔ 3 2 3
M.C.
T A T
T T A T A T ↔ 3
PhD Univ. Warsaw
f1
3 f4 @ I
@ @ @ @ R @
3
@ @ @ @
3
f2
2
f3
33/63
Graph of Prefixes
⋆
F = {f1, f2, . . . , fn }, factor code
⋆
Prefixes pr(fi , fj ) = |fi| − ov(fi , fj )
⋆
Nodes = {f1, f2, . . . , fn} Arcs = {(fi, pr(fi , fj ), fj ) | 1 ≤ i, j ≤ n, i 6= j}
s= A T A T f1 = A T A T f2 = T A T f3 = T f4 = f1 = ↔ ←→ ↔ 1 2 1
M.C.
T A T A T T T A T T A T A A T A T ↔ 1 = 5
PhD Univ. Warsaw
f1
1 f4 @ I
@ @ @ @ R @
1
@ @ @ @
1
f2
2
f3
34/63
Technique
⋆
A cycle (fi1 , fi2 , . . . , fik , fi1 ) produces a superstring s of length: |s| = kj=1 |fij | − k−1 j=1 ov(fij , fij+1 ) P = k−1 j=1 pr(fij , fij+1 ) + |fik | P
M.C.
P
⋆
Computation of an hamiltonian cycle of maximal weight in the graph of overlaps
⋆
. . . or of minimal weight in the graph of prefixes
⋆
NP-hard problems
PhD Univ. Warsaw
35/63
3 × OP T Approximation
⋆
Based on cycle covers of the graph
⋆
Polynomial running time
⋆
Schema: (i) compute a cycle cover of minimal weight for the graph of prefixes (ii) open cycles (iii) s = assembly with overlaps of words obtained in (ii)
M.C.
⋆
Approximation : |s| ≤ 3 × SCS(F)
⋆
Proof based on the Overlap Lemma [Blum, Jiang, Li, Tromp, Yanakakis, 1994]
PhD Univ. Warsaw
36/63
Periodicities
⋆
Word x, integer p, 0 < p ≤ |x| ; p period of x if x[i] = x[i + p]
⋆
period(x) = smallest period of x
Lemma 9 (Overlap Lemma) Let u, v, p = period(u), q = period(v). If u[1 . . p] et v[1 . . q] are not conjugate, ov(u, v) < p + q − GCD(p, q).
M.C.
⋆
Exemple : period(baabaabaa) = 3, period(aabaaabaaaba) = 4 ov(baabaabaa, aabaaabaaaba) = |aabaa| = 5 < 3 + 4 − 1
⋆
Straight corollary of the Periodicity Lemma
PhD Univ. Warsaw
37/63
Better Approximation
⋆
Based on a more efficient cycle opening
⋆
Schema : (i) compute a cycle cover of minimal weight for the graph of prefixes (ii) open cycles at positions deduced from the Rotation Lemma (iii) s = assembly with overlaps of words obtained in (ii)
M.C.
⋆
Approximation : |s| ≤ (2 + 32 ) × SCS(F)
⋆
Proof based on the Rotation Lemma [Jiang, Jiang, Breslauer, 1996]
PhD Univ. Warsaw
38/63
Rotation Lemma
Lemma 10 (Rotation Lemma) Let x and p = period(x) be such that |x| ≥ 3p. There is a factorization u · v of x with both: • |u| < p, and • each w prefix of v with q = period(w) < p satisfies |w| ≤ 32 (p + q). [Breslauer, Jiang, Jiang, 1996] p y x q z z w
≤ ⋆
M.C.
2 3 (p
y
y
-
+ q)
The bound is tight: at each position < 2n + 3 of x = (an ban+1b)3 starts a factor of length ≥ 2n + 2 having a period ≥ n + 1
PhD Univ. Warsaw
39/63
Choice of the rotation
⋆
y 3 prefix of x with |y| = period(x) = p
⋆
≤ ordering on the alphabet, inverse ordering
⋆
r conjugate of y that is a Lyndon word according to ≤
⋆
t conjugate of y that is a Lyndon word according to
⋆
i first position of r on x, j first position of t on x
⋆
If i < j and j − i ≤ p, we choose u such that y = uu′ , r = u′u p y
i
-
y j
y
t r
⋆
M.C.
If i < j and j − i > p, we choose u such that y = uu′, t = u′ u
PhD Univ. Warsaw
40/63
Proof
⋆
r Lyndon word and j − i ≤ p y x j i t
p 2
y
y
r w
⋆
M.C.
s q
-
s
–
|w| < p since q < p and r is border-free
–
If q ≥ p2 , then |w| < 23 (p + q)
–
Else, q < p2 , and |w| < j − i + q (Critical Factorization at j) and j − i + q < 2p + q < 23 (p + q), QED
If j − i > p2 , replace r by t and t by the next occurrence of r, which brings back to the first case
PhD Univ. Warsaw
41/63
Text Indexing
⋆
Set of factors (subwords) of a static text
⋆
Basic operations
⋆
M.C.
–
existence of patterns in the text
–
number of occurrences of patterns
–
list of positions of occurrences
Other applications –
finding repetitions in texts
–
finding regularities in texts
–
approximate matchings
–
two-dimensional pattern matching
–
...
PhD Univ. Warsaw
42/63
Implementation of indexes
suffix of text pattern
Implementation with efficient data structures ⋆
Suffix Trees digital trees, PATRICIA trees (compact trees)
⋆
Suffix Automata or DAWG’s minimal automata, compact automata
Implementation with efficient algorithm ⋆
M.C.
Suffix Arrays binary search in the ordered list of suffixes
PhD Univ. Warsaw
43/63
Suffixes
Text y ∈ A∗ of length n ⋆
Suff (y) = set of suffixes of y
⋆
Suff (ababbb) i y[i]
0 1 2 3 4 5 a b a b b b a b a b b a b a b b
M.C.
b b b b b
b b b b b b ε
positions 0 1 2 3 4 5 6 (empty string)
PhD Univ. Warsaw
44/63
Suffix Structures
⋆
Suffix trie of ababbb
1
b
b
4
b
5
6
Its suffix tree
0
b
12
b
8
b
9
b
10
11
1
b
14
b
15
3
b
7
1
b
1
b
6
3
⋆
Its compact suffix automaton
2
a
3
b
bb 4
b
5
a b
2
4
b a
2
5 5
Its suffix automaton
0
4
abbb
b
4
⋆
bb
6
7 5
0
2
0 a
1
ab
13
b
6
b
abbb 3
2
a 0
b
3
a
⋆
2′
b
5′
b
6
0
ab
abbb
2
b
1
abbb b b
2′
b
3
Linear size data structures and O(n log card A) construction time, except for suffix tries
M.C.
PhD Univ. Warsaw
45/63
Suffix Array
⋆
Structure composed of: lexicographically sorted list of non-empty suffixes, and maximal lengths of common prefixes between suffixes consecutive in the list
⋆
Suffix array of ababbb
j y[j]
⋆
M.C.
0 1 2 3 4 5 a b a b b b
k SUF[k] LCP[k] 0 0 0 a b a b b b 1 2 2 a b b b 2 5 0 b 3 1 1 b a b b b 4 4 1 b b 5 3 2 b b b
Benefit: use of binary search and of additional combinatorial features lead to a O(m + log n) searching time (instead of a mere O(m × log n) time)
PhD Univ. Warsaw
46/63
Searching—Case one
⋆
Hypotheses Ld < x < Lf and ld ≤ |lcp(Li, Lf )| < lf
⋆
Example Ld a a Li a a Lf a
⋆
M.C.
a a a a a
a a b b b
c c b b b
a b a a b
x aabbbaa a ba bb ab
x aabbbaa
Conclusion Li < x < Lf and |lcp(x, Li)| = |lcp(Li, Lf )|
PhD Univ. Warsaw
47/63
Searching—Case two
⋆
Hypotheses Ld < x < Lf and ld ≤ lf < |lcp(Li, Lf )|
⋆
Example Ld a a Li a a Lf a
⋆
M.C.
a a a a a
a a b b b
c c b b b
a b a a b
x aabacb a ba bb ab
x aabacb
Conclusion Ld < x < Li and |lcp(x, Li)| = |lcp(x, Lf )|
PhD Univ. Warsaw
48/63
Searching—Case three
⋆
Hypotheses Ld < x < Lf and ld ≤ lf = |lcp(Li, Lf )|
⋆
Example Ld a a Li a a Lf a
⋆
M.C.
a a a a a
a a b b b
c c b b b
a b a a b
x aabbab a ba bb ab
x aabbab
Conclusion compare x and Li from position lf
PhD Univ. Warsaw
49/63
Sorting Suffixes on a Bounded Integer Alphabet
⋆
Schema [1] bucket sort positions i according to first3(y[i . . n − 1]), for i = 3q or i = 3q + 1 t[i]: rank of i in the sorted list [2] recursively sort the suffixes of the 2/3-shorter word t[0]t[3] · · · t[3q] · · · t[1]t[4] · · · t[3q + 1] · · · s[i]: rank of suffix i in the sorted list (i = 3q or i = 3q + 1) [3] sort suffixes y[j . . n − 1] for j of the form 3q + 2 (i.e., bucket sort pairs (y[j], s[j + 1])) [4] merge lists obtained at steps 2 and 3 Note: comparing suffixes i (first list) and j (second list) remains to compare: (x[i], s[i + 1]) and (x[j], s[j + 1]) if i = 3q (x[i]x[i + 1], s[i + 2]) and (x[j]x[j + 1], s[j + 2]) if i = 3q + 1
⋆
M.C.
Running time: T (n) = T (2n/3) + O(n) then T (n) = O(n) [K¨ arkk¨ ainen, Sanders, 2003] PhD Univ. Warsaw
50/63
Example
i y[i]
0 1 2 3 4 5 6 7 8 9 10 a a b a a b a a b b a Rank s 0 1 2 3 4 5 6 7
Rank t 0 a 1 a a b 2 a b a 3 a b b 4 b a
i y[i] r[i] SUF[i] M.C.
0 a 1 10
1 a 4 0
2 b 8 3
3 a 2 6
4 a 5 1
5 b 9 4
i 10 0 3 6 1 4 7 9 6 a 3 7
Suff (11142230) 0 1 1 1 4 2 2 3 0 1 1 4 2 2 3 0 1 4 2 2 3 0 2 2 3 0 2 3 0 3 0 4 2 2 3 0 7 a 6 9
8 b 10 2
9 b 7 5
Rank 0 1 2
j 2 5 8
(y[j], s[j + 1]) (b, 2) (b, 3) (b, 7)
10 a 0 8
PhD Univ. Warsaw
51/63
Computing LCP’s of suffixes
⋆
LCP’s of suffixes LCP[i] = |lcp(y[SUF[i − 1] . . n − 1], y[SUF[i] . . n − 1])|, for 0 ≤ i ≤ n i y[i] SUF[i] LCP[i]
0 a 10 0
1 a 0 1
2 b 3 6
3 a 6 3
4 a 1 1
5 b 4 5
6 a 7 2
7 a 9 0
8 b 2 2
9 b 5 4
10 11 a 8 1 0
j RANK[j] j RANK[j] 0 1 aabaabaabba 1 4 abaabaabba 3 2 aabaabba 4 5 abaabba Lemma 11 Let j ∈ (1, 2, . . . , n − 1) with RANK[j] > 0. Then LCP[RANK[j − 1]] − 1 ≤ LCP[RANK[j]].
M.C.
PhD Univ. Warsaw
52/63
LCP algorithm
⋆
Rank RANK: RANK[j] = rank of suffix at position j (RANK = SUF−1)
⋆
Permutation SUF: SUF[k] = position of suffix of rank k Lcp(y, n, SUF, RANK)
⋆ M.C.
1 ℓ←0 2 for j ← 0 to n − 1 do 3 ℓ ← max{0, ℓ − 1} 4 if RANK[j] > 0 then 5 i ← SUF[RANK[j] − 1] 6 while y[i + ℓ] = y[j + ℓ] do 7 ℓ←ℓ+1 8 LCP[RANK[j]] ← ℓ 9 LCP[0] ← 0 10 LCP[n] ← 0 11 return LCP Running time: O(n) PhD Univ. Warsaw
53/63
Burrows-Wheeler Transform
⋆
Text w = a1a2 · · · an a primitive word w1, w2, . . . , wn sequence of conjugates of w in increasing order bi = last letter of wi , then T (w) = b1b2 · · · bn
⋆
T (abracadabra) = rdarcaaaabb
⋆
M.C.
1 a a b 2 a b r 3 a b r 4 a c a 5 a d a 6 b r a 7 b r a 8 c a d 9 d a b 10 r a a 11 r a c T (w) depends only on the conjugacy class of w we may suppose that w is a Lyndon word, i.e. that w PhD Univ. Warsaw
r a a d b a c a r b a
a a c a r b a b a r d
c b a b a r d r a a a
a r d r a a a a b c b
d a a a b c b a r a r
a c b a r a r b a d a
b a r b a d a r c a a
r d a r c a a a a b b
= w1 54/63
Text Compression
⋆
M.C.
Basis of bzip text compression –
intial text aabracadabr
–
BW Transform: T (aabracadabr) = rdarcaaaabb
–
Either run-length encoding (using ASCII code): rdarca4b2
–
... or move-to-front encoding, roughly like encoding of: 17, 4, 2, 2, 4, 1, 0, 0, 0, 4, 0
PhD Univ. Warsaw
55/63
Related works
M.C.
⋆
Reversible transformation of a word used for text compression, used in bzip [Burrows, Wheeler, 1994]
⋆
Analysis of compression by [Manzini, 2001]
⋆
Related to combinatorics on words, Sturmian words [Mantaci, Restivo, Sciortino, 2003]
⋆
Particular case of a bijection due to Gessel and Reutenauer [Crochemore, D´ esarm´ enien, Perrin, 2005]
⋆
Linear-time computation (adapting suffix sorting) [K¨ arkk¨ ainen, Sanders, 2003][Kim, Sim, Park, Park, 2003] [Ko, Aluru, 2003][Nong, Zhang, Chan, 2009]
PhD Univ. Warsaw
56/63
Permutation σ
⋆
Permutation σ: σ(i) = rank of i-th circular shift
⋆
w = aabracadabr
i σ(i) 1 1 a a b r a c a d a b r 2 3 a b r a c a d a b r a 3 7 b r a c a d a b r a a 4 11 r a c a d a b r a a b 5 4 a c a d a b r a a b r 6 8 c a d a b r a a b r a 7 5 a d a b r a a b r a c 8 9 d a b r a a b r a c a 9 2 a b r a a b r a c a d 10 6 b r a a b r a c a d a 11 10 r a a b r a c a d a b
M.C.
σ −1 (j) 1 9 2 5 7 10 3 6 8 11 4
PhD Univ. Warsaw
j 1 2 3 4 5 6 7 8 9 10 11
a a a a a b b c d r r
a b b c d r r a a a a
b r r a a a a d b a c
r a a d b a c a r b a
a a c a r b a b a r d
c b a b a r d r a a a
a r d r a a a a b c b
d a a a b c b a r a r
a c b a r a r b a d a
b a r b a d a r c a a
r d a r c a a a a b b
57/63
Permutation π
⋆
Permutation P (w) = π: π(i) = σ(σ −1 (i) + 1)
⋆
w = aabracadabr
σ=
1 2 3 4 5 6 7 8 9 10 11 1 3 7 11 4 8 5 9 2 6 10
π as a cycle
π = 1 3 7 11 4 8 5 9 2 6 10
π as an array
π=
M.C.
1 2 3 4 5 6 7 8 9 10 11 3 6 7 8 9 10 11 5 2 1 4
PhD Univ. Warsaw
58/63
Inverse Transformation
⋆
w = a1 a2 · · · an y = b1b2 · · · bn with bi = last letter of wi z = c1c2 · · · cn with ci = first letter of wi
⋆
ai = cσ(i) bi = aσ−1(i)−1 ai = cπi−1(1) ci = bπ(i)
⋆
Property: i < j and ci = cj ⇒ π(i) < π(j) Occurrences of a in w appear in the same relative order in both y and z
⋆
M.C.
Linear-time computation of π and w from y
PhD Univ. Warsaw
⋆
w = a1a2b3r4a5c6a7d8a9b10r11 y = r11d8a1r4c6a9a2a5a7b10b3 i ci bπ(i) 1 a1 · · · r11 2 a9 · · · d8 3 a2 · · · a1 4 a5 · · · r4 5 a7 · · · c6 6 b10 · · · a9 7 b3 · · · a2 8 c6 · · · a5 9 d8 · · · a7 10 r11 · · · b10 11 r4 · · · b3
59/63
Descents of permutations
⋆
Descent of permutation P (w) = π: i such that π(i) > π(i + 1)
⋆
for π = P (aabracadabr): des(π) = {7, 8, 9}
π=
1 2 3 4 5 6 7 8 9 10 11 3 6 7 8 9 10 11 5 2 1 4
⋆
Parikh vector of w: (n1, n2, . . . , nk ) where k = #Alphabet with n1 + n2 + · · · + nk = n = |w|
⋆
ρ(v) = {n1, n1 + n2, . . . , n1 + · · · + nk−1}
⋆
for π = P (aabracadabr): v = (5, 2, 1, 1, 2) and ρ(v) = (5, 7, 8, 9)
⋆
From property: des(π) ⊆ ρ(v)
Theorem 6 Let v = (n1, n2, . . . , nk ) be a positive vector, n1 +n2 +· · ·+nk = n. The map P : w 7→ π is one to one from the set of conjugacy classes of primitive words of length n on k letters with Parikh vector v onto the set of cyclic permutations on {1, 2, . . . , n} such that des(π) ⊆ ρ(v).
M.C.
PhD Univ. Warsaw
60/63
From permutation to words
⋆
Words w having a given cyclic permutation π: P (w) = π No Parikh vector given
⋆
π = P (aabracadabr) 2 3 4 5 6 7 8 9 10 11 1 π= 3 6 7 8 9 10 11 5 2 1 4 π as a cycle = 1 3 7 11 4 8 5 9 2 6 10 on 5 letters
z= a a a a a b b c d r r w= a a b r a c a d a b r
on 4 letters
z= a a a a a a a c d r r w= a a a r a c a d a a r
on 7 letters
z= a a b b c c c d e f g w= a b c g b d c e a c f
number of words (up to alphabetical renaming): 26.21 = 128 M.C.
PhD Univ. Warsaw
61/63
Length 4
σ 1 2 3 4
1 2 4 3
1 3 2 4
M.C.
π descents 2 3 4 1 1 a a a a
w a a b b a b b c
b c c d
2 4 1 3
a b a b
b c b c
3 4 2 1
1
2
a a a a
b c c d
σ 1 3 4 2
π descents w 3 1 4 2 2 a c b c a d b c
1 4 2 3
4 3 1 2
2
a b c b a c d b
1 4 3 2
4 1 2 3
1
a a a a
b c c d
b c b c
b b b b
a b a c a c b d
PhD Univ. Warsaw
62/63
Main references
References [1] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on Strings. Cambridge University Press, 2007. 392 pages. [2] Maxime Crochemore and Wojciech Rytter. Jewels of Stringology. World Scientific Publishing, Hong-Kong, 2002. 310 pages. [3] See also http://monge.univ-mlv.fr/ mac/REC/text-algorithms.pdf or http://www.mimuw.edu.pl/ rytter/BOOKS/text-algorithms.pdf
M.C.
PhD Univ. Warsaw
63/63