Document not found! Please try again

Random Access to Grammar Compressed Strings

6 downloads 50 Views 287KB Size Report
Random Access to Grammar Compressed Strings. Philip Bille, Gad ... 2. 3. 4. 5. 6. 7. 8. N. ≤n. AGTAGTAG. N = 8. X7 → X6X3. X6 → X5X5. X5 → X3X4. X4 → T.
Random Access to Grammar Compressed Strings Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann

Random Access to Compressed Strings

text DNA XML

Random Access to Compressed Strings

text DNA XML

• What is the ith character? • What is the substring at [i,j]? • Does pattern P appear in text? (perhaps with k errors?)

Random Access to Grammar Compressed Strings AGTAGTAG X7 → X6X3 X6 → X5X5 X5 → X3X4 X4 → T X3 → X1X2 X2 → G X1 → A

X7

N=8

X6

X3 X5

X5 ≤n n=7

X3

1

G 2

X4

X1 X2

X1 X2 A

X3

X4

X1 X2

T 3

A

G

4

5

T 6

A 7

G 8

N

• Grammar based compression captures many popular compression schemes with no or little blowup in space [Charikar et al. 2002, Rytter 2003]. • Lempel-Ziv family, Sequitur, Run-Length Encoding, Re-Pair, ...

Tradeoffs and Results X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

X4

X1 X2

X1 X2 A

3 2

1

2

T 3

A

G

4

5

T 6

A 7

G 8

N

• What is the ith character?

• What is the substring at [i,j]?

O(N) space

O(n) space

O(n) space

O(1) query

O(n) query

O(log N) query

O(n) space O(log N + j - i) query

Application: Black-Box Compressed String Matching X7

6

X6

3

X5

2 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

X4

X1 X2

X1 X2 A

3 2

1

2

T 3

A 4

G 5

T 6

A 7

G 8

• Does “AGGA” appear in the text (perhaps with k errors)?

Application: Black-Box Compressed String Matching X7

6

A 1

G 2

2

X6

X3





T 3

A 4

G 5

T 6

A 7

G 8

m+k m+k

• Does “AGGA” appear in the text (perhaps with k errors)? • Total time O(n (log N + m + Blackbox(m))).

Extension: Compressed Trees a b c

d c

a

d

c

b

b d

a

b

c

d

b

b c

c

d d

c b

d

• Linear space in compressed tree. • Fast navigation operations (select, access, parent, depth, height, subtree_size, first_child, next_sibling, level_ancestor, nca).

a

Heavy Path Decomposition X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

X4

X1 X2

X1 X2 A

3 2

1

2

T 3

A

G

4

5

N

T 6

A 7

G 8

Heavy Path Decomposition X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

X4

X1 X2

X1 X2 A

3 2

1

2

T 3

A

G

4

5

N

T 6

A 7

G 8

Heavy Path Decomposition X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

X4

X1 X2

X1 X2 A

3 2

1

2

T 3

A

G

4

5

N

T 6

A 7

G 8

Random Access Query X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

X4

X1 X2

X1 X2 A

3 2

1

2

T i

A

G

4

5

T 6

A 7

G 8

N

• The path from root to i goes through O(log N) heavy paths • Query: Binary search all heavy paths on the way O(log n) ⋅ O(log N)

Random Access Query X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

X4

X1 X2

X1 X2 A

3 2

1

2

T i

A

G

4

5

T 6

A 7

G 8

N

• The path from root to i goes through O(log N) heavy paths • Query: Binary search all heavy paths on the way O(log n) ⋅ O(log N)

Interval Biased Search Tree X7

6

X6

x=3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

1

X1 X2

T i

A

G

4

5

?

X4

X1 X2

X1 X2 A

3 2

1

2

0 T 6

A 7

G

x

N

8

N

• The path from root to i goes through O(log N) heavy paths • Query: Binary search all heavy paths on the way O(log n) ⋅ O(log N) O(log N/x) Telescopes to O(log N) • Space: Each IBSTs uses linear space => total O(n2) space for all heavy paths.

O(n) Representation of Heavy Paths X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

T 3

A

G

X1 X2 X5

4

5

X6 6

A 7

A

T

X2

X1

X4

X3

X4

T

G 1

X1 X2

X1 X2 A

3 2

1

2

G 8

X7

Xi Xk

Xj Xl

Xm

N

• Search for i on heavy path = lowest ancestor of distance i.

Xp Xt

O(n) Representation of Heavy Paths X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

T 3

A

G

X1 X2 X5

4

5

X6 6

A 7

A

T

X2

X1

X4

X3

X4

T

G 1

X1 X2

X1 X2 A

3 2

1

2

G 8

X7

Xi Xk

Xl Xm

N

• Search for i on heavy path = lowest ancestor of distance i. • A heavy path decomposition of heavy path representation. • In-path: O(log N/x) time, total O(n) space.

Xj Xp Xt

O(n) Representation of Heavy Paths X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

T 3

A

G

X1 X2

X4

log n X5

4

5

6

A 7

A

T

X2

X1

X4

X3

X6 T

G 1

X1 X2

X1 X2 A

3 2

1

2

G 8

Xi Xk

X7

Xl Xm

N

• Search for i on heavy path = lowest ancestor of distance i. • A heavy path decomposition of heavy path representation. • In-path: O(log N/x) time, total O(n) space. • Between-paths: O(log N/x) time, total O(n log n) space.

Xj Xp Xt

O(n) Representation of Heavy Paths X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

T 3

A

G

4

5

G

A

T

X2

X1

X4

1

X1 X2

X4

X3

n/logn leaves

log n

X5

X1 X2

X1 X2 A

3 2

1

2

Xi

X6 Xk logn Xl Xp logn logn leaves leaves X7leaves Xm Xt 2

T 6

A 7

G 8

Xj

N

• Search for i on heavy path = lowest ancestor of distance i. • A heavy path decomposition of heavy path decomposition. • In-path: O(log N/x) time, total O(n) space. • Between-paths: O(log N/x) time, total O(n log n) space.

O(n) Representation of Heavy Paths X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

T 3

A

G

4

5

G

A

T

X2

X1

X4

1

X1 X2

X4

X3

n/logn leaves

log n

X5

X1 X2

X1 X2 A

3 2

1

2

Xi

X6 Xk logn Xl Xp logn logn nodes nodes X7nodes Xm Xt 2

T 6

A 7

G 8

Xj

N

• Search for i on heavy path = lowest ancestor of distance i. • A heavy path decomposition of heavy path decomposition. • In-path: O(log N/x) time, total O(n) space. • Between-paths: O(log N/x) time, total O(n log n) space.

O(n) Representation of Heavy Paths X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

T 3

A

G

4

5

G

A

T

X2

X1

X4

1

X1 X2

X4

X3

n/logn leaves

log n

X5

X1 X2

X1 X2 A

3 2

1

2

Xi

X6 Xk logn Xl Xp logn logn nodes nodes X7nodes Xm Xt 2

T 6

A 7

G 8

Xj

N

• Search for i on heavy path = lowest ancestor of distance i. • A heavy path decomposition of heavy path decomposition. • In-path: O(log N/x) time, total O(n) space. • Between-paths: O(log N/x) time, total O(n log n) space. O(nα(n))

O(n) Representation of Heavy Paths X7

6

X6

3

X5

2

n 1

X3

1

1

X4

1

G 2

1

X5 1

X3

1

X3

T 3

A

G

4

5

G

A

T

X2

X1

X4

1

X1 X2

X4

X3

n/logn leaves

log n

X5

X1 X2

X1 X2 A

3 2

1

2

Xi

X6 Xk logn Xl Xp logn logn nodes nodes X7nodes Xm Xt 2

T 6

A 7

Xj

G 8

N

• Search for i on heavy path = lowest ancestor of distance i. • A heavy path decomposition of heavy path decomposition. • In-path: O(log N/x) time, total O(n) space. • Between-paths: O(log N/x) time, total O(n log n) space. O(nα(n)) O(n) with bittricks

Summary • Random access and substring decompression. • O(n) space and O(log N + length of substring) time. • Black compressed (approximate) string matching. • Random access in compressed trees.

Suggest Documents