Succinct Data Structures: Theory and Practice

7 downloads 39540 Views 1MB Size Report
Mar 16, 2012 ... Succinct Data Structures: Theory and Practice. 1/15 .... Problems: in theory develop succinct data structures in practice constants in O(1)-time ...
Motivation and Context

Succinct Data Structures: Theory and Practice

March 16, 2012

— Succinct Data Structures: Theory and Practice

1/15

Motivation and Context

Contents

1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics

— Succinct Data Structures: Theory and Practice

2/15

Motivation and Context

1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics

— Succinct Data Structures: Theory and Practice

3/15

Motivation and Context

Succinct data structures Data structure D representation of an object + X Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1)

+

operations on X

access b[i]P in O(1) time i−1 rank(i) = j=0 b[j] in O(n) time

in n bits space

— Succinct Data Structures: Theory and Practice

4/15

Motivation and Context

Succinct data structures Data structure D representation of an object + X

operations on X

Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(n) time (0,0,1,1,2,3,3,4) in n bits space

— Succinct Data Structures: Theory and Practice

4/15

Motivation and Context

Succinct data structures Data structure D representation of an object + X

operations on X

Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(1) time (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.

— Succinct Data Structures: Theory and Practice

4/15

Motivation and Context

Succinct data structures Data structure D representation of an object + X

operations on X

Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(1) time (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.

— Succinct Data Structures: Theory and Practice

4/15

Motivation and Context

Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !?

— Succinct Data Structures: Theory and Practice

5/15

Motivation and Context

Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? ≈ 1 CPU cycle

CPU

≈ 100 B

≈ 5 CPU cycles

L1-Cache

≈ 10 KB

≈ 10-20

L2-Cache

≈ 512 KB

≈ 20-100

L3-Cache

≈ 1-8 MB

≈ 100-500

DRAM

≈ 106

Disk

≈ 4 GB ≈ x · 100 GB

Less memory ⇒ less costs !?

— Succinct Data Structures: Theory and Practice

5/15

Motivation and Context

Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? main memory price per hour Instance name Micro 613.0 MB 0.02 US$ 68.4 GB 2.00 US$ High-Memory Quadruple Extra Large Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011.

Problems: in theory develop succinct data structures

in practice constants in O(1)-time terms are high o(n)-space term is not negligible complex data structures are hard to implement

— Succinct Data Structures: Theory and Practice

5/15

Motivation and Context

Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory develop succinct data structures

in practice constants in O(1)-time terms are high o(n)-space term is not negligible complex data structures are hard to implement

— Succinct Data Structures: Theory and Practice

5/15

Motivation and Context

Memory Hierarchy Moore’s Law

“The number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years”

Image source: Wikipedia — Succinct Data Structures: Theory and Practice

6/15

Motivation and Context

Memory Hierarchy

Reality

Processor speed increasing

Disks have not evolved at the same pace Image source: www.intel.com

— Succinct Data Structures: Theory and Practice

7/15

Motivation and Context

Memory Hierarchy

Image source: Wikipedia

— Succinct Data Structures: Theory and Practice

8/15

Motivation and Context

Memory Hierarchy

Some Numbers A few CPU registers, less than 1 nanosecond. A few KBs of L1 cache, about 10 nanoseconds. A few MBs of L2 cache, about 30 nanoseconds. A few GBs of RAM, about 60 nanoseconds. A few TBs of disk, about 10 milliseconds.

— Succinct Data Structures: Theory and Practice

9/15

Motivation and Context

Succinct Data Structures

There are Data Structures: Modify to use less space than the original one. That is not compression? No: It must provide efficient algorithms for simulating its operations. Why use it, if the memory is so cheap? Improve performance given the memory hierarchy, especially if we operate on RAM something that would need the disk.

— Succinct Data Structures: Theory and Practice

10/15

Motivation and Context

Things that will be cover We will review progress in various succinct compact data structures. These will give us theoretical and practical tools to take advantage of the memory hierarchy in the design of algorithms and data structures. We will see compact structures for: Manipulate bit sequences Manipulate symbols sequences Navigating trees Pattern matching and text search Navigating graphs

We will also see hashing applications, sets, partial sums, geometry, permutations, and more.

— Succinct Data Structures: Theory and Practice

11/15

Motivation and Context

Zero-order Entropy

Average number of bits needed to represent a symbol of a text T if each symbol receives always the same code. H0 (T ) = −

X nc c∈Σ

n

log

nc n

where nc is the number of occurrences of the symbol c in T and n is the length of T .

— Succinct Data Structures: Theory and Practice

12/15

Motivation and Context

K-order Entropy

If we can encode each symbol depending on the context in which it appears, we can achieve better compression ratios. Hk (T ) =

X |T s | H0 (T s ) n k

s∈Σ

where T s is the sequence of symbols preceded by the context s in T.

— Succinct Data Structures: Theory and Practice

13/15

Motivation and Context

Entropies of the Pizza&Chili 200MB test cases

k 0 1 2 3 4 5 6 7 8 9 10

dblp.xml Hk CT /n 5.257 0.0000 3.479 0.0000 2.170 0.0000 1.434 0.0007 1.045 0.0043 0.817 0.0130 0.705 0.0265 0.634 0.0427 0.574 0.0598 0.537 0.0773 0.508 0.0955

dna Hk CT /n 1.974 0.0000 1.930 0.0000 1.920 0.0000 1.916 0.0000 1.910 0.0000 1.901 0.0000 1.884 0.0001 1.862 0.0001 1.834 0.0004 1.802 0.0013 1.760 0.0051

english Hk CT /n 4.525 0.0000 3.620 0.0000 2.948 0.0001 2.422 0.0005 2.063 0.0028 1.839 0.0103 1.672 0.0265 1.510 0.0553 1.336 0.0991 1.151 0.1580 0.963 0.2292

— Succinct Data Structures: Theory and Practice

proteins Hk CT /n 4.201 0.0000 4.178 0.0000 4.156 0.0000 4.066 0.0001 3.826 0.0011 3.162 0.0173 1.502 0.1742 0.340 0.4506 0.109 0.5383 0.074 0.5588 0.061 0.5699

rand Hk 7.000 7.000 6.993 5.979 0.666 0.006 0.000 0.000 0.000 0.000 0.000

k128 CT /n 0.0000 0.0000 0.0001 0.0100 0.6939 0.9969 1.0000 1.0000 1.0000 1.0000 1.0000

sources Hk CT /n 5.465 0.0000 4.077 0.0000 3.102 0.0000 2.337 0.0012 1.852 0.0082 1.518 0.0250 1.259 0.0509 1.045 0.0850 0.867 0.1255 0.721 0.1701 0.602 0.2163

14/15

Motivation and Context

Calculation of H1 (T) in linear time

$

+0 10

6

9

12

4

1

— Succinct Data Structures: Theory and Practice

2

13

5

ndumulmum$ m$

3

14

umulmum$ m$nd

ndumulmum$ m$

11

+5H 4]) +5H0 ([1, ([1, 4])

m$ lmu mu ulmu ndum $

+0

lm u

7

3,1]) 1]) +6H00([2, 3, u+6H

lmum$ ndumu $ ulmum ndum m$ m u lm u ndumulmum$ m$

15 +0

um$ ulm u lm dum$

+0

8

0 15/15