Succinct Data Structures: Theory and Practice

Motivation and Context

Succinct Data Structures: Theory and Practice

March 16, 2012

— Succinct Data Structures: Theory and Practice

1/15


Contents

1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics


2/15


1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics


3/15


Succinct data structures Data structure D representation of an object + X Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1)

+

operations on X

access b[i]P in O(1) time i−1 rank(i) = j=0 b[j] in O(n) time

in n bits space


4/15


Succinct data structures Data structure D representation of an object + X

operations on X

Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(n) time (0,0,1,1,2,3,3,4) in n bits space


4/15



operations on X

Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(1) time (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.


4/15



operations on X

Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(1) time (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.


4/15


Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !?


5/15


Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? ≈ 1 CPU cycle

CPU

≈ 100 B

≈ 5 CPU cycles

L1-Cache

≈ 10 KB

≈ 10-20

L2-Cache

≈ 512 KB

≈ 20-100

L3-Cache

≈ 1-8 MB

≈ 100-500

DRAM

≈ 106

Disk

≈ 4 GB ≈ x · 100 GB

Less memory ⇒ less costs !?


5/15


Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? main memory price per hour Instance name Micro 613.0 MB 0.02 US$ 68.4 GB 2.00 US$ High-Memory Quadruple Extra Large Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011.

Problems: in theory develop succinct data structures

in practice constants in O(1)-time terms are high o(n)-space term is not negligible complex data structures are hard to implement


5/15


Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory develop succinct data structures

in practice constants in O(1)-time terms are high o(n)-space term is not negligible complex data structures are hard to implement


5/15


Memory Hierarchy Moore’s Law

“The number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years”

Image source: Wikipedia — Succinct Data Structures: Theory and Practice

6/15


Memory Hierarchy

Reality

Processor speed increasing

Disks have not evolved at the same pace Image source: www.intel.com


7/15


Memory Hierarchy

Image source: Wikipedia


8/15


Memory Hierarchy

Some Numbers A few CPU registers, less than 1 nanosecond. A few KBs of L1 cache, about 10 nanoseconds. A few MBs of L2 cache, about 30 nanoseconds. A few GBs of RAM, about 60 nanoseconds. A few TBs of disk, about 10 milliseconds.


9/15


Succinct Data Structures

There are Data Structures: Modify to use less space than the original one. That is not compression? No: It must provide efficient algorithms for simulating its operations. Why use it, if the memory is so cheap? Improve performance given the memory hierarchy, especially if we operate on RAM something that would need the disk.


10/15


Things that will be cover We will review progress in various succinct compact data structures. These will give us theoretical and practical tools to take advantage of the memory hierarchy in the design of algorithms and data structures. We will see compact structures for: Manipulate bit sequences Manipulate symbols sequences Navigating trees Pattern matching and text search Navigating graphs

We will also see hashing applications, sets, partial sums, geometry, permutations, and more.


11/15


Zero-order Entropy

Average number of bits needed to represent a symbol of a text T if each symbol receives always the same code. H0 (T ) = −

X nc c∈Σ

n

log

nc n

where nc is the number of occurrences of the symbol c in T and n is the length of T .


12/15


K-order Entropy

If we can encode each symbol depending on the context in which it appears, we can achieve better compression ratios. Hk (T ) =

X |T s | H0 (T s ) n k

s∈Σ

where T s is the sequence of symbols preceded by the context s in T.


13/15


Entropies of the Pizza&Chili 200MB test cases

k 0 1 2 3 4 5 6 7 8 9 10

dblp.xml Hk CT /n 5.257 0.0000 3.479 0.0000 2.170 0.0000 1.434 0.0007 1.045 0.0043 0.817 0.0130 0.705 0.0265 0.634 0.0427 0.574 0.0598 0.537 0.0773 0.508 0.0955

dna Hk CT /n 1.974 0.0000 1.930 0.0000 1.920 0.0000 1.916 0.0000 1.910 0.0000 1.901 0.0000 1.884 0.0001 1.862 0.0001 1.834 0.0004 1.802 0.0013 1.760 0.0051

english Hk CT /n 4.525 0.0000 3.620 0.0000 2.948 0.0001 2.422 0.0005 2.063 0.0028 1.839 0.0103 1.672 0.0265 1.510 0.0553 1.336 0.0991 1.151 0.1580 0.963 0.2292


proteins Hk CT /n 4.201 0.0000 4.178 0.0000 4.156 0.0000 4.066 0.0001 3.826 0.0011 3.162 0.0173 1.502 0.1742 0.340 0.4506 0.109 0.5383 0.074 0.5588 0.061 0.5699

rand Hk 7.000 7.000 6.993 5.979 0.666 0.006 0.000 0.000 0.000 0.000 0.000

k128 CT /n 0.0000 0.0000 0.0001 0.0100 0.6939 0.9969 1.0000 1.0000 1.0000 1.0000 1.0000

sources Hk CT /n 5.465 0.0000 4.077 0.0000 3.102 0.0000 2.337 0.0012 1.852 0.0082 1.518 0.0250 1.259 0.0509 1.045 0.0850 0.867 0.1255 0.721 0.1701 0.602 0.2163

14/15


Calculation of H1 (T) in linear time

$

+0 10

6

9

12

4

1


2

13

5

ndumulmum$ m$

3

14

umulmum$ m$nd

ndumulmum$ m$

11

+5H 4]) +5H0 ([1, ([1, 4])

m$ lmu mu ulmu ndum $

+0

lm u

7

3,1]) 1]) +6H00([2, 3, u+6H

lmum$ ndumu $ ulmum ndum m$ m u lm u ndumulmum$ m$

15 +0

um$ ulm u lm dum$

+0

8

0 15/15

Succinct Data Structures: Theory and Practice

Succinct Data Structures: Theory and Practice

Suggest Documents

Succinct Dynamic Data Structures

Hashing and Indexing: succinct data structures

Hashing and Indexing: succinct data structures

Squeezing Succinct Data Structures into Entropy

Succinct Data Structures for Assembling Large Genomes

Succinct data structures for representing equivalence classesâ

Design of Practical Succinct Data Structures for Large Data Collections

Optimized Succinct Data Structures for Massive Data - University of ...

Formal Verification of the rank Function for Succinct Data Structures

Formal Verification of the rank Function for Succinct Data Structures

Succinct Data Structures for Flexible Text Retrieval ... - Semantic Scholar

A Framework for Dynamizing Succinct Data Structures - ITTC

A Framework for Dynamizing Succinct Data

Synopsis Data Structures for Massive Data Sets - Stanford CS Theory

theory! research! and practice

Fascism : Theory and Practice

Conductivity Theory and Practice

Phonics: Theory and Practice

theory and practice

THEORY, SCHOOLS AND PRACTICE

Succinct Posets

Dynamic and Succinct Statistical Analysis of Neuroscience Data

Algorithms and Data Structures

Data Structures and Algorithms

Succinct Data Structures: Theory and Practice