Mar 16, 2012 ... Succinct Data Structures: Theory and Practice. 1/15 .... Problems: in theory
develop succinct data structures in practice constants in O(1)-time ...
Motivation and Context
Succinct Data Structures: Theory and Practice
March 16, 2012
— Succinct Data Structures: Theory and Practice
1/15
Motivation and Context
Contents
1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics
— Succinct Data Structures: Theory and Practice
2/15
Motivation and Context
1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics
— Succinct Data Structures: Theory and Practice
3/15
Motivation and Context
Succinct data structures Data structure D representation of an object + X Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1)
+
operations on X
access b[i]P in O(1) time i−1 rank(i) = j=0 b[j] in O(n) time
in n bits space
— Succinct Data Structures: Theory and Practice
4/15
Motivation and Context
Succinct data structures Data structure D representation of an object + X
operations on X
Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(n) time (0,0,1,1,2,3,3,4) in n bits space
— Succinct Data Structures: Theory and Practice
4/15
Motivation and Context
Succinct data structures Data structure D representation of an object + X
operations on X
Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(1) time (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.
— Succinct Data Structures: Theory and Practice
4/15
Motivation and Context
Succinct data structures Data structure D representation of an object + X
operations on X
Example: Rank-bit-vector bit vector b of length n access b[i]P in O(1) time (0,1,0,1,1,0,1,1) + i−1 rank(i) = j=0 b[j] in O(1) time (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.
— Succinct Data Structures: Theory and Practice
4/15
Motivation and Context
Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !?
— Succinct Data Structures: Theory and Practice
5/15
Motivation and Context
Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? ≈ 1 CPU cycle
CPU
≈ 100 B
≈ 5 CPU cycles
L1-Cache
≈ 10 KB
≈ 10-20
L2-Cache
≈ 512 KB
≈ 20-100
L3-Cache
≈ 1-8 MB
≈ 100-500
DRAM
≈ 106
Disk
≈ 4 GB ≈ x · 100 GB
Less memory ⇒ less costs !?
— Succinct Data Structures: Theory and Practice
5/15
Motivation and Context
Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? main memory price per hour Instance name Micro 613.0 MB 0.02 US$ 68.4 GB 2.00 US$ High-Memory Quadruple Extra Large Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011.
Problems: in theory develop succinct data structures
in practice constants in O(1)-time terms are high o(n)-space term is not negligible complex data structures are hard to implement
— Succinct Data Structures: Theory and Practice
5/15
Motivation and Context
Succinct data structures Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory develop succinct data structures
in practice constants in O(1)-time terms are high o(n)-space term is not negligible complex data structures are hard to implement
— Succinct Data Structures: Theory and Practice
5/15
Motivation and Context
Memory Hierarchy Moore’s Law
“The number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years”
Image source: Wikipedia — Succinct Data Structures: Theory and Practice
6/15
Motivation and Context
Memory Hierarchy
Reality
Processor speed increasing
Disks have not evolved at the same pace Image source: www.intel.com
— Succinct Data Structures: Theory and Practice
7/15
Motivation and Context
Memory Hierarchy
Image source: Wikipedia
— Succinct Data Structures: Theory and Practice
8/15
Motivation and Context
Memory Hierarchy
Some Numbers A few CPU registers, less than 1 nanosecond. A few KBs of L1 cache, about 10 nanoseconds. A few MBs of L2 cache, about 30 nanoseconds. A few GBs of RAM, about 60 nanoseconds. A few TBs of disk, about 10 milliseconds.
— Succinct Data Structures: Theory and Practice
9/15
Motivation and Context
Succinct Data Structures
There are Data Structures: Modify to use less space than the original one. That is not compression? No: It must provide efficient algorithms for simulating its operations. Why use it, if the memory is so cheap? Improve performance given the memory hierarchy, especially if we operate on RAM something that would need the disk.
— Succinct Data Structures: Theory and Practice
10/15
Motivation and Context
Things that will be cover We will review progress in various succinct compact data structures. These will give us theoretical and practical tools to take advantage of the memory hierarchy in the design of algorithms and data structures. We will see compact structures for: Manipulate bit sequences Manipulate symbols sequences Navigating trees Pattern matching and text search Navigating graphs
We will also see hashing applications, sets, partial sums, geometry, permutations, and more.
— Succinct Data Structures: Theory and Practice
11/15
Motivation and Context
Zero-order Entropy
Average number of bits needed to represent a symbol of a text T if each symbol receives always the same code. H0 (T ) = −
X nc c∈Σ
n
log
nc n
where nc is the number of occurrences of the symbol c in T and n is the length of T .
— Succinct Data Structures: Theory and Practice
12/15
Motivation and Context
K-order Entropy
If we can encode each symbol depending on the context in which it appears, we can achieve better compression ratios. Hk (T ) =
X |T s | H0 (T s ) n k
s∈Σ
where T s is the sequence of symbols preceded by the context s in T.
— Succinct Data Structures: Theory and Practice
13/15
Motivation and Context
Entropies of the Pizza&Chili 200MB test cases
k 0 1 2 3 4 5 6 7 8 9 10
dblp.xml Hk CT /n 5.257 0.0000 3.479 0.0000 2.170 0.0000 1.434 0.0007 1.045 0.0043 0.817 0.0130 0.705 0.0265 0.634 0.0427 0.574 0.0598 0.537 0.0773 0.508 0.0955
dna Hk CT /n 1.974 0.0000 1.930 0.0000 1.920 0.0000 1.916 0.0000 1.910 0.0000 1.901 0.0000 1.884 0.0001 1.862 0.0001 1.834 0.0004 1.802 0.0013 1.760 0.0051
english Hk CT /n 4.525 0.0000 3.620 0.0000 2.948 0.0001 2.422 0.0005 2.063 0.0028 1.839 0.0103 1.672 0.0265 1.510 0.0553 1.336 0.0991 1.151 0.1580 0.963 0.2292
— Succinct Data Structures: Theory and Practice
proteins Hk CT /n 4.201 0.0000 4.178 0.0000 4.156 0.0000 4.066 0.0001 3.826 0.0011 3.162 0.0173 1.502 0.1742 0.340 0.4506 0.109 0.5383 0.074 0.5588 0.061 0.5699
rand Hk 7.000 7.000 6.993 5.979 0.666 0.006 0.000 0.000 0.000 0.000 0.000
k128 CT /n 0.0000 0.0000 0.0001 0.0100 0.6939 0.9969 1.0000 1.0000 1.0000 1.0000 1.0000
sources Hk CT /n 5.465 0.0000 4.077 0.0000 3.102 0.0000 2.337 0.0012 1.852 0.0082 1.518 0.0250 1.259 0.0509 1.045 0.0850 0.867 0.1255 0.721 0.1701 0.602 0.2163
14/15
Motivation and Context
Calculation of H1 (T) in linear time
$
+0 10
6
9
12
4
1
— Succinct Data Structures: Theory and Practice
2
13
5
ndumulmum$ m$
3
14
umulmum$ m$nd
ndumulmum$ m$
11
+5H 4]) +5H0 ([1, ([1, 4])
m$ lmu mu ulmu ndum $
+0
lm u
7
3,1]) 1]) +6H00([2, 3, u+6H
lmum$ ndumu $ ulmum ndum m$ m u lm u ndumulmum$ m$
15 +0
um$ ulm u lm dum$
+0
8
0 15/15