4 Jul 2013 ... Compressed Data Structures - Applications ... Introduction to Text Indexing. 6 .....
Wavelet trees introduced by Grossi,Gupta,Vitter, SODA'03.
Compressed Data Structures with Applications on Compact Integer Representations ˘ M. Oguzhan Külekci
[email protected]
TÜB˙ITAK–B˙ILGEM – UEKAE National Research Institute of Electronics&Cryptology, Turkey
Summer School on Selected Topics in Massive Data Management July 1-4, 2013
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
1 / 114
Outline 1 Introduction 2 Rank/Select Dictionaries 3 Wavelet Trees 4 WT and R/S Applications on Coding Non-Prefix-Free Codes with Wavelet Trees Revisiting Variable-Length Codes with WT and R/S Dictionaries Constant-time Prefix Summation 5 Introduction to Text Indexing 6 Basics Structures in Text Indexing Subword Trie Suffix Tree Suffix Array Sparsification & Sampling on Suffix Arrays 7 Burrows-Wheeler Transform and the FM-index Reconstructing text from BWT Backwards Searching Paradigm M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
2 / 114
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
3 / 114
How to Sail in Data Tsunami? Do our best to .. perform minimal I/O fit data in memory as close as possible to CPU
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
4 / 114
Compression: Picking up the buoy Keep data compressed in the storage ...not to save money, but to increase data transfer speed!
Is “fetch compressed data, then decode” faster than “fetch uncompressed data”? Yes!
Find ways to work directly on compressed data ! Compressed Data Structures
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
5 / 114
Compressed Data Structures The Aim Represent the data structure in space as small as possible, without a loss in its functionality. G. Jacobson, Succinct Static Data Structures, PhD thesis, Carnegie Mellon University, 1989. D. Clark: Compact Pat Trees, PhD thesis, University of Waterloo, Canada, 1996
Compressed arrays, lists, trees, ... Very active area in the last decade especially in data management and information retrieval. See the keynote speech delivered by Jeff Vitter at CIKM’12 Compressed Data Structures with Relevance
We will be using rank/select dictionaries and wavelet trees throughout this talk...
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
6 / 114
1
Introduction
2
Rank/Select Dictionaries
3
Wavelet Trees
4
WT and R/S Applications on Coding Non-Prefix-Free Codes with Wavelet Trees Revisiting Variable-Length Codes with WT and R/S Dictionaries Constant-time Prefix Summation
5
Introduction to Text Indexing
6
Basics Structures in Text Indexing Subword Trie Suffix Tree Suffix Array Sparsification & Sampling on Suffix Arrays
7
Burrows-Wheeler Transform and the FM-index Reconstructing text from BWT Backwards Searching Paradigm M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
7 / 114
Rank/Select Dictionaries (RSDic) b1
b3
1
1
b4
b5
b6
0 0 1 {z } rank1(6)=3 # of 1s (0s) occurring up to position i B=
0
b2 |
b7
b8
...
bn
0
1 ... ⇑ select1(4)=8 The position of the ith 1 (0) bit
Can be achieved in O(1) time by using space n + o(n) bits by Jacobson’89 (thesis), Clark&Munro (SODA’96) nH0 (B) + o(n) bits by Raman,Raman,Rao (SODA’02)
H0 (B) is the 0th–order entropy of bitmap B many alternatives still appearing
... see SPIRE’11 and SPIRE’12 tutorials on “Space-efficient Data Structures“ for details.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
8 / 114
Computing Rank in O(1)–time Step 1: Split n-bits into superblocks of size log2 n bits n bits log2 n bits
log2 n bits
...
log2 n bits
S[1]
S[2]
...
S[ logn2 n ]
S[i] : number of 1 bits observed from the beginning of the n–bits array to the beginning of ith super block. There are
n log2 n
super blocks, and each S[i] requires log n bits.
total space required for S array is n 2
log n
· log n =
n log n
bits. M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
9 / 114
Computing Rank in O(1)–time Step 2: Split n-bits into
log n 2
bits long blocks. n bits
log n 2
bits
B[1]
log n 2
bits
B[2]
...
log n 2
...
2n B[ log n]
bits
B[i] : number of 1 bits observed from the beginning of the corresponding super block to the beginning of ith block. 2n There are log n blocks, and each requires log(log n · log n) = 2 log log n bits.
Total space required for B array is
2n 2n log log n · 2 log log n = log n log n bits. M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
10 / 114
Computing Rank in O(1)–time Step 3: Build a table for all possible
1
2
...
...
0 1 ... log n 2 2 −1
log n 2
log n 2
bits, such that ... 000 001 010 011 100 101 110 111
1 0 0 0 0 1 1 1 1
2 0 0 1 1 1 1 2 2
3 0 1 1 2 1 2 2 3
log n
2 2 rows and log2 n columns, where each cell(r,c) contains number of 1s observed in the binary representation of row id r up to cth bit. Total space required for the table is 2
log n 2
·
log n log n · log ≈ 2 2
√ n log n log log n 2
bits. M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
11 / 114
Computing Rank in O(1)–time n bits log2 n bits log n 2
bits
log n 2
2
log2 n bits bits
log n 2
by
... log n 2
...
Raw bitmap
n
Superblocks
n log n
Blocks
2n log log n log n
Table
√ n log n log log n 2
Overall picture rank (i) = S[j] + B[k ] + T [r ][c] with the properly computed j, k , r , c corresponding to i. log n Space: n + O n log = n + o(n) bits. log n Time: O(1)
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
12 / 114
Rank Example
010 S
011
000
110
0
000 001 010 011 100 101 110 111
1 1 0 0 0 0 1 1 1 1
2 0 0 1 1 1 1 2 2
010
100
111
3
0
B
101
3 3 0 1 1 2 1 2 2 3
M. O. Külekci (UEKAE)
0
2
10 1
8 4
0
110
...
...
1
4
...
Super Block Length: 9 bits Block length : 3 bits rank(27) = S[3] + B[9] + T[101][3] = 8 + 4 + 2 = 14.
Compressed Data Structures - Applications
Summer School-MDM
13 / 114
Rank in compressed space We can even improve n + o(n) space to nH0 + o(n) by keeping the raw bitmap in zero-order compressed form. Needs some further auxiliary data structures, which will not change the o(n) complexity.
010
011
000
110
101
010
100
111
101
110
...
ParID
1
2
0
2
2
1
1
3
2
2
...
PerID
1
0
0
2
1
1
2
0
1
...
ParID 0 1 2 3
000 0
M. O. Külekci (UEKAE)
001
010
0
1
011
100
101
110
1
2
111
2 0
0 Compressed Data Structures - Applications
Summer School-MDM
14 / 114
Zero-order Compressed Bitmap The zero order entropy of a n-bits long bitmap with m 1 bits is ⌈log Split the n-bits long bitmap into hParID, PermIDi tuples. The ParID takes space
2n log n
log n 2
n m
⌉.
bits long blocks, and represent via
log n · log( log2 n ) = O( n·log log n ) bits.
The PermID takes space 2n
log n X
i=1
2n log n log n Y n n n 2 ⌉ ≤ log + O( ) ≤ log ) + O( ⌈log m ParIDi ParIDi log n log n
log n 2
i=1
log n ParID + PermID takes n · H0 + O( n log log n ) = n · H0 + o(n) bits.
Auxiliary data structures for rank query does not further increase the o(n) complexity.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
15 / 114
1
Introduction
2
Rank/Select Dictionaries
3
Wavelet Trees
4
WT and R/S Applications on Coding Non-Prefix-Free Codes with Wavelet Trees Revisiting Variable-Length Codes with WT and R/S Dictionaries Constant-time Prefix Summation
5
Introduction to Text Indexing
6
Basics Structures in Text Indexing Subword Trie Suffix Tree Suffix Array Sparsification & Sampling on Suffix Arrays
7
Burrows-Wheeler Transform and the FM-index Reconstructing text from BWT Backwards Searching Paradigm M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
16 / 114
Rank/Select over Large Alphabets
We have seen the R/S structure on binary arrays. What if we want to achieve them on larger alphabets? e.g., How many e letter occurs in the first 7 symbols of wavelettree? Wavelet trees introduced by Grossi,Gupta,Vitter, SODA’03. We can answer in O(log σ)–time and n log σ + o(n log σ)-space (which can be further improved to nH0 + o(n log σ)).
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
17 / 114
Wavelet Trees [Grossi,Gupta,Vitter SODA’03] 0 ←{a,e,l} 1 ← {r,t,v,w} wavelettree 10100011100
0 ←{r,t} 1 ←{v,w} wvttr 11000
0 ←{a} 1 ← {e,l} aeleee 011111 0 ←{e} 1 ←{l} eleee 01000
M. O. Külekci (UEKAE)
0 ←{r} 1 ←{t} ttr 110
Compressed Data Structures - Applications
0 ←{v} 1 ←{w} wv 10
Summer School-MDM
18 / 114
Wavelet Trees [Grossi,Gupta,Vitter SODA’03] 0 ←{a,e,l} 1 ← {r,t,v,w} wavelet tree 1010001 1100
0 ←{r,t} 1 ←{v,w} wvttr 11000
0 ←{a} 1 ← {e,l} aele ee 0111 11 a(0)
0 ←{e} 1 ←{l} ele ee 010 00 e(0)
M. O. Külekci (UEKAE)
rank(e,7)= ?
l(1)
0 ←{r} 1 ←{t} ttr 110 r(0)
t(1)
Compressed Data Structures - Applications
0 ←{v} 1 ←{w} wv 10 v(0)
w(1)
Summer School-MDM
19 / 114
Wavelet Trees [Grossi,Gupta,Vitter SODA’03] 0 ←{a,e,l} 1 ← {r,t,v,w} wavelettree 101000 1 1100
0 ←{r,t} 1 ←{v,w} wvttr 11 0 00
0 ←{a} 1 ← {e,l} aeleee 011111 a(0)
0 ←{e} 1 ←{l} eleee 01000 e(0)
0 ←{r} 1 ←{t} ttr 1 10
l(1) r(0)
M. O. Külekci (UEKAE)
access(7) = ? requires 2 rank query and 3 bitmap-access
t(1)
Compressed Data Structures - Applications
0 ←{v} 1 ←{w} wv 10 v(0)
w(1)
Summer School-MDM
20 / 114
Improving access in WT
Wavelet trees for all, Navarro, CPM’12 (survey) WT with Doubly-logarithmic Access O(log log σ) access is possible by splitting the sequence regarding the log values of the symbols instead of the pure alphabet We will start to work on integer sequences without loss of generality. Let X = x1 x2 . . . xn , xi ∈ {0, 1, 2, . . . , (U − 1)}, L = set of unique values in {⌊log x1 ⌋, ⌊log x2 ⌋, . . . , ⌊log xn ⌋}.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
21 / 114
Doubly-Logarithmic Access WT
0 ←{a,e,l,r} 1 ← {t,v,w} wavelettree 10100011000
0 ←{a,e} 1 ← {l,r} aeleree 0010100 0 1 1 1 1 a e e e e
110 101 100 100 w v t t
i 0(0) 1(1) 2(10) 3(11) 4 (100) 5 (101) 6 (110)
⌊log i⌋ 0 0 1 1 2 2 2
L = {0, 1, 2}
10 11 l r
M. O. Külekci (UEKAE)
a e l r t v w
Compressed Data Structures - Applications
Summer School-MDM
22 / 114
Doubly-Logarithmic Access WT
0 ←{a,e,l,r} 1 ← {t,v,w} wavelettree 101000 1 1000
0 ←{a,e} 1 ← {l,r} aeleree 0010100 0 1 1 1 1 a e e e e
110 101 100 100 w v t t
10 11 l r
access(7) = ? requires 1 rank query and 2 bitmap-access
a e l r t v w
i 0(0) 1(1) 2(10) 3(11) 4 (100) 5 (101) 6 (110)
⌊log i⌋ 0 0 1 1 2 2 2
L = {0, 1, 2}
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
23 / 114
Doubly-Logarithmic Access WT
Contribution We improved access to be achieved in O(log log σ)–time. WT supporting doubly-logarithmic rank/select is still open.
Opportunity Leaf nodes are not dedicated to a unique symbol, but includes all symbols having same code-length!
Codewords are not required to be prefix-free!
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
24 / 114
1
Introduction
2
Rank/Select Dictionaries
3
Wavelet Trees
4
WT and R/S Applications on Coding Non-Prefix-Free Codes with Wavelet Trees Revisiting Variable-Length Codes with WT and R/S Dictionaries Constant-time Prefix Summation
5
Introduction to Text Indexing
6
Basics Structures in Text Indexing Subword Trie Suffix Tree Suffix Array Sparsification & Sampling on Suffix Arrays
7
Burrows-Wheeler Transform and the FM-index Reconstructing text from BWT Backwards Searching Paradigm M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
25 / 114
Unique decodability and Prefix-free Codes Prefix-free codes No codeword is a prefix of another, e.g., Huffman codes Code lengths must ensure Kraft-McMillan inequality. Hence, uniquely decodable. No direct access to ith codeword.
A sample Huffman tree of a 26-symbol alphabet. Code lengths vary between 3 to 9 bits.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
26 / 114
Non-Prefix-free codes
Non-Prefix-free codes No restriction on codewords except being unique. Code lengths are minimal. NOT uniquely decodable. No direct access to ith codeword. For a 26-symbol alphabet, simply assign codewords of length 1 to 5 bits as {0, 1, 00, 01, . . ., 11001} according to frequencies. The Problem Uniquely decodable and directly accessible non-prefix-free codes with reduced overhead.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
27 / 114
Alternative Solutions Notation T = t1 t2 . . . tn , where ti ∈ Σ = {ǫ1 , ǫ2 , . . . , ǫσ }. The coding scheme C : Σ → A.
A = {α1 , α2 , . . . , ασ } is the non-prefix-free variable-length codeword set such that C(ǫi ) = αi . Encoded sequence C(T ) = C(t1 )C(t2 ) . . . C(tn ) = c1 c2 . . . cn . |C(T )| = |c1 | + |c2 | + . . . + |cn |. Naive solution Store the address of each codeword → n log |C(T )| bits overhead. Supports both unique decodability and O(1) access.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
28 / 114
Improved Naive Solutions Dense Sampling [Ferragina&Venturini’07] Dense-sampling by Ferragina&Venturini’07 achieves O(1)-time access. Keep addresses in double layer: First split code sequence into blocks, and then represent inner addresses relatively. O n · (log log |C(T )| + log log max (|α1 |, |α2 |, . . . , |αn |)) bits overhead. Benefit from compact integer representations Beginning (or ending) bit positions of each codeword is set P = {p1 , p2 , . . . , pn }.
Store P in a compact way, e.g., via DACs (Brisaboa et al.’13) P Pn−1 Pj≤i Overhead is Υ + Υb + o( Υb ), where Υ = ∀i ⌊log pi ⌋ = i=1 j=1 |cj |, and b is a parameter. Supports O(1) access.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
29 / 114
Improved Naive Solutions
Gap encoding of codeword boundaries Instead of codeword addresses, store codeword lengths Prefix sum is required to access a random codeword, use improved-AC (Elmasry et al.’12) coding )| Overhead: n · log( |C(T n ) + O(n) Access time: O(log log(n + |C(T )|))
How about using WT with doubly-logarithmic access?
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
30 / 114
Uniquely Decodable & Directly Accessible Non-Prefix-Free Codes with Wavelet Trees [Külekci, ISIT’13]
a
e
b
0
10
1
Σ = {a, b, c, d , e, f , g, h}, A = {0, 1, 00, 01, 10, 11, 000, 001}, C → A f d c b b d h b b g 11
01
00
1
1
01
001
1
1
000
f
a
a
a
11
0
0
0
L1 ={1,2,3} B1 = 01011100110011000
L3 ={2,3} B3 = 00000110
L2 ={1} 0 a
1 b
1 b
1 b
1 b
1 b
0 a
0 a
0 a
L4 ={2} 10 e
M. O. Külekci (UEKAE)
11 f
01 d
00 c
01 d
L5 ={3} 11 f
Compressed Data Structures - Applications
001 h
000 g
Summer School-MDM
31 / 114
Space complexity
Theorem Any non-prefix-free coding scheme can be uniquely decoded by creating a wavelet tree over the encoded sequence. Such a wavelet tree occupies at most n · ⌈log q⌉ bits, where q is the number of distinct codeword lengths observed in the encoded sequence. Proof. The depth of the wavelet tree is ⌈log q⌉.
At each level’ there are n bits, hence n · ⌈log q⌉ bits in total.
Topology of the WT can be extracted once |T | and set L are known, which requires negligible space.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
32 / 114
Direct Access
a
e
b
0
10
1
Σ = {a, b, c, d , e, f , g, h}, A = {0, 1, 00, 01, 10, 11, 000, 001}, C → A f d c b b d h b b g 11
01
00
1
1
01
001
1
1
L2 ={1} 1 b
1 b
1 b
1 b
1 b
0 a
0 a
a
a
a
0
0
0
L3 ={2,3} B3 = 00000110
0 a
L4 ={2} 10 e
M. O. Külekci (UEKAE)
f 11
access(14) requires 2 rank queries and 3 bitmap-access
L1 ={1,2,3} B1 = 01011100110011000
0 a
000
11 f
01 d
00 c
01 d
Compressed Data Structures - Applications
L5 ={3} 11 f
001 h
Summer School-MDM
000 g
33 / 114
Direct Access Lemma By reserving an additional o(n · ⌈log q⌉) bits in the proposed WT construction, any codeword in a sequence encoded with a variable–length coding scheme (including non–prefix–free codes), can be reached directly in at most ⌈log q⌉ steps. Proof. Depth of the WT is ⌈log q⌉ at most.
At each visited node, a rank query is run. O(1)–time rank over n bits requires o(n) bits overhead. Thus, total overhead space is n · ⌈log q⌉ + o(n · ⌈log q⌉).
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
34 / 114
Theoretical Complexity Comparison Method
Overhead space complexity in bits
naive Ferragina & Venturini (2007) Brisaboa et al. (2012) Elmasry et al. (2012) (also by Delpratt et al. (2007) this study
n · log |C(T )|
O n · (log log |C(T )| + log log max (L)) Υ Υ b + o( b ) )| log( |C(T n ) + O(n)
Υ+
n·
n · log q + o(n · log q)
Table: Additional space requirements of possible alternatives to uniquely decode a non-prefix-free coding scheme. C(T ) is the length of the code-stream in bits, n is the total number of symbols encoded, L is the set of distinct codeword lengths, q represents the size of L, and Pn−1 Pj≤i P Υ = ∀i ⌊log pi ⌋ = i=1 j=1 |cj |. M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
35 / 114
Experimental Comparison
σ |C(T )| file english 105 2503981 dblp 89 3005147 dna 6 1471215 protein 22 2523116 sources 98 3113442
Space Usage bits/ symbol iAC DAC WT 7.42 24.50 5.04 8.17 25.01 5.63 5.61 22.47 2.52 7.46 24.52 4.54 8.33 25.11 5.83
Random Access µsec./ symbol iAC DAC WT 0.060 0.016 0.064 0.060 0.012 0.064 0.056 0.008 0.028 0.060 0.012 0.048 0.060 0.016 0.072
Table: Comparison of alternative solutions providing unique decodability with direct access capability.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
36 / 114
1
Introduction
2
Rank/Select Dictionaries
3
Wavelet Trees
4
WT and R/S Applications on Coding Non-Prefix-Free Codes with Wavelet Trees Revisiting Variable-Length Codes with WT and R/S Dictionaries Constant-time Prefix Summation
5
Introduction to Text Indexing
6
Basics Structures in Text Indexing Subword Trie Suffix Tree Suffix Array Sparsification & Sampling on Suffix Arrays
7
Burrows-Wheeler Transform and the FM-index Reconstructing text from BWT Backwards Searching Paradigm M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
37 / 114
The dual problem: Compact Integer Representation X = hx0 , x1 , x2 , . . . , xn−1 i, where xi ∈ {0, 1, 2, . . . , u − 1}. All xi > 0 can be shown by ⌊log xi ⌋ bits.
Minimal length of X is P P binary representation Υ = ∀i:xi >1 ⌊log xi ⌋ + ∀i:(xi =0|xi =1) 1 X = {127, 32, 3, 56, 201, 27, 20, 13, 9, 150, 68, 487, 5, 6, 319, 3, 1, 0} (1)111111(1)00000(1)1(1)11000. . . Problem 1 Not possible to decode unambiguously! Problem 2 Random access to xi is not possible!
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
38 / 114
Variable-length Codes for Compact Integer Representation
Universal codes Use as less bits as possible to encode an (unbounded) integer. No use of a priori probabilities (not Huffman coding). Self-delimiting: Codewords boundaries can be retrieved from the encoded sequence. Aim Among many alternatives we will visit Elias and Golomb (Rice) schemes to improve compression to provide direct access via wavelet trees and R/S dictionaries.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
39 / 114
Elias–γ and Elias–δ Coding [Elias’74, J. of ACM] Elias–γ Encode minimal length ⌊log x ⌋ in unary, then the actual value in binary. unary binary z }| { z }| { Elias–γ(201) = 00000001 1001001 15 bits Requires 2 · ⌊log x ⌋ + 1 bits per integer x > 0. Elias–δ Encode unary part (⌊log x ⌋ + 1) of Elias–γ(x) again by Elias–γ. Elias–γ(unary) binary z }| { z }| { Elias–δ(201) = 0001000 1001001 14 bits
Requires 2 · ⌊log(⌊log x + 1⌋)⌋ + 1 + ⌊log x ⌋ bits per integer x > 0.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
40 / 114
Incorporating WT into Elias
Assume L = ℓ1 , ℓ2 , . . . , ℓq is the set of unique unary values observed on set X . Build a wavelet tree over X by splitting the values according to L. Leaves encode the minimal binary representations of corresponding values. Each leaf is dedicated to a unique code–length. Hence, accessing a value in a leaf node is O(1) time. Reaching a leaf node takes at most ⌈log q⌉ steps.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
41 / 114
EliasW(avelet) Coding
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
42 / 114
EliasW(avelet) Coding access(x10 ) : (1)0010110, 3 rank query + 4 bitmap access
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
43 / 114
EliasW Coding Complexity Results
Lemma The EliasW coding of X = hx0 , x1 , . . . , xn−1 i, where 0 ≤ xi < u and L = {ℓ1 , ℓ2 , . . . , ℓq } is the set of distinct ⌊log xi ⌋ values observed in X , occupies at most n · ⌈log q⌉ + Υ bits of space plus a few bits to encode the n value and set L for correct decoding. Remark Assuming L = {0, 1, 2, . . . , ⌊log(u − 1)⌋}, the space requirement is at most n · ⌈log⌈log u⌉⌉ + Υ bits. Lemma In EliasW coding any element of sequence X can be accessed in O(⌈log q⌉) time by reserving additional o(n · ⌈log q⌉) bits.
M. O. Külekci (UEKAE)
Compressed Data Structures - Applications
Summer School-MDM
44 / 114
EliasW against Elias–γ & Elias–δ Space usage comparison of EliasW against Elias–γ and Elias–δ. Notice that 1 < ǫ < 2.
Elias–γ Elias–δ
EliasW
EliasW with direct access
(n · ⌈log⌈log u⌉⌉ + Υ)
(n · ⌈log⌈log u⌉⌉ + o(n · ⌈log⌈log u⌉⌉) + Υ)
uses less space when
uses less space when
log u