Compressed Data Structures with Applications on Compact ... - mcs

Compressed Data Structures with Applications on Compact Integer Representations ˘ M. Oguzhan Külekci [email protected]

TÜB˙ITAK–B˙ILGEM – UEKAE National Research Institute of Electronics&Cryptology, Turkey

Summer School on Selected Topics in Massive Data Management July 1-4, 2013

M. O. Külekci (UEKAE)

Compressed Data Structures - Applications

Summer School-MDM

1 / 114

Outline 1 Introduction 2 Rank/Select Dictionaries 3 Wavelet Trees 4 WT and R/S Applications on Coding Non-Prefix-Free Codes with Wavelet Trees Revisiting Variable-Length Codes with WT and R/S Dictionaries Constant-time Prefix Summation 5 Introduction to Text Indexing 6 Basics Structures in Text Indexing Subword Trie Suffix Tree Suffix Array Sparsification & Sampling on Suffix Arrays 7 Burrows-Wheeler Transform and the FM-index Reconstructing text from BWT Backwards Searching Paradigm M. O. Külekci (UEKAE)


Summer School-MDM

2 / 114



Summer School-MDM

3 / 114

How to Sail in Data Tsunami? Do our best to .. perform minimal I/O fit data in memory as close as possible to CPU



Summer School-MDM

4 / 114

Compression: Picking up the buoy Keep data compressed in the storage ...not to save money, but to increase data transfer speed!

Is “fetch compressed data, then decode” faster than “fetch uncompressed data”? Yes!

Find ways to work directly on compressed data ! Compressed Data Structures



Summer School-MDM

5 / 114

Compressed Data Structures The Aim Represent the data structure in space as small as possible, without a loss in its functionality. G. Jacobson, Succinct Static Data Structures, PhD thesis, Carnegie Mellon University, 1989. D. Clark: Compact Pat Trees, PhD thesis, University of Waterloo, Canada, 1996

Compressed arrays, lists, trees, ... Very active area in the last decade especially in data management and information retrieval. See the keynote speech delivered by Jeff Vitter at CIKM’12 Compressed Data Structures with Relevance

We will be using rank/select dictionaries and wavelet trees throughout this talk...



Summer School-MDM

6 / 114

1

Introduction

2

Rank/Select Dictionaries

3

Wavelet Trees

4

WT and R/S Applications on Coding Non-Prefix-Free Codes with Wavelet Trees Revisiting Variable-Length Codes with WT and R/S Dictionaries Constant-time Prefix Summation

5

Introduction to Text Indexing

6

Basics Structures in Text Indexing Subword Trie Suffix Tree Suffix Array Sparsification & Sampling on Suffix Arrays

7

Burrows-Wheeler Transform and the FM-index Reconstructing text from BWT Backwards Searching Paradigm M. O. Külekci (UEKAE)


Summer School-MDM

7 / 114

Rank/Select Dictionaries (RSDic) b1

b3

1

1

b4

b5

b6

0 0 1 {z } rank1(6)=3 # of 1s (0s) occurring up to position i B=

0

b2 |

b7

b8

...

bn

0

1 ... ⇑ select1(4)=8 The position of the ith 1 (0) bit

Can be achieved in O(1) time by using space n + o(n) bits by Jacobson’89 (thesis), Clark&Munro (SODA’96) nH0 (B) + o(n) bits by Raman,Raman,Rao (SODA’02)

H0 (B) is the 0th–order entropy of bitmap B many alternatives still appearing

... see SPIRE’11 and SPIRE’12 tutorials on “Space-efficient Data Structures“ for details.



Summer School-MDM

8 / 114

Computing Rank in O(1)–time Step 1: Split n-bits into superblocks of size log2 n bits n bits log2 n bits

log2 n bits

...

log2 n bits

S[1]

S[2]

...

S[ logn2 n ]

S[i] : number of 1 bits observed from the beginning of the n–bits array to the beginning of ith super block. There are

n log2 n

super blocks, and each S[i] requires log n bits.

total space required for S array is n 2

log n

· log n =

n log n

bits. M. O. Külekci (UEKAE)


Summer School-MDM

9 / 114

Computing Rank in O(1)–time Step 2: Split n-bits into

log n 2

bits long blocks. n bits

log n 2

bits

B[1]

log n 2

bits

B[2]

...

log n 2

...

2n B[ log n]

bits

B[i] : number of 1 bits observed from the beginning of the corresponding super block to the beginning of ith block. 2n There are log n blocks, and each requires log(log n · log n) = 2 log log n bits.

Total space required for B array is

2n 2n log log n · 2 log log n = log n log n bits. M. O. Külekci (UEKAE)


Summer School-MDM

10 / 114

Computing Rank in O(1)–time Step 3: Build a table for all possible

1

2

...

...

0 1 ... log n 2 2 −1

log n 2

log n 2

bits, such that ... 000 001 010 011 100 101 110 111

1 0 0 0 0 1 1 1 1

2 0 0 1 1 1 1 2 2

3 0 1 1 2 1 2 2 3

log n

2 2 rows and log2 n columns, where each cell(r,c) contains number of 1s observed in the binary representation of row id r up to cth bit. Total space required for the table is 2

log n 2

·

log n log n · log ≈ 2 2

√ n log n log log n 2

bits. M. O. Külekci (UEKAE)


Summer School-MDM

11 / 114

Computing Rank in O(1)–time n bits log2 n bits log n 2

bits

log n 2

2

log2 n bits bits

log n 2

by

... log n 2

...

Raw bitmap

n

Superblocks

n log n

Blocks

2n log log n log n

Table

√ n log n log log n 2

Overall picture rank (i) = S[j] + B[k ] + T [r ][c] with the properly computed j, k , r , c corresponding to i. log n Space: n + O n log = n + o(n) bits. log n Time: O(1)



Summer School-MDM

12 / 114

Rank Example

010 S

011

000

110

0

000 001 010 011 100 101 110 111

1 1 0 0 0 0 1 1 1 1

2 0 0 1 1 1 1 2 2

010

100

111

3

0

B

101

3 3 0 1 1 2 1 2 2 3


0

2

10 1

8 4

0

110

...

...

1

4

...

Super Block Length: 9 bits Block length : 3 bits rank(27) = S[3] + B[9] + T[101][3] = 8 + 4 + 2 = 14.


Summer School-MDM

13 / 114

Rank in compressed space We can even improve n + o(n) space to nH0 + o(n) by keeping the raw bitmap in zero-order compressed form. Needs some further auxiliary data structures, which will not change the o(n) complexity.

010

011

000

110

101

010

100

111

101

110

...

ParID

1

2

0

2

2

1

1

3

2

2

...

PerID

1

0

0

2

1

1

2

0

1

...

ParID 0 1 2 3

000 0


001

010

0

1

011

100

101

110

1

2

111

2 0

0 Compressed Data Structures - Applications

Summer School-MDM

14 / 114

Zero-order Compressed Bitmap The zero order entropy of a n-bits long bitmap with m 1 bits is ⌈log Split the n-bits long bitmap into hParID, PermIDi tuples. The ParID takes space

2n log n

log n 2

n m

⌉.

bits long blocks, and represent via

log n · log( log2 n ) = O( n·log log n ) bits.

The PermID takes space 2n

log n X

i=1

2n log n log n Y n n n 2 ⌉ ≤ log + O( ) ≤ log ) + O( ⌈log m ParIDi ParIDi log n log n

log n 2

i=1

log n ParID + PermID takes n · H0 + O( n log log n ) = n · H0 + o(n) bits.

Auxiliary data structures for rank query does not further increase the o(n) complexity.



Summer School-MDM

15 / 114

1

Introduction

2


3

Wavelet Trees

4


5


6


7



Summer School-MDM

16 / 114

Rank/Select over Large Alphabets

We have seen the R/S structure on binary arrays. What if we want to achieve them on larger alphabets? e.g., How many e letter occurs in the first 7 symbols of wavelettree? Wavelet trees introduced by Grossi,Gupta,Vitter, SODA’03. We can answer in O(log σ)–time and n log σ + o(n log σ)-space (which can be further improved to nH0 + o(n log σ)).



Summer School-MDM

17 / 114

Wavelet Trees [Grossi,Gupta,Vitter SODA’03] 0 ←{a,e,l} 1 ← {r,t,v,w} wavelettree 10100011100

0 ←{r,t} 1 ←{v,w} wvttr 11000

0 ←{a} 1 ← {e,l} aeleee 011111 0 ←{e} 1 ←{l} eleee 01000


0 ←{r} 1 ←{t} ttr 110


0 ←{v} 1 ←{w} wv 10

Summer School-MDM

18 / 114

Wavelet Trees [Grossi,Gupta,Vitter SODA’03] 0 ←{a,e,l} 1 ← {r,t,v,w} wavelet tree 1010001 1100

0 ←{r,t} 1 ←{v,w} wvttr 11000

0 ←{a} 1 ← {e,l} aele ee 0111 11 a(0)

0 ←{e} 1 ←{l} ele ee 010 00 e(0)


rank(e,7)= ?

l(1)

0 ←{r} 1 ←{t} ttr 110 r(0)

t(1)


0 ←{v} 1 ←{w} wv 10 v(0)

w(1)

Summer School-MDM

19 / 114

Wavelet Trees [Grossi,Gupta,Vitter SODA’03] 0 ←{a,e,l} 1 ← {r,t,v,w} wavelettree 101000 1 1100

0 ←{r,t} 1 ←{v,w} wvttr 11 0 00

0 ←{a} 1 ← {e,l} aeleee 011111 a(0)

0 ←{e} 1 ←{l} eleee 01000 e(0)

0 ←{r} 1 ←{t} ttr 1 10

l(1) r(0)


access(7) = ? requires 2 rank query and 3 bitmap-access

t(1)


0 ←{v} 1 ←{w} wv 10 v(0)

w(1)

Summer School-MDM

20 / 114

Improving access in WT

Wavelet trees for all, Navarro, CPM’12 (survey) WT with Doubly-logarithmic Access O(log log σ) access is possible by splitting the sequence regarding the log values of the symbols instead of the pure alphabet We will start to work on integer sequences without loss of generality. Let X = x1 x2 . . . xn , xi ∈ {0, 1, 2, . . . , (U − 1)}, L = set of unique values in {⌊log x1 ⌋, ⌊log x2 ⌋, . . . , ⌊log xn ⌋}.



Summer School-MDM

21 / 114

Doubly-Logarithmic Access WT

0 ←{a,e,l,r} 1 ← {t,v,w} wavelettree 10100011000

0 ←{a,e} 1 ← {l,r} aeleree 0010100 0 1 1 1 1 a e e e e

110 101 100 100 w v t t

i 0(0) 1(1) 2(10) 3(11) 4 (100) 5 (101) 6 (110)

⌊log i⌋ 0 0 1 1 2 2 2

L = {0, 1, 2}

10 11 l r


a e l r t v w


Summer School-MDM

22 / 114


0 ←{a,e,l,r} 1 ← {t,v,w} wavelettree 101000 1 1000

0 ←{a,e} 1 ← {l,r} aeleree 0010100 0 1 1 1 1 a e e e e

110 101 100 100 w v t t

10 11 l r

access(7) = ? requires 1 rank query and 2 bitmap-access

a e l r t v w

i 0(0) 1(1) 2(10) 3(11) 4 (100) 5 (101) 6 (110)

⌊log i⌋ 0 0 1 1 2 2 2

L = {0, 1, 2}



Summer School-MDM

23 / 114


Contribution We improved access to be achieved in O(log log σ)–time. WT supporting doubly-logarithmic rank/select is still open.

Opportunity Leaf nodes are not dedicated to a unique symbol, but includes all symbols having same code-length!

Codewords are not required to be prefix-free!



Summer School-MDM

24 / 114

1

Introduction

2


3

Wavelet Trees

4


5


6


7



Summer School-MDM

25 / 114

Unique decodability and Prefix-free Codes Prefix-free codes No codeword is a prefix of another, e.g., Huffman codes Code lengths must ensure Kraft-McMillan inequality. Hence, uniquely decodable. No direct access to ith codeword.

A sample Huffman tree of a 26-symbol alphabet. Code lengths vary between 3 to 9 bits.



Summer School-MDM

26 / 114

Non-Prefix-free codes

Non-Prefix-free codes No restriction on codewords except being unique. Code lengths are minimal. NOT uniquely decodable. No direct access to ith codeword. For a 26-symbol alphabet, simply assign codewords of length 1 to 5 bits as {0, 1, 00, 01, . . ., 11001} according to frequencies. The Problem Uniquely decodable and directly accessible non-prefix-free codes with reduced overhead.



Summer School-MDM

27 / 114

Alternative Solutions Notation T = t1 t2 . . . tn , where ti ∈ Σ = {ǫ1 , ǫ2 , . . . , ǫσ }. The coding scheme C : Σ → A.

A = {α1 , α2 , . . . , ασ } is the non-prefix-free variable-length codeword set such that C(ǫi ) = αi . Encoded sequence C(T ) = C(t1 )C(t2 ) . . . C(tn ) = c1 c2 . . . cn . |C(T )| = |c1 | + |c2 | + . . . + |cn |. Naive solution Store the address of each codeword → n log |C(T )| bits overhead. Supports both unique decodability and O(1) access.



Summer School-MDM

28 / 114

Improved Naive Solutions Dense Sampling [Ferragina&Venturini’07] Dense-sampling by Ferragina&Venturini’07 achieves O(1)-time access. Keep addresses in double layer: First split code sequence into blocks, and then represent inner addresses relatively. O n · (log log |C(T )| + log log max (|α1 |, |α2 |, . . . , |αn |)) bits overhead. Benefit from compact integer representations Beginning (or ending) bit positions of each codeword is set P = {p1 , p2 , . . . , pn }.

Store P in a compact way, e.g., via DACs (Brisaboa et al.’13) P Pn−1 Pj≤i Overhead is Υ + Υb + o( Υb ), where Υ = ∀i ⌊log pi ⌋ = i=1 j=1 |cj |, and b is a parameter. Supports O(1) access.



Summer School-MDM

29 / 114

Improved Naive Solutions

Gap encoding of codeword boundaries Instead of codeword addresses, store codeword lengths Prefix sum is required to access a random codeword, use improved-AC (Elmasry et al.’12) coding )| Overhead: n · log( |C(T n ) + O(n) Access time: O(log log(n + |C(T )|))

How about using WT with doubly-logarithmic access?



Summer School-MDM

30 / 114

Uniquely Decodable & Directly Accessible Non-Prefix-Free Codes with Wavelet Trees [Külekci, ISIT’13]

a

e

b

0

10

1

Σ = {a, b, c, d , e, f , g, h}, A = {0, 1, 00, 01, 10, 11, 000, 001}, C → A f d c b b d h b b g 11

01

00

1

1

01

001

1

1

000

f

a

a

a

11

0

0

0

L1 ={1,2,3} B1 = 01011100110011000

L3 ={2,3} B3 = 00000110

L2 ={1} 0 a

1 b

1 b

1 b

1 b

1 b

0 a

0 a

0 a

L4 ={2} 10 e


11 f

01 d

00 c

01 d

L5 ={3} 11 f


001 h

000 g

Summer School-MDM

31 / 114

Space complexity

Theorem Any non-prefix-free coding scheme can be uniquely decoded by creating a wavelet tree over the encoded sequence. Such a wavelet tree occupies at most n · ⌈log q⌉ bits, where q is the number of distinct codeword lengths observed in the encoded sequence. Proof. The depth of the wavelet tree is ⌈log q⌉.

At each level’ there are n bits, hence n · ⌈log q⌉ bits in total.

Topology of the WT can be extracted once |T | and set L are known, which requires negligible space.



Summer School-MDM

32 / 114

Direct Access

a

e

b

0

10

1

Σ = {a, b, c, d , e, f , g, h}, A = {0, 1, 00, 01, 10, 11, 000, 001}, C → A f d c b b d h b b g 11

01

00

1

1

01

001

1

1

L2 ={1} 1 b

1 b

1 b

1 b

1 b

0 a

0 a

a

a

a

0

0

0

L3 ={2,3} B3 = 00000110

0 a

L4 ={2} 10 e


f 11

access(14) requires 2 rank queries and 3 bitmap-access

L1 ={1,2,3} B1 = 01011100110011000

0 a

000

11 f

01 d

00 c

01 d


L5 ={3} 11 f

001 h

Summer School-MDM

000 g

33 / 114

Direct Access Lemma By reserving an additional o(n · ⌈log q⌉) bits in the proposed WT construction, any codeword in a sequence encoded with a variable–length coding scheme (including non–prefix–free codes), can be reached directly in at most ⌈log q⌉ steps. Proof. Depth of the WT is ⌈log q⌉ at most.

At each visited node, a rank query is run. O(1)–time rank over n bits requires o(n) bits overhead. Thus, total overhead space is n · ⌈log q⌉ + o(n · ⌈log q⌉).



Summer School-MDM

34 / 114

Theoretical Complexity Comparison Method

Overhead space complexity in bits

naive Ferragina & Venturini (2007) Brisaboa et al. (2012) Elmasry et al. (2012) (also by Delpratt et al. (2007) this study

n · log |C(T )|

O n · (log log |C(T )| + log log max (L)) Υ Υ b + o( b ) )| log( |C(T n ) + O(n)

Υ+

n·

n · log q + o(n · log q)

Table: Additional space requirements of possible alternatives to uniquely decode a non-prefix-free coding scheme. C(T ) is the length of the code-stream in bits, n is the total number of symbols encoded, L is the set of distinct codeword lengths, q represents the size of L, and Pn−1 Pj≤i P Υ = ∀i ⌊log pi ⌋ = i=1 j=1 |cj |. M. O. Külekci (UEKAE)


Summer School-MDM

35 / 114

Experimental Comparison

σ |C(T )| file english 105 2503981 dblp 89 3005147 dna 6 1471215 protein 22 2523116 sources 98 3113442

Space Usage bits/ symbol iAC DAC WT 7.42 24.50 5.04 8.17 25.01 5.63 5.61 22.47 2.52 7.46 24.52 4.54 8.33 25.11 5.83

Random Access µsec./ symbol iAC DAC WT 0.060 0.016 0.064 0.060 0.012 0.064 0.056 0.008 0.028 0.060 0.012 0.048 0.060 0.016 0.072

Table: Comparison of alternative solutions providing unique decodability with direct access capability.



Summer School-MDM

36 / 114

1

Introduction

2


3

Wavelet Trees

4


5


6


7



Summer School-MDM

37 / 114

The dual problem: Compact Integer Representation X = hx0 , x1 , x2 , . . . , xn−1 i, where xi ∈ {0, 1, 2, . . . , u − 1}. All xi > 0 can be shown by ⌊log xi ⌋ bits.

Minimal length of X is P P binary representation Υ = ∀i:xi >1 ⌊log xi ⌋ + ∀i:(xi =0|xi =1) 1 X = {127, 32, 3, 56, 201, 27, 20, 13, 9, 150, 68, 487, 5, 6, 319, 3, 1, 0} (1)111111(1)00000(1)1(1)11000. . . Problem 1 Not possible to decode unambiguously! Problem 2 Random access to xi is not possible!



Summer School-MDM

38 / 114

Variable-length Codes for Compact Integer Representation

Universal codes Use as less bits as possible to encode an (unbounded) integer. No use of a priori probabilities (not Huffman coding). Self-delimiting: Codewords boundaries can be retrieved from the encoded sequence. Aim Among many alternatives we will visit Elias and Golomb (Rice) schemes to improve compression to provide direct access via wavelet trees and R/S dictionaries.



Summer School-MDM

39 / 114

Elias–γ and Elias–δ Coding [Elias’74, J. of ACM] Elias–γ Encode minimal length ⌊log x ⌋ in unary, then the actual value in binary. unary binary z }| { z }| { Elias–γ(201) = 00000001 1001001 15 bits Requires 2 · ⌊log x ⌋ + 1 bits per integer x > 0. Elias–δ Encode unary part (⌊log x ⌋ + 1) of Elias–γ(x) again by Elias–γ. Elias–γ(unary) binary z }| { z }| { Elias–δ(201) = 0001000 1001001 14 bits

Requires 2 · ⌊log(⌊log x + 1⌋)⌋ + 1 + ⌊log x ⌋ bits per integer x > 0.



Summer School-MDM

40 / 114

Incorporating WT into Elias

Assume L = ℓ1 , ℓ2 , . . . , ℓq is the set of unique unary values observed on set X . Build a wavelet tree over X by splitting the values according to L. Leaves encode the minimal binary representations of corresponding values. Each leaf is dedicated to a unique code–length. Hence, accessing a value in a leaf node is O(1) time. Reaching a leaf node takes at most ⌈log q⌉ steps.



Summer School-MDM

41 / 114

EliasW(avelet) Coding



Summer School-MDM

42 / 114

EliasW(avelet) Coding access(x10 ) : (1)0010110, 3 rank query + 4 bitmap access



Summer School-MDM

43 / 114

EliasW Coding Complexity Results

Lemma The EliasW coding of X = hx0 , x1 , . . . , xn−1 i, where 0 ≤ xi < u and L = {ℓ1 , ℓ2 , . . . , ℓq } is the set of distinct ⌊log xi ⌋ values observed in X , occupies at most n · ⌈log q⌉ + Υ bits of space plus a few bits to encode the n value and set L for correct decoding. Remark Assuming L = {0, 1, 2, . . . , ⌊log(u − 1)⌋}, the space requirement is at most n · ⌈log⌈log u⌉⌉ + Υ bits. Lemma In EliasW coding any element of sequence X can be accessed in O(⌈log q⌉) time by reserving additional o(n · ⌈log q⌉) bits.



Summer School-MDM

44 / 114

EliasW against Elias–γ & Elias–δ Space usage comparison of EliasW against Elias–γ and Elias–δ. Notice that 1 < ǫ < 2.

Elias–γ Elias–δ

EliasW

EliasW with direct access

(n · ⌈log⌈log u⌉⌉ + Υ)

(n · ⌈log⌈log u⌉⌉ + o(n · ⌈log⌈log u⌉⌉) + Υ)

uses less space when

uses less space when

log u

Compressed Data Structures with Applications on Compact ... - mcs

Compressed Data Structures with Applications on Compact ... - mcs

Suggest Documents

Compressed Data Structures with Relevance

Data Structures - MCS Resource Portal

Compressed Data Structures with Relevance - Ittc.ku.edu - The ...

Opportunistic Data Structures with Applications

Approximate Data Structures with Applications

Compressed Data Structures: Dictionaries and Data ... - CiteSeerX

DATA STRUCTURES and APPLICATIONS

TAMED SYMPLECTIC STRUCTURES ON COMPACT

Analog Sparse Approximation with Applications to Compressed

Chapter 1 Approximate Data Structures with Applications - CiteSeerX

structures on compact homogeneous manifolds [JX, JY]

NORMAL CR STRUCTURES ON COMPACT 3-MANIFOLDS

Oblivious Data Structures: Applications to Cryptography Daniele ...

HEXA: Compact Data Structures for Faster Packet ... - Computer Science

Compact Distributed Data Structures for Adaptive Routing - CiteSeerX

Compact and Localized Distributed Data Structures - Semantic Scholar

Learning Selectively Conditioned Forest Structures with Applications

Applications of Fluorogens with Rotor Structures in

Discrete groups and Geometric Structures with Applications

anatomical structures of thermally compressed - ProLigno

Impact of Nuclear Data Uncertainties on ... - Argonne MCS

Modulator structures for radio-on-fiber applications

Data Movement Support for Analysis - Argonne MCS

New Compact Microstrip Patch Filtenna Structures with Partitioned ...