Lossless Data Compression

Lossless Data Compression Modern Scope and Applications

Dr. Kai Sandfort [email protected]

latest edit: June 29, 2017

“There are 10 types of people in the world: those who understand binary code and those who don’t.”

Scope of the talk • big picture of topic • current methods & research efforts • modern applications Not covered • method details • patent situation • lossy compression

Introduction Techniques Entropy Coding Dictionary Coding Other Techniques

Methods Lempel-Ziv Derivatives The Famous Classic Recent Methods Options for Improvement

Scope & Applications What will come? A prediction. Conclusion

Introduction

A Bit of History 80s to mid-90s • storage space and bandwidth are limited and expensive • little data transfer, exchange by floppy disks • compress to save storage space (HDD, floppy disk) mid-90s to mid-00s • plenty of storage space is available at an ever lower price (HDD, CD-ROM, DVD) • rise of the World Wide Web • rise of multimedia (image/sound/video) • ever increasing data transfer • compress to handle new types and volumes of data

A Bit of History - cont. mid-00s and later • storage space: you name it (HDD with PMR, DVD, USB flash, Blu-ray Disc, SSD, . . . ) • burst of data-producing cameras, microphones, sensors • massive and rapidly increasing data transfer • compress to lower transmission costs • compress to handle increasing resolution, fidelity, dynamic range • compression for cold archiving and new applications

Basics • Non-random data contains redundant information. Random data is meaningless. • Compression is about pattern or structure identification and exploitation. • No algorithm can compress even 1 % of all data of a given length, even by 1 byte. • The smaller the data amount, the more difficult it is to compress.

General Scheme 1. Modeling: Collect data/statistics and build model from past data. 2. “Past-to-Present” Mapping: (a) Predict

Compute probability distribution for present data, based on model.

(b) Match & Reuse

Match present data with past in model and reference latter.

3. Coding: Compute and emit codes for actual and reference data, resp.

The more powerful Modeling and Past-to-Present Mapping, the better compression.

Techniques

Entropy Coding • set up probability model for following data (mostly next symbol) • compute variable-length codes, shorter ones for likely data • low speed, high compression strength • recommended for poorly structured data

The simplest theory that explains the past is the best predictor of the future. – Occam’s Razor

Entropy Coding .

Huffman coding (1952) computes optimal length prefix-free codes for symbols, acc. to their probabilities → integer # bits / symbol .

Arithmetic coding (1979) encodes a string of symbols in a single rational number in [0, 1], by nested intervals determined by probability distribution → fractional # bits / symbol .

Asymmetric Numeral Systems (ANS) coding (2014) encodes a string of symbols in a single natural number combines strength of Arithmetic coding with speed of Huffman coding → fractional # bits / symbol

Example: Huffman coding Example string:

lossless compression

Alphabet: # characters:

ASCII with 7 bit/symbol 20

symbol ’s’ ’o’ ’e’ ’l’ ’ ’ ’c’ ’i’ ’m’ ’n’ ’p’ ’r’

count 6 3 2 2 1 1 1 1 1 1 1

Huffman code 10 110 001 000 1110 0111 0110 0101 0100 11111 11110

size of ASCII bit seq.: 20 · 7 bit = 140 bit

size of Huffman bit seq.: 6 · 2 + (3 + 2 + 2) · 3 + (1 + 1 + 1+ + 1 + 1) · 4 + (1 + 1) · 5 bit = 63 bit + size of (symbol,code) pair table The latter means fixed costs, which amortize for longer strings!

Dictionary Coding • maintain dictionary of strings for either a buffer (“sliding window”) or the entire data • replace later occurrences by reference position and length • high speed, moderate compression strength • suitable for rule- or grammar-based data, in particular text • famous: Lempel-Ziv methods LZ77 and LZ78, many derivatives

Example: LZ77 Choice: sliding window size: 15 char.s (4 bit), max. match length: 15 char.s (4 bit) lossless compression l ossless compression lo ssless compression los sless compression lossl ess compression lossle ss compression lossless compression lossless c ompression lossless com pression lossless comp ression lossless compr ession los sless compressi on lossless compression

address 0 0 0 1 0 4 0 9 0 0 9 8

size of LZ77 bit seq.: 12 · (4 + 4) + 12 · 7 bit = 180 bit

length 0 0 0 1 0 2 0 1 0 0 3 1

delimiter ’l’ ’o’ ’s’ ’l’ ’e’ ’ ’ ’c’ ’m’ ’p’ ’r’ ’i’ ’n’

BUT: > 140 bit (ASCII bit seq.)

Example: LZ77 Choice: sliding window size: 15 char.s (4 bit), max. match length: 15 char.s (4 bit) lossless compression l ossless compression lo ssless compression los sless compression lossl ess compression lossle ss compression lossless compression lossless c ompression lossless com pression lossless comp ression lossless compr ession los sless compressi on lossless compression

address 0 0 0 1 0 4 0 9 0 0 9 8

length 0 0 0 1 0 2 0 1 0 0 3 1

delimiter ’l’ ’o’ ’s’ ’l’ ’e’ ’ ’ ’c’ ’m’ ’p’ ’r’ ’i’ ’n’

However, LZ77 with Huffman for (address,length): 120 bit + fixed costs

Other Techniques •

.

•

.

•

.

•

.

•

.

Run-length encoding (RLE, 1967 or earlier) replaces repeated item by (count, item) pair useful only for repetitive data Prediction by partial matching (PPM family, mid-1980s onwards) maintain probability model based on statistics for contexts of various lengths show very high compression strength Dynamic Markov compression (DMC, 1987) maintains Markov chain model as directed graph to predict one bit at a time usually combined with Arithmetic coding Burrows-Wheeler transform (BWT, 1994) transf. data into efficiently compr.ible permutation and tiny datum for unique reversal used e.g. in bzip2 PAQ family (2002 onwards) research archivers combining various techniques super-slow, but the “bosses in town” w.r.t. compression strength

Methods

Lempel-Ziv Derivatives

light modification of . A Hierarchy of Lossless Compression Algorithms by . ETHW

Lempel-Ziv Derivatives

light modification of . A Hierarchy of Lossless Compression Algorithms by . ETHW

Uses in Compression Formats • LZSS (1982) in RAR and in LHA (.lzh, .lha), in each as one part • DEFLATE (1993) in ZIP • LZX (1995) improved version in Cabinet (.cab) by Microsoft • LZMA (1998) in 7-Zip (.7z), as one option Further Uses by Microsoft • LZX: Compiled HTML Help (.chm), Windows Imaging (.wim) • LZNT1, a LZSS variant: NTFS file system • a LZ77 variant: Windows Update Services

The Famous Classic .

DEFLATE (1993, in PKZIP 2.0) • the algorithm behind ZIP • used in numerous formats: PNG, PDF, ODF, WOFF etc. • provides good compression on wide variety of data • relies on combination of LZSS and Huffman coding

.

zlib (1995) • extremely widespread standard code library for DEFLATE • uses little system resources • allows trading speed for compression strength

Recent Methods .

LZ4 (2011) • simple LZ77-type algorithm with a byte-oriented encoding • no entropy encoding • very fast compression, extremely fast decompression • weaker compression than DEFLATE

Recent Methods “Swiss Series” by Google methods developed at the engineering center in Zurich: . Gipfeli (2012), . Zopfli (2013), . Brotli (2015) .

Brotli (2015) • based on modern LZ77 variant, Huffman coding, and 2nd order context modeling • also uses a static 120 kB dictionary of common HTML/JS strings • about as fast as DEFLATE, but 20–25 % stronger compression • specified as an HTTP compression encoding (IETF RFC 7932)

Recent Methods .

Zstandard (2015) • authored by . Yann Collet (nowadays working at Facebook) • based on LZ77, Huffman and ANS coding (. Huff0/FSE codecs) • designed for speed and parallel execution • fine-grained and wide-ranged control of strength vs. speed • supports creation of dictionaries as a data-specific boost

Some Performance Data

Dataset: . Canterbury corpus, 2.7 MB in total System: Intel Xeon E51650 v2 @3.5 GHz, Linux 3.13.0 Test: compiled with GCC 4.8.4 with -O2, run single-threaded Source: . Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms, Sep. 2015


Dataset: . Canterbury corpus, 2.7 MB in total System: Intel Xeon E51650 v2 @3.5 GHz, Linux 3.13.0 Test: compiled with GCC 4.8.4 with -O2, run single-threaded Source: . Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms, Sep. 2015


Dataset: . Silesia corpus, 202.1 MB in total System: Intel Core i7-6700K @4.0 GHz, Linux 4.8.0-1-amd64 Test: run using the in-memory benchmark . lzbench compiled with GCC 6.3.0 Source: . Zstandard, accessed May 20, 2017, 6:30 p.m.


Dataset: . Silesia corpus, 202.1 MB in total System: Intel Core i7-6700K @4.0 GHz, Linux 4.8.0-1-amd64 Test: run using the in-memory benchmark . lzbench compiled with GCC 6.3.0 Source: . Zstandard, accessed May 20, 2017, 6:30 p.m.

Recent Methods .

Marlin (2017) • developed at KIT and Universitat Autònoma de Barcelona • based on blockwise switching between optimized dictionaries • dict. generation is “entropy coding” here, excellent strength and decompr. speed

Presented at . Data Compression Conference (DCC), April 2017. Algorithm

Compr. ratio

JPEG-LS ANS by FSE codec Marlin Dataset: . Rawzor benchmark dataset for lossless image compression System: Intel Core i5-6600K @3.5 GHz, Linux 4.4.0 Test: compiled with GCC 5.4.0 with -O3

2.12 2.07 1.94

Decompr. speed [MB/s] 31 524 2494

Options for Improvement . . . of Strength • operate on the bit level • monitor benefit of compression, disable it where appropriate • use “context mixing”: combination of models for better prediction • use additional static or prefilled dictionary • use big sliding window for backward references

Options for Improvement . . . of Speed • choose data structures very carefully • fit data packets to cache line, working memory to L1/L2 cache size • comply with byte alignment in I/O, use biggest POD type as buffer • avoid branching, mimic conditional op.s by arithmetic/logical ones • precompute any static data • only estimate statistics, e.g. by sampling filters • use cheap hashing for identifying dictionary matches

Scope & Applications

Modern Scope • Reduce network data traffic. • Lower transmission costs, especially for mobile web. • Extend battery lifetime of mobile devices. • Accelerate data loading for web pages, app data, streamed games, 3D geometry data, AR/VR (soon). • Reduce costs of operation of cloud infrastructure.

These are critical business factors!

Recent Applications • File Systems OpenZFS (LZ4) • Web all major web browsers, Apache HTTP Server (all Brotli) • Databases/Data Warehouses MySQL, Apache HBase (both LZ4), Amazon Redshift (Zstandard) • Big Data Services Apache Hadoop (LZ4, Zstandard), Apache Spark (LZ4), Presto (Zstandard) • Processing Pipelines Facebook, “The Guardian” publication pipeline (both Zstandard), Dropbox static assets (Brotli)

Special Applications . . . of the Match & Reuse / Predict method parts: • computational molecular biology, e.g. structural fold analysis of protein sequences • smartphone typing assistance • document analysis and text mining • language & speech modeling • music classification

What will come? A prediction.

Big Potential •

.

context mixing methods

• combinations of models (see . this) and method aspects (see . this) •

.

LZMA methods, allow attractive balance speed vs. strength

Future • more natural & transparent use of compression • new domain-specific codecs (see . this) and diff. evaluation methods (see . this) • synergy with machine learning (ML) techniques: - better compression by utilizing learned “model knowledge” - more efficient training of NN by feeding “essential data” only

Conclusion

Lossless Data Compression • is essential as data bloat and I/O bandwidths stay bottlenecks • receives new attention and research by big IT players • will take exciting role in era of rich data, ML, IoT

Overview article The Engineering and Technology History Wiki (ETHW): History of Lossless Data Compression Algorithms . http://ethw.org/History_of_Lossless_Data_Compression_Algorithms Video lectures The Science and Application of Data Compression Algorithms . https://www.youtube.com/watch?v=ZEQRz7BmGtA “Compressor Head” Series by Google . https: //www.youtube.com/playlist?list=PLOU2XLYxmsIJGErt5rrCqaSGTMyyqNt2H

Thank you!

Lossless Data Compression

Lossless Data Compression

Suggest Documents

lossless compression of biometric image data

Universal lossless data compression algorithms - CiteSeer

lossless compression of biometric image data - CiteSeerX

Lossless Compression of Volume Data - CiteSeerX

Lossless Volume Data Compression Schemes - CiteSeerX

lossless compression of electroencephalographic (eeg) data - CiteSeerX

Lossless Data Compression at Finite Blocklengths

Universal lossless data compression algorithms - Google Sites

Lossless Analog Compression - arXiv

(LZW) Lossless Compression Technique

multiresolution lossless compression scheme - CiteSeerX

Lossless compression based Kmp technique

multiresolution lossless compression scheme - CiteSeerX

Efficient Adaptive Lossless Compression of

Lossless Waveform Compression 1 Introduction

advanced lossless text compression algorithm

Lossless Compression Scheme for Satellite

Lossless quantum data compression and variable-length coding

Lossless Compression of Volumetric Medical Data - Semantic Scholar

Universal Lossless Data Compression Via Binary Decision Diagrams

FPGA-Based Lossless Data Compression Using GNU Zip - UWSpace

lossless and near-lossless compression of ecg signals

Context-based lossless and near-lossless compression ... - IEEE Xplore

Lossless and near-lossless compression of line-drawing ... - CiteSeerX