Lossless Data Compression Modern Scope and Applications
Dr. Kai Sandfort
[email protected]
latest edit: June 29, 2017
“There are 10 types of people in the world: those who understand binary code and those who don’t.”
Scope of the talk • big picture of topic • current methods & research efforts • modern applications Not covered • method details • patent situation • lossy compression
Introduction Techniques Entropy Coding Dictionary Coding Other Techniques
Methods Lempel-Ziv Derivatives The Famous Classic Recent Methods Options for Improvement
Scope & Applications What will come? A prediction. Conclusion
Introduction
A Bit of History 80s to mid-90s • storage space and bandwidth are limited and expensive • little data transfer, exchange by floppy disks • compress to save storage space (HDD, floppy disk) mid-90s to mid-00s • plenty of storage space is available at an ever lower price (HDD, CD-ROM, DVD) • rise of the World Wide Web • rise of multimedia (image/sound/video) • ever increasing data transfer • compress to handle new types and volumes of data
A Bit of History - cont. mid-00s and later • storage space: you name it (HDD with PMR, DVD, USB flash, Blu-ray Disc, SSD, . . . ) • burst of data-producing cameras, microphones, sensors • massive and rapidly increasing data transfer • compress to lower transmission costs • compress to handle increasing resolution, fidelity, dynamic range • compression for cold archiving and new applications
Basics • Non-random data contains redundant information. Random data is meaningless. • Compression is about pattern or structure identification and exploitation. • No algorithm can compress even 1 % of all data of a given length, even by 1 byte. • The smaller the data amount, the more difficult it is to compress.
General Scheme 1. Modeling: Collect data/statistics and build model from past data. 2. “Past-to-Present” Mapping: (a) Predict
Compute probability distribution for present data, based on model.
(b) Match & Reuse
Match present data with past in model and reference latter.
3. Coding: Compute and emit codes for actual and reference data, resp.
The more powerful Modeling and Past-to-Present Mapping, the better compression.
Techniques
Entropy Coding • set up probability model for following data (mostly next symbol) • compute variable-length codes, shorter ones for likely data • low speed, high compression strength • recommended for poorly structured data
The simplest theory that explains the past is the best predictor of the future. – Occam’s Razor
Entropy Coding .
Huffman coding (1952) computes optimal length prefix-free codes for symbols, acc. to their probabilities → integer # bits / symbol .
Arithmetic coding (1979) encodes a string of symbols in a single rational number in [0, 1], by nested intervals determined by probability distribution → fractional # bits / symbol .
Asymmetric Numeral Systems (ANS) coding (2014) encodes a string of symbols in a single natural number combines strength of Arithmetic coding with speed of Huffman coding → fractional # bits / symbol
Example: Huffman coding Example string:
lossless compression
Alphabet: # characters:
ASCII with 7 bit/symbol 20
symbol ’s’ ’o’ ’e’ ’l’ ’ ’ ’c’ ’i’ ’m’ ’n’ ’p’ ’r’
count 6 3 2 2 1 1 1 1 1 1 1
Huffman code 10 110 001 000 1110 0111 0110 0101 0100 11111 11110
size of ASCII bit seq.: 20 · 7 bit = 140 bit
size of Huffman bit seq.: 6 · 2 + (3 + 2 + 2) · 3 + (1 + 1 + 1+ + 1 + 1) · 4 + (1 + 1) · 5 bit = 63 bit + size of (symbol,code) pair table The latter means fixed costs, which amortize for longer strings!
Dictionary Coding • maintain dictionary of strings for either a buffer (“sliding window”) or the entire data • replace later occurrences by reference position and length • high speed, moderate compression strength • suitable for rule- or grammar-based data, in particular text • famous: Lempel-Ziv methods LZ77 and LZ78, many derivatives
Example: LZ77 Choice: sliding window size: 15 char.s (4 bit), max. match length: 15 char.s (4 bit) lossless compression l ossless compression lo ssless compression los sless compression lossl ess compression lossle ss compression lossless compression lossless c ompression lossless com pression lossless comp ression lossless compr ession los sless compressi on lossless compression
address 0 0 0 1 0 4 0 9 0 0 9 8
size of LZ77 bit seq.: 12 · (4 + 4) + 12 · 7 bit = 180 bit
length 0 0 0 1 0 2 0 1 0 0 3 1
delimiter ’l’ ’o’ ’s’ ’l’ ’e’ ’ ’ ’c’ ’m’ ’p’ ’r’ ’i’ ’n’
BUT: > 140 bit (ASCII bit seq.)
Example: LZ77 Choice: sliding window size: 15 char.s (4 bit), max. match length: 15 char.s (4 bit) lossless compression l ossless compression lo ssless compression los sless compression lossl ess compression lossle ss compression lossless compression lossless c ompression lossless com pression lossless comp ression lossless compr ession los sless compressi on lossless compression
address 0 0 0 1 0 4 0 9 0 0 9 8
length 0 0 0 1 0 2 0 1 0 0 3 1
delimiter ’l’ ’o’ ’s’ ’l’ ’e’ ’ ’ ’c’ ’m’ ’p’ ’r’ ’i’ ’n’
However, LZ77 with Huffman for (address,length): 120 bit + fixed costs
Other Techniques •
.
•
.
•
.
•
.
•
.
Run-length encoding (RLE, 1967 or earlier) replaces repeated item by (count, item) pair useful only for repetitive data Prediction by partial matching (PPM family, mid-1980s onwards) maintain probability model based on statistics for contexts of various lengths show very high compression strength Dynamic Markov compression (DMC, 1987) maintains Markov chain model as directed graph to predict one bit at a time usually combined with Arithmetic coding Burrows-Wheeler transform (BWT, 1994) transf. data into efficiently compr.ible permutation and tiny datum for unique reversal used e.g. in bzip2 PAQ family (2002 onwards) research archivers combining various techniques super-slow, but the “bosses in town” w.r.t. compression strength
Methods
Lempel-Ziv Derivatives
light modification of . A Hierarchy of Lossless Compression Algorithms by . ETHW
Lempel-Ziv Derivatives
light modification of . A Hierarchy of Lossless Compression Algorithms by . ETHW
Uses in Compression Formats • LZSS (1982) in RAR and in LHA (.lzh, .lha), in each as one part • DEFLATE (1993) in ZIP • LZX (1995) improved version in Cabinet (.cab) by Microsoft • LZMA (1998) in 7-Zip (.7z), as one option Further Uses by Microsoft • LZX: Compiled HTML Help (.chm), Windows Imaging (.wim) • LZNT1, a LZSS variant: NTFS file system • a LZ77 variant: Windows Update Services
The Famous Classic .
DEFLATE (1993, in PKZIP 2.0) • the algorithm behind ZIP • used in numerous formats: PNG, PDF, ODF, WOFF etc. • provides good compression on wide variety of data • relies on combination of LZSS and Huffman coding
.
zlib (1995) • extremely widespread standard code library for DEFLATE • uses little system resources • allows trading speed for compression strength
Recent Methods .
LZ4 (2011) • simple LZ77-type algorithm with a byte-oriented encoding • no entropy encoding • very fast compression, extremely fast decompression • weaker compression than DEFLATE
Recent Methods “Swiss Series” by Google methods developed at the engineering center in Zurich: . Gipfeli (2012), . Zopfli (2013), . Brotli (2015) .
Brotli (2015) • based on modern LZ77 variant, Huffman coding, and 2nd order context modeling • also uses a static 120 kB dictionary of common HTML/JS strings • about as fast as DEFLATE, but 20–25 % stronger compression • specified as an HTTP compression encoding (IETF RFC 7932)
Recent Methods .
Zstandard (2015) • authored by . Yann Collet (nowadays working at Facebook) • based on LZ77, Huffman and ANS coding (. Huff0/FSE codecs) • designed for speed and parallel execution • fine-grained and wide-ranged control of strength vs. speed • supports creation of dictionaries as a data-specific boost
Some Performance Data
Dataset: . Canterbury corpus, 2.7 MB in total System: Intel Xeon E51650 v2 @3.5 GHz, Linux 3.13.0 Test: compiled with GCC 4.8.4 with -O2, run single-threaded Source: . Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms, Sep. 2015
Some Performance Data
Dataset: . Canterbury corpus, 2.7 MB in total System: Intel Xeon E51650 v2 @3.5 GHz, Linux 3.13.0 Test: compiled with GCC 4.8.4 with -O2, run single-threaded Source: . Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms, Sep. 2015
Some Performance Data
Dataset: . Silesia corpus, 202.1 MB in total System: Intel Core i7-6700K @4.0 GHz, Linux 4.8.0-1-amd64 Test: run using the in-memory benchmark . lzbench compiled with GCC 6.3.0 Source: . Zstandard, accessed May 20, 2017, 6:30 p.m.
Some Performance Data
Dataset: . Silesia corpus, 202.1 MB in total System: Intel Core i7-6700K @4.0 GHz, Linux 4.8.0-1-amd64 Test: run using the in-memory benchmark . lzbench compiled with GCC 6.3.0 Source: . Zstandard, accessed May 20, 2017, 6:30 p.m.
Recent Methods .
Marlin (2017) • developed at KIT and Universitat Autònoma de Barcelona • based on blockwise switching between optimized dictionaries • dict. generation is “entropy coding” here, excellent strength and decompr. speed
Presented at . Data Compression Conference (DCC), April 2017. Algorithm
Compr. ratio
JPEG-LS ANS by FSE codec Marlin Dataset: . Rawzor benchmark dataset for lossless image compression System: Intel Core i5-6600K @3.5 GHz, Linux 4.4.0 Test: compiled with GCC 5.4.0 with -O3
2.12 2.07 1.94
Decompr. speed [MB/s] 31 524 2494
Options for Improvement . . . of Strength • operate on the bit level • monitor benefit of compression, disable it where appropriate • use “context mixing”: combination of models for better prediction • use additional static or prefilled dictionary • use big sliding window for backward references
Options for Improvement . . . of Speed • choose data structures very carefully • fit data packets to cache line, working memory to L1/L2 cache size • comply with byte alignment in I/O, use biggest POD type as buffer • avoid branching, mimic conditional op.s by arithmetic/logical ones • precompute any static data • only estimate statistics, e.g. by sampling filters • use cheap hashing for identifying dictionary matches
Scope & Applications
Modern Scope • Reduce network data traffic. • Lower transmission costs, especially for mobile web. • Extend battery lifetime of mobile devices. • Accelerate data loading for web pages, app data, streamed games, 3D geometry data, AR/VR (soon). • Reduce costs of operation of cloud infrastructure.
These are critical business factors!
Recent Applications • File Systems OpenZFS (LZ4) • Web all major web browsers, Apache HTTP Server (all Brotli) • Databases/Data Warehouses MySQL, Apache HBase (both LZ4), Amazon Redshift (Zstandard) • Big Data Services Apache Hadoop (LZ4, Zstandard), Apache Spark (LZ4), Presto (Zstandard) • Processing Pipelines Facebook, “The Guardian” publication pipeline (both Zstandard), Dropbox static assets (Brotli)
Special Applications . . . of the Match & Reuse / Predict method parts: • computational molecular biology, e.g. structural fold analysis of protein sequences • smartphone typing assistance • document analysis and text mining • language & speech modeling • music classification
What will come? A prediction.
Big Potential •
.
context mixing methods
• combinations of models (see . this) and method aspects (see . this) •
.
LZMA methods, allow attractive balance speed vs. strength
Future • more natural & transparent use of compression • new domain-specific codecs (see . this) and diff. evaluation methods (see . this) • synergy with machine learning (ML) techniques: - better compression by utilizing learned “model knowledge” - more efficient training of NN by feeding “essential data” only
Conclusion
Lossless Data Compression • is essential as data bloat and I/O bandwidths stay bottlenecks • receives new attention and research by big IT players • will take exciting role in era of rich data, ML, IoT
Overview article The Engineering and Technology History Wiki (ETHW): History of Lossless Data Compression Algorithms . http://ethw.org/History_of_Lossless_Data_Compression_Algorithms Video lectures The Science and Application of Data Compression Algorithms . https://www.youtube.com/watch?v=ZEQRz7BmGtA “Compressor Head” Series by Google . https: //www.youtube.com/playlist?list=PLOU2XLYxmsIJGErt5rrCqaSGTMyyqNt2H
Thank you!