Fractal Trees v. B-Trees. Characteristics of B-Trees and Fractal Trees for large data: ⢠Very high percentage of leaf
TokuMX Internals The “What”, “Why”, and “How” of Fractal Tree Indexing for MongoDB Zardosht Kasheff @zkasheff @tokutek
What is TokuMX?
• •
TokuMX = MongoDB with improved storage (Fractal Tree Indexes!)
Drop in replacement for MongoDB v2.2 applications o o o o o
Including replication and sharding Same data model Same query language Drivers just work 2.4 compatibility soon
• Open source •
https://github.com/Tokutek/mongo/
TokuMX Benefits
Top 5 benefits to TokuMX are…
TokuMX Benefit #1 Improved write performance on large data
TokuMX Benefit #2 Compression! (up to 25x) TokuMX achieved 11.6:1 compression
TokuMX Benefit #3 No Fragmentation.
TokuMX Benefit #4 Scale up No global read/write lock Document level locking Sysbench Benchmark on data > RAM
• • •
TokuMX Benefit #4 Scale up No global read/write lock Document level locking Sysbench Benchmark on data < RAM
• • •
TokuMX Benefit #5 Transactions: MVCC + multi-statement on single servers
TokuMX Top 5 Benefits Recap
• • • • •
Improved write performance on large data Compression! (up to 25x) No fragmentation (Deprecated compact!) Scale up Transactions (MVCC + multi-statement)
Bottom line: TokuMX makes MongoDB applications stable and fast for large databases.
TokuMX: How? Built a storage core from the ground up, with Fractal Tree indexes, a data structure designed with large data in mind. • Some benefits thanks to Fractal Tree indexes • Some benefits thanks to good old fashioned engineering
Benefits:
• • • • •
Improved write performance on large data Compression! (up to 10x) No fragmentation (Deprecated compact!) Scale up Transactions (MVCC + multi-statement)
Thanks to Fractal Trees
Good old fashioned engineering
Agenda • Focus on how TokuMX brings the benefits that Fractal Trees are responsible for. (We won’t focus on scale up and transactions). • Compare side-by-side the B-Tree (what many databases use) and the Fractal Tree. Understand the differences. • Use differences to show, one by one, how TokuMX’s Fractal Trees enable: – Fast writes on big data – Compression – No fragmentation
But first, a spoiler…
Spoiler!! • MySQL customer I/O utilization graph:
Without Fractal Trees
With Fractal Trees
It’s all about I/O!!
Fractal Trees v. B-Trees Contrast and Compare
Fractal Trees v. B-Trees What is a B-Tree?
• Traditional data structure used in databases for over 40 years. • Used in NEARLY ALL databases, such as MongoDB, MySQL, BerkeleyDB, etc…
Fractal Trees v. B-Trees What is a B-Tree?
Internal Nodes
Leaf Nodes
Simple and elegant data structure: • Internal nodes store as many pivots and pointers that fit. • Leaf nodes store data.
Fractal Trees v. B-Trees What is a Fractal Tree? Pointers and pivots
Buffer
Leaf node
Another simple and elegant data structure: • Internal nodes store pivots, pointers, and buffers. • Leaf nodes store data.
Fractal Trees v. B-Trees What is a Fractal Tree? Pointers and pivots
Buffer
Leaf node
Buffers are important: • Batch up writes • Will dig into what this means soon.
Fractal Trees v. B-Trees
On disk, not in memory
Characteristics of B-Trees and Fractal Trees for large data: • Very high percentage of leaf nodes do not fit in memory • Therefore, accessing a random leaf node likely requires I/O
Understanding TokuMX’s Fractal Tree Benefit #1: Write performance
Write performance. How…
100mm inserts into a collection with 3 secondary indexes
With Less I/O!
100mm inserts into a collection with 3 secondary indexes
Fractal Tree v B-Tree for write I/O Fractal Trees have significantly better write performance than B-Trees when data > RAM – B-Trees become I/O bound. (Disks do < 500 I/O per second) – Fractal Trees are not I/O bound
This is why B-Tree insertion performance “falls off a cliff”.
MongoDB
cliff
MySQL
Conventional Wisdom This also leads to the following conventional wisdom: • Keep indexes in memory. • Keep “working set” in memory. • Have a “right-most insertion pattern” on indexes All of these tips are designed to work around the fact that BTrees become I/O bound when writing to large databases.
Now let’s understand why…
How a B-Tree does writes Random Writes require I/O
B-Trees algorithm for doing a write: • Find the appropriate leaf node where the write belongs • Bring the leaf node into memory EXPENSIVE! • Modify the leaf node For large data, nearly all B-Tree leaf nodes are not in memory, so algorithm requires practically one I/O per write
How a Fractal Tree does Writes Let’s zoom in here for the next slide
• Writes are batched in buffers with messages • When a buffer is full, messages spills into buffers of child node (who also spill if they get full) • Through spilling, messages eventually make it to leaf nodes.
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes When does a Fractal Tree do I/O for Writes? – When flushing a buffer’s worth of writes.
Here we see the BIG difference in I/O performance for Fractal Trees v. B-Trees:
B-Trees do an I/O to write one measly document. Fractal Trees do an I/O to write a buffer’s worth of documents. This is why I/O is drastically reduced!
Fractal Tree Wisdom This also leads to the following wisdom for Fractal Trees: • Indexes don’t need to fit in memory. • “Working set” does not need to be in memory. • Indexes don’t need to worry about their “insertion pattern”.
These capabilities reduce complexity of database design, and enable rich indexes and queries that B-trees cannot support.
Understanding TokuMX’s Fractal Tree Benefit #2: Compression
What Compression? • BitTorrent Peer Snapshot Data (~31 million documents), 3 indexes • http://cs.brown.edu/~pavlo/torrent/ TokuMX achieved 11.6:1 compression
Compression: How? TokuMX compression algorithm is simple! 1. Take large chunks of data 2. Use standard compression algorithms (zlib, lzma, or quicklz) and compress them 3. There is no step 3! Effectiveness of these compression algorithms is dependent on how much data you give it. TokuMX gives lots of data, so TokuMX compresses well.
The secret is…
Compression: The Secret TokuMX node sizes (4 MB) are larger than B-Trees
Small: 8KB or 16KB Large: 4MB
Larger node size leads to better compression So the question is, why do Fractal Trees have such large node sizes?
Fractal Trees: Why Large Nodes? Again, it’s all about the I/O. For writes: – –
B-Trees: reading a large node to write one measly row is painful Fractal Trees: reading a large node to write a proportionally large buffer is not painful. In fact, it’s better. Reading larger nodes means you pay more disk bandwidth cost than disk seek cost.
Conclusion: Fractal Trees should use large nodes for writes, for better performance AND compression.
Fractal Trees: Large Nodes + Reads What about reading a single document? The problem: • For point query, we are reading one measly document • Just as B-Trees don’t want to do a large I/O to write one measly document, Fractal Trees should not read 4MB to read one measly document.
Fractal Trees: Large Nodes + Reads What about reading a single document? The solution: • Partition the 4MB leaf node into 64KB “basement nodes”. (value of 64KB is configurable) • 64KB chunks are individually compressed, concatenated, and written disk to represent a leaf node • When flush data for writes, read the full 4MB row • When reading “one measly document”, read only appropriate 64KB chunk of data 64 KB chunks are nice sweet spot to get good compression and point query performance
Fractal Trees: Compression Summary: • Use large nodes: 4MB • Partition leaf nodes into 64KB contiguous chunks • Compress 64 KB chunks individually with standard compression algorithms (zlib, lzma, or quicklz), getting good compression • Concatenate compressed chunks to make large compressed leaf node.
Understanding TokuMX’s Fractal Tree Benefit #3: No Fragmentation
What is Fragmentation? Fragmentation happens when nodes on disk get rearranged in random order, with wasted space accumulating between nodes. Why MongoDB Users care about fragmentation: • Wasted space between blocks makes keeping working set in memory more difficult, leads to disk bloat • Blocks of data rearranged in random order leads to performance degradation
Workarounds for Fragmentation MongoDB workarounds: – Pad inserted documents with some additional space to account for future updates – Occasionally bring the database down and run compact. This correctly rearranges blocks and removes wasted space – Aggressively preallocate files to reserve space
TokuMX workarounds:
Why TokuMX does not Fragment Why TokuMX Users don’t care about fragmentation: • On wasted space between blocks: – Compression greatly mitigates impact of wasted space on disk usage – Write performance allows working set to exceed memory
• On blocks of data being rearranged in random order: – Short answer: large leaf nodes practically eliminate the I/O impact of rearranged data blocks (once again, it’s all about the I/O) – Long answer: let’s do some analysis…
Impact of Rearranged blocks First, let’s assume the following costs of disk access: • Disk seek time: 10ms 100 I/Os per second • Disk bandwidth time: 100MB/s Numbers are meant to be nice estimates to make math simple. Question to ask ourselves that shows the impact of fragmentation:
At what rate (determined in bytes/second) can I read an entire B-Tree?
Impact of Rearranged blocks At what rate (determined in bytes/second) can I read an entire B-Tree? Non-fragmented B-Trees: – all data sequentially arranged, therefore sequentially accessed – Effective rate: 100 MB/s (at most) great performance!
Fragmented B-Tree: – – – –
Suppose node size 8KB, accessing leaf node requires I/O Cost of reading block of data is seek time + bandwidth time seek time: 10ms, bandwidth time: 100us dominated by seek Effective rate: 8KB/10ms = 800 KB/s poor performance!
This is the poor performance one sees with fragmentation, and why users want to compact
Impact of Rearranged blocks At what rate (determined in bytes/second) can I read an entire Fractal Tree? Non-fragmented Fractal Tree: – Effective rate: 100 MB/s (at most) great performance!
“Fragmented” Fractal Tree: – – – –
Suppose node size 1MB compressed, 4MB uncompressed Cost of reading block of data is seek time + bandwidth time seek time: 10ms, bandwidth time: 10ms Effective rate: 1 MB / 20 ms = 50 MB/s great performance!
Large Fractal Tree nodes mitigate I/O seek cost of a fragmented collection!
Summary on Fragmentation • Don’t worry about fragmentation .
TokuMX Resources tokutek.com/products/downloads •
[email protected] •
[email protected] For evaluations or enterprise support:
[email protected]
[email protected], @zkasheff on twitter