Overview of Distributed Database. Silberschatz, Korth, Sudarsha, “Database
System Concepts”,. 4th ed., McGraw Hill. M. Tamer Özsu, “DISTRIBUTED ...
Lecture 5 – Distributed Database and BigTable
922EU3870 – Cloud Computing and Mobile Platforms, Autumn 2009 (2009/10/12) http://labs.google.com/papers/bigtable.html Ping Yeh ( 葉平 ), Google, Inc.
Numbers real world engineers should know L1 cache reference
0.5 ns
Branch mispredict
5
ns
L2 cache reference
7
ns
Mutex lock/unlock
100
ns
Main memory reference
100
ns
Compress 1 KB with Zippy
10,000
ns
Send 2 KB through 1 Gbps network
20,000
ns
Read 1 MB sequentially from memory
250,000
ns
Round trip within the same data center
500,000
ns
Disk seek
10,000,000
ns
Read 1 MB sequentially from network
10,000,000
ns
Read 1 MB sequentially from disk
30,000,000
ns
150,000,000
ns
Round trip between California and Netherlands
2
The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in d: new_node = Split(node)
# node -> new_node + node
Insert(new_node.lastkey(), new_node, parent)
May produce a new root node (not shown)
1
4
… 29
30
… 15
15
Deletion in a B+ tree DeleteInTree(key, bplus_tree): node = Find(key, bplus_tree) if not node:
# find the node to delete
return False
Delete(key, node)
# delete
return True Delete(key, node): RemoveData(key, node) if Size(node) < d/2: RedistributeOrMerge(node, parent) 8501780
29 57 … 850
879
…
1
4
… 29
d1 d4 d5 … 16
30 33 … 57
d30 d33 d34 …
822 823 … 850
d822d823d825 …
16
Features of a B+ Tree • Good fit for sorted data stored in block storage devices – Fast search: O(logdN) with large d – Fast range scan with links from one leaf node to the next: O(logdN+k) where k = number of elements
– Insertion may cause splitting of nodes – Deletion may cause merge of nodes
• Many optimizations exist (with pros vs. cons) – data structure of a node (array, binary tree, linked list, etc) – compression of keys in a node – lazy deletion – RAM resident – etc 17
17
Compressing Data in a B+ Tree • How to use less space in nodes? • Compressing all keys together – most space efficient – reading 10 bytes requires uncompressing the whole node
• Split the keys into blocks and compress each block – less space efficient – faster in small reads
18
18
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI 2006 http://labs.google.com/papers/bigtable.html
Motivation • Lots of (semi-)structured data at Google – Web: contents, crawl metadata, links/anchors/pagerank, … – Per-user data: user preference settings, recent queries, search results, …
– Geographic locations: physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …
• Scale is large – billions of URLs, many versions/page (~20K/version) – Hundreds of millions of users, thousands of q/sec – 100TB+ of satellite image data
• Need data both for offline data processing and online serving
20
20
Why not use a commercial DB? • Scale is too large for most commercial databases • Even if it weren’t, cost would be very high – Building internally means system can be applied across many projects for low incremental cost
• Low-level storage optimizations help performance significantly
– Much harder to do when running on top of a database layer Also fun and challenging to build large-scale systems :)
21
21
Goals • Wide applicability – by many Google products and projects – Often want to examine data changes over time, e.g., Contents of a web page over multiple crawls
– both throughput-oriented batch-processing jobs and latencysensitive serving of data to end users
• Scalability – Handful to thousands of servers, hundreds of TB to PB
• High performance – Very high read/write rates (millions of ops per second) – Efficient scans over all or interesting subsets of data
• High availability – Want access to most current data at any time 22
22
BigTable • Distributed multi-level map – With an interesting data model
• Fault-tolerant, persistent • Scalable – Thousands of servers – Terabytes of in-memory data – Petabyte of disk-based data – Millions of reads/writes per second, efficient scans
• Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance 23
23
Status • Design/initial implementation started beginning of 2004 • Production use or active development for many projects: – Google Analytics – Personal Search History – Crawling/indexing pipeline – Google Maps/Google Earth – Blogger – …
• ~100 BigTable cell with largest cell manages ~200TB of
data spread over several thousand machines circa 2007
24
24
Building Blocks of BigTable • Distributed File System (GFS): stores persistent state • Scheduler (not published): schedules jobs onto machines – BigTable jobs are among all kinds of jobs
• Lock service (Chubby): distributed lock manager – Also can reliably hold small files with high availability – Master election, location bootstrapping
• Data processing (MapReduce): – Simplified large-scale data processing – Often used to read/write BigTable data (not a building block of BigTable, but uses BigTable heavily)
25
25
Google File System (GFS) • Master manages metadata • Data transfers happen directly between clients/chunkservers
• Files broken into chunks (typically 64 MB) • Chunks triplicated across three machines for safety • See SOSP’03 paper at
http://labs.google.com/papers/gfs.html
master
chunk server
client
chunk server
client
chunk server
client
26
Chubby • Distributed lock service with a file system for small files • Usually have 5 servers running paxos algorithm – maintain consistency – fault-tolerant – master election – event notification mechanism
• Also used for name resolution in the cluster
27
Key Jobs in a BigTable Cluster • Master – schedules tablets assignments – quota management – health check of tablet servers – garbage collection management
• Tablet servers – serve data for reads and writes (one tablet is assigned to exactly one tablet server)
– compaction – replication – etc
• monitor 28
28
Typical Cluster
Cluster scheduling master
Machine 1 User app1
BigTable server
User app2 Scheduler slave
GFS chunkserver
Linux
29
Lock service
Machine 2 BigTable server
User app1 Scheduler slave
GFS chunkserver
Linux
GFS master
Machine N
…
BigTable master
Scheduler slave
GFS chunkserver
Linux
29
BigTable Overview • Data Model • Implementation Structure – Tablets, compactions, locality groups, …
• API • Details – Shared logs, compression, replication, …
• Current/Future Work
30
30
Basic Data Model • Semi-structured: multi-dimensional sparse map (row, column, timestamp) Columns
cell contents “contents:”
“inlinks:”
Rows t11
“com.cnn.www” “…”
t3
t17
Timestamps
• Good match for most of Google's applications 31
31
Rows • Everything is a string • Every row has a single key – An arbitrary string (how about numerical keys?) – Access to data in a row is atomic – Row creation is implicit upon storing data
• Rows ordered lexicographically by key – Rows close together lexicographically usually on one or a small number of machines
– Question: key distribution? Hot rows?
• No such things as empty row (see Columns page)
32
32
Columns • Arbitrary number of columns – organized into column families, then locality groups – data in the same locality group are stored together (more later)
• Don't predefine columns (compare: schema) – “multi-map”, not “table”. column names are arbitrary strings. – sparse: a row contains only the columns that have data
33
33
Column Family • Must be created before any column in the family can be written
– Has a type: string, protocol buffer, – Basic unit of access control and usage accounting: different applications need access to different column families. • careful with sensitive data
• A column key is named as family:qualifier – family: printable; qualifier: any string. – usually not a lot of column families in a BigTable cluster (hundreds)
• one “anchor:” column family for all anchors of incoming links
– but unlimited columns for each column family • columns: “anchor:cnn.com”, “anchor:news.yahoo.com”, “anchor:someone.blogger.com”, …
34
34
BigTable operations • Reading – selection by a combination of row, column or timestamp ranges
• Writing – Write to individual cell versions (row, column, timestamp) – Delete different granularities up to row – Applied atomicity within a row
35
35
Read API • Scanner: read arbitrary cells in a bigtable – Each row read is atomic – Can restrict returned rows to a particular range – Can ask for just data from 1 row (Lookup), all rows, etc. – Can ask for all columns, just certain column families, specific columns, timestamp ranges (ScanStream)
Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("com.cnn.www"); for (; !stream->Done(); stream->Next()) { printf("%s %s %lld %s\n", scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); 36
}
36
Write API • Metadata operations – Create/delete tables, column families, change metadata
• Row mutation – Apply: single row only, atomic, sequence of sets and deletes – APIs exist for bulk updates: updates are grouped and sent with one RPC call.
Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); 37
37
Tablets • Large tables broken into tablets at row boundaries – Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality
– Aim for ~100MB to 200MB of data per tablet
• Serving machine responsible for ~100 tablets – Fast recovery: • 100 machines each pick up 1 tablet from failed machine
– Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions
38
38
Tablets • Dynamic fragmentation of rows – Unit of load balancing – Distributed over tablet servers – Tablets split and merge • automatically based on size and load • or manually
– Clients can choose row keys to achieve locality
39
39
Tablets & Splitting “language:”
“contents:”
EN
“…”
“aaa.com” “cnn.com” “cnn.com/sports.html”
Tablets … “website.com”
… “yahoo.com/kids.html” …
“yahoo.com/kids.html\0”
…
“zuppa.com/menu.html”
40
40
Locality Groups • Dynamic fragmentation of column families – segregates data within a tablet – different locality groups
different SSTable files on GFS
– scans over one locality group are
O(bytes_in_locality_group) , not O(bytes_in_table)
• Provides control over storage layout – memory mapping of locality groups – choice of compression algorithms – client-controlled block size
41
41
Locality Groups “contents:”
“www.cnn.com”
“…”
Locality Groups “language:” “pagerank:”
…
EN
0.65
…
42
42
Timestamps Used to store different versions of data in a cell
• New writes default to current time, but timestamps for writes can also be set explicitly by clients
Lookup options:
• “Return most recent K values” • “Return all values in timestamp range (or all values)” Column familes can be marked w/ attributes:
• “Only retain most recent K values in a cell” • “Keep values until they are older than K seconds” 43
43
Where is my Tablets? • Tablets move around from one tablet server to another (why?)
• Question: given a row, how does a client find the right tablet server?
– Tablet server location is ip:port – Need to find tablet whose row range covers the target row – One approach: could use the BigTable master • Central server almost certainly would be bottleneck in large system
– Instead: store tablet location info in special tablets similar to a B+ tree
44
44
Metadata Tablets • Approach: 3-level B+-tree like scheme for tablets – 1st level: Chubby, points to MD0 (root) – 2nd level: MD0 data points to appropriate METADATA tablet – 3rd level: METADATA tablets point to data tablets
• METADATA tablets can be split when necessary • MD0 never splits so number of levels is fixed MD0
45
45
Finding Tablet Location • Client caches tablet locations. • In case if it does not know, it has to make three network round-trips in case cache is empty and up to six round trips in case cache is stale.
• Tablet locations are stored in memory, so no GFS accesses are required
46
Tablet Storage • Commit log on GFS – Redo log – buffered in tablet server's memory
• A set of locality groups – one locality group = a set of SSTable files on GFS – key = , value = cell content
47
SSTable • SSTable: string to string table. – persistent, ordered, immutable map from keys to values. • keys and values are arbitrary byte strings.
– contains a sequence of blocks (typical size = 64KB), with a block index at the end of SSTable loaded at open time.
– one disk seek per block read. – operations: lookup(key), iterate(key_range). – an SSTable can be mapped into memory.
48
Tablet Serving Memory
read memtable (random-access) minor compaction
append-only log on GFS minor compaction write
SSTable on GFS
SSTable on GFS
Tablet SSTable: Immutable on-disk ordered map from string->string string keys: triples 49
49
Compactions • Tablet state represented as set of immutable compacted SSTable files, plus tail of log (buffered in memory)
• Minor compaction: – When in-memory state fills up, pick tablet with most data and write contents to SSTables stored in GFS
• Major compaction: – Periodically compact all SSTables for tablet into new base SSTable on GFS
• Storage reclaimed from deletions at this point (garbage collection)
50
50
System Structure Bigtable client Bigtable Cell metadata ops
Bigtable client library
Bigtable master performs metadata ops + load balancing
51
Bigtable tablet server
Bigtable tablet server
serves data
serves data
read/write
…
Open()
Bigtable tablet server serves data
Cluster scheduling system
GFS
Lock service
handles failover, monitoring
holds tablet data, logs
holds metadata, handles master-election
51
File Cleaning • BigTable generates a lot of files – dominated by SSTables
• SSTables are immutable: they can be created, read, or deleted, but not overwritten.
• Obsolete SSTables are deleted in a mark-and-sweep garbage collection
– run by the BigTable master
52
52
Chubby Interactions • Master election: single Chubby lock • Tablet server membership – a tablet server creates and acquires an exclusive lock on a
uniquely-named file in the servers directory of Chubby when it starts, and stops serving when the lock is lost.
– master monitors the directory to find tablet servers
• Chubby stores access control list • Metadata – Schema information (column family metadata) – Tablet advertisement and metadata – Replication metadata
53
53
Shared Logs • Designed for 1M tablets, 1000s of tablet servers – 1M logs being simultaneously written performs badly
• Solution: shared logs – Write log file per tablet server instead of per tablet • Updates for many tablets co-mingled in same file
– Start new log chunks every so often (64 MB)
• Problem: during recovery, server needs to read log data to apply mutations for a tablet
– Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk
54
54
Shared Log Recovery Recovery:
• Servers inform master of log chunks they need to read • Master aggregates and orchestrates sorting of needed chunks
– Assigns log chunks to be sorted to different tablet servers – Servers sort chunks by tablet, writes sorted data to local disk
• Other tablet servers ask master which servers have sorted chunks they need
• Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets
55
55
BigTable Compression • Keys: – Sorted strings of (Row, Column, Timestamp): prefix compression
• Values: – Group together values by “type” (e.g. column family name) – BMDiff across all values in one family • BMDiff output for values 1..N is dictionary for value N+1
• Zippy as final pass over whole block – Catches more localized repetitions – Also catches cross-column-family repetition, compresses keys
56
56
Compression • Many opportunities for compression – Similar values in the same row/column at different timestamps
– Similar values in different columns – Similar values across adjacent rows
• Within each SSTable for a locality group, encode compressed blocks
– Keep blocks small for random access (~64KB compressed data)
– Exploit fact that many values very similar – Needs to be low CPU cost for encoding/decoding
• Two building blocks: BMDiff, Zippy 57
57
BMDiff • Bentley, McIlroy DCC'99: “Data Compression Using Long Common Strings”
– Input: dictionary + source – Output: sequence of • COPY: bytes from offset • LITERAL:
• Store hash at every 32-byte aligned boundary in dictionary and source processed so far
• For every new source byte – Compute incremental hash of last 32 bytes, lookup hash table
– On hit, expand match forwards & backwards, emit COPY
• Encode: ~ 100 MB/s, Decode: ~1000 MB/s 58
58
Zippy • LZW-like: Store hash of last four bytes in 16K entry table • For every input byte: – Compute hash of last four bytes – Lookup in table – Emit COPY or LITERAL
• Differences from BMDiff: – Much smaller compression window (local repetitions) – Hash table is not associative – Careful encoding of COPY/LITERAL tags and lengths
• Sloppy but fast:
59
Algorithm Gzip LZO Zippy
% remaining 13.4% 20.5% 22.2%
Encoding 21 MB/s 135 MB/s 172 MB/s
Decoding 118 MB/s 410 MB/s 409 MB/s 59
Compression Effectiveness • Experiment: store contents for 2.1B page crawl in BigTable instance
– Key: URL rearranged as “com.cnn.www/index.html:http” – Groups pages from same site together • Good for compression • Good for clients: efficient to scan over all pages on a web site
• One compression strategy: gzip each page: ~28% bytes remaining
• BigTable: BMDiff + Zippy: Type
60
Count (B)
Space (TB)
Compressed
% remaining
Web page contents
2.1
45.1 TB
4.2 TB
9.2%
Links
1.8
11.2 TB
1.6 TB
13.9%
126.3
22.8 TB
2.9 TB
12.7%
Anchors
60
Bloom Filters • A read may need to read many SSTables • Idea: use a membership test to remove disk reads for non-existing data
– membership test: does (row,column) exist in the tablet?
• Algorithm: Bloom filter – No false negatives. False positives: read to find out – Update bit vector when new data is inserted. Delete? data in the set: {a1, a2, … aN}
0 0 0 0 v m positions
0 0 1 0 0 0 1 0 0 1 0 0 00 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
query: b 61
indep. hash functions h1, h2, … hk
61
Replication • Often want updates replicated to many BigTable cells in different datacenters
– Low-latency access from anywhere in world – Disaster tolerance
• Optimistic replication scheme – Writes in any of the on-line replicas eventually propagated to other replica clusters
• 99.9% of writes replicated immediately (speed of light)
– Currently a thin layer above BigTable client library • Working to move support inside BigTable system
• Replication deployed on My Search History
62
62
Performance
63
63
Application: Personalized Search • Personalized search (http://www.google.com/psearch) – an opt-in service – records queries and clicks of a user in Google (web search, image search, news, etc)
– user can edit the search history – search history affects search results
• Implementation in BigTable – one user per row, row name = user ID – one column family per action – analyzed with MapReduce to produce user profile – other products add column families later, quota system
64
64
Sample Usages
65
65
In Development/Future Plans • More expressive data manipulation/access – Allow sending small scripts to perform read/modify/write
transactions so that they execute on server (kind of “stored procedures”)
• Multi-row (i.e. distributed) transaction support • General performance work for very large cells • BigTable as a service – Interesting issues of resource fairness, performance isolation, prioritization, etc. across different clients
– App Engine's DataStore
66
66
Conclusions • Data model applicable to broad range of clients – Actively deployed in many of Google’s services
• System provides high performance storage system on a large scale – Self-managing – Thousands of servers – Millions of ops/second – Multiple GB/s reading/writing
67
67