Lecture 5 – Distributed Database and BigTable - Cloud Computing ...

3 downloads 33 Views 2MB Size Report
Overview of Distributed Database. Silberschatz, Korth, Sudarsha, “Database System Concepts”,. 4th ed., McGraw Hill. M. Tamer Özsu, “DISTRIBUTED ...
Lecture 5 – Distributed Database and BigTable

922EU3870 – Cloud Computing and Mobile Platforms, Autumn 2009 (2009/10/12) http://labs.google.com/papers/bigtable.html Ping Yeh ( 葉平 ), Google, Inc.

Numbers real world engineers should know L1 cache reference

0.5 ns

Branch mispredict

5

ns

L2 cache reference

7

ns

Mutex lock/unlock

100

ns

Main memory reference

100

ns

Compress 1 KB with Zippy

10,000

ns

Send 2 KB through 1 Gbps network

20,000

ns

Read 1 MB sequentially from memory

250,000

ns

Round trip within the same data center

500,000

ns

Disk seek

10,000,000

ns

Read 1 MB sequentially from network

10,000,000

ns

Read 1 MB sequentially from disk

30,000,000

ns

150,000,000

ns

Round trip between California and Netherlands

2

The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in d: new_node = Split(node)

# node -> new_node + node

Insert(new_node.lastkey(), new_node, parent)

May produce a new root node (not shown)

1

4

… 29

30

… 15

15

Deletion in a B+ tree DeleteInTree(key, bplus_tree): node = Find(key, bplus_tree) if not node:

# find the node to delete

return False

Delete(key, node)

# delete

return True Delete(key, node): RemoveData(key, node) if Size(node) < d/2: RedistributeOrMerge(node, parent) 8501780

29 57 … 850

879



1

4

… 29

d1 d4 d5 … 16

30 33 … 57

d30 d33 d34 …

822 823 … 850

d822d823d825 …

16

Features of a B+ Tree • Good fit for sorted data stored in block storage devices – Fast search: O(logdN) with large d – Fast range scan with links from one leaf node to the next: O(logdN+k) where k = number of elements

– Insertion may cause splitting of nodes – Deletion may cause merge of nodes

• Many optimizations exist (with pros vs. cons) – data structure of a node (array, binary tree, linked list, etc) – compression of keys in a node – lazy deletion – RAM resident – etc 17

17

Compressing Data in a B+ Tree • How to use less space in nodes? • Compressing all keys together – most space efficient – reading 10 bytes requires uncompressing the whole node

• Split the keys into blocks and compress each block – less space efficient – faster in small reads

18

18

BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI 2006 http://labs.google.com/papers/bigtable.html

Motivation • Lots of (semi-)structured data at Google – Web: contents, crawl metadata, links/anchors/pagerank, … – Per-user data: user preference settings, recent queries, search results, …

– Geographic locations: physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

• Scale is large – billions of URLs, many versions/page (~20K/version) – Hundreds of millions of users, thousands of q/sec – 100TB+ of satellite image data

• Need data both for offline data processing and online serving

20

20

Why not use a commercial DB? • Scale is too large for most commercial databases • Even if it weren’t, cost would be very high – Building internally means system can be applied across many projects for low incremental cost

• Low-level storage optimizations help performance significantly

– Much harder to do when running on top of a database layer Also fun and challenging to build large-scale systems :)

21

21

Goals • Wide applicability – by many Google products and projects – Often want to examine data changes over time, e.g., Contents of a web page over multiple crawls

– both throughput-oriented batch-processing jobs and latencysensitive serving of data to end users

• Scalability – Handful to thousands of servers, hundreds of TB to PB

• High performance – Very high read/write rates (millions of ops per second) – Efficient scans over all or interesting subsets of data

• High availability – Want access to most current data at any time 22

22

BigTable • Distributed multi-level map – With an interesting data model

• Fault-tolerant, persistent • Scalable – Thousands of servers – Terabytes of in-memory data – Petabyte of disk-based data – Millions of reads/writes per second, efficient scans

• Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance 23

23

Status • Design/initial implementation started beginning of 2004 • Production use or active development for many projects: – Google Analytics – Personal Search History – Crawling/indexing pipeline – Google Maps/Google Earth – Blogger – …

• ~100 BigTable cell with largest cell manages ~200TB of

data spread over several thousand machines circa 2007

24

24

Building Blocks of BigTable • Distributed File System (GFS): stores persistent state • Scheduler (not published): schedules jobs onto machines – BigTable jobs are among all kinds of jobs

• Lock service (Chubby): distributed lock manager – Also can reliably hold small files with high availability – Master election, location bootstrapping

• Data processing (MapReduce): – Simplified large-scale data processing – Often used to read/write BigTable data (not a building block of BigTable, but uses BigTable heavily)

25

25

Google File System (GFS) • Master manages metadata • Data transfers happen directly between clients/chunkservers

• Files broken into chunks (typically 64 MB) • Chunks triplicated across three machines for safety • See SOSP’03 paper at

http://labs.google.com/papers/gfs.html

master

chunk server

client

chunk server

client

chunk server

client

26

Chubby • Distributed lock service with a file system for small files • Usually have 5 servers running paxos algorithm – maintain consistency – fault-tolerant – master election – event notification mechanism

• Also used for name resolution in the cluster

27

Key Jobs in a BigTable Cluster • Master – schedules tablets assignments – quota management – health check of tablet servers – garbage collection management

• Tablet servers – serve data for reads and writes (one tablet is assigned to exactly one tablet server)

– compaction – replication – etc

• monitor 28

28

Typical Cluster

Cluster scheduling master

Machine 1 User app1

BigTable server

User app2 Scheduler slave

GFS chunkserver

Linux

29

Lock service

Machine 2 BigTable server

User app1 Scheduler slave

GFS chunkserver

Linux

GFS master

Machine N



BigTable master

Scheduler slave

GFS chunkserver

Linux

29

BigTable Overview • Data Model • Implementation Structure – Tablets, compactions, locality groups, …

• API • Details – Shared logs, compression, replication, …

• Current/Future Work

30

30

Basic Data Model • Semi-structured: multi-dimensional sparse map (row, column, timestamp) Columns

cell contents “contents:”

“inlinks:”

Rows t11

“com.cnn.www” “…”

t3

t17

Timestamps

• Good match for most of Google's applications 31

31

Rows • Everything is a string • Every row has a single key – An arbitrary string (how about numerical keys?) – Access to data in a row is atomic – Row creation is implicit upon storing data

• Rows ordered lexicographically by key – Rows close together lexicographically usually on one or a small number of machines

– Question: key distribution? Hot rows?

• No such things as empty row (see Columns page)

32

32

Columns • Arbitrary number of columns – organized into column families, then locality groups – data in the same locality group are stored together (more later)

• Don't predefine columns (compare: schema) – “multi-map”, not “table”. column names are arbitrary strings. – sparse: a row contains only the columns that have data

33

33

Column Family • Must be created before any column in the family can be written

– Has a type: string, protocol buffer, – Basic unit of access control and usage accounting: different applications need access to different column families. • careful with sensitive data

• A column key is named as family:qualifier – family: printable; qualifier: any string. – usually not a lot of column families in a BigTable cluster (hundreds)

• one “anchor:” column family for all anchors of incoming links

– but unlimited columns for each column family • columns: “anchor:cnn.com”, “anchor:news.yahoo.com”, “anchor:someone.blogger.com”, …

34

34

BigTable operations • Reading – selection by a combination of row, column or timestamp ranges

• Writing – Write to individual cell versions (row, column, timestamp) – Delete different granularities up to row – Applied atomicity within a row

35

35

Read API • Scanner: read arbitrary cells in a bigtable – Each row read is atomic – Can restrict returned rows to a particular range – Can ask for just data from 1 row (Lookup), all rows, etc. – Can ask for all columns, just certain column families, specific columns, timestamp ranges (ScanStream)

Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("com.cnn.www"); for (; !stream->Done(); stream->Next()) { printf("%s %s %lld %s\n", scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); 36

}

36

Write API • Metadata operations – Create/delete tables, column families, change metadata

• Row mutation – Apply: single row only, atomic, sequence of sets and deletes – APIs exist for bulk updates: updates are grouped and sent with one RPC call.

Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); 37

37

Tablets • Large tables broken into tablets at row boundaries – Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality

– Aim for ~100MB to 200MB of data per tablet

• Serving machine responsible for ~100 tablets – Fast recovery: • 100 machines each pick up 1 tablet from failed machine

– Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions

38

38

Tablets • Dynamic fragmentation of rows – Unit of load balancing – Distributed over tablet servers – Tablets split and merge • automatically based on size and load • or manually

– Clients can choose row keys to achieve locality

39

39

Tablets & Splitting “language:”

“contents:”

EN

“…”

“aaa.com” “cnn.com” “cnn.com/sports.html”

Tablets … “website.com”

… “yahoo.com/kids.html” …

“yahoo.com/kids.html\0”



“zuppa.com/menu.html”

40

40

Locality Groups • Dynamic fragmentation of column families – segregates data within a tablet – different locality groups

different SSTable files on GFS

– scans over one locality group are

O(bytes_in_locality_group) , not O(bytes_in_table)

• Provides control over storage layout – memory mapping of locality groups – choice of compression algorithms – client-controlled block size

41

41

Locality Groups “contents:”

“www.cnn.com”

“…”

Locality Groups “language:” “pagerank:”



EN

0.65



42

42

Timestamps Used to store different versions of data in a cell

• New writes default to current time, but timestamps for writes can also be set explicitly by clients

Lookup options:

• “Return most recent K values” • “Return all values in timestamp range (or all values)” Column familes can be marked w/ attributes:

• “Only retain most recent K values in a cell” • “Keep values until they are older than K seconds” 43

43

Where is my Tablets? • Tablets move around from one tablet server to another (why?)

• Question: given a row, how does a client find the right tablet server?

– Tablet server location is ip:port – Need to find tablet whose row range covers the target row – One approach: could use the BigTable master • Central server almost certainly would be bottleneck in large system

– Instead: store tablet location info in special tablets similar to a B+ tree

44

44

Metadata Tablets • Approach: 3-level B+-tree like scheme for tablets – 1st level: Chubby, points to MD0 (root) – 2nd level: MD0 data points to appropriate METADATA tablet – 3rd level: METADATA tablets point to data tablets

• METADATA tablets can be split when necessary • MD0 never splits so number of levels is fixed MD0

45

45

Finding Tablet Location • Client caches tablet locations. • In case if it does not know, it has to make three network round-trips in case cache is empty and up to six round trips in case cache is stale.

• Tablet locations are stored in memory, so no GFS accesses are required

46

Tablet Storage • Commit log on GFS – Redo log – buffered in tablet server's memory

• A set of locality groups – one locality group = a set of SSTable files on GFS – key = , value = cell content

47

SSTable • SSTable: string to string table. – persistent, ordered, immutable map from keys to values. • keys and values are arbitrary byte strings.

– contains a sequence of blocks (typical size = 64KB), with a block index at the end of SSTable loaded at open time.

– one disk seek per block read. – operations: lookup(key), iterate(key_range). – an SSTable can be mapped into memory.

48

Tablet Serving Memory

read memtable (random-access) minor compaction

append-only log on GFS minor compaction write

SSTable on GFS

SSTable on GFS

Tablet SSTable: Immutable on-disk ordered map from string->string string keys: triples 49

49

Compactions • Tablet state represented as set of immutable compacted SSTable files, plus tail of log (buffered in memory)

• Minor compaction: – When in-memory state fills up, pick tablet with most data and write contents to SSTables stored in GFS

• Major compaction: – Periodically compact all SSTables for tablet into new base SSTable on GFS

• Storage reclaimed from deletions at this point (garbage collection)

50

50

System Structure Bigtable client Bigtable Cell metadata ops

Bigtable client library

Bigtable master performs metadata ops + load balancing

51

Bigtable tablet server

Bigtable tablet server

serves data

serves data

read/write



Open()

Bigtable tablet server serves data

Cluster scheduling system

GFS

Lock service

handles failover, monitoring

holds tablet data, logs

holds metadata, handles master-election

51

File Cleaning • BigTable generates a lot of files – dominated by SSTables

• SSTables are immutable: they can be created, read, or deleted, but not overwritten.

• Obsolete SSTables are deleted in a mark-and-sweep garbage collection

– run by the BigTable master

52

52

Chubby Interactions • Master election: single Chubby lock • Tablet server membership – a tablet server creates and acquires an exclusive lock on a

uniquely-named file in the servers directory of Chubby when it starts, and stops serving when the lock is lost.

– master monitors the directory to find tablet servers

• Chubby stores access control list • Metadata – Schema information (column family metadata) – Tablet advertisement and metadata – Replication metadata

53

53

Shared Logs • Designed for 1M tablets, 1000s of tablet servers – 1M logs being simultaneously written performs badly

• Solution: shared logs – Write log file per tablet server instead of per tablet • Updates for many tablets co-mingled in same file

– Start new log chunks every so often (64 MB)

• Problem: during recovery, server needs to read log data to apply mutations for a tablet

– Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk

54

54

Shared Log Recovery Recovery:

• Servers inform master of log chunks they need to read • Master aggregates and orchestrates sorting of needed chunks

– Assigns log chunks to be sorted to different tablet servers – Servers sort chunks by tablet, writes sorted data to local disk

• Other tablet servers ask master which servers have sorted chunks they need

• Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets

55

55

BigTable Compression • Keys: – Sorted strings of (Row, Column, Timestamp): prefix compression

• Values: – Group together values by “type” (e.g. column family name) – BMDiff across all values in one family • BMDiff output for values 1..N is dictionary for value N+1

• Zippy as final pass over whole block – Catches more localized repetitions – Also catches cross-column-family repetition, compresses keys

56

56

Compression • Many opportunities for compression – Similar values in the same row/column at different timestamps

– Similar values in different columns – Similar values across adjacent rows

• Within each SSTable for a locality group, encode compressed blocks

– Keep blocks small for random access (~64KB compressed data)

– Exploit fact that many values very similar – Needs to be low CPU cost for encoding/decoding

• Two building blocks: BMDiff, Zippy 57

57

BMDiff • Bentley, McIlroy DCC'99: “Data Compression Using Long Common Strings”

– Input: dictionary + source – Output: sequence of • COPY: bytes from offset • LITERAL:

• Store hash at every 32-byte aligned boundary in dictionary and source processed so far

• For every new source byte – Compute incremental hash of last 32 bytes, lookup hash table

– On hit, expand match forwards & backwards, emit COPY

• Encode: ~ 100 MB/s, Decode: ~1000 MB/s 58

58

Zippy • LZW-like: Store hash of last four bytes in 16K entry table • For every input byte: – Compute hash of last four bytes – Lookup in table – Emit COPY or LITERAL

• Differences from BMDiff: – Much smaller compression window (local repetitions) – Hash table is not associative – Careful encoding of COPY/LITERAL tags and lengths

• Sloppy but fast:

59

Algorithm Gzip LZO Zippy

% remaining 13.4% 20.5% 22.2%

Encoding 21 MB/s 135 MB/s 172 MB/s

Decoding 118 MB/s 410 MB/s 409 MB/s 59

Compression Effectiveness • Experiment: store contents for 2.1B page crawl in BigTable instance

– Key: URL rearranged as “com.cnn.www/index.html:http” – Groups pages from same site together • Good for compression • Good for clients: efficient to scan over all pages on a web site

• One compression strategy: gzip each page: ~28% bytes remaining

• BigTable: BMDiff + Zippy: Type

60

Count (B)

Space (TB)

Compressed

% remaining

Web page contents

2.1

45.1 TB

4.2 TB

9.2%

Links

1.8

11.2 TB

1.6 TB

13.9%

126.3

22.8 TB

2.9 TB

12.7%

Anchors

60

Bloom Filters • A read may need to read many SSTables • Idea: use a membership test to remove disk reads for non-existing data

– membership test: does (row,column) exist in the tablet?

• Algorithm: Bloom filter – No false negatives. False positives: read to find out – Update bit vector when new data is inserted. Delete? data in the set: {a1, a2, … aN}

0 0 0 0 v m positions

0 0 1 0 0 0 1 0 0 1 0 0 00 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1

query: b 61

indep. hash functions h1, h2, … hk

61

Replication • Often want updates replicated to many BigTable cells in different datacenters

– Low-latency access from anywhere in world – Disaster tolerance

• Optimistic replication scheme – Writes in any of the on-line replicas eventually propagated to other replica clusters

• 99.9% of writes replicated immediately (speed of light)

– Currently a thin layer above BigTable client library • Working to move support inside BigTable system

• Replication deployed on My Search History

62

62

Performance

63

63

Application: Personalized Search • Personalized search (http://www.google.com/psearch) – an opt-in service – records queries and clicks of a user in Google (web search, image search, news, etc)

– user can edit the search history – search history affects search results

• Implementation in BigTable – one user per row, row name = user ID – one column family per action – analyzed with MapReduce to produce user profile – other products add column families later, quota system

64

64

Sample Usages

65

65

In Development/Future Plans • More expressive data manipulation/access – Allow sending small scripts to perform read/modify/write

transactions so that they execute on server (kind of “stored procedures”)

• Multi-row (i.e. distributed) transaction support • General performance work for very large cells • BigTable as a service – Interesting issues of resource fairness, performance isolation, prioritization, etc. across different clients

– App Engine's DataStore

66

66

Conclusions • Data model applicable to broad range of clients – Actively deployed in many of Google’s services

• System provides high performance storage system on a large scale – Self-managing – Thousands of servers – Millions of ops/second – Multiple GB/s reading/writing

67

67

Suggest Documents