Column-Oriented Database Systems - Computer Science

Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)

Column-Oriented Database Systems

VLDB 2009 Tutorial

Part 1: Stavros Harizopoulos (HP Labs) Part 2: Daniel Abadi (Yale) Part 3: Peter Boncz (CWI)

VLDB 2009 Tutorial Column-Oriented Database Systems

1


What is a column-store? row-store Date

Store Product Customer Price

column-store Date

Store

Product

Customer

Price

+ easy to add/modify a record

+ only need to read in relevant data

- might read in unnecessary data

- tuple writes require multiple accesses

=> suitable for read-mostly, read-intensive, large data repositories

VLDB 2009 Tutorial


2


Are these two fundamentally different? l l

The only fundamental difference is the storage layout However: we need to look at the big picture different storage layouts proposed row-stores

‘70s

‘80s

row-stores++ ‘90s

new applications new bottleneck in hardware l l l

row-stores++

‘00s

converge?

today column-stores

How did we get here, and where we are heading Part 1 What are the column-specific optimizations? Part 2 How do we improve CPU efficiency when operating on Cs Part 3 VLDB 2009 Tutorial


3


Outline l

Part 1: Basic concepts — Stavros l l l l

Introduction to key features From DSM to column-stores and performance tradeoffs Column-store architecture overview Will rows and columns ever converge?

l

Part 2: Column-oriented execution — Daniel

l

Part 3: MonetDB/X100 and CPU efficiency — Peter

VLDB 2009 Tutorial


4


Telco Data Warehousing example l

Typical DW installation

dimension tables account

l

or RAM

Real-world example

usage

“One Size Fits All? - Part 2: Benchmarking Results” Stonebraker et al. CIDR 2007 QUERY 2 SELECT account.account_number, sum (usage.toll_airtime), sum (usage.toll_price) FROM usage, toll, source, account WHERE usage.toll_id = toll.toll_id AND usage.source_id = source.source_id AND usage.account_id = account.account_id AND toll.type_ind in (‘AE’. ‘AA’) AND usage.toll_price > 0 AND source.type != ‘CIBER’ AND toll.rating_method = ‘IS’ AND usage.invoice_date = 20051013 GROUP BY account.account_number

VLDB 2009 Tutorial

fact table source

toll

star schema

Query 1 Query 2 Query 3 Query 4 Query 5

Column-store 2.06 2.20 0.09 5.24 2.88

Row-store 300 300 300 300 300

Why? Three main factors (next slides) Column-Oriented Database Systems

5


Telco example explained (1/3): read efficiency row store

column store

read pages containing entire rows

read only columns needed

one row = 212 columns!

in this example: 7 columns

is this typical? (it depends)

caveats: • “select * ” not any faster • clever disk prefetching • clever tuple reconstruction

What about vertical partitioning? (it does not work with ad-hoc queries) VLDB 2009 Tutorial


6


Telco example explained (2/3): compression efficiency l

Columns compress better than rows l l

l

Typical row-store compression ratio 1 : 3 Column-store 1 : 10

Why? l

l l

Rows contain values from different domains => more entropy, difficult to dense-pack Columns exhibit significantly less entropy Male, Female, Female, Female, Male Examples: 1998, 1998, 1999, 1999, 1999, 2000

l

Caveat: CPU cost (use lightweight compression)

VLDB 2009 Tutorial


7


Telco example explained (3/3): sorting & indexing efficiency l

Compression and dense-packing free up space l l l l

Use multiple overlapping column collections Sorted columns compress better Range queries are faster Use sparse clustered indexes

What about heavily-indexed row-stores? (works well for single column access, cross-column joins become increasingly expensive)

VLDB 2009 Tutorial


8


Additional opportunities for column-stores l

Block-tuple / vectorized processing l

Easier to build block-tuple operators l

l

Easier to apply vectorized primitives l l

l

Software-based: bitwise operations Hardware-based: SIMD Part 3

Opportunities with compressed columns l l

Avoid decompression: operate directly on compressed Delay decompression (and tuple reconstruction) l

l

Amortizes function-call cost, improves CPU cache performance

Also known as: late materialization

more in Part 2

Exploit columnar storage in other DBMS components l

Physical design (both static and dynamic) VLDB 2009 Tutorial


See: Database Cracking, from CWI 9


Effect on C-Store performance

“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Hachem, and Madden. SIGMOD 2008.

Average for SSBM queries on C-store

Time (sec)

original C-store

column-oriented join algorithm

VLDB 2009 Tutorial

enable compression & operate on compressed


enable late materialization

10


Summary of column-store key features columnar storage header/ID elimination l

Storage layout

compression

Part 1

Part 2

Part 3

multiple sort orders

column operators

l

Execution engine

avoid decompression late materialization

Part 2

vectorized operations l

Part 2

Part 1

Part 3

Design tools, optimizer VLDB 2009 Tutorial


11


Outline l



l


l


VLDB 2009 Tutorial


12


From DSM to Column-stores 70s -1985:

TOD: Time Oriented Database – Wiederhold et al. "A Modular, Self-Describing Clinical Databank System," Computers and Biomedical Research, 1975 More 1970s: Transposed files, Lorie, Batory, Svensson. “An overview of cantor: a new system for data analysis” Karasalo, Svensson, SSDBM 1983

1985: DSM paper

“A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985.

1990s: Commercialization through SybaseIQ Late 90s – 2000s: Focus on main-memory performance l l

DSM “on steroids” [1997 – now] Hybrid DSM/NSM [2001 – 2004]

CWI: MonetDB Wisconsin: PAX, Fractured Mirrors

Michigan: Data Morphing

CMU: Clotho

2005 – : Re-birth of read-optimized DSM as “column-store” MIT: C-Store VLDB 2009 Tutorial

CWI: MonetDB/X100 Column-Oriented Database Systems

10+ startups 13


The original DSM paper l l l l

“A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985.

Proposed as an alternative to NSM 2 indexes: clustered on ID, non-clustered on value Speeds up queries projecting few columns Requires more storage value ID

0100 0962 1000 ..

1 2 3 4 ..

VLDB 2009 Tutorial


14


Memory wall and PAX l

90s: Cache-conscious research

“Cache Conscious Algorithms for from: Relational Query Processing.” Shatdal, Kant, Naughton. VLDB 1994. “Database Architecture Optimized for to: the New Bottleneck: Memory Access.” Boncz, Manegold, Kersten. VLDB 1999.

l

“DBMSs on a modern processor: and: Where does time go?” Ailamaki, DeWitt, Hill, Wood. VLDB 1999.

PAX: Partition Attributes Across l l

Retains NSM I/O pattern Optimizes cache-to-RAM communication

“Weaving Relations for Cache Performance.” Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001.

VLDB 2009 Tutorial


15


More hybrid NSM/DSM schemes l

Dynamic PAX: Data Morphing “Data morphing: an adaptive, cache-conscious storage technique.” Hankins, Patel, VLDB 2003.

l

Clotho: custom layout using scatter-gather I/O “Clotho: Decoupling Memory Page Layout from Storage Organization.” Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004.

l

Fractured mirrors l

Smart mirroring with both NSM/DSM copies “A Case For Fractured Mirrors.” Ramamurthy, DeWitt, Su, VLDB 2002.

VLDB 2009 Tutorial


16


MonetDB (more in Part 3) l l

Late 1990s, CWI: Boncz, Manegold, and Kersten Motivation: l l

l l

l

Main-memory Improve computational efficiency by avoiding expression interpreter DSM with virtual IDs natural choice Developed new query execution algebra

Initial contributions: l l l

Pointed out memory-wall in DBMSs Cache-conscious projections and joins … VLDB 2009 Tutorial


17


2005: the (re)birth of column-stores l

New hardware and application realities l l

l

New approach: combine several techniques l

l

First comprehensive design description of a column-store

MonetDB/X100 l

l

Read-optimized, fast multi-column access, disk/CPU efficiency, light-weight compression

C-store paper: l

l

Faster CPUs, larger memories, disk bandwidth limit Multi-terabyte Data Warehouses

“proper” disk-based column store

Explosion of new products VLDB 2009 Tutorial


18


Performance tradeoffs: columns vs. rows DSM traditionally was not favored by technology trends How has this changed? l l

Optimized DSM in “Fractured Mirrors,” 2002 “Apples-to-apples” comparison “Performance Tradeoffs in ReadOptimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06

l

Follow-up study

l

Main-memory DSM vs. NSM

“Read-Optimized Databases, InDepth” Holloway, DeWitt, VLDB’08

“DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN’08 l

Flash-disks: a come-back for PAX?

“Query Processing Techniques “Fast Scans and Joins Using Flash for Solid State Drives” Drives” Shah, Harizopoulos, Tsirogiannis, Harizopoulos, Wiener, Graefe. DaMoN’08 Shah, Wiener, Graefe, VLDB 2009 Tutorial Column-Oriented Database Systems 19 SIGMOD’09


Fractured mirrors: a closer look l

Store DSM relations inside a B-tree

“A Case For Fractured Mirrors” Ramamurthy, DeWitt, Su, VLDB 2002.

Leaf nodes contain values Eliminate IDs, amortize header overhead Custom implementation on Shore

l l l Tuple Header

TID

Column Data

1 2 3 4 5

a1 a2 a3 a4 a5

sparse B-tree on ID 3

1

a1 1

2 a1

a2

3

a3

a2 a3

a4

4 4

5

a5

a4 a5

Similar: storage density “Efficient columnar comparable storage in B-trees” Graefe. to column stores Sigmod Record 03/2007. VLDB 2009 Tutorial


20


Fractured mirrors: performance column?

From PAX paper: time

row

regular DSM column?

columns projected: 1 2 3 4 5

l

Chunk-based tuple merging l l l

optimized DSM

Read in segments of M pages Merge segments in memory Becomes CPU-bound after 5 pages VLDB 2009 Tutorial


21


Column-scanner implementation

“Performance Tradeoffs in ReadOptimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06

row scanner

column scanner

Joe 45 … …

apply predicate(s)

Joe 45 … …

SELECT name, age WHERE age > 40

S

S Direct I/O

1 Joe 45 2 Sue 37 …… …

#POS 45 #POS …

Joe Sue …

prefetch ~100ms worth of data

apply predicate #1

S

45 37 … VLDB 2009 Tutorial


22


Scan performance l l l l l

Large prefetch hides disk seeks in columns Column-CPU efficiency with lower selectivity Row-CPU suffers from memory stalls Memory stalls disappear in narrow tuples Compression: similar to narrow

VLDB 2009 Tutorial


not shown, details in the paper

23


Even more results •

Same engine as before Additional findings

35

narrow & compressed tuple: 30 CPU-bound! C-25% C-10% R-50%

25

20

Time (s)

•

“Read-Optimized Databases, InDepth” Holloway, DeWitt, VLDB’08

15

10

wide attributes: same as before

5

0 1

2

3

4

5 6 7 Columns Returned

8

9

10

Non-selective queries, narrow tuples, favor well-compressed rows Materialized views are a win Column-joins are Scan times determine early materialized joins covered in part 2! VLDB 2009 Tutorial


24


(cpdb)

cycles per disk byte

Speedup of columns over rows “Performance Tradeoffs in ReadOptimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06

144 72

+++

36

_ = + ++

18 9

8

12

16

20

24

28

32

36

tuple width l

Rows favored by narrow tuples and low cpdb l

Disk-bound workloads have higher cpdb VLDB 2009 Tutorial


25


Varying prefetch size no competing disk traffic time (sec)

40 Column 2

30

Column 8

20

Column 16 Column 48 (x 128KB)

10

Row (any prefetch size)

0 4

8

12

16

20

24

28

32

selected bytes per tuple l

No prefetching hurts columns in single scans VLDB 2009 Tutorial


26


Varying prefetch size with competing disk traffic time (sec)

40

Column, 48 Row, 48

30

40 30

20

20

10

10

0

0 4

12

20

28

Column, 8 Row, 8 4

12

20

28

selected bytes per tuple l l

No prefetching hurts columns in single scans Under competing traffic, columns outperform rows for any prefetch size VLDB 2009 Tutorial


27


CPU Performance l l l

“DSM vs. NSM: CPU performance trade offs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN’08

Benefit in on-the-fly conversion between NSM and DSM DSM: sequential access (block fits in L2), random in L1 NSM: random access, SIMD for grouped Aggregation

VLDB 2009 Tutorial


28


New storage technology: Flash SSDs l

Performance characteristics l l

l

Price per bit (capacity follows) l

l

l

useful for delta structures and logs

Random I/O on flash fixes unclustered index access l l

l

Flash RAID: very tight bandwidth/cm3 packing (4GB/sec inside the box)

Column Store Updates l

l

SSD (Ł small reads still suffer from SATA overhead/OS limitations) PCI card (Ł high price, limited expandability)

Boost Sequential I/O in a simple package l

l

avoid random writes!

Form factors not ideal yet l

l

cheaper than RAM, order of magnitude more expensive than Disk

Flash Translation Layer introduces unpredictability l

l

very fast random reads, slow random writes fast sequential reads and writes

still suboptimal if I/O block size > record size therefore column stores profit mush less than horizontal stores

Random I/O useful to exploit secondary, tertiary table orderings l

the larger the data, the deeper clustering one can exploit

VLDB 2009 Tutorial


29


Even faster column scans on flash SSDs l

New-generation SSDs l l

l l

30K Read IOps, 3K Write Iops 250MB/s Read BW, 200MB/s Write

Very fast random reads, slower random writes Fast sequential RW, comparable to HDD arrays

No expensive seeks across columns FlashScan and Flashjoin: PAX on SSDs, inside Postgres “Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD’09 mini-pages with no qualified attributes are not accessed

VLDB 2009 Tutorial


30


Column-scan performance over time regular DSM (2001)

column-store (2006)

..to 1.2x slower

from 7x slower

..to 2x slower

..to same

and 3x faster! optimized DSM (2002) VLDB 2009 Tutorial

SSD Postgres/PAX (2009) Column-Oriented Database Systems

31


Outline l



l


l


VLDB 2009 Tutorial


32


Architecture of a column-store storage layout l l l l

l l l l

read-optimized: dense-packed, compressed organize in extends, batch updates multiple sort orders engine sparse indexes l block-tuple operators l new access methods l optimized relational operators system-level system-wide column support loading / updates scaling through multiple nodes transactions / redundancy VLDB 2009 Tutorial


33


C-Store l l l l l l l l l l l

“C-Store: A Column-Oriented DBMS.” Stonebraker et al. VLDB 2005.

Compress columns No alignment Big disk blocks Only materialized views (perhaps many) Focus on Sorting not indexing Data ordered on anything, not just time Automatic physical DBMS design Optimize for grid computing Innovative redundancy Xacts – but no need for Mohan Column optimizer and executor VLDB 2009 Tutorial


34


C-Store: only materialized views (MVs) l l

l l l l l

Projection (MV) is some number of columns from a fact table Plus columns in a dimension table – with a 1-n join between Fact and Dimension table Stored in order of a storage key(s) Several may be stored! With a permutation, if necessary, to map between them Table (as the user specified it and sees it) is not stored! No secondary indexes (they are a one column sorted MV plus a permutation, if you really want one) User view:

Possible set of MVs:

EMP (name, age, salary, dept) Dept (dname, floor)

MV-1 (name, dept, floor) in floor order MV-2 (salary, age) in age order MV-3 (dname, salary, name) in salary order

VLDB 2009 Tutorial


35


Continuous Load and Query (Vertica) Hybrid Storage Architecture

> Read Optimized Store (ROS)

> Write Optimized Store (WOS)

Trickle Load

TUPLE MOVER

A

B

C

Asynchronous Data Transfer

• On disk • Sorted / Compressed • Segmented • Large data loaded direct

§Memory based §Unsorted / Uncompressed §Segmented §Low latency / Small quick inserts VLDB 2009 Tutorial


A

B

C

(A B C | A)

36


Loading Data (Vertica) > INSERT, UPDATE, DELETE

Write-Optimized Store (WOS) In-memory

> Bulk and Trickle Loads

§COPY Automatic Tuple Mover

§COPY DIRECT > User loads data into logical Tables > Vertica loads atomically into storage

Read-Optimized Store (ROS) On-disk

VLDB 2009 Tutorial


37


Applications for column-stores l

Data Warehousing l l l

l

Data Mining l

l l

Semantic web data management

Information retrieval l

l

E.g. Proximity

Google BigTable RDF l

l

High end (clustering) Mid end/Mass Market Personal Analytics

Terabyte TREC

Scientific datasets l l

SciDB initiative SLOAN Digital Sky Survey on MonetDB

VLDB 2009 Tutorial


38


List of column-store systems l l l l l l l l l l l l l

Cantor (history) Sybase IQ SenSage (former Addamark Technologies) Kdb 1010data MonetDB C-Store/Vertica X100/VectorWise KickFire SAP Business Accelerator Infobright ParAccel Exasol VLDB 2009 Tutorial


39


Outline l



l


l


VLDB 2009 Tutorial


40


Simulate a Column-Store inside a Row-Store Date

Store

Product Customer

Price

01/01 BOS Table

Mesa

$20

01/01 NYC Chair

Lutz

$13

Mudd

$79

01/01 BOS

Bed

Option A: Vertical Partitioning Date TID

Value

1 01/01 2 01/01 3 01/01

Store TID

Value

1 2 3

BOS NYC BOS

Option B: Index Every Column Date Index

Product TID

Value

1 Table 2 Chair 3 Bed

Customer TID

1 2 3

Value

Mesa Lutz Mudd

Price TID

Value

1 2 3

$20 $13

Store Index

$79

… VLDB 2009 Tutorial


41


Simulate a Column-Store inside a Row-Store Date

Store

Product Customer

Price

01/01 BOS Table

Mesa

$20

01/01 NYC Chair

Lutz

$13

Mudd

$79

01/01 BOS

Bed

Option A: Vertical Partitioning Date Value

01/01

1

Date Index

Store

StartPos Length

3

Can explicitly runlength encode date

TID

Value

1 2 3

BOS NYC BOS

Product TID

Value

Customer TID

1 Table 1 2 Chair 2 3 Bed 3

Value

Mesa Lutz Mudd

Price TID

Value

1 2 3

$20 $13

Store Index

$79

“Teaching an Old Elephant New Tricks.” Bruno, CIDR 2009. VLDB 2009 Tutorial

Option B: Index Every Column

… Column-Oriented Database Systems

42


Experiments Adjoined Dimension Column Index (ADC Index) to Improve Star Schema Query Performance”. O’Neil et. al. ICDE 2008.

l

Star Schema Benchmark (SSBM)

l

Implemented by professional DBA Original row-store plus 2 column-store simulations on same row-store product

l

250.0

“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Hachem, and Madden. SIGMOD 2008.

Time (seconds)

200.0 150.0 100.0 50.0 0.0

Average

Normal Row-Store

Vertically Partitioned Row-Store

Row-Store With All Indexes

25.7

79.9

221.2

VLDB 2009 Tutorial


43


What’s Going On? Vertical Partitions l

Vertical partitions in row-stores: l l l

Work well when workload is known ..and queries access disjoint sets of columns See automated physical design Tuple Header

l

Do not work well as full-columns l l

TupleID overhead significant Excessive joins

TID

Column Data

1 2 3

Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns

“Column-Stores vs. Row-Stores: How Different Are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. VLDB 2009 Tutorial

Complete fact table takes up ~4 GB (compressed) Vertically partitioned tables take up 0.7-1.1 GB (compressed) Column-Oriented Database Systems

44


What’s Going On? All Indexes Case l

Tuple construction l

Common type of query:

l

Result of lower part of query plan is a set of TIDs that passed all predicates Need to extract SELECT attributes at these TIDs

l

l l

SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name

BUT: index maps value to TID You really want to map TID to value (i.e., a vertical partition)

Tuple construction is SLOW VLDB 2009 Tutorial


45


So…. l l

All indexes approach is a poor way to simulate a column-store Problems with vertical partitioning are NOT fundamental l l l

l

Store tuple header in a separate partition Allow virtual TIDs Combine clustered indexes, vertical partitioning

So can row-stores simulate column-stores? l

Might be possible, BUT: l Need better support for vertical partitioning at the storage layer l Need support for column-specific optimizations at the executer level l Full integration: buffer pool, transaction manager, ..

l

When will this happen?

See Part 2, Part 3 for most promising features

l

Most promising features = soon

l

..unless new technology / new objectives change the game (SSDs, Massively Parallel Platforms, Energy-efficiency)

VLDB 2009 Tutorial


46


End of Part 1 l

Basic concepts — Stavros l l l l


l


l


VLDB 2009 Tutorial


47


Part 2 Outline l

Compression

l

Tuple Materialization

l

Joins

VLDB 2009 Tutorial


48


Column-Oriented Database Systems Compression “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06 “Integrating Compression and Execution in ColumnOriented Database Systems” Abadi, Madden, and Ferreira, SIGMOD ’06 •Query optimization in compressed database systems” Chen, Gehrke, Korn, SIGMOD’01

VLDB 2009 Tutorial


Compression l l

Trades I/O for CPU Increased column-store opportunities: l l

l

Higher data value locality in column stores Techniques such as run length encoding far more useful Can use extra space to store multiple copies of data in different sort orders

VLDB 2009 Tutorial


50


Run-length Encoding Quarter Product ID Price Q1 Q1 Q1 Q1 Q1 Q1 Q1 …

1 1 1 1 1 2 2 …

5 7 2 9 6 8 5 …

Q2 Q2 Q2 Q2 …

1 1 1 2 …

3 8 1 4 …

VLDB 2009 Tutorial

Quarter (value, start_pos, run_length)

(Q1, 1, 300)

Product ID

(value, start_pos, run_length)

(1, 1, 5) (2, 6, 2)

(Q2, 301, 350) … (Q3, 651, 500) (Q4, 1151, 600)

Price

(1, 301, 3) (2, 304, 1) …

5 7 2 9 6 8 5 … 3 8 1 4 …


51


“Integrating Compression and Execution in ColumnOriented Database Systems” Abadi et. al, SIGMOD ’06

Bit-vector Encoding l

For each unique Product ID value, v, in column 1 c, create bit-vector 1 b l

l

l

b[i] = 1 if c[i] = v

Good for columns with few unique values Each bit-vector can be further compressed if sparse

VLDB 2009 Tutorial

ID: 1

ID: 2

ID: 3

…

1 1 1 2 2 …

1 1 1 1 1 0 0 …

0 0 0 0 0 1 1 …

0 0 0 0 0 0 0 …

0 0 0 0 0 0 0 …

1 1 2 3 …

1 1 0 0 …

0 0 1 0 …

0 0 0 1 …

0 0 0 0 …


52



Dictionary Encoding l

l

l

For each unique value create dictionary entry Dictionary can be per-block or per-column Column-stores have the advantage that dictionary entries may encode multiple values at once

VLDB 2009 Tutorial

Quarter Quarter Q1 Q2 Q4 Q1 Q3 Q1 Q1 Q1 Q2 Q4 Q3 Q3 …

0 1 3 0 2 0 0 0 1 3 2 2

+

Quarter 24 128 122

OR

+ Dictionary

Dictionary 0: Q1 1: Q2 2: Q3 3: Q4 Column-Oriented Database Systems

24: Q1, Q2, Q4, Q1 … 122: Q2, Q4, Q3, Q3 … 128: Q3, Q1, Q1, Q1 53


Frame Of Reference Encoding l

l

Encodes values as b bit offset from chosen frame of reference Special escape code (e.g. all bits set to 1) indicates a difference larger than can be stored in b bits l

After escape code, original (uncompressed) value is written

“Compressing Relations and Indexes ” Goldstein, Ramakrishnan, Shaft, ICDE’98 VLDB 2009 Tutorial

Price 45 54 48 55 51 53 40 50 49 62 52 50 …


Price Frame: 50

-5 4 -2 5 1 3 ∞ 40 0 -1 ∞ 62 2 0 …

4 bits per value

Exceptions (see part 3 for a better way to deal with exceptions)

54


Differential Encoding l l

Encodes values as b bit offset from previous value Special escape code (just like frame of reference encoding) indicates a difference larger than can be stored in b bits l

l

After escape code, original (uncompressed) value is written

Performs well on columns containing increasing/decreasing sequences l l l l

inverted lists timestamps object IDs sorted / clustered columns

“Improved Word-Aligned Binary Compression for Text Indexing” Ahn, Moffat, TKDE’06

VLDB 2009 Tutorial

Time 5:00 5:02 5:03 5:03 5:04 5:06 5:07 5:08 5:10 5:15 5:16 5:16 …


Time 5:00 2 1 0 1 2 1 1 2 ∞ 5:15 1 0

2 bits per value

Exception (see part 3 for a better way to deal with exceptions)

55


What Compression Scheme To Use? Does column appear in the sort key? yes

no

Is the average run-length > 2 yes

Are number of unique values < ~50000

no yes

Differential Encoding

RLE

no

Is the data numerical and exhibit good locality? yes Frame of Reference Encoding

Does this column appear frequently in selection predicates?

yes

no

no Leave Data Uncompressed

Bit-vector Compression

Dictionary Compression

OR Heavyweight Compression

VLDB 2009 Tutorial


56


“Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE’06

Heavy-Weight Compression Schemes

l l

Modern disk arrays can achieve > 1GB/s 1/3 CPU for decompression Ł 3GB/s needed

Ł

Lightweight compression schemes are better

Ł

Even better: operate directly on compressed data VLDB 2009 Tutorial


57



Operating Directly on Compressed Data l l l

I/O - CPU tradeoff is no longer a tradeoff Reduces memory–CPU bandwidth requirements Opens up possibility of operating on multiple records at once

VLDB 2009 Tutorial


58



Operating Directly on Compressed Data Quarter

Product ID

(Q1, 1, 300) (Q2, 301, 6)

301-306

(Q3, 307, 500) (Q4, 807, 600)

Index Lookup + Offset jump

VLDB 2009 Tutorial

1 …

2 …

3 …

1 0 0 1 1 0 0 1 0 0 0 …

0 0 1 0 0 0 1 0 0 1 0 …

0 1 0 0 0 1 0 0 1 0 1 …

ProductID, COUNT(*))

(1, 3) (2, 1) (3, 2) SELECT ProductID, Count(*) FROM table WHERE (Quarter = Q2) GROUP BY ProductID


59



Operating Directly on Compressed Data Block API Data Aggregation Operator

Selection Operator

CompressionAware Scan Operator VLDB 2009 Tutorial

SELECT ProductID, Count(*) FROM table WHERE (Quarter = Q2) GROUP BY ProductID

isOneValue() isValueSorted() isPosContiguous() isSparse() getNext() decompressIntoArray() getValueAtPosition(pos)

(Q1, 1, 300)

getMin() getMax() getSize()

(Q3, 307, 500)


(Q2, 301, 6)

(Q4, 807, 600)

60


Column-Oriented Database Systems Tuple Materialization and Column-Oriented Join Algorithms “Materialization Strategies in a ColumnOriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007. “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09 “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.

“Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos Shah, Wiener, and Graefe. SIGMOD 2009. “Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB’04

VLDB 2009 Tutorial


When should columns be projected? l

Where should column projection operators be placed in a query plan? l

Row-store: l l

l

Column projection involves removing unneeded columns from tuples Generally done as early as possible

Column-store: l l

Operation is almost completely opposite from a row-store Column projection involves reading needed columns from storage and extracting values for a listed set of tuples §

l

Early materialization: project columns at beginning of query plan §

l

More complicated since selection and join operators on one column obfuscates mapping to other columns from same table

Most column-stores construct tuples and column projection time § §

VLDB 2009 Tutorial

Straightforward since there is a one-to-one mapping across columns

Late materialization: wait as long as possible for projecting columns §

l

This process is called “materialization”

Many database interfaces expect output in regular tuples (rows) Rest of discussion will focus on this case


62


When should tuples be constructed? Select + Aggregate

QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID

4

2

2 7

4

1

3 13

4

3

3 42

4

1

3 80 l

Construct (4,1,4)

prodID

2 1 3 1 storeID

VLDB 2009 Tutorial

2 3 3 3

7 13 42 80

Solution 1: Create rows first (EM). But: l l l

Need to construct ALL tuples Need to decompress data Poor memory bandwidth utilization

custID price


63


Solution 2: Operate on columns

AGG Data Source custID

1 1 1 1

0 1 0 1

Data Source

Data Source

Data Source price

QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID

AND

Data Source prodID

VLDB 2009 Tutorial

Data Source storeID

4 4 4 4

2 1 3 1

prodID

storeID


4 4 4 4

2 1 3 1

2 3 3 3

7 13 42 80

prodID storeID custID price

64


Solution 2: Operate on columns QUERY:


SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID

0 1 0 1

Data Source price

AND AND

Data Source prodID

VLDB 2009 Tutorial

Data Source storeID

1 1 1 1

0 1 0 1


4 4 4 4

2 1 3 1

2 3 3 3

7 13 42 80


65


Solution 2: Operate on columns QUERY:

3 3


Data Source price

AND

Data Source prodID

VLDB 2009 Tutorial

Data Source storeID

2 0 1 3 0 3 1 3

13 1 80 1

Data Source

Data Source

custID

7 0 13 1 42 0 80 1

SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID

price

0 1 0 1


4 4 4 4

2 1 3 1

2 3 3 3

7 13 42 80


66


Solution 2: Operate on columns QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID

AGG 3 93 1 1 Data Source custID

Data Source price

AGG

AND

Data Source prodID

Data Source storeID

3 1 1 3

13 1 80 1

4 4 4 4

2 1 3 1

2 3 3 3

7 13 42 80


VLDB 2009 Tutorial


67


“Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007.

Time (seconds)

For plans without joins, late materialization is a win QUERY:

10 9 8 7 6 5 4 3 2 1 0

SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1 l Low selectivity

Medium selectivity

Early Materialization

VLDB 2009 Tutorial

High selectivity

Ran on 2 compressed columns from TPC-H scale 10 data

Late Materialization


68


“Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007.

Time (seconds)

Even on uncompressed data, late materialization is still a win 14

QUERY:

12

SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1

10 8 6 4 2 0

l Low selectivity

Medium selectivity

Early Materialization

VLDB 2009 Tutorial

High selectivity

Materializing late still works best

Late Materialization


69


What about for plans with joins? B, C, E, H, F

Select R1.B, R1.C, R2.E, R2.H, R3.F From R1, R2, R3 Where R1.A = R2.D AND R2.G = R3.K

Project Join G=K

B, C, E, H, F B, C Join 2 G=K

B, C, E, G, H

G F, K

K Scan

Project

R3 (F, K, L)

Join 1 A=D

Scan

Join A=D

G

R3 (F, K, L)

A, B, C Scan

D, E, G, H

A

Scan

Scan

R1 (A, B, C) R2 (D, E, G, H) VLDB 2009 Tutorial

F

D Scan

R1 (A, B, C) R2 (D, E, G, H)


E, H

70

70


What about for plans with joins? B, C, E, H, F

Select R1.B, R1.C, R2.E, R2.H, R3.F From R1, R2, R3 Where R1.A = R2.D AND R2.G = R3.K

Project Join G=K

B, C, E, H, F B, C Join 2 G=K

B, C, E, G, H Early

G

K

Late Project

F, K Scan

Join A=D

G

R3 (F, K, L)

A, B, C Scan

D, E, G, H

A

Scan

Scan

R1 (A, B, C) R2 (D, E, G, H) VLDB 2009 Tutorial

Scan

materialization R3 (F, K, L)

materialization Join 1 A=D

F

D Scan

R1 (A, B, C) R2 (D, E, G, H)


E, H

71

71


Early Materialization Example QUERY:

2 7 3 13

1 Green

3 42

2 White

3 80

3 Brown

Construct (4,1,4)

12 1 11 1

6 1 2 1

2 3 3 3

Construct 7 13 42 80

prodID storeID quantity custID price

Facts VLDB 2009 Tutorial

SELECT C.lastName,SUM(F.price) FROM facts AS F, customers AS C WHERE F.custID = C.custID GROUP BY C.lastName

1 2 3 custID

Green White Brown lastName

Customers Column-Oriented Database Systems

72


Early Materialization Example 7

White

13 Brown 42 Brown 80 Brown

QUERY: SELECT C.lastName,SUM(F.price) FROM facts AS F, customers AS C WHERE F.custID = C.custID GROUP BY C.lastName

Join 2 7

VLDB 2009 Tutorial

3 13

1 Green

3 42

2 White

3 80

3 Brown


73


Late Materialization Example 1 2 3 4

QUERY:

2 3 1 3

SELECT C.lastName,SUM(F.price) FROM facts AS F, customers AS C WHERE F.custID = C.custID GROUP BY C.lastName

Join

(4,1,4)

prodID

12 1 11 1

6 1 2 1

2 3 1 3

7 13 42 80

storeID quantity custID price

Facts VLDB 2009 Tutorial

1 2 3 custID

Green White Brown

Late materialized join causes out of order probing of projected columns from the inner relation

lastName

Customers Column-Oriented Database Systems

74


Late Materialized Join Performance l

Naïve LM join about 2X slower than EM join on typical queries (due to random I/O) l

This number is very dependent on l l l

l

Amount of memory available Number of projected attributes Join cardinality

But we can do better l l l

Invisible Join Jive/Flash Join Radix cluster/decluster join

VLDB 2009 Tutorial


75


Invisible Join

l

Designed for typical joins when data is modeled using a star schema l

l

“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.

One (“fact”) table is joined with multiple dimension tables

Typical query: select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = 'ASIA‘ and s_region = 'ASIA‘ and d_year >= 1992 and d_year 30

22 > 30 ? 37 102 101

ivan alice

37 FALSE 22 TRUE

SELECT next() 102 101

VLDB 2009 Tutorial

ivan alice

37 22

SCAN


99


A Look at the Query Pipeline

102

next()

ivan

350 37 7 *–50 30

Operators

PROJECT next()

22 > 30 ? 37

SELECT

Iterator interface -open() -next(): tuple -close()

next() SCAN

VLDB 2009 Tutorial


100


A Look at the Query Pipeline

102

ivan

next()

350 37 7 *–50 30

102

ivan

37 7

Primitives

350

PROJECT next()

22 > 30 ? 37 102 101

ivan alice

37 FALSE 22 TRUE

SELECT next() 102 101

ivan alice

Provide computational functionality

37 22

All arithmetic allowed in expressions, e.g. Multiplication

SCAN 7 * 50

mult(int,int) Ł VLDB 2009 Tutorial


int 101


Database Architecture causes Hazards l

Data hazards l l

l

Dependencies between instructions L1 data cache misses

SIMD

Out-of-order Execution

Control Hazards l l l

Branch mispredictions Computed branches (late binding) L1 instruction cache misses

work on one tuple at a time Large Tree/Hash Structures

PROJECT

Operators

Code footprint of all operators in query plan exceeds L1 cache

SELECT

Iterator interface -open() -next(): tuple -close()

Data-dependent conditions

SCAN Next() late binding method calls “DBMSs On A Modern Processor: Where Does Time Go? ” Tree, List, Hash traversal Complex NSM record navigation Ailamaki, DeWitt, Hill, Wood, VLDB’99 VLDB 2009 Tutorial


102


DBMS Computational Efficiency TPC-H 1GB, query 1 l selects 98% of fact table, computes net prices and aggregates all l Results: l l l

C program: MySQL: DBMS “X”:

? 26.2s 28.1s

“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 VLDB 2009 Tutorial


103


DBMS Computational Efficiency TPC-H 1GB, query 1 l selects 98% of fact table, computes net prices and aggregates all l Results: l l l

C program: MySQL: DBMS “X”:

0.2s 26.2s 28.1s

“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 VLDB 2009 Tutorial


104


Column-Oriented Database Systems MonetDB

VLDB 2009 Tutorial


a column-store l

l

“save disk I/O when scan-intensive queries columns” “avoid an expression interpreter to improve computational efficiency”

VLDB 2009 Tutorial


need a few

106


RISC Database Algebra

SELECT FROM WHERE

id, name, (age-30)*50 as bonus people age > 30

VLDB 2009 Tutorial


107


RISC Database Algebra CPU happy? Give it “nice” code ! - few dependencies (control,data) - CPU gets out-of-order execution - compiler can e.g. generate SIMD One loop for an entire column batcalc_minus_int(int* - no per-tuple interpretationres, int* col, - arrays: no record navigation int locality val, - better instruction cache

Simple, hardcoded semantics in operators

int n)

{ for(i=0; i 30

VLDB 2009 Tutorial


108


RISC Database Algebra CPU happy? Give it “nice” code ! - few dependencies (control,data) - CPU gets out-of-order execution - compiler can e.g. generate SIMD One loop for an entire column batcalc_minus_int(int* - no per-tuple interpretationres, int* col, - arrays: no record navigation int locality val, - better instruction cache int n)

{ for(i=0; i 30

VLDB 2009 Tutorial

MATERIALIZED intermediate results


109


a column-store l

“save disk I/O when scan-intensive queries

need a few columns”

l

“avoid an expression interpreter to improve computational efficiency” How? l

RISC query algebra: hard-coded semantics l

l

Operators only handle simple arrays l

l

Decompose complex expressions in multiple operations

No code that handles slotted buffered record layout

Relational algebra becomes array manipulation language l

Often SIMD for free

l

Plus: use of cache-conscious algorithms for Sort/Aggr/Join

VLDB 2009 Tutorial


110


a Faustian pact l

You want efficiency l

l

Simple hard-coded operators

I take scalability l

Result materialization

n

C program:

0.2s

n

MonetDB:

3.7s

n

MySQL:

26.2s

n

DBMS “X”:

28.1s

VLDB 2009 Tutorial


111


Column-Oriented Database Systems as a research platform

VLDB 2009 Tutorial


SIGMOD 1985 B MonetD bra lge BAT A orts p p u s B MonetD MG, D O , L M SQL, X ..RDF

•“MIL Primitives for Querying a Fragmented World”, Boncz, Kersten, VLDBJ’98 •“Flattening an Object Algebra to Provide Performance” Boncz, Wilschut, Kersten, ICDE’98 •“MonetDB/XQuery: a fast XQuery processor powered n C- by a relational engine” Boncz, Grust, vanKeulen, o t r o p p RDF su SW-Store Rittinger, Teubner, SIGMOD’06 / •“SW-Store: a vertically partitioned DBMS for Semantic STORE Web data management“ Abadi, Marcus, Madden, Hollenbach, VLDBJ’09

VLDB 2009 Tutorial


113


The

Software Stack

Extensible query lang frontend framework XQuery .. SGL? Dynamic Extensible

Runtime QOPT Framework! SOAP

Arrays

Optimizers

MonetDB 4

MonetDB 5

MonetDB kernel

VLDB 2009 Tutorial

RDF

SQL 03

OGIS

Extensible Architecture-Conscious compile Execution platform


114


optimizer

frontend

backend

as a research platform l

Cache-Conscious Joins l

l

“MonetDB/XQuery: a fast XQuery processor powered by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD’06 “Database Cracking”, CIDR’07 “Updating a cracked database “, SIGMOD’07 “Self-organizing tuple reconstruction in columnstores“, SIGMOD’09 (all Idreos, Manegold, Kersten)

on-the-fly automatic indexing without workload knowledge

Recycling: l

l

structural joins exploiting positional column access

Cracking: l

l

Bottleneck: Memory Access” VLDB’99 Cost Models, Radix-cluster, •“Generic Database Cost Models for Hierarchical Memory Radix-decluster Systems”, VLDB’02 (all Manegold, Boncz, Kersten) •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB’04

MonetDB/XQuery: l

l

•“Database Architecture Optimized for the New

“An architecture for recycling intermediates in a using materialized intermediates column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09

Run-time Query Optimization: l

correlation-aware run-time optimization without cost model VLDB 2009 Tutorial

“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09


115


Column-Oriented Database Systems “MonetDB/X100” vectorized query processing

VLDB 2009 Tutorial


MonetDB spin-off: MonetDB/X100 Materialization vs Pipelining

VLDB 2009 Tutorial


117


“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05

next()

PROJECT next() SELECT next()

VLDB 2009 Tutorial

101 102 104 105

alice ivan peggy victor


22 37 45 25

SCAN

118



next()

PROJECT

> 30 ?

next()

next()

VLDB 2009 Tutorial

101 102 104 105 101 102 104 105

alice ivan peggy victor alice ivan peggy victor


22 37 45 25 22 37 45 25

FALSE TRUE TRUE FALSESELECT

SCAN

119



“Vectorized Observations:

In Cache

102 ivan 350 104 peggy 750

Processing”

next() called much less often Ł more time vector =spent arrayinofprimitives ~100 less in overhead

102 ivan 37 7 350 104 peggy 45 15 750

processed in a tight loop primitive calls process an array of values in a loop: CPU cache Resident

PROJECT

> 30 ?

next()

next()

VLDB 2009 Tutorial

- 30 * 50

next()

101 102 104 105 101 102 104 105



22 37 45 25 22 37 45 25


SCAN

120



102 ivan 350 104 peggy 750

Observations: next() called much less often Ł more time spent in primitives less in overhead

102 ivan 37 7 350 104 peggy 45 15 750

primitive calls process an array of values in a loop:

PROJECT

> 30 ?

next()

CPU Efficiency depends on “nice” code - out-of-order execution - few dependencies (control,data) - compiler support

next() Compilers like simple loops over arrays - loop-pipelining - automatic SIMD

VLDB 2009 Tutorial

- 30 * 50

next()

101 102 104 105 101 102 104 105



22 37 45 25 22 37 45 25


SCAN

121



> 30 ?

Observations:

FALSE TRUE TRUE FALSE

next() called much less often Ł more time spent in primitives less in overhead primitive calls process an array of values in a loop: CPU Efficiency depends on “nice” code - out-of-order execution - few dependencies (control,data) - compiler support Compilers like simple loops over arrays - loop-pipelining - automatic SIMD

VLDB 2009 Tutorial

for(i=0; i x) * 50 350 750

- 30 7 15

* 50 350 750

PROJECT

for(i=0; i 30 ?

FALSE res[i] = (col[i] - x) TRUE TRUE FALSE SELECT for(i=0; i 30 ?

next()

1 2

SELECT next()

VLDB 2009 Tutorial

101 102 104 105



22 37 45 25

SCAN

123



map_mul_flt_val_flt_col( float *res, int* being sel, played: Tricks float val, float materialization *col, int n) - Late

- 30 * 50

next() -{ Materialization avoidance for(int i=0; i 30 ?

next()

1 2

selection vectors used to reduce vector copying contain selected positions

VLDB 2009 Tutorial

350 750

SELECT next()

101 102 104 105



22 37 45 25

SCAN

124



map_mul_flt_val_flt_col( float *res, int* being sel, played: Tricks float val, float materialization *col, int n) - Late

- 30 * 50

next() -{ Materialization avoidance for(int i=0; i 30 ?

next()

1 2

selection vectors used to reduce vector copying contain selected positions

VLDB 2009 Tutorial

350 750

SELECT next()

101 102 104 105



22 37 45 25

SCAN

125


MonetDB/X100 l

Both efficiency l

l


Vectorized primitives

and scalability.. l

Pipelined query evaluation

n

C program:

0.2s

n

MonetDB/X100:

0.6s

n

MonetDB:

3.7s

n

MySQL:

26.2s

n

DBMS “X”:

28.1s

VLDB 2009 Tutorial


126


Memory Hierarchy


X100 query engine

CPU cache

RAM

ColumnBM (buffer manager)

(raid) Disk(s) VLDB 2009 Tutorial


127


Memory Hierarchy


X100 query engine

Vectors are only the in-cache representation RAM & disk representation might actually be different CPU cache

RAM

(vectorwise uses both PAX & DSM)




128


Optimal Vector size?


X100 query engine

All vectors together should fit the CPU cache Optimizer should tune this, given the query characteristics. CPU cache

RAM




129



Varying the Vector size

Less and less iterator.next() and primitive function calls (“interpretation overhead”)

VLDB 2009 Tutorial


130



Varying the Vector size Vectors start to exceed the CPU cache, causing additional memory traffic

VLDB 2009 Tutorial


131



Varying the Vector size

The benefit of selection vectors

VLDB 2009 Tutorial


132



MonetDB/MIL materializes columns

CPU cache

RAM




133



Benefits of Vectorized Processing l

l

l

l

l

l

iterator.next(), primitives

No Instruction Cache Misses

“Block oriented processing of relational database operations in Less Data Cache Misses modern computer architectures” l Cache-conscious data placement Padmanabhan, Malkemus, Agarwal, ICDE’01 No Tuple Navigation l

High locality in the primitives

l

Primitives are record-oblivious, only see arrays

Vectorization allows algorithmic optimization l

l

“Buffering Database Operations for Enhanced Instruction Cache Performance” Zhou, Ross, SIGMOD’04

100x less Function Calls

Move activities out of the loop (“strength reduction”)

Compiler-friendly function bodies l

Loop-pipelining, automatic SIMD

VLDB 2009 Tutorial


134


“Balancing Vectorized Query Execution with Bandwidth Optimized Storage ” Zukowski, CWI 2008

Vectorizing Relational Operators l

Project

l

Select l

l

Aggregation l

l

Ordered, Hashed

Sort l

l

Exploit selectivities, test buffer overflow

Radix-sort nicely vectorizes

Join l

Merge-join + Hash-join

VLDB 2009 Tutorial


135


Column-Oriented Database Systems Efficient Column Store Compression

VLDB 2009 Tutorial


Key Ingredients l

Compress relations on a per-column basis l

l

Columns compress well

Decompress small vectors of tuples from a column into the CPU cache l

l


Minimize main-memory overhead

Use light-weight, CPU-efficient algorithms l

Exploit processing power of modern CPUs

VLDB 2009 Tutorial


137


Key Ingredients l

Compress relations on a per-column basis l

l


Columns compress well

Decompress small vectors of tuples from a column into the CPU cache l

Minimize main-memory overhead

VLDB 2009 Tutorial


138


Vectorized Decompression X100 query engine

Idea: decompress a vector only compression: -between CPU and RAM -Instead of disk and RAM (classic)

CPU cache

RAM




139



RAM-Cache Decompression l

l

l

l

VLDB 2009 Tutorial


Decompress vectors on-demand into the cache RAM-Cache boundary only crossed once More (compressed) data cached in RAM Less bandwidth use

140


Multi-Core Bandwidth & Compression Performance Degradation with Concurrent Queries

avg query speed (clock normalized)

1.2

1

0.8

0.6

0.4

Intel® Xeon E5410 (Harpertown) - Q6b Normal 0.2

Intel® Xeon E5410 (Harpertown) - Q6b Compressed Intel® Xeon X5560 (Nehalem) - Q6b Normal Intel® Xeon X5560 (Nehalem) - Q6b Compressed

0 1

2

3

4

5

6

7

8

number of concurrent queries VLDB 2009 Tutorial


141



CPU Efficient Decompression l l

Decoding loop over cache-resident vectors of code words Avoid control dependencies within decoding loop l

l

no if-then-else constructs in loop body

Avoid data dependencies between loop iterations

VLDB 2009 Tutorial


142



Disk Block Layout l

VLDB 2009 Tutorial

Forward growing section of arbitrary size code words (code word size fixed per block)


143



Disk Block Layout l

l

VLDB 2009 Tutorial

Forward growing section of arbitrary size code words (code word size fixed per block) Backwards growing exception list


144



Naïve Decompression Algorithm l

Use reserved value from code word domain (MAXCODE) to mark exception positions

VLDB 2009 Tutorial


145



Deterioriation With Exception%

l

1.2GB/s deteriorates to 0.4GB/s

VLDB 2009 Tutorial


146



Deterioriation With Exception%

l

Perf Counters: CPU mispredicts if-then-else

VLDB 2009 Tutorial


147



Patching l

VLDB 2009 Tutorial

Maintain a patch-list through code word section that links exception positions


148



Patching

VLDB 2009 Tutorial

l

Maintain a patch-list through code word section that links exception positions

l

After decoding, patch up the exception positions with the correct values


149



Patched Decompression

VLDB 2009 Tutorial


150



Patched Decompression

VLDB 2009 Tutorial


151



Decompression Bandwidth Patching can be applied to: • Frame of Reference (PFOR) • Delta Compression (PFOR-DELTA) • Dictionary Compression (PDICT)

Makes these methods much more applicable to noisy data Ł without additional cost

l

Patching makes two passes, but is faster!

VLDB 2009 Tutorial


152


Column-Oriented Database Systems Conclusion

VLDB 2009 Tutorial


Summary (1/2) l

Columns and Row-Stores: different? l l

No fundamental differences Can current row-stores simulate column-stores now? l

l

On disk layout vs execution layout l l

l

not efficiently: row-stores need change actually independent issues, on-the-fly conversion pays off column favors sequential access, row random

Mixed Layout schemes l l l

Fractured mirrors PAX, Clotho Data morphing

VLDB 2009 Tutorial


154


Summary (2/2) l

Crucial Columnar Techniques l

Storage l

l

Compression l l

l

Operating on compressed data Lightweight, vectorized decompression

Late vs Early materialization l l

l

Lean headers, sparse indices, fast positional access

Non-join: LM always wins Naïve/Invisible/Jive/Flash/Radix Join (LM often wins)

Execution l l

Vectorized in-cache execution Exploiting SIMD

VLDB 2009 Tutorial


155


Future Work l

looking at write/load tradeoffs in column-stores l

read-only vs batch loads vs trickle updates vs OLTP

VLDB 2009 Tutorial


156


Updates (1/3) l

Column-stores are update-in-place averse l l l l

In-place: I/O for each column + re-compression + multiple sorted replicas + sparse tree indices

Update-in-place is infeasible!

VLDB 2009 Tutorial


157


Updates (2/3) l

Column-stores use differential mechanisms instead l l l

Differential lists/files or more advanced (e.g. PDTs) Updates buffered in RAM, merged on each query Checkpointing merges differences in bulk sequentially l

I/O trends favor this anyway § §

l

trade RAM for converting random into sequential I/O this trade is also needed in Flash (do not write randomly!)

How high loads can it sustain? §

§

Depends on available RAM for buffering (how long until full) § Checkpoint must be done within that time § The longer it can run, the less it molests queries Using Flash for buffering differences buys a lot of time § Hundreds of GBs of differences per server

VLDB 2009 Tutorial


158


Updates (3/3) l l

Differential transactions favored by hardware trends Snapshot semantics accepted by the user community l

can always convert to serialized “Serializable Isolation For Snapshot Databases” Alomari, Cahill, Fekete, Roehm, SIGMOD’08

Ł

Row stores could also use differential transactions and be efficient! Ł Ł

Implies a departure from ARIES Implies a full rewrite

My conclusion: a system that combines row- and columns needs differentially implemented transactions. Starting from a pure column-store, this is a limited add-on. Starting from a pure row-store, this implies a full rewrite. VLDB 2009 Tutorial


159


Future Work l

looking at write/load tradeoffs in column-stores l

l l

database design for column-stores column-store specific optimizers l

l

read-only vs batch loads vs trickle updates vs OLTP

compression/materialization/join tricks Ł cost models?

hybrid column-row systems l

can row-stores learn new column tricks? l

l

Study of the minimal number changes one needs to make to a row store to get the majority of the benefits of a column-store Alternative: add features to column-stores that make them more like row stores

VLDB 2009 Tutorial


160


Conclusion l

Columnar techniques provide clear benefits for: l l

l

A number of crucial techniques make them effective l

l

Without these, existing row systems do not benefit

Row-Stores and column-stores could be combined l l

l

Data warehousing, BI Information retrieval, graphs, e-science

Row-stores may adopt some column-store techniques Column-stores add row-store (or PAX) functionality

Many open issues to do research on!

VLDB 2009 Tutorial


161

Column-Oriented Database Systems - Computer Science

Column-Oriented Database Systems - Computer Science

Suggest Documents

Database Systems - School of Computer Science

Database Systems - Faculty of Computer Science, IBA

COP 4711 Advanced Database Systems - Computer Science

A Case for Staged Database Systems - School of Computer Science

database management systems, 3rd edition (irwin computer science)

database management systems, 3rd edition (irwin computer science)

database management systems, 3rd edition (irwin computer science)

database management systems, 3rd edition (irwin computer science)

Introduction to Database Systems - School of Computer Science

Introduction to Database Systems - School of Computer Science

DATABASE SYSTEMS DATABASE SYSTEMS

ObjectâOriented Database Systems - IEEE Computer Society

Building Diverse Computer Systems - UNM Computer Science

ObjectâOriented Database Systems - IEEE Computer Society

CS561- ADVANCED TOPICS IN DATABASE SYSTEMS - Computer ...

Querying Database Knowledge - Department of Computer Science

Instrumentation Database System for ... - Computer Science

pdf-1463\modern-database-management-computer-science ...

Database System Concepts, 5th Ed - Computer Science

LDV: Light-weight Database Virtualization - Computer Science

Database Management - Computer Science Department - Stony ...

Database Description - NMT Computer Science & Engineering

Testing Web Database Applications - Computer Science and ...

Computer Systems Science & Engineering

Column-Oriented Database Systems - Computer Science