System Co-Design and Data Management for Flash Devices - CiteSeerX

3 downloads 1006 Views 3MB Size Report
Sep 3, 2011 - disk performance are pushing flash devices as replacements for disks ... to compete with hard disks providers. .... Evaluating and repairing write.
System Co-Design and Data Management for Flash Devices Philippe Bonnet

Luc Bouganim

IT University of Copenhagen Denmark

INRIA and University of Versailles France

[email protected] Ioannis Koltsidas

[email protected] Stratis D. Viglas

IBM Research, Zurich Switzerland

School of Informatics, University of Edinburgh United Kingdom

[email protected]

[email protected]

ABSTRACT Flash devices are emerging as a replacement for disks. How does this evolution impact the design of data management systems? While flash devices have been available for years, this question is still open. In this tutorial, we share two views on the development of data management systems for flash devices. The first view considers that flash devices introduce so much complexity that it is necessary to reconsider the strictly layered approach between storage system, operating system and data management system. The second view considers that data management systems should recognize the complexity of flash devices and leverage the characteristics of different classes of devices for different usage patterns. Throughout the tutorial, we will cover the data management stack: from the fundamentals of flash technology, through storage for database systems and the manipulation of flash-resident data, to query processing.

1.

SYSTEM CO-DESIGN

Since the advent of Unix, the stability of disks characteristics and interface have guaranteed the timelessness of major database system design decisions, i.e., pages are the unit of IO; random accesses are avoided. Today, the quest for energy proportional systems and the growing performance gap between processors and magnetic disk performance are pushing flash devices as replacements for disks. Indeed, flash devices rely on tens of flash chips wired in parallel that together can deliver hundreds of thousands accesses per second with low energy consumption. Flash devices embed a complex software called Flash Translation Layer (FTL) in order to hide flash chip constraints (erase-before-write, limited number of erase-write cycles, sequential page-writes within a flash block). A FTL provides address translation, wear leveling and strives to hide the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 12 Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00.

impact of updates and random writes based on observed update frequencies, access patterns, temporal locality. This trend towards flash devices has created a mismatch between the simple disk model that underlies the design of today’s database systems and the complex flash devices of today’s computers. This mismatch results in sub-optimal IO performance, which is costly both in terms of throughput and energy consumption. In fact, a tension exists between the design goals of flash devices and DBMS. Flash device designers aim at hiding the constraints of flash chips to compete with hard disks providers. They also compete with each other, tweaking their FTL to improve overall performance, and masking their design decision to protect their advantage. Database designers, on the other hand, have full control over the IOs they issue. What they need is a clear and stable distinction between efficient and inefficient IO patterns to produce a stable (re)design of core database techniques. They might even be able to trade increased complexity for improved performance and stable behavior across devices. The goal of the first part of this tutorial is to offer database researchers and practitioners an insight into flash chip management as well as a survey of the constraints and opportunities it creates for database system or algorithm designers. We will stress the need for a tighter form of collaboration between database system, operating system and FTL to reconcile the complexity of flash chip management with the performance goals of a database system.

2.

DATA MANAGEMENT

In the near future, commodity and enterprise-level hardware is expected to incorporate both flash Solid State Drives (SSDs) and magnetic disks as storage media. In light of this, fundamental principles of data management need to be revisited, as all existing database systems and algorithms have been designed for disks consisting of rotating platters. However, the term SSD incorporates multiple classes of device. The only major common characteristic of all these devices is their excellent random read performance. The remaining characteristics range within more than two orders of magnitude across different devices. Some SSDs are more than an order of magnitude slower than disks at random writes, while other SSDs dominate disks in both random read and write throughput and latency. The most important

question to be answered is what is the best use of each class of device in a DBMS. Equally important is how this question can be answered automatically, by the DBMS itself, without administrator intervention. The answer also depends on the amount of main memory available and the number, size, rotational speed and RAID configuration of the underlying disks. Possible answers are (a) using the SSD as persistent storage, either in combination with disks or only by itself, (b) using the SSD as a read cache for the HDDs, as a write cache or as a combined read-write cache, (c) using the SSD as a transactional log, (d) using the disk as a log-structured write cache for the SSD, (e) using the SSD as a temporary buffer for specific query evaluation algorithms (e.g., sorting), and, of course, (f) any combination of the above. The aim of the second part of the tutorial is to present the challenges that arise when flash technology is introduced in a database system context; the recent results in this fresh research area; and an outlook of existing problems and things to come.

3.

REFERENCES

[1] D. Agrawal et al. Lazy-adaptive tree: An optimized index structure for flash devices. Proc. VLDB Endow., 2(1):361–372, 2009. [2] N. Agrawal et al. Design tradeoffs for SSD performance. In USENIX Annual Technical, pages 57–70, 2008. [3] P. A. Bernstein et al. Hyder - a transactional record manager for shared flash. In CIDR, 2011. [4] M. Bjørling et al. Performing sound flash device measurements: some lessons from uFLIP. In SIGMOD, pages 1219–1222, 2010. [5] M. Bjørling et al. Understanding the energy consumption of flash devices with uFLIP. IEEE Data Eng. Bull., pages 1–7, 2010. [6] P. Bonnet and L. Bouganim. Flash device support for database management. In CIDR, 2011. [7] L. Bouganim et al. uFLIP: Understanding flash IO patterns. In CIDR, 2009. [8] M. Canim et al. An object placement advisor for DB2 using solid state storage. Proc. VLDB Endow., 2(2):1318–1329, 2009. [9] M. Canim et al. SSD bufferpool extensions for database systems. Proc. VLDB Endow., 3(2):1435–1446, 2010. [10] B. Debnath et al. Flashstore: High throughput persistent key-value store. Proc. VLDB Endow., 3(2):1414–1425, 2010. [11] E. Gal and S. Toledo. Algorithms and data structures for flash memories. ACM Comput. Surv., 37(2):138–163, 2005. [12] J. Gray. Tape is dead, disk is tape, flash is disk. http://research.microsoft.com/en-us/um/people/ gray/talks/Flash_is_Good.ppt, 2006. [13] A. Gupta et al. DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings. In ASPLOS, 2009. [14] A. Holloway. Adapting database storage for new hardware. PhD Thesis. University of Wisconsin Madison, 2009. [15] X.-Y. Hu et al. Write amplification analysis in flash-based solid state drives. In SYSTOR, pages

10–18, 2009. [16] H. Kim and S. Ahn. BPLRU: a buffer management scheme for improving random writes in flash storage. In FAST, 2008. [17] J. Kim et al. A space-efficient flash translation layer for CompactFlash systems. Trans. on Consumer Electronics., pages 366–375, 2002. [18] Y. Kim et al. Flashsim: A simulator for nand flash-based solid-state drives. In SIMUL, pages 125–131, 2009. [19] I. Koltsidas and S. D. Viglas. Flashing up the storage layer. Proc. VLDB Endow., 1(1):514–525, 2008. [20] I. Koltsidas and S. D. Viglas. Designing a flash-aware two-level cache. In ADBIS, 2011. [21] S.-W. Lee and B. Moon. Design of flash-based DBMS: An in-page logging approach. In SIGMOD, pages 55–66, 2007. [22] S. Lee et al. LAST: locality-aware sector translation for NAND flash memory-based storage systems. SIGOPS Oper. Syst. Rev., 42(6):36–42, 2008. [23] S.-W. Lee et al. A log buffer-based flash translation layer using fully-associative sector translation. Trans. on Embedded Comp. Sys., 2007. [24] S.-W. Lee et al. A case for flash memory SSD in enterprise database applications. In SIGMOD, pages 1075–1086, 2008. [25] A. Leventhal. Flash storage memory. Commun. ACM, 51(7):47–51, 2008. [26] D. Ma et al. LazyFTL: A page-level flash translation layer optimized for NAND flash memory. In SIGMOD, pages 1–12, 2011. [27] D. Narayanan et al. Migrating server storage to SSDs: analysis of tradeoffs. In EuroSys, pages 145–158, 2009. [28] S. Nath and P. B. Gibbons. Online maintenance of very large random samples on flash storage. Proc. VLDB Endow., 1(1):67–90, 2008. [29] S. Nath and A. Kansal. FlashDB: Dynamic Self-Tuning Database for NAND Flash. In IPSN, pages 410–419, 2007. [30] X. Ouyang et al. Beyond Block I/O : Rethinking Traditional Storage Primitives. In IEEE HPCA, pages 301–311, 2011. [31] A. Rajimwale et al. Block management in solid-state devices. In USENIX Annual Technical, 2009. [32] F. Shu and N. Obr. Data set management commands proposal for ATA8- ACS2. http://www.t13.org/, 2007. [33] G. Soundararajan et al. Extending SSD lifetimes with disk-based write caches. In FAST, 2010. [34] R. Stoica et al. Evaluating and repairing write performance on flash devices. In DAMON, pages 9–14, 2009. [35] M. Stonebraker. Operating system support for database management. Commun. ACM, 24(7):412–418, 1981. [36] D. Tsirogiannis et al. Query processing techniques for solid state drives. In SIGMOD, pages 59–72, 2009. [37] X. Wu and A. N. Reddy. Exploiting concurrency to improve latency and throughput in a hybrid storage system. MASCOTS, pages 14–23, 2010.

1

System Co-Design and Data Management for Flash Devices VLDB’2011 Philippe Bonnet, ITU, Denmark

Luc Bouganim, INRIA, France

Ioannis Koltsidas IBM Research, Switzerland

Stratis D. Viglas University of Edinburgh, United Kingdom

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

2

Flash Devices (SSD) Just a SATA drive IO don't matter

I can readily plug in flash devices in my server. What is the big deal?

CPU is the critical resource

Why Bother? Disk is disk ~650 mio units shipped in 2010

PCM is coming 100x faster 10 mio write cycles [Papandreou et al., IMW 2011]

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

3

Some Trends ... 2010

2000 HDD Capacity

200 GB

x10

2 TB

HDD GB/$

0,05

x600

30

HDD IOPS

200

x1

200

14 GB (2001)

x20

256 GB

SSD GB/$

3 x10E-4

0,5

SSD IOPS

10E3 (SCSI)

x1000 x1000

SSD Capacity

10E6+ (PCIe) 5x10E3+ (SATA)

PCM Capacity PCM IOPS

2x10E5 cells, 4 bits/cell 10E6+ (1 chip)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

4

… and a Fact

[Tsorigiannis et al. 2010]

Flash-based SSDs do nothing well! They offer high throughput at low energy consumption.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

5

SSD-based Systems

With more than 1,000 stores, Danish Supermarket group is one of Denmark’s largest retailers. To help keep up with customer needs, the company manages more than 10 terabytes of business intelligence data. Database Appliances

SSD-based blades Scaled up

Super Micro 6026 Scaled down

Neteeza Twin-fin

Oracle Exadata

Amdahl blade [Szalay et al., 2009]

IOs matter. Systems are being designed and commercialized for efficient data management for flash devices. Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6

Block Device

SSDs and HDDs provide the same memory abstraction: a block device interface

ERASE (address)

Figure courtesy of Koschaak and Saltzer

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Strong Modularity

7

SSDs and HDDs provide the same memory abstraction: a block device interface application

=> There should be no impact on application (e.g., DBMS) ?

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

8

Design Assumptions => Actually DBMS design very much based on disk characteristics: (1) locality in the logical space preserved in the physical space, (2) sequential access is faster than random access.

tracks

Random accesses are avoided

Sequential accesses are favored: Extent-based allocation, clustering

platter

spindle

read/write head

actuator

disk arm

Controller Page-based IO quantization; Identical representation In memory and on disk

Write-ahead logging; Physiological logging

disk interface

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

How do flash devices impact DBMS design?

9

(Bottom-up) We need to understand flash devices a bit better. If they exhibit stable properties => Design principles for data management If they do not exhibit stable properties => How to tackle the increased complexity? (Top-down) We make assumptions about the behaviour of flash devices, and we design adapted DBMS components. We then need to make sure that (at least some) flash devices actually fit our assumptions. Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

10

Tutorial Outline 1. Introduction (Philippe) 2. Flash devices characteristics (Luc) 3. Data management for flash devices (Stratis) 4. Two outlooks (Stratis & Philippe)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

11

A short motivating story (1) •! Alice, Bob, Charlie and Dave want to measure the

performance of a given data intensive algorithm for flash devices…

•! They use different strategies but start from the same IO traces of that algorithm and own an MTRON and 2 identical INTEL X25-M SSDs.

Same model Same firmware Algorithm X

Never used Used

IO Traces

RW(2000, 2.0, 8000) SR(2000, 16.0) RW(500, 2.0, 8000) RW(500, 2.0, 8000) RR(100, 4.0, 8000) … Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

A short motivating story (2): Alice & Bob

12

•! Alice believes in datasheets. She builds a simple SSD

simulator configured with basic SSD performance numbers.

•! She takes the SSD performance numbers from the datasheet and runs the simulator using the traces…. Mtron Datasheet

Configuration File IOS 1 2 4 8

SR 70 81 104 150

RR 87 98 122 167

IO Traces

SW 51 64 85 129

RW 9023 8723 8686 8682

Simulator

Results

RW(2000, 2.0, 8000) SR(2000, 16.0) RW(500, 2.0, 8000) RW(500, 2.0, 8000) RR(100, 4.0, 8000)

•! Bob, does not believe in datasheets. He runs simple tests on both SSDs to obtain the basic performance numbers…He then runs Alice’s simulator on the traces with his numbers

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

13

A short motivating story (3): Charlie & Dave •! Charlie, does not believe in Bob! He is more cautious and

runs long tests on the same SSDs and obtain his own basic performance numbers. Then, he proceeds as Bob.

•! Dave does not like simulation and runs the traces directly on the SSDs.

IO Traces

RW(2000, 2.0, 8000) SR(2000, 16.0) RW(500, 2.0, 8000) RW(500, 2.0, 8000) RR(100, 4.0, 8000)

What is your take on the resulting measures? Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

14

A short motivating story (4): Results &'(

&'(

MTRON

%"

INTEL X25

$E"

Used

%! $

$" $!

#E"

#"

Never used

#

#! !E"

" !

! )*+,&./0/'1--0(

•! •!

2345&'+67*-55 ,/*+48/93:(5

;1/8*+-5&*3:< ,/*+48/93:(

=/>-5&8?:53: @ABCD(

2345&'+67*,/*+48/93:(

;1/8*+-5&*3:< ,/*+48/93:(

=/>-5&B?:53: ?'-.5F$"(

=/>-5&B?:53: :-G5F$"(

Mtron and Intel devices behave differently Identical Intel devices behave differently

! Confidence in performance measurements is very low!

•! •!

Modeling flash devices seems difficult What about designing algorithms for flash devices ? "! e.g., database systems, operating systems, applications ? Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

15

Goal: understand the impact of flash memory on software (DBMS) design and vice-versa

•! We study flash chips, explaining their constraints and trends •! We then consider flash devices as black boxes and try to understand their performance behavior (uFLIP). Goal: Find a simple model, basis for a DBMS design

•! We hit a wall with the black box approach # we open the box, i.e., the FTL, and look at FTL techniques.

•! Finally, we propose an alternative to complex FTLs, better adapted for DBMS design.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

16 The Good

NAND Flash chip performance! •! A single flash chip offers great performance "! e.g., 40 MB/s Read, 10 MB/s Program "! Random access is as fast as sequential access "! Low energy consumption

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

17 The Bad

The severe constraints of NAND flash chips! •! C1: Program granularity: "! Program must be performed at flash page granularity (2KB-16KB)

•! C2: Must erase a block before updating a page (256 KB-1MB) •! C3: Pages must be programmed sequentially within a block •! C4: Limited lifetime (from 104 up to 105 erase operations)

Pagess must be programmed sequentially within the block (256 pages)

Program granularity: a page (32 KB) Erase granularity: a block (1 MB)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

18

s p i h c Flash BY

A bit of electronic to understand flash chip constraints and trends Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

19

Flash cells •! Flash cell: resembles a semiconductor transistor "! 2 gates instead of 1 "! Floating gate insulated all around by an oxide layer

•! Electrons placed on the floating gate are trapped •! The floating gate will not discharge for many years Oxide Layer

Control Gate Floating Gate N+

P substrate

N+

Flash cell: a floating gate transistor Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

20

Flash cells: NOR vs NAND NOR "! "! "! "!

Quick read (Byte) Slow prog. (Byte) Slow erase XIP # Code

NAND "! "! "! "!

Slower read (Page) Quicker prog. (Page) Quicker erase (Block) Files, data

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

NAND Flash cells mode of operation •!

Programming: Apply a high voltage to the control gate

•!

Erasing: Apply a high voltage to the substrate

•!

Reading: the charge changes the threshold voltage of the cell

•!

After a number of program/erase cycle, electrons are getting trapped in the oxyde layer # End of life of the cell

21

# electrons get trapped in the floating gate # electrons are removed from the floating gate "! Single level cell (SLC) store one bit per cell: charged = 0, not charged = 1 "! Multi level cell (MLC) store 2 bits per cell (4 levels)

20 V

0V

0V

0V

Programming

20 V

20 V

Erasing

Wear out cell Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

22

NAND Architecture & timings •! Based upon independent blocks (4 Mio cells here)

•! Block: smallest erasable unit •! Page: smallest programmable unit

Geometry & Timings Page Size Block Size Chip Size Read Page (µs) Program Page (µs) Erase Block (µs) NAND flash MICRON MLC: MT29F128G08CJABB

MLC 4 KB 1 MB 16 GB 150 1000 3000

1 page

256 pages/ block

Floating gate 1 flash cell Control gate

34560 bits/page (4 KB + 224 B)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

23

Program Disturb •! Some cells not being

programmed receive elevated voltage stress (near the cells being programmed)

•! Stressed cells can

appear weakly programmed

Reducing program disturb:

•! Use Error Correction Code to recover errors •! Program page sequentially within a block Cooke (FMS 2007)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

24

Impact on flash chip IOs

•!Flash cell technology

! Limited lifetime for entire blocks (when a cell wear out, the entire block is marked as failed).

•!NAND Layout and structure

!Block is the smallest erase granularity

•!Program Disturb

! Page is the smallest program granularity (! for SLC) ! Pages must me programmed sequentially within a block ! Use of ECC is mandatory # ECC unit is the smallest read unit (generally 1 or ! page) Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

25

Flash chips: trends •! Density increases (price decreases)

"! NAND process migration: faster than Moore’s Law (today 20 nm) "! More bits/cell: –! SLC (1), MLC (2), TLC (3)

•! Flash chip layout and structure: larger, parallel "! Larger blocks (32 # 256 Pages) "! Larger pages: 512 B (old SLC) # 16KB (future TLC) "! Dual plane Flash # parallelism within the flash chip

•! Lifetime decreases

"! 100 000 (SLC), 10 000 (MLC), 5000 (TLC)

•! ECC size increases •! Basic performance decreases "! Compensated by parallelism

Abraham (FMS 2011), StorageSearch.com

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

26

Goal: understand the impact of flash memory on software (DBMS) design and vice-versa

•! We study flash chips, explaining their constraints and trends •! We then consider flash devices as black boxes and try to understand their performance behavior (uFLIP)

•! We hit a wall with the black box approach # we open the box, i.e., the FTL, and look at FTL techniques

•! Finally, we propose an alternative to complex FTLs, better adapted for DBMS design

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

27 The Good

The hardware!

•! A single flash chip offers great performance "! e.g., 40 MB/s Read, 10 MB/s Program "! Random access is as fast as sequential access "! Low energy consumption

•! A flash device contains many (e.g., 32, 64) flash chips and provides inter-chips parallelism

•! Flash devices may include some (power-failure resistant) SRAM

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

28 The Bad

The severe constraints of flash chips! •! C1: Program granularity:

"! Program must be performed at flash page granularity

•! C2: Must erase a block before updating a page •! C3: Pages must be programmed sequentially within a block •! C4: Limited lifetime (from 104 up to 106 erase operations)

Pagess must be programmed sequentially within the block (256 pages)

Program granularity: a page (32 KB) Erase granularity: a block (1 MB)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

29 And The FTL

The software!, the Flash Translation Layer "! emulates a classical block device and handle flash constraints

Read sector Write sector

MAPPING

Read page Program page

GARBAGE COLLECTION

WEAR LEVELING

(C1) Program granularity (C2) Erase before prog.

(C3) Sequential program within a block Erase block (C4) Limited lifetime

No constraint! SSD

Constraints

FTL

Flash chips Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

30

Flash devices are black boxes! •! Flash devices are not flash chips

"! Do not behave as the flash chip they contain "! No access to the flash chip API but only through the device API "! Complex architecture and software, proprietary and not documented

#! Flash devices are black boxes ! #! DBMS design cannot be based on flash chip behavior! We need to understand flash devices behavior!

DBMS

Read sector Write sector

No constraint!

MAPPING

GARBAGE COLLECTION

? WEAR LEVELING

FT L

Constraints

Read page

(C1) Program granularity

Program page

(C2) Erase before prog.

Erase block

(C3) Sequential program within a block

SSD

(C4) Limited lifetime

Flash chips

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Understanding flash devices behavior

31

•! Define an experimental benchmark which can exhibit the behavior of flash devices.

•! Define a broad benchmark

"! No safe assumption can be made on the device behavior (black box) –! e.g., Random writes are expensive… "! No safe assumption on the benchmark usage!

•! Design a sound benchmarking methodology

"! IO cost is highly variable and depends on the whole device history! Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

32

Methodology (1): Device state

Random Writes – Samsung SSD Out of the box

Random Writes – Samsung SSD After filling the device

! Enforce a well-defined device state "! performing random write IOs of random size on the whole device "! The alternative, sequential IOs, is less stable, thus more difficult to enforce

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

33

Methodology (2): Startup and running phases •! When do we reach a steady state? How long to run each test?

Startup and running phases for the Mtron SSD (RW)

Running phase for the Kingston DTI flash Drive (SW)

! Startup and running phase: Run experiments to define "! IOIgnore: Number of IOs ignored when computing statistics "! IOCount: Number of measures to allow for convergence of those statistics. Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

34

Methodology (3): Interferences 10

Sequential Reads

Random Writes

Sequential Reads

Pause 1

0.1 0

250

500

750

1000

1250

1500

! Interferences: Introduce a pause between experiments Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Results (1): Samsung, memoright, Mtron

Locality for the Samsung, Memoright and Mtron SSDs

Granularity for the Memoright SSD

•!

For SR, SW and RR,

•!

For RW, !5ms for a 16KB-128KB IO

"! linear behavior, almost no latency "! good throughputs with large IO Size

35

•!

When limited to a focused area, RW performs very well

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Results (2): Intel X25-E

36

Response time (µs)

SR, SW and RW have similar performance. RR are more costly!

Response time (µs)

IO size (KB)

RW (16 KB) performance varies from 100 µs to 100 ms!! (x 1000) Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

37

Results (3): Fusion IO

•!Capacity vs Performance tradeoff (80 GB # 22 GB!) •!Sensitivity to device state Response %#!" time (µs)

IO Size = 4KB

%!!" $#!" $!!"

01"

01""

11"

11""

0+"

0+""

1+"

1+""

#!"

"

!" "

&'()'*""

&'(+,-./""

Low level formatted

&'()'*""

&'(+,-./""

Fully written Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

38

Conclusion: Flash device behavior Finally, what is the behavior of flash devices? Common wisdom

$!Update in place are inefficient? $!Random writes are slower than sequential ones? $!Better not filling the whole device if we want good performance?

! Behavior varies across devices and firmware updates ! Behavior depends heavily on the device state!

Is it a problem ?

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Conclusion: Flash device behavior (2)

39

•! Flash devices are difficult (impossible?) to model! •! Hard to build DBMS design on such a moving ground! Bill Nesheim: Mythbusting Flash Performance

•! Substantial performance variability

"! Some cases can be even worse than disk

•! Performance outliers can have significant adverse impact •! What’s Needed: –! Predictable scaling & performance over time –! Less asymmetry between reads/writes, random/sequential –! Predictable response time

(FMS 2011) Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

40

Goal: understand the impact of flash memory on software (DBMS) design and vice-versa

•! We study flash chips, explaining their constraints and trends •! We then consider flash devices as black boxes and try to understand their performance behavior (uFLIP)

•! We hit a wall with the black box approach # we open the box, i.e., the FTL, and look at FTL techniques

•! Finally, we propose an alternative to complex FTLs, better adapted for DBMS design

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

41

Opening the black box !

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

42

FTL – Basic components

Read sector Write sector

MAPPING

Constraints Read page Program page

GARBAGE COLLECTION

WEAR LEVELING

(C1) Program granularity (C2) Erase before prog.

(C3) Sequential program within a block Erase block (C4) Limited lifetime

No constraint! FTL

Flash chips SSD

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

43

FTL – Page Level Mapping •! Basic page level mapping: translation table stored in SRAM Logical Physical Block 0

Block 1

Block 2

Block 3

"! Problem: the table is too large ! (1 GB for 1 TB flash) (4KB pages)

•! Demand-base FTL: DFTL (Gupta et al. 2009)

"! The translation table is stored in Flash and cached in SRAM

SRAM

Global Translation Directory

Flash

Translation blocks

Cached Mapping Table

Data blocks Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

44

FTL - Mapping: Block Level / Hybrid •! Pure Block Level Mapping

"! Translation table at block level "! The page offset remains the same "! Does not comply with C3!

Logical Physical Block 0

Block 1

Block 2

Block 3

•! Hybrid Mapping

Updates done out-of-place in log blocks Data blocks # block mapping Log blocks # page mapping Proposals differ in the way log blocks are managed –! 1 log block for 1 data block # BAST (Kim et al. 2002) –! n log blocks for all data blocks # FAST (Lee et al. 2007) –! Exploiting locality # LAST (Lee et al. 2008) "! Cleaning when log blocks are exhausted # Major costs "! Block mapping for data blocks does not either comply with C3!

"! "! "! "!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

45

FTL – Garbage Collection •! With page mapping: Block 1

! Block 2

!!!!

!

Block 1

Block 3

! Block 2

•! With hybrid mapping: three cases with BAST

Erase

!

Erase

Block 3

Switch Block 0

! ! !

Log(Block0)

! !

Block 0

! ! ! !

Log(Block0)

!

Erase

•!

Partial Merge

! New Block0

Block 0

Full Merge

!

Erase

! ! ! !

Log(Block0)

!!!!

Erased

More complex with FAST "! pages of the same block can be on different log blocks

New block 0 Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

46

FTL-Wear leveling

•! Goal: ensure that all blocks of the flash have about the same erase count (i.e., number of program/erase cycle).

•! Basic algorithm: hot-cold swapping (Jung et al. 2007) "! Swap the blocks with min and max erase count.

•! Difficulties:

(1)! When to trigger the WL algorithm (2)! How to manage erase count, how to select min or max erase count block wrt the limited CPU and memory resources of the flash controler (3)! What wear leveling strategy? (4)! Interactions between Garbage Collection and Wear Leveling

•!

The same difficulties arise with garbage collection!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

47

FTL: Trends Hybrid mapping

Detect sequential or semi-random writes

Temporal/spatial locality?

Caching Compression / deduplication

Adaptivity

Background/ on demand

MAPPING

TRIM management Security / encryption

GARBAGE COLLECTION

WEAR LEVELING

Consider hot/cold data

Dynamic / static WL

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

48

FTL designers vs DBMS designers goals •! Flash device designers goals: "! "! "! "!

Hide the flash device constraints (usability) Improve the performance for most common workloads Make the device auto-adaptive Mask design decision to protect their advantage (black box approach)

•! DBMS designers goals:

"! Have a model for IO performance (and behavior) –! Predictable –! Clear distinction between efficient and inefficient IO patterns ! To design the storage model and query processing/optimization strategies "! Reach best performance, even at the price of higher complexity (having a full control on actual IOs)

These goals are conflicting! Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

49

Goal: understand the impact of flash memory on software (DBMS) design and vice-versa

•! We study flash chips, explaining their constraints and trends •! We then consider flash devices as black boxes and try to understand their performance behavior (uFLIP)

•! We hit a wall with the black box approach # we open the box, i.e., the FTL, and look at FTL techniques

•! Finally, we propose an alternative to complex FTLs, better adapted for DBMS design

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Minimal FTL: Take the FTL out of equation!

50

FTL provides only wear leveling, using block mapping to address C4 (limited lifetime)

•! Pros

"! Maximal performance for –! SR, RR, SW –! Semi-Random Writes "! Maximal control for the DBMS

DBMS

Constrained Patterns only (C1, C2, C3)

"! All complexity is handled by the DBMS "! All IOs must follow C1-C3 –! The whole DBMS must be rewritten –! The flash device is dedicated

Minimal flash device

•! Cons

Block mapping, Wear Leveling (C4)

(C1) Write granularity (C2) Erase before prog. (C3) Sequential prog. within a block

(C4) Limited lifetime

Flash chips Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Semi-random writes (uFLIP [CIDR09])

51

•! Inter-blocks : Random •! Intra-block : Sequential •! Example with 3 blocks of 10 pages: IO address

&!"

%#"

%!"

$#"

$!"

#"

!"

0 10 11

time 1 20 21 22

2 23 24 12

3 13 14

4 25 26 15

5 16 27

6

7 17 18 19 28

8 29

9

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Bimodal FTL: a simple idea …

52

•!Bimodal Flash Devices:

"! Provide a tunnel for those IOs that respect constraints C1-C3 ensuring maximal performance "! Manage other unconstrained IOs in best effort "! Minimize interferences between these two modes of operation

•! Pros

DBMS

"! Flexible "! Maximal performance and control for the DBMS for constrained IOs "! No behavior guarantees for unconstrained IOs.

Bimodal flash device

•! Cons

unconstrained patterns

constr. patterns (C1, C2, C3)

Page map., Garb. Coll. (C1, C2, C3) Block map., Wear Leveling (C4)

(C1) Program granularity (C2) Erase before prog. (C3) Sequential prog. within a block (C4) Limited lifetime

Flash chips

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Bimodal FTL: easy to implement

53

•! Constrained IOs lead to optimal blocks Flag = Optimal

Page 0 Page 1 Page 2 Page 3 Page 4 Page 5

Flag = Non-Optimal

CurPos=6

Page 0 Page 1 Page 1’ Page 1’’ Page 0’ Page 2

CurPos=6

•! Optimal blocks can be trivially

"! mapped using a small map table in safe cache "! detected using a flag and cursor in safe cache

16 MB for a 1TB device

•! No interferences! •! No change to the block device interface:

"! Need to expose two constants: block size and page size Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

54

Bimodal FTL: better than Minimal + FTL •! Non-optimal block can become

Free

(CurPos = 0)

optimal (thanks to GC)

TRIM

TRIM

Write at @ CurPos++

Write at @ ! CurPos

Non optimal

Optimal Write at @ CurPos++

Flag = Non-Optimal

Page 0 Page 1 Page 1’ Page 1’’ Page 0’ Page 2

Garbage collector actions

Flag = Optimal CurPos=3

Page 0’ Page 1’’ Page 2

CurPos=6

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

55

Impact on DBMS Design

Using bimodal flash devices, we have a solid basis for designing efficient DBMS on flash:

•! What IOs should be constrained?

"! i.e., what part of the DBMS should be redesigned?

•! How to enforce these constraints? Revisit literature:

"! Solutions based on flash chip behavior enforce C1-C3 constraints "! Solutions based on existing classes of devices might not.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Example: Hash Join on HDD

One pass partitioning

56

Multi-pass partitioning (2 passes)

Tradeoff: IOSize vs Memory consumption

•! IOSize should be as large as possible, e.g., 256KB – 1 MB "! To minimize IO cost when writing or reading partitions

•! IOSize should be as small as possible

"! To minimize memory consumption: One pass partitioning needs 2 x IOSize x NbPartitions in RAM "! Insufficient memory # multi-pass # performance degrades! Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

57

Hash join on SSD and on bimodal SSD •! With non bimodal SSDs

"! No behavior guarantees but… "! Choosing IOSize = Block size (256 KB – 1MB) should bring good performance

•! With bimodal SSDs

"! Maximal performance are guaranteed (constrained patterns) "! Use semi-random writes "! IOSize can be reduced up to page size (4 – 16 KB) with no penalty !!Memory savings !!Performance improvement !!Predictability!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

58

Summary •! Flash chips

"! Performance & Energy consumption "! Wired in parallel in flash devices

•! Hardware constraints!

(C1) Program granularity, (C2) Erase before program, (C3) Sequential program within a block, (C4) Limited lifetime

•! FTL: a complex piece of sofware

"! Constantly evolving, no common behavior "! Hard to model "! Hard to build a consistent DBMS design!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Conclusion: DBMS Design ?

•!

Complex FTLs

Simple FTLs

HW Constraints

HW Constraints

Complex FTLs

Bimodal

Unpredictable performance

Predictable & Optimal

No stable design

Stable Design

59

Adding bimodality does not hinder competition between flash device manufacturers, they can "! bring down the cost of constrained IO patterns (e.g., using parallelism) "! bring down the cost of unconstrained IO patterns without jeopardizing DBMS design Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

60

Tutorial Outline 1. Introduction (Philippe) 2. Flash devices characteristics (Luc) 3. Data management for flash devices (Stratis) 4. Two outlooks (Stratis & Philippe)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

!"#$%&'#'()$*+,-./'0/1%23"345' %! 6*(/*$'37'8#42)0+(/'9/:/*',/*73*8#21/'0%#2';'#>#5';/./*D'!"#$%'E0$'./*5'>/""'9/0>//2'#**#205'73*'/20/*,*)$/'AA*)0/$',/*'(#5]'

[2/*45'/W1)/215' !!

!!

F/#($'#*/'7#$0/*'0%#2'>*)0/$' [*#$/C9/73*/C>*)0/'")8)0#-32'

=)8)0/('/2(+*#21/'I'>/#*'"/./")24'' !!

!!

G11/$$'"#0/215')2(/,/2(/20'37'0%/'#11/$$',#:/*2' UL'03'VL'-8/$'83*/'/W1)/20')2'?6XAIY',/*'MN'0%#2';%3"/'8/()+8' VLf'*#2(38'(#0#' 3 B

C

D

E

2,5

Latency (ms)

2

1,5

1

0,5

0 0

68

5000

10000

15000

20000

25000

IOPS

30000

35000

40000

45000

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

H)O/('>3*P"3#('^'F/#('"#0/215' d'lN'?I6'3,/*#-32$'+2)73*8"5'()$0*)9+0/('3./*'0%/'>%3"/'8/()+8' VLf'*#2(38'(#0#D'S+/+/'(/,0%'m'U_' d'

)>-8/3*P"3#('^'Z*)0/'"#0/215' d'lN'?I6'3,/*#-32$'+2)73*8"5'()$0*)9+0/('3./*'0%/'>%3"/'8/()+8' VLf'*#2(38'(#0#D'S+/+/'(/,0%'m'U_' K_'

LDV'

)>-8/-8/*)0/$'#*/')2.3"./(' Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

X#*#""/")$8r' =34)1#"'9"31P' !!

A-""D'$38/'3,/*#-32$'#*/'83*/'/W1)/20' 32'%#*(>#*/' !!

!!

H#,,)24'37'0%/'#((*/$$'$,#1/'03'R#$%',"#2/$D' ()/$'#2('1%#22/"$'

!!

[TTD'/21*5,-32'/01a'

!!

Z/#*C"/./")24'$-""'2//($'03'9/'(32/'95'0%/' (/.)1/'E*8>#*/'

@%/')20/*2#"'(/.)1/'4/38/0*5')$'1*)-1#"' 03'#1%)/./'8#O)8+8',#*#""/")$8' @%/'#*/'37'0%/' 4/38/0*5'03'$38/'(/4*//''

T%#22/"' T320*3""/*' [TT'

!"#$%' T%),'

T%#22/"' T320*3""/*' [TT'

h'

U

!"#$%' T%),'

_ h'

h'

!!

L K _ U

!"#$%' T%),'

K

!"#$%' T%),'

L

V33*P"3#($'

!!

tAA//2'()$P$' Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

ol'p'`D''`=3*P"3#('37'#',#4/' #2('#,,*3,*)#0/"5',"#1/')0' !! !! !!

.//(

=34)1#"'3,/*#-32$'\)a/aD'*/7/*/21/$' 32"5]' X%5$)1#"'3,/*#-32$'\#10+#""5' 03+1%)24'0%/'()$P]' ;59*)('83(/"'\"34)1#"'3,/*#-32$' 8#2)7/$0/('#$',%5$)1#"'32/$]'

F/#(C)20/2$)./',#4/$'32'R#$%' Z*)0/C)20/2$)./',#4/$'32';%8%&'>'8"()%*'$(

;30I13"('(#0#' 1"#$$)E/*'

1#1%/'

N:.+8-,056/77+:*):/2'%#$'230'9//2' 7+""5'*/#(D'*/#('>%#0x$' 8)$$)24'#2('>*)0/' $/g+/2-#""5'

*/#('

j'

b'

KK'

>*)0/'

=Fn'138,/2$#-32' !!

N"P'_'

N"P'L'

N"P'U'

e'

L'

j'

i'

K'

b'

L'

89 Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

N"P'L'

d' KK'

j'

K'

N"P'K'

V'

e' d'

i'

KK'

V'

e'

N"P'U'

N"P'K'

N"P'_'

09J(A)#2K(

!!

A/g+/2-#""5'>*):/2' 9"31P$'#*/'83./('03' 0%/'/2('37'0%/'=Fn' g+/+/' =/#$0'")P/"5'03'9/' >*):/2')2'0%/'7+0+*/'

!@='*/#($' 8)$$)24' $/103*$'#2(' */,"#1/$'(#0#' 9"31P')2'32/' $/g+/2-#"'>*)0/'

i'

e'

I9J(A)#2K(

!!

T3$0C9#$/('*/,"#1/8/20' !! !!

T%3)1/'37'.)1-8'(/,/2($'32',*39#9)")05'37'*/7/*/21/'\#$' +$+#"]' N+0'0%/'/.)1-32'13$0')$'230'+2)73*8' !! !!

!!

?0'(3/$2x0'%+*0')7'>/'8)$/$-8#0/'0%/'%/#0'37'#',#4/' !!

!!

T"/#2',#4/$'9/#*'23'>*)0/'13$0D'()*05',#4/$'*/$+"0')2'#'>*)0/' ?I6'#$588/0*5&'>*)0/$'83*/'/O,/2$)./'0%#2'*/#($' A3'"324'#$'>/'$#./'\/O,/2$)./]'>*)0/$'

l/5')(/#&'1389)2/'=FnC9#$/('*/,"#1/8/20'>)0%'13$0C 9#$/('#"43*)0%8$' !!

90

G,,")1#9"/'930%')2'AA3'*/4)32$' !! !! !!

Z3*P)24'*/4)32&'9+$)2/$$'#$'+$+#"' T"/#2CE*$0'*/4)32&'1#2()(#0/$'73*'/.)1-32' B+89/*'37'1#2()(#0/$')$'1#""/('0%/'>)2(3>'$)s/'Z'

!!

G">#5$'/.)10'7*38'1"/#2CE*$0'*/4)32'

!!

[.)10'1"/#2',#4/$'9/73*/'()*05'32/$'03'$#./'>*)0/'13$0' ?8,*3./8/20&'T"/#2C!)*$0'3' */4)32$' !! !!

@)8/'*/4)32&'05,)1#"'=Fn' T3$0'*/4)32&'73+*'=Fn'g+/+/$D'32/' ,/*'13$0'1"#$$' !! !! !!

!!

!!

T"/#2'R#$%' T"/#2'8#42/-1' #5$'7*38'0%/'13$0' */4)32' 92

D#,"($'&1#8(

13$0'

!!

!!

W1>'($'&1#8(

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oA03)1#D'G0%#2#$$3+")$D'y3%2$32'p'G)"#8#P)D'*)0/$' !!

A%)8'$03*#4/'8#2#4/*'"#5/*' 4*3+,'#2(' >*)0/'' $/g+/2-#""5'

)2.#")(#0/'

AA*)0/'9"31P' $/g+/2-#""5' ?2.#")(#0/'3"('./*$)32$'

X#5'0%/',*)1/'37'#'7/>'/O0*#' */#($'9+0'$#./'0%/'13$0'37' *#2(38'>*)0/$' Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

T#1%)24')2'R#$%'8/83*5' !!

X*39"/8'$/0+,' !! !!

!!

F/$/#*1%'g+/$-32$' !! !! !!

!!

AA#*/'132E4+*#-32' AA$0*32D'[+*3A5$'_LLjq'

?213*,3*#-24'AA*)0/'13$0'

9+Q/*' ,33"'

=34' 3,/*#-32$' )#&(

G44*/4#0/'1%#24/$'#2(' ,*/()1-./"5',+$%' 9'%B(2%2-'(

AA'$1%/8/$'()10#0/'%3>' (#0#'8)4*#0/$'#1*3$$'0%/'-/*$' !! !! !!

!!

!!

?21"+$)./&'(#0#')2'8/83*5')$'#"$3'32' R#$%' [O1"+$)./&'23',#4/')$'930%')2' 8/83*5'#2('32'R#$%' =#s5&'#2')2C8/83*5',#4/'8#5'3*' 8#5'230'9/'32'R#$%'(/,/2()24'32' /O0/*2#"'1*)0/*)#'

T3$0'83(/"',*/()10$'%3>'#' 1389)2#-32'37'>3*P"3#('#2(' $1%/8/'>)""'9/%#./'32' 132E4+*#-32'' B3'8#4)1'1389)2#-32|'()Q/*/20' $1%/8/$'73*'()Q/*/20'>3*P"3#($' #2('()Q/*/20';;'1#2'>/'(/$)42'/W1)/20'$/132(#*5'$03*#4/')2(/O/$'^' ,30/2-#""5'73*'83*/'0%#2'32/'8/0*)1r'

H/0%3(3"34)/$' !! !! !!

100

G.3)('/O,/2$)./'3,/*#-32$'>%/2'+,(#-24'0%/')2(/O' A/"7C0+2)24')2(/O)24D'1#0/*)24'73*'R#$%C*/$)(/20'(#0#' T389)2/'AA*)0/$' !! !! !! !!

A0#*-24',3)20')$'$0+(5)24'05,)1#"'>*)0/'#11/$$',#:/*2$')2'0%/'1320/O0'37' $#8,")24' !#10&'*#2(38'>*)0/$'%+*0',/*73*8#21/' N+0'1#*/7+"'#2#"5$)$'37'#'05,)1#"'>3*P"3#('$%3>$'0%#0'>*)0/$'#*/'*#*/"5' 138,"/0/"5'*#2(38' F#0%/*D'0%/5'#*/'$/8)C*#2(38' !! !! !! !!

F#2(38"5'()$,#01%/('#1*3$$'9"31P$D'$/g+/2-#""5'>*):/2'>)0%)2'#'9"31P' A)8)"#*'03'0%/'"31#")05',*)21),"/$'37'8/83*5'#11/$$' @#P/'#(.#20#4/'37'0%)$'#0'0%/'$0*+10+*/'(/$)42'"/./"'#2('>%/2')$$+)24'>*)0/$' N+"P'>*)0/$'03'#83*-s/'>*)0/'13$0' h9+0'#10+#""5'>*):/2'$/g+/2-#""5'>)0%)2'#'9"31P' N"31P'K'

N"31P'_'

N"31P'U'

>*)0/$'$//8)24"5'*#2(38"5'()$,#01%/(')2'-8/h' 101

h'

N"31P'8(

7>'( Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oB#0%'p'l#2$#"D'?AXB'_LLiq'

!"#$%*)0/$'#*/' ,30/2-#""5'/O,/2$)./')7'*#2(38' @>3'83(/$'73*'N}C0*//'23(/$' !! !!

!! !!

!!

*)-24D'8#)20#)2'"34' /20*)/$'73*'0%/'23(/'#2('*/132$0*+10'32' (/8#2('

)#&(>#B'( B3(/'(#0#' 8/*4/' =34'/20*)/$'

@*#2$"#-32'"#5/*',*/$/20$'+2)73*8' )20/*7#1/'73*'930%'83(/$' A5$0/8'$>)01%/$'9/0>//2'83(/$'95' 832)03*)24'+$/' A)8)"#*'"344)24'#,,*3#1%')2'oZ+D'l+3' p'T%#24D'GTH'@*#2$a'62'[89/((/(' A5$0/8$D'e\U]D'_LLiq' !! !!

102

*)0/$' #*/'#,,")/(' N+Q/*/('E*$0D'0%/2'9#01%/('#2('#,,")/(' 95'0%/'N}C0*//'!@='

N}C0*//' 23(/'

F/#(I>*)0/'3,/*#-32'

/1,K( >#B'(

8)4*#-32'13$0' 7*38'()$P'03'"34' 8)4*#-32'13$0' 7*38'"34'03'()$P'

0#&( >#B'(

_C$0#0/'0#$P'$5$0/8' Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o=)D';/D'=+3'p'z)D'?T*)0/$'95')20*3(+1)24')89#"#21/' !!

!!

oZ+D'T%#24'p'l+3D'M?A'_LLUq' ol'p'`D'AA@)(0%' $,")w24' !! !!

()*/103*5'

105

oLCKL]'

A&'_' T&'K'

oKLCKV]'

A&'K' T&'L'

oKVC_L]'

A&'U' T&'K'

o_LCUL]'

A&'L' T&'L'

oULCdL]'

A&'L' T&'L'

!!

[O,#2$)32'0*)44/*/('>%/2'0%/'2+89/*'37' $,")0$'/O1//($'$38/'0%*/$%3"(' ?27*/g+/20"5'+$/('9+1P/0$'#*/'R+$%/('03'AA#$*&'>*)0/'#2('*/#('9+Q/*$'#2('8/0#(#0#' ;&+,(9&(

?#"9'('>#$*&'' */151"/('#,,/2('"34' 3*4#2)s/('#$'#'' 151")1'")$0'37',#4/$D' (/$0#4/('03';/'+$/'AAr'

H/0%3(3"34)/$' !! !! !!

109

!"#$%C#>#*/'#"43*)0%8$'/)0%/*'95'(/$)42'3*'0%*3+4%'#(#,0#-32' 6v3#(',#*0$'37'0%/'138,+0#-32'03'R#$%'8/83*5' [13238)/$'37'$1#"/' Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6"('$03*)/$D'2/>'035$' !!

?8,#10'37'$/"/1-.)05'32',*/()1#0/'/.#"+#-32'oH5/*$D'HA1' @%/$)$D'H?@D'_LLiq' !! !!

!!

6./*#""D'#$'$/"/1-.)05'7#103*')21*/#$/$',/*73*8#21/'(/4*#(/$'\2//("/' )2'%#5$0#1P'g+/*)/$]' G0'-8/$';*)0/$'#2('0%*3+4%'9+Q/*)24' ?20*3(+1/'%)/*#*1%)1#"'$0*+10+*/'03'#113+20'73*'()$PC"/./"',#4)24' N+Q/*'"#5/*')2'8/83*5'

*)0/$'

LLKKKKLLLKLKKLKL'

!)"0/*'"#5/*'32'AA3*P/*'

>3*P/*' 114

9':;',"(:;';'(

38"'$*)0/$'

X/*$)$0/20'$03*#4/'^'8#59/')2'1389)2#-32'>)0%';

Suggest Documents