Slides - Parallel Data Storage Workshop

9 downloads 101 Views 1MB Size Report
Inflexible byte stream abstraction. • Scalability rules are simple. – But middleware is complex. • Applying 'magic number'. – Unnatural and difficult to propogate.
DataMods Programmable File System Services

Noah Watkins*, Carlos Maltzahn, Scott Brandt UC Santa Cruz, *Inktank Adam Manzanares California State University, Chico 1

Talk Agenda 1. 2. 3. 4.

Middleware and modern IO stacks Services in middleware and parallel file systems Avoid duplicating work with DataMods Case study: Checkpoint/restart

2

Why DataMods? • Applications struggle to scale on POSIX I/O • Parallel FS rarely provide other interfaces – POSIX I/O designed to prevent lock-in

• Open-source PFS are now available – Ability to avoid lock-in

• Can we generalize PFS services to provide new behavior to new users?

3

Application Middleware • Complex data models and interfaces • Difficult to work directly with simple byte stream • Middleware maps the complex onto the simple

4

Middleware Complexity Bloat • Hadoop and “Big Data” data models – Ordered key/value pairs stored in file – Dictionary for random key-oriented access – Common table abstractions

5

Middleware Complexity Bloat • Scientific data – Multi-dimensional arrays – Imaging – Genomics

6

Middleware Complexity Bloat • IO Middleware – Low-level data models and I/O optimization – Transformative I/O avoids POSIX limitations

7

Middleware Scalability Challenges • Scalable storage system • Exposes one data model • Must find ‘magic’ alignment

8

Data Model Modules • Plugin new “file” interfaces and behavior • Native support; atop existing scalable services New behavior

Generalized storage services

Pluggable customization (new programmer role)

9

What does middleware do?

Metadata Management

Data Placement

Intelligent Access

Asynchronous Services

10

Middleware: Metadata Management • • • •

Byte stream layout Data type information Data model attributes Example: Mesh Data Model

Header

File

– How is the mesh represented? – What does it represent?

11

• Serialization • Placement index • Physical alignment – Including the metadata

• Example: Mesh Data Model – Vertex lists – Mesh elements – Metadata

Header

Middleware: Data Placement

Data

Met a Data Met a Data

12

• Data model specific interfaces • Rich access methods

Header

Middleware: Intelligent Access

– Views, subsetting, filtering

• Write-time optimizations • Locality and data movement

Data

Met a

Array-based Application

read(array-slice)

HDF5 Library

Data Met a Data

13

Middleware: Asynchronous Services Header

• Workflows • • • •

Compression Indexing Layout optimization Performed online

Workflow Driver

HDF5 Library

– Regridding

Data

Met a Data Met a Data

14

Middleware Challenges • Inflexible byte stream abstraction • Scalability rules are simple – But middleware is complex

• Applying ‘magic number’ – Unnatural and difficult to propogate

• Loss of detail at lower-levels – Difficult for in-transit / co-located compute

15

Storage System Services • Scalable meta data

– Clustered service – Scalability invariants

• Distributed object store

– Local compute resources – Define new behavior

• File operations – POSIX

• Fault-tolerance

– Scrubbing and replication 16

DataMods Abstraction File Manifold

(Metadata and Data Placement)

Typed and Active Storage

Asynchronous Services

17

DataMods Architecture • Generalized file system services • Exposed through programming model

18

File Manifold • Metadata management and data placement – Flexible, custom layouts

• Extensible interfaces • Object namespace managed by manifold • Placement rules evaluated by system

19

Typed and Active Storage • Active storage adoption has been slow – Code injection is scary – Security and QoS

• Reading, writing, and checksums are not free • Exposing scalable services is tractable – Well-defined data models supports optimization – Programming model support data model creation – Indexing and filtering 20

Asynchronous Services • Re-use of active / typed storage components • Temporal relationship to file manifold – Incremental processing – After file is closed – Object update trigger

• Scheduling – Exploit idle time – Integrate with larger ecosystem – Preempted or aborted 21

Case Study: PLFS Checkpoint/Restart • Long-running simulations need fault-tolerance – Checkpoint simulation state

• Simulations run on expensive machines – Very expensive machines. Really, very expensive.

• Decrease cost (time) of checkpoint/restart • Translation: increase bulk I/O bandwidth

22

Overview of PLFS • Middleware layer – Transforms I/O pattern

• IO Pattern: N-1 – Most common

• IO Pattern: N-N – File system friendly

• Convert N-1 into N-N • Applications see the same logical file 23

Simplified PLFS I/O Behavior Client 1

Client 2

Client 3

Parallel Log-structured File System

Index Log-structured

Index Log-structured

Index Log-structured

24

PLFS Scalability Challenges • Index maintenance and volume • Optimization above file system

Application

PLFS

Compute

– Compression and reorganization

File System

Time

Optimization Process

25

Moving Overhead to Storage System • Checkpoints are not read immediately (if at all)

Application

PLFS

Compute

– Index maintenance and optimization in storage

File System

Time

Return to compute sooner Optimization Process 26

DataMods Module for PLFS • File Manifold – Logical file view – Per-process log-structured files – Index

• Hierarchical Solution – Top-level manifold routes to logs – Inner manifold implements log-structured file – Automatic namespace management (metadata) 27

PLFS Outer File Manifold Logical top-half file is not materialized

28

PLFS Outer File Manifold Logical top-half file is not materialized

Routes to perprocess log file

29

PLFS Inner File Manifold Logical top-half file is not materialized

Routes to perprocess log file Append striping within object namespace

30

PLFS Inner File Manifold Logical top-half file is not materialized

Routes to perprocess log file Append striping within object namespace Index-enabled objects record logical-to-phy

31

PLFS Inner File Manifold Logical top-half file is not materialized

Routes to perprocess log file Interface to index maintenance routines

Append striping within object namespace Index-enabled objects record logical-to-phy

32

Active and Typed Objects • • • • • •

Append-only object Automatic indexing Managed layout Built on existing services Logical view at lowest level Index maintenance interface

Offline Index Optimization • Extreme index fragmentation (per-object) • Exploit opportunities for optimization – Storage system idle time – Re-use of analysis I/O – Piggy-backed on scrubbing / healing

• Index Compression – Merging contiguous entries – Pattern discovery and replacement – Consolidation 34

Offline Index Optimization • Three stage pipeline – Incremental compression and consolidation

• Incremental compression 1. Merging physically contiguous entries (in PLFS) •

Not subject to buffer size limits

• Applied technique to 92 PLFS indexes published by LANL 35

Merging Reduces PLFS Index Size 10000000 Raw Trace (Baseline)

Large, Strided

Merge Compress

1000000

Number of Index Entries

100000

10000

Contiguous Writes

1000

100

10

1 1

11

21

31

41

51

61

71

81

91

PLFS Map File

36

Index Compression: Pattern • Compactly represent extents using patterns • Example pattern template – offset + stride * i, low < i < high

• Fit data to this pattern to reduce index size • Linear algorithm; parallel across logs

37

Pattern Compression Improves Over Merging 10000000 Raw Trace (Baseline)

Strided pattern identified

Merge Compress

1000000

Pattern Compress

Number of Index Entries

100000

10000

1000

100

10

1 1

11

21

31

41

51

61

71

81

91

PLFS Map File

38

Index Consolidation



Index Consolidation

• Combines all logs together (in PLFS) • Increases index read efficiency

Index Pack

39

Wrapping Up • Implementing new data model plugins – Hadoop and Visualization – Refining API with more use cases – Constructing specification language

• Thank you to supporters – DOE funding (DE-SC0005428), Gary Grider John Bent, James Nunez

• Questions? --- [email protected] • Poster session 40

Extra Slides

41

0.7

0.6

PatternIO.47K.5 PatternIO.47K.4 PatternIO.47K.1 PatternIO.47K.6 PatternIO.47K.2 LANL_App3.64.ma LANL_App1.64.dm LANL_App2.mpiio chombo.512.map chombo.128.map flash.8PE.hdf5_ flash.32PE.hdf5 flash.128PE.hdf flash.512PE.hdf flash.16PE.hdf5 flash.64PE.hdf5 flash.256PE.hdf chombo.32.map strided.8PE.10M flash.16PE.hdf5 flash.64PE.hdf5 flash.256PE.hdf PatternIO.10MB. PatternIO.10MB. PatternIO.10MB. PatternIO.10MB. PatternIO.10MB. strided.8PE.10M strided.16PE.10 strided.24PE.10 strided.32PE.10 strided.40PE.10 strided.48PE.10 strided.56PE.10 strided.64PE.10 strided.128PE.1 nonstrided.8PE. nonstrided.24PE nonstrided.40PE nonstrided.56PE nonstrided.128P nonstrided.1PE. nonstrided.3PE. nonstrided.5PE. nonstrided.7PE. PatternIO.10MB.

Reduction Factor over Baseline

Index Reduction Improvements Reduction from Merging

0.8 Reduction from Pattern

1

0.9

Global reduction

HDF5 Indexing, Data reorganization

0.5

0.4

0.3

0.2

0.1

0

42