Inflexible byte stream abstraction. • Scalability rules are simple. – But middleware
is complex. • Applying 'magic number'. – Unnatural and difficult to propogate.
DataMods Programmable File System Services
Noah Watkins*, Carlos Maltzahn, Scott Brandt UC Santa Cruz, *Inktank Adam Manzanares California State University, Chico 1
Talk Agenda 1. 2. 3. 4.
Middleware and modern IO stacks Services in middleware and parallel file systems Avoid duplicating work with DataMods Case study: Checkpoint/restart
2
Why DataMods? • Applications struggle to scale on POSIX I/O • Parallel FS rarely provide other interfaces – POSIX I/O designed to prevent lock-in
• Open-source PFS are now available – Ability to avoid lock-in
• Can we generalize PFS services to provide new behavior to new users?
3
Application Middleware • Complex data models and interfaces • Difficult to work directly with simple byte stream • Middleware maps the complex onto the simple
4
Middleware Complexity Bloat • Hadoop and “Big Data” data models – Ordered key/value pairs stored in file – Dictionary for random key-oriented access – Common table abstractions
5
Middleware Complexity Bloat • Scientific data – Multi-dimensional arrays – Imaging – Genomics
6
Middleware Complexity Bloat • IO Middleware – Low-level data models and I/O optimization – Transformative I/O avoids POSIX limitations
7
Middleware Scalability Challenges • Scalable storage system • Exposes one data model • Must find ‘magic’ alignment
8
Data Model Modules • Plugin new “file” interfaces and behavior • Native support; atop existing scalable services New behavior
Generalized storage services
Pluggable customization (new programmer role)
9
What does middleware do?
Metadata Management
Data Placement
Intelligent Access
Asynchronous Services
10
Middleware: Metadata Management • • • •
Byte stream layout Data type information Data model attributes Example: Mesh Data Model
Header
File
– How is the mesh represented? – What does it represent?
11
• Serialization • Placement index • Physical alignment – Including the metadata
• Example: Mesh Data Model – Vertex lists – Mesh elements – Metadata
Header
Middleware: Data Placement
Data
Met a Data Met a Data
12
• Data model specific interfaces • Rich access methods
Header
Middleware: Intelligent Access
– Views, subsetting, filtering
• Write-time optimizations • Locality and data movement
Data
Met a
Array-based Application
read(array-slice)
HDF5 Library
Data Met a Data
13
Middleware: Asynchronous Services Header
• Workflows • • • •
Compression Indexing Layout optimization Performed online
Workflow Driver
HDF5 Library
– Regridding
Data
Met a Data Met a Data
14
Middleware Challenges • Inflexible byte stream abstraction • Scalability rules are simple – But middleware is complex
• Applying ‘magic number’ – Unnatural and difficult to propogate
• Loss of detail at lower-levels – Difficult for in-transit / co-located compute
15
Storage System Services • Scalable meta data
– Clustered service – Scalability invariants
• Distributed object store
– Local compute resources – Define new behavior
• File operations – POSIX
• Fault-tolerance
– Scrubbing and replication 16
DataMods Abstraction File Manifold
(Metadata and Data Placement)
Typed and Active Storage
Asynchronous Services
17
DataMods Architecture • Generalized file system services • Exposed through programming model
18
File Manifold • Metadata management and data placement – Flexible, custom layouts
• Extensible interfaces • Object namespace managed by manifold • Placement rules evaluated by system
19
Typed and Active Storage • Active storage adoption has been slow – Code injection is scary – Security and QoS
• Reading, writing, and checksums are not free • Exposing scalable services is tractable – Well-defined data models supports optimization – Programming model support data model creation – Indexing and filtering 20
Asynchronous Services • Re-use of active / typed storage components • Temporal relationship to file manifold – Incremental processing – After file is closed – Object update trigger
• Scheduling – Exploit idle time – Integrate with larger ecosystem – Preempted or aborted 21
Case Study: PLFS Checkpoint/Restart • Long-running simulations need fault-tolerance – Checkpoint simulation state
• Simulations run on expensive machines – Very expensive machines. Really, very expensive.
• Decrease cost (time) of checkpoint/restart • Translation: increase bulk I/O bandwidth
22
Overview of PLFS • Middleware layer – Transforms I/O pattern
• IO Pattern: N-1 – Most common
• IO Pattern: N-N – File system friendly
• Convert N-1 into N-N • Applications see the same logical file 23
Simplified PLFS I/O Behavior Client 1
Client 2
Client 3
Parallel Log-structured File System
Index Log-structured
Index Log-structured
Index Log-structured
24
PLFS Scalability Challenges • Index maintenance and volume • Optimization above file system
Application
PLFS
Compute
– Compression and reorganization
File System
Time
Optimization Process
25
Moving Overhead to Storage System • Checkpoints are not read immediately (if at all)
Application
PLFS
Compute
– Index maintenance and optimization in storage
File System
Time
Return to compute sooner Optimization Process 26
DataMods Module for PLFS • File Manifold – Logical file view – Per-process log-structured files – Index
• Hierarchical Solution – Top-level manifold routes to logs – Inner manifold implements log-structured file – Automatic namespace management (metadata) 27
PLFS Outer File Manifold Logical top-half file is not materialized
28
PLFS Outer File Manifold Logical top-half file is not materialized
Routes to perprocess log file
29
PLFS Inner File Manifold Logical top-half file is not materialized
Routes to perprocess log file Append striping within object namespace
30
PLFS Inner File Manifold Logical top-half file is not materialized
Routes to perprocess log file Append striping within object namespace Index-enabled objects record logical-to-phy
31
PLFS Inner File Manifold Logical top-half file is not materialized
Routes to perprocess log file Interface to index maintenance routines
Append striping within object namespace Index-enabled objects record logical-to-phy
32
Active and Typed Objects • • • • • •
Append-only object Automatic indexing Managed layout Built on existing services Logical view at lowest level Index maintenance interface
Offline Index Optimization • Extreme index fragmentation (per-object) • Exploit opportunities for optimization – Storage system idle time – Re-use of analysis I/O – Piggy-backed on scrubbing / healing
• Index Compression – Merging contiguous entries – Pattern discovery and replacement – Consolidation 34
Offline Index Optimization • Three stage pipeline – Incremental compression and consolidation
• Incremental compression 1. Merging physically contiguous entries (in PLFS) •
Not subject to buffer size limits
• Applied technique to 92 PLFS indexes published by LANL 35
Merging Reduces PLFS Index Size 10000000 Raw Trace (Baseline)
Large, Strided
Merge Compress
1000000
Number of Index Entries
100000
10000
Contiguous Writes
1000
100
10
1 1
11
21
31
41
51
61
71
81
91
PLFS Map File
36
Index Compression: Pattern • Compactly represent extents using patterns • Example pattern template – offset + stride * i, low < i < high
• Fit data to this pattern to reduce index size • Linear algorithm; parallel across logs
37
Pattern Compression Improves Over Merging 10000000 Raw Trace (Baseline)
Strided pattern identified
Merge Compress
1000000
Pattern Compress
Number of Index Entries
100000
10000
1000
100
10
1 1
11
21
31
41
51
61
71
81
91
PLFS Map File
38
Index Consolidation
…
Index Consolidation
• Combines all logs together (in PLFS) • Increases index read efficiency
Index Pack
39
Wrapping Up • Implementing new data model plugins – Hadoop and Visualization – Refining API with more use cases – Constructing specification language
• Thank you to supporters – DOE funding (DE-SC0005428), Gary Grider John Bent, James Nunez
• Questions? ---
[email protected] • Poster session 40
Extra Slides
41
0.7
0.6
PatternIO.47K.5 PatternIO.47K.4 PatternIO.47K.1 PatternIO.47K.6 PatternIO.47K.2 LANL_App3.64.ma LANL_App1.64.dm LANL_App2.mpiio chombo.512.map chombo.128.map flash.8PE.hdf5_ flash.32PE.hdf5 flash.128PE.hdf flash.512PE.hdf flash.16PE.hdf5 flash.64PE.hdf5 flash.256PE.hdf chombo.32.map strided.8PE.10M flash.16PE.hdf5 flash.64PE.hdf5 flash.256PE.hdf PatternIO.10MB. PatternIO.10MB. PatternIO.10MB. PatternIO.10MB. PatternIO.10MB. strided.8PE.10M strided.16PE.10 strided.24PE.10 strided.32PE.10 strided.40PE.10 strided.48PE.10 strided.56PE.10 strided.64PE.10 strided.128PE.1 nonstrided.8PE. nonstrided.24PE nonstrided.40PE nonstrided.56PE nonstrided.128P nonstrided.1PE. nonstrided.3PE. nonstrided.5PE. nonstrided.7PE. PatternIO.10MB.
Reduction Factor over Baseline
Index Reduction Improvements Reduction from Merging
0.8 Reduction from Pattern
1
0.9
Global reduction
HDF5 Indexing, Data reorganization
0.5
0.4
0.3
0.2
0.1
0
42