Big Data

TSUBAME2.0, 2.5 towards 3.0 for Convergence of Extreme Computing and Big Data Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM) HP-CAST SC2014 Presentation New Orleans, USA 20141114

TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” • GPU-centric (> 4000) high performance & low power • Small footprint (~200m2 or 2000 sq.ft), low TCO • High bandwidth memory, optical network, SSD storage…

TSUBAME 2.0 New Development

32nm

40nm

>400GB/s Mem BW 80Gbps NW BW ~1KW max

>1.6TB/s Mem BW

>12TB/s Mem BW 35KW Max

>600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max

TSUBAME2.0⇒2.5 Thin Node Upgrade (Fall 1993) Peak Perf.

Thin Node Infiniband QDR x2 (80Gbps)

4.08 Tflops ~800GB/s Mem BW 80GBps NW ~1KW max

HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem(per GPU)

Productized as HP ProLiant

SL390s Modified for TSUABME2.5

CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD：60GBx2, 120GBx2

NVIDIA Fermi M2050 1039/515 GFlops

NVIDIA Kepler K20X 3950/1310 GFlops

Phase-field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Gordon Bell 2011 Winner Weak scaling on TSUBAME (Single precision) Mesh size（1GPU+4 CPU cores）:4096 x 162 x 130

TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640

Developing lightweight strengthening material by controlling microstructure

TSUBAME 2.0 2.000 PFlops (4,000 GPUs+16,000 CPU cores)

Low-carbon society 4,096 x 6,480 x 13,000

•

•

Peta-Scale phase-field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials. 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time-to-Solution

Application

TSUBAME2.0 Performance

TSUBAME2.5 Performance

Boost Ratio

Top500/Linpack 4131 GPUs (PFlops)

1.192

2.843

2.39

Green500/Linpack 4131 GPUs (GFlops/W)

0.958

3.068

3.20

Semi-Definite Programming Nonlinear Optimization 4080 GPUs (PFlops)

1.019

1.713

1.68

Gordon Bell Dendrite Stencil 3968 GPUs (PFlops)

2.000

3.444

1.72

LBM LES Whole City Airflow 3968 GPUs (PFlops)

0.592

1.142

1.93

Amber 12 pmemd 4 nodes 8 GPUs (nsec/day)

3.44

11.39

3.31

GHOSTM Genome Homology Search 1 GPU (Sec)

19361

10785

1.80

MEGADOC Protein Docking 1 node 3GPUs (vs. 1CPU core)

37.11

83.49

2.25

TSUBAME2.0=>2.5 Power Improvement

2012/12

18% Power Reduction inc. cooling

2013/11 Green 500 #6 in the world • Along with TSUBAME-KFC (#1) • 2014/6 #9

2013/12

Comparing K Computer to TSUBAME 2.5

Perf ≒ Cost 1 Billion

~6 million

~700,000

Memory Technology

GDDR5+DDR3

DDR3

DDR3

Network Technology

Luxtera Silicon Photonics

Standard Optics

Copper

Non Volatile Memory / SSD

SSD Flash all nodes ~250TBytes

None

None

Power Management

Node/System Active Power Cap

Rack-level measurement only

Rack-level measurement only

Virtualization

KVM (G & V queues,

None

None

Resource segragation)

TSUBAME3.0：Leadership “Template” Machine • • • •

Under Design：Deployment 2016H2~H3 High computational power: ~20 Petaflops, ~5 Petabyte/s Mem BW Ultra high density: ~0.6 Petaflops DFP/rack (x10 TSUBAME2.0) Ultra power efficient: 10 Gigaflops/W (x10 TSUBAME2.0, TSUBAME-KFC) – Latest power control, efficient liquid cooling, energy recovery • Ultra high-bandwidth network: over 1 Petabit/s bisection, new topology? – Bigger capacity than the entire global Intenet (several 100Tbps) • Deep memory hierarchy and ultra high-bandwidth I/O with NVM – Petabytes of NVM, several Terabytes/s BW, several 100 million IOPS – Next generation “scientific big data” support

• Advanced power aware resource management, high resiliency SW/HW codesign, VM & container-based dynamic deployment…

Focused Research Towards Tsubame 3.0 and Beyond towards Exa

• Software and Algorithms for new memory hiearchy– Pushing the envelops of low Power vs. Capacity, Communication and Synchroniation Reducing Algorithms (CSRA) • Post Petascale Networks – Topology, Routing Algorithms, Placement Algorithms… (SC14 paper Tue 14:00-14:30 “Fail in Place Network…”) • Green Computing: Power aware APIs, fine-grained resource scheduling • Scientific “Extreme” Big Data – GPU Hadoop Acceleration, Large Graphs, Search/Sort, Deep Learning • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction, Hybrid Algorithms • Post Petascale Programming – OpenACC extensions and other many-core programming substrates, • Performance Analysis and Modeling –For CSRA algorithms, for Big Data, for deep memory hierarchy, for fault tolerance, …

TSUBAME KFC

TSUBAME-KFC Towards TSUBAME3.0 and Beyond Oil-Immersive Cooling #1 Green 500 SC13, ISC14, … (Paper @ ICPADS14)

Extreme Big Data Examples Rates and Volumes are extremely immense Social NW – large graph processing •

•

〜1 billion users Average 130 friends 30 billion pieces of content shared per month

Target Area: Planet (Open Street Map)

–

•

500 million active users 340 million tweets per day

– –

300 million new websites per year 48 hours of video to YouTube per minute 30,000 YouTube videos played per second

7 billion people

Peta~Zetabytes Data

Input Data –

Internet – – –

NOT simply mining Tbytes Silo Data

Applications –

Twitter – –

•

•

Facebook – – –

Social Simulation

Road Network for Planet: 300GB (XML) Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB Real-Time Streaming Data

Ultra High-BW Data Stream

(e.g., Social sensor, physical data)

•

Highly Unstructured, Irregular

Simulated Output for 1 Iteration

– 700TB– real time Weather large data assimilation

Genomics advanced sequence matching 　Impact of new generation sequencers

Sequencing data (bp)/$ becomes x4000 per 5 years c.f., HPC x33 in 5 years

Phased Array Radar 1GB/30sec/2 radars

Himawari 500MB/2.5min

A-1. Quality Control A-2. Data Processing

B-1. Quality Control B-2. Data Processing

①30-sec Ensemble Forecast Simulations 2 PFLOP

②Ensemble Data Assimilation 2 PFLOP

Analysis Data 2GB

Ensemble Forecasts シミュレーションシミュレーション 200GB データデータ

Ensemble Analyses シミュレーションシミュレーション 200GB データデータ

③30-min Forecast Simulation 1.2 PFLOP

Repeat every 30 sec. 4

Lincoln Stein, Genome Biology, vol. 11(5), 2010

30-min Forecast 2GB

Complex correlations between data from multiple sources Extreme Capacity, Bandwidth, Compute All Required

Graph500 “Big Data” Benchmark Kronecker graph BSP Problem

A: 0.57, B: 0.19 C: 0.19, D: 0.05

November 15, 2010 Graph 500 Takes Aim at a New Kind of HPC Richard Murphy (Sandia NL => Micron) “ I expect that this ranking may at times look very different from the TOP500 list. Cloud architectures will almost certainly dominate a major chunk of part of the list.” The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 Koji Ueno, Tokyo Institute of Technology/RIKEN AICS

RIKEN Advanced Institute for Computational Science (AICS)’s K computer is ranked

Reality: Top500 Supercomputers Dominate #1 K Computer No Cloud IDCs at all No.1

on the Graph500 Ranking of Supercomputers with 17977.1 GE/ s on Scale 40 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014.

Congratulations from the Graph500 Executive Committee

Global Scientiﬁc Information and Computing Center, Tokyo Institute of Technology’s TSUBAME

2.5

is ranked

No.12 on the Graph500 Ranking of Supercomputers with 1280.43 GE/ s on Scale 36 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014. Congratulations from the Graph500 Executive Committee

#12 TSUBAME2

Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010)

A Major Northern Japanese Cloud Datacenter (2013) the Internet

>>

~= Entire Global Internet Average Data BW ~200 Tbps (source: CISCO)

Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber

10GbE

~1500 nodes compute & storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps

Juniper MX480

Juniper MX480

2 zone switches (Virtual Chassis)

x1000! Juniper EX4200

10GbE

10GbE

Juniper EX8208

Juniper EX8208

10GbE

LACP Juniper EX4200

Zone (700 nodes)

Juniper EX4200

Juniper EX4200

Zone (700 nodes)

Juniper EX4200

Juniper EX4200

Zone (700 nodes)

8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps

Aleksandr Drozd, Naoya M aruyama, Sat oshi M at suoka A( TM ITult ECH) i GPU Read Alignment Al

In most living their developm molecule called DNA consists called nucleotid The four bases A), cytosine (C

Int roduct ion

JST-CREST “Extreme Big Data” Project (2013-2018) Problem Domain

Future Non-Silo Extreme Big Data Scientific Apps

Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. Clouds?

Large Scale Metagenomics

Co-Design EBD Bag

Ultra Large Scale Graphs and Social Infrastructures

Massive Sensors and Data Assimilation in Weather Prediction

Co-Design Co-Design 日本地図

EBD System Software incl. EBD Object System

Cartesian Plane KV S KV S

KV S

Graph Store

Cloud IDC Very low BW & Efficiency Highly available, resilient

13/ 06/ 06 22:36

NVM/Fla NVM/Flas 2Tbps HBM NVM/Fla sh h 4~6HBM Channels NVM/Flas NVM/Fla NVM/Flas sh h 1.5TB/s DRAM & h sh DRAM DRAM NVM BW DRAM DRAM DRAM DRAM Low Low High Powered 30PB/s I/O BW Possible Main CPU Power Power 1 Yottabyte / Year CPU CPU TSV Interposer

Exascale Big Data HPC

EBD KVS

1000km

f ile:/ / / Users/ shirahat a/ Pict ures/ 日本地図.svg

Issues regading Architectural, algorithmic, and system software evolution?

1/ 1 ページ

PCB

Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

Use of GPUs?

Supercomputers Compute&Batch-Oriented More fragile

Towards Extreme-scale BigData Machines Convergence • Computation – Increase in Parallelism, Heterogeneity, Density, BW • Multi-core, Many-core processors • Heterogeneous processor

• Deep Hierarchial Memory/Storage Architecture – NVM (Non-Volatile Memory), SCM (Storage Class Memory) FLASH, PCM, STT-MRAM, ReRAM, HMC, etc. – Next-gen HDDs (SMR), Tapes (LTFS), Cloud Problems Algorithm Scalability Heterogeneity

Power

FT

Network

Productivity

Locality Storage Hierarchy I/O

Aleksandr Drozd, Naoya M aruyama, Sat oshi M at suoka A(TMITult ECH) i GPU Read

In most their d molecul DNA c called The fo A), cyt

Int rod

100,000 Times Fold EBD “Convergent” System Architecture Problem Domain

Akiyama Group

Miyoshi Group

Suzumura Group

Large Scale Genomic Correlation

SQL for EBD Message Passing (MPI, X10) for EBD

Data Assimilation in Large Scale Sensors and Exascale Atmospherics

Large Scale Graphs and Social Infrastructure Apps

Graph Framework

PGAS/Global Array for EBD

Workflow/Scripting Languages for EBD

EBD Abstract Data Models

MapReduce for EBD

Matsuoka Group EBD Algorithm Kernels 日本地図

(Distributed Array, Key Value, Sparse Data Model, Tree, etc.) EBD Bag

Programming Layer

13/ 06/ 06 22:36

Cartesian Plane (Search/ Sort, Matching, Graph Traversals, , etc.)

Basic Algorithms Layer

KVS

EBD File System

EBD Data Object

EBD Burst I/O Buffer

KVS

Tatebe Group Graph Store NVM

(FLASH, PCM, STT-MRAM, ReRAM, HMC, etc.)

Koibuchi Group

KVS

EBD KVS

Big Data & SC HW Layer

1000km

HPC Storage

Web Object Storage

Interconnect

Network

(InfiniBand 100GbE)

(SINET5) file:/ / / Users/ shirahata/ Pictures/ 日本地図.svg

TSUBAME 3.0

System SW Layer

EBD Network Topology and Routing

TSUBAME-GoldenBox

Cloud Datacenter

Intercloud / Grid (HPCI) 17

1/ 1 ページ

Big Data & SC HW Layer

Elapsed Time (ms)

The Graph500 – June 2014 K Computer #1 Tokyo Tech[EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS 1200

Communicaton

1000

Computation

List

Rank

GTEPS

Implementation

November 2013

4

5524.1 2

Top-down only

June 2014

1

17977. 05

Efficient hybrid

274

MTEPS/node

800 600 400

1236 MTEPS/node

200 0 64 nodes

65536 nodes

(Scale 30)

(Scale 40)

*Problem size is weak scaling

73% total exec time wait in communication

EBD Programming Framework

Out-of-core GPU-MapReduce for Large-scale Graph Processing [Cluster 2014]

Emergence of large-scale graphs -

GPU

SNS, road network, smart grid, etc. Millions to trillions of vertices/edges

CPU

Memcpy (H2D, D2H)

Processing for each chunk

Operation on GPU Map

Map

→ Need for fast graph processing on supercomputers

Map

Map

Initialization Sort

Shuf fle

Shuf fle

Operation on GPU

Problem: GPU memory capacity limits scalable large-scale graph processing

Scan

Red uce

Red uce

Sort

Red uce

Red uce

Proposal: Out-of-core GPU memory management on MapReduce - Stream-based GPU MapReduce - Out-of-core GPU sorting

Experimental Results: performance improvement over CPUs -

Map: 1.41x, Reduce: 1.49x, Sort: 4.95x speedup Overlapping communication effectively

Performance [MEdges/sec]

Weak scaling on TSUBAME2.5 3000

1CPU (S23 per node)

2500

1GPU (S23 per node) 2CPUs (S24 per node)

2.10x

2000

2GPUs (S24 per node)

(3 GPU vs 2CPU)

3GPUs (S24 per node)

1500 1000 500 0 0

500 1000 Number of Compute Nodes

1500

EBD Algorithm Kernels

GPU-HykSort [IEEE BigData2014] Motivation Effectiveness of sorting for large-scale GPUbased heterogeneous systems remains unclear - Appropriate selection of phases to be offloaded to GPU is required - Handling GPU memory overflow is required

Implementation Process 0 unsorted

Process 1

Process 2

unsorted

unsorted

Process 3 unsorted

local sort sorted

sorted

sorted

Separate an unsorted array

unsorted unsorted unsorted unsorted Iter 1

GPU

Performance of weak scaling HykSort 1thread HykSort 6threads HykSort GPU + 6threads

0.25 TB/s

merge merged

merged

Iter 4

Sort a chunk on GPU

Transfer a sorted chunk to DRAM Iter 1 Iter 2 Iter 3

data transfer

merged

Iter 3

sorted

sorted

Offload local sort, the most time-consuming phase, to GPU accelerators

Iter 2

Transfer an unsorted chunk to GPU memory

select splitters

Approach

sorted

sorted

Iter 4

sorted

Merge sorted chunks into a sorted array merged

sorted

Performance prediction

1.4x

Keys/second(billions)

30

unsorted

3.6x

20

10

~1024 nodes ~2048 GPUs

0 0

500

1000

1500

2000

# of proccesses (2 proccesses per node)

GPU-HykSort achieves 2.2x performance improvement with 50GB/s CPU-GPU interconnect


Efficient Parallel Sorting Algorithm for Variable-Length Keys

•

•

•

Comparison-based sorts inefficient for long/variable-length keys (like strings) Better way: examining individual characters (based on MSD Radix sort algorithm) Hybrid parallelization scheme: combining data-parallel and taskparallel stages

apple apricot banana kiwi

70 M keys/second sorting throughput on 100bytes strings

Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. Efficient String Sorting on Multi- and ManyCore Architectures. in Proceedings of IEEE 3rd International Congress on Big Data. Anchorage, USA, August 2014 Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. MSD Radix String Sort on GPU: Longer Keys, Shorter Alphabets in proceedings of 第142回ハイパフォーマンスコンピューティング合同研究発表会 (HOKKE-21)

Large Scale Graph Processing Using NVM

[BigData2014]

1. Hybrid-BFS ( Beamer’11 ) 2. Proposal Switching two approaches

DRAM

NVM

Holds highly accessed data Top-down


Holds full size of Graph

Bottom-up

CPU DRAM

Intel Xeon E5-2690 × 2

NVM

EBD-I/O 2TB × 2

256 GB

www.crucial.com/

mSATA ・・・ mSATA SSD SSD RAID Card (RAID 0) www.adaptec.com

×8

mSATA-SSD

RAID Card

Median GigaTEPS

3. Experiment

(Giga Traversed Edges Per Seconds)

Load highly accessed graph data before BFS

# of frontiers:nfrontier，# of all vertices:nall, Parameter : α, β

Limit of DRAM Only

6.0

4.1

5.0 4.0

3.8

DRAM + EBD-I/O DRAM Only

3.0 2.0

4 times larger graph with

1.0

6.9 % of degradation

0.0

23 24 25 26 27 28 29 30 31

Ranked 3rd

SCALE(# vertices = 2SCALE)

in Green Graph500 (June 2014)

Tokyo’s Institute of Technology GraphCREST-Custom # 1 is ranked

No.3 in the Big Data category of the Green Graph 500 Ranking of Supercomputers with 35.21 M T EPS/ W on Scale 31 on the third Green Graph 500 list published at the International Supercomputing Conference, June 23, 2014. Congratulations from the Green Graph 500 Chair

Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era JST-CREST project, 2012-2018, PI Toshio Endo Comm/BW reducing algorithms

＋ System software for mem hierarchy mgmt

＋ HPC Architecture with hybrid memory devices HMC, HBM

O(GB/s) Flash

Next-gen NVM

Target: Realizing extremely Fast&Big simulations of {O(100PF/s) or O(10PB/s)} & O(10PB) around 2018

Supporting Larger domains than GPU device memory for Stencil Simulations

>> TSUBAME2.5 node GPU card

Caution: Simply “swapping out” to larger host memory is disastrously slow PCIe traffic is too large!

GPU cores L2$ 1.5MB

250GB/s CPU cores

GPU mem 6GB

8GB/s Host memory 54GB

Keys are “Communication Avoiding & Locality Improvement” Algorithms

Temporal Blocking (TB) for Comm. Avoiding • Performs multiple updates on a small block, before proceeding to the next block – Originally proposed to improve cache locality [Kowarschik 04] [Datta 08]

s-step updates at once

Introducing “larger halo” Step 1

Step 2

Step 3

Step 4

Simulated time

Redundant computation is introduced due to data dependency with neighbor Step 1 Step 2 Step 3 Step 4

Redundancy can be removed when blocks are computed sequentially [Demmels 12]

Multi-level TB to reduce both • PCIe traffic • device memory traffic

Single GPU Performance 3D 7point stencil on a K20X GPU (6GB GPU mem) 52GB

5.3GB

Speed (GFlops)

Faster

200 180 160 140 120 100 80 60 40 20 0

Version 3

Common

Version 1

Naïve Basic-TB Opt-TB

0

500 1000 1500 Size of Each Dimension

2000

Bigger

• With optimized TB, 10x larger domain size is successfully used with little overhead!!!  A step towards extremely fast&big simulations

Problem: Programming Cost • Communication reducing algorithms efficiently support larger domains • Programming cost is the issue – Complex loop structure, complex border handling

• Reducing programming cost by using system software supporting memory hierarchy – HHRT (Hybrid Hierarchical Runtime) – Physis DSL, by Maruyama, RIKEN

Memory Hierarchy Management with Runtime Libraries HHRT (Hybrid hierarchical RT) is for GPU supercomputers and MPI+CUDA user applications • HHRT provides MPI and CUDA compatible APIs • # of MPI processes > # of GPUs • Several processes share a GPU Compute node Process’s data

Compute node

Compute node

GPU mem

Host mem

• HHRT supports memory swapping between GPU and host mem at granuarity of processes • Similar to NVIDIA UVM, but works well with communication reducing algorithms

Beyond GPU memory efficient execution w/ moderate programming cost 3D 7point stencil on a Weak scalability on TSUBAME2.5 Small: 3.4GB per GPU single K20X GPU Large: “16GB” per GPU (>6GB!)

140

20

120 Speed (GFlops)

100

15

80

10

60 40

5

20 0

Speed (TFlops)

Faster

HHRT Comm. Reducing Results

Small

Large

14TFlops with 3TB Problem

0 0

10 20 Problem Size (GB)

30

NoTB

50

100

150

The number of GPUs

Larger Hand-TB

0

HHRT-TB

200

Where Do We Go From Here? TSUBAME KFC TSUBAME EBD Green and Extreme Big Data TSUBAME3.0 (2016) TSUBAME4.0 2021~ Post CMOS Moore?

TSUBAME4 2021~2022 K-in-a-Box (Golden Box) BD/EC Convergent Architecture 1/500 Size, 1/150 Power, 1/500 Cost, x5 DRAM+ NVM Memory

10 Petaflops, 10 Petabyte Hiearchical Memory (K: 1.5PB), 10K nodes 50GB/s Interconnect (200-300Tbps Bisection BW) (Conceptually similar to HP “The Machine”)

Datacenter in a Box Large Datacenter will become “Jurassic”

Tsubame 4: 2020- DRAM+NVM+CPU with 3D/2.5D Die Stacking -The Ultimate Convergence of BD and ECNVM/Flash NVM/Flash NVM/Flash

2Tbps HBM 4~6HBM Channels 2TB/s DRAM & NVM BW

NVM/Flash NVM/Flash NVM/Flash

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Low Power CPU

Optical SW & Launch Pad

Low Power CPU

TSV Interposer PCB

Direct Chip-Chip Interconnect with DWDM optics

GoldenBox “Proto1” (NVIDIA K1-based) at Tokyo Tech. SC14 Booth #1857 (also Wed. morning plenary talk) • 36 Node Tegra K1, ~11TFlops SFP • ~700GB/s BW • 100-700Watts • Integrated mSata SSD, ~7GB/s I/O • Ultra dense, Oil immersive cooling • Same SW stack as TSUBAME2 2022: x10 Flops, x10 Mem Bandwidth, silicon photonics, x10 NVM, x10 node density, with new device and packaging technologies

EBD Interconnect

Network Performance Visualization [EuroMPI/Asia 2014 Poster] Application Layer

① Hardware Layer

① Portably expose MPI’s internal performance

② ③ MPI Layer

Process View

② Non-intrusively profile lowlevel metrics

Performance Analysis

③ Flexible hardware-centric performance analysis

Overhead of our profiler (named ibprof): 5 4 3 2 1 0

MPI_Alltoall Avg: 2.06% 0 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768

Comm. Latency (%)

OUR GOALS

MPI abstractions

Message size (bytes)

NAS Parallel FT Benchmark Runtime overhead = less than 0.02% (12.1919s -> 12.1935s)

Network visualization of TSUBAME 2.5 running the Graph500 benchmark on 512 nodes

Design of nonblocking B+Tree for NVM-KV [Tatebe Group, Jabri]

EBD NVM System Software

NVM-BPTree is a Key-Value Stores (KVS) running natively over Non-VolatileMemory (NVM), like flash, supporting range-queries. 

Take advantage of NVM new capabilities: atomic writes, huge sparse address space, direct access to NVM device natively as a KVS



Enable range-queries support for KVS running natively on NVM like FusionIO ioDrive



Provide optional persistence to the BPTree structure and also snapshots

KVS on NVM supporting range-queries In-memory B+Tree

OpenNVM like Key-value store Interface NVM (Fusion-io flash device)

• • • •

Fusion-io sdk 0.4 and ioDrive 160GB SLC Key size: fixed to 40 Bytes Value size: ranging from 1 up to 1024 sectors (512B) NVM-BPTree does not impact the performance compared with original KVS

EBD NVM System Software EBD I/O and C/R modeling for extreme scale[CCGrid2014 Best Paper]

Extreme scale I/O for Burst Buffer

Extreme scale C/R modeling

mSATA mSATA mSATA mSATA mSATA mSATA mSATA mSATA

mSATA ☓ 8

IBIO client 5 1

chunk

Compute node 2

Compute node 3

Compute node 4

IBIO client

IBIO client

IBIO client

Compute node

Application

IBIO Client

Ii

< write perf. ( wi ) > or

Burst buffer node

IBIO Server

Write threads

1

4

Writer thread

fd1 fd2

Writer thread

file2

Writer thread

file3

Chunk buffers

Compute node

Hi

3 file1

fd3 fd4

Writer thread

file4

Writer threads

Storage

i >0

Storage Model: HN {m1, m2, . . . , mN } Read - Peak Read - NFS Write - IBIO

Read/Write throughput (GB/sec)

Hi-1

Si

i =0

IBIO write: four IBIO clients and one IBIO server

LLNL-PRES-654744

mi

2

Hi-1 Hi-1

IBIO server thread 2

Li = Ci + Ei

(Async.)

< C/R date size / node >☓

Ci or Ri =

EBD I/O Compute node 1

Ci + Ei (Sync.)

Oi =

Adaptec RAID ☓ 1

Read - Local Write - Peak Write - NFS

Read - IBIO Write - Local

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

2

1

1

1

1 1

2

t

2

1

1 1

1

1

1

1

1

1

1

: Interval

2

4

6 8 10 12 # of Processes

14

16

li : i -level checkpoint time

1 1

1 1

2

cc : c -level checkpoint time rc : c -level recovery time 0

1 1

1

2

1 1

t + ck

1

No failure

1

Failur e

k k

i

Duration p0 (t + ck ) t0 (t + ck ) pi (t + ck )

ti (t + ck )

rk p00(rkk ) t0 (rk )

k k

i

p0 (T ) : No failure for T t0 (T ) : Expected time pi (T ) : i - level failure for T

ti (T ) : Expected time

pi (rk ) ti (rk )

Cloud-based I/O Burst Buffer Architecture (I/O Burst Buffer) In collaboration talks with Amazon EC2 I/O Bursting Buffer Nodes I/O node System 1

System 2

Compute node

I/O node

Compute node

I/O node

LAN

Compute node

WAN

I/O node

Storage LAN 1

I/O node

Compute node

I/O node

Compute node

LAN

I/O node

Main idea: using several compute nodes in public cloud as I/O nodes Buffer I/O data in the main memory of I/O nodes All I/O nodes maintain a on-memory buffer queue dynamic burst buffer, # of I/O nodes can be dynamically decided

Buffer Queue

Compute node

Taking advantage of high throughput of LAN inside public cloud

LAN

Cloud Supercomputer 160 One Client Performance

Multi Client Performance 600

write experiment result

140

read experiment result

120

8X!

100

1.7X!

80

read experiment result read simulation result write experiment result write simulation result read without IOnode write without IOnode

60 40 20

0 1

2

3

4 5 # of node

6

7

8

Throughput(MB/s)

Throughput (MB/s)

500

write simulation result read simulation result

400

read without IOnode write without IOnode

300

7X!

200 100 0 1

2 3 # of I/O nodes=# of clients

4