Jun 22, 2014 - Internet. â. 300 million new websites per year. â. 48 hours of video to YouTube per minute. â. 30,0
TSUBAME2.0, 2.5 towards 3.0 for Convergence of Extreme Computing and Big Data Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM) HP-CAST SC2014 Presentation New Orleans, USA 20141114
TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” • GPU-centric (> 4000) high performance & low power • Small footprint (~200m2 or 2000 sq.ft), low TCO • High bandwidth memory, optical network, SSD storage…
TSUBAME 2.0 New Development
32nm
40nm
>400GB/s Mem BW 80Gbps NW BW ~1KW max
>1.6TB/s Mem BW
>12TB/s Mem BW 35KW Max
>600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max
TSUBAME2.0⇒2.5 Thin Node Upgrade (Fall 1993) Peak Perf.
Thin Node Infiniband QDR x2 (80Gbps)
4.08 Tflops ~800GB/s Mem BW 80GBps NW ~1KW max
HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem(per GPU)
Productized as HP ProLiant
SL390s Modified for TSUABME2.5
CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2
NVIDIA Fermi M2050 1039/515 GFlops
NVIDIA Kepler K20X 3950/1310 GFlops
Phase-field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Gordon Bell 2011 Winner Weak scaling on TSUBAME (Single precision) Mesh size(1GPU+4 CPU cores):4096 x 162 x 130
TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640
Developing lightweight strengthening material by controlling microstructure
TSUBAME 2.0 2.000 PFlops (4,000 GPUs+16,000 CPU cores)
Low-carbon society 4,096 x 6,480 x 13,000
•
•
Peta-Scale phase-field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials. 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time-to-Solution
Application
TSUBAME2.0 Performance
TSUBAME2.5 Performance
Boost Ratio
Top500/Linpack 4131 GPUs (PFlops)
1.192
2.843
2.39
Green500/Linpack 4131 GPUs (GFlops/W)
0.958
3.068
3.20
Semi-Definite Programming Nonlinear Optimization 4080 GPUs (PFlops)
1.019
1.713
1.68
Gordon Bell Dendrite Stencil 3968 GPUs (PFlops)
2.000
3.444
1.72
LBM LES Whole City Airflow 3968 GPUs (PFlops)
0.592
1.142
1.93
Amber 12 pmemd 4 nodes 8 GPUs (nsec/day)
3.44
11.39
3.31
GHOSTM Genome Homology Search 1 GPU (Sec)
19361
10785
1.80
MEGADOC Protein Docking 1 node 3GPUs (vs. 1CPU core)
37.11
83.49
2.25
TSUBAME2.0=>2.5 Power Improvement
2012/12
18% Power Reduction inc. cooling
2013/11 Green 500 #6 in the world • Along with TSUBAME-KFC (#1) • 2014/6 #9
2013/12
Comparing K Computer to TSUBAME 2.5
Perf ≒ Cost 1 Billion
~6 million
~700,000
Memory Technology
GDDR5+DDR3
DDR3
DDR3
Network Technology
Luxtera Silicon Photonics
Standard Optics
Copper
Non Volatile Memory / SSD
SSD Flash all nodes ~250TBytes
None
None
Power Management
Node/System Active Power Cap
Rack-level measurement only
Rack-level measurement only
Virtualization
KVM (G & V queues,
None
None
Resource segragation)
TSUBAME3.0:Leadership “Template” Machine • • • •
Under Design:Deployment 2016H2~H3 High computational power: ~20 Petaflops, ~5 Petabyte/s Mem BW Ultra high density: ~0.6 Petaflops DFP/rack (x10 TSUBAME2.0) Ultra power efficient: 10 Gigaflops/W (x10 TSUBAME2.0, TSUBAME-KFC) – Latest power control, efficient liquid cooling, energy recovery • Ultra high-bandwidth network: over 1 Petabit/s bisection, new topology? – Bigger capacity than the entire global Intenet (several 100Tbps) • Deep memory hierarchy and ultra high-bandwidth I/O with NVM – Petabytes of NVM, several Terabytes/s BW, several 100 million IOPS – Next generation “scientific big data” support
• Advanced power aware resource management, high resiliency SW/HW codesign, VM & container-based dynamic deployment…
Focused Research Towards Tsubame 3.0 and Beyond towards Exa
• Software and Algorithms for new memory hiearchy– Pushing the envelops of low Power vs. Capacity, Communication and Synchroniation Reducing Algorithms (CSRA) • Post Petascale Networks – Topology, Routing Algorithms, Placement Algorithms… (SC14 paper Tue 14:00-14:30 “Fail in Place Network…”) • Green Computing: Power aware APIs, fine-grained resource scheduling • Scientific “Extreme” Big Data – GPU Hadoop Acceleration, Large Graphs, Search/Sort, Deep Learning • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction, Hybrid Algorithms • Post Petascale Programming – OpenACC extensions and other many-core programming substrates, • Performance Analysis and Modeling –For CSRA algorithms, for Big Data, for deep memory hierarchy, for fault tolerance, …
TSUBAME KFC
TSUBAME-KFC Towards TSUBAME3.0 and Beyond Oil-Immersive Cooling #1 Green 500 SC13, ISC14, … (Paper @ ICPADS14)
Extreme Big Data Examples Rates and Volumes are extremely immense Social NW – large graph processing •
•
〜1 billion users Average 130 friends 30 billion pieces of content shared per month
Target Area: Planet (Open Street Map)
–
•
500 million active users 340 million tweets per day
– –
300 million new websites per year 48 hours of video to YouTube per minute 30,000 YouTube videos played per second
7 billion people
Peta~Zetabytes Data
Input Data –
Internet – – –
NOT simply mining Tbytes Silo Data
Applications –
Twitter – –
•
•
Facebook – – –
Social Simulation
Road Network for Planet: 300GB (XML) Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB Real-Time Streaming Data
Ultra High-BW Data Stream
(e.g., Social sensor, physical data)
•
Highly Unstructured, Irregular
Simulated Output for 1 Iteration
– 700TB– real time Weather large data assimilation
Genomics advanced sequence matching Impact of new generation sequencers
Sequencing data (bp)/$ becomes x4000 per 5 years c.f., HPC x33 in 5 years
Phased Array Radar 1GB/30sec/2 radars
Himawari 500MB/2.5min
A-1. Quality Control A-2. Data Processing
B-1. Quality Control B-2. Data Processing
①30-sec Ensemble Forecast Simulations 2 PFLOP
②Ensemble Data Assimilation 2 PFLOP
Analysis Data 2GB
Ensemble Forecasts シミュレーション シミュレーション 200GB データ データ
Ensemble Analyses シミュレーション シミュレーション 200GB データ データ
③30-min Forecast Simulation 1.2 PFLOP
Repeat every 30 sec. 4
Lincoln Stein, Genome Biology, vol. 11(5), 2010
30-min Forecast 2GB
Complex correlations between data from multiple sources Extreme Capacity, Bandwidth, Compute All Required
Graph500 “Big Data” Benchmark Kronecker graph BSP Problem
A: 0.57, B: 0.19 C: 0.19, D: 0.05
November 15, 2010 Graph 500 Takes Aim at a New Kind of HPC Richard Murphy (Sandia NL => Micron) “ I expect that this ranking may at times look very different from the TOP500 list. Cloud architectures will almost certainly dominate a major chunk of part of the list.” The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 Koji Ueno, Tokyo Institute of Technology/RIKEN AICS
RIKEN Advanced Institute for Computational Science (AICS)’s K computer is ranked
Reality: Top500 Supercomputers Dominate #1 K Computer No Cloud IDCs at all No.1
on the Graph500 Ranking of Supercomputers with 17977.1 GE/ s on Scale 40 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014.
Congratulations from the Graph500 Executive Committee
Global Scientific Information and Computing Center, Tokyo Institute of Technology’s TSUBAME
2.5
is ranked
No.12 on the Graph500 Ranking of Supercomputers with 1280.43 GE/ s on Scale 36 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014. Congratulations from the Graph500 Executive Committee
#12 TSUBAME2
Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010)
A Major Northern Japanese Cloud Datacenter (2013) the Internet
>>
~= Entire Global Internet Average Data BW ~200 Tbps (source: CISCO)
Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber
10GbE
~1500 nodes compute & storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps
Juniper MX480
Juniper MX480
2 zone switches (Virtual Chassis)
x1000! Juniper EX4200
10GbE
10GbE
Juniper EX8208
Juniper EX8208
10GbE
LACP Juniper EX4200
Zone (700 nodes)
Juniper EX4200
Juniper EX4200
Zone (700 nodes)
Juniper EX4200
Juniper EX4200
Zone (700 nodes)
8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps
Aleksandr Drozd, Naoya M aruyama, Sat oshi M at suoka A( TM ITult ECH) i GPU Read Alignment Al
In most living their developm molecule called DNA consists called nucleotid The four bases A), cytosine (C
Int roduct ion
JST-CREST “Extreme Big Data” Project (2013-2018) Problem Domain
Future Non-Silo Extreme Big Data Scientific Apps
Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. Clouds?
Large Scale Metagenomics
Co-Design EBD Bag
Ultra Large Scale Graphs and Social Infrastructures
Massive Sensors and Data Assimilation in Weather Prediction
Co-Design Co-Design 日本地図
EBD System Software incl. EBD Object System
Cartesian Plane KV S KV S
KV S
Graph Store
Cloud IDC Very low BW & Efficiency Highly available, resilient
13/ 06/ 06 22:36
NVM/Fla NVM/Flas 2Tbps HBM NVM/Fla sh h 4~6HBM Channels NVM/Flas NVM/Fla NVM/Flas sh h 1.5TB/s DRAM & h sh DRAM DRAM NVM BW DRAM DRAM DRAM DRAM Low Low High Powered 30PB/s I/O BW Possible Main CPU Power Power 1 Yottabyte / Year CPU CPU TSV Interposer
Exascale Big Data HPC
EBD KVS
1000km
f ile:/ / / Users/ shirahat a/ Pict ures/ 日本地図.svg
Issues regading Architectural, algorithmic, and system software evolution?
1/ 1 ページ
PCB
Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW
Use of GPUs?
Supercomputers Compute&Batch-Oriented More fragile
Towards Extreme-scale BigData Machines Convergence • Computation – Increase in Parallelism, Heterogeneity, Density, BW • Multi-core, Many-core processors • Heterogeneous processor
• Deep Hierarchial Memory/Storage Architecture – NVM (Non-Volatile Memory), SCM (Storage Class Memory) FLASH, PCM, STT-MRAM, ReRAM, HMC, etc. – Next-gen HDDs (SMR), Tapes (LTFS), Cloud Problems Algorithm Scalability Heterogeneity
Power
FT
Network
Productivity
Locality Storage Hierarchy I/O
Aleksandr Drozd, Naoya M aruyama, Sat oshi M at suoka A(TMITult ECH) i GPU Read
In most their d molecul DNA c called The fo A), cyt
Int rod
100,000 Times Fold EBD “Convergent” System Architecture Problem Domain
Akiyama Group
Miyoshi Group
Suzumura Group
Large Scale Genomic Correlation
SQL for EBD Message Passing (MPI, X10) for EBD
Data Assimilation in Large Scale Sensors and Exascale Atmospherics
Large Scale Graphs and Social Infrastructure Apps
Graph Framework
PGAS/Global Array for EBD
Workflow/Scripting Languages for EBD
EBD Abstract Data Models
MapReduce for EBD
Matsuoka Group EBD Algorithm Kernels 日本地図
(Distributed Array, Key Value, Sparse Data Model, Tree, etc.) EBD Bag
Programming Layer
13/ 06/ 06 22:36
Cartesian Plane (Search/ Sort, Matching, Graph Traversals, , etc.)
Basic Algorithms Layer
KVS
EBD File System
EBD Data Object
EBD Burst I/O Buffer
KVS
Tatebe Group Graph Store NVM
(FLASH, PCM, STT-MRAM, ReRAM, HMC, etc.)
Koibuchi Group
KVS
EBD KVS
Big Data & SC HW Layer
1000km
HPC Storage
Web Object Storage
Interconnect
Network
(InfiniBand 100GbE)
(SINET5) file:/ / / Users/ shirahata/ Pictures/ 日本地図.svg
TSUBAME 3.0
System SW Layer
EBD Network Topology and Routing
TSUBAME-GoldenBox
Cloud Datacenter
Intercloud / Grid (HPCI) 17
1/ 1 ページ
Big Data & SC HW Layer
Elapsed Time (ms)
The Graph500 – June 2014 K Computer #1 Tokyo Tech[EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS 1200
Communicaton
1000
Computation
List
Rank
GTEPS
Implementation
November 2013
4
5524.1 2
Top-down only
June 2014
1
17977. 05
Efficient hybrid
274
MTEPS/node
800 600 400
1236 MTEPS/node
200 0 64 nodes
65536 nodes
(Scale 30)
(Scale 40)
*Problem size is weak scaling
73% total exec time wait in communication
EBD Programming Framework
Out-of-core GPU-MapReduce for Large-scale Graph Processing [Cluster 2014]
Emergence of large-scale graphs -
GPU
SNS, road network, smart grid, etc. Millions to trillions of vertices/edges
CPU
Memcpy (H2D, D2H)
Processing for each chunk
Operation on GPU Map
Map
→ Need for fast graph processing on supercomputers
Map
Map
Initialization Sort
Shuf fle
Shuf fle
Operation on GPU
Problem: GPU memory capacity limits scalable large-scale graph processing
Scan
Red uce
Red uce
Sort
Red uce
Red uce
Proposal: Out-of-core GPU memory management on MapReduce - Stream-based GPU MapReduce - Out-of-core GPU sorting
Experimental Results: performance improvement over CPUs -
Map: 1.41x, Reduce: 1.49x, Sort: 4.95x speedup Overlapping communication effectively
Performance [MEdges/sec]
Weak scaling on TSUBAME2.5 3000
1CPU (S23 per node)
2500
1GPU (S23 per node) 2CPUs (S24 per node)
2.10x
2000
2GPUs (S24 per node)
(3 GPU vs 2CPU)
3GPUs (S24 per node)
1500 1000 500 0 0
500 1000 Number of Compute Nodes
1500
EBD Algorithm Kernels
GPU-HykSort [IEEE BigData2014] Motivation Effectiveness of sorting for large-scale GPUbased heterogeneous systems remains unclear - Appropriate selection of phases to be offloaded to GPU is required - Handling GPU memory overflow is required
Implementation Process 0 unsorted
Process 1
Process 2
unsorted
unsorted
Process 3 unsorted
local sort sorted
sorted
sorted
Separate an unsorted array
unsorted unsorted unsorted unsorted Iter 1
GPU
Performance of weak scaling HykSort 1thread HykSort 6threads HykSort GPU + 6threads
0.25 TB/s
merge merged
merged
Iter 4
Sort a chunk on GPU
Transfer a sorted chunk to DRAM Iter 1 Iter 2 Iter 3
data transfer
merged
Iter 3
sorted
sorted
Offload local sort, the most time-consuming phase, to GPU accelerators
Iter 2
Transfer an unsorted chunk to GPU memory
select splitters
Approach
sorted
sorted
Iter 4
sorted
Merge sorted chunks into a sorted array merged
sorted
Performance prediction
1.4x
Keys/second(billions)
30
unsorted
3.6x
20
10
~1024 nodes ~2048 GPUs
0 0
500
1000
1500
2000
# of proccesses (2 proccesses per node)
GPU-HykSort achieves 2.2x performance improvement with 50GB/s CPU-GPU interconnect
EBD Algorithm Kernels
Efficient Parallel Sorting Algorithm for Variable-Length Keys
•
•
•
Comparison-based sorts inefficient for long/variable-length keys (like strings) Better way: examining individual characters (based on MSD Radix sort algorithm) Hybrid parallelization scheme: combining data-parallel and taskparallel stages
apple apricot banana kiwi
70 M keys/second sorting throughput on 100bytes strings
Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. Efficient String Sorting on Multi- and ManyCore Architectures. in Proceedings of IEEE 3rd International Congress on Big Data. Anchorage, USA, August 2014 Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. MSD Radix String Sort on GPU: Longer Keys, Shorter Alphabets in proceedings of 第142回ハイパフォーマンスコンピューティング合同 研究発表会 (HOKKE-21)
Large Scale Graph Processing Using NVM
[BigData2014]
1. Hybrid-BFS ( Beamer’11 ) 2. Proposal Switching two approaches
DRAM
NVM
Holds highly accessed data Top-down
EBD Algorithm Kernels
Holds full size of Graph
Bottom-up
CPU DRAM
Intel Xeon E5-2690 × 2
NVM
EBD-I/O 2TB × 2
256 GB
www.crucial.com/
mSATA ・・・ mSATA SSD SSD RAID Card (RAID 0) www.adaptec.com
×8
mSATA-SSD
RAID Card
Median GigaTEPS
3. Experiment
(Giga Traversed Edges Per Seconds)
Load highly accessed graph data before BFS
# of frontiers:nfrontier,# of all vertices:nall, Parameter : α, β
Limit of DRAM Only
6.0
4.1
5.0 4.0
3.8
DRAM + EBD-I/O DRAM Only
3.0 2.0
4 times larger graph with
1.0
6.9 % of degradation
0.0
23 24 25 26 27 28 29 30 31
Ranked 3rd
SCALE(# vertices = 2SCALE)
in Green Graph500 (June 2014)
Tokyo’s Institute of Technology GraphCREST-Custom # 1 is ranked
No.3 in the Big Data category of the Green Graph 500 Ranking of Supercomputers with 35.21 M T EPS/ W on Scale 31 on the third Green Graph 500 list published at the International Supercomputing Conference, June 23, 2014. Congratulations from the Green Graph 500 Chair
Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era JST-CREST project, 2012-2018, PI Toshio Endo Comm/BW reducing algorithms
+ System software for mem hierarchy mgmt
+ HPC Architecture with hybrid memory devices HMC, HBM
O(GB/s) Flash
Next-gen NVM
Target: Realizing extremely Fast&Big simulations of {O(100PF/s) or O(10PB/s)} & O(10PB) around 2018
Supporting Larger domains than GPU device memory for Stencil Simulations
>> TSUBAME2.5 node GPU card
Caution: Simply “swapping out” to larger host memory is disastrously slow PCIe traffic is too large!
GPU cores L2$ 1.5MB
250GB/s CPU cores
GPU mem 6GB
8GB/s Host memory 54GB
Keys are “Communication Avoiding & Locality Improvement” Algorithms
Temporal Blocking (TB) for Comm. Avoiding • Performs multiple updates on a small block, before proceeding to the next block – Originally proposed to improve cache locality [Kowarschik 04] [Datta 08]
s-step updates at once
Introducing “larger halo” Step 1
Step 2
Step 3
Step 4
Simulated time
Redundant computation is introduced due to data dependency with neighbor Step 1 Step 2 Step 3 Step 4
Redundancy can be removed when blocks are computed sequentially [Demmels 12]
Multi-level TB to reduce both • PCIe traffic • device memory traffic
Single GPU Performance 3D 7point stencil on a K20X GPU (6GB GPU mem) 52GB
5.3GB
Speed (GFlops)
Faster
200 180 160 140 120 100 80 60 40 20 0
Version 3
Common
Version 1
Naïve Basic-TB Opt-TB
0
500 1000 1500 Size of Each Dimension
2000
Bigger
• With optimized TB, 10x larger domain size is successfully used with little overhead!!! A step towards extremely fast&big simulations
Problem: Programming Cost • Communication reducing algorithms efficiently support larger domains • Programming cost is the issue – Complex loop structure, complex border handling
• Reducing programming cost by using system software supporting memory hierarchy – HHRT (Hybrid Hierarchical Runtime) – Physis DSL, by Maruyama, RIKEN
Memory Hierarchy Management with Runtime Libraries HHRT (Hybrid hierarchical RT) is for GPU supercomputers and MPI+CUDA user applications • HHRT provides MPI and CUDA compatible APIs • # of MPI processes > # of GPUs • Several processes share a GPU Compute node Process’s data
Compute node
Compute node
GPU mem
Host mem
• HHRT supports memory swapping between GPU and host mem at granuarity of processes • Similar to NVIDIA UVM, but works well with communication reducing algorithms
Beyond GPU memory efficient execution w/ moderate programming cost 3D 7point stencil on a Weak scalability on TSUBAME2.5 Small: 3.4GB per GPU single K20X GPU Large: “16GB” per GPU (>6GB!)
140
20
120 Speed (GFlops)
100
15
80
10
60 40
5
20 0
Speed (TFlops)
Faster
HHRT Comm. Reducing Results
Small
Large
14TFlops with 3TB Problem
0 0
10 20 Problem Size (GB)
30
NoTB
50
100
150
The number of GPUs
Larger Hand-TB
0
HHRT-TB
200
Where Do We Go From Here? TSUBAME KFC TSUBAME EBD Green and Extreme Big Data TSUBAME3.0 (2016) TSUBAME4.0 2021~ Post CMOS Moore?
TSUBAME4 2021~2022 K-in-a-Box (Golden Box) BD/EC Convergent Architecture 1/500 Size, 1/150 Power, 1/500 Cost, x5 DRAM+ NVM Memory
10 Petaflops, 10 Petabyte Hiearchical Memory (K: 1.5PB), 10K nodes 50GB/s Interconnect (200-300Tbps Bisection BW) (Conceptually similar to HP “The Machine”)
Datacenter in a Box Large Datacenter will become “Jurassic”
Tsubame 4: 2020- DRAM+NVM+CPU with 3D/2.5D Die Stacking -The Ultimate Convergence of BD and ECNVM/Flash NVM/Flash NVM/Flash
2Tbps HBM 4~6HBM Channels 2TB/s DRAM & NVM BW
NVM/Flash NVM/Flash NVM/Flash
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Low Power CPU
Optical SW & Launch Pad
Low Power CPU
TSV Interposer PCB
Direct Chip-Chip Interconnect with DWDM optics
GoldenBox “Proto1” (NVIDIA K1-based) at Tokyo Tech. SC14 Booth #1857 (also Wed. morning plenary talk) • 36 Node Tegra K1, ~11TFlops SFP • ~700GB/s BW • 100-700Watts • Integrated mSata SSD, ~7GB/s I/O • Ultra dense, Oil immersive cooling • Same SW stack as TSUBAME2 2022: x10 Flops, x10 Mem Bandwidth, silicon photonics, x10 NVM, x10 node density, with new device and packaging technologies
EBD Interconnect
Network Performance Visualization [EuroMPI/Asia 2014 Poster] Application Layer
① Hardware Layer
① Portably expose MPI’s internal performance
② ③ MPI Layer
Process View
② Non-intrusively profile lowlevel metrics
Performance Analysis
③ Flexible hardware-centric performance analysis
Overhead of our profiler (named ibprof): 5 4 3 2 1 0
MPI_Alltoall Avg: 2.06% 0 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768
Comm. Latency (%)
OUR GOALS
MPI abstractions
Message size (bytes)
NAS Parallel FT Benchmark Runtime overhead = less than 0.02% (12.1919s -> 12.1935s)
Network visualization of TSUBAME 2.5 running the Graph500 benchmark on 512 nodes
Design of nonblocking B+Tree for NVM-KV [Tatebe Group, Jabri]
EBD NVM System Software
NVM-BPTree is a Key-Value Stores (KVS) running natively over Non-VolatileMemory (NVM), like flash, supporting range-queries.
Take advantage of NVM new capabilities: atomic writes, huge sparse address space, direct access to NVM device natively as a KVS
Enable range-queries support for KVS running natively on NVM like FusionIO ioDrive
Provide optional persistence to the BPTree structure and also snapshots
KVS on NVM supporting range-queries In-memory B+Tree
OpenNVM like Key-value store Interface NVM (Fusion-io flash device)
• • • •
Fusion-io sdk 0.4 and ioDrive 160GB SLC Key size: fixed to 40 Bytes Value size: ranging from 1 up to 1024 sectors (512B) NVM-BPTree does not impact the performance compared with original KVS
EBD NVM System Software EBD I/O and C/R modeling for extreme scale[CCGrid2014 Best Paper]
Extreme scale I/O for Burst Buffer
Extreme scale C/R modeling
mSATA mSATA mSATA mSATA mSATA mSATA mSATA mSATA
mSATA ☓ 8
IBIO client 5 1
chunk
Compute node 2
Compute node 3
Compute node 4
IBIO client
IBIO client
IBIO client
Compute node
Application
IBIO Client
Ii
< write perf. ( wi ) > or
Burst buffer node
IBIO Server
Write threads
1
4
Writer thread
fd1 fd2
Writer thread
file2
Writer thread
file3
Chunk buffers
Compute node
Hi
3 file1
fd3 fd4
Writer thread
file4
Writer threads
Storage
i >0
Storage Model: HN {m1, m2, . . . , mN } Read - Peak Read - NFS Write - IBIO
Read/Write throughput (GB/sec)
Hi-1
Si
i =0
IBIO write: four IBIO clients and one IBIO server
LLNL-PRES-654744
mi
2
Hi-1 Hi-1
IBIO server thread 2
Li = Ci + Ei
(Async.)
< C/R date size / node >☓
Ci or Ri =
EBD I/O Compute node 1
Ci + Ei (Sync.)
Oi =
Adaptec RAID ☓ 1
Read - Local Write - Peak Write - NFS
Read - IBIO Write - Local
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
2
1
1
1
1 1
2
t
2
1
1 1
1
1
1
1
1
1
1
: Interval
2
4
6 8 10 12 # of Processes
14
16
li : i -level checkpoint time
1 1
1 1
2
cc : c -level checkpoint time rc : c -level recovery time 0
1 1
1
2
1 1
t + ck
1
No failure
1
Failur e
k k
i
Duration p0 (t + ck ) t0 (t + ck ) pi (t + ck )
ti (t + ck )
rk p00(rkk ) t0 (rk )
k k
i
p0 (T ) : No failure for T t0 (T ) : Expected time pi (T ) : i - level failure for T
ti (T ) : Expected time
pi (rk ) ti (rk )
Cloud-based I/O Burst Buffer Architecture (I/O Burst Buffer) In collaboration talks with Amazon EC2 I/O Bursting Buffer Nodes I/O node System 1
System 2
Compute node
I/O node
Compute node
I/O node
LAN
Compute node
WAN
I/O node
Storage LAN 1
I/O node
Compute node
I/O node
Compute node
LAN
I/O node
Main idea: using several compute nodes in public cloud as I/O nodes Buffer I/O data in the main memory of I/O nodes All I/O nodes maintain a on-memory buffer queue dynamic burst buffer, # of I/O nodes can be dynamically decided
Buffer Queue
Compute node
Taking advantage of high throughput of LAN inside public cloud
LAN
Cloud Supercomputer 160 One Client Performance
Multi Client Performance 600
write experiment result
140
read experiment result
120
8X!
100
1.7X!
80
read experiment result read simulation result write experiment result write simulation result read without IOnode write without IOnode
60 40 20
0 1
2
3
4 5 # of node
6
7
8
Throughput(MB/s)
Throughput (MB/s)
500
write simulation result read simulation result
400
read without IOnode write without IOnode
300
7X!
200 100 0 1
2 3 # of I/O nodes=# of clients
4