VGGnet with Caffe, Imagenet 2012. 2nd Gen to 3rd Gen System. 1.2x. CPU+4xM40. Minsky+4xP100. CPU+8xM40. 1.93x. Michael Gschwind, Need for Speed: ...
Need for Speed: Accelerated Deep Learning on Power Michael Gschwind Chief Engineer, Machine Learning and Deep Learning TJ Watson Res Ctr IBM Corp.
The Cognitive Revolution New Paradigm, New Chip, New Servers Accelerated AI
New Chip “POWER8 with NVLink”
New Power Linux Servers
S821LC: High Density 2-Socket 1U
S822LC for Big Data Accelerator X
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
POWER8 + coherent CAPI + novel NVlink for high BW coherent CPU/GPU acceleration
S822LC for High Performance Computing 2
Accelerated Deep Learning on Power
DL Framework Enablement
Hardware Platform
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
Software Blocks
Hardware Blocks
DL Frameworks are independent of hardware and can be deployed across IBM servers. Can be updated independently of hardware releases. IBM is only server provider publishing an optimized DL stack for its systems
Three generations of diverse and compatible hardware configurations: • Power8 + K80 • “TurboTrainer” • “Minsky”
3
Three Generations of Accelerated Power Systems for Deep Learning
3rd Gen: “Minsky” POWER8 + 4 P100 CPU/GPU NVlink + p2p NVlink (September 2016)
1st Gen: POWER8 + 1-2 K80 (2014) Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
2nd Gen: “TurboTrainer” POWER8 + 4-16 M40 p2p PCIe (internal development) 4
Power GPU acceleration with NVlink • CUDA8 programming environment under LE OpenPOWER Linux • POWER8 with NVlink • Tesla P100 GPU
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
5
2nd
Training Time to 50% Accuracy, VGGnet with Caffe, Imagenet 2012 Gen to 3rd Gen System (constant GPU count)
1.93x
CPU+4xM40 Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
Minsky+4xP100 6
Training Time to 50% Accuracy, VGGnet with Caffe, Imagenet 2012 2nd Gen to 3rd Gen System
1.93x
1.2x
CPU+4xM40 Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
CPU+8xM40
Minsky+4xP100 7
Latency-Optimized GPU accelerated training for Vision Alexnet Minsky 4xP100 - Caffe AlexNet Training 2012 ImageNet / 1K BS 20K SS
• Full Imagenet 2012 dataset – 1.2M images – 289 GB of data
50
Accuracy %
• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Caffe
60
40 30
20 10 0 0:00
0:15
0:30
0:45
1:00
Runtime Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
8
Latency-Optimized GPU accelerated training for Vision VGGnet
• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Caffe
60 50
Accuracy %
• Full Imagenet 2012 dataset – 1.2M images – 289 GB of data
Minsky 4XP100 Caffe VGGnet Training 2012 ImageNet
40 30
20 10 0 0:00
1:00
2:00
3:00
4:00
5:00
Runtime (Hours) Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
9
• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Torch • Smaller Networks Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
60
4
50
3.5
3.2 3.0
50
2.8
3
2.7
40
2.4 2.0
35
30 20
2 1.5
24
1.0
19 17
10
2.5
12
12
0 1
2
3
4
5
6
Training Time (5 cores)
1
Throughput Speedup (5 cores)
0.5
Throughput Speedup (10 CPUs)
0
7
8
9
10
# of NLC Instances (using NVIDIA MPS) 10
Throughput Speedup Factor (X)
• Natural Language Proc. – Small, non-rectangular Networks
Execution Time of 200 Epochs (mins)
Throughput-Optimized GPU accelerated training for Natural Language Processing (NLP)
Portability and Optimization in Heterogeneous Systems
Application
Application
Application
Cognitive Middleware
ML & DL Framework Layer Library Layer CPU enablement
GPU enablement
GPU enablement
FPGA interface & configuration
Accelerator X Enablement Accelerator X
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
11
Portability and Optimization in Heterogeneous Systems
Application
Application
Application
Cognitive Middleware
Caffe, ML Torch, Theano, CNTK, DL4J,… &TensorFlow, DL Framework Layer
Library Layer CPU enablement
FPGA interface GPU GPU Transparent Cognitive Acceleration enablement & configuration enablement
Accelerator X Enablement Accelerator X
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
12
Prebuilt vs. out of the box source enablement Prebuilt
Source - out of box Power
• Enterprise customers • Time to productivity • “Try out”
• Open source community • Ongoing maintenance • Customers who want to tune / modify / enhance
• Available from launchpad ibm.biz/power-mldl – Ubuntu – Download & install instructions
• Commit Power tuning back to master repo – Stage in github.com.ibmsoe – Build recipes on developerworks
• Avoid User Frustration by Curation – Packages are notoriously hard to build – External dependences not always obvious – Some versions are not / only partially functional
• Customer Benefits – No lock-in and loss of control – Build and maintain your own – Enhance from a known good base – Other OS versions (RHEL, SLES, CentOS, Red Flag…)
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
13
Deep Learning Options on OpenPOWER Expand with New Distribution ibm.biz/power-mldl
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
14
Power MLDL Distro • Most popular packages – Theano – Caffe – Torch – DIGITS
• Ecosystem Optimization – cuDNN – MASS – OpenBLAS – Scipting: Lua, Python, … • Single command or selective install – apt-get install power-mldl – apt-get install package Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
15
Deep Learning Build Recipes on IBM DeveloperWorks
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
16
Building and Optimizing the Power Ecosystem for Deep Learning • Application Domain Ecosystem – Engage with community to drive open source improvements
• Scripting Languages • Math libraries – Basic Linear Algebra Subprograms (BLAS) – Mathematics functions
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
17
Power Machine and Deep Learning Distro DL Framework
Status
CAFFE
Ported, upstreaming. Build recipe & binary release.
Torch
Ported, upstreaming. Build recipe & binary release.
Theano
Ported, upstreaming. Build recipe & binary release.
DIGITS
Ported, upstreaming. Build recipe & binary release.
TensorFlow
Ported, upstreaming. Build recipe & binary release.
DL4J
Ported, upstreaming. Build recipe.
Chainer
Ported, upstreaming. Build recipe.
MXnet
Ported, upstreaming. Build recipe.
Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power
18
Accelerate AI Workloads in a Connected World Enable Compute-Intensive Cognitive Workloads Exploit Best-of-Breed Accelerators Provide Abstraction of Hardware and Software Function
19