Need for Speed: Accelerated Deep Learning on

Need for Speed: Accelerated Deep Learning on Power Michael Gschwind Chief Engineer, Machine Learning and Deep Learning TJ Watson Res Ctr IBM Corp.

The Cognitive Revolution New Paradigm, New Chip, New Servers Accelerated AI

New Chip “POWER8 with NVLink”

New Power Linux Servers

S821LC: High Density 2-Socket 1U

S822LC for Big Data Accelerator X

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

POWER8 + coherent CAPI + novel NVlink for high BW coherent CPU/GPU acceleration

S822LC for High Performance Computing 2

Accelerated Deep Learning on Power

DL Framework Enablement

Hardware Platform


Software Blocks

Hardware Blocks

DL Frameworks are independent of hardware and can be deployed across IBM servers. Can be updated independently of hardware releases. IBM is only server provider publishing an optimized DL stack for its systems

Three generations of diverse and compatible hardware configurations: • Power8 + K80 • “TurboTrainer” • “Minsky”

3

Three Generations of Accelerated Power Systems for Deep Learning

3rd Gen: “Minsky” POWER8 + 4 P100 CPU/GPU NVlink + p2p NVlink (September 2016)

1st Gen: POWER8 + 1-2 K80 (2014) Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

2nd Gen: “TurboTrainer” POWER8 + 4-16 M40 p2p PCIe (internal development) 4

Power GPU acceleration with NVlink • CUDA8 programming environment under LE OpenPOWER Linux • POWER8 with NVlink • Tesla P100 GPU


5

2nd

Training Time to 50% Accuracy, VGGnet with Caffe, Imagenet 2012 Gen to 3rd Gen System (constant GPU count)

1.93x

CPU+4xM40 Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

Minsky+4xP100 6

Training Time to 50% Accuracy, VGGnet with Caffe, Imagenet 2012 2nd Gen to 3rd Gen System

1.93x

1.2x

CPU+4xM40 Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

CPU+8xM40

Minsky+4xP100 7

Latency-Optimized GPU accelerated training for Vision Alexnet Minsky 4xP100 - Caffe AlexNet Training 2012 ImageNet / 1K BS 20K SS

• Full Imagenet 2012 dataset – 1.2M images – 289 GB of data

50

Accuracy %

• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Caffe

60

40 30

20 10 0 0:00

0:15

0:30

0:45

1:00

Runtime Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

8

Latency-Optimized GPU accelerated training for Vision VGGnet

• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Caffe

60 50

Accuracy %

• Full Imagenet 2012 dataset – 1.2M images – 289 GB of data

Minsky 4XP100 Caffe VGGnet Training 2012 ImageNet

40 30

20 10 0 0:00

1:00

2:00

3:00

4:00

5:00

Runtime (Hours) Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

9

• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Torch • Smaller Networks Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

60

4

50

3.5

3.2 3.0

50

2.8

3

2.7

40

2.4 2.0

35

30 20

2 1.5

24

1.0

19 17

10

2.5

12

12

0 1

2

3

4

5

6

Training Time (5 cores)

1

Throughput Speedup (5 cores)

0.5

Throughput Speedup (10 CPUs)

0

7

8

9

10

# of NLC Instances (using NVIDIA MPS) 10

Throughput Speedup Factor (X)

• Natural Language Proc. – Small, non-rectangular Networks

Execution Time of 200 Epochs (mins)

Throughput-Optimized GPU accelerated training for Natural Language Processing (NLP)

Portability and Optimization in Heterogeneous Systems

Application

Application

Application

Cognitive Middleware

ML & DL Framework Layer Library Layer CPU enablement

GPU enablement

GPU enablement

FPGA interface & configuration

Accelerator X Enablement Accelerator X


11

Portability and Optimization in Heterogeneous Systems

Application

Application

Application

Cognitive Middleware

Caffe, ML Torch, Theano, CNTK, DL4J,… &TensorFlow, DL Framework Layer

Library Layer CPU enablement

FPGA interface GPU GPU Transparent Cognitive Acceleration enablement & configuration enablement

Accelerator X Enablement Accelerator X


12

Prebuilt vs. out of the box source enablement Prebuilt

Source - out of box Power

• Enterprise customers • Time to productivity • “Try out”

• Open source community • Ongoing maintenance • Customers who want to tune / modify / enhance

• Available from launchpad ibm.biz/power-mldl – Ubuntu – Download & install instructions

• Commit Power tuning back to master repo – Stage in github.com.ibmsoe – Build recipes on developerworks

• Avoid User Frustration by Curation – Packages are notoriously hard to build – External dependences not always obvious – Some versions are not / only partially functional

• Customer Benefits – No lock-in and loss of control – Build and maintain your own – Enhance from a known good base – Other OS versions (RHEL, SLES, CentOS, Red Flag…)


13

Deep Learning Options on OpenPOWER Expand with New Distribution ibm.biz/power-mldl


14

Power MLDL Distro • Most popular packages – Theano – Caffe – Torch – DIGITS

• Ecosystem Optimization – cuDNN – MASS – OpenBLAS – Scipting: Lua, Python, … • Single command or selective install – apt-get install power-mldl – apt-get install package Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

15

Deep Learning Build Recipes on IBM DeveloperWorks


16

Building and Optimizing the Power Ecosystem for Deep Learning • Application Domain  Ecosystem – Engage with community to drive open source improvements

• Scripting Languages • Math libraries – Basic Linear Algebra Subprograms (BLAS) – Mathematics functions


17

Power Machine and Deep Learning Distro DL Framework

Status

CAFFE

Ported, upstreaming. Build recipe & binary release.

Torch


Theano


DIGITS


TensorFlow


DL4J

Ported, upstreaming. Build recipe.

Chainer


MXnet



18

Accelerate AI Workloads in a Connected World Enable Compute-Intensive Cognitive Workloads Exploit Best-of-Breed Accelerators Provide Abstraction of Hardware and Software Function

19

Need for Speed: Accelerated Deep Learning on

Need for Speed: Accelerated Deep Learning on

Suggest Documents

Deep Learning Accelerated Gold Nanocluster

meProp: Sparsified Back Propagation for Accelerated Deep Learning ...

Deep Residual Learning for Accelerated MRI using Magnitude ... - arXiv

Deep Residual Learning for Accelerated MRI using Magnitude

Deep Learning applied to Road Traffic Speed

The need for speed - eLife

the need for speed - Broncolor

The Need For Speed - Thirdway

the need for speed - Broncolor

The Need For Speed - Thirdway

the need for speed - Broncolor

GPGPU Accelerated Deep Object Classification on

Variable-speed conveyor element, particularly for accelerated ...

FPGA-ACCELERATED BAYESIAN LEARNING FOR

GPU-Accelerated Deep Shadow Maps for Direct

Deep learning for healthcare applications based on

EXPLAINING THE NEED FOR SPEED Speed and ...

The need for speed: an update on methamphetamine ... - CiteSeerX

Need for Speed - the movie.pdf - Google Drive

Need for Speed .pdf - Google Drive

The Need for Speed: Empowering Innovation - TelecomTV

The Need for Speed: Empowering Innovation - TelecomTV

Accelerated Reader - Renaissance Learning

Methods for Predicting Truck Speed Loss on Grades - Deep Blue