Need for Speed: Accelerated Deep Learning on

0 downloads 0 Views 2MB Size Report
VGGnet with Caffe, Imagenet 2012. 2nd Gen to 3rd Gen System. 1.2x. CPU+4xM40. Minsky+4xP100. CPU+8xM40. 1.93x. Michael Gschwind, Need for Speed: ...
Need for Speed: Accelerated Deep Learning on Power Michael Gschwind Chief Engineer, Machine Learning and Deep Learning TJ Watson Res Ctr IBM Corp.

The Cognitive Revolution New Paradigm, New Chip, New Servers Accelerated AI

New Chip “POWER8 with NVLink”

New Power Linux Servers

S821LC: High Density 2-Socket 1U

S822LC for Big Data Accelerator X

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

POWER8 + coherent CAPI + novel NVlink for high BW coherent CPU/GPU acceleration

S822LC for High Performance Computing 2

Accelerated Deep Learning on Power

DL Framework Enablement

Hardware Platform

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

Software Blocks

Hardware Blocks

DL Frameworks are independent of hardware and can be deployed across IBM servers. Can be updated independently of hardware releases. IBM is only server provider publishing an optimized DL stack for its systems

Three generations of diverse and compatible hardware configurations: • Power8 + K80 • “TurboTrainer” • “Minsky”

3

Three Generations of Accelerated Power Systems for Deep Learning

3rd Gen: “Minsky” POWER8 + 4 P100 CPU/GPU NVlink + p2p NVlink (September 2016)

1st Gen: POWER8 + 1-2 K80 (2014) Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

2nd Gen: “TurboTrainer” POWER8 + 4-16 M40 p2p PCIe (internal development) 4

Power GPU acceleration with NVlink • CUDA8 programming environment under LE OpenPOWER Linux • POWER8 with NVlink • Tesla P100 GPU

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

5

2nd

Training Time to 50% Accuracy, VGGnet with Caffe, Imagenet 2012 Gen to 3rd Gen System (constant GPU count)

1.93x

CPU+4xM40 Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

Minsky+4xP100 6

Training Time to 50% Accuracy, VGGnet with Caffe, Imagenet 2012 2nd Gen to 3rd Gen System

1.93x

1.2x

CPU+4xM40 Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

CPU+8xM40

Minsky+4xP100 7

Latency-Optimized GPU accelerated training for Vision Alexnet Minsky 4xP100 - Caffe AlexNet Training 2012 ImageNet / 1K BS 20K SS

• Full Imagenet 2012 dataset – 1.2M images – 289 GB of data

50

Accuracy %

• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Caffe

60

40 30

20 10 0 0:00

0:15

0:30

0:45

1:00

Runtime Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

8

Latency-Optimized GPU accelerated training for Vision VGGnet

• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Caffe

60 50

Accuracy %

• Full Imagenet 2012 dataset – 1.2M images – 289 GB of data

Minsky 4XP100 Caffe VGGnet Training 2012 ImageNet

40 30

20 10 0 0:00

1:00

2:00

3:00

4:00

5:00

Runtime (Hours) Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

9

• Power Systems S822LC for HPC – P8 with NVlink – 4 NVIDIA P100 GPUs – Ubuntu 16.04, CUDA8 – Torch • Smaller Networks Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

60

4

50

3.5

3.2 3.0

50

2.8

3

2.7

40

2.4 2.0

35

30 20

2 1.5

24

1.0

19 17

10

2.5

12

12

0 1

2

3

4

5

6

Training Time (5 cores)

1

Throughput Speedup (5 cores)

0.5

Throughput Speedup (10 CPUs)

0

7

8

9

10

# of NLC Instances (using NVIDIA MPS) 10

Throughput Speedup Factor (X)

• Natural Language Proc. – Small, non-rectangular Networks

Execution Time of 200 Epochs (mins)

Throughput-Optimized GPU accelerated training for Natural Language Processing (NLP)

Portability and Optimization in Heterogeneous Systems

Application

Application

Application

Cognitive Middleware

ML & DL Framework Layer Library Layer CPU enablement

GPU enablement

GPU enablement

FPGA interface & configuration

Accelerator X Enablement Accelerator X

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

11

Portability and Optimization in Heterogeneous Systems

Application

Application

Application

Cognitive Middleware

Caffe, ML Torch, Theano, CNTK, DL4J,… &TensorFlow, DL Framework Layer

Library Layer CPU enablement

FPGA interface GPU GPU Transparent Cognitive Acceleration enablement & configuration enablement

Accelerator X Enablement Accelerator X

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

12

Prebuilt vs. out of the box source enablement Prebuilt

Source - out of box Power

• Enterprise customers • Time to productivity • “Try out”

• Open source community • Ongoing maintenance • Customers who want to tune / modify / enhance

• Available from launchpad ibm.biz/power-mldl – Ubuntu – Download & install instructions

• Commit Power tuning back to master repo – Stage in github.com.ibmsoe – Build recipes on developerworks

• Avoid User Frustration by Curation – Packages are notoriously hard to build – External dependences not always obvious – Some versions are not / only partially functional

• Customer Benefits – No lock-in and loss of control – Build and maintain your own – Enhance from a known good base – Other OS versions (RHEL, SLES, CentOS, Red Flag…)

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

13

Deep Learning Options on OpenPOWER Expand with New Distribution ibm.biz/power-mldl

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

14

Power MLDL Distro • Most popular packages – Theano – Caffe – Torch – DIGITS

• Ecosystem Optimization – cuDNN – MASS – OpenBLAS – Scipting: Lua, Python, … • Single command or selective install – apt-get install power-mldl – apt-get install package Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

15

Deep Learning Build Recipes on IBM DeveloperWorks

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

16

Building and Optimizing the Power Ecosystem for Deep Learning • Application Domain  Ecosystem – Engage with community to drive open source improvements

• Scripting Languages • Math libraries – Basic Linear Algebra Subprograms (BLAS) – Mathematics functions

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

17

Power Machine and Deep Learning Distro DL Framework

Status

CAFFE

Ported, upstreaming. Build recipe & binary release.

Torch

Ported, upstreaming. Build recipe & binary release.

Theano

Ported, upstreaming. Build recipe & binary release.

DIGITS

Ported, upstreaming. Build recipe & binary release.

TensorFlow

Ported, upstreaming. Build recipe & binary release.

DL4J

Ported, upstreaming. Build recipe.

Chainer

Ported, upstreaming. Build recipe.

MXnet

Ported, upstreaming. Build recipe.

Michael Gschwind, Need for Speed: Accelerated Deep Learning on Power

18

Accelerate AI Workloads in a Connected World Enable Compute-Intensive Cognitive Workloads Exploit Best-of-Breed Accelerators Provide Abstraction of Hardware and Software Function

19