Scalable Distributed Deep Learning with Chainer Takuya Akiba ...

ChainerMN

Scalable Distributed Deep Learning with Chainer Takuya Akiba | Preferred Networks, Inc.

ChainerMN v1.0.0β

Released today! https://git.io/chainermn

2

Agenda 1. Intro

to distributed deep learning

2. Chainer

and ChainerMN

3. Benchmark 4. Future

results

directions 3

Distributed Deep Learning

Training Iterations

Training Iterations Forward

6


Backward

2 𝜹𝑬 Gradient 𝜹𝒘


Backward

Optimize

Two Kinds of Parallelism Data Parallel

Model Parallel

9

Synchronous vs. Asynchronous Synchronous

Asynchronous

Parameter Server Computation

Push grad Pull model 10


Asynchronous

Parameter Server Computation

Which should we use?

Push grad Pull model

11


Asynchronous

Lower throughput

Higher throughput 12

13

14


Asynchronous

Lower throughput

Higher throughput 15


Asynchronous

Lower throughput

Higher throughput

Only looking at throughput?

16

[Figure 3, Chen+‘16]

[Figure 3, Chen+‘16]

Gradient Staleness

Training Iterations Usual Training with Chainer Forward

Backward

Optimize

Distributed Training with ChainerMN


Backward

Optimize

Distributed Training with ChainerMN Forward Forward Forward


Backward

Optimize

Distributed Training with ChainerMN Forward

Backward

Forward

Backward

Forward

Backward


Backward

Optimize


Backward

Forward

Backward

Forward

Backward

All-Reduce


Backward

Optimize


Backward

Forward

Backward

Forward

Backward

Optimize All-Reduce

Optimize Optimize

Chainer & ChainerMN

A Flexible Deep Learning Framework

26

A Flexible Deep Learning Framework Define-and-Run Define Model definition

Define-by-Run Computational graph

Gradient function

Model definition

Training data

Run Training data

Define-by-Run

Computational graph

Gradient function

Computational graph

Gradient function

ChainerMN’s building blocks

Node

CUDA-Aware MPI

Node

NVIDIA NCCL

28

NVIDIA NCCL

ncclBcast( dev_buff, n, ncclFloat, 0, comm, stream); l MPI-like

interface

l Optimized

inner-node GPU-GPU communication

Non-CUDA-Aware MPI cudaMemcpy( host_buf, dev_buf, n, cudaMemcpyDeviceToHost); MPI_Send( host_buf, n, MPI_CHAR, 1, 100, MPI_COMM_WORLD);

CUDA-Aware MPI MPI_Send( dev_buf, n, MPI_CHAR, 1, 100, MPI_COMM_WORLD); 30

Non-CUDA-Aware MPI

CUDA-Aware MPI

Images are from Parallel Forall | https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

31

Non-CUDA-Aware MPI

CUDA-Aware MPI +GPUDirect

Images are from Parallel Forall | https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

32

Two-Dimensional All-Reduce


① NCCL Reduce-Scatter


② MPI All-Reduce


③ NCCL All-Gather

Benchmark Results

GeForce GTX TITAN X (Maxwell)

×4

×32 Infiniband FDR 4X

38

39

All that glitters is not gold…

We achieved XX times speedup!!

l

l

Improvement only of throughput is actually too easy –

Too large batch sizes

–

Infrequent synchronization

Need to check accuracy of resulting models 40

41

42

Future Directions

Limits of Data Parallelism l Problem

1: Fewer iterations

l Problem

2: Sharp minima

Problem 1: Fewer iterations 1 epoch (Same amount of work)

1 GPU 1 iteration

Multi GPUs

45

Problem 2: Sharp minima [Keskar+,ICLR’17]

46

Limits of Data Parallelism l Problem

1: Fewer iterations

l Problem

2: Sharp minima

Simple model parallelism is not the solution Standard Training

Overlapped Model Parallelism

fwd

fwd

・・・

・・・

bwd

Batch 1

bwd

Batch 2

Time

Time

Staleness!

Synthetic Gradients [Jaderberg(DeepMind)+,’16]

Synthetic Gradients [Jaderberg(DeepMind)+,’16] Standard Training

Synthetic Gradient

fwd

fwd

・・・

bwd

Batch 1

・・・

bwd

Batch 2

Time

Time

Virtual Forward-Backward Networks [Miyato (PFN) et. al, ICLRw’17]

Original

VFBN

Successfully decoupled 100-layer CNN

Preferred Networks

A startup that applies DL to industrial IoT

Deep learning

Industrial IoT 53

Agenda 1. Intro

to distributed deep learning

2. Chainer

and ChainerMN

3. Benchmark 4. Future

results

directions 55

External

PFN

T. Akiba

D. Okanohara

Y. Doi

T. Okuta

K. Fukuda

S. Tokui

T. Miyato

G. Watanabe

T. Nishikawa

S. Watanabe

T. Mitsuishi (Fixstars)

K. Ueno (Fixstars)

Thank you, team!

T. Sudo

(Sakura Internet)

N. Yoshifuji (Fixstars)

56

ChainerMN v1.0.0β

Released today! https://git.io/chainermn

57

Scalable Distributed Deep Learning with Chainer Takuya Akiba ...

Scalable Distributed Deep Learning with Chainer Takuya Akiba ...

Suggest Documents

Distributed Deep Reinforcement Learning using

Distributed Deep Q-Learning - arXiv

DeSTIN: A Scalable Deep Learning Architecture with Application to ...

DeepSecure: Scalable Provably-Secure Deep Learning - arXiv

GossipGraD: Scalable Deep Learning using Gossip ... - arXiv

Distributed Deep Learning on Mesos with GPUs and Gang Scheduling

Distributed Deep Learning Using Synchronous ... - Semantic Scholar

How to scale distributed deep learning?

Towards Scalable Distributed Workload Manager with Monitoring ...

Scalable Versioning in Distributed Databases with

Scalable Distributed Databases Design

Deep Learning Acceleration with - Amax

Deep learning with segregated dendrites

Deep Learning with Python - DeepLearningItalia

Deep Learning Acceleration with - Amax

Deep Learning with Differential Privacy

Deep Learning with Differential Privacy

Topometric Localization with Deep Learning

Deep Learning with Python - DeepLearningItalia

Distributed Indexing: A Scalable Mechanism for Distributed ...

CaffeGPI Single-Sided Communication for Scalable Deep Learning

Neurostream: Scalable and Energy Efficient Deep Learning - arXiv

scalable stacking and learning for building deep architectures - Microsoft

Takuya Nakazato - Genetics