Scalable Distributed Deep Learning with Chainer Takuya Akiba ...

1 downloads 100 Views 15MB Size Report
... with Chainer. Takuya Akiba | Preferred Networks, Inc. ... Data Parallel. 9 ... definition. Computational graph. Grad
ChainerMN

Scalable Distributed Deep Learning with Chainer Takuya Akiba | Preferred Networks, Inc.

ChainerMN v1.0.0β

Released today! https://git.io/chainermn

2

Agenda 1. Intro

to distributed deep learning

2. Chainer

and ChainerMN

3. Benchmark 4. Future

results

directions 3

Distributed Deep Learning

Training Iterations

Training Iterations Forward

6

Training Iterations Forward

Backward

2 𝜹𝑬 Gradient 𝜹𝒘

Training Iterations Forward

Backward

Optimize

Two Kinds of Parallelism Data Parallel

Model Parallel

9

Synchronous vs. Asynchronous Synchronous

Asynchronous

Parameter Server Computation

Push grad Pull model 10

Synchronous vs. Asynchronous Synchronous

Asynchronous

Parameter Server Computation

Which should we use?

Push grad Pull model

11

Synchronous vs. Asynchronous Synchronous

Asynchronous

Lower throughput

Higher throughput 12

13

14

Synchronous vs. Asynchronous Synchronous

Asynchronous

Lower throughput

Higher throughput 15

Synchronous vs. Asynchronous Synchronous

Asynchronous

Lower throughput

Higher throughput

Only looking at throughput?

16

[Figure 3, Chen+‘16]

[Figure 3, Chen+‘16]

Gradient Staleness

Training Iterations Usual Training with Chainer Forward

Backward

Optimize

Distributed Training with ChainerMN

Training Iterations Usual Training with Chainer Forward

Backward

Optimize

Distributed Training with ChainerMN Forward Forward Forward

Training Iterations Usual Training with Chainer Forward

Backward

Optimize

Distributed Training with ChainerMN Forward

Backward

Forward

Backward

Forward

Backward

Training Iterations Usual Training with Chainer Forward

Backward

Optimize

Distributed Training with ChainerMN Forward

Backward

Forward

Backward

Forward

Backward

All-Reduce

Training Iterations Usual Training with Chainer Forward

Backward

Optimize

Distributed Training with ChainerMN Forward

Backward

Forward

Backward

Forward

Backward

Optimize All-Reduce

Optimize Optimize

Chainer & ChainerMN

A Flexible Deep Learning Framework

26

A Flexible Deep Learning Framework Define-and-Run Define Model definition

Define-by-Run Computational graph

Gradient function

Model definition

Training data

Run Training data

Define-by-Run

Computational graph

Gradient function

Computational graph

Gradient function

ChainerMN’s building blocks

Node

CUDA-Aware MPI

Node

NVIDIA NCCL

28

NVIDIA NCCL

ncclBcast( dev_buff, n, ncclFloat, 0, comm, stream); l MPI-like

interface

l Optimized

inner-node GPU-GPU communication

Non-CUDA-Aware MPI cudaMemcpy( host_buf, dev_buf, n, cudaMemcpyDeviceToHost); MPI_Send( host_buf, n, MPI_CHAR, 1, 100, MPI_COMM_WORLD);

CUDA-Aware MPI MPI_Send( dev_buf, n, MPI_CHAR, 1, 100, MPI_COMM_WORLD); 30

Non-CUDA-Aware MPI

CUDA-Aware MPI

Images are from Parallel Forall | https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

31

Non-CUDA-Aware MPI

CUDA-Aware MPI +GPUDirect

Images are from Parallel Forall | https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

32

Two-Dimensional All-Reduce

Two-Dimensional All-Reduce

① NCCL Reduce-Scatter

Two-Dimensional All-Reduce

② MPI All-Reduce

Two-Dimensional All-Reduce

③ NCCL All-Gather

Benchmark Results

GeForce GTX TITAN X (Maxwell)

×4

×32 Infiniband FDR 4X

38

39

All that glitters is not gold…

We achieved XX times speedup!!

l

l

Improvement only of throughput is actually too easy –

Too large batch sizes



Infrequent synchronization

Need to check accuracy of resulting models 40

41

42

Future Directions

Limits of Data Parallelism l Problem

1: Fewer iterations

l Problem

2: Sharp minima

Problem 1: Fewer iterations 1 epoch (Same amount of work)

1 GPU 1 iteration

Multi GPUs

45

Problem 2: Sharp minima [Keskar+,ICLR’17]

46

Limits of Data Parallelism l Problem

1: Fewer iterations

l Problem

2: Sharp minima

Simple model parallelism is not the solution Standard Training

Overlapped Model Parallelism

fwd

fwd

・・・

・・・

bwd

Batch 1

bwd

Batch 2

Time

Time

Staleness!

Synthetic Gradients [Jaderberg(DeepMind)+,’16]

Synthetic Gradients [Jaderberg(DeepMind)+,’16] Standard Training

Synthetic Gradient

fwd

fwd

・・・

bwd

Batch 1

・・・

bwd

Batch 2

Time

Time

Virtual Forward-Backward Networks [Miyato (PFN) et. al, ICLRw’17]

Original

VFBN

Successfully decoupled 100-layer CNN

Preferred Networks

A startup that applies DL to industrial IoT

Deep learning

Industrial IoT 53

Agenda 1. Intro

to distributed deep learning

2. Chainer

and ChainerMN

3. Benchmark 4. Future

results

directions 55

External

PFN

T. Akiba

D. Okanohara

Y. Doi

T. Okuta

K. Fukuda

S. Tokui

T. Miyato

G. Watanabe

T. Nishikawa

S. Watanabe

T. Mitsuishi (Fixstars)

K. Ueno (Fixstars)

Thank you, team!

T. Sudo

(Sakura Internet)

N. Yoshifuji (Fixstars)

56

ChainerMN v1.0.0β

Released today! https://git.io/chainermn

57