... with Chainer. Takuya Akiba | Preferred Networks, Inc. ... Data Parallel. 9 ... definition. Computational graph. Grad
ChainerMN
Scalable Distributed Deep Learning with Chainer Takuya Akiba | Preferred Networks, Inc.
ChainerMN v1.0.0β
Released today! https://git.io/chainermn
2
Agenda 1. Intro
to distributed deep learning
2. Chainer
and ChainerMN
3. Benchmark 4. Future
results
directions 3
Distributed Deep Learning
Training Iterations
Training Iterations Forward
6
Training Iterations Forward
Backward
2 𝜹𝑬 Gradient 𝜹𝒘
Training Iterations Forward
Backward
Optimize
Two Kinds of Parallelism Data Parallel
Model Parallel
9
Synchronous vs. Asynchronous Synchronous
Asynchronous
Parameter Server Computation
Push grad Pull model 10
Synchronous vs. Asynchronous Synchronous
Asynchronous
Parameter Server Computation
Which should we use?
Push grad Pull model
11
Synchronous vs. Asynchronous Synchronous
Asynchronous
Lower throughput
Higher throughput 12
13
14
Synchronous vs. Asynchronous Synchronous
Asynchronous
Lower throughput
Higher throughput 15
Synchronous vs. Asynchronous Synchronous
Asynchronous
Lower throughput
Higher throughput
Only looking at throughput?
16
[Figure 3, Chen+‘16]
[Figure 3, Chen+‘16]
Gradient Staleness
Training Iterations Usual Training with Chainer Forward
Backward
Optimize
Distributed Training with ChainerMN
Training Iterations Usual Training with Chainer Forward
Backward
Optimize
Distributed Training with ChainerMN Forward Forward Forward
Training Iterations Usual Training with Chainer Forward
Backward
Optimize
Distributed Training with ChainerMN Forward
Backward
Forward
Backward
Forward
Backward
Training Iterations Usual Training with Chainer Forward
Backward
Optimize
Distributed Training with ChainerMN Forward
Backward
Forward
Backward
Forward
Backward
All-Reduce
Training Iterations Usual Training with Chainer Forward
Backward
Optimize
Distributed Training with ChainerMN Forward
Backward
Forward
Backward
Forward
Backward
Optimize All-Reduce
Optimize Optimize
Chainer & ChainerMN
A Flexible Deep Learning Framework
26
A Flexible Deep Learning Framework Define-and-Run Define Model definition
Define-by-Run Computational graph
Gradient function
Model definition
Training data
Run Training data
Define-by-Run
Computational graph
Gradient function
Computational graph
Gradient function
ChainerMN’s building blocks
Node
CUDA-Aware MPI
Node
NVIDIA NCCL
28
NVIDIA NCCL
ncclBcast( dev_buff, n, ncclFloat, 0, comm, stream); l MPI-like
interface
l Optimized
inner-node GPU-GPU communication
Non-CUDA-Aware MPI cudaMemcpy( host_buf, dev_buf, n, cudaMemcpyDeviceToHost); MPI_Send( host_buf, n, MPI_CHAR, 1, 100, MPI_COMM_WORLD);
CUDA-Aware MPI MPI_Send( dev_buf, n, MPI_CHAR, 1, 100, MPI_COMM_WORLD); 30
Non-CUDA-Aware MPI
CUDA-Aware MPI
Images are from Parallel Forall | https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/
31
Non-CUDA-Aware MPI
CUDA-Aware MPI +GPUDirect
Images are from Parallel Forall | https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/
32
Two-Dimensional All-Reduce
Two-Dimensional All-Reduce
① NCCL Reduce-Scatter
Two-Dimensional All-Reduce
② MPI All-Reduce
Two-Dimensional All-Reduce
③ NCCL All-Gather
Benchmark Results
GeForce GTX TITAN X (Maxwell)
×4
×32 Infiniband FDR 4X
38
39
All that glitters is not gold…
We achieved XX times speedup!!
l
l
Improvement only of throughput is actually too easy –
Too large batch sizes
–
Infrequent synchronization
Need to check accuracy of resulting models 40
41
42
Future Directions
Limits of Data Parallelism l Problem
1: Fewer iterations
l Problem
2: Sharp minima
Problem 1: Fewer iterations 1 epoch (Same amount of work)
1 GPU 1 iteration
Multi GPUs
45
Problem 2: Sharp minima [Keskar+,ICLR’17]
46
Limits of Data Parallelism l Problem
1: Fewer iterations
l Problem
2: Sharp minima
Simple model parallelism is not the solution Standard Training
Overlapped Model Parallelism
fwd
fwd
・・・
・・・
bwd
Batch 1
bwd
Batch 2
Time
Time
Staleness!
Synthetic Gradients [Jaderberg(DeepMind)+,’16]
Synthetic Gradients [Jaderberg(DeepMind)+,’16] Standard Training
Synthetic Gradient
fwd
fwd
・・・
bwd
Batch 1
・・・
bwd
Batch 2
Time
Time
Virtual Forward-Backward Networks [Miyato (PFN) et. al, ICLRw’17]
Original
VFBN
Successfully decoupled 100-layer CNN
Preferred Networks
A startup that applies DL to industrial IoT
Deep learning
Industrial IoT 53
Agenda 1. Intro
to distributed deep learning
2. Chainer
and ChainerMN
3. Benchmark 4. Future
results
directions 55
External
PFN
T. Akiba
D. Okanohara
Y. Doi
T. Okuta
K. Fukuda
S. Tokui
T. Miyato
G. Watanabe
T. Nishikawa
S. Watanabe
T. Mitsuishi (Fixstars)
K. Ueno (Fixstars)
Thank you, team!
T. Sudo
(Sakura Internet)
N. Yoshifuji (Fixstars)
56
ChainerMN v1.0.0β
Released today! https://git.io/chainermn
57