Distributed Training of Deep Neural Networks - ORNL Computational

Recommend Documents

Mixed-precision training of deep neural networks using computational

Dec 4, 2017 - Deep neural networks (DNN) including multilayer percep- trons ... devices have achieved reduced classification accuracies owing ... The benefits of areal/energy/speed improve- ... precision of the stored weights in which gradients are a

Incremental Training of Deep Convolutional Neural Networks

Jae-Yoon Jung and James A. Reggia. The automated design of artificial neural ... Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing ...

Training Deep Spiking Neural Networks using Backpropagation

Aug 31, 2016 - âSamsung Advanced Institute of Technology, Samsung Electronics [email protected]. â Institute of Neuroinformatics, University of ...

Exploring Strategies for Training Deep Neural Networks

Journal of Machine Learning Research 1 (2009) 1-40. Submitted 12/07 ... 1. Introduction. Training deep multi-layered neural networks is known to be hard.

Deep Neural Networks

https://developer.apple.com/library/ios/documentation/Perfo rmance/Conceptual/vImage/ConvolutionOperations/Convolut · ionOperations.html.

Deep Neural Networks

... to train deep architectures. 10. Page 11. 11. Slide from: https://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf ...

Why Deep Neural Networks?

Oct 13, 2016 - Neural networks have drawn significant interest from the machine learning ... to show that for a given upper bound on the approx- imation error, shallow ... LG] 13 Oct 2016 ... ically, we aim to answer the following two questions. Give

Exploring Strategies for Training Deep Neural Networks - Journal of

Each layer in a multi-layer neural network can be seen as a representation of the ... We believe that it should disentangle the factors of variation that inherently ...

Exploring Strategies for Training Deep Neural Networks - Journal of ...

Journal of Machine Learning Research 1 (2009) 1-40. Submitted 12/07; Revised 9/08; Published 1/09. Exploring Strategies for Training Deep Neural Networks.

Hybrid pre training algorithm of Deep Neural Networks

This paper proposes a hybrid algorithm of pre training deep networks, using both marked and ... representations, obtained during training of stacked auto-.

On the Importance of Consistency in Training Deep Neural Networks

Aug 2, 2017 - [8] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman ... [27] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.

Exploring Strategies for Training Deep Neural Networks - Journal of

Training deep multi-layered neural networks is known to be hard. ... However, complexity theory of circuits strongly suggests that deep architectures can be much ..... When we introduce a saturating non-linearity such as the sigmoid, and we ...

neuralnet: Training of neural networks

30. CONTRIBUTED RESEARCH ARTICLES neuralnet: Training of Neural Networks by Frauke Günther and Stefan Fritsch. Abstract Artificial neural networks are ...

Training Deep Neural Networks on Noisy Labels with Bootstrapping

Apr 15, 2015 - âSoftâ bootstrapping uses predicted class probabilities q directly to ..... Analysis of semi-supervis

Training Deep Neural Networks on Imbalanced Data Sets

Keywordsâdeep neural network; loss function; data imbalance ... classification tasks, data sets are usually classified into binary- class data sets and multi-class ...

Large Scale Distributed Systems for Training Neural Networks

Growing Use of Deep Learning at Google. Android. Apps drug discovery. Gmail. Image understanding ... Molecular Activity

A Sequence Training Method for Deep Rectifier Neural Networks in ...

A Sequence Training Method for Deep Rectifier Neural Networks in Speech ... Part of the Lecture Notes in Computer Science book series (LNCS, volume 8773).

Noisy Training for Deep Neural Networks in Speech Recognition

Keywords: speech recognition; deep neural network; noise injection .... In contrast, the GMM is a generative model and focuses on class distributions. The.

Learning to Optimize: Training Deep Neural Networks ... - IEEE Xplore

such as communications, radar, filter design, and speech and image analytics ... ping and use a deep neural network (DNN) to approximate it. If the nonlinear ...

Training Deep Convolutional Neural Networks with Resistive ... - arXiv

point arrays perform the multiply and add in the analog domain in contrast to the CMOS based digital approaches. Therefore, it is not clear yet whether the ...

A Sequence Training Method for Deep Rectifier Neural Networks in ...

Tamás Grósz, Gábor Gosztolya, and László Tóth. MTA-SZTE Research Group ... local estimates into a global (utterance-level) score jointly, in a unified mathe-.

Large Scale Distributed Systems for Training Neural Networks [PDF]

Page 3 ... Growing Use of Deep Learning at Google. Android. Apps drug discovery. Gmail. Image understanding. Maps ... really only way to do this at very large scale .... Typically, setup a graph with one or a few Extend calls and then Run it ...

Training Deep Neural Networks via Optimization ... - Semantic Scholar

Feb 11, 2017 - method is to randomly select and feed a few training samples .... Illustration of L{Ïs|sâ¥i} in (17) over a new factor graph. Each factor carries a ...

Large Scale Distributed Deep Networks

lief that enables model parallelism within a machine (via multithreading) and across ... An interesting suggestion for scaling up deep learning is the use of.

Distributed Training of Deep Neural Networks - ORNL Computational

Download PDF

0 downloads 0 Views 597KB Size Report

Comment

I Overview: distributed parallel training of DNNs. II Experimental ... Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!) Computed via Back ... Training Deep Neural Networks. Underlying .... Compression .... Network bandwidth is already exceeded by the SGD communication. â Worst ...

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany

Outline I

Overview: distributed parallel training of DNNs

II Experimental Evaluation III Limitation I: Communication Bounds IV Limitation II: Skinny Matrix Multiplication V Limitation III: Data I/O

Training Deep Neural Networks Underlying Optimization Problem Computed via Back Propagation Algorithm: 1. feed forward and compute activation 2. error by layer 3. compute derivative by layer

Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!)

Optimization Problem By Stochastic Gradient Descent (SGD) forward 1. 2. 3. 4. 5. 6. 7.

Initialize weights W at random Take small random subset X (=batch) of the train data Run X through network (forward feed) Compute Loss Compute Gradient Propagate backwards through the network Update W

Repeat 2-8 until convergence

backward Layer 1 Layer 2

... Layer n-1 Layer n

Parallelization Common approaches to parallelize SGD for DL Parallelization of SGD is very hard: it is an inherently sequential algorithm 1. Start at some state t (point in a billion dimensional space) 2. Introduce t to data batch d1 3. Compute an update (based on the objective function) 4. Apply the update →t+1

d1

d2

d3

How to gain Speedup ? Make faster updates Make larger updates

State t+2 State t

State t+1

Parallelization Common approaches to parallelize SGD for DL Internal parallelization ●

●

Dense matrix multiplication: ● standard blas sgemm ● MKL, Open-Blas ● CuBlas Task parallelization for special Layers ● Cuda-CNN for fast convolutions

forward

backward Layer 1 Layer 2

... Layer n-1 Layer n

Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers

Layer 1

Layer 1

Layer 1

Layer 2

Layer 2

Layer 2

... Layer n-1 Layer n forward

... Layer n-1 Layer n forward

... Layer n-1 Layer n forward

Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master

Layer 1

Layer 1

Layer 1

Layer 2

Layer 2

Layer 2

... Layer n-1 Layer n forward backward

... Layer n-1 Layer n forward backward

... Layer n-1 Layer n forward backward

Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers

Layer 1

Layer 1

Layer 1

Layer 2

Layer 2

Layer 2

... Layer n-1 Layer n forward backward

... Layer n-1 Layer n forward backward

... Layer n-1 Layer n forward backward

Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers

Speedup

Experimental Evaluation Scaling Distributed Parallel Synchronous SGD Training 128

Experimental Setup: HPC Cluster with FDR Infiniband Interconnect K80 GPUs / Xeon E5 CPUs Intel Caffe Distribution

64

32

Speedup

16

8

4

2

1 1

2

4

8

16

32

64

# nodes

AlexNet (B=256) GoogLeNet (B=32) GoogLeNet [6] (B=1024)

AlexNet (B=1024) GoogLeNet (B=256) Linear Speedup

AlexNet [1] GoogLeNet (B=1024)

128

Limitation I Distributed SGD is heavily Communication Bound

Gradients have the same size as the model ● Model size can be hundreds of MB ● Iteration time (GPU)