Distributed Training of Deep Neural Networks - ORNL Computational
Recommend Documents
Dec 4, 2017 - Deep neural networks (DNN) including multilayer percep- trons ... devices have achieved reduced classification accuracies owing ... The benefits of areal/energy/speed improve- ... precision of the stored weights in which gradients are a
Jae-Yoon Jung and James A. Reggia. The automated design of artificial neural ... Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing ...
Aug 31, 2016 - âSamsung Advanced Institute of Technology, Samsung Electronics [email protected]. â Institute of Neuroinformatics, University of ...
Journal of Machine Learning Research 1 (2009) 1-40. Submitted 12/07 ... 1.
Introduction. Training deep multi-layered neural networks is known to be hard.
... to train deep architectures. 10. Page 11. 11. Slide from: https://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf ...
Oct 13, 2016 - Neural networks have drawn significant interest from the machine learning ... to show that for a given upper bound on the approx- imation error, shallow ... LG] 13 Oct 2016 ... ically, we aim to answer the following two questions. Give
Each layer in a multi-layer neural network can be seen as a representation of the ... We believe that it should disentangle the factors of variation that inherently ...
Journal of Machine Learning Research 1 (2009) 1-40. Submitted 12/07; Revised
9/08; Published 1/09. Exploring Strategies for Training Deep Neural Networks.
This paper proposes a hybrid algorithm of pre training deep networks, using both marked and ... representations, obtained during training of stacked auto-.
Aug 2, 2017 - [8] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman ... [27] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
Training deep multi-layered neural networks is known to be hard. ... However, complexity theory of circuits strongly suggests that deep architectures can be much ..... When we introduce a saturating non-linearity such as the sigmoid, and we ...
30. CONTRIBUTED RESEARCH ARTICLES neuralnet: Training of Neural
Networks by Frauke Günther and Stefan Fritsch. Abstract Artificial neural
networks are ...
Apr 15, 2015 - âSoftâ bootstrapping uses predicted class probabilities q directly to ..... Analysis of semi-supervis
Keywordsâdeep neural network; loss function; data imbalance ... classification tasks, data sets are usually classified into binary- class data sets and multi-class ...
Growing Use of Deep Learning at Google. Android. Apps drug discovery. Gmail. Image understanding ... Molecular Activity
A Sequence Training Method for Deep Rectifier Neural Networks in Speech ... Part of the Lecture Notes in Computer Science book series (LNCS, volume 8773).
Keywords: speech recognition; deep neural network; noise injection .... In contrast, the GMM is a generative model and focuses on class distributions. The.
such as communications, radar, filter design, and speech and im- age analytics ... ping and use a deep neural network (DNN) to approximate it. If the nonlinear ...
point arrays perform the multiply and add in the analog domain in contrast to the CMOS based digital approaches. Therefore, it is not clear yet whether the ...
Tamás Grósz, Gábor Gosztolya, and László Tóth. MTA-SZTE Research Group ... local estimates into a global (utterance-level) score jointly, in a unified mathe-.
Page 3 ... Growing Use of Deep Learning at Google. Android. Apps drug discovery. Gmail. Image understanding. Maps ... really only way to do this at very large scale .... Typically, setup a graph with one or a few Extend calls and then Run it ...
Feb 11, 2017 - method is to randomly select and feed a few training samples .... Illustration of L{Ïs|sâ¥i} in (17) over a new factor graph. Each factor carries a ...
lief that enables model parallelism within a machine (via multithreading) and
across ... An interesting suggestion for scaling up deep learning is the use of.
Distributed Training of Deep Neural Networks - ORNL Computational
I Overview: distributed parallel training of DNNs. II Experimental ... Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!) Computed via Back ... Training Deep Neural Networks. Underlying .... Compression .... Network bandwidth is already exceeded by the SGD communication. â Worst ...
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany
Outline I
Overview: distributed parallel training of DNNs
II Experimental Evaluation III Limitation I: Communication Bounds IV Limitation II: Skinny Matrix Multiplication V Limitation III: Data I/O
Training Deep Neural Networks Underlying Optimization Problem Computed via Back Propagation Algorithm: 1. feed forward and compute activation 2. error by layer 3. compute derivative by layer
Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!)
Optimization Problem By Stochastic Gradient Descent (SGD) forward 1. 2. 3. 4. 5. 6. 7.
Initialize weights W at random Take small random subset X (=batch) of the train data Run X through network (forward feed) Compute Loss Compute Gradient Propagate backwards through the network Update W
Repeat 2-8 until convergence
backward Layer 1 Layer 2
... Layer n-1 Layer n
Parallelization Common approaches to parallelize SGD for DL Parallelization of SGD is very hard: it is an inherently sequential algorithm 1. Start at some state t (point in a billion dimensional space) 2. Introduce t to data batch d1 3. Compute an update (based on the objective function) 4. Apply the update →t+1
d1
d2
d3
How to gain Speedup ? Make faster updates Make larger updates
State t+2 State t
State t+1
Parallelization Common approaches to parallelize SGD for DL Internal parallelization ●
●
Dense matrix multiplication: ● standard blas sgemm ● MKL, Open-Blas ● CuBlas Task parallelization for special Layers ● Cuda-CNN for fast convolutions
forward
backward Layer 1 Layer 2
... Layer n-1 Layer n
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers
Layer 1
Layer 1
Layer 1
Layer 2
Layer 2
Layer 2
... Layer n-1 Layer n forward
... Layer n-1 Layer n forward
... Layer n-1 Layer n forward
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master
Layer 1
Layer 1
Layer 1
Layer 2
Layer 2
Layer 2
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers
Layer 1
Layer 1
Layer 1
Layer 2
Layer 2
Layer 2
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers
Speedup
Experimental Evaluation Scaling Distributed Parallel Synchronous SGD Training 128
Experimental Setup: HPC Cluster with FDR Infiniband Interconnect K80 GPUs / Xeon E5 CPUs Intel Caffe Distribution