CaffeGPI Single-Sided Communication for Scalable Deep Learning
Recommend Documents
Deep Learning, Secure Function Evaluation, Garbled Circuit, ...... proximation method results in a different error rate and circuit ..... Privacy preserving auctions.
Mar 15, 2018 - [email protected]. Thomas Warfel .... servers is expensive, and 4) parameter servers require warm-start to facilitate convergence ...
... Dong Yu, and John Platt. Microsoft Research, Redmond, USA ..... [7] William W. Cohen and Vitor Rocha de Carvalho, âStacked sequential learning,â in Proc.
Apr 29, 2005 - flood risk management based on three themes: Flood Hazard ...... The construction of a web site was an ea
Apr 29, 2005 - Computer generated model that uses numerical methods to solve the mathematical equations relating the ...
higher-layer units to cover larger areas of the input in a ... of binary hidden units h, a set of (binary or real- ..... 3http://www.cnbc.cmu.edu/cplab/data_kyoto.html.
DeSTIN: A Scalable Deep Learning Architecture with Application to ..... the first layer hosting 64 nodes, each receiving
... with Chainer. Takuya Akiba | Preferred Networks, Inc. ... Data Parallel. 9 ... definition. Computational graph. Grad
Jan 23, 2017 - workloads for servers and high-end embedded systems. Our co- ... a power budget of 2.5 W. An energy efficiency of 22.5 GFLOPS/W is achieved in a single 3D .... in the power consumption compared to the best GPU imple- mentations. .....
[2] Google's android apps for cloud robotics, Retrieved November 2011 at http://rj3sp.blogspot.com/2011/05/googles-android-apps-for-cloud- robotics.html.
which is defined in the database, then it executes the cloud service through its ... to understand its reading from sensors which is defined by a developer. E. g.,.
US$795 Early bird rate until Feb 1. After Feb 1 $895. April 16-17. NPDL Non-member. US$995. Featuring Insight. Presentat
Jul 30, 2017 - tual personal assistants, Microsoft's Cortana, Ap- ple's Siri ... The goal of this tutorial is to provide the au- dience with .... Chen et al. (2016a) transfered the dialogue acts across differ- .... logue agents for information access
Jun 6, 2017 - Learn to Track: Deep Learning for Tractography. Philippe ... 2 Sherbrooke Connectivity Imaging Laboratory
Dec 5, 2016 - time series data consisting of heterogeneous (continuous and categorical) variables. ..... The data consists of several models of hard drives.
Aug 31, 2017 - ABSTRACT. Deep Learning is one of the next big things in Recommendation. Systems technology. The past few years have seen the ...
Oct 27, 2018 - settings. The seismic data models are still approximations of the field data. ..... clean removal of the multiples without affecting the primaries.
Mortgage products in the data set. Product type. Total Data Set Subprime. Prime. Fixed Rate. 80.6 %. 48 %. 86.3 %. ARM.
Email: {duan, jfc}@cs.berkeley.edu ... by sending IGMP (Internet Group Management Protocol) [1] ..... However, batch rekeying is only possible when it is.
AbstractâThe Scalable Tools Communication Infras- tructure (STCI) is an open source collaborative effort intended to provide high-performance, scalable, ...
Deep Learning â An interdisciplinary View of. Learning ...... Google is taking over through their newly released library TensorFlow. â ... Written mainly in C++.
May 28, 2015 - intricate structures in high-dimensional data and is therefore applica- ble to many domains ... audio, whereas recurrent nets have shone light on sequential data such as text and speech. ... by ây/âx (that is, the definition of par
Image Classification. Speech Recognition. Language Translation. Language Processing. Sentiment Analysis. Recommendation.
Mar 7, 2017 - machine-learning algorithm for estimation, that we call an instantiation of the ... SL is an example of ensemble learning methodology which builds a meta- ...... illustration using false discovery rates,â PLoS Genet, 4, e1000098.
CaffeGPI Single-Sided Communication for Scalable Deep Learning
Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!) ... Training Deep Neural Networks .... Compression ... Distributed Synchronous SGD Parallelization ... Lower demands on the communication bandwidth.
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany
Deep Neural Networks In a Nutshell At an abstract level, DNNs are: Layer 1 ● ● ●
directed (acyclic) graphs Nodes are compute entities (=Layers) Edges define the data flow through the graph
Layer 2
Inference / Training Layer 3
Layer 4
Forward feed of data through the network Layer 5 Layer 6
Deep Neural Networks In a Nutshell At an abstract level, DNNs are: Common intuition ●
Layer 1
● ●
directed (acyclic) graphs Nodes are compute entities (=Layers) Edges define the data flow through the graph
Layer 2
... Layer n-1 Layer n
Inference / Training Forward feed of data through the network
Training Deep Neural Networks The Underlying Optimization Problem Computed via Back Propagation Algorithm: 1. feed forward and compute activation 2. error by layer 3. compute derivative by layer
Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!)
Optimization Problem By Stochastic Gradient Descent (SGD) forward 1. 2. 3. 4. 5. 6. 7.
Initialize weights W at random Take small random subset X (=batch) of the train data Run X through network (forward feed) Compute Loss Compute Gradient Propagate backwards through the network Update W
Repeat 2-8 until convergence
backward Layer 1 Layer 2
... Layer n-1 Layer n
Parallelization Common approaches to parallelize SGD for DL Parallelization of SGD is very hard: it is an inherently sequential algorithm 1. Start at some state t (point in a billion dimensional space) 2. Introduce t to data batch d1 3. Compute an update (based on the objective function) 4. Apply the update →t+1
d1
d2
d3
How to gain Speedup ? Make faster updates Make larger updates
State t+2 State t
State t+1
Parallelization Common approaches to parallelize SGD for DL Internal parallelization = parallel execution of the layer operation ●
●
Mostly dense matrix multiplication: ● standard blas sgemm ● MKL, Open-Blas ● CuBlas on GPUs Task parallelization for special Layers ● Cuda-CNN for fast convolutions
forward
backward Layer 1 Layer 2
... Layer n-1 Layer n
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers
Layer 1
Layer 1
Layer 1
Layer 2
Layer 2
Layer 2
... Layer n-1 Layer n forward
... Layer n-1 Layer n forward
... Layer n-1 Layer n forward
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master
Layer 1
Layer 1
Layer 1
Layer 2
Layer 2
Layer 2
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers
Layer 1
Layer 1
Layer 1
Layer 2
Layer 2
Layer 2
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
... Layer n-1 Layer n forward backward
Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers
Speedup
Outline I
Overview: distributed parallel training of DNNs
II Limits of Scalability Limitation I:
Communication Bounds
Limitation II:
Skinny Matrix Multiplication aka large batch problem
Limitation III:
Data I/O
Based on our paper from SC 16
Outline I
Overview: distributed parallel training of DNNs
II Limits of Scalability Limitation I:
Communication Bounds
Limitation II:
Skinny Matrix Multiplication aka large batch problem
Limitation III:
Data I/O
Based on our paper from SC 16
Limitation I Distributed SGD is heavily Communication Bound
Gradients have the same size as the model ● Model size can be hundreds of MB ● Iteration time (GPU)