Asynchronously Parallelized Machine Learning. Algorithms ... for Scalable Distributed Machine Learning Algorithms ... pages 693â701, 2011.] â ... Page 10 ...
Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms J. Keuper and F.-J. Pfreundt Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern
Outline Paper based on our SC15 contribution: Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms Preprint: [arxiv.org/abs/1505.04956] Part I Part II Part III
Review: inside distributed Machine Learning Asynchronous Parallel Stochastic Gradient Descent ASGD outside HPC environment
Part I: Large Scale Machine Learning
Definition of Large Scale Machine Learning [L. Buttou] Small scale learning problem We have a small scale learning problem when the active budget constraint is the number of examples n. Large scale learning problem We have a large scale learning problem when the active budget constraint is the computing time T.
Part I: Large Scale Machine Learning
Definition of Large Scale Machine Learning [L. Buttou] Small scale learning problem We have a small scale learning problem when the active budget constraint is the number of examples n. Large scale learning problem We have a large scale learning problem when the active budget constraint is the computing time T.
Possible conclusion:
Large scale ML != scaling ML
Training Machine Learning Models
Formulation of the training problem: ●
Set of observations (Training Samples)
●
[Super vised Learning] Set of (semantic) labels
●
●
Loss function to determine quality the learned model, ● Writing or for the loss of given samples ● and model state → ● ●
Optimization problem, minimizing the loss Straight forward gradient descent optimization Pitfalls: sparse, high dimensional target space → Overfitting problem
Optimization Algorithms for ML Simple Method BATCH-Optimization Run over ALL samples Compute average gradient of loss-function Make an update step in gradient direction
● ● ●
● ●
Computationally expensive Scales very poor in the number of samples
Optimization Algorithms for ML Simple Method BATCH-Optimization Run over ALL samples Compute average gradient of loss-function Make an update step in gradient direction
● ● ●
● ●
Computationally expensive Scales very poor in the number of samples
Stochastic Gradient Descent ● ● ●
Online algorithm Randomized Update after EACH sample
Optimization Algorithms for ML Stochastic Gradient Descent
Simple Method BATCH-Optimization Run over ALL samples Compute average gradient of loss-function Make an update step in gradient direction
● ● ●
● ●
Computationally expensive Scales very poor in the number of samples
● ● ●
● ● ●
Online algorithm Randomized Update after EACH sample
Much faster Better ML results But: intrinsically sequential !
Parallel Optimization Algorithms for ML Hogwild! [B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.] ● ●
●
●
Parallel SGD for shared memory systems Basic Idea: ● Write asynchronous updates to shared memory (without any save-guards) ● NO mutual exclusion / locking what so ever → Data races + race conditions ● NO theoretical guarantees on converges ● Only constraint: sparsity (time and/or space) to reduce probability of races BUT: ● Works very well in practice → fast, stable, ... Why does it work? → Robust nature of ML algorithms + Cache hierarchy
Distributed Optimization Algorithms for ML Map-reduce scheme possible for basically all ML algorithms: [1] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, 2007.
Distributed Optimization Algorithms for ML Map-reduce scheme possible for basically all ML algorithms: [1] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, 2007. BUT Map-reduce only works for BATCH-solvers
Distributed Optimization Algorithms for ML (cont.) Parallel SGD on Distributed Systems [M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595–2603, 2010.] ●
→ ●
Prove of convergence for distributed SGD with on one final Reducestep. Synchronization at the very end Only condition: constant step size
Distributed Asynchronous Stochastic Gradient Descent ASGD has become very hot topic – especially in the light of Deep Learning (DL) ●
●
●
DL currently mostly on GPUs (Hogwild SGD + derivates) Recent cluster based DL optimizations by Google, Microsoft, … ● Hogwild scheme with parameter servers and message passing updates Part II: Our ASGD approach (SC15 Paper) ● Asynchronous communication via RDMA ● Partitioned Global Address Space model ● Using the GASPI protocol implemented by our GPI2.0 ● Host to Host communication - NO central parameter server ●
Hogwild like update scheme, several extensions ● Mini-BATCH updates ● Update states – not gradients ● Parzen-Window function selecting external updates
Distributed parallel ASGD: Our Algorithm
●
●
Cluster Setup ● Nodes with several CPUs Each thread operates independently in parallel
ASGD: Parzen-Window Updates
ASGD: Evaluation Experimental setup ●
●
Simple K-Means Algorithm ● Easy to implement → no hidden optimization possible ● Widely used ● Artificial data for the cluster problem easy to produce and to control HPC Cluster Setup ● FDR Infiniband interconnect ● 16 CPUs / node ● BeeGFS parallel file system ● GPI2.0 asynchronous RDMA communication
Results: Strong Scaling
K-Means:
n=10, k=10 ~1TB Train Data
Results: Convergence
K-Means:
n=10, k=100 ~1TB Train Data
Results: Effect of Asynchronous Communication
Part III: Balancing Asynchronous Communication ●
●
●
Moving to the Cloud and Big Data applications ● From HPC to HTC No Infiniband Hardware ● Bandwidth + Latency issues
New GPI2.0 Release supporting GASPI over Ethernet ● → www.gpi-site.com
Communication Load Balancing Algorithm ●
ASGD communication after each mini-batch update ● Mini-batch size b ● Communication frequency 1/b
Enforcing Message Sending: Bandwidth exceedes
Load Balancing Results: fixed frequency
50B messages
5kB messages
Results: if we were waiting for messages to be sent ...
50B messages
5kB messages Optimum: b=1000
Communication Load Balancing Algorithm ●
Theoretically, higher frequency should be better
●
But bandwidth + Latency much worse over Ethernet
●
Optimal frequency for a thread
→
●
Is NOT fixed! → might change over time
●
Depends on activity other threads
●
Depends on sender and receiver location in Network ●
NUMA location
●
Rack
●
...
Too many parameters for closed approach
Communication Load Balancing Algorithm ●
● ● ●
Black Box approach → regularize frequency by gradient over the size of the output buffer queue Keep small history for each thread GPI allows queue size polling Only send when queue is not full
Results 5kB messages
Practical Implications ●
●
●
Load Balancing on the message sender side can soften bandwidth problems ● But, there are limits Recall parallel SGD: actually NO Communication needed until final reduce ● But our ASGD experiments show: any additional communication actually helps ● → little bit of communication is better than no communication at all … Effect strongly depends on the actual ML algorithm ● Ration computation time / communication size is the dominant factor ● Asynchronous scheme: hiding communication ● → Algorithms with expensive update computations are suitable for ASGD over Ethernet
What's Next ? ●
Deep Learning (DL) ●
Tricky combination of very cheap updates and huge parameter spaces
●
→ handle communication overhead more efficiently ●
●
●
●
Not likely to work without Infiniband
→ Caffe Plugin for CPU based distributed ASGD coming soon …
DL is still dominated by GPUs ●
Not enough memory
●
So far only a few GPUs in parallel
NEW: GPI2.0 implements PGAS with cross-Host GPU2GPU RDMA ●
→ current work on GPU based distributed ASGD
Summary
We presented a novel approach towards distributed ASGD for machine learning algorithms, which ●
Takes the asynchronous approach to the communication level (RDMA PGAS via GPI2.0)
●
Outperforms existing methods in both, scalability and convergence speed
●
Can be used outside HPC environments (with some limitations)