Balancing the Communication Load of

Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms J. Keuper and F.-J. Pfreundt Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern

Outline Paper based on our SC15 contribution: Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms Preprint: [arxiv.org/abs/1505.04956] Part I Part II Part III

Review: inside distributed Machine Learning Asynchronous Parallel Stochastic Gradient Descent ASGD outside HPC environment

Part I: Large Scale Machine Learning

Definition of Large Scale Machine Learning [L. Buttou] Small scale learning problem We have a small scale learning problem when the active budget constraint is the number of examples n. Large scale learning problem We have a large scale learning problem when the active budget constraint is the computing time T.

Part I: Large Scale Machine Learning

Definition of Large Scale Machine Learning [L. Buttou] Small scale learning problem We have a small scale learning problem when the active budget constraint is the number of examples n. Large scale learning problem We have a large scale learning problem when the active budget constraint is the computing time T.

Possible conclusion:

Large scale ML != scaling ML

Training Machine Learning Models

Formulation of the training problem: ●

Set of observations (Training Samples)

●

[Super vised Learning] Set of (semantic) labels

●

●

Loss function to determine quality the learned model, ● Writing or for the loss of given samples ● and model state → ● ●

Optimization problem, minimizing the loss Straight forward gradient descent optimization Pitfalls: sparse, high dimensional target space → Overfitting problem

Optimization Algorithms for ML Simple Method BATCH-Optimization Run over ALL samples Compute average gradient of loss-function Make an update step in gradient direction

● ● ●

● ●

Computationally expensive Scales very poor in the number of samples

Optimization Algorithms for ML Simple Method BATCH-Optimization Run over ALL samples Compute average gradient of loss-function Make an update step in gradient direction

● ● ●

● ●


Stochastic Gradient Descent ● ● ●

Online algorithm Randomized Update after EACH sample

Optimization Algorithms for ML Stochastic Gradient Descent

Simple Method BATCH-Optimization Run over ALL samples Compute average gradient of loss-function Make an update step in gradient direction

● ● ●

● ●


● ● ●

● ● ●

Online algorithm Randomized Update after EACH sample

Much faster Better ML results But: intrinsically sequential !

Parallel Optimization Algorithms for ML Hogwild! [B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.] ● ●

●

●

Parallel SGD for shared memory systems Basic Idea: ● Write asynchronous updates to shared memory (without any save-guards) ● NO mutual exclusion / locking what so ever → Data races + race conditions ● NO theoretical guarantees on converges ● Only constraint: sparsity (time and/or space) to reduce probability of races BUT: ● Works very well in practice → fast, stable, ... Why does it work? → Robust nature of ML algorithms + Cache hierarchy

Distributed Optimization Algorithms for ML Map-reduce scheme possible for basically all ML algorithms: [1] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, 2007.

Distributed Optimization Algorithms for ML Map-reduce scheme possible for basically all ML algorithms: [1] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, 2007. BUT Map-reduce only works for BATCH-solvers

Distributed Optimization Algorithms for ML (cont.) Parallel SGD on Distributed Systems [M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595–2603, 2010.] ●

→ ●

Prove of convergence for distributed SGD with on one final Reducestep. Synchronization at the very end Only condition: constant step size

Distributed Asynchronous Stochastic Gradient Descent ASGD has become very hot topic – especially in the light of Deep Learning (DL) ●

●

●

DL currently mostly on GPUs (Hogwild SGD + derivates) Recent cluster based DL optimizations by Google, Microsoft, … ● Hogwild scheme with parameter servers and message passing updates Part II: Our ASGD approach (SC15 Paper) ● Asynchronous communication via RDMA ● Partitioned Global Address Space model ● Using the GASPI protocol implemented by our GPI2.0 ● Host to Host communication - NO central parameter server ●

Hogwild like update scheme, several extensions ● Mini-BATCH updates ● Update states – not gradients ● Parzen-Window function selecting external updates

Distributed parallel ASGD: Our Algorithm

●

●

Cluster Setup ● Nodes with several CPUs Each thread operates independently in parallel

ASGD: Parzen-Window Updates

ASGD: Evaluation Experimental setup ●

●

Simple K-Means Algorithm ● Easy to implement → no hidden optimization possible ● Widely used ● Artificial data for the cluster problem easy to produce and to control HPC Cluster Setup ● FDR Infiniband interconnect ● 16 CPUs / node ● BeeGFS parallel file system ● GPI2.0 asynchronous RDMA communication

Results: Strong Scaling

K-Means:

n=10, k=10 ~1TB Train Data

Results: Convergence

K-Means:

n=10, k=100 ~1TB Train Data

Results: Effect of Asynchronous Communication

Part III: Balancing Asynchronous Communication ●

●

●

Moving to the Cloud and Big Data applications ● From HPC to HTC No Infiniband Hardware ● Bandwidth + Latency issues

New GPI2.0 Release supporting GASPI over Ethernet ● → www.gpi-site.com

Communication Load Balancing Algorithm ●

ASGD communication after each mini-batch update ● Mini-batch size b ● Communication frequency 1/b

Enforcing Message Sending: Bandwidth exceedes

Load Balancing Results: fixed frequency

50B messages

5kB messages

Results: if we were waiting for messages to be sent ...

50B messages

5kB messages Optimum: b=1000


Theoretically, higher frequency should be better

●

But bandwidth + Latency much worse over Ethernet

●

Optimal frequency for a thread

→

●

Is NOT fixed! → might change over time

●

Depends on activity other threads

●

Depends on sender and receiver location in Network ●

NUMA location

●

Rack

●

...

Too many parameters for closed approach


● ● ●

Black Box approach → regularize frequency by gradient over the size of the output buffer queue Keep small history for each thread GPI allows queue size polling Only send when queue is not full

Results 5kB messages

Practical Implications ●

●

●

Load Balancing on the message sender side can soften bandwidth problems ● But, there are limits Recall parallel SGD: actually NO Communication needed until final reduce ● But our ASGD experiments show: any additional communication actually helps ● → little bit of communication is better than no communication at all … Effect strongly depends on the actual ML algorithm ● Ration computation time / communication size is the dominant factor ● Asynchronous scheme: hiding communication ● → Algorithms with expensive update computations are suitable for ASGD over Ethernet

What's Next ? ●

Deep Learning (DL) ●

Tricky combination of very cheap updates and huge parameter spaces

●

→ handle communication overhead more efficiently ●

●

●

●

Not likely to work without Infiniband

→ Caffe Plugin for CPU based distributed ASGD coming soon …

DL is still dominated by GPUs ●

Not enough memory

●

So far only a few GPUs in parallel

NEW: GPI2.0 implements PGAS with cross-Host GPU2GPU RDMA ●

→ current work on GPU based distributed ASGD

Summary

We presented a novel approach towards distributed ASGD for machine learning algorithms, which ●

Takes the asynchronous approach to the communication level (RDMA PGAS via GPI2.0)

●

Outperforms existing methods in both, scalability and convergence speed

●

Can be used outside HPC environments (with some limitations)