Distributed Deep Learning on Mesos with GPUs and Gang Scheduling
Recommend Documents
Feb 1, 1996 - For example, assume that the size of the process group is 1000, the ..... 16 nod"". Mnodes. 32 nod"s. 16 nodes. 32 noo,,". 1('; nudes o. 1000.
Tsukuba Research Center ... ment gang-scheduling on a workstation cluster and the problems we faced. ..... call of the function on the processor speci ed by the.
Computer Science, Faculty of Engineering and Information Technology, ... Keywords: parallel computing, job scheduling, gang scheduling, cluster computing.
traditional approaches to scheduling a mixed load of jobs. Performance results for ASAT ... and dispatch the threads of a process in a roughly synchronized manner. ... extent to which they use hardware or software and the techniques used to ...
Research School of Information Sciences and Engineering ... Computer Science, Faculty of Engineering and Information Technology, and the Computer Sciences ... May 2003. TR-CS-02-06 Stephen M Blackburn and Kathryn S McKinley. ..... Ultimately, by exte
to manage distributed global transactions. In Section 2 we show how transactions at au- tonomous local systems are called from VPL. In Sec- tion 3 a control ow ...
environments from OpenAI Gym, the agent outperforms a decent human reference player with few days of training. Keywordsâ Deep Reinforcement Learning, ...
Hao Yi Ong1, Kevin Chavez1, and Augustus Hong1. AbstractâWe ... function, and R is a reward function. T gives the ... taking action a at the current state s and is denoted R(s,a). To solve ..... Stone, âA neuroevolution approach to general atari.
Keywordsâ parallel logic programming, static analysis, task scheduling, granularity analysis. I. INTRODUCTION. Parallel processing is a very important ...
Today, Vehicular Ad-Hoc Networks (VANETs) offer a promising way to achieve ... where vehicles have non-safety download requests to be served by a battery-powered ... Section V lays out a detailed presentation of the MDP formulation. The.
We take a statistical approach to learning a static time-of-day (TOD) preference ..... C. Lisa Dent, Jesus Boticario, John P. McDermott, Tom M. Mitchell, and David.
... with Chainer. Takuya Akiba | Preferred Networks, Inc. ... Data Parallel. 9 ... definition. Computational graph. Grad
Nov 14, 2016 - How to scale distributed deep learning? Peter H. Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. Department of Electrical Engineering ...
Aug 24, 2014 - new challenges to application developers, as achieving max- imum performance ... it should perform the profiling with near zero overhead, ef- fectively utilizing all ...... [7] M. E. Belviranli, L. N. Bhuyan, and R. Gupta. A dynamic ..
case execution time through measurement, since the effects related with uncoordinated ... There are M jobs, and all of them are ready for execution at time 0.
708â717, April 1997. [19] Werner Vogels, David Follett, Jenwi Hsieh, David Lifka, and David. Stern. Tree-Saturation Control in the AC Velocity Cluster. In Hot.
... Containerized Cluster Environment. Wei QIAO1,a, Ying LI2,b and Zhong-Hai WU3,* ..... E. Howard, W. Hubbard, and L. D. Jackel,. âBackpropagation applied to ...
problems, there are fewer fitness cases than shader processors. In this scenario, some ..... for technical support with the CUDA.Net toolkit. ... Wilson, G., Banzhaf, W.: Linear genetic programming GPGPU on microsoft's xbox 360. In: J. Wang (ed.) ...
Tsukuba Research Center. Real World ... The gang scheduler, called SCore-D, is implemented on top of a UNIX operating system and runs on .... user process is caught by an element process object with a wait() system call, and the stopped ...
we call VLS, for âvariable-length schedulingâ, that provides weighted ..... t (ms). # of packets sent. S1. S2. S3. S4. S5. S6. S7. S8. S9. S10. (a) w/o VLS. 0. 0.5. 1 ..... Networking Conference, 2005 IEEE, volume 3, pages 1551â1556 Vol. 3, 200
Achieving Efficient Distributed Scheduling with Message Queues in the Cloud ... Ramamurty, Ke Wang, Ioan Raicu .... higher number of jobs in order to improve the application ... scheduling/management system that satisfies four major.
Jul 8, 2007 - VARIABLE-LENGTH SCHEDULING (VLS). In this section, we assume that there is only one collision domain. That is, each station can sense ...
Distributed Deep Learning on Mesos with GPUs and Gang Scheduling
Distributed Deep Learning on Mesos with. GPUs and Gang Scheduling. Min Cai, Alex Sergeev, Paul Mikesell, Anne Holler, UB
Distributed Deep Learning on Mesos with GPUs and Gang Scheduling Min Cai, Alex Sergeev, Paul Mikesell, Anne Holler, UBER
Who are we?
Min Cai
Alex Sergeev
Paul Mikesell
Anne Holler
Deep Learning @ UBER • Use cases: – Self-Driving Vehicles – Trip Forecasting – Fraud Detection
Self-Driving Vehicles
Self-Driving Vehicles
Trip Forecasting
Fraud Detection
Spam referral code
Partner up with the same driver
Cash out Uber credits
Why Distributed Deep Learning? • Speed up model training • Scale out to hundreds of GPUs • Shard large models that can not fit into a single machine
How Distributed Deep Learning Works
Why Mesos? • • • • •
Widely adopted GPU Support Nested Containers Highly Customizable Reliable and Scalable
Mesos Support for GPUs • •
Mesos Containerizer only Docker Containerizer support is not landed to upstream yet
Mesos Nested Containers • •
Separate management code from user docker images Avoid dependency conflict
What is Missing? • • • • •
Elastic GPU Resource Management Locality and Network aware Placement Gang Scheduling Task Discovery Failure Handling
Peloton Overview
Peloton Architecture
Elastic GPU Resource Management
Resource Pools
Gang Scheduling • A subset of Tasks in a Job can be specified for Gang Scheduling • Gang tasks are a single scheduling unit – Admitted, placed, preempted and killed as a group
• Gang tasks are independent execution units – Run in separate containers and may fail independently
• Gang execution is terminated if a gang task fails and cannot be restarted
Placement Strategies • Place as many as container into the same host or rack • Best fit algorithm to tightly packing GPU containers • Constraint based placement for same generation of GPUs
Why TensorFlow? • Most popular Open Source framework for Deep Learning • Combines high performance with ability to tinker with low level model details • Has end-to-end support from research to production
Architecture for Distributed TensorFlow
Architecture for Distributed TensorFlow on Mesos
Distributed Training Performance Training with synthetic , config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
Giving Back
Horovod is available on GitHub today https://github.com/uber/horovod