High Performance Deep Learning with Apache Spark

0 downloads 0 Views 1MB Size Report
Deep learning is a computation and communication intensive process. – High Utilization of GPUs. – Low latency synchronizations. – Hardware acceleration, e.g. ...
High Performance Deep Learning with Apache Spark Rui Liu Yuduo Wu NovuMind Inc. #DLSAIS15

Background • We build a high performance deep learning platform – Co-design hardware and software

• We are working on connecting our deep learning platform with data pipelines

#DLSAIS15

2

Different Characteristics • Deep learning is a computation and communication intensive process – High Utilization of GPUs – Low latency synchronizations – Hardware acceleration, e.g., GPU Direct RDMA, NUMA, InfiniBand – Single instance per machine

• Spark data pipelines are optimized for – Data locality – Minimization of data I/O and shuffling – Multiple tasks on one machine #DLSAIS15

3

Different Hardware • Different types of machines in customer data center • HPC cluster for deep learning – GPU, InfiniBand, etc.

• Data processing machines – No GPU – Ethernet #DLSAIS15

4

Goals • Connect deep learning systems with data pipelines • Without sacrificing deep learning performance

Data pipelines on data processing cluster

Deep learning services on HPC cluster #DLSAIS15

5

Data Augmentation Off-loading • Pre-processing of deep learning often needs to be off-loaded from training services

#DLSAIS15

6

Data Shuffling Off-loading • Training needs data shuffle between training epochs • The network of training cluster is overwhelmed by parameter synchronizations • Data shuffling can be off-loaded from training services

#DLSAIS15

7

Our Solutions – NovuForce • A fully connected pipeline leveraging the advantages of the two worlds – Spark + high performance computing

• Spark for data ingestion, preprocessing, shuffling • Deep learning (training/inference) as a service – Optimized for high performance computing

• Different schedulers for data pipeline and deep learning – Hardware-aware schedulers

• Zero-copy data sharing #DLSAIS15

8

Training Flow Data Flow definition: Scheduler:

Spark Data Pipelines

Training Services

Spark Scheduler

Deep learning Scheduler

Model Server

Apache Mesos Managed Resources Cameras /Sensors

Data Processing Cluster

HPC Cluster for Deep Learning NovuForce

Data processing pipelines

#DLSAIS15

Deep learning services

9

Interactive Usages • WebUI for deep learning services • Web notebook via Apache Zeppelin WebUI NovuForce

Data Processing Pipelines

Deep learning Services #DLSAIS15

10

Zero Copy Data Sharing • The last stage of spark tasks are scheduled to the HPC cluster • Circular buffer in shared memory • Labels and images in Apache Arrow format

Arrow Format

Label

Image

NovuForce

Data Processing Task

Training Service Instance

#DLSAIS15

11

Deep Learning Services C++ client

NiFi

Python client

Java client

WebUI

Model Server

REST APIs Master Machine

Frontend

NovuForce Framework DSGD runtime

GPU/Hardware-aware Scheduler

Mesos

Docker Registry

Mesos Master Worker Machines

Mesos Agent

Executor DSGD [Docker]

Arrow Data

Mesos Agent

...

Executor DSGD [Docker]

Configuration Management Ansible Modules

Arrow Data #DLSAIS15

12

Hardware-aware Scheduler • Spark data pipelines – Scheduled to data processing cluster

– Training stage tasks are collocated with deep learning services on HPC cluster

• Deep learning services – Scheduled with NUMA zone binding – Communication is optimized

#DLSAIS15

13

Inference Flow • Inference in Spark pipelines – such as DeepImagePredictor

• Inference as a service

Model Server

NovuForce

Data Processing Pipelines

Deep learning Services

#DLSAIS15

14

High Performance Deep Learning Training Services

High Performance Trainer - Computation • Optimized for data throughput (samples/sec)

#DLSAIS15

16

High Performance Trainer - Computation • • • • •

Optimized for data throughput (samples/sec) Reliable data input pipelines Efficient data shuffle and augmentations Layer / kernel fusions Half-precision (FP16) support – Tensor computations/communications – Half memory consumption – Leverage latest hardware • e.g. Volta’s TensorCores

Source: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ #DLSAIS15

17

High Performance Trainer - Communication • Designed for distributed clusters

#DLSAIS15

18

High Performance Trainer - Communication • Designed for distributed clusters • Data-parallel distributed synchronized SGD Load Batch 1/N

Computation forward/backward

Load Batch 2/N

Computation forward/backward

Load Batch N/N



Computation forward/backward #DLSAIS15

Update Communications (All-Reduce)

Update

… Update

19

High Performance Trainer - Communication • Designed for distributed clusters • Data-parallel distributed synchronized SGD – Ring based All-Reduce algorithm using NCCL* • Chunked into buckets to overlap computation and communication

– Recursive doubling/halving^ All-Reduce algorithm • Multi-level, using NCCL and CUDA-aware MPI

– InfiniBand and GPU direct RDMA * https://developer.nvidia.com/nccl ^ Thakur et al., Optimization of Collective Communication Operations in MPICH, IJHPCA, 2005 #DLSAIS15

20

High Performance Trainer - Communication • Designed for distributed clusters • Hardware-aware scheduler using Apache Mesos – Dynamic choose All-Reduce algorithm – CPU/GPU affinities, NUMA binding

#DLSAIS15

21

High Performance Trainer - Communication • Designed for distributed clusters • Hardware-aware scheduler using Apache Mesos – Dynamic choose All-Reduce algorithm – CPU/GPU affinities, NUMA binding

• Everything is inside containers – Easy deployment / task scheduling

#DLSAIS15

22

Running in Production - for Medical Images Recently deployed in west China

- 8 nodes 64 GPUs in total - 8 x V100-PCIE-16G GPUs - 2 x InfiniBand EDR - Easy-deployment with Ansible

#DLSAIS15

23

High Performance Trainer - Benchmark

VGG-16/ResNet-50/Inception-V3 benchmark on real ImageNet data using up to 64 NVIDIA V100-PCEI-16G GPUs, batch size 128 per worker with step size = 500. Averaged run 10 times. Tested with OpenMPI 3.0.1, NCCL 2.1.15, CUDA 9.1, and CUDNN 7.1.3. #DLSAIS15

24

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed

#DLSAIS15

25

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts* for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling

* Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour: https://arxiv.org/abs/1706.02677 #DLSAIS15

26

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling

• Distributed large-minibatch (8k) ResNet-50 on ImageNet – Top-1 75.8% in 52 minutes using only 64 GPUs #DLSAIS15

27

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling

• Distributed large-minibatch (8k) ResNet-50 on ImageNet – Top-1 75.8% in 52 minutes using only 64 GPUs – Top-1 74.1% in 50 minutes with mixed-precision using 32 GPUs #DLSAIS15

28

Summary • Connected deep learning workflow that optimized for both data processing and high performance deep learning • Off-load data augmentations and shuffling to Spark • Zero-copy data sharing • Hardware-aware schedulers • Fast and accurate distributed HPC training services

#DLSAIS15

29

Thank You! Rui Liu ([email protected]) Yuduo Wu ([email protected]) NovuMind Inc. #DLSAIS15