High Performance Deep Learning with Apache Spark

High Performance Deep Learning with Apache Spark Rui Liu Yuduo Wu NovuMind Inc. #DLSAIS15

Background • We build a high performance deep learning platform – Co-design hardware and software

• We are working on connecting our deep learning platform with data pipelines

#DLSAIS15

2

Different Characteristics • Deep learning is a computation and communication intensive process – High Utilization of GPUs – Low latency synchronizations – Hardware acceleration, e.g., GPU Direct RDMA, NUMA, InfiniBand – Single instance per machine

• Spark data pipelines are optimized for – Data locality – Minimization of data I/O and shuffling – Multiple tasks on one machine #DLSAIS15

3

Different Hardware • Different types of machines in customer data center • HPC cluster for deep learning – GPU, InfiniBand, etc.

• Data processing machines – No GPU – Ethernet #DLSAIS15

4

Goals • Connect deep learning systems with data pipelines • Without sacrificing deep learning performance

Data pipelines on data processing cluster

Deep learning services on HPC cluster #DLSAIS15

5

Data Augmentation Off-loading • Pre-processing of deep learning often needs to be off-loaded from training services

#DLSAIS15

6

Data Shuffling Off-loading • Training needs data shuffle between training epochs • The network of training cluster is overwhelmed by parameter synchronizations • Data shuffling can be off-loaded from training services

#DLSAIS15

7

Our Solutions – NovuForce • A fully connected pipeline leveraging the advantages of the two worlds – Spark + high performance computing

• Spark for data ingestion, preprocessing, shuffling • Deep learning (training/inference) as a service – Optimized for high performance computing

• Different schedulers for data pipeline and deep learning – Hardware-aware schedulers

• Zero-copy data sharing #DLSAIS15

8

Training Flow Data Flow definition: Scheduler:

Spark Data Pipelines

Training Services

Spark Scheduler

Deep learning Scheduler

Model Server

Apache Mesos Managed Resources Cameras /Sensors

Data Processing Cluster

HPC Cluster for Deep Learning NovuForce

Data processing pipelines

#DLSAIS15

Deep learning services

9

Interactive Usages • WebUI for deep learning services • Web notebook via Apache Zeppelin WebUI NovuForce

Data Processing Pipelines

Deep learning Services #DLSAIS15

10

Zero Copy Data Sharing • The last stage of spark tasks are scheduled to the HPC cluster • Circular buffer in shared memory • Labels and images in Apache Arrow format

Arrow Format

Label

Image

NovuForce

Data Processing Task

Training Service Instance

#DLSAIS15

11

Deep Learning Services C++ client

NiFi

Python client

Java client

WebUI

Model Server

REST APIs Master Machine

Frontend

NovuForce Framework DSGD runtime

GPU/Hardware-aware Scheduler

Mesos

Docker Registry

Mesos Master Worker Machines

Mesos Agent

Executor DSGD [Docker]

Arrow Data

Mesos Agent

...

Executor DSGD [Docker]

Configuration Management Ansible Modules

Arrow Data #DLSAIS15

12

Hardware-aware Scheduler • Spark data pipelines – Scheduled to data processing cluster

– Training stage tasks are collocated with deep learning services on HPC cluster

• Deep learning services – Scheduled with NUMA zone binding – Communication is optimized

#DLSAIS15

13

Inference Flow • Inference in Spark pipelines – such as DeepImagePredictor

• Inference as a service

Model Server

NovuForce

Data Processing Pipelines

Deep learning Services

#DLSAIS15

14

High Performance Deep Learning Training Services

High Performance Trainer - Computation • Optimized for data throughput (samples/sec)

#DLSAIS15

16

High Performance Trainer - Computation • • • • •

Optimized for data throughput (samples/sec) Reliable data input pipelines Efficient data shuffle and augmentations Layer / kernel fusions Half-precision (FP16) support – Tensor computations/communications – Half memory consumption – Leverage latest hardware • e.g. Volta’s TensorCores

Source: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ #DLSAIS15

17

High Performance Trainer - Communication • Designed for distributed clusters

#DLSAIS15

18

High Performance Trainer - Communication • Designed for distributed clusters • Data-parallel distributed synchronized SGD Load Batch 1/N

Computation forward/backward

Load Batch 2/N

Computation forward/backward

Load Batch N/N

…

Computation forward/backward #DLSAIS15

Update Communications (All-Reduce)

Update

… Update

19

High Performance Trainer - Communication • Designed for distributed clusters • Data-parallel distributed synchronized SGD – Ring based All-Reduce algorithm using NCCL* • Chunked into buckets to overlap computation and communication

– Recursive doubling/halving^ All-Reduce algorithm • Multi-level, using NCCL and CUDA-aware MPI

– InfiniBand and GPU direct RDMA * https://developer.nvidia.com/nccl ^ Thakur et al., Optimization of Collective Communication Operations in MPICH, IJHPCA, 2005 #DLSAIS15

20

High Performance Trainer - Communication • Designed for distributed clusters • Hardware-aware scheduler using Apache Mesos – Dynamic choose All-Reduce algorithm – CPU/GPU affinities, NUMA binding

#DLSAIS15

21

High Performance Trainer - Communication • Designed for distributed clusters • Hardware-aware scheduler using Apache Mesos – Dynamic choose All-Reduce algorithm – CPU/GPU affinities, NUMA binding

• Everything is inside containers – Easy deployment / task scheduling

#DLSAIS15

22

Running in Production - for Medical Images Recently deployed in west China

- 8 nodes 64 GPUs in total - 8 x V100-PCIE-16G GPUs - 2 x InfiniBand EDR - Easy-deployment with Ansible

#DLSAIS15

23

High Performance Trainer - Benchmark

VGG-16/ResNet-50/Inception-V3 benchmark on real ImageNet data using up to 64 NVIDIA V100-PCEI-16G GPUs, batch size 128 per worker with step size = 500. Averaged run 10 times. Tested with OpenMPI 3.0.1, NCCL 2.1.15, CUDA 9.1, and CUDNN 7.1.3. #DLSAIS15

24

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed

#DLSAIS15

25

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts* for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling

* Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour: https://arxiv.org/abs/1706.02677 #DLSAIS15

26

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling

• Distributed large-minibatch (8k) ResNet-50 on ImageNet – Top-1 75.8% in 52 minutes using only 64 GPUs #DLSAIS15

27

High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling

• Distributed large-minibatch (8k) ResNet-50 on ImageNet – Top-1 75.8% in 52 minutes using only 64 GPUs – Top-1 74.1% in 50 minutes with mixed-precision using 32 GPUs #DLSAIS15

28

Summary • Connected deep learning workflow that optimized for both data processing and high performance deep learning • Off-load data augmentations and shuffling to Spark • Zero-copy data sharing • Hardware-aware schedulers • Fast and accurate distributed HPC training services

#DLSAIS15

29

Thank You! Rui Liu ([email protected]) Yuduo Wu ([email protected]) NovuMind Inc. #DLSAIS15

High Performance Deep Learning with Apache Spark

High Performance Deep Learning with Apache Spark

Suggest Documents

Apache Spark

Apache Spark

Intro to Apache Spark

Intro to Apache Spark

Bioinformatics Application on Apache Spark

Bioinformatics Application on Apache Spark

Apache Spark Case Study - arXiv

DeepSpark: Spark-Based Deep Learning Supporting Asynchronous ...

Performance Evaluation of Apache Spark Vs MPI: A Practical Case

Performance Evaluation of Apache Spark on Cray ... - Cray User Group

Read [PDF] Learning Apache Spark 2.0 Full Popular - Google Sites

Download Learning Apache Spark 2 Full Books - Google Sites

Spark Performance - Spark Summit

PDF Download High Performance Spark - Google Sites

[PDF]-Download High Performance Spark - Google Sites

[PDF]Book High Performance Spark - Google Sites

PDF High Performance Spark - Google Sites

PDF High Performance Spark - Google Sites

PDF High Performance Spark - Google Sites

Diagnostic Performance of Deep Learning

PDF Download High Performance Spark - Google Sites

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

Efficient iterative virtual screening with Apache Spark and conformal ...

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark