High Performance Deep Learning with Apache Spark Rui Liu Yuduo Wu NovuMind Inc. #DLSAIS15
Background • We build a high performance deep learning platform – Co-design hardware and software
• We are working on connecting our deep learning platform with data pipelines
#DLSAIS15
2
Different Characteristics • Deep learning is a computation and communication intensive process – High Utilization of GPUs – Low latency synchronizations – Hardware acceleration, e.g., GPU Direct RDMA, NUMA, InfiniBand – Single instance per machine
• Spark data pipelines are optimized for – Data locality – Minimization of data I/O and shuffling – Multiple tasks on one machine #DLSAIS15
3
Different Hardware • Different types of machines in customer data center • HPC cluster for deep learning – GPU, InfiniBand, etc.
• Data processing machines – No GPU – Ethernet #DLSAIS15
4
Goals • Connect deep learning systems with data pipelines • Without sacrificing deep learning performance
Data pipelines on data processing cluster
Deep learning services on HPC cluster #DLSAIS15
5
Data Augmentation Off-loading • Pre-processing of deep learning often needs to be off-loaded from training services
#DLSAIS15
6
Data Shuffling Off-loading • Training needs data shuffle between training epochs • The network of training cluster is overwhelmed by parameter synchronizations • Data shuffling can be off-loaded from training services
#DLSAIS15
7
Our Solutions – NovuForce • A fully connected pipeline leveraging the advantages of the two worlds – Spark + high performance computing
• Spark for data ingestion, preprocessing, shuffling • Deep learning (training/inference) as a service – Optimized for high performance computing
• Different schedulers for data pipeline and deep learning – Hardware-aware schedulers
• Zero-copy data sharing #DLSAIS15
8
Training Flow Data Flow definition: Scheduler:
Spark Data Pipelines
Training Services
Spark Scheduler
Deep learning Scheduler
Model Server
Apache Mesos Managed Resources Cameras /Sensors
Data Processing Cluster
HPC Cluster for Deep Learning NovuForce
Data processing pipelines
#DLSAIS15
Deep learning services
9
Interactive Usages • WebUI for deep learning services • Web notebook via Apache Zeppelin WebUI NovuForce
Data Processing Pipelines
Deep learning Services #DLSAIS15
10
Zero Copy Data Sharing • The last stage of spark tasks are scheduled to the HPC cluster • Circular buffer in shared memory • Labels and images in Apache Arrow format
Arrow Format
Label
Image
NovuForce
Data Processing Task
Training Service Instance
#DLSAIS15
11
Deep Learning Services C++ client
NiFi
Python client
Java client
WebUI
Model Server
REST APIs Master Machine
Frontend
NovuForce Framework DSGD runtime
GPU/Hardware-aware Scheduler
Mesos
Docker Registry
Mesos Master Worker Machines
Mesos Agent
Executor DSGD [Docker]
Arrow Data
Mesos Agent
...
Executor DSGD [Docker]
Configuration Management Ansible Modules
Arrow Data #DLSAIS15
12
Hardware-aware Scheduler • Spark data pipelines – Scheduled to data processing cluster
– Training stage tasks are collocated with deep learning services on HPC cluster
• Deep learning services – Scheduled with NUMA zone binding – Communication is optimized
#DLSAIS15
13
Inference Flow • Inference in Spark pipelines – such as DeepImagePredictor
• Inference as a service
Model Server
NovuForce
Data Processing Pipelines
Deep learning Services
#DLSAIS15
14
High Performance Deep Learning Training Services
High Performance Trainer - Computation • Optimized for data throughput (samples/sec)
#DLSAIS15
16
High Performance Trainer - Computation • • • • •
Optimized for data throughput (samples/sec) Reliable data input pipelines Efficient data shuffle and augmentations Layer / kernel fusions Half-precision (FP16) support – Tensor computations/communications – Half memory consumption – Leverage latest hardware • e.g. Volta’s TensorCores
Source: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ #DLSAIS15
17
High Performance Trainer - Communication • Designed for distributed clusters
#DLSAIS15
18
High Performance Trainer - Communication • Designed for distributed clusters • Data-parallel distributed synchronized SGD Load Batch 1/N
Computation forward/backward
Load Batch 2/N
Computation forward/backward
Load Batch N/N
…
Computation forward/backward #DLSAIS15
Update Communications (All-Reduce)
Update
… Update
19
High Performance Trainer - Communication • Designed for distributed clusters • Data-parallel distributed synchronized SGD – Ring based All-Reduce algorithm using NCCL* • Chunked into buckets to overlap computation and communication
– Recursive doubling/halving^ All-Reduce algorithm • Multi-level, using NCCL and CUDA-aware MPI
– InfiniBand and GPU direct RDMA * https://developer.nvidia.com/nccl ^ Thakur et al., Optimization of Collective Communication Operations in MPICH, IJHPCA, 2005 #DLSAIS15
20
High Performance Trainer - Communication • Designed for distributed clusters • Hardware-aware scheduler using Apache Mesos – Dynamic choose All-Reduce algorithm – CPU/GPU affinities, NUMA binding
#DLSAIS15
21
High Performance Trainer - Communication • Designed for distributed clusters • Hardware-aware scheduler using Apache Mesos – Dynamic choose All-Reduce algorithm – CPU/GPU affinities, NUMA binding
• Everything is inside containers – Easy deployment / task scheduling
#DLSAIS15
22
Running in Production - for Medical Images Recently deployed in west China
- 8 nodes 64 GPUs in total - 8 x V100-PCIE-16G GPUs - 2 x InfiniBand EDR - Easy-deployment with Ansible
#DLSAIS15
23
High Performance Trainer - Benchmark
VGG-16/ResNet-50/Inception-V3 benchmark on real ImageNet data using up to 64 NVIDIA V100-PCEI-16G GPUs, batch size 128 per worker with step size = 500. Averaged run 10 times. Tested with OpenMPI 3.0.1, NCCL 2.1.15, CUDA 9.1, and CUDNN 7.1.3. #DLSAIS15
24
High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed
#DLSAIS15
25
High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts* for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling
* Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour: https://arxiv.org/abs/1706.02677 #DLSAIS15
26
High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling
• Distributed large-minibatch (8k) ResNet-50 on ImageNet – Top-1 75.8% in 52 minutes using only 64 GPUs #DLSAIS15
27
High Performance Trainer - Convergence • Overall time to reach convergence • Large-mini-batch is key when going distributed • Leverage recent research efforts for large-batch training – Learning rate linear scaling – Learning rate gradual warmup schema – Aggressive learning rate scheduling
• Distributed large-minibatch (8k) ResNet-50 on ImageNet – Top-1 75.8% in 52 minutes using only 64 GPUs – Top-1 74.1% in 50 minutes with mixed-precision using 32 GPUs #DLSAIS15
28
Summary • Connected deep learning workflow that optimized for both data processing and high performance deep learning • Off-load data augmentations and shuffling to Spark • Zero-copy data sharing • Hardware-aware schedulers • Fast and accurate distributed HPC training services
#DLSAIS15
29
Thank You! Rui Liu (
[email protected]) Yuduo Wu (
[email protected]) NovuMind Inc. #DLSAIS15