Daniel Crankshaw. Xin Wang, Giulio Zhou,. Michael Franklin, Joseph Gonzalez, Ion Stoica. NSDI 2017. March 29, 2017. A Lo
Clipper
A Low-Latency Online Prediction Serving System Daniel Crankshaw Xin Wang, Giulio Zhou, Michael Franklin, Joseph Gonzalez, Ion Stoica
NSDI 2017 March 29, 2017
Learning
Big Data
Training
Complex Model
Learning
Big Data
Training
Complex Model
Learning Produces a Trained Model
Query
Decision “CAT” Model
Learning
Serving Query
Big Data
Training
?
Decision
Model
Application
Serving Learning Query Big Data
Training
Decision
Model
Application
Prediction-Serving for interactive applications Timescale: ~10s of milliseconds
Prediction-Serving Raises New Challenges
Prediction-Serving Challenges
??? Create Caffe
Support low-latency, highthroughput serving workloads
VW 8
Large and growing ecosystem of ML models and frameworks
Support low-latency, high-throughput serving workloads
Models getting more complex Ø 10s of GFLOPs [1]
Deployed on critical path Ø Maintain SLOs under heavy load
Using specialized hardware for predictions
[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.
Google Translate Serving 140 billion words a day
1
82,000 GPUs running 24/7 [1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
Invented New Hardware! Tensor Processing Unit (TPU)
Big Companies Build One-Off Systems Problems: Ø Expensive to build and maintain Ø Highly specialized and require ML and systems expertise Ø Tightly-coupled model and application Ø Difficult to change or update model Ø Only supports single ML framework
Prediction-Serving Challenges
??? Create Caffe
Support low-latency, highthroughput serving workloads
VW
12
Large and growing ecosystem of ML models and frameworks
Large and growing ecosystem of ML models and frameworks Fraud Detection
Content Rec.
Personal Asst.
Robotic Control
Machine Translation
??? Create
Caffe
VW 13
Large and growing ecosystem of ML models and frameworks Fraud Detection
Content Rec.
Personal Asst.
Difficult to deploy and brittle to manage ???
Robotic Control
Machine Translation
Varying physical resource requirements Create VW Caffe
But most companies can’t build new serving systems…
Use existing systems: Offline Scoring
Big Data
Training
Model
Batch Analytics
Use existing systems: Offline Scoring Datastore X
Big Data
Scoring
Training
Model
Batch Analytics
Y
Use existing systems: Offline Scoring Look up decision in datastore Query X
Y
Decision
Application
Low-Latency Serving
Use existing systems: Offline Scoring Look up decision in datastore
Problems: Query Ø Requires full set of queries ahead of time X Y
Ø Small and bounded input domain
Ø Wasted computation and space Decision Ø Can render and store unneeded predictions
Ø Costly to update Ø Re-run batch job
Application
Low-Latency Serving
Prediction-Serving Challenges
??? Create Caffe
Support low-latency, highthroughput serving workloads
VW
20
Large and growing ecosystem of ML models and frameworks
How does Clipper address these challenges?
Clipper Solutions q Simplifies deployment through layered architecture q Serves many models across ML frameworks concurrently q Employs caching, batching, scale-out for high-performance serving
Clipper Decouples Applications and Models Applications Predict
RPC/REST Interface
Feedback
Clipper RPC Model Container (MC)
RPC
RPC
RPC
MC
MC
MC
Caffe
Clipper Architecture Applications Predict
RPC/REST Interface
Observe
Clipper Improve accuracy through bandit methods and ensembles, online learning, and personalization Provide a common interface to models while bounding latency and maximizing throughput.
RPC Model Container (MC)
Model Selection Layer Model Abstraction Layer
RPC
RPC
RPC
MC
MC
MC
Clipper Architecture Applications RPC/REST Interface
Predict
Observe
Clipper Model Selection Layer
Selection Policy Caching Adaptive Batching
RPC Model Container (MC)
Model Abstraction Layer
RPC
RPC
RPC
MC
MC
MC
Clipper Implementation Applications RPC/REST Interface
Predict
Observe
Clipper ust
Core system: 5000 lines of Rust Caching
RPC: Adaptive Batching • 100 lines of Python • RPC 250 lines of Rust RPC • 200 lines of C++ Model Container (MC) MC
Model Abstraction Layer RPC
RPC
MC
MC
Caching Adaptive Batching
RPC Model Container (MC)
Model Abstraction Layer
RPC
RPC
RPC
MC
MC
MC
Caffe
Caching Adaptive Batching
RPC Model Container (MC)
Model Abstraction Layer
RPC
RPC
RPC
MC
MC
MC
Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems
Container-based Model Deployment Implement Model API: class ModelContainer: def __init__(model_data) def predict_batch(inputs)
Container-based Model Deployment Implement Model API: class ModelContainer: def __init__(model_data) def predict_batch(inputs)
Ø Implemented in many languages Ø Ø Ø
Python Java C/C++
Container-based Model Deployment Model implementation packaged in container Model Container (MC) class ModelContainer: def __init__(model_data) def predict_batch(inputs)
Container-based Model Deployment Clipper RPC Model Container (MC)
RPC
RPC
RPC
MC
MC
MC
Caffe
Caching Adaptive Batching
RPC Model Container (MC)
Model Abstraction Layer
RPC
RPC
RPC
MC
MC
MC
Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation
Caching
Model Abstraction Layer
Adaptive Batching
RPC
el Container (MC)
RPC
RPC
RPC
RPC
RPC
MC
MC
MC
MC
MC
Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation Ø Scale-out
Problem: frameworks optimized for batch processing not latency
Batching to Improve Throughput Ø Why batching helps: A single page load may generate many queries Hardware Acceleration Helps amortize system overhead
Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load
Adaptive Batching to Improve Throughput Ø Why batching helps: A single page load may generate many queries Hardware Acceleration
Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load
Clipper Solution: Adaptively tradeoff latency and throughput…
Helps amortize system overhead
Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)
Tensor Flow Conv. Net (GPU) Better
Throughput (Queries Per Second) Batch Size
Tensor Flow Conv. Net (GPU) Better
Throughput (Queries Per Second)
Latency (ms) Better
Batch Size
Tensor Flow Conv. Net (GPU) Better
Throughput (Queries Per Second)
Optimal Batch Size
Latency (ms) Better
Batch Size
1R BatFKLng
Better
Throughput (QPS)
60000 40000 20000 0
8
1
3 6 9
3
17
7
6 0 2
1
0 2 9
2
03
1
1 2 9
n Vt 0 0 0 R e L 9 9 9 r V V ) R ) a r 6 rN ) a r 6 rn ) e O 6 rn ) e r n) g n e e P a a a n r r e r n n R a LL e Ke Le S LL a d R L 6 e e n y O K K g L (3 (6 (6 R a (6 K L R (6 K
S 2 R
AdaStLve
Better
Throughput (QPS)
60000 40000 20000 0
48
6 8 3 89
1
1R BatFKLng
63
S 2 R
22
8
59
7 31
2
5 93
4
0
0 2 7
6
3 89
4
2 9 1
4 0
7 3 19 20
1 2 7
9
2 9 1
1
Vt 0 0 0 Rn e L 9 9 9 r V V O 6 rn ) r 6 rN ) r 6 rn ) )R ) e r e a a g rn ) n ea e ea P r n Ln e S a r e n R e L 5 Lea L KL K KL n d K Oe a L 3 y 6 g a ( (6 (6 5 (6 L R (6 K
AdaStLve
1R BatFKLng
Better
Throughput (QPS)
60000
4
8 83
6
40000 6 89
20000
3
2
5 28 3
0
P99 Latency (ms) 20 ms is Fast Enough
Better
9
29
0 72
17
4
0 35
3 89
6
2 19
40 20
20
20
0
0
1
S 2 R
20 5
4 0
0
1 72
9
2 19
7 3 19 20
28
20 0
4
1
20 5
0
t V 0 0 0 Rn e L 9 9 9 r V V O 6 rn ) r 6 rN ) r 6 rn ) )R ) e r e a a e g a rn ) R P a r n L Ln e 6 S a L Ln e L e a K e r n L e a d 5 e n K Oe y K K g L a 3 6 R 6 K ( ( ( 5 (6 L (6
AdaStLve
Throughput (QPS) P99 Latency (ms)
60000 40000 20000 0 40 20 0
4
8 83
1R BatFKLng
6
6 89
3
22
9 85 3
20
20
17
2
4 6
4
2 19
47 0
28
1318
1000
490
1
20 5
0
9 1 2 2 19
7 3 19 20
20 0
1304
Batch Size
0
0 72
20 5
0
5 93
3 89
0 1224
632 3
0
1
S 2 R
t V 0 0 0 Rn e L 9 9 9 r V O 6 rn ) r 6 rN ) r 6 rn ) )R ) eV ) r e a a g n e a rn R P a r n L Ln e 6 S a L Ln e L e a K e r L e a d 5 n y Oe K K g KLe R (3 (6 (6 5 a (6 K L (6
Overhead of decoupled architecture Applications Predict
RPC/REST Interface
Feedback
Clipper RPC MC
RPC
RPC
RPC
MC
MC
MC
Caffe
Overhead of decoupled architecture Applications
Applications Predict
RPC/REST Interface
Feedback
Clipper RPC MC
RPC
RPC
RPC
MC
MC
MC
Caffe
Predict
RPC Interface
TensorFlowServing
Overhead of decoupled architecture Applications
Applications Predict
RPC/REST Interface
Clipper RPC Model Container
RPC
RPC
MC
MC
Caffe
RPC MC
Feedback
Predict
RPC Interface
TensorFlowServing
Overhead of decoupled architecture Model: AlexNet trained on CIFAR-10
Better
Throughput (QPS)
P99 Latency (ms)
Better
Clipper Architecture Applications Predict
RPC/REST Interface
Observe
Clipper Improve accuracy through bandit methods and ensembles, online learning, and personalization Provide a common interface to models while bounding latency and maximizing throughput.
RPC Model Container (MC)
Model Selection Layer Model Abstraction Layer
RPC
RPC
RPC
MC
MC
MC
Clipper Improve accuracy through bandit methods and ensembles, online learning, and personalization
Model Selection Layer
Version 1 Version 2
Periodic retraining
Caffe
Version 3
Experiment with new models and frameworks
Selection Policy: Estimate confidence Version 2
Version 3
Caffe
Policy
“CAT” “CAT” “CAT” “CAT”
“CAT” CONFIDENT
“CAT” “MAN” “CAT” “SPACE”
Policy
Caffe
Selection Policy: Estimate confidence
“CAT” UNSURE
Better
7Rp-5 ErrRr 5ate
Selection Policy: Estimate confidence Image1et cRnIident
0.4
0.3182 0.1983
0.2 0.0586
0.0
unsure
ensemEle
0.0469
4-agree
0.0327
5-agree
Better
7Rp-5 ErrRr 5ate
Selection Policy: Estimate confidence Image1et cRnIident
0.4
0.3182 0.1983
0.2 0.0586
0.0
unsure
ensemEle
0.0469
4-agree
width is percentage of query workloads
0.0327
5-agree
Clipper Selection Policy
Model Selection Layer
Selection policies supported by Clipper Ø Exploit multiple models to estimate confidence Ø Use multi-armed bandit algorithms to learn optimal model-selection online Ø Online personalization across ML frameworks
*See paper for details
Conclusion Ø Prediction-serving is an important and challenging area for systems research Ø Support low-latency, high-throughput serving workloads Ø Serve large and growing ecosystem of ML frameworks
Ø Clipper is a first step towards addressing these challenges Ø Simplifies deployment through layered architecture Ø Serves many models across ML frameworks concurrently Ø Employs caching, adaptive batching, container scale-out to meet interactive serving workload demands
Ø Beyond academic prototype to build a real, open-source system
https://github.com/ucbrise/clipper
[email protected]
Latency (Ps)
TKrRugKput (10K Tps)
GPU Cluster Scaling 10
Agg 10Gbps Agg 1Gbps
0ean 10Gbps 0ean 1Gbps
5 0 200 0ean 10Gbps 399 10Gbps
100
0ean 1Gbps 399 1Gbps
0 1
2
3
1uPber Rf 5eplLcas
4