A Low-Latency Online Prediction Serving System - Usenix

12 downloads 135 Views 15MB Size Report
Daniel Crankshaw. Xin Wang, Giulio Zhou,. Michael Franklin, Joseph Gonzalez, Ion Stoica. NSDI 2017. March 29, 2017. A Lo
Clipper

A Low-Latency Online Prediction Serving System Daniel Crankshaw Xin Wang, Giulio Zhou, Michael Franklin, Joseph Gonzalez, Ion Stoica

NSDI 2017 March 29, 2017

Learning

Big Data

Training

Complex Model

Learning

Big Data

Training

Complex Model

Learning Produces a Trained Model

Query

Decision “CAT” Model

Learning

Serving Query

Big Data

Training

?

Decision

Model

Application

Serving Learning Query Big Data

Training

Decision

Model

Application

Prediction-Serving for interactive applications Timescale: ~10s of milliseconds

Prediction-Serving Raises New Challenges

Prediction-Serving Challenges

??? Create Caffe

Support low-latency, highthroughput serving workloads

VW 8

Large and growing ecosystem of ML models and frameworks

Support low-latency, high-throughput serving workloads

Models getting more complex Ø 10s of GFLOPs [1]

Deployed on critical path Ø Maintain SLOs under heavy load

Using specialized hardware for predictions

[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.

Google Translate Serving 140 billion words a day

1

82,000 GPUs running 24/7 [1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

Invented New Hardware! Tensor Processing Unit (TPU)

Big Companies Build One-Off Systems Problems: Ø Expensive to build and maintain Ø Highly specialized and require ML and systems expertise Ø Tightly-coupled model and application Ø Difficult to change or update model Ø Only supports single ML framework

Prediction-Serving Challenges

??? Create Caffe

Support low-latency, highthroughput serving workloads

VW

12

Large and growing ecosystem of ML models and frameworks

Large and growing ecosystem of ML models and frameworks Fraud Detection

Content Rec.

Personal Asst.

Robotic Control

Machine Translation

??? Create

Caffe

VW 13

Large and growing ecosystem of ML models and frameworks Fraud Detection

Content Rec.

Personal Asst.

Difficult to deploy and brittle to manage ???

Robotic Control

Machine Translation

Varying physical resource requirements Create VW Caffe

But most companies can’t build new serving systems…

Use existing systems: Offline Scoring

Big Data

Training

Model

Batch Analytics

Use existing systems: Offline Scoring Datastore X

Big Data

Scoring

Training

Model

Batch Analytics

Y

Use existing systems: Offline Scoring Look up decision in datastore Query X

Y

Decision

Application

Low-Latency Serving

Use existing systems: Offline Scoring Look up decision in datastore

Problems: Query Ø Requires full set of queries ahead of time X Y

Ø Small and bounded input domain

Ø Wasted computation and space Decision Ø Can render and store unneeded predictions

Ø Costly to update Ø Re-run batch job

Application

Low-Latency Serving

Prediction-Serving Challenges

??? Create Caffe

Support low-latency, highthroughput serving workloads

VW

20

Large and growing ecosystem of ML models and frameworks

How does Clipper address these challenges?

Clipper Solutions q Simplifies deployment through layered architecture q Serves many models across ML frameworks concurrently q Employs caching, batching, scale-out for high-performance serving

Clipper Decouples Applications and Models Applications Predict

RPC/REST Interface

Feedback

Clipper RPC Model Container (MC)

RPC

RPC

RPC

MC

MC

MC

Caffe

Clipper Architecture Applications Predict

RPC/REST Interface

Observe

Clipper Improve accuracy through bandit methods and ensembles, online learning, and personalization Provide a common interface to models while bounding latency and maximizing throughput.

RPC Model Container (MC)

Model Selection Layer Model Abstraction Layer

RPC

RPC

RPC

MC

MC

MC

Clipper Architecture Applications RPC/REST Interface

Predict

Observe

Clipper Model Selection Layer

Selection Policy Caching Adaptive Batching

RPC Model Container (MC)

Model Abstraction Layer

RPC

RPC

RPC

MC

MC

MC

Clipper Implementation Applications RPC/REST Interface

Predict

Observe

Clipper ust

Core system: 5000 lines of Rust Caching

RPC: Adaptive Batching • 100 lines of Python • RPC 250 lines of Rust RPC • 200 lines of C++ Model Container (MC) MC

Model Abstraction Layer RPC

RPC

MC

MC

Caching Adaptive Batching

RPC Model Container (MC)

Model Abstraction Layer

RPC

RPC

RPC

MC

MC

MC

Caffe

Caching Adaptive Batching

RPC Model Container (MC)

Model Abstraction Layer

RPC

RPC

RPC

MC

MC

MC

Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems

Container-based Model Deployment Implement Model API: class ModelContainer: def __init__(model_data) def predict_batch(inputs)

Container-based Model Deployment Implement Model API: class ModelContainer: def __init__(model_data) def predict_batch(inputs)

Ø Implemented in many languages Ø Ø Ø

Python Java C/C++

Container-based Model Deployment Model implementation packaged in container Model Container (MC) class ModelContainer: def __init__(model_data) def predict_batch(inputs)

Container-based Model Deployment Clipper RPC Model Container (MC)

RPC

RPC

RPC

MC

MC

MC

Caffe

Caching Adaptive Batching

RPC Model Container (MC)

Model Abstraction Layer

RPC

RPC

RPC

MC

MC

MC

Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation

Caching

Model Abstraction Layer

Adaptive Batching

RPC

el Container (MC)

RPC

RPC

RPC

RPC

RPC

MC

MC

MC

MC

MC

Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation Ø Scale-out

Problem: frameworks optimized for batch processing not latency

Batching to Improve Throughput Ø Why batching helps: A single page load may generate many queries Hardware Acceleration Helps amortize system overhead

Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load

Adaptive Batching to Improve Throughput Ø Why batching helps: A single page load may generate many queries Hardware Acceleration

Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load

Clipper Solution: Adaptively tradeoff latency and throughput…

Helps amortize system overhead

Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)

Tensor Flow Conv. Net (GPU) Better

Throughput (Queries Per Second) Batch Size

Tensor Flow Conv. Net (GPU) Better

Throughput (Queries Per Second)

Latency (ms) Better

Batch Size

Tensor Flow Conv. Net (GPU) Better

Throughput (Queries Per Second)

Optimal Batch Size

Latency (ms) Better

Batch Size

1R BatFKLng

Better

Throughput (QPS)

60000 40000 20000 0

8

1

3 6 9

3

17

7

6 0 2

1

0 2 9

2

03

1

1 2 9

n Vt 0 0 0 R e L 9 9 9 r V V ) R ) a r 6 rN ) a r 6 rn ) e O 6 rn ) e r n) g n e e P a a a n r r e r n n R a LL e Ke Le S LL a d R L 6 e e n y O K K g L (3 (6 (6 R a (6 K L R (6 K

S 2 R

AdaStLve

Better

Throughput (QPS)

60000 40000 20000 0

48

6 8 3 89

1

1R BatFKLng

63

S 2 R

22

8

59

7 31

2

5 93

4

0

0 2 7

6

3 89

4

2 9 1

4 0

7 3 19 20

1 2 7

9

2 9 1

1

Vt 0 0 0 Rn e L 9 9 9 r V V O 6 rn ) r 6 rN ) r 6 rn ) )R ) e r e a a g rn ) n ea e ea P r n Ln e S a r e n R e L 5 Lea L KL K KL n d K Oe a L 3 y 6 g a ( (6 (6 5 (6 L R (6 K

AdaStLve

1R BatFKLng

Better

Throughput (QPS)

60000

4

8 83

6

40000 6 89

20000

3

2

5 28 3

0

P99 Latency (ms) 20 ms is Fast Enough

Better

9

29

0 72

17

4

0 35

3 89

6

2 19

40 20

20

20

0

0

1

S 2 R

20 5

4 0

0

1 72

9

2 19

7 3 19 20

28

20 0

4

1

20 5

0

t V 0 0 0 Rn e L 9 9 9 r V V O 6 rn ) r 6 rN ) r 6 rn ) )R ) e r e a a e g a rn ) R P a r n L Ln e 6 S a L Ln e L e a K e r n L e a d 5 e n K Oe y K K g L a 3 6 R 6 K ( ( ( 5 (6 L (6

AdaStLve

Throughput (QPS) P99 Latency (ms)

60000 40000 20000 0 40 20 0

4

8 83

1R BatFKLng

6

6 89

3

22

9 85 3

20

20

17

2

4 6

4

2 19

47 0

28

1318

1000

490

1

20 5

0

9 1 2 2 19

7 3 19 20

20 0

1304

Batch Size

0

0 72

20 5

0

5 93

3 89

0 1224

632 3

0

1

S 2 R

t V 0 0 0 Rn e L 9 9 9 r V O 6 rn ) r 6 rN ) r 6 rn ) )R ) eV ) r e a a g n e a rn R P a r n L Ln e 6 S a L Ln e L e a K e r L e a d 5 n y Oe K K g KLe R (3 (6 (6 5 a (6 K L (6

Overhead of decoupled architecture Applications Predict

RPC/REST Interface

Feedback

Clipper RPC MC

RPC

RPC

RPC

MC

MC

MC

Caffe

Overhead of decoupled architecture Applications

Applications Predict

RPC/REST Interface

Feedback

Clipper RPC MC

RPC

RPC

RPC

MC

MC

MC

Caffe

Predict

RPC Interface

TensorFlowServing

Overhead of decoupled architecture Applications

Applications Predict

RPC/REST Interface

Clipper RPC Model Container

RPC

RPC

MC

MC

Caffe

RPC MC

Feedback

Predict

RPC Interface

TensorFlowServing

Overhead of decoupled architecture Model: AlexNet trained on CIFAR-10

Better

Throughput (QPS)

P99 Latency (ms)

Better

Clipper Architecture Applications Predict

RPC/REST Interface

Observe

Clipper Improve accuracy through bandit methods and ensembles, online learning, and personalization Provide a common interface to models while bounding latency and maximizing throughput.

RPC Model Container (MC)

Model Selection Layer Model Abstraction Layer

RPC

RPC

RPC

MC

MC

MC

Clipper Improve accuracy through bandit methods and ensembles, online learning, and personalization

Model Selection Layer

Version 1 Version 2

Periodic retraining

Caffe

Version 3

Experiment with new models and frameworks

Selection Policy: Estimate confidence Version 2

Version 3

Caffe

Policy

“CAT” “CAT” “CAT” “CAT”

“CAT” CONFIDENT

“CAT” “MAN” “CAT” “SPACE”

Policy

Caffe

Selection Policy: Estimate confidence

“CAT” UNSURE

Better

7Rp-5 ErrRr 5ate

Selection Policy: Estimate confidence Image1et cRnIident

0.4

0.3182 0.1983

0.2 0.0586

0.0

unsure

ensemEle

0.0469

4-agree

0.0327

5-agree

Better

7Rp-5 ErrRr 5ate

Selection Policy: Estimate confidence Image1et cRnIident

0.4

0.3182 0.1983

0.2 0.0586

0.0

unsure

ensemEle

0.0469

4-agree

width is percentage of query workloads

0.0327

5-agree

Clipper Selection Policy

Model Selection Layer

Selection policies supported by Clipper Ø Exploit multiple models to estimate confidence Ø Use multi-armed bandit algorithms to learn optimal model-selection online Ø Online personalization across ML frameworks

*See paper for details

Conclusion Ø Prediction-serving is an important and challenging area for systems research Ø Support low-latency, high-throughput serving workloads Ø Serve large and growing ecosystem of ML frameworks

Ø Clipper is a first step towards addressing these challenges Ø Simplifies deployment through layered architecture Ø Serves many models across ML frameworks concurrently Ø Employs caching, adaptive batching, container scale-out to meet interactive serving workload demands

Ø Beyond academic prototype to build a real, open-source system

https://github.com/ucbrise/clipper [email protected]

Latency (Ps)

TKrRugKput (10K Tps)

GPU Cluster Scaling 10

Agg 10Gbps Agg 1Gbps

0ean 10Gbps 0ean 1Gbps

5 0 200 0ean 10Gbps 399 10Gbps

100

0ean 1Gbps 399 1Gbps

0 1

2

3

1uPber Rf 5eplLcas

4