ASCAR: Increasing Performance Through Automated ... - OpenSFS

2 downloads 198 Views 1MB Size Report
Find the action that works best. 26 ack_ewma pt_raio. 0,0. INT_MAX. INT. _MAX. Action. . Action. <
Increasing Performance Through Automated Contention Management

(LUG’16)

Yan Li, Xiaoyuan Lu, Ethan Miller, Darrell Long Storage Systems Research Center (SSRC) University of California, Santa Cruz (a Intel® Parallel Computing Center)

Challenge: consistent performance at peak times

total throughput

congestion harms efficiency and throughput

100 clients

150 clients

200 clients 2

Challenge: consistent performance at peak times congestion causes fluctuation

client throughput of a random write workload 5 nodes accessing 5 servers 3

Challenge: consistent performance at peak times congestion can occur anywhere in the cluster

4

The problem we are trying to solve

Picture credit: https://www.checkfelix.com/reiseblog/10-argumente-urlaub-schottland/

5

The problem we are trying to solve Improve throughput or fairness during congestion or both at the same time!

End-to-end coverage handling congestion at OSC, network, OSS, and OST

Fully automatic and requires little human effort modern systems are very dynamic, and we won’t have time to create models 6

Rate limiting can improve performance

total throughput

… if done properly

100 clients

150 clients

200 clients

200 clients with contention control

7

Challenges of distributed I/O rate control: 1. Where is the sweet spot? rate limit

real throughput

ideal throughput 8

Challenges of distributed I/O rate control: 1. Where is the sweet spot?

rate limit is too low

real throughput

ideal throughput 9

Challenges of distributed I/O rate control: 1. Where is the sweet spot? rate limit too high

real throughput

ideal throughput 10

Challenges of distributed I/O rate control: 1. Where is the sweet spot? Capability discovery usually involves communication:

sweet rate limit spot

• between clients • with a central controller

real throughput

ideal throughput 11

Challenges of distributed I/O rate control: 2. scalability Intra-node communication can grow at O(n2)

Adds overhead to already congested network Low responsiveness for highly dynamic workload

12

ASCAR: Automatic Storage Contention Alleviation and Reduction Client-side rule-based I/O rate control 1. no need for central scheduling or coordination, nimble and highly responsive

2. no need to change server software or hardware 3. no scale-up bottleneck

Use machine learning and heuristics for rule generation and optimization no prior knowledge of the system or workload is required 13

Components of the ASCAR prototype ASCAR Rule Manager

Rule Generator & Optimizer (SHARP)

Workload Character & Rule Set Database

Client node Application

Workload Classifier

traffic rules

File system client

Workload character markers

Legends: Traffic rules ASCAR traffic controller

Storage servers

14

Components of the ASCAR prototype ASCAR Rule Manager

Rule Generator & Optimizer (SHARP)

Workload Character & Rule Set Database

Client node Application

Workload Classifier

traffic rules

File system client

Workload character markers

Legends: Traffic rules ASCAR traffic controller

Storage servers

15

Components of the ASCAR prototype ASCAR Rule Manager

Rule Generator & Optimizer (SHARP)

Workload Character & Rule Set Database

Client node Application

Workload Classifier

traffic rules

File system client

Workload character markers

Legends: Traffic rules ASCAR traffic controller

Storage servers

16

Rule-based Contention Control Rules tell the controller how to react to congestions tweak the congestion window according to request processing latency Each client tracks three congestion state statistics (ack_ewma, send_ewma, pt_ratio)

Each rule maps a congestion state to an action (Congestion State (CS) statistics) → An action describes how to change the congestion window (max_rpcs_in_flight) and rate limit: new_cong_window = m × cong_window + b τ is the rate limit 17

Generating a rule set for a certain workload Extract a short signature workload that covers the important features of the application’s I/O workload a signature workload is usually 20 ~ 30 seconds long

Generate candidate rules and benchmark them with the signature workload on the real system test possible combinations of action variables

Split the hottest rule in the set to generate more rules structural improve vs. tuning parameters 19

pt_raio

INT_MAX

Begin with one rule: the whole state space maps to one action

Action

0,0

ack_ewma

INT_MAX 20

INT_MAX

Try different values of with the workload

pt_raio

Try the Cartesian product of:

0,0

{m ± 0.01, m ± 0.02, m ± 0.04, … } × {b ± 0.01, b ± 0.02, b ± 0.04, … } × { τ ± 1, τ ± 2, τ ± 4, … }

ack_ewma

INT_MAX 21

pt_raio

INT_MAX

Find the rule that yields highest performance

0,0

Action

ack_ewma

INT_MAX 22

pt_raio

INT_MAX

Split the state space at the most observed state values

0,0

Action

Action

Action

Action ack_ewma

INT_MAX 23

pt_raio

INT_MAX

Run the workload, find out the rules that was triggered most often

0,0

Action used 32k times

Action used 0.3k times

Action used 2k times

Action used 120 times ack_ewma

INT_MAX 24

pt_raio

INT_MAX

Improve the most used rules by sweeping all possible values of

0,0

Try the Cartesian product of: {m ± 0.01, m ± 0.02, m ± 0.04, … } × {b ± 0.01, b ± 0.02, b ± 0.04, … } × { τ ± 1, τ ± 2, τ ± 4, … }

Action

Action

Action ack_ewma

INT_MAX 25

pt_raio

INT_MAX

Find the action that works best

0,0

Action

Action

Action

Action ack_ewma

INT_MAX 26

pt_raio

INT_MAX

Split the most used rule’s state space at the most observed state values Action Action

Action

Action

Action

Action 0,0

Action

ack_ewma

INT_MAX 27

pt_raio

INT_MAX

Run workload, find the most used rule, and improve it Action Action

Action

Action

Action

Action 0,0

Action

ack_ewma

INT_MAX 28

pt_raio

INT_MAX

After find the best action, split the most used rule

Action

Action

Action

Action

Action

Action

Action

Action

Action

Action 0,0

ack_ewma

INT_MAX 29

pt_raio

INT_MAX

After find the best action, split the most used rule

Action

Action

Action

Action

Action

Action

Action

Action

Action

Action 0,0

ack_ewma

INT_MAX 30

pt_raio

INT_MAX

Repeat this process Action Action

Action Action

Action

Action

Action

Action

Action

Action

Action

Action 0,0

Action

ack_ewma

INT_MAX 31

Prototype and Evaluation An ASCAR prototype for Lustre Patched Lustre client to add congestion control no change to server or other parts

Hardware: 5 servers, 5 clients Intel Xeon CPU E3-1230 V2 @ 3.30GHz, 16 GB RAM, Intel 330 SSD for the OS, dedicated 7200 RPM HGST Travelstar Z7K500 hard drive for Lustre, Gigabit Ethernet 32

ASCAR is good at increasing throughout and decreasing speed variance

33

Workload Throughput Improvements 30.00% BTIO (B4)

20.00%

BTIO (C4) Random write

Speed variance change

10.00% 0.00% 0.00% -10.00%

Random r/w 5.00% 10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Seq. read TP only

-20.00%

Seq. write

TP and Var

-30.00% -40.00% -50.00%

Random read Random write

-60.00%

Throughput improvement

BTIO (B4) BTIO (C4)

34

Workload Throughput Improvements 30.00% BTIO (B4)

20.00%

BTIO (C4) Random write

Speed variance change

10.00% 0.00% 0.00% -10.00%

Random r/w 5.00% 10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Seq. read TP only

-20.00%

Seq. write

TP and Var

-30.00%

Higher throughput

-40.00% -50.00%

Random read Random write

-60.00%

Throughput improvement

BTIO (B4) BTIO (C4)

35

Workload Throughput Improvements 30.00% BTIO (B4)

20.00%

BTIO (C4) Random write

Speed variance change

10.00% 0.00% 0.00% -10.00%

Random r/w 5.00% 10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Seq. read TP only

-20.00%

Seq. write

-30.00%

TP and Var

Lower speed variance

-40.00% -50.00%

Random read Random write

-60.00%

Throughput improvement

BTIO (B4) BTIO (C4)

36

Workload Throughput Improvements 30.00% BTIO (B4)

20.00%

BTIO (C4) Random write

Speed variance change

10.00% 0.00% 0.00% -10.00%

Random r/w 5.00% 10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Seq. read TP only

-20.00%

Seq. write

TP and Var

-30.00% -40.00% -50.00%

Random read Random write

-60.00%

Throughput improvement

BTIO (B4) BTIO (C4)

37

Workload Throughput Improvements 30.00% BTIO (B4)

20.00%

BTIO (C4) Random write

Speed variance change

10.00% 0.00% 0.00% -10.00%

Random r/w 5.00% 10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Seq. read TP only

-20.00%

Seq. write

TP and Var

-30.00% -40.00% -50.00%

Random read Random write

-60.00%

Throughput improvement

BTIO (B4) BTIO (C4)

38

Our prototype shows that … ASCAR increases the performance of all workloads 2% to 36%

ASCAR works best to boost write-heavy workloads 25% to 36%

ASCAR can lower speed variance at the same time for certain workloads by more than 50% 39

Limitations Training is currently offline The offline training can take hours Need to re-train when workload changes

40

Handling a new workload without training Measure the workload’s features

Compare features with known workloads Find the most similar known workload and use its contention control rules

41

Components of the ASCAR prototype ASCAR Rule Manager

Rule Generator & Optimizer (SHARP)

Workload Character & Rule Set Database

Client node Application

Workload Classifier

traffic rules

File system client

Workload character markers

Legends: Traffic rules ASCAR traffic controller

Storage servers

42

Finding the right features Workloads can have radically different pressure on the underlying system they require different congestion control rules

The features should reflect the workload’s pressure on the underlying system The combined workload from many clients is a mix of read/write and random/sequential 43

A 75% sequential + 25% random workload can be very different from another Request p

Request p + 100

Request p + 101

Request p + 120

Request p + 121

Request p + 122

Request p + 200

Request p + 201

Request p + 100

Request p + 200

Request p + 351

Random requests

Request p

Request p+1

Request p+2

Request p+3

Request p+4

And there are many different 60% read + 40% write workloads out there Read

Read

Read

Write

Write

Write

Read

Read

Write

Read

Write

Read

The feature set we use Op type: read/write/metadata Ratios between ops (read to write, read/write to metadata, etc.) For each type of op, we measure the following features: 1. average size of sequential ops 2. average positional gap between seq. ops 3. average temporal gap between seq. ops

Sample: Different 60% read + 40% write workloads Read

Write Read to write: 60/40 Avg. size of sequential read: 60 MB Avg. size of sequential write: 40 MB

Read

Write

Read

Write

Read

Write

Read

Read to write: 60/40 Avg. size of sequential read: 15 MB Avg. size of sequential write: 13 MB

Calculating similarity of workloads Goal: to determine if they can use the same contention control rules

How: start from a known workload and tweak its features, measuring the efficiency of existing rules on the changed workload Result: we used the results of hundreds of benchmarks to generate a decision tree 48

Calculating similarity of workloads rw_ratio 7

Linear model 1

read_size

75 Linear model 3 49

Calculating similarity of workloads rw_ratio 7

Linear model 1

read_size

75 Linear model 3 50

This decision tree can precisely predict the performance of using a specific rule set with a new workload Correlation coefficient: 0.8676 Mean absolute error: 0.038 Root mean squared error: 0.0462

51

Not a Conclusion: ASCAR … Does not require knowledge of the system or workloads fully unsupervised

Only needs client-side controllers no need to change hardware/application/network/server software

Can improve highly changeable workloads like burst I/O

Work-conserving no wasted bandwidth 52

Not a Conclusion: ASCAR … Needs to be evaluated on a larger scale we are looking for collaborators

Can be evaluated using a patched client (2.4, 2.7) no need to change server or hardware

53

Not a Conclusion: ASCAR … Paper and source code of our prototype are published @ http://ascar.io

54

Future work: online rule optimization Current ASCAR prototype requires a lengthy offline learning process Use cloud computing services for doing evaluation on a larger scale Online tweaking of rules using random-restart hill climbing Also need to evaluate the ASCAR algorithm on other workloads: database, web services 55

Acknowledgments This research was supported in part by the National Science Foundation under awards IIP-1266400, CCF-1219163, CNS- 1018928, the Department of Energy under award DE-FC02- 10ER26017/DESC0005417, Symantec Graduate Fellowship, and industrial members of the Center for Research in Storage Systems. We would like to thank the sponsors of the Storage Systems Research Center (SSRC), including Avago Technologies, Center for Information Technology Research in the Interest of Society (CITRIS of UC Santa Cruz), Department of Energy/Office of Science, EMC, Hewlett Packard Laboratories, Intel Corporation, National Science Foundation, NetApp, Sandisk, Seagate Technology, Symantec, and Toshiba for their generous support.

ASCAR project: http://ascar.io Contact: Yan Li 56

Suggest Documents