Find the action that works best. 26 ack_ewma pt_raio. 0,0. INT_MAX. INT. _MAX. Action. . Action. <
Increasing Performance Through Automated Contention Management
(LUG’16)
Yan Li, Xiaoyuan Lu, Ethan Miller, Darrell Long Storage Systems Research Center (SSRC) University of California, Santa Cruz (a Intel® Parallel Computing Center)
Challenge: consistent performance at peak times
total throughput
congestion harms efficiency and throughput
100 clients
150 clients
200 clients 2
Challenge: consistent performance at peak times congestion causes fluctuation
client throughput of a random write workload 5 nodes accessing 5 servers 3
Challenge: consistent performance at peak times congestion can occur anywhere in the cluster
4
The problem we are trying to solve
Picture credit: https://www.checkfelix.com/reiseblog/10-argumente-urlaub-schottland/
5
The problem we are trying to solve Improve throughput or fairness during congestion or both at the same time!
End-to-end coverage handling congestion at OSC, network, OSS, and OST
Fully automatic and requires little human effort modern systems are very dynamic, and we won’t have time to create models 6
Rate limiting can improve performance
total throughput
… if done properly
100 clients
150 clients
200 clients
200 clients with contention control
7
Challenges of distributed I/O rate control: 1. Where is the sweet spot? rate limit
real throughput
ideal throughput 8
Challenges of distributed I/O rate control: 1. Where is the sweet spot?
rate limit is too low
real throughput
ideal throughput 9
Challenges of distributed I/O rate control: 1. Where is the sweet spot? rate limit too high
real throughput
ideal throughput 10
Challenges of distributed I/O rate control: 1. Where is the sweet spot? Capability discovery usually involves communication:
sweet rate limit spot
• between clients • with a central controller
real throughput
ideal throughput 11
Challenges of distributed I/O rate control: 2. scalability Intra-node communication can grow at O(n2)
Adds overhead to already congested network Low responsiveness for highly dynamic workload
12
ASCAR: Automatic Storage Contention Alleviation and Reduction Client-side rule-based I/O rate control 1. no need for central scheduling or coordination, nimble and highly responsive
2. no need to change server software or hardware 3. no scale-up bottleneck
Use machine learning and heuristics for rule generation and optimization no prior knowledge of the system or workload is required 13
Components of the ASCAR prototype ASCAR Rule Manager
Rule Generator & Optimizer (SHARP)
Workload Character & Rule Set Database
Client node Application
Workload Classifier
traffic rules
File system client
Workload character markers
Legends: Traffic rules ASCAR traffic controller
Storage servers
14
Components of the ASCAR prototype ASCAR Rule Manager
Rule Generator & Optimizer (SHARP)
Workload Character & Rule Set Database
Client node Application
Workload Classifier
traffic rules
File system client
Workload character markers
Legends: Traffic rules ASCAR traffic controller
Storage servers
15
Components of the ASCAR prototype ASCAR Rule Manager
Rule Generator & Optimizer (SHARP)
Workload Character & Rule Set Database
Client node Application
Workload Classifier
traffic rules
File system client
Workload character markers
Legends: Traffic rules ASCAR traffic controller
Storage servers
16
Rule-based Contention Control Rules tell the controller how to react to congestions tweak the congestion window according to request processing latency Each client tracks three congestion state statistics (ack_ewma, send_ewma, pt_ratio)
Each rule maps a congestion state to an action (Congestion State (CS) statistics) → An action describes how to change the congestion window (max_rpcs_in_flight) and rate limit: new_cong_window = m × cong_window + b τ is the rate limit 17
Generating a rule set for a certain workload Extract a short signature workload that covers the important features of the application’s I/O workload a signature workload is usually 20 ~ 30 seconds long
Generate candidate rules and benchmark them with the signature workload on the real system test possible combinations of action variables
Split the hottest rule in the set to generate more rules structural improve vs. tuning parameters 19
pt_raio
INT_MAX
Begin with one rule: the whole state space maps to one action
Action
0,0
ack_ewma
INT_MAX 20
INT_MAX
Try different values of with the workload
pt_raio
Try the Cartesian product of:
0,0
{m ± 0.01, m ± 0.02, m ± 0.04, … } × {b ± 0.01, b ± 0.02, b ± 0.04, … } × { τ ± 1, τ ± 2, τ ± 4, … }
ack_ewma
INT_MAX 21
pt_raio
INT_MAX
Find the rule that yields highest performance
0,0
Action
ack_ewma
INT_MAX 22
pt_raio
INT_MAX
Split the state space at the most observed state values
0,0
Action
Action
Action
Action ack_ewma
INT_MAX 23
pt_raio
INT_MAX
Run the workload, find out the rules that was triggered most often
0,0
Action used 32k times
Action used 0.3k times
Action used 2k times
Action used 120 times ack_ewma
INT_MAX 24
pt_raio
INT_MAX
Improve the most used rules by sweeping all possible values of
0,0
Try the Cartesian product of: {m ± 0.01, m ± 0.02, m ± 0.04, … } × {b ± 0.01, b ± 0.02, b ± 0.04, … } × { τ ± 1, τ ± 2, τ ± 4, … }
Action
Action
Action ack_ewma
INT_MAX 25
pt_raio
INT_MAX
Find the action that works best
0,0
Action
Action
Action
Action ack_ewma
INT_MAX 26
pt_raio
INT_MAX
Split the most used rule’s state space at the most observed state values Action Action
Action
Action
Action
Action 0,0
Action
ack_ewma
INT_MAX 27
pt_raio
INT_MAX
Run workload, find the most used rule, and improve it Action Action
Action
Action
Action
Action 0,0
Action
ack_ewma
INT_MAX 28
pt_raio
INT_MAX
After find the best action, split the most used rule
Action
Action
Action
Action
Action
Action
Action
Action
Action
Action 0,0
ack_ewma
INT_MAX 29
pt_raio
INT_MAX
After find the best action, split the most used rule
Action
Action
Action
Action
Action
Action
Action
Action
Action
Action 0,0
ack_ewma
INT_MAX 30
pt_raio
INT_MAX
Repeat this process Action Action
Action Action
Action
Action
Action
Action
Action
Action
Action
Action 0,0
Action
ack_ewma
INT_MAX 31
Prototype and Evaluation An ASCAR prototype for Lustre Patched Lustre client to add congestion control no change to server or other parts
Hardware: 5 servers, 5 clients Intel Xeon CPU E3-1230 V2 @ 3.30GHz, 16 GB RAM, Intel 330 SSD for the OS, dedicated 7200 RPM HGST Travelstar Z7K500 hard drive for Lustre, Gigabit Ethernet 32
ASCAR is good at increasing throughout and decreasing speed variance
33
Workload Throughput Improvements 30.00% BTIO (B4)
20.00%
BTIO (C4) Random write
Speed variance change
10.00% 0.00% 0.00% -10.00%
Random r/w 5.00% 10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
Seq. read TP only
-20.00%
Seq. write
TP and Var
-30.00% -40.00% -50.00%
Random read Random write
-60.00%
Throughput improvement
BTIO (B4) BTIO (C4)
34
Workload Throughput Improvements 30.00% BTIO (B4)
20.00%
BTIO (C4) Random write
Speed variance change
10.00% 0.00% 0.00% -10.00%
Random r/w 5.00% 10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
Seq. read TP only
-20.00%
Seq. write
TP and Var
-30.00%
Higher throughput
-40.00% -50.00%
Random read Random write
-60.00%
Throughput improvement
BTIO (B4) BTIO (C4)
35
Workload Throughput Improvements 30.00% BTIO (B4)
20.00%
BTIO (C4) Random write
Speed variance change
10.00% 0.00% 0.00% -10.00%
Random r/w 5.00% 10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
Seq. read TP only
-20.00%
Seq. write
-30.00%
TP and Var
Lower speed variance
-40.00% -50.00%
Random read Random write
-60.00%
Throughput improvement
BTIO (B4) BTIO (C4)
36
Workload Throughput Improvements 30.00% BTIO (B4)
20.00%
BTIO (C4) Random write
Speed variance change
10.00% 0.00% 0.00% -10.00%
Random r/w 5.00% 10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
Seq. read TP only
-20.00%
Seq. write
TP and Var
-30.00% -40.00% -50.00%
Random read Random write
-60.00%
Throughput improvement
BTIO (B4) BTIO (C4)
37
Workload Throughput Improvements 30.00% BTIO (B4)
20.00%
BTIO (C4) Random write
Speed variance change
10.00% 0.00% 0.00% -10.00%
Random r/w 5.00% 10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
Seq. read TP only
-20.00%
Seq. write
TP and Var
-30.00% -40.00% -50.00%
Random read Random write
-60.00%
Throughput improvement
BTIO (B4) BTIO (C4)
38
Our prototype shows that … ASCAR increases the performance of all workloads 2% to 36%
ASCAR works best to boost write-heavy workloads 25% to 36%
ASCAR can lower speed variance at the same time for certain workloads by more than 50% 39
Limitations Training is currently offline The offline training can take hours Need to re-train when workload changes
40
Handling a new workload without training Measure the workload’s features
Compare features with known workloads Find the most similar known workload and use its contention control rules
41
Components of the ASCAR prototype ASCAR Rule Manager
Rule Generator & Optimizer (SHARP)
Workload Character & Rule Set Database
Client node Application
Workload Classifier
traffic rules
File system client
Workload character markers
Legends: Traffic rules ASCAR traffic controller
Storage servers
42
Finding the right features Workloads can have radically different pressure on the underlying system they require different congestion control rules
The features should reflect the workload’s pressure on the underlying system The combined workload from many clients is a mix of read/write and random/sequential 43
A 75% sequential + 25% random workload can be very different from another Request p
Request p + 100
Request p + 101
Request p + 120
Request p + 121
Request p + 122
Request p + 200
Request p + 201
Request p + 100
Request p + 200
Request p + 351
Random requests
Request p
Request p+1
Request p+2
Request p+3
Request p+4
And there are many different 60% read + 40% write workloads out there Read
Read
Read
Write
Write
Write
Read
Read
Write
Read
Write
Read
The feature set we use Op type: read/write/metadata Ratios between ops (read to write, read/write to metadata, etc.) For each type of op, we measure the following features: 1. average size of sequential ops 2. average positional gap between seq. ops 3. average temporal gap between seq. ops
Sample: Different 60% read + 40% write workloads Read
Write Read to write: 60/40 Avg. size of sequential read: 60 MB Avg. size of sequential write: 40 MB
Read
Write
Read
Write
Read
Write
Read
Read to write: 60/40 Avg. size of sequential read: 15 MB Avg. size of sequential write: 13 MB
Calculating similarity of workloads Goal: to determine if they can use the same contention control rules
How: start from a known workload and tweak its features, measuring the efficiency of existing rules on the changed workload Result: we used the results of hundreds of benchmarks to generate a decision tree 48
Calculating similarity of workloads rw_ratio 7
Linear model 1
read_size
75 Linear model 3 49
Calculating similarity of workloads rw_ratio 7
Linear model 1
read_size
75 Linear model 3 50
This decision tree can precisely predict the performance of using a specific rule set with a new workload Correlation coefficient: 0.8676 Mean absolute error: 0.038 Root mean squared error: 0.0462
51
Not a Conclusion: ASCAR … Does not require knowledge of the system or workloads fully unsupervised
Only needs client-side controllers no need to change hardware/application/network/server software
Can improve highly changeable workloads like burst I/O
Work-conserving no wasted bandwidth 52
Not a Conclusion: ASCAR … Needs to be evaluated on a larger scale we are looking for collaborators
Can be evaluated using a patched client (2.4, 2.7) no need to change server or hardware
53
Not a Conclusion: ASCAR … Paper and source code of our prototype are published @ http://ascar.io
54
Future work: online rule optimization Current ASCAR prototype requires a lengthy offline learning process Use cloud computing services for doing evaluation on a larger scale Online tweaking of rules using random-restart hill climbing Also need to evaluate the ASCAR algorithm on other workloads: database, web services 55
Acknowledgments This research was supported in part by the National Science Foundation under awards IIP-1266400, CCF-1219163, CNS- 1018928, the Department of Energy under award DE-FC02- 10ER26017/DESC0005417, Symantec Graduate Fellowship, and industrial members of the Center for Research in Storage Systems. We would like to thank the sponsors of the Storage Systems Research Center (SSRC), including Avago Technologies, Center for Information Technology Research in the Interest of Society (CITRIS of UC Santa Cruz), Department of Energy/Office of Science, EMC, Hewlett Packard Laboratories, Intel Corporation, National Science Foundation, NetApp, Sandisk, Seagate Technology, Symantec, and Toshiba for their generous support.
ASCAR project: http://ascar.io Contact: Yan Li 56