Collapse in Cluster-based Storage Systems (2007). Elie Krevat, Vijay. Vasudevan, Amar Phanishayee, David G. Andersen, Gregory R. Ganger,. Garth A. Gibson ...
End-to-End Performance Management for Large, Distributed Storage Scott A. Brandt and Carlos Maltzahn, Anna Povzner, Roberto Pineiro, Andrew Shewmaker, and Tim Kaldewey Computer Science Department University of California Santa Cruz and Richard Golding and Ted Wong, IBM Almaden Research Center HEC FSIO Conference—August 5, 2008
Distributed systems need performance guarantees • Many distributed systems and applications need (or want) I/O performance guarantees • Multimedia, high-performance simulation, transaction processing, virtual machines, service level agreements, real-time data capture, sensor networks, ... • Systems tasks like backup and recovery • Even so-called best-effort applications
• Providing such guarantees is difficult because it involves: • Multiple interacting resources • Dynamic workloads • Interference among workloads • Non-commensurable metrics: CPU utilization, network throughput, cache space, disk bandwidth
End-to-end I/O performance guarantees • Goal: Improve end-to-end performance management in large distributed systems • Manage performance • Isolate traffic • Provide high performance
• Targets: High-performance storage (LLNL), data centers (LANL), satellite communications (IBM), virtual machines (VMware), sensor networks, ... • Approach: 1. Develop a uniform model for managing performance 2. Apply it to each resource 3. Integrate the solutions
Stages in the I/O path IO selection and head scheduling
disk
I/O scheduler
prefetch and writeback based on utilization, QoS
storage cache
1. Disk I/O
flow control with one client
network transport
network transport
client cache
2. Server cache 3. Flow control across network • Within one client’s session and between clients
4. Client cache
app app
integration between client and server cache
network transport
connection management between clients
client cache
app app
System architecture Network Server Caches Disks
• Client: Task, host, distributed application, VM, file, ...
1
• Reservations made via broker
Client
Reservation
• Specify workload: throughput, read/write ratio, burstiness, etc.
• Broker does admission control • Requirements + workload are translated to utilization • Utilizations are summed to see if they are feasible • Once admitted, I/O streams are guaranteed (subject to workload adherence)
• Disk, caches, network controllers maintain guarantees
QoS Broker
Request 3
2
Utilization reservations
4
I/O
Storage Server
Storage Server
Storage Server
Storage Server
Achieving robust guaranteeable resources • Goal: Unified resource management algorithms capable of providing • Good performance • Arbitrarily hard or soft performance guarantees with • Arbitrary resource allocations • Arbitrary timing / granularity
• Complete isolation between workloads • All resources: CPU, disk, network, server cache, client cache
➡Virtual resources indistinguishable from “real” resources with fractional performance
Isolation is key • CPU • 20% of a 3 Ghz CPU should be indistinguishable from a 600 Mhz CPU • Running: compiler, editor, audio, video
• Disk • 20% of a disk with 100 MB/second bandwidth should be indistinguishable from a disk with 20 MB/second bandwidth • Serving:1 stream, n streams, sequential, random
Epistemology of virtualization • Virtual Machines and LUNs provide good HW virtualization • Question: Given perfect HW virtualization, how can a process tell the difference between a virtual resource and a real resource? • Answer: By not getting its share of the resource when it needs it
Observation • Resource management consists of two distinct decisions • Resource Allocation: How much resources to allocate? • Dispatching: When to provide the allocated resources?
• Most resource managers conflate them • Best-effort, proportional-share, real-time
constrained unconstrained
CPUBound
d
Soft RealTime
se
Ba
e-
at
• Enables direct, integrated support of all types of timeliness needs
Missed Deadline SRT
R
• Separately managing resource allocation and dispatching gives direct control over the delivery of resources to tasks
Resource Allocation
Separating them is powerful! Hard RealTime
Resource Allocation SRT
Best Effort I/O-
Bound
unconstrained
constrained
Dispatching
Supporting different timeliness requirements with RAD Hard Real-time
Period WCET
Ratebased
Rate Bounds
Rate
Soft Real-time
Period ACET
Deadlines
Besteffort
Priority
Scheduling Policy
Runtime System
Set of jobs w/ budgets and deadlines
Scheduling Mechanism
Dispatcher
PPi i PPi i
Rate-Based Earliest Deadline (RBED) CPU scheduler • Processes have rate & period • ∑rates ≤ 100% • Periods based on processing characteristics, latency needs, etc.
• Jobs have budget & deadline • budget = rate * period • Deadlines based on period or other characteristics
• Jobs dispatched via Earliest Deadline First (EDF) • Budgets enforced with timers • Guarantees all budgets & deadlines = all rates & periods
Scheduling Policy
Rate Deadlines
Runtime System
Set of jobs w/ budgets and deadlines
Scheduling Mechanism
EDF + timers
Adapting RAD to disk, network, and buffer cache
• Guaranteed disk request scheduling Anna Povzner • Guaranteeing storage network performance Andrew Shewmaker (UCSC and LANL) • Buffer management for I/O guarantees Roberto Pineiro
Guaranteed disk request scheduling • Goals • Hard and soft performance guarantees • Isolation between I/O streams • Good I/O performance
• Challenging because disk I/O is: • • • •
Stateful Non-deterministic Non-preemptable, and Best- and worst-case times vary by 3–4 orders of magnitude
Fahrrad • Manages disk time instead of disk throughput
A
I/O streams B C BE
• Adapts RAD/RBED to disk I/O • Reorders aggressively to provide good performance, without violating guarantees
Fahrrad Disk
A bit more detail • Reservations in terms of disk time utilization and period (granularity) • All I/Os feasible before the earliest deadline in the system are moved to a Disk Scheduling Set (DSS) • I/Os in the DSS are issued in the most efficient way • I/O charging model is critical • Overhead reservation ensures exact utilization • 2 WCRTs per period for “context switches” • 1 WCRT per period to ensure last I/Os • 1 WCRT for the process with the shortest period due to non-preemptability
Fahrrad guarantees utilization, isolation, throughput Throughput
% utilization
I/Os per second
Utilization
Period of 4th process
Period of 4th process
• 4 SRT processes, sequential I/Os, reservations = 20% • 3 w/period = 2 seconds • 1 with period 125 ms to 2 seconds • Result: perfect isolation between streams
Fraction of I/Os
Fahrrad bounds latency
Latency bounds (period)
Latency (ms)
• Period bounds latency
Fahrrad outperforms Linux Linux
Linux w/Fahrrad
• Workload • Media 1: 400 sequential I/Os per second (20%) • Media 2: 800 sequential I/Os per second, (40%) • Transaction: short bursts of random I/Os at random times (30%) • Background: random (10%)
• Result: Better isolation AND better throughput
Guaranteeing storage network performance • Goals • Hard and soft performance guarantees • Isolation between I/O streams • Good I/O performance
• Challenging because network I/O is: • Distributed • Non-deterministic (due to collisions or switch queue overflows) • Non-preemptable
• Assumption: closed network
What we want
Client
30%
Client
50%
Client
20%
Server
Server
Server
What we have
• Switched fat tree w/full bisection bandwidth • Issue 1: Capacity of shared links • Issue 2: Switch queue contention
Radon—RAD on the Network • RAD-based network reservations ~ disk reservations • Network utilization and period
• Fully decentralized • Two controls: • Window size w = # of network packets per transmission window • Offset = delay before transmitting window
• Two pieces of information: • Dropped packets (TCP) • Forward delay (TCP Santa Cruz)
Flow control and congestion control • Flow control determines how much data is transmitted • Fixed window sizes (TCP) • Adjust window size based on congestion (TCP Santa Cruz)
• Congestion control determines what to do when congestion is detected • Packet loss ➔ backoff then retransmit (TCP) • ∆ Forward delay ➔ ∆ window size (TCP Santa Cruz)
Packet loss based congestion control (TCP) Client Observes Packet Loss
Bottleneck Switch Queue Overflows
Improving TCP Congestion Control Over Internets with Heterogeneous Transmission Media (1999)! Christina Parsa, J.J. Garcia-Luna-Aceves. Proceedings of the 7th IEEE ICNP
Performance vs. offered load (w/nuttcp) Achieved vs. Offered Load
Packet Loss vs. Offered Load
• The switch performs well below 800 Mbps • Regardless of the # of clients • Each client is limited to 600 Mbps due to SW/NIC issues
Delay based congestion control (TCP SC) Client Observes Increased Delay
Queue Settles At Operating Point
The incast problem
• On Application-level Approaches to Avoiding TCP Throughput Collapse in Cluster-based Storage Systems (2007). Elie Krevat, Vijay Vasudevan, Amar Phanishayee, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, Srinivasan Seshan. Proceedings of the PDSW
Three approaches 1.Fixed window size w/out offsets • Clients send data as it is available, up to per-period budget • Expected result: lots of packet loss due to incast/queue overflow
2.Fixed window size w/per-window offsets • Clients transmit each window with random offset in [0, laxity / num_windows]
3.Less Laxity More • Distributed approximation to Least Laxity First • No congestion ➔ increase window size • Congestion ➔ move toward wopt = (1-%laxity)*wmax
Early results -- packets sent
Congestion detection for 4 clients, each with 0.2 utilization a Sent Packets (sequence number)
25000 20000 15000 10000 5000
Arrival of Window Feedback (sequence number)
0
Delay s)
client 1 client 2 client 3 client 4
0
1
2
3
4
5
6
6
7
7
8
9
10
11
12
13
14
15
16
17
25000 20000 15000 10000 5000 0 1100 1000 900 800 700 600
0
1
2
3
4
5
8
9
10
11
Time (sec)
12
13
14
15
16
17
18
19
20
Arrival of Window Feedb (sequence number)
1100 1000 900 800 700 600 500 400 300 200 100
Receive Delay (us)
10000 Early results -- send/receive delay
Send Delay (us)
25000 20000 15000
5000 0
0
0 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 200000 rd Delay us)
1
150000 100000
2
1
0
3
2
1
3
2
4
4
3
5
5
4
6
6
5
7
7
6
8
8
7
9
9
8
10
10
9
10
11
11
11
Time (sec)
12
12
12
13
13
13
14
14
14
15
15
15
16
17
18
19
20
16
17
18
19
20
16
17
18
19
20
Receive Delay (us)
200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0
Early results -- modeled queue depth 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Forward Delay (us)
200000 150000 100000 50000 0
Modeled Queue Depth (packets)
-50000 2 1.5 1 0.5 0 -0.5 -1 0 Packets ce number)
25000 20000 15000
1
2
3
4
5
6
7
8
9
10
11
Time (sec)
12
13
14
15
16
17
18
19
20
Modeled Queue Depth (packets)
2 1.5 1
Early results -- lost packets 0 0.5
-0.5 -1 0
1
2
3
4
5
6
7
8
9
10
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Lost Packets (sequence number)
25000 20000 15000 10000 5000
Window Size (packets)
0 800 700 600 500 400 300 200 100 0 0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
8
9
11
10
12
13
11
14
12
15
13
16
14
17
Time (Seconds)
Time (sec)
15
18
19
16
20
Buffer management for I/O guarantees • Goals • Hard and soft performance guarantees • Isolation between I/O streams • Improved I/O performance
• Challenging because: • Buffer is space-shared rather than time-shared • Space limits time guarantees
• Best- and worst-case are opposite of disk • Buffering affects performance in non-obvious ways
Buffering in storage servers • Staging and de-staging data • Decouples sender and receiver
• Speed matching • Allows slower and faster devices to communicate
• Traffic shaping • Shapes traffic to optimize performance of interfacing devices
• Assumption: reuse primarily occurs at the client
Approach • Partition buffer cache based on worst-case needs • 1–3 periods worth
• Use slack to prefetch reads and delay writes • Period transformation: period into cache may be shorter than from cache to disk • Rate transformation: rate into cache may be higher than disk can support
Conclusion • Distributed I/O performance management requires management of many separate components • An integrated approach is needed • RAD provides the basis for a solution • RAD has been successfully applied to several resources: CPU, disk, network, and buffer cache • We are on our way to an integrated solution • There are many useful applications: Data center performance management, storage virtualization, ...
Publications 1. A. Povzner, T. Kaldewey, S. Brandt, R. Golding, T. Wong, and C. Maltzahn, “Efficient Guaranteed Disk Request Scheduling with Fahrrad,” Eurosys 2008. 2. T. Kaldewey, A. Povzner, T. Wong, R. Golding, S. Brandt, C. Maltzahn, “Virtualizing Disk Performance”, The IEEE Real-Time and Embedded Technology and Applications Symposium, 2008. Best Student Paper. 3. J. Wu, S. Brandt, ”Providing Quality of Service Support in Object-based File Systems,” IEEE / NASA Goddard Conference On Mass Storage Systems And Technologies, 2007. 4. S. Brandt, C. Maltzahn, A. Povzner, R. Pineiro, A. Shewmaker, T. Kaldewey, “An Integrated Model for Performance Management in a Distributed System,” Workshop on Operating Systems Platforms for Embedded Real-Time Applications, 2008. 5. D. Bigelow, S. Iyer, T. Kaldewey, R. Pineiro, A. Povzner, S. Brandt, R. Golding, T. Wong, and C Maltzahn, ”End-to-End Performance Management for Scalable Distributed Storage,” Petascale Data Storage Workshop at SC07. 6. Anna Povzner, Scott Brandt, Richard Golding, Theodore Wong, and Carlos Maltzahn, ”Virtualizing Disk Performance with Fahrrad”, Work-in-Progress Session of the USENIX Conference on File and Storage Technology, 2008. 7. Tim Kaldewey, Andrew Shewmaker, Carlos Maltzahn, Theodore Wong, and Scott Brandt, ”RADoN: QoS in storage Networks”, Work-in-Progress Session of the USENIX Conference on File and Storage Technology, 2008.