Adaptive disk scheduling for overload management Alma Riska Seagate Research 1251 Waterfront Place Pittsburgh, PA 15222
[email protected]
Erik Riedel Seagate Research 1251 Waterfront Place Pittsburgh, PA 15222
[email protected]
Abstract Most computer systems today are lightly loaded in normal operation. The real performance problems occur during burst times when the system becomes overloaded. We evaluate how the choice of scheduling algorithms can assist a system in maintaining stable performance while operating under transient overloads. We propose a new disk scheduling algorithm that efficiently handles overload by dynamically adjusting its parameters. The algorithm adapts its operation to the current load conditions and achieves good overall performance, while maintaining minimal variability in the request response time. We evaluate the robustness of the algorithm against different disks and synthetic and realistic traces measured in benchmarked systems.
1. Introduction Today’s computer systems, in particular those supporting Internet applications, are characterized by swift and sharp fluctuations in load [15]. Although designers aim to provide systems with enough resources to sustain the worstcase load, the dynamics of applications and the event-driven nature of the request intensity make the worst-case scenario difficult to predict. The burstiness in request arrivals propagates through all layers in the system, including the network subsystem, memory and caches at multiple levels, and finally into the storage subsystem. As an example of burstiness in storage subsystems, previous measurements indicate that load fluctuates severely reaching as many as 1000 outstanding requests even when traditional (non-Internet) applications generate the storage subsystem workload [9]. Clearly, the long term solution for handling persistent overload conditions, i.e., long request queues, is to increase the system resources. However, if the system experiences unexpected transient overload conditions then better management of the available resources and adjustment of system operation can avoid system collapse and allow for
Sami Iren Seagate Research 1251 Waterfront Place Pittsburgh, PA 15222
[email protected]
graceful degradation of performance. Given the sudden nature of transient overload conditions, we focus on ways that allow the system to adapt its operation to current load conditions without human intervention. Because our goal is to handle sharp and transient increases in the load intensities, we focus on ways to adjust system operation in fine time scales, i.e., time scales that measure the average request service time. At the disk level, we achieve this goal by proposing an adaptive disk scheduling algorithm based on disk-level characteristics. The algorithm adapts its operation on-the-fly based on the disk load. By disk-level characteristics, we mean the disk properties that are used for efficient scheduling, such as the relative seek and rotational latency for every request. In addition to building self-adjusting storage subsystems via load-based adaptive disk scheduling algorithms, effort has been put to adapt the disk layout to the disk access pattern and achieve better locality and sequentiality in the stream of requests [10]. In contrast to our approach, this work focuses on coarser time scales, i.e., days, rather than the fine time scales i.e., milliseconds that we consider. Various model-based approaches have been proposed for adaptive resource management at the storage subsystems [1] and in higher layers of the system hierarchy [3]. The remainder of the paper is organized as follows. Section 2 presents an overview of disk scheduling algorithms. We describe the synthetic workload that we use in our simulation-based analysis in Section 3. The evaluation and analysis that leads to our adaptive algorithm is presented in Section 4. We introduce our algorithm and analyze its performance in Section 5. We continue with performance analysis of our adaptive algorithm under a real workload in Section 6. We summarize our results and conclude in Section 7.
2. Background Apart from FCFS, there are two major categories of disk scheduling algorithms, namely seek-based and positionbased. Seek-based disk algorithms, such as SCAN, LOOK,
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
Shortest Seek Time First (SSTF), schedule the request with the shortest seek time first, where seek time is the time the disk head positions from one track to the next 1 . Positionbased algorithms, such as the Shortest Positioning Time First (SPTF), schedule the request with the shortest seek + rotation latency first, where rotation latency is the time that a track has to rotate and reach the sector with the requested data. Numerous papers provide comprehensive analysis of these disk scheduling algorithms [11, 5, 16]. Position-based disk scheduling algorithms achieve the best overall performance [5, 2]. The optimal disk scheduling algorithm generates the schedule such that the time to serve all outstanding requests is minimal [2] rather than selecting the request with the shortest positioning time as SPTF does. The optimal disk scheduling algorithm is computationally expensive and, in practice, disk drives use SPTF to schedule requests. As such, we use SPTF as the base case in our evaluation. Although position and seek-based disk scheduling algorithms might introduce high variation in request response times [16], their performance benefits overcome this drawback [8]. A variant of the SPTF algorithm that focuses on reducing variation in the request response time is the Batched SPTF (B-SPTF) algorithm [5]. This algorithm partitions requests into batches and applies SPTF only over the requests of the first (oldest) batch. Upon completion of requests in the first batch, the algorithm continues with requests in the second batch. Another variation of the B-SPTF scheduling algorithm, Leaky B-SPTF, allows for new requests to be admitted in the batch of requests in service, if a schedule can be found that does not violate the deadline to completely serve the current batch [5]. We propose a very similar algorithm called Window-Based SPTF (WB-SPTF). This algorithm applies SPTF over requests that fall within a sliding time window rather than on batches of requests. The arrival time of the oldest request in the queue serves as the starting reference for the time window in the WB-SPTF scheduling algorithm. The performance of batched-based disk scheduling algorithms, which define the batches either by number of requests (B-SPTF) or by time window (WB-SPTF), depends on a single parameter, namely the size of the batch or the size of the time window. The focus and contribution of this paper is on the analysis and evaluation of the effects of the window size, in WB-SPTF performance, as well as on the new adaptive algorithm, Dynamic WB-SPTF, that we propose based on such analysis. The Dynamic WB-SPTF algorithm adjusts its window size according to the load in the system. This algorithm aims to find the right window size for a given system load and applies the SPTF algorithm only 1
On a disk the data is stored in circular tracks, which are logically partitioned in sectors.
over the requests that fall within the current window. The algorithm is self-adjusting because it adapts its parameters to the load conditions. The Dynamic WB-SPTF performance is better or similar to the performance of SPTF, depending on workload characteristics. Dynamic WB-SPTF performs similar to WB-SPTF with the optimal window size. Our analysis shows that Dynamic WB-SPTF is robust and performs well for different types of workloads and hardware (i.e., disk) characteristics.
3. Synthetic Workload Characterization Our objective is to use request scheduling to improve disk performance under fluctuating request arrival intensities. Bursty arrivals might result in long disk queues (i.e., higher than 100), causing overload periods. An overload condition at the disk depends on two factors, namely the request arrival intensity and workload characteristics represented by the amount of locality and sequentiality in the disk access pattern. Our initial analysis is based on tracedriven simulation using a synthetic trace which is characterized by fluctuations in both arrival intensities and workload characteristics. The synthetic trace consists of 64,726 requests and spans a time interval of 400 seconds. The arrival process of the trace, depicted in Figure 1(a), has three intervals of high arrival intensities that cause overload at the disk. These intervals are (100,140), (200, 240), and (325, 365) seconds, which we denote by B, D, and F, respectively. Each interval of overload is followed by an interval of low load in the system, i.e., intervals A, C, E, and G in Figure 1(a), so that the effect of the overload in the system performance is not carried over from one interval to the next. Our synthetic trace is a mix of random and sequential disk accesses as shown in Figure 1(b). In the synthetic trace, we do not have locality in the disk access pattern. The sequential accesses are obtained by assuming various simultaneous video streaming. With our synthetic trace, we aim to analyze system behavior in overload in three different scenarios; 1 - under random workload as in interval B, 2 - under a mix of random and sequential workload as in interval D, and 3 - under a pure mix of only sequential streams as in interval F. Table 1 highlights the load and workload characteristics in each of the intervals that we define in Figure 1. In our analysis, we use DiskSim 2.0 [4] as the disk level simulator. This simulator is appropriately modified to accommodate the WB-SPTF and the Dynamic WB-SPTF scheduling algorithms. The disk that we simulate in all our experiments with the synthetic trace described in Figure 1, is the Seagate Barracuda with 7200 rpm and a total of 4,110,000 blocks. In Section 6, we present another set of experiments with measured data in the Seagate Cheetah with 10,000 rpm and a total of 17,783,240 blocks. Note that Bar-
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
A
700
B
C
D
E
F
G
C
D
E
F
G
3.5e+06
500
Block number
Requests per second
600
400 300
3e+06 2.5e+06 2e+06 1.5e+06
200
1e+06
100
500000
0
B
A 4e+06
0
0
(a)
50000
100000
150000 200000 250000 Time (in milliseconds)
300000
350000
400000
(b)
0
50000
100000
150000 200000 250000 Time (in milliseconds)
300000
350000
400000
Figure 1. Characteristics of the synthetic trace used in our analysis; (a) arrival rate, (b) disk access pattern.
Interval Load Workload
A Low Random
B High Random
C Low Random
D High Random Sequential
E Low Random Sequential
F High Sequential
G Low Sequential
Table 1. Characteristics of the synthetic trace per interval of time racuda is an ATA disk drive while Cheetah is a SCSI disk drive. By selecting different hardware to test our algorithm, we aim to evaluate its robustness in different environments.
4. Performance of FCFS, SPTF, and WBSPTF under overload First, we evaluate what impact transient overloads, as those depicted in Figure 1, have on disk performance. Initially, we compare the performance of FCFS and SPTF scheduling algorithms. Figure 2(a) illustrates the individual response time for each request in the trace as a function of its arrival time. Observe that the performance of FCFS is poor (average response time is 12591 ms) compared to the performance of SPTF (average response time is 1565 ms). In addition, we compute the standard deviation in response time as an important metric that measures variability in disk performance. The variability introduced by SPTF in request response time, i.e., the standard deviation, is 3293 ms. In the case of the transient overload, the SPTF response time standard deviation (3293 ms), although high (more than twice the average response time), is less severe when compared with the FCFS response time standard deviation, which although in the same range as the FCFS average response time, is very high. Note that in Figure 2(a), FCFS data is incomplete because the simulation could not finish the entire trace due to limitation of resources, i.e. queues longer than the available buffer space . Disk performance improves further by applying the WBSPTF scheduling algorithm instead of pure SPTF. Fig-
ure 2(b) plots the response time of individual requests as a function of arrival time for SPTF and WB-SPTF(2000) scheduling algorithms, where “WB-SPTF(2000)” indicates the Window-Based-SPTF scheduling algorithm with window size of 2000 milliseconds. Note that, in cases of overload, under WB-SPTF(2000), response time has a lower standard deviation (only 1657 ms) compared to the SPTF response time standard deviation (3293 ms). The average response times of both scheduling algorithms are in the same range. For medium or light load, WB-SPTF(2000) and SPTF perform essentially the same (i.e., the window size is large enough and allows all outstanding requests to be included in the batch of requests where the SPTF scheduling algorithm applies). The window size of 2000 milliseconds for the WB-SPTF scheduling algorithm is picked to demonstrate the benefits of applying SPTF only over a portion of outstanding requests once the system operates in overload. Our experiments show that other window sizes generate better results than WB-SPTF(2000), not only overall, but particularly for individual overload time intervals. We evaluate the performance of WB-SPTF for 6 different window sizes, i.e., 500, 1000, 2000, 3000, 4000, and 5000 milliseconds and present the average response times and the standard deviation of response time in Figures 3 and 4, respectively. Specifically, we present the overall performance of each WB-SPTF and the pure SPTF in Figures 3(a) and 4(a), and for each overload interval B, D, and F in Figures 3(b),(c), and (d) and 4(b), (c), and (d), respectively. The highlighted bar in each of the graphs of Figures 3 and 4 corresponds to the value of
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
40000
40000
35000
35000 30000
FCFS
25000 20000 15000 10000
SPTF
5000
Response Time (millisec)
Response Time (millisec)
45000
0
SPTF
25000 20000 15000 10000 5000
WB-SPTF
0 0
(a)
30000
50000
100000 150000 200000 250000 300000 350000 400000 Time (millisec)
0
50000
100000 150000 200000 250000 300000 350000 400000
(b)
Time (millisec)
Figure 2. Individual request response time under (a) FCFS and SPTF and (b) SPTF and WBSPTF(2000) scheduling algorithms.
the window size for which WB-SPTF performs best, while the first bar in each graph correspond to the SPTF scheduling algorithm. The results of Figures 3 and 4 indicate that - Performance of SPTF and WB-SPTF depends on both the load in the system and the workload mix (i.e., random, sequential, or their mix). - Performance of WB-SPTF depends on the window size. If the window size is not selected carefully, disk performance is quite poor. For example, Figure 3(b) illustrates how poor WB-SPTF(500) performs and how the performance improves for the window size of 5000 milliseconds. - Performance of the WB-SPTF algorithm, as of other disk scheduling algorithms, depends on the characteristics of the workload. In the case of interval B, i.e., fully random workload, only a very large window size, which allows WB-SPTF to behave as pure SPTF, yields comparable performance between WB-SPTF and SPTF, i.e., the best performance for the interval. The trend of the best-performing window size changes as we move from random toward more sequential workloads i.e., intervals D and F defined in Figure 1. Performance results for intervals D and F are presented in Figures 3(c) and 3(d), respectively. Observe that the more sequential the workload becomes, the better performing WB-SPTFs are those with smaller window sizes. Additionally, performance improvement gained by using WB-SPTF instead of pure SPTF increases as the workload becomes more sequential. - Most importantly, note that for different load and workload conditions there are different window sizes that we have to select to achieve the best performance from the WBSPTF algorithm. For example, for intervals B, D, and F the optimal window sizes are 5000, 3000, and 500 milliseconds, respectively. - Consistently, the standard deviation in response times for the WB-SPTF scheduling algorithm is lower than for the SPTF scheduling algorithm (see all graphs in Figure 4). Particularly, the gap increases when moving from a completely random workload toward a more sequential one, since the
optimal window size for WB-SPTF decreases when the workload is more sequential. The results presented in this section illustrate how request scheduling assists the disk on maintaining good overall performance under fluctuations in the arrivals intensity. In cases of overload and non-random workloads, WB-SPTF yields better performance that the pure SPTF. However, the performance of WB-SPTF scheduling algorithm depends on the correct selection of its window size. Over time, different window sizes yield different performance results depending on the load and workload characteristics at the disk. In the following section, we discuss how we can dynamically change the window size of the WB-SPTF scheduling algorithm to achieve high performance for different system load and workload characteristics.
5. Dynamic WB-SPTF Algorithm In this section, we propose a new variation of the WBSPTF scheduling algorithm, that adapts the window size onthe-fly in response to fluctuations in the disk load. The results of Figure 3 indicate that, generally, the higher/lower the load in the system the larger/smaller the optimal window size for the WB-SPTF scheduling algorithm. Nevertheless, it is not trivial to dynamically update the window size as the load in the system changes. In the previous section, we showed that both load and workload characteristics affect the behavior of the scheduling algorithm. In our approach, we modify the window size of WB-SPTF based only on the system load. The workload characteristics are taken into account by the algorithm indirectly, because we measure the system load not by the number of new arrivals but by the number of outstanding requests in the system. For example, in the case of random accesses (see Figure 3(b)), small window sizes for WB-SPTF will cause the number of outstanding requests to increase and the latter will trigger window size increase to improve performance. This trend continues until the window size is large enough to include all requests in the queue and the performance of WB-SPTF
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
3000 2500 2000 1500 1000 500 SPTF 500
4000
3000 2500 2000 1500 1000 500 0
(b)
1000 2000 3000 4000 5000
Interval B
3500
SPTF 500
1000 2000 3000 4000 5000
4000
Average Response time (millisec)
0
Overall
3500
Average Response time (millisec)
(a)
Average Response time (millisec)
Average Response time (millisec)
4000
Interval D
3500 3000 2500 2000 1500 1000 500 0
(c)
SPTF 500
4000
3000 2500 2000 1500 1000 500 0
(d)
1000 2000 3000 4000 5000
Interval F
3500
SPTF 500
1000 2000 3000 4000 5000
Figure 3. Average response time under SPTF and WB-SPTF with window sizes 500, 1000, 2000, 3000, 4000, and 5000 milliseconds, respectively; (a) overall and for time intervals (b) B, (c) D, and (d) F.
4000
Overall
2500 2000 1500 1000
0 SPTF 500
3000 2500 2000 1500
2500 2000 1500
500
500
SPTF 500
1000 2000 3000 4000 5000
(c)
Interval F
3500
3000
1000
0
4000
Interval D
3500
1000
(b)
1000 2000 3000 4000 5000
Response time STD
Response time STD
Response time STD
3000
500
(a)
4000
Interval B
3500
3500
Response time STD
4000
3000 2500 2000 1500 1000 500
0 SPTF 500
(d)
1000 2000 3000 4000 5000
0 SPTF 500
1000 2000 3000 4000 5000
Figure 4. Standard deviation of response time under SPTF and WB-SPTF with window sizes 500, 1000, 2000, 3000, 4000, and 5000 milliseconds, respectively; (a) overall and for time intervals (b) B, (c) D, and (d) F.
nears the performance of SPTF, which in case of fully random workloads is the best one (see Figure 3). The metric that we use to trigger a window size change is the ratio of the number of requests within the window of WB-SPTF to the total number of outstanding requests in . While a the disk queue. We refer to this metric as change in the number of outstanding requests in the system indicates that an update in the window size might be necessary, the ratio determines if actually such action will take place and its direction, i.e., increase or decrease. The basic guidelines for the Dynamic WB-SPTF algorithm are as follows:
Under light and medium system load, any window size is fine as long as it is large enough to include the entire set of outstanding requests at the disk. Note that, in the long run, the dynamic algorithm tends to calculate a small window size for light and medium system load.
Under high system load, the optimal window size increases relative to the optimal window size under light decreases. or medium load while
Under overload, the optimal window size increases relative to the window size under high load while decreases.
At first, these guidelines might seem counter-intuitive, because they basically state that as load increases, even
though the window size increases, the portion of the requests within the window decreases. This is related to the bursty conditions that we focus on, where the number of outstanding requests is high, the waiting time per request increases, and even large window sizes include only a fraction of the set of outstanding requests. We define four load levels at disk: (1) Light, (2) Medium, (3) High, and (4) Overload. Each of these load levels is determined by observing both the queue build-up and the request slowdown in the system. Based on the queue build-up and the respective request slowdown for the systems that we evaluated, we define low disk load when there are at most 16 outstanding requests in the system, medium when there are at most 32 outstanding requests in the system, high when there are at most 64 outstanding requests, and finally the disk is overloaded when there are more than 64 outstanding requests. If there are more than 512 outstanding requests in the system the overload is considered severe and the system beyond this state does not adapt anymore to changes i.e., increases in the load, but continues the operation with the window size that it has reached already. For each load level, we (recall that is define an interval of acceptable the ratio of the number of requests within the window with the total number of outstanding requests). The length of the ” increases and its individinterval of “acceptable ual boundary values decrease as system load increases.
If the current
is not within the interval of accept-
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
300
0.8
250
Window Size Change
Acceptable Ratios
1
0.6 0.4 0.2 0
150 100 50 0
LOW
(a)
200
MEDIUM HIGH OVERLOAD Load Levels
LIGHT
(b)
MEDIUM HIGH Load Levels
OVERLOAD
Figure 5. Parameters used in the Dynamic WB-SPTF algorithm. (a) Acceptable ratios for the four load levels. (b) Window size increase (left bar) and decrease (right bar) for each load level.
able ratio for the current load level then the algorithm updates the window size. The window size is increased if the ratio is below the lower boundary of the ratio interval. The window size is decreased if the ratio is above the upper boundary of the acceptable ratio interval. Hence, the ratio intervals are the tools that guide window size updates, how often they occur, and the levels reached by the window size for a given system load. The Dynamic WB-SPTF algorithm checks if the window size needs to be updated upon completion of each request. By default, if the load in the system is light then SPTF applies over the entire set of outstanding requests. For all other load levels, the window size increases or decreases according to the changes in the number of outstanding requests in the system. The initial window size does not affect Dynamic WB-SPTF performance in the long run. The algorithm does not introduce any computational overhead. It just adds a simple check if the window size is correct. The computational cost of applying SPTF on the entire, i.e., larger, set of requests is much higher, because for each request the seek+rotational latency have to be computed given the current position of the head. In time of overload, the large number of outstanding requests makes scheduling quite computational expensive, and with Dynamic WB-SPTF we considerably reduce the cost by applying SPTF only over a fraction of the outstanding requests. In Figure 5, we present the values of the Dynamic WBSPTF parameters. These values are not hardware dependent. They worked well with various hardware that we tested. A more detailed analysis of these choices is subject of future work. In Figure 5(a), we present the acceptable intervals for all load levels. Note that for the light load, the interval reduces to the number 1, causing Dynamic WB-SPTF to behave equivalently with SPTF. The perfor
mance of Dynamic WB-SPTF is not sensitive to the acceptable ratios intervals as long as they follow the trend described previously. In Figure 5(b), we present the amount of time added or subtracted from the window size every time a change is necessary according to the Dynamic WB-SPTF algorithm. For each load level, we present two bars; the left bar indicates added amount while the right bar represents the subtracted amount from the window size. The window size changes are small in case of light load and large in high and overload. This allows for faster adaption to sharp oscillations in the load. We decrease the window size slower than we increase it to avoid unnecessary oscillations in the window size. A single window size change allows, in low and medium load conditions, individual requests to be included/excluded from the window. In high and overload, a single window change allows tens of requests to be included/excluded from the window.
5.1. Dynamic WB-SPTF Performance Results We use the synthetic trace described in Figure 1 to analyze the performance of Dynamic WB-SPTF. We measure the average response time for each time interval defined in Figure 1 and present only the results for the overloaded intervals B, D, and F, and overall. In addition to the average response time, we focus on the response time standard deviation, since it describes the amount of variability introduced by the scheduling algorithm. We present our findings in Figures 6 and 7. For comparison in each graph, we include the respective results for SPTF (as the base line), as well as WB-SPTF(500), WB-SPTF(3000), and WB-SPTF(5000), as the best performing WB-SPTFs for intervals B, D, and F respectively. The results in Figure 6 indicate that Dynamic WB-SPTF manages to adapt its operation to different load condi-
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
3000 2500 2000 1500 1000 500 SPTF
500
3000
5000
Dynamic
4000
Interval B
3500 3000 2500 2000 1500 1000 500
(b)
0 SPTF
500
3000
5000
Dynamic
4000
Average Response time (millisec)
0
Overall
3500
Average Response time (millisec)
(a)
Average Response time (millisec)
Average Response time (millisec)
4000
Interval D
3500 3000 2500 2000 1500 1000 500 0
(c)
SPTF
500
3000
5000
Dynamic
4000
Interval F
3500 3000 2500 2000 1500 1000 500
(d)
0 SPTF
500
3000
5000
Dynamic
Figure 6. Average response time under SPTF, WB-SPTF(500), WB-SPTF(3000), WB-SPTF(5000), and the Dynamic WB-SPTF scheduling algorithms (a) overall and for time intervals (b) B, (c) D, and (d) F.
4000
Overall
2500 2000 1500 1000
0 SPTF
500
3000
5000
Dynamic
Response time STD
3000
3000 2500 2000 1500
2500 2000 1500
500
500
0 SPTF
500
3000
5000
Dynamic
(c)
Interval F
3500
3000
1000
(b)
4000
Interval D
3500
1000
500
(a)
4000
Interval B
3500 Response time STD
Response time STD
3500
Response time STD
4000
3000 2500 2000 1500 1000 500
0 SPTF
500
3000
5000
Dynamic
(d)
0 SPTF
500
3000
5000
Dynamic
Figure 7. Standard deviation of response time under SPTF, WB-SPTF(500), WB-SPTF(3000), WBSPTF(5000), and the Dynamic WB-SPTF scheduling algorithms (a) overall and for time intervals (b) B, (c) D, and (d) F.
tions yielding performance similar to the best performing scheduling algorithm. Observe that by focusing on the outstanding requests rather than arrival intensities, Dynamic WB-SPTF manages to adapt its operation to different workload conditions as well, by adapting a large window for interval B (Figure 6(b)) and smaller windows for intervals D and F with more sequential workloads (Figures 6(c) and 6(d)). Observe that, while for non-random workloads Dynamic WB-SPTF performs better than SPTF (interval F), this is not the case for fully random workload (interval B) where SPTF performs best. Furthermore, the results of Figure 7 show that variability, measured as the standard deviation in response time is among the lowest for the Dynamic WB-SPTF scheduling algorithm. It is always at least twice as low as the standard deviation of SPTF. We conclude that Dynamic WB-SPTF not only adapts its operation to different load conditions to maintain good overall performance especially in overload periods, but does this by maintaining low variability in the response time. The experimental results show that Dynamic WB-SPTF performs similar to the best performing scheduling algorithm for each overload period but it is not the best. This outcome is expected, since we provide only a simple heuristic for finding the optimal window size for WB-SPTF. Identifying additional and more optimal ways of adjusting the window size is subject to future work.
6. Experimental Results with Realistic Workload Disk drive behavior depends on both hardware and workload characteristics. Hence, in this section we test Dynamic WB-SPTF with another disk, i.e., Seagate Cheetah 10,000 rpm and a different realistic trace. The trace, which captures the disk activity, is measured in a real system that runs an online bookstore according to the TPC-W specification [7].
6.1. Experimental Set-up TPC-W [13] specifies how to implement an on-line bookstore. The users of such a system are the emulated browsers (EBs). They browse through the pages of the web site and, finally, might purchase books from the on-line store. All requests generated by the EBs are received by a Web server which in our implementation is Apache [12]. The Web server forwards the dynamic requests to the application server, which in our implementation is Tomcat 4.0 [12]. According to the TPC-W specification, the dynamic requests are simple queries in the bookstore database and the application server sends them down to the database server, which in our implementation is MySQL 4.1 [6]. The database stores the entire information of the on-line bookstore and consists of several tables. The most important one is the ITEMS table,
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
Emulated Browsers Web Server + Application Server Database Server Database Disk
Processor Memory OS Pentium 4 / 2 GHz 256 MB Linux Redhat 9.0 Pentium III / 1266MHz 2 GB Linux Redhat 9.0 Intel Xeon / 1.5 GHz 1GB / 768 MB Linux Redhat 9.0 2.1 GB in size; 1,000,000 records - 511 MB - in ITEMS table SEAGATE: ST373453LC; SCSI; 73 GB; 15,000 rpm
Table 2. Hardware components on the TPC-W-based on-line bookstore implementation which stores all available books for purchase. All hardware components in our experimental setup are shown in Table 2: In our measurements, we trace the entire I/O activity in the bookstore database. For this, we run the MySQL database server, using VMWare [14], in a virtual machine hosted by the Database Server machine in Table 2. The host of the database server has 1 GB of memory but the virtual machine uses only 768 MB. The physical SCSI disk used by the database server in the virtual machine appears as a process in the host machine. We use the strace Linux utility to trace all I/O activity in that disk. The database used in our experiments is 2.1GB in size and has the ITEMS table with 1,000,000 records and 511 MB in size. This determines the highly localized access pattern shown in Figure 8(b). TPC-W defines three types of traffic; browsing mix with 95% browsing and 5% ordering, shopping mix with 80% browsing and 20% ordering, and ordering mix with 50% browsing and 50% ordering. Browsing the online bookstore generates many database queries that read from the database, mainly from the ITEMS table. Ordering from the on-line bookstore generates update queries that deal with individual records from the tables that handle customer data and item availability. Browsing is the most expensive activity, since it generates queries that search large chunks of data, while ordering deals only with single database records. We measure the I/O activity while the system is under the browsing traffic mix generated by 80 concurrent EBs and the trace is shown in Figure 8.
6.2. The TPC-W based Workload We run our experiment over a period of 20 minutes and collect a trace with 138,497 disk requests. In Figure 8(a), we present the arrival intensity of the trace by plotting the number of requests received every second. Observe that the plot in Figure 8(a) is highly jagged, a characteristic of real traces that is not present in the synthetic trace of Figure 1(a). We define three different time intervals, A, B, and C, with different arrival intensities. The B and C intervals are the ones that are overloaded. The disk access pattern is captured in Figure 8(b). The
most notable difference with the semi-synthetic trace is the locality that is observed here. The disk has 73 GB capacity and the database uses only 2.1 GB. Note the range of the y axis in Figure 8(b). It covers only the area of the disk that is accessed by the database. The database is stored in files representing each table and the access pattern follows the individual table accesses. The most frequently accessed table, as mentioned previously, is the ITEMS table (from block to block ). Other trace characteristics are the randomness and the sequentiality, most notably during interval C. Each sequential stream corresponds to the queries that search for either Best Sellers or the New Products within a subject category in the ITEMS table. In Table 3, we highlight workload characteristics for intervals A, B, and C, within the measurement period.
Interval Load Workload
A Medium Random Local Sequential
B High Random Local
C High Random Local Sequential
Table 3. Characteristics of the TPC-W trace
6.3. Results with the TPC-W-based Workload Using the TPC-W-based trace, we run the same set of simulation experiments as in Section 5. Although, the disk we use in this set of experiments is the Seagate Cheetah 15,000 rpm with 73 GB capacity, DiskSim provides parameters only for Seagate Cheetah 10,000 rpm with 9.1 GB capacity. We use the latter in our simulations because there are no conflicts since the data set is only 2.1 GB. We compare Dynamic WB-SPTF performance with SPTF performance and the performance of WB-SPTF for different, i.e., the best performing, window sizes. We present the average request response time and the standard deviation of request response time for the various time intervals in Figures 9 and 10, respectively. Observe that, for the TPC-W-based trace, WB-SPTF outperforms pure SPTF in all three intervals A, B, and C. This indicates that batching requests be-
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
A
400
B
C
350 300
1.6e+07
250
1.55e+07
Blocks
Requests per second
A
1.7e+07
B
C
1.65e+07
200 150
1.5e+07 1.45e+07
100
1.4e+07
50
1.35e+07
0
1.3e+07 0
200000
(a)
400000
600000 800000 Time in milliseconds
1e+06
1.2e+06
0
200000
(b)
400000 600000 800000 Time in milliseconds
1e+06
1.2e+06
Figure 8. Arrival process for the realistic trace used in our analysis; (a) Arrival rate, (b) Disk access pattern.
fore scheduling them using SPTF, is much more profitable when either locality or sequentiality, or both, are characteristics of the workload. The performance of Dynamic WBSPTF is, consistently, near the best performance (among all algorithms tested), independent from the load and workload characteristics at the disk.
7. Conclusions We proposed a new scheduling algorithm for disk drives that operate under dynamic load conditions. The algorithm, Dynamic WB-SPTF, adapts its parameters to the load conditions in the system maintaining good overall and stable performance during transient overloads. The Dynamic WBSPTF algorithm adapts the value of only one of its parameters, namely the window size. This parameter is updated on every request service completion without additional computational overhead to the disk. The window size determines how many of the outstanding requests are included in a pool of requests upon which SPTF scheduling is applied. Dynamic WB-SPTF increases or decreases its window size by monitoring the ratio between the number of requests within the current window to the total number of outstanding requests. We evaluated Dynamic WB-SPTF performance via trace-driven simulations. We chose traces that are characterized by burstiness in both arrival intensities and workload characteristics, such as locality, sequentiality, and randomness in disk access pattern. Although, not directly considering characteristics of the access pattern, Dynamic WB-SPTF adapts its operation to the current workload type and improves disk overall performance. The gains of using Dynamic WB-SPTF are higher when the workload consists of local and sequential accesses. Under completely random disk accesses, Dynamic WB-SPTF behaves similarly to the traditional SPTF. This is our first attempt to analyze and propose ways to handle the ever-changing dynamic environment under
which the computer system, in general, and the storage subsystem, in particular, operate. Currently, we are further investigating how to adapt disk operation to the conditions of the entire system. We are identifying additional information which might be available at disk level (e.g., workload characteristics, the individual request response time) or made available by higher layers of the system (e.g., file system or application level) that can be used for adaptive operation of the storage subsystem (whether single or multi-disk units) and better overall utilization of system resources.
Acknowledgments We would like to thank Qi Zhang, who helped us on the process of collecting the trace described in Section 6.
References [1] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. Veitch. Hippodrome: Running circles around storage administration. In Proceedings of the First USENIX Conference on File and Storage Technologies, (FAST’02), 2002. [2] M. Andrews, M. A. Bender, and L. Zhang. New algorithms for the disk scheduling problem. Algorithmica, 32(2):277– 301, 2002. [3] R. P. Doyle, J. S. Chase, O. M. Asad, W. Jin, and A. M. Vahdat. Model-based resource provisioning in a web service utility. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, (USITS’03), Seattle, WA, 2003. [4] G. R. Ganger, B. L. Worthington, and Y. N. Patt. The DiskSim simulation environment, Version 2.0, Reference manual. Technical report, Electrical and Computer Engineering Department, Carnegie Mellon University, 1999. [5] D. M. Jacobson and J. Wilkes. Disk scheduling algorithms based on rotational position. Technical Report HPL-CSP-917rev1, HP Laboratories, 1991. [6] MySQL AB. MySQL. http://www.mysql.com.
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.
1500 1000 500 0 SPTF
100
500
900
Dynamic
(b)
2500 2000 1500 1000 500 0 SPTF
100
500
900
Dynamic
(c)
Interval C Average Response time (millisec)
2000
Interval B Average Response time (millisec)
(a)
Interval A Average Response time (millisec)
Average Response time (millisec)
Overall 2500
2500 2000 1500 1000 500 0 SPTF
100
500
900
Dynamic
2500 2000 1500 1000 500 0
(d)
SPTF
100
500
900
Dynamic
Figure 9. Average response time under SPTF, WB-SPTF(100), WB-SPTF(500), WB-SPTF(900), and the Dynamic WB-SPTF scheduling algorithms (a) overall and for time intervals (b) A, (c) B, and (d) C.
(a)
Interval A
Interval B
Interval c 6000
5000
5000
5000
5000
4000 3000 2000
4000 3000 2000
4000 3000 2000
1000
1000
1000
0
0
0
SPTF
100
500
900
Dynamic
(b)
SPTF
100
500
900
Dynamic
(c)
Standard Deviation
6000 Standard Deviation
6000 Standard Deviation
Standard Deviation
Overall 6000
4000 3000 2000 1000
SPTF
100
500
900
Dynamic
(d)
0 SPTF
100
500
900
Dynamic
Figure 10. Standard deviation of response time under SPTF, WB-SPTF(100), WB-SPTF(500), WBSPTF(900), and the Dynamic WB-SPTF scheduling algorithms (a) overall and for time intervals (b) A, (c) B, and (d) C.
[7] PHARM Project. Java TPC-W Implementation Distribution. http://www.ece.wisc.edu/˜pharm/, Department of Electrical and Computer Engineering and Computer Sciences Department, University of Wisconsin-Madison. [8] A. Riska and E. Riedel. It’s not fair - Evaluating effi cient Symposium disk scheduling. In Proceedings of the on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS’03, pages 288–295, Oct. 2003. [9] C. Ruemmler and J. Wilkes. Unix disk access patterns. In Proceedings of the Winter 1993 USENIX Technical Conference, pages 313–323, 1993. [10] B. Salmon, E. Thereska, C. A. N. Soules, and G. R. Ganger. A two-tiered software architecture for automated tuning of disk layouts. In Proceedings of the 1st Workshop on Algorithms and Architectures for Self-Managing Systems, San Diego, CA, 2003. [11] M. Seltzer, P. Chen, and J. Osterhout. Disk scheduling revisited. In Proceedings of the Winter 1990 USENIX Technical Conference, pages 313–323, Washington, DC, 1990. [12] The Apache Software Foundation. Apache Web Server. http://www.apache.org. [13] Transaction Processing and Performance Council. TPC-W. http://www.tpc.org. [14] VMWare INC. VMWare Workstation. http://www.vmware.com. [15] M. Welsh and D. Culler. Adaptive overload control for busy Internet servers. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, (USITS’03), Seattle, WA, 2003.
[16] B. L. Worthington, G. R. Ganger, and Y. N. Patt. Scheduling for modern disk drives and non-random workloads. Technical Report CSE-TR-194-94, Computer Science and Engineering Division, University of Michigan, 1994.
Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST’04) 0-7695-2185-1/04 $ 20.00 IEEE Authorized licensed use limited to: Seagate Technology. Downloaded on December 3, 2008 at 16:21 from IEEE Xplore. Restrictions apply.