Exploiting Replication for Energy-Aware Scheduling in Disk Storage Systems Jerry Chou
Jinoh Kim
Doron Rotem
Lawrence Berkeley Lab.
Lawrence Berkeley Lab.
Lawrence Berkeley Lab.
[email protected]
[email protected]
d
[email protected]
ABSTRACT This paper deals with the problem of scheduling requests on disks for minimizing energy consumption. We first analyze several versions of the energy-aware scheduling problem based on assumptions on the arrival pattern of the requests. We show that the corresponding optimization problems are NP-complete by reduction to the set cover or the independent set problem. Then both optimal and heuristic scheduling algorithms are proposed to maximize the energy saving of a storage system. Our evaluation results using two realistic traces show that our approach significantly reduces energy consumption up to 55% and achieves fewer disk spinup/down operations and shorter request response time as compared to other approaches.
1.
INTRODUCTION
Large scale datacenters need to maintain thousands of rotating disks for fast online access of stored datasets. As disks are getting cheaper in terms of dollars per gigabyte, the prediction is that the energy costs for operating and cooling these rotating disks will eventually outstrip the cost of the disks and the associated hardware needed to control them. Currently it is estimated that disk storage systems consume about 35 percent of the total power used in data centers [9]. This percentage of power consumption by disk storage systems will only continue to increase, as data intensive applications demand fast and reliable access to on-line data resources. This in turn requires the deployment of power hungry faster (high RPM) and larger capacity disks. Several energy saving techniques for disk based storage systems have been introduced in the literature [15, 25, 13, 6, 20, 21, 19]. Most of these techniques revolve around the idea of spinning down the disks from their usual high energy mode (idle mode) into a lower energy mode (standby mode) after they experience a period of inactivity whose length exceeds a certain breakeven time (also called idleness threshold). The reason for this is that typical disks consume about one tenth of the power in standby mode as compared with their power consumption in idle mode (See Fig 5). It is well known [11] that the optimal deterministic power management policy is to set idleness threshold to E TB = up/down seconds where Eup/down is the amount of PI energy (in joules) needed to spin the disk up (from standby to idle mode) and down (from idle to standby mode) and PI is the power (in watts) required to keep the disk spinning in idle mode. We call this power management policy 2CPM (2-competitive power management scheme) as it can
be proved that it results in energy consumption which does not exceed twice that of the best possible algorithm that has a-priori knowledge of all request arrival times. There are problems associated with this spin-down technique as listed below: (a) Energy and Response Time Penalty: Disks can only service requests while they are in idle mode, in case a request arrives when the disk is in standby mode there is a response time penalty of several seconds (5-15 seconds) before the request can be serviced. In addition, as mentioned above, the energy cost of Eup/down joules may in some cases exceed the energy saved by transitioning the disk to standby mode. (b) Expected length of inactivity periods: Under many typical workloads found in scientific and other applications, disks do not experience long enough periods of inactivity (longer than the idleness threshold ) thus limiting the opportunities to save energy by transitioning to standby mode. (c)Number of spin-up/down operations: For typical disks, an excessive number of such operations may shorten the lifetime of the disk and must also be considered (an attribute of S.M.A.R.T., a popular monitoring system for predicting hard disk failures1 ). The research literature attempts to solve the above problems using several techniques. The idea is to re-shape the workload so that I/O requests are directed to a small subset of the disks allowing the other disks to enjoy relatively long periods of inactivity. Several of these techniques are listed below. • Data Placement [19, 20]: These techniques use intelligent data placement on disks where the most popular files are packed into the fewest possible disks subject to response time constraints. This placement is based on some knowledge of data access patterns. A second approach tries to achieve the same goal by using dynamic data migration based on observed workloads. This approach does not require an a priori knowledge of the access patterns but results in a larger time-scale to adapt. • Write off-loading [17]: These techniques are directed towards minimizing energy consumed due to write requests. Newly written data is diverted to disks which are currently spinning (anywhere in the data center) thus enabling disks in standby mode to remain in their low-energy mode for longer periods. • Power aware caching and pre-fetching [27, 26]: The idea behind such techniques is to always prefer evicting blocks from the cache residing on idle disks rather than 1
http://www.linuxjournal.com/article/6983
from disks in standby mode. It also seems beneficial to allocate more cache space for disks in standby mode than disk in idle mode. Another observation is that pre-fetching data from an idle disk is beneficial in order to avoid the need to access it later when the disk is placed in standby mode. • Replication for energy saving [15, 25, 21]: Many modern file systems such as HDFS (Hadoop Distributed File System) use replication for enhancing fault tolerance and supporting data recovery. This replication can be used to save energy by accessing data copies from spinning disks while transitioning disks that contain redundant data to standby mode. All approaches that attempt to redirect requests away from spun-down disks by dynamic data migration or data replication may cause several performance and implementation concerns. Specifically, a storage system could become un-reliable or un-available because of data consistency and synchronization issues. In addition, expensive overhead costs may affect network bandwidth and request response time during data transfer. Also, additional data transfer protocols and control systems must be deployed to facilitate data movement. More critical to the performance degradation, these approaches could interfere with the data placement scheme used by the file system. As a result, a storage system may not be able to manage its data without cooperating with these techniques, resulting in the introduction of more limitations and complexity on the storage system. In this paper, we propose an energy-aware scheduling approach which attempts to re-direct requests by utilizing existing data location, instead of data migration to new locations. The motivation for our approach is that typical file systems already replicate data at multiple locations in a storage system for fault tolerance and other performance considerations. Thus we could simply redirect requests on-the-fly by scheduling a request to any of its existing locations. As a result, our approach has several advantages over the previous solutions. First, it has no interference with any other data placement scheme. Second, it can be deployed without additional storage system modification or performance penalty. Third, it is designed to work with any power management scheme in which a disk simply gets spun down after experiencing a fixed idleness threshold. Such schemes are used commercially in commercial systems such as Copan400 by SGI2 and AutoMAID by NexSan3 . Last but not least, it is complementary and compatible with many other power-saving techniques. For example, Set-Cover and All-in Power management strategies [16, 14] for the Hadoop system could be combined with our approach to save more power by concentrating requests on fewer active disks. Our main contributions are: • We propose a novel energy-aware scheduling approach to reduce energy consumption of storage systems without interfering with data placement or introducing additional performance and deployment overhead. • We formulate energy-aware scheduling as an optimization problem and propose both optimal and heuristic algorithms for offline, batch and online cases. • We analyze the complexity of the energy-aware scheduling problem and show it is equivalent to the weighted 2
http://www.sgi.com/products/storage/maid/ 3 http://www.nexsan.com/products/automaid.php
Figure 1: Storage system architecture. set cover problem in the batch case and is NP-complete in the offline case using a reduction from the maximum independent set problem. • We provide extensive evaluations with realistic traces (Cello [3], Financial1 [23]), disk model and power manager. The experiments show a reduction in energy consumption up to 55%, while minimizing the number of disk spin-up/down operations and request response time. The rest of the paper is organized as follows. Section 2 gives an overview of the energy-aware scheduling problem. Our scheduling algorithms for the three scheduling models are proposed in Section 3. The experimental setup using OMNeT++ [18] and Disksim [7] is described in Section 4, and the results are presented in Section 5. Finally, some additional related work is discussed in Section 6, and the conclusions are presented in Section 7. More experimental results and proofs are included in the Appendix.
2. ENERGY-AWARE SCHEDULING 2.1 Storage System Architecture Figure 1 represents a typical architecture for a modern storage system in a data center. Brief descriptions and assumptions of each component are given below. Request is associated with a disk I/O of a file block (normally 512KB). The I/O time of a request is on the scale of milliseconds and is very small when compared with the time scale of a disk power management operation(e.g spinup/down) which is measured in seconds. For that reason, in our scheduling algorithm analysis we assume the request I/O time is negligible. However in our experimental evaluations we use the Disksim [7] simulator which takes the disk I/O time into account. We only consider the scheduling of read requests as we assume write requests can be assigned to one or more idle disks in the system using techniques such as write off-loading, so that they do not need to be handled by the scheduler. Disk power manager Decides when to spin down disks implementing a power management policy. As mentioned in introduction, in this paper, we use the 2CPM scheme. Scheduler dispatches requests to disks according to a scheduling algorithm. The three types of scheduling models are described in the next section. Data placement manager determines the data replication factor and data location in a storage system. As mentioned, our approach makes no assumption and does not interfere with the data placement scheme. We assume that data location information is available to the scheduler.
2.2 Storage System Scheduling Model We discuss the energy-aware scheduling problem using three models: offline, online and batch. An overview of
(a) Scheduling A is one possible solution, and it reduces energy consumption to 15 by using only 3 disks. (a) Scheduling B now has energy consumption 23, and its energy consumption is no longer the minimum.
(b) Scheduling B is optimal with energy consumption 10. Figure 2: Batch energy-aware scheduling examples. these models is given below, while the proposed scheduling algorithms for each model are described in Section 3. Offline assumes a scheduler has a-priori knowledge of the arrival times of requests before making its scheduling decisions. It represents an ideal scheduling scenario and used mainly as a benchmark for theoretical analysis. Equipped with arrival time knowledge, all requests can access disks at their arrival time without suffering disk spin-up delay. This is done by spinning up disks in-advance or keeping disks idle before requests arrive according to the scheduling assignment. These assumptions will be removed from the other models. Batch queues requests and assigns all of them together to disks periodically at a scheduling interval. Note that although requests in a batch have different arrival times, they are scheduled to access disks at the same time. Thus, requests could suffer queuing delays besides the disk spin-up delay. But, better performance could also be achieved with more request information. Online represents the basic scheduling model where a scheduler immediately assigns a request to a disk upon its arrival. However, requests could still suffer disk spin-up delay, if requested disks are not spinning when requests arrive.
2.3 Energy-aware Scheduling Examples We will now illustrate the advantages and challenges of designing an energy-aware schedule using several examples. For simplicity, we use a basic disk power model with no spinup/down time or energy penalty. Also, the disk consumes exactly 1 unit of energy per second during the active or idle state, and the breakeven time in the example is 5 seconds. Finally, the initial state of all disks is standby.
2.3.1 Batch scheduling example The first example shown in Figure 2 simulates a batch scheduling problem where all requests access disks concurrently because of the batch queuing delay. In the example, there are six requests, r1 · · · r6 ,and their requested data are b1 · · · b6 , respectively. The data placement is given and known to the scheduler as shown in the figure where disk d1 has data b1 , b2 , b3 , b5 , disk d2 has data b2 , b3 , disk d3 has data b4 , b6 and disk d4 has data b3 , b4 , b5 , b6 . Figure 2(a) shows one possible schedule A, which assigns requests r1 , r5 to disk d1 , requests r2 , r3 to disk d2 and requests r4 , r6 to disk d3 . As shown by the disk status at the right side of Figure 2(a), disks d1 , d2 , d3 are spun up and become active at time 0, then they remain idle for a
(b) Scheduling C is optimal with energy consumption 21. Figure 3: Offline energy-aware scheduling examples. breakeven time 5 seconds before spinning down. Comparing to the always-on disk power scheme with energy consumption 20(=4*5), schedule A consumes less energy with 15(=3*5) by keeping disk d4 standby. However, as shown in Figure 2(b), there is another schedule B using even less energy with 10(=2*5) by scheduling requests r1 , r2 , r3 , r5 to disk d1 and requests r4 , r6 to disk d3 . In fact, as proven in Theorem 2, schedule B is also an optimal schedule for this example, because it uses the minimum number of disks to service all requests. As a result, schedule B significantly reduces the energy consumption of the always-on scheme (50%) and schedule A (33%).
2.3.2 Offline scheduling example We next extend our previous example to an offline scheduling problem by allowing requests access disks at different times. As shown in Figure 3, requests r1 · · · r6 are now associated with times at 0, 1, 3, 5, 12 and 13, respectively. Figure 3(a) re-examines the results of schedule B in the offline model. Schedule B still uses the minimum number of disks to service requests, but the energy consumption of d1 and d3 now becomes 13 and 10. As shown by the disk status at the bottom of Figure 3(a), disk d1 is spun up at time 0 to service r1 , then r2 and r3 arrive at time 1 and 3, respectively. Thus, according to 2CPM scheme, the disk cannot be spun down until time 8 when it reaches the breakeven time(5s) after servicing r3 . At time 12, d1 is spun up again to service r5 and spun down after it reaches the breakeven time at time 17. Similarly, the energy consumes of disk d3 is 10 when it is spun up at time 5 and 13 for r4 and r6 , respectively. Therefore, the energy consumption of schedule B is 23. Although schedule B consumes much less energy than the always-on configuration, 76(=18*4), it is no longer an optimal schedule in this example. Figure 3(b) shows an optimal schedule C, which uses three disks but has the minimum energy consumption. Schedule C assigns requests r1 · · · r3 to disk d1 , request r4 to disk d3 and requests r5 , r6 to disk d4 . As a result, d1 is idle between time 0 to 8, d3 is idle between time 5 to 10, and d4 is idle between time 12 to 18. Accordingly, the energy consumption of schedule C is 19 which is
even less than the energy consumption of schedule B. Comparing the results of schedules B and C, we see that the reason an offline scheduling problem is more complicated than the batch scheduling counterpart is that the energy cost of a schedule is not only dependent on the location of data for scheduled requests, but also on their arrival time.
3.
SCHEDULING ALGORITHM
3.1 Offline Scheduling We formulate offline scheduling problem as an optimization problem, and propose a scheduling algorithm illustrated by an example. The variables used in our formulation are summarized in Table 1 in Appendix B.
3.1.1 Problem formulation An energy-aware scheduling problem ES is described by an input of (R, D, L, P ), where R = {r1 · · · rN } is a request stream, D = {d1 · · · dK } is a list of disks, L is the data placement information, and P = {Tup/down , Eup/down , TB , PI } is the configuration of 2CPM scheme. Each request ri ∈ R is associated with a data and a disk access time ti (i.e. the time a disk receives the request), and requests in R are sorted by their times in increasing order. The power configuration P consists of a set of parameters. Tup/down and Eup/down are the disk spin-up/down time and energy, while TB , PI are the disk breakeven time and idle power. Let SES be the set of all feasible schedules of a scheduling problem ES. Then, our goal is to find an optimal schedule ∗ ∗ SES ∈ SES , such that the energy consumption of SES is the minimum among the set of all feasible schedules, i.e., ∗ x x SES = min(SES |∀SES ∈ SES ). Specifically, we formulate our scheduling problem by analyzing the energy saving of a schedule instead of the energy consumption. Therefore, in the following analysis, we use x ) deX(·) as a function of energy saving, such that X(SES x notes the energy saving of a schedule SES for scheduling x problem ES, and X(SES , ri ) denotes the energy saving of x any request ri ∈ R under the scheduling results of SES . For simplicity, we assume the initial status of disks is standby and the standby power is 0, but these assumptions are not necessary for the correctness of our problem formulation. x x , ri ), we first define the ) and X(SES To compute X(SES energy consumption of a request as the amount of energy consumed on its scheduled disk from the time of servicing the request until the next request arrives on the disk. Take the scheduling results of schedule C in Figure 3(b) as an example. The energy consumption of r1 is 1, because disk d1 is idle from time 1 (arrival time of r1 ) to time 2 (arrival time of r2 ), and the disk idle power is 1 unit per second in the power configuration. Similarly, the energy consumption of r3 is 5, because disk d1 is idle after the arrival time of r3 at 3 and spun down after time 8. According to the above definition, we know the maximum energy consumption of a request under 2CPM scheme is equal to (Eup + Edown + TB ∗ PI ) because it achieves its maximum if the successor request arrives when the disk has been spun down. Therefore, we define the energy saving of x x a request ri in schedule SES , X(SES , ri ) = the maximum energy consumption of ri − the energy consumption of ri x under schedule SES . For example, in Figure 3(b), the energy saving of r1 is 4(=5-1), because its maximum energy consumption is 5 (Eup/down = 0, PI = 1 and TB = 5), and
its energy consumption in the schedule is only 1. By the definition of request energy saving, it is trivial to show that the energy saving of a schedule is equal to the aggregated energy savings of all requests because a schedule iteratively assigns requests to disks. Thus, we have X x x X(SES , ri ) (1) X(SES )= ∀ri ∈R
Furthermore, the amount of energy saving of a request is determined by the time of its successor request on the disk because the energy consumption of a request is computed between the time of the request and its successor request. x Therefore, we re-formulate the value of X(SES , ri ) by an energy saving amount X(i, j, k), such that 8 x schedules ri on disk dk < X(i, j, k), if SES x and the successor request of ri on dk is rj . X(SES , ri ) = : 0, otherwise (2) Intuitively, we can only save energy from a request ri if its successor request rj arrives before disk dk is spun down. In addition, more energy is saved from ri when the arrival time of ri and its successor rj gets closer, because the amount of disk idle time between the two requests is reduced. Therefore, as proven by Lemma 1 in Appendix B, the exact value of X(i, j, k)4 can computed as, 8 < Eup + Edown + (TB − (tj − ti )) ∗ PI , if 0 ≤ tj − ti < TB + Tup + Tdown X(i, j, k) = (3) : 0, otherwise and
X(i, j, k) ∈ ES, iif dk has the data of ri , rj and ti < tj . (4) Let U be the set of all X(i, j, k), ∀i, j, k, in a given scheduling problem ES. Based on the above equations, we find x that the energy saving of any schedule, X(SES ), can be computed by summing up the values of S where S ⊆ U, and x , ri ) (i.e. the energy each X(i, j, k) ∈ S is equal to X(SES saving of ri in the schedule). Therefore, the energy saving ∗ of an optimal schedule, X(SES ), can be formulated into the following optimization problem, such that the value of S ∗ corresponding to an optimal schedule SES is maximized. Given a scheduling problem ES(R, D, L, P ) and a set of energy saving U: U = {X(i, j, k)|∀X(i, j, k) ∈ ES} Find a subset of Penergy saving S ⊆ U ∗ S.t X(SES ) = ∀X(i,j,k)∈S X(i, j, k) is maximized subject to For each pair of X(i, j, k) ∈ S and X(i′ , j ′ , k′ ) ∈ S : i 6= i′ //energy-constraint k == k′ if {i, j} ∩ {i′ , j ′ } = 6 ∅ //schedule-constraint In the formulation, the first constraint is given because only one X(i, j, k)∀j, k can be selected in S to represent the energy saving of request ri . The second constraint is given ∗ must be a valid schedule. By definition, if because SES X(i, j, k) ∈ S, both requests ri , rj must be scheduled on disk ∗ dk in schedule SES . Therefore, if X(i, j, k) and X(i′ , j ′ , k′ ) share a common request (i.e. {i, j} ∩ {i′ , j ′ } = 6 ∅), then their associated disks k and k′ must be the same. Otherwise, there is a conflict for the scheduling location of the common 4 X(i, j, k) is always ≥ 0 as long as the disk spin-up/down power ≥ the idle power for the given power model.
(a) Step1: Nodes are added for all X(i, j, k) ∈ ES according to Eq. 3 and Eq. 4.
(c) Step3: A MWIS algorithm is applied to select the independent set S = {X(2, 3, 1), X(1, 2, 1), X(4, 6, 4)}.
(b) Step2: Edges are added among X(1, 3, 1), X(2, 3, 1) and X(2, 3, 2) because of the energy-constraint of request r3 . Also, edge is added between X(1, 2, 1) and X(2, 3, 2) because of the schedule-constraint of request r2 , which can only be scheduled either on disk d1 or on disk d2 .
(d) Step4: A scheduling assignment is derived from the request dependencies selected in Step3. Notice, r4 does not appear in S because it has no dependency to any other requests. Thus, r4 can be scheduled on any of its data locations, such as d3 . The final scheduling assignment is the same as the solution shown in Figure 3(b). Figure 4: The figure illustrates the MWIS scheduling algorithm by computing the optimal solution schedule C in Figure 3(b). request. For convenience, we refer to the two constraints as energy-constraint and schedule-constraint, respectively.
3.1.2 Algorithm Based on the above problem formulation, we propose our MWIS offline scheduling algorithm by reducing the problem into a maximum weighted independent set problem. Given a scheduling problem ES(R, D, L, P ), we construct a graph G = (V, E) in two steps. Step1 adds a node to V for each X(i, j, k) ∈ ES with non-zero value according to Eq. 3 and Eq. 4. Then Step2 adds an edge between any pair of nodes X(i, j, k) and X(i′ , j ′ , k′ ) that do not satisfy the constraints in our problem formulation. Once the graph is constructed, Step3 applies a maximum weighted independent set algorithm to select the set of S. Finally, Step4 derives ∗ an optimal schedule SES from S by scheduling requests ri and rj to disk dk for each X(i, j, k) ∈ S. Since Step1 ignores any X(i, j, k) with zero value, a request may not be associated with any X(i, j, k) ∈ S when the request cannot save energy at any disk location. Thus, the request is allowed to be scheduled on any of its associated data locations. Figure 4 illustrates our algorithm by computing the optimal scheduling results in the previous example in Figure 3(b). The correctness of our algorithm is proven in Theorem 3 in Appendix B. Further, the optimality of the algorithm is equivalent to the optimality of the MWIS algorithm used in Step3, because the amount of energy saving from our algorithm is equal to the weight of its corresponding independent set solution. But our experimental results in Section 5 show that the well-known sub-optimal greedy MWIS algorithm provides significant energy savings.
3.2 Weighted Set Cover Batch Scheduling As described in Section 2, requests are queued and scheduled simultaneously at each scheduling interval under the batch scheduling model. Thus, the optimality and problem space of a batch schedule should also be considered with respect to the set of requests queued at each scheduling interval. Although our MWIS offline scheduling algorithm still can be used to solve a batch scheduling problem. Theorem 2 in Appendix B proves that a batch scheduling problem is equivalent to a weighted set cover problem. Therefore, we propose a weighted set cover (WSC) batch scheduling algorithm as follows. For a given batch scheduling problem, we construct its corresponding weighted set cover graph. In the graph, a
node represents a request, and a set represents a disk. The weight of a set is determined by the amount of energy consumption of a disk when requests are scheduled on the disk. Specifically, in Theorem 2, we show the weight of a disk can be computed according to its current status as follows. 8 < 0, if active or spin-up Eup/down + TB · PI , if standby or spin-down E(dk ) = : (T now − Tlast ) · PI , if idle (5) where Tnow is the current time, and Tlast is the time that dk receives its previous request. Once the graph is constructed, a weighted set cover algorithm is applied to select a set of disks with the minimum weight to cover all requests. Accordingly, we achieve the minimum energy cost by scheduling requests on these disks.
3.3 Cost Function Online Scheduling Finally, an online scheduling problem can be considered as a special case of batch scheduling problem when only one request needs to be scheduled at a time. Thus, we could again apply our WSC scheduling algorithm to the online scheduling problem, so that the scheduler simply assigns each request to one of its data locations with the minimum energy consumption, E(dk ). But, in this subsection, we further extend the WSC scheduling algorithm by using a composition cost function, C(dk ), to optimize for both energy cost, E(dk ), and performance cost, P (dk ), as follows C(dk ) = E(dk ) ∗
α + P (dk ) ∗ (1 − α) β
(6)
In this paper, we use request response time as the performance metric because spinning down disks to save energy could increase the load of disks and delay response time. Thus, we define P (dk ) as the current number of requests on a disk as follows. P (dk ) = current number of requests on disk dk .
(7)
Finally, parameters α, β are defined to adjust the cost. α determines the cost ratio between energy and performance. In other words, the cost function, C(dk ), only considers energy cost when α = 1, and it only considers performance cost when α = 0. On the other hand, β determines the unit factor between energy cost and performance cost. Since E(dk ) and P (dk ) could be computed based on different cost metrics, β is used to scale the cost between them.
From Eq. 5 and Eq. 7, we clearly see the trade-off between energy and response time. For instance a standby disk with no load has high energy cost but low performance cost, while a active disk has low energy cost but high performance cost. Also, for saving energy, a scheduler actually prefers a disk which is in the process of being spun-up disk rather than a disk in idle mode, because it could overlay more requests together and reduce disk idle time, even though requests could suffer some disk spin-up delay. Finally, more sophisticated cost functions could be defined and computed if more information and techniques are available. For example, a prediction technique could be used to estimate the access probability of a disk and assign lower cost to a more frequently used disk. But in this paper, we propose a basic solution as a first step to demonstrate our cost function strategy. A more detailed evaluation of the trade-off between energy and response time is given in Appendix A.2.
4.
EXPERIMENTAL SETUP
For our experiments, we built an energy-aware storage system framework using OMNeT++ [18]. In addition, we incorporated our framework with a disk simulator Disksim [7] by utilizing its interfaces to synchronize the event time between the two simulators. The disk model simulated in the experiments is Seagate Cheetah 15.5 enterprise disks [4]. For this disk model, however, some power information, such as standby power, is missing in the associated documents. Therefore, we alternatively used power parameters from Seagate Barracuda specification [1] as shown in Figure 5.
4.1 Workload Trace Our approach was evaluated with two real disk I/O traces, namely Cello [3] and Financial1 [23]. Cello is collected from a timesharing system used by a small group of researchers at Hewlett-Packard Lab to do simulation, compilation, editing, and mail. Financial1 is collected from an OLTP application running at a Financial institute and made available in [23]. Both traces record low-level (block) disk I/O activities over an extended period of time, the information includes disk access time (i.e. the time a disk receives a request), logical block address, block size, etc. In the experiments, we consider each unique combination of disk id and block address as a data in storage systems. For both traces, we used 70,000 requests accessing over 30,000 data.
4.2 Data Placement Our experiments used a storage system with 180 disks. Although our approach makes no assumption about data placement, our evaluation was based on a particular one with a high degree of data skewness among disks. Specifically, we use replication factor to indicate the number of replications for each data in storage systems. The first data location is referred to as the original data location, and the rest of the data locations are referred as the replica data locations. In the experiments, we assume that the original data locations were randomly selected among disks by a Zipf distribution, such that the the probability of choosing a disk (p) and its rank r is p = c/r z where c is a constant and z is often close to 1. This skewness of data locality was observed from the Cello trace, as observed from web servers [2]. Such skewness could also be considered as the result of a system using a data placement energy saving technique [20]. On the other hand, we assume that replica data locations
Figure 5: 2CPM configuration for experimental evaluations. were evenly distributed among disks, because it is a common data placement scheme to ensure fault tolerance in a storage system. The results of a range of other data placement configurations are also shown in Appendix A.1.
4.3 Scheduling Algorithms In the experiments, we compare our proposed energyaware scheduling algorithms to other scheduling algorithms without energy consideration (i.e., Static and Random). Random : uniformly sends requests to one of data locations. Static: always sends requests to original data locations. Energy-aware Heuristic: the online algorithm proposed in Section 3.3. The cost function parameters are configured as α = 0.2 and β = 100 to reach the balance of energy and response time as observed in Appendix A.2. Energy-aware WSC : the batch algorithm proposed in Section 3.2 with 0.1s scheduling interval. The disk weights are computed by the same cost function of Heuristic, and the reduced MWIS problem is solved the greedy set cover algorithm described in Section 6. Energy-aware MWIS : the offline algorithm proposed in Section 3.1. The scheduler is configured to an offline model with no disk spin-up delay. The reduced MWIS problem is solved by the GMIN [22] greedy algorithm.
5. EXPERIMENTAL RESULTS Due to limited space, this section presents the primary results of Cello trace; see Appendix A for additional results.
5.1 Energy Consumption Figure 6 compares the energy consumption among different scheduling algorithms when the data replication factor is increased from 1 to 5. The results in the figure are normalized to the energy consumption of an always-on power management configuration where disks are never spun down. In the case of replication factor 1 (no data replication), all schedules have the same scheduling results. But we still observed about 12% less energy consumption comparing to the always-on configuration because of the 2CPM scheme. As the data replication factor increases, a scheduler now has the option to assign a request to one of its original data locations and replica data locations. But, Static still only assigns requests to their original data location, and its energy consumption remains the same, as shown in Figure 6. On the other hand, Random uniformly assigns requests to all data locations. Since the replica data locations in our experiments are uniformly selected, Random also more evenly scatters requests among disks as the replication factor increases. As a result, disks constantly receive requests and cannot be spun down, and the energy consumption of Random becomes close to the always-on configuration. In contrast, our energy-aware scheduling algorithms (WSC, MWIS, Heuristic) save more energy as the replication factor
0.5
0
1
2 3 4 Data replication factor
5
Figure 6: Energy consumption. standby
active
idle
spin−up/down
0.5
0
1
2 3 4 Data replication factor
active
idle
spin−up/down
60
40
20
50 100 150 Disks sorted by their standby time
(a) Random schedule
80
60
40
20
0 0
1.2 1 0.8 0.6 0.4 0.2 0
5
standby
Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s)
1.4
active
1
50 100 150 Disks sorted by their standby time
idle
spin−up/down
(b) Static schedule
5
standby
active
idle
spin−up/down
100
80
60
40
20
0 0
2 3 4 Data replication factor
Figure 8: Request response time.
100 Percentage time breakdown (%)
Percentage time breakdown (%)
Percentage time breakdown (%)
1
100
80
0 0
1.6 Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s) Energy−aware MWIS(offline)
Figure 7: Disk spin-up/down. standby
100
1.5
Percentage time breakdown (%)
1
Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s) Energy−aware MWIS(offline)
Mean request response time (sec)
# disk spin−up/down nomalized to Fix scheduling
Energy normalized to the always−on config.
1.5
50 100 150 Disks sorted by their standby time
(c) WSC schedule
80
60
40
20
0 0
50 100 150 Disks sorted by their standby time
(d) MWIS schedule
Figure 9: Disk time percentage comparison among scheduling algorithms. increases because there are more locations available for our algorithms to minimize energy. For example, WSC gradually reduces its energy from 0.88 to 0.73, 0.63, 0.57 and 0.52, as the replication factor increases from 1 to 5. Also as expected, among the three energy-aware scheduling algorithms, WMIS has the lowest energy consumption, while Heuristic has the highest energy consumption. This is because WMIS has a priori knowledge about requests while Heuristic only has the information of a single request at a time. However, even with Heuristic, under replication factor 3, it still significantly reduces energy consumption of the always-on configuration, Random and Static by 32.2%, 22.72% and 13.40%, respectively. Furthermore, WSC and MWIS could achieve even lower energy by using more sophisticated set cover and independent set algorithms.
5.2 Number of Disk spin-up/down Operation Figure 7 compares the number of disk spin-up/down while varying replication factors. Since the results of Static remains the same, we normalize the results to Static. In the case of replication factor 1, all scheduling algorithms except MWIS have the same number of disk spin-up/down because no scheduling is involved. The MWIS has much fewer number of disk spin-up/down because according to the offline scheduling assumption, a disk does not spin down if its successor request would suffer disk spin-up delay. As the value of replication factor increases, the number of disk spinup/down operations is decreased for both our energy-aware scheduling algorithms and Random but for different reasons. With higher replication factor, Random more evenly distributes requests on disks, and disks become less likely to be spun down. As a result, we observe increasing energy consumption but decreasing number of disk spin-up/down for Random in Figure 6 and 7. On the other hand, more available data replications allow our algorithms to maximize energy saving by assigning requests on idle disks (and fur-
ther preventing standby disks from spinning up). WSC has slightly more number of disk spin-up/down than Heuristic because WSC could attempt to spin up a disk if more requests could be serviced on it.
5.3 Request Response Time Figure 8 compares mean request response time. Since offline scheduling model is an ideal case, and requests do not suffer from any disk spin-up delay in this model, the results of MWIS is omitted for comparison. As shown from the figure, Heuristic and WSC achieve shorter mean request response time than Random and Static, because our energyaware schedules have fewer number of disk spin-up/down as shown in Figure 7. In consequence, requests also have less probability of suffering disk spin-up delay. The response time of WSC is longer than Heuristic because WSC batches requests with a scheduling interval (0.1s), and requests could suffer additional queueing delay. However, under data replication factor 3, WSC still significantly reduces the response time of Static from 1.11s to 0.68s (38.7% reduction).
5.4 Per-Disk Statistical Analysis Finally, we perform detailed analysis of our scheduling algorithms by presenting statistics of individual disks. Figure 9 plots the time percentage for each disk for the four possible disk states: active, idle, standby and spin-up/down. The results were measured in the case of data replication factor was equal to 3. The time percentages of active are hardly visible in the figure because they are less than 1%. As observed from Figure 9(a), disks were idle for the majority of time under Random, because requests were evenly distributed among disks. In consequence, most disks constantly service requests and do not spin down or remain standby for a long period of time. In contrast, Static schedules a substantial number of requests on some (small number of) disks because the distribution of the original data loca-
tion is highly skewed among disks. Since disks with fewer requests could have higher probability to be spun down and to remain standby, we observed higher standby time percentage on some disks in Static, as shown in Figure 9(b). Heuristic and WSC utilize non-standby disks more eagerly than Static. That is, they maximize energy saving by servicing requests on active or idle disks. Thus, the results show greater skewness of request distribution among disks, and more disks have greater standby time percentage as shown in Figure 9(c) and 9(d). As disks consume much less energy at standby, it also explains the greater energy saving of our algorithms in Figure 6.
6.
ADDITIONAL RELATED WORK
Weighted set cover(WCS) and Maximum weighted independent set(MWIS) are two classic NP-complete graph problems. The greedy set cover algorithm iteratively selects the most cost-effective set and removes the covered elements until all elements are covered, and it is proven to be an Hn factor approximation algorithm for the minimum set cover problem, where Hn = 1 + 21 + · · · n1 , and n is the number of nodes in the graph. On the other hand, one of the greedy algorithms for the maximum independent set problem has been investigated to select a vertex of minimum degree, remove it and its neighbors from the graph, and iterate this process on the remaining graph until no vertex remains. The algorithm is extended in [22] as GMIN to solve the weighted independent set problem. However, unlike set cover problem, MWIS problem does not admit a constant factor approximation algorithm for general graphs [10]. Exact algorithms for MWIS have been developed and gradually improved [12]. But, approximation algorithms with additional constraints and assumptions have been more commonly discussed [5, 8]. But recently, branch-bound approximation methods [24] have also been proposed for general graph. In our experiments, we applied the greedy set cover and the GMIN algorithms in our scheduling algorithm.
7.
CONCLUSION
This paper investigated using energy-aware scheduling for disk access requests in a storage system. Our approach utilizes the existing data replication and power management policy in a storage system to maximize the disk standby time and minimize energy consumption. Therefore, unlike previous approaches [20, 15, 25], our approach has no interference with the data placement in a storage system and does not require additional modification to a storage system. We discussed our approach for three different scheduling models, including online, offline and batch. We showed an offline scheduling and batch scheduling problem can be solved by a weighted independent-set and weighted set-cover algorithm, respectively. A cost function strategy was also proposed to consider the trade-off between energy and performance in our online and batch scheduling problem. Comparing to other scheduling algorithms without energy consideration, our results show an energy-aware scheduling could significantly reduce energy consumption with fewer disk spinup/down and less request response time. Future planned work involves testing our algorithms on popular file systems such as HDFS and GFS as well as commercially available disk systems with built in power management software such as COPAN-400 and Nexsan.
8. ACKNOWLEDGMENTS This work is supported by the Director, Office of Laboratory Policy and Infrastructure Management of the U.S. Department of Energy under Contract No. AC02-05CH11231.
9. REFERENCES [1] Seagate Barracuda, http://www.seagate.com/support/disc/ manuals/sata/100402371a.pdf. [2] L. Breslau, P. Cue, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM, pages 126–134, 1999. [3] Tools and traces, http://www.hpl.hp.com/research/ssp/software/. [4] Seagate Cheetah 15K.5, http://www.seagate.com/www/enus/products/enterprise-hard-drives/cheetah-15k/. [5] G. H. Chen, M. T. Kuo, and J. P. Sheu. An optimal time algorithm for finding a maximum weight independent set in a tree. BIT, 28(2):353–356, 1988. [6] D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives. In ACM/IEEE conference on Supercomputing, pages 1–11, 2002. [7] Disksim, http://www.pdl.cmu.edu/disksim/. [8] D. Gamarnik, D. Goldberg, and T. Weber. Ptas for maximum weight independent set problem with random weights in bounded degree graphs. In SODA, 2010. [9] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, and H. Franke. Reducing Disk Power Consumption in Servers with DRPM. Computer, 36(12):59–66, 2003. [10] J. Hastad. Clique is hard to approximate within n1 . Acta Math, pages 105–142, 1999. [11] S. Irani, G. Singh, S. K. Shukla, and R. K. Gupta. An overview of the competitive and adversarial approaches to designing dynamic power management strategies. IEEE Trans. VLSI Syst., 13(12):1349–1361, 2005. [12] T. Jian. An o(20.304n ) algorithm for solving maximum independent set problem. IEEE Trans. Comput., pages 847–851, 1986. [13] J. Kim and D. Rotem. Using replication for energy conservation in raid systems. In PDPTA’10, 2010. [14] W. Lang and J. Patel. Energy management for mapreduce clusters. In VLDB’10, 2010. [15] W. Lang, J. M. Patel, and J. F. Naughton. On energy management, load balancing and replication. SIGMOD Rec., 38(4):35–42, 2009. [16] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters. SIGOPS Oper. Syst. Rev., 44(1):61–65, 2010. [17] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical power management for enterprise storage. Trans. Storage, 4(3):1–23, 2008. [18] OMNeT++, http://www.omnetpp.org/. [19] E. Otoo, D. Rotem, and S.-C. Tsao. Dynamic data reorganization for energy savings in disk storage systems. In SSDBM ’10, pages 322–341, 2010. [20] E. Pinheiro and R. Bianchini. Energy conservation techniques for disk array-based servers. In ICS ’04, pages 68–78, 2004. [21] E. Pinheiro, R. Bianchini, and C. Dubnicki. Exploiting redundancy to conserve energy in storage systems. In SIGMETRICS ’06, pages 15–26, 2006. [22] S. Sakai, M. Togasaki, and K. Yamazaki. A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math., 126(2-3):313–322, 2003. [23] Umass Trace Repository: OLTP Application I/O, http://traces.cs.umass.edu/index.php/storage/storage. [24] D. Warrier, W. E. Wilhelm, J. S. Warren, and I. V. Hicks. A branch-and-price approach for the maximum weight independent set problem. Networks, 46(4):198–209, 2005. [25] C. Weddle, M. Oldham, J. Qian, A. Wang, P. Reiher, and G. Kuenning. PARAID: A gear-shifting power-aware RAID. Trans. Storage, 2007. [26] Q. Zhu, A. Shankar, Y. Zhou, A. Shankar, and Y. Zhou. Pb-lru: A self-tuning power aware storage cache replacement algorithm for conserving disk energy. In ACM/IEEE conference on Supercomputing, pages 79–88, 2004. [27] Q. Zhu and Y. Zhou. Power-aware storage cache management. IEEE Trans. Comput., 54(5):587–602, 2005.
APPENDIX A.
ADDITIONAL RESULTS
A.1
Data Placement Analysis
To show our approach is resilient to data placement, we present the results over a range of data placement configurations by changing the data replication factor and the locality of the original data location. Similar to previous evaluation, we set the replication factor from 1 to 5. In addition, we changed the degree of the locality of original data location by using a set of Zipf-like distributions, such that the value of the exponent z in a Zipf distribution (p = c/r z ) was varied from 0 to 1 with every 0.1 apart. With z = 1, the distribution of the original data locations follows a Zipf distribution (rather than Zipf-like), and the detailed results have been discussed in Section 5. With z = 0, the distribution of both original and replica data locations are uniformly selected among disks. The energy consumption was measured for each of the data placement configurations and plotted in Figure 10. Due to limited space, we only present Heuristic as the worst case of our energy-aware scheduling approach, since WSC and MWIS achieved less energy consumption than Heuristic. Since the impact of data replication factor has been discussed in Section 5.1, here we focus on the impact of data locality. As expected, the results of Random and Static are heavily relied on the skewness of data locality as shown in Figure 10(a), (b). Both Random and Static cannot reduce energy consumption when data locality is uniformly distributed (z = 0), and they only achieve lower energy consumption when data locality is more heavily skewed among disks. In contrast, Figure 10(c) shows that when the data replication factor is 5, our energy-aware solution was still able to reduce energy by over 40% under an uniform data placement. In addition, the impact of data locality becomes less as the data replication factor increased. In sum, our approach does not rely on data placement like previous studies [21, 19, 20]. Not only can our approach take advantage of the skewness of data locality, but also it could achieve less energy consumption by exploiting more data replications.
A.2
Cost Function Configuration
We investigate the cost function configuration of Heuristic using the results with data replication factor of 3 as a case study. Figure 11 plots mean request response time and energy consumption under varied α and β values for the cost function parameters (Recall the cost function is defined with two parameters α and β in Eq. 6). For comparison, results in the figures are normalized to the number when α = 0. As discussed with Eq. 6, a schedule considers only the energy cost when α = 1, and only the response time cost when α = 0. Thus, as α increases, energy consumption decreases (Figure A.2(a)), while mean request response time increases (Figure A.2(b)). More importantly, Figure 11(a) shows that energy consumption was reduced by more than 35% when using α = 1, while Figure 11(b) shows the request response time was reduced by more than a factor of 2 when using α = 0. Therefore, our cost function provide a great range of trade-offs for performance optimization. On the other hand, larger β means less percentage of the energy cost to the overall cost. Therefore, given the same α value, we found larger β has higher energy consumption as
in Figure 11(a) but lower response time as in Figure 11(b). More specifically, β determines the slope of results in Figure 11. For example, if β is smaller, the results will shift more to the value of α = 1. In contrast, if β is larger, the results will shift more to the value of α = 0. In sum, although we show the trade-off and ability of our cost function strategy, it is not trivial to choose the best α and β values, and we leave it as future work to further investigate the optimal parameter setting for our cost function as well as a better design of the cost function itself. Nonetheless, from Figure 11(a) and (b), we observed that the results of α = 0.2 and β = 100 are close to the best results in terms of both energy saving and response time. For this reason, we use this parameter configuration for Heuristic and WSC schedules throughout the paper.
A.3 Request Response Time Analysis We further analyze the response time results, and we will show the response time delay is mainly caused by the disk spin-up delay of a small fraction of requests. Figure 12 plots inverse cumulative distributions for request response time, that is, the percentage of requests with a response time larger than a given time (i.e., P rob[response time > X]). The results of the always-on configuration and MWIS are plotted as the baselines, because they do not suffer from any disk spin-up delay. As can be seen from the figure, the majority of the requests have response time less than 100ms in any schedule. However, less than 1% of requests suffered spin-up delay up to 15 second when 2CPM scheme is used. To show the mean request response time shown in Figure 8 was greatly increased by these requests, we present the 90th percentile request response time in Figure 13. As shown in the figure, the 90th percentile response time is 10ms for the always-on configuration and MWIS. Although the 90th percentile response time of Heuristic is 70ms under replication factor 1, it becomes 10ms with a greater replication factor. The response time of WSC is the highest because of the queuing delay due to the batch scheduling interval (0.1s), but it is also reduced from 140ms to 110ms as replication factor increases.
A.4 Financial1 Trace Results Finally, we briefly discuss the experimental results with the Financial1 trace. The results of energy consumption, the number of disk spin-up/down and mean response time are shown in Figure 14—16, and disk time percentage breakdown in Figure 17. As shown in the figures, the results are quite similar with the ones with the Cello trace. The only difference between the results of the two traces is the request response time. From Figure 8 and Figure 16, the mean request response time in Cello trace is around 1 second, but it is only around 300ms in Financial1 trace. This is because the request inter-arrival time in Cello trace has much higher burstness and variation. As a result, some requests suffer much longer disk access delay, and the mean request response time is increased with the Cello trace. But as shown in Figure 8 and Figure 16, our energy-aware scheduling consistently outperforms Static and Random in both traces.
1 0.95 0.9 0.85 0.8 5 4 3 2
0.8
1 1
Data replication factor
0.6
0.2
0.4
0
1.05 1 0.95 0.9 0.85 0.8 5 4 3 2
Data locality (z)
Normalized energy consumption
Normalized energy consumption
Normalized energy consumption
1.05
(a) Random schedule
0.8
1 1
Data replication factor
0.6
0.4
0.2
0
1 0.9 0.8 0.7 0.6 0.5 1 2 3 4
(b) Static schedule
0.8
5 1
Data replication factor
Data locality (z)
0.6
0.4
0.2
0
Data locality (z)
(c) Heuristic schedule
Figure 10: Energy consumption comparison under various data replication factor and data locality configuration. Energy consumption is normalized to the always-on power configuration.
0.7 0.65 0.2
0.4 0.6 alpha
0.8
1
1.8 1.6 1.4 1.2 1
0.2
0.4 0.6 alpha
# disk spin−up/down nomalized to Fix scheduling
0.8
Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s) Energy−aware MWIS(offline)
0.5
1
2 3 4 Data replication factor
5
Figure 14: Energy consumption comparison of Financial1 trace.
standby
active
idle
spin−up/down
Percentage time breakdown (%)
Percentage time breakdown (%)
60
40
20
0 0
50 100 150 Disks sorted by their standby time
(a) Random schedule
−4
10
2
10 Request response time (ms)
4
10
200 Always−on Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s) Energy−aware MWIS(offline)
150
100
50
0
1
2 3 4 Data replication factor
5
th
Figure 12: Request response time distribution.
Figure 13: 90 percentile request response time.
1.5
1
Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s) Energy−aware MWIS(offline)
0.5
0
1
2 3 4 Data replication factor
active
idle
spin−up/down
standby
Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s)
0.4 0.3 0.2 0.1 0
5
active
0.5
idle
1
60
40
20
50 100 150 Disks sorted by their standby time
(b) Static schedule
2 3 4 Data replication factor
5
Figure 16: Mean request response time comparison of Financial1 trace.
spin−up/down
standby
100
80
0 0
0
10
100
80
Always−on Random Static Energy−aware Heuristic Energy−aware WSC(batch 0.1s) Energy−aware MWIS(offline)
Figure 15: Number of spin-up/down comparison of Financial1 trace.
standby
100
−3
10
10
1
Percentage time breakdown (%)
Energy normalized to the always−on config.
1.5
0
−2
10
−5
0.8 0
(a) Energy consumption (b) Request response time Figure 11: Performance tradeoff from the cost function.
1
10
active
idle
spin−up/down
100 Percentage time breakdown (%)
0.75
2
−1
Mean request response time (sec)
0.8
beta = 1 beta = 10 beta = 100 beta = 500 beta = 1000
2.2
Prob[request response time > x]
energy consumption
0.9 0.85
mean request response time
beta = 1 beta = 10 beta = 100 beta = 500 beta = 1000
0.95
0.6 0
90 percentile request response time (ms)
0
10
1
80
60
40
20
0 0
50 100 150 Disks sorted by their standby time
(c) WSC schedule
Figure 17: Disk time percentage comparison of Financial1 trace.
80
60
40
20
0 0
50 100 150 Disks sorted by their standby time
(d) MWIS schedule
B.
THEOREMS AND PROOFS
Here we give the detailed description of our proof and theorem. The variables and notations used in this paper are also summarized in Table 1. Lemma 1. For a given scheduling problem ES(R, D, L, x P ), assume a schedule SES assigns request ri on disk dk and request rj is the successor of ri on dk . Then the energy x saving of ri is X(SES , ri ) = X(i, j, k) where the value of X(i, j, k) is computed as Eq. 3. Proof. First of all, the amount of energy saving of ri only depends on the arrival time of ri and rj because the energy consumption of ri is considered as the energy consumption of its scheduled disk from the arrival time of ri to the arrival time of rj . Thus, we compute the amount of energy saving X(i, j, k) by analyzing all possible scheduling results of ri until rj arrives. Specifically, according to the arrival time of ri and rj at time ti and tj , there are three possible scheduling results with different disks status as shown in Figure 18. We compute the value of X(i, j, k) in each of the cases with respect to the maximum energy, Eup + Edown + TB ∗ PI as follows.
(a) Energy-saving case I: No energy saving, when rj arrives after (ti + Tup + Tdown + TB ) because the disk is spun down before rj arrives.
(b) Energy-saving case II: When rj arrives between (ti + Tup + Tdown + TB ) and (ti + TB ), the disk remains idle, and the disk spin-up/down energy is saved. But idle time is extended until rj arrives to prevent disk spin-up delay for rj .
• Case I, tj > (ti + Tup + Tdown + TB ): As shown in Figure 18(a), disk dk has experienced a breakeven time after servicing ri , and the disk is spun down and spun up again before the arrival of rj . Therefore, the energy consumption of ri is Eup +Edown +TB ∗ PI ). Since the energy consumption is equal to EPmax , the energy saving of ri in this case is X(i, j, k) = 0.
(c) Energy-saving case III: When rj arrives before (ti + TB ), the disk spin-up/down energy is saved, and the idle time is reduced.
• Case II, (ti + Tup + Tdown + TB ) ≤ tj > (ti + TB ): In the second case, request rj still arrives after the breakeven time of ri . But, if the disk is spun down at time ti + TB , the disk cannot be spun up and active until the disk spin-up/down operation completes at time ti + Tup + Tdown . According to our assumption for the offline model, a disk must remain idle to prevent requests suffer disk spin-up delay. To satisfy the assumption, in Figure 18(b), the disk must remain idle before rj arrives. As a result, the energy consumption of ri is PI ∗ (tj − ti ). Notice, the disk spinup/down energy is prevented in this case because the disk remain idle. Therefore, in this case, X(i, j, k) = Eup + Edown + (TB − (tj − ti )) ∗ PI .
Theorem 1. Offline energy-aware scheduling problem can be reduced to a weighted independent set problem and solved by our MWIS scheduling algorithm.
• Case III, (ti + TB ) > tj > ti : Figure 18(c) shows the last case when rj arrives before the disk experiences a breakeven time after servicing ri . In this case, the disk simply stays idle between time ti and tj . Therefore, the energy consumption of ri is also PI ∗ (tj − ti ), and X(i, j, k) = Eup + Edown + (TB − (tj − ti )) ∗ PI like the previous case. Summarized from the results above, we prove the value of X(i, j, k) is computed as Eq. 3 8 Eup + Edown + ((tj − ti ) + TB ) ∗ PI , > > < if 0 ≤ ti − tj < TB + Tup + Tdown X(i, j, k) = and ri , rj are scheduled on dk . > > : 0, otherwise
Figure 18: Energy-saving example between two requests on a single disk.
Proof. As shown in our MWIS scheduling algorithm, given a scheduling problem ES(R, D, L, P ), we can construct a corresponding maximum weighted independent set problem G = (V, E). Also, given an independent set solution x x SG of G, we can derive a solution SES for the scheduling problem ES. x Therefore, we first prove that the schedule SES derived x from SG is indeed a valid solution for the given scheduling problem ES. According to our MWIS scheduling algorithm, x for each X(i, j, k) ∈ SG , requests ri and rj are scheduled x on disk dk in SES . If ri or rj is scheduled on two different x x , disks in SES , there must exist X(i′ , j ′ , k′ ) ∈ SG S ′ another ′ ′ such that k 6= k and {i, j} {i , j } = 6 emptyset. However, in the graph of G, there must be an edge between such X(i, j, k) and X(i′ , j ′ , k′ ) because they do not satisfy the schedule-constraint in our formulation. As a result, such x X(i′ , j ′ , k′ ) cannot exist in SG . Furthermore, by definition, X(i, j, k) ∈ ES if and only if ri and rj can be scheduled on x dk . Therefore SES is a valid scheduling assignment for the given scheduling problem ES. x x derived from SG is the Now we prove the schedule SES x optimal solution. Let us assume SG is an optimal solution y of G. If there exists a valid scheduling assignment SES with x more energy saving than SES for the scheduling problem ES, y we can find a corresponding independent set solution SG for y schedule SES as follows. For each request ri ∈ R, if ri is scheduled on disk dk and its successor request on the disk is y y rj , we select X(i, j, k) in SG . Since SES is a valid schedule, y and only one X(i, j, k) is selected in SG for each request,
variables D = {d1 · · · dK } problem B = {b1 · · · bM } input L = {l1 · · · lM } R = {r1 · · · rN } ti ES(R, D, L, P ) SES x SES ∗ formulation SES & X(i, j, k) algorithm x X(SES , ri ) x X(SES ) PI 2CPM TB configuration Eup/down Tup/down
Table 1: Variables for problem definition. description a list of disks in the storage system. a list of data in the storage system. L is data placement assignment where li is a set of disk locations for storing data bi . a list of input request stream sorted by their time in increasing order. disk access time (i.e. the time when one disk receives request ri ). a scheduling problem. a set of all possible solutions for the scheduling problem ES. a valid scheduling solution for the scheduling problem ES. an optimal schedule for scheduling problem ES. the amount of energy saving of request ri , if ri is scheduled on disk dk , and rj is the successor of ri on disk dk . x the energy saving of request ri ∈ R in schedule SES . x the total energy saving of schedule SES . the disk idle power. the breakeven time (idleness threshold). the disk spin-up/down energy. the disk spin-up/down time.
y All X(i, j, k) ∈ SG must satisfy the schedule-constraint and y energy-constraint. Therefore, SG is also a valid independent set solution of G. Since the energy saving of a schedule is equal to the total weight of its corresponding independent y x set, if SES has more energy saving than SES , the weight y x of SG must also be larger than the weight of SG . As a y result, the existence of such SES causes contradiction to our x assumption. Therefore, SES must be an optimal solution for the given scheduling problem ES.
Theorem 2. For a given batch scheduling problem ES(R, D, L, P ), there exists an equivalent weighted set cover problem WSC(U, S), where U is a universe and S is a family of subset of U . Proof. First we prove a batch scheduling problem can be reduced to a weighted set cover problem. For a given batch scheduling problem ES, we construct its corresponding weighted set cover problem WSC as follows. For each request ri ∈ R, we have an element i ∈ U . Also, for each disk dk ∈ D, we have a set sk ∈ S. If the requested data of ri is stored at disk dk according to L, then i ∈ sk . Finally, we assign the weight of a set sk to the additional energy consumption on disk dk if we schedule requests on the disk. Thus, the weight of dk is computed based on the current disk status as Eq. 5 and explained as follows. • If the disk status is active or spin-up: the energy consumption is 0 because the request will not spin-up a disk or extend the idle time of a disk. Thus no addition spin-up energy or idle energy will be added to the disk. • If the disk status is standby or spin-down: the energy consumption is Eup + Edown + PI ∗ TB because the disk has to be spun up and spun down again after experiencing a breakeven idle time. • If the disk status is idle: the energy consumption is the amount of idle time extended by the request multiplied by the idle power. For example, let us consider to schedule requests on a idle disk dk at the current time Tnow , and dk receives its previous request at time Tlast . If we don’t schedule requests on dk , disk dk would be
spun down at time TB + Tlast. However, if we do schedule requests on dk , disk dk now cannot not be spun down until time TB + Tnow . Therefore, the idle time of dk is extended by Tnow − Tlast , and the corresponding energy consumption is (Tnow − Tlast ) ∗ PI . As a result, it is easy to see that a given set cover solux tion SW SC for WSC can be polynomially translated into a x schedule solution SES for scheduling problem ES, and vice versus. Since the amount of energy consumption of schedx x ule SES is equal to the weight of SW SC , for each scheduling problem ES, there exists an equivalent weighted set cover problem WSC(U, S). Theorem 3. Offline scheduling problem is NP-complete. Proof. We prove it by using reduction from maximum independent set problem. Given any graph G(V, E), we will generate a request stream R from it as follows. For each edge e(vi , vj ), we associate a unique request re , and generate two additional dummy requests rei and rej . We assume the requested data of re appears on disks di and dj , rei only on disk di and rej only on disk dj . We can then generate a request stream R by placing all re associated with e ∈ E (in any order) with large time intervals between their arrival times (≫ TB ), and rei and rej with the same arrival time as that of re . It is then easy to show that an optimal schedule for R can be polynomially translated into a maximum independent set of G.