Minimizing Expected Energy Consumption in Real-Time Systems ...

9

Minimizing Expected Energy Consumption in Real-Time Systems through Dynamic Voltage Scaling ´ and RAMI MELHEM RUIBIN XU, DANIEL MOSSE University of Pittsburgh

Many real-time systems, such as battery-operated embedded devices, are energy constrained. A common problem for these systems is how to reduce energy consumption in the system as much as possible while still meeting the deadlines; a commonly used power management mechanism by these systems is dynamic voltage scaling (DVS). Usually, the workloads executed by these systems are variable and, more often than not, unpredictable. Because of the unpredictability of the workloads, one cannot guarantee to minimize the energy consumption in the system. However, if the variability of the workloads can be captured by the probability distribution of the computational requirement of each task in the system, it is possible to achieve the goal of minimizing the expected energy consumption in the system. In this paper, we investigate DVS schemes that aim at minimizing expected energy consumption for frame-based hard real-time systems. Our investigation considers various DVS strategies (i.e., intra-task DVS, inter-task DVS, and hybrid DVS) and both an ideal system model (i.e., assuming unrestricted continuous frequency, well-defined power-frequency relation, and no speed change overhead) and a realistic system model (i.e., the processor provides a set of discrete speeds, no assumption is made on power-frequency relation, and speed change overhead is considered). The highlights of the investigation are two practical DVS schemes: Practical PACE (PPACE) for a single task and Practical Inter-Task DVS (PITDVS2) for general frame-based systems. Evaluation results show that our proposed schemes outperform and achieve significant energy savings over existing schemes. Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management—Scheduling; D.4.7 [Operating Systems]: Organization and Design—Real-time systems General Terms: Algorithms Additional Key Words and Phrases: Real-time, dynamic voltage scaling, power management, processor acceleration to conserve energy ACM Reference Format: Xu, R., Mossé, D., and Melhem, R. 2007. Minimizing expected energy consumption in real-time systems through dynamic voltage scaling. ACM Trans. Comput. Syst. 25, 4, Article 9 (December 2007), 40 pages. DOI = 10.1145/1314299.1314300 http://doi.acm.org/10.1145/1314299.1314300 Support for the Power Awareness Real Time Systems (PARTS) Group has been provided by DARPA (contract #F33615-00C-1736) and NSF (grants ANI-0125704 and ANI-0325353). Parts of this article have appeared in Xu et al. [2004, 2005]. Authors’ address: email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2007 ACM 0738-2071/2007/12-ART9 $5.00 DOI 10.1145/1314299.1314300 http://doi.acm.org/ 10.1145/1314299.1314300 ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:2

•

R. Xu et al.

1. INTRODUCTION Energy conservation is critically important for many real-time systems such as battery-operated embedded devices with restricted energy budgets. Dynamic voltage scaling (DVS), which involves dynamically adjusting the voltage and frequency of the processor, has become a well-known technique in power management for real-time systems of which the dynamic power accounts for a significant portion of the total power. An increasing number of processors implement DVS, starting from the Transmeta’s Crusoe [Transmeta 2005], which is no longer shipping, but set the trend followed by Intel’s XScale [Xscale 2007] and AMD’s Athlon 64 [AMD 2001]. Through DVS, quadratic energy savings can be achieved at the expense of just linear performance loss [Weste and Eshraghian 1993]. Thus, the execution of the tasks can be slowed down in order to save energy as long as the deadline constraints are not violated. The key problem is to design DVS schemes that determine the execution speeds (i.e., frequency and voltage) of the tasks in the system. We are particularly interested in frame-based hard real-time systems, which are special cases of periodic real-time systems. In frame-based real-time systems, all tasks share the same period (also called the frame) and deadlines are equal to the end of the period. In each frame, tasks are executed in a predetermined fixed order. Many real-world real-time systems, especially embedded systems, use the frame-based model due to its simplistic and deterministic nature. On the other hand, periodic real-time systems that use static table driven scheduling approaches [Ramamritham and Stankovic 1994] such as cyclic executive [Baker and Shaw 1989] can be treated as multiple frame-based systems operating in succession because each minor schedule in a cyclic schedule corresponds to a frame. For frame-based systems, the problem of designing a DVS scheme can be reduced to determining the amount of time allotted to a task (and accordingly, deciding the speed(s) to execute it) before it is dispatched to the system. Our focus is on workloads that are variable and unpredictable. In these workloads, tasks usually run for less than their worst-case execution cycles (WCEC1 ), creating the opportunity for dynamic slack reclamation to slow down future tasks. Furthermore, the actual execution cycles of tasks are unknown and cannot be predicted before execution. The variability and unpredictability of the workloads are mainly caused by different inputs to tasks, and possibly by randomization inside tasks. Because of the characteristics of the workloads that we are dealing with, it is impossible for any DVS scheme to guarantee to minimize the energy consumption in the system. However, if the variability of the workloads can be captured by the probability distribution of the computational requirement of each task in the system (e.g., through profiling), it is possible to design DVS schemes that minimize the expected system energy consumption. This means that for large number of frames, such DVS schemes will achieve the most total energy savings even though they do not necessarily result in lower energy consumption than other DVS schemes for any given frame. 1 The

traditional worst-case execution time (WCET) in real-time systems is defined as the running time at the maximum processor speed when a task takes its WCEC.

ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

Minimizing Expected Energy Consumption in Real-Time Systems

•

9:3

Table I. The Road Map of Our Investigation Model/DVS strategy Ideal model

Realistic model

Intra-task DVS PACE [Lorch and Smith 2001] (Section 4.1) PPACE (Section 5.1)

Inter-task DVS OITDVS (Section 4.2) PITDVS PITDVS2 (Section 5.2)

Hybrid DVS GOPDVS [Zhang et al. 2005] (Section 4.3) PGOPDVS PIT-PPACE (Section 5.3)

In this paper, we investigate DVS schemes for frame-based hard real-time systems with the goal of minimizing the expected energy consumption in the system while meeting the deadlines, given probabilistic information of the workloads. The DVS schemes should be: (i) effective: achieve good performance in terms of energy savings; (ii) efficient: the offline computation of the schemes must be tractable (i.e., at most polynomial time) and the online computation of the schemes must be done in no more than linear time in order to minimize the scheduling overhead; (iii) practical: the schemes must take into consideration practical issues such as limited number of discrete speeds and speed change overhead, which may not be ignored in practice. Bearing the above requirements in mind, we carry out the investigation in two dimensions. The first dimension is the DVS strategy, which can be categorized as inter-task, intra-task, and hybrid DVS [Kim et al. 2002]. Inter-task DVS focuses on allotting system time to multiple tasks and schedules speed changes only at each task boundary (i.e., the execution speed for a task is constant for each instance executed), while intra-task DVS focuses on how to schedule speed changes within a single task instance given an allotted amount of time. Hybrid DVS combines intra-task and inter-task DVS. The second dimension is the system model. In an ideal system model, the processor speed can be tuned continuously and unrestrictedly, and the speed change overhead is ignored. Because of the great simplicity of this model, it is possible to derive elegant and optimal DVS schemes through which we gain insight into the problems that we are dealing with. Furthermore, the DVS schemes derived under the ideal system model could serve as the basis for deriving DVS schemes under the second system model, that is, the realistic system model, in which the processor has a limited number of discrete speeds and the speed change overhead is considered. Table I shows the road map of our investigation. Our ultimate goal is to obtain DVS schemes that satisfy the aforementioned three requirements (i)–(iii) under the realistic system model. The main contribution of this article is fourfold. (1) We propose OITDVS, the Optimal Inter-Task DVS scheme under the ideal system model, and provide a unified view at the optimal DVS schemes for all DVS strategies under the ideal system model. (2) We present PPACE (Practical PACE), a new DVS scheme for frame-based systems with a single task that takes into consideration discrete speeds and speed change overhead. PPACE can give performance guarantees and achieve energy savings very close to the optimal solution. ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:4

•

R. Xu et al.

(3) We propose PITDVS2 (Practical Inter-Task DVS, using 2 speeds) for general frame-based systems and show that it outperforms the existing DVS schemes in our experiments. (4) We show that simple patches to optimal DVS schemes obtained under the ideal system model do not necessarily generate DVS schemes that perform well in practice. An example of such patches is rounding the ideal continuous speed to the next higher speed that exists in the processor. The remainder of this article is organized as follows. We first present the related work in Section 2. The task and system models are described in Section 3. Section 4 presents the optimal DVS schemes for the ideal system model and Section 5 studies DVS schemes for the realistic system model. The evaluation results are reported in Section 6. Section 7 presents concluding remarks and future work directions. 2. RELATED WORK In CMOS circuits, power is roughly proportional to the square of the input voltage times the operating frequency [Weste and Eshraghian 1993]. But input voltage and operating frequency are not independent, and there is a minimum voltage required to drive the circuit at the desired frequency. The minimum voltage is approximately proportional to the frequency, leading to the conclusion that the power is proportional to the frequency cubed [Brooks et al. 2000]. Since the time taken to run a program is inversely proportional to the operating frequency, quadratic energy savings can be achieved at the expense of just linear performance loss through DVS [Hong et al. 1998]. Although much work has been done on exploring DVS in real-time environments since the seminal work in Yao et al. [1995], we describe work that takes into consideration actual (not worst-case) execution time of tasks. This is because real-time applications usually exhibit a large variation in actual execution times (e.g., Ernst and Ye [1997] reported that the ratio of the worst-case execution time to the best-case execution time can be as high as 10 in typical applications; our measurements in Rusu et al. [2004] showed that this ratio can be as high as 100). Thus, DVS schemes that exclusively use worst-case execution times lack the advantage of unused computation time. At the implementation level, DVS schemes can be categorized into compiler-assisted schemes and OS-only schemes. Compiler-assisted schemes [AbouGhazaleh et al. 2003; Hsu and Kremer 2003; Shin et al. 2001; Xie et al. 2003; Seo et al. 2005] rely on compilers to analyze programs to perform speed scheduling. In this work, we are interested in OS-only schemes and assume no access to program source code. OS-only schemes can be further subcategorized into prediction-based [Zhu and Mueller 2004; Sinha and Chandrakasan 2001], stochastic [Yuan and Nahrstedt 2003; Lorch and Smith 2001; Gruian 2001a; Zhang et al. 2005], and other. Among prediction-based schemes, a feedback DVS algorithm using PID control [Zhu and Mueller 2004] was proposed to predict the computational requirement of a task based on a designated window of history. Moreover, Markov process was applied to predict future computation based on recent history [Sinha ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.


•

9:5

and Chandrakasan 2001] . Since we focus on workloads that cannot be predicted based on recent history at the OS level (e.g., MPEG decoder and ATR [Mahalanobis et al. 1996]), we do not consider prediction-based schemes. Stochastic schemes assume that the variability of the workloads can be captured by the probability distribution of the computational requirement of each task, as has been shown to be valid for many applications [Yuan and Nahrstedt 2003]. There are also OS-only schemes that only take advantage of the slack reclamation technique and do not use any stochastic information of the system. Next, we will review the related work of OS-only schemes excluding prediction-based schemes. 2.1 Inter-Task DVS Inter-task DVS schemes differ in the way slack is allotted to tasks in the system. Slack includes static slack due to system under-utilization when each task is assumed to run for its WCEC and dynamic slack due to early completion of tasks. The concept of speculative speed reduction was introduced in Mossé et al. [2000] and proposed three DVS schemes (i.e., Greedy, Proportional, and Statistical schemes) with different speed reduction aggressiveness for framebased real-time systems. The Proportional scheme distributes the slack proportionally among all unexecuted tasks, while the Greedy scheme is much more aggressive and gives all the slack to the next ready-to-run task. The Statistical scheme uses the average-case execution cycle (ACEC) of tasks to predict the future slack. Many existing DVS schemes proposed for real-time systems can be classified as Proportional or Greedy. The cycle-conserving scheme and the look-ahead scheme for periodic real-time systems with variable workloads were proposed in Pillai and Shin [2001]. When used in frame-based systems, the cycle-conserving scheme is equivalent to the Proportional scheme, while the look-ahead scheme is equivalent to the Greedy scheme. The scheme for fixedpriority real-time systems proposed in Saewong and Rajkumar [2003], when used in frame-based systems, is equivalent to the Greedy scheme. To be able to navigate the full spectrum of speculative speed reduction, a DVS scheme has been proposed [Aydin et al. 2001], in which system designers can set a parameter to control the degree of speed reduction aggressiveness. In fact, the optimal speed reduction aggressiveness depends on the variability of the workloads, as will be shown in Section 4.2. The Statistical scheme attempts to capture the variability of the workloads by using the ACEC of each task, which does not contain sufficient probabilistic information. Our inter-task DVS schemes (OITDVS in Section 4.2, PITDVS and PITDVS2 in Section 5.2) automatically choose the degree of speed reduction aggressiveness to minimize the expected energy consumption, based on the probability distribution of the tasks’ execution cycles. 2.2 Intra-Task DVS For a single task and a given deadline, it has been shown [Lorch and Smith 2001; Lorch and Smith 2004] that if a task’s computational requirement is only known probabilistically, there is no constant optimal speed for the task ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:6

•

R. Xu et al.

and the expected energy consumption is minimized by gradually increasing the speed as the task progresses. They call this approach PACE (Processor Acceleration to Conserve Energy). Stochastic schemes similar to PACE have also been proposed in Yuan and Nahrstedt [2003] and Gruian [2001a]. They differ in the way of patching the speed schedule obtained under the ideal system model to fit the realistic system model. Among these intra-task DVS schemes, PACE and GRACE [Yuan and Nahrstedt 2003] can be used for soft real-time systems as they have a parameter to control the percentage of deadlines to be met. When this parameter is set to 100%, they are targeted at hard real-time systems. 2.3 Hybrid DVS A hybrid DVS scheme can be obtained by using an inter-task DVS scheme as the basis and plugging in an intra-task DVS scheme [Kim et al. 2002; Yuan and Nahrstedt 2003]. However, such hybrid schemes are inherently suboptimal because they ignore the interaction between inter-task and intra-task DVS. The optimal hybrid DVS scheme under the ideal system model extends PACE to multiple tasks in frame-based real-time systems [Zhang et al. 2005]. However, Zhang et al. [2005] did not provide any solution to patch their scheme to fit the realistic system model. Another interesting result, applicable to both hybrid and inter-task DVS, is that different ordering of tasks in real-time systems results in different energy consumption [Gruian 2001b; Gruian and Kuchcinski 2003], which led to a number of heuristics to obtain a “good” ordering of tasks. This is complementary to our work because we focus on finding speed schedules given the ordering of tasks. 3. TASK AND SYSTEM MODELS We consider a frame-based task model with N periodic tasks in the system. The task set is denoted by = {T1 , T2 , . . . , TN }. All tasks share the same period and deadlines are equal to the end of the period. The length of the period (also known as frame length) is denoted by D. Frames are to be repeated and thus we focus on the first frame which starts at time 0 and ends at time D. All tasks are executed nonpreemptively during each frame in the order T1 , T2 , . . . , TN . Each task Ti (1 ≤ i ≤ N ) is characterized by its worst-case execution cycles (WCEC) Wi and the probability function of its execution cycles Pi (x), which denotes the probability that task Ti executes for x (1 ≤ x ≤ Wi ) cycles. Obvi i P (x) = 1 andPi (Wi ) = 0. The corresponding cumulative ously, we have W i x=1 distribution function is cdfi (x) = xj =1 Pi ( j ) and cdfi (0) = 0. We assume that the probability functions of the tasks do not change with the processor speed. In practice, a histogram is used to approximate the probability function considering that a task usually takes millions of cycles. The number of bins in the histogram for Pi (x) is denoted by ri . We assume that the probability distribution of the execution cycles of each task is stable, that is, does not change with time. The tasks are to be executed on a variable voltage processor with the ability to dynamically adjust its speed (i.e., frequency and voltage). We consider ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.


•

9:7

two system models in this paper. The first model, the ideal system model, assumes that the processor speed can be adjusted continuously from 0 to infinity and there is no speed change overhead. The system power consumption when executing task Ti at frequency f is pi ( f ) = c0 + ci f α , where α (α > 1) reflects the convex power-frequency relationship, c0 denotes the system power consumption when the processor is idle (i.e., f = 0, since the system is not executing any task and the overhead is ignored), and ci reflects the effective switching capacitance of task Ti . We assume that the system is never turned off and therefore the power function can be simplified as pi ( f ) = ci f α in the analysis without affecting the analysis results. Although the ideal system model is simplistic, it captures the essence of a DVS system (i.e., the convex power-frequency relationship). Under this model, it is possible to derive optimal DVS schemes, which gives us great insight into the problem and provides the basis for designing practical DVS schemes. The second model, the realistic system model, considers practical issues. The processor only provides M discrete operating frequencies, f 1 < f 2 < · · · < f M . All frequencies are efficient, which means that using a frequency to execute a task always results in lower energy consumption than using higher frequencies [Miyoshi et al. 2002]. The system power consumption for an idle processor is pidl e (the system is not executing any task, and thus consumes constant power—not necessarily f = 0 as in the ideal system model). The system power consumption when executing task Ti at frequency f j is pi ( f j ). As in the case of the ideal system model, we ignore the idle power in deriving DVS schemes since this portion of power is always consumed in the system. That is, we use p ˆ i ( f j ) = pi ( f j ) − pidle in Section 5 to simplify the presentation. In the realistic system model, when changing the frequency of the processor from f i to f j , the time penalty is and the energy penalty is

PT ( f i , f j ) = ξ1 | f i − f j |

(1)

PE ( f i , f j ) = ξ2 f i2 − f j2 ,

(2)

where ξ1 and ξ2 are constants determined by the voltage regulators. Equations (1) and (2) are taken from Burd and Brodersen [2000] and are considered to be an accurate modeling of speed change overheads. It is common in the literature [Mochocki et al. 2004] to simplify Equations (1) and (2) by considering the worst-case frequency swing and assuming that PT ( f i , f j ) = ξ1 ( f M − f 1 ) 2 − f 12 ). Note that PT ( f , f ) = 0 and PE ( f , f ) = 0 where and PE ( f i , f j ) = ξ2 ( f M f ∈ { f 1 , f 2 , . . . , f M }. 4. OPTIMAL DVS SCHEMES UNDER THE IDEAL SYSTEM MODEL In this section, we study DVS schemes under the ideal system model. Because of the simplicity of that model, provably optimal DVS schemes can be derived ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:8

•

R. Xu et al.

for the three DVS strategies (i.e., intra-task DVS, inter-task DVS, and hybrid DVS). We first present the three optimal DVS schemes and then give a unified view of these schemes at the end of this section. 4.1 The Optimal Intra-Task DVS Scheme The ideal system model assumes the ability to set speed/frequency for each cycle executed by a task, since there is no speed change overhead.2 Thus, the problem of deriving an intra-task DVS scheme is to find the execution speed for each cycle of a task (also called the speed schedule of the task) to minimize the expected energy consumption while ensuring that the task will finish by the deadline D. Obviously, an intra-task DVS scheme is of great importance because it could be used as a building block for general frame-based real-time systems containing multiple tasks: a time Di is allotted to each task, and the intra-task DVS scheme is applied for each task. It can also be used as the DVS scheme for a special case of frame-based systems where there is only a single task in a frame (e.g., for specialized devices like MP3 players). The optimal intra-task DVS scheme, PACE, was derived in Lorch and Smith [2001]. We will briefly review its derivation. Since there is only a single task, we will omit the subscripts of the parameters of the task unless confusion arises. Let the execution speed of the ith cycle of task T be denoted by si . The problem can be formally expressed by the following mathematical program [Lorch and Smith 2001] Minimize Fi csiα−1 (3) 1≤i≤W

Subject to

1 ≤ D, s 1≤i≤W i

(4)

where Fi = 1 − cdf (i − 1) represents the probability of executing the ith cycle of task T . Note that c in (3) reflects the effective switching capacitance of task T . By using Lagrangian technique [Lang 1973] or Jensen’s inequality [Krantz et al. 1999], the optimal solution to (3)–(4) is 1 W α j =1 F j si = (5) 1 Fi α D and the optimal expected energy consumption is 1 α c W j =1 F j ∗ e = . (6) D α−1 Several claims can be made based on (5) and (6): (i) the optimal expected energy consumption of a task is inversely proportional to D α−1 ; (ii) s1 ≤ s2 ≤ . . . ≤ sW [Lorch and Smith 2001] since cdf(·) is a nondecreasing function. Using lower speeds for early cycles makes perfect sense because early cycles have higher probability to be executed than later cycles. (iii) when changing speed 2 In

practice this will be several million cycles as discussed later.



•

9:9

during the execution of a task, the optimal speed to be used is always inversely proportional to the remaining time until the deadline. This claim is not straight1 1 W α α forward and needs a little clarification. Let ηi = Fi / j =1 F j and thus (5) can be rewritten as si = ηi1D . We can view ηi as the fraction of the time D to be allotted to the ith cycle. Claim (iii) is obviously true for executing the 1st cycle and we will show that it is also true for the ith cycle where i > 1. Before executing ith the i−1 cycle, the remaining time until the deadline is D − i−1 η D = D(1 − j j =1 j =1 η j ) and the time allotted to the ith cycle is ηi D. If βi is the fraction of the remaining time dedicated to Ti , that is, βi =

ηi i−1

1−

j =1

ηj

we have si =

βi D 1 −

1 i−1

j =1

ηj

.

Thus the preceding claim (iii) can be made for all cycles of the task. 4.2 The Optimal Inter-Task DVS Scheme The key for inter-task DVS schemes is how to allot the slack in the system (or equivalently, the available system time) to the tasks. In doing so, one needs to take into consideration the variability of the workload and possible dynamic slack reclamation. Next, we will describe the derivation of our optimal scheme under the ideal system model. The optimal DVS scheme is based on an important property that the optimal expected energy consumption of a sequence of tasks is inversely proportional to t α−1 where t is the amount of allotted time. To introduce this property, we start by establishing a preliminary lemma on the energy consumption of a single task. LEMMA 4.1. If a task T cannot change speed during the execution, its optimal expected energy consumption is inversely proportional to t α−1 , where t is the amount of time allotted to execute T . PROOF. Suppose that W is the worst-case number of execution cycles of T , and P (x) is the probability that T executes for x cycles. Obviously, we should use the lowest possible speed, Wt , such that T will finish in time t in the worst case. Therefore, the optimal expected energy consumption of executing T is W

W P (x) p t x=1

x W t

=

W α−1 c

W x=1 t α−1

P (x)x

,

(7)

which is inversely proportional to t α−1 (recall that p(·) is the power function). ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:10

•

R. Xu et al.

Interestingly, the result of Lemma 4.1 still holds for multiple tasks. THEOREM 4.2. If tasks cannot change speed during their execution, the optimal expected energy consumption of executing N tasks T1 , T2 , . . . , TN consecutively is inversely proportional to t α−1 , where t is the amount of time allotted to execute these tasks. PROOF. Suppose that the worst-case number of execution cycles of Ti is Wi , and the probability that Ti executes for x cycles is Pi (x). Let the optimal expected energy consumption of executing tasks Ti , Ti+1 , . . . , TN consecutively with allotted time d be denoted by E(i, d ). We will prove by induction that E(1, d ) is inversely proportional to d α−1 . The induction is on i. The base case for E(N , d ) is obviously true by Lemma 4.1. In the induction step, assume that E(i + 1, d ) is inversely proportional to i+1 d α−1 , that is, E(i + 1, d ) = dCα−1 , where Ci+1 only depends on Wi+1 , . . . , W N and Pi+1 (x), . . . , PN (x). To compute E(i, d ), we first compute a helper func˜ d , β), which denotes the expected energy consumption of executtion, E(i, ing tasks Ti , Ti+1 , . . . , TN with allotted time d when allotting a fraction β (0 < β ≤ 1) of time d to task Ti and allotting the remaining time, (1 − β)d , Wi to tasks Ti+1 , . . . , TN . The running speed for Ti is obviously βd , and the time Wi left for executing tasks Ti+1 , Ti+2 , . . . , TN is d − x/ βd when Ti executes x cycles. Therefore,

Wi W C W i i i+1 ˜ d , β) = E(i, Pi (x) p x + Wi α−1 βd βd (d − x/ βd ) x=1 α−1 W i Wi Ci+1 = Pi (x) xci + xβd α−1 βd (d − ) x=1 Wi

α−1 Wi Ci+1 Wi P (x) xc + xβ i i x=1 β (1− W )α−1 i = d α−1 Wi i Wiα−1 ci x=1 Pi (x)x Pi (x) + Ci+1 W x=1 (1− xβ )α−1 β α−1 Wi = α−1 d φi (β) + ϕi (β) = , d α−1 where φi (β) =

Wiα−1 ci

Wi x=1

β α−1

Pi (x)x

and ϕi (β) = Ci+1

Wi

Pi (x) x=1 (1− xβ )α−1 . W i

Let Ci = min (φi (β) + ϕi (β)) 0 (1 + δ)l 2 .e > (1 + δ)2l 3 .e · · · > (1 + δ)k−1l k .e. Moreover, clearly l 1 .e ≤ e( f M ) 0≤ j ≤i s j F j and l k .e ≥ e( f 1 ) 0≤ j ≤i s j F j . Let λ = e(e(ffM1 )) (i.e., the ratio of the energies when running with the highest and lowest frequencies), then (1 + δ)k−1
tgreedy , then we allot time Wi to Ti ; this is because the Greedy scheme guarantees deadlines in the t greedy

most aggressive form [Mossé et al. 2000]. It is easy to see that the resulting schedule is still valid as long as the tasks can be scheduled using the maximum speed when all the tasks take their WCECs. Patch 3 (Discrete Speeds). The processor speed in the ideal system model can be tuned continuously. But real-world processors only provide a finite set of discrete speeds. Since we will use a constant speed to execute a task, the most ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:22

•

R. Xu et al.

Fig. 6. The online part of PITDVS.

straightforward way to fix this problem is to round the continuous speed up to the closest higher discrete speed. To summarize the above patches, we show the online part of PITDVS in Figure 6. The procedure PITDVS-online(i, d ) is called before task Ti is executed and there is time d remaining in the frame. This procedure can be regarded as running in constant time because the complexity of Lines 23 can be reduced to constant time by using an extra array to store the partial sums of Wi values. Thus, Patch 1 and 2 can be done in constant time. Patch 3 takes O(log M ) where M is the number of available discrete speeds. Because M is usually small in practice, we can treat it as constant time. We also propose a variant of the PITDVS scheme, PITDVS2, that uses up to two speeds to execute a task within the allotted time. This is because any continuous speed can be emulated by using its two adjacent discrete speeds [Ishihara and Yasuura 1998]. Intuitively, using two adjacent discrete speeds to emulate a continuous speed is expected to do better than rounding up the speed. For a continuous speed s, let the closest higher and lower discrete speeds be denoted by s and s, respectively. To emulate s for time t, we let the system operate at s for time t1 and at s for time t − t1 − PT (s, s ). Thus, we need to satisfy st1 + s (t − t1 − PT (s, s ) = st. ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

(19)


•

9:23

Fig. 7. The online part of PITDVS2.

Solving Equation (19) will give the speed schedule for PITDVS2, taking into account the speed change overhead. Figure 7 shows the online part of PITDVS2. Note that, although PITDVS2 changes speed during the execution of a task, we still categorize it as inter-task DVS scheme because its main idea is to emulate the optimal inter-task DVS scheme from the ideal system model. 5.3 The Hybrid DVS Schemes For hybrid DVS schemes under the realistic system model, we can either patch the optimal hybrid DVS scheme obtained under the ideal system model (GOPDVS), or combine the PITDVS and PPACE schemes, as described below. The first new scheme is called PGOPDVS (Practical GOPDVS). The offline part of PGOPDVS assumes the ideal system model. Its first step is to compute the time allocation fraction βi j for each Phase j of task Ti . By treating the phases of a task as supercycles, the first step can be done by the procedure GOPDVS-offline in Figure 2 with slight modification. The second step of the offline part of PGOPDVS is to compute the time allocation fractions βˆ i for the whole task Ti . The latter can be computed from all the allocated times for each phase of task Ti , as follows. When task Ti is ready to execute and there is time d remaining in the frame, the time allocated for the first phase is d 1 = βi1 d ; the time allocated for the second phase is d 2 = (d − βi1 d )βi2 = (1 − βi1 )βi2 d . j −1 We can see that the time allocated to the j th phase is d j = dβi j k=1 (1 − βik ). Thus, the time allocation fraction for the whole task Ti is ri j −1 ri j =1 d j βˆ i = βi j (1 − βik ), = d j =1 k=1 where ri is the number of phases that task Ti has. Once we have derived βˆ i , we know the optimal (under the ideal system model) time to be allocated to task Ti before Ti starts executing. The online part of PGOPDVS performs the patches to fit the realistic system model. The patches are similar to those of PITDVS described in Section 5.2. We first perform the same patches as Patch 1 and 2 of PITDVS. Then we follow the approach of PACE, that is, first round each speed to the closest available discrete speed, and then perform a linear scan and adjustment to make sure Ti will finish in the allotted time in the worst case. Figure 8 shows the online part ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:24

•

R. Xu et al.

Fig. 8. The online part of PGOPDVS.

Fig. 9. The online part of PIT-PPACE.

of PGOPDVS. The procedure PGOPDVS-online(i, d ) is called before task Ti is executed and there is time d remaining in the frame. The second new scheme can be regarded as a variant of PITDVS. It differs from PITDVS only in that PPACE speed schedule is used to execute a task. Specifically, after the allotted time for a task is decided, procedure PPACE( ) in Figure 5 is called to compute the speed schedule because different allotted times corresponds to different speed schedules. Thus we call this scheme PIT-PPACE. Figure 9 shows the online part of PIT-PPACE. Note that we are calling the offline part of the PPACE scheme (which has high time complexity) online in PIT-PPACE, which is by no means practical. This can be remedied by precomputing the PPACE speed schedules for all possible allotted time d for each task and storing them in memory for online lookup. We need to discretize the allotted time d because it is continuous. The degree of discretization, nd , depends the characteristics of the task. In general, larger nd results in better speed schedule at the expense of higher space overhead. The space overhead ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.


•

9:25

is proportional to nd because for each task we need to precompute nd speed schedules and store them for online use. 5.4 Implementation Issues Due to the simplistic and deterministic nature of frame-based systems, implementing the DVS schemes described in this section only requires modest hardware and software support. Next, we will briefly discuss implementation issues. Hardware Support. First, many real-world processors already provide hardware-supported DVS. Changing the voltage and frequency of a DVSenabled processor is as simple as writing a value to a special register. Second, in order to change speed during the execution of a task, the processor needs to have the capability of being programmed to be interrupted after a predefined number of cycles and be trapped to a service routine that performs speed scaling. In current processors, this requirement can be satisfied by using timers. To interrupt a processor running at speed s Hz after x cycles, we can set up a timer for xs seconds. The resolution of the timer determines the precision of the value of x. For PACE, a timer with 1 ms resolution is used [Lorch and Smith 2003] while GRACE uses a timer with 500 μs resolution [Yuan and Nahrstedt 2003]. Software Support. All DVS schemes described in this section are divided into an offline part and an online part. The results computed by the offline part of a DVS scheme are stored in the memory accessible by the online part. The online part is called before dispatching a task and it is responsible for computing the amount of time to be allotted to a task and the speed schedule to be used by the task. For inter-task DVS schemes, we only need to set the speed once for each task in a frame. Thus, the speed scaling can be performed at context switching time. For intra-task and hybrid DVS schemes, we need to change speed through the interrupt mechanism during the execution of a task. The first interrupt is set up by the online part and subsequent ones by the interrupt handler: when the first interrupt happens, the interrupt handler will be entered to change the speed according to the speed schedule and set up the next interrupt. 6. EVALUATION For the ideal system model, we have described DVS schemes that are provably optimal for different DVS strategies. However, this is not the case for the realistic system model. We have presented the PPACE scheme for framebased systems with a single task, and the PITDVS, PITDVS2, PGOPDVS, and PIT-PPACE schemes for general frame-based systems. For these schemes, two questions remain unanswered: (i) since the worst-case performance of PPACE is dependent on the value of , what is the appropriate value that should be used and how well does it perform compared to the existing schemes? (ii) because PITDVS, PITDVS2, PGOPDVS, and PIT-PPACE are approximations of the optimal schemes obtained under the ideal system model and they do not have ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:26

•

R. Xu et al. Table II. XScale Speed Settings and Power Consumptions Speed (MHz) Voltage (V) Power (mW)

150 0.75 80

400 1.0 170

600 1.3 400

800 1.6 900

1000 1.8 1600

Fig. 10. Approximate analytical power function vs. actual power function.

theoretical performance guarantee as PPACE does, how well do they perform in practice? To answer these questions, we conducted extensive simulations for different processor models and workloads. 6.1 Processor Models We used three processor models in our simulation: (1) Synthetic processor. It strictly conforms to the p( f ) = f 3 power-frequency relation and has 10 discrete frequencies ranging from 100MHz to 1GHz with 100MHz step; its idle power is zero and there is no speed change overhead. (2) Intel XScale (Table II) [Xscalepower 2005]. We assume that XScale is executing NOP instructions at the minimum frequency when idle and the idle power is one half the power consumed at the minimum frequency (i.e., 40 mW) [Contreras and Martonosi 2005]. Figure 10(a) shows that the approximate analytical power function obtained by curve fitting is a good approximation of the actual power function. The time and energy penalties for each speed change are reported as 12μs and 1.2μJ, respectively, in Xie et al. [2003]. We assume that these numbers are worst-case speed change overheads, which are equivalent to the overheads incurred when changing from the minimum speed to the maximum speed in our power model described in Section 3. Accordingly, we derived the values of the constants ξ1 and ξ2 in Equation (1)–(2) to be used in our experiments. (3) IBM PowerPC 405LP (Table III) [Rusu et al. 2004]. The approximate analytical power function obtained by curve fitting is shown in Figure 10, which indicates that the approximation is not as good as XScale. The time and energy penalties for each speed change are reported as 1ms and 750μJ, respectively, in Rusu et al. [2004]. We also translated these numbers into our power model as in the case of XScale. ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.


•

9:27

Table III. PowerPC 405LP Speed Settings and Power Consumptions Speed (MHz) Voltage (V) Power (mW)

33 1.0 19

100 1.0 72

266 1.8 600

333 1.9 750

Fig. 11. Probability functions: uniform, unimodal1, unimodal2, unimodal3, bimodal1, bimodal2 (from left to right). The Y-axis is probability and the X-axis is number of execution cycles.

The numbers listed in Tables II and III are the power consumptions of the processor when running certain benchmarks. In practice, different applications have different instruction mix, thus resulting in different dynamic power consumptions. We associate each task with a power scaling factor to simulate this reality. For example, if a task’s power scaling factor is 0.9, it will consume 40 + (400 − 40) × 0.9 = 364 mW when executing at frequency 600MHz for XScale. 6.2 Evaluation of Intra-Task DVS Schemes We used six distributions of number of execution cycles (Figure 11) to generate synthetic tasks: one uniform distribution, three unimodal distributions, and two bimodal distributions. The power scaling factor is 1. The WCEC is 500, 000, 000 and the minimum number of cycles is 5, 000, 000. This is corresponding to the ratio of the WCET to the best-case execution time being 100, as reported in Rusu et al. [2004]. The default number of transition points is r = 100 and they are placed evenly in the range of [1, WCEC]. We will evaluate the effect of parameter r on the performance of the schemes in Section 6.2.2. We define the relative error for any scheme that returns the expected energy consumption E to be E−OPT , where OPT is the optimal solution. We compute OPT OPT using the PPACE scheme without doing the trimming operation at the expense of much longer running time. As usual with FPTAS algorithms, we set = 0.05 for PPACE when comparing it with other schemes. For all experiments, we varied the slack available for power management. The slack is changed by varying the deadline from WCEC to WCEC (increasing the deadline will increase fM f1 the slack, that is, will increase the allotted time for the task, thus resulting in less energy consumption). Under the above setup, we performed 3 experiments. 6.2.1 Experiment 1: Comparing All Schemes. In this experiment, we compare all intra-task DVS schemes including a variant of the PACE scheme called PACE2, which differs from PACE in that it uses two adjacent discrete speeds to emulate any continuous speed [Ishihara and Yasuura 1998] obtained from the solution to the mathematical program (11)–(13). Since all schemes except for ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:28

•

R. Xu et al.

Fig. 12. Comparing intra-task DVS schemes for bimodal1 distribution (the relative errors are relative to optimal solutions).

PPACE are based on the ideal system model, which does not model speed change overheads, we assume that they fix this problem by subtracting the maximum possible time penalties from the allotted time. We only show the results for bimodal1 distribution (Figure 11(e)) because results for all other distributions are similar. As shown in Figure 12 (note the different Y-axes scales), PPACE ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.


•

9:29

Fig. 13. Efficiency of PPACE (bimodal1 distribution).

is very close to optimal and outperforms all other schemes in all cases. On the other hand, the relative errors of GRACE, PACE, and PACE2 depend on the power model, the distribution of number of execution cycles, and the deadline. For the Synthetic processor (Figure 12(a)) for which the relative errors of GRACE, PACE, and PACE2 are only due to rounding, PACE and PACE2 perform reasonably well, but still have relative errors up to 22% and 19% respectively. Since PACE2 uses two adjacent discrete speeds to emulate any continuous speed, it performs very well for processors that have good approximate analytical power functions (for XScale, PACE2’s relative errors are less than 8%). However, PACE2 performs relatively poorly for the PowerPC 405LP, which does not have good approximate analytical power function (the relative error is up to 66%). Still, for XScale and PowerPC 405LP, neither PACE nor GRACE is a clear winner. In summary, the effect of approximations (i.e., using analytical functions to approximate the actual power function, and rounding or using two speeds to emulate any continuous speed) and patching for dealing with the speed change overhead could compound or cancel each other out. Thus the solutions returned by GRACE, PACE, and PACE2 are unpredictable and unstable, while PPACE can provide performance guarantee. Figures 12(b), 12(d), and 12(f) show the results of sensitivity analysis on for PPACE. We can see that the relative errors are generally below 0.1%, 0.18%, 0.33% for = 5%, 10%, 15%, respectively. This means that, in practice, PPACE performs much better than the performance guarantees it offers. This knowledge allows system designers to set the parameter higher than required, in order to speed up the algorithm execution, if the worst-case performance guarantee is not needed. The running time of PPACE depends on the implementation and the platform. However, it is roughly proportional to the number of energy-time labels generated during the execution of PPACE( ). From Figure 13, we can see that the average number of energy-time labels in all vertices increases when decreases, but they are all relatively small (especially for PowerPC 405LP shown in Figure 13(b)). It is also true that when M , the number of discrete speeds, increases, the average number of labels increases; this can be seen by comparing ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:30

•

R. Xu et al.

Fig. 14. Effect of transition points (bimodal1 distribution).

Figure 13(a) and 13(b). In this work, we use Java to implement PPACE and our hardware setting is Pentium 4 3.2 GHz with 1 GB memory. The running time of PPACE(0.05) for all processor models is always within 1 second. 6.2.2 Experiment 2: Effect of Transition Points. Intra-task DVS schemes assume a predefined set of transition points in tasks. However, it is not clear how to choose an optimal sequence of transition points, especially in the presence of speed change overheads. This is still an open problem, and it is beyond the scope of this paper. Intuitively, having more transition points should result in a better speed schedule at the expense of longer time to find the speed schedule. Thus, we perform an experiment to find out how the number of transition points affects the speed schedule. For the tasks in Experiment 1, we set the number of transition points to be 2, 3, . . . , 100 (the positions of the transition points are evenly distributed) and obtained the speed schedule returned by PPACE with = 0.05. Figure 14 shows the energy consumption (normalized to the energy consumption for the number of transition points equal to 100) versus number of transition points for three values of the deadline (for other values of the deadline, the curve is similar; also we do not show results for the Synthetic processor because they are similar). The results agree with our intuition. However, Figure 14 shows the phenomenon of diminishing return in increasing the number of transition points. In practice, this fact will help system designer find a good number of transition points more quickly; from our experiments, it seems that r = 20 or r = 30 is a good number, for these parameters. 6.2.3 Experiment 3: Effect of Speed Change Overhead. We also performed simulations to evaluate the effect of speed change overhead. In practice, processors that can adjust the voltage internally have low overheads (in the range of microseconds), but systems that require changing the voltage externally experience high overheads in the milliseconds range. Using the same task sets used in Experiment 1, we varied the worst-case time penalty of XScale and PowerPC405LP for speed changes from 50μs to 50ms. The energy penalty is changed proportionally to the time penalty. For example, the energy penalty of t XScale for time penalty being tμs is 1.2μJ × 12 . Figure 15 shows the relative errors for PACE and PPACE (we do not show results for GRACE and PACE2 ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

Minimizing Expected Energy Consumption in Real-Time Systems Relative error (%)

9:31

Relative error (%) 0.06 0.05 0.04 0.03 0.02 0.01 0

0.5

1

1.5

2

2.5

Deadline (sec)

3

0.05 0.04 0.03 0.02 0.01 Time Penalty (sec) 3.5 0

0.5

(a) PACE, XScale

Relative error (%)

1

1.5 2 2.5 Deadline (sec)

3

0.05 0.04 0.03 0.02 0.01 3.5 0 Time Penalty (sec)

∋

100 80 60 40 20 0

(b) PPACE ( = 0.05), XScale

Relative error (%)

100 80 60 40 20 0

0.05 0.04 0.03 0.02 0.01 0 0.05 0.04 0.03 0.02 6 8 10 12 0.01 14 16 0 Time Penalty (sec) Deadline (sec)

4

(c) PACE, Power PC 405LP

2

4

0.05 0.04 0.03 0.02 6 8 10 12 0.01 14 16 0 Time Penalty (sec) Deadline (sec)

∋

2

•

(d) PPACE ( = 0.05), Power PC 405LP

Fig. 15. Effect of speed change overhead (bimodal1 distribution).

because similar conclusions can be reached). We can see that the relative errors for PPACE are all below 0.06%. For PACE, which has the smallest error, the relative error increases as the time penalty increases; when the slack is small and time penalty is large, the relative error can be up to 94%. 6.3 Evaluation of the DVS Schemes for General Frame-Based Systems 6.3.1 Synthetic Workloads. A frame-based real-time system is characterized by the number of tasks, the power scaling factor for each task, the WCEC of each task, the probability distribution of the number of execution cycles of each task, and the frame length. We simulated systems consisting of 5 and 10 tasks. We only show the results for the systems with 5 tasks because the results for systems with 10 tasks are similar. The power scaling factor was randomly chosen uniformly from 0.8 to 1.2. As with the simulations in Section 6.2, the WCEC of each task is 500,000,000 and the minimum number of cycles is 5,000,000. The probability function of each task’s actual execution cycles is randomly chosen from the 6 representative distributions shown in Figure 11. The bin width of the histograms denoting the probability functions is 5,000,000 cycles. We experimented with 20 frame lengths chosen evenly from 5 × WCEC to 5 × WCEC . fM f1 For each simulated system, we evaluated 7 DVS schemes: Proportional2, Greedy2, Statistical2,4 PITDVS, PITDVS2, PGOPDVS, and PIT-PPACE. We also evaluated a clairvoyant scheme, which is aware of the actual execution 4 In our experiments, we improve the original Proportional, Greedy, and Statistical schemes [Moss´ e

et al. 2000] by using up to two speeds to execute a task [Ishihara and Yasuura 1998]. ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:32

•

R. Xu et al.

cycles of each task and uses the optimal frequency to execute each task. The clairvoyant scheme is used as the baseline to compare all other schemes. In evaluating a DVS scheme on a system, we computed the average energy consumption per frame as the energy consumption for that scheme on that system. For the same system, we compute the relative error of a scheme whose energy consumption is E as E−OPT , where OPT is the energy consumption of OPT the clairvoyant scheme. For each DVS scheme, we averaged the relative errors for all systems with the same frame length because we consider slack to be the most influential factor for energy consumption. Under the aforementioned setup, we simulated a total of 20 billion frames. The comparisons of the DVS schemes are shown in Figure 16. For most of the simulations, the best scheme is either PITDVS2 or PITPPACE. For the simulations in which the best scheme is neither PITDVS2 nor PIT-PPACE, the minimum energy consumption of these two schemes is just off by less than 0.1% compared to the energy consumption of the best scheme. Thus, we take a closer look at PITDVS2 and PIT-PPACE on the plots in the right column of Figure 16. PITDVS2 performs no worse than PITPPACE in most cases except for a small range of frame lengths for PowerPC 405LP. Considering the high memory overhead or high run-time overhead of PIT-PPACE (as described in Section 5.3), we regard PITDVS2 as the better scheme. From Figure 16, we can see that for most cases the worst scheme is PGOPDVS, which is a surprising result considering that it is based on the best scheme under the ideal system model. This is because the excessive rounding of speeds in the PGOPDVS scheme makes it drift far away from the optimal solution. The PITDVS scheme is not necessarily better than the DVS schemes that do not use probabilistic information of the workload. This is because the rounding-up effect offsets the advantage of using probabilistic information. Considering the difference between PITDVS and PITDVS2, we can see that using two adjacent discrete speeds to emulate a continuous speed [Ishihara and Yasuura 1998] plays an important role in PITDVS2. Among the Proportional, Greedy, and Statistical schemes, none is a clear winner. This is because all of them are just based on heuristics and will only perform well in a subset of the problem space. The two key factors that affect the energy savings of the PITDVS2 scheme over other schemes (by comparing relative errors) are the minimum speed of the processor and the number of speeds available from the processor. In computing speed schedules, the PITDVS2 scheme is based on the solution under the ideal system model in which the frequency of the processor is unrestricted and continuous. Because of the convexity of the power function, high speeds are not usually obtained by the PITDVS2 scheme. But low speeds are desired because the PITDVS2 scheme can navigate the full spectrum of available speeds and can find the best speed that minimizes the expected energy consumption. The importance of the number of speeds available from the processor is obvious given that we need to convert the continuous speeds to discrete speeds. For example, because the minimum speed of the Synthetic processor is less than that of XScale, and the number of speeds of that processor is greater than that ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

Minimizing Expected Energy Consumption in Real-Time Systems 400

300

9:33

PIT-PPACE PITDVS2

40

Relative error (%)

350

Relative error (%)

50

PGOPDVS Greedy Statistical Proportional PITDVS PIT-PPACE PITDVS2

•

250 200 150

30

20

10

100 0

50 0

-10 0

5

10

15

20

25

0

5

10

Frame length (sec)

(a) Synthetic processor, all schemes 120

20

25

PIT-PPACE PITDVS2

15

Relative error (%)

Relative error (%)

80

20

(b) Synthetic processor, two schemes


100

15

Frame length (sec)

60

40

10

5

0

20 -5 0 -10 2

4

6

8

10

12

14

16

2

4

6

Frame length (sec)

(c) XScale, all schemes 90

60 Relative error (%)

15

12

14

16

PIT-PPACE PITDVS2

10 Relative error (%)

70

10

(d) XScale, two schemes


80

8

Frame length (sec)

50 40 30

5

0

20 10

-5

0 -10

-10 10

20

30

40

50

60

70

Frame length (sec)

(e) PowerPC 405LP, all schemes

80

10

20

30

40

50

60

70

80

Frame length (sec)

(f) PowerPC 405LP, two schemes

Fig. 16. Comparison of DVS schemes for general frame-based systems (the relative errors are relative to the clairvoyant scheme).

of XScale, the energy saving for the Synthetic processor is greater than that for XScale. We also performed simulations to evaluate the effect of speed change overhead. We varied the time penalty of XScale and PowerPC 405LP in the same way as in Section 6.2. The results are very similar to those in Figure 16. This ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:34

•

R. Xu et al.

is because the derivations of all schemes are based on the ideal system model and they all deal with the speed change overhead in a similar way. That is, they ignore the energy penalty and subtract the maximum possible time penalty from the available time when a task is being scheduled to execute. As the time penalty increases, the energy consumption of all schemes increases, but their differences in terms of energy consumption remain roughly the same when averaging the results. 6.3.2 Automatic Target Recognition (ATR). We evaluated the DVS schemes on a real-life workload, automatic target recognition5 (ATR), which does pattern matching of targets in images/pictures. In ATR, the regions of interest (ROI) in one image are detected and each ROI is compared with all the templates. The number of target detections in each frame varies from 0 to 8 detections. Image processing time is proportional to the number of detections within an image. An example embedded system that can use ATR is an unmanned autonomous vehicle (UAV) with two cameras installed. Each camera will take 1 to 3 pictures every 100 ms and send them to a back end for target recognition. The back end is required to finish processing each batch of pictures in a timely fashion. Thus, the backend can be modeled as a frame-based system whose frame length is 100 ms. A task in this system is responsible for processing a picture. For any given frame, there could be 2 to 6 tasks, each corresponding to a picture. The number of execution cycles of a task depends on the number of ROIs in the picture that the task is processing. The number of ROIs in a picture cannot be predicted before processing the picture. We assume that the back end is equipped with an Intel XScale processor. We obtained the probability distribution of cycle demand of the task by profiling on a training image set using Simplescalar [Austin et al. 2002]. Figure 17(a) shows the histogram of the execution cycles. We then precomputed the schedule for having 2, 3, 4, 5, and 6 images to be processed in one frame. The five schedules are stored in the back-end. When a period begins, the back-end counts the number of images received and applies the corresponding schedule. Note that fewer number of images corresponds to more slack in the frame. Figure 17(b) shows the energy consumption of each scheme normalized to that of the PITDVS2 scheme when the back-end has 2, 3, 4, 5, and 6 images to process. From the figure we can see that the PITDVS2 scheme can achieve significant energy savings over the previously existing schemes. For example, the PITDVS2 scheme can achieve an average of 11% energy saving over the Proportional scheme. 7. CONCLUSIONS In this article, we study the problem of minimizing the expected energy consumption in frame-based hard real-time systems that execute variable and unpredictable workloads. We have proposed and identified two practical schemes: the PPACE scheme for frame-based systems with a single task, and the PITDVS2 scheme for general frame-based systems. We take into 5 The

original code and data for this application were provided by our industrial research partners.


•

Minimizing Expected Energy Consumption in Real-Time Systems 1.9

4000

1.8

Energy (normalized to PITDVS2)

4500

3500

Count

3000 2500 2000 1500 1000 500 0

9:35

PGOPDVS Greedy Statistical PITDVS Proportional PIT-PPACE

1.7 1.6 1.5 1.4 1.3 1.2 1.1 1

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07 1.4e+07 1.6e+07

1

2

3

4

5

6

7

Number of images

Cycle

(a) Histogram of execution cycles for ATR

(b) Comparison of DVS schemes

Fig. 17. Experimental results for ATR. Table IV. The Parameters for the 3 Tasks in the Illustrative Example Task

W

P (1), P (2), · · · , P (W )

T1 T2 T3 Tˆ

2 4 2 8

.9, .1 .9, 0, 0, .1 .5, .5 0, 0, .405, .45, .045, .045, 0.05, .005

account processors with limited number of speeds and with overhead for changing speeds. Our evaluation results show that these two DVS schemes achieve significant energy savings over existing schemes. Another important and surprising conclusion from our study is that patching optimal DVS schemes to comply with realistic processors does not necessarily yield DVS schemes that perform well under realistic conditions. In future work, we shall continue to investigate practical hybrid DVS schemes directly derived from the realistic system model, so as to push the expected energy saving to its limit. We shall also study how to minimize the expected energy consumption for general periodic and aperiodic real-time systems. APPENDIX A. AN ILLUSTRATIVE EXAMPLE OF DVS SCHEMES In this appendix, we demonstrate and compare several DVS schemes under the ideal system model through an illustrative example. In this example, there are 3 tasks in a frame-based real-time system with a frame length of 14 time units. The parameters for the 3 tasks are shown in Table IV. The tasks are required to be executed in the order T1 , T2 , T3 . We can also treat the 3 tasks as three sequential sections of a single task Tˆ , and its parameters are computed from those of the 3 tasks. Tˆ is used for a naive extension of PACE shown at the end of this appendix. The system power model is p( f ) = f 3 . We first look at a simple scheme called the Proportional scheme [Mossé et al. 2000], which distributes the system slack proportionally among all unexecuted ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:36

•

R. Xu et al. Table V. The Comparison of the DVS Schemes for the Illustrative Example Scheme

Expected energy consumption per frame

naive PACE Proportional OITDVS GOPDVS

0.7953 0.7733 0.6097 0.5154

tasks. In this example, the Proportional scheme will start executing T1 using speed 2+4+2 = 0.5714. After T1 finishes, the system reclaims the slack created 14 by T1 if it runs for less than its WCEC, and computes the speed of the next task recursively. Suppose that T1 runs for only 1 cycle. Then the time left for 1 4+2 executing T2 and T3 is 14 − 0.5714 = 12.2499, and speed 12.2499 = 0.4898 will be used to execute T2 , and so forth. For the OITDVS scheme, the time allocation fractions of T1 , T2 , T3 are β1 = 0.3938, β2 = 0.7619, and β3 = 1.0, respectively (refer to Figure 1 for how to 2 compute the β values). Thus, OITDVS will use speed 0.3938×14 = 0.3628 to execute T1 . If T1 runs for 1 cycle, then the time left for executing T2 and T3 is 1 4 14 − 0.3628 = 11.2737. Then OITDVS will use speed 0.7619×11.2737 = 0.4657 to execute T2 . For the GOPDVS scheme, the time allocation fractions of T1 , T2 , T3 are β11 = 0.2147, β12 = 0.2207, β21 = 0.2832, β22 = 0.2086, β23 = 0.2636, β24 = 0.3579, β31 = 0.5575, and β32 = 1.0, respectively (refer to Figure 2 for how 1 to compute the β values). Thus, GOPDVS will use speed 0.2147×14 = 0.3327 to st nd execute the 1 cycle of T1 . If T1 has the 2 cycle, the GOPDVS scheme will 1 = 0.4121 to execute. Table V shows the expected use speed 0.2207×(14−14×0.2147) energy consumption per frame for the DVS schemes. Finally, we show through this simple example that a naive extension of PACE (or, naive PACE for short) cannot even obtain energy savings over the DVS schemes that do not use intra-task DVS. Since PACE has only been studied for a single task, naive PACE is applied to the supertask Tˆ in Table IV. From Table V, we can see that naive PACE [Lorch and Smith 2001] will result in expected energy consumption per frame of 0.7953, which is even worse than the Proportional scheme. B. PROOF OF LEMMA 5.3 In this appendix, we provide the proof of Lemma 5.3, which states that for every energy-time label l ∈ LABEL (i, j ) where 1 ≤ j ≤ M , there exists a label l ∈ LABEL(i, j ) such that l .e ≤ l .e ≤ (1 + δ)i l .e and l .t ≥ l .t. PROOF. We prove by induction on i. The base case (i = 0) is trivially true. In the inductive step, we assume that the claim holds for i. Now the goal is to find an energy-time label ζ ∈ LABEL(i + 1, j ) (1 ≤ j ≤ M ) for every energy-time label l i+1 ∈ LABEL (i + 1, j ) such that l i+1 .e ≤ ζ.e ≤ (1 + δ)i+1l i+1 .e ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

•


9:37

and l i+1 .t ≥ ζ.t.

Let l i ∈ LABEL (i, k) (1 ≤ k ≤ M , and k is not necessarily equal to j ) be the energy-time label that generates l i+1 . Then we have wi l i+1 . = l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ), l i .t + PT ( f k , f j ) + fj By the induction hypothesis, there is an energy-time label l i ∈ LABEL(i, k) such that l i .e ≤ l i .e ≤ (1 + δ)i l i .e and l i .t ≥ l i .t. Let l i+1 = (l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ), l i .t + PT ( f k , f j ) + have

wi fj

). Thus, we

l i+1 .e < (1 + δ)i l i .e + (1 + δ)i (PC (i)PE ( f k , f j ) + Fi e( f j )) = (1 + δ)i l i+1 .e.

There are two possibilities regarding l i+1 : (1) l i+1 ∈ LABEL(i + 1, j ). Then we have l i+1 .e = l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ) ≤ l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ) = l i+1 .e

and l i+1 .e = l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ) ≤ (1 + δ)i l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ) ≤ (1 + δ)i (l i .e + PC (i)PE ( f k , f j ) + Fi e( f j )) = (1 + δ)i l i+1 .e

< (1 + δ)i+1l i+1 .e

and l i+1 .t = l i .t + PT ( f k , f j ) +

wi wi ≥ l i .t + PT ( f k , f j ) + = l i+1 .t. fj fj

Therefore we let ζ = l i+1 . (2) l i+1 ∈ LABEL(i + 1, j ) and it was removed as a result of trimming. Thus there is an energy-time label lˆ ∈ LABEL(i + 1) such that l i+1 .e ≤ lˆ.e ≤ (1 + δ)l i+1 .e and l i+1 .t > lˆ.t. Therefore we have l i+1 .e = l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ) ≤ l i .e + PC (i)PE ( f k , f j ) + Fi e( f j ) = l i+1 .e ≤ lˆ.e ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

•

9:38

R. Xu et al.

and lˆ.e ≤ (1 + δ)l i+1 .e < (1 + δ)(1 + δ)i l i+1 .e ≤ (1 + δ)i+1l i+1 .e

and l i+1 .t ≥ l i+1 .t > lˆ.t.

Therefore we let ζ = lˆ. Summarizing these two possibilities will prove the claim. REFERENCES ABOUGHAZALEH, N., MOSSE´ , D., CHILDERS, B., MELHEM, R., AND CRAVEN, M. 2003. Collaborative operating system and compiler power management for real-time applications. In Proceedings of IEEE Real-Time Embedded Technology and Applications Symposium (RTAS). AMD. 2001. Mobile AMD Athlon 4 Processor Model 6 CPGA data sheet. http://www.amd.com. AUSTIN, T., LARSON, E., AND ERNST, D. 2002. Simplescalar: An infrastructure for computer system modeling. IEEE Comput. 35, 2, 59–67. AYDIN, H., MELHEM, R., MOSSE´ , D., AND MEJIA-ALVAREZ, P. 2001. Dynamic and aggressive scheduling techniques for power-aware real-time systems. In Proceedings of IEEE Real-Time Systems Symposium (RTSS). 95–105. BAKER, T. P. AND SHAW, A. 1989. The cyclic executive model and ada. Real-Time Syst. J. 1, 1, 7–26. BROOKS, D., BOSE, P., SCHUSTER, S., JACOBSON, H., KUDVA, P., BUYUKTOSUNOGLU, A., WELLMAN, J., ZYUBAN, V., GUPTA, M., AND COOK, P. 2000. Power-aware microarchitecture: Design and modeling challenges for next generation microprocessors. IEEE Micro 20, 6. BURD, T. AND BRODERSEN, R. 2000. Design issues for dynamic voltage scaling. In Proceedings of International Symposium on Low Power Electronics and Design (ISLPED). CONTRERAS, G. AND MARTONOSI, M. 2005. Power prediction for intel xscale processors using performance monitoring unit eventss. In Proceedings of International Symposium on Low Power Electronics and Design (ISLPED). CORMEN, T., LEISERSON, C., RIVEST, R., AND STEIN, C. 2001. Introduction to Algorithms. MIT Press, Cambridge, MA. ERNST, R. AND YE, W. 1997. Embedded program timing analysis based on path clustering and architecture classification. In Proceedings of International Conference on Computer Aided Design (ICCAD). San Jose, CA. GRUIAN, F. 2001a. Hard real-time scheduling for low-energy using stochastic data and dvs processors. In Proceedings of International Symposium on Low Power Electronics and Design (ISLPED). GRUIAN, F. 2001b. On energy reduction in hard real-time systems containing tasks with stochastic execution times. In Proceedings of IEEE Workshop on Power Management for Real-Time and Embedded Systems. Taipei, Taiwan. GRUIAN, F. AND KUCHCINSKI, K. 2003. Uncertainty-based scheduling: energy efficient ordering for tasks with variable execution time. In Proceedings of International Symposium on Low Power Electronics and Design (ISLPED). Seoul, Korea. HONG, I., QU, G., POTKONJAK, M., AND SRIVASTAVA, M. 1998. Synthesis techniques for low-power hard real-time systems on variable voltage processors. In Proceedings of IEEE Real-Time Systems Symposium (RTSS). Madrid, Spain. HSU, C. AND KREMER, U. 2003. The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction. In Proceedings of ACM Programming Language Design and Implementation (PLDI). ISHIHARA, T. AND YASUURA, H. 1998. Voltage scheduling problem for dynamically variable voltage processors. In Proceedings of International Symposium on Low Power Electronics and Design (ISLPED). 197–202. KIM, W., SHIN, D., YUN, H., KIM, J., AND MIN, S. 2002. Performance comparison of dynamic voltage scaling algorithms for hard real-time systems. In Proceedings of IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.


•

9:39

KRANTZ, S., KRESS, S., AND KRESS, R. 1999. Jensen’s Inequality. Birkhauser. LANG, S. 1973. Calculus of Several Variables. Addison-Wesley, Reading, MA. LORCH, J. 2001. Operating systems techniques for reducing processor energy consumption. Ph.D. thesis, University of California at Berkeley. LORCH, J. AND SMITH, A. 2001. Improving dynamic voltage scaling algorithms with PACE. In Proceedings of ACM SIGMETRICS. LORCH, J. AND SMITH, A. 2003. Operating system modifications for task-based speed and voltage scheduling. In Proceedings of International Conference on Mobile Systems, Applications, and Services (MobiSys). LORCH, J. AND SMITH, A. 2004. PACE: a new approach to dynamic voltage scaling. IEEE Trans. Comput. 53, 7, 856–869. MAHALANOBIS, A., KUMAR, B. V. K. V., AND SIMS, S. 1996. Distance-classifier correlation filters for multiclass target recognition. Appl. Optics 35. MIYOSHI, A., LEFURGY, C., HENSBERGEN, E. V., RAJAMONY, R., AND RAJKUMAR, R. 2002. Critical power slope: Understanding the runtime effects of frequency scaling. In Proceedings of the 16th ACM International Conference on Supercomputing. MOCHOCKI, B., HU, X., AND QUAN, G. 2004. A unified approach to variable voltage scheduling for nonideal dvs processors. IEEE Trans. Comput.-Aid. Design Integrat. Circ. Syst. 23, 9, 1370– 1377. MOSSE´ , D., AYDIN, H., CHILDERS, B., AND MELHEM, R. 2000. Compiler-assisted dynamic power-aware scheduling for real-time applications. In Proceedings of Workshop on Compiler and OS for Low Power (COLP). PILLAI, P. AND SHIN, K. G. 2001. Real-time dynamic voltage scaling for low-power embedded operating systems. In Proceedings of ACM Symposium on Operating Systems Principles (SOSP). 89–102. PREPARATA, F. AND SHAMOS, M. 1993. Computational Geometry An Introduction. Springer. RAMAMRITHAM, K. AND STANKOVIC, J. 1994. Scheduling algorithms and operating systems support for real-time systems. In Proceedings of the IEEE. RUSU, C., XU, R., MELHEM, R., AND MOSSE´ , D. 2004. Energy-efficient policies for request-driven soft real-time systems. In Proceedings of Euromicro Conference on Real-Time Systems (ECRTS). Catania, Italy. SAEWONG, S. AND RAJKUMAR, R. 2003. Practical voltage-scaling for fixed-priority RT-systems. In Proceedings of IEEE Real-Time Embedded Technology and Applications Symposium (RTAS). SEO, J., KIM, T., AND DUTT, N. 2005. Optimal Integration of inter-task and intra-task dynamic voltage scaling techniques for hard real-time applications. In Proceedings of International Conference on Computer Aided Design (ICCAD). SHIN, D., KIM, J., AND LEE, S. 2001. Intra-task voltage scheduling for low-energy hard real-time applications. IEEE Design Test Comput. 18, 23. SINHA, A. AND CHANDRAKASAN, A. 2001. Dynamic voltage scheduling using adaptive filtering of workload traces. In Proceedings of International Conference on VLSI Design (VLSID). TRANSMETA 2005. Crusoe processor specification. http://www.transmeta.com/crusoe/specs.html. WESTE, N. AND ESHRAGHIAN, K. 1993. Principles of CMOS VLSI Design. Addison-Wesley, Reading, MA. XIE, F., MARTONOSI, M., AND MALIK, S. 2003. Compile-time dynamic voltage scaling settings: Opportunities and limits. In Proceedings of Programming Language Design and Implementation (PLDI). XSCALE 2007. Intel XScale Microarchitecture. http://developer.intel.com/design/intelxscale/. XSCALEPOWER 2005. Intel XScale Microarchitecture:benchmarks. http://web.archive.org/web/ 20050326232506/http://developer.intel.com/design/intelxscale/benchmarks.htm. XU, R., MOSSE´ , D., AND MELHEM, R. 2005. Minimizing expected energy in real-time embedded systems. In Proceedings of ACM International Conference on Embedded Software (EMSOFT). Jersey City, New Jersey. XU, R., XI, C., MELHEM, R., AND MOSSE´ , D. 2004. Practical PACE for embedded systems. In Proceedings of ACM International Conference on Embedded Software (EMSOFT). Pisa, Italy. ACM Transactions on Computer Systems, Vol. 25, No. 4, Article 9, Publication date: December 2007.

9:40

•

R. Xu et al.

YAO, F., DEMERS, A., AND SHANKAR, S. 1995. A scheduling model for reduced CPU energy. In Proceedings of IEEE Annual Foundations of Computer Science (FOCS). 374–382. YUAN, W. AND NAHRSTEDT, K. 2003. Energy-efficient soft real-time CPU scheduling for mobile multimedia systems. In Proceedings of ACM Symposium on Operating Systems Principles (SOSP). ZHANG, Y., LU, Z., LACH, J., SKADRON, K., AND STAN, M. 2005. Optimal procrastinating voltage scheduling for hard real-time systems. In Proceedings of Desing Automation Conference (DAC). ZHU, Y. AND MUELLER, F. 2004. Feedback EDF scheduling exploiting dynamic voltage scaling. In Proceedings of IEEE Real-Time Embedded Technology and Applications Symposium (RTAS). Toronto, Canada. Received April 2006; revised March 2007, September 2007; accepted October 2007