Power-Aware Design Synthesis Techniques for ... - CiteSeerX

0 downloads 0 Views 128KB Size Report
in ms. Energy eref in mJ. Priority Blocking. Time. Worst-case latency τ1,1. 10 ..... the end-to-end latency constraint of 300 ms. Γ1. Λconst (ms). 300. Λ (ms). 300.0.
Power-Aware Design Synthesis Techniques for Distributed Real-Time Systems Dong-In Kang, Stephen Crago, Jinwoo Suh University of Southern California/Information Sciences Institute 4350 N. Fairfax Dr. Suite 770 Arlington, VA 22203 (703) 248-6164

{dkang, scrago, jsuh}@east.isi.edu

consumption, etc. [14]. It is not unusual for these constraints to be tightly correlated. For example, higher throughput may require more weight and power, and smaller latency may require a higher clock frequency and more power. Correlations among the design constraints make the design of distributed real-time systems hard.

ABSTRACT This paper presents an end-to-end synthesis technique for lowpower distributed real-time system design. This technique synthesizes supply voltages of resources to optimize system-level power consumption while satisfying end-to-end hard real-time latency bounds. A system is modeled as a set of distributed task chains (or pipelines). Each task chain is given its own end-to-end constraints. Task chains may share resources. Our approach searches the space of the trade-off between end-to-end latency and supply voltages of resources to minimize system-level power consumption. A power optimization algorithm is proposed for simple distributed real-time systems that do not have any resource sharing between task chains, and its optimality is shown. For more general systems, a heuristic based on the same techniques is proposed.

Power-efficient and power-conscious design becomes more important for devices ranging from hand-held devices to large radar systems on airplanes where power is either limited or costly. Power consumption needs to be treated as an important systemlevel variable early in system design. A major source of power savings is voltage scaling, which scales operational voltages of resources and corresponding maximum clock speeds. Voltage scaling affects throughput and latency in a nonlinear fashion. Moreover, due to the nature of end-to-end constraints, a set of end-to-end throughput, latency, and power-consumption constraints can produce an exponential combination of throughput, power, and latency parameters of tasks and resources.

Categories and Subject Descriptors B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids

In this paper, we deal with solving the voltage synthesis problem for low-power system design, which is an important subproblem of the overall process of the power-aware hardwaresoftware partitioning process. The voltage synthesis problem is to find the (optimal) voltages of resources in the system that satisfy system-level end-to-end throughput and latency constraints and also to achieve minimum power consumption. The following inputs are assumed to be given: (1) task flow graphs; (2) task-toresource mapping; (3) required end-to-end throughput and latency constraints; and (4) a real-time scheduler. With those inputs, our technique estimates end-to-end latencies and power consumption and searches for (optimal) voltages of resources analytically.

General Terms Performance, Design

Keywords Real-time, synthesis, distributed system, voltage scaling

1. INTRODUCTION Designing distributed real-time systems can be a complex art. Design constraints, which are multi-dimensional, include characteristics such as throughput, latency, weight, power

The rest of this paper is organized as follows. Related work is presented in Section 2. Section 3 presents system, application, and power models, and an overview of the problem. In Section 4, the power optimization techniques and voltage synthesis algorithms are discussed. Conclusions are presented in Section 5.

Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. LCTES 2001, Snowbird, Utah, USA © ACM 2001 1-58113-425-8/01/06…$5.00

20

used for distributed systems having dependencies between tasks such as the distributed task chain model used in this paper. Power optimization under throughput constraints for multiprocessor systems was proposed in [17]. They identify a critical path in the application graph and reduce voltages of processors that run tasks in the non-critical paths using genetic algorithm or simulated annealing techniques.

2. RELATED WORK Our results rely on preemptive, uniprocessor scheduling analysis [4][7][12]. Many of these methods provide an offline schedulability test and the analysis of the worst-case latency of a task. Exploiting resource slack time has been studied for enhancing flexibility, improving performance, and reducing power consumption of a real-time system. The classical uniprocessor scheduling model was extended to make use of the unused slack time for sporadic tasks for performance improvement [5][6]. Low power scheduling techniques exploiting slack time on a single chip were proposed by adjusting the supply voltage dynamically[1][8]. These methods schedule tasks, voltage, and corresponding clock speed to minimize power consumption at run-time. Our technique synthesizes the supply voltage of a resource at design time that may result in minimum power consumption while meeting end-to-end latency constraints and guaranteeing schedulability of each resource.

3. MODEL AND PROBLEM OVERVIEW We model a system as a set of independent, pipelined task chains with each task mapped to a designated resource. Formally, a system possesses the following structure and constraints. Resources with Independent Power Supplies: There is a set of resources r1, r2, .., rm, where a given resource ri corresponds to one of the system’s active resources. Each resource has its own power supply, and its voltage setting is independent of the others. In CMOS digital circuits, latency and energy consumption by a task is given by the following equations [2]. In the equations, v denotes supply voltage, Vref and Vt denote reference voltage and threshold voltage respectively, which are inherent to the implementation technology, K is a technology dependent constant, and Ceff is effective capacitance.

Automatic resource calibration from end-to-end design constraints was proposed for real-time system design [9][10]. The period and deadline of each task are derived automatically from end-to-end latency constraints and jitter constraints on a uniprocessor system [9]. The technique in this paper is also an end-to-end design technique. We consider system-level power consumption as an additional end-to-end objective function and propose a synthesis technique that synthesizes supply voltages of resources and deadlines of tasks for minimum system-level power consumption.

Latency = K

v , V t < v ≤ V ref (v − V t ) 2

(Eq. 1)

Energy ≈ (# of switching) × Ceff × v 2

(Eq. 2)

When latency lkref and energy consumption ekref of a task τk at reference voltage Vref of the resource where the task resides, are given, latency lk(v) and energy consumption ek(v) at another supply voltage v can be driven by the following equations.

Power optimization techniques have been an important topic for CAD research [11][13]. However, most of them are chip-level VLSI design techniques. Multiple voltage scheduling among functional units in a chip was proposed [16][18]. These techniques address the problem of assigning a supply voltage from a finite number of pre-known supply voltages to each operation in a data flow graph so that the resulting schedule minimizes power consumption and satisfies timing constraints. In this paper we relax the assumption of a finite number of preknown supply voltages of resources. Unlike holistic multiprocessor scheduling techniques, our technique isolates task scheduling on a resource from the other resources and synthesizes scheduling parameters of tasks and supply voltages of resources from end-to-end constraints, which reduces the search space and reduces design time. Power-conscious partitioning and assignment of tasks between hardware and software were proposed [3][15]. The technique proposed in [3] addresses per resource voltage synthesis. Their model assumes a simple task model without any dependencies or communications among the tasks and also assumes that task constraints are known ahead of time. Their technique is not an end-to-end design technique and cannot be

ek (v ) = ek lk ( v ) = lk

ref

×

ref

×(

v V

ref

)2

(Eq. 3)

v × (V ref − V t ) 2 V ref × ( v − V t ) 2

(Eq. 4)

It is clear from the equations above that energy consumption is reduced at the cost of latency increase. The following equation shows how quickly energy consumption changes as latency changes.

dek ( v ) dlk ( v )

increases monotonically in the interval

V t < v ≤ V ref .

de k ( v ) de k ( v ) dv = dl k ( v ) dl k ( v ) dv =

21

− 2 e kref × v × ( v − V t ) 3 l

ref k

×V

ref

× (V

ref

(Eq. 5)

− V ) × (v + V ) t

2

t

Task Chains: A system has n task chains, denoted Γ1, Γ2, … , Γn. Each chain Γi carries out an end-to-end transformation from its external input Xi to its output Yi in a periodic way with period Ti. All tasks in a chain Γi have the same period Ti.

be given either by measurement or by estimation tools that can be either analytic or simulation-based. Each resource has its own range of reference voltages and threshold voltages that are given by the chip manufacturer.

Real-Time Scheduler: A real-time scheduler at each resource schedules tasks. Any uniprocessor real-time scheduling technique can be used that can estimate the worst-case latency of a task. Known techniques include RM (Rate Monotonic) scheduling, EDF (Earliest Deadline First) scheduling, fixed priority scheduling, etc.[4][7]. In this paper, we assume a fixed priority real-time scheduling technique in the examples for simplicity of presentation.

r1 X1

 lkw    × li i∈hp ( k )  Ti 



k∈

const i

τ2,2

τ2,3

Y1

τ2,4

Y2

τ3,2 τ3,3

τ4,2

τ4,1

X4

Y3

τ4,3

r3

Y4

r4

Γ1 Γ2 Γ3 Γ4

r6

Vref Vt

(Eq. 6)

Period T (ms) Λconst (ms) Λ (ms)

Γ1 40 100 62

r1 3.3V 0.6V

r3 2.5V 0.5V

r2 2.5V 0.5V

Γ2 80 250 142

Γ3 120 300 117

r4 3.3V 0.6V

r5 1.8V 0.5V

Γ4 120 200 137 R6 2.5V 0.5V

Latency lref Energy eref Priority Blocking Worst-case in ms in mJ Time latency

τ1,1 τ2,1 τ3,1 τ1,2 τ2,2 τ3,2 τ4,2 τ1,3 τ2,3 τ3,3 τ1,4 τ2,4 τ3,4

Latency Constraints: Γi’s latency constraint Λiconst is an upper bound on the time it takes for a computation to flow through the system. It is a hard real-time constraint. The end-to-end latency of Γi, Λi, is the sum of worst-case latencies of all the tasks in Γi.

= ∑ l kw ≤

τ1,3

τ3,1

X3

A set of tasks on a resource is schedulable when the worst-case end times of all tasks are within their deadlines. A voltage change causes a change of execution time of all tasks in the resource and requires recalculation of worst-case latencies of all tasks in the resource.

i

r5

τ1,2

τ2,1

X2

With fixed priority preemptive scheduling techniques, the worst-case latency of a task τk, lkw, can be derived analytically using the following recursive equation, where li is the execution time of a task τi [4]. In the equation, Bk is the maximum blocking time task τk can be blocked by a lower priority task while it uses protected data or is in a critical section. hp(k) is the set of tasks of higher priority than task τk. Ti is the period of task τi.

lkw = Bk + lk +

r2

τ1,1

(Eq. 7)

i

An Example: Consider the example shown in Figure 1. It consists of four chains, labeled Γ1 - Γ4. Here, shaded boxes are external inputs and outputs, rectangles denote shared resources, and circles denote tasks. The table at the bottom of Figure 1 shows properties of tasks. A task is given a fixed priority and is scheduled by the preemptive fixed priority scheduler. We assume that a smaller priority number represents a higher priority. Worst-case latencies are calculated using (Eq. 6). For simplicity of presentation, blocking times are set to 0. The two tables below the task graph in Figure 1 show design constraints of the chains and the properties of the resources. All tasks in Γ1 run every 40 milliseconds and the sum of their worst-case latencies is 62 milliseconds at the reference voltages shown, which is within Γ1’s latency constraint 100 milliseconds. A task’s latency and its energy consumption can

10 30 22 20 10 40 10 15 10 12 30 15 30

15 40 25 35 10 42 17 15 15 10 60 15 33

1 1 1 2 2 1 2 3 2 1 1 3 2

0 0 0 0 0 0 0 0 0 0 0 0 0

10 30 22 30 40 40 32 55 50 12 30 65 42

Figure 1. An example topology with design constraints Problem Overview: Our goal is to synthesize operational voltages of resources to satisfy the end-to-end latency constraints of all chains and to consume minimum energy. We can reduce voltages of resources to reduce energy consumption while 1) the end-to-end latency is within the latency constraint, 2) all resources are schedulable. Voltage reduction causes a latency increase. When a resource is shared by multiple tasks in multiple chains, voltage reduction of the resource affects the latencies of all the tasks in the resource and the end-to-end latencies of the chains using the resource. Due to the dependencies introduced by resource sharing, finding an optimal solution is not a trivial process. A real-time scheduler adds more complexity to the

22

characteristics. They may have different reference voltages, different threshold voltages, and different power consumption per unit latency. The sum of execution times of tasks at reference voltages of the resources should be equal to or less than the endto-end latency constraint. If this is not true, a feasible design does not exist. For the example system, we let the end-to-end latency constraint be 300 milliseconds, and the period of tasks be 1 second. Our goal is to calibrate resources so that the resulting system satisfies the end-to-end latency constraint with minimum power consumption.

problem because of the nonlinear latency change due to preemption and blocking among the tasks.

4. Resource Calibration for Minimum Power Consumption In this section, we describe how we find (optimal) supply voltages of resources such that the resulting system satisfies end-to-end latency constraints and achieves (minimum) power consumption. We first present an optimal algorithm for a simple system having one chain. Then we present a similar technique for general systems.

r1

r2

r3

r4

τ3

τ4

X1

Y1

4.1 Simple Case

τ1

In this section we assume a distributed system is running one chain and all tasks in the chain have the same period, which is larger than the end-to-end latency constraint. The power consumption of the chain in a resource ri, Ei(v), is defined as the sum of the power consumption of the tasks on the resource. The latency of the chain in a resource ri, Li(v), is defined as the sum of latencies of the tasks on the resource. We define φi(v) of resource ri as follows.

φ i (v ) = =

L ref i

Period T (ms) Λconst (ms) Vref Vt

dE i ( v ) dE i ( v ) dv = dL i ( v ) dv dL i ( v )

− 2 E iref × v × ( v − V i t ) 3 × V i ref × (V i ref − V i t ) 2 × ( v + V i t )

(Eq. 8)

where v ≥ Vi t , and C =

2 Eiref ref ref Li × Vi × (Vi ref − Vi t ) 2

(Eq. 9)

All terms except {3v 2 + 4vV i t − (Vi t ) 2 } are always non-negative in (Eq.

r3 3.3 V 0.6 V

τ2 40 60

R4 3.3 V 0.5 V

τ3 10 30

τ4 50 100

Lemma 2. Assume a system has N resources and has one chain whose end-to-end latency constraint is smaller than its period. No resource is shared by two or more tasks physically or temporally. Assume the current voltage of a resource ri is vi and the φi(vi) values of all resources are the same. Let the power consumption of the current configuration be E, and the latency be Λ . When we want to reduce power consumption at the cost of the latency increasing by ∆Λ from Λ , the maximum power reduction from E is obtained at voltages vi’ where φ( vi’) for all resources are the same when all resources are not over-utilized at the resulting voltages.

(Proof) It is sufficient and necessary to show that the first derivative of |φi(v)| is positive where v ≥ Vi t .

dv

τ1 20 20

Γ1 1000 300

The following lemma and theorems show that an optimal solution for the distributed system, which has one chain without resource sharing such as the one in Figure 2, can be obtained using φ(v) values of the resources.

v ≥ Vi t .

C × (v − Vi t ) 2 × {(3v 2 + 4vVi t − (Vi t ) 2 } , (v + Vi t ) 2

r2 3.3 V 0.7 V

Γ1

Figure 2. A simple example

Lemma 1. The function |φi(v)| is monotonically increasing where

=

r1 2.5 V 0.5 V

Latency lref, (ms) Energy eref , (mJ)

The function φi(v) shows how quickly energy consumption changes as latency changes.

d φi (v)

τ2

9). The term {3v 2 + 4vV i t − (Vi t ) 2 } is always positive

where v ≥ Vi t . Therefore |φi(v)| is non-negative and is

(Proof) We first show the lemma holds for the system having two resources, r1 and r2. Assume that maximum power reduction is found at different φ(v) values, namely φ1(v1’) and φ2(v2’) and the corresponding voltages are v1’, and v2’. Without loss of generality, we assume |φ1(v1’)| < |φ2(v2’)|. Let us increase power consumption (increase voltage to v1”) of r1, but decrease power consumption of

monotonically increasing where v ≥ Vi t . ♣ An example is shown in Figure 2. Characteristics of each task at the reference voltage on the resource where it is mapped are shown. Different resources may have different physical

23

r2 (decrease voltage down to v2”) while the end-to-end latency remains unchanged and |φ1(v1”)| ≤ |φ2(v2”)|. Let ∆Λ be the largest latency change of both r1 and r2 while |φ1(v1”)| ≤ |φ2(v2”)| holds. And let the resulting voltages of r1 and r2 be vd1 and vd2, respectively. Then, the following hold:

(10) (11) (12) (13) (14) (15) (16) (17) (18)

∆Λ = L1(v1’) – L1(vd1) = L2(vd2) – L2(v2’)

E 1 ( v d 1 ) − E 1 ( v1 ’ ) =

L2 ( v2 ’ )

L2 ( v2 ’ )

L2 ( vd 2 )

L2 ( v2 ’ ) + ∆L

∫ φ 2 ( v ) dL2 =

E 2 ( v2’ ) − E 2 ( vd 2 ) =



L1 ( v d 1 )

L1 ( v1 ’ ) − ∆ L

L1 ( v1 ’ )

L1 ( v1 ’ )

∫ φ1 ( v ) dL1 =



dE 2 ( v ) dL2 dL2 ( v )

dE 1 ( v ) dL1 dL1 ( v )

(19) (20)

Because both |φ1(v)| and |φ2(v)| increase monotonically for v above the threshold voltage of resources r1 and r2 respectively by Lemma 1 and |φ1(vd1)| ≤ |φ2(vd2)|, power reduction at r2 is larger than the power increase at r1. This contradicts the assumption of optimal solution. Moreover, φ1(vd1) = φ2(vd2), because, if it is not true, a larger ∆Λ can be chosen, which contradicts the assumption that ∆Λ is maximized. For general systems having three or more resources having

Theorem 1. Algorithm 1 produces an optimal solution that consumes minimum power for systems that have one chain for which the end-to-end latency constraint is smaller than its period. (Proof) If all resources have the same φ(v) values at their reference voltages and no resources are over-utilized at the resulting voltages, the solution generated by Algorithm 1 is optimal by Lemma 1. If some resources reach their maximum utilization, the algorithm finds the smallest φ(v) value at which all resources are not over-utilized. By Theorem 1, it is optimal so far. Then, those resources that reach their maximum utilization are excluded in line (20). The algorithm repeats the same optimal procedure until all resources reach their maximum utilization or the end-to-end latency is equal to the end-to-end latency constraint.

the same φ(v) values at the start, we can always find a single φ(v) value for optimal energy saving of any two resources. And the resulting φ(v) value is always between the two original φ(v) values. Applying this procedure iteratively will result in a single φ(v) value for all resources having minimum power consumption. ♣ Based on the observation described in the Lemma 1, we propose an optimal algorithm that calibrates the voltages of resources for maximum power savings. Theorem 1 shows its optimality.

If some resources have different φ(v) values at their reference voltages, the algorithm first picks a set S’ of resources for which φ(v) values are largest at their reference voltages shown in lines (10) and (11) and reduces their φ(v) value to the next largest φ(v) value, as shown in line (15). Because |φi(v)| values for all resources increase monotonically at v above their threshold voltages, a larger |φi(v)| value leads to less power consumption at the cost of the same latency penalty. Therefore, it is trivially true to reduce φ(v) values of resources in the set S’ first for minimizing power until their φ(v) values become equal to the φ(v) value of the resources in S’’. By Lemma 1, all resources in the set S’ must have the same φ(v) value for minimizing power consumption. However, the resulting end-to-end latency may exceed the end-toend latency constraint or some resources may be overloaded. When this happens, the smallest φ(v) value satisfying both the latency constraints and the resource loading constraints is generated as shown in lines (17) through (19). At each iteration, the algorithm finds minimum power consumption from the previous configuration, and excludes any resources that reach their maximum utilization. It is trivial to see that this algorithm reduces maximum power consumption for each latency increase

Algorithm 1. Voltage assignment algorithm for minimum power consumption of a single chain system. (1) (2) (3)

For each resource ri E i ( v ) = E iref × (

v 2 ) V i ref

L i ( v ) = L ref × i

(4)

φi (v) =

(5)

vi = Vi ref

v × ( V i ref − V V i ref × ( v − V

t i t i

)2 )2

− 2E × v × (v − Vi )3 ref ref L ref × V × ( V − Vit ) 2 × ( v + Vit ) i i i ref i

φ = max{ |φi(vi)| of ri ∈ S}; S’ = { ri | ri ∈ S and |φi(vi)| == φ}; φ’ = max{ |φi(vi)| of ri ∈ S – S’} If (S’ is equal to S) φ’ = 0; S’’ = { ri | ri where |φi(vi)| == φ’}; vi = v such that φi(v) == φ’ for each ri ∈ S’; Calculate Λ; if (Λ > Λconst OR some resources are not schedulable) Find the smallest φ’’, φ’ ≤φ’’