A Feedback-Based Approach to DVFS in Data-Flow ... - CiteSeerX

10 downloads 0 Views 1MB Size Report
Oct 21, 2009 - overcome the “power wall” [3], their parallel nature introduces many additional difficulties for the software system designers. Indeed, it must be ...
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

1691

A Feedback-Based Approach to DVFS in Data-Flow Applications Andrea Alimonda, Salvatore Carta, Andrea Acquaviva, Alessandro Pisano, Member, IEEE, and Luca Benini

Abstract—Runtime frequency and voltage adaptation has become very attractive for current and next generation embedded multicore platforms because it allows handling the workload variabilities arising in complex and dynamic utilization scenarios. The main challenge of dynamic frequency adaptation is to adjust the processing speed of each element to match the quality-ofservice requirements in the presence of workload variations. In this paper, we present a control theoretic approach to dynamic voltage/frequency scaling for data-flow models of computations mapped to multiprocessor systems-on-chip architectures. We discuss, in particular, nonlinear control approaches to deal with general streaming applications containing both pipeline and parallel stages. Theoretical analysis and experiments, carried out by means of a cycle-accurate energy-aware multiprocessor simulation platform, are provided. We have applied the proposed control approach to realistic streaming applications such as Data Encryption Standard and software-based FM radio. Index Terms—Data flow, dynamic voltage scaling (DVS), energy management, feedback control, streaming.

I. I NTRODUCTION

E

NERGY management in embedded multiprocessor systems-on-chip (MPSoC) architectures for multimedia streaming computing is becoming a crucial issue [8], [9], [17], [28]. While these architectures represent a promising way to overcome the “power wall” [3], their parallel nature introduces many additional difficulties for the software system designers. Indeed, it must be designed suitable to runtime management techniques that are capable of orchestrating the utilization of various cores in an efficient way. Among them, clock speed control of processing elements (PEs) strongly impacts both performance and energy consumption. Scaling the core frequency reduces the dynamic power consumption by a cubic factor if the core voltage is scaled accordingly. Remarkably modern MPSoCs feature hardware support for independent

Manuscript received September 29, 2008; revised April 3, 2009 and July 31, 2009. Current version published October 21, 2009. This paper was recommended by Associate Editor V. Narayanan. A. Alimonda and S. Carta are with the Department of Mathematics and Computer Science, University of Cagliari, 09124 Cagliari, Italy (e-mail: [email protected]; [email protected]). A. Acquaviva is with the Department of Control and Computer Engineering (DAUIN), Politecnico di Torino, 10129 Torino, Italy (e-mail: [email protected]). A. Pisano is with the Department of Electrical and Electronic Engineering (DIEE), University of Cagliari, 09123 Cagliari, Italy (e-mail: [email protected]). L. Benini is with the Department of Electronics, Computer Sciences and Systems (DEIS), University of Bologna, 40136 Bologna, Italy (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2009.2030439

runtime selection of frequency and voltage in several cores [18] which can be adjusted depending on their performance requirements. However, the problem of speed selection is complex as multimedia applications are made of multiple communicating tasks mapped on possibly many different cores [31], [33]. In this paper, we propose a feedback control approach to runtime speed adjustment of data-flow applications mapped on MPSoCs [31], [32]. A. Problem Formulation The increasing complexity of utilization scenarios in which applications must share the processing resources leads to heavy workload variations that cannot easily be handled by means of their static characterization. In this paper, we consider a data-flow application that can be described as a pipeline of stages. In the first and last stages, a producer (P) and a consumer (C) PE are found, interconnected by a certain number of intermediate PEs in a series or parallel configuration. Here, we assume that the mapping of tasks to PEs, as well as their interconnection, is given, with the problem under investigation being solely that of dynamically assigning the proper speed to each PE in the presence of unpredictable unknown-in-advance workload variability. The problem we address can be formulated as follows: Given a collection of tasks implemented by means of parallel and pipelined computational elements, adjust the speed of each computational element in such a way that the overall energy consumption can be minimized while fulfilling the specific application performance requirements. The performance in data-flow applications is usually evaluated in terms of the guaranteed throughput, which is determined by the data output rate of the last stage. We assume that the last stage is not subject to workload variations and takes data items from its input queue with a constant rate. Given these specifications, the problem can be reformulated as that of finding the minimum speed for each core such that the input queue to the final stage never becomes empty. The queue levels are observed in order to identify such an “optimal” frequency. B. Feedback Loop Strategies To tackle the runtime speed allocation problem in multimedia MPSoCs static, “open-loop” dynamic voltage/frequency scaling (DVFS) techniques have been introduced which require prior knowledge of the workload characteristics [24], [29], [30], [37]. Basically, such methods select the processor speeds

0278-0070/$26.00 © 2009 IEEE

1692

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

depending on the expected workload scenario, and, intuitively, their performance may decrease in the presence of unexpected workload variability [14]. One of the most popular DVFS implementation is Vertigo [12], which is based on the observation of local processor utilization. The Vertigo algorithm is mainly based on the estimation of per-task deadlines, based on the observation of idle periods occurring between task completion and successive task reactivation. Vertigo exploits a feedback-based control strategy that observes deadline events in order to compensate for the estimation errors. The main limitation of this technique, which mainly manifests when it is applied to a multiprocessor architecture, is that it is not able to promptly react to fast workload changes. The reason is that the estimation of qualityof-service (QoS) requirement is carried out on the basis of the processor utilization, which reflects the QoS requirements only indirectly. A more recent and effective approach looks in runtime at the interprocessor queues instead [4], [19], [34], [35]. In MPSoC architectures, first-in–first-out (FIFO) queues are used as data buffers between adjacent PEs in order to smooth out the effect of quick workload variations, thereby ensuring a more stable throughput rate [36]. Each PE has an input queue, from which it gets the input data, and an output queue, where it releases the processed data. Possibly, either the input and output queues or both are shared with additional processors working in parallel. Multimedia frameworks such as GStreamer [15] and OpenMAX [22] allow implementing multimedia applications as processing blocks communicating with buffers. A number of feedback control approaches have been proposed that consider, as monitored variables, the occupancy level of the data buffer queues between the PEs and adjust the core frequencies by processing those monitored quantities by using suitable control algorithms. The control objective is to keep under control the occupancy of the output queues of all PEs. This should be guaranteed while avoiding too frequent frequency adjustments, which cause an increase in the energy dissipation and introduce additional computation delays. Thus, the control problem presents conflicting requirements. Both linear [19] and nonlinear [4], [34] control schemes have been proposed in the literature. In [4], it is shown that, for a specific streaming architecture, i.e., the “pure-pipeline” configuration discussed later, a nonlinear controller can overperform the classical proportional–integral–derivative (PID) linear controller from both the aspects of energy efficiency and ease of tuning and implementation. C. Paper Contribution Despite that feedback-based runtime speed adaptation policies are promising, so far, there is a lack of methodologies and implementation of these strategies on a general data-flow model. Indeed, until now, they have only been applied to simple architectures with a pure series pipelined configuration and with a single output queue at each stage [4]. In this paper, we extend and generalize the control feedback strategies applied to streaming applications by considering a more general class of data-flow applications, where the processing pipeline is

characterized by a composition of series and parallel stages. This framework adheres to a general representation model for streaming applications [32]. In this paper, we consider two alternative allocation strategies for the queues in the presence of parallel stages. Basically, the PEs of a parallel stage can have either of the following: 1) a unique shared input queue or 2) multiple private input queues. This outlines two main configurations of the “split–join” stages [32], which require different dedicated control policies. In the case of private queues, the control law becomes more complex. Since various queue configurations depend on the type of application, in this paper, we consider two realistic representative ones, namely, software FM radio [32] and parallel Data Encryption Standard (DES) [10]. Summarizing, the main elements of novelty introduced by this paper are the following: 1) the introduction of a new set of controllers generalizing those presented in [4] and [2] to handle different more complex queue configurations; 2) we experimentally evaluated various controller alternatives using synthetic and realistic benchmarks; 3) we give a proof of stability for the control law. To measure the performance and energy consumption, we implemented the considered benchmark applications on a cycle-accurate energy-aware multiprocessor software simulator [26]. From the results obtained through extensive experiments, the feedback control turned out to be very effective in reducing energy consumption compared to local approaches based on local processor utilization. The rest of this paper is organized as follows. In Section II, we survey and discuss some relevant literature. In Section III, a classification of mixed parallel/pipelined streaming configurations is given, and real-world application case studies are provided for every configuration. Section IV resumes the previously published results concerning the modeling and frequency control of the pure-pipeline streaming configuration [1], [4]. In Section IV-A, system modeling and DVFS design for the mixed parallel/pipelined configurations described in Section III are considered. In Section V, the performance of the proposed nonlinear controller is evaluated by means of synthetic and realworld benchmarks running on a cycle-accurate energy-aware software simulator. Some final conclusions and perspectives for next researches are drawn in the Conclusion. II. R ELATED W ORK ON F EEDBACK -BASED DVFS The common objective of control theoretic approaches to DVFS is to achieve an estimation of the slowest processor speed needed to satisfy the deadline constraints in multimedia applications. In the context of single-processor systems, Vertigo [12] determines the deadline of each task by looking at idle periods occurring between task completion and successive task reactivation. Task completions are detected by tracing operating system-specific system calls indicating that the program has voluntarily released the CPU. In Vertigo, the next deadline is computed by the observation of the last measured deadline

ALIMONDA et al.: FEEDBACK-BASED APPROACH TO DVFS IN DATA-FLOW APPLICATIONS

and an averaged history of past deadline events. Basically, a linear proportional–integral (PI) control law is used to compensate for the estimation errors. The Vertigo performancesetting algorithm has been adopted in the intelligent energy management standard power conservation strategy for ARM cores [7]. Vertigo is effective as it directly estimates the deadline events by means of system call tracing. However, it has two major limitations. First, it lacks reactivity because it performs estimation based on the observation of past deadlines and utilization intervals. Second, processor utilization itself cannot always be stretched up to the deadline because of blocking idle times in between. In the context of single-processor systems, alternative feedback-based approaches are presented in [19] and [21]. In [21], the frequency adjustment is performed by looking at frame delays, which are directly related to the QoS. The control algorithm is of the PI type. Clearly, the DVFS algorithm should be able to access the application-level parameters, which makes the approach application dependent. On the other hand, in [19], the PI feedback control is used for an MPEG application to adjust decoding speed (i.e., the frequency of the processor) by looking at the occupancy level of the buffer between the processor and the display device. The objective of the control law is to keep constant the occupancy level of the queue so that the probability of QoS degradation is minimized. For distributed multiprocessor systems, a feedback-based frequency adjustment approach has been presented in [11]. A PID controller is used in a feedback loop, where information about the deadline constraints and the battery status of all PEs is propagated backward to the producer element. The distributed dynamic voltage scaling technique proposed in [11] applies only to a chain of two elements, and its extension to more general cases was not considered. Furthermore, there is not theoretical justification for the effectiveness of the control law. Feedback-based DVFS schemes using the output queue occupancy as the feedback signal were suggested to adapt execution speed to changing throughput demand [34], [35]. In [34] and [35], Wu et al. derive a nonlinear model of the queue occupancy and a feedback scheme consisting of a linear PI controller cascaded by a nonlinear mapping which “linearizes” the dynamics of the queue. The implementation of the nonlinear mapping requires the online identification of two parameters correlating the processor frequency and the corresponding service rate. In [2] and [4], linear and nonlinear control approaches have been suggested, whose performance has been compared against the state-of-the-art local dynamic voltage setting policies. In this paper, we provide a generalization of the feedback control strategies in [2] and [4] to data-flow applications, and we also apply it to realistic data-flow case studies. In particular, compared to [4], where only the pure-pipeline case has been considered, we extend the control strategy to a wide range of data-flow applications. In this paper, we introduce two types of queue allocations (i.e., shared and private queues) for split–join stages, and we propose and characterize various control approaches.

1693

Fig. 1. Pipelined multilayer architecture.

III. M ULTIMEDIA TASK C ONFIGURATIONS In this section, we give a classification of the configurations for multiprocessor streaming computing that shall be handled in this paper. We assume that, in the chosen architecture, the speed of each PE can be selected among a discrete range of permitted values. In [4], we considered applications modeled by a pure pipeline cascade of single-processor workers. In this paper, we refer to a more general setup, i.e., a given worker may be representative of a more complex subsystem containing multiple processors working in parallel. A. Pure-Pipeline Configuration A significant class of applications can be modeled by referring to a pipelined interconnection of general-purpose PEs, as that schematized in Fig. 1. Modern multimedia applicationdevelopment frameworks, such as GStreamer [15], use the notion of pipeline to connect elements that perform various functions, such as filtering, mixing, etc. The initial and final elements represent the data source, or “producer,” and the data sink, or “consumer.” For example, in an audio playback application, the producer is the storage memory, whereas the consumer is the audio codec. Fig. 1 depicts a block scheme representation of this configuration: from left to right, the producer processor P, the m workers W1 , . . . , Wm , and the consumer processor C. Between the processor stages, the FIFO buffers Q1 , Q2 , . . . , Qm+1 provide for the data-exchange interface. The producer and worker frequencies are considered user selectable among a discrete set of permitted values including the zero value corresponding to the shut-off condition for the PE. The consumer frequency is preassigned depending on the application throughput demand. B. SQPP Configuration In many applications, some worker stage can be “splitted,” i.e., its load is allocated among several parallel substages. We consider, as a specific example having sufficient generality, a four-stage pipeline containing two workers W1 and W2 , the second of which is representative of a multiprocessor system containing three processors working in parallel, for example, W21 , W22 , and W23 . If processors W21 , W22 , and W23 perform the same operation, they can have a unique feeding queue Q2 . We call this configuration the “shared-queue parallel-pipeline (SQPP) configuration,” represented in Fig. 2. SQPP Case Study—Parallel DES: The DES algorithm is a well-known encryption algorithm. It can be seen as a paradigm for streaming applications with regular workload. DES encrypts and decrypts data using a 64-b key. It splits the input data into 64-b chunks and outputs a stream of 64-b ciphered blocks. Since each input element (called “frame”) is encrypted independently, the algorithm can easily be parallelized. In the

1694

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

Fig. 2. SQPP configuration with four layers.

Fig. 3. Parallel DES.

Fig. 5.

FM radio.

of the digitalized pulse-code modulation radio signal to be processed in order to produce an equalized baseband audio signal. In the first stage, the radio signal passes through a lowpass filter to cut frequencies over the radio bandwidth. Then, it is demodulated by the demodulator (DEMOD) to shift the signal at the baseband and produce the audio signal. The audio signal is then equalized with a number of bandpass filters (BPFs) implemented with a parallel stage. The data must be replicated for each BPF, which requires the use of multiple input queues. In the last stage, the consumer (Σ) collects the data provided by each BPF and makes the weighted sum, reconstructing the final output. IV. N ONLINEAR DVFS C ONTROLLERS FOR PARALLEL /P IPELINE C ONFIGURATIONS

Fig. 4. PQPP configuration with four layers.

parallelized version of DES [2], [29], three kinds of tasks are defined. An initiator task (producer) dispatches 64-b blocks, together with a 64-b key, to n calculator tasks (referred to as working tasks) that perform the parallel DES encryption. The initiator task and working tasks are allocated to different PEs and use a single queue to exchange data. Finally, a collector task (consumer) reconstructs the output stream by concatenating the ciphered blocks provided by the working tasks. It is allocated onto another PE and communicates with the workers by means of the output queues. This architecture can lead to a threelayered pipeline in which the intermediate stage is split into n = 3 parallel ciphers [Fig. 3(b)]. C. PQPP In some applications, the processors working in parallel implement different data processing algorithms; hence, they need private data from the preceding stage. Then, each parallel processor needs a private input queue. The graphical representation of a four-layer pipeline/parallel architecture with three processors in the parallel stage is given in Fig. 4. In this case, processor W1 should implement a proper dispatching of the data. PQPP Case Study7—Software FM Radio: A software FM radio application with multiband equalizer well fits the privatequeue parallel-pipeline (PQPP) architecture previously described. As shown in Fig. 5, the input data represent the samples

In this section, we describe the novel DVFS schemes for the previously outlined mixed parallel/pipeline architectures. First, we derive suitable mathematical models for the SQPP and PQPP configurations that were described in Sections III-B and C. Then, in Section IV-B, we present the general formulation of the nonlinear controller that constitutes the core element of the proposed DVFS schemes. Finally, in Sections IV-C and D, we describe the suggested DVFS schemes for the SQPP and PQPP architectures, respectively. A. Mathematical Modeling of the Mixed Pipeline/ Parallel Configurations For the SQPP and PQPP configurations, a mathematical model that is suitable for designing the corresponding DVFS schemes is derived as follows. Refer to Figs. 2 and 4, where the workers Wi or Wij represent single processors. Denote as fi or fij the corresponding frequency, and let Qi or Qij be the current occupancy of the associated interprocessor FIFO data buffers. It is sensible to assume that the data throughput of each processor is proportional to the corresponding frequency by some positive time-varying coefficient ki (t) [or kij (t)] called “throughput gain.” The throughput gain of the input and output data can be different. To facilitate the system modeling, define Qi (t) [Qij (t)] as a real-valued (for example, “fluid”) approximation of the corresponding integer buffer occupancy. A simple dynamical model can be derived considering the input/output data balance for each queue. Data-dependent workload variations are captured by appropriate variations of the corresponding throughput gains. Moreover, the effects due to the finite bandwidth of the communication subsystem (i.e.,

ALIMONDA et al.: FEEDBACK-BASED APPROACH TO DVFS IN DATA-FLOW APPLICATIONS

interconnecting bus) can be seen as variations of the output data rate for each stage. SQPP Configuration: Consider the four-stage SQPP configuration represented in Fig. 2. The following model can be derived, which represents the balance equation between the incoming and outcoming data rates for each queue: ˙ (t) = k Q Iw21 (t)fw21 − kOc (t)fc 31 ˙ Q (t) = k (t)f − k (t)f Iw22

32

w22

Oc

c

˙ (t) = k Q Iw23 (t)fw23 − kOc (t)fc 33 ˙ Q (t) = k (t)f − k (t)f Iw1

2

w1

Ow21

w21

− kOw22 (t)fw22 − kOw23 (t)fw23

˙ (t) = k (t)f − k Q IP P Ow1 (t)fw1 . 1

(1)

The frequency fc of the consumer processor is prespecified (possibly time varying) and cannot be adjusted by the user. The throughput gains are strictly positive and, obviously, bounded. In general, their values depend on the currently processed data, but, for some specific applications, it may happen that all throughput gains are constant. PQPP Configuration: Now, let us consider the four-stage PQPP configuration represented in Fig. 4, which presents seven buffers and five adjustable PEs. Again, the consumer frequency fc cannot be adjusted. The queue dynamics are easily written as follows, considering the input/output data balance for each queue: ˙ (t) = k Q Iw21 (t)fw21 − kOc (t)fc 31 ˙ Q (t) = k (t)f − k (t)f 32

Iw22

w22

Oc

c

˙ (t) = k Q Iw23 (t)fw23 − kOc (t)fc 33 ˙ Q (t) = k (t)f − k (t)f 21

Iw1

w1

Ow21

w21

˙ (t) = k (t)f − k Q Iw1 w1 Ow22 (t)fw22 22 ˙ Q (t) = k (t)f − k (t)f 23

Iw1

w1

Ow23

˙ (t) = k (t)f − k Q IP p Ow1 (t)fw1 . 1

w23

(2)

B. Nonlinear DVFS Algorithm The worker frequencies are user selectable among a finite set of permitted values. The idea behind the feedback-based DVFS control is to determine the proper speed of each processor by controlling the occupancy of the FIFO queues. The twofold goal is that of minimizing both the average processor frequencies and the number of frequency adjustments, which increase energy dissipation and introduce additional computation delays. Let Q∗ a convenient target value for the queue occupancy. Denote as ei = Qi − Q∗ the “error variable” associated to the buffer Qi . It is the task of the DVFS system to maintain the error variables close to zero. The nonlinear control algorithm described in this section is the key tool of this approach. It has a very simple structure and tuning rules. Since the control variables (the core frequencies) are discretized so that only a finite number of permitted frequencies can be assigned to each PE, it makes sense to parameterize the set of admissible frequencies by an integer subscript

1695

coefficient j ranging from 1 to N , with N being the number of permitted frequencies. By convention, we can assume that the processor speed increases with increasing j, i.e., f1 < f2 < · · · < fj < fj+1 < · · · < fN .

(3)

Let us assume, for the sake of simplicity, that the set of permitted frequencies is the same for all the worker processors. The controller adjusts in runtime the frequency of each processor. In other words, the controller sets and adjusts, for each worker processor, the appropriate value of the integer coefficient j. A dedicated controller unit for each processor is implemented. Every time instant that the controller is activated, a decision is made between the following three options: 1) keep the same frequency (i.e., leave j unchanged); 2) increase the frequency (i.e., increase j by one); 3) decrease the frequency (i.e., decrease j by one). The decision is made on the basis of the current (e[k]) and previous value (e[k − 1]) of the “error variable” e of the FIFO buffer following the processor. According to its definition (e = Q − Q∗ ), a negative value of e means that the buffer occupancy is lower than the desired one, and vice versa. The adopted policy is explained discursively as follows. If the current buffer occupancy is lower than the desired one (i.e., if e[k] < 0), then, in general, the processor frequency should be increased. However, if the buffer is already filling up with respect to the previous measurement instant (i.e., if e[k] < e[k − 1]), then the frequency can also be kept constant since the queue occupancy is correctly tending to the desired value. A dual reasoning can be made when e[k] > 0. Furthermore, we leave the processor frequency unchanged when the corresponding error variable e[k] is, in modulus, sufficiently small (|e[k]| < Δ, for some Δ > 0). The decision of increasing the frequency traduces into the index assignment j := min(j + 1, N ), whereas the opposite decision of decreasing the frequency leads to j := max(j − 1, 0). Indeed, it follows by (3) that index j cannot exceed N and cannot become negative. Denote as “trigger instants” the time instants at which the controller performs the corresponding decisions. It results the following formal description for the basic nonlinear DVFS controller. Nonlinear DVFS Controller: Every trigger instant does the following: − IF [(ei [k] < −Δ) AND (ei [k] ≤ ei [k − 1])] THEN j := min(j + 1, N ) − IF [(ei [k] > Δ) AND (ei [k] ≥ ei [k − 1])] THEN j := max(j − 1, 0). We present two controller implementations that differ in the way the trigger instants are generated. In the “constant triggering controller” (CTC) version, the trigger instants are equally spaced in time (let Ts be the sample time interval). In the “variable triggering controller” (VTC) version, the controller triggering is generated adaptively depending on the monitored queue occupancy level. Triggering is generated every time that H new packets are put into, or get from, the corresponding

1696

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

buffer, with H being a proper integer number. This leads to an adaptive “throughput-dependent” spacing between the trigger instants. The VTC has the interesting property of speeding up automatically the trigger instants generation in the presence of high data throughput. The VTC implementation is then of particular interest because it provides high reactivity exactly when it is needed. The aforementioned nonlinear controller is compactly referred to as NL(ei ). The rationale underlying the NL controller structure is discussed in Appendix A. The important tuning parameters are Q∗ , Δ, and T (CTC version) or H (VTC version). The tuning of the other parameters is not critical and can be made by means of the formulas suggested next as a certain percentage of the maximal queue capacity. The control system “reactiveness” depends on T (CTC version) or H (VTC version); the smaller the parameter, the higher the reactiveness. The only requirement for stability is that the highest available frequency fN is appropriately large, depending on the specific application [see (20)]. There are no additional stability conditions to fulfill for the remaining parameters.

and the resulting dynamics are derived similarly to the previous SQPP configuration, and the adjustment laws for the core frequencies are cast as follows using the standard NL algorithm: fp = NL(e1 ) fw1 = NL(AU X − Q∗ ) fw21 = NL(e31 ) fw22 = NL(e32 ) fw23 = NL(e33 ).

(7)

It differs from (6) in the control logic for fw1 . Worker W1 feeds three parallel buffers simultaneously; hence, there is no obvious choice for selecting the appropriate error variable. The auxiliary variable AU X, which must take into account the occupancy of all buffers Q21 , Q22 , and Q23 , can be constructed in three different manners which are described and commented in the following. AVG—Average Value: A first possibility is to consider the average occupancy Qav of the three queues involved, namely, Q21 , Q22 , and Q23

C. DVFS Controllers for SQPP Configuration Let us illustrate the controller design for the four-stage SQPP configuration represented in Fig. 2. Let us write down the error variables and the associated error dynamics as follows: e1 = Q1 − Q∗ e2 = Q2 − Q∗ e31 = Q31 − Q∗ e32 = Q32 − Q∗

e33 = Q33 − Q∗ (4)

e˙ 31 (t) = kIw21 (t)fw21 − kOc (t)fc e˙ 32 (t) = kIw22 (t)fw22 − kOc (t)fc e˙ 33 (t) = kIw23 (t)fw23 − kOc (t)fc e˙ 2 (t) = kIw1 (t)fw1 − kOw21 (t)fw21 − kOw22 (t)fw22 − kOw23 (t)fw23 ˙e1 (t) = kIP (t)fp − kOw1 (t)fw1 . (5) It is seen that the dynamics (5) of all the error variables are special cases of the dynamics (17) discussed in Appendix A. According to Appendix A, the form of the aforementioned error dynamics justifies the following control policy: fp = NL(e1 ) fw1 = NL(e2 ) fw21 = NL(e31 ) fw22 = NL(e32 ) fw23 = NL(e33 ).

(6)

Controller tuning entails the choice of the queue set point Q∗ and of parameters H and Δ for each control block instance. The reasonable tuning values are expressed in terms of a fixed percentage of the queue capacity QM :Q∗ = 50%QM , H = 2 ÷ 5%QM , Δ = 5 ÷ 10%QM . D. DVFS Controllers for PQPP Configuration Now, we deal with the controller design for the four-stage PQPP configuration represented in Fig. 4. The error variable

AU X = Qav = (Q21 + Q22 + Q23 )/3.

(8)

The dynamics of the average queue occupancy error eav = Qav − Q∗ is e˙ av (t) = kIw1 (t)fw1 − kOw21 (t)fw21 − kOw22 (t)fw22 − kOw23 (t)fw23

(9)

which belongs to the general class (17) studied in Appendix A. Using the suggested control logic (7) and (8), in the steady state, eav will be oscillating around zero. During normal system operation, the data flow across the parallel branches takes place almost uniformly; then, the mean value will closely approximates the current occupancy of all queues. MAX—Maximal Error: A second possibility is to consider the queue where the deviation error is maximum in modulus AU X = Q2i

(10)

|Q2i − Q∗ | ≥ |Q2j − Q∗ | ∀i = j.

(11)

with Q2i such that

This logic controls the buffer with the largest magnitude of the occupancy error. MIN—Minimal Occupancy: A third possibility is to consider the minimum between the occupancy levels of the three queues involved AU X = Qmin = min{Q21 , Q22 , Q23 }.

(12)

Using logic (7), (12) controls the queue which is closest to the emptiness condition, which can be considered as a more dangerous situation than the saturation condition because it might cause a violation of the throughput constraint. In all three cases, tuning of Δ and H is analogous to the SQPP configuration.

ALIMONDA et al.: FEEDBACK-BASED APPROACH TO DVFS IN DATA-FLOW APPLICATIONS

Fig. 6.

MPSoC platform with hardware support for frequency scaling.

V. E XPERIMENTAL E VALUATION A. Simulation Environment The benchmark applications used for the experimental comparison and performance evaluation of the DVFS algorithms have been run on a cycle-level-accurate energy-aware multiprocessor simulation platform (the SystemC-based “MPARM” platform [26]). Fig. 6 gives a schematic of the simulated architecture. It consists of a configurable number of 32-b ARMv7 processors. Each processor core has its own private memory, and a shared memory is used for interprocessor communication. Synchronization among the cores is provided by hardware semaphores implementing the test-and-set operation. The system interconnect is a shared bus instance of the STBus interconnect. The virtual platform environment provides power statistics that were made available by STMicroelectronics for a 0.13μ technology in ARM cores, caches, memories, and STBus. The feedback controllers run inside the corresponding PEs. B. Comparative Analysis Overview and Adopted Metrics We have made an experimental evaluation of the proposed nonlinear controllers using both synthetic and real applications and considering both the SQPP and PQPP configurations. Furthermore, we tested both the “CTC” and “VTC.” We leave out from our comparison the PID controller because it was extensively studied in [4], showing that it presents limitations in terms of difficult parameter settings and excessive switching rate. The latter phenomenon has, indeed, an important cost in terms of energy consumption. We compared the performance by means of the following metrics: 1) energy consumption; 2) queue content fluctuations; 3) number of frequency adjustments. To assess the effectiveness of the proposed approach for the implementation of energy-efficient embedded multimedia streaming applications, we compared the energy consumption of the processor cores when using the nonlinear feedback control or the Vertigo [12] dynamic voltage scaling technique, which was discussed in Section I-B. It must be reminded that CPU speed adjustments are obtained through frequency switchings. A large number of switchings are not desirable because they are energy and time expensive and may also cause synchronization problems with external

1697

peripherals. The number of frequency adjustments is then a critical metric to be considered. Reducing the number of frequency adjustments causes fluctuations of the queue contents when changes in the workload occur. However, it must be noted that, from an energy viewpoint, queue fluctuations are acceptable as far as their entity does not lead to empty or full queue conditions. To evaluate the capability of the control law to handle queue fluctuations, we measured the average value of the squared difference between the emptiest queue level and the set point over the whole test. We refer to this metric as least mean square (LMS) error. It indicates the size of the fluctuations of the queue level. This metric provides a good estimation of the capability of the control policy to keep the wanted performance and QoS level. A well-behaving controller would limit the LMS because small queue fluctuations indicate low probability of empty queue conditions. Indeed, these may cause deadline misses in the last stage of the pipeline (which, in our system, represents the consumer element). The consumer could be, for example, an audio or video codec reading data at a fixed rate. The experiments have been organized as follows. We first show the results of the comparison between the nonlinear feedback policy and the Vertigo local policy for the two case studies of DES and software FM radio. Once the overall effectiveness of our approach has been demonstrated, we concentrate on the characterization of the nonlinear controller using the aforementioned metrics. The overall target is to compare the two versions of the nonlinear controller (CTC and VTC) described in Section IV. To this purpose, we performed a set of experiments for each metric (energy, number of frequency switches, and LMS error) by varying the sample rate of the constant time controller with reference to the case study applications described in Section III. Moreover, we implemented synthetic benchmarks following the same configurations but characterized by dummy loops used to obtain constant and variable workload. In the first experiment, concerning the overall energy comparison between the nonlinear feedback policy and the local policy, we detail the energy results for each stage (producer and workers) (Fig. 7). Conversely, in all of the other experiments, we focused on the core that feeds the parallel stage (producer) which is the place where the new control feedback policy has been applied (i.e., the producer implements the SQPP or the PQPP controller). All the other cores implement a control strategy based on a single output queue which is analogous to the pure-pipeline configuration. C. System and Controller Parameters and Preliminary Results In all the tests, the maximal queue occupancy is set as Qmax = 100.

(13)

Following the previously given guidelines, the set point is set as Q∗ = 50%Qmax = 50.

(14)

The Δ thresholding parameter is always set to Δ = 10%Qmax = 10.

(15)

1698

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

Fig. 7. Energy comparison between nonlinear controller and Vertigo. (a) Parallel DES. (b) Software FM radio.

Fig. 8. Feedback control strategy applied to synthetic benchmark with (a)–(c) constant and (d)–(f) variable workload. [(a) and (d)] Energy consumed by the producer. [(b) and (e)] LMS error. [(c) and (f)] Number of switchings.

For the Ts and H parameters, different values will be adopted, and the obtained results are compared. As previously specified, H will always be selected in the appropriate range H = 2 ÷ 5%Qmax = 2 ÷ 5.

(16)

The results of energy consumption obtained using the proposed nonlinear strategy with properly selected tuning parameters and using the Vertigo policy are reported in Fig. 7(a) and (b). The two plots refer to the DES and software FM radio applications described in Section III. The energy consumption results are shown for each core. It is apparent that the nonlinear control strategy outperforms the Vertigo DVFS policy. As previously explained in Section I-B, Vertigo exploits system call tracing to determine the deadline periods. However, the tasks in this benchmark do not make system calls to release the processor. In the absence of deadline information, Vertigo assumes a conservative approach in which it sets a time interval (statically configured) in which the processor utilization is evaluated and configures the speed for the next time interval depending on the processor utilization at the previous interval.

D. SQPP Configuration Results In this set of experiments, we better characterize the nonlinear controller. Concerning the energy results, we kept the Vertigo policy as a reference point. In all cases, the nonlinear control approach provides better energy performance as compared to the Vertigo local DVFS strategy. 1) Synthetic Benchmarks: We implemented a synthetic benchmark containing dummy loops that impose a variable (both in time and entity) load to each worker. The variability is obtained, changing the number of loop iterations. We tested two possible workload scenarios: In the first one, we impose a constant workload, whereas, in the second, we first let the initial transient condition to exhaust; then, we impose a variable workload characterized by a pulse train shape. For each scenario, we compared both the CTC and VTC implementations. For the CTC version, we show the results as a function of the sample time Ts . Clearly, since the VTC version does not depend on the sampling time, the results will be the same, and the interpolated curve will be parallel to the abscissa axis. As shown in Fig. 8, the VTC shows overall better performance for all the considered metrics. Vertigo energy

ALIMONDA et al.: FEEDBACK-BASED APPROACH TO DVFS IN DATA-FLOW APPLICATIONS

Fig. 9.

1699

Feedback control strategy applied to the parallel DES benchmark. (a) Energy consumed by the producer. (b) LMS error. (c) Number of switchings.

Fig. 10. Feedback control strategy applied to synthetic benchmark with (a)–(c) constant and (d)–(f) variable workload. [(a) and (d)] Energy consumed by the producer. [(b) and (e)] LMS error. [(c) and (f)] Number of switchings.

consumption is, furthermore, higher than the worst case consumption of the CTC. Analyzing in more detail the CTC results, it must be noted that the energy consumption slightly decreases as the sample time becomes larger. This is due to the fact that the time and energy spent in the control routine are lower. 2) Parallel DES: In this section, we describe the experiments carried out on the DES parallel benchmark described in Section III. Since we are considering an SQPP architecture, there is a single outgoing queue from the producer stage, from which the three workers take their input data. As done for the synthetic benchmarks, we compare the behavior of the two CTC and VTC implementations for different values of their tuning parameters. Fig. 9(a) shows the energy consumed by the producer stage. The VTC is more efficient with respect to the CTC, which shows an increase in the energy consumption as a function of the sample time. As far as the queue fluctuations are concerned [Fig. 9(b)], we observe that the square error increases as the sample time increases, as expected. However, this trend changes for sample time that is larger than 300 μs for this benchmark. This is due to the fact that the queue saturates in most of the cases and that the positive errors are not taken into account by our metric. It can be noted how the VTC always provides a better queue control capability, independently of the sample time and the benchmark characteristics, and, moreover, it does so with reduced number of frequency switches, as outlined in Fig. 9(c).

As a final remark, it is apparent from Figs. 8(a) and (d) and 9 that both CTC and VTC provide a better energy performance as compared to the Vertigo DVFS control strategy. E. PQPP Configuration Results In this section, we show the experiments concerning the performance evaluation of the control strategy applied to the PQPP configuration. In this case, together with the comparison between the CTC and VTC implementations, we have also compared three controller versions reflecting the three options (MIN, AVG, and MAX) for the choice of the AUX signal described in Section IV-D. As such, we have six different configurations to be tested that will be denoted as CTC-MIN, CTCAVG, CTC-MAX, VTC-MIN, VTC-AVG, and VTC-MAX, with obvious meaning. 1) Synthetic Benchmarks: The set of experiments shown in Fig. 10 has been carried using synthetic benchmarks, built as in the previous case. The occupancy levels of the queues and the time evolution of the producer and worker frequencies are shown in Fig. 11 for the three instances of the VTC version, namely, the VTC-MIN, VTC-AVG, and VTC-MAX implementations, all with H = 3. In Fig. 10(a) and (d), we reported the energy consumption comparison between the six controller configurations. It can be noted that AVG configurations are the most efficient, but, as shown in Fig. 10(b) and (e), they suffer from some difficulties

1700

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

Fig. 11. Queue/frequency behavior: Frequencies are on the upper part, whereas queue levels are on the lower part of the figures. (a) MAX. (b) AVG. (c) MIN.

Fig. 12. Feedback control strategy applied to software FM radio benchmark. (a) Energy consumed by the producer. (b) LMS error. (c) Number of switchings.

in controlling the queue depletion. By looking at Fig. 10(b), we can conclude that the MIN technique is the more reliable in terms of avoiding the queue depletion while providing almost the same number of switchings, as shown in Fig. 10(c), and satisfactory energy results [Fig. 10(a)]. The MAX configuration performance lies between the previous presented approaches, both in terms of energy [Fig. 10(a)] and risk of depletion [Fig. 10(b)]. However, it does not seem a good choice because it leads to a large number of frequency switches [Fig. 10(c)]. Overall, we can observe that the variable time version provides lower energy consumption and that, for each configuration (MIN, AVG, and MAX), it provides less frequency switches and higher reliability with respect to queue depletion. Fig. 10(b) highlights that the CTC version is sensitive against the sampling time parameter, which strongly affects queue depletion risk. Moreover, a large sample time interval leads to higher energy consumption, as shown in Fig. 10(a). As such, even if we can decrease the number of switchings with a wider sample interval [Fig. 10(c)], this comes at the price of a reduced energy efficiency and worst queue control capability. The different functioning principles of the three versions of the PQPP VTC, which follow the purposes described in Section IV-D, clearly appear by looking at the time evolutions of the queue occupancy and processor frequency shown in Fig. 11. The VTC-MAX controller shows the best control of the queue occupancy level. Indeed, excluding the initial transient, the number of elements in the queue is always kept in the range 100 ÷ 140. The VTC-MIN controller avoids the depletion below 100 elements but leaves the queue occupancy to exceed the number of 140 elements. Finally, the VTC-AVG controller is the more “permissive,” leaving the number of elements of the

queue free to go above 140 and below 100. A symmetry can be observed in the behavior of the processor frequency. 2) Software FM Radio: In this set of experiments, we study a realistic software FM radio benchmark. We focus on the demodulator stage because it has multiple outgoing queues. The results are shown in Fig. 12(a). Compared to the results obtained for the synthetic benchmarks, the AVG configuration is not the most energy efficient anymore. This is because the application we are considering is characterized by an almost balanced throughput in the parallel stages. As such, being the queue occupancy levels closer between each other, the smoothing effect provided by the AVG configuration is no longer needed. On the other side, this technique pays the price of a greater computational overhead, leading to larger energy consumption for the execution of the controlling algorithm itself with respect to the other configurations. It must be noted that there is a large increase in energy corresponding to 300 μs for the constant time interval controller version. For the sample time that is larger than 300 μs, the controller is too slow, and it spends a large amount of energy to keep the queue controlled. This is confirmed by looking at Fig. 12(b), where the square error with respect to the set point is plotted. This behavior is application dependent, making it difficult to tune the controller in an efficient way. This problem does not arise when the VTC version is used, which makes the VTC much more stable and reliable for this application. In Fig. 12(b) and (c), we reported the results for the other two metrics, namely, the queue control capability and the number of frequency switches. In both cases, we found that the variable time controller is more efficient and reliable. Even for the PQPP configurations, the CTC and VTC control strategy outperforms the Vertigo DVFS in terms of energy dissipation as shown in Figs. 10(a) and (d) and 12.

ALIMONDA et al.: FEEDBACK-BASED APPROACH TO DVFS IN DATA-FLOW APPLICATIONS

TABLE I S UMMARY OF MISO C ONTROLLER E VALUATION

1701

than the worst case switching overhead (1 ms) considering PLL reprogramming. Finally, we quantified the overhead of the control algorithm in terms of instructions (and time for the selected platform). The results show that this overhead is negligible (around 200 CPU cycles = 200 ns at 1 GHz), which is definitely negligible with respect to the switching overhead. VI. C ONCLUSION

TABLE II S WITCHING R ATE E VALUATION

The conclusions arising from the comparison results are summarized in Table I. F. Switching Overhead Impact In this section, we show a quantitative evaluation of the switching overhead impact on the selected policies. We first give an estimation of the switching overhead for current embedded systems; then, we compare it with the switching rate required by our policies. The reason why a high number of switchings are not desirable is twofold. First, the actual time required for frequency/voltage setting depends on the frequency (and on the platform). If only the value of the prescaler in front of the core needs to be reprogrammed, this implies a relatively small overhead on the order of a few microseconds (see ARM MPCore [8] as an example). However, this limits the admissible frequencies to a restricted number of values (the submultiples of the main platform clock). A more general clock setting could be obtained by phase-locked loop (PLL) reprogramming, which, in some industrial platforms, may require from hundreds of microseconds up to 1 ms [13]. The second reason is that, in certain system conditions, the overhead may be also larger and unpredictable because of the need of synchronization with other on-chip peripherals. The sample rate of the controller policies is between a hundred of microseconds and 1 ms, which is the typical range for a time slice interval of an embedded system. As such, the frequency adjustments cannot be performed at every sampling rate, and, for this reason, a controller that is able to keep the number of switchings one or two order of magnitude larger than the overhead is desirable. We quantified the switching rate in Table II in terms of average number of switching per second. The results for the different benchmarks and control policies highlight that the higher rate is achieved by constant time controllers. However, the minimum interval between two successive switches is larger than 1 s, which is more than one order of magnitude higher

In this paper, we have presented several nonlinear control approaches to feedback-based DVFS for mixed pipeline/parallel architectures. This framework captures a large class of multimedia applications. Various implementations of the controller have been compared, and the overall efficiency of the proposed approach has been tested through a cycle-accurate virtual multiprocessor platform on both software FM radio and DES parallel streaming benchmarks. It follows from the experimental results that the nonlinear controller with variable triggering (VTC) shows the best performance not only in terms of quantitative energy-saving performance but also in terms of ease of tuning. A PPENDIX A N ONLINEAR C ONTROL A LGORITHM . S KETCH OF THE S TABILITY P ROOF Consider the following controlled differential equation: e˙ = γ(t)u − ϕ(t)

(17)

where e ∈ R represents a measurable output error variable, u ∈ R is an adjustable control quantity, and ϕ(t) and γ(t) are uncertain scalar functions. The following assumptions are made. 1) The real-valued control variable u can only assume a discrete set of nonnegative values u1 , u2 , . . . , uN , with u1 = 0 and uj+1 > uj . The integer index i (1 ≤ i ≤ N ) can represent the actual value of the control variable u, i.e., u = ui . 2) Function ϕ(t) (that can be referred to as the “drift term”) is not negative and bounded, i.e., there is a constant Φ such that 0 < ϕ(t) ≤ Φ.

(18)

3) Function γ(t) (the “control gain”) is strictly positive and bounded, i.e., there are two constants γm and γM such that 0 < γm < γ(t) < γM .

(19)

It is worth noting that (17)–(19) can represent the error dynamics for a generic FIFO buffer with a single processor unit (with adjustable frequency u), which puts packets into the queue, and a set of N processors (N ≥ 1), which gets packets from the queue. From this viewpoint, e is the deviation of the (real valued) queue occupancy from the desired set point value (e = Q − Q∗ ), whereas γ(t)u and ϕ(t) model the packets put into and get from the queue, respectively.

1702

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

All the error dynamics developed throughout this paper are special cases of the dynamics (17) with appropriate choices for the actual control variable u and for the functions ϕ(t) and γ(t). When the processor works at full speed (i.e., u = uN ), the queue occupancy should increase. Mathematically, we should have that γM um > Φ.

(20)

When the condition (20) is not fulfilled, the control task of regulating the queue occupancy by adjusting u might be unfeasible. Clearly, if such condition would not be met, then, when the packets are taken from the queue at the largest throughput ϕ(t) = Φ, the queue would certainly become empty, regardless of any choice for u. Let e[k] denote the current value of e, and let e[k − 1] be the value of e at the closest time instant in the past at which it has been stored in the controller memory. The basic nonlinear algorithm can be written algorithmically as follows. Nonlinear Controller: 1) Every trigger instant do: 2) If [(e[k] < −Δ) AND (e[k] ≤ e[k − 1])], then increase the frequency (i := min(i + 1, N )). 3) If [(e[k] > Δ) AND (e[k] ≥ e[k − 1])], then decrease the frequency (i := max(i − 1, 0)). The deviation variable e[k] (the difference between the current and desired queue occupancy) is wanted to be driven near zero for sufficiently large k. It is simple to show that the aforementioned algorithm can guarantee bounded oscillations of e[k] around the zero value, i.e., there are E and T such that |e(t)| ≤ E

t ≥ T.

(21)

Let us first discuss what is done by the algorithm at the trigger instants. Essentially, the algorithm processes two informations, namely, the actual value of e[k] and the sign of the difference e[k] − e[k − 1]. Concerning the actual value of e[k], in particular, it is checked whether it exceeds the positive value Δ or lies below the negative threshold −Δ. Depending on those three Boolean informations, the controller takes the decision of increasing, decreasing, or leaving unchanged the processor frequency. Let us analyze the queue behavior under the action of the aforementioned controller. Implicitly, if none of the two conditions requested at steps 2) and 3) is satisfied, the frequency is left unchanged. According to step 2), the frequency will be increased when e[k] < −Δ and, at the same time, e[k] ≤ e[k − 1]. The first condition indicates that the current value of Q is too small and must be increased. The second condition means that the queue is deploying or that its content is remaining constant. Then, it is sensible to command an increase in the processor frequency because of the current deploying “trend” that must be reversed. If the latter condition would be not satisfied, i.e., if e[k] > e[k − 1] holds, there is clearly no need to increase the frequency since the queue is already filling up. Because of the “feasibility condition” γM um > Φ, after a finite number of iterations, condition e[k] > e[k − 1] will cer-

tainly be restored, and the queue occupancy will then be forced to converge toward the desired value. Analogously, according to step 3), the frequency will be decreased when e[k] > Δ (i.e., the queue content needs to decrease) and, at the same time, e[k] ≥ e[k − 1] (i.e., the queue is filling or remaining constant). As far as e[k] is detected to lie outside the threshold band |e[k]| ≤ Δ, proper control actions are made to steer the error variable back inside the band. Because of the enabling conditions at step 2) and step 3) that are only satisfied during short transients, stable controlled fluctuations of e about zero are observed in the steady state. Intuitively, the amplitude of the steady-state oscillations is affected by Δ; the larger Δ, the larger the oscillation amplitude. Let us discuss the possibilities for generating the trigger time instants. As discussed in Section IV-B, we suggest two alternatives: the “CTC” and “VTC.” In the CTC version, triggering instants are equally spaced in time. In the VTC version, the controller triggering is generated every time that H packets are put in, or get from, the queue. The suggested VTC logic leads to a “throughput-dependent” generation of the triggering instants which provides high reactiveness in the presence of high data throughput and slow triggering generation otherwise. H can be chosen as a certain percentage of the overall queue capacity Qmax , e.g., 2 ÷ 5%, and should be less than Δ. This adaptive sampling rate has no effect on the closed-loop stability, whereas it only affects the magnitude of the steady-state error in a similar manner as Δ. R EFERENCES [1] A. Alimonda, A. Acquaviva, S. Carta, and A. Pisano, “A control theoretic approach to run-time energy optimization of pipelined elaboration in MPSoCs,” in Proc. DATE, 2006, pp. 876–877. [2] A. Alimonda, A. Acquaviva, S. Carta, and A. Pisano, “Non-linear feedback control for energy efficient on-chip streaming computation,” in Proc. Symp. IES, 2006, pp. 1–8. [3] K. Asanovic, R. Bodik, B. C. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The landscape of parallel computing research: A view from Berkeley,” Univ. California, Berkeley, Berkeley, CA, Tech. Rep. UCB/EECS-2006-183, Dec. 18, 2006. [4] S. Carta, A. Alimonda, A. Acquaviva, A. Pisano, and L. Benini, “A control theoretic approach to energy-efficient pipelined computation in MPSoCs,” ACM Trans. Embedded Comput. Syst. (TECS), vol. 6, no. 4, p. 27, Sep. 2007. [5] A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. M. Al-Hashimi, “Simultaneous communication and processor voltage scaling for dynamic and leakage energy reduction in time-constrained systems,” in Proc. ICCAD, 2004, pp. 362–369. [6] A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. M. Al-Hashimi, “Overhead-conscious voltage selection for dynamic and leakage energy reduction of time-constrained systems,” in Proc. DATE, 2004, pp. 518–523. [7] ARM Intelligent Energy Manager, Dynamic Power Control for Portable Devices, 2005. [Online]. Available: http://infocenter.arm.com/ help/topic/com.arm.doc.dai0172b/ [8] ARM Ltd., ARM11 MPCore. [Online]. Available: http://www.arm.com/ products/CPUs/ARM11MPCoreMultiprocessor.html [9] Cradle Technologies, Cradle’s CT3000 Family. [Online]. Available: http://www.cradle.com/multi_core_dsp.html [10] “Federal information processing standards publication 46-2,” Announcing the Standard for DATA ENCRYPTION STANDARD (DES), 1993. [Online]. Available: http://www.itl.nist.gov/fipspubs/fip46-2.htm [11] T. Durnan and C. Pollabauer, “DDVS: Distributed dynamic voltage scaling,” in Proc. SOSP, 2005, pp. 1–8. [12] K. Flautner and T.N. Mudge, “Vertigo: Automatic performance-setting for Linux,” in Proc. OSDI, 2002, pp. 105–116.

ALIMONDA et al.: FEEDBACK-BASED APPROACH TO DVFS IN DATA-FLOW APPLICATIONS

[13] Freescale, i.MX31 Multimedia Applications Processors, Austin, TX, 2003. [Online]. Available: http://www.freescale.com/imx31 [14] S. V. Gheorghita, T. Basten, and H. Corporaal, “Application scenarios in streaming-oriented embedded system design,” in Proc. Int. SoCC, 2006, pp. 175–178. [15] GStreamer: Open Source Multimedia Framework. [Online]. Available: http://gstreamer.freedesktop.org/ [16] C. Im, H. Kim, and S. Ha, “Dynamic voltage scaling technique for lowpower multimedia applications using buffers,” in Proc. ISLPED, 2001, pp. 34–39. [17] Intel, Intel Multi-Core Processors. [Online]. Available: http://www.intel. com/research/platform/ terascale/ts_microprocessors.htm [18] D. Lackey, P. Zuchowski, T. R. Bedhar, D. W. Stout, S. Gould, and J. Cohn, “Managing power and performance for systems-on-chip designs using voltage islands,” in Proc. ICCAD, 2002, pp. 195–202. [19] Z. Lu, J. Lach, and M. Stan, “Reducing multimedia decode power using feedback control,” in Proc. ICCD, 2003, pp. 489–496. [20] Y. Lu, L. Benini, and G. De Micheli, “Dynamic frequency scaling with buffer insertion for mixed workloads,” IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 21, no. 11, pp. 1284–1305, Nov. 2002. [21] Z. Lu, J. Hein, M. Humphrey, M. Stan, J. Lach, and K. Skadron, “Control theoretic dynamic frequency and voltage scaling for multimedia workloads,” in Proc. Int. Conf. CASES, 2002, pp. 156–163. [22] Khronos Group-Media Authoring and Acceleration, OpenMAX—The Standard for Media Library Portability. [Online]. Available: http://www. khronos.org/openmax/ [23] P. Juang, Q. Wu, L.-S. Peh, M. Martonosi, and D. W. Clark, “Coordinated, distributed, formal energy management of chip multiprocessors,” in Proc. ISLPED, 2005, pp. 127–130. [24] Z. Ma and F. Catthoor, “Scalable performance-energy trade-off exploration of embedded real-time systems on multiprocessor platforms,” in Proc. DATE, 2006, pp. 1073–1078. [25] R. Teodorescu and J. Torrellas, “Variation-aware application scheduling and power management for chip multiprocessors,” SIGARCH Comput. Archit. News, vol. 36, no. 3, pp. 363–374, 2008. [26] MPARM multiprocessor simulation environment. [Online]. Available: http://www-micrel.deis.unibo.it/sitonew/research/mparm.html [27] N. Pazos, A. Maxiaguine, P. Ienne, and Y. Leblebici, “Parallel modelling paradigm in multimedia applications: Mapping and scheduling onto a multi-processor system-on-chip platform,” in Proc. Int. GSP Conf., 2004. [28] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, “The design and implementation of a first-generation CELL processor,” in Proc. IEEE/ACM ISSCC, 2005, pp. 184–186. [29] M. Ruggiero, A. Acquaviva, D. Bertozzi, and L. Benini, “Applicationspecific power-aware workload allocation for voltage scalable MPSoC platforms,” in Proc. ICCD, 2005, pp. 87–93. [30] M. Ruggiero, G. Pari, A. Guerri, D. Bertozzi, M. Milano, L. Benini, and A. Andrei, “A cooperative, accurate solving framework for optimal allocation, scheduling and frequency selection on energy-efficient MPSoCs,” in Proc. Int. SoCC, 2006, pp. 1–4. [31] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, and E. Deprettere, “System design using Kahn process networks: The Compaan/Laura approach,” in Proc. DATE, Mar. 2004, pp. 1530–1591. [32] W. Thies, M. I. Gordon, M. Karczmarek, J. Lin, D. Maze, R. M. Rabbah, and S. Amarasinghe, “Language and compilers design for streaming applications,” in Proc. IEEE IPDPS, 2004, p. 201. [33] M. Wesley, J. Johnston, J. Paul Hanna, and R. Miller, “Advances in dataflow programming languages,” ACM Comput. Surv., vol. 36, no. 1, pp. 1–34, Mar. 2004. [34] Q. Wu, P. Juang, M. Martonosi, and W. Clark, “Formal online methods for voltage/frequency control in multiple clock domain microprocessors,” in Proc. Int. Conf. ASPLOS, 2004, pp. 248–259. [35] Q. Wu, P. Juang, M. Martonosi, L.-S. Peh, and D. W. Clark, “Formal control techniques for power-performance management,” IEEE Micro, vol. 25, no. 5, pp. 52–62, Sep./Oct. 2005. [36] P. Van der Wolf, P. Lieverse, M. Goel, D. La Hei, and K. Vissers, “An MPEG-2 decoder case study as a driver for a system level design methodology,” in Proc. Int. Symp. Hardw.-Softw. Codesign CODES, 1999, pp. 33–37. [37] D. Zhu, R. Melhem, and B. Childers, “Scheduling with dynamic voltage/speed adjustment using slack reclamation in multiprocessor real-time systems,” IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 7, pp. 686–700, Jul. 2003.

1703

Andrea Alimonda received the M.S. degree (summa cum laude) in electronics engineering and the Ph.D. degree in mathematics from the University of Cagliari, Cagliari, Italy, in 1998 and 2008, respectively. His Ph.D. thesis was on “Dynamic Resource Optimization for Embedded Systems, a Software Approach.” In 1998, he founded a private company and worked on the software design of electronic systems. He also collaborated with the Sardinian district to promote the technological transfer to small and medium enterprise. He is currently with the Department of Mathematics and Computer Science, University of Cagliari. His research interests focus on algorithms and infrastructures for dynamic resource management in multiprocessorsystems-on-chips, and more recently, he is involved in researching new web technologies for media content retrieval and recommendation systems.

Salvatore Carta received the M.S. degree (summa cum laude) in electronic engineering and the Ph.D. degree in electronics and computer science from the University of Cagliari, Cagliari, Italy, in 1997 and 2003, respectively. Since 2005, he has been an Assistant Professor in computer science with the University of Cagliari. His research interests focus mainly on architectures, software, and tools for embedded and portable computing, with particular emphasis on the following: 1) operating systems, middleware, and software for multiprocessor systems-on-chips; 2) networks-on-chip; and 3) reconfigurable computing. He is the author of more than 20 papers in these fields in the last three years. In the last two years, he is also working in the fields of recommendation systems and social networks. He is also the author of some papers in these fields.

Andrea Acquaviva received the Ph.D. degree in electrical engineering from the University of Bologna, Bologna, Italy, in 2003. In 2003, he became an Assistant Professor with the Computer Science Department, Università di Urbino, Urbino, Italy. From 2005 to 2007, he was a Visiting Researcher with the Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland. In 2006, he joined the Department of Computer Science, Università di Verona, Verona, Italy. He was involved in several industrial research projects with Hewlett Packard and Freescale Semiconductor (U.K.). Since 2008, he has been with the Department of Computer Engineering and Automation, Politecnico di Torino, Torino, Italy. He is involved in a number of European projects concerning software and architectures for multicore platforms and low-power distributed systems. His researches (between 2000 and 2008) yielded more than 50 papers in international journals and peer-reviewed international conference proceedings, as well as six book chapters. His research interests focus mainly on parallel computing for distributed embedded systems such as multicore and sensor networks and simulation and analysis of biological systems using parallel architectures.

1704

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2009

Alessandro Pisano (M’07) was born in Sassari, Italy, in 1972. He received the M.S. (Laurea) degree in electronic engineering and the Ph.D. degree in electronics and computer science from the Department of Electrical and Electronic Engineering (DIEE), University of Cagliari, Cagliari, Italy, in 1997 and 2000, respectively. He is currently an Assistant Professor with DIEE, University of Cagliari. He is a Coeditor of the book “Modern Sliding Mode Control Theory—New Perspectives and Applications” (Springer Lecture Notes in Control and Information Sciences, 2008) and has coauthored over 30 publications on international archival journals and over 50 peer-reviewed conference papers. He is the holder of two patents in the field of underwater robotics and industrial control systems. His current research interests include the theory and application of nonlinear and robust control, with special emphasis on slidingmode control design and implementation. Dr. Pisano is a Professional Engineer registered in Cagliari, Italy. Since 2008, he has been an Associate Editor of the Conference Editorial Board, IEEE Control Systems Society, and of the Asian Journal of Control.

Luca Benini received the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, in 1997. He is a Full Professor with the Department of Electrical Engineering and Computer Science (DEIS), University of Bologna, Bologna, Italy. He is also holding a visiting faculty position with Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland, and a Consulting Research Professor position with the Belgian Interuniversity MicroElectronics Centre (IMEC). He has published more than 400 papers in peer-reviewed international journals and conferences, four books, and several book chapters. His research interests are in the design of system-on-chip platforms for embedded applications. He is also active in the area of energy-efficient smart sensors and sensor networks, including biosensors and related data-mining challenges. Dr. Benini is a member of the steering board of the ARTEMISIA European Association on Advanced Research and Technology for Embedded Intelligence and Systems. He is an Associate Editor of several international journals, including the IEEE T RANSACTIONS ON C OMPUTER -A IDED D ESIGN OF I NTEGRATED C IRCUITS AND S YSTEMS, the ACM Journal on Emerging Technologies in Computing Systems, and the ACM Transactions on Embedded Computing Systems. He has been the General Chair and the Program Chair of the Design Automation and Test in Europe Conference. He has been a member of the technical program committee and organizing committee of several conferences, including the Design Automation Conference, International Symposium on Low Power Design, and the Symposium on Hardware-Software Codesign.

Suggest Documents