XT-XENERGY tool is used to measure the power ... the ASIPs were running at 1 GHz and XT-XENERGY was configured ... 9,100 clock cycles (to support. 30 fps) ...
System-Level Application-Aware Dynamic Power Management in Adaptive Pipelined MPSoCs for Multimedia Haris Javaid†
Muhammad Shafique‡
Jorg ¨ Henkel‡
Sri Parameswaran†
†
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia ‡
Chair for Embedded Systems, Karlsruhe Institute of Technology, Karlsruhe, Germany
{harisj,
sridevan}@cse.unsw.edu.au, {muhammad.shafique, henkel}@kit.edu Power Description Power Transition Wake-up State Consumption Energy Latency 0 Active 1 0 0 1 CG 0.4 0.01 0.01 2 Partially PG 0.1 0.4 0.6 3 Fully PG 0.01 1 1
ABSTRACT System-level dynamic power management (DPM) schemes in Multiprocessor System on Chips (MPSoCs) exploit the idleness of processors to reduce the energy consumption by putting idle processors to low-power states. In the presence of multiple low-power states, the challenge is to predict the duration of the idle period with high accuracy so that the most beneficial power state can be selected for the idle processor. In this work, we propose a novel dynamic power management scheme for adaptive pipelined MPSoCs, suitable for multimedia applications. We leverage application knowledge in the form of future workload prediction to forecast the duration of idle periods. The predicted duration is then used to select an appropriate power state for the idle processor. We proposed five heuristics as part of the DPM and compared their effectiveness using an MPSoC implementation of the H.264 video encoder supporting HD720p at 30 fps. The results show that one of the application prediction based heuristic (MAMAPBH) predicted the most beneficial power states for idle processors with less than 3% error when compared to an optimal solution. In terms of energy savings, MAMAPBH was always within 1% of the energy savings of the optimal solution. When compared with a naive approach (where only one of the possible power states is used for all the idle processors), MAMAPBH achieved up to 40% more energy savings with only 0.5% degradation in throughput. These results signify the importance of leveraging application knowledge at system-level for dynamic power management schemes.
1.
Table 1: Typical power states of a processor. The values of power consumption, transition energy and wake-up latency are normalized, and are inferred from [4]) when the workload is varying widely [1], while stochastic approaches suffer from the inaccuracies in the workload model and the complexity involved in solving the optimization problem [2]. These issues primarily limit the use of both the predictive and stochastic schemes to systems where either the workload is very regular or the workload model is known a priori. Some advanced history based heuristics and stochastic schemes have been shown to predict with high accuracy in varying workloads; however, their computational complexity severely limits their use [2] and may not be suitable for fine-grained run-time management (required by multimedia applications to avoid degradation of the throughput). Hence, Liu et al. [3] recently proposed the use of application knowledge for efficient dynamic power management schemes because the application by far knows (or may know) the most about its future workload. The experiments illustrated application-aware power management outperforming OS-level and hardware-level schemes, and it is this fact that has inspired our work. The work in [3] exploited only a limited application knowledge such as the size and type of the frames (algorithmic properties). In this work, we leverage more diverse application knowledge (algorithmic and video data properties) for dynamic power management in pipelined MPSoCs targeting multimedia applications. Multimedia applications are typically characterized by several kernels which are executed repeatedly on the incoming data stream, favoring their implementation on pipelined MPSoCs [5, 6]. A pipelined MPSoC is a system where processors are connected in a pipeline configuration [5, 6]. It is divided into several stages where each stage contains one or more processors. These processors are connected via FIFOs and execute different sub-tasks of a multimedia application. Hence, the incoming data streams through the stages of a pipelined MPSoC before being written by the last stage. Recently, an adaptive pipelined MPSoC architecture [7] has been proposed to cope with the run-time variations in workload of multimedia applications. In an adaptive pipelined MPSoC, each stage with run-time varying workload is implemented with Main Processors (MPs) and Auxiliary Processors (APs). An MP uses an AP only when required, and renders it idle when the workload is low. The technique of [7] exploited the idleness of APs to reduce the energy consumption of the pipelined MPSoC by either only clock-gating (CG) or only power-gating (PG) the idle APs. Let us examine the limitations of such an approach through a case study of motion estimation in an H.264 video encoder.
INTRODUCTION
Reducing power consumption of Multi Processor System on Chip (MPSoC) in portable devices enables slower depletion of batteries and results in lower die temperatures that improve performance and reliability. Several system-level dynamic power management (DPM) schemes have been proposed to exploit the idleness of processors at run-time for energy reduction by putting idle processors in low-power states [1, 2]. These schemes decide “when” and “which” power state should be selected for a processor to reduce the energy consumption of the system. The decision is a challenging one due to the time and energy overheads involved in a transition from one power state to another. Table 1 shows four typical states available for a processor (where CG and PG stand for clock-gated and power-gated respectively). The values illustrate that Power State 3 (PS3) will result in the most energy saving; however, the amount saved will depend on the amount of time the processor will remain in PS3, and this saving should amortize the energy overhead of the transition. The challenge is to predict, with high accuracy in the presence of a widely varying workload, the amount of idleness for a processor so that it can be put into the most beneficial power state. Most of the dynamic power management schemes are categorized as: predictive schemes; and, stochastic techniques [1]. Predictive techniques typically exploit temporal correlation between the past history of the workload and its near future to predict the upcoming workloads. Stochastic techniques, on the other hand, model the workload behavior as a controlled Markov process, and then find the optimal power management scheme based on the model. Predictive techniques suffer
978-1-4577-1400-9/11/$26.00 ©2011 IEEE
1.1
Motivational Example
Motion estimation in the H.264 encoder is one of the most computationally intensive sub-tasks. Motion estimation is performed on each
616
ĐƚŝǀŝƚLJ
PS3
ϭ
PS1
PS2
PS3
AP 1-1
MP1 Dynamic Power Manager Decides “when” and “how many” APs will be idle
S1
Ϭ Ϭ
ϮϬϬ
ϰϬϬ
ϲϬϬ
ϴϬϬ
AP 2-1
ϭϬϬϬ
AP 2-2
MP2
S2
DĂĐƌŽďůŽĐŬƐ Figure 1: Activity of one of the APs in the motion estimation stage of the H.264 video encoder
FIFO Buffers S3
MacroBlock (MB) of the incoming frame, where Sum of Absolute Differences (SAD) is used to compare the current MB with the reference MBs to find the best possible match. The number of SADs that need to be computed for an MB depends on its texture and the motion contained in it. An MB containing fast moving objects will require more SADs compared to an MB of slow moving objects. Consider an adaptive pipelined MPSoC where the motion estimation stage is implemented with one MP and 16 APs (designed for HD720p @ 30 fps). These 16 APs are not active at all times, but are used only when the workload is high. Figure 1 shows the activity of one of the APs in the motion estimation stage, where 1 and 0 means the AP is active and idle respectively. The figure shows that the number of consecutive idle iterations (idle periods) of the AP varies significantly at run-time, where an iteration refers to the processing of one MB. Power-gating will not be beneficial during short idle periods due to its relatively large wake-up overhead, while clock-gating will not be beneficial during long idle periods as it only saves dynamic power. Hence, both CG and PG from [7] do not exploit the full potential of idle periods as they do not evaluate the suitability of clock- and power-gating depending upon the duration of an idle period. A dynamic power management scheme with the provision of multiple power states, on the other hand, will provide a fine-grained power reduction knob as multiple power states [4] tradeoff wake-up latency and energy with the possible energy savings in the system. For example, Figure 1 illustrates that the AP is transitioned to different power states (PS1, PS2 and PS3 from Table 1) depending on the number of idle iterations instead of always power-gating or clockgating it, which will result in more energy savings. The challenge in dynamic power management scheme would be to predict the duration of upcoming idle period so that the most beneficial power state can be selected.
AP 4-1-1
MP3
MP 4-1
List of idle APs
ASIPs with local memories
MP 4-2
AP 4-2-1
Decides the power states for all the idle APs
S4 AP 5-1
MP5
S5
Figure 2: Adaptive Pipelined MPSoC’s Architecture for an AP at a given iteration – the predicted number of idle iterations is used to decide the power state of the AP).
2.
RELATED WORK
Pipelined MPSoCs have emerged as a viable platform for high throughput implementation of multimedia applications [6, 9, 10, 11, 12, 13, 7]. The works in [9, 10, 11] focus on balancing the pipelined MPSoC at design-time, while [6, 12, 13] have introduced adaptability in the pipelined MPSoCs to cope with run-time varying workloads of multimedia applications. Guo et al. [13] proposed a dynamic voltage scaling approach to reduce the voltage to processors with low workload, while [6, 12] showed the application of Dynamic Voltage and Frequency Scaling (DVFS) to pipelined MPSoCs. All these works monitored the occupancy level of the queues to determine when to increase or decrease the voltage-frequency levels of a processor. Practically, the provision of DVFS circuitry for MPSoCs with more than two processors is very expensive [14]. Furthermore, the large overhead of DVFS control circuitry limits its use to systems requiring only coarse-grained run-time management [15]. The shrinkage of the dynamic range of voltage-frequency operational points due to downward scaling of supply voltage has also limited DVFS use, and has given rise to the use of dynamic power management with multiple power states. Our work uses the architecture proposed in [7] where an adaptive pipelined MPSoC with either power-gating or clock-gating is illustrated to provide energy savings over design-time balanced pipelined MPSoCs [9, 10]. Our work uses multiple power states for APs in conjunction with application-aware dynamic power management scheme instead of power- and clock-gating the idle APs. We leverage video data and algorithmic properties of a multimedia application in contrast to [3] where only algorithmic properties were used. Furthermore, we use application knowledge in an adaptive multiprocessor system instead of a single processor system as used in [3]. To the best of our knowledge, there does not exist any work on application-aware dynamic power management of adaptive pipelined MPSoCs with multiple power states for the processors.
1.2 Idea Overview and Novel Contribution We propose a novel dynamic power management (DPM) scheme for adaptive pipelined MPSoCs considering multiple power states for the APs. To this end, first we propose an analytical analysis to calculate the minimum number of idle iterations for an AP to benefit from a given power state. Second, we propose five heuristics as part of the DPM which decide at run-time the most beneficial power state for an AP. These heuristics attempt to predict the correct number of idle iterations (length of the idle period) for an AP using different methods. One of the heuristics uses the history information, while the others use workload prediction directly from the application. Hence, the novelty of our DPM arises from the fact that we leverage application knowledge in the form of workload prediction to forecast the number of idle iterations for an AP. Based on the predicted idle iterations, the heuristics decide the power state for each of the idle APs. Finally, we demonstrate the applicability of the proposed DPM scheme through a case study on the H.264 video encoder supporting HD720p at 30 fps (implemented using a commercial design environment from Tensilica [8]) to compare the effectiveness of the five heuristics with CG and PG from [7]. In summary, the contribution of our work is two-fold:
3.
ADAPTIVE PIPELINED MPSOC
Figure 2 shows a typical adaptive pipelined MPSoC, comprised of various pipeline stages. Each processor is an ASIP with separate instruction and data caches, which are connected to its local memory. Use of ASIPs can significantly increase the throughput of a pipelined MPSoC [5]. In addition to local memories, shared memories could also be used where common data need to be shared among different stages. A typical multimedia application contains several kernels, and hence can be partitioned into sub-tasks. These sub-tasks are then mapped onto the ASIPs, where the special instructions for each ASIP are designed according to the sub-task(s) mapped on it. Please note that the partitioning and mapping of a multimedia application is not the focus of this paper. In a pipelined MPSoC, an iteration of a processor refers to processing of one MB, where an iteration is considered idle if the processor is inactive during it.
∙ an analytical analysis to compute the minimum number of idle iterations for each power state (a given power state will be energywise beneficial if an AP stays in that power state for at least the minimum number of idle iterations of that power state); and, ∙ five different heuristics as part of a dynamic power management scheme for adaptive pipelined MPSoCs (these heuristics use either the history information or the workload prediction information from the application to predict the number of idle iterations
617
Power State 0 1 2 3
The adaptability in an adaptive pipelined MPSoC is due to the coexistence of Main Processors (MPs) and Auxiliary Processors (APs). An MP is always active while an AP is only active when the workload increases beyond the capacities of its MP. Thus, the stages with significant run-time variation in workloads are implemented with a combination of MPs and APs, where APs are connected to their corresponding MP using FIFOs. Stages with more or less constant workload do not need APs and are implemented with MPs only, for example, stage S3 in Figure 2. Such an architecture provides an efficient implementation platform for advanced multimedia applications which contain stages with both constant and run-time varying workloads.
4.
𝐼3,0
𝐼𝑗,𝑖 𝐼1,0 = 1 𝐼2,0 = 1, 𝐼2,1 = 2 = 2, 𝐼3,1 = 3, 𝐼3,2 = 7
𝐼𝑗𝑚𝑖𝑛 1 2 7
Table 2: Minimum number of idle iterations required for the power states described in Table 1 The energy savings in a transition from 𝑃 𝑆0 to 𝑃 𝑆𝑖 should amortize the overhead of the transition. The energy consumption of an AP for 𝐼 iterations in 𝑃 𝑆𝑖 would be: 𝑃𝑖 × 𝑇𝑐 × 𝐼 + 𝐸𝑖𝑜𝑣 . The first factor computes the energy consumption of 𝑃 𝑆𝑖 for 𝐼 iterations while the second factor is the overhead of the transition to 𝑃 𝑆𝑖 from 𝑃 𝑆0 and back to 𝑃 𝑆0 . To evaluate whether a transition to 𝑃 𝑆𝑖 (higher power state) or a transition to 𝑃 𝑆𝑗 (lower power state) would be beneficial, the energy consumption in 𝑃 𝑆𝑗 including the overhead should be less than the energy consumption in 𝑃 𝑆𝑖 :
APPLICATION-AWARE DYNAMIC POWER MANAGEMENT
The architecture of the adaptive pipelined MPSoC allows for both a centralized and a distributed dynamic power management scheme. We have proposed a distributed dynamic power management (DPM) scheme where an MP monitors and controls its own APs, independent of other MPs. Such a distributed approach has the advantage of scalability over a centralized DPM. Furthermore, stages with more or less constant workload do need system-level power management and hence will not have a DPM implementation, eliminating the overheads of DPM for such stages. Since the DPM is distributed, each stage can have different power states for APs depending on the type of processors used as well as different power management heuristics. For example, MP2 and MP4-1 can have different DPM heuristics, while MP3 will not have any DPM implemented. Figure 2 zooms in one of the MPs to illustrate the two components of our DPM. The first component decides “when” and “how many” APs to activate and deactivate, which has been incorporated from [7]. The idle APs are decided at the start of each iteration based on previous and current iteration’s execution time with the workload prediction from the application. The approach was designed so that the idle APs will remain idle for at least one iteration. The second component decides the power state of each idle AP based upon the predicted number of idle iterations which is obtained either through the history of application execution or through the workload prediction from the application. For example, if the first component reports AP14 and AP15 to be idle, then the second component will decide the power states of AP14 and AP15 to reduce the overall energy consumption of the system.
𝑃𝑗 × 𝑇𝑐 × 𝐼 + 𝐸𝑗𝑜𝑣 𝐼
< 𝑃𝑖 × 𝑇𝑐 × 𝐼 + 𝐸𝑖𝑜𝑣 𝐸𝑗𝑜𝑣 − 𝐸𝑖𝑜𝑣 > (𝑃𝑖 − 𝑃𝑗 ) × 𝑇𝑐
where 0 ≤ 𝑖 < 𝑗 < 𝑁 . Hence, if the number of idle iterations is 𝐸 𝑜𝑣 −𝐸𝑖𝑜𝑣 , then the transition to 𝑃 𝑆𝑗 (lower power state) more than (𝑃𝑖𝑗−𝑃𝑗 )×𝑇 𝑐 would be beneficial, otherwise the AP should be transitioned to 𝑃 𝑆𝑖 𝐸 𝑜𝑣 −𝐸𝑖𝑜𝑣 (higher power state). Thus, we define 𝐼𝑗,𝑖 = ⌈ (𝑃𝑖𝑗−𝑃𝑗 )×𝑇 ⌉ which is 𝑐 the minimum number of idle iterations for 𝑃 𝑆𝑗 to be beneficial than 𝑃 𝑆𝑖 . Consequently, a power state can be compared will all the high power states to obtain the values of 𝐼𝑗,𝑖 ∀𝑗, 𝑖 where 0 ≤ 𝑖 < 𝑗 < 𝑁 . The threshold number of idle iterations for a power state 𝑃 𝑆𝑗 , denoted as 𝐼𝑗𝑚𝑖𝑛 , would then be max∀𝑖 {𝐼𝑗,𝑖 }. Let us go through the example states shown in Table 1 to illustrate the calculation of 𝐼𝑗𝑚𝑖𝑛 . The results are shown in Table 2. For each power state, the values of 𝐼𝑗,𝑖 are computed assuming 𝑇𝑐 = 1 sec. The value of 𝐼1,0 signifies the fact that an AP should only be transitioned to PS1 from PS0 if the AP will be idle for at least 1 iteration. It should be noted that a transition to PS2 from PS0 for one iteration will also be beneficial (𝐼2,0 = 1); however for PS2 to be beneficial than PS1, the AP should be in PS2 for at least 2 iterations (𝐼2,1 = 2). Thus, if an AP remains in 𝑃 𝑆𝑗 for at least 𝐼𝑗𝑚𝑖𝑛 number of iterations, it is ensured that the total energy savings would be more than the transitions to any of the higher power states. The minimum number of idle iterations for each power state is computed offline and then saved in the MP for use at run-time by its DPM scheme. The DPM scheme will predict the number of idle iterations, 𝑝 let us say 𝐼𝑖𝑑𝑙𝑒 , for an AP at the start of an iteration, which will then be used to obtain the most beneficial power state from the saved values of 𝑝 𝐼𝑗𝑚𝑖𝑛 . For example, for values of 1, 5 and 8 for 𝐼𝑖𝑑𝑙𝑒 , the AP will be transitioned to PS1, PS2 and PS3 respectively.
4.1 Analytical Analysis The decision on the power state of an idle AP depends on the duration of the idle period, that is, the number of consecutive idle iterations of the AP. In this section, we show an analytical method to calculate the minimum number of idle iterations for each power state so that if an AP is transitioned to a given power state for at least the minimum number of idle iterations, then the transition would be energy-wise beneficial. For the purpose of analysis, we assume the following: ∙ 𝑁 power states denoted as 𝑃 𝑆0 , ...𝑃 𝑆𝑁−1 with power consumptions 𝑃0 , ...𝑃𝑁−1 respectively, where 𝑃𝑖 > 𝑃𝑗 for 𝑖 < 𝑗. Hence, 𝑃 𝑆0 would be the active state while 𝑃 𝑆𝑁−1 would be the most power saving state. ∙ 𝐸𝑖𝑜𝑣 : Energy overhead of switching from 𝑃 𝑆0 to 𝑃 𝑆𝑖 and back to 𝑃 𝑆0 . We assume that state 𝑃 𝑆𝑖 directly transitions to 𝑃 𝑆0 without any transition to intermediate states, which is similar to the assumption used in [2]. ∙ 𝑇𝑖𝑜𝑣 : Wake-up latency from 𝑃 𝑆𝑖 to 𝑃 𝑆0 . An AP is activated and then the data is sent to overlap the activation time with the communication time as the FIFOs between the MP and the APs are always active. In this work, we assume that the communication time is greater than the wake-up latency which is typically the case with large scale multimedia applications (see Section 7.1). Hence, 𝑇𝑖𝑜𝑣 will not be the deciding factor for the minimum number of idle iterations. ∙ 𝑇𝑐 : The time critical pipeline stage spends in one iteration, computed using the throughput constraint of the multimedia application. The duration of an iteration will be 𝑇𝑐 for all the stages, and hence for all the MPs and APs.
4.2
Application Based Workload Prediction
In multimedia applications, much information is available from the application, such as texture, brightness, size and homogeneity of the macroblocks or frames [16]. Typically, a pre-processing stage is employed in multimedia applications to analyze such information [16] (see Figure 6). The pre-processing stage processes the raw input to analyze texture, brightness, etc. at the macroblock- or the frame-level to provide useful information to the video processing system in advance for run-time adaptations [17]. In this paper, we use such information from the system-level for application-aware dynamic power management.
4.2.1 An H.264 Video Encoder Example In this section, we elaborate on one piece of information that is available at the pre-processing stage to the motion estimation sub-task in H.264 video encoder. Consider the pre-processing stage categorizes the MBs of a frame as either low or high motion MBs. Low motion MBs typically contain slow moving objects and are homogeneous
618
x10-4 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 100
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0
station
For high-motion MBs
300
500 #SAD
700
900
x10-4
Algorithm 1: History Based Heuristic
tractor For low-motion MBs
1 2 3 100
300
500 #SAD
700
900
4
Figure 3: Probability density function of number of SADs for low and high motion MBs while high motion MBs are textured and contain fast moving objects. Depending on the texture/variance of the current MB and the predicted SADs of the neighboring MBs, the workload (in terms of the number of SADs) for the current MB may be predicted. If 𝑀 𝐵𝑣 < 𝑇 𝐻𝑣 and median(𝑆𝐴𝐷𝑙𝑒𝑓 𝑡 , 𝑆𝐴𝐷𝑡𝑜𝑝 , 𝑆𝐴𝐷𝑡𝑜𝑝𝑅𝑖𝑔ℎ𝑡 ) < 𝑇 𝐻𝑆𝐴𝐷 then the MB is low-motion, otherwise it is a high-motion MB. 𝑇 𝐻𝑣 and 𝑇 𝐻𝑆𝐴𝐷 are the threshold values for variance and number of SADs, and are typically obtained through [16]. The variance of MB, ∑256regression analysis 2 1 𝑀 𝐵𝑣 , equals 256 𝑖=1 (𝑃𝑖 − 𝐵𝑎𝑣𝑔 ) where 𝑃𝑖 is the i-th pixel while ∑256
5 6 7 8 9
10 11 12
Transition 𝑖𝑡ℎ idle AP to power state 𝑃 𝑆𝑗−1 // Called at the end of k-th iteration to populate the history information if 𝑊𝑎 [𝑘] ≤ 𝑊𝑎 [𝑘 − 1] then for 𝑖 = 𝑊𝑎 [𝑘]; 𝑖 < 𝐴𝑃𝑀 ; i++ do totalIdleIterations[i]++;
𝑃 +128
𝐵𝑎𝑣𝑔 is the average brightness ( 𝑖=1256𝑖 ). The number of SADs for low- and high-motion MBs are obtained through the Probability Density Function (PDF), which are shown for two test video sequences (“station” and “tractor”) in Figure 3. Based on the category of the current MB (low- or high-motion, computed as explained in the last paragraph), the correct distribution is used to obtain the zone of high probability (a probability of 84% for Gaussian distributions) [18]. For example, the two vertical (dotted-blue) lines in the “station” graph is the range for the number of SADs for low motion MBs, while the range represented by the two vertical (dotted-red) lines in “tractor” graph is for the high-motion MBs. From the figure, the range for low- and high-motion MBs will be [0, 150] and [150, 600] number of SADs. It should be noted that the prediction is fuzzy as it is in the form of a range. For each range of the predicted workload, we perform an offline analysis to obtain the number of APs that can handle that much workload. For example, the ranges of [0, 150] and [150, 600] number of SADs can be converted to [0, 5] and [6, 20] number of APs respectively if the offline analysis revealed that each AP can handle 30 SADs within one iteration. In summary, the ranges are computed offline and stored in the preprocessing stage in the form of lookup table to reduce the run-time overhead. At run-time, the pre-processing stage first categorizes the current MB (by analyzing its variance and number of SADs of neighboring MBs), and then uses its category to obtain the corresponding workload range from the lookup table. Hence, the pre-processing stage can predict the workload of current MB (number of APs in the form of a range) for the motion estimation sub-task of the H.264 video encoder.
13 14 15
else for 𝑖 = 𝑊𝑎 [𝑘 − 1]; 𝑖 < 𝑊𝑎 [𝑘]; i++ do idlePeriods[i]++; for 𝑖 = 𝑊𝑎 [𝑘]; 𝑖 < 𝐴𝑃𝑀 ; i++ do totalIdleIterations[i]++;
16 17
The following terms are introduced to explain the heuristics: ∙ 𝑊𝑎 [𝑖]: Actual workload at the end of iteration 𝑖 which will be equal to the number of active APs ∙ 𝐴𝑃𝑀 : The total number of APs for an MP, where APs are denoted as AP0, AP1, ... APM-1 ∙ 𝑀 𝐵𝑁 : The total number of macroblocks in a frame ∙ 𝑖𝑑𝑙𝑒𝐴𝑃 𝑠: The list of APs that will be idle in the current iteration (provided by the first component of DPM, see Section 4) 𝑝 ∙ 𝐼𝑖𝑑𝑙𝑒 : The predicted number of idle iterations for an AP ∙ 𝐼𝑗𝑚𝑖𝑛 : The minimum number of idle iterations for power state 𝑃 𝑆𝑗 as explained in Section 4.1
5.1
History Based Heuristic
The history based heuristic (HBH) monitors the workload of the previous iterations to keep a record of the average number of idle iterations of each AP which is later used to predict the number of idle iterations for an AP. The algorithm is shown in Algorithm 1. The algorithm keeps the total number of idle iterations (totalIdleIterations array) and the total number of idle periods (idlePeriods array) seen till the current iteration (k-th iteration in the Algorithm 1) for all the APs. The history information is populated at the end of the current iteration (lines 10 - 17), while the history information is used at the start to choose the power state for idle APs (lines 4 - 9). The predicted number of idle 𝑝 iterations, 𝐼𝑖𝑑𝑙𝑒 , is the average number of idle iterations so far (line 5), and is used to select the most beneficial power state using 𝐼𝑗𝑚𝑖𝑛 (lines 6 - 9). The history information is populated as follows. If the current number of active APs (𝑊𝑎 [𝑘]) is less than the previous iteration’s active
4.3 Problem Statement Given the minimum number of idle iterations for each power state, the predicted amount of workload for future iterations (in terms of the number of APs) of an MP and the number of idle APs for the current iteration, the goal is to select the most beneficial power state for the idle APs so as to maximally reduce the energy consumption of the system. The challenge is to accurately predict the number of idle iterations for an AP, and here we leverage the indication of the future workloads from the pre-processing stage.
5.
// Initialization for 𝑖 = 0; 𝑖 < 𝐴𝑃𝑀 ; i++ do idlePeriods[i] = 0; totalIdleIterations[i] = 0; // Called at the start of k-th iteration to decide the power states for 𝑖 ∈ 𝑖𝑑𝑙𝑒𝐴𝑃 𝑠 do 𝑝 𝐼𝑖𝑑𝑙𝑒 = ⌊ 𝑡𝑜𝑡𝑎𝑙𝐼𝑑𝑙𝑒𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠[𝑖] ⌋; 𝑖𝑑𝑙𝑒𝑃 𝑒𝑟𝑖𝑜𝑑𝑠[𝑖] for 𝑗 = 1; 𝑗 < 𝑁 ; j++ do 𝑝 if 𝐼𝑖𝑑𝑙𝑒 < 𝐼𝑗𝑚𝑖𝑛 then break;
Wa[k] Iteration totalIdle AP15 Iterations AP14 AP13
POWER MANAGEMENT HEURISTICS
This section describes the five heuristics which we have proposed for dynamic power management of an adaptive pipelined MPSoC. The first heuristic is history based and does not use the workload prediction information from the application. The other four heuristics leverage the workload prediction from the application to decide the most beneficial power state for idle APs. The first component of DPM, described in Section 4, always decides the idle number of APs for an iteration at the start of the iteration. Thus, all the heuristics described below switch the APs to their corresponding power states at the start of each iteration.
AP15 idle AP14 Periods AP13
16 k‐3 200 150 120
13 k‐2 201 151 121
14 k‐1 202 152 121
13 k 203 153 122
13 k+1 204 154 123
16 k+2 204 154 123
25 28 20
25 28 20
25 28 21
25 28 21
25 28 21
26 29 22
Figure 4: An example illustrating the working of HBH
619
Algorithm 2: Populate idleIterations Table (for the sake of simplicity, boundary cases are not reported here) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Algorithm 3: Application Prediction Based Heuristic
for 𝑖 = 0; 𝑖 < 𝑀 𝐵𝑁 ; i++ do if i == 0 then // First iteration pHW = 0; for ii=i+1; 𝑖 < 𝑀 𝐵𝑁 ; ii++ do for 𝑘 = 𝑝𝐻𝑊 ; 𝑘 < 𝑊𝑝 [𝑖𝑖]; k++ do idleIterations[k][i] = 𝑖𝑖 − 𝑖; if 𝑊𝑝 [𝑖𝑖] > 𝑝𝐻𝑊 then pHW = 𝑊𝑝 [𝑖𝑖]; if 𝑊𝑝 [𝑖𝑖] == 𝐴𝑃𝑀 then break;
1 2 3 4 5 6
for 𝑘 = 𝑊𝑝 [𝑖]; 𝑘 < 𝐴𝑃𝑀 ; ii++ do idleIterations[k][i] = idleIterations[k][i-1] - 1; pHW = 0; for ii=i+1; 𝑖 < 𝑀 𝐵𝑁 ; ii++ do for 𝑘 = 𝑝𝐻𝑊 ; 𝑘 < 𝑊𝑝 [𝑖𝑖]; k++ do idleIterations[k][i] = 𝑖𝑖 − 𝑖; if 𝑊𝑝 [𝑖𝑖] > 𝑝𝐻𝑊 then pHW = 𝑊𝑝 [𝑖𝑖]; if 𝑊𝑝 [𝑖𝑖] ≥ 𝑊𝑝 [𝑖] then break;
The example in Figure 5 illustrates the computation of idleIterations table for the last four APs only where 𝐴𝑃𝑀 = 16. The idea is to look into the future iterations to compute the number of idle iterations of an AP, if it is deactivated at the start of the current iteration. For example, at iteration 𝑖 in Figure 5, if AP15 is deactivated, then it will be idle for the next 5 iterations according to the workload prediction because it will be activated again in iteration i+5 (when 𝑊𝑝 [i+5] = 16). Hence, the predicted number of idle iterations for AP15 at iteration 𝑖 will be 5. As another example, AP12 will be idle for only 1 iteration as it will be used in iteration i+1 according to the predicted workload. The algorithm to populate the entries of the idleIterations table is shown in Algorithm 2. It populates the entries for the i-th iteration (i-th column of the table) based on the (i-1)th iteration’s values and future workloads. The initialization is done at the first iteration (line 2) where the first column of the table is populated. Lines 4 - 10 look into the future iterations until the future workload equals the maximum number of APs (lines 9 - 10) to calculate the number of idle iterations for all the APs. The variable pWH in lines 7 - 8 track the number of APs for which the number of idle iterations has already been computed. For example, the first run of the for loop in lines 5 - 6 will compute the idle iterations for AP0 - AP12 (since 𝑖 = 0, 𝑊𝑝 [i+1] = 13). The second run of the same for loop will only compute the idle iterations for the rest of the APs, that is, AP13 and onwards. The second part of the algorithm (lines 11 - 21) populates the rest of the columns of idleIterations table. At this step, the number of idle iterations for some of the APs can be inferred from the previous iteration’s values (lines 12 -13). For other APs, the algorithm looks into the future workloads until the future workload is the same or higher than the current iteration’s workload (lines 20 - 21) to compute the number of idle iterations (lines 14 - 15). For example, in Figure 5, the values for AP14 and AP15 at i+2 are computed by subtracting one from the values of i+1 iteration; however, the values for AP12 and AP13 are computed from future workloads. Handling of the boundary cases and some optimization steps are skipped for the sake of simplicity in Algorithm 2. Once the idleIterations table is available, the decision for the Application Prediction Based Heuristic (APBH) is simplified. Consider APBH has to decide the power state for AP0 at the start of k-th iteration, then the value of idleIterations[0][k] (which will be the predicted number of idle iterations for AP0) will be used to decide the most beneficial power state for AP0. Algorithmically, it is stated in Algorithm 3. As an example, in Figure 5, if 𝑖𝑑𝑙𝑒𝐴𝑃 𝑠 = {14, 15} at i+2 iteration, then both AP14 (idleIterations[14][i+2] = 3) and AP15 (idleIterations[15][i+2] = 3) will be transitioned to PS2 (see Table 2). Recall from Section 4.2 that the workload prediction from the preprocessing stage is fuzzy and is represented as a range. However, the algorithm to compute the idleIteration table assumes a single value for the workload prediction. Thus, we use four different methods to obtain a single value from the predicted workload’s range, resulting in four application prediction based heuristics. We define Min(R), Max(R), Avg(R) to return the minimum, maximum and average value of a range 𝑅 respectively. Consider 𝑀 ranges are available from the
APs (𝑊𝑎 [𝑘 − 1]), then the total number of idle iterations for all the inactive APs (including the ones which were rendered idle in the current iteration) is incremented by 1 (lines 10 -12). On the other hand, if 𝑊𝑎 [𝑘 − 1] > 𝑊𝑎 [𝑘], then some of the APs were activated in the current iteration, and for these APs the number of idle periods (because the idle period of these APs has just finished) is incremented by 1 (lines 14 - 15). For the rest of the APs, the total number of idle iterations in incremented by 1 (lines 16 - 17). An example illustrating the working of the algorithm is shown in Figure 4, where 𝐴𝑃𝑀 = 16 and the calculation is shown for only the last three APs. At iteration k-2, consider the 𝑖𝑑𝑙𝑒𝐴𝑃 𝑠 = {13, 14, 15}. Then, HBH will put AP15 𝑝 (𝐼𝑖𝑑𝑙𝑒 = ⌊200/25⌋ = 8) to PS3 (since 8 ≥ 𝐼3𝑚𝑖𝑛 , see Table 2), while 𝑝 𝑝 AP14 (𝐼𝑖𝑑𝑙𝑒 = ⌊151/28⌋ = 5) and AP13 (𝐼𝑖𝑑𝑙𝑒 = ⌊121/20⌋ = 6) will be transitioned to PS2. It should be noted that HBH keeps the minimum amount of information so that its run-time overhead is low. Furthermore, the average number of idle iterations for each AP is updated at run-time based on history; however, HBH will not be able to predict very accurately due to quickly changing workload of multimedia applications (see Section 7.2).
5.2 Application Prediction Based Heuristics As explained in Section 4.2, a pre-processing stage is available which can predict the workload of all the MBs of a frame in terms of the number of APs required to process them. In this section, first we show how that prediction can be used to predict the number of idle iterations for APs, and then we show how the predicted number of idle iterations is used by the heuristics to decide the power states of idle APs. The following terms are used in addition to the ones described in Section 5: ∙ 𝑊𝑝 [𝑖]: Workload predicted by the pre-processing stage for iteration 𝑖 in terms of the number of APs required ∙ idleIterations[k][i]: At i-th iteration, the number of iterations 16 i 5 5 2 1
13 i+1 4 4 1 1
14 i+2 3 3 3 1
13 i+3 2 2 2 1
13 i+4 1 1 1 1
Transition 𝑖𝑡ℎ idle AP to power state 𝑃 𝑆𝑗−1 for which k-th AP will remain idle. For example, idleIterations[0][10] = 3 means that AP0 will remain idle for the next 3 iterations starting at iteration 10, that is, until iteration 12. This table is populated using the workload prediction from the application.
else
Wp[i] Iteration AP15 idle Iterations AP14 AP13 AP12
// Called at the start of k-th iteration to decide the power states for 𝑖 ∈ 𝑖𝑑𝑙𝑒𝐴𝑃 𝑠 do 𝑝 𝐼𝑖𝑑𝑙𝑒 = idleIterations[i][k]; for 𝑗 = 1; 𝑗 < 𝑁 ; j++ do 𝑝 if 𝐼𝑖𝑑𝑙𝑒 < 𝐼𝑗𝑚𝑖𝑛 then break;
16 i+5
Figure 5: An example of populating idleIterations table
620
256
pre-processing stage, numbered from 1 to 𝑀 where Max(𝑅𝑀 ) ≤ 𝐴𝑃𝑀 . The following text uses the ranges of [0, 5] and [6, 20] in terms of the number of APs for low- (L) and high-motion (H) MBs (from Section 4.2), and 𝐴𝑃𝑀 = 20 for exemplary purposes.
CC
1. MinAPBH: 𝑀 𝑖𝑛(𝑅𝑖 )∀𝑖 is used to map ranges to single values. For example, for a sequence of [L L H L] MBs, the predicted workloads would be [0 0 6 0]. The drawback of computing idleIterations table with Min(R) is that the maximum value of the predicted workload will be Min(𝑅𝑀 ). This means that all the APs from Min(𝑅𝑀 ) to 𝐴𝑃𝑀 -1 will always be considered inactive according to the predicted workloads. For example, AP6 AP19 will always be idle and hence will always be transitioned to PS3 (the most power saving state from Table 2). 2. MaxAPBH: 𝑀 𝑎𝑥(𝑅𝑖 )∀𝑖 is used to map ranges to single values. For the same example of [L L H L] MBs, the predicted workloads would be [5 5 20 5]. Unlike MinAPBH, MaxAPBH introduces an error towards the other end of the spectrum. Since the minimum value of the workloads will be Max(𝑅1 ), the first Max(𝑅1 ) APs will be considered active during all the iterations according to the workload prediction. For example, AP0 - AP4 will be active at all times and hence will only be transitioned to PS1 (the least power saving state from Table 2). 3. AvgAPBH: 𝐴𝑣𝑔(𝑅𝑖 )∀𝑖 is used to map ranges to single values. For example, the predicted workload for the sequence of [L L H L] MBs would be [3 3 13 3]. In AvgAPBH all the APs from 0 to Avg(𝑅1 )-1 (AP0 - AP2) will always be transitioned to PS1 while all the APs from Avg(𝑅𝑀 ) to 𝐴𝑃𝑀 -1 (AP13 - Ap19) will always be switched to PS3. 4. MAMAPBH: 𝑀 𝑖𝑛(𝑅1 ), 𝐴𝑣𝑔(𝑅𝑖 ), 𝑀 𝑎𝑥(𝑅𝑀 ) ∀𝑖, 𝑖 ∕= 1, 𝑖 ∕= 𝑀 are used for the ranges. MAMAPBH uses the minimum of the first range, the maximum of the last range and the average of the intermediate ranges to compute the predicted workloads. For example, the sequence of [L L H L] MBs would be translated to [0 0 20 0]. MAMAPBH will not suffer from the drawbacks of MinAPBH, MaxAPBH and AvgAPBH as it uses 𝑀 𝑖𝑛(𝑅1 ) and 𝑀 𝑎𝑥(𝑅𝑀 ) for the first and last range respectively.
784 384 IP/ 768 TQ ITQ LF MC 384
784
EC
Encoded
Figure 7: Task graph of H.264 Video Encoder processing systems. The pre-processing system extracts the features of incoming frames to provide useful information to the video processing system for run-time adaptations. For example, categorization of the MBs according to the motion contained in them as described in Section 4.2. In our DPM scheme, the pre-processing system is also responsible for the computation of the idleIterations table through Algorithm 2. The video processing system implements the video codec such as H.264 on an adaptive pipelined MPSoC. A DPM scheme is implemented for each of the MPs with run-time varying workloads. More specifically, the first component of DPM (see Section 4) and either the HBH or any of the APBH heuristics is implemented to decide at runtime the power states for idle APs. Statistical, architectural and application information obtained through profiling and offline analysis is used to guide the two systems at run-time. For example, the ranges for workload prediction (number of SADs for the motion estimation sub-task) are obtained through the statistical analysis which are then converted to equivalent number of APs through profiling (number of SADs an AP can handle in one iteration) for workload prediction at run-time. Other information such as the minimum number of idle iterations for the power states is also provided. The pre-processing system is expected to work at frame-level so that the workload prediction for all the MBs of a frame (in the form of idleIterations table) is available to the video processing system which is working at the macroblock-level. Thus, our DPM is applicable to all advanced macroblock based video coding applications such as H.264, MPEG-4, AVS, and VC1. If the pre-processing system is not available, then the DPM can use the application knowledge from the video processing system (for example, the actual number of SADs of the previous MBs) to predict the future workload; however, such a prediction would be less accurate. The DPM scheme presented here can also be extended to architectures other than pipelined MPSoCs. For example, in a master-slave architecture, the master processor will run DPM algorithms to select the power states of the slaves.
7.
AN H.264 VIDEO ENCODER CASE STUDY
In this section, we illustrate the applicability of our application-aware dynamic power management scheme by implementing an H.264 video encoder on an adaptive pipelined MPSoC supporting HD720p (high definition) at 30 fps.
The overall flow to implement the proposed application-aware dynamic power management scheme for adaptive pipelined MPSoCs executing multimedia applications is shown in Figure 6. A multimedia application is implemented as a combination of pre-processing and video
7.1
Profiling & Offline Analysis Application Information:
Implementation Details
The adaptive pipelined MPSoC was implemented in a commercial design flow from Tensilica using Xtensa LX3 [19] family of processors and RC-2010.1 tool suite. The ASIPs in the pipelined MPSoC were created automatically using the XPRES tool, which can generate special instructions for a processor from the C code that will be executed on it. Hence, different sets of special instructions were generated for all the sub-tasks of H.264 encoder (explained later). These special instructions contained a combination of fused operations, vector operations, FLIX instructions [20] and specialized operations [21]. The resulting ASIPs were then used to create the pipelined MPSoC in XTensa Modeling Protocol (XTMP), a cycle-accurate multiprocessor simulation environment. XT-XENERGY tool is used to measure the power and energy of the ASIPs in XTMP. Hence, we obtained the throughput and energy of the adaptive pipelined MPSoC from XTMP, where all the ASIPs were running at 1 GHz and XT-XENERGY was configured for a given 45nm technology. Figure 7 shows the sub-tasks of the H.264 video encoder. The task
Throughput; MBN; etc.
Architectural Information: Max APs; Power States; etc.
Statistical Information:
Workload Ranges from PDFs; etc.
Frame-level
8
384
CC: Color Conversion, ME: Motion Estimation Bitstream IP: Intra Prediction, MC: Motion Compensation TQ: Transform & Quantize, ITQ: Inverse TQ, LF: Loop Filter, EC: Entropy Coding
DPM IMPLEMENTATION OVERVIEW
Pre-processing System 1. Macroblock categorization, e.g., based on motion (Section 4.2) 2. Workload prediction, e.g., No. of SADs, No. of APs (Section 4.2) 3. Populate idleIterations table (Algorithm 2)
ME
384(bytes/MB)
All these heuristics have to compute the idleIterations table at runtime which might introduce unacceptable overhead. Our solution to this problem is to implement the table computation algorithm in the pre-processing stage. The pre-processing stage will write the table into a shared memory from which APBH will read the values at run-time, keeping its overhead to minimum. The computation of idleIterations table in the pre-processing stage will not affect the throughput of the video processing system as the pre-processing stage is not part of the video processing system (see Figure 6).
6.
256
384
Video Processing System H.264; MPEG-4; etc. AP MP DPM 1. First component MP MP (Section 4) 2. HBH (Algorithm 1) MP Video APBH (Algorithm 3) frame
Macroblock-level
Figure 6: An overview of DPM implementation in adaptive pipelined MPSoCs
621
Power Power Transition Wake-up Imin j State Consumption Energy (nJ) Latency (ns) 0 𝑃𝑑𝑦𝑛 + 𝑃𝑙𝑒𝑎𝑘 0 0 1 𝑃𝑙𝑒𝑎𝑘 1 3 1 2 ∼0 250 100 9
Label A
Label C
Table 3: Multiple power states used in our experiments Label B
graph is executed at MB level where each sub-task processes one MB in an iteration (which is typical of real-time implementations of H.264 encoder/decoder [22]). To increase the throughput, the entropy coding processes MBs in parallel to the reconstruction path (ITQ and LF), while the CC and IP/MC sub-tasks send data to IP/MC and ITQ subtasks respectively in advance (bypassing the intermediate sub-tasks). The annotations around the arrows show the amount of data (buffer sizes in adaptive pipelined MPSoC) in bytes transferred in each iteration. For example, the CC sub-task sends the Y component of a 16×16 MB to ME sub-task, a transfer of 256 bytes in each iteration. In this task graph, the DPM scheme could be implemented for ME, IP/MC and EC sub-tasks due to their run-time varying workload. However, for this case study, we deployed DPM for the ME sub-task only, providing proof of concept of the proposed idea. Thus, all the sub-tasks in Figure 7 were mapped on MPs in the adaptive pipelined MPSoC except for the ME stage where a combination of MPs and APs was used with a DPM. In our implementation, we used one MP and 16 APs (𝐴𝑃𝑀 = 16) for the motion estimation stage to support HD720p @ 30 fps. The three power states shown in Table 3 were used for the APs. The values of transition energy and wake-up latency were inferred from [23, 24], while the values of 𝐼𝑗𝑚𝑖𝑛 were computed according to the equations from Section 4.1 with 𝑃𝑑𝑦𝑛 = 28.5 mW, 𝑃𝑙𝑒𝑎𝑘 = 6.50 mW and 𝑇𝑐 = 9,100 clock cycles (to support ≥ 30 fps). We executed the pipelined MPSoC with several video sequences to obtain average values of 𝑃𝑑𝑦𝑛 and 𝑃𝑙𝑒𝑎𝑘 of an AP. In our experiments, the latency of sending the data (at least 256 ns assuming a byte transfer takes at least 1 clock cycle @ 1 GHz) to APs after activating them was larger than the wake-up latency of PS2 (100 ns) and hence did not affect the throughput of the pipelined MPSoC. The pre-processing stage for our experiments categorized all the MBs of a frame into low, medium and high motion MBs at run-time. The workload of each category in terms of the number of APs was computed using offline analysis and was saved in a lookup table for use at run-time. In our experiments, the ranges (𝑅1 , 𝑅2 and 𝑅3 ) for predicted workload of low, medium and high motion MBs were [0, 4], [5, 10] and [11, 16] (number of APs) respectively. These ranges were also used by Algorithm 2 to compute the idleIterations table for the application prediction based heuristics.
Label D
Label E
Figure 8: Power states of AP0 and AP15 for the ‘pedestrian’ video sequence for (a) HBH; (b) MinAPBH; (c) MaxAPBH; (d) AvgAPBH; and (e) MAMAPBH irrespective of the duration of the idle period. ∙ Label C illustrates the same problem as of MaxAPBH but for MinAPBH. In MinAPBH, AP11 - AP15 (Min(𝑅3 ) to 𝐴𝑃𝑀 -1) will be considered inactive at all times. Hence, AP15 is always transitioned to PS2 (the most power saving state) according to MinAPBH. ∙ Label D shows the scenarios where AvgAPBH will take the wrong decisions on the power state of an AP. Since AvgAPBH uses Avg(R) for converting the ranges to single values, it will always consider AP0 as active and AP15 as inactive resulting in their transitions to PS1 and PS2 respectively irrespective of the idle periods’ durations. ∙ Label E illustrates the scenario where the fuzzy workload prediction from the application can be misleading. AP15 should have been transitioned to PS2 as the duration of the idle period is more than 9 iterations, instead it was transitioned to PS1. Frequent wrong decisions on the appropriate power state might result in increased energy consumption; however, we will show later that the number of wrong decisions from MAMAPBH is very low. This can also be seen from the graphs where MAMAPBH chose the wrong power state only once, that is, at Label E. These graphs illustrate that MAMAPBH performs the best in selecting the most beneficial power state for the two APs. Similar results were obtained for the other APs; however they are not included here due to lack of space. It should be noted that PG and CG (from [7]) would have always transitioned both AP0 and AP15 to PS2 and PS1 respectively. To compare the accuracy of these heuristics, we created an “Optimal” implementation by using the actual workload of the application after its execution. The power states selected through the optimal scheme would be the most beneficial states as the exact duration of the idle periods is known from the actual workload. The results are depicted in Table 4. The values report the number of wrong decisions taken by a heuristic as a percentage of the total decisions taken by it. For example, HBH took 15.35% wrong decisions on the selection of the power states for the ‘pedestrian’ video sequence. The second column shows that the error of HBH is quite high which corroborates the fact that history based heuristics do not perform well in widely Video HBH Min Max Avg MAM Sequence APBH APBH APBH APBH pedestrian 15.35 27 7.05 20.23 1.57 sky 52.13 20.94 14.5 21.10 1.64 station 47.28 18.03 25.52 21.64 2.53 sunflower 26.41 24.51 18.09 20.75 1.43 tractor 48.29 18.85 18.96 23.77 0.68
7.2 Results & Analysis We evaluated the adaptive pipelined MPSoC with our DPM scheme for five different HD720p (high definition) video sequences: pedestrian; sky; station; sunflower; and, tractor. First, we illustrate the capability of each heuristic in choosing the correct power state for the idle APs. Figure 8 shows part of the whole results where the power state of AP0 and AP15 is plotted for the first 250 iterations for each of the five heuristics. Several notable facts are illustrated in the figure with labels A - E: ∙ Label A illustrates the scenario of incorrect power state transitions by HBH. The duration of the idle periods pointed by the first two arrows is less than 9 iterations. Hence, AP0 should have been transitioned to PS1; however, the average number of iterations in an idle period according to the current history information was more than 9. Thus, HBH transitioned AP0 to PS2 which is not beneficial. The last arrow points out the converse scenario. Due to the recent short idle periods, the average number of idle iterations (from the history) dropped below 9, resulting in AP0’s transition to PS1 instead of PS2 (the correct power state). ∙ Label B illustrates the drawback of MaxAPBH. Recall from Section 3 that MaxAPBH considers the first Max(𝑅1 ) APs (AP0 AP3 in our experiments) active during all the iterations. Thus, it is always switching AP0 to PS1 (the least power saving state)
Table 4: Percentage error in the selection of power states by the five heuristics when compared to the optimal case
622
is a viable option in adaptive pipelined MPSoCs for multimedia applications such as H.264 video encoder.
9.
Figure 9: Relative Energy Savings of the five heuristics compared to PG and CG varying workloads. Column 6, on the other hand, depicts the error of MAMAPBH which is always less than 3%. This shows that appropriate leveraging of application information can significantly improve the accuracy of workload prediction, and hence the DPM heuristics. Another interesting fact is that MAMAPBH achieved such an accuracy using only fuzzy predictions (ranges of predicted workloads). Availability of better predictions (for example, 10 ranges instead of 3) would have further improved the accuracy of APBH heuristics. Let us now examine the energy savings of the proposed heuristics. We measured the relative savings of all the schemes (including optimal, PG and CG) to show the improvement in energy savings through our DPM heuristics. We computed the relative savings of a scheme 𝐸 𝑠 −min∀𝑖 {𝐸𝑖𝑠 } 𝑗 as 𝑗min {𝐸 where 𝐸𝑗𝑠 is the energy saving of a scheme over 𝑠} ∀𝑖
𝑖
a design-time balanced pipelined MPSoC1 . In our experiments, either PG or CG had the lowest energy saving, and thus the relative savings depicted how much more energy was saved using our heuristics. Figure 9 shows the results, where the energy savings were computed including the overhead of the heuristics. For example, CG (the third bar) saved 36% more energy than PG for the ‘pedestrian’ video sequence, while PG saved 11% more energy than CG for the ‘station’ video sequence. It is obvious that MAMAPBH (the last bar) saves the most energy from amongst all the heuristics as it is closest to the optimal (the first bar) for all the video sequences. The MAMAPBH was always within 1% of the optimal result. This again shows the significance of leveraging application knowledge for system-level DPM. In terms of overhead, it was found that the heuristics degraded the throughput of the pipelined MPSoC by a maximum of 0.5% compared to PG and CG. Hence, the effectiveness of our DPM can be seen from the fact that MAMAPBH saved up to 40% (‘pedestrian’ sequence) more energy than the work of [7] with only a 0.5% degradation of the throughput. This shows that application-aware dynamic power management in an adaptive pipelined MPSoC is an efficient technique for low power implementation of multimedia applications.
8.
REFERENCES
[1] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of design techniques for system-level dynamic power management,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 8, pp. 299 –316, June 2000. [2] S. Irani, S. Shukla, and R. Gupta, “Online strategies for dynamic power management in systems with multiple power-saving states,” ACM Trans. Embed. Comput. Syst., vol. 2, pp. 325–346, August 2003. [3] X. Liu, P. J. Shenoy, and M. D. Corner, “Chameleon: Application-level power management,” IEEE Trans. Mob. Comput., vol. 7, no. 8, pp. 995– 1010, 2008. [4] K. Agarwal, K. Nowka, H. Deogun, and D. Sylvester, “Power gating with multiple sleep modes,” in Proceedings of the 7th International Symposium on Quality Electronic Design, ISQED ’06, pp. 633–637, 2006. [5] S. L. Shee, A. Erdos, and S. Parameswaran, “Heterogeneous multiprocessor implementations for jpeg:: a case study,” in CODES+ISSS ’06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, (New York, NY, USA), pp. 217–222, ACM, 2006. [6] S. Carta, A. Alimonda, A. Pisano, A. Acquaviva, and L. Benini, “A control theoretic approach to energy-efficient pipelined computation in mpsocs,” ACM Trans. Embedded Comput. Syst., vol. 6, no. 4, 2007. [7] H. Javaid, M. Shafique, S. Parameswaran, and J. Henkel, “Low-power adaptive pipelined mpsocs for multimedia: An h.264 video encoder case study,” in Design Automation Conference, DAC ’11, 2011. [8] “Tensilica.” Tensilica Inc. (http://www.tensilica.com). [9] S. L. Shee and S. Parameswaran, “Design methodology for pipelined heterogeneous multiprocessor system,” in DAC ’07: Proceedings of the 44th annual conference on Design automation, (New York, NY, USA), pp. 811– 816, ACM, 2007. [10] H. Javaid, A. Ignjatovic, and S. Parameswaran, “Rapid design space exploration of application specific heterogeneous pipelined multiprocessor systems,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 29, pp. 1777– 1789, November 2010. [11] I. Karkowski and H. Corporaal, “Design of heterogenous multi-processor embedded systems: applying functional pipelining,” in PACT ’97: Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, 1997. [12] A. Alimonda, S. Carta, A. Acquaviva, A. Pisano, and L. Benini, “A feedback-based approach to dvfs in data-flow applications,” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 28, no. 11, pp. 1691– 1704, 2009. [13] H. Guo and S. Parameswaran, “Balancing system level pipelines with stage voltage scaling,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI: New Frontiers in VLSI Design, ISVLSI ’05, (Washington, DC, USA), pp. 287–289, IEEE Computer Society, 2005. [14] W. Kim, M. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-core dvfs using on-chip switching regulators,” in High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on, pp. 123 –134, 2008. [15] K. K. Rangan, G. yeon Wei, and D. Brooks, “Thread motion: fine-grained power management for multi-core systems,” in International Symposium on Computer Architecture, pp. 302–313, 2009. [16] M. Shafique, B. Molkenthin, and J. Henkel, “An hvs-based adaptive computational complexity reduction scheme for h.264/avc video encoder using prognostic early mode exclusion,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pp. 1713 –1718, march 2010. [17] M. Shafique, L. Bauer, and J. Henkel, “enbudget: A run-time adaptive predictive energy-budgeting scheme for energy-aware motion estimation in h.264/mpeg-4 avc video encoder,” in DATE, pp. 1725–1730, 2010. [18] B. Zatt, M. Shafique, S. Bampi, and J. Henkel, “An adaptive early skip mode decision scheme for multiview video coding,” in Picture Coding Symposium, 2010. [19] Tensilica, “Xtensa Customizable Processor.” http://www.tensilica.com. [20] Tensilica, “Flix: Fast relief for performance-hungry embedded applications.” http://www.tensilica.com/pdf/FLIX White Paper v2.pdf, 2005. [21] Tensilica, “XPRES Generated Specialized Operations.” http://tensilica.com/pdf/XPRES%201205.pdf, 2005. [22] T.-C. Chen, C.-J. Lian, and L.-G. Chen, “Hardware architecture design of an h.264/avc video codec,” in Proceedings of the 2006 Asia and South Pacific Design Automation Conference, ASP-DAC ’06, IEEE Press, 2006. [23] J. Leverich, M. Monchiero, V. Talwar, P. Ranganathan, and C. Kozyrakis, “Power management of datacenter workloads using per-core power gating,” Computer Architecture Letters, vol. 8, pp. 48 –51, feb. 2009. [24] T. Tuan, A. Rahman, S. Das, S. Trimberger, and S. Kao, “A 90-nm lowpower fpga for battery-powered applications,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 26, pp. 296 –300, feb. 2007.
CONCLUSION
In this paper, we proposed a novel dynamic power management (DPM) scheme for adaptive pipelined MPSoCs, executing multimedia applications. As part of the DPM, we proposed five heuristics which can decide at run-time the appropriate power state for the idle APs of an MP. An analytical analysis is used to compute the minimum duration of idle period for a power state to be beneficial, which is then used by the heuristics at run-time. One of the heuristics is guided by the application’s execution history while the other four leverage application information to predict the upcoming workloads. By implementing an advanced multimedia application, the H.264 video encoder supporting HD720p at 30 fps, we illustrated that MAMAPBH heuristic (an application prediction based heuristic) outperformed all the other heuristics in addition to PG and CG from [7] – MAMAPBH provided up to 40% more energy savings with only a 0.5% degradation of the throughput. The results show that application-aware dynamic power management 1 A design-time balanced pipelined MPSoC is designed for the worst case and does not adapt itself at run-time. Hence, all the processors are active at all times.
623