Power Consumption Estimation of a C-algorithm: A New Perspective for Software Design Johann Laurent, Nathalie Julien, Eric Senn, Eric Martin LESTER, University of South Brittany, France
[email protected] Abstract A complete methodology to estimate power consumption at the C-level for off-the-shelf processors is proposed. It relies on the Functional-Level Power Analysis, which results in a power model of the processor; this model describes the consumption variations relatively to algorithmic and configuration parameters. Some parameters can be predicted directly from the C-algorithm with simple assumptions on the compilation. Estimation results are summarized on a consumption map; then the designer can check the algorithm with the application constraints. Maximum and minimum bounds are also provided. Applied to the TI C6x, the estimation method provides a maximum error of 6% against measurements for classical DSP algorithms.
1. Introduction The software can have a substantial impact on the power dissipation of a system [1]. Two codes can have the same performances but different energy dissipation [2]; to efficiently evaluate this energy consumption, it is necessary to estimate, along with the performance, the power consumption of an algorithm. A convenient tool for algorithmic power estimation must be simple to use and provide fast and accurate results. There are several interests in achieving the power consumption estimation directly at the C-level. As the feedback to the designer is very fast, he is efficiently guided in his choices. The power consumption of an algorithm can be estimated on different processors without compilation; that allows choosing the best target to meet the constraints, before purchasing any specific development tool. Afterwards, the power consumption of different scripts of the same algorithm can be easily checked with the application constraints. The clock frequency can also be set for the best trade-off between performance and energy. For off-the-shelf processors, details about the processor architecture are often unavailable. This assumption prohibits methods based on cycle-level
simulation like in Wattch or SimplePower [3-4]. A classical approach is to evaluate the power consumption with an instruction-level power analysis (ILPA) [5]. This method relies on current measurements for each instruction and couple of successive instructions. Its main limitation is the unrealistic number of measurements for complex architectures [6]. Some approaches have proposed to group instructions [7] or to work on a reduced instruction set [8]; but still, parallelism possibilities are not considered. Finally, recent studies have introduced a functional approach [9-11]. All these methods perform power estimation only at the assembly-level with an accuracy from 2-4% for simple cases to 10% when both parallelism and pipeline stalls are effectively considered. As far as we know, only one unsuccessful attempt of algorithmic estimation has already been made [12]. This paper demonstrates that, differing from the instruction-level methods, our functional approach allows the power estimation of an algorithm directly from the Ccode without compilation. A first power estimation has already been developed and validated at the assemblylevel with a maximum error of 3.5% against measurements [13]. This method is divided in two steps: the model definition and the estimation process. The model definition provides a complete power model of the processor, whose inputs are algorithmic and configuration parameters; it has been built from a functional analysis of the processor dissipation combined with a reduced set of physical measurements. This model includes important phenomena for the power consumption, like pipeline stalls and cache misses. The estimation process analyzes the code and extracts the required parameters. For the C-level power estimation, we propose to use the same power model of the processor; but, in this case, some algorithmic parameters are now predicted from the C code rather than exactly computed from the compiled code. Then, a first estimation is possible with only elementary architectural considerations about the target; this estimation provides the maximum and minimum bounds for the power consumption. By adding simple assumptions on data placement, our method provides estimates with an average error of 4% against physical measurements. When all the
parameters are not directly available at the C-level, a consumption map is supplied to the designer, describing the power variations of the algorithm. The estimation methodology and the model definition are presented in section 2. The Functional Level Power Analysis is explained through a case study, the processor TMS320C6201. Then, the C-level estimation process is detailed in section 3 together with the different prediction models, defined to evaluate the algorithmic parameters. In section 4, application results for several DSP algorithms are provided. First, the accuracy of the estimation method is validated against physical measurements. Then, we exhibit how to use these estimates to guide the algorithm designer. Finally, the conclusion summarizes the limits and possibilities of the C-level power estimation and some future works are proposed.
2. Functional Analysis and Power Model 2.1 Estimation framework
Prediction models
2.2. Case study of the TMS320C6201 The FLPA has been applied on the C6x from Texas Instruments for which a complete power model has been developed. This processor has been chosen for its complex architecture: a deep pipeline (up to 11 stages), VLIW instructions set, and parallelism capabilities (up to 8 instructions in parallel). Its internal program memory can be used in cache mode. It also contains an External Memory Interface (EMIF), used to load data and program from the external memory [14]. The FLPA actually consists in a functional analysis of the architecture from the power point-of-view. The aim is to determine which parameters are significant for the global power consumption. FLPA results for this processor are summarized in Figure 2. PU
FETCH/DP
The estimation methodology has two linked parts: the model definition and the estimation process (Figure 1). C Algorithm
be predicted from simple assumptions about compilation defined in the prediction models.
Processor
α
β
CTRL 1 -γ PRG MEM
IMU
τ
γ ε
Parameters
FLPA Power Model
REGISTERS MULTIPLEXERS DC/ALU/MPY CTRL 1 -τ
τ-ε
DMA 1 EMIF 1 EXTERNAL MEMORY
DATA MEM
MMU
Measurements
Fig. 2. FLPA on the C6x.
C-level Power Estimation Estimation process
Model definition
Fig. 1. The estimation methodology. The model definition is done once and before any estimation to begin. It is based on a Functional Level Power Analysis (FLPA) of the processor, and provides the power model. This model is a set of consumption rules that describes how the average supply current of the processor core evolves with some algorithmic and configuration parameters. Algorithmic parameters indicate the activity level between every functional block in the processor (e.g. parallelism rate or cache miss rate). The estimation process is done every time the consumption of an algorithm has to be evaluated. Actually, power estimation can be done at two different levels. At the assembly-level, algorithmic parameters are directly computed from the compiled code through a simple profiling. These parameters are the inputs of the power model of the processor [13]. At the C-level, algorithmic parameters are not known exactly; they must
The architecture is divided into four blocks: the Instructions Management Unit (IMU), the Processing Unit (PU), the Memory Management Unit (MMU) and the Control Unit (CU). The CU contains every configuration device (control registers for the PLL, DMA, EMIF etc). It was not included in the final diagram since its power consumption is relatively negligible in signal processing applications. However, power dissipation of both pipeline control and sequencer are taken into account. Each link on this functional diagram is associated to an algorithmic parameter; it expresses the activity rate on the link, that impacts on the global power consumption. The parallelism rate α assesses the flow between the FETCH stages and the internal program memory controller inside the IMU. The processing rate β between IMU and PU represents the utilization rate of the processing units (ALU, MPY). The activity rate between IMU and MMU is expressed by the program cache miss rate γ. The parameter τ corresponds to the external data memory access rate. The parameter ε stands for the activity rate
between the data memory controller and the Direct Memory Access (DMA). The DMA may be used for fast transfer of bulky data blocks from external to internal memory. Obviously, ε = 0 if the DMA is unused. Although not appearing on this functional diagram, the clock tree is included in our model; since the clock frequency can vary up to 200MHz, it has an important contribution in the global power dissipation.
2.3. Power Model of the processor Once the functional analysis is achieved, consumption rules have to be precisely determined to get the complete power model. These rules are mathematical functions of the algorithmic and configuration parameters. To determine these functions and their coefficients, the average supply current of the processor core ITOTAL was measured in relation with the variations of each parameter. To make these parameters vary, small programs written in assembly language, were used. The consumption rules were finally obtained by fitting the measurements. Though the choice of the external memory fully relies on the designer, it would be necessary to add a specific memory model like in [9,15] to estimate its consumption. The algorithmic parameters in Figure 2 are α, β, γ, τ and ε. Since the DMA is not modeled yet, ε was set to 0; future works will include the DMA in the power model. In fact, the remaining parameters α, β, γ and τ are not fully independent. Indeed, γ and τ directly impact on the number of pipeline stalls, and thus modify the average parallelism rate and the average number of processing units. Then new parameters α’ and β’ are defined as the effective values of α and β related to the pipeline stall rate (PSR): α’ = α (1 - PSR) and β’ = β (1 - PSR) (1) The effect of the parameter τ is totally included in the PSR - so, there is no more need for keeping τ in the power model. As a result, the final power model has only 3 algorithmic parameters in inputs: α’, β’, and γ. The consumption rules obtained for the TMS320C6201 are given in Table 1. The constant values of the coefficients ai , bi , ci , di , ei and fi can be found in [13] together with details on their determination. These rules are composed of linear functions of both algorithmic and configuration parameters. Configuration parameters are the clock frequency (F) and the memory mode (MM). Four different memory modes are available. In the mapped mode (M), all the instructions are in the internal program memory. Conversely, in the bypass mode (B), all the instructions are in external memory. Otherwise, a direct mapped cache is used in either the cache mode (C) or the freeze mode (F): for this last mode, no writing in the cache is allowed.
Table 1. Consumption Rules. MM
CONSUMPTION RULES
M
ITOTAL = aβ’F + (amα’ +bm) (cmF +dm)
B
ITOTAL =(aβ’+ bb)F + cb
C
ITOTAL = aβ’F+(acα’+bc)(ccγ+dc)(ecF+fc)
F
ITOTAL = aβ’F + (afα’+bf)(cfγ+df)(efF+ff)
The dependence between parameters implies that our expressions are more complex than those derived from a linear regression analysis. This could explain the large errors obtained from a simple linear regression model in [12]. The static contribution, actually known as a nonnegligible part in the power dissipation, appears explicitly in the consumption rules. Furthermore, the clock frequency F is a configuration parameter; the designer can also control the global energy consumption by tuning this parameter [16]. Finally, the global power consumption P for the application is computed as follows: P = VDD * ITOTAL (2) where VDD is the supply voltage (2.5V) for the C6x DSP core. The energy consumption will be obtained by multiplying P with the execution time.
3. Estimation Method As stated before, power estimation with FLPA can be done at two levels: from the C algorithm or after compilation from the assembly code. Initial works were tackling estimation from the assembly code. The methodology was settled then with the power model and the way to define it for a given processor. Afterwards, the estimation methodology was extended at the C-level. This section presents how the algorithmic parameters are predicted from the C code. Among the algorithmic parameters, the pipeline stall rate PSR and the cache miss rate γ cannot be determined easily at the C level. In some particular cases, for instance in the mapped memory mode where γ=0, these parameters can be evaluated. If not possible, these two parameters are obtained from the Texas Instruments’ tools after compilation and profiling, along with the execution time TEXE. It will be seen in section 4 how a consumption map of the algorithm will be provided when γ and the PSR are still undetermined at the early step of the design process. The two remaining parameters to determine are α and β. In the C6x, 8 instructions are fetched at the same time. They form a fetch packet (FP). In this fetch packet, operations are gathered in execution packets (EP) depending on the available resources and the parallelism capabilities. The parameters α and β are computed as follows:
NFP NPU 1 ≤ 1 (3) ≤ 1 ;β = NPU NEP MAX NEP In these expressions, NFP and NEP stands for respectively the average number of fetch and execution packets. NPU is the average number of processing units used per cycle (regarding every instruction except the NOP), and NPUMAX is the maximum number of processing units that can be used at the same time in the DSP; for the C6x, NPUmax = 8. So, to estimate α and β, the three parameters NFP, NEP and NPU are to be predicted from the algorithm. It is clear that this prediction must rely on a model that anticipates the way the code is executed on the target. According to the processor architecture and with a little knowledge on the compiler, four prediction models were defined. The sequential model (SEQ) is the simplest since it assumes that all the operations are executed sequentially. This model is only realistic for non-parallel processor. The maximum model (MAX) corresponds to the case where the compiler fully exploits all the architecture possibilities. In the C6x, 8 operations can be done in parallel; for example 2 loads, 4 additions and 2 multiplications in one clock cycle. This model gives a maximum bound of the application consumption. The minimum model (MIN) is more restrictive than the MAX model since it assumes that load and store instructions are never executed at the same time - indeed, it was noticed on the compiled code that all parallelism capabilities were not always fully exploited for these instructions. That will give a reasonable lower bound for the algorithm’s power consumption. At last, the data model (DATA) expresses more acutely the parallelism of load and store instructions. It works almost like the MAX model. The only difference is that it supposes that loads and stores are executed in parallel only if they involve data from different memory banks. Indeed, there are two banks in the C6x internal data memory. The both of them can be accessed in one clock cycle. As a result, two loads and/or stores can be achieved in parallel as long as the accessed data are not in the same bank. As illustration, we present below a simple example: For (i = 0; i < 512; i++) Y = X[i] * (H[i] + H[i+1] + H[i-1]) + Y[i] ; In this loop nest, there are 4 loads (LD), and 4 other operations (OP): 1 multiplication, and 3 additions. Operations at the beginning or at the end of the loop body are neglected. In our example, the final store for Y, only done once at the end of the loop, is not considered. Here, our 8 operations will always be gathered in one single FP so NFP = 1. Because no NOP operation is involved, NPU = 8 and α and β parameters have the same value only. In
α =
the SEQ model, instructions are assumed to be executed sequentially. Then NEP = 8, and α = β = 0.125. Results for the other models are summarized in Table 2. Table 2. Prediction models for the example. model MAX MIN DATA
EP1 2 LD 1 LD 2 LD
EP2 EP3 EP4 2LD,4OP 1 LD 1 LD 1LD,4OP 1 LD 1LD,4OP -
α=β 0.5 0.25 0.33
Of course, realistic cases are more elaborated: parameter prediction has to be done for each part of the program (loop, subroutine…) for which local values are obtained. The global parameter values, for the complete C source, are computed by averaging all the local values. Such an approach permits to spot "hot points" in the program.
4. Applications First, the estimation method at the C-level is validated by direct comparison with measurements. A maximum error of 6% is reported, which is quite satisfying at this level. Next, an application of this method to explore the power consumption of an algorithm is proposed.
4.1 Estimation validation In this section, our prediction models are applied on classical digital signal processing algorithms: a FIR filter, a FFT, a LMS filter, a Discrete Wavelet Transform (DWT) with two different image sizes and an Enhanced Full Rate (EFR) Vocoder for GSM. Results are presented for different memory modes (mapped, cache and bypass) and data placements (external or internal memory). Estimates for different implementations of these algorithms will be compared with the physical consumption measurements. The purpose here is to validate the C-level estimation method by assessing its accuracy. First, the α and β algorithmic parameters are predicted from the C-code as presented above. Second, the code is compiled and a profiling is made with the help of the TI development tools to get the PSR and the execution time TEXE. Third, γ is determined. For these typical digital signal processing applications, the assembly code fits in the internal program memory of the C6x; hence γ = 0 (see [13] for the power model validation at the assembly level for various different values of the cache miss rate). Finally, the global power consumption is computed from α, β, γ, and the PSR, as well as the energy with TEXE. For every algorithm, the clock frequency F is 200MHz (the nominal frequency for the TMS320C6201). The Table 3 summarizes the results for all the algorithms.
Table 3. Comparison between measurements and power estimation Algorithm
Application FIR FFT LMS LMS DWT 64*64 DWT 64*64 DWT 512*512 EFR VOCODER AVERAGE ERROR
Power estimation (W)
Measurements
MM
*
**
INT/EXT
TEXE
P(W)
Energy
SEQ
MAX
MIN
DATA
M
INT
6.885µs
4.5
30.98µJ
2.745
4.725
3.015
4.725
+5%
M
INT
1.389ms
2.65
3.68mJ
2.36
2.97
2.57
2.58
-2.5%
B
INT
1.847s
4.97
9.18J
5.02
5.12
5.07
5.12
+3%
C
INT
165.75ms
5.665
939mJ
2.55
6
4.76
6
+6%
M
INT
2.32ms
3.755
8.71mJ
2.82
4.24
3.27
3.53
-6%
M
EXT
9.19ms
2.55
23.46mJ
2.295
2.63
2.4
2.46
-3.5%
M
EXT
577.77ms
2.55
1.473J
2.27
2.61
2.37
2.45
-4%
M
INT
39µs
5.0775
198µJ
2.54
5.636
3.86
5.13
25%
7%
13%
* MM: memory mode with M: mapped, B: bypass and C: cache
The relative error between estimation and measurement is given for the four models (MAX, DATA, MIN and SEQ). Of course, the SEQ model gives the worst results since it does not take into account the architecture possibilities (parallelism, several memory banks etc.). In fact, this model has been developed to explore the estimation possibilities without any knowledge about the architecture of the targeted processor. It seems that such an approach cannot provide enough accuracy to be satisfying. It is remarkable that, for the LMS in bypass mode, every model overestimates the power consumption with close results. This exception can be explained by the fact that, in this marginal memory mode, every instruction is loaded from the external memory and thus pipeline stalls are dominant. As the SEQ model assumes sequential operations, it is the most accurate in this mode. For all the other algorithms, the MAX and the MIN models always respectively overestimate and underestimate the application power consumption. Hence, the proposed models need a restricted knowledge on the processor architecture; but they guaranty to bound the power consumption of a C algorithm with reasonable errors.
4.2 Algorithm Power Consumption Exploration At the C-level, both the pipeline stall rate (PSR) and the cache miss rate (γ) are unknown. However, a consumption map is provided to the programmer. This map represents the power variations of the algorithm according to the PSR and/or γ. Moreover, in many applications, the designer can evaluate the realistic domain of variation for these two parameters. It is thus possible to locate, on the consumption map, the more probable power consumption limits. This exploration
+1% 4%
** INT/EXT: data in internal/external memory
can be done for all the models (MAX, MIN, DATA or SEQ). For the mapped or bypass mode, because γ =0, exploration results are summarized in a curve, whereas for the cache or freeze mode, exploration results draw an area. Furthermore, it could be noticed that, the major part of current embedded applications have a program size (after compilation) easily contained in the internal memory of the C6x (64 Kbytes - and that also gives γ = 0). For the EFR vocoder, Figure 3 represents the power consumption exploration through all the estimation models in mapped mode. 7 6 5 4 3 2 1
POWER (W) MAX DATA MIN SEQ
Measure 0 0
10
20
30
40
50
60
70
80
90
PSR (%)
Fig. 3. Power Consumption Exploration for the vocoder. Of course, the PSR cannot be equal to 100% since in this case no operation would be executed. Obviously, the average power consumption decreases when the PSR gets higher; in the same time, as the execution time rises, the global energy increases too. At the same time, the minimum and maximum bounds of the estimation become closer because the PSR dominates the global power consumption by lowering the parallelism rate. The measurement value, very close of the DATA model, is also represented. For the cache mode and the DATA model, results are presented in Figure 4. Here, in addition, the cache miss rate γ can vary from 0 to 100%. The minimum power consumption value is obtained for γ = 0% and the maximum PSR. Indeed for these values, α’ and β’ are minimum. The maximum power consumption is obtained when γ =100% and with PSR = 0%; actually, this case is unrealistic since γ provokes pipeline stalls
through external memory accesses. The power consumption exploration points out if the algorithm does not respect the application consumption constraints (in terms of energy and/or power). Since at the C level the execution time Texe is unknown, the energy could be evaluated from the execution time constraint (given by the code programmer). If the algorithm consumption estimation is always under the constraints then the C code is suitable. Else, the programmer can focus on the more dissipating parts of the algorithm spotted through the α and β parameters; that can be modified being aware on the memory mapping that strongly affects pipeline stalls and cache misses. At last, several versions of the same algorithm could be efficiently compared through their consumption maps.
References 1.
2.
3.
4.
5.
6.
POWER (W)
8 6 4 2
90 60 30
0
80
60
40
20
CACHE MISS RATE (%)
00
7.
PSR (%)
Fig.4. Power consumption Exploration for the vocoder application in cache mode
8.
5. Conclusion 9.
In this paper, it has been demonstrated the conditions to perform an accurate power consumption estimation of a C-algorithm: (i) determining precisely the power consumption without any knowledge about the targeted processor is not possible; (ii) by considering only the architecture possibilities, the maximum and minimum bounds of the power consumption for a DSP algorithm are obtained; (iii) the more accurate model includes elementary information on both architecture and data placement. It provides an estimation with a maximum error of 6% against measurements. As cache misses and pipeline stalls are directly related to the quality of the code, they are difficult to control from the C-level. A consumption map is provided, representing the consumption behavior of the C algorithm relatively to these parameters. It then allows to verify if the application constraints are respected. Current works include the development of an automatic tool and implementation of the FLPA method on other processors. Future works will concern the addition of a generic memory model to take account of the power consumption of the external memory in our estimation. The model should also be completed by the model of the DMA.
10.
11.
12.
13.
14. 15.
16.
K. Roy, M. C. Johnson "Software Design for Low Power," in NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics, Aug. 1996, NATO ASI Series, chap. 6.3. M. Valluri, L. John "Is Compiling for Performance == Compiling th for Power?," presented at the 5 Annual Workshop on Interaction between Compilers and Computer Architectures INTERACT-5, Monterey, Mexico, Jan. 2001. W. Ye, N. Vijaykrishnan, M. Kandemir, M.J. Irwin “The Design and Use of SimplePower: A Cycle Accurate Energy Estimation Tool,” in Proc. Design Automation Conf., June 2000, pp. 340345. D. Brooks, V. Tiwari, M. Martonosi "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations," in Proc ISCA, June 2000, pp. 83-94. V. Tiwari, S. Malik, A. Wolfe "Power analysis of embedded software: a first step towards software power minimization," IEEE Trans. VLSI Systems, vol.2, Dec. 1994, pp. 437-445. B. Klass, D.E. Thomas, H. Schmit, D.F. Nagle "Modeling InterInstruction Energy Effects in a Digital Signal Processor," presented at the Power Driven Microarchitecture Workshop in ISCA, Barcelona, Spain, June 1998. M. T.-C. Lee, V. Tiwari, S. Malik, M. Fujita "Power Analysis and Minimization Techniques for Embedded DSP Software," IEEE Trans. VLSI Systems, vol. 5, n°1, March 1997, pp. 123-135. C. Brandolese, W. Fornaciari, F. Salice, D. Sciuto “An Instruction-Level Functionality-Based Energy Estimation Model for 32-bits Microprocessors,” in Proc. Design Automation Conf., June 2000, pp. 346-351. S. Steinke, M. Knauer, L. Wehmeyer, P. Marwedel "An accurate and Fine Grain Instruction-Level Energy Model Supporting Software Optimizations," in Proc. PATMOS, Sept. 2001, pp. 3.2.1-3.2.10. L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zaccaria, R. Zafalon "A Power Modeling and Estimation Framework for VLIW-based Embedded Systems," in Proc. PATMOS, Sept. 2001, pp. 2.3.1-2.3.10. G. Qu, N. Kawabe, K. Usami, M. Potkonjak "Function-Level Power Estimation Methodology for Microprocessors," in Proc. Design Automation Conf, June 2000, pp. 810-813. C. H. Gebotys, R. J. Gebotys "An Empirical Comparison of Algorithmic, Instruction, and Architectural Power Prediction Models for High Performance Embedded DSP Processors," in Proc. ACM Int. Symp. on Low Power Electronics Design, Aug. 1998, pp. 121-123. J. Laurent, E. Senn, N. Julien, E. Martin "High Level Energy Estimation for DSP Systems," in Proc PATMOS, Sept. 2001, pp. 311-316. TMS320C6x User's Guide, Texas Instruments Inc., 1999 S. L. Coumeri, D. E. Thomas "Memory Modeling for System Synthesis," IEEE Trans. VLSI Systems, vol. 8, n°3, June 2000, pp. 327-334. A. Sinha, A. P. Chandrakasan "JouleTrack - A Web Based Tool for Software Energy Profiling," in Proc. Design Automation Conf., June 2001, pp. 220-225.