660
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 5, MAY 2009
A Multi-Model Engine for High-Level Power Estimation Accuracy Optimization Felipe Klein, Student Member, IEEE, Roberto Leao, Guido Araujo, Member, IEEE, Luiz Santos, Member, IEEE, and Rodolfo Azevedo, Member, IEEE
Abstract—Register transfer level (RTL) power macromodeling is a mature research topic with a variety of equation and table-based approaches. Despite its maturity, macromodeling is not yet widely accepted as a de facto industrial standard for power estimation at the RT level. Each approach has many variants depending upon the parameters chosen to capture power variation. Every macromodeling technique has some intrinsic limitation affecting either its performance or its accuracy. Therefore, alternative macromodeling methods can be envisaged as part of a power modeling toolkit from which multiple models for a given component could be exploited so as to reduce the estimation errors resulting from conventional single-model approaches. This paper describes two different approaches for a new multi-model power estimation engine. The first one selects the macromodeling technique that leads to the least estimation error, for a given system component, depending on the properties of its input-vector stream. A proper selection function is built after component characterization and used during estimation. Though simple, this approach has revealed a substantial improvement in estimation accuracy. The second one builds a power estimate function that captures the correlation between individual macromodel estimates and input-stream properties. Experimental results show that our multi-model engine improves the robustness of power analysis with negligible usage overhead. Accuracy becomes seven times better on average, as compared to conventional single-model estimators, while the overall maximum estimation error is divided by 9. Index Terms—High-level power estimation, low-power design, macromodeling techniques, power macromodel, register transfer level (RTL), VLSI.
I. INTRODUCTION HE offer of increasing amounts of ever-shrinking transistors and the demand for battery-powered mobile devices have made low-power consumption one of the most important goals in contemporary VLSI design. Power modeling is the keystone for the whole building of power-conscious design tools. Power macromodeling is considered today the state-of-the-art paradigm for power estimation at the register transfer level
T
Manuscript received June 28, 2008; revised November 16, 2008. First published March 10, 2009; current version published April 22, 2009. This work was supported in part by CAPES (0326054), by FAPESP (2005/02565-9), and by CNPq. F. Klein, G. Araujo, and R. Azevedo are with the Institute of Computing, State University of Campinas, 13084-971 Campinas, Sao Paulo, Brazil (e-mail:
[email protected];
[email protected];
[email protected]). R. Leao is with the Suntech Intelligent Solutions, Florianopolis, SC 88015400, Brazil (e-mail:
[email protected]). L. Santos is with the Computer Science Department, Federal University of Santa Catarina, 88010-970 Florianopolis, Santa Catarina, Brazil (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2009.2013627
(RTL) [1], with a variety of equation and table-based approaches [2]–[17]. Each approach has many variants depending upon the parameters chosen to capture power variation (e.g., signal probability, transition density, spatial correlation). Despite its maturity, macromodeling is not yet widely accepted as an industrial de facto standard for power estimation at the RTL. This paper claims that one of the main reasons impairing its widespread use is that each macromodeling technique makes assumptions that lead to some sort of intrinsic limitation, thereby affecting its accuracy. In [5], for instance, signal statistics are assumed as uniformly distributed among all inputs and outputs, although input signal imbalances may result in significant errors [18]; the training set is limited to a single input stream per characterized point in [2], even though the choice of a proper set is crucial in obtaining high-quality models [19]. Most power estimation tools rely on a single macromodeling technique. However, given a system component, there might exist an alternative technique leading to a more accurate estimate. Therefore, the exploitation of multiple macromodeling techniques from a pool of supported macromodels could become a way of optimizing the overall power estimation accuracy. Although related works [20], [21] report the use of multiple macromodels to improve power assessment performance, few approaches [17] have been proposed to address the impact of multiple models on power evaluation accuracy. This motivated us to develop a new multi-model power estimation engine in order to exploit multiple models built with distinct power macromodeling techniques. We propose the following two different approaches in this paper. 1) The macromodeling technique leading to the least estimation error is selected for each system component and for each input-vector stream. A proper selection function is constructed immediately after component characterization and used during estimation. Given an input-vector stream stimulating an RTL component, the selection function chooses the most suitable macromodel for a system component according to the input stream’s signal probability and transition density. 2) A power estimate function is built immediately after component characterization. This function captures the correlation between individual macromodel estimates and inputvector properties and it is built by means of a nonlinear regression technique. Our engine relies on a preexistent framework called PowerSC [22], which is a SystemC [23] extension obtained by adding power-aware C++ classes. The remainder of this paper is organized as follows. Section II reviews related work. The PowerSC framework is described in
1063-8210/$25.00 © 2009 IEEE
KLEIN et al.: MULTI-MODEL ENGINE FOR HIGH-LEVEL POWER ESTIMATION ACCURACY OPTIMIZATION
661
Section III. Section IV-A selects four distinct macromodels for experimentation and summarizes their key ideas. Sections IV-B and IV-C provide evidences that macromodel-based estimation may incur unacceptably high errors and pinpoints the sources of inaccuracy by means of a few examples. Then, it is described how our multi-model engine works to improve accuracy. Section VI shows experimental results comparing the accuracy obtained with different macromodeling techniques. Our conclusions are drawn in Section VII along. II. RELATED WORK Most of the techniques used in today’s power-aware tools are based on the concept of macromodel [24], [7], which is an abstract model obtained by measuring the power consumption of existing implementations with the help of lower level methods. Essentially, macromodeling techniques can be divided into three categories: equation-based, table-based, and hybrid. Most equation-based techniques make use of statistical methods for model building (e.g., regression analysis). For instance, least square estimation is employed in [10] to derive the fitting coefficients. One advantage of statistical approaches is that they inherit a solid mathematical foundation and are therefore capable of providing a good level of confidence for the estimates, if adequate training sets are used. Obviously, the more an input stream is similar to the training streams, the higher the confidence level. Many other techniques belonging to this category can be found in the literature [8], [12]. The first table-based technique was proposed in [5]. Basically, it consists of a 3-D table, where the axes represent input/output signal statistics, and a table entry represents a power value. The original technique was improved by including a new parameter in the model, resulting in a 4-D table [15]. Several other works belonging to this category have been proposed in the literature [3], [9]. The technique proposed in [2] lies in the hybrid category. It consists of a 2-D table, which is indexed by transition density and signal probability values. For each entry in the table, an equation is built by means of linear regression. A recent work [20] describes a multi-model approach for system-on-chip (SoC) power analysis aiming at improving power simulation performance. This approach is based upon multiple models at distinct levels of granularity. The coarser the granularity, the less parameters are needed for estimation, thereby reducing the estimate overhead. The finer the granularity, the higher the accuracy, but the higher the overhead. A model is dynamically chosen for a component according to its relative power contribution to the overall system power. In other words, if a component does not contribute much to the system consumption, a coarser model is selected since a higher error can be accommodated. The work in [21] trades off simulation performance and accuracy at runtime as well, with similar goals as in [20]. As opposed to that work, the approach in [17] exploits multiple models at the same level of granularity, though, using a single macromodeling technique. This is particularly different from our approaches, since we exploit multiple models built with distinct macromodeling techniques. Our motivation to that is based upon the fact that most macromodeling techniques
Fig. 1. PowerSC design flow.
make assumptions that lead to some sort of limitations. A more comprehensive discussion on that is presented in Section IV-C. Potlapally et al. [17] observe that, different stimulation patterns make circuits exhibit distinct power behavior, which is referred to as power modes. Then, given a circuit and a base power model, stimulation patterns with the same power mode are clustered together, which are then used to create derived power models, one for each cluster. The selection functions, namely, power-mode identification function and power-mode identification tree are constructed by extracting information from the internal structure of the RTL circuit. As opposed to that work, the functions built by the multi-model approaches proposed in this article do not depend on the internal structure of the circuit. Nevertheless, like the method proposed in [17], our proposed approaches exploit multiple models to reduce the estimation errors. The preexisting SystemC-based framework used in this work is introduced in Section III. III. POWERSC FRAMEWORK Fig. 1 shows two complementary and orthogonal flows. At the right side, the usual purely-functional SystemC flow is sketched. At the left side, our power-aware flow is illustrated in more detail. No matter the adopted flow, the starting point is a SystemC description. The flow to be followed is determined by some configuration files (e.g., Makefiles), which instruct the C++ compiler to generate either a conventional executable specification (by linking the SystemC library only) or to produce an augmented executable specification (by linking both, the SystemC and the PowerSC libraries). The first step in our methodology is to compile the PowerSC executable specification, which is instrumented to gather
662
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 5, MAY 2009
signal statistics during simulation. In the next step, simulation is launched and, as consequence of proper instrumentation, design elements are monitored and power information is dynamically recorded. At simulation completion, the resulting information is summarized in power reports. Reported bottlenecks can be employed to prompt design refinements. Refined designs should undergo as many iterations through the PowerSC flow as required to reach adequate power values. Notice that the user can easily switch from one flow to another as many times as needed to optimize power and satisfy design constraints. One of the keystones of the PowerSC framework is the power modeling support at different levels of abstraction. This support is given by the PowerSC macromodeling API, which consists of a set of C++ classes for the integration of distinct modeling techniques. In order to add a new part to the library, two classes must be and . derived: The internal details are hidden from the user, who must only create the code for the following virtual functions, used internally by PowerSC to evaluate power. : This function initializes the internal • structure (defined by the user) with the power information from the characterization phase. : This function has the details on how • to compute the power dissipated based on its parameter. For instance, the attributes of the derived class could be the signal statistics. PowerSC internally calls this function in order to generate the power reports from a specific power model. is automatically created for each A distinct library component since the macromodeling techniques usually require a specific behavior for each component (tables initialized with different values, different power equations, and so on). are also useful when the deThese distinct signer wants to use a different macromodel methodology for each component. In order to show how effective is this support, we implemented four different RTL macromodeling techniques through this API, as described in Section IV. IV. SINGLE-MODEL LIMITATIONS The quality of a macromodel is strongly dependent on the training set. For instance, since the macromodel in [5] assumes that signal statistics are uniformly distributed among inputs and outputs, input signal imbalances may result in significant errors as pointed out in [18]. Furthermore, the macromodel proposed in [2] restricts the training set to a single input stream per characterized point. However, the choice of a proper set is crucial to obtaining a high-quality model. As shown in [19], while some training sets lead to high-quality models (average errors around 6%), others result in unacceptably poor ones (average errors around 660%). The sensibility to the training set and some inherent macromodeling assumptions (e.g., uniform probability distribution [5], lack of modeling for output signal properties [8]) lead to intrinsic limitations that degrade estimation accuracy.
This section provides the reader with some preliminary evidences that macromodel-based estimation may incur unacceptably high errors and then pinpoints the sources of inaccuracy by means of a few examples to justify the claim that no macromodeling is robust enough to be employed alone. Therefore, this section paves the way to Section V, where we show how our engine can automatically handle multiple macromodels to improve the overall estimation accuracy. A. Macromodeling Background We selected four well-known macromodeling techniques as candidates for experimental analysis. From now on, they will be referred to as 4DTab [15], EqTab [2], eHD [8], and Analytical [12]. The rationale behind this selection is threefold: 1) these techniques model power variation in a very distinct way, making different assumptions that are specific to each one of them (i.e., they complement each other); 2) the published results demonstrate that they present, in general, good accuracy; and 3) their implementation complexity makes their integration into an automated design flow quite feasible. This section summarizes some background on the selected techniques, such as key estimation concepts and main characterization steps, as a base for the discussion in Section V. 1) Four-Dimensional Table (4DTab): The 4DTab macromodel relies on the following signal properties: the average , the average input transition density input signal probability , the average input spatial correlation , and the average . The model is represented by a output transition density 4-D lookup table such that a power value corresponds to each ). valid grid point For the sake of clarification, the definitions of these properties are informally recalled. Signal probability is the fraction of time a signal is a logic high. Transition density is the number of logic transitions (low-to-high, high-to-low) per unit time. Spatial correlation is the probability of two distinct signals being logic high , and are defined as simultaneously. Therefore, is the avaverages over all input/output bit positions, and erage over all possible pairs of distinct input bit positions. Power estimation consists in first running an RTL simula, and and tion to collect the signal statistics then looking up in the table for the corresponding power value. When the collected signal statistics do not directly correspond to a valid grid point, interpolation is used to return the closest value. Component characterization consists of the following se, quence of steps. First, for each valid triple several input streams are generated. An input stream is a block vectors, such that the width of the vectors matches the of input bit-width of the component. Then every distinct stream is applied to a lower-level model of the component (e.g., a is evaluated and the gate-level model), the consequent resulting power consumption is determined. Next, the average is stored of all power values with same at the proper table entry. 2) Equation Coefficient Table: Instead of relying on the overall average transition input/output densities (like 4DTab), the EqTab model considers the individual contribution of each
KLEIN et al.: MULTI-MODEL ENGINE FOR HIGH-LEVEL POWER ESTIMATION ACCURACY OPTIMIZATION
input/output bit position. Let and be transition densities measured at the th bit position for a stream of input and output vectors, respectively. For simplicity, let us generically call them bit-wise transition densities. Let and be, respectively, input and output vector bit-widths. Given a component, its power consumption is modeled by the following equation, where denotes a coefficient:
663
is their number of stable bits. Let be the power consumption at the th simulation cycle and be the number of input bits. The eHD macromodel equation is defined as follows, where denotes the contribution of switching event to the power consumption and denotes whether or not such event occurs in cycle .. . In [8],
Actually, the EqTab technique relies on a lookup table which is . For each entry in this table, instead of indexed with directly storing a power value, the corresponding entry actually stores the coefficients of the equation above. As a result, EqTab estimation consists of three steps: first, an RTL simulation is run and the bit-wise densities are collected, , and ; then the coeffialong with the properties are returned (if does cients stored at entry not represent a valid entry, the closest valid point is used instead); finally, the returned coefficients and the collected bitwise densities are combined according to the above equation. As opposed to 4DTab, EqTab characterization employs a single input vector stream for each pair and consists in determining the respective set of coefficients. To find a proper set of coefficients, a system of equations is built as follows. Let SW be a matrix with as many rows as the number of successive pairs of vectors in the stream (say, pairs) and with as many columns as the compound vector bit-width .A row of matrix SW stores the bit-wise transition densities taken between a pair of successive vectors. A column stores the transition density of a given bit position along the input stream. Let be an matrix, where each entry stores the power consumed by the th pair of input vectors. Characterization consists in first calculating the bit-wise transition densities for every successive pair of input vectors (storing them in a row of matrix SW) and measuring their resulting power consumption (storing it in an entry of matrix ). Then, the set of fitting coefficients is obtained by solving the system of equations with standard regression techniques (e.g., least mean squares). Finally, such coefficients are stored at entry . 3) Enhanced Hamming Distance (eHD): Basically, the eHD macromodel is an equation that expresses power as a function of two distinct input signal properties: the Hamming distance and the number of stable bits between two successive input vectors (the technique does not employ output signals). Given two input vectors and with bits each, the Hamming distance ( ) and the number of stable bits ( ) with value “1” are defined, respectively, as for for As opposed to the previous techniques, the eHD macromodel calculates the power per cycle, as follows. Let be a switching event class representing the properties of a pair of vectors, where is their Hamming distance and
(1)
is called an activator and is defined as follows: if event if event
occurs in cycle does not occur in cycle
Given an input stream, estimation starts by calculating and for each pair of successive vectors. Then, the activators associated with events occurring at a given cycle are set and (1) returns the power consumption for that cycle. The overall power consumption is obtained by adding the contribution of all cycles. Component characterization starts with the generation of random input streams. Then, for every stream, each pair of its successive vectors is applied to the component and functions and are evaluated. The resulting power consumption is determined with a preexistent lower-level power model of the component. Finally, for all streams with same , the average power is calculated. 4) Analytical: The Analytical macromodel considers the same signal properties when compared to 4DTab: the input , the input signal probability signal spatial correlation , the input transition density , and the output transition . The model is represented by an equation which density combines these signal properties as follows:
Power estimation in this technique consists of running an RTL , simulation to collect signal statistics followed by the evaluation of the previous equation. No other technique (like interpolation) is necessary, as the equation uses only the collected signal statistics and the coefficients obtained during power characterization. Component characterization is performed in three steps. several input First, for each valid triple streams are generated. Second, every distinct stream is apand power plied to the component and the resulting consumption are determined. Third, each resulting value is used, according to equation above, in a linear regression method to obtain coefficients of the macromodel equation. B. Conventional Single-Model Approach To illustrate how estimation accuracy can largely diverge among power models, we employ two real-life components as examples (Add ECLA32 and MulS16). Both examples were
664
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 5, MAY 2009
Fig. 2. Error distribution (4DTab): (a) Add ECLA32 and (b) MulS16.
Fig. 3. Error distribution (EqTab): (a) Add ECLA32 and (b) MulS16.
characterized by the four macromodeling techniques summarized in Section IV-A (4DTab, EqTab, eHD, and Analytical). An experiment was conducted to assess the average errors of each macromodel as a function of two input-stream parameters, and using 0.1 as the discretization step. A set namely, of 5000 streams was generated to properly cover the input space. Then the RTL macromodel estimates were com. pared to the gate-level estimates, for each pair space Figs. 2–5 show the error distribution on the for both examples, and for the four adopted techniques. Notice that, for a given component, the distinct macromodels lead to quite different average errors. For instance, regardless of the chosen component, the eHD macromodel leads to the highest average error for the input streams whose transition density is in . This means that 4DTab, EqTab, or Analytthe interval ical would be a better choice for such streams. However, for the streams whose transition density is greater than 0.6, eHD exhibits higher accuracy, competitive to the other techniques.
C. Analysis of Macromodeling Limitations Each macromodel has its merits in capturing power variation. However, each technique implies a different usage of parameters, which eventually hides some assumption. This subsection shows that such implicit assumptions are the very source of limitation and that they are difficult to overcome within the scope of a single macromodeling technique. In [15], Gupta and Najm describe a mathematical relation between , and . Here, we use this relation to illustrate the drawbacks in macromodel selection, but for the sake of simplicity and without loss of generality, we restrict ourselves to the transition density . Let us first consider the 4DTab technique. It assumes that , and are uniformly distributed through all input/output signals, although some signals could have a higher impact on the power consumption than others. Such assumption is clearly inadequate when irregularly structured circuits are addressed or when control signals are considered,
KLEIN et al.: MULTI-MODEL ENGINE FOR HIGH-LEVEL POWER ESTIMATION ACCURACY OPTIMIZATION
665
Fig. 4. Error distribution (eHD): (a) Add ECLA32 and (b) MulS16.
Fig. 5. Error distribution (Analytical): (a) Add ECLA32 and (b) MulS16.
since they may completely change the circuit’s operational mode [9]. To illustrate that, consider again the example MulS16 (a 16-bit multiplier used as a component of the stereo audio crossover design to be discussed in Section VI). We simulated one of its implementations with 100 distinct input streams and monitored one of its operands (16 inputs). For most streams, we observed that only four of the monitored inputs exhibited transition densities higher than zero. Let us now consider one of those simulation instances, whose overall input transition density is and whose bitwise transition densities are the following:
assumed by 4DTab, which implies the following bit-wise transition densities:
Notice that only the last and the first three inputs actually switch for this stream. This means that part of the MulS16 circuit is not stimulated, as opposed to the uniform distribution
We can therefore conclude that, despite having the same , such rather different stimulation patterns are likely to lead to quite different power estimates. EqTab overcomes this drawback by taking each input into account individually, which in principle should lead to better accuracy. However, as opposed to 4DTab, a single stream is used in the characterization process for each entry in the table. Since the number of possible streams grows exponentially with the input width for some chosen input statistics, this assumption represents a serious limiting factor, as illustrated by the following example. Let and be 4-bit input operands of a Booth multiplier. Consider a candidate characterization stream whose overall
666
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 5, MAY 2009
transition density is densities are
and whose bit-wise transition
where the first four elements refer to and the last four to . Now, consider an alternative characterization stream obtained by swapping the operands and . Although the overall transition density remains the same, the resulting bit-wise transition densities are
Since the multiplication algorithm is likely to require drastically different amounts of computational effort, depending whether the operand is the multiplicand (the value to be added up) or the multiplier (the number of times the multiplicand must be added), it is clear that the power behavior for these two characterization streams will be quite different, although only one of them would be captured in the macromodel. In the eHD technique, although two distinct pairs of vectors with same input signal statistics will probably result on distinct output signal statistics, they are modeled in exactly the same way, which is based on input information only. Let us illustrate the impact of such assumption by means of the following example. We performed the characterization of of the Add ECLA32 component and monitored each pair successive vectors within the input characterization streams (62 streams with 2000 vectors each). We observed that a collection of 505 pairs had exactly the same input statistics: and . For such collection, the resulting power estimates lied in the range [100 W, 1000 W] with a mean value of 482.2 W and a standard deviation of 213.2 W. As compared to gate-level reference estimates available for that component, the estimation error lies in the range % % . This means that the eHD technique may incur in high estimation errors because different circuit behaviors cannot be distinguished by the same input signal statistics. If output statistics were included in the model, such distinct behaviors would be better captured. Let us now consider the Analytical technique. This method uses the same parameters used by 4DTab. Again, there is an assumption that the parameters are uniformly distributed through all input/output signals. Therefore, this method is bounded to encompass 4DTab’s limitations. Since the drawbacks of each method were pointed out qualitatively, we have proper grounds to introduce our multi-model engine in Section V. V. PROPOSED MULTI-MODELS A. Multi-Model Engine Flow Our multi-model approach consists of four phases, as depicted in Fig. 6: 1) individual macromodel creation; 2) individual macromodel evaluation; 3) multi-model creation; and 4) multi-model usage. The first three phases are performed only once for a given technology library (during library characterization).
Fig. 6. Multi-model engine flow.
In the left side of the figure lies the sequence generator, which produces two types of streams: training sets, used during the characterization process (Phase 1) and evaluation sets, used during model robustness evaluation (Phase 2). If a single set was used in Phases 1 and 2, it would capture the intrinsic errors only. To ensure an unbiased macromodeling mechanism (Phase 3), training and evaluation sets are generated according to distinct prespecified parameters. 1) Phase 1 (Individual Macromodel Creation): Basically, the sequence generator produces as many training sets as the number of components undergoing characterization, although some may be compliant with more than a component and may be reused. The individual macromodel creation phase uses the training sets in the characterization of every RTL component from the library, for each supported macromodeling technique. As an example, if the library has 100 RTL components and we are using 4 macromodeling techniques, this phase will result in 400 different macromodels (4 for each RTL component). Although this phase is apparently rather time consuming, given that different macromodels need to be constructed for each design, the time to build the models is actually determined by the technique that requires the higher effort during simulation and model construction. Since all information necessary to build all macromodels is readily available upon simulation completion, the total time will be slightly higher than the time required for the most complex technique. Considering the four adopted techniques, most of the parameters they use can be shared, as well as the discretization step. Thereby, the most
KLEIN et al.: MULTI-MODEL ENGINE FOR HIGH-LEVEL POWER ESTIMATION ACCURACY OPTIMIZATION
667
complex technique requires a training-set which can be seen as a superset of the training-sets of all macromodeling techniques. 2) Phase 2 (Individual Macromodel Evaluation): Once all macromodels are generated for a given RTL component, they are assessed by the evaluation engine. This individual macromodel evaluation phase consists in first launching an RTL simulation of the component, which is stimulated by the evaluation set. As a consequence, distinct power estimates are obtained for each macromodel. Then, such estimates are compared to preexisting gate-level power estimates and an error value is computed. As a result, for each macromodel, a file reporting its robustness space is produced. This information is the input in the of the multi-model engine, as described next. 3) Phase 3 (Multi-Model Creation): In this phase, the engine builds a mapping that represents a power estimate as a function of input signal properties ( and ) and of power estimates associated to each individual macromodel (a distinct estimate for each macromodel ). The rationale of using only and is that, as it has been shown in [18], is the parameter which has the greatest effect on power variation, with a nearly linear relation to it. Moreover, Gupta and Najm [15] derived a property which shows a special relationship between and . In this paper, we propose two definitions for function (to be explained in Sections V-B and V-C): the first assumes that one of the individual macromodels leads to better accuracy than the others for a given value of input signal properties, whereas the second captures possible correlations between individual macromodel estimates by using regression techniques. 4) Phase 4 (Multi-Model Usage): Once the -functions were built for each RTL component in the previous phase, this phase consists merely in the collection of input statistics and model parameters during simulation. B. Approach 1: Uncorrelated Multi-Modeling
Fig. 7. -function for both modules: (a) Add ECLA32 and (b) MulS16.
Let each individual macromodel be associated to a mapping such that, for a given point of the input space, returns the power estimated by macromodel . Assume that, for each point in the input space, there is a single macromodel whose accuracy is maximal. In the so-called uncorrelated multi-modeling (UMM) approach, we first build a selection function to map each point of the input space to the macromodel leading to the least error, say , and then we invoke the function to obtain its power estimate, as follows. be the set of macromodels and let be the error comLet puted for a given . The selection function represents the mapping of a pair to a macromodel such that . Consider a mapping such that represents the multimodel power estimate. For the UMM approach, we define as follows: (2) Since the function is highly dependent on a proper choice of input streams, the evaluation set is designed as a huge collection of input streams (approximately 5000 in the current implemenspace. tation) uniformly distributed over the
In brief, our multi-model engine associates a distinct function to each library component. Given an input stream, its propare employed to query the function , which erties selects the macromodel returning the more accurate power esis not a member of the domain of timate. If a given , the closest member is chosen according to the least Euclidian distance. Fig. 7(a) and (b) display the -functions for modules Add ECLA32 and MulS16. Each symbol represents the most accurate macromodel selected for a given point of associated to the input space. For instance, the symbol in Fig. 7(a) indicates that EqTab was the selected model. Although the -functions in Fig. 7(a) and (b) are different, a common pattern can be observed in both examples. The eHD macromodel is more accurate than the other techniques for higher transition densities, while the macromodel Analytical is more accurate for lower transition densities. As an example, for , it is better to use the eHD model for ADD ECLA32, since it has an expected average error ranging from 0% to 10%, while for EqTab and Analytical, this error lies within 10%–20% and 40%–100%, respectively. Now, if we
668
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 5, MAY 2009
consider the region around (0.3, 0.3), 4DTab should be used instead, as it leads to the lowest errors (10%–20%). As another example, consider the MulS16 design. As it can be seen, the Analytical model is selected for most regions of the input space. the Nevertheless, for the regions around other models should be used instead, since they have a lower expected error. Also, observe that, for this component, EqTab is not selected for any region of the input space. C. Approach 2: Correlated Multi-Modeling To obtain a power estimate, this approach also relies on the computed estimates for each available individual macromodel. However, instead of directly selecting the macromodel with maximal accuracy and using it to provide the power estimate, this approach employs a more refined power estimate function, which is built by means of regression analysis so as to correlate the input stream properties to the individual macromodel estimates, thereby further improving the resulting accuracy. That is why this approach is called correlated multi-modeling (CMM). Remind that the mapping represents the multi-model power estimate. For the CMM approach, we define as follows:
models. The total number of correlations, denoted by , can be calculated as follows: (5) Since
, (5) can be rewritten as (6)
Since the relationship between the chosen parameters is clearly not linear, we adopted standard nonlinear regression methods to obtain the coefficients in the row vector . In brief, our multi-model engine builds a function for each library component. Given an input stream and a point in the input space, the individual power estimates are obtained as an intermediate step. Then, the final power estimate is computed by evaluating the function for the given point in the input space. To close this section, let us illustrate the formulation by means of an example. Consider the set of macromodels reviewed in this . The power paper: estimate function for the CMM approach is
(3) is a row vector whose elewhere ments are the coefficients resulting from regression analysis and where is a column vector whose elements represent correlations, as follows:
D. Discussion
(4)
In (4), note that, in order to effectively capture power variation, we first take into account the individual contributions of the input stream properties and of each individual macroof input model estimate. Then, we correlate the pair stream properties with individual macromodel estimates. Next, we cross-correlate all individual macromodel estimates. Finally, we capture the self-correlations of input stream properties and individual macromodel estimates. Let represent the number of arguments in function and is the number of available individual macroremind that
It should be noted that, in general, there is a large intersection in the required parameters for a given set of supported macromodels (for instance, being the most complex macromodel, eHD contains the parameters required by 4DTab, EqTab, and Analytical). Therefore, the overhead of using the UMM approach is slightly larger than that of the most complex macromodel. The small increment on the overhead of the dominant macromodel is due to the time required to apply the -function once for each component. Note that, essentially, there is no extra overhead in macromodel execution: only one (but possibly distinct) macromodel function is invoked per component, exactly the same number of invocations a single-model engine would require. For the CMM approach, since the power estimation requires , the overhead is comthe computation of , for each posed of one invocation of each macromodel function per component during the evaluation of the -function. VI. EXPERIMENTS This section compares conventional single-model techniques to the proposed multi-model approaches with respect to esti-
KLEIN et al.: MULTI-MODEL ENGINE FOR HIGH-LEVEL POWER ESTIMATION ACCURACY OPTIMIZATION
669
TABLE I SELECTED BENCHMARK DESIGNS
mation accuracy. Reference gate-level estimates were obtained with Synopsys’ PrimePower. The adopted benchmark suite consists of six designs, all synthesized to a TSMC 0.25 m technology library. Design names are listed in the first column of Table I, while their descriptions and areas are shown in the following columns. Two of them are single components from a part library (Add ECLA32 and MulS16). Four are complex in-house designs, extracted from real-life applications: an implementation of a differential equation algorithm (Diffeq) and implementations of a stereo audio crossover (Crossover1, Crossover2, Crossover3). These complex designs consists of several different components with varying bit-widths, such as adders, multipliers, and subtracters. Several estimations were performed for each design and for various distinct input streams. The simulation times, which include the overhead imposed by the macromodels, ranged from 56 to 126 s for these experiments. A. Common Characterization Setup To properly assess the distinct macromodeling techniques, the same vector generation procedure was used during characterization. We adopted the procedure described in [18], which allows input stream generation with high-accurate signal probability, transition density and spatial correlation. As suggested in [18], we set each stream to have 2000 vectors. The adopted input-space range discretization was for both and . The characterization times required for generating the models were 92, 102, 232, and 246 minutes for 4DTab, EqTab, eHD, and Analytical, respectively. B. Experimental Results Fig. 8(a) and (b) show, respectively, the average and maximum errors resulting for each macromodel approach. The designs (listed in Table I) lie in the horizontal axis, while the vertical axis presents the errors, expressed as percentages. For each design, a cluster of bars is shown, each bar representing a power macromodeling technique, as follows (from left to right): 4DTab, EqTab, eHD, Analytical, UMM, and CMM. An additional cluster of bars is added to the right-hand side of these figures. It shows an overall error average on the whole set of designs. Observe that the scales used in these figures differ from each other. These results were computed using the following equations: %
(7) %
(8)
Fig. 8. Results obtained with the adopted power macromodeling techniques for the selected designs. (a) Average errors. (b) Maximum errors.
where is the number of components in the design, and are the estimated and reference power values for the th component, respectively. We make three observations from these figures. First, for all designs, there is a large variation in the errors between each single-model technique (first four bars). For the average errors, this variation spans from 15% to 101%. Even more contrasting values can be seen in Fig. 8(b), ranging from 83% to 1453%. Note that, the single-model approach may lead to unacceptable errors. Such high errors may compromise the overall quality of the estimates, which, in turn, may lead to wrong design decisions. For the adopted set of designs, the largest errors were obtained with eHD. This indicates that such macromodel should not be used alone. However, this does not necessarily mean that such technique should be discarded beforehand. Instead, we conclude that it is simply not adequate for some regions in the space. Actually, as we have shown in Section IV, there are regions where the eHD performed much better than the other techniques.
670
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 5, MAY 2009
The second observation is that the estimates produced by the proposed multi-model approaches (last two bars) are far better than those produced by the single-model techniques. From the perspective of average errors, CMM performed better than UMM in most of the cases. The only exceptions were for MulS16 and Crossover1, where UMM presents lower errors. This is due to the fact that, for the component MulS16, the Analytical macromodel obtained better results as compared to the other techniques (4DTab, EqTab, and eHD) for a large part of the input space [refer to Fig. 7(b)]. This affected the computation of the model coefficients during regression analysis. Notice, however, that even in these cases the multi-model approaches outperform the best single-model counterpart. Also, observe that in no other case a single-model technique presents average errors under %. In order to show the benefit of using the proposed multi-model approaches instead of conventional single-model alternatives, we use as figures of merit the following family of ratios: (9) and are the average errors for the where single and multi-model techniques [as computed in (7)] and where determines the range of benefit as compared to single-model techniques. On average, the obtained ratios were: , and . From the perspective of maximum errors in Fig. 8(b), similar results to those presented in Fig. 8(a) are perceived, with the lowest maximum errors being observed for UMM and CMM. The only exception is, MulS16, where Analytical presents lower errors than CMM. Also, notice that, the overall maximum errors of UMM are better than CMM’s for these experiments. Figures of merit [similar to those in (9)] can be used with respect to the maximum errors, as follows: (10) The difference from (9) lies basically in the computation of and , which now employs (8). On average, the resulting ratios were: , and . The third observation from Fig. 8(a) and (b) is that each multimodel technique exhibits small variations across designs, as opposed to conventional single-model approaches. For the multimodel approaches, we found the standard deviation of % and % for the average and maximum errors, respectively. In contrast to that, the values % and % were found for the single-model approaches. This shows that the proposed approaches robustly produce power estimates, a desirable and important property that every power model should contain. To summarize our improvements in a single number for the average and maximum errors, we employ the ratios and , computed in (9) and (10). Note that the accuracy becomes 7 times better for the average errors, as compared to the conventional approaches, while the overall maximum estimation error is divided by 9.
Fig. 9. Input-space coverage and distribution of reference power values for Add ECLA32. (a) P D input-space coverage. (b) Distribution of the reference power values.
2
C. Robustness Evaluation In order to correlate the overall system accuracy with the estimation errors of a component, let us focus on the most complex designs (Diffeq, Crossover1, Crossover2, Crossover3). Those designs employ Add ECLA32 as one of their various components and, therefore, we select it for a more comprehensive robustness evaluation of the adopted power macromodeling techniques. We consider a model is robust when it is capable of producing high-quality estimates for a wide range of different input streams. In order to do such an evaluation, we generated a number of input streams large enough to properly cover the input space. Fig. 9(a) pictorially shows this coverage, where the horizontal and vertical axes represent the input signal probability and the transition density , respectively. Each point in this figure refers to a distinct input stream. Moreover, the distribution of reference power values for these streams, along with the normal curve, is illustrated in Fig. 9(b). Each bar in this histogram corresponds to a distinct power value
KLEIN et al.: MULTI-MODEL ENGINE FOR HIGH-LEVEL POWER ESTIMATION ACCURACY OPTIMIZATION
671
Fig. 10. Model fitness of the adopted techniques for the Add ECLA32 design. (a) 4DTab’s fitness. (b) EqTab’s fitness. (c) eHD’s fitness. (d) Analytical’s fitness. (e) UMM’s fitness. (f) CMM’s fitness.
Fig. 11. Distribution of power values of the adopted techniques for the Add\_ECLA32 design. (a) 4DTab’s distribution. (b) EqTab’s distribution. (c) eHD’s distribution. (d) Analytical’s distribution. (e) UMM’s distribution. (f) CMM’s distribution.
and its height corresponds to the number of occurrences of this specific power value. The next step of the evaluation consisted of generating power estimates with all single and multi-model techniques, for all input streams from Fig. 9(a). Then these estimates were compared to the reference values, resulting in a so-called model fitness. The fitness of the adopted techniques are shown in Fig. 10(a)–(f). Notice that the scale used in Fig. 10(d) is
different from the others. As it can be seen in Fig. 10(a)–(d), our claim that conventional single-model techniques have intrinsic limitations that affect their accuracy is clear in these figures. Also, notice that the errors produced by these techniques are not random; an identifiable trend is detected for each method’s estimates, though, they are quite distinct. These results corroborate the qualitative analysis of macromodeling limitations in Section IV.
672
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 5, MAY 2009
The fitness of the proposed multi-model approaches are shown in Fig. 10(e) and (f). Observe that both UMM and CMM display a better fitness as compared to the single-model techniques. Remarkably, CMM estimates were much closer to the reference values than all other techniques. We present also the distribution of power values of the adopted techniques in Fig. 11(a)–(f). Note, that there is huge discrepancy between the techniques regarding number of occurrences for each range of power values. Observe that, the histograms in Fig. 11(e) and (f) (UMM and CMM) have a similar shape as compared to the reference histogram in Fig. 9(b). The large variation in the estimates and distribution of power values (Figs. 10 and 11) is an evidence of the lack of robustness of a single macromodel. On the one hand, due to such lack of robustness in the estimations, a single macromodel is bound to compromise the overall accuracy. On the other hand, the multi-model approaches overcome this lack of robustness, as the results revealed. As a remark, similar results were observed for the other designs, but are omitted due to the lack of space. The overhead imposed to the simulation due to the usage of our multi-model approach for power estimation is minimum % . Once all power models have been built offline and the component from the library appropriately instrumented with its -function, the call to it whether returns only which model is to be used at that moment (UMM approach) or each macromodel function is invoked only once (CMM approach). All remaining overhead is exactly the same as if a single-model approach is used. Thus, as it can be noticed, the robustness improvement obtained by using our approaches pays off. VII. CONCLUSION We presented two alternative power estimation approaches which exploit multiple macromodeling techniques to improve % . accuracy at the expense of negligible usage overhead When compared to four distinct conventional macromodeling approaches, accuracy becomes 7 times better on average, while the maximum estimation error is divided by 9. We believe that multi-model estimation is the key to a wider acceptation of macromodeling in industry, by improving both power simulation performance (as proposed in [20] and [21]) and estimation accuracy (as proposed in this paper and [17]). ACKNOWLEDGMENT The authors would like to thank their colleague Y. Y. Ju, who assisted them during the experimentation process. They would also like to thank Prof. X. Liu, from North Carolina State University, who provided them with his sequence generator.
[4] S. Gupta and F. N. Najm, “Energy and peak-current per-cycle estimation at RTL,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 4, pp. 525–537, Apr. 2003. [5] S. Gupta and F. N. Najm, “Power macromodeling for high level power estimation,” in Proc. DAC, 1997, pp. 365–370. [6] A. Raghunathan, S. Dey, and N. K. Jha, “Register-transfer level estimation techniques for switching activity and power consumption,” in Proc. ICCAD, 1996, pp. 158–165. [7] P. E. Landman and J. M. Rabaey, “Activity-sensitive architectural power analysis,” IEEE Trans. Comput.-Aided Des. Integr. Circuits, vol. 15, no. 6, pp. 571–587, Jun. 1996. [8] G. Jochens, L. Kruse, E. Schmidt, and W. Nebel, “A new parameterizable power macro-model for datapath components,” in Proc. DATE, 1999, pp. 29–36. [9] R. Corgnati, E. Macii, and M. Poncino, “Clustered table-based macromodels for RTL power estimation,” in Proc. 9th Great Lakes Symp. VLSI (GLSVLSI), 1999, pp. 354–357. [10] A. Bogliolo and L. Benini, “Robust RTL power macromodels,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 4, pp. 578–581, Dec. 1998. [11] L. Benini, A. Bogliolo, E. Macii, M. Poncino, and M. Surmei, “Regression-based RTL power models for controllers,” in Proc. 10th Great Lakes Symp. VLSI, 2000, pp. 147–152. [12] S. Gupta and F. Najm, “Analytical models for RTL power estimation of combinational and sequential circuits,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 19, no. 7, pp. 808–814, Jul. 2000. [13] G. Jochens, L. Kruse, E. Schmidt, A. Stammermann, and W. Nebel, “Power macro-modelling for firm-macro,” in Proc. Int. Workshop Power Timing Model., Opt. Simulation (PATMOS), Sep. 2000, pp. 24–35. [14] M. Bruno, A. Macii, and M. Poncino, “A statistical power model for non-synthetic RTL operators,” in Proc. Int. Workshop Power Timing Model., Opt. Simulation (PATMOS), 2003, pp. 208–218. [15] S. Gupta and F. N. Najm, “Power modeling for high-level power estimation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 1, pp. 18–29, Feb. 2000. [16] H. Mehta, R. M. Owens, and M. J. Irwin, “Energy characterization based on clustering,” in Proc. DAC, 1996, pp. 702–707. [17] N. R. Potlapally, M. S. Hsiao, A. Raghunathan, G. Lakshminarayana, and S. T. Chakradhar, “Accurate power macro-modeling techniques for complex RTL circuits,” in Proc. 18th Int. Conf. VLSI Des. (VLSID), Los Alamitos, CA, 2001, pp. 235–241. [18] X. Liu and M. C. Papaefthymiou, “A markov chain sequence generator for power macromodeling,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 23, no. 7, pp. 1048–1062, Jul. 2004. [19] Q. Wu, C. Ding, C. Hsieh, and M. Pedram, “Statistical design of macromodels for RT-level power evaluation,” in Proc. ASPDAC, 1997, pp. 523–528. [20] N. Bansal, K. Lahiri, A. Raghunathan, and S. T. Chakradhar, “Power monitors: A framework for system-level power estimation using heterogeneous power models,” in Proc. 18th Int. Conf. VLSI Des. (VLSID), Los Alamitos, CA, 2005, pp. 579–585. [21] G. Beltrame, D. Sciuto, and C. Silvano, “Multi-accuracy power and performance transaction-level modeling,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 10, pp. 1830–1842, Oct. 2007. [22] F. Klein, R. Leao, G. Araujo, L. Santos, and R. Azevedo, “An efficient framework for high-level power exploration,” in Proc. 50th IEEE Int. Midw. Symp. Circuits Syst., 2007, pp. 1046–1049. [23] OSCI, “SystemC 2.0 User’s Guide, 2nd Ed.,” 2002. [Online]. Available: http://www.systemc.org [24] S. R. Powell and P. M. Chau, “Estimating power dissipation of VLSI signal processing chips: The PFA technique,” in Proc. VLSI Signal Process. IV, 1990, pp. 250–259.
Felipe Klein (S’07) received the B.Sc. degree in computer science from the Federal University of Santa Catarina, Santa Catarina, Brazil, in 2002, and the M.Sc. degree from the University of Campinas (UNICAMP), Sao Paulo, Brazil, in 2005, where he is pursuing the Ph.D. degree in computer science. From 2007 to 2008, he was a visiting scholar with the Embedded Systems Research Group, Dortmund University of Technology, Dortmund, Germany. His research interests include low-power design of embedded systems, multiprocessor platforms, and hard-
REFERENCES [1] E. Macii and M. Poncino, “Power macro-models for high-level power estimation,” in Low-Power Electronics Design.. Boca Raton, FL: CRC Press, 2005, ch. 39. [2] M. Anton, I. Colonescu, E. Macii, and M. Poncino, “Fast characterization of RTL power macromodels,” in Proc. IEEE ICECS, 2001, pp. 1591–1594. [3] M. Barocci, L. Benini, A. Bogliolo, B. Ricco, and G. D. Micheli, “Lookup table power macro-models for behavioral library components,” in Proc. IEEE Alessandro Volta Memorial Workshop Low-Power Des., Mar. 1999, pp. 173–181.
ware/software codesign.
KLEIN et al.: MULTI-MODEL ENGINE FOR HIGH-LEVEL POWER ESTIMATION ACCURACY OPTIMIZATION
Roberto Leao received the B.Sc. and M.Sc. degrees in computer science from the Federal University of Santa Catarina, Santa Catarina, Brazil, in 2002 and 2008, respectively. Since 2005, he has been an Oracle Database Administrator (DBA) with Suntech Intelligent Solutions, Florianopolis, Brazil. In 2008, he became an Oracle Certified Professional (OCP) DBA 10g. His research interests include low-power design and high-level hardware synthesis.
Guido Araujo (M’08) received the Ph.D. degree in electrical engineering from Princeton University, Princeton, NJ, in 1997. He is currently a Professor with the Department of Computer Science and Engineering, University of Campinas (UNICAMP), Sao Paulo, Brazil. He worked as a Consultant for Conexant Semiconductor Systems and Mindspeed Technologies from 1997 to 1999. His main research interests include code optimization, architecture simulation, and embedded system design. Dr. Araujo was a corecipient of Best Paper Awards at the 1997 ACM/IEEE Design Automation Conference, 2003 International Workshop on Software and Compilers for Embedded Systems (SCOPES’03), and SBAC-PAD’04. He was also awarded the 2002 University of Campinas Zeferino Vaz Award for his contributions to Computer Science research and teaching.
673
Luiz C. V. Santos (M’08) received the Electrical Engineering degree with honors from the Federal University of Parana, Parana, Brazil, in 1987, the M.S. degree in computer science from the Federal University of Rio Grande do Sul, Brazil, in 1990, and the Ph.D. degree from the Eindhoven University of Technology, Eindhoven, The Netherlands, in 1998. Since 2007, he has served as an Associate Professor with the Federal University of Santa Catarina, Sao Paulo, Brazil, where he was an Assistant Professor from 1999 to 2006. Recently, his research work includes system-level and low-power design of embedded systems.
Rodolfo Azevedo (M’07) received the Ph.D. degree in computer science from University of Campinas (UNICAMP), Sao Paulo, Brazil, in 2002. He currently works as a Professor with the Institute of Computing at UNICAMP. His main research interests include computer architecture, low power, ADLs, and system-level design, mainly using SystemC. Dr. Azevedo was a corecipient of Best Paper Awards at SBAC-PAD’04 for his work on the ArchC ADL and SBAC-PAD’08.