1830
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 10, OCTOBER 2007
Multi-Accuracy Power and Performance Transaction-Level Modeling Giovanni Beltrame, Donatella Sciuto, Member, IEEE, and Cristina Silvano, Member, IEEE
Abstract—This paper introduces a modeling and simulation technique that extends transaction-level modeling (TLM) to support multi-accuracy models and power estimation. This approach provides different combinations of power and performance models, and the switching of model accuracy during simulation, allowing the designer to trade off between simulation accuracy and speed at runtime. This is particularly useful during the exploration phase of a design, when the designer changes the features or the parameters of the design, trying to satisfy its constraints. Usually, only limited portions of a system are affected by a single parameter change, and therefore, it is possible to fast-simulate uninteresting sections of the application. In particular, we show how to extend the TLM and modify the SystemC kernel to support multiaccuracy features. The proposed methodology has been tested on several benchmarks, among which is an MPEG4 encoder, showing that simulation speed can be increased of one order of magnitude. On the same benchmarks, we also show how it is possible to choose the optimal performance simulation accuracy for a given power model, maximizing simulation speed for the desired accuracy. Index Terms—Embedded systems, modeling, power modeling and estimation, simulation.
I. I NTRODUCTION
E
STABLISHED simulation techniques work well at the behavioral and architectural levels, but they are useful only in determining the functional correctness of a system. When the aim is to evaluate the performance or power consumption, simulations at register transfer or gate level are needed, but they fail in delivering fast results. Models of systems can be very accurate, but designers also need feasible simulation times to validate their hypothesis and verify the design. With the introduction of transaction-level modeling (TLM), simulation times have shortened considerably while keeping an acceptable level of accuracy, raising the abstraction level that can be used for effective performance simulation. Nevertheless, TLM simulation speed, in particular, when considering cycle-accurate models, does not keep up with the increasing complexity of MultiProcessor Systems-on-Chip (MPSoC). Furthermore, the design and tuning of an MPSoC often require many simulation runs of the application on the platform, after each minor optimization or modification of the application. This paper extends the methodology presented in [1], exploiting the TLM and the concept of reflection (from softManuscript received March 22, 2006; revised October 31, 2006. This paper was recommended by Associate Editor N. Chang. G. Beltrame is with ESTEC, European Space Agency, 2200 AG Noordwijk, The Netherlands (e-mail:
[email protected]). D. Sciuto and C. Silvano are with the Dipartimento di Elettronica e Informazione, Politecnico di Milano, 20133 Milan, Italy. Digital Object Identifier 10.1109/TCAD.2007.895790
Fig. 1. Example of the use of multi-accuracy switching capabilities of the proposed approach.
ware engineering) to support the dynamic switching of multiaccuracy models of functionality, communication, and power consumption for the simulation of complex systems, like MPSoCs. In particular, to support the optimization and debugging phases of small kernels of the target application, we propose a methodology to use a less-accurate model for the uninteresting sections of the simulation and a refined model for the kernels under analysis. Current TLM simulators do not provide an effective way of switching the simulation accuracy at runtime, and there is no consistent way of modeling multiaccuracy systems. The proposed approach can be used to perform fast simulations on uninteresting parts of an application life span, trading off accuracy. Once an interesting kernel is reached, the simulation speed is lowered, restoring accuracy. A simple example of such behavior is shown in Fig. 1. Suppose that we have modified an application and we want to test the results of the modification, but the application is affected only during the interval δ1 . During interval δ0 , simulation is running with low-accuracy modules and functional channels, going at full speed at the cost of accuracy. Accuracy is then reestablished during interval δ1 , allowing the designer to reach the interval under analysis in considerably less time. A simulator to estimate performance while varying system parameters faces only one aspect of the design space exploration (DSE) phase at the system level. To efficiently support DSE, we need a high-level framework to model the system and to dynamically provide power characteristics in order to efficiently evaluate accurate power/performance tradeoffs. In this paper, we also propose the decoupling of power modeling from the rest of the system models. The basic assumption is that the power consumption of each component depends only on the component’s own behavior, since the power estimation
0278-0070/$25.00 © 2007 IEEE
BELTRAME et al.: MULTI-ACCURACY POWER AND PERFORMANCE TRANSACTION-LEVEL MODELING
framework is built over an architectural simulator, already taking into account the interaction between components. In this way, it is possible to have a library of power consumption models that are used to measure the power consumption “side effect” and perform energy-delay tradeoffs. In addition, a generic power model can be attached to several components or to components specified at different accuracy levels. The designer can then trade off between simulation accuracy and cost. The proposed infrastructure is flexible enough to specify most power models present in literature and adds a negligible overhead. It is also possible to switch on and off or change power modeling accuracy at any time during simulation. The proposed methodology has been tested with several power models, both on single components and at system level. The tradeoff between estimation accuracy and model complexity is shown for some of these models, together with the overhead introduced by the framework. This paper is organized as follows: Section II describes the most commonly used simulation and power modeling methodologies at different levels of abstraction. Section III introduces the proposed technique in reducing simulation times of TL models. Section IV presents the extension for power consumption estimation during simulation. Section V provides experimental evidence and considerations on model simulation accuracy. Finally, Section VI gives some concluding remarks. II. R ELATED W ORK A. Transaction-Level Simulation To deal with the increasing complexity of MPSoC, TLM [2] has been introduced in the past years as a modeling style to describe on-chip communication channels at higher abstraction level with respect to register transfer level (RTL). At TLM level, IP modules can be modeled at a functional level, and the system bus behavior can be viewed as an abstract channel independent of the target bus architecture or protocol implementation. Although TL models offer high simulation speed, in some cases, they do not capture enough details about the on-chip behavior. Recent works (like [3] and [4]) have been directed toward the application of some concepts from the TLM level at bus-cycle accuracy (BCA) to model on-chip communication architectures and to guarantee simulation speedup. A hierarchical modeling framework to support on-chip communication architectures has been proposed in [3], which is based on function calls instead of signal semantics to model AMBA2 and CoreConnect bus architectures. The approach in [5] models the AMBA2 bus at TL by using function calls for read/write operations, but also using SystemC clocked threads that can slow down the simulation speed. The approach proposed in [4] also models AMBA2 read/write transactions, but using low-level handshaking semantics to support cycle-level accuracy. More recently, the TLM approach has been extended in [6] to propose a new and faster transaction-based modeling abstraction level clock cycle accurate at transaction boundaries (CCATB) to explore the communication design space. The CCATB abstraction level tries to bridge the gap between the TLM and BCA levels, thus yielding an average performance speedup over pure BCA models.
1831
Recently, commercial tools (such as in [7] and [8]) have started to support system modeling at TLM, in addition to lower level RTL modeling tools. Other recent research works, such as in [9], introduced variations and extensions to TLM to speed up system-level simulation. All the aforementioned approaches are not able to fully exploit the tradeoffs in terms of speedup and accuracy of the TLM with respect to BCA. In addition, current TLM simulators do not offer a consistent way of switching accuracy at runtime. B. System-Level Power Modeling Low-level power estimation tools (such as Synopsys Power Compiler) require synthesized RTL descriptions. Although these tools provide reasonable levels of accuracy, they require lengthy simulation times, becoming an unfeasible alternative for complex MPSoCs. The exploration of MPSoCs requires architectural-level power and performance simulators such as SimplePower [10] and Wattch [11] platforms. The SimpleScalar framework and toolset [12] provides the basic simulation-based infrastructure to explore both processor architectures and memory subsystems. However, it does not support power analysis and has limited flexibility. Based on the SimpleScalar simulators, SimplePower [10] can be considered as one of the first efforts used to evaluate the different contributions to the energy budget at the system level. The SimplePower energy estimation environment consists of a compilation framework and an energy simulator that captures the cycleaccurate energy consumed by the SimpleScalar architecture, the memory system, and the buses. Although the proposed systemlevel framework is quite general, the exploration methodology reported in [10] is limited to a search over the space of the following parameters: cache size, block buffering, isolated sense amplifiers, pulsed word lines, and eight different compilation optimizations (such as loop unrolling). More recently, the Wattch architectural-level framework has been proposed in [11] to analyze power with respect to performance tradeoffs with a reasonable level of accuracy when compared to the lower level estimation approaches. Wattch represents an extension of the SimpleScalar simulators to support power analysis at the architectural level. These tools focus on instruction-level parallelism and do not provide a sufficiently flexible framework to model MPSoC platforms. The MPARM multiprocessor simulation and analysis platform has been recently proposed in [13] for cycle-accurate power-performance estimation. The MPARM platform has been used in analyzing several on-chip communication architectures (such as AMBA from ARM and STBus from STMicroelectronics). In [13], the exploration results do not show any data on accuracy with respect to simulation speed. Moreover, the effort that is required to extend the MPARM platform has not been specified. For power estimation of multiprocessor systems, the approach proposed in [14] integrates a complete-system simulator (VirtuTech Simics [15]) with a microarchitectural simulator and power estimator (SimpleScalar/Wattch [11]). This approach focuses on estimates of performance and power of superscalar microprocessors in a complete-system simulation environment;
1832
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 10, OCTOBER 2007
however, it does not consider the power dissipated by the interconnection, the external memory, and peripherals. Other approaches, such as in [16] and [17], have recently analyzed the concurrent exploration of the design space considering both the compilation level and the architectural design space, from the energy point of view. The power estimation framework proposed in [16] and [17] supports both hardware and software optimizations. The framework is based on SimpleScalar and SimplePower, where a high-level optimization module has been added to implement source-code optimizations. The focus of the work described in [16] and [17] is on energy estimation and optimization rather than the investigation of energy-delay tradeoffs. Multilevel simulation techniques were presented in [18], and they integrate an instruction-set simulator and a hardware emulator. However, they rely on the hardware implementation of the emulator, and the implementation process is rather demanding. In [19], Hines and Borriello describe a technique to present communication at different abstraction levels. The approach is interesting, but is limited to communication and requires special interfaces and component modeling. III. M ULTI -A CCURACY TLM To overcome some limitations of the previous approaches, we defined a flexible and efficient framework to support multilevel modeling and simulation for MPSoCs, which provides the capability of dynamically switching between a TLM description and a more accurate description of either a module or a communication channel. The basic idea consists of exploiting SystemC 2.0 TLM features together with remote procedure call (RPC) and introspection to support multilevel models in order to trade off the simulation speed with accuracy at runtime. Multilevel models include both communication channels and generic processing modules. In general, the TLM focuses on the data flow of a system, defining abstract channels that transport data between modules. These channels can be refined to provide different levels of accuracy. They span from primitive FIFO-like channels to fully modeled cycle-accurate buses. When modeling a complex system like the MPSoC, even TLM may show slow simulation speeds, particularly when using cycle-accurate models. This can represent a serious slowdown in the development process, in particular, when the designer wants to simulate only a small part of the application. The solution proposed in this paper consists of using a less-accurate model for the uninteresting sections of the simulation and the refined models for the kernels under analysis. This feature is not supported by the current TLM simulators, which do not provide an effective way of doing this at runtime. Furthermore, to the best of our knowledge, a consistent way of modeling systems at multiple accuracy levels does not exist yet. SystemC 2.0 implementation of TLM supports component hierarchies, i.e., modules can contain other modules. We exploit this hierarchy to provide a multi-accuracy simulation, and we use the concept of introspection (from software engineering) to allow runtime switching of the abstraction level of a module during SystemC simulation. The TLM separates modeling of
Fig. 2. Generation of SIDL stubs to allow asynchronous control of TLM entities.
computation and communication [2], and this is implemented in SystemC through two kinds of entities. 1) Modules: These entities model the actual functionality of the components in a system. 2) Channels: These entities model the communication between modules. Our approach distinguishes between channels and modules, because of their different nature. Modules can show arbitrary internal states that can be very different depending on the level of accuracy at which they are modeled, while channels only have to be consistent on a transaction basis. A. Switching Models at Runtime To switch the simulation speed and accuracy at runtime, we exploit the concept of introspection [20] (from software engineering), using an interface definition language, systemC interface definition language (SIDL) [21]. With SIDL and introspection, an external program can access services or variables of each simulation object. By defining a standard interface for hot switching, it is possible to determine or to change the accuracy level for each entity in the system at any given time during simulation. Interfaces specified in the SIDL language are compiled and generate a stub. The stub routes requests coming from outside the simulation environment to the various entities that are being simulated. These requests consist of reading variables, producing statistics, modifying a component’s behavior, etc. One of the main advantages is that SIDL allows asynchronous external control of a simulation kernel, but its stubs are synchronized with the simulation, thus avoiding inconsistencies. SIDL’s use and behavior are shown in Fig. 2. All these solutions have been implemented in a SystemC-based cosimulator, namely, StepNP [21]. In practice, each TLM entity (i.e., a channel or a module) extends a SocObject class that provides methods and structures that export services to the processes executing out of the simulation space. These services let an external process read or modify the internal variables of the simulation. At the beginning of the simulation, each entity registers itself in an object request broker (ORB). The ORB acts as a name solver: It receives requests from the user or another program and routes them to the target TLM entity, as shown in Fig. 3. The novelty with respect to a standard RPC approach is that the ORB has to be synchronized with SystemC to avoid race conditions and inconsistent states in the simulator. When started, the ORB takes control of the simulation starting SystemC in a separate thread and uses the SystemC context methods to pause the model when needed, without affecting the simulation performance.
BELTRAME et al.: MULTI-ACCURACY POWER AND PERFORMANCE TRANSACTION-LEVEL MODELING
Fig. 3.
1833
Extension to TLM to support introspection.
Fig. 5. Internal architecture of a multilevel TLM split-transaction channel.
Fig. 4.
Architecture of a multilevel channel.
To allow access to variables and signals, we created a subclass of sc_signal called probe, that extends the object class, and all sc_signals have been redefined as probes. B. Channels Let us consider a channel: A TLM primitive channel provides one or more interfaces that can be accessed through ports by any module. A primitive channel can be refined into a hierarchical channel, i.e., a channel that can include processes, ports, and also other modules and subchannels, in addition to interfaces. It is possible to wrap the model of a channel inside another channel. Multiple channels, which are modeled at different abstraction levels, can be merged into a single entity, allowing the multilevel simulation as shown in Fig. 4. In the following, we will refer to channel as the container of a set of subchannels that are models specified at different abstraction levels. Switching between subchannels introduces several issues. 1) We need a mechanism to route requests to the correct subchannel depending on the desired abstraction level. 2) We need to keep the system in a consistent state by avoiding any data loss during the switching of the abstraction level. 3) We need a mechanism to perform control and view of the abstraction level of each component at runtime. The first issue is the routing of the requests coming from the channel to the correct subchannel. As the channel interface method is called, the request is immediately sent to the currently active subchannel. To do so, each subchannel must present the same interface as the wrapping channel or use an appropriate converter/adapter solution. Considering a splittransaction bus model, this is done through the introduction of switchers and mergers. Switchers are bound to each master port of the channel and route transactions through the active subchannel; mergers get responses from both the subchannels and route them to the correct destination, as shown in Fig. 5. Note that both subchannels forward transactions to the merger that is bound to the destination slave, which, in turn, sends the transaction to the attached slave without performing any routing decision but keeping a table of the transactions, recording the
Fig. 6. Architecture of a switcher for a split-transaction bus.
source subchannel. On the other hand, mergers route transaction responses through the same subchannel the transaction was started from, using the information from the transaction table. When the transaction response arrives from the subchannel, it goes through the switcher interface and is forwarded to the external port, which is connected to the master external module, as shown in Fig. 6. Mergers analogously receive requests from an interface, and they route requests through their external ports, which are connected to the slave external modules. To keep the system in a consistent state, each subchannel has to complete all the transactions that have been routed through it. The wrapping channel enforces this behavior by keeping track of all transaction requests: Whenever the abstraction level is changed, the channel will start routing new requests to the appropriate subchannel, but all previous transactions will still be dealt with by the previously selected subchannel. This is done by keeping a status request table (see Fig. 5) and by associating each incoming request with a tag that specifies the subchannel that handles it. Routing of responses is performed according to the tag. As an example, let us consider a processor P that is communicating with a memory M. Suppose that P is multithreaded and that communication is modeled as split transaction and using two channels. Suppose also that P’s first thread (T1) is reading data from memory and waiting for a result, while P’s second thread (T2) is writing to another location in memory. Simulation is taking place at functional level. When T1 accesses the channel, the transaction is marked as belonging to the functional channel. Before the response to T1’s read request is back to P, the user decides to change the accuracy level of
1834
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 10, OCTOBER 2007
Fig. 7. Average transaction duration when switching a channel from BCA to UTF: Outstanding transactions have to be completed to establish the expected behavior (indicated by the dashed lines).
the simulation to bus cycle accurate (BCA). T2’s transaction is routed into a different channel than T1’s and is marked accordingly. When the memory is ready to return the read value to P, the information will travel in the functional-level channel, as T1’s transaction was marked functional, while T2’s write will be propagated through the BCA channel. This means that channel models can be changed at any time without losing the state of the system, leaving the designer to decide which part of the simulation should be considered at a higher level of detail. The subchannel that is modeled at high accuracy has an internal state that likely carries more information than the subchannel modeled at low accuracy. Therefore, whenever we switch from low accuracy to high accuracy, the missing part of the channel state has to be recreated. As an example, consider the channels as simple FIFO buffers. When we switch from one buffer to the other, the latter has to be filled before it can show its actual throughput, which might depend on the buffer size. Similarly, channels need to build their internal state. The actual accuracy loss depends on the specific channel models that are used, and also on the application used for testing. Assuming that switching points are sufficiently far from each other, we can suppose that after switching, the newly selected subchannel starts processing transactions in an empty state. To maintain accuracy, in the worst case, a number of transactions equal to the maximum contention of the subchannel has to be processed. The same principle can be applied when switching from high to low accuracy. The outstanding transactions in the high accuracy channel need to be completed before the average transaction delay settles to the expected values for the low accuracy channel. Fig. 7 shows how 102 simulation cycles have to be performed to reestablish the expected transaction delays when switching from BCA to untimed functional (UTF). C. Modules The implementation of generic multi-accuracy modules is less straightforward: It is necessary to keep track of the module state, and this can be significantly different for each accuracy level. As an example, a functionally modeled processor state contains much less information with respect to the same processor modeled at the internal bus signal level. It is not
useful to keep two instances of the same module running at different levels of abstraction to maintain state consistency, as the slower model would be the bottleneck for simulation speed. Nevertheless, it is possible to increase the simulation speed by desynchronizing the module from the rest of the system. If we consider a standard event-driven simulation, clock distribution plays a noticeable role in the overall simulation complexity. Every time the module modifies some signals, it generates events that are put into a queue and scheduled by the simulation kernel. The system clock is responsible in generating many of these events. What we propose is the internal desynchronization of a component, disregarding the system clock or other global synchronization signals, and leaving each module to run at its own pace, in order to increase the simulation speed. Synchronization is regained only at module boundaries, i.e., when the module accesses interconnection channels. This can lead to a substantial speedup: In fact, running the SystemC code without any clock wait state is like running native C++ code. By selectively removing wait() statements from the SystemC code, but keeping internal synchronization, it is possible to trade off the simulation time with simulation accuracy. Models have to be properly designed to keep the internal synchronization when not using the clock. This is done through the use of dynamic sensitivity lists and synchronization events (i.e., the sc_event structure). Dynamic sensitivity lists do not interfere with the cycle-accurate simulation, but they can be deactivated at runtime to speed up the cycle-accurate simulation. It is also possible to enforce synchronization every n clock cycles, triggering the wait() statement every n calls. The activation of wait() statements is controlled externally, through SIDL. This is implemented through a redefinition of the statements, using a value to determine the desired speed, i.e., the number of calls to the wait() before execution is actually suspended. void multiLevelWait(){ static int speed = DEF_SPEED; static int counter = 0; if (speed > 0 && counter < speed} { counter + +; return; } else { counter = 0; wait(); } } The model is synchronized only after a certain amount (speed in the code) of calls to the wait statement. This can be extended to the point of avoiding synchronization completely. Whenever the module tries to access an external channel, synchronization with other modules is maintained by the TLM infrastructure: Transactions (split transactions) are blocking for the module (thread) that generated them. In our experiments, the model of a multithreaded ARM can double its speed when synchronizing every 100 000 cycles. We implemented a cycle-accurate instruction set simulator inside a SystemC wrapper, synchronizing to the clock only on request,
BELTRAME et al.: MULTI-ACCURACY POWER AND PERFORMANCE TRANSACTION-LEVEL MODELING
Fig. 8.
1835
Multilevel module with external control and view.
and leaving it to run at full speed when simulation accuracy is not needed, as shown in Fig. 8. This approach is different from channel models, in that, it does not have two different objects wrapped in a single module. Rather, the module is appropriately disconnected from the clock signal. Hence, there is no need to keep track of the state of the module or any routing involved in interface calls, provided that the module can keep internal synchronization. IV. M ULTI -A CCURACY TL P OWER E STIMATION Starting from multi-accuracy modeling, we also propose a TL power estimation layer, which is used to explore powerperformance tradeoffs. Since each target system is constituted by interconnected components, which are modeled as TLM entities, we can assume that the power consumption of each component, for estimation purposes, depends only on the component’s own behavior. For this reason, we built our power estimation framework over an existing TL simulator that already takes into account the interaction between components. Given this assumption, the energy consumption at system level is given by the sum of the contributions of all the components of the system, and the power consumption can be derived consequently once the actual performance of the system is determined by simulation. It is worth noting that there is no assumption about the nature of a component: It can be specified as any TLM entity, either a channel or a module. The main idea consists of decoupling power modeling from the TL simulation, thus allowing power models to be attached to entities specified at different accuracy levels, or to use different power models (e.g., with different accuracies) on the same entity. In the following, since we are discussing the power consumption of physical devices, the term component and the corresponding TLM entity are considered interchangeable. Let us assume that the power model of a component can be seen as a black box and that it requires data coming from a generic TL simulator in addition to an optional parameter set (such as technological or statistically determined ones). This means that each component is seen as an object c associated with a type tc , which exports a set of data variables Pc called probes. A power model can then be represented by an abstract class, i.e., an information-hiding shell that protects the inner structure, but provides a well-defined external interface. This class can be extended to any power model for any kind of component, declaring the probes the component should export, and what components it works for. This has been implemented as a reflection mechanism: Models are organized as a type
Fig. 9. Outline of the processor component domain. Model search explores domain trees starting from the root.
hierarchy and declare what data they need. A type hierarchy enforces a typing graph, in this case, a tree defined as follows. Definition 1: A domain D is defined as a set of component types di . The component domain tree T = (V, E) is defined as a directed acyclic graph, where V = {D0 , D1 , . . . , Dn } is a set of domains and E = {ej,k } is a set of edges. Vj = {Dk |∃ej,k } is a setof domains connected by an edge to domain Dj , and D0 = Dk ∈V Dk is the root of the tree, where the following properties hold: ∀i, j, k with i = j = k: ∃ej,k ⇔ Dk ⊆ Dj ∧ ∃Di |Di ⊂ Dj ∧ Dk ⊂ Di Dj = Dk
(1) (2)
Dk ∈Vj
∅=
Dk
(3)
Dk ∈Vj
∀Dk ∃Ek = e0,f (1) , ef (1),f (2) , . . . , ef (n),k
(4)
for some f : N → V with n ∈ N. By construction, a power model can be connected to any component belonging to its domain. The most general power model is the root model that defines the basic domain handled by the framework. Up to now, these are classified as processors, memories, and interconnection channels, but this does not represent a limitation. As an example, in Fig. 9, there is the outline of the processor domain. A model can be specified as belonging to the reduced instruction set computer (RISC) domain, thus comprising all kinds of RISC processors or as belonging to the ARM9 cycle-accurate domain, working only with cycle-accurate simulations of the ARM9 processor. As previously stated, the power models explicitly declare what data they need. This is done by specifying a set of probes for each model, which are data structures (identified by a name and a type) containing the performance data. Definition 2: A power model m is a tuple m = {Pm , Rm , sm , f (Pm , Rm , sm ), g(Pm , Rm , sm ), Dm } where Pm = {pi } with i ∈ [0, p] is a set of parameters, Rm = {rj } with j ∈ [0, r] is a set of probes coming from a TL simulator, sm is an optional state of the model, f (Pm , Rm , sm ) is a function that provides power estimation from the two sets,
1836
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 10, OCTOBER 2007
g(Pm , Rm , sm ) is a function that computes the next state, and Dm ∈ T is the domain associated with the power model. These definitions imply that a power model m can be associated with a component c of type tc that provides the probe set Pc if and only if tc ∈ Dm ∧ Rm ⊆ Pc . It is then possible to define very general to very specific models, and their applicability is specified by the domain and by the probes. This enables the use of the same power model with differently specified components, as long as they provide all the necessary probes. As an example, we shall consider a model defined for a generic ARM processor, requiring only the number of instructions that have been executed. Whether the processor is specified as functional or cycle accurate, the power model remains the same, associating a cost with each executed instruction. Among its parameters, there is one that specifies the technology or the type of ARM that is being considered. The accuracy of the model then depends on the accuracy of the underlying architectural simulator. The f (Pm , Rm , sm ) function determines the output of the power model m. For the sake of completeness, the power models can have an internal state, making them behave like finite state machines, to model complex energy consumption behaviors. As an example, several processor models at the instruction level consider interinstruction effects, and thus, they have to keep track of the last executed instruction. The function f can be of any kind; however, as we are interested in highlevel models suitable for system-level exploration (such as [13], [22]–[25]), it is of the type f=
p i=0
ri pi +
r
si pi+p
(5)
i=0
where the parameters pi are usually statistically estimated and si = 0 for ∀i, i.e., models do not require any internal state in most cases. In fact, most models require just the current state of the performance simulator to produce their output (see Section IV-A). This function includes memory models (number of accesses × access cost), processor models (executed instructions × cost + idle cost × cycle), and activity-based network models (data transferred × cost). In most cases, power is computed as an energy cost multiplied by an activity factor and scaled to the performance of the simulated system [26]. The obtained power is then adjusted by adding a leakage power and other effects that can be determined statistically. Power models are added to the simulation infrastructure as shown in Fig. 10. Each model is an object that responds to external commands in the same way as TL entities. Each model listens to some signals (called probes) that are extracted from each TL entity using the proposed introspection mechanism. Each probe is accessible through SIDL without the need of specific support. These signals are used, together with other parameters, to produce power estimations associated with each TL entity. Registration associates a power model to a single TL component, and such association can be created or destroyed
Fig. 10.
Extension to TLM to support power modeling.
at runtime, depending on the accuracy needs of the designer. A model can register in two different modes. 1) Push: The simulator pushes data to the models every time a change happens in the probes, forcing models to update their state. 2) Pull: The power model waits for an external event (e.g., the user asking for power analysis) to get the latest probe values from the simulator. The former mode is useful for stateful models, while the latter can be used by stateless ones (as most models are). A. Model Library Let us consider a very simple processor power model—a processor that has two power consumption states: an active state and an idle state. The model is described as Rm = {load, t} Pm = {idleP, activeP} sm = g(·) = ∅ f (Rm , Pm ) = [load × activeP + (1 − load) × idleP] ∗ t Dm = processor where load is the average computational load of the processor (expressed as a number in [0, 1], which is assumed to be available as a probe), t is the simulation time, and idleP and activeP are the idle and active powers, respectively. It is worth noting that the model is stateless (sm = ∅) and is fit for every kind of processor (Dm = processor), given the appropriate parameters. This model, although very simple, is widely used and has been proved to have an error smaller than 10% when used on the ARM processor [22], [25]. The cost of adding this model to the framework is 129 lines of code (LOC). The execution of the model with different parameters allows the model to be connected to virtually any processor, which is modeled at any level of abstraction or accuracy. Concerning memory models, analytical or statistical models can be used to determine a cost per access to memory cells, separating read from write accesses and typical case from worst case values. For system-level simulation, these costs are used to
BELTRAME et al.: MULTI-ACCURACY POWER AND PERFORMANCE TRANSACTION-LEVEL MODELING
determine the overall energy consumption of a memory block. In our approach, these costs are parameters, and the number of accesses represents the probes needed by the model
1837
TABLE I POWER MODEL INTEGRATION EFFORT
Rm = {numR, numW, t} Pm = {readE, writeE, leakage} sm = g(·) = ∅ f (Rm , Pm ) = numR×readE + numW×writeE + leakage ∗ t Dm = memory where numR and numW are the number of read and write accesses, respectively, leakage is the leakage power, and readE and writeE are the access costs expressed as energy. We have estimated the parameters readE, writeE, and leakage by linear regression over a set of measurements on 90-nm technology memory, obtaining the results exhibiting a 3% absolute error. A similar model is defined for caches with roughly the same accuracy, and both require approximately 185 LOC. It is also possible to define complex models of network-onchip power consumption; as an example, the model for the STBus interconnection network proposed in [27] has been successfully integrated, but requires an internal state and therefore works only in push mode. The model can be represented by the tuple Rm = {requests[·], responses[·], C} Pm = {nodes, B[·], Ps [·], Pr [·]} sm = currentEnergy g(sm , f (·)) = currentEnergy + f (Rm , Pm ) requests[n] f (Rm , Pm ) = B[n] + Ps [n] · C n∈nodes responses[n] + Pr [n] · C Dm = interconnect.stbus
V. E XPERIMENTAL R ESULTS
where requests[·] and responses[·] are the active requests and responses in the current clock cycle of each node, C is the number of clock cycles executed so far, nodes is the set of nodes contained in the current instance of STBus, and B[·], Ps [·], and Pr [·] represent a database of parameters associated with every node n and, in particular, a base cost, packet sending power, and packet receiving power, respectively. A node n is identified by the tuple n = i, t, rqr, rpr, p, CL , dps, Type
is null). The overall estimation error is then given by the lessaccurate model. Adding a power model to the proposed framework is relatively simple, given its regular structure and coherent application programmer’s interface (API). Several models have been ported to the framework, with the overall porting effort shown in Table I: The first column shows the model that has been ported; the second column shows the LOC number written for the porting effort (including comments and blank lines); and the last two columns show the number of probes and parameters needed by each model. Even the most complex models require only a few probes. Overall, the integration effort of a model is very limited, proving the applicability of the approach. However, commercial models might not provide all the probes needed, or it might be difficult to access them. Therefore, the effort required might depend on the observability of the available models.
(6)
where i is the number of masters, t is the number of slaves, rqr is the number of request links, rpr is the number of response links, p is the type of arbitration policy (STBus has seven arbitration policies), CL is the output pin capacitance, dps is the data-path size (in bits), and Type is the protocol mode of STBus. The parameters are calculated per node and are loaded from a library. Considering power estimation at system level, assuming unbiased models, error does not sum (i.e., the average error
The proposed modeling approach has been tested using the StepNP simulation platform [21] with several benchmarks and a large multimedia application, namely, an MPEG4 encoder. The selected benchmarks are as follows: 1) VMUL: a parallel vector multiplication application that exploits both multicycle ARM instructions and accesses to memory; 2) PI: a parallel π calculator that makes full use of the processors computing ability; 3) MemTest: a parallel mass memory read/write application; 4) Sort: a sorting application that exploits bandwidth and uses little computing power; 5) MPEG4: a full-fledged MPEG4 encoder, which is used to prove the methodology on a real industrial application. MemTest and Sort are suitable for testing the channel switching methodology because they fully use the bandwidth offered, introducing a certain amount of contention and can show the accuracy trend when switching between two channels. PI and VMUL are appropriate to test the desynchronization of the ARM processor because they focus on computing more than communication. Experiments considered the simulation of a multiprocessor system on chip. The target architecture is constituted by five ARM processors, two levels of caches, local memories for each processor, an external bus, and a scratchpad. All components are connected by the STMicroelectronics interconnection network, STBus. All components are modeled at the timed
1838
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 10, OCTOBER 2007
Fig. 11. Comparison of functional simulation time with respect to BCA simulation (including the overhead of switchers and mergers).
Fig. 13. Comparison of simulation performance for desynchronized ARM processor modules, as compared to BC accuracy.
Fig. 12. Comparison of TF simulation accuracy with respect to BCA.
Fig. 14. Execution time of the benchmarks with respect to BCA when applying both methodologies.
functional (TF), except for the processor and the interconnection network. Concerning channel modeling, the BCA model of the STBus interconnection network in crossbar mode (supplied by STMicroelectronics) has been simulated in parallel with a functional crossbar. A. Multi-Accuracy Simulation Fig. 11 shows the simulation speed of the aforementioned benchmarks when switching to TF and UTF simulation with respect to BCA simulation. For the selected benchmarks, simulation time is less than 50% on average when switching from BCA to TF, and it is reduced to 30% when switching to UTF. When considering complex applications like MPEG4, simulation time can be reduced to 20%. The overhead introduced by the switchers and mergers is negligible when compared to the actual simulation time, as shown in Fig. 11. It is worth noting that the greater the use of the communication channel by the application, the greater the difference between the functional and BCA models. Fig. 12 shows the effect of the accuracy change. Notice that the error is very low for benchmarks where there is low communication overhead, while it is higher for MPEG4 and Sort that use the interconnection channel more intensively. TF simulation of channels is less accurate than BC, but error remains acceptable and can be traded off for increased speed. UTF is not considered since it disregards accuracy completely in favor of fast simulation. A second set of experiments was performed by using a desynchronized ARM processor module, i.e., the processor uses global synchronization signals only communicating with other modules. Results are shown in Fig. 13 and prove that the “uninteresting” parts of the application can be simulated half
the time when using our methodology. There is no noticeable overhead introduced when desynchronizing the ARM modules. Extensive experimentation has been performed to determine the delay needed to regain high-accuracy simulation at the accuracy switch boundary. Accuracy switching was performed on all the benchmarks in different sections of the code, and the results show no delay needed. The sections executed in BCA mode maintain a clock-cycle accuracy from the first cycle. Finally, by putting together both channel and module methodologies, it is possible to save up to one order of magnitude in simulation speed, as shown in Fig. 14, with negligible overhead. B. Multi-Accuracy Power Modeling In the following, we show how our framework can be used for TL power estimation. To prove the effectiveness of the approach, the framework infrastructure has been tested with two goals: determine the overhead of TL power estimation, and the tradeoff between power model complexity and simulator accuracy. Since system level simulation and power estimation can be very time consuming, it is of high interest to evaluate the cost in terms of simulation performance when adding power estimation to TL simulation. The overall cost C in terms of complexity of running the power framework presented in Section IV can be computed as C=
O(nK + p), O (αN (nK + p)) ,
when using pull mode when using push mode
(7)
BELTRAME et al.: MULTI-ACCURACY POWER AND PERFORMANCE TRANSACTION-LEVEL MODELING
Fig. 15. StepNP simulation performance in thousands of transactions per second (kTPS) using a P4 3.2-GHz Linux machine with 1 GB of RAM.
where K is the cost of the most expensive model, n is the number of components, α is the activation coefficient for the most expensive model, N is the number of simulation cycles (or events in case of asynchronous simulation), and p is the total number of probes. Pull mode shows linear complexity in the number of components if each model has constant cost, which is true for most models [see (5)]; push mode shows polynomial complexity in the number of simulated cycles. This means that the pull mode does not hinder the performance of the underlying simulator, while push mode somewhat affects the simulation speed. This has been proved by running the StepNP simulator with the selected benchmarks, using both pull and push modes, as shown in Fig. 15. As expected, the pull mode does not affect the simulator performance in any noticeable way. Note that the push mode is applied to all power models, i.e., every instruction executed by the processor, every access to the memory, etc., force the recalculation of the power consumption of each component. Nevertheless, the simulation performance remains acceptable. The effect of the domain level of models and the simulator abstraction level on which they are applied is of great interest to the designer. Knowing the effect on power estimation of choosing a simulation accuracy lets the designer identify an accuracy/simulation time tradeoff: It is possible to choose the fastest simulator in order to obtain the required accuracy level for power estimation, and choose the most appropriate model in the target component domain. All the experiments provided in the following have been performed on the selected benchmarks described above (PI, VMUL, MemTest, Sort, and MPEG4). Considering processor cores, two models were used for an ARM processor: A simple parameterized approach (consisting of a generic model using average power values, which has been proved to be reliable for the ARM core [22]) and an assembly-level model [28] where the accuracy of which was compared with the RTL simulation, as shown in Fig. 16. This figure shows that using these high-level models (associating costs with instructions) provides satisfactory accuracy for the ARM core, while the use of more accurate simulators increases the simulation cost with no advantage. In fact, assembly-level models require only the number and type of executed instructions to provide estimations, and therefore, even a functional simulator provides sufficient accuracy. This means that it is
1839
Fig. 16. Relative accuracy of different CPU power models (as instruction level and RTL) as tested on different simulator accuracies (functional, cycle accurate, and RTL) using StepNP.
Fig. 17. Accuracy of different memory models (simple, interpolated) as tested on different simulators (functional, cycle accurate) using StepNP.
possible to obtain reasonable relative power estimations even using simple models for processor cores, whenever reasonably accurate instruction-level models are available. Memory models behave differently. Since most models are access-based, their accuracy depends on the preciseness of the estimation of the number of accesses to memory. Hence, the optimal simulator is cycle accurate; any higher level of detail will result in needless overhead. This behavior is shown in Fig. 17. Considering interconnection networks, the STBus power model introduced in [27] was tested with different simulator accuracies. This model is more complex than the previously described ones, and its behavior is not easily predicted. It considers a base cost per clock cycle for each node configuration belonging to the network on chip; this cost is summed to the number of packets received and multiplied by a cost per forwarded cell. The factors influencing accuracy are then clock accuracy and a correct estimation of the numbers of packets sent through the network. A functional simulator then produces an error that is proportional to the estimation error of the number of clock cycles and the number of cells sent through the network, and inversely proportional to the ratio between base and transfer cost. Experimental activity using a TF simulator revealed that the estimation of clock cycles has an average 16% error, while the estimation error of the number of cells averages at 31%. Total estimation error strongly depends on the configuration of each node (i.e., the base cost/transfer cost ratio), and was not computed.
1840
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 10, OCTOBER 2007
VI. C ONCLUDING R EMARKS This paper represents an extension to TLM to support multiaccuracy models and power estimation. This approach also provides the ability to switch the accuracy of a given model during simulation, without affecting the consistency of the simulation. This allows the designer to trade off between simulation accuracy and speed at runtime. In particular, we have shown how to extend the TLM and to modify the SystemC kernel to support multilevel features, and how the proposed methods have been tested on several benchmarks, among which is an industrial MPEG4 encoder. Results show a negligible overhead and a simulation speed increase up to an order of magnitude. Other simulators provide fast functional simulation, but they are not as general as the approach proposed here. SimpleScalar and its multiprocessor extensions [29] are limited to instruction processing, while here, all the aspects of communication and functionality are considered multilevel. In addition, the ability of multilevel power modeling and the orthogonality of TL and power modeling are first introduced here. Future works include the automatic identification of application sections to be quickly simulated and the integration of the framework with DSE algorithms. A PPENDIX P OWER F RAMEWORK I MPLEMENTATION This appendix describes the implementation details of the power estimation framework that was built to test the model presented in this paper. A modular software architecture for system-level power estimation has been defined. It aims at maximizing the code reuse and the flexibility of the resulting power estimation framework. To obtain this goal, the proposed framework uses the objectoriented paradigms and standard design patterns derived from software engineering, as outlined in the following. The framework is composed of a set of basic building blocks. 1) Components: An abstraction of an object inside the simulator. It provides information about its state by exporting a set of probes. 2) Probes: Objects that provide data taken from a component. They provide a consistent interface both to simple and structured data types. 3) PowerModels: Functions that, given a set of input variables, (which have been derived through probes), compute the power/energy consumption associated to a component. 4) Parameters: Additional variables, user-specified, that are needed as input to models but are not strictly bound to the simulated component. As an example, the process technology used. 5) PowerLibraries: A database of parameters and probes data, which is used to store information for easy and fast retrieval. All of these components are bound together in the framework, and their use is possible through the framework API. A simplified diagram of the proposed architecture is shown in Fig. 18. The core of the framework architecture is the
Fig. 18.
Framework architecture.
ModelManager: This object is responsible for finding the best match between models and components and connecting them to perform the actual estimation. The ModelManager is independent of the models, the underlying simulation architecture and the library type. To this purpose, it relies on two abstract layers: the library layer and the simulator driver layer. In order to let a PowerModel get data from a generic probe in a simple way, the abstraction of C++ streams was used. Generic streams were extended in an object-oriented fashion into ProbeStreams. These are created by a SimulatorDriver and passed to each PowerModel. The model will then be able to get data from them as they were typed as communication streams, i.e., it can request ProbeValues, and these will be provided, blocking the PowerModel until data are available. SimulatorDrivers require special attention since they represent one of the main advantages of the framework. In fact, they form an abstraction layer between the framework and the simulator by providing the necessary glue and exporting three main functions. 1) Basic connection tools: Establishing, checking, dropping, etc. 2) List components: The driver has to be able to connect to the simulator, extract the component list, and return it to the framework using the internal data structure. 3) Probe registration: Upon request, the driver will register a probe or a set of probes and return the corresponding ProbeStream. The other building blocks of the framework architecture are components, PowerModels, and parameters. The components are data structures used to identify the objects inside the simulator. In practice, they are used only as probe placeholders, since the framework driver layer separates the components from their actual implementation inside the simulator. The component data structure hence includes the following: 1) name: used to identify the component against all the others during a simulation run; 2) type: identifies which PowerModels can be attached to this particular component; 3) probe list: a list of all the data that the component can provide about itself, identified by common names. These names are used by the ModelManager to match a component to a model. If the component provides all the probes that the model needs, the former will be connected to the latter. Data are taken from the simulator through the ProbeStreams generated by the simulator driver. PowerModels are abstract classes representing a function with data from a specific component and parameters as input and the power and energy consumption of that component as output. The base implementation provides everything that it is
BELTRAME et al.: MULTI-ACCURACY POWER AND PERFORMANCE TRANSACTION-LEVEL MODELING
needed for the integration of the framework, but not the powercomputing functions. This makes the addition of models extremely easy, with new models being around or less 50 SLOC. PowerModels are typed using the domain name system structure [30]: this specifies the class of components the PowerModel can handle. By design, a PowerModel can be connected to any component belonging to its domain. Depending on the domain level, a PowerModel can be connected to a single component (e.g., a specific off-the-shelf memory core) or many. PowerModels also specify which data they need, and the framework takes care of finding out if each component can actually provide those data. This is held in the needed probes’ list into the PowerModel data structure. In addition to the component data, the PowerModel may need some extra parameters (as process technology). Those are represented with a special data structure and are usually stored in a database for easy retrieval. PowerLibraries are used to store probes and parameters data in a database in a fully automatic way. A PowerLibrary is a layer between the PowerModels and a database, and it can be seen as a driver exporting basic functionalities like reading and writing probes and parameters. To simplify the creation and association of models, the factory design pattern was used. The ModelManager calls the factory every time it has to create a model, passing a string containing the model type, and the factory returns a reference to the newly created object. The type hierarchy assures that the ModelManager connects each model to the correct component. It is also possible to manually specify which model is connected to each component and which parameter set to use. In case no parameter set is specified, the default ones are used. PowerModels are chosen from a library, specified in the ModelFactory. Adding a new model is as simple as writing a new line in a configuration file, and several models are already available. This flexibility guarantees to cover most of the designer’s needs. ACKNOWLEDGMENT The authors would like to thank P. Paulin and all the SoCPA team at STMicroelectronics for their contributions. R EFERENCES [1] G. Beltrame, D. Lyonnard, C. Pilkington, D. Sciuto, and C. Silvano, “Exploiting TLM and object introspection for system-level simulation,” in Proc. DATE, Mar. 2006, pp. 100–105. [2] L. Cai and D. Gajski, “Transaction level modeling: An overview,” in Proc. Conf. CODES+ISSS, 2003, pp. 19–24. [3] X. Zhu and S. Malik, “A hierarchical modeling framework for onchip communication architectures,” in Proc. IEEE/ACM ICCAD, 2002, pp. 663–671. [4] O. Ogawa, S. B. de Noyer, P. Chauvet, K. Shinohara, Y. Watanabe, H. Niizuma, T. Sasaki, and Y. Takai, “A practical approach for bus architecture optimization at transaction level,” in Proc. Conf. DATE, 2003, p. 20176. [5] M. Caldari, M. Conti, M. Coppola, S. Curaba, L. Pieralisi, and C. Turchetti, “Transaction-level models for AMBA bus architecture using SystemC 2.0,” in Proc. Conf. DATE, 2003, p. 20026. [6] S. Pasricha, N. Dutt, and M. Ben-Romdhane, “Extending the transaction level modeling approach for fast communication architecture exploration,” in Proc. 41st Annu. DAC, 2004, pp. 113–118. [7] CoWare. (2006, Dec.). Platform Architect. [Online]. Available: http:// www.coware.com/products/platformarchitect.php [8] Synopsys. (2006, Dec.). CoCentric System Studio. [Online]. Available: http://www.synopsys.com/products/designware/system_studio/
1841
[9] E. Viaud, F. Pecheux, and A. Greiner, “An efficient TLM/T modeling and simulation environment based on conservative parallel discrete event principles,” in Proc. DATE, 2006, pp. 1–6. [10] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye, “Energy-driven integrated hardware-software optimizations using simplepower,” in Proc. ISCA, 2000, pp. 95–106. [11] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A framework for architectural-level power analysis and optimizations,” in Proc. 27th Annu. Int. Symp. Comput. Architecture, 2000, pp. 83–94. [12] D. Burger, T. M. Austin, and S. Bennett, “Evaluating future microprocessors: The SimpleScalar Tool Set,” Univ. Wisconsin, Madison, Tech. Rep. CS-TR-1996-1308, 1996. [13] M. Loghi, M. Poncino, and L. Benini, “Cycle-accurate power analysis for multiprocessor systems-on-a-chip,” in Proc. 14th ACM Great Lakes Symp. VLSI, 2004, pp. 406–410. [14] J. Chen, M. Dubois, and P. Stenstrom, “Integrating complete-system and user-level performance/power simulators: The SimWattch approach,” in Proc. ISPASS, 2003, pp. 1–10. [15] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,” Computer, vol. 35, no. 2, pp. 50–58, Feb. 2002. [16] N. Vijaykrishnan, “Evaluating integrated hardware-software optimizations using a unified energy estimation framework,” IEEE Trans. Comput., vol. 52, no. 1, pp. 59–76, Jan. 2003. [17] M. Kandemir, “Influence of compiler optimizations on system power,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp. 801–804, Dec. 2001. [18] R. Klein, “Miami: A hardware software co-simulation environment,” in Proc. 7th IEEE Int. Workshop RSP, 1996, p. 173. [19] K. Hines and G. Borriello, “Selective focus as a means of improving geographically distributed embedded system co-simulation,” in Proc. 8th Int. Workshop RSP—Shortening Path from Specification Prototype, 1997, p. 58. [20] C. Ghezzi and M. Jazayeri, Programming Language Concepts, 2nd ed. Hoboken, NJ: Wiley, 1986. [21] P. G. Paulin, C. Pilkington, and E. Bensoudane, “StepNP: A system-level exploration platform for network processors,” IEEE Des. Test Comput., vol. 19, no. 6, pp. 17–26, Nov./Dec. 2002. DOI:10.1109/MDT.2002. 1047740. [22] J. Laurent, N. Julien, E. Senn, and E. Martin, “Functional level power analysis: An efficient approach for modeling the power consumption of complex processors,” in Proc. Conf. DATE, 2004. [23] T. T. Ye, L. Benini, and G. De Micheli, “Packetized on-chip interconnect communication analysis for MPSoC,” in Proc. DATE, 2003, pp. 344–349. [24] P. Shivakumar and N. Jouppi, “CACTI 3.0: An integrated cache timing, power, and area model,” Compaq Western Research Laboratory, Palo Alto, CA, WRL Res. Rep. 2001/2, Aug. 2001. [25] T. Simunic, L. Benini, and G. De Micheli, “Cycle-accurate simulation of energy consumption in embedded systems,” in Proc. 36th Annu. DAC, 1999, pp. 867–872. [26] A. Chandrakasan, “Minimizing power consumption in digital CMOS circuits,” Proc. IEEE, vol. 83, no. 4, pp. 498–523, Apr. 1995. [27] A. Bona, V. Zaccaria, and R. Zafalon, “System level power modeling and simulation of high-end industrial network-on-chip,” in Proc. Conf. Design, Autom. Test Eur., 2004, pp. 318–323. [28] C. Brandolese, F. Salice, and D. Sciuto, “Static power modeling of 32-bit microprocessors,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no. 11, pp. 1306–1316, Nov. 2002. [29] N. Manjikian, “Multiprocessor enhancements of the SimpleScalar tool set,” SIGARCH Comput. Archit. News, vol. 29, no. 1, pp. 8–15, Mar. 2001. [30] P. Mochapetris, DNS: The Domain Name System, Nov. 1987, Network Working Group. ISI, RFC 1034.
Giovanni Beltrame received the M.Sc. degree in electrical engineering and computer science from the University of Illinois, Chicago, in 2001, the Laurea degree in computer engineering from the Politecnico di Milano, Milan, Italy, in 2002, the M.S. degree in information technology from CEFRIEL, Milan, in 2002, and the Ph.D. degree in computer engineering from the Politecnico di Milano, in 2006. He is currently a Research Fellow at the European Space Agency. His research interests include modeling and design of embedded systems, artificial intelligence, and robotics.
1842
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 10, OCTOBER 2007
Donatella Sciuto (S’84–M’87) received the Laurea degree in electronic engineering from the Politecnico di Milano, Milan, Italy, in 1984, the Ph.D. degree in electrical and computer engineering from the University of Colorado, Boulder, in 1988, and the MBA degree from the Scuola di Direzione Aziendale, Bocconi University, Milan, in 1991. She was an Assistant Professor with the University of Brescia, Dipartimento di Elettronica per l’Automazione, until 1992. Since 2000, she has been a Full Professor at the Politecnico di Milano. Her research interests include embedded systems design methodologies and architectures.
Cristina Silvano (M’99) received the Laurea degree in electronic engineering from Politecnico di Milano, Milan, Italy, in 1987, and the Ph.D. degree in computer engineering from the University of Brescia, Brescia, Italy, in 1999. She is currently an Associate Professor at the Department of Electronic Engineering and Computer Science, Politecnico di Milano. Her primary research interests are in the area of computer architectures and computer-aided design of digital systems.