Compiler-Directed High-Level Energy Estimation and Optimization

0 downloads 0 Views 870KB Size Report
most compiler optimizations focus on metrics, such as performance and code ...... would be constructing a two-level loop where the outer loop enumerates dif-.
Compiler-Directed High-Level Energy Estimation and Optimization I. KADAYIF, M. KANDEMIR, G. CHEN, N. VIJAYKRISHNAN, M. J. IRWIN, and A. SIVASUBRAMANIAM The Pennsylvania State University

The demand for high-performance architectures and powerful battery-operated mobile devices has accentuated the need for power optimization. While many power-oriented hardware optimization techniques have been proposed and incorporated in current systems, the increasingly critical power constraints have made it essential to look for software-level optimizations as well. The compiler can play a pivotal role in addressing the power constraints of a system as it wields a significant influence on the application’s runtime behavior. This paper presents a novel Energy-Aware Compilation (EAC) framework that estimates and optimizes energy consumption of a given code, taking as input the architectural and technological parameters, energy models, and energy/performance/ code size constraints. The framework has been validated using a cycle-accurate architectural-level energy simulator and found to be within 6% error margin while providing significant estimation speedup. The estimation speed of EAC is the key to the number of optimization alternatives that can be explored within a reasonable compilation time. As shown in this paper, EAC allows compiler writers and system designers to investigate power-performance tradeoffs of traditional compiler optimizations and to develop energy-conscious high-level code transformations. Categories and Subject Descriptors: D.3.m [Programming Languages]: Miscellaneous General Terms: Languages Additional Key Words and Phrases: Energy-Aware Compilation (EAC), mobile devices

1. INTRODUCTION Power consumption is becoming a major design consideration for both high-end platforms and embedded/mobile computing devices [Benini et al. 1998; Catthoor Some parts of this material appeared in the proceedings of the 5th Design Automation and Test in Europe Conference (DATE’02) held in Paris, France, between March 4th and 8th, 2002 [Kadayif et al. 2002]. This paper significantly improves over the DATE’02 paper by expanding the validation section by including trend validation for optimizations, by including a new section (Section 5) that shows how this framework can be used to guide energy-conscious compiler optimizations, and by discussing the limitations of the proposed framework. This work is supported in part by the NSF award #0093082. Authors’ address: Department of Computer Science and Engineering, 111 IST Building, The Pennsylvania State University, University Park, PA, 16802; email: {kadayif,kandemir,guilchen, vijay,mji,anand}@cse.psu.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1539-9087/05/1100-0819 $5.00 ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005, Pages 819–850.

820



I. Kadayif et al.

et al. 1998; Butts and Sohi 2000]. At the high end, the phenomenal increase in the level of integration and clock frequencies have escalated power dissipation, which consequently accentuates the heat-extraction problem (packaging and design of heat sinks), system reliability, and operating costs. Cooling systems need to be designed with peak power consumption in mind and not just based on the average power. At the other end of the spectrum, the unprecedented growth of battery-operated embedded/mobile devices makes power optimization important for conserving battery energy. Stretching the battery energy can significantly affect the functionality of such devices and their market acceptability [Lorch and Smith 1998]. Unfortunately, the growth in energy capacity of commercially available rechargeable batteries has woefully lagged behind the increasing energy consumption demands of powerful new embedded devices, motivating the research on optimizing for low power under battery-capacity constraints [Irwin and Vijaykrishnan 2000]. It is thus clear that power/energy estimation and optimization techniques are critical for the continued progress in computing. The last decade has witnessed several techniques at the circuit and architectural levels for power optimization. These techniques include transistor sizing, input ordering, circuit restructuring, scaling power supply voltages, gatedclock designs, partitioned memories, and sleep (low-power operating) modes [Catthoor et al. 1998; Lebeck et al. 2000; Moshovos 2003; Chandrakasan et al. 2001; Delaluz et al. 2001; Irwin and Vijaykrishnan 2000; Brooks et al. 2000; Baniasadi and Moshovos 2002; Toburen et al. 1998; Albonesi 1999; Gonzales and Horowitz 1996; Lu et al. 2000; Benini et al. 1998]. These techniques have been widely accepted and incorporated in current power-efficient systems. However, the continued growth in power demands has made it essential to look for system-wide optimizations encompassing both the software and hardware. Recently, there have been some forays in optimizing the system-power using compilation and runtime optimizations [e.g., see Benini et al. 1998 and the references therein]. The software influences the hardware power consumption in two ways. First, it determines the transitions at the gates of a circuit and the inputs [Panda and Dutt 1996] to the circuit affecting both dynamic (affected by input transitions) and static (leakage component affected by circuit inputs) power consumption. Second, it can have a determining role on exploiting the effectiveness of low-power circuit/architectural techniques present in a design. There have been a few studies that have focused on specific low-level optimizations for reducing power, such as instruction scheduling [Tiwari et al. 1996] and register allocation [Gebotys 1997]. High-level (source-level) optimizations can complement these techniques and, more importantly, as shown in a few recent simulation-based studies (e.g., [Simunic et al. 1999; Kandemir et al. 2000]), can have a larger impact on the system power consumption. The compiler is a critical component in determining the types, order (to some extent), and number of instructions executed for a given application. Thus, it wields a significant influence on the power consumed by the system. However, most compiler optimizations focus on metrics, such as performance and code size, and do not account for power considerations during optimizations or code generation. Performance optimization does not necessarily optimize power, as ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



821

Fig. 1. Energy-Aware Compilation (EAC) framework.

will be shown in some of our experiments in this paper. In some cases, one may be willing to sacrifice some amount of performance to reduce power dissipation which can prolong battery life or save cooling costs. These new dimensions for optimization, such as energy consumption or peak power consumption, are no longer second-class citizens, and can take center-stage, together with performance, in a large class of computing environments. Consequently, it is important to develop a configurable/parameterizable compilation framework that can generate code given different resource constraints (registers, cache sizes, etc.) to meet different optimization criteria, such as performance and energy. Further, the compiler should also be able to accommodate multiple criteria at the same time, e.g., generate code whose execution dissipates 50% less power than the optimal, with a slow down of, at most, 10% from the optimal performance. All these require a compiler that can quickly and accurately estimate the energy consumption of a given code, so that it can evaluate these trade-offs to generate code that can satisfy such multiple constraints/criteria. In this paper, a novel Energy-Aware Compilation (EAC) framework that can estimate and optimize energy consumption of a given code is presented. This framework has the ability to estimate the energy consumption of a high-level (source) code given the architectural and technological parameters, energy models, and energy/performance constraints. This capability also allows us to apply high-level code and data transformations (both at loop-level and procedurelevel) to optimize energy. In other words, the proposed compilation framework (Figure 1) can be used for either quick energy estimation (without performing any energy-oriented optimization) or energy optimization under several constraints. This paper makes the following major contributions: r It presents a low-cost high-level energy estimation model that can be incorporated into an optimizing compiler. Further, it discusses the necessary compiler analyses for extracting the required parameters for this energy model from the application code. r It presents a validation of the compiler-directed energy estimation using a cycle-accurate architectural-level energy simulator for a simple architectural model. r Using this model, it presents energy-constrained versions of iteration space tiling [Wolf and Lam 1991; Lam et al. 1991; Wolf and Chen 1996], a commonly used data locality (cache performance) optimization technique. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

822



I. Kadayif et al.

r It discusses a procedure-level optimization strategy based on ILP (integer linear programming) that considers energy, performance, and code size in a unified setting and presents experimental results. Figure 1 shows the inputs to EAC and its outputs. This work proposes, to the best of our knowledge, the first complete energy-aware compilation framework (not tied to a specific application domain) that can estimate and optimize the energy consumption of a given application at the high level (sourcelevel) under different constraints. Note that estimating energy at high level is crucial because only then can the energy/performance tradeoffs involving high-level optimizations, such as tiling and other loop and data optimizations, be made and high-level optimizations that target energy be performed. In contrast, if we estimate the energy consumption at assembly level, we can only evaluate a small number of alternative optimization strategies, the energy estimation process would be slow, and it would be difficult to integrate energy optimization strategies with commonly used source-level performance optimizations. The first version of the EAC framework (that works with a simple five-stage pipelined architecture) has been implemented using the SUIF [1994] compiler and evaluated using several array-dominated benchmark codes and a singleissue embedded processor architecture. Array-dominated codes are very common in digital signal and media-processing applications that consist of multiple nested loops. It is anticipated that, in many array-dominated codes, the highest energy gains will come from high-level code optimizations [Catthoor et al. 1998; Vijaykrishnan et al. 2000]. The EAC framework has been validated using a cycle-accurate architectural-level energy simulator and found to be within 6% error margin while providing significant estimation speedup. Our energy estimation and optimization framework can be used by compiler designers and system architects in the following ways: r Given a fixed architecture and technology parameters, various performanceoriented compiler optimization techniques can be evaluated from an energy viewpoint. For example, the energy impact of loop tiling, loop distribution, unroll-and-jam, and other loop and data transformations can be investigated using EAC. r For specific compiler optimization techniques preferred for performance reasons, the potential energy savings from architectural enhancements/ modifications can be evaluated (although this requires extending the basic EAC infrastructure with appropriate hardware models). Conversely, compiler optimizations designed to exploit energy-efficient architectural features can also be studied. r When designing new energy-aware compiler optimization techniques, the impact of imminent changes in technology on the effectiveness of the proposed techniques can be explored. Further, the impact of optimizations on the energy consumption of different components of a system (e.g., on-chip or off-chip) can be evaluated. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



823

r Given a specific set of operating constraints (energy and performance) and a set of high-level optimizations, the tool can be used to generate code meeting these constraints, whenever possible. r Apart from compilation support, EAC-directed estimation is accurate enough to be useful in developing/evaluating power-conscious applications/ architectures in several situations, without having to go through much more time-consuming execution-driven simulations [e.g., Vijaykrishnan et al. 2000]. r The speed of EAC-directed estimation also makes it an enabling technology for dynamic compilation (which is common in several ubiquitous/embedded environments), although this issue is not explicitly investigated in this paper. We refer the interested reader to Unnikrishnan et al. [2002]. The current EAC framework has also some limitations. Basically, it currently works only with integer data and does not take scalar variable accesses into account (i.e., tuned for array based computations). It also operates under a simple processor architecture and a conservative assumption, which states that all the branches of a conditional construct have the same probability of being taken at runtime. Prior work on high-level energy optimization focused on voltage scaling [Saputra et al. 2002; Xie et al. 2003; Hsu and Kremer 2003], signal-processing specific architectures [Lorenz et al. 2002], and compilation offloading in Javabased environments [Chen et al. 2003]. Rele et al. [2002] and Zhang et al. [2003] focused on reducing leakage (static) energy consumption by turning off unused functional units. In comparison to these studies, the work presented here targets at estimating energy consumption of a given source-level code. Based on this estimation, the proposed framework can also be used for guiding compiler optimizations in energy-sensitive application environments. The remainder of this paper is structured as follows. The next section explains how software can influence energy consumption for a given architecture and presents the energy model embedded in EAC. Section 3 describes the necessary compiler analyses to extract the parameters that are needed by the energy model. Section 4 presents a validation of the compiler-based energy estimation using a cycle-accurate energy simulator. Section 5 proposes a loop-level and a procedure-level energy optimization technique and presents experimental data. Finally, Section 6 concludes with a summary and an outline of future research on this topic. 2. MODELING ENERGY CONSUMPTION OF SOFTWARE Energy consumption is dependent on how different components of a system are exercised by the software. In this work, we focus on dynamic energy consumption. The dynamic energy consumed in a system can be expressed as the sum of the energies consumed in the different components, such as the datapath, caches, clock network, buses, and main memory. The activity and, consequently, the energy consumed in these components, is determined by the software being executed on the system. The software can modify the number of transitions in ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

824



I. Kadayif et al.

Fig. 2. A summary of the major parameters for estimating the energy consumed by different hardware components.

the circuit nodes (note that this affects switching power) by altering the input patterns, reducing effective capacitance by reducing the number of accesses to high-capacitance components (e.g., large off-chip memories), or scaling voltage/clock frequency [Hsu and Kremer 2003] to adapt the energy behavior of the code to the needs of the application. This paper focuses on a single-issue, five-stage [instruction fetch (IF), instruction decode/operand fetch (ID), execution (EXE), memory access (MEM), and write-back (WB) stages] pipelined datapath of typical embedded processors with a single-level on-chip cache. Currently, this is the only architectural model for which our compiler estimates energy. We selected this model mainly because we have access to an accurate energy simulator that generates the detailed energy behavior of a given application running on this architecture. This allows us to evaluate the accuracy of the compiler-directed energy estimation. We later discuss what the additional challenges are when we move to more complex processor architectures. In the following, we explain the energy consumption in the individual components of the system and the impact of software/compiler on these components with the help of our target architecture. The third column in Figure 2 gives a summary of the major architecture/circuit parameters whose values are used to estimate the energy consumption within the EAC framework. 2.1 Datapath The energy consumed in a datapath is dependent on the number, types, and sequence of instructions executed. The type of an instruction determines the components in the datapath that are exercised while the number of instructions determines the duration of the activity. The energy of each instruction was obtained for our target architecture by accounting for the energy consumed in each component exercised by the instruction. Whenever a component is exercised, there is a switching activity in that particular component contributing to dynamic energy consumption. For example, when an integer addition instruction is executed, energy is consumed in the instruction fetch logic in the first stage of the pipeline, in the register file when accessing the source operands ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



825

Fig. 3. Memory array.

in the decode stage, in the ALU when executing the operation, and, again, in the register file during write-back. It must be noted that energy is also consumed in the pipeline registers of the datapath and that the components in the memory stage of the pipeline are not exercised by this instruction. Using a cycle-accurate energy simulator, we are able to capture the activity of each individual component of the architecture that is exercised by a given instruction. This information is then fed to the compiler and used along with the estimated number of times each instruction executes to obtain the energy consumed in the datapath. 2.2 Cache The energy consumed in a cache is dependent on the number of cache accesses, number of misses, the cache configuration (e.g., associativity, capacity, line size), and the extent of utilizing energy-efficient implementation techniques (e.g., sub-banking, bitline isolation [Ghose and Kamble 1999]). There are two important components of the cache, namely, the tag and the data arrays, and both have an architecture similar to that shown in Figure 3. The major components that consume dynamic energy are the row and column decoders, wordlines, bitlines, and sense amplifiers. Energy is expended in the row decoders when a particular cache line is selected for a read or write operation, in the lines that activate each cell (wordline) in a particular row of the cache line, in the bitlines when the values are written or read from the cells, in the sense amplifiers to amplify the values read, and finally, in the column decoders that select a part of the activated cache line. The energy consumed during reads and writes vary, since the voltage swings in the bitlines are different for these two (i.e., full swing for writes and limited swing for reads). Also, the sense amplifiers are also not used during the write cycle. Consequently, it is important to estimate the number of reads and writes separately. The energy consumed in the caches is largely independent of the actual data accessed from the caches, and the prior work has shown that the number of cache accesses is sufficient to model energy accurately [Ghose and Kamble 1999]. The proposed compilation framework estimates the number of cache misses and hits given the high-level ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

826



I. Kadayif et al.

code as explained in the next section. This information along with the cache configuration parameters that determine the length of the bitlines and wordlines and the size of decoders and sense amplifiers can be used to evaluate the energy consumption. Further, we parameterize the energy model based on the extent of energy-efficient implementation techniques used to reduce the capacitance on the bitlines and wordlines. For example, EAC can also estimate energy consumption when the cache architecture is banked. 2.3 Main Memory The organization of the main memory arrays is similar to that of the caches, but is different in two ways. First, the memory arrays have no tag comparison portion. Second, the basic cell for implementing memory storage (DRAM cells) is different from that used in on-chip caches (SRAM cells). Consequently, there is a difference in the energy consumed during read/write accesses. Further, DRAM memory architectures are usually partitioned into multiple banks that can be partially shutdown during periods of inactivity. The energy consumed in the memory can be modeled fairly accurately by capturing the number of accesses and the interval between the accesses within our compiler. Typically, the energy cost of accessing memory is larger than that of on-chip caches because of the additional costs associated with off-chip packaging capacitances and also due to the energy consumed in refreshing the DRAM cells. EAC has energy models for capturing the impact of low-power operation modes [Delaluz et al. 2001]. 2.4 Buses The buses are used to communicate addresses and data between cache and processor and between main memory and cache. The energy consumed on a given bus is dependent on the number of transactions on the bus, the bus capacitance, the bus width, and the switching activity on the bus. While the width of the bus and the bus capacitance are readily available once the design is finalized, the other two factors are a function of the software. Our compiler estimates the number of transactions for buses between main memory and cache and between cache and processor, and assumes a 50% switching activity since data values are not known at the source level. 2.5 Clock Network The components of the clock network that contribute to the energy consumption are the clock generation circuit [Phase Lock Loop (PLL)], the clock distribution buffers and wires, and the clock-load on the clock network presented by the clocked components [Duarte et al. 2001]. The energy consumed in a single cycle depends on the parts of the clock network that are active. The PLL and the main clock distribution circuitry are normally active every clock cycle during execution. Therefore, our compiler captures the energy consumption due to those two components by estimating the number of cycles that the code would take. However, the participation of the clock-load varies based on the active components of the circuit as determined by the software executing on the system. For example, the clock to the caches are gated (disabled) when a cache miss is being serviced. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



827

EAC exploits the estimation techniques for the datapath and caches explained above to effectively account for this varying clock-load in a given cycle. 3. EXTRACTING PARAMETERS FOR ENERGY MODELS In order to compute the energy expended in different hardware units, the compiler should analyze the program and extract the application-dependent parameters required by the energy models. The second column in Figure 2 gives a list of these compiler-supplied parameters. In this section, we explain the techniques to extract these parameters from the nested loop-based codes used in this work. 3.1 Current Models The first step in developing the automated process involves identifying the high-level constructs used in these codes and correlating them with the actual machine instructions. The constructs that are vital to the studied codes include a typical loop, a nested loop, assignment statements, array references, and scalar variable references within and outside loops. To compute datapath energy, we need to estimate the number of assembly instructions of each type associated with the actual execution of these constructs. To achieve this, the assembly equivalents of several codes were obtained using our back-end compiler (a variant of gcc) with the O2-level optimization. Next, the portions of the assembly code were correlated with corresponding high-level constructs to extract the number and type of each instruction associated with a high-level construct. In order to simplify the correlation process and to partially isolate the impact of instruction choice due to low-level optimizations, the assembly instructions with similar functionality and energy consumption are grouped together. For example, both branch-if-not-equal (bne) and branchif-equal (beq) are grouped as a generic branch instruction under the name bne. In order to illustrate our parameter extraction process in more detail, we focus on some specifics of the following example constructs. First, let us focus on a loop construct. Each loop construct is modeled to have a one-time overhead to load the loop index variable into a register and initialize it. Each loop also has an index comparison and an index increment (or decrement) overhead whose costs are proportional to the number of loop iterations (called trip count, or trip). From correlating the high-level loop construct to the corresponding assembly code, each loop initialization code is estimated to execute one load (lw) and one add (add) instruction (in general). Similarly, an estimate of trip + 1 load (lw), store-if-less-than (stl), and branch (bne) instructions is associated with the index-variable comparison. For index-variable increment (resp. decrement), 2 × trip addition (resp. subtraction) and trip load, store, and jump instructions are estimated to be performed. Next, we consider extracting the number of instructions associated with array accesses. First, the number and types of instructions required to compute the address of the element are identified. This requires the evaluation of the base address of the array and the offset provided by the subscript(s). Our current implementation considers the dimensionality of the array in question, and computes the necessary instructions for obtaining ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

828



I. Kadayif et al.

each subscript value. Computation of the subscript operations is modeled using multiple shift and addition/subtraction instructions (instead of multiplications) as this is the way our back-end compiler generates code when invoked with the O2 optimization flag. Finally, an additional load/store instruction was associated to read/write the corresponding array element. These correlations between high-level constructs and low-level assembly instructions are a firstlevel approximation for our simple architecture and array-dominated codes with the O2-level optimization and obtained through extensive analysis of a large number of code fragments. Our current calculation method does not take into account the energy spent in datapath during stalls (due to branches or cache misses) as we assume the existence of clock gating [Chandrakasan and Brodersen 1995], which reduces stall energy significantly (i.e., it does not affect any trend observed in this study). In this study, we focus only on data cache as our high-level optimizations influence data cache energy and performance behavior more dramatically as compared to the instruction cache. To compute the number of hits and misses, our current implementation uses the miss estimation technique proposed by McKinley et al. [1996]. This approach first groups the array references according to the potential group-reuse (i.e., the type of reuse that originates from multiple references to the same array) between them. Then, for each representative reference, it calculates a reference cost (i.e., the estimated number of misses during a complete execution of the innermost loop). Basically, the reference cost of a given array reference with respect to a loop order is 1 if the reference has temporal reuse in the innermost loop; that is, the subscript functions of the reference are independent of the innermost loop index. The reference cost is trip/(cls/stride) if the reference has spatial reuse in the innermost loop. In this expression, trip is the number of iterations of the innermost loop (trip count), cls is the cache line size in data items (array elements), and stride is the step size of the innermost loop multiplied by the coefficient of the loop index variable. Finally, if the reference in question exhibits neither temporal nor spatial reuse in the innermost loop, its reference cost is assumed to be equal to trip; that is, a cache miss per loop iteration is anticipated. After all the reference costs are calculated, the technique computes the loop cost (i.e., the total number of estimated misses due to nest execution) considering each reference in the nest. The overall loop cost of the nest is the sum of the contributions of each reference it contains. The contribution of a reference is its reference cost multiplied by the number of iterations of all the loops (that enclose the reference) except the innermost one. This miss calculation process is a good first-degree approximation if one does not consider internest data reuse. Note that this algorithm takes into account the cache line (to determine the extent of spatial reuse) but does not consider associativity. As an extension to McKinley et al.’s algorithm, our compiler also distinguishes reads and writes, and computes the read and write misses separately. The sum of read and write misses also gives us the number of accesses to the main memory, a parameter necessary to compute the main memory energy. Estimating the number of execution cycles (which is necessary to compute the clock energy and performance) is not very difficult in our architecture as it ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



829

is a single-issue machine. Since each instruction (omitting stalls) requires one cycle to be initiated, the number of instructions is a lower bound for the number of cycles. To this lower bound, we add the number of estimated stall cycles (a fixed number of cycles for each estimated cache miss) to reach the final compiletime estimate of the number of clock cycles. The two parameters required here, namely, the number of instructions and the number of misses, are estimated by the compiler as explained above. Estimating the number of bus transactions is also relatively easy as it is proportional to the number of cache and memory accesses, both of which are captured by the compiler during cache miss analysis. Our compiler also calculates the static code size (in terms of the number of assembly instructions), which will be used later in the paper when we evaluate multiconstraint optimization. The process of estimating static code size is very similar to that of execution cycle estimation; the difference is that in static code estimation the compiler does not multiply the static estimates (for individual constructs) by the trip counts of the enclosing loops. 3.2 Limitations and Possible Enhancements It should be emphasized that this paper is a first step toward compiler-based energy estimation and optimization. Consequently, it focuses on rather a simple embedded architecture and tries to measure the effectiveness of compiler-based analysis. The accuracy of the compiler-based estimates discussed above (and, of course, that of overall energy estimate) can be improved by employing more sophisticated techniques. For example, a cache miss-estimation technique that takes into account cache associativity and conflict misses [e.g., Temam and Jalby 1993] or internest interactions [e.g., Cooper et al. 1996] can potentially lead to better cache, memory, and bus energy estimations. Similarly, a technique that considers potential low-level instruction scheduling constraints at the high level (e.g., [Wolf and Chen 1996]) can give a better datapath energy estimation. Our approach is also conservative in handling the conditional constructs in the program code under consideration. Specifically, we assume that all the branches of an if-statement have the same probability of being taken. It needs to be mentioned, however, that if we have profile data on execution frequencies of different branches, this can be taken into account by our approach. Clearly, this conservativeness brings inaccuracies to our energy-estimation approach. All the experimental results presented in this paper already capture these inaccuracies. More refinements are essential to capture the influence of other compiler and architectural aspects. Here, we provide a brief discussion of how the influence of more sophisticated optimizations can be captured. A more detailed discussion of all possible optimizations is a broad area in itself, and is beyond the scope of this work. First, we consider the influence of software pipelining [Lam 1988]. We can analyze each array reference and group array references into classes as done in Wolf and Chen [1996]. Two references are in the same class if they reference the same location within a small number of iterations (e.g., U[i] and U[i+2]). This will allow us to evaluate the impact of a software ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

830



I. Kadayif et al.

pipeliner which would eliminate multiple memory references by keeping some values in registers across loop iterations. In addition, common subexpression elimination (CSE) can be taken into account when handling references with the same subscript expressions (e.g., U[2i−1] and V[2i−1]). Second, it is also important to model the register pressure more accurately. For example, the number of registers that are necessary to keep the pipeline running at full speed can be estimated [Wolf and Chen 1996]. Invariant memory references can be analyzed to determine the impact of loop invariant variable optimizations on the register requirements. Once a total register demand is estimated, this value can be compared with the available number registers and, if it is larger, a spill code (an assembly code) can be assumed for each nonexistent register. The energy estimation for this spill code can be performed with a reasonable accuracy [Tiwari et al. 1996]. More sophisticated architectures than used in this work will also influence the accuracy of energy estimates. For example, the support for speculative execution can make the task of estimating the number of instructions executed harder. The wrong-path executions can increase the number of instructions executed and memory references issued. However, this impact varies based on the accuracy of the predictors and the code characteristics. A first-order approximation of energy consumed in wrong paths can be modeled from bound estimates reported in Musoll [2000]. It should also be noted there are several ongoing efforts to minimize the impact of wrong-path executions on energy without unduly sacrificing performance [e.g., Manne et al. 1998]. Once an estimation model is selected, the rest of our technique, that includes calculating energy and optimizing for energy, is independent from instruction count, cache hit/miss and bus transaction estimates. In other words, our framework is general enough to accommodate different estimation strategies where available. In our current implementation, in cases where the loop bounds and array sizes are not known at compile-time, we exploit the available profile information. At this point, we would like to discuss some of the limitations of our model in more detail and how these limitations can potentially be addressed. As explained earlier, we model a simple single-issue embedded architecture. While this type of architecture is in use in several important application domains, such as automobile control, smart phones, and signal processing, embedded processors are increasingly becoming more complex. In particular, many embedded systems are now using superscalar or VLIW architectures. Therefore, it is important to discuss what are the additional challenges that could be introduced when we move to more sophisticated architectures. Our belief is that our approach is directly applicable to VLIW architectures as the entire scheduling is performed by the compiler and our energy models can utilize this information. However, our models need to be enhanced for handling the superscalar architectures. One of the major problems is runtime (dynamic) instruction scheduling employed by superscalar architectures, which presents at least two issues: estimating the number of cycles (which will affect the clock power) and estimating the number of instructions issued per cycle (which will affect the energy consumed in the issue logic). In order to estimate these accurately, we need to model the processor resources. For example, R10K can ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



831

execute two integer ALU operations, two FP ALU operations, and 1 memory operation per cycle; however, it can issue, at most, four instructions per cycle. This means that even if we have more ALU instructions that can be issued, the resource constraints would not allow it. Therefore, our static analysis should take into account these resource constraints, and estimate the number of instructions that can accurately be issued at the same cycle. This can be achieved by enhancing our datapath model to include the resource constraints (for both issue stage and execution stage). Notice that automatic compiler analysis can also be useful in this enhanced model. Specifically, the compiler can analyze each loop nest and identify how much pressure it would put on the issue logic and this can be fed to our energy models. It is to be emphasized that a prior work [Wolf and Chen 1996] reports accurate estimation of execution cycles (at compile time) assuming detailed resource and latency models for a superscalar architecture. We believe that energy models for superscalar architectures can be built upon such performance models. Another limitation is in the methods used for capturing metrics, such as the number of misses and the number of CPU instructions. As for the CPU instructions, while our current model is designed only for O2 level optimization, it is possible to extend it to other optimization levels as well. In fact, if we have a total of k different optimization combinations (each combination contains a set of optimizations), we can repeat our high-level construct to assembly mapping (which is necessary for calculating the datapath energy) for each combination. While this may take time, note that this a one-time overhead; i.e., once we have the models in place, we can use them for any program we want to compile. Due to space concerns, in this paper we focus our attention on O2 level only. As far as the cache model is concerned, our model can work with any cache model available. Obviously, the more accurate the cache model is (in capturing the number of accesses/hits/misses), the more reliable our energy estimation will be. However, one source of inaccuracy will remain no matter what cache behavior estimator is employed. The energy consumption is not just a function of the hits and misses but also a function of the contents of the data (0-1 pattern). Unfortunately, this information (that is, actual data values) is not available to the static analysis. However, prior research on cache energy [Su and Despain 1995; Kamble and Ghose 1997a, 1997b] indicates that estimations based on the number of hits and misses can be very accurate. While the bus energy consumption might be more dependent on the actual data values than the cache model, again it is not possible to capture the actual data values transmitted within a static analyzer. The final issue that we want to discuss is the leakage (static) energy consumption. As circuit technology scales to ever smaller dimensions, the threshold voltages are also scaled, which significantly increases leakage consumption [Borkar 1999; Kaxiras et al. 2001; Flautner et al. 2002]. While our current approach models only dynamic energy consumption, it is possible to extend it to capture leakage consumption as well. This is possible since, for computing the clock power, we are capturing the number of execution cycles anyway. Assuming a per cycle leakage consumption for each hardware component of interest, we can also model the leakage consumption. In fact, our belief is that the proposed ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

832



I. Kadayif et al.

framework can even be extended to capture the impact of leakage-oriented compilation techniques. 4. VALIDATION Estimating the energy consumption at high level (source code level) is not very useful unless the estimation is accurate enough to guide high-level optimizations. Therefore, validating the compiler-directed energy estimation is of critical importance. In this section, we compare the compiler-estimated energy consumption to that obtained through a cycle-accurate energy simulator that uses transition-sensitive energy models for the datapath and analytical energy models for other components. Transition-sensitive models quantify the energy consumption based on the current and previous data inputs to a circuit; hence, they are very accurate. However, they are also time-consuming and difficult to develop. While transition-sensitive models are essential for modeling datapaths accurately [Vijaykrishnan et al. 2000], analytical approaches based on activity-based models are sufficient for modeling caches, memories, and clock circuitry [Ghose and Kamble 1999]. Activity-based energy models assume a fixed (component-specific) energy consumption when the component is accessed independent of the specific data input values. Our compiler-based estimation approach uses such activity-based energy models for all components including the datapath. The difficulty in predicting data input sequences from the high-level source code mandates the use of the activity-based approach in EAC. In this section, we first perform this validation for different benchmarks to observe the error margin in the estimates across different components of the system. Next, the validation is performed for one of the benchmarks when the type of high-level compiler optimizations is varied. This is done to ensure that the trends due to optimizations are correctly predicted by EAC (so that the selection of suitable compiler optimizations to be applied can be achieved). We start by giving some information about our simulation environment. In this work, we use SimplePower, an architectural-level, cycle-accurate energy simulator. SimplePower [Vijaykrishnan et al. 2000] is an execution-driven power estimation tool and is publicly available. It is based on the architecture of a five-stage pipelined datapath. The instruction set architecture is a subset of the instruction set (the integer part) of SimpleScalar, which is a suite of publicly available tools to simulate modern microprocessors [Burger et al. 1996]. The major components of SimplePower are SimplePower core, RTL power estimation interface, technology-dependent switch capacitance tables, cache/bus simulator, and loader. The SimplePower core simulates the activities of all the functional units and calls the corresponding power estimation interfaces to find the switched capacitances. These interfaces can be configured to operate with energy tables based on different process technologies. Transition-sensitive [Vijaykrishnan et al. 2000], technology-dependent switch capacitance tables are available for the different functional units such as adders, ALU, multipliers, shifter, register file, pipeline registers, and multiplexors. The SimplePower core continues the simulation until a predefined program halt instruction ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



833

Fig. 4. Base configuration parameters used in the experiments. The energy numbers are obtained by providing the (hardware) configuration parameters to our energy models.

is fetched. Once the simulator fetches this instruction, it continues executing all the instructions left in the pipeline, and then dumps the output. The cache simulator of SimplePower is interfaced with an analytical memory energy model derived from that proposed by Shiue and Chakrabarti [1999]. The memory energy is divided into that consumed by the cache decoders, cache cell array, the buses between the cache and main memory, and the main memory. The components of the cache energy are computed using analytical energy formulations. We first enhanced the simulator to accurately capture clock energy [Duarte et al. 2001]. The energy models utilized in the simulator are within 8% error margin of measurement from real systems [Chen et al. 2001]. The energy simulator can work under the assumption of different supply voltages and process technologies. It can be configured using the command line to set the caches parameters, output the pipeline trace cycleby-cycle, and dump the memory image. SimplePower provides the total number of cycles in execution and the energy consumption in different system components (datapath, clock, cache, memory, and buses). All simulator-based results reported in this paper are obtained using the parameters given in Figure 4. The same configuration (shown in Figure 4) is also used in the compilerdirected energy estimation experiments. An important issue in estimation within the compiler is associating an accurate, activity-based energy cost for each type of instruction. To achieve this, we averaged energy values by executing multiple (1000) instances of the same instruction with random data using the transition-sensitive simulator. The resulting values are stored as a table where each entry lists an instruction type and corresponding energy. Figure 5 gives the most frequently used instructions (by the back-end compiler) and corresponding energy consumptions. Note that the energy value given for an instruction (group) involves all the energy consumed in different parts of the datapath. Such energy tables could be built for other target architectures using power measurement on an actual system [Tiwari et al. 1996]. Alternately, one could extract these numbers by embedding energy models in architectural-level simulators that are normally available for many target platforms. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

834



I. Kadayif et al.

Fig. 5. Datapath energy costs (in picojoules) of the most frequently used instructions. The main sources of datapath energy consumption are the register files, pipeline registers, functional units, and the datapath multiplexors and contribute to 20, 40, 20, and 8% of the overall energy consumption, respectively, when averaged over different applications.

Fig. 6. Benchmark codes used in the experiments.

4.1 Codes Figure 6 lists the benchmark codes in our experimental suite. The first six codes are array-based versions of the corresponding DSPstone benchmarks [Willems and Zivojnovic 1996]. The remaining codes are array-based programs that are rewritten to use integer data instead of floating-point data (since our cycleaccurate energy simulator currently operates only with integer data). vpentai, tsfi, and tomcatvi are integer versions of the respective benchmark codes (vpenta, tsf, and tomcatv). The third column of the table in Figure 6 gives the input size used for each benchmark. This is the sum of the sizes of all the arrays manipulated by the application. Note that, when the array sizes change, the loop trip counts also change, which is taken into account by our compiler during energy estimation. 4.2 Validating Energy Consumption Estimates Figure 7 compares the compiler-directed energy estimation with that of the simulator. We see that the average difference between energy estimations is 12.21, 6.09, 3.84, 4.90, and 2.38% for datapath, cache, main memory, buses, and clock network, respectively. Overall, the compiler-estimated total energy is within 5.9% of the simulator value. We now explore the causes for these error margins component-wise. The datapath inaccuracies stem from two major factors. First, estimating the number of instructions executed requires accurate (high-level) modeling of the backend code generation techniques. The inaccuracies in this factor were observed to contribute to an average 5% error in EAC estimates across all the benchmarks ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



835

Fig. 7. Comparison of compiler-directed (EAC) and simulator-based (SB) energy estimation and percentage difference. The estimated energy consumptions are in units given below the name of each component.

used. As an example, depending on the absolute magnitudes of some variables that are involved in address calculations, we noted that the back-end compiler generates different code sequences (which our current implementation does not capture). Second, the impact of input transitions (i.e., transition-sensitivity) in the datapath are not accurately captured by the compiler estimates. It is very difficult to predict the exact instruction sequence and associated data (both impact the transition activity of the components) at the high level. Hence, the resulting inaccuracies due to this contribute to 7.2% of the overall estimation error in the datapath (underestimation). The main source of error for cache and main memory energies is the inaccurate estimation of the number of cache hits and misses. These inaccuracies are a result of not taking conflict misses and misses due to scalar accesses into account. This can lead to an underestimation on the compiler’s part, as exemplified in vpentai and biquad. The underestimation of cache energy for the fir benchmark, however, occurs due to mispredictions in the number of loads (which is also reflected in datapath energy estimates). Further, our current miss estimation framework does not model data locality across separate nested loops (i.e., internest locality). That is, it operates under the assumption that the cache is empty at the beginning of each nest execution. In contrast to the previous case, this can lead to an overestimation of misses, as in the adi code. The inaccuracies in the clock is a factor of the inaccuracies in estimating the number of execution cycles which, in turn, is dependent on predicting the number of instructions executed and predicting the number of cache misses. Additional inaccuracies also accrue from not accounting for pipeline stalls, because of data or control hazards. Finally, the accuracy of bus and memory energy estimates are affected by the number of bus transactions and cache misses estimated. It needs to be mentioned that the average values can sometimes hide the fact that the error range is really large. As an example, in memory energy estimation the error range is [−6.58%, +18.80%]. In addition, this range is with respect ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

836



I. Kadayif et al.

Fig. 8. Comparison of estimation times (in seconds) for EAC and the simulator. In the first column, the numbers within parentheses give the input size. The last four entries for the transition-sensitive simulator are extrapolated values.

to a simulator that itself has an error range. However, recall that our goal is not to be 100% accurate, but rather to obtain a reasonable approximation for energy consumption, and (more importantly) rank different optimizations from the energy perspective. Therefore, we believe that the compiler-directed energy estimation approach is moderately accurate as compared to the simulator-based approach. The loss in accuracy is traded for the ability to perform the estimates significantly faster. Figure 8 shows the absolute times (in seconds) required for obtaining the energy estimates for different benchmarks. The last four entries for the transition-sensitive simulator are extrapolated values based on smaller input sizes. An important reason for the longer estimation times of the simulator is its cycle-accurate nature which causes the scaling of estimation time with the problem size. In contrast, the time taken by the compiler-based approach is independent of the problem size. 4.3 Trend Validation A key requirement for the EAC estimates is to accurately capture the impact of different high-level optimizations on energy consumption of different components. In this section, we use different versions of the matrix multiply code (mxm) to investigate whether EAC conforms to this requirement. We consider eight different versions of the code including the original (unoptimized) code and seven other versions, each being optimized using a different combination of three high-level (loop based) optimizations, namely, iteration space tiling (denoted T), linear loop transformation (denoted L), and unroll-and-jam (denoted U). We focus on these three high-level optimizations as they have been shown to be very effective in improving program performance through locality enhancement [Wolf and Chen 1996]. In versions that involve tiling, we set the tile size (blocking factor) to 20 for each loop, a value that we found to perform ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



837

Fig. 9. Comparison of compiler-directed (EAC) and simulator-based (SB) energy estimation and percentage difference. The estimated energy consumptions are in units given below the name of each component. Note that, in almost all cases, for a given component and two different versions of the code, EAC and the simulator select the same version.

well for the loops in our benchmarks from the performance angle. Similarly, with unroll-and-jam, we used an unrolling factor of 4, again a value that has been found to perform well. That is, an unrolling factor of 4 combined with a tile size of 20 generates the best performance across all tile sizes and unrolling factors tested. The results given in Figure 9 indicate that, as far as the trends are concerned, our estimation follows those from the simulator, giving an error margin of 6.15% on the average (total energy). The average errors for datapath, cache, main memory, bus, and clock network are 7.54, 11.80, 1.61, 4.48, and 5.31%, respectively. Due to the two-way associative cache used and array padding employed, we observe a very accurate cache miss estimation (which is the major contributor to the accuracy of main memory energy). In cases where unrolland-jam is used, our approach underestimates the number of data accesses as it does not currently take scalar accesses into account. Our current datapath energy estimation model does not distinguish loop permutation. Nevertheless, the compiler-directed estimation captures the energy trends across different optimizations quite well. We also performed a sensitivity analysis based on tile sizes and unrolling factors. The bar charts in Figure 10 give the datapath energy consumptions with different tile sizes and unrolling factors. The left bar chart is for the EACbased estimation, whereas the right one is for the simulation. We see from these results that the compiler-based estimation follows the simulation-based estimation very well and exhibits the same trends. We also focus here only on validation and do not discuss where the differences in energy behaviors of unoptimized and optimized codes come from. In general, optimized codes improve cache behavior and reduce memory system energy. However, some optimizations (e.g., tiling) also degrade code locality (by increasing the number of instructions and reduce instruction reuse) [Kandemir et al. 2000]. Other than reinforcing that the EAC-based estimates are still moderately accurate, these results also show that the compiler-estimated energy can be used to predict the energy impact of different high-level optimizations. For example, in estimating the memory energy consumption, EAC orders the ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

838



I. Kadayif et al.

Fig. 10. Energy estimations with different tile sizes and unrolling factors. Left, EAC-based estimation; right, simulator-based estimation.

optimizations as tiling, loop permutation, and unroll-and-jam (from the least to the most energy-consuming one), the same order returned by the simulator. The absolute energy consumption and trend validation results indicate that EAC can be used to capture the impact of high-level optimizations on energy consumption of and distribution across different components. While the validation results are shown for specific optimizations in this section, this estimation ability can be used to evaluate energy impact of other high-level compiler optimizations as well. More importantly, this ability can be exploited to drive energy-oriented high-level code optimizations, as explained in the next section. 5. ENERGY OPTIMIZATIONS In this section, we show how EAC can be used to guide energy-conscious, highlevel compiler optimizations at both the nested loop level and the whole procedure level. Our presentation is in two parts. First, we present energy-conscious versions of tiling (a widely used loop-level optimization). We then focus on procedure-level energy optimization and present a strategy that can be used for compilation under multiple energy, performance, and code size constraints. We believe that an effort toward compiling a given code under multiple constraints is of extreme importance for a large class of embedded/portable and high-end systems. 5.1 Energy-Constrained Iteration Space Tiling Energy constraints for compilers can be imposed for different reasons. In this section, we will focus on three different scenarios for energy-constrained loop tiling. While we focus here on only loop tiling, similar analysis can also be performed for other high-level code transformations. 5.1.1 Thermal-Constrained Optimization. We consider an optimization to ensure that the power consumption in a particular system component does not exceed a specified limit. Such an optimization is important as uniform heat dissipation is desired across a chip and increasing the power consumption in one particular component can create a thermal hot-spot in the chip [Viswanath et al. 2000; Skadron et al. 2003; Chandrakasan et al. 2001]. For example, previous ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



839

Fig. 11. EAC-estimated energy consumptions for different tiling strategies for the mxm code. In the first column, within the parentheses, are the indexes of the tiled loops. For each tiled loop, a tile size of 5 is used. The values are in units given below the title of each column. The reason that the datapath energy dominates is the fact that the codes are already optimized for minimizing the number of memory references. Since the first version has the highest execution time, it also incurs the largest clock energy.

research shows that tiling typically reduces the average power consumed in the memory and cache and increases the average power spent in datapath [Kandemir et al. 2000]. The increase in datapath power is due to an increase in number of instructions associated with increased complexity of loop structures (i.e., the increase in the number of loops and the increased complexity in loop-bound expressions) and a reduction in the number of memory stall cycles. Thus, an increase in datapath energy coupled with a decrease in execution time results in an increase in average power consumption. Consequently, the datapath can become a hot-spot as the degree of tiling (i.e., the number of loops tiled) increases. Based on the relative size (area) occupied by the datapath compared to the rest of the on-chip resources, this problem can become critical. In order to address such a constraint, we implemented a constrained version of iteration space tiling in EAC. Given a loop nest to be tiled and a datapath power limit (thermal limit), this tiling strategy first uses linear loop transformations to obtain a new loop order such that the loop with the highest reuse is placed into innermost position; the loop with the next highest reuse is placed into the next inner position, and so on. After that, the compiler considers a tiled version of the innermost loop and calculates the increase in datapath power (by evaluating both energy and number of execution cycles). If the calculated power is lower than the allowable thermal limit, the compiler continues with tiling and considers a version in which the two innermost loops are tiled. As before, if the calculated power is lower than the limit, it continues with tiling (the three innermost loop). If, at any point, the calculated datapath power is found to be higher than the allowable limit, the compiler stops and outputs the previous version (before the current tiling). As an example, Figure 11 gives the compilerestimated energy consumption for different tiling strategies for the matrix multiply code. Our energy-constrained tiling implementation first transforms the loop from ‘i,j,k’ order to ‘i,k,j’ order (both from outermost loop position to innermost) to improve the data locality,1 and then (in the worst case) considers 1 Outermost

tile loops are also ordered to exploit inter-tile locality. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

840



I. Kadayif et al.

the following versions in this order: original code (mxm), mxm(j), mxm(j+k), and mxm(i+j+k). Using this power-constrained incremental tiling strategy, if the datapath power limit is 0.28 nJ/cycle, we select the mxm(j) version as going one more step [that is, the mxm(j+k)] results in a datapath energy which is higher than the limit (see the third column in Figure 11 for estimated datapath power). Note that the mxm(j) version is not the best version if we consider only the performance. As can be seen from the second column in Figure 11, mxm(j+k) version is the fastest code. The strategy explained above is based on transforming the loop using linear loop transformations [Wolf and Lam 1991] (prior to tiling) and considers tiling alternatives starting from innermost positions. Such a strategy also limits the number of alternatives to be considered. It is also possible to develop a more general strategy in which all versions of the code with one tiled loop are considered first before considering versions where two loops are tiled. For example, in the matrix multiply case, these would be mxm(i), mxm(j), mxm(k). If any of these versions has a lower estimated datapath power than the specified limit, EAC considers versions where an additional loop is tiled (in an attempt to improve performance while meeting the power constraint). Assuming a datapath power limit of 0.26 nJ/cycles, such a strategy would first select mxm(i) (see the third column) and then would consider mxm(i+j) and mxm(i+k), and stop as the datapath power dissipations for these two versions are higher than our limit. 5.1.2 Battery Capacity-Constrained Optimization. We consider optimizing the code to meet specific energy capacity limits of the battery in an embedded/ mobile computing system. If we assume that the battery capacity is fixed and that the code must complete execution before the battery can be recharged, we need to find a suitable degree of loop tiling to meet this constraint. Here, we start with the original untiled code and try alternate tiling strategies incrementally. Once the energy constraint is met, the compiler stops incremental tiling, and returns the current version as solution. Such scalable optimizations are particularly beneficial when the same code needs to be executed in different types of embedded architectures with varying battery capacities (e.g., laptop, palmtop, and smart cards). We can adapt the search strategy described above for our current problem. That is, we start with the versions in which only a single loop is tiled, and then work our way up to more aggressively tiled versions to reach a version with the desired total energy consumption. In our matrix multiply example, assuming a total energy limit (battery capacity) of 41.5e-03 J, our approach will first consider the untiled code, and since its total energy consumption is higher than the limit (the last column in Figure 11), it continues with the versions mxm(i), mxm(j), and mxm(k). As soon as it sees the mxm(j) version with an energy consumption of 41.4e-03 J, it stops the search and returns this version as output. While we have assumed the battery capacity to be fixed, in practice, the battery capacity is affected by the degree of uniformity in the load and the average load exposed to the battery [Pedram and Wu 1999]. We can extend our constrained tiling strategy to account for such factors as well by predicting and ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



841

correlating the spikes in power consumption to off-chip memory accesses and the intervals between these spikes [Rakhmatov et al. 2002]. For example, we can consider reducing the number of spikes to increase the efficiency of battery capacity, which, in turn, can be considered as equivalent to lowering energy consumption. Thus, a constrained tiling strategy should now consider both the number of spikes in the current profile and the overall energy consumed. EAC can be used to perform such optimizations as well. Another interesting observation from Figure 11 is the impact on clock energy. While number of datapath operations increases as a result of tiling, causing an increase in clock energy, the reduction in number of memory stall cycles (as a result of tiling) decreases the energy consumed in the clock generation and distribution circuitry. 5.1.3 Multiconstrained Optimization. In many embedded systems, the compiler can afford to spend more cycles in compilation (than general purpose systems) as the quality of the code is of critical importance. We can take advantage of this by adopting a multiconstrained tiling strategy based on exhaustive search. Using EAC, given a nested loop, we can generate a table, such as the one shown in Figure 11. Using such a table, we then can select the tiled version that satisfies our energy/performance constraints. For instance, one strategy can select the version with the minimum clock network energy (mxm(j+k)) whereas the other strategy can search for the version with minimum cache plus memory energy value (mxm(i+j+k)). We can even focus on compilation strategies with more complex constraints. For example, we can search for the version with the minimum on-chip (datapath + caches + clock + on-chip buses) energy consumption under the constraints that limit off-chip memory energy and execution cycles to be less than some preset thresholds. The important point to note here is that a table, such as this, gives us flexibility to perform tradeoffs between performance, power, and energy and even tradeoffs between energy (or power) consumptions of different components. Using such a table and a search strategy, EAC can also compile a code under a given energy-delay product value. It should also be noted that while we fixed the tile size here at a certain value (5 for each tiled loop) for each tiled version, a more sophisticated strategy can even try a number of alternative tile sizes for each tiled version. Since our previous simulator-based characterization work [Kandemir et al. 2000] shows that, for a given tiled code, the best tile size from the performance perspective might, in general, be different from the most energyefficient one, we believe that interesting power, energy, performance tradeoffs, and optimization opportunities exist for selecting tile sizes in codes running on energy-sensitive environments. A possible strategy in the case of tiling would be constructing a two-level loop where the outer loop enumerates different tile sizes and the inner loop iterates over different versions of the code. It should also be noted that while we present only tiling results here, similar search-based strategies can also be used for other high-level optimizations such as linear loop transformations, loop unrolling, loop distribution, and loop fusion. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

842



I. Kadayif et al.

We want to emphasize that the discussion above implies that optimizing a given code under different constraints might, in general, generate different results. Our compiler framework allows us to study not only tradeoffs between overall energy consumption and performance, but also tradeoffs between energy/power consumptions of different system components. 5.2 Integer Linear Programming-Based Procedure-Level Optimization In this section, we focus on procedure-level optimization and present a framework based on integer linear programming (ILP). An ILP problem is a linear programming problem in which the optimization parameters (solution vector elements) are restricted to be integers. The specific version of the ILP employed within EAC is called zero-one ILP (ZILP), where we further restrict each integer variable to be either zero or one.While ILP solutions can be time-consuming, the number of variables and constraints that are necessary for this study is fairly low, making this a viable approach. In all experiments we performed, the solution times were negligible (0.3 seconds on the average). Given an input code, the procedure-level optimization tries to come up with an optimized code, which satisfies a given criteria. Prior source-level optimization techniques [Blume et al. 1994; Gupta et al. 1999] generally focus solely on performance, and thus try to minimize execution cycles. Prior work [Bodin et al. 1998] has used an ILP-based approach to optimize code under performance and size (code length) constraints. Our approach optimizes a given code under energy, performance, and size constraints and employs the compilerdirected energy estimation model discussed earlier. It consists of two parts: (1) an alternative-generation module and (2) an alternative-selection module. The alternative-generation module outputs a number of alternatives for a given nested loop, each alternative having (potentially) a different (estimated) energy consumption, code size, and performance (execution cycles). Following the generation of alternatives for each nest, the selection module uses ILP to select an alternative for each loop nest such that all the constraints will be met. The current implementation signals an error message if it is not possible to satisfy all the constraints prompting the user to relax some constraint(s) and attempt recompilation. There can be several ways of generating alternatives, our current strategy uses possible combinations of a limited set of high-level optimizations. For optimizations where we can possibly have a very large number of alternatives (e.g., by changing the tile size in tiling or unrolling factor in unroll-and-jam), we restrict the parameters in question to reasonable set of values; for example, limiting the maximum unrolling factor to 16 has been found to be a reasonable approach for many array-based codes [Wolf and Chen 1996]. While the technique used for generating the alternatives can critically influence the effectiveness of the selection step, a complete exploration of more sophisticated strategies is beyond the scope of this paper. Our interest in this section is in the alternative-selection module; in particular, we show how ZILP can be used to optimize a given code under specified constraints. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



843

We assume N different nests and that nest i (1 ≤ i ≤ N ) has p(i) alternatives. We make the following definitions: r E : Estimated energy consumption for alternative j for nest i, ij r S : Estimated static code size for alternative j for nest i, and ij r C : Estimated execution cycles for alternative j for nest i, ij

where 1 ≤ j ≤ p(i). To capture the energy breakdown across different components, we use Eij d , Eij c , Eij m , Eij b, and Eij n to denote (for alternative j of nest i) the estimated energy consumptions in datapath, cache, main memory, buses, and clock network, respectively. We use zero-one integer variables y ij to indicate whether the j th alternative is selected (in the final output code) for the nest i or not. Specifically, if y ij is 1, this means that the j th alternative is selected for nest i, if y ij is 0, it is not. Since for each nest, the final code should contain only one alternative, the total energy consumption (E), total code size (S), and total execution cycles (C) of the resulting code can be expressed as follows:   E = E y , i  j ij ij S = S y , i  j ij ij C = C y i j ij ij and  where Eij = Eij d + Eij c + Eij m + Eij b + Eij n and j y ij = 1, for each nest i. In the remainder of this section, we refer to these constraints as basic constraints. Each optimization problem that we address in the following is expressed by augmenting these basic constraints by some additional constraints and an objective function. We now present a number of case studies using tomcatvi as our running example. This code has nine separate loop nests and our experimentation with this code using EAC shows that the majority of its nests can have multiple versions with different energy/performance/size values. Figure 12 gives, for each loop nest, the EAC-estimated energy (E), static code size (S), and execution cycles (C) for its alternatives. For the purposes of this discussion, we identified the set of alternatives for each nest by considering different combinations of three high-level optimizations (loop permutation, tiling, and unroll-and-jam). For each nest, we eliminated the illegal combinations and combinations that are duplicates (from the final estimation viewpoint) of already selected ones. After generating the versions, EAC uses the estimation method explained earlier, and estimates energy, static code size, and execution cycles for each version of each nest (as in Figure 12). It then takes into account the additional constraints and the objective function (it reads them from an input file) and builds the ZILP equations and inequalities. After that, it calls the procedural interface of lp solve [Schwab 2004] (a publicly-available ILP solver package) to solve the ILP problem, and uses the solution vector to select the appropriate versions for each nest and construct the output code. We now consider different compilation strategies (cases) whose objective functions and additional constraints (if any) are given in the second and third columns, respectively, of Figure 13. First, we explore three cases, C1, C2, and ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

844



I. Kadayif et al.

Fig. 12. Characteristics of alternative versions of each nest in tomcatvi. Each tuple, given as (E, S, C), represents the total energy consumption (E), static code size (S), and execution cycles (C) for a given alternative version of a nest. Note that, in general, different nests have different number of alternatives. The entry (-,-,-) indicates that the corresponding alternative does not exist for the nest. The energy consumption is expressed in units of 1.0e-05 J, and the static code size is given as number of machine instructions.

Fig. 13. Different compilation strategies and selected optimized versions for each nest (tomcatvi). E, S, and C represent, respectively, the total energy consumption, static code size, and execution cycles. Each Ai refers to an alternative for the corresponding nest.

C3, that optimize, respectively, the performance (traditional focus of compilation), overall system energy consumption, and the resulting code size with no additional constraints. From Figure 13, we observe that different alternatives are preferred as the optimization criterion differs resulting in the overall code being different for each of the cases. The minimum size constraint was observed to be the code in which all the nests were in their original unoptimized form. Compared to this original version of the code, the compilation for minimum energy resulted in a 4.9% code bloating, a 33.5% reduction in energy consumption, and a 58.6% improvement in performance. Similarly, compiling for maximum ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



845

Fig. 14. Energy, performance, and static code size estimations for different compilation strategies (tomcatvi). The energy consumption is expressed in units of 1.0e-05 J, and the static code size is given as the number of machine instructions.

performance resulted in a 4.1% code bloating, a 32.3% reduction in energy consumption, and a 59.2% improvement in performance. From these observations, we see that optimizing for different objective function can generate different application behavior. While the performance and overall energy consumption optimizations seem similar, detailed analysis shows that the energy-optimized version decreases the on-chip energy consumption while aggravating energy dissipated in off-chip memory (see Figure 14). Next, we impose additional constraints to these three cases to optimize for multiple criteria simultaneously as shown in Cases 4, 5, and 6 in Figure 13 (C4, C5, and C6). For example, in Case 5, the code with the minimum energy consumption is generated while ensuring that the total number of execution cycles does not increase beyond a specified threshold and the code bloating is again limited by a threshold. It should be noted that this is a very general optimization constraint. Such optimizations are essential for the following scenarios. Optimizing only for performance may not be useful in an energy-constrained environment. For example, if the application runs very fast but cannot complete because battery capacity is exhausted, no useful work is done. Alternately, it is of no practical use if the application consumes very little energy, but does not complete in a reasonable amount of time. Also, if the code bloat due to an optimization results in exceeding the limited memory capacity of a constrained system, the resulting performance and energy are of little consequence. Thus, it is essential to compile under multiple constraints. Note that, in Figure 13, C5 and C6 generate the same results. This is just a consequence of the particular bound used in the constraints. A different bound would generate different versions being selected. Our component-based energy estimation model also allows us to compile a given code under more complex constraints. For example, considering   d Ed = i  j E ij y ij ,  Ec y , Ec = i  j ijm ij m E = E y , i  j ijb ij Eb = E y , i  j ijn ij En = i j E ij y ij and ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

846



I. Kadayif et al.

Fig. 15. Results when different objective functions are used. Each value represents the percentage increase in the indicated metric when a particular version (C1, C2, or C3) is used. Recall that C1, C2, and C3 minimize C, E, and S, respectively.

we can develop component-sensitive compilation strategies. First, in Case 7 (Figure 13), we optimize the energy consumed in the off-chip memory energy. We observe such an optimization results in 26% reduction in average power dissipation in the off-chip package as compared to the pure energy optimization that focuses on the entire system (C2). Such optimizations can be of use when the system is composed of different chips with different packaging technology. Hence, if a cheaper package is desired for memory modules to bring system cost down, we may also need to impose a stricter limit on the power dissipated by these memory modules. Thus far, we used additional constraints that involve absolute bounds. EAC can also be used to compile a given code under relative constraints in cases where absolute constraints are not available. For example, in Case 8 (C8), we compile the code to minimize the energy consumed in the memory hierarchy (the caches and the main-memory) under an additional constraint, which limits the increase in datapath energy of the original code by 5% (assuming that E1 d is the estimated datapath energy consumption of the original code). It should be emphasized that the estimation speed of EAC is the key to the number of optimization alternatives that can be explored within a fixed compilation time. It would be extremely time consuming to explore such a solution space using a simulation-based approach. Finally, Figure 15 shows, for all benchmark codes used in this study, the increase in two metrics when the third one is optimized. For example, the first two entries corresponding to fir indicates that, when C1 is used (i.e., performance is optimized), we incur a 4.11% (energy) increase over the mostenergy efficient version, and a 8.66% (memory space) increase over the most memory space (size) efficient version. These results clearly indicate that, optimizing one metric can generate poor results for the other metrics and the EAC framework allows us to perform experiments with different objective functions and study the tradeoffs between energy, performance, and code size. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



847

6. CONCLUDING REMARKS AND FUTURE WORK It is essential for a compiler to be able to satisfy criteria, such as reducing energy consumption and code size, other than just performance as many of today’s compilation frameworks are targeted for. Satisfying these multiple constraints requires a close interaction with the capabilities (number of available registers, cache sizes, etc.) and behavior (how many cycles does an instruction take? what is the energy consumption of a given instruction? etc.) of the underlying hardware. This paper presents the first framework—EAC—that interacts closely with the underlying system to generate code that can meet many of today’s evolving compilation constraints. EAC takes in an architecture description and technology parameters, and constraint specifications to generate the appropriate code segments in arraydominated applications. Validation results show that the proposed framework is very accurate and within 6% error margin (on the average) of a cycle-accurate architectural-level energy simulator. With several detailed investigations using this framework, we have shown the importance of energy-aware compilation (optimizing performance does not necessarily translate to optimizing energy), how the compilation process can help alleviate the thermal/cooling issues of components in high-end processors, and how we can generate code based on multiple constraints (performance, energy consumption, code size, etc.). EAC is thus useful across a wide spectrum of computing environments—from high-end servers to resource-constrained embedded/mobile systems. Apart from its apparent impact on compiler research, EAC would be extremely useful for energy-conscious application developers by providing a rapid way of evaluating source-level algorithmic optimizations. It can also be useful for system architects as a quick way of experimenting with architectural alternatives to understand their suitability to real workloads without having to go through extensive execution-driven simulations (although this would require extending the basic EAC infrastructure). The current EAC framework has also some limitations. Basically, it currently works only with integer data and does not take scalar variable accesses into account (i.e., tuned for array based computations). It also operates under a conservative assumption that all the branches of a conditional construct have the same probability of being taken at runtime. In the future, we will address these shortcomings. Our future research plans also include developing more accurate energy estimation strategies within EAC and investigating energyconstrained versions of a large class of optimizations. We also plan to extend the ILP-based approach to larger codes and integrate it with interprocedural analysis. Finally, we would like to extend the proposed approach to superscalar architectures. REFERENCES ALBONESI, D. H. 1999. Selective cache ways: On-demand cache resource allocation. In Proc. the 32nd International Symposium on Microarchitecture. 248–259. BANIASADI, A. AND MOSHOVOS, A. 2002. Asymmetric-frequency clustering: A power-aware back-end for high-performance processors. In Proc. International Symposium on Low-Power Electronics and Design. 255–258. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

848



I. Kadayif et al.

BENINI, L., BOGLIOLO, A., CAVALLUCCI, S., AND RICCO, B. 1998. Monitoring system activity for osdirected dynamic power management. In Proc. International Symposium on Low-Power Electronics and Design. 185–190. BENINI, L., HODGSON, R., AND SIEGEL, P. 1998. System-level power estimation and optimization. In Proc. International Symposium Low Power Electronics and Design, Monterey, CA. BLUME, W., EIGENMANN, R., FAIGIN, K., GROUT, J., HOEFLINGER, J., PADUA, D., PETERSEN, P., POTTENGER, B., RAUCHWERGER, L., TU, P., AND WEATHERFORD, S. 1994. Polaris: The next generation in parallelizing compilers. In Proc. the Seventh Workshop on Languages and Compilers for Parallel Computing, Ithaca, New York. 10.1–10.18. BODIN, F., CHAMSKI, Z., EISENBEIS, C., ROHOU, E., AND SEZNEC, A. 1998. Gcds: A compiler strategy for trading code size against performance in embedded applications. Tech. Rep. RR-3346, INRIA, Rocquencourt, France. Jan. BORKAR, S. 1999. Design challenges of technology scaling. IEEE Micro 19, 4, 23–29. BROOKS, D., TIWARI, V., AND MARTONOSI, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proc. the 27th Annual International Symposium on Computer Architecture. BURGER, D., AUSTIN, T., AND BENNETT, S. 1996. Evaluating future microprocessors: The simplescalar tool set. Tech. Rep. CS-TR-96-103, (July). Computer Science Dept., University of Wisconsin, Madison, WI. BUTTS, J. A. AND SOHI, G. 2000. A static power model for architects. In Proc. International Symposium on Microarchitecture. CATTHOOR, F., WUYTACK, S., GREEF, E. D., BALASA, F., NACHTERGAELE, L., AND VANDECAPPELLE, A. 1998. Custom Memory Management Methodology—Exploration of Memory Organization for Embedded Multimedia System Design. Kluwer Academic Pub., Boston, MA. CHANDRAKASAN, A. AND BRODERSEN, R. 1995. Low Power Digital CMOS Design. Kluwer Academic Pub., Boston, MA. CHANDRAKASAN, A., BOWHILL, W. J., AND FOX, F. 2001. Design of High-Performance Microprocessor Circuits. IEEE Press. Piscataway, NJ 08855-1331, USA. CHEN, G., KANG, B., KANDEMIR, M., VIJAYKRISHNAN, N., AND IRWIN, M. J. 2003. Energy-aware compilation and execution in Java-enabled mobile devices. In Proc. International Parallel and Distributed Processing Symposium, Nice, France. CHEN, R., BAJWA, R., AND IRWIN, M. J. 2001. Architectural level power estimation and design experiments. ACM Transactions on Design Automation of Electronic Systems 6, 1 (Jan.), 50– 66. COOPER, K. D., KENNEDY, K., AND MCINTOSH, N. 1996. Cross-loop reuse analysis and its application to cache optimizations. In Proc. the 9th Workshop on Languages and Compilers for Parallel Computing, San Jose, CA. DELALUZ, V., KANDEMIR, M., VIJAYKRISHNAN, N., SIVASUBRAMANIAM, A., AND IRWIN, M. J. 2001. DRAM energy management using software and hardware directed power mode control. In Proc. the 7th International Conference on High Performance Computer Architecture, Monterrey, Mexico. DUARTE, D., VIJAYKRISHNAN, N., IRWIN, M. J., AND KANDEMIR, M. 2001. Formulation and validation of an energy dissipation model for clock generation circuitry and distribution network. In Proc. International Conference on VLSI Design. FLAUTNER, K., KIM, N. S., MARTIN, S., BLAAUW, D., AND MUDGE, T. 2002. Drowsy caches: Simple techniques for reducing leakage. In Proc. the 29th International Symposium on Computer Architecture. GEBOTYS, C. H. 1997. Low energy memory and register allocation using network flow. In Proc. Design Automation Conference, Anaheim, CA, 435–440. GHOSE, K. AND KAMBLE, M. B. 1999. Reducing power in superscalar processor caches using subbanking, multiple line buffers, and bit-line segmentation. In Proc. year = 1999 International Symposium Low Power Electronics and Design. 70–75. GONZALES, R. AND HOROWITZ, M. 1996. Energy dissipation in general purpose processors. IEEE Journal of Solid-State Circuits 31, 9 (Sept.), 1277–1283. GUPTA, R., PANDE, S., PSARRIS, K., AND SARKAR, V. 1999. Compilation techniques for parallel systems. Parallel Computing 25, 13, 1741–1783. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Compiler-Directed High-Level Energy Estimation and Optimization



849

HSU, C.-H. AND KREMER, U. 2003. The design, implementation, and evaluation of a compiler algorithm for cpu energy reduction. In Proc. ACM Conference on Programming Language Design and Implementation, San Diego, CA. IRWIN, M. J. AND VIJAYKRISHNAN, N. 2000. Low power design: From soup to nuts. In Tutorial Notes, ISCA’ 2000. KADAYIF, I., KANDEMIR, M., VIJAYKRISHNAN, N., IRWIN, M. J., AND SIVASUBRAMANIAM, A. 2002. Eac: A compiler framework for high-level energy estimation and optimization. In Proc. the 5th Design Automation and Test in Europe Conference, Paris, France. KAMBLE, M. AND GHOSE, K. 1997a. Analytical energy dissipation models for low power caches. In Proc. International Symposium on Low Power Electronics and Design. KAMBLE, M. AND GHOSE, K. 1997b. Energy efficiency of vlsi caches: A comparative study. In Proc. International Conference on VLSI Design. KANDEMIR, M., VIJAYKRISHNAN, N., IRWIN, M. J., AND KIM, H. S. 2000. Experimental evaluation of energy behavior of iteration space tiling. In Workshop on Languages and Compilers for High Performance Computing, Yorktown Heights, NY. KANDEMIR, M., VIJAYKRISHNAN, N., IRWIN, M. J., AND YE, W. 2000. Influence of compiler optimizations on system power. In Proc. the 37th Design Automation Conference, Los Angeles, CA. KAXIRAS, S., HU, Z., AND MARTONOSI, M. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proc. the 28th Annual International Symposium on Computer Architecture. LAM, M. 1988. Software pipelining: An effective scheduling technique for vliw machines. In Proc. the ACM Conference on Programming Language Design and Implementation, Atlanta, GA. LAM, M., ROTHBERG, E., AND WOLF, M. 1991. The cache performance of blocked algorithms. In Proc. the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. LEBECK, A. R., FAN, X., ZENG, H., AND ELLIS, C. S. 2000. Power aware page allocation. In Proc. the 9th International Conference on Architectural Support for Programming Languages and Operating Systems. LORCH, J. R. AND SMITH, A. J. 1998. Software strategies for portable computer energy management. IEEE Personal Communications, 60–73. LORENZ, M., WEHMEYER, L., AND DRAGER, T. 2002. Energy aware compilation for dsps with simd instructions. In Proc. Conference on Language, Compiler and Tool Support for Embedded Systems. Berlin, Germany. LU, Y.-H., BENINI, L., AND MICHELI, G. D. 2000. Operating-system directed power reduction. In Proc. ISLPED’00. MANNE, S., KLAUSER, A., AND GRUNWALD, D. 1998. Pipeline gating: Speculation control for energy reduction. In Proc. International Symposium on Computer Architecture. MCKINLEY, K., CARR, S., AND TSENG, C. 1996. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems. MOSHOVOS, A. 2003. Checkpointing alternatives for high performance, power-aware processors. In Proc. International Symposium on Low-Power Electronics and Design. 318–321. MUSOLL, E. 2000. Estimation of the upper-bound useless energy dissipation in a highperformance processor. Kool-Chips. PANDA, P. R. AND DUTT, N. D. 1996. Reducing address bus transitions for low-power memory mapping. In Proc. European Design and Test Conference. PEDRAM, M. AND WU, Q. 1999. Design considerations for battery powered electronics. In Proc. Design Automation Conference. 861–866. RAKHMATOV, D., VRUDHULA, S., AND WALLACH, D. A. 2002. Battery lifetime prediction for energyaware computing. In Proc. International Symposium on Low Power Electronics and Design. Monterey, CA. RELE, S., PANDE, S., ONDER, S., AND GUPTA, R. 2002. Optimization of static power dissipation by functional units in superscalar processors. In Proc. International Conference on Compiler Construction. Grenoble, France. SAPUTRA, H., KANDEMIR, M., VIJAYKRISHNAN, N., IRWIN, M. J., HU, J. S., HSU, C.-H., AND KREMER, U. 2002. Energy-conscious compilation based on voltage scaling. In Proc. Conference on Language, Compiler and Tool Support for Embedded Systems, Berlin, Germany. ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

850



I. Kadayif et al.

SCHWAB, H. 2004. lp solve mixed integer linear program solver. ftp://ftp.es.ele.tue.nl/pub/ lp solve/. SHIUE, W.-T. AND CHAKRABARTI, C. 1999. Memory exploration for low power, embedded systems. Tech. Rep. CLPE-TR-9-1999-20, Arizona State University, AZ. SIMUNIC, T., BENINI, L., AND MICHELI, G. D. 1999. Cycle-accurate simulation of energy consumption in embedded systems. In Proc. ACM Design Automation Conference. SKADRON, K., STAN, M. R., HUANG, W., VELUSAMY, S., SANKARANARAYANAN, K., AND TARJAN, D. 2003. Temperature-aware computer systems: Opportunities and challenges. IEEE Micro 23, 6 (Nov.) 52–61. SU, C.-L. AND DESPAIN, A. M. 1995. Cache designs for energy efficiency. In Proc. the 28th Hawaii International Conference on System Sciences, Hawaii. SUIF, R. W. ET AL. 1994. SUIF: An infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Notices 29, 12 (Dec.), 31–37. TEMAM, O. E. G. AND JALBY, W. 1993. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Proc. the IEEE Supercomputing’93. Portland, OR. TIWARI, V., MALIK, S., WOLFE, A., AND LEE, T. C. 1996. Instruction level power analysis and optimization of software. Journal of VLSI Signal Processing Systems 13, 2 (Aug.). TOBUREN, M. C., CONTE, T. M., AND REILLY, M. 1998. Instruction scheduling for low power dissipation in high performance processors. In Proc. the Power Driven Micro-architecture Workshop in conjunction with the ISCA’98, Barcelona, Spain. UNNIKRISHNAN, P., CHEN, G., KANDEMIR, M., AND MUDGETT, D. R. 2002. Dynamic compilation for energy adaptation. In Proc. the International Conference on Computer Aided Design, San Jose, CA. VIJAYKRISHNAN, N., KANDEMIR, M., IRWIN, M. J., KIM, H. S., AND YE, W. 2000. Energy-driven integrated hardware-software optimizations using simplepower. In Proc. the International Symposium on Computer Architecture, Vancouver, British Columbia, Canada. VISWANATH, R., WAKHARKAR, V., WATWE, A., AND LEBONHEUR, V. 2000. Thermal performance challenges from silicon to systems. Intel Technology Journal Q3. WILLEMS, M. AND ZIVOJNOVIC, V. 1996. DSP-compiler: Product quality for control oriented applications? In Proc. ICSPAT’96. 752–756. WOLF, M. D. M. AND CHEN, D. 1996. Combining loop transformations considering caches and scheduling. In Proc. International Symposium on Microarchitecture, Paris, France. 274–286. WOLF, M. AND LAM, M. 1991. A data locality optimizing algorithm. In Proc. the ACM Conference on Programming Language Design and Implementation. 30–44. XIE, F., MARTONOSI, M., AND MALIK, S. 2003. Compile-time dynamic voltage scaling settings: Opportunities and limits. In Proc. ACM Conference on Programming Language Design and Implementation. ZHANG, W., KANDEMIR, M., VIJAYKRISHNAN, N., IRWIN, M. J., AND DE, V. 2003. Compiler support for reducing leakage energy consumption. In Proc. the 6th Design Automation and Test in Europe Conference, Munich, Germany. Received May 2004; revised December 2004; accepted January 2005

ACM Transactions on Embedded Computing Systems, Vol. 4, No. 4, November 2005.

Suggest Documents