Optimization,” a thesis prepared by Santosh B Talli in partial fulfillment for the ......
For example, an application which is intensive of integer computations and ...
COMPILER-DIRECTED FUNCTIONAL UNIT SHUTDOWN FOR MICROARCHITECTURE POWER OPTIMIZATION
BY SANTOSH B TALLI
A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree
Master of Science in Electrical Engineering
New Mexico State University Las Cruces, New Mexico September 2006
“Compiler-Directed Functional Unit Shutdown for Microarchitecture Power Optimization,” a thesis prepared by Santosh B Talli in partial fulfillment for the degree, Master of Science in Electrical Engineering, has been approved and accepted by the following:
Linda Lacey Dean of the Graduate School
Jeanine Cook Chair of the Examining Committee
Date
Committee in charge: Dr. Jeanine Cook, Chair Dr. Steve Stochaj Dr. Mary Ballyk
ii
To My Family: Mother, Father, Sister and Brother-in-law
iii
ACKNOWLEDGEMENTS
I would like to thank my parents for their love and support. I thank my sister and brother-in-law for their support all through my graduate studies. I have no words to express my gratitude towards my advisor, Dr. Jeanine Cook, for her support and insightful suggestions. Her computer architecture course made me love the subject and changed the course of my career. Dr. Cook has also been of great help in helping me decide on my courses in the Electrical Engineering Department. I am thankful to Dr. Steve Stochaj and Dr. Mary Ballyk for serving in my committee. Dr. Stochaj’s computer performance analysis course helped me analyze my results better.
I thank my good friend Ram, for his constant help and support during my research. I thank him for the technical discussions and ideas. Finally I would like to thank all my friends at NMSU who made my stay memorable.
iv
ABSTRACT COMPILER DIRECTED FUNCTIONAL UNIT SHUTDOWN FOR MICROARCHITECTURE POWER OPTIMIZATION
BY SANTOSH B TALLI
Master of Science in Electrical Engineering New Mexico State University Las Cruces, New Mexico, 2006 Dr. Jeanine Cook, Chair
Leakage power is a major concern in current microarchitectures as it is increasing exponentially with decreasing transistor feature sizes. In this study, we present a technique called functional unit shutdown to reduce the static leakage power consumption of a micro-processor by power gating functional units when not used. We use profile information to identify functional unit idle periods that is used by the compiler to issue corresponding OFF/ON instructions. The decision to power gate during idle periods is made based on the comparison between the energy consumed by leaving the units ON and the overhead and leakage energy involved in power v
cycling them. This comparison identifies short idle periods where less power is consumed if a functional unit is left ON rather than cycling the power during that period. The results show that this technique saves up to 18% of the total energy with a performance degradation of 1%.
vi
TABLE OF CONTENTS
Page LIST OF TABLES ........................................................................................
x
LIST OF FIGURES .......................................................................................
xi
1.
1
2.
3.
INTRODUCTION AND MOTIVATION .............................................. 1.1
CMOS Transistor and Leakage Power .......................................
1
1.2
Superscalar Processors and ILP .................................................
3
1.3
Functional Units Power Consumption........................................
5
1.4
Combination of Compiler and Hardware Techniques .................
7
RELATED WORK ................................................................................
9
2.1 Clock Gating ..............................................................................
9
2.2 Static Power Reduction Techniques ...........................................
11
2.2.1 Hardware Techniques to Reduce Static Power ...................
12
2.2.1.1 Power Gating .............................................................
12
2.2.1.2 Dual Threshold Voltage .............................................
13
2.2.1.3 Gate Level Leakage Reduction (Input Vector Control)
13
2.2.2 Compiler Directed Static Power Reductions .......................
15
2.3 Functional Unit Static Power Reduction .....................................
15
FUNCTIONAL UNIT SHUTDOWN .....................................................
19
3.1 CFG Generation .........................................................................
19
3.2 Energy Estimation ......................................................................
23
3.2.1 Overhead Energy and BreakEven Cycles ...........................
25
vii
3.3 FU Requirement Optimization ...................................................
27
3.3.1 Complexity of Optimization: Local and Global ..................
28
3.3.2 Local Optimizer .................................................................
30
3.3.3 Global Optimizer ...............................................................
33
3.4 Processor Support ......................................................................
36
3.5 Performance Penalty ..................................................................
38
3.6 Variations of the Basic Algorithm ..............................................
40
4. EXPERIMENTAL PLATFORM .............................................................
41
4.1 The SimpleScalar Simulator .......................................................
40
4.2 The Wattch Simulator ................................................................
41
4.3 SPEC CPU2000 Benchmarks .....................................................
41
4.3.1 Subset of SPEC CPU2000 Benchmarks .............................
42
4.4 Framework.................................................................................
42
5. RESULTS ...............................................................................................
44
5.1 Effect of FU Shutdown ..............................................................
44
5.2 Energy Breakdown.....................................................................
51
5.3 Performance Degradation ...........................................................
53
5.4 Sensitivity Analysis on BreakEven Cycles .................................
55
5.5 Depth of Global Optimization ....................................................
58
5.6 Wait if Busy vs Busy ON ...........................................................
59
5.7 Conclusions and Future Work ....................................................
60
CONCLUSIONS ...................................................................................
62
References .....................................................................................................
63
6.
viii
ix
LIST OF TABLES
Table
Page
1.1
Static power dissipation by Functional Units ......................................
7
3.1
Instruction class FU requirements .......................................................
20
3.2
Average difference in FU requirements estimation and actual usage....
19
3.3
Total Number of FUs Used .................................................................
19
3.4
Number of basic blocks in each benchmark .........................................
30
3.5
Local optimizer energy estimation ......................................................
31
3.6
Time for local optimizer for all benchmarks ........................................
32
3.7
Energy saved versus depth of CFG optimization .................................
33
3.8
Global optimizer energy estimation .....................................................
34
3.9
Optimization time for global2 CFG optimization ................................
36
3.10 Optimization time for global4 CFG optimization ................................
36
4.1
Subset of SPEC CPU2000 benchmarks ...............................................
42
4.2
Simulation parameters .........................................................................
43
5.1
Dynamic benchmark instruction mix ...................................................
46
5.2
Average number of FUs used in a BB .................................................
47
5.3
Total energy savings by implementing FU shutdown ..........................
51
5.4
Increase in energy due to performance for global2 ..............................
55
x
LIST OF FIGURES Figure
Page
1.1
CMOS Inverter ..................................................................................
2
2.1
Power Gating implementation ............................................................
13
3.1
BB instruction dependence tree ..........................................................
21
3.2
CFG with FU requirements ................................................................
22
3.3
Illustration of BreakEven point ..........................................................
26
3.4
Short FU idle period ...........................................................................
28
3.5
CFG to be optimized ..........................................................................
29
3.6a
Exhaustive search over nodes 1,4 and 5..............................................
29
3.6b
Exhaustive search on nodes 1, 2, 3, and 5 ...........................................
29
3.7
Local optimization .............................................................................
31
3.8
Local optimizer algorithm ..................................................................
32
3.9
Depth one sub-CFGs ..........................................................................
33
3.10
Global optimizer algorithm ................................................................
35
3.11
Global optimizer time complexity, mcf ..............................................
35
3.12
Compiler-Inserted instructions ...........................................................
37
5.1
Original Percent Energy Breakdown ..................................................
44
5.2
Total energy savings (%) ....................................................................
45
5.3a
Different Strategies for eon ...............................................................
48
5.3b
Different Strategies for facerec .........................................................
48
xi
5.3c
Different Strategies for fma3d ...........................................................
49
5.3d
Different Strategies for mesa .............................................................
49
5.3e
Different Strategies for swim .............................................................
50
5.4a
Total energy breakdown for vortex .....................................................
52
5.4b
Total energy breakdown for INT benchmarks ....................................
52
5.4c
Total energy breakdown for FP benchmarks.......................................
53
5.4d
Total energy breakdown over all benchmarks .....................................
53
5.5
Execution time in clock cycles ...........................................................
54
5.6
Energy breakup for different BE values for art ...................................
56
5.7a
BE cycles sensitivity, art ....................................................................
56
5.7b
BE cycles sensitivity, vpr ...................................................................
57
5.7c
BE cycles sensitivity, facerec .............................................................
57
5.8
Energy consumption versus global optimizer depth ............................
59
xii
1. Introduction and Motivation
Decreasing CMOS transistor feature sizes have enabled higher processing speeds and more components on chip. However, this is at the expense of increased static power dissipation in the form of transistor leakage current, which increases as the transistor size decreases. Future technologies will have greater levels of on-chip integration and higher clock frequencies making the energy dissipation an even more critical design constraint. It is now estimated that static power dissipation accounts for about 40% of the total power of high-speed processing chips which use the 65nm technology [11]. Also, with decreasing transistor feature sizes static power dissipation in a microprocessor is increasing exponentially [12]. Power consumption is a crucial factor that determines the functionality and mobility of devices. The performance potential in a mobile device is limited by the power consumption as the increasing levels of integration and clock frequencies for high performance escalate the power dissipation. Due to these factors power optimization at various levels of a microprocessor design becomes essential.
1.1 CMOS Transistor and Leakage Power
One of the most popular Metal Oxide Semiconductor Field Effect Transistor (MOSFET) technologies is the Complementary MOS (CMOS) technology. This technology makes use of both P and N channel devices in the same substrate material. Such devices are extremely useful, since the signal which turns a transistor of one type ON is used to turn a transistor of the other type OFF. This allows the design of logic devices using only simple switches, without the need for a pull-up resistor. Figure 1.1 shows a typical inverter implemented with CMOS technology. VDD is the supply voltage, Vin and Vout are the input and output voltages respectively and C L is the load capacitance. In this case an input of logic 1 (Vdd volts, transistor supply voltage) switches the N transistor on and the P transistor off. The decreasing transistor feature sizes have the advantage
1
of (1) reducing gate delay, resulting in an increased clock frequency and faster circuit operation, and (2) increasing transistor density, making the chips smaller and reducing cost.
Figure 1.1: CMOS Inverter CMOS circuits dissipate power by charging and discharging the various load capacitances (mostly gate and wire capacitance, but also drain and some source capacitances) whenever they are switched. The charge moved, Q, is the capacitance multiplied by the voltage change, Vgain. The current used, Iused, , is a product of the charge moved and the switching frequency, f. Finally, the characteristic switching power, P, dissipated by a CMOS is the product of the current used and the voltage gain: P (dynamic) = Iused * Vgain = (Q * f) * Vgain = ((C * Vgain) * f) * Vgain = C V2gain f As opposed to the dynamic power which is due to switching of devices, the main contributor to leakage power is the sub-threshold leakage current present in deep submicron MOS transistors acting in the weak inversion region. Sub-threshold leakage is the current that flows from drain to source even when the transistor is off (gate voltage less than threshold voltage). The transistor begins to conduct at the threshold voltage. Sub-threshold leakage increases exponentially with decreasing threshold voltage (VT) and the continuous reduction of VT with technology scaling is making the static (leakage) power increasingly significant. Hence, in recent
2
years computer architects have invented solutions to decrease power by relying on microarchitectural innovations. It is also important that the solutions developed do not lead to a significant degradation in performance. In this study, we primarily focus on leakage energy dissipation in high-performance microprocessors and develop a power aware design that reduces the leakage energy.
1.2 Superscalar Processors and ILP
A scalar processor processes one data item at a time, whereas in a vector processor, a single instruction operates simultaneously on multiple data items. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple processing units so that multiple instructions can be processing separate data items at the same time. A superscalar processor has multiple functional units of the same type, along with additional circuitry to support issuing the instruction to the units. The issuing/scheduling unit reads instructions from memory and decides which ones can be run in parallel and dispatches them to the units. Instruction Level Parallelism (ILP) is a measure of the number of instructions in a straight-line piece of code that can be performed simultaneously. To exploit ILP the instructions that can be executed in parallel have to be determined. Two instructions can be executed in parallel if they can execute simultaneously in a pipeline without causing any stalls, if they have sufficient resources at their disposal. The decision to issue multiple instructions at the same time is based on determining if an instruction is dependent on another instruction. The types of dependences that can exist between two instructions are: data, name and control dependences. 1. Data Dependence: Data Dependence can occur when an instruction depends on the result of an earlier instruction. If two instructions are data dependent, they cannot be executed simultaneously as the second instruction has to wait till the result of the first instruction is
3
produced. If both are executed in parallel, then the second instruction might read an earlier value of the operand. 2. Name Dependence: Name dependence occurs when two instructions use the same register or memory location, referred to as a name, but there is no flow of data between them associated with that name. There are two types of name dependences that can occur between two instructions: a. Anti-dependence occurs when an instruction requires the value of an operand that is later updated. The original ordering must be preserved to ensure that the first instruction reads the correct value. b. Output dependence occurs when the ordering of the instructions will affect the final output of an operand, when both the instructions write the same register or memory location. The ordering between the instructions must be preserved so that the final value corresponds to the one written by the second instruction. 3. Control Dependence: Control dependence occurs when there is a branch instruction and the next instruction to be executed is based on the direction of the branch which is known from the outcome of its execution. The next instruction to be executed can either be the next sequential instruction (if the branch is not taken) or the instruction specified by the branch (if the branch is taken). A hazard occurs whenever there is a dependence between the instructions and they are close enough that the overlap caused by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. Due to the dependence, program order must be preserved, that is, the order that the instructions would execute in
if executed sequentially one at a time. Data hazards can be classified into the following three types based on the order of read and write accesses in the instructions.
4
1. Read after Write (RAW): An operand is modified just before its read. If the first instruction did not finish writing to the operand, the second instruction will read an incorrect data. 2. Write after Read (RAW): An operand is written right after it is read. If the write finishes before the read, the read instruction will incorrectly get the new written value. 3. Write after Write (WAW): Two instructions write to the same operand. If the second instruction finishes first, then the operand is left with an incorrect value. The hazards can be avoided in certain cases, if not completely eliminated. There are various software and hardware techniques to avoid hazards and exploit parallelism by preserving the program order only where it affects the outcome of the program. In the case of data hazards, WAR and WAW can be avoided by using techniques like register renaming but RAW cannot be avoided and the pipeline must be stalled as an instruction has to read the operand modified by the earlier instruction. Control hazards can be avoided at times, if not always, by speculating if the branch instruction will be taken/not taken by using various branch prediction techniques. As these hazards cannot be eliminated completely, maximum ILP is not always achieved and the utilization rates of the various processor hardware components are not 100%, leading to idle periods in them.
1.3 Functional Unit Power Consumption
Contemporary high-performance microprocessors are some of the most highly integrated, performance driven state-of-the-art chips designed today. They require an extremely large number of transistors and implement high clock rates to meet the high performance requirement, which leads to significant static and dynamic power dissipation. The current Intel processors have around 200 million transistors. Of the total processor power consumption, Functional Units contribute to about 20% [17]. Dynamic power of the FUs can be reduced by using clock gating
5
[14, 11]. The other major static power dissipating units of a microprocessor are the cache memories and techniques have been proposed to reduce this [16, 17, 9], which leave the FUs to be the major static power dissipating components. The power consumption of the functional units at different technology parameters is shown in Table 1.1, taken from [7]. It can be seen that static power dissipation increases non-linearly with decreasing transistor sizes.
Table 1.1: Static power dissipation by functional units [7] Simulation results have shown that functional unit utilization rates are typically low with idle periods characterizing their use [5]. Superscalar processors achieve their high processing speeds by dynamically detecting and exploiting Instruction Level Parallelism (ILP), subsequently executing instructions in parallel on multiple functional units. As maximum ILP is rarely achieved, as discussed in section 1.2, not all functional units available are fully utilized leading to idle periods in them. Also, the applications being run may not utilize the functional units completely. For example, an application which is intensive of integer computations and with only a few floating point operations, will under utilize the floating point functional units. During these idle periods static power is dissipated. Our work is aimed at reducing static power consumption of functional units during these idle periods by turning the power OFF. While turning the FUs OFF results in static energy savings, a performance penalty may be incurred when a unit is turned OFF but is needed for instruction execution. Additionally, a certain amount of dynamic energy is consumed to turn the FUs ON. Hence, it is important that turning OFF a FU does not affect
6
performance and that the static energy saved by turning it OFF is greater than the dynamic energy incurred to power cycle it. In our implementation, if the idle period of a functional unit is short, such that the power required to cycle from ON to OFF to ON (power cycle) is greater than the power consumed by leaving it ON during the idle period, then the unit is left ON. We use an annotated Control Flow Graph (CFG) extracted from an application to detect the idle periods. These idle periods are then translated into compiler instructions that turn FUs OFF/ON while hardware enables the actual OFF/ON operation. The advantages of using such a hardwaresoftware approach in lieu of a purely hardware based approach are discussed below. Although Energy = Power * Time, we use the terms power and energy interchangeably to suit the context.
1.4 Combination of Compiler and Hardware Techniques
Most of the early techniques to reduce static microprocessor power were hardware based [16, 17, 9], wherein, once the idle period on a microprocessor component such as memory starts, a hardware counter starts counting the number of cycles. If the counter reaches a set threshold value the component is turned OFF. The disadvantages of such a technique are that (1) the device may need to be accessed immediately after it is turned OFF; (2) decreased energy savings while the counter waits to reach a particular threshold value and (3) the hardware counter consumes additional power. To overcome these disadvantages, we use a combination of compiler and hardware techniques. Instead of depending solely on the hardware by using the hardware counters, the compiler produces information which is used to issue the OFF/ON directives. The advantages of using such an approach, which overcomes the disadvantages of the hardware based techniques described above, are given below. Identifying Idle Regions off-line: In order to turn the FUs OFF, their idle periods need to be detected. Primitive Hardware techniques for power gating devices are based on keeping track of the idle period as discussed above. In our technique, the compiler
7
examines all of the code off-line and identifies suitable regions for turning the FUs OFF. Furthermore, the compiler also identifies the type of FUs and determines the number of FUs that can be turned OFF without degrading the performance or increasing the total power consumption. Ability to Hide Latency: While turning a FU OFF prior to entering an idle period, it has to be ensured that all the pending instructions have committed. Similarly, while turning a FU ON upon exiting an idle period it has to be turned ON sufficiently ahead of instructions accessing the FU as there is latency for FU turn ON. A hardware based technique cannot take care of these as it is just based on hardware counters. Variable Length Idle Periods: Idle periods in a FU can be of variable lengths. If the idle period is long, turning the FU OFF saves power. But if the idle period is too short, the FU will have to be turned ON as soon as it is turned OFF. If the latter situation occurs frequently, while there is little or no power savings, additional dynamic power is dissipated during power cycling of the FU. Hence, turning a FU OFF for a short idle period could lead to more overall power consumption than leaving it ON. In compilerdirected FU shutdown the FU is turned OFF only if the compiler is sure that turning OFF the FU will save power, which nullifies the effect of very short idle periods.
In this work, we identify the FU idle periods and propose architectural techniques to reduce their static power consumption during the idle periods by comparing the dynamic and static power consumption of different Functional Units by leaving them ON versus turning them OFF. The rest of the thesis is organized as follows: Chapter 2 presents the prior related work done with respect to reducing the dynamic and static power of a microprocessor in general and of the Functional Units. Chapter 3 proposes our Compiler-Directed FU Shutdown methodology and Chapter 4 describes our experimental platform. In Chapter 5 we present the effectiveness of our technique to reduce static power and we conclude in Chapter 6.
8
2. Related Work
The first efforts of reducing microprocessor power consumption focused on reducing the dynamic power dissipation [14, 8, 15]. There are a number of power-aware architecture designs many of which focus on reducing power of various micro-architectural components [16, 17, 18, 9]. In this section we focus mainly on the FU power reduction techniques [6, 10, 24]. Approaches for reducing dynamic power dissipated by functional units during idle periods by clock gating techniques have been described in [5, 8, 13]. With decreasing transistor feature sizes, static (leakage) power dissipation is a major contributor to the total power dissipation. In this chapter we review recent related work that focuses on the dynamic and static FU power reduction techniques.
2.1 Clock Gating
One of the first efforts to reduce power dissipation of functional units introduced the Clock Gating technique [14, 15, 8]. Clock Gating is implemented in synchronous circuits to disable portions of a circuit when they are not actively performing computation, thereby reducing the dynamic power dissipation of the portions gated. The clock network in a microprocessor connects the clock to sequential elements like flip-flops, latches, and dynamic logic gates which are used in high-performance execution units and array address decoders in cache memories. At a high level, gating the clock to a latch or a logic gate by ANDing the clock with a control signal prevents the unnecessary charging/discharging of the circuit’s capacitances when the circuit is idle, and saves the circuit’s clock power. Initially clock gating was applied to a functional unit only when none of the functional unit stages are active. Clock gating techniques have improved on this limitation by having the ability to disable stages of a functional unit which are not active [23].
9
One of the first attempts of clock gating was in [14], where they state that the total clock power is usually around 30-35% of the total microprocessor power. Clock power is a major component of microprocessor power mainly because the clock is fed to most of the circuit blocks in the processor, and the clock switches every cycle. However, effective clock gating requires a methodology to determine which circuits are gated, when and for how long. Clock gating schemes that either (1) result in frequent toggling of the clock-gated circuit between enabled and disabled states, or (2) apply clock gating to such small blocks that the clock gating control circuitry incurs, is as large as the block itself, incur large overhead. This overhead may result in power dissipation higher than that without clock gating.
Pipeline balancing (PLB) is a technique which essentially outlines a predictive clock gating methodology [15]. PLB exploits the inherent variation of instruction level parallelism (ILP) within a program. It uses past program behavior and its characteristics such as issue IPC to predict a program’s ILP at the granularity of a 256-cycle window. If the degree of ILP in the next window is predicted to be lower than the width of the pipeline, PLB clock gates a cluster of pipeline components which include not just the datapath but all associated control logic and clocks, during the window. Using a simulator based on an extension of the Alpha Processor, [15] presents a component and full chip power and energy savings for single and multi-threaded execution. Results show an issue queue and execution unit power reduction of up to 23% and 13%, respectively, with an average performance loss of 1% to 2% on SPEC95 benchmarks.
In contrast to PLB’s predictive methodology (as it uses past program behavior and predicts ILP), [8] proposes a deterministic methodology called Deterministic Clock Gating (DCG). DCG is based on the key observation that for many of the stages in a modern pipeline, a circuit block’s usage in a specific cycle in the near future is deterministically known a few cycles
10
ahead of time. DCG exploits this advance knowledge to clock gate the unused blocks. In an outof-order pipeline, whether these blocks will be used is known at the end of issue based on the instructions issued. The execution units, pipeline latches of back-end stages after issue, L1 Dcache wordline decoders, and result bus drivers are clock gated. There is at least one cycle of register read stage between issue and the stages using execution units, D-cache wordline decoder, result bus driver, and the back-end pipeline latches. DCG exploits this one-cycle advance knowledge to clock gate the unused blocks without impacting the clock speed.
DCG’s
deterministic methodology has three key advantages over PLB’s predictive methodology: (1) PLB’s ILP prediction is not 100% accurate; if the predicted ILP is lower than the actual ILP, PLB ends up clock-gating useful blocks and incurs performance loss and vice versa. DCG guarantees no performance loss and no lost opportunity for the blocks whose usage can be known in advance, (2) DCG clock gates at a finer granularities than PLB’s clock gating granularities, both circuit and time granularities, (3) While PLB’s prediction heuristics have to be fine-tuned, DCG uses no extra heuristics and is significantly simpler. Experimental results show an average of 19.9% reduction in dynamic processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, DCache word-line decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss. Clock gating techniques help in reducing the processor power consumption by reducing the dynamic power of the functional units while in our work, we focus on the static power.
2.2 Static Power Reduction Techniques
With decreasing transistor technology sizes, the static power consumption of high performance microprocessors is increasing exponentially [12]. Techniques have been proposed to reduce static power consumption of various microprocessor components which include Level 1
11
and Level 2 Caches [16, 17, 18, 9] and Functional/Execution Units [10, 6, 19]. Additional hardware support is needed to reduce static power consumption of various microarchitectural components, which is discussed below.
2.2.1 Hardware Techniques to Reduce Static Power
To reduce static leakage power of microarchitectural components, they need to be put in a low leakage state. The components can be put into a low leakage state either by (1) reducing the supply voltage of components’ transistors, (2) increasing the threshold voltage of the transistors, and (3) changing the gate inputs. The first technique is called Power Gating [4, 9], the second is the Dual Threshold Voltage (Vt) [20, 26] and the third technique is the Input Vector Control [25, 28].
2.2.1.1 Power Gating
In power gating the power supply to the appropriate microarchitectural component is reduced or shut off during the idle periods. Leakage power reduces as the supply voltage reduced. While lowering the supply voltage to zero leads to no static power, it leads to loss of information stored in the transistors in cache memories. Hence, in most of the proposed designs for reducing leakage power of cache memories [9], they are put into a low leakage state to retain the stored data by using Dynamic Voltage Scaling (DVS). For example, in 70nm transistor technology, the supply voltage to the transistors is 1.0V. By using DVS the supply voltage can be reduced to 0.3V while retaining the data stored [9]. The power-gating approach achieves ultra low leakage power as the device is completely shut off. Sleep transistors are inserted into a logic gate to control power supplies to the gates of the transistors as shown in Figure 2.1 [16] for an SRAM cell. When
12
the signal LowVolt goes high transistor P2 gets switched ON and P1 OFF, supply voltage is VDDLow (0.3V).
Figure 2.1: Power Gating Implementation [16] 2.2.1.2 Dual Threshold Voltage
A microarchitectural component can be put into a low leakage state by raising the threshold voltage (Vt) during the idle periods. The leakage power in this case does not go all the way down to zero as the transistor is still ON. Putting the transistor in a high Vt state decreases its leakage power and increases its latency. In [20], the problem of optimal assignment of threshold voltages to transistors in a CMOS logic circuit is defined, and an efficient algorithm for its solution is given.
2.2.1.3 Gate Level Leakage Reduction (Input Vector Control)
In [25] a new gate level leakage reduction technique is proposed that can be used during the logic design of CMOS circuits that use clock gating to reduce the dynamic power. The
13
original logic design of a multi-gate logic circuit is modified by using minimal additional circuitry to force the combinational logic into a low leakage state during an idle period. Based on a library of gates, characterized for leakage current, a low leakage input vector is determined using a sampling of random vectors. When parts of a circuit are disabled by clock gating, they dissipate leakage power. When a circuit is clock gated, its internal state is set to a low leakage state for the idle period. Leakage power reductions of up to 54% have been achieved on the ISCAS-89 benchmark circuits.
In [27] the three techniques discussed above are compared based on their limits and benefits, leakage reduction potential, performance penalty, and area and power overhead. Power gating achieves the maximum possible leakage reduction, but at the cost of large overheads. Dual Vt has the lowest leakage energy savings but it helps in retaining the internal state. Input vector control (IVC) performs better than dual Vt but it changes the internal state as the gate inputs are changed. Also, IVC can be applied only to circuits which are clock gated and equipped with front-end latches.
In the above approaches described in 2.2.1.1 and 2.2.1.2, there is turnon latency involved, that is, when the device is turned back ON it cannot be used immediately as some time is needed before the circuitry returns to its operating condition. To have a noteworthy impact on reducing the leakage energy by using Dual Vt technique, the Vt has to be increased significantly. As the current transistor feature sizes have a low Vt, latency for implementing the Dual Vt is high. Also, power gating can put the FU in an ultra low leakage state by gating the supply all the way to zero whereas dual Vt cannot. Hence, in this work we assume that power gating is being employed to for FU shutdown.
14
2.2.2 Compiler Directed Static Power Reduction
Most of the initial static power reduction techniques have focused on utilizing hardware counters to monitor the idle period. These techniques transition to low leakage modes after fixed periods of inactivity which incur an energy penalty for transitioning to low leakage mode only after fixed periods. To address this problem, approaches have been proposed which dynamically change the turn-OFF periods [21, 22] and are compiler-based. In compiler-based techniques, appropriate profile information is first collected on a program which is in turn used to generate compiler hints during the program’s execution. The profile information is either (1) collected statically by examining the program code; (2) dynamically by executing the program, or, (3) is based on previous behavior. In [22] a compiler-based approach is used to reduce static power of Level 1 Instruction Cache. In this approach, the last use of instructions is identified and the corresponding cache lines that contain them are placed into a low leakage mode. Such an approach was proved to be competitive in terms of energy and energy-delay product as compared to the hardware-based leakage control techniques. By using the compiler-directed techniques, 6.4% more L1 data cache leakage energy is saved in [21] than by using a purely hardware based approach. Hence, in our approach we use compiler directives to issue the FU shutdown instructions.
2.3 Functional Unit Static Power Reduction
In [6] the potential to power gate functional units is evaluated based on parameterized analytical equations that estimate the break-even point of power gating techniques. There is an overhead energy associated to power gate a functional unit. The break-even point is the point when the aggregate leakage energy savings by turning the FU OFF equals the energy overhead of switching OFF and ON the header device used to power gate. So, to save energy by turning OFF
15
a functional unit its idle period has to be at least break-even number of cycles. In this study they assume a perfect predictor that can predict the idle intervals of the functional units with no delay. They use the technique of turning OFF a functional unit after detecting a series of idle cycles. They also propose a technique to turn OFF the functional units when a mispredicted branch has been detected as the units are going be idle while the current instructions in the pipeline are going to be flushed. Their results show that floating-point units can be put to sleep for up to 28% of the execution cycles at a performance loss of 2%. Using the branch misprediction guided technique the fixed-point units can be put to sleep for up to 6% of the execution cycles compared to the previous approach. This approach is purely hardware based where hardware counters are used to detect idle periods in a FU as compared to ours which is compiler directed. Such an approach has disadvantages as described in Section 1.3. Also, in this technique only the percentage of time the FUs can be put to sleep is talked about with no reference to the percentage energy saved. As power cycling the FU consumes energy, shutting down the FUs for 28% of the time might not necessarily lead to power savings. In our approach, we propose a technique wherein the FU shutdown is down based on the power savings and the results are reported in terms of power savings too.
The work most similar to ours is presented in [10], where FU static power dissipation is optimized by power gating these units during idle periods. This is a compiler driven FU shutdown technique where the FU turn OFF/ON directives are based on compiler hints. Program regions with low ILP and thus low functional unit demands are detected. The compiler can examine all of the code off-line and, therefore, identify suitable regions for turning the FUs OFF. Large subgraphs are identified in the control flow graph which represent control structured (e.g., loops) which are called Power blocks. These blocks are then classified into: hot blocks whose execution frequencies are greater than a certain threshold and cold blocks which are the remaining blocks in the program. The functional unit usage in each block is also analyzed to identify the units that are
16
expected to be idle in that block. OFF and ON directives are placed in cold blocks adjacent to the hot blocks in which the unit is expected to be idle. This information is then communicated to the hardware by generating special OFF and ON directives. The idle regions in the functional units may vary greatly in duration and for this strategy to work well the idle regions should be long in duration. This is because turning OFF and ON the FU incurs additional energy and this energy should be greater than the energy saved by turning the FU OFF. The greater the idle period, the greater is the energy savings by turning the FU OFF. To nullify short idle periods they turn OFF the FU only after certain duration of clock cycles after the turn OFF directive is issued. Such a strategy incurs an energy loss by leaving the FU ON for a certain duration, where shutting the FU OFF would save power. Also, the FU idle periods are detected purely based on utilization rates within a basic block and the overhead energy required for FU power cycling is not considered. In our approach we quantify the energy consumed by the FUs and the overhead energy for power cycling and use this information to detect short and long idle periods and to drive FU shutdown.
In [24] a compiler based technique for optimizing leakage energy consumption in VLIW functional units is proposed. A data-flow analysis is done to detect idle functional units along the control flow graph paths and then a leakage control mechanism inserts the FU turn OFF/ON directives. The FU idleness is defined at the basic block level by detecting if a FU will be used by an operation in a basic block. All the FUs of a type are turned ON even if there is only one operation which needs it. In our implementation, we turn ON only the required number of FUs in a basic block. Two leakage control mechanisms are evaluated: (1) power gating, and (2) input vector control [25]. Input vector control is a gate level leakage reduction technique which exploits the state dependence of the leakage current and sets the logic gate inputs to values that have the minimum leakage current when the units are idle. The input vector control mechanism lead to about 45% savings in the leakage power, while the power gating mechanism did not perform well due to the re-activation time of the FU. Compared to our implementation, the energy overhead
17
incurred to transition the FU into a low leakage state is not considered and the savings are reported only in terms of the leakage energy saved but not the total energy.
18
3. FU Shutdown
In Chapter 2, various techniques for reducing static processor power by FU shutdown are discussed. These techniques are either purely hardware based and/or the results are based on the percentage of time the FUs spend in the OFF state. Further, none of these account for the overhead or the total energy consumed. In our implementation, the FU shutdown directives are issued by the compiler and we consider the extra energy (overhead) incurred to power cycle the FUs and also the total energy saved by shutting down FUs during their idle period.
To implement FU shutdown, our algorithm first generates a Control Flow Graph (CFG) from the static representation of the program code that is subsequently annotated with initial FU requirements for each program basic block (BB). Our algorithm then analyzes the tradeoff between leaving FUs ON and turning FUs OFF when not utilized by consecutive BBs. Through this analysis, long and short idle periods are detected and FU requirements are optimized for minimum energy consumption. The FU requirements are then translated into compiler-generated instructions that are used during program execution to physically switch OFF/ON the FUs by using the power gating mechanism described in Section 2.2.1.1. The generation and annotation of the CFG, the energy estimation, and the FU requirement optimization algorithm are described in the following sections.
3.1 CFG Generation
We first generate a CFG from the compiled static representation of the program code. A CFG is an abstract representation of a program, where each node in the graph represents a basic block, i.e. a straight-line piece of code; jump targets start a block, and jumps end a block. The assembly code of a program is fed to a functional simulator, which is a fast and less detailed
19
simulator with no time accounting, and the branch instructions are identified. For every branch instruction fetched, its target address(es) and the number of times a branch instruction transitions to each of its target address(es) are captured. The target of a branch instruction starts a basic block (BB) and the branch instruction marks the end of a basic block. This information is encoded in a CFG, where each node represents a BB. The transition probabilities between the nodes (pictured on arcs in Figure 3.2) are generated by looking at the number of times a branch instruction transitions to its respective target address(es). These are used to guide our FU requirement optimization process which is discussed in Section 3.3.
Next we identify the initial (pre-optimized) FU requirements of each node and embed this information to the CFG. To accomplish this, a second analysis of the static code is done to determine the number and class of instructions in each BB from which we extract initial FU requirements. Table 3.1 shows the FU requirements for different instruction classes. Based on the class of the instruction the required FU can be identified. Apart from identifying the types of FUs required for a BB, we also need to determine the required number of FUs of each type. INTEGER ALU IntALU Control Memory
INTEGER MULTIPLY IntMult IntDiv
FLOATING-POINT FLOATING-POINT ALU MULTIPLY FloatADD FloatMult Float Compare FloatDiv Float Convert FloatSQRT Table 3.1. Instruction Class FU Requirements
Knowing the number of instructions within a BB is not enough information to accurately determine the number of FUs required for instruction execution. Due to dependencies between BB instructions, the raw number of instructions of a class cannot be considered as the BBs FU requirement. An accurate determination of FU requirements may only be done through dynamic analysis, which is undesirable due to the complexity and time associated with gathering this information. Therefore, we estimate basic block FU requirements based on a static read-afterwrite (RAW) dependence analysis of instructions. The instruction dependences within each BB
20
are represented in a tree structure, where nodes represent instructions and edges between nodes indicate dependence. Nodes that are at the same level in the tree contain independent instructions. The level or depth of the tree that contains the maximum number of instructions is used to estimate the FU requirement. Consider the code sequence for a basic block and its dependence tree shown in Figure 3.1. Here, instructions 3, 5 and 6 are dependent on instruction 1; instructions 3, 4 and 5 are independent. Therefore, the Integer (INT) Add unit requirement for this block is 3, which corresponds to the number of instructions in Level 2. If the maximum number of independent instructions of a particular class exceeds the number of FUs of that class, the FU requirement is set to the maximum number of FUs which is currently defined by the simulator.
Figure 3.1: BB Instruction Dependence Tree Figure 3.2 shows a CFG where each node represents a BB and the arcs represent transitions between BBs. Each node is annotated with its FU requirements and the arcs with transition probabilities. A FU requirement of [a, b, c, d] indicates that the block requires a INT Add units, b INT Multiply units, c Floating Point (FP) Add units, and d FP Multiply units for execution. When control flows from block 1 to block 2, we see that block 2 requires fewer FP FUs and one more INT FU than block 1. Therefore, three FP Adders and one FP Multiply unit can be turned OFF and one INT Multiply FU is turned ON during the execution of block 2. Since block 5 has the same requirement as that of block 2, no additional OFF/ON operations are required when transitioning from block 2 to block 5. Similarly, when control flows from block 4 to 6, the FP Multiply unit is left ON, while a unit each of INT Add, INT Multiply and FP Add are
21
switched ON. During our optimization process (Section 3.3), we traverse a node’s children in decreasing order of their transition probability magnitude. For example, the order of traversing the children of BB1 is 4, 3, 2.
Figure 3.2: CFG with FU Requirements To understand the accuracy of our FU requirement estimation, we captured the actual FU usage from a dynamic execution of each benchmark. During the dynamic execution of a benchmark, we capture the maximum number and type of FUs used in each cycle for every BB. These requirements are BB accurate but not cycle accurate as the actual usage in a BB is set to be the maximum number of FUs used in a cycle during the BBs execution. For example, if a BB has three INT Add instructions out which two are executed in parallel and then third, then the actual usage of this BB is set as 2 INT Add units.
Table 3.2 shows the average difference per basic block between our estimation and the actual FU usage for each FU class and Table 3.3 shows the total number of FUs used for each class. This difference is weighted by the number of times the basic block appears in the benchmark. It can be seen that on an average over all the benchmarks, we are 0.84 units off per BB for INT ADD units from the actual usage and for the rest of the units we are on average 0.11 units or less off per BB. The high difference in the estimation of INT Add units could be due to
22
the high utilization of the INT Add which we fail to estimate or over-estimate. There is not much of a difference in the other FUs as the utilization rates on the other FUs are less. In our configuration, we have 4 INT Add and 4 FP Add, and 1 INT Multiply and 1 FP Multiply FUs. Benchmark FP ADD FP MUL INT ADD INT MUL art 0.168701 0.028857 0.895671 0.016648 eon 0.416667 0.03316 0.635236 0.010669 facerec 0.078638 0.010093 0.834735 0.015559 fma3d 0.053744 0.007489 0.737004 0.005066 gzip 0.008555 0.001901 0.804183 0.003802 mcf 0.013449 0.005764 0.985591 0.013449 mesa 0.134649 0.01567 0.916425 0.017411 swim 0.078209 0.011327 0.866235 0.018878 vortex 0.013162 0.001595 0.94466 0.005085 vpr 0.103314 0.019006 0.79922 0.009747 Average 0.106909 0.013486 0.841896 0.011631 Table 3.2. Average Difference in FU Requirements Estimation and Actual Usage
Benchmark art eon facerec fma3d gzip mcf mesa swim vortex vpr
FP ADD FP MUL INT ADD 3919587391 2689207531 21228837988 3713080475 567995188 21000413699 13683497983 2450299717 39900565513 9694908918 2992133452 38176021408 28 4 15431581015 616951 16 36890004871 12545262675 2283794423 85334709953 11416314764 1787411090 1173483769 127882464 27484515 60280421279 108204917 19780873 1349389262 Table 3.3. Total Number of FUs Used
INT MUL 307327 63407706 727551890 401983868 560560 3446248 948536882 5834419 19888605 231767
3.2 Energy Estimation
Our FU shutdown technique computes the total the energy consumed by BB instructions as the sum of the dynamic and static (leakage) energy dissipated every execution cycle by the individual FUs and the overhead energy associated with power cycling a FU, if necessary. Note that the energy is an estimate since the FU requirements are estimated based on a static dependence analysis (and not cycle accurate); and the number of cycles that FUs are OFF/ON is
23
estimated based on the average IPC (instructions committed per cycle) obtained by application execution profiling rather than the average IPC per instruction class. Wattch computes the power consumed by a FU (excluding overhead power), PFU as: PFU = PONused + PONunused
(1)
PFU = NONused * DFU + NONunused * DFU * LF
(2)
or
= Dynamic
+
Static Power
where DFU is the dynamic or instantaneous power consumed by a FU; NONused is the number of FUs that are ON and in use; NONunused is the number of FUs that are ON and not in use; LF is the Leak Factor, which is the ratio of the static power consumption to the total power. In Eqn 2, Wattch [2], the simulator used in this work, assumes that FUs that are ON but are not in use consume only their leakage power; the units which are OFF consume no power [14]. The values for INT and FP FU dynamic power and cycle time are taken directly from Wattch and are shown in Table 4.2. Leak factor is the static/(static + dynamic) power ratio and for the current high performance microprocessors for the 65nm transistor technology, the static/dynamic power ratio is about 2/3 [11]. Hence, we assume that leakage power or Leak Factor is Leak Factor = 0.4 (40% of total power)
(3)
Power cycling the FUs incurs an energy overhead, which is discussed in more detail in Section 3.2.1. The compiler-generated instructions (see Section 3.4) that work in conjunction with the hardware to physically turn FUs OFF/ON also incur energy overhead for their execution. Therefore, these instructions must be counted and their energy (EOH) included in the determination of total energy, which is computed as, Etotal = PFU * time + EOH
(4)
Where time = clock cycles * clock cycle time.
24
3.2.1 Overhead Energy and BreakEven Cyles
Power cycling FUs incurs energy overhead since power gating requires a header circuit (Section 2.2.1.1) to perform the physical switching. There is a time and energy associated with turning this circuit ON (EOHon) and with turning it OFF (EOHoff), where the total energy overhead is EOH = EOHon + EOHoff. This time and energy is proportional to the size and capacitance of the FU. When we turn a FU OFF, the amount of leakage energy saved per cycle increases as the supply voltage, VDD, gradually decreases. Conversely, when a FU is turned ON, the amount of leakage energy savings decrease as VDD is charged back up. The breakeven point is the point at which the aggregate leakage energy savings is equal to the total overhead energy due to switching, ESAVEDaggregate = EOHon + EOHoff. This is illustrated in Figure 3.3 taken from [6]. At T1 the power gating circuit makes a decision to power-gate the unit and an overhead energy is incurred till the time taken to turn OFF. Once the OFF signal is delivered to the gate of the header device at T2, the supply voltage starts going down. As the voltage reduces, savings in leakage energy begin. At T3 the aggregate leakage energy savings equals the energy overhead of switching OFF and ON the device. At T4 the reduction in supply voltage saturates at 0 and the unit is completely turned OFF with no leakage power dissipation. At T5 a signal to turn ON the unit is asserted and there an overhead energy involved and the device starts to turn ON at T6, turning ON completely by T7. During T6 – T7, as the supply voltage is charged back up, the amount of leakage energy savings per cycle gradually decreases to zero by T7.
Since specific, often proprietary information is required about individual FUs to precisely determine the breakeven point, we assume a BreakEven value of 20 cycles based on the work done in [6]. It is shown in [6] that BreakEven cycles is close to 10 clock cycles for transistor technologies in which the static power accounts to about 33% of the total power, while our FU energy consumption values are based on the ones used by Wattch where the static power accounts
25
to about 10%. We did a sensitivity analysis that quantifies the aggregate energy saved to the BreakEven cycles (Section 5.3) which shows that using a value of 10 results in a large amount of FU power cycling; a BreakEven value of 30 cycles leads to very little power cycling; while a BreakEven cycles of 20 shows appropriate power cycling activity. Based on equation (5) EOH is directly proportional to the BreakEven cycles. A BreakEven cycles of 20 means that a FU must be powered OFF for more than 20 cycles for the aggregate leakage energy savings to be greater than the total energy overhead cost. Conversely, if a FU is powered OFF for less than 20 cycles, the overhead energy cost for power cycling is greater than the aggregate leakage energy saved. In this case, the total energy consumed is minimized when the FU is left ON during this period. We use the same BreakEven cycles for INT and FP units. Based on the FU power consumption values from Wattch (Table 4.2), the FP FU dynamic power per cycle is three times to that of the INT FU, which accounts for the FP FU being more complex and having more capacitance than the INT FU. Therefore, although we use the same BreakEven values for INT and FP FUs the difference in the energy consumed to turn them OFF is accounted for by their respective dynamic power values.
Figure 3.3: Illustration of Breakeven Point [6] The energy overhead attributed to FU power cycling can be expressed in terms of leakage energy and BreakEven cycles, as:
26
EOH = PFU * LF * BE Cycles * Cycle time
(5)
where the BE cycles is the BreakEven cycles; Cycle time is the clock cycle time (Table 4.2). The other variables are defined in Eqn 2. When we use the concept of BreakEven cycles, we consider the overhead energy as an aggregate (EOHon + EOHoff) rather than the individual overhead energies to turn OFF and ON the FU. Hence, we assume that EOHon = EOHoff as we compute the total overhead energy as Total EOH = Nturnon * EOHon + Nturnoff * EOHof f
(6)
= (Nturnon + Nturnoff) * (EOH / 2)
3.3 FU Requirement Optimization
To maximize the energy savings, our algorithm optimizes basic block functional unit requirements. The optimization depends on accurately detecting short FU idle periods where the energy overhead for power cycling idle FUs is greater than the aggregate energy saved during these cycling periods. Short idle periods occur as shown in Figure 3.4. The FU requirement of basic block 1 (BB1) is 2 INT Add, 1 INT Multiply, 1 FP Add and 1 FP Multiply units; BB2 requires only 2 INT Adders for its execution; BB3 has the same FU requirements as that of BB1. Assume that BB2 executes for 6 cycles (less than the BE cycles). In this example, the overhead energy required to power cycle the FUs (INT Multiply, FP Add, and FP Multiply) OFF for the execution of BB2 and back ON for BB3 will be greater than the static leakage energy saved by turning these FUs OFF during BB2’s execution as the FUs have to be turned OFF for at least BE cycles. Our algorithm detects these cases in an application’s CFG, determines and compares the total energy dissipation for various FU configurations, and sets the basic block FU requirements to consume minimal energy. This may result in FUs remaining ON during short idle periods, especially in cases where the idle FUs are FP units, since the switching overhead of these units can be large due to their size and complexity (i.e., large capacitance). In Figure 3.4 BB2’s FU
27
configuration may be set to [2, 1, 1, 1] or [2, 1, 0, 0] rather than [2, 0, 0, 0] depending on the result of the energy analysis.
Figure 3.4: Short FU Idle Period
3.3.1 Complexity of Optimization: Local and Global
Optimization done on an exhaustive search of the CFG results in a least-energy FU configuration (Figure 5.7). The exhaustive CFG search tries all possible FU requirement combinations for each basic block and computes the total energy consumed for each combination using Equation 5. The energy consumed changes based on the energy overhead, E OH , for turning FUs OFF/ON, and the energy saved if the FU is left ON for cycle-time (NONunused * DFU * LF * cycle-time). For each combination FU OFF/ON, we calculate the energy and check to see if the present FU configuration consumes less total energy. If it does, this combination is set as the FU requirement of the corresponding node. For example, if the INT Add requirement for a sequence of nodes is (4-2-3-4) along one path and (4-3-4) along the other path as shown in Figure 3.5, the exhaustive search evaluates the energy consumption for all combinations of FU requirements. The search starts from node 1, evaluating all possible combinations on both the paths: from (4-23-4) to (4-4-4-4) and from (4-3-4) to (4-4-4) and chooses the FU configuration that results in the least energy. The exhaustive search proceeds in the decreasing order of transition probabilities. Hence, nodes 1, 4 and 5 are optimized first followed by nodes 1, 2, 3 and 5. Figure 3.6a and 3.6b
28
illustrate the various combinations tried for nodes 1, 4, 5 and 1, 2, 3, 5 respectively, wherein the node being worked on is enclosed by a dashed box.
Figure 3.5: CFG to be optimized
Figure 3.6a: Exhaustive search over nodes 1, 4 and 5
Figure 3.6b: Exhaustive Search on nodes 1, 2, 3, and 5
29
The complexity of this method is O(N B), where N is the number of FUs and B is the number of basic blocks in the CFG. Because exhaustive analysis of the CFG is computationally infeasible, we implement sub-optimal but computationally feasible solutions called the Local and Global Optimizers. Table 3.4 shows the number of basic blocks in each of the benchmarks we use.
Benchmark # Basic Blocks art 7722 eon 20387 facerec 28561 fma3d 49025 gzip 6772 mcf 5913 mesa 25550 swim 26972 vortex 27944 vpr 10639 Table 3.4. Number of basic blocks in each benchmark 3.3.2 Local Optimizer
This optimization is performed one node at a time. For example, if the actual INT Add unit requirement in sequential nodes is (4-2-3-2-4), the optimizer starts on node 2 as node 1 has the maximum number of available units as shown in Figure 3.7. The optimizer sets the requirement of the second node to 3 and calculates the energy consumed. Setting the FU requirement to 3 reduces the overhead of switching a FU OFF from node 1 to node 2 and also switching a FU ON from node 2 to node 3 but consumes extra static energy to leave it ON. If the total energy consumed is less when the requirement is set to 3 rather than 2, then the requirement for node 2 is set to 3. We then optimize node 3 by setting the requirement to 4. If in the previous step the requirement of node 2 was set to 3, then setting the requirement of node 3 to 4 would increase static energy. The optimizer then picks node 4 by setting the requirement to 3, which
30
would reduce the overhead of switching a FU OFF from node 3 to node 4 and switching a FU ON from node 4 to node 5. Setting the requirement of node 4 to 4 would have the same overhead cost of setting it to 3 but with an increase in static energy. As node 5 has the maximum number of units available, no optimization is done on it.
Figure 3.7: Local Optimization Table 3.5 shows the total energy and each of its components for all INT FU configurations that are examined by the local optimizer, with the FU configuration with the least energy consumption in bold. The values are obtained by using the equations described in Section 3.2 and 3.2.1. It is also assumed that the actual usage of INT Adder unit is (4-2-3-2-4) units. Figure 3.8 shows the pseudo code representation of the local optimizer algorithm. INT FU Config 4-2-3-2-4 4-3-3-2-4 4-3-4-2-4 4-3-3-3-4 4-3-3-4-4
EONused EONunused EOFF EON 2.325E-06 4.65E-07 4.278E-07 4.278E-07 2.325E-06 5.58E-07 2.852E-07 2.852E-07 2.325E-06 6.51E-07 4.278E-07 4.278E-07 2.325E-06 6.51E-07 1.426E-07 1.426E-07 2.325E-06 7.44E-07 1.426E-07 1.426E-07 Table 3.5. Local Optimizer Energy Estimation
ETotal 3.6456E-06 3.4534E-06 3.8316E-06 3.2612E-06 3.3542E-06
The complexity of this algorithm O(N*B), where again, N is the number of FUs and B is the number basic blocks in the CFG. The primary advantage of the local optimizer is its relatively low complexity, which leads to a linear increase in computation/optimization time with an increase in the number of basic blocks in the CFG. The main disadvantage is that since it only
31
optimizes using one node at a time, higher order combinations that may result in increased energy savings are not analyzed. Table 3.6 shows the time taken to locally optimize FU requirements for all of the benchmarks
Benchmark art
Time (minutes) 0.1
eon
3.65
facerec
1.2
fma3d
5.75
gzip mcf
0.2 0.06
mesa
0.4
swim
0.32
vortex
17.75
vpr 0.6 Table 3.6. Time for local optimizer for all benchmarks
Figure 3.8: Local Optimizer Algorithm
32
3.3.3 Global Optimizer To take advantage of higher order optimizations in a computationally feasible method, we divide the CFG into smaller sub-CFGs of a specified depth and perform an exhaustive search for optimal FU requirements on each sub-CFG. The CFG in Figure 3.2 is shown with sub-CFGs of depth one in dashed boxes in Figure 3.9. The depth chosen for sub-CFGs exhibits a trade off between energy reduction and optimization time. Table 3.7 shows the percentage energy saved and time taken to locally and globally optimize for depth 2 for some of the benchmarks. It can be seen that as the depth of optimization increases the energy savings increase with corresponding increase in optimization times. Benchmark
% Energy Saved/Time taken in minutes to optimize Local
Global2
Art
1.364/0.1
1.365/0.3
Gzip
15.9/0.2
16.1/0.4
Mcf
18.56/0.4
18.7/0.4
Vortex
11.9/17.75
14.8/547.5
Vpr
4.3/0.5
4.7/1.5
Table3.7. Energy saved versus depth of CFG optimization
Figure 3.9: Depth One sub-CFGs
33
The optimizer works as follows: if the INT Add unit requirement in sequential nodes is (4-2-3-2-4) and we assume a depth of 3, the optimization is performed over 2 sub-CFGs – (4-2-32) and (2-3-2-4). For the (4-2-3-2) sub-CFG, the FU requirements of these nodes are changed from (4-2-3-2) to (4-4-4-4) to try out all possible combinations and the energy is computed for each combination as shown in Table 3.8. The FU configuration with the least energy consumption is chosen and shown in bold. Comparing global with local optimization for this example, the FU configurations evaluated by the local algorithm include (4-3-3-2-4), (4-3-4-2-4), (4-3-3-3-4) and (4-3-3-4-4), which are a subset of the combinations analyzed by the global optimizer, shown in the table. Figure 3.10 shows the pseudo code representation of the global optimization algorithm.
INT FU Config 4-2-3-2-4 4-2-3-3-4 4-2-3-4-4 4-2-4-2-4 4-2-4-3-4 4-2-4-4-4 4-3-3-2-4 4-3-3-3-4 4-3-3-4-4 4-3-4-4-4 4-4-3-2-4 4-4-3-3-4 4-4-3-4-4 4-4-4-2-4 4-4-4-3-4 4-4-4-4-4
EONused EONunused EOFF EON 2.325E-06 4.65E-07 4.278E-07 4.278E-07 2.325E-06 5.58E-07 2.852E-07 2.852E-07 2.325E-06 6.51E-07 4.278E-07 4.278E-07 2.325E-06 5.58E-07 5.704E-07 5.704E-07 2.325E-06 6.51E-07 4.278E-07 4.278E-07 2.325E-06 7.44E-07 2.852E-07 2.852E-07 2.325E-06 5.58E-07 2.852E-07 2.852E-07 2.325E-06 6.51E-07 1.426E-07 1.426E-07 2.325E-06 7.44E-07 1.426E-07 1.426E-07 2.325E-06 8.37E-07 1.426E-07 1.426E-07 2.325E-06 6.51E-07 2.852E-07 2.852E-07 2.325E-06 7.44E-07 1.426E-07 1.426E-07 2.325E-06 8.37E-07 1.426E-07 1.426E-07 2.325E-06 7.44E-07 2.852E-07 2.852E-07 2.325E-06 8.37E-07 1.426E-07 1.426E-07 2.325E-06 0.00000093 0.00E+00 0.00E +00 Table3.8. Global Optimizer Energy Estimation
ETotal 3.6456E-06 3.4534E-06 3.8316E-06 4.0238E-06 3.8316E-06 3.6394E-06 3.4534E-06 3.2612E-06 3.3542E-06 3.4472E-06 3.5464E-06 3.3542E-06 3.4472E-06 3.6394E-06 3.4472E-06 3.255E-06
The complexity of this algorithm is O (B * Nx), where B is the number of basic blocks in the CFG, N is the number of FUs and x is proportional to the chosen sub-CFG depth and the depth of the original CFG. If the global optimization depth (sub-CFG depth) is small relative to CFG depth, x