Improving Energy Consumption by Compiler Optimization ... - CiteSeerX

2 downloads 0 Views 73KB Size Report
ABSTRACT. This paper presents a new approach for improving energy consumption of compiler generated software using register pipelining. This technique ...
Improving Energy Consumption by Compiler Optimization Technique Register Pipelining Stefan Steinke

Ruediger Schwarz

University of Dortmund Otto-Hahn-Str. 16 44221 Dortmund, Germany +49-231-755-6133

University of Dortmund Otto-Hahn-Str. 16 44221 Dortmund, Germany +49-231-755-6133

[email protected]

[email protected]

ABSTRACT

reducing the energy consumption.

This paper presents a new approach for improving energy consumption of compiler generated software using register pipelining. This technique allows to save memory load operations by temporarily moving array elements into registers which leads to a significant reduction of data traffic between the processor and data memory. Experimental results have shown that this technique saves up to 25% of energy consumption depending on the memory type and the system architecture. In this work, the technique of register pipelining is used for the first time in order to save energy in a C compiler for a common RISC processor.

For reduction of energy consumption, a lot of technical improvements concerning the architecture of electronic systems are being investigated. Most of them belong to the area of hardware design where e.g. clock gating reduces the number of active gates or bus encoding reduces the number of toggled bus lines. Beside changes in hardware design, software design issues are another promising approach. The advantage of software in general is that it can be developed and changed late in the development. Furthermore, software changes are inexpensive and can be delivered as an update. The reduction of energy consumption by generating software code with less energy consumption has been a research issue for only six years [10]. Besides the optimization for code size and time, reducing energy consumption is a new optimization goal of compilers. This means that the processor and other system components should use minimal energy during the execution time of the generated program, regarding only those aspects that can be influenced using software.

Keywords Compiler, Register Pipelining, Energy Consumption.

1. INTRODUCTION An increasing number of embedded systems control applications like Games machines, digital audio players, PDAs or digital cameras. They are developed for a single application only. The requirements often include hard real-time constraints and efficiency and for some of them the energy consumption is crucial.

For the generation of energy saving software, several steps have to be performed. In most cases, no detailed data is available from the vendors about the energy consumption of the processor during the execution of a single instruction. Therefore, a power model has to be developed and simulations or measurements must be performed to extract the necessary data as a base for the decision processes inside the compiler during code generation.

There are embedded systems like mobile phones, in which energy capacity is restricted and where standby time is an important technical feature (for the buyers' decision). For other embedded systems, it is difficult or expensive to cool the system and therefore heat dissipation has to be minimized. Finally, the lifetime of all electronic systems can be extended by reducing the chip temperature. All these characteristics can be improved by

During program execution, memory accesses consume a high amount of energy. Especially in systems using RISC processors and off-chip memory, it seems promising to spend some research effort on the reduction of memory accesses. This is also being done to improve execution speed because memory accesses are often slow as compared to the processor speed. But even without saving clock cycles, it is advantageous to reduce memory accesses because energy consumption is reduced. A known optimization technique in compilers is called register pipelining, which eliminates memory load accesses in loops by storing the data temporarily in unused registers. This is especially profitable when arrays are concerned. Beforehand, a detailed analysis of the array dependencies is required.

1

for (i = 1; i < 120; i++) {

2. RELATED WORK An overview of hardware and software based methods for reducing the energy consumption of electronic systems can be found in [8]. Some of these can be integrated into the compiler and thus reduce the energy consumption without any hardware changes. The compiler optimization technique register pipelining [5] belongs to this class of techniques and deals with memory accesses. It reduces the number of memory accesses for arrays and ensures better usage of the memory hierarchy.

a[i] = a[i-1] + 3; } Figure 1a. Source before register pipelining

R = a[0]; for ( i = 1; i < 120; i++ ) {

In order to be able to evaluate the contribution of register pipelining concerning energy dissipation, the energy consumption of all single instructions has to be known by the compiler. In most cases, however, hardware vendors of general purpose processors do not publish this data or a power model of the processor in most cases. Therefore it is necessary to undertake measurements for all instructions of the individual processor. This was first done by Tiwari, who published detailed information on the energy consumption of a 486DX2 processor, a RISC and an embedded processor [9,10,11,12]. Later measurements were also taken for an ARM processor by Sinevriotis [7].

R = R + 3; a[i] = R; } Figure 1b. Source after register pipelining This transformation is performed in the following steps and repeated from step 5 as long as beneficial optimizations can be executed:

In [10], Tiwari published a power model based on his measurements. This has been extended in our approach to include energy used by the program and data memory. Especially for the ARM7TDMI processor which was chosen for experiments in our work, the energy used for an instruction fetch from memory as well as data loads and stores have to be taken into consideration. Several experiments have shown that the amount of energy consumed by the memory can be the double amount of the energy consumed by the processor itself.

Step 1. Loop detection First of all, the loops have to be detected like the for statement in Figure 1. This is done by a depth first search algorithm on the internal control flow graph. A control flow graph is a representation of the instructions as nodes and their control dependency as edges. The result of the depth first search algorithm is a classification of the edges into classes. Each back edge represents an existing loop. Starting with this analysis, the whole loop and all involved edges can be determined.

Finally, the work of Panda [6] should be mentioned, in which the behavior of the memory and optimizations for memory accesses is described in detail. The use of scratch-pad memory are also part of his research.

Step 2. Search for induction variables and their limiting values Induction variables like i in Figure 1 are variables whose values depend on the number of executed loop iterations. In this step, the induction variables are determined along with the step width as well as the their lower and higher bounds which are 1 and 120 in the above example. With the application of different algebraic transformations, a higher percentage of loops can be processed.

3. OVERVIEW OF REGISTER PIPELINING 3.1 Principle During compilation, a compiler in an intermediate step replaces the variables of the programming language by virtual registers and later, during the register allocation phase, by processor registers. Due to the limited number of real processor registers, some virtual registers have to be stored temporarily in the memory which is called spilling.

Step 3. Search for load and store operations The load and store operations have to be determined in order to replace the loads with register accesses later on. For each load or store memory access concerning an array, the respective index function is computed which is based on the induction variables at the loop entry. In the next step, possible different induction variables in the index functions are transformed into one main variable if possible.

Arrays are completely stored in memory and the elements are accessed with load/ store instructions. Register pipelining is a technique which reduces these memory accesses. It replaces data load operations for array elements retrieved in a previous loop pass by storing the data temporarily in unused processor registers whenever this is possible. This is mainly done in loops where improvements are very profitable.

Step 4. Array data flow analysis The array data flow analysis used here is based on a method developed by Duesterwald [1]. The dependencies between different memory accesses in the loop are evaluated.

The main principle is shown in the C program given in Figure 1a, which is transformed into the program in Figure 1b using register pipelining. The number of memory accesses for the array a is reduced with this transformation from 240 to 120 times, because the value of a[i-1] is temporarily stored in the register R.

In our approach, this method was extended to cover multiple entry/multiple exit loops and the handling of non affine index functions. Furthermore it was extended to handle different kinds of loops, e.g. those with a step width greater than 1 and accesses with different data types.

2

signal transition, both the P-channel and N-channel transistors can be conductive simultaneously and the current Isc causes the dissipation of short circuit power. It can be up to 30% of the total dissipation for slow transitions.

Step 5. Calculation of free registers In this work, register pipelining is performed after register allocation. This stage was chosen because the exact number of available registers can easily be determined by analyzing the instructions together with their processor registers. This is very important, since a possible spill, which could be caused by a wrong decision during register pipelining, would cost much more than it potentially saves.

Psc = Sumcell j f(C load j, Transition timeInput) * TRj From the compiler's point of view the short circuit power can be influenced in the same way as switching power.

4.1.3 Leakage power

Different types of registers are another aspect. For example, the considered processor ARM7TDMI provides 8 lower registers in the 16 bit instruction set, which can be used in all instructions, and 8 higher registers, which can only be addressed by a limited number of instructions.

The leakage power is the static power dissipated when a gate is not switching. It accounts for less than 1% of the total dissipation of active circuits, but this value increases when the processor is in an idle state (standby). PLeakage Total = Sum cells i P cell Leakage.

This is taken into consideration in our compiler in order to allow different transformations depending on the type of free registers.

Leakage power can be influenced by the compiler by switching off parts of the circuits. This is called power management and depends on the built-in features of the individual processor.

Step 6. Optimization exploration and selection In general, a number of transformations have to be explored. The choice of the most efficient one depends on: -

the control flow,

-

the live time of the registers,

-

the number and the type of additionally usable registers,

-

the positions of load and stores of different array elements.

4.2 Power Model For the decision process inside the compiler, a power model reflecting the physical processes has to be developed. As shown in the previous sections, the compiler can influence energy consumption by reducing switching activity by optimizing the encoding of the memory contents and by the use of special features for deactivating parts of the circuit. Only the first possibility is surveyed in this work.

After the exploring different kinds of transformations and determining the number of free lower and higher registers, the benefit of all transformations is computed and the most efficient one is applied.

It seems obvious to choose processor instructions as the base for reducing switching activity. The compiler can calculate the costs either before code selection or later in the code generation process. Also, a sequence of instructions can be considered e.g. during register pipelining.

Included in the process are some standard optimizations like copy propagation, which are executed after the actual transformation. Following one register pipelining optimization run, a jump back is implemented to check if an additional optimization is applicable.

It has to be taken into consideration that no detailed circuit description or VHDL model can usually be obtained for generally available processors. Thus, the electrical measurement of the current consumed by the processor is the only possibility to get a satisfactory decision base for the compiler.

4. BASICS OF ENERGY CONSUMPTION 4.1 Physical Basis

As mentioned before, our work is based upon the power model developed by Tiwari [10], which distinguishes between basic costs and inter-instruction effects. Basic costs consist of the measured current during execution of a single instruction in a loop and an approximate amount added for stalls and cache misses. The change of the circuit state for a different instruction and resource constraints are summed up in the inter-instruction effects.

For measuring the energy consumption and optimizing the compiler, it is necessary to know the physical reasons for the dissipation of energy. In an electronic system, the processor and the memory, which are the focus of this research work, are usually manufactured using CMOS technology. In CMOS, three sources of dissipation have to be distinguished [8].

4.1.1 Switching power Power is consumed for charging and uncharging of load capacitances at the output of a CMOS cell. Additionally, there are net and gate capacitances in the circuit. In active CMOS circuits, switching power accounts for 70 to 90% of the total power dissipation and can be computed as follows:

Tiwari's research work shows that the basic costs constitute the major part of the total amount of energy consumed. This is fortunate, since the decisions in the compiler are mostly concerned with the choice of an isolated instruction without the knowledge about the whole sequence the instruction is included in.

Psw = ½ Vdd2 * Sumnets i (Cload i * TRi) (TR = toggle rate) Reducing switching power is the main target of improvements by the compiler. In order to achieve this the compiler has to produce software with reduced switching activity both inside and outside the processor and memory.

For our work, we use a power model which only includes the cost of the processor Pproc and the power dissipation of the memory Pmem. The latter aspect is necessary, because the measurements for the selected ultra low power processor ARM7TDMI have shown a higher energy dissipation in the memory than in the processor itself. Since the number of memory accesses is influenced by the compiler, the energy amount for memory accesses was integrated into the power model:

4.1.2 Short circuit power The short circuit power occurs during switching of one of the two transistors used in CMOS technology. For a short time during

3

Ptotal = Sumexec instr (Pproc + Pmem) This power model is integrated into the compiler and into a trace analyzer which computes the total amount of energy dissipated during execution of the program under observation.

In Table 1 the essential measurements are shown for the 16 Bit Thumb Instruction Set. Sinevriotis [7] did measurements only for the ARM instruction set and no further measurements are published. It could be stated that load and store instructions need 20% more current than a move instruction between registers. This has to be multiplied with the number of cycles which depends on the memory type to compare the energy.

5. COMPILER INTEGRATION Several phases during compilation are possible for integration of the register pipelining technique: 1. Implementation as a C-to-C transformation

Table 1. Measured current of processor

2. Implementation on an intermediate representation level

instruction

cycle

3. Integration after the register allocation phase

Figure 2. Compiler phases The different phases are shown in Figure 2, whereby the intermediate representation (IR-Code) is independent of the input language of the program and also independent of the processor. This is advantageous for reuse.

current (mA)

MOV Rd,Hs

1 + n

43.8

ADD Rd,#imm

1 + n

44.7

BGE

3 + n

43.0

CMP Rd,Hs

1 + n

41.3

LDR Rd,[SP,#imm]

3 + n + m

48.8

STR Rd,[SP,#imm]

2 + n + m

57.7

n = number of additional cycles for instruction fetch from program memory

In our work we decided not to implement register pipelining on a C-Code level or on the intermediate representation because at these stages it is difficult to determine the benefit of this technique in advance. In the early phases, the number of available processor registers is unknown. This is an important issue, however, since the introduction of a further spilling will usually make the generated code worse instead of better.

m = number of additional cycles for data load and store from data memory In a second series, the energy consumption by the 512 kByte SRAM memory was measured with different word width: Table2. Measured current of 512 kByte off-chip RAM

A second reason is that other optimizations are working on the intermediate representation. They may interact with the register pipelining in such a way that the advantage is destroyed. In contrast to the first two alternatives, the processor instructions are known during the register allocation phase and the code will not be changed later again.

mode

6. EXPERIMENTAL ENVIRONMENT AND RESULTS For our experiments, we choose the ARM7TDMI processor which is very common nowadays for embedded systems, especially for low power applications. This processor is a member of the family designed by Advanced RISC Machines and is manufactured and licensed by a majority of the processor vendors. The ARM7TDMI is based on the ARM7 core which is often combined with on-chip memory and I/O logic. It also includes an additional so called Thumb Instruction Set which is recommended by the vendor for power conscious applications.

bit width additional cycles

current (mA)

Instruction fetch

16

1

77.2

Data Read

32

3

98.4

Data Write

32

3

101.0

The amount of current needed for a data read or write compared with a move instruction is the double amount of the processor current itself. For the on-chip memory, the current is part of the measurements of the processor current. On average, an additional current of 3 mA can be stated. Based on these values, the amount of energy necessary for an access to the memory was computed and integrated into the data base inside the compiler.

The experiments were evaluated using the ATMEL AT91M40400 processor which features an additional 4 kByte static on-chip RAM and the evaluation board ATMEL AT91EB01. It includes the processor, 512 kByte static off-chip RAM, Flash ROM and I/O components.

Register pipelining replaces loads with a pipeline of move instructions. Based on the measurements, Table 3 shows the number of moves which are equivalent in energy consumption to a 32 bit load instruction depending on the memory type:

For the evaluation of register pipelining, a data base for the power model is necessary. Therefore, a series of measurements was performed to measure the energy consumption of the processor instructions of the 16 Bit Thumb Instruction Set.

n = (Pload_Instr * Cycleload_Instr ) / (PMove * CycleMove)

4

1.6

on-chip

off-chip

16.2

on-chip

on-chip

3.4

-19.3%

60,0 40,0 -4.7% 20,0

-19.8%

-26.0% -22.8% 28,3 21,8

on-chip

80,0

27,0 20,0

off-chip

79,5

3.5

-19.8%

35,4 28,4

off-chip

-5.9%

100,0

43,1 34,8

off-chip

120,0

Equivalent number of move instructions

11,6 11,1

Data memory

Energy (10-6 Ws)

Program memory

98,6

Table3. equivalent number of move instructions

112,6 105,9

The compiler can replace load instructions with the register pipeline up to this amount.

Experiments were conducted with different benchmarks meeting this criterion. To show the efficiency of the technique, the following benchmarks were used:

70,9 65,4

12

12

7

5

11

ke rn el

ke rn el

ke rn el

8. REFERENCES

ke rn el

11 ke rn el

7

66,2 51,3

83,3 68,4

1

Furthermore it is shown that the energy consumption of the memory system has to be taken into consideration and the type of memory influences the degree of application of register pipelining.

-22.5% -7.8%

ke rn el

5

123,4 103,0

1

without register pipelining

ke rn el

ke rn el

la tti ce

(* 10 )

29,8 29,7

-0.0%

ke rn el

The presented work implements register pipelining which is performed after the register allocation phase in a C compiler. It has been shown that this method is capable of reducing the number of memory accesses which reduces the total energy consumption of the system. Improvements of up to 26% could be achieved only by improving the compiler.

271,7 256,5

334,7 331,2

The increasing role of embedded systems and the included software demands that optimizations of energy consumption have to be studied from the perspective of software too.

-16.5%

bi qu ad _N _s ec tio ns

-6

Energy (10 Ws)

7. CONCLUSION

-5.6%

-17.9%

with register pipelining

Figure 4. Benchmarks with program in on-chip RAM

The benchmarks kernel 1 (hydro fragment), kernel 5 (tri-diagonal elimination), kernel 7 (equation of state fragment), kernel 11 (first sum), kernel 12 (first difference) are different kernels of the Livermore benchmark suite [3], which was specially constructed for benchmarking loops. Due to the lack of support for data type double in the current compiler version, the variables of that type were replaced by integers. -1.0%

ke rn el

with register pipelining

The benchmark biquad_N_sections is part of the DSPStone benchmark.

400,0 350,0 300,0 250,0 200,0 150,0 100,0 50,0 0,0

(* 10 )

The register pipelining optimization is helpful in loops with accesses to array elements used in previous passes.

la tti ce

bi qu ad _N _s ec tio ns

0,0

[1] Duesterwald, E. and Gupta R. and Soffa M. L., “A Practical Data Flow Framework for Array Reference Analysis and its Use in Optimizations“, Proc. of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, Albuquerque, NM, 1993

with register pipelining

Figure 3. Benchmarks with program in off-chip RAM

[2] Kandemir, M. And Vijakrishnan N., Irwin M.J. and Ye, W.,“Influence of Compiler Optimizations on System Power“, Proc. Of the 37th Design Automation Conference, Los Angeles, CA, 2000

Depending on the number of load statements which are replaced by register pipelining, the improvement varies between 0 and 22% (cf. Figure 3). The reason for the improvement are the removed load statements and the unnecessary index calculations for the array elements inside the considered loop.

[3] Livermore Benchmarks: http://scicomp.ewha.ac.kr/netlib/benchmark/livermorec [4] Macii, E. and Pedram, M. and Somenzi, F.,“High-Level

To show the dependency on the architecture of the system, the program was loaded into the internal 4 kByte on-chip RAM. Even though this could not be done with larger programs, it is common to use a ROM for storing the program code in embedded systems.

Power Modeling, Estimation, and Optimization“,IEEE, Trans. on CAD of ICs and Systems, November 1998

[5] Muchnick, S. S., “Advanced Compiler Design and

The results of this experiment are displayed in Figure 4 and show improvements between 4.7 and 26 %. This is a remarkable result because register pipelining can easily be implemented.

Implementation“, Morgan Kaufmann Publishers, San Francisco, California, 1997

5

[10] Tiwari, V., and Malik, S. And Wolfe, A., “Compilation

[6] Panda, P. R. and N. D. Dutt, N. D. and Nicolau, A.,

Techniques for Low Energy: An Overview“, Proc. of the 1994 IEEE Symposium on Low Power Electronics, San Diego, CA, 1994

“Memory Issues in Embedded Systems-On-Chip“, Kluwer Academic Publishers, 1999

[7] Sinevriotis, G. and Stouraitis, T., “Power Analysis of the

[11] Tiwari, V., and Malik, S. And Wolfe, A., “Power Analysis

ARM 7 Embedded Microprocessor“, Proc. 9th Int. Workshop Power and Timing Modeling, Optimization and Simulation (PATMOS), Oct. 6-8 1999

of Embedded Software: A First Step towards Software Power Minimization“, IEEE, Trans. On VLSI Systems, December, 1994

[8] synopsys, “Power Products Reference Manual“, v 3.5,

[12] Tiwari, V. and Malik, S. And Wolfe, A., “Instruction Level

Document Order Number: 23837-000 BA, 1996

Power Analysis and Optimization of Software“, Journal of VLSI Signal Processing Systems, August, 1996

[9] Tiwari, V. And Lee, M., “Power Analysis of a 32-bit Embedded Microcontroller“, VLSI Design Journal, 1998(7)

6