Power Reduction in VLIW Processor with Compiler ...

39 downloads 0 Views 67KB Size Report
A compiler can mark all the operands which will get their value from bypass after ..... [4] Krste Asanovic, Mark Hampton, Ronny Krashinsky, and. Emmett Witchel.
Power Reduction in VLIW Processor with Compiler Driven Bypass Network Neeraj Goel, Anshul Kumar and Preeti Ranjan Panda Department of Computer Science Indian Institute of Technology Delhi mail id: {neeraj, anshul, panda}@cse.iitd.ernet.in Abstract With the increase in issue width, bypass control of a processor become more complex. Also, in a processor, operands are read both from register file as well as from bypass. For a multi-port register file, read/write energy is much more than that of single port register file. Both redundant register read/write and bypass control area can be reduced with compiler hints for register bypass. In this work we suggest a innovative way to represent compiler bypass hints that serve both these motivations. Further, bypass hints are used in effective design of multi-stage bypass network. Experiments on mediabench benchmarks show that by using our approach, (i) register file energy savings can be as as much as 60% (ii) and synthesis of VLIW core saves 2-4% of the core area.

1. Introduction In a pipelined scalar processor design, register bypass (also known as data forwarding circuit) is a standard technique to increase performance, but as the number of issue slots increases the complexity of bypass circuit increases rapidly. It is shown [1] that a bypass control circuit needs 2 ∗ k ∗ N 2 comparators, where N is the number of issue slots [1] and k is the number of stages between register read and write back stage. Thus in a VLIW processor, the bypass network may significantly contribute to chip area as well as the energy consumption. Research efforts to reduce chip area/ energy consumption in processors with bypass methods have primarily been in two directions, (a) reducing the number of bypass paths [7, 8, 9, 10], (b) reducing the number of read/write from/to register file [3, 4, 5] by virtue of bypass paths. In the first case, both chip area and energy consumption are reduced by providing only selected bypass paths. The penalty is paid in terms of increased num-

ber of stall cycles. This penalty can be substantially reduced by carefully selecting the paths to be provided and appropriately scheduling and binding the instructions. While all researchers have proposed instruction scheduling, binding may be done by compiler [7, 10] or hardware [8, 9]. Reducing the bypass network may also have a positive impact on clock cycle time. In the second direction of research, the aim is to use the presence of bypass network to save register file energy. The idea here is to avoid reading operands from the register file which could be obtained from the bypass network. The opportunities for avoiding register file read may be detected by hardware [5] or the compiler [3, 4]. The hardware approach has an overhead in terms of area, energy and possibly clock period, though the energy overhead is more than offset by savings in RF energy. In case of the compiler approach, the overhead is only in the terms of additional bits in the instruction from which control signals to inhibit RF read can be derived. By liveness analysis, compiler can identify those situations in which writing into register file can also be avoided. As in VLIW processors multi-port register file consumes 20-30% of power [2], decrease in register reads/writes can save significant amount of energy. In this paper, we present a compiler driven approach to save register file energy which not only avoids extending the instruction size, but also reduces bypass network control hardware. Our technique uses the fact that register address field is not used when operands are read from the bypass network and is available for providing information which can be used to derive appropriate control signal. Though it may result in an increase in the number of clock cycles to execute a program, we experimentally show that this increase is negligible.

2. Overview of the Approach In a standard implementation of a processor operand may be read from register file as well as from bypass

network. Which value is used as operand is resolved by control unit at run time. In case operands are available in bypass registers, no register read is needed. Further a register write can be avoided when operands are short lived and are not used after being read from bypass. A compiler can mark all the operands which will get their value from bypass after scheduling. Using liveness analysis, compiler can mark the operands (live ranges) which are used last time. Using this information, compiler will represent the instruction in a way that will cause minimum penalty.

Bypass3 Bypass2

Bypass1 Bypass

Mux1 Register

FU

File Mux2

Figure 1. Multi stage bypass pipeline

2.1. Representation of Control Information A crucial question while providing compiler support is how bypass hints should be encoded. Hints should provide two pieces of information. First, whether operands are read from bypass or register file. Second, from which bypass path operands are read. Providing one bit for each operand can indicate whether to read from register file or bypass path. If no more information is given then computing the bypass path is performed in hardware. In this case there is no reduction in complexity of bypass control hardware, though register reads are avoided whenever data is available in bypass path. If the technique of [3] is used for providing the information of bypass paths, more information bits to be added with each operand, leading to increase in instruction size, and consequently the program size. Another way to represent bypass information which we propose in this paper is, use of register file address space. As discussed before, when the operands are read from bypass then information in register address space is redundant. Thus instead of using extra bits for bypass, we can encode bypass information bits with register addressing. While addressing the registers, part of the address space can be allocated to these bypass registers and rest can be used for general purpose. For example, if previously we had a register file with 64 registers, after using four registers for bypass the effective register file size reduces to 60. This may cause generation of more spill code causing some performance loss. We show experimental results in next section to quantify this performance loss. Advantage of this representation is, we can address the bypass register in the instruction itself with no modification in ISA or instruction format as suggested by [3] and [4].

2.2. Multi-stage Bypass It has been observed that most of the operand values are transient. A VLIW compiler in general schedules an

operation as soon as its operands are available, and if in one particular cycle, number of ready operations are more than number of functional units, operation will be scheduled in next cycle. Thus, there is a high probability that if the result of an operation is kept in pipeline for few more cycles, then more operands will be read from bypass paths. On the other hand, keeping results of operations in pipeline for more cycles makes the bypass network larger. Number of inputs to bypass network gets multiplied and thus may increase the clock cycle. By designing bypass network as a multi-stage unit helps in reducing the critical path. For a single issue processor two stage bypass is shown in figure 1, in which result of the functional unit is kept in pipeline for three cycles. Instead of sending output of all the four bypass registers to same multiplexer, task is divided among two bypass multiplexers. The first bypass multiplexer takes inputs from pipelined bypass registers. The second bypass multiplexer, takes input from the first bypass multiplexer and results of the functional unit. Using this method, number of inputs of Mux2 ( in figure 1 ) is increased only by one irrespective of number of increase pipeline stages. Advantage of two stage bypass is evident in the case of a VLIW processor, where size of bypass multiplexer becomes larger due to multiple functional units. For example, in case of four functional units, number of inputs to bypass multiplexer would be five ( one from register file). If all the functional units keep their results for one more cycle, then the number of inputs of multiplexer will grow to nine. This can lead to an increase in critical path as the bypass multiplexer is part of the critical path. In multi stage bypass approach, the number of inputs to bypass multiplexer will increase only by one. Multi-stage bypass requires more number of bypass registers to be addressed. Our register usage analysis shows that we can provide enough register address space for addressing bypass registers with only a slight loss in performance.

2.3. Issues with Proposed Bypass Hints As we represent bypass hints using register space, a few issues arise which are discussed in next subsections. 2.3.1. Performance Impact To study the performance loss due to less number of available general purpose registers, various benchmarks of mediabench were simulated for different registers file sizes. For simulation and compilation we have used trimaran compiler infrastructure [12]. Base VLIW machine was a 4 issue machine (3 integer functional units, one memory and one branch unit) and 64 registers. To compare the results of different benchmarks, number of execution cycles is divided by number of cycles taken by 64 register VLIW. From the figure 2 it can be seen that when we reduce number of registers from 64 to 48 than there is 1% increase in number of cycles. In other words as many as 16 registers can be used for bypass register addresses with only 1% loss in performance. Performance Variation with Size of register file 1.6

Average g721decode mpeg2dec pegwitdec pegwitenc unepic

Normalized Number of Cycle

1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 16

24

32

40 48 Number of Registers

56

64

72

Figure 2. Impact of register file size on performance

2.3.2. Predicate Operations In the VLIW processors predication is used for increasing the size of basic block and to increase the parallelism. Predicated operation are conditionally executed and conditions are determined at run time. At compile time, a predicated operation will be executed or not, can not be determined. This implies, compiler hints that provide information, whether bypass path will be used or not, can be used in predicated operations. But hints which specify the bypass path as well as bypass register can not be given at compile time. Note that in commertial processors predication is not used in general, for example, Analog’s Tiger-

Shark, TI’s TMS62XX, Sun Magc have no predication in their instruction sets. ST’s Lx has only partial predication support. 2.3.3. Exceptions While giving information of bypass registers as bypass hints, information of original source registers is removed from the instructions. Because the bypass registers are transient in nature, their values will be changed after each cycle. If there is an exception raised by an instruction having bypass operands, or there is an interrupt at this instruction, operand value will not be in bypass registers after returning from interrupt/exception unless some micro-architectural change is done in the processor design. Exceptions and interrupts are handled in VLIW processors like any other pipeline processor with the difference that we can not change the addresses of operands in hardware, as all registers are determined at compile time and there is no hardware support. Ozer et al describe how the traditional techniques of exception/interrupt handing can be extended for VLIW processor [11]. In their technique they store a copy of register file in future file and reorder buffer or history buffer. Whenever control gets back from exception, original state can be retrieved from either history buffer or future file with reorder buffer. In our case any of these methods can be extended, and bypass registers will be saved in history buffer or future file as regular registers. In this way lost transient values can be retrieved at any instant. The extra logic required to provide connection and control so that future file or history buffer can read (written) from (to) bypass registers. Quantitative analysis of power and performance impact of exception handling is not in scope of this work.

3. Compiler Support 3.1. Compiler Analysis To inhibit redundant reads and writes, compiler during its analysis phase identify the operands which are read from the bypass network and also the writes which can be avoided. For avoiding reads, a list can be maintained for the results available in that cycle. The operands for which results are available in that cycle can be marked as bypass reads. For avoiding redundant writes, liveness information is requires, as only those writes can be avoided which are not to be read in future. In a standard compiler, after liveness analysis, the information which variables are live in a program at any point is available. This analysis is done using standard techniques of data flow analysis. By using the liveness

Before bypass: add.1 r13, r14, r8; sub.1 r6, r13, r9;

Addr i

m*n Address Comparison

Operand Selection

Addr 1

After Bypass add.1 r13, r14, r8; sub.1 r6, by1, r9;

32 m:1 mux

Addr i

m

n such muxes

32

Addr m

Figure 3. Register Renaming for bypass (a) Original bypass control

information, we can find when a variable is last used. The use of a variable is said to be last use if either the value is not used or is re-defined afterwards. For marking last use of the operands, all analysis is done at basic block level by using global information. For each basic block, its liveout variables are obtained from liveness analysis. The operands which are read in that basic block but not in the list of live out variables, are marked as last use operands. Similarly those operands which are redifined inside the basic blocks are also marked last use in the appropriate instruction.

3.2. Code Generation For generating code using the above suggested hints method, a compiler needs to be modified at two points. First, free registers given in the register allocation pass will be: Total Physical Registers − Bypass Registers. This can be done by specifying less number of registers in machine description of the processor given to the compiler. Second, rename the registers with bypass registers. For this, bypass register name is associated with output of each FU. During final code generation, operands which are marked as bypass reads, register number will be replaced by bypass register number as shown in figure 3. Note that no change is done in the write register address thus value of register will be updated in register file and can be used by another operation in any subsequent cycles. However, when the source of bypass operand is marked as last use, the write to register file can be omitted and the address of the destination register is given as bypass register. Changing the register number in read/write will automatically lead to no read/write from the register file.

n such decoders

Addr

m output decoder m

32 m:1 mux m

32

n such muxes

RF inhibit bit

(b) Modified bypass control

Figure 4. Bypass control for one operand of a functional unit

source is compared with the address of current operands. The number of comparisons, in general, is m ∗ n, where m is the number of bypass sources and n is the number of destinations. In the suggested approach, these comparison are not required as bypass register number is given in the instruction itself. Control unit for bypass in this case, consists of decoding circuit which expands the information given for each operand. For example, if there are 8 bypass registers, they require three bits to address those bypass registers. In place of 8 address comparisons, a 3 to 8 decoding logic is required for each operand. To inhibit register read/write a signal is generated when an operand is read from bypass. This signal is generated by a combinational circuit which is based on address space used by bypass registers. For example, in a 64 register RF, if register addresses 56 to 63 are used for bypass registers, then three most significant bits of operand address can be ANDed to generate the register file read/write inhibit signal.

4.1. Area Saving

4. Micro-architecture Modification With the new encoding scheme for bypass access bypass control unit would be simpler. In a standard implementation of bypass control, address of each bypass

To study the area savings, we used a parametrized RTL model of VLIW processor core. This processor core contains dispatch-decode and execute unit of processor. Register file and caches are outside the proces-

5. Experiments and Results

unepic

rawdaudio

rawcaudio

pegwitenc

pegwitdec

mpeg2dec

mesamipmap

g721encode

g721decode

0

epic

Percentage of reads

20

Benchmarks

Figure 5. Bypass usage for mediabench benchmarks

Bypass2

Bypass1

Bypass 100 80 60 40

unepic

rawdaudio

rawcaudio

pegwitenc

0

pegwitdec

20

mpeg2dec

Table 1. Comparison of area values with/out bypass

Register file

40

mesamipmap

% Area saved 2.03 3.43 4.9 2.89 3.46 4.0

Bypass2

60

g721encode

Full bypass with hints (um2 ) 3.21e5 3.49e5 6.10e5 7.08e5 8.08e5 8.99e5

Bypass1

80

g721decode

Full bypass no hints (um2 ) 3.28e5 3.62e5 6.42e5 7.29e5 8.37e5 9.36e5

Bypass 100

epic

Issue width 3 4 5 6 7 8

pass, bypass1, bypass2 were calculated. Bypass reads were calculated within the basic blocks as compiler also will not be able to find bypass usage beyond basic block. Bypass, bypass1, bypass2 notations are same as shown in figure 1. In this experiment we also calculated register file writes which can be avoided as they are short lived and read from bypass registers.

Percentage of avoided write

sor core and are not modeled in RTL. In first approach when there are no compiler hints then all the address comparisons are done in hardware. One bit is generated from each comparison which indicates that operand is to be read from that bypass path or not. Such bits are produced for each bypass path and for each operand. In the second case, the information of bypassing is available in encoded form, a circuit is required to decode that. For getting area values, a VHDL design was synthesized using Synopsys’ dc compiler with 0.18um UMC libraries. The area shown here is only logic area with no routing area included. As discussed earlier, the number of comparisons increases as we increase the number of function units. Thus advantage of compiler hints is expected to be more in the case where issue slots are more. Table 1 shows the area of different VLIW processor cores in two approaches, i.e. full bypass with no hints and bypass with hints. From the table it can be seen that area gains of the order of 4 % of total core area, are due to saving in bypass control overheads.

Benchmarks

Figure 6. Avoided writes for mediabench benchmarks

5.1. Bypass Reads and multi-stage bypass Usage Experiments were setup to quantify the usage of bypass paths. Along with the usage of bypass paths, usage of pipelined bypass registers is also calculated. Base processor configuration for these experiments is same as for previous experiment, i.e., VLIW processor with 4 issue slot, 2 integer unit, 2 memory units and one branch unit, each with single cycle latency. Trimaran infrastructure was used to simulate mediabench benchmarks. Runtime instruction trace generated by the simulator of trimaran was examined. In the trace total number of reads, total number of operations, read from by-

The results of the experiments are shown in figure 5 and 6. There are 38-60% operands available from bypass path. Bypass1 path provides upto 11% of the operands. Bypass2 provides less than 5% operands except in the case of mesatexgen application. Thus most of operands are consumed either immediately (in the next cycle) or wait for a cycle then get consumed. From the figure 6 it can be observed that 44-80% register file write become redundant due to reading from bypass registers. And if we include bypass1’s avoided write also, total ranges from 54 to 80%. Note that in the base architecture, each functional

unit has latency of one cycle. Thus all the bypass reads or avoided writes that are observed in bypass1 latch are purely due to their scheduling and life time not due to architecture constraint.

5.2. Energy savings In the proposed architecture, we are saving power mainly by not reading register file whenever data is available in bypass registers. Energy is also saved by reducing the number of writes in register file by not writing the values which are short lived and read in bypass. Total energy in register file can be computed by number of read/write and their corresponding energy consumption. We have calculated total number of reads and writes from register file as well as bypass reads and avoided writes by analyzing execution traces. Energy values per read/write is calculated from cacti4.0 energy model of SRAM. Energy savings for different benchmarks is shown in figure 7. From bar-chart it is observed that energy savings in register file range from 29% to 45%. Further by avoiding redundant writes energy saving range from 40% to 62%. It may be noted that energy saving does not include saving due to multi stage bypass, i.e., bypass1 and bypass2.

inserts hints that specify from which bypass register operands will be read. Register file reads are disabled when operand is read from bypass hardware. This gives is 40-60% energy savings of register file. Due to compiler’s hints hardware for bypass control is much simpler. This lead to 2-5% less area of VLIW core. By exploiting transient behavior of operands, if registers are kept in pipeline for one more cycle, more energy saving is expected. It is purposed to investigate how savings in energy can be improved further by tuning instruction scheduling and register allocation to fully exploit the bypass architecture.

References [1] J. Gray, A. Naylor, A. Abnous, and N. Bagherzadeh. Viper: A vliw integer microprocessor. IEEE Journal of Solid State Circuits, December 1993. [2] M. Sami, D. Sciuto, C. Silvano, and V. Zaccaria. Instruction level power estimation for embedded VLIW cores. In CODES, 2000. [3] M. Sami et al. Low-power data forwarding for vliw embedded architecture. IEEE Transaction on VLSI Systems, 10(5), May,2002. [4] Krste Asanovic, Mark Hampton, Ronny Krashinsky, and Emmett Witchel. Power Aware Computing. Kluwer Academic/Plenum Publishers, June 2002.

Avoid Bypass read Avoid Bypass Read & Write

80 60 40

unepic

rawdaudio

rawcaudio

pegwitenc

pegwitdec

mpeg2dec

mesamipmap

g721encode

0

g721decode

20

epic

Percentage Energy Saved

100

Benchmarks

Figure 7. Energy saved in register file

For calculating the power consumed in the bypass control circuit, a vcd trace was generated by simulating core VLIW processor. Synposys’s prime power was used to calculate power consumed. We observed that power consumed in bypass control is much less than overall power consumed and effectively no significant power is saved due to smaller bypass control.

6. Conclusions In this work we have presented a compiler based approach which results in reduced energy of register file and area of the processor core without affecting clock period. Using register file address space compiler

[5] I. Park, M. D. Powell, and T. N. Vijaykumar. Reducing register ports for higher speed and lower energy. In International Symposium on Microarchitecture, pages 171– 182, 2002. [6] PS Ahuja, DW Clark, and A Rogers. The performance impact of incomplete bypassing in processor pipelines. In Proceedings of the International Symposium on Microarchitecture, 1995. [7] Kevin Fan et al. Systematic register bypass customization for application-specific processors. In ASAP, June 2003. [8] A Shrivastava, N Dutt, A Nicolau, and E Earlie. Pbexplore: A framework for compiler-in-the-loop exploration of partial by passing in embedded processors. In Proceedings of the Design, Automation and test in Europe Conference and Exhibition, 2005. [9] S Park el al. Bypass aware instruction scheduling for register file power reduction. In Proceedings of conference on Language, compilers and tool support for embedded systems, 2006. [10] Manjunath Kudlur, K. Fan, M. Chu, Rajiv Ravindran, N. Clark, and S. Mahlke. Flash: foresighted latencyaware scheduling heuristic for processors with customized datapaths. In CGO, 2004. [11] E Ozer et al. A fast interrupt handling scheme for VLIW processors. In PACT, 1998. [12] Trimaran compiler infrastructure. www.trimaran.org.

Suggest Documents