Decoupled State-Execute Architecture Miquel Peric` as1,2, Adri´ an Cristal2 , Ruben Gonz´ alez1 , Alex Veidenbaum3 , 1,2 and Mateo Valero 1
Computer Architecture Department, Technical University of Catalonia (UPC) Jordi Girona, 1-3, M` odul D6 Campus Nord, 08034 Barcelona (SPAIN) {mpericas,adrian,gonzalez,mateo}@ac.upc.edu 2 Computer Sciences, Barcelona Supercomputing Center (BSC) Jordi Girona, 29, Edifici Nexus-II Compus Nord, 08034 Barcelona (SPAIN) 3 Department of Computer Science, University of California (UCI) 3019 Donald Bren Hall, Irvine, CA 92697-3435 (USA)
[email protected]
Abstract. The majority of register file designs follow one of two well– known approaches. Many modern high-performance processors (POWER4 [1], Pentium4 [2]) use a merged register file that holds both architectural and rename registers. Other processors use a Future File (eg, Opteron [3]) with rename registers kept separately in reservation stations. Both approaches have issues that may limit their application in future microprocessors. The merged register file scales poorly in terms of powerperformance while the Future File has to pay a large penalty due on branch mis–prediction recovery. In addition, the Future File requires the use of the less scalable mechanism of reservation stations. This paper proposes to combine the best aspects of the traditional Future File architecture with those of the merged physical register file. The key point is that the new architecture separates the processor state, in particular the registers, and the execution units in the pipeline back– end. Therefore it is called Decoupled State-Execute Architecture. The resulting register file can be accessed in the pipeline front–end and has several desirable properties that allow efficient application of several optimizations, most notably the register file banking and a novel writeback filtering mechanism. As a result, only a 1.0% IPC degradation was observed with aggressive banking and the energy consumption was lowered by the new writeback filtering technique. Together, the two optimizations remove approximately 80% of the energy consumed in register file data array.
1
Introduction
Memory structures in microprocessors are one of the main sources of energy consumption [4]. As a consequence, great care has to be taken when designing structures that require large amounts of memory. One of such structures is a large register file of modern, out-of-order processors which use register renaming [5]. There are several approaches to intermediate value storage in registers in such dynamically scheduled architectures. One alternative is to implement a so-called J. Labarta, K. Joe, and T. Sato (Eds.): ISHPC 2005 and ALPS 2006, LNCS 4759, pp. 68–78, 2008. c Springer-Verlag Berlin Heidelberg 2008
Decoupled State-Execute Architecture
69
merged register file, an approach first followed by the ES/9000 [6] and further developed by the R10000 [7]. The merged register file has to supply values for both the computation and the mis–prediction recovery. It is typically accessed after an instruction is scheduled to execute, even if source operand values were available much earlier. As a result, this file needs to be both large and heavily multiported, increasing its energy consumption. The alternative is to use a Future File. In this approach the future file, of size equal to the logical register file, is kept in the pipeline front–end while the rename registers correspond to storage in reservation stations. The future file contains the most recent values assigned to logical registers. The use of future registers is thus quite energy efficient. However, in the case of a branch mis–prediction, the architectural state must be recovered using the architectural register file at commit. With today’s large memory latencies this approach can suffer a large IPC loss. The Future File approach can be improved by providing direct access to registers required for recovery. The architecture proposed here uses a single register file containing all physical registers but located in the front end. Mis–prediction recovery can thus be done using a rename map stack, which check–points the rename map on each instruction that may require recovery. The new register file is called the Front-end Physical Register File or FPRF. As source operand registers of an instruction are renamed, it can be determined if a source register has a computed value. The front–end physical register file is only read in this case, significantly reducing its access frequency. The remaining source operand values come directly from executing instructions via reservation stations which are also required in this architecture. With the register file in the front end, the new architecture is called Decoupled State-Execute Architecture or DSE. Due to lower access frequency to the front–end register file it can be large but very more energy efficient. And because mis–prediction recovery is now fast, the DSE has better power-performance characteristics than the traditional approaches. The new register file organization is more amenable to two important optimizations. Register file banking can be easily implemented, both due to the fact that registers are accessed in the front end and to the reduced access frequency. Also, an optimization to filter unnecessary writebacks into the register file can be performed efficiently in the DSE.
2
Related Work
The body of related work on register file design optimization is large. Many papers have proposed to reorganize the register file architecture to reduce the number of ports and thus the energy [8, 9, 10]. Other techniques have been used to re–organize the register file. For instance, it is possible to distribute the register file based on the significance of register values [11]. Multilevel register files have also been proposed to reduce latency and save energy [12, 13, 14]. Clustered register files [15, 16, 17] have been used for the same reasons.
70
M. Peric` as et al.
The Future File was proposed by Pleszkun and Smith in their 1985 work on precise exceptions [18]. The original proposal only provided operands to instructions via a logical register file in the front-end, hence the name Future File. More recent proposals for Future File design are capable of reading operands from both the Future File and an additional architectural Register File, which stores committed values [19]. This is specially useful after an exception/mis– prediction, when a precise instruction state needs to be recovered, as it avoids having to reconstruct the register state from the ROB.
3
The Decoupled State-Execute Architecture
This section describes the DSE in more detail. The DSE pipeline attempts to provide an instruction with source operands as early as possible. Similar to the Future File, it provides available source register values in the pipeline front–end. However, in the DSE approach the registers are accessed after being remapped to a large physical register space. This has two implications: 1. Access to computed values in the front–end needs to be delayed until the rename stage has completed 2. The number of registers in the front–end can be much larger than in the Future File Figure 1 shows the DSE microarchitecture.
Rename MAP
BANK 0
ARBITRATE
Load/Store Queue
Rename Stack
BANK 1
BANK 2
BANK 3
FP Queue
FUs
Integer Queue
FPRF INSTRUCTION FLOW WRITEBACK
Fig. 1. The Decoupled State-Execute Architecture
Figure 2 shows the DSE pipeline. The total pipeline length of the DSE Microarchitecture is one stage longer than it would have been without the Front– End Physical Register File. The FPRF access in the front–end requires two stages: arbitrate and operand read. The former stage is necessary to implement register file banking.
Decoupled State-Execute Architecture
71
The source register designators are checked in the arbitration stage for bank access conflicts in the same cycle. When a conflict is detected, all stages before the FPRF stage stall. The DSE architecture maintains a bit for each logical register to mark if it contains a computed value. A FPRF register read operation is initiated if a desired source operand value is computed (available). The arbitration stage (ARB) logic checks if an N-instruction FPRF access has conflicts. The front-end stalls all stages prior to ARB in case of conflicts. Arbitration priority is given to older instructions to make sure that the front-end does not dispatch instructions out-of-order. FETCH
DECODE
RENAME
ARB
FPRF
ICACHE
QUEUE
ISSUE
EXE
WB
COMMIT
IQ FU MAP
ROB FPRF
Fig. 2. The Pipeline of the DSE Architecture
Next, an instruction and the available operand values it read from the FPRF are inserted in an appropriate reservation station. A single, centralized reservation station can be used for all instructions, which can be implemented as an instruction queue in which the available source operand values are stored in the queue’s payload RAM. This is the Queue stage. As mentioned above, The DSE architecture has a lower access rate to the FPRF compared to a standard architecture accessing a back-end physical register file since only some source operands are available at this point in instruction execution. The number of integer operands obtained from the FPRF was observed to be near 40% of all required integer operands, while for floating point operands this number decreases to around 20% (for the SPEC2000 benchmarks and the architecture described in Sec. 4). This has the potential to reduce the number of banks as well as ports per bank in the FPRF. Note however, that there need to be at least two read ports per bank so that an instruction can obtain both operands from the same bank without stalling. In the event of a branch mis–prediction the DSE architecture behaves exactly like the MIPS R10000. The processor aborts all instructions along the mis–predicted path, restores register mapping from the branch stack and starts fetching instructions from the correct path. 3.1
Read Sharing
Accesses to the same logical register often appear several times in a short instruction sequence. For instance, code that manipulates objects on the stack normally sources the stack pointer register many times. Such register accesses
72
M. Peric` as et al.
cannot be distributed among different banks. When they appear there is a high probability that a register access will result in a bank conflict. Such conflicts in the FPRF access can be reduced by using a technique known as read sharing [13]. Read sharing allows multiple reads of a same register to share a single local port among the concurrently issued instructions. The impact of Read sharing will be evaluated for the FPRF architecture. 3.2
Writeback Filtering
Some of the values written to physical registers may never be used in the future. Such values do not actually need to be written to the register file. There are two types of register values that need to be written: 1) values updating the state of an architectural register, and 2) register values that are needed in case of branch mis–prediction recovery. For instance, consider a register that is renamed twice in a short interval. A physical register allocated to the first instruction may not appear in any of the current mappings and its value will not be needed by any future instructions. If this can be detected, then the write to this register can be eliminated. This is Writeback Filtering. To implement Writeback Filtering the processor needs to check mapped registers in all rename checkpoints plus the current mapping and decide if a register write–back is necessary. Checkpoints need to be taken at all instructions that may cause a replay. There are many such instructions but the vast majority are conditional branches and load operations. Registers that are not referenced anywhere are candidates to be filtered out during writeback. Writeback filtering can be integrated in the renaming logic to detect short-lived registers. A short-lived register is a register that is not referenced by any checkpoint in the rename stack (including the current rename map). This can be detected by computing an OR of all the rename maps. In practice, the number of loads that need to be replayed (due to ordering violations) is very small while the total number of loads is very large. Checkpointing all loads is thus expensive and very inefficient. The DSE architecture reduces the number of checkpoints by associating loads to older checkpoints. In case of a replay the restart is from an older point in execution but this is very infrequent and has almost no impact on performance. These older checkpoints will usually correspond to branches, although it is possible that a load is the oldest operation in the processor and a previous branch in the instruction slice have committed already. To avoid check-pointing any loads, the architecture keeps an additional checkpoint associated to the last branch that has committed. This increases the size of the rename stack by one entry instead of a large increase that storing all mappings due to loads would have lead to.
4
Experimental Setup
The FPRF architecture was evaluated using a modified execution driven simulator based on SimpleScalar [20]. The simulator executes binaries compiled for the
Decoupled State-Execute Architecture
73
Alpha ISA. The entire SPEC2000 suite was compiled using the Compaq/Digital cc compiler with the ”-O2” optimization level. 100 million committed instructions are selected using SimPoint [21] and simulated. Results for the DSE architecture are presented and compared to the baseline out-of-order microarchitecture as well as to the banked, multi–ported register file described in [22]. The common parameters of all architectures are shown in Table 1. Table 1. Common architecture parameters for all configurations Fetch/Issue/Commit Width Branch Predictor I-L1 size D-L1 size D-L2 size Memory Bus Width Ports to the Register File Reorder Buffer Size Memory latency Integer Physical Registers FP Physical Registers Load/Store Queue Integer Queue FP Queue Integer Functional Units FP Functional Units
4 instructions/cycle Combined bimodal + 2-level 32 KB, 4-way, 1 cycle latency 32 KB, 4-way, 2 rd/wr ports, 2 cycle latency 256 KB, 4-way, 2 rd/wr ports, 11 cycle latency 32 bytes 8 Read & 4 Write 128 100 160 160 128 entries 32 entries 32 entries 4 (latency 1) 4 (latency 2)
To evaluate the DSE architecture the 5 configurations described below were studied (they are summarized in Table 2). 1. NON-BANKED baseline configuration: a processor with a fully-ported (8Rd/4Wr), centralized physical register file without banking. It has a fourstage front-end and a five-stage back-end pipeline with operand access after ISSUE. 2. NON-BANKED-WBF : NON-BANKED configuration with writeback filtering. 3. NON-BANKED-LONG: NON-BANKED configuration but with an additional stage in the front-end. This model has the same branch mis–prediction penalty as the DSE. Introducing this model allows us to evaluate how much IPC our proposal loses just due to banking stalls in the front-end. 4. DSE-BANKED : the DSE architecture with the FPRF which has 8 banks and uses read sharing. Each bank has 2 read and 2 write ports. The banked register file is similar to the one in Tseng et al. [22]. 5. DSE-BANKED-WBF : the DSE architecture with the DSE-BANKED configuration and writeback filtering.
74
M. Peric` as et al. Table 2. Configuration summary Configuration
#Banks Read Ports per Bank NON-BANKED 1 Unlimited NON-BANKED-WBF 1 Unlimited NON-BANKED-LONG 1 Unlimited DSE-BANKED 8 2 DSE-BANKED-WBF 8 2
Write Ports Pipeline Length Writeback per Bank Filtering Unlimited 9 NO Unlimited 9 YES Unlimited 10 NO 2 10 NO 2 10 YES
In addition to these five configurations the model described in [22] was also implemented. Our implementation performed around 2% better than the original one as reported in that paper.
5
Performance Evaluation
This section reports on two major metrics of interest: IPC and Energy consumption. Writeback Filtering will be further analyzed in Sec. 5.3. 5.1
IPC
The amount of additional instruction level parallelism delivered by the DSE architecture is the focus of this section. For simplicity, only SPECINT and SPECFP average IPC for all configurations is shown in Fig. 3. The figure shows that the slowdown due to banking or the use of the FPRF is very small. Only a 1.12% average slowdown is observed for SPECINT. For SPECFP the losses are even lower, less than 0.85%. The reason for small slowdowns is that the penalty due to the addition of two new stages in the front-end 2.3
2.2
NON-BANKED NON-BANKED-WBF NON-BANKED-LONG DSE-BANKED DSE-BANKED-WBF
2.1
IPC
2
1.9
1.8
1.7
1.6
1.5 SPECINT
Fig. 3. Average IPC
SPECFP
Decoupled State-Execute Architecture
75
of the architecture is largely hidden by the latency tolerance of the out-of-order execution back-end. The addition of an arbitration stage in the back end has a higher impact on IPC. Our implementation of a ”standard” banked register file (per [22]) resulted in an average IPC loss of over 2%. The writeback filtering technique has little impact on performance. The difference is less than 0.1%. Writeback filtering prevents writes in the writeback stage, which only reduces traffic on result buses and write ports. These are not a bottleneck in the simulated architectures. Finally, the figure shows that the banked architectures are very close to the non-banked architectures with the same pipeline length. This implies that, even with 2 read and 2 write ports per bank, the number of conflicts is very small. The number of conflicts in FPRF access has been measured. For the reads to the FPRF in SPECINT, 0.18% of all accesses to the FPRF resulted in a conflict. This is approximately 1 conflict for every 600 accesses. For SPECFP this number is a bit larger, 0.5%. The reason it is higher for SPECFP is that most FP instructions read two sources while many integer operations read only a single source. 5.2
Energy Consumption
One of the main benefits of banking is that it reduces the energy consumption in the register file. This section evaluates the energy requirements of data arrays in the FPRF. The energy consumption in the register file is modeled per Rixner et al [23]. The model shows that an access in the FPRF with 160 registers, 8Rd/4Wr ports, and no banking consumes 4.32 times more energy compared to an access in the banked FPRF (8banks, 2Rd/2Wr ports per bank). The total FPRF access energy is then computed using the total number of FPRF accesses obtained in simulations. The results are shown in Fig. 4 averaged over all SPEC2000 benchmarks. The impact of banking and writeback filtering on the FPRF energy consumption is clearly visible: the two techniques combined reduce the energy by 81.7%. Banking alone reduces the energy by 76%. The writeback filtering alone reduces it by about 20%. The latter has a smaller impact than banking because writeback filtering can only remove energy due to writes while banking reduces the energy on both reads and writes. In this discussion on Energy Consumption we have not analyzed components other than the FPRF. But the FPRF is not the only source of registers in this architecture. The simulated DSE architecture features a centralized reservation station with 32 entries. This structure requires four write ports (driven by CAMs) and a series of read ports used to issue the instructions. Energy-wise it can be expensive. However, the instruction queue can also be implemented as a set of distributed reservation stations, a scheme in which each reservation station is attached to a single functional unit. This scheme adds a little complexity and requires somewhat more busses to drive operands around, but its use of very small structures (less number of entries, less read/write ports) allows it to be very small and have little energy consumption. For example, if four reservation
M. Peric` as et al.
Relative amount of Energy in the FPRF compared to NON-BANKED
76
NON-BANKED NON-BANKED-WBF NON-BANKED-LONG DSE-BANKED DSE-BANKED-WBF
100
80
60
40
20
0 Configurations
Fig. 4. Relative Energy Consumption
stations are being used, giving each one 8 entries and 2 write ports / 1 read port is still enough to support highly parallel execution. 5.3
Writeback Filtering
Number of Write Accesses to FPRF relative to BANKED-INT (SPECFP)
Fig. 5 shows percentage of writebacks that were actually filtered in the banked FPRF averaged over all SPECINT and all SPECFP benchmarks. For SPECINT writeback filtering reduces the number of writebacks by approximately 29%. In SPECFP benchmarks, the number of integer writes is reduced by 26% and 48% for floating–point writes. The results show that more FP registers are not part of the state when they are written back and can be filtered out.
110 BANKED-INT BANKED-FP BANKED-WBF-INT BANKED-WBF-FP
100 90
29% 80 70
48%
60 50 40
26%
30 20 10 0 SPECFP
SPECINT
Fig. 5. The impact of Writeback Filtering in the FPRF
Decoupled State-Execute Architecture
6
77
Conclusions
This paper proposes the Decoupled State-Execute Architecture (DSE) which combines the best features of the Future File and the centralized physical register file. The DSE places the physical register file in the processor front–end. This architecture decouples the processor state in registers from the execution back-end. It has been shown that this architecture allows significant powerperformance improvement: a very small IPC loss and a large energy reduction in register file access are achieved. The main reasons are the move of register access arbitration to the front–end and a large reduction in register file access frequency (only values available early are read). Two optimizations are applied to the DSE architecture: register file banking with read sharing and writeback filtering. Banking results in a minimal IPC loss, considerably less than in a previous proposal where the physical register file is in the back-end. This is due to a) lower access frequence and fewer conflicts and b) reduced impact of arbitration when performed in the front end.
Acknowledgments This work has been supported by the Ministry of Science and Technology of Spain under contract TIN–2004–07739–C02–01 and the HiPEAC European Network of Excellence under contract IST-004408. This work was supported in part by the National Science Foundation under grant CNS–0220069.
References 1. Tendler, J., Dodson, S., Fields, S., Le, B.S.H.: Power4 system microarchitecture. IBM Journal of Research and Development 46(1) (2002) 2. Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., Roussel, P.: The microarchitecture of the Pentium 4 processor. Intel Technology Journal (2001) 3. Keltcher, C., McGrath, K., Ahmed, A., Conway, P.: The AMD Opteron processor for multiprocessor servers. IEEE Micro 23, 66–76 (2003) 4. Gowan, M.K., Biro, L.L., Jackson, D.B.: Power considerations in the design lf the Alpha 21264. In: Proc. of the 35th Design Automation Conference (1998) 5. Tomasulo, R.M.: An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 25–33 (January 1967) 6. Liptay, J.: Design of the IBM Enterprise System/9000 high-end processor. IBM Journal of Research and Development 36(4) (July 1992) 7. Yeager, K.C.: The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 28–41 (1996) 8. Zyuban, V., Kogge, P.: The energy complexity of register files. In: Intl. Symp. on Low Energy Electronics and Design, pp. 305–310 (1998) 9. Park, I., Powell, M.D., Vijaykumar, T.: Reducing register ports for higher speed and lower energy. In: Proc. of the 35th Annual Intl. Symposium on Microarchitecture (December 2002)
78
M. Peric` as et al.
10. Kim, N.S., Mudge, T.: Reducing register ports using delayed write-back queues and operand pre-fetch. In: Proc. of the 17th ACM Intl. Conf. on Supercomputing (June 2003) 11. Gonzalez, R., Cristal, A., Ortega, D., Veidenbaum, A., Valero, M.: A content aware integer register file organisation. In: Proc. of the 31th Intl. Symp. on Computer Architecture (2004) 12. Cruz, J., Gonzez, A., Valero, M., Topham, N.: Multiple-banked register file architecture. In: Proc. of the 27th Intl. Symp. on Computer Architecture, pp. 316–325 (2000) 13. Balasubramonian, R., Dwarkas, S., Albonesi, D.: Reducing the complexity of the register file in dynamic superscalar processors. In: Proc of the 34th Intl. Symp. on Microarchitecture (2001) 14. Zalamea, J., Llosa, J., Ayguad, E., Valero, M.: Two-level hierarchical register file organization for VLIW processors. In: Proc of the 33th Intl. Symp. on Microarchitecture (MICRO-33), pp. 137–146 (2000) 15. Palacharla, S., Jouppi, N., Smith, J.: Complexity-effective superscalar processors. In: Proc. of the 24th Intl. Symp. on Computer Architecture (1997) 16. Kessler, R.: The Alpha 21264 microprocessor. IEEE MICRO 19 (March 1999) 17. Seznec, A., Toullec, E., Rochecouste, O.: Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors. In: Proc. of the 35th Intl. Symp. on Microarchitecture, pp. 383–394 (2002) 18. Smith, J.E., Pleszkun, A.R.: Implementation of precise interrupts in pipelined proccessors. In: Proc. of the 12th Intl. Symp. on Computer Architecture, pp. 34–44 (1985) 19. Johnson, M.: Superscalar Microprocessor Design. Prentice-Hall, Englewood Cliffs (1990) 20. Austin, T., Larson, E., Ernst, D.: Simplescalar: an infrastructure for computer system modeling. IEEE Computer (2002) 21. Perelman, E., Hamerly, G., Biesbrouck, M.V., Sherwood, T., Calder, B.: Using SimPoint for accurate and efficient simulation. In: Proc. of the Intl. Conf. on Measurement and Modeling of Computer Systems (2003) 22. Tseng, J., Asanovic, K.: Banked multiported register files for high-frequency superscalar microprocessors. In: Proc. of the 30th Annual Intl. Symp. on Computer Architecture (2003) 23. Rixner, S., Dally, W.J., Khailany, B., Mattson, P.R., Kapasi, U.J., Owens, J.D.: Register organization for media processing. In: Proc. of the 6th Intl. Symp. on High Performance Computer Architecture, pp. 375–386 (2000)