A Simple Superscalar Architecture - CiteSeerX

A Simple Superscalar Architecture Crispin Cowan, Jan Graczyk, J. Michael Bennetty, and Charles L.A. Clarkez Department of Computer Science and Engineering Oregon Graduate Institute of Science and Technology P.O Box 91000, Portland, Oregon 97291-1000 USA 503-690-1265 FAX: 503-690-1553 [email protected] [email protected] [email protected] [email protected]

January 21, 1995 Abstract

We present a simple technique for instruction-level parallelism and analyze its performance impact. Our processor architecture economically encodes two instructions, one ALU and one load/store, into a single 32-bit instruction word. Using an existing RISC processor design as a starting point, we detail the instruction set, the pipeline design, and scheduling techniques. Implementation should require little or no additional hardware over the scalar processor, and may require less. Simulation results show up to 13% improvement in program execution time.

Corresponding author. The work described in this paper was performed at the University of Western Ontaro. yMichael Bennett is a professor in the Computer Science Department, University of Western Ontario, London, Ontario, N6A 5B7 Canada. Jan Graczyk is a graduate of UWO. zComputer Science Department, University of Waterloo, Waterloo, Ontario, N2L 3G1 Canada.

1

1 Introduction Many new architectures have chosen to use a dynamic approach to issuing instructions in parallel, providing hardware that examines the instruction stream for opportunities. While eective, this approach requires a substantial investment in CPU hardware. Identifying instructions that can be issued in parallel is not simple to do at the hardware level, and this cost is re ected in the size of these devices [1, 5, 6, 11, 14]. The alternative approach has been to statically bind instructions together into long instruction words; hence the (Very) Long Instruction Word (VLIW) family of architectures. These architectures foundered on many problems, some of which included the diculty of keeping all functional units suciently busy to justify their cost, and the diculty of economically fabricating a device capable of loading wide instruction words with a satisfactory clock rate. In this paper we present a compromise: the Narrow Instruction Word architecture (NIW). A carefully crafted instruction set encoding allows an ALU instruction to be encoded into the same 32-bit word as an independent load or store instruction. These two independent instructions, statically bound together by the compiler, are executed in parallel by the CPU. The bene t of such a design is that it reaps some of the advantages of parallel instruction execution without incurring the hardware expense of either dynamic instruction scheduling or long instruction words and multiple functional units. The purpose of this paper is to evaluate the impact of the relatively simple NIW enhancements on a scalar architecture. In section 2, we detail the instruction set and pipeline architecture of the NIW, and show that the additional hardware required is minimal, and may even be less than a non-NIW machine. Section 3 describes the compiler optimization and scheduling techniques necessary to exploit this architecture. Section 4 presents simulation results comparing the performance of the NIW architecture to the DLX architecture, showing up to 13% performance improvement. Finally, section 5 summarizes our results and describes future research opportunities.

2 The NIW Architecture The NIW architecture is a small modi cation of the DLX pipelined RISC architecture described in Hennessy and Patterson [12]. The DLX has not actually been implemented; it is a simpli cation of the MIPS R2000/R3000 architecture, designed for the pedagogical purposes of the text book. As such, it appropriate to describe the NIW as a modi cation of the DLX for the following reasons: By modifying an existing architecture, we can make direct performance comparisons without having to account for other dierences. 2

I-type instruction 6 5 5 16 Opcode RS1 RD Immediate R-type instruction 6 5 5 5 11 Opcode RS1 RS2 RD Func J-type instruction 6 26 Opcode Oset added to PC

Figure 1: DLX Instruction Formats

A cross compiler and simulator for the DLX are available, easing implementation for the NIW, so that real program measurements can be made.

The similarity of the DLX to the MIPS R2000 makes it possible to port existing R2000 scheduling compilers to the NIW.

In this section, we will describe the pertinent details of the DLX architecture, and the changes necessary in each case to implement the NIW architecture. The architecture presented is not actually intended for direct implementation. Instead, it is intended to measure the impact of the NIW enhancement on a scalar RISC architecture. We conclude by discussing the relative hardware cost of the NIW.

2.1 The Instruction Set

The DLX is a straightforward load/store RISC architecture. The only operands to an ALU instruction are registers, and 16-bit immediate values. Load and store instructions have only one addressing mode: register indirect with 16-bit displacement. The DLX has three simple instruction formats with xed elds, allowing fast instruction decoding and operand fetching, shown in gure 1. The I-type instructions support instructions with an immediate operand, the R-type instructions support all-register ALU operations, and the J-type instructions support branches with a 26-bit oset. Figures 2, 3, and 4 show the DLX instruction encoding [9] for the I-type, R-type, and Jtype instructions. The R-type instructions all have an opcode of $01; the speci c instruction 3

$00 $08 $10 $18 $20 $28 $30

$00 SPECIAL ADDI RFE SEQI LB SB SEQUI

$01 FPARITH ADDUI TRAP SNEI LH SH SNEUI

$02 J SUBI JR SLTI

SLTUI

$03 JAL SUBUI JALR SGTI LW SW SGTUI

$04 BEQZ ANDI

$05 BNEZ ORI

SLEI LBU

SGEI LHU

SLEUI

SGEUI

$06 BFPT XORI

$07 BFPF LHI

LF SF

LD SD

Figure 2: DLX Instruction Encoding: Main Opcodes $00 $08 $10 $18 $20 $28 $30

$00 SLLI

$01

$02 SRLI

$03 SRAI

SEQU MULT ADD SEQ MOVI2S

SNEU MULTU ADDU SNE MOVS2I

SLTU DIV SUB SLT MOVF

SGTU DIVU SUBU SGT MOVD

$04 SLL TRAP SLEU

$05

SGEU

AND SLE MOVFP2I

OR SGE MOVI2FP

$06 SRL

$07 SRA

XOR

Figure 3: DLX Instruction Encoding: Special Opcodes (Main opcode = $00) to be executed is encoded in the Func eld, shown in gure 3. Similarly, the oating point operations all have an opcode of $02, and their function encodings are described gure 4. To allow a load/store instruction to be encoded along with a register-register ALU operation, we restrict the number of instructions and addressing modes to make room in the instruction encoding. The NIW removes three major classes of operations from the DLX instruction set: 1. Single Precision. Double precision oating point arithmetic is retained. This eliminates a large number of similar instructions, diering only in the type of the operands, and a large number of conversion instructions. 2. Immediate Operands. Most of the ALU instructions that the DLX allows with immediate operands have been removed. This eliminates a large number of redundant instructions, diering only in the presence of an immediate operand. These instructions were not frequently used, except for addi (which the DLX uses for \load immediate 4

$00 $08 $10 $18

$00 ADDF CVTF2D EQF EQD

$01 SUBF CVTF2I NEF NED

$02 MULTF CVTD2F LTF LTD

$03 DIVF CVTD2I GTF GTD

$04 ADDD CVTI2F LEF LED

$05 SUBD CVTI2D GEF GED

$06 MULTD

$07 DIVD

Figure 4: DLX Instruction Encoding: Floating Point Opcodes (Main opcode = $01) I-type instruction 6 5 5 Opcode RS1 RD

16 Immediate

R-type instruction 6 5 5 5 1 5 5 Opcode RS1 RS2 RD L/S TA TD J-type instruction 6 26 Opcode Oset added to PC

Figure 5: NIW Instruction Formats value") and slli (which is used for scaling an index into two- or four-byte member tables). 3. Displacement Addressing. Load and store operations are restricted to simple register indirect. Displacement addressing is frequently used, but can be simulated at low cost by explicitly calculating addresses, due to NIW's available parallelism. See section 3. The loss of single precision is not signi cant in an environment of rising expectations of precision, and is precedented in the RS/6000 [11]. Immediate operands were removed from most instructions, but retained in those few where they were actually being used by the DLX compiler. Restricting immediate operands is just an extension of the classic RISC load/store philosophy. The loss of displacement addressing has a signi cant negative performance impact. However, this loss is largely mitigated through the use of parallelism, as described in section 3. Pruning the above instructions makes room to encode two instructions into a single 5

$00 $08 $10 $18 $20 $28 $30 $38

$00 ADD BFPT GED LH MULT SEQ SLEU SRL

$01 ADDD BNEZ GTD LHI MULTD SEQU SLL SUB

$02 ADDI CVTD2I J LTD MULTU SGE SLLI SUBD

$03 ADDU CVTI2D JAL MOVD NED SGEU SLT SUBU

$04 ADDUI DIV JALR MOVFP2I OR SGT SLTU TRAP

$05 AND DIVD JR MOVI2FP RFE SGTU SNE XOR

$06 BEQZ DIVU LB MOVI2S SB SH SNEU OP RES

$07 BFPF EQD LED MOVS2I SEQ SLE SRA OP RES

Figure 6: NIW Opcode Encoding 32-bit word. Figure 5 shows the modi ed NIW instruction format, and gure 6 lists the NIW instructions. Only the R-type has changed. The Func eld has been removed, and the functions merged into the general Opcode eld (made possible by the reduction in the number of instructions). There is now a L/S bit indicating whether a load or a store is to be executed in parallel with the ALU operation. The TA and TD elds specify the Transfer Address and Transfer Data registers, respectively, for the load/store instruction. Specifying R0 as both the TA and TD register inhibits the load/store operation.1

2.2 Pipeline

The ve stages of the DLX pipeline are shown in gure 7. Each pipe stage takes one clock cycle to execute, and thus under optimal conditions (no stalls), the DLX issues one instruction per cycle. Measurements in section 4 show that the DLX actually executes at approximately 1.15 clocks-per-instruction. As seen in gure 8 [12, gure 6.4], ALU instructions do not use the MEM stage of the pipeline. Thus, if the memory instructions could be shortened, so that they only used 4 pipe stages, then the entire pipeline could be shortened. Currently, one fth of the pipeline is idle more than 70% of the time. There is not enough room in the NIW ALU/memory superscalar instruction to express a displacement value for the load/store address, so displacement addressing has been removed. There is no longer any need for memory instructions to use the EX stage of the pipeline. Instead, the NIW uses a four stage pipeline, as shown in gure 9. The IF and ID stages are similar to the corresponding DLX pipe stages, with the ID stage additionally decoding and 1

R0 is hard-wired to the value 0.

6

Stage Description IF Instruction Fetch ID Instruction Decode and operand fetch (RS1 and RS2) EX Execute; perform ALU ops and compute displacement addresses MEM Memory access WB Write Back; write RD back to register le

Figure 7: DLX Pipeline Stage ALU Instruction IF IMAR PC ;

Load/Store Instruction

IMAR PC ; IR Mem[IMAR]; IR Mem[IMAR]; PC PC + 4; PC PC + 4; ID A RS 1; A RS 1; B RS 2; B RS 2; EX ALUoutput AemopB ; DMAR A+ bfor ((IR16)16##IR16 31); ALUoutput Aop MDR B ; ((IR16)16##IR16 31); MEM MDR Mem[DMAR]; or Mem[DMAR] MDR; WB RD ALUoutput; RD ALUoutput; :::

:::

Branch Instruction

IMAR PC ; IR Mem[IMAR]; PC PC + 4; A RS 1; B RS 2; ALUoutput PC + ((IR16)16##IR16 31); cond (RS 1op 0); :::

if (cond) PC ALUoutput;

Figure 8: DLX Pipeline Usage fetching the TA and TD registers. The WB stage is also similar, with the TD register being written to the register le along with the RD register, as necessary. The major modi cation is to replace the EX and MEM stage with a combined stage called MEX, which processes the ALU component in parallel with the load/store component of a superscalar instruction. Figure 10 shows the usage of the NIW pipeline by the three classes of instructions: immediate (I-type), superscalar ALU-load/store (R-type), and branch (J-type). 7

Stage Description IF Instruction Fetch ID Instruction Decode and operand fetch (RS1, RS2, TA and TD) MEX Execute; perform ALU ops and load/stores WB Write Back; write RD and TD back to register le

Figure 9: NIW Pipeline Stage ALU Immediate Instruction ALU and Load/Store Instruction Branch Instruction IF IMAR PC ; IMAR PC ; IMAR PC ;

IR Mem[IMAR]; PC PC + 4; ID A RS 1; B RS 2; DMAR TA; MDR TD; MEX ALUoutput Aop ((IR16)16##IR16 31); :::

WB

RD

ALUoutput;

IR Mem[IMAR]; PC PC + 4; A RS 1; B RS 2; DMAR TA; MDR TD; ALUoutput Aop B ; and (MDR Mem[DMAR]; or Mem[DMAR] MDR; ) RD ALUoutput; TD MDR

IR Mem[IMAR]; PC PC + 4; A RS 1; B RS 2; DMAR TA; MDR TD; ALUoutput PC + ((IR16)16##IR16 31); cond (RS 1op 0); if (cond) PC ALUoutput; :::

Figure 10: NIW Pipeline Usage

2.3 Hardware Requirements

Three prominent commercial processors that exploit instruction-level parallelism are the MIPS R4000 [1], the IBM/Motorola PowerPC[5], and the DEC Alpha AXP [14]. The R4000 is a superpipelined design, which uses additional hardware to provide buering stages that allow instructions to ow through the pipeline at a multiple of the basic clock rate. The Alpha and the PowerPC are superscalar designs, and as such require additional hardware 8

to decide at run time which instructions may be issued in parallel. The Alpha also requires additional functional units to support parallel operations of the same type. All three designs resulted in chips that use well over 1 million transistors. In this context, the hardware requirements of the NIW design are modest. The additional hardware required to add NIW features to a DLX or R3000 like CPU is as follows:

Register File Ports

The ID stage requires two additional read ports to fetch the values of TA and TD. The WB stage requires one additional write port to write back the value of TD.

Decoding Logic

The opcode eld is now densely encoded; almost all values are used, and so a full decoder will be required for this eld.

Mitigating this cost is the fact that there are also signi cant reductions in hardware requirements implied by the NIW architecture. Unlike other LIW architectures, the NIW does not require any additional functional units or a wide instruction bus. By using the ALU and Memory units in parallel, the pipeline can be shortened from ve stages to four. The combined hardware cost of the above changes should be quite minimal; very little additional silicon, if any, will be required to transform a pipelined RISC processor into an NIW processor.

3 Compiling for the NIW Compiler techniques for the NIW architecture have not been the focus of our research. However, such an architecture is not completely unknown to the compiler community, and optimal algorithms exist for scheduling instructions for such an architecture [2, 3]. Rather than try to re-invent the wheel, we intend to adapt an existing instruction scheduling compiler to the NIW architecture [4, 13]. In particular, speculative execution algorithms [15] seem appropriate to use in a statically scheduled processor such as the NIW. However, the constraints imposed by squeezing two instructions into a single 32-bit word present some unusual problems. Because only register-based ALU instructions can be executed in parallel with load/store instructions, it is important to avoid immediate constants in calculations. Examining compiler output for the DLX processor revealed that the constants 1, 2, and 4 cover most cases of immediate arguments to ALU operations. Thus, adopting the convention of placing the constants 1, 2, and 4 in registers (say, R30, R29, and R28) allows the convenient scheduling of most addi instructions as add instructions executed in parallel with load/store instructions. 9

;; Adjust Stack Pointer add r14,r14,#-40 ;; Save Registers add r25,r14,r27 % sw (r14),r3 add r24,r25,r27 % sw (r25),r4 add r25,r24,r27 % sw (r24),r5 add r24,r25,r27 % sw (r25),r6 add r25,r24,r27 % sw (r24),r7 add r24,r25,r27 % sw (r25),r8 add r25,r24,r27 % sw (r24),r9 sw (r25),r10

Figure 11: NIW Optimized Register Save Because displacement addressing is not available, all addresses must be computed explicitly. The loss of displacement addressing is signi cant when addressing stack variables and members of structures. Dynamically and statically allocated basic data types, as well as array members, are unaected because they are accessed through simple register indirect addressing. To access the members of a structure, the compiler can compute the address of the speci c structure member ahead of time, scheduling the requisite add instructions early in parallel with preceding load/store instructions. Figure 11 shows how to eciently save and restore the register set to the stack using an inductive approach where the current address being saved to is computed in parallel with the previous register save.

4 Analysis This section presents our analysis of the impact of the NIW enhancement to the DLX architecture using three example programs. The programs were compiled using the DLX compiler (a port of gcc 1.36 [16]), and then hand-translated and scheduled for the NIW. No non-basic-block optimizations were used, and only the inner loops were optimized. The programs were run using simulators, and the run-time statistics compared to analyze the impact of the NIW modi cation. Further details on the performance analysis can be found in a related technical report [7]. While a larger number of sample programs would clearly be desirable, hand-translating a large number of programs is impractical, so larger performance studies must wait until a scheduling compiler has been ported to the NIW. We have, however, attempted to provide a balanced cross-section by providing two programs that are \favourable" to the architecture, 10

#define N 5000 int x[N]; main() { int i; for(i=5; i--; ) g(); printf("%d\n%d\n", x[N-2], x[N-1]); } g() { int i; x[0] = x[1] = 1; for(i=2; i