Simplifying Instruction Issue Logic in Superscalar Processors

Simplifying Instruction Issue Logic in Superscalar Processors Toshinori Sato1 2 ;

1 2

Itsujiro Arita1

Department of Arti cial Intelligence, Kyushu Institute of Technology Center for Microelectronic Systems, Kyushu Institute of Technology 680-4 Kawazu, Iizuka, 820-8502 Japan Phone: +81-948-29-7624, FAX:+81-948-29-7601 ftsato,[email protected]

Abstract

instructions using explicit data forwarding, removing the associative lookup logic. The instruction window can be implemented using RAM instead of content addressable memory (CAM), thus improving the clock speed. In addition, RAM consumes lower power than CAM.

Modern microprocessors schedule instructions dynamically in order to exploit instruction-level parallelism. It is necessary to increase instruction window size for improving instruction scheduling capability. However, it is dicult to increase the size without any serious impact on processor performance, since the instruction 2 EDF Instruction Window window is one of the dominant determiners of processor cycle time. section proposes a simpli ed wakeup logic for a large instrucThe instruction window is critical because it is realized using con- This tion window, which we call an explicit data forwarding instruction tent addressable memory (CAM). In general, RAMs are faster in window (EDF instruction window) [8]. First, we explain a convenaccess time and lower in power dissipation than CAMs. Therefore, tional instruction window including register mapping hardware. it is desirable that the CAM instruction window is replaced by the After that, the EDF instruction window is described. RAM instruction window. This paper proposes such an instruction window, named the explicit data forwarding instruction window. 2.1 Terminology The principle behind our proposal is to make result forwarding explicit. It is possible to dynamically construct explicit relationships Several de nitions are given here to simplify future references in between instructions, since it is expected that each execution result this section. Modern microprocessors fetch multiple instructions is forwarded to a limited number of dependent instructions. Sim- per cycle. Following instruction fetch, the instructions are deulation results show that the explicit data forwarding instruction coded and issued into an instruction window. We use the term window achieves a level of performance comparable to that of the issue to indicate the process of placing the instructions into the conventional instruction window, while also providing bene t in instruction window. The instruction window consists of an interms of shorter cycle time. struction queue and a buer, which maintains program order such 1 Introduction

Future microprocessors will rely on higher clock speed, wider instruction issue width, deeper pipeline stages, and larger instruction windows in order to improve performance. As width and size increases, scaling the clock speed becomes more dicult. Obstacles for clock speed scaling are register renaming logic, large register les, and operand bypass logic as well as instruction window wakeup and select logic [5]. In this paper, we propose a new instruction wakeup logic through a dierent approach that is based on explicit data forwarding. Two approaches allow for instructions to be executed out-oforder. One is Tomasulo's algorithm and the other is Thorton's register scoreboarding. Modern superscalar processors use Tomasulo's algorithm due to its strong scheduling capability due to its ability to remove anti- and output-dependences. Even though a register renaming technique is combined with Thorton's scoreboarding, Tomasulo's algorithm has another advantage of result forwarding. However, this result forwarding signi cantly increases hardware complexity. This process requires associative lookup logic, which will be critical as the instruction window size increases. Since the associative lookup logic is a major obstacle in attaining high clock frequency, removing the result forwarding is a simple solution. However, this might lead to serious performance degradation. Recently, Cruz et al. [3] reported that only 22% of the values generated by SPECint95 benchmarks are read more than once. The principal behind our proposal is that each execution result is forwarded to a limited number of dependent

as a reorder buer, and is operated as a FIFO buer. The instructions remain in the instruction queue until their operands become ready. Once their dependences have been resolved, instruction dispatch logic schedules the instructions and then dispatches them into functional units. The instruction queue entries containing the dispatched instructions are deallocated so that new instructions may be issued. We use the term dispatch to move the instructions from the instruction queue to the functional units, where they are executed. After execution is completed, the instructions remain waiting in the instruction window until their preceding instructions have been retired from the instruction window. When the instructions reach the head of the instruction window, they are retired from it. The instructions may be completed out-of-order but are retired in-order. We use the term instruction queue as the structure holding the instructions waiting for dispatch. On the other hand, the term instruction window is used as the structure holding the issued and completed instructions as well as the waiting ones.

2.2 Conventional instruction window In order to eliminate anti- and output-dependences, modern dynamically scheduled processors perform register renaming. Register renaming is commonly implemented in either of two ways. One is by using a separate renaming registers which are usually constructed by the reorder buer. The other combines the renaming registers with architected registers in a single register le. We focus on the latter case, especially based on MIPS R10000 [11].

instruction operation rs rt

id1

L/R

E

id2

L/R

E

rd ID1

ID2

Figure 3: DMT entry map table

free list instruction operation rs rt

busy bit

operation

ready

old D rd

tagL tagR dest

DMT

rd

map table

free list

id L/R id L/R

instruction queue

active list operation

Figure 1: Instruction window

id L/R id L/R

op L op R dest

old D rd

instruction queue active list

result tags

Figure 4: Data ow management table =? =?

=? =? result tags

readyL

=?

=?

tagL

tagR

readyR

Dataflow management table id

R

id

L

wakeup readyL

tagL

tagR

readyR

readyL

tagL

tagR

readyR

Figure 2: Wakeup logic

Figure 1 depicts R10000's instruction window including its register mapping hardware. The register mapping hardware consists mainly of three structures: map table, active list, and free list. By means of the map table, each logical register is mapped into a physical register. The destination register is mapped to a free physical register which is supplied by the free list, while operand registers are translated into the last mapping assigned to them. The old destination register is kept in the active list. When an instruction is retired, the old destination register which is allocated by the previous instruction with the same logical destination register is freed and is placed in the free list. The translated operand registers are held in the instruction queue as tags which are used for Tomasulo's algorithm. The busy bit table contains a bit indicating whether each physical register contains a valid value. It is used for initializing ready bits in the instruction queue for ready operands. Figure 2 shows the conventional instruction queue, focusing on the instruction wakeup logic. Every queue entry has a combination of the tag eld and the ready bit for each operand register. Every time an instruction is completed, a result tag associated with the instruction is broadcasted to all instructions waiting in the instruction queue. Every instruction in the queue compares the result tag with its operand tags, and if they match the ready bit for the operand is set. When all operands are ready, the instruction wakes up.

2.3 Proposed instruction window In order to improve the scalability of the instruction queue by reducing the delay of the instruction wakeup logic, we propose the EDF instruction window. The main purpose of the EDF instruction window is to allow the use of RAM which has more scalability than CAM. The EDF instruction window consists of a RAM instruction queue and a table named the (DMT).

data ow management table

readyL

op L

op R

readyR

readyL

op L

op R

readyR

readyL

op L

op R

readyR

wakeup

Figure 5: EDF instruction window

In order to replace CAM by RAM, the dependences between instructions are explained de nitely by any means. The DMT is a small register le and retains the dependences which are dynamically constructed. Figure 3 shows an entry of the DMT. It has a number of ID slots, each of which consists of an identi cation eld (denoted as id), left/right identi cation eld (denoted as L/R), and empty bit (denoted as E). In Figure 3, the number of ID slots (ID) is two. While in the remaining of this section we assume that ID is two, we will vary ID in Section 3. Figure 4 depicts the DMT that is attached to the instruction window. It is indexed by physical register numbers, and each entry holds ID slots indicating speci ed instruction queue slots. Hence, a dependence from a source instruction I1 to a sink instruction I2 is expressed as the reference from the instruction I1 associated with a DMT entry to the instruction I2 registered in an ID slot of the DMT entry. The unresolved dependences between instructions are registered in the DMT when every instruction is issued, and the DMT is referred to when instructions are completed. That is, dependences through ready operands are not registered in the DMT. The busy bit table (not depicted in Figure 4) is used for this check. The registration process is as follows: As shown in Figure 4, the DMT is indexed by the physical operand register number, and the ID associated with the instruction which requests the operands are registered in id elds. Since the instruction in the gure has two operands, the ID corresponding to the instruction is registered in 2 entries. The identi ers (denoted as L and R in Figure 4), that identify which operand for the instruction it is, are also held in the DMT. The empty bit in the DMT entry is set when a destination register is mapped to a new physical register which is associated with the entry, and is reset when an ID is registered in the DMT slot. The reference process and its following instruction wakeup

process are explained in Figure 5. Modern schedulers rely on the fact that they know every instruction's execution latency and they are performing wakeup and selection with this knowledge in advance. When an instruction is dispatched, the DMT is indexed by the result tag of the instruction, i.e., the physical destination register number. From the DMT, the instruction ID which requests the execution result is obtained. Using the ID, the ready bit of an entry associated with the instruction is set. If all ready bits in the entry are set, the instruction is ready for execution (wakeup). As can be seen, there are no associative lookups in the instruction wakeup logic. And thus, the instruction queue is implemented using RAM. It is important to mention how to process branch mispredictions. A branch misprediction leads to incorrect dependences held in the DMT. Therefore, it is necessary to revert the table to a safe point where the speculation is initiated. This is easily handled. Every time a branch is predicted, a checkpoint of the DMT is made, just like the map table [11].

2.3.1 Issue policies In order to handle a case in where there is no room for a speci c entry in the DMT, we consider two policies for instruction issue. One is blocking issue policy and the other is non-blocking one. The blocking issue policy stalls the decode and issue of instructions following one whose unresolved dependences cannot be registered in the DMT. The instruction that causes the decoder interlocking waits until its operands are ready and then obtains them from the register le. This blocking issue policy might have severe impact on processor performance. One of its possible severe drawbacks is that it is dicult to utilize large instruction windows eciently. However, we expect that a few ID slots are sucient for preventing serious interlocking, since only 22% of the values generated by SPECint95 benchmarks are read more than once [3]. The non-blocking issue policy continues the decode and issue of instructions until the instruction window is full. For instructions whose unresolved dependences are not registered in the DMT, their operands are not delivered directly by preceding instructions. In order for the instructions to wake up appropriately, an implicit wakeup scheme is used. A key point is that dependences of every instruction in the instruction window head have been resolved1 . Therefore, any instruction which reaches the instruction window head can be dispatched unconditionally if an appropriate functional unit is free. Though the implicit wakeup scheme removes the undesirable decoder interlocking, it also presents a possible severe drawback. Every instruction whose dependences are not registered in the DMT remains in the instruction queue until it is in the instruction window head even when the dependences have been resolved. This delay of wakeup creates serious problems for mispredicted branch instructions. Many useless instructions might be issued and dispatched, thus degrading processor performance. The other eect of the delay is increasing pressure on the eective instruction window capacity. If the instruction window is full, the decoding and issuing of instructions are stalled, resulting in performance degradation. We will evaluate these eects in the following section.

2.3.2 Splitting DMT The access time of the DMT might have signi cant impact on processor cycle time and thus should be as small as possible. The access time of a register le can be modeled as a logarithmic function of the number of read ports, and the area of that can be modeled as a function that grows with the square of the number of total 1 Note that instructions are not removed from the instruction window until they are retired.

opL IW x2

id1

L/R

id2

L/R

NF (a) DMT

opR

IW

IW

IW

IW

id

id

id

id

NF

NF (b) split-DMT

Figure 6: Split data management tale (IW = N F = ID = 2)

ports [2]. Based on this model, the access time and the area of the DMT are estimated as c1 3 log(N F ) and c2 3 (N F + 2 3 IW )2 respectively, where N F is the number of functional units, IW is the instruction issue width, and c1 and c2 are constant values. The other factor in uencing the access time is ID . Any implementation with more than one ID slot encounters complexity when selecting one empty slot during the registration process and delivering valid slots during the reference process. This increases the access time of the DMT. In order to improve the access time, two implementation solutions can be used. One is duplicating the DMT by the number of read ports. This means every functional unit has its dedicated DMT for reference. The other solution is dividing the DMT into two subsets, each of which is dedicated to every single operand and with one ID slot in each entry, and thus the implementation complexity explained above is removed. We call the DMT using these solutions split-DMT. Figure 6 shows the split-DMT when every IW , N F , and ID is two. The eciency of ID slots in the split-DMT might be lower than that of the base DMT. We will evaluate this eect in the following section. The access time of the split-DMT is improved and estimated as c3 , where c3 is a constant value and less than c1 . Unfortunately, the area is increased and estimated as c2 3 (1 + IW )2 3 N F 3 2.

3 Evaluation Methodology In this section, we describe the evaluation methodology by explaining a processor model and benchmark programs. An execution-driven simulator is used for this study. We implemented this simulator using the SimpleScalar tool set (ver.3.0a) [1]. The SimpleScalar/PISA instruction set architecture (ISA) is based on the MIPS ISA. The simulator models a realistic, 8-way out-of-order execution superscalar processor. In the SimpleScalar processor, register renaming is performed on a register update unit (RUU) [9], while it can easily model the R10000's register mapping hardware. The renaming registers, the active list, and the instruction queue share a single structure, which is the RUU. The number of RUU entries evaluated is 64 or 512. Each functional unit can execute any operation. The latency for execution is 1 cycle except in the case of multiplication (4 cycles) and division (12 cycles). A 4-port, non-blocking, 128KB, 32B block, 2-way setassociative L1 data cache is used for data supply. It has a load latency of 1 cycle after the data address is calculated and a miss latency of 6 cycles. It has a backup of an 8MB, 64B block, directmapped L2 cache which has a miss latency of 18 cycles for the rst word plus 2 cycles for each additional word. No memory operation that follows a store whose data address is unknown can be executed. A 128KB, 32B block, 2-way set-associative L1 instruction cache is used for instruction supply and also has the backup of the L2 cache which is shared with the L1 data cache. For control

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

1 23 C 1 23 C 1 23 C 1 23 C 1 23 C 1 23 C 1 23 C 1 23 C 1 23 C go m88ksim gcc compress li ijpeg perl vortex average

0


(i) 64 entries

(ii) 512 entries

Figure 7: Processor performance comparison (blocking) prediction, a 1K-entry 4-way set associative branch target buer, a 4K-entry gshare-type branch predictor, and an 8-entry return address stack are used. The branch predictor is updated at the instruction commit stage. The SPECint95 benchmark suite is used for this study. We focus on the performance of only integer programs because it tends to be more dicult to obtain high levels of parallelism from these types of programs than from oating-point programs. The input les are modi ed to achieve a practical evaluation time. We use the object les provided by University of Wisconsin-Madison [1], except for 132.ijpeg which is compiled by GNU GCC (version 2.6.3) with the optimization option, -O3. Each program is executed to completion.

4 Simulation Results This section presents simulation results. First, the EDF window using the blocking issue policy is evaluated. After that, the in uences of the non-blocking issue policy and the splitting EDF window are shown.

4.1 Baseline performance For measuring performance, we use committed instructions per cycle (IPC) as a metric. Only useful instructions are considered for counting the IPC. We do not count nop instructions. We evaluate the usefulness of the EDF window by comparing it with the baseline performance. First, the EDF window using the blocking issue policy is evaluated. Table 1 explains how often the instruction issue stalls due to the lack of ID slots in the DMT. The rst column in the table shows the program name. The remaining two groups of three columns indicate the percent stall cycles for 64- and 512-entry instruction windows, respectively. For each group, three columns present the results for the cases where ID is 1, 2, and 3, respectively. We can easily recognize the following: First, there are no signi cant dierences in the percent of stall cycles when the instruction window size changes. Second, in the case that ID is 1, the instruction issue is frequently stalled at approximately 50% of all the execution cycles. As the ID slot increases to 2 and 3, the stall cycle is reduced to about 30% and 20% respectively. This means one- fth of the instruction issue slot is wasted even though ID is 3. In other words, the issue width eectively decreases from 8 to 6 instructions. This might have serious negative impact on processor performance. Next, we evaluate this impact. Figure 7 compares processor performance. The performance of the proposed RAM instruction window is normalized by that of the conventional CAM instruction window. For each group of four bars in the gure, the rst three (see from left to right) are for the EDF instruction window (denoted as 1, 2, and 3), and the last is for the conventional CAM instruction window (denoted as

Table 1: % Stalls of instruction issue

program 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex

1 40.46 54.04 40.34 59.82 57.09 43.41 54.60 52.21

64 entries 2 3 10.68 3.09 35.51 6.46 18.39 10.24 29.63 14.61 38.09 11.14 28.30 19.49 31.56 22.65 22.56 18.92

1 40.47 54.04 40.36 59.94 57.09 48.66 54.62 52.26

512 entries 2 3 11.10 3.58 35.58 5.99 18.60 10.62 29.89 14.77 38.09 11.16 31.37 23.46 31.58 23.04 23.53 20.37

C). For the cases of the proposed RAM instruction window, the left, middle, and right bars indicate the results in the cases where ID is 1, 2, and 3, respectively. First, it is not observed that the interlocking of the decode and issue seriously aects processor performance. For the 64-entry instruction window, it is found that the DMT with three ID slots is required for the EDF instruction window to achieve processor performance comparable to that of the conventional CAM instruction window, except for the case of 134.perl. On average, it attains 94.9% of that of the conventional case. It is interesting that in the case of 130.li, the processor performance of the RAM instruction window exceeds that of the CAM window. This is due to the indeterminate characteristics of out-of-order execution. For example, if the instruction issue stall reduces useless instructions which are executed on a branch misprediction path, the performance is improved. When ID decreases to two, 91.3% of the conventional performance is achieved. If it still decreases to one, only 77.8% of the performance is achieved. Therefore, the DMT having two ID slots constitutes an eective tradeo point of balance. For the 512-entry instruction window, it is also found that a DMT with three ID slots is sucient for the EDF instruction window to achieve processor performance comparable to that of the conventional CAM instruction window, except for the case of 124.m88ksim. On average, approximately 93% of the conventional case is achieved when ID is two. Thus, it is also con rmed that a two-slot DMT constitutes an eective tradeo point of balance.

4.2 In uence of issue policies Figure 8 shows a comparison of the EDF instruction window using the non-blocking issue policy and the conventional CAM instruction window in terms of processor performance. While the blocking issue policy is free from its possible drawback, the non-blocking issue policy is faced with its possible drawback. Except for 099.go, processor performance using the EDF instruction window cannot reach that achieved through the use of the conventional CAM in-

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0


0


(i) 64 entries

(ii) 512 entries

Figure 8: Processor performance comparison (non-blocking) 90 80 70 60 50 40 30 20 10 0 -10 -20 -30

700 600 500 400 300 200 100 12 3 go

12 3 m88ksim

12 3 12 3 12 3 gcc compress li

12 3 ijpeg

12 3 perl

12 3 vortex

0

12 3 go

12 3 m88ksim

(i) 64 entries

12 3 12 3 12 3 gcc compress li

12 3 ijpeg

12 3 perl

12 3 vortex

(ii) 512 entries

Figure 9: %Increase of misspeculated instructions (non-blocking)

Table 2: Instruction window occupant latency (cycles)

program 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex

1 12.8 25.8 17.1 24.5 25.1 24.0 24.3 34.9

64 entries 512 entries 2 3 C 1 2 3 C 8.5 7.5 6.6 17.3 10.1 8.1 6.8 21.4 15.2 9.2 49.8 27.2 23.9 22.1 12.3 10.2 6.3 27.8 19.4 15.6 6.9 14.0 10.8 7.3 39.5 25.8 22.9 9.7 18.2 11.0 6.1 41.4 28.4 17.2 6.1 18.9 16.5 10.4 83.6 66.2 49.1 24.5 18.5 14.5 6.8 39.8 30.3 21.2 7.3 27.6 24.2 8.6 89.0 78.4 68.6 12.7

struction window. Even when the DMT has three ID slots, only 75.9% and 73.2% of processor performance in the CAM window case is achieved for 64- and 512-entry EDF instruction windows, respectively. This performance gap may not be easily compensated for by high clock frequency. Moreover, the performance degradation is more severe in a large instruction window than in a small one. This does not suit the gola of enlarging instruction window size. Hence, the non-blocking issue policy is not desirable for the EDF instruction window. The performance degradation is due mainly to the increase of instructions squashed by branch mispredictions. This is caused by the delay of instruction wakeup. Table 2 presents how long each instruction remains in the instruction window on average. The rst column in the table shows program name. The remaining two groups of four columns are for 64- and 512-entry instruction windows, respectively. For each group, the rst three columns present the results for the cases where ID is 1, 2, and 3, respectively, while the last is for the conventional CAM window. As is easily observed, the delay increases considerably. Figure 9 shows the percent increase of misspeculated instructions. In the case of a 512-entry EDF instruction window, they are in-

creased by as much as 661% in the case of 124.m88ksim. This increases pressure on functional units, resulting in performance loss. In the remainder of this paper, the non-blocking issue policy will be disregarded.

4.3 In uence of splitting DMT In Section 2.3.2, we consider ways to reduce the delay of the DMT. In this section, the split-DMT using the blocking issue policy is examined. Figure 10 compares the split-DMT with the original DMT. Each bar denoted as S indicates processor performance for the split-DMT. Please remember that the split-DMT has two ID slots in every DMT entry, each of which is dedicated to a single source operand. Therefore, it is expected that the performance of this processor lies between that of the original one-slot and twoslot DMTs. Figure 10 con rms this expectation. In general, the split-DMT improves processor performance over the original DMT with one ID slot model by approximately 5%. However, there remains a performance gap of about 5% between the split-DMT and the original two-slot DMT. If the split-DMT model executes more than 5% faster than does the original DMT model, splitting the DMT is a better solution.

5 Related Work The basic idea behind our proposal is to make result-forwarding explicit. Dual ow Architecture [4] helps achieve this goal. It hybridizes control- and data-driven architectures. Instruction sequence is control-driven, while result forwarding between instructions is data-driven. The destinations of a result are explained explicitly in each instruction, removing associative lookup in instruction wakeup logic while Dual ow performs out-of-order execution. The EDF instruction window is strongly in uenced by Dual ow. We owe the basic concept of the explicit explanation of data communication to Dual ow. One of the disadvantages of Du-

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C go m88ksim gcc compress li ijpeg perl vortex average

0

S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C S 12 C go m88ksim gcc compress li ijpeg perl vortex average

(i) 64 entries

(ii) 512 entries

Figure 10: Processor performance comparison (blocking/split) al ow is the explosion of the program code. It has been reported in [4] that Dual ow increases the code size by more than 100%, with approximately 50% of dynamic instructions being useless. The EDF instruction window is most similar to the Direct Tag Search algorithm (DTS algorithm) proposed by Weiss et al. [10]. The DTS algorithm contains a tag search table indexed by result tags to perform the result forwarding. The tag search table has only one reservation address for each tag, as it is the case of the DMT with one ID slot. They evaluate the DTS algorithm on a scalar processor while we evaluate the EDF window on a wide superscalar processor. For modern superscalar processors, branch prediction is an essential technique. Weiss et al. do not mention speculative execution and thus the handling for mispredicted branches is not considered.

6 Conclusion Future microprocessors will rely on higher clock speed, wider instruction issue width, deeper pipeline stages, and larger instruction windows in order to improve performance. As width and size increases, scaling the clock speed becomes more dicult. One of the obstacles for clock speed scaling is the instruction window wakeup logic. In this paper, we have proposed a simple wakeup logic for large instruction windows, named the explicit data forwarding instruction window. This logic is based on the characteristic in which each execution result is forwarded to a limited number of dependent instructions. The relationships between instructions are explicitly described and thus the associative lookup for checking the relationships is removed. Simulation results show that the proposed RAM window achieves comparable performance to the conventional CAM window. Future study regarding the EDF instruction window will focus on the removal of the DMT by constructing data ow information in the trace cache [6, 7]. This simpli es the decoding and issuing of instructions, resulting in faster clock frequency. For increasing the number of destinations of result forwarding, inserting dummy move instructions can be used. This reduces the interlocking of the decode and issue functions, thus enhancing the utilization of large instruction windows. We are also interested in the power dissipation of the instruction window. RAMs dissipate less power than do CAMs. Thus is it of interest to evaluate the power requirements of each of the two window models. We expect that the EDF instruction window will be a promising candidate for large instruction windows for future microprocessors.

Acknowledgments This work is supported in part by a Grant-in-Aid for Scienti c Research(B) (No.12780273) from the Japan Society for the Promotion of Science and a grant from the Okawa Foundation for

Information and Telecommunications (No.01-13), and was supported in part by a Grant-in-Aid for Encouragement of Young Science (No.12780273) from the Japan Society for the Promotion of Science. Toshinori Sato was supported in part by the grant from Fukuoka Industry, Science & Technology Foundation (No.H12-1).

References [1] D.Burger, T.M.Austin: The SimpleScalar tool set, version 2.0, ACM SIGARCH Computer Architecture News, vol.25, no.3, 1997. [2] A.Capitanio, N.Dutt, A.Nicolau: Partitioned register les for VLIWs: a preliminary analysis of tradeos, 25th Int. Symp. on Microarchitecture, 1992. [3] J-L.Cruz, A.Gonzlez, M.Valero, N.Topham: Multiple-banked register le architectures, 27th Int. Symp. on Computer Architecture, 2000. [4] M. Goshima, N.H. Ha, A. Agata, H. Mori, S. Tomita: Proposal of the Dual ow architecture, 12th Joint Symp. on Parallel Processing, 2000. [5] S.Palacharla, N.P.Jouppi, J.E.Smith: Complexity-eective superscalar processors, 24th Int. Symp. on Computer Architecture, 1997. [6] S.J.Patel, D.H.Friendly, Y.N.Patt: Critical issue regarding the trace cache fetch mechanism, Technical Report CSE-TR335-97, Dept. of Electronics Engineering and Computer Science, University of Michigan, 1997. [7] E.Rotenberg, S.Bennet, J.Smith: Trace cache: a low latency approach to high bandwidth instruction fetching, 29th Int. Symp. on Microarchitecture, 1996. [8] T.Sato, Y.Nakamura, I.Arita: Revisiting direct tag search algorithm on superscalar processors, Workshop on ComplexityEective Design held in conjunction with 28th Int. Symp. on Computer Architecture, 2001. [9] G.S. Sohi: Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers, IEEE Trans. Comput., vol.39, no.3, 1990. [10] S.Weiss, J.E.Smith: Instruction issue logic in pipelined supercomputers, IEEE Trans. Comput., vol.C-33, no.11, 1984. [11] K.C.Yeager: The MIPS R10000 superscalar microprocessor, IEEE Micro, April 1996.

Simplifying Instruction Issue Logic in Superscalar Processors

Simplifying Instruction Issue Logic in Superscalar Processors

Suggest Documents

Superscalar Branch Instruction Processor

Improving Superscalar Instruction Dispatch and Issue by ... - CiteSeerX

The Microarchitecture of Superscalar Processors - Computing ...

performance factors for superscalar processors - CiteSeerX

Simplifying Inquiry Instruction - CiteSeerX

Simplifying Inquiry Instruction - CiteSeerX

performance factors for superscalar processors - CiteSeerX

Complexity-Effective Superscalar Processors - Description - University ...

Thread-Sensitive Instruction Issue for SMT Processors - CiteSeerX

Instruction scheduling for instruction level parallel processors

Improving Superscalar Instruction Dispatch and ... - Semantic Scholar

Available Instruction-Level Parallelism for Superscalar ... - CiteSeerX

Principles of Timing Anomalies in Superscalar Processors - Core

Decoding of CISC instructions in superscalar processors ... - IEEE Xplore

Kilo-instruction Processors - Semantic Scholar

Instruction issue logic for high-performance ... - ECE @ UMD

Performance Evaluation of VLIW and Superscalar Processors on DSP

Software-based Instruction Caching for Embedded Processors

Optimizing Instruction-set Extensible Processors under ... - CiteSeerX

Global Instruction Scheduling for SuperScalar Machines Abstract 1 ...

error rate in current-controlled logic processors with ...

In-Field Logic Repair of Deep Sub-Micron CMOS Processors

In-Field Logic Repair of Deep Sub-Micron CMOS Processors

Special Issue: Temporal Logic in Engineering