Register Write Specialization Register Read ...

Register Write Specialization Register Read Specialization: A Path to Complexity-Effective Wide-Issue Superscalar Processors André Seznec Eric Toullec Olivier Rochecouste seznec,etoullec,orocheco @irisa.fr IRISA/INRIA, Rennes, France

register renaming. WSRS architecture induces constraints on the policy for allocating instructions to clusters. However, performance of a 8-way 4-cluster WSRS architecture stands the comparison with the one of a conventional 8-way 4-cluster conventional superscalar processor.

Abstract With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumptions constitute major difficulties for the design of wide issue superscalar processors. In this paper, we show that transgressing a rule, that has so far been applied in the design of all the superscalar processors, allows to reduce these difficulties. Currently used general-purpose ISAs feature a single logical register file (and generally a floating-point register file). Up to now all superscalar processors have allowed any general-purpose functional unit to read and write any physical generalpurpose register. First, we propose Register Write Specialization, i.e, forcing distinct groups of functional units to write only in distinct subsets of the physical register file, thus limiting the number of write ports on each individual register. Register Write Specialization significantly reduces the access time, the power consumption and the silicon area of the register file without impairing performance. Second, we propose to combine Register Write Specialization with Register Read Specialization for clustered superscalar processors. This limits the number of read ports on each individual register and simplifies both the wakeup logic and the bypass network. With a 8-way 4-cluster WSRS architecture, the complexities of the wake-up logic entry and bypass point are equivalent to the ones found with a conventional 4-way issue processor. More physical registers are needed in WSRS architectures. Nevertheless, using WSRS architecture allows a dramatic reduction of the total silicon area devoted to the physical register file (by a factor four to six). Its power consumption is more than halved and its read access time is shortened by one third. Some extra hardware and/or a few extra pipeline stages are needed for

1 Introduction The physical register file, the bypass network and the selection and wake-up logic have become a burden for the design of general-purpose dynamically scheduled superscalar processors [14]. Access to the register file is now pipelined accross several cycles (e.g. on Pentium4 [10]). Pipelining is also considered for wake-up and selection logic [18] and only limited fast-forwarding capability is implemented on clustered processors, e.g. Alpha 21264 [11]. At constant issue width, this trend will continue with the advance of integration technology. It will be further emphasized on wider issue processors. Unlike some VLIW ISAs (e.g., Multiflow [3]), the ISAs currently used on PCs, workstations as well as servers feature a single logical general-purpose register file, and generally a second logical register file for floating-point registers. From the code generator perspective, this general-purpose register file is central to the architecture, since the operands of every integer operation are read from this register file and results are written onto it. This central view of the general purpose register file has also been adopted for the hardware implementation of the physical register file in dynamically scheduled superscalar processors. The following unwritten rule has been always applied: every general-purpose physical register can be the source or the result of any instruction executed on any integer functional unit. It has been applied for implementing processors using a centralized monolithic physical register file as well as for clustered processors implementing a distributed register file, e.g. Alpha 21264. In this latter case, a copy of the physical register file is associated with each cluster (Figure 1). Each of these copies features only a fraction of the total number of read ports, but features the same number of write

This work was partially supported by an Intel grant

1

Monolithic register file

Clustered register file

Register File

Functional Units

Register File

Register File

C0

C1

(a)

Register File

C2

Register File

C3

(b)

Figure 1. Monolithic versus clustered register file organization First operand

S0

C0

C1

S2

S1

S1

C2

C3

S3

S2

S3

Second operand

S0

Second operand

ports as in the monolithic register file case. In this paper, we show that this unwritten design rule can be transgressed on dynamically scheduled processors. The set of physical registers can be divided in distinct subsets that are only read-connected (respectively write-connected) with a subsets of the entries (respectively the exits) of the functional units. The number of write and read ports on each individual physical register and the overall complexities of the physical register file, the bypass network and the wake-logic are decreased. Register write specialization Distinct clusters of functional units or distinct pools of functional units can be forced to write in distinct subsets from the set of the physical registers (Figure 2). We refer to this principle as register write specialization. Whenever an instruction is executed on cluster Ci (or a pool of functional units), its result is written into a physical register from register subset Si. Each individual physical register is write-connected with only a subset of the functional units. Using register write specialization allows to significantly reduce the silicon area devoted to the physical register file, to decrease its read access time and to reduce its power consumption. Register write specialization links register renaming with allocation of instructions to clusters. However, for simple policies for allocating instructions on clusters such as round-robin, it does not impair performance at all, provided that the number of physical registers is sufficiently increased. For more complex dynamic instruction allocation policies, a few extra pipeline stages may be needed for register write specialization. WSRS architectures With a clustered superscalar processor, one can also constrain the clusters to read their operands from only a subset of the physical registers, provided that each instruction remains executable by, at least, one of the clusters. This allows to reduce the number of read ports on each individual register. We refer to this method as register read specialization

First operand

Figure 3. A 4-cluster WSRS architecture

and to clustered architectures implementing the combination of register write specialization and register read specialization as WSRS (for register Write Specialization, register Read Specialization) architectures. A 4-cluster WSRS architecture is illustrated on Figure 3. The first operand of an instruction executed on cluster C1 is read from a physical register belonging to subset S0 or to subset S1 (i.e., it has been produced by cluster C0 or cluster C1). Therefore, (a) the bypass point associated with the first operand entry of a cluster C1 functional unit has only to be connected with the output buses of the cluster C0 and C1 functional units and (b) for checking the validity of this operand, the wake-up logic has only to monitor the executions on clusters C0 and C1. The complexity of the 2

Distinct clusters write on distinct register subsets

Pools of functional units write on distinct register subsets

S0 S1 S2 S3

S0 S1 S2 S3

S0 S1 S2 S3

S0 S1 S2 S3

S0 S1 S2 S3

S0 S1 S2 S3

S0 S1 S2 S3

S0 S1 S2 S3

C0

C1

C2

C3

Ld/St units

ALU simple

ALU complex

Branch units

(a)

(b) Figure 2. Register Write Specialization

2 Register Write Specialization

bypass point (respectively the wake-up logic entry) on the 4-cluster WSRS architecture is equivalent to the complexity of the bypass point (respectively the wake-up logic entry) of a conventional 2-cluster superscalar processor.

2.1

Register Write Specialization principle

Figure 2 illustrates register write specialization with a 4-cluster processor and with a processor relying on reservation stations servicing pools of identical functional units. The physical register file is partitioned in several distinct subsets of registers. Each cluster (resp. pool of functional units) can write only into one of the subsets. Assuming 2-way clusters able to produce up to 3 results per cycle (as on Alpha EV6 [11]), each physical register can be built using four identical (4-read, 3-write) copies instead of four identical (4-read, 12-write) copies when register write specialization is not implemented.

Moreover, the physical registers from subset S1 are only read-connected with the first operand entries of functional units from clusters C0 and C1 and with the second operand entries of functional units from clusters C1 and C3. Therefore the register file of the 4-cluster WSRS architecture exhibits reduced silicon area, reduced power consumption and a shorter access time, when compared with the one of of a conventional 4-cluster architecture. On the other hand, allocation of instructions to clusters is strongly constrained on this 4-cluster WSRS architecture. This induces extra hardware logic for register renaming and for computing the allocation of instructions to clusters as well as some extra pipeline stages before register renaming.

2.2

Register write specialization and register renaming

When register write specialization is used, register renaming is strongly dependent of the instruction allocation to clusters. The cluster (or the pool of functional units) that executes an instruction determines the register subset where the instruction result is written. In this paper we assume that instructions are first allocated to clusters then renamed 1 . That is, once the instruction has been allocated to a cluster, register renaming has to take this constraint into account. We propose below two possible implementations of register renaming with register write specialization. Both implementations are derived from a register renaming process using a single set of registers, a map table and a free list. This renaming process (quite similar as the one used in many current processors) can be decomposed in three tasks: (A) dependency propagation within the group of instructions to be renamed in parallel,

Paper organization The remainder of the paper is organized as follows. Register write specialization is further analyzed in Section 2. Section 3 analyzes the 4-cluster WSRS architecture. In particular register renaming and instruction allocation policies are detailed Section 4 discusses the complexity advantages of WSRS architectures over conventional clustered architectures. Section 5 presents performance results confirming that combining register write specialization and register read specialization does not impair performance of a 4-cluster processor architecture. Section 6 reviews previous related works to optimize physical register files (e.g., virtualphysical registers [13], register caches [4, 1], read/write port arbitration [1]) or to reduce critical path [18, 2] or power consumption [5, 8] on wake-up and selection logic. On WSRS architectures, all these proposals can be applied at cluster level. Finally, Section 7 summarizes this study.

(B) assignment of a free register to each instruction producing a register result, 1 The alternative solution (register renaming first, instruction allocation to clusters at second) may lead to very unbalanced workloads on clusters.

3

(C) read and update of the map table: the new map table is built from the old map table and the group of new free registers.

static instruction allocation to clusters (e.g. round-robin), the cluster allocation of instruction is known very early in the pipeline. The final register renaming stage might not be delayed at all.

Task (C) is dependent on Tasks (A) and (B). Tasks (A) and (C) can be directly adapted to register write specialization. For both implementations, we assume that a list of free registers is maintained for each of the physical register file subsets . We also assume that a subset target vector V representing the allocation of instructions to clusters is available. That is allocation of instructions to the clusters must precede the final pipeline stage in the register renaming process. Note that the instruction allocation policy may induce extra pipeline stages.

2.3

A deadlock issue and its workarounds

There is a possible deadlock issue when register write specialization is used. Let us suppose that the number of physical registers in some (or all ) register subsets is smaller than the number of logical registers in the ISA. When at a given point of the execution, all the physical registers from one of the register subsets represent an architectural register from the ISA, the logical-to-physical register renaming mechanism can not rename any new instruction to the cluster/pool of functional units writing in this register subset. This situation might result in a deadlock. Such a deadlock can not occur when the register subsets feature at least the same number of registers as the number of logical registers in the ISA. However, for SMTs or for ISAs featuring very large numbers of registers (e.g., IA64), this might not be a realistic solution. Two possible workarounds can be considered: (a) the allocation of instructions to clusters may be in charge of avoiding the deadlock or (b) An exception is raised whenever the deadlock is detected; moves that map some of the logical registers onto the other register subsets are then issued.

2.2.1 First implementation If a physical register target is systematically assigned to an instruction Task (B) can be implemented with a conventional superscalar processor architecture as follows: Let N be the number of instructions to be renamed in parallel, N free physical registers ,.., are picked from the free lists. Register is assigned to the jth instruction in the group to be renamed. This solution can be adapted to register write specialization as follows. N free registers are picked from each free list . Then the assignment of instruction J to a cluster is used to assign a single target register to the instruction among the register . A major drawback of this solution is to “waste” many free registers. The unused free registers must be recycled. This recycling of free registers can be handled in pipelined mode for each of the free lists: 1) build the two lists of registers to be recycled (list of registers freed by committed instructions and list of registers that were not attributed to the group of instructions renamed on the previous cycle), 2) independently pack both lists, 3) make a single list and 4) append this list to the free list. A residual problem is that a large number of free registers are not accessible since they are flowing through the recycling pipeline.

2.4

Performance considerations

Pipeline stalls Depending on instruction allocation policy to clusters and the size of the register subsets, register write specialization may induce some extra/different pipeline stalls compared with a conventional approach. However, the impact on performance should be very small provided that the total number of physical registers is sufficiently increased to manage the slight unbalancing that can occur among the local requirement for registers. Let us illustrate this with a 4-cluster processor example. Let us assume that the ISA features 32 (logical) registers and that each cluster is able to accept up to 56 inflight instructions (i.e a total of 224 instructions). For a conventional approach, using a 256-entry register file guarantees that register renaming is never the stalling factor. When using register write specialization, the same applies if 88 entries per register subset are available: at most 32 registers are mapped to the current architectural registers, then 56 registers are available for renaming i.e., one per possible inflight instruction. Pipeline depth Depending on the instruction allocation policy to clusters, register write specialization may (or may not) induce extra pipeline stages in the register renaming process. With a round-robin or pseudo-random allocation, the read of the free lists can be initiated very early in the pipeline. When using pools of functional units associated with dedicated reservation stations, the allocation of instructions to the pools can be stored in the instruction cache

2.2.2 Second implementation The difficulty associated with free registers recycling can be eliminated at the cost of a longer pipeline in Task (B). For the group of N instructions to be renamed on a single cycle, the exact number of registers required from each register subset is first computed from the subset tar get vector V. Then the exact numbers of required free registers are picked from each free list . These groups of registers are then expanded and merged using the subset target vector. Careful design should limit the extra pipeline length incurred by such a design to 2 or 3 cycles in Task (B). For 4

as predecoded bits. For these cases, we can reasonably assume that no extra pipeline stage will be needed for both the register renaming implementations we have proposed. Other allocation policies, particularly policies using dynamic register dependencies to allocate instructions to clusters [14], will induce extra pipeline stages: allocation of instructions to clusters must be executed in parallel with dependency propagation, therefore Task (B) in the register renaming process is delayed.

We assume that step A1 is executed in parallel with dependency propagation in register renaming. The complexity of the second step is equivalent to the complexity of updating a map table. Extra pipeline stages On a WSRS architecture, for the first register renaming implementation proposed in 2.2, the register renaming is delayed by step A2 (read and update of bit vectors and ). In Section 5, we will assume that this translates in a single extra pipeline stage before renaming. For the second register renaming implementation proposed in 2.2, two other actions dependent from step A2 must also be performed before the late phase in register renaming: 1) computation of the numbers of registers to pick from each of the free lists and 2) read of the free lists followed by expansion and merge of the groups of free registers. In Section 5, we will assume that this translates in a total of three extra pipeline stages before renaming.

3 WSRS architecture In this section, we first analyze the constraints on register renaming and instruction allocation to clusters on 4-cluster WSRS architecture. Then we present the degrees of freedom that exists on this allocation.

3.1

A 4-cluster WSRS Architecture

A 4-cluster WSRS architecture is illustrated in Figure 3: Functional units are grouped into four identical clusters, , , and . The set of registers is splitted into four distinct subsets of physical registers , , and . Register write specialization: Any result produced on cluster Ci is written on the register subset Si. Register read specialization: for any instruction executed on a given cluster, the first (resp. the second) operand is read on a fixed pair of register subsets.

3.3

We list here some degrees of freedom that can be exploited for allocating the instructions to the clusters on the 4-cluster WSRS architecture. Notation Instructions are often using immediate operands. However, this paper is only concerned with dynamic register operands and results. Then we will refer to an instruction using two register operands as a dyadic instruction and to an instruction using one register operand as a monadic instruction independently of their use of an extra immediate operand or not. Monadic instructions A large fraction of the instructions are either monadic or noadic. Monadic instructions offer a degree of freedom for the distribution of instructions among clusters since they can be executed by two clusters. However this may lead to a slight unbalancing in the workload: chains of dependent monadic instructions are executed on a single cluster pair (either ( , ) or ( , )). Commutative monadic instructions Monadic instructions are instructions that use a single register operand and, may be, an immediate as a second operand, e.g., addition of a register and an immediate. Usual convention is to use the register as the first operand and the immediate as the second operand. If a functional unit is implemented in such a way that it can take its register operand either on the right entry or on the left entry then commutative monadic instructions can be executed by any of three clusters on the 4-cluster WSRS architecture. Commutative dyadic instructions On optimized codes, the compiler tends to maintain invariant operands in the registers in order to avoid repetitive loads of the same data from the cache. On a 4-cluster WSRS architecture, this may lead to unbalance the workload among the clusters. This phenomenon can be limited through exploiting the commutativity of many dyadic instructions (add, or,

The execution cluster of a dyadic instruction and its register subset target are determined by the register subsets where its operands are located.

3.2

Degrees of freedom for allocating instructions to clusters

Cluster Allocation and Register renaming

The instruction allocation to clusters and register renaming are strongly linked in the 4-cluster WSRS architecture. Once the register subfile target has been determined, register renaming can be handled as described in 2.2. A simple rule illustrated on Figure 3 determines the cluster that executes instruction I: Position of the first operand determines whether instruction I is executed on the top or bottom 2-cluster, and the position of the second operand determines whether instruction I is executed on the left or right 2-cluster. The computation of the two bits that represents the execution cluster number for instruction I are independent. It can be implemented as follows. At any cycle, two vectors of bits and represent respectively the first and second bits of the subset numbers for the physical registers affected to the logical registers (i.e, logical register Ri is mapped onto ). Computations of a physical register in subset the new values of vectors and are very similar to register renaming. For a group of N instructions to be renamed in parallel, it consists in two phases: (A1) propagation of (pseudo)-dependencies within the group, (A2) read and update of vectors and . 5

4.2

exclusive-or, ..). Commutative dyadic operations can be executed on two clusters provided that the two operands do not lie in the same register subset. This second degree of freedom can be exploited through inverting the two operands before cluster allocation in the pipeline.

Compared with a conventional superscalar architecture (Figure 1), the 4-cluster WSRS architecture presents a major difference: any physical register is connected with only half of the functional unit entries and can be written by only one fourth of the functional units. On a 4-cluster 8-way WSRS architecture, each physical register can be implemented using two (4-read, 3-write) register copies instead of four (4-read, 3-write) register copies when using register write specialization alone, or four (4-read, 12-write) register copies on a conventional 4-cluster 8-way architecture. In this section, we try to quantitatively evaluate how this impacts on the access time, power consumption and silicon area of the physical register file.

“Commutative” clusters In order to further improve the degree of freedom provided by the commutative dyadic instructions, functional units can be implemented in order to be able to execute instructions in two forms inverting the operands order (i.e for instance computing A-B and -A+B). Any dyadic instruction with register operands in two different register subsets can then be executed on two clusters. Moreover any monadic instruction can be executed on three out of the four clusters.

4.2.1 Methodology

4 Complexity of the 4-cluster WSRS architecture

Silicon area estimation The silicon footprint of a multiported register file is dominated by the area devoted to memory cells [21]. When the number of ports is high, the size of a multiported memory cell is approximately a quadratic function of its number of access ports [19]. For a conventional multiported memory cell featuring ports and ports, word bitlines, line wires, bitlines and wordline wires must cross the cell [21]. being the width of each wire (i.e. the width of the wire itself plus the distance with the neighbor wire), the area devoted to a register cell is given by:

In this section, we compare a 8-way 4-cluster WSRS architecture with a conventional 8-way superscalar processor in terms of complexity of implementation. Throughout this section, we will use the example of a symmetric 4-cluster architecture. Each cluster is able to issue two instructions per cycle. On any cycle up to 4 reads and 6 writes (4 ALU results and 2 load results) onto the register file may be generated by each cluster. These parameters are similar as those found on the two-cluster Alpha 21264 [11].

4.1

Complexity of the register file

Constraints and extra hardware

(1)

We use Formula 1 to report the silicon area devoted to represent a single bit of a physical register.

Four identical clusters The 4-cluster WSRS architecture involves the use of identical functional units clusters. Each cluster must be able to execute every kind of instruction. This might be an issue for complex integer instructions such as integer division or multiplication. Replicating dividers and multipliers on every cluster might be considered as a waste of silicon. As an alternative to complete replication, a divider (resp. multiplier) can be shared among two adjacent clusters. Static arbitration among the two clusters should allow their smooth sharing.

Power consumption and access time estimation In order to evaluate the peak power consumption and the access time for multiported register files in future superscalar processors, we used the CACTI2.0 package [20]. Since CACTI2.0 is devoted to evaluate peak power consumption and access time on caches, we discarded the tag path in the measures presented here. We also modify CACTI2.0 in order to take in account register write specialization. Technology assumptions Due to the 4-6 years microprocessor design cycle, current research propositions cannot appear in products before 2006-2008. Therefore we present this evaluation for a two generation ahead technology C MOS and a 10 Ghz clock2 . If the current trend in the increasing of clock frequency continues then one can reasonably expect to achieve frequencies in the 10 Ghz . range using CMOS Considered configurations We report estimates for four 8-way issue superscalar architectures and a 4-way issue superscalar architecture.

Need for more physical registers On a 4-cluster WSRS architecture, more physical registers are needed than on a conventional 4-cluster architecture, but each of them support fewer access ports.

More complex register renaming pipeline It has been pointed out in Section 3.2 that, compared with the conventional architecture, the complete renaming process on the WSRS architecture (including allocation of instructions to clusters) involves 1 or 3 extra pipeline stages (depending on the register renaming implementations). It also requires some extra logic hardware (extra free lists, register recycling pipeline, ..).

2 “The

low

6

!"$#

technology scaling in CACTI 2.0 should work well down to be.” Norm Jouppi, private communication

4.3

The four considered 8-way configurations are (1) noWSM a conventional 8-way architecture with a monolithic register file (Figure 1.a), (2) noWS-D a conventional 4-cluster architecture with distributed register file(Figure 1.b), (3) WS, a 4-cluster architecture featuring register Write Specialization (Figure 2.a), (4) WSRS-S, a 4-cluster WSRS architecture (Figure 3). Estimates for a conventional 2-cluster 4-way architecture, noWS-2, are also presented. The Alpha 21264 features 80 physical integer registers. As future processors will feature deeper pipelines, we assume 128 physical integer registers for a conventional 4way processor and twice as many for a conventional superscalar 8-way processor. A total of 512 registers is assumed for WS and WSRS architectures, since more registers are needed in these architectures.

Complexity on the bypass network and on the wake-up and selection logic

For any instruction executed on a given cluster, its first (resp. second) operand can have been produced by only two of the four clusters. Therefore the bypass point at an entry of a functional unit is only connected with half of the results buses. The wake-up logic in a given cluster only monitors two clusters as possible producers for its first operands and two clusters as possible producers for its second operands. 4.3.1 Bypass network complexity The bypass network allows the functional units to use the result of an operation as soon as it has been produced. As the access to the register file will be pipelined on future generations of processors, the performance of the processor will dramatically depend on the bypass network. Two distinct issues must be distinguished: first the ability of the bypass network to forward the data to the functional unit entries, second the fast-forwarding capability, i.e. the ability to use an instruction result as an operand for a dependent instruction on the very next cycle.

4.2.2 Estimates Table 1 reports register file characteristics as well as estimations on access time, power consumption and silicon area for the five considered configurations. We report the following characteristics of the register files: number of copies of each individual register, number of read and write ports on each individual register copy and total number of register subfiles. We also report an estimation of the relative size of the register file compared with the size of the register file on 2-cluster 4-way issue processor. The number of pipeline stages needed for accessing the register file is also estimated, first assuming a very aggressive 10 Ghz clock and a less aggressive 5 Ghz clock. An extra half cycle is assumed in order to drive the data to the functional units. Using this register read pipeline depth and assuming a complete bypass network, we also report the number of possible sources that must be arbitrated by a bypass point (see 4.3.1).

Bypass point complexity A complete bypass network connects all result buses to all functional unit entries. The cost of a complete bypass network is huge: if the read-write pipeline on the register file is X cycles long and if the register can be produced by possible units then, for each already computed results functional unit entry, up to are potentially unaccessible from the register file. The by possible pass point logic must then choose among sources for each operand. The complexity of the bypass point on a 4-cluster WSRS architecture benefits from two orthogonal factors compared with a conventional architecture. First the register operand accessed by a functional unit entry can have been produced by only two clusters instead of four. Second the register read pipeline is shorter on a 4-cluster WSRS architecture. As a consequence, assuming a complete bypass network, a bypass point on a 4-cluster WSRS architecture must choose among the same number of possible sources as the bypass point on a 2-cluster conventional architecture (see Table 1).

Analysis By reducing the number of ports on each individual register, register write specialization alone enables a dramatic complexity reduction of the overall register file in terms of silicon area, power consumption and access time. Using a WSRS architecture allows to further halve the silicon area and to further reduce the access time and the power consumption. Compared with a conventional 4cluster 8-way architecture (noWS-D), the total silicon area of the physical register file is divided by more than six despite that the fact that the number of physical registers is doubled. Peak power consumption is more than halved and access time is reduced by more than one third. This will allow to implement a shorter register read pipeline. Compared with the 2-cluster conventional architecture, the physical register file on the 4-cluster WSRS architecture scales very smoothly: a) the read access time is in the same range, b) the total silicon area is only increased by 75 % c) power consumption only doubles.

Fast-forwarding Forwarding an operation result in order to use it as an operand on the very next cycle is also a very challenging task. The transit delays between the functional units are becoming long compared with an ALU operation. On the other hand, a systematic one (or more) cycle delay for forwarding a result to a dependent instruction impairs performance [4]. In a first approximation, the fastforwarding delay increases with the distance between the producer of a register and its consumer. A second order factor for this delay is the number of entries that have to be fed within the next cycle. Three possibilities of increasing hardware complexity are natural on a 4-cluster architecture: 7

nb of registers register copies (R,W) ports per copy physical subfiles nJ/cycle Access time (ns) Pipeline cycles: 10 Ghz sources per bypass point: 10 Ghz Pipeline cycles: 5 Ghz sources per bypass point: 5 Ghz Reg. bit area (x )

noWS-M 256 1 (16,12) 1 3.20 0.71 8 97 5 61 1120 7

noWS-D 256 4 (4,12) 4 2.90 0.52 6 73 4 49 1792 11.2

WS 512 4 (4,3) 4 1.70 0.40 5 61 3 37 280 3.50

WSRS 512 2 (4,3) 4 1.25 0.35 4 25 3 19 140 1.75

noWS-2 128 2 (4,6) 2 0.63 0.34 4 25 3 19 320 1

Table 1. Estimates for different architecture configurations

Fast-forwarding inside a single cluster: the WSRS architecture presents the advantage that, assuming random distribution of instructions to clusters (when some freedom is available), statistically two out of four possible consumers for a result will be located on the producer cluster instead of only one out of four in a conventional architecture. Fast-forwarding inside pairs of adjacent clusters: on the WSRS architecture, statistically three out of four possible consumers of a result will be able to capture it on the very next cycle instead of two out of four on a conventional 4-cluster architecture. Complete fast-forwarding: Figure 3 suggests a possible layout of the 4-cluster WSRS architecture where the consumer cluster is always close to (i.e. touches) the producer cluster. Such a layout may favor a simpler implementation of complete fast-forwarding capability than on a conventional 4-cluster architecture.

from 4 to 8 (46 % is reported in [14] assuming 0.18 m CMOS technology). For an instruction executed on a given cluster with a 4cluster WSRS architecture, a given operand can only be produced by two of the four clusters in the processor: a wake-up logic entry on a 8-way 4-cluster WSRS architecture features only the same number of comparators as the one of a 4-way issue conventional processor.

5 Performance evaluation A WSRS architecture features a deeper pipeline than a conventional architecture. Strong constraints on the instruction allocation policy to clusters are also encountered. Therefore, assuming equal cycle for a conventional clustered architecture and a 4-cluster WSRS architecture, one can envision some performance loss when using a 4-cluster architecture. The preliminary simulations presented in this section shows that, a contrario from this intuition, the 4-cluster WSRS architecture stands the performance comparison with a conventional 4-cluster architecture.

4.3.2 Wake-up logic complexity With an out-of-order execution processor, an instruction can not be issued before its operands are guaranteed to be valid in time3 On each cycle, the wake-up logic entry associated with an instruction must monitor every possible source for any of its operands and check it against its register operand numbers: if an instruction features two register operands and if N possible sources can produce these register operands then each wake-up logic entry implements 2*N comparators. The wake-up logic (and these comparators in particular) is responsible for a significant part of the power consumption [9, 12] in the processors, therefore limiting the number of comparators in each wake-up logic entry is a major challenge. The wake-up logic response time also increases dramatically when doubling the possible sources for an operand 3 For

5.1

Experimental framework

5.1.1 Sparc ISA For our simulations, we used the Sparc ISA. Instructions using three register operands (i.e indexed stores, .. ) are translated at decode in two microoperations as suggested in Section 3. The Sparc ISA features register windows. In our simulations, we considered that 4 register windows are mapped in the physical register file at the same time, i.e a total of 80 logical general-purpose registers are used. An exception is taken on a window overflow.

5.2

General characteristics of the simulated architecture

8-way 4-cluster architectures are considered. All clusters are assumed identical. The cluster is 2-way issue. It features a single load/store unit, a single fully pipelined floatingpoint unit and two integer ALUs. Latencies of instructions

loads, a cache hit is predicted.

8

are summarized in Table 2. Fast-forwarding is possible inside a single cluster, one cycle delay is needed to forward a result from cluster to an other. inst lat.

loads 2

ALU 1

mul/div 15

fadd/fmul 4

icant difference, we only display the simulation results for the second register renaming strategy. For the 4-cluster WSRS architecture, we also assume a total of 384 or 512 physical registers. For the first register renaming stategy, we set the minimum misprediction penalty to 16 cycles and for the second strategy to 18 cycles. These misprediction penalties take into account respectively one and three extra pipeline stages before renaming (see Section 3.2) and two pipeline stages saved on the register read (see Section 4.2). As, with using Write Specialization alone, simulation results for the two register renaming strategies were very close, therefore we only display results for the second strategy. We simulated two simple allocation policies on the WSRS architecture:

fdiv/fsqrt 15

Table 2. Latencies for principal instructions Our study focuses on the impact of the use of WSRS architecture on the performance of the execution core. Therefore, we made some simplification hypotheses on the instruction fetch front-end of the processor. We assume that, the front-end stages in the pipeline, up to the rename stage, delivers eight instructions/microoperations per cycle at a sustained rate. That is, our simulations ignore all the artefacts associated with irregular instruction fetch bandwidth. Realistic conditional branch prediction was simulated. A very large 2Bc-gskew branch predictor featuring 512Kbits of memorization was considered [17]. The size and accuracy of the branch predictor are equivalent to the ones of the branch predictor from the cancelled Alpha EV8 microprocessor [16]. Perfect prediction of branch target was assumed since target misprediction for PC relative branches can be corrected very early in the pipeline, procedure returns can be predicted almost perfectly with a return stack and very few indirect jumps were encountered on our benchmark set. Load/store addresses were computed in order, loads bypassing stores whenever no conflict were encountered. The data memory hierarchy was modelled using the parameters reported in Table 3. size latency miss pen. bandwidth L1 D-$ 32 Kb 2 cycles 12 cycles 4 W/cycle L2 $ 512 Kb 12 cycles 80 cycles 16 B/cycle

random monadic, RM: on monadic instructions, the register operand determines the top or bottom bicluster, the left or right bicluster is randomly selected before register renaming. random “commutative” cluster, RC : Functional units are assumed to be able to execute any instruction in two forms (e.g. A-B and -A + B) taking their first operand either on their left port or on their right entry port. The form of the instruction is first randomly selected. Then for monadic instructions, two clusters are able to execute it, one of them is randomly selected.

5.3

Benchmark selection

We present simulation results for 7 SPECFP2000 (wupwise, swim, mgrid, appluy, galgel, equake and facerec) and 5 SPECINT2000 (gzip, vpr, gcc, mcf and crafty) benchmarks using the ref input sets. Codes were compiled with the following options,cc xO3 -xarch=v8plusa -xCC, c++ -xO3 -xarch=v9, f77 -fast -xarch=v8plusa, f90 -fns=no -fast -xarch=v9. The initialisation phase of the application were skipped using a fast-forward mode, then caches and branch prediction structures were warmed for 20 millions of instructions. A slice of 10 millions of instructions is then simulated.

Table 3. Memory hierarchy characteristics 5.2.1 Simulated configurations As a comparison point, we used a conventional 4-cluster superscalar processor using a round-robin allocation policy. 256 physical registers are assumed. The current trend for pipeline depth on high-end microprocessors is toward very deep pipeline (14 stages on EV8, 18 on Pentium4). We assume a minimum 17-cycle misprediction penalty for this base case. A processor featuring only register Write Specialization is simulated. Round-robin instruction allocation to cluster is also assumed. A total of 384 or 512 physical registers is assumed. Both register renaming strategies described in Section 2 were simulated with a minimum misprediction penalty is set to 16 cycles: the register read pipeline is one cycle shorter than on the conventional architecture (see Section 4.2). As simulation results did not exhibit any signif-

5.4

Simulation results

Figure 4 summarizes the performance results on the different benchmarks (measured in instructions per cycle). 5.4.1 Register Write Specialization only As expected, the same level of performance is reached using the conventional architecture or using register write specialization alone on integer applications. For floating-point point applications, a marginal performance improvement consistantly obtained when using register write specialization. This marginal performance increase is allowed by the larger instruction window allowed by an overall larger register set. This trends is further enhanced by increasing the overall number of physical registers from 384 to 512. 9

RR 256

RR 256

3.5

WSRR 384

3.5

WSRR 384

WSRR 512

WSRR 512

WSRS RC S 384

3.0

WSRSRC S 384

3.0

WSRS RC S 512

WSRSRC S 512

WSRS RM S 512

WSRSRM S 512

2.5

IPC

IPC

2.5

2.0

2.0

1.5

1.5

1.0

1.0

0.5

0.5

0.0

0.0 gzip

vpr

gcc

mcf

wupwise

crafty

Integer benchmarks

swim

mgrid

applu

galgel

equake

facerec

Floating point benchmarks

Figure 4. Performance results 5.4.2 WSRS architecture On all our integer applications, the 4-cluster WSRS architecture performs slightly better than the conventional architecture. On the other hand, the 4-cluster WSRS architecture performed slightly worse than the conventional architecture on most floating point applications, particularly the applications with relative high IPCs. Nevertheless, when using RC instruction allocation policy (column WSRSRC), the performance always stay within a 3% difference margin with the base architecture. Note that increasing the total number of registers from 384 to 512 has a minor impact on performance.

significant performance loss (facerec and wupwise), the unbalancing degree is close to 100 %, while on integer high IPC benchmarks (gzip and crafty), the unbalancing degree is around 80 %. In future research on allocation policies for WSRS architectures, we plan to study dynamic policies that tradeoffs allocation of dependent instructions within a cluster and (local) workload balancing between clusters.

6 Related works VLIW ISAs such as Multiflow [3] or more recently LX [6] implement distinct logical register files that are accessed by different clusters of functional units. Whenever the different operands of an operation lie in different register files, the compiler is responsible to guarantee moves between the register files to enable the execution of the operation. This allows the implementation of wide-issue statically scheduled processors using silicon register files with a limited number of read and write ports. Our proposal tackles the implementation of dynamically scheduled wide-issue processors for current general purpose ISAs featuring a single logical register file. For ISA featuring a single logical register file, Farkas et al. [7] proposed the use of two distinct physical register files, each of them associated with a subset of the ISA logical registers. Each physical register file is associated with a cluster of functional units. The main difficulty with this approach is that whenever an instruction uses two logical operands mapped onto the two distinct subsets of the logical register file, moves have to be generated by the hardware between the two physical register files. The load balancing of the clusters is very sensitive to code generation. However, in some sense this work is close to our proposals, since the unwritten rule we cited in the introduction is also transgressed.

Analysis Two phenomena associated with the distribution of instructions to clusters have two opposite impacts on performance. Round robin allocation of instructions to clusters lead to a better load balancing of the workload among the clusters than achieved by RM and RC policies on the WSRS architecture. But on the other hand, RM and RC policies statistically distribute the instructions “closer” to the producer(s) of their operand(s) than round-robin allocation policy. For characterizing the unbalancing of the workload, we split the applications in groups of 128 instructions and measure the ratio of these groups that are unbalanced. We arbitrarily define a group as unbalanced whenever one of the four clusters gets less than 24 instructions or more than 40 instructions. We define the unbalancing degree of an application as the ratio of unbalanced instruction groups in the application. Figure 5 represents the unbalancing degrees on our set of benchmarks. Round-robin policy exhibits a perfect balancing degree. The RM policy uses less degrees of freedom than RC. Therefore, in most of the cases, it exhibits the highest unbalancing degree. Floating point benchmarks tend to exhibit higher unbalancing degrees than integer benchmarks. For instance, on the two high IPC benchmarks that are exercising the most

Previous research work on improving access time to the register file with out-of-order execution superscalar pro10

WSRS RC

WSRS RC WSRS RM

100

80

80

unbalancing degree

unbalancing

WSRS RM

100

60

40

20

60

40

20

0

0 gzip

vpr

gcc

mcf

wupwise

crafty

Integer benchmarks

swim

mgrid

applu

galgel

equake facerec

Floating point benchmarks

Figure 5. Unbalancing degrees cessor includes virtual-physical register file [13](limiting the number of physical registers) and register caching [4] (caching the critical registers).

time register access ports, thus implementing less ports on the register file. However, this puts more pressure on another critical path in the processor: the wake-up and selection logic. Stark et al. [18] proposed to pipeline the wake-up and selection logic to address this electrical critical path. Brown et al. [2] further proposed to optimistically select any instruction that is fireable, removing selection logic from the critical path. Critical path as well as power consumption in the wake-up logic is addressed by Ernst and Austin [5] through eliminating tag comparison for one of the operands. Folegnani and Gonzales [8] proposed to selectively disable part of the comparators in the wake-up logic. We would like to point out that all these techniques [13, 4, 1, 18, 2, 5, 8] are orthogonal with WSRS and can be applied at cluster level to WSRS architectures.

Monreal et al. [13] proposed virtual-physical registers. The allocation of the physical register is delayed until instruction execution or even result write back. The renaming of registers is replaced by the allocation of a virtual stamp which does not connect directly with any physical location. A physical register is associated with the virtual stamp at instruction execution (or result write back). This solution allows to reduce the number of required physical registers, therefore to reduce the silicon area of the physical register file and its power consumption. This approach addresses the number of physical registers needed in a superscalar processor. Cruz et al. [4] remarked that many physical registers have to be accessible on the very next cycle. Many physical registers are not even ever read since they are used only once and they are captured through the bypass network. They proposed to use a register file cache. Only registers likely to be useful in the very next cycles are written in the register file cache. A complete register file copy is maintained, but it can feature a longer access time as well as fewer read ports. This organization results in a low latency register access while supporting a large number of physical registers. As it allows a shorter register read pipeline, it also decreases the number of possible sources on each bypass point. Balasubramonian et al. [1] remarked that physical register must stay alive long after their last read has been issued, that is until their last read is validated. Therefore the contents of these alive critical registers can migrate to a L2 register cache.

7 Conclusion Scaling the current superscalar designs for wider issue processors would result in a more than quadratic increase of the register file, the bypass network and the wakeup/selection logic complexities. In this paper, we have shown that these issues may be attacked by transgressing an unwritten rule that has so far been applied to all superscalar processor designs. All currently used general purpose ISAs feature a single set of logical general purpose registers. This central view of the general purpose register file has also been adopted for the hardware implementation of the physical register file, i.e, any general purpose physical register can be read or written by any integer functional units. By using Register Write Specialization, i.e. by forcing each functional unit to write only on a fixed subset of the physical register file, one can decrease dramatically the complexity of the physical register file in a wide-issue superscalar processor. The number of write ports on each physical register is decreased and therefore the silicon area, the power consumption and the access time of the register

As many instructions do not really use the read and write ports on the register file (monadic instructions, instructions with no result, operands captured on the bypass network), Balasubramonian et al. [1] also proposed to arbitrate at run 11

file are significantly decreased. Register Write Specialization does not impair performance for static policies of allocation of instructions to clusters or functional units. Second, we have proposed to combine register write specialization and register read specialization in a 4-cluster WSRS architecture. On a 8-way 4-cluster WSRS architecture, the complexities of the bypass points and of the wakeup logic entries are the same as the one found with a conventional 4-way superscalar processor. Moreover the complexity of the physical register file is even further reduced compared with using register write specialization alone (see Table 1 in Section 4). The 4-cluster WSRS architecture trades the complexity of the register file, of the bypass network and the wake-up logic against degrees of freedom for allocating of instructions to clusters and a more complex register renaming (at the cost of a few extra pipeline stages in register renaming). The location of the physical register operands restricts the set of clusters that can execute an instruction. However monadic instructions can be executed on several clusters. One can also execute commutative dyadic operations on several clusters, . . . The performance study presented in this paper indicates that, by exploiting these degrees of freedom, simple policies for allocating instructions on clusters will be able to reasonably balance the workload among the clusters. This performance study also indicates that, at equal cycle time, the 4-cluster WSRS architecture will achieve performance levels in the same range as a conventional 4-cluster architecture using a round-robin instruction allocation. Furthermore, in [15], we have also shown the WSRS architecture can be extended to a 7-cluster architecture while maintaining the complexities of each individual wake-up logic entry and each bypass point, and also using only two (4-read, 3-write) copies of each individual physical register.

[5] D. Ernst and T. Austin. Efficient dynamic scheduling through tag elimination. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002. [6] Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred (Mark Owen) Homewood. Lx: A technology platform for customizable VLIW embedded processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. [7] K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. The multicluster architecture: Reducing cycle time through partitioning. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-97), December 1997. [8] D. Folegnani and A. Gonzalez. Energy-effective issue logic. In Proceedings of the 28th Annual International Symposium on Computer Architecture, June 30–July 4, 2001. [9] R. Gonzalez and M. Horowitz. Energy dissipation in general purpose microprocessor. In IEEE Journal of Solid-State Circuits, september 1996. [10] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The microarchitecture of the pentium4 processor. In Intel Technology Journal, 2001. [11] Richard E. Kessler. The Alpha 21264 microprocessor. IEEE Micro, 19(2):24–36, 1999. [12] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: Speculation control for energy reduction. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA-98), June 27–July 1 1998. [13] Teresa Monreal, Antonio Gonzalez, Mateo Valero, José Gonzalez, and Victor Vinals. Dynamic register renaming through virtual-physical registers. Journal of Instruction-Level Parallelism, May 2000. [14] Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In th Annual International Symposium on Computer Architecture, pages 206–218, 1997. [15] André Seznec. A path to complexity-effective wide-issue superscalar processors. In Technical Report IRISA PI 1411, August 2001. [16] André Seznec, Stephen Felix, Venkata Krishnan, and Yanos Sazeidès. Design tradeoffs for the ev8 branch predictor. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002. [17] André Seznec and Pierre Michaud. De-aliased hybrid branch predictors. Technical Report RR-3618, Inria, 1999. [18] J. Stark, M. D. Brown, and Y. N. Patt. On pipelining dynamic instruction scheduling logic. In Proceedings of the 33rd Annual International Symposium on Microarchitecture, December 2000. [19] Marc Tremblay, Bill Joy, and Ken Shin. A three dimensional register file for superscalar processors. In Proceedings of the 28th Annual Hawaii International Conference on System Sciences, Jan 1995. [20] Steven J. E. Wilton and Norman P. Jouppi. Cacti: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, May 1996. [21] Victor Zyuban and Peter Kogge. The energy complexity of register files. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED-98), August 10–12 1998.

References [1] Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings of the 34th Annual International Symposium on Microarchitecture, December 2001. [2] M. Brown, J. Stark, and Y. Patt. Select-free instruction scheduling logic. In Proceedings of the 34th Annual International Symposium on Microarchitecture, December 2001. [3] Robert P. Colwell, Robert P. Nix, John J. O’Donnell, David B. Papworth, and Paul K. Rodman. A VLIW architecture for a trace scheduling compiler. In Proceedings of Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS II), October 1987. [4] José-Lorenzo Cruz, Antonio Gonzalez, Mateo Valero, and Nigel Topham. Multiple-banked register file architectures. In Proceedings of the 27th International Symposium on Computer Architecture, june 2000.

12