an application specific multi-port ram cell circuit for ... - CiteSeerX

AN APPLICATION SPECIFIC MULTI-PORT RAM CELL CIRCUIT FOR REGISTER RENAMING UNITS IN HIGH SPEED MICROPROCESSORS Alessandro De Gloria* and Mauro Olivieri** (*) University of Genoa , Italy , (**) University of Rome “La Sapienza” , Italy ABSTRACT: We present a novel custom circuit for superscalar microprocessor renaming unit and compare its performance with a conventional design, referring to an industrial 0.35 µm CMOS process. Speed and power consumption are significantly improved. 1. INTRODUCTION High performance superscalar microprocessors rely on register renaming units that replace logical register tags with physical register tags, in order to increase parallelism among instructions [5]. During the renaming phase, the logical destination register of an instruction is renamed with a physical register, taken from a pool of available registers (free list), so that no two instructions with the same destination register enter the processor pipeline. Accordingly, a map table is maintained to translate the logical source registers of an instruction into (already renamed) physical register tags. By these operations, write-after-write and writeafter-read (i.e. false) data dependencies are removed, and only read-after-write (i.e. true) dependencies are left in the instruction flow [6]. A detailed definition of the renaming algorithm is in [6][2]. The renaming circuits are located in the processor pipeline immediately after instruction decoding (Fig. 1). From the performance point of view, the critical part of the circuit operation is to map the already renamed logical register tags with the corresponding physical tags. Since superscalar processors fetch more than one instruction per cycle from memory, the register renaming circuits must be capable to coherently map several instructions per cycle. Recent studies have pointed out that this operation is one of the critical speed limits in future microprocessors [3][4] dealing with nearly 1 GHz clock frequency. The task of name mapping can be accomplished by means of either a CAM-based approach or a RAM-based one [3][4]. Existing microprocessors implement either methods [1][7][10][2], though the former is generally considered less scaleable for future architecture trends [4]. We focus on the RAM-based method. VLSI works published on the subject are based on the integration of standard logic blocks [3][4][9]. We present a new custom circuit design and compare its performance with a standard solution by means of layout level Spice simulation, referring to an industrial 0.35 µm CMOS process. Results on improved speed and power consumption show the effectiveness of the proposed solution. 2. RENAMING UNIT STANDARD IMPLEMENTATION Let N be the number of instructions fetched per cycle by the processor (the rename pool). Fig. 1 shows the architecture of a standard renaming unit implementation for N=2 1 [4]. The map table is implemented by a static random access memory (RAM), where the logical register tag directly addresses the location containing the corresponding physical register tag. The number of 1

The priority encoding logic is only needed for N>2.

memory locations is equal to the number of logical registers. In every clock cycle, the name mapping of the instruction source registers is read from the RAM, while the new mapping of the destination registers is written in the RAM. At the same time, the overwritten tags must be read and inserted in the free list buffer 2. Assuming instructions with 2 source registers and 1 destination register, the RAM has 3N read ports and N write ports. The RAM also has the capability of saving a whole register-mapping configuration in a “shadow” RAM, for later restoring in case of speculative execution faults (e.g. branch misprediction [6]). This is accomplished by providing each memory cell with a shadow cell. For multiple branch prediction, there is a stack of shadow cells for each memory cell. Since several instructions in the rename pool can have the same destination register, a dedicated circuitry (Fig. 1 right) is needed to prevent concurrent writes in the same location of the RAM. In case different tags are tentatively written in the same location, a set of multiplexers selects the latest tag (in fetch order) as the valid one, and sends the other tags back to the free list buffer. Furthermore, a dependence check unit is present (Fig. 1 bottom), to detect the cases where the registers being mapped are destination registers of an earlier instruction in the same rename pool. The unit compares each logical register tag being renamed with the logical destination register of the earlier instructions in the rename pool. If there is a match, the output multiplexer selects as the valid mapping the physical register assigned to the destination register of the earlier instruction, instead of the physical register found in the RAM. In case of multiple matches, a priority encoding logic selects the latest match as the valid mapping. All the above operations must be completed in a single clock cycle. The map table memory and the dependence check logic operate concurrently, and the memory read access (plus the multiwrite managing logic and the output multiplexers) dominates the overall delay. The critical path of the memory read access is composed of the decoder delay, word-line drive time, bit-line drive time (operated by the memory cell) and sense amplifier delay. The standard implementation of a static multi-port memory cell, utilized in a register renaming architecture, is described in [4] and [9]. In [4], a set of simulation results is given, referring to a 0.8, 0.35 and 0.18 µm CMOS processes. In order to have a fair comparison with our solution and to obtain comprehensive data, we re-designed the standard renaming circuit at layout level under a reference CMOS process 3. We included a stack of 4 shadow cells for each cell.

2

Actually, a temporary buffer (sometimes called active list [10]) is used because the register may not be immediately re-usable. This detail is not relevant for our description. 3 HCMOS6 0.35µm, 5 metals, from ST Microelectronics.

Free list buffer

out

Dest. phys. regs

u3

u2

u1

u0

o7

o6

o5

o4

o3

CK d0

Fig. 4 reports a sample simulation of the new custom circuit solution from which the operation delay can be extracted. The first three panels from the top refer to a concurrent multiple read and write operation, while the last panel refers to a multi-write operation in the same location with priority management. The delay results for both the standard and the new implementation are summarized in Table 1. Signal delays have been measured between half-swing transitions. Timing results for the standard implementation very nearly match the numbers available from [4], except for the “bitline + sense-amplifier” delay, whose Array of identical cells + decoders

Free list buffer

logical dest. regs logical source regs

MAP TABLE RAM

Dest. phys. regs

out

4. PERFORMANCE RESULTS We performed Spice simulation on layout-extracted netlists extracted. Extracted netlists include both parallel-plate and fringe capacitance; long metal wire resistance has also been included. Operating conditions are typical process, 27 C° temperature, and 2.5 V supply. The simulated circuits incorporate the decoders and sense amplifiers needed for the proper operation of the memory. We used identical decoders and sense amplifiers for the standard circuit and the newly proposed one. The decoder implementation is based on the “pre-decode” approach [9]. The decoders outputs are low during the pre-charge phase of the memory cells. The sense amplifiers design follows the scheme reported in [8]. Finally, we designed the additional logic modules required by standard architecture (Fig. 1), as dynamic CMOS circuits.

Fig 2. - Circuit of a 4-write-port specialized RAM cell for a renaming unit . Ui = bitline output to free list buffer (to sense amp); Oi = output bitline, to sense amp.; di = input bitline; wi and ri = write and read wordlines from decoders.

Data

3. NEW REGISTER RENAMING CIRCUIT DESIGN Our new design is a memory cell circuit integrating all the operations required by multi-instruction register rename mapping, that in the standard architecture are jointly performed by the memory, comparators, priority encoders and multiplexers. The circuit diagram of the custom cell is reported in Fig. 2 along with its layout, obtained under the reference CMOS process. The memory element is composed of the inverter ring in the middle. The right part of the cell is a priority based multiplexer that selects which data input is allowed to write in the memory element according to the write control signals wi. The left part of the cell is a special multiplexing circuit that selects the proper output according to the dependence check. Dependence check is implicitly performed by the pull-down n-tree, that detects concurrent read and write operations in the same cell and properly drives the output selection. The stack of 4 shadow cells, implemented as a 4-bit shift-register, is not included in the picture for sake of simplicity. The total renaming unit results extremely simplified as it only consists of the map table RAM (Fig. 3) composed of the new cells.

d3

Fig. 1 - Conceptual view of a superscalar processor pipeline and scheme of a standard renaming unit with 2 write ports (N = 2).

Data in

Priority encoders

CK

comparators

Physical source regs

MUX

comparator Priority encoders

MUX

w0 w1 w2 w3 w0 w1 w2 w3

Address

MAP TABLE RAM

logical source regs

Pull-down NOR-logic n-tree

Data out

Data

logical dest. regs

buses Data in

MUX

MUX

o2

o1

Data cache

d1

func. unit

d2

Regs file

Issue logic

clock

func. unit

r7 r8

Rename logic

clock

r1 r2 r3 r4 r5 r6

Decode logic

clock

clock

Address buses

Fetch logic

clock

o0

clock

Physical source regs

Fig. 3 - Global renaming unit architecture utilizing the specialized cell.

Architecture Number of write ports in RAM ( i.e. N)

Standard 2

New 4

2

4

RAM cell + shadow stack area [µm×µm]

84×35

108×47

91×45

150×58

memory array + decoders area [µm×µm]

664×1220

808×1704

706×1540

1060×2056

Whole unit* area [µm×µm]

664×1300

808×1976

706×1540

1060×2056

Whole unit* power consum. [µW / MHz]

51.8

140.3

37.5

90.6

Decoder delay [ns]

0.192

0.193

0.192

0.192

Word-line drive time [ns]

0.134

0.161

0.135

0.175

Bit-line + sense amp delay (1 read) [ns]

0.110

0.120

0.98

0.101

Bit-line + sense amp delay (all read) [ns]

0.120

0.151

0.163

0.168

Total RAM access time (all read) [ns]

0.447

0.505

0.490

0.534

Multi-write manag. + output MUX [ns]

0.170

0.235

--

--

Whole unit* access time (all read) [ns] 0.617 0.740 0.490 0.534 Table 1 - Performance results for the new renaming circuit and the standard scheme, for a 32-logical/64-physical register architecture (*) Whole unit = including dependence check and multi-write management. since changes in the number of physical or logical registers imply discrepancy is due to the different sense amplifier technology, replicating the basic cell array, instead of re-designing the which is not relevant for our study. additional logic around the RAM. Due to the modularity, reduced Despite the greater circuit complexity, the proposed cell circuit power consumption and reduced critical delay, the approach is performs very like the standard one as for single cell delay. The suited for future superscalar processor architecture, demanding for overall operation of the new renaming unit is significantly faster more instructions per cycle to be processed by the renaming unit. than the standard implementation, as the latter includes the additional delay of the multi-write management logic and the 6. REFERENCES output multiplexers. Table 1 shows the speed improvement and [1] Gaddis, N. and Lotz, J., A 64-b Quad-Issue CMOS RISC the area overhead. Taking into account pipeline latch delay, a 1 Microprocessor, IEEE J. of Solid State Circ., 31(11), Nov. 96 GHz clock frequency is a feasible target for the proposed circuit. [2] Groski, G.F., The organization of the IBM RISC System/6000 Another interesting simulation result is that the total power microprocessor, IBM Jour. of Res. and Dev., 34(1), Jan. 90. consumption of the proposed renaming circuit is significantly [3] Paracharla, S., Jouppi, N. P. and Smith, J. E., Complexity improved with respect to the standard implementation. Actually, a effective superscalar processors, 24th International precise measure of the absolute power consumption should rely on Symposium on Computer Architecture, Denver, Colorado, wide instruction traces of a real program, because the possible June 1997, IEEE-ACM. Available in ACM-SIGARCH circuit operations and bit values depend on the instruction flow. Computer Architecture News, 25(2), May 1997. We relatively compared the power consumption referring to the [4] Paracharla, S., Jouppi, N. P. and Smith, J. E., Quantifying the average result over specific reference operations, always complexity of superscalar processors, Technical Report CSsupposing that 2N reads and N writes are made every cycle in the TR-96-1328, available at http://ww.cs.wisc.edu/trs.html, map table RAM (N being the number of instructions fetched per Univ. of Wisconsin-Madison, Nov. 1996. cycle), and that the read data are all 0’s (i.e. worst case). The [5] Sima, D., Superscalar Instruction Issue, IEEE Micro, 17(5), reference operations are: all reads and writes independent; each Sep. 1997. read dependent on the previous write and all writes independent; [6] Smith, J.E. and Sohi, G., The microarchitecture of all reads independent and all writes on the same location. Table 1 superscalar processors, Proc. of the IEEE, 83(12), Dec. 1995. also reports the average power consumption results. [7] Vasseghi, N., Yeager, K., Sarto, E. and Seddighnedzhad, M., 5. CONCLUSIONS 200 MHz Superscalar processor, IEEE Journal of Solid State We presented a new custom design of a memory cell circuit for Circuits, 31(11), Nov. 1996. register renaming units in superscalar microprocessors. For a [8] Wada, T., Rajan, S. and Przybilsky, S. A., An analytical limited area overhead, whose relevance is ever diminishing in the access time model for on chip cache memories, IEEE Journal CMOS technology trend, the global unit’s operating speed and of Solid State Circuits, 27(8), Aug. 1992. power consumption are improved. Layout level simulation shows [9] Weste, N. H. and Eshraghian, K., Principles of CMOS VLSI that 1 GHz target clock frequency is feasible in a 0.35 µm CMOS design, Addison-Wesley, 1993. process for the proposed circuit design. The design of the [10] Yeager, K, The MIPS R10000 superscalar microprocessor, renaming unit is highly scaleable with respect to a traditional one, IEEE Micro, 16(2), Apr. 1996. Fig. 4 (Last page) - Layout level Spice simulation results. a1_inp = decoder input; dec_out = decoder output; outx = output n. x; ux = output n. x to free list buffer; Sa_xx = port xx sense amp. output; wx = write wordline n. x; datum, datum_neg = memory cell content.

Symbol

Symbol

Symbol

Symbol

Wave v(a1_inp) v(dec_out) v(out_11) v(u_1)

Wave v(sa_out11) v(sa_out12) v(sa_out21) v(sa_out22) v(sa_out31) v(sa_out32)

Wave v(sa_out41) v(sa_out42) v(sa_u1) v(sa_u2) v(sa_u3) v(sa_u4)

Wave v(w0) v(w1) v(w2) v(w3) v(datum) v(datum_neg)

Voltages (lin) Voltages (lin) Voltages (lin) Voltages (lin) 500m 0

1

1.5

2

2.5

500m 0

1

1.5

2

2.5

500m 0

1

1.5

2

2.5

500m 0

1

1.5

2

2.5

50.15n 50.2n 50.25n 50.3n 50.35n 50.4n 50.45n 50.5n 50.55n 50.6n 50.65n 50.7n 50.75n 50.8n 50.85n 50.9n 50.95n 50.1n 51n Time




hspice netlist: rename memory

an application specific multi-port ram cell circuit for ... - CiteSeerX

an application specific multi-port ram cell circuit for ... - CiteSeerX

Suggest Documents

OaSis: An Application Specific Operating System for an ... - CiteSeerX

Application Specific Integrated Circuit Implementation of ... - ijiee

An Application Specific Memory Characterization ... - CiteSeerX

ENPORT Model Builder: An Improved Tool for Multiport ... - CiteSeerX

An Algorithm for Partitioning of Application Specific Systems - CiteSeerX

An Application-Specific Design Methodology for On-Chip ... - CiteSeerX

An Exploration of Network RAM 1 Why Network RAM - CiteSeerX

An application-specific protocol architecture for wireless ...

An application-specific protocol architecture for

Topology Optimization for Application-Specific Networks ... - CiteSeerX

Program Analysis Tools for Application Specific ... - CiteSeerX

Generating Application-Specific Benchmark Models for ... - CiteSeerX

IDCT application specific VLIW ... - CiteSeerX

Application-Specific Heterogeneous Multiprocessor ... - CiteSeerX

Molecular Immunology Application of specific cell permeable ...

Methods for Evaluating Cell-Specific, Cell-Internalizing ... - CiteSeerX

high speed application specific integrated circuit (asic) - ARPN Journals

Using a Victim Buffer in an Application-Specific Memory ... - CiteSeerX

Cell-Type-Specific Circuit Connectivity of Hippocampal CA1 Revealed ...

An Automatic Switched-Capacitor Cell Balancing Circuit for Series

Cell Mapping for Nanohybrid Circuit Architecture Using ... - CiteSeerX

Photodiode/Phototransistor Application Circuit

An Isolated Multiport DC-DC Converter for Simultaneous ... - IEEE Xplore

ENPORT Model Builder: An Improved Tool for Multiport Modeling of ...