A Seventh-Generation x86 Microprocessor

6 downloads 9053 Views 457KB Size Report
judicious application of custom circuitry permit the development of a processor with an ... ASIC design but permits the application of full-custom design techniques when they ...... laude from Rice University, Houston, TX, in 1992. He then joined ...
1466

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999

A Seventh-Generation x86 Microprocessor Michael Golden, Steve Hesley, Alisa Scherer, Matthew Crowley, Member, IEEE, Scott C. Johnson, Member, IEEE, Stephan Meier, Dirk Meyer, Jerry D. Moench, Member, IEEE, Stuart Oberman, Member, IEEE, Hamid Partovi, Member, IEEE, Fred Weber, Scott White, Member, IEEE, Tim Wood, Member, IEEE, and John Yong

Abstract— An out-of-order, three-way superscalar x86 microprocessor with a 15-stage pipeline, organized to allow 600MHz operation, can fetch, decode, and retire up to three x86 instructions per cycle to independent integer and floating-point schedulers. The schedulers can simultaneously dispatch up to nine operations to seven integer and three floating-point execution resources. A sophisticated, cell-based design technique and judicious application of custom circuitry permit the development of a processor with an aggressive architecture and high clock frequency with a rapid design cycle. Design-for-test techniques such as scan and clock bypassing permit straightforward testing and debugging of the part. Index Terms—Cell-based design, custom design, flip-flop, register file, test and debug, x86 architecture.

I. INTRODUCTION

A

N out-of-order, three-way superscalar x86 microprocessor with a 15-stage pipeline, organized to allow 600-MHz operation, can fetch, decode, and retire up to three x86 instructions per cycle to independent integer and floatingpoint schedulers. The schedulers can simultaneously dispatch up to nine operations to seven integer and three floatingpoint execution resources. The cache subsystem and memory interface minimize effective memory latency and provide high bandwidth data transfers to and from these execution resources. The processor contains separate instruction and data caches, each 64 KB and two-way set associative. The data cache is banked and supports concurrent access by two loads or stores, each up to 64-b in length. The processor contains logic to directly control an external L2 cache. The L2 data interface is 64-b wide and supports bit rates up to 2/3 the processor clock rate. The system interface consists of a separate 64-b data bus. The die, shown in Fig. 1, is 1.84 cm and contains 22 million transistors. Table I shows the technology features. C4 solder-bump flip-chip technology is used to assemble the die into a ceramic 575-pin ball grid array (BGA). Measurements are from initial silicon evaluation unless otherwise stated. This paper is organized as follows. Section II describes the microarchitecture of the processor in more detail, including a discussion of the integer execution core; the high-performance Manuscript received April 13, 1999; revised June 15, 1999. M. Golden, A. Scherer, S. Meier, S. Oberman, H. Partovi, F. Weber, T. Wood, and J. Yong are with Advanced Micro Devices, Sunnyvale, CA 94088 USA. S. Hesley, S. C. Johnson, D. Meyer, J. D. Moench, and S. White are with Advanced Micro Devices, Austin, TX 78741 USA. M. Crowley was with Advanced Micro Devices, Sunnyvale, CA 94088 USA. He is now with Rhombus Inc., Palo Alto, CA 94306 USA. Publisher Item Identifier S 0018-9200(99)08337-7.

Fig. 1. Processor die photograph.

TABLE I TECHNOLOGY FEATURES

floating-point unit (FPU), which is implemented as a coprocessor; and the interface between them. Section III describes the design methodology that permitted the creation of the processor by a relatively small design team with a rapid

0018–9200/99$10.00  1999 IEEE

GOLDEN et al.: SEVENTH-GENERATION x86 MICROPROCESSOR

1467

Fig. 2. Block diagram of the processor.

design cycle. This methodology has it roots in cell-based ASIC design but permits the application of full-custom design techniques when they are warranted. Section IV details some of the more interesting circuits in the processor, including a high-performance flip-flop, several array structures, and a floating-point multiplier implemented using standard cells. Last, Section V describes a debug and test strategy that expedites the process of turning an initial tapeout and first silicon samples into a functionally correct and electrically robust commercial product. II. ARCHITECTURE

AND

CONTROL

Fig. 2 shows the major blocks within the design. This section will provide a brief overview of each of these blocks and how they interact to effect execution of x86 instructions. A. Instructions, MacroOps, and OP’s Due to the complexity of the x86 instruction set, the processor breaks instructions into one or more simple operations that can be directly understood by the execution core. DirectPath instructions are relatively simple and most frequently used. The processor decodes and executes them by hardware without microcode assistance. The processor executes VectorPath, the less common and more complex x86 instructions, using microcode. The term x86 instruction refers to information associated with an instruction in the x86 instruction set. This information may or may not be encoded in its original format. The machine converts each DirectPath x86 instruction into a single

MacroOp. VectorPath x86 instructions are decoded into one or more microlines. Microlines consist of three MacroOps retrieved from the microcode engine in a single cycle. The MacroOps emitted by the microcode engine form a superset of the MacroOps created from DirectPath instructions. A MacroOp may consist of one or two operations, or OP’s. An OP is the minimum executable entity understood by the machine. An example will make this more clear. Consider the x86 instruction ADD EAX, EBX. This is a DirectPath x86 instruction and maps to a single MacroOp. Only a single OP is required to execute this instruction. The instruction XOR EAX, [EBX+8] is also a DirectPath x86 instruction, so it too maps to a single MacroOp. However, two OP’s are required to execute the instruction: a load using general-purpose register EBX as a base and an XOR of this load’s result with the general-purpose register EAX. Last, consider the instruction AND [EBX], EAX. This is a DirectPath x86 instruction and a single MacroOp. Three primitive operations are required to perform the instruction: a load, a logical AND, and a store. However, in the processor’s terminology, this instruction consists of only two OP’s: the address calculation with its associated load and the AND. The store is not really an executable entity from the perspective of the machine’s execution core. B. Core Control The Icache block contains the instruction cache and fetch logic. Its mission is to provide instruction bytes to the rest of

1468

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999

Fig. 3. Pipeline diagram of the processor.

the machine. It delivers up to 16 naturally aligned bytes per cycle. After fetch, x86 macro-instructions follow one of two paths through the machine. The Fastpath Instruction Scan and Align block receives up to 16 bytes per cycle from the Icache. This block sifts through the bytes and identifies the boundaries of the variable-length DirectPath instructions. It then places these instructions into three fixed positions from which they are subsequently decoded and dispatched. The Microcode Engine, or MENG, receives the 16-byte fetch stream from the Icache in parallel with the scan and align block. This block decodes microcoded instructions and sequences through the low-level operations required to execute them. Scan/Align may send up to three DirectPath instructions in a cycle; the MENG may send a microline in a cycle. The Integer Decode and Rename block contains a reorder buffer, a future file, and distributed reservation stations to hold operands while OP’s wait to be scheduled. It also contains logic to decode the MacroOps emitted from Scan/Align or MENG one-to-one into a form more suitable for directly controlling the integer execution core. The MacroOps retain their integrity in this decoded form as they are scheduled, executed, and retired by the core. The reorder buffer is allocated in lines, each holding up to three MacroOps. It has a depth of 24 lines, allowing up to 72 MacroOps in flight between dispatch and retire. Every MacroOp in the machine is allocated a slot in the reorder buffer. Tags passed around the processor for various uses identify the line and position of that OP in the main reorder buffer. As MacroOps are completed, their completion status (and sometimes their associated data or architecturally defined flags) are steered into the appropriate slot in the reorder buffer using the tag they were assigned at dispatch time. MacroOps may complete in any order; their results are held pending in the reorder buffer until they become eligible for retire. Reorder buffer lines are retired strictly in order. A line is retired if it is the oldest line in the machine, and all MacroOps in that line have reported their completion status. The FPU retire queue, which is described later, is in some ways an extension of the reorder buffer. If the reorder buffer line being retired contains one or more FPU MacroOps, then a corresponding number, up to three per cycle, of FPU retire indications are signalled to the FPU retire queue. Since the

FPU retire queue is allocated in order, and the main reorder buffer retires MacroOps in order with a precise indication of the number of FPU MacroOps, no tagging is required on this interface. The FPU simply pops the number of retires indicated and commits them to real FPU state. Floating-Point Decode and Rename decodes x87 floatingpoint, single-instruction, multiple-data (SIMD) singleprecision floating point, and multimedia instructions, handles complications associated with the x87 floating-point stack architecture, and performs register renaming. The integer execution core consists of the Integer Instruction Scheduler, an instruction scheduling logic that picks OP’s for execution based on their operand availability and issues them to functional units or address generation units. The functional units perform transformations on data and return their results to the reorder buffer, while the address-generation units send calculated memory addresses to the Load/Store Unit for further processing. The Dcache block contains the data cache array and its associated logic, and the BIU contains the bus-interface logic. The Floating-Point Instruction Scheduler and the floating-point execution units are further described in Section II-D.

C. Pipeline Overview Fig. 3 shows how the work is divided among pipeline stages in the machine. The top diagram shows the fetch and decode stages of the pipeline, along with the integer execute and memory access stages. The second diagram shows the pipe stages specific to floating-point and multimedia instructions. The following operations occur in each stage: 1) Fetch: The instruction cache and branch prediction tables are accessed in this cycle. 2) Scan: The scan block finds instruction boundaries and sends DirectPath instructions down the pipeline. Meanwhile, the MENG looks for x86 instructions within this window which require microcode emulation. 3) Align1: This is the first of two pipeline stages associated with putting DirectPath x86 instructions into fixed positions for later decode and dispatch. The MENG calculates microcode entry points in this cycle.

GOLDEN et al.: SEVENTH-GENERATION x86 MICROPROCESSOR

1469

Fig. 4. FPU block diagram.

4) Align2: At the end of this cycle, the Align block may have placed up to three DirectPath x86 instructions into fixed positions. The MENG accesses the microcode ROM. 5) Edec/uRom: The early decode stage selects between the DirectPath instructions coming from Align2 and microcode MacroOps from the microcode ROM. It also decodes enough of each MacroOp to identify which registers and flag bits are required. 6) Idec/Rename: In this cycle, the processor resolves integer operand dependencies and reads the integer input operands if they are available. Integer MacroOps are decoded into a form more appropriate for the execution units. A final dispatch decision is made at the end of this cycle, based on whether there are sufficient buffers available to hold all the MacroOps. 7) Sched (FPU: Stk Rename): At this point, integer and floating-point MacroOps diverge in the pipeline. Integer OP’s are scheduled for execution next cycle. Floating-point OP’s have their floating-point stack operands mapped to registers. 8) EX (FPU: Reg Rename): The integer core executes OP’s while the floating-point pipe performs register renaming. 9) Addr (FPU: Wr. Sched): Addresses are transported to the data cache. The floating-point pipeline writes the scheduler. 10) DC (FPU: Freg): The integer core accesses the data cache and transports results to the execution units. The floating-point unit makes scheduling decisions. 11) (FPU: FX0): The floating-point pipe reads operands from the register file.

12) (FPU: FX1-FX3): The floating-point unit executes floating-point OP’s. D. FPU Control The processor’s FPU is implemented as an out-of-order coprocessor responsible for executing all x87 floating-point instructions and a set of multimedia and SIMD single-precision floating-point instructions [1]. The FPU interfaces to the processor core, which sends it instructions, load data, and guides the retirement of instructions. The FPU sends store data and completion status back to the core. Fig. 4 shows a block diagram of the FPU. The FPU contains 2.4 million transistors and uses 10.5 2.6 mm of die area in the 0.25- m process. The control for the FPU consists of an in-order front end that decodes and maps FPU MacroOps one-to-one into internal execution OP’s. Although each FPU MacroOp maps to exactly one OP in the FPU, this discussion distinguishes between FPU MacroOps and FPU Ops. FPU MacroOps are associated with a single FPU OP but are manipulated by the processor core just like all other MacroOps. FPU OP’s are the entities transformed and executed by the floating-point unit. A central scheduler dispatches the execution OP’s into the execution pipes when their source operands are available. Pipe tracking logic reports completion status to the core. A retire queue holds all in-flight OP’s, maintains the register free list, and updates architectural state as the core retires MacroOps. The front end is responsible for decoding up to three FPU MacroOps per cycle. This involves first mapping them into three-operand format internal execution OP’s with the stack relative register references converted to absolute registers. Complex FPU instructions (e.g., transcendentals) are received from the core directly as a series of MacroOps from the MENG which are easily mapped into FPU execution OP’s. The

1470

absolute register numbers of the execution OP’s are renamed into physical register numbers, using a mapper for the source, which provides the most recent physical register mapped to a given absolute register and obtaining the destination physical register numbers from the free list. The last stage of the inorder front end inserts the renamed execution OP’s into a 36-entry scheduler. OP’s are issued from the scheduler when their source registers are ready and the required execution resources are available. Sources may come from the register file, be bypassed directly from one of three result buses, or, in the case of memory operands, come from one of two load operand buses. Each source in the scheduler has to snoop a maximum of three buses since it is determined ahead of time whether to snoop the three result buses or the two memory operand buses. Once an OP is issued, it proceeds to read the register file and then enters the appropriate execution pipe. The scheduler employs a compaction scheme that allows space to be freed in the scheduler as OP’s are issued to the functional units. On completion of an operation, the result is written to the destination register and completion status is sent to the core, which enables the core to retire the associated MacroOp. The retire queue holds up to 72 speculative OP’s and is responsible for updating architectural state and placing the old destination registers back onto the free list when the associated MacroOps are retired. Each read port of the register file is separately enabled to reduce power dissipation. Similarly, each functional unit is enabled only when it is performing true computation. Power is managed in the control queues by using valid bits to control conditional flip-flops, reducing power dissipation during periods without valid instructions.

E. FPU-Core Interface Although the FPU is implemented on the same die as the integer core, it is designed as a coprocessor, with its own control structures decoupled from those in the integer core. The FPU is provided with up to three MacroOps per cycle by the core, which it decodes locally. Valid FPU instructions undergo register renaming and then are sent to the FPU scheduler, which makes its own scheduling decisions, independent of the core, based on the source operands being available, and all resource constraints satisfied. The FPU notifies the core of the completion of FPU OP’s, and the core proceeds to issue retire commands to the FPU when the associated MacroOps become nonspeculative. These simple operations form the basis of the core-FPU interface, requiring three MacroOp dispatch buses to the FPU, three completion ports from the FPU indicating completion status, and three retire signals from the core, which tell the FPU to remove OP’s from its retire queue. There is only one stall mechanism from the FPU to the core, which the FPU uses when its scheduler is full. This tells the core to cease issuing MacroOps to the FPU. Floating-point load and store instructions require additional interface support. Floating-point loads are sent to both the core and the FPU. The core is responsible for computing the load address and performing the actual load operation. Up to three

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999

load addresses may be computed simultaneously and up to two loads may access the cache concurrently. The load data are sent from the data cache to the FPU via one of two load buses where they are converted into the internal FPU data representation and written directly into the destination register of the floating-point load. A structure called the load-mapper in the FPU is used to map the instruction tag provided by the core along with the load data to the renamed destination register of the load OP. The floating-point load OP will not be scheduled by the FPU scheduler until the load data are received from the core and written. In the case of a simple load (memory to register transfer only), the only work performed by the execute OP in the FPU is to notify the core that the load has completed. Since compound instructions (e.g. load from memory and add that value to a register) are supported by the x87 architecture, it is possible that the execute OP in the FPU will actually perform a computation on the load data directly. Through an optimization known as superforwarding, the FPU will convert a two-instruction sequence consisting of a simple load and a simple execute into a compound instruction, which reduces the combined latency by two cycles. Floating-point stores are likewise sent to both the core and the FPU. The core computes the store address and awaits the arrival of the store data from the FPU. The FPU scheduler issues the store to the store pipe, which has a 64-bit datapath back to the core on which store data are sent following conversion/rounding. Exceptions fall into one of three types: architectural traps, microarchitectural traps, and microarchitectural faults. They are signaled to the core concurrently with the sending of the completion status of the OP. For a trap, the instruction generating the trap is retired and then the core signals an abort to the FPU, causing all speculative state in the FPU to be discarded. For a fault exception, the abort is signalled as soon as all instructions up to the faulting instruction have been retired. The core then transfers control to a microcoded exception handler. The FPU receives all its microcode (which is used to implement complex floating-point instructions such as transcendentals, as well as by exception handlers) from the core as MacroOps—it does not contain a local microcode ROM. F. Global Exception Mechanisms There are four broad classes of exceptions: faults, traps, resyncs, and interrupts, the sum of which serve a wide variety of architectural and microarchitectural needs. Faults are handled by retiring all MacroOps up to, but not including, the MacroOp that reported the fault, and then aborting all integer and FPU activity still present in the machine, thereby discarding their speculative results. After issuing the pipeline abort, an appropriate fault handling microcode sequence is entered. Note that faults may be reported from both integer and FPU OP’s and from load and store operations. Traps are handled in a manner similar to faults, except that the trap-reporting MacroOp is also allowed to retire. Furthermore, if the trap-reporting MacroOp was part of a

GOLDEN et al.: SEVENTH-GENERATION x86 MICROPROCESSOR

microcoded instruction, all MacroOps in that x86 instruction are allowed to retire before the trap is taken. Traps are always deferred until the completion of the x86 instruction in which they were reported. Resyncs are simply special microarchitectural traps that return execution back to the x86 code stream right where it left off. By going through the internal mechanics of an exception, many anomalous pipeline conditions can be cleared. Interrupts closely resemble traps, the key difference being that the source of the interrupt is asynchronous to the MacroOps being executed. When an external interrupt arises, it is internally attached to an x86 instruction as if an OP in that instruction had reported a trap condition. Once that entire x86 instruction has been retired, the pipeline is aborted and an appropriate interrupt handling microcode sequence is entered. While each exception type has its own peculiarities as described above, all exceptions share some common traits. They are reported (speculatively) to the central Integer Decode and Rename block, where they are put back in order and prioritized against all other processor activity. MacroOps retire and update real integer and FPU state in order up until the point that an exception should be taken. If and when an exception is taken, all pending operations in the processor are aborted, effectively discarding their speculative results. Note that a unit that reports an exception has no way of knowing if a subsequent pipeline abort is due to the exception that it reported or due to some earlier one in another part of the machine. III. DESIGN METHODOLOGY The processor is implemented with custom circuit designs, or macrocells, and standard cells. The design library contains slightly more than 100 logically unique standard cells, usually in several different drive strengths. Custom macros are used to implement large arrayed structures such as RAM’s, ROM’s, and register files. Standard cells are used to implement the integer and floating-point datapaths, decoders, the bus interface, and all control logic. Early in the project, the microarchitecture is defined in behavioral register-transfer level (RTL), and the chip is partitioned into functional blocks that are implemented in parallel. For this to be successful, block sizes and interblock physical and timing interfaces have to be defined and maintained throughout the project. The implementation team then designs each functional block in structural RTL using custom macrocells and standard cells to implement the behavioral RTL model. The block is logically proven using formal logic verification and vector-based simulation. Internal tools are used to place the cells and to direct routing for critical nets. An industry-standard router makes the final connections, and route parasitics are extracted for use in timing analysis. Timing analysis draws upon the defined timing constraints for the block, and resulting timing can be seen by other blocks when it improves the constraints. Block sizes are chosen to make the most efficient use of the placement, route, extraction, and timing flow. This allows many blocks to progress in parallel but lets the designers expect a high-quality result when the blocks are integrated.

1471

Many of the standard cells are designed to solve specific problems and take advantage of the fact that their usage can be well controlled. These cells have usage rules attached to them, which range from requiring mux select signals to be both logically and temporally one-hot to limiting the allowable noise on an input pin. The rules can be statically checked on the design database, letting an engineer know when to modify the design for robustness. IV. CIRCUIT DESIGN Although the implementation team designs the processor with a cell-based design methodology, they use proprietary standard and full-custom macrocells designed by the processor’s circuit team. This section covers some of the more interesting choices made by the circuit team in constructing the cells. A description of the FPU arithmetic pipelines then illustrates the power of the cell-based design flow to produce high-performance circuits. A. Standard Cell Storage Elements The processor uses a variation of the pulsed flip-flop topology [2] as its principal latching element. In addition to its small total latency, defined as the sum of setup time and clock-tooutput latency, this topology can incorporate complex logic in its first stage using a dynamic pulldown network, a feature heavily utilized throughout the processor to improve the timing of critical paths. Fig. 11 shows an enabled, four-way mux flipflop. The muxed flop is functionally equivalent to the basic flop preceded by a multiplexer but improves the critical path latency by as much as 12%. Due to the dynamic nature of the first stage of the flipflop, the coupling to its inputs must be tightly controlled. Consequently, designers use a proprietary computer-aideddesign tool to determine the minimum allowable input signal strength based on the driver distance from the flip-flop. To manage races and relax the flop-to-flop minimum delay restriction, both clock skew and flip-flop hold time must be controlled. To this end, the width of the one-shot on CLKPULSE, during which the flop is transparent, is minimized while guaranteeing correct operation with adequate margin. Circuit designers developed a statistical model based on local variation of devices and interconnect parasitics to determine the smallest CLKPULSE width that can safely capture the data at the input of the flop. The chosen width limits yield fallout due to failure to capture data across all flip-flops and under extreme parametric conditions to less than 0.1%. Master–slave flip-flops, which have near-zero hold-time characteristics but suffer from large latencies, are used in noncritical paths to further relax hold-time concerns. B. On-Chip SRAM The sub-1-ns static random access memory (SRAM) arrays are single cycle or pipelined. The single-cycle arrays dynamically decode the address and access the array in the same cycle. The first stage of decode is incorporated in the

1472

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999

Fig. 5. DC pipeline diagram.

edge-triggered self-resetting address flops. These address flops generate monotonic outputs that drive to NAND-type decoders. By pipelining the arrays in the data cache (Dcache), the available time is fully utilized to access the largest possible Dcache with three-cycle load latency. The three-cycle load latency maps to pipeline stages eight to ten. The logical address is calculated in pipe stage eight. If the segment base is zero, the address is transported to the Dcache in stage nine. If the segment is nonzero, an additional cycle is required to calculate the linear address, but the load is still fully pipelined. The Dcache is accessed and the results transported to the execution units in stage ten. Stages nine and ten are shown in Fig. 5. The Dcache access begins in stage nine with the translation lookaside buffer CAM lookup, TAG address decode, and DATA address decode. The area difference for the row decoder between the pipelined and single-cycle arrays is less than 1%. The saving in area is achieved by building the one-shot in the row decoder. In Fig. 6, if a one-shot is added to drive the NMOS pulldown gated by CLK, the row decoder looks more like the pulsed flipflop. However, this one-shot must be added 128 times, once for each decoded row. Instead, the one-shot is removed and CLK is added to one of the first-stage two-to-four decoders. When CLK is low, the address propagates through the decoders to the NMOS pulldown stack. When CLK rises, the pulldown stack is active, but only for a short duration since CLK rising will deassert all outputs of the two-to-four decoder. This effectively creates a one-shot for the row decoder flip-flop. By adding an analogous structure for a test clock (TCLK), the row decoder flip-flops are controllable, but not observable. This puts some restrictions on the test program but does not reduce coverage. The benefit of the area savings outweights the restrictions of the test program. The speed penalty for the static row decoder is less than 1% of the delay from address to data output. C. ROM Structures The read-only memory (ROM) arrays are self-timed, edgetriggered, full-rail segmented structures. Each ROM array consists of one, two, or four 64-b tall arrays connected by a super bitline. The full-rail segmented architecture has comparable speed, power, and smaller area to a reference cell and static load design. The segmented and twisted bitlines reduce

Fig. 6. DC row decoder.

the coupling to aggressors. Full-rail circuits eliminate the risk and complexity of matching circuits and races associated with small signals. D. Register Files The processor has two custom register files: the 88-entry, 90-b, five-read, five-write, floating-point register file (FPRF) [3] and the 24-entry, 32-b, nine-read, eight-write, combined integer future file and register file (IFFRF). Both register files avoid complex bypass circuitry by completing write operations before reads occur. The FPRF decodes the write address in the previous cycle, so that the write operation naturally completes during the read address decode. The IFFRF delays the read access until the write and tag comparisons complete. To reduce the routing and area cost, the write bitlines are single-ended. The low-voltage writeability issue is solved by the threetransistor configuration used for each write port, as shown in Fig. 7. 1) Integer Core Register File: The IFFRF superimposes speculative data in a future file on top of the architecturally committed data in the register file. All reads originate in the future file. In the event of a pipeline abort, data in the future file are resynchronized with the committed data in the register file. The IFFRF supports five write ports into the future file and three in the register file. Data written into the tag or data

GOLDEN et al.: SEVENTH-GENERATION x86 MICROPROCESSOR

1473

devices. This fast write permits the write through of write data to a read in the same cycle from the same register without the use of special bypass circuitry. Contrast this to designs using single-ended writes [4], which do not allow the write-through operation, especially at lower voltages. The register file is selftimed and self-resetting and relies only on the falling edge of the clock to start the timing chain. E. Phase-Locked Loop and On-Chip Clock

Fig. 7. Register file bit cell.

The phase-locked loop (PLL) operates with a 2.5-V power supply, internally regulated down to 1.6 V to satisfy oxide voltage stress limits. A high-precision bandgap circuit is used to minimize the variation of this internal supply voltage. Given the limited voltage headroom and the high-frequency target, the PLL is designed to maximize the voltage-controlled oscillator (VCO) control range. To ensure minimum static phase error over the maximum VCO control voltage range, the charge pump, shown in Fig. 9, is designed to regulate the DOWN current level based on the UP level. This avoids large current mismatches when the DOWN current source devices begins to exit saturation. The cycle compression (less than 25 ps) is optimized at the expense of accumulated phase error (less than 1 ns) by setting the loop natural frequency low. The PLL clock is transported to the center of the chip. From the center of the chip, an eight-level binary tree distributes the clock to eight horizontal buffer slices. The final programmable drivers are connected to the metal-5 and metal-6 mesh grid in 66 columns across each buffer slice. The maximum RC simulated skew is 32 ps. When channel-length variation is taken into account, the simulated process skew is 96 ps. F. FPU Arithmetic Pipelines

Fig. 8. Integer register file inhibit circuit.

array are immediately available for a read. A 7-bit tag CAM produces data forwarding information in the event that a write is pending and data for the execution units must be bypassed externally. Nine read address generators located 5 mm away from the IFFRF transmit dynamic partially decoded addresses causing a large variance in transmission time. A dynamic signal converter holds the address valid for the duration of the cycle. The latched signals are ANDed with a minimum delay inhibit signal determined by the completion of the data write, tag compare and latch, and bitline precharge to generate the read enable (see Fig. 8). The read addresses will wait for the deassertion of the inhibit signal if they arrive early, or flow through immediately if they arrive later. This avoids initiating a wrong-way read, which would reduce bitline margin and cause potential timing hazards. 2) Floating-Point Register File: The floating-point register file has five read ports and five write ports that can operate simultaneously in a clock cycle. A pulsed write enable signal and flopped write data are driven directly to the register cells without any intervening logic. The register bit cell uses pulldowns on both sides of the cross-coupled-inverter storage node to speed up the write operation without the use of PMOS

The FPU contains three execution pipelines: an add pipeline, a multiply pipeline, and a store pipeline. The add pipeline computes all x87 floating-point addition, subtraction, compare operations, and multimedia integer arithmetic logic unit (ALU) operations and SIMD single-precision floating-point addition instructions. The multiply pipeline computes all x87 floatingpoint multiplication, remainder, division, square-root operations, multimedia integer ALU and multiplication operations, and SIMD floating-point multiplications. The store pipeline processes true store operations, along with several special operations to support microcode routines. These pipelines are implemented without resorting to full-custom macrocells. Thoughtful architectural decisions, a few special-purpose standard cells, and careful guidance of placement and routing permit the operation of these functional units at high frequencies. The add pipeline contains two floating-point adders, one for multimedia and SIMD floating-point instructions and one for x87 floating-point computation. The multimedia adder is based on a two-path implementation. The Far path is used for all effective additions and effective subtractions where the exponents of the two operands differ by more than one. The dataflow for the Far path contains an exponent subtraction, an aligning right shift, and full-width carry-propagate addition. The Close path is used for effective subtractions with expo-

1474

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999

Fig. 9. PLL charge-pump circuit.

nents that differ by zero or one. The dataflow in the Close path consists of an exponent difference prediction, a carrypropagate addition, and a normalizing left shift. These two paths operate in parallel with the correct path chosen at the end of the execution. Rounding is handled implicitly within the adder by computing multiple results (i.e., sum and sum 1) in parallel and choosing the appropriate one. The x87 adder has a structure similar to the multimedia adder, in that it also uses a two-path architecture. However, it rearranges the operations within each path to allow for rounding to multiple precisions. In the first cycle, exponent difference computations are performed. In the second cycle, either the smaller operand is right-aligned (Far path) or both operands are pre-normalized (Close path). Carry-propagate addition occurs in the third cycle, and low bits are selectively cleared in the fourth cycle. The multiply pipeline contains a single 76 76 bit multiplier, shown in Fig. 10. It employs radix-8 Booth encoding to generate 26 partial products in the first execution cycle. A binary tree of 4-2 compressors reduces the partial products to two, after which a rounding constant is added through

two parallel (3, 2) carry-save adders. These results are carryassimilated in the third cycle, and the appropriate result is chosen in the fourth cycle. SIMD floating-point multiplications are also computed using this multiplier. Two independent single precision products are computed in parallel for SIMD operations. Division and square root operations are computed using quadratically converging multiplication-based algorithms, and they share the multiplier, forming the constraints on the dimensions of the multiplier [5]. The x87 and SIMD floatingpoint multiplication operations are able to fill unused cycles during a division or square-root operation to provide maximum execution bandwidth. The latency and throughput of the various operations are shown in Table II. V. TEST

AND

DEBUG FEATURES

The design includes scan chains and various clock modes to provide debug support and allow the use of automatic test pattern generation (ATPG) tools. Areas of the design that do not use full scan include macrocells, I/O cells, PLL

GOLDEN et al.: SEVENTH-GENERATION x86 MICROPROCESSOR

Fig. 10.

Floating-point multiplier organization.

Fig. 11.

Flop with built-in multiplexor functionality.

1475

TABLE II INSTRUCTION LATENCIES

state elements, and JTAG related logic. For debug, the chip also supports on-the-fly frequency variation, single-cycle step operation, and a stopped-clock mode. For PLL characterization, a scan chain, two high-speed pads, and one analog pad are used for extensive measurements of critical clock phase relationships and subblock operations. In addition, one scan chain is dedicated for programming self-timed pulses in the macrocells.

The macrocells are generally SRAM arrays, register files, or ROM’s. A built-in self-test (BIST) 13N march C algorithm tests the SRAM arrays. A write-recovery test is implemented by the BIST hardware once the 13N algorithm completes. At low frequency, the SRAM bit cells are tested for data retention. To apply the ATPG scan patterns, the PLL is bypassed so that the internal processor clock can be directly controlled from the pins at the edge of the device. During scanning, the

1476

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999

clock is held high, and scan data are shifted using two-phase nonoverlapping pulses on the scan clocks. The macrocells that are not fully scannable can also be tested with scanbased patterns. During scan operation, patterns are written at the I/O boundary of the nonscan macrocells. An intermediate signal then exercises the macrocell, allowing its contents to be observed and controlled. In this way, ATPG patterns can provide stuck-at fault coverage of the surrounding logic upstream and downstream to/from the macrocell. Test modes were added to support debug, manufacturing test, and characterization of the PLL itself. The PLL can by bypassed from two possible clocking sources. The first bypass clock source is the normal differential clock inputs. The second bypass clock source is another differential clock pair that was added to support a high-speed external clock generated by the test equipment manufacturer. The bypass mode is used primarily for applying the ATPG-produced scan patterns. The bypass clock is also used as the clock source during burn-in since the lower end operating limits of the PLL are greater than the desired speed of burn-in. In addition to the bypass test mode, there are test modes to control the generation of the PLL clock. These include system clock modes and tester clock modes. The system clock modes provide the ability to perform system capture by stopping the internal clock at a specific internal clock cycle and capturing the state of the machine in the scan latches at a predetermined internal clock cycle. A JTAG programmable register is loaded to select the system PLL test mode as well as to select the internal cycle at which to stop or capture state. An external pin is asserted to trigger the test mode. The state of the machine can also be captured by simply stopping the clock and shifting out the state. The system capture mode was added over a concern of excessive changes in power-supply current occurring when stopping the PLL instantaneously. The system capture mode gracefully slows down the clock before stopping it. The complete state of the machine is not guaranteed using this mode, however, since writes may occur to the nonscan arrays after the state capture. The tester clock modes provide more flexibility than the system clock modes. In addition to stopping the clock at any internal clock cycle, there is a mode to provide clock stretching and a mode for generating a selected number of high-speed clock pulses from the PLL.

[5] S. Oberman, “Floating point division and square root algorithms and implementation in the AMD-K7 microprocessor,” in Proc. 14th IEEE Symp. Computer Arithmetic, Apr. 1999, pp. 106–115.

Michael Golden received the B.S. degree in computer engineering and the M.S.E.E. degree from the University of Illinois at Urbana-Champaign in 1990 and 1991, respectively. He received the Ph.D. degree in computer science and engineering from the University of Michigan, Ann Arbor, in 1995. He is a Member of Technical Staff at Advanced Micro Devices (AMD), Sunnyvale, CA. Since 1995, he has worked at NexGen and AMD on the K6 and K7 microprocessors. He is currently working on microprocessor circuit design.

Steve Hesley received the B.S.E.E. degree cum laude from Rice University, Houston, TX, in 1992. He then joined Advanced Micro Devices, Austin, TX, where he worked on three generations of the 29000 embedded microprocessor family before joining the K7 project. On K7, he designed data cache macro cells and was involved in the custom circuit design methodology. His current interests include PLL’s and low-power circuits.

Alisa Scherer received the B.S. degree from the University of Michigan, Ann Arbor, in 1986 and the M.S. degree from the University of California at Berkeley in 1988, both in electrical engineering. Since 1994 she has worked at NexGen and Advanced Micro Devices, Sunnyvale, CA, on the K6 and K7 microprocessors.

ACKNOWLEDGMENT The authors gratefully acknowledge the technical contributions of the architecture, circuit, implementation, and product development teams that designed the processor. REFERENCES [1] S. Oberman et al., “AMD 3DNow! Technology and the K6-2 microprocessor,” in Proc. Hot Chips 10, Aug. 1998, pp. 245–254. [2] H. Partovi et al., “Flow-through latch and edge-triggered flip-flop hybrid elements,” in ISSCC Dig. Tech. Papers, Feb. 1996, pp. 138–139. [3] M. Golden and H. Partovi, “A 500 MHz, write-bypassed, 88-entry, 90bit register file,” in Proc. Symp. VLSI Circuits, June 1999, pp. 105–108. [4] B. Gieseke et al., “A 600 MHz superscalar RISC microprocessor with out-of-order execution,” in ISSCC Dig. Tech. Papers, Feb. 1997, pp. 176–177.

Matthew Crowley (S’90–M’90) received the B.S. degree in electrical engineering from the University of Illinois, Urbana-Champaign, in 1990. From 1990 to 1993, he worked at Amdahl Corp., Fremont, CA, as a Design Engineer, where he was responsible for board-level signal integrity as well as high-performance package design for mainframe systems. In 1993, he joined NexGen, Milpitas, CA, as a Circuit Design Engineer responsible for I/O design on the Nx586 and Nx686 processors. As a Member of Technical Staff with Advanced Micro Devices (via NexGen acquisition), he led the design and implementation of several K6 custom blocks, including a high-speed RAM, phase-locked loop, and on-die temperature sensor. He was also the design and implementation lead for the K7 PLL. In June 1999, he joined Rhombus Inc., Palo Alto, CA, where he is currently a Member of Technical Staff.

GOLDEN et al.: SEVENTH-GENERATION x86 MICROPROCESSOR

Scott C. Johnson (S’90–M’91) received the B.S. degree in electrical engineering from the University of Texas at Arlington in 1991 and the M.S. degree in computer engineering from National Technical University, Fort Collins, CO, in 1999. He joined Advanced Micro Devices, Austin, TX, in 1992, where he worked on product test of microprocessors and chip sets. In 1995, he transitioned to layout and mask design. Since 1997, he has been working with a design team working on microprocessor circuits. On K7, he designed the clock network. His current interests include SRAM and register file circuits. He has received two U.S. patents.

Stephan Meier received the B.S. degree in computer science and electrical engineering and the M.Eng. degree from Cornell University, Ithaca, NY, in 1989 and 1990, respectively. He is a Senior Member of Technical Staff at the California Microprocessor Division of Advanced Micro Devices, Sunnyvale, CA. He is currently responsible for the microarchitecture of the load-store unit on K8. Prior to working on K8, he was Co-architect of the K7 FPU, where his responsibilities included specifying and developing an RTL model of the control logic for the FPU as well as a gate-level implementation of the retire queue. His technical interests include computer architecture, the design and verification of microprocessors, and performance modeling. Mr. Meier is a member of Eta Kappa Nu.

Dirk Meyer is Vice President of Engineering for Advanced Micro Devices (AMD)’s Computation Products Group and was Co-director of K7 development from 1996 through 1998. He joined AMD in 1996 from Digital Equipment Corp., where he was Co-architect of the Alpha 21064 and 21264 microprocessors. He has 16 years of microprocessor design experience at AMD, Digital, and Intel, where he has been involved in the design of x86, Alpha, VAX, and embedded processors.

Jerry D. Moench (S’69–M’71) received the B.S.E.E. degree from South Dakota School of Mines and Technology, Rapid City, in 1971. From 1971 to 1983, he was with Motorola in Phoenix, AZ, and Austin, TX. His work at Motorola was on high-performance memory products, which include SRAM’s, DRAM’s, EPROM’s, and ROM’s. One of his most successful products was one of the first single 5-V, 64-K DRAM’s. He has received 34 patents. He has been with Advanced Micro Devices (AMD), Austin, since 1984, where he has worked on designing DRAM’s, programmable logic, and custom design of microprocessors. He is currently heading up the custom design effort on the K7 microprocessor and working with process development groups to determine the technology directions for future microprocessor designs. Mr. Moench was a Motorola Dan Noble Fellow in 1982. He also is an AMD Fellow.

Stuart Oberman (S’91–M’97) received the B.S. degree from the University of Iowa, Iowa City, in 1992 and the M.S. and Ph.D. degrees from Stanford University, Palo Alto, CA, in 1994 and 1996, respectively, all in electrical engineering. He is a Member of Technical Staff at Advanced Micro Devices, Sunnyvale, CA. He was an Architect of the AMD 3DNow! instruction set and the floating-point unit datapath architect for the K7 microprocessor. He is currently responsible for multimedia architecture in the next-generation AMD microprocessors. His technical interests include computer arithmetic, computer architecture, and VLSI design. Dr. Oberman is a member of Tau Beta Pi, Eta Kappa Nu, Sigma Xi, and the IEEE Computer Society.

1477

Hamid Partovi (M’89) received the B.S.E.E. degree magna cum laude from the University of California, Berkeley, in 1981 and the M.S.E.E. degree from the University of Michigan, Ann Arbor, in 1983. He was with Advanced Micro Devices (AMD), Sunnyvale, CA, from 1983 to 1987, where he participated in the development of nonvolatile memories. He joined Digital Equipment Corp., Hudson, MA, in 1987, where he was involved in the design of memories and microprocessors. Most notably, he designed the caches of a 100-MHz VAX processor and a 10-ns, 1-Mbit commodity SRAM. After a year with Integraph Corp.’s Advanced Processor Division, Palo Alto, CA, he joined NexGen, Inc., Milpitas, CA (later acquired by AMD) in 1994. Having participated in the design of the Nx586 and K6 processors, he co-led the circuit efforts of the K7 microprocessor. He is the author or co-author of 16 technical papers and has received 16 patents. Mr. Partovi is an AMD Fellow.

Fred Weber received the A.B. degree in physics from Harvard University, Cambridge, MA, in 1985. He is Vice President of Engineering for the Computation Products Group of Advanced Micro Devices (AMD), Sunnyvale, CA. He is responsible for all CPU and chipset design engineering in the California Microprocessor Division. He was the Co-leader for development of the AMD-K7. His technical interests include computer architecture, software/hardware optimization, and CAD.

Scott White (S’86–M’87) received the B.A.Sc. degree in systems design engineering from the University of Waterloo, Ont., Canada, in 1987. He is a Senior Member of Technical Staff at Advanced Micro Devices (AMD), Austin, TX. He has 13 years of experience in processor design, including both high-performance x86 and embedded RISC processors. Before joining AMD in 1989, he worked on large mainframe CPU design at Control Data Corp.

Tim Wood (S’83–M’83) received the B.S. degree in electrical engineering from Rensselaer Polytechnic Institute, Troy, NY. He is a Senior Member of Technical Staff at Advanced Micro Devices, Sunnyvale, CA, where he works on design and test development of x86compatible microprocessors.

John Yong received the B.S. degree from Stanford University, Stanford, CA, in 1994 and the M.S. degree from the University of Michigan, Ann Arbor, in 1996, all in electrical engineering. He has been with NexGen and Advanced Micro Devices, Sunnyvale, CA, since 1995 and is currently working on high-performance microprocessor design.