A 93MHz, X86 Microprocessor with On-Chip L2 Cache ... - IEEE Xplore

TP 10.2 A 93MH2,X86 Microprocessor with On-Chip L2 Cache Controller Donald Draper, Matthew Crowley, Udeema Doppalapudi, Harold McFarland, Bill MO, Hamid Partovi, David Puziol, A l i i Scherer, Eric Tosaya, Korbin Van Dyke, Anderson Vuong, Lany Wdiin, Jonas Yip, Stanley Yu, David Roth

optimizedtomaintainpulse width and deliver edge rates less than SOOpsworstcase. The outputa of the third-and fourth-levelbuffers aregriddedtoequalizeloadandhencedelay.Clocklinesarerouted withequivalent topologiesto providepredictable delay. Estimated on-chip skew is less than 2OOps.

NexGen, Inc., Milpitas,CA

The instruction decode PIA shown in Figure 3 comprises 40

This 3.5M-transistormicroprocessoruses a 0 . 5 CMOS ~ technology (Table 1).The die is mounted on the package with a solderbump technology. Using custom and routed blocks with 5-layer metal, a die size of 14.1x14.1mmz is produced as shown in Figure 1.At 4.0V and 25"C, the chip operates above 93MHz. With a 1MB cache of 1211s S u s , performance is over 12OWinstones. The processor micro-architecture supports the issue and execution of x86 instructions at the rate of one per clock, as described in Reference [l].Internally, these become multiple RISC operations issued and executed in an out-of-order superscalar fashion. The machine supports 14macro-instructionsactive at one time: two of these may be transfer controls.Over 120Winetonesat 93MHz is attained by aggressiveuse ofinternal data-pathresources, branchprediction, general-purpose and segment register renaming,and multiple stream pre-fetch. The level-zero caching includes write buffers and a branch prediction cache with the program counter pointing to the transfer control and the target address for the transfer. The on-chip level-one caches are separate instruction and data units of 16kBeach. The unified,four-wayset-associative secondary cache containseither 256kB or lMB of data and tagEl. First-and second-levelcache controllers are on the processor. The basic features of the architecture are shown in Figure 2. A high-performance implementation of the x86 architecture presents several challenges:variable-length i n e t r u c t i ~segmenta~, tion, data and code limit checking, and multiple modes. The microprocessoraddresses these issues by providing special hardware to perform the operations quickly. While a PIA generates x86 to RISC decode and assembly information, dedicated highspeed logic computesthe length of the current instruction to allow rapid alignment of the following instruction. Selector register renaming and parallel protection and limit checking hardware accomplishhigh-speedimplementation of the segmentedmemory model. Code limit checking, required on every instruction, is handled by specialized hardware. Each internal RISC operation is normally processed in one clock. A dedicatedunit processes register to register operations autonomously from other units. This capitalizes on register renaming provided by the instruction decoder, allowing work to p d beyond instructions stalled, e.g., due to an L1 cache miss. There is extensive branch prediction for both near (within the same code segment) and far (to another code segment) branches. Predicting the target address and the target bytes allows these macro-instructionsto have only a one-or two-clock latency in the pipe even though the complicatedprocessingrequiredto checkthe prediction may require several more cycles. The branch prediction cache uses CAM structures to detect writes in conjunction with carefulpipeline and fetch controlto support the x86 characteristic of self-modifyingcode.

A phase-locked loop doubles the system clock frequency and compensatesfor the on-chipclock tree and buffer delays. The clock is distributed through four levels of non-invertingbuffers that are

inputs, 800 mintem, and 144outputs and is accessed in 4.5118. A matched delay line, including a maximally loaded bitline with a single ON cell, activates the sense amps in the AND plane and allows for the flow of data to the OR plane. Furthermore, a swit&ed ground technique is used in the AND plane to avoid the need to precondition the PIA inputs to the clock. The design incorporates cascode amplisers in both planes that are tuned to respond to small voltage excursions on the bit lines. The physical realization of the chip uses an internallydeveloped floor-planningtool that automates tiling and placement of data path and random logic structures. After cell placements are completed, scan chain assignments are made. Clock and power routing arethen completedsemi-automatically.Overallchip placement and routingare completed hierarchically to avoid too many placeable objects. Interconnect delay is then analyzed by an RC extraction tool. The logic is verifiedby logic simulation and timing changesare compared to the original logic using a formal verification tool. The flow diagram is shown in Figure 4. The die is mounted onto a 463-pin interstitial PGA by epoxy underfilled flip-chip technology, as shown in Figure 5. A thermal grease provides a heat conductionpath from the die to the aluminum lid and heat spreader. Worst-case power dissipation is 16W. An active (fanned) heat sink is attached resulting in a total junction-*ambient thermal resistance of 1.7"CN. Thislimitsthe junction temperature rise to 27°C above ambient. Low-inductance decoupliugcapacitors are mounted under the lid with solder-ball array technology. Signal integrity of the package and the socket is veriiiedusing a 3-D electromagneticfield equation solver to meet the ground-bounceand transmissionline reflectionrequirement s.

Acknowledgments The authors acknowledgethe contributionsofB. Lee, E. Sowadski, K Schakel, P. Yu, R. Khanna, S. Fetherston, T. Roth, W.Stutz, D. Stiles, Y. Dao, and IBM Microelectronicsfor process support. References Cl] Gwennap, L,Microprocessor Report, March 28,1994.

Gate oxide thickness n-channel Leff p-channel Leff n-well technology Substrate: Contacted pitchM1 M2, M3, M4 M5 via technology

Table 1: Process.

135h 0.46p 0 . 5 1 ~ Retrograde P*Pi p+ substrate 1 . 4 ~ 1.8p 4.8p

Stacked

ISSCC95 / February 16,1995 / Buena Vista and Sea Cliff / 2:W PM

x86 Instructions

Floating Point

AND Plane

OR Plane

v

t FROM MATCHED DELAY L I N E

, ~

Figure 3: Instructiondecode PIA

-

GATUHDL NETLIST

1

1

ROUTE

/-

LID

/ H E A T SINK

GREASE CAPACITOR

SCAN CHAIN

MPU CHIP

I

STATIC TIMING ANALYZER

CERAMIC PACKAGE

BUMP

I __ ..

. .- -.

VLSl DESIGN FLOW

~ g u r 4: e VLSI design now.

Figure 5 Flip chip mounted die, module cross-section.

DIGEST OF TECHNICAL PAPERS

a

173

TP 102: A 93 MHz, X86 Microprocessorwith Okchip L2 Cache Controller (COntiMledfrom page 173)

t-

OFF-CHIP DRIVERS

Figure 1: Chip micmgraph.

TP 10.3: A 133MHz 64b Four-Issue CMOS Microprocessor (Continued frcin page 175)

Figure 7: Chip micrograph.

I

I

1

360

1995 IEEE InternationalSolid-state Circuits Conference I

A 93MHz, X86 Microprocessor with On-Chip L2 Cache ... - IEEE Xplore

A 93MHz, X86 Microprocessor with On-Chip L2 Cache ... - IEEE Xplore

Suggest Documents

An X86 Microprocessor with Multimedia Extensions - IEEE Xplore

Performance trade-offs for microprocessor cache ... - IEEE Xplore

A Seventh-Generation x86 Microprocessor

A 160000 transistor GaAs microprocessor - IEEE Xplore

Intel x86 Assembly Fundamentals Intel microprocessor history

Impact Assessment of a Microprocessor Animation on ... - IEEE Xplore

A 100-MHz macropipelined VAX microprocessor - Solid ... - IEEE Xplore

Feedback Cache Mechanism for Dynamically ... - IEEE Xplore

Performance Comparison of Cache Invalidation ... - IEEE Xplore

A Kernel-based l2 Norm Regularized Least Square ... - IEEE Xplore

High-bandwidth x86 instruction fetching based on ... - IEEE Xplore

SiGe HBT Microprocessor Core Test Vehicle - IEEE Xplore

Microprocessor-based controller design and simulation ... - IEEE Xplore

EDA challenges facing future microprocessor design - IEEE Xplore

8.4 The Clock Distribution of the Power4 Microprocessor - IEEE Xplore

8.4 The Clock Distribution of the Power4 Microprocessor - IEEE Xplore

Microprocessor-based controller design and simulation ... - IEEE Xplore

Tutorial: Power-aware, reliable microprocessor design - IEEE Xplore

EDA challenges facing future microprocessor design - IEEE Xplore

HK386 : An X86-compatible 32 Bit CISC Microprocessor - Design ...

Building a Semi Intelligent Web Cache with Light Weight ... - IEEE Xplore

Bayesian Selection for the l2-Potts Model Regularization ... - IEEE Xplore

Neural-Network- and L2-Gain-Based Cascaded Control ... - IEEE Xplore

Cache-Leakage Resilient OS Isolation in an Idealized ... - IEEE Xplore