Donald Draper, Matthew Crowley, Udeema Doppalapudi, Harold. McFarland, Bill MO, Hamid Partovi, David Puziol, A l i i Scherer, Eric. Tosaya, Korbin Van Dyke ...
TP 10.2 A 93MH2,X86 Microprocessor with On-Chip L2 Cache Controller Donald Draper, Matthew Crowley, Udeema Doppalapudi, Harold McFarland, Bill MO, Hamid Partovi, David Puziol, A l i i Scherer, Eric Tosaya, Korbin Van Dyke, Anderson Vuong, Lany Wdiin, Jonas Yip, Stanley Yu, David Roth
optimizedtomaintainpulse width and deliver edge rates less than SOOpsworstcase. The outputa of the third-and fourth-levelbuffers aregriddedtoequalizeloadandhencedelay.Clocklinesarerouted withequivalent topologiesto providepredictable delay. Estimated on-chip skew is less than 2OOps.
NexGen, Inc., Milpitas,CA
The instruction decode PIA shown in Figure 3 comprises 40
This 3.5M-transistormicroprocessoruses a 0 . 5 CMOS ~ technology (Table 1).The die is mounted on the package with a solderbump technology. Using custom and routed blocks with 5-layer metal, a die size of 14.1x14.1mmz is produced as shown in Figure 1.At 4.0V and 25"C, the chip operates above 93MHz. With a 1MB cache of 1211s S u s , performance is over 12OWinstones. The processor micro-architecture supports the issue and execution of x86 instructions at the rate of one per clock, as described in Reference [l].Internally, these become multiple RISC operations issued and executed in an out-of-order superscalar fashion. The machine supports 14macro-instructionsactive at one time: two of these may be transfer controls.Over 120Winetonesat 93MHz is attained by aggressiveuse ofinternal data-pathresources, branchprediction, general-purpose and segment register renaming,and multiple stream pre-fetch. The level-zero caching includes write buffers and a branch prediction cache with the program counter pointing to the transfer control and the target address for the transfer. The on-chip level-one caches are separate instruction and data units of 16kBeach. The unified,four-wayset-associative secondary cache containseither 256kB or lMB of data and tagEl. First-and second-levelcache controllers are on the processor. The basic features of the architecture are shown in Figure 2. A high-performance implementation of the x86 architecture presents several challenges:variable-length i n e t r u c t i ~segmenta~, tion, data and code limit checking, and multiple modes. The microprocessoraddresses these issues by providing special hardware to perform the operations quickly. While a PIA generates x86 to RISC decode and assembly information, dedicated highspeed logic computesthe length of the current instruction to allow rapid alignment of the following instruction. Selector register renaming and parallel protection and limit checking hardware accomplishhigh-speedimplementation of the segmentedmemory model. Code limit checking, required on every instruction, is handled by specialized hardware. Each internal RISC operation is normally processed in one clock. A dedicatedunit processes register to register operations autonomously from other units. This capitalizes on register renaming provided by the instruction decoder, allowing work to p d beyond instructions stalled, e.g., due to an L1 cache miss. There is extensive branch prediction for both near (within the same code segment) and far (to another code segment) branches. Predicting the target address and the target bytes allows these macro-instructionsto have only a one-or two-clock latency in the pipe even though the complicatedprocessingrequiredto checkthe prediction may require several more cycles. The branch prediction cache uses CAM structures to detect writes in conjunction with carefulpipeline and fetch controlto support the x86 characteristic of self-modifyingcode.
A phase-locked loop doubles the system clock frequency and compensatesfor the on-chipclock tree and buffer delays. The clock is distributed through four levels of non-invertingbuffers that are
inputs, 800 mintem, and 144outputs and is accessed in 4.5118. A matched delay line, including a maximally loaded bitline with a single ON cell, activates the sense amps in the AND plane and allows for the flow of data to the OR plane. Furthermore, a swit&ed ground technique is used in the AND plane to avoid the need to precondition the PIA inputs to the clock. The design incorporates cascode amplisers in both planes that are tuned to respond to small voltage excursions on the bit lines. The physical realization of the chip uses an internallydeveloped floor-planningtool that automates tiling and placement of data path and random logic structures. After cell placements are completed, scan chain assignments are made. Clock and power routing arethen completedsemi-automatically.Overallchip placement and routingare completed hierarchically to avoid too many placeable objects. Interconnect delay is then analyzed by an RC extraction tool. The logic is verifiedby logic simulation and timing changesare compared to the original logic using a formal verification tool. The flow diagram is shown in Figure 4. The die is mounted onto a 463-pin interstitial PGA by epoxy underfilled flip-chip technology, as shown in Figure 5. A thermal grease provides a heat conductionpath from the die to the aluminum lid and heat spreader. Worst-case power dissipation is 16W. An active (fanned) heat sink is attached resulting in a total junction-*ambient thermal resistance of 1.7"CN. Thislimitsthe junction temperature rise to 27°C above ambient. Low-inductance decoupliugcapacitors are mounted under the lid with solder-ball array technology. Signal integrity of the package and the socket is veriiiedusing a 3-D electromagneticfield equation solver to meet the ground-bounceand transmissionline reflectionrequirement s.
Acknowledgments The authors acknowledgethe contributionsofB. Lee, E. Sowadski, K Schakel, P. Yu, R. Khanna, S. Fetherston, T. Roth, W.Stutz, D. Stiles, Y. Dao, and IBM Microelectronicsfor process support. References Cl] Gwennap, L,Microprocessor Report, March 28,1994.
Gate oxide thickness n-channel Leff p-channel Leff n-well technology Substrate: Contacted pitchM1 M2, M3, M4 M5 via technology
Table 1: Process.
135h 0.46p 0 . 5 1 ~ Retrograde P*Pi p+ substrate 1 . 4 ~ 1.8p 4.8p
Stacked
ISSCC95 / February 16,1995 / Buena Vista and Sea Cliff / 2:W PM
x86 Instructions
Floating Point
AND Plane
OR Plane
v
t FROM MATCHED DELAY L I N E
, ~
Figure 3: Instructiondecode PIA
-
GATUHDL NETLIST
1
1
ROUTE
/-
LID
/ H E A T SINK
GREASE CAPACITOR
SCAN CHAIN
MPU CHIP
I
STATIC TIMING ANALYZER
CERAMIC PACKAGE
BUMP
I __ ..
. .- -.
VLSl DESIGN FLOW
~ g u r 4: e VLSI design now.
Figure 5 Flip chip mounted die, module cross-section.
DIGEST OF TECHNICAL PAPERS
a
173
TP 102: A 93 MHz, X86 Microprocessorwith Okchip L2 Cache Controller (COntiMledfrom page 173)
t-
OFF-CHIP DRIVERS
Figure 1: Chip micmgraph.
TP 10.3: A 133MHz 64b Four-Issue CMOS Microprocessor (Continued frcin page 175)
Figure 7: Chip micrograph.
I
I
1
360
1995 IEEE InternationalSolid-state Circuits Conference I