Itanium 2 processor 6M: higher frequency and larger L3 cache - Micro ...

4 downloads 0 Views 491KB Size Report
Apr 14, 2004 - HP Integrity Superdome with 64 Intel Itanium 2 6Ms, each at 1.5 GHz with. 6-Mbyte L3 cache, running HP UX 11.iv2 64-bit with Oracle.
ITANIUM 2 PROCESSOR 6M: HIGHER FREQUENCY AND LARGER L3 CACHE IN DESIGNING THE NEXT GENERATION OF THE ITANIUM 2 PROCESSOR, INTEL DOUBLED THE ON-DIE, LEVEL-THREE CACHE TO 6 MBYTES AND INCREASED FREQUENCY BY 50 PERCENT COMPARED TO THE PREVIOUS GENERATION.

ANOTHER GOAL WAS TO KEEP THE POWER DISSIPATION OF THE NEW DESIGN WITHIN THE SAME ENVELOPE AS ITS PREDECESSOR.

Stefan Rusu Harry Muljono Brian Cherkauer Intel Corp.

10

The third-generation Itanium processor targets the high-performance server and workstation market. To do so, the design team sought to provide higher performance through increased frequency and a larger L3 cache. At the same time, we had to limit the power dissipation to fit into the existing platform envelope. These considerations led to what we now call the Itanium 2 processor 6M: the latest generation of Itanium 2, which features a 6-Mbyte, 24-way set-associative on-die L3 cache. Fabricated in a dualVt 130-nm process, this processor has six copper interconnect layers and uses a florinated-silica-glass dielectric. The design implements a 2-bundle 64-bit explicitly parallel instruction computing (EPIC) architecture and is fully compatible with previous implementations,1,2 including hardware support for in-order IA-32 binary execution. The 374-mm2 die contains 410 million transistors. The processor operates at 1.5 GHz from a 1.3 V core supply and is plug-in compatible with the existing Itanium 2 platforms. The worst-case power dissipation is 130 W, although the power dissipation on a typical server work-

load is 107 W. The front-side bus is 128-bits wide and provides a total bandwidth of 6.4 Gbytes/s in a 4-way multidrop bus configuration. Table 1 compares the main attributes of this processor to those of the previous implementation. The on-die cache size (6 Mbyte) and the transistor count (410 million) are the largest ever for a commercial microprocessor. Although this processor’s frequency is 50 percent higher than that of the previous generation, the maximum power dissipation holds flat at 130 W to ensure the platform’s backward compatibility.

Published by the IEEE Computer Society

0272-1732/04/$20.00  2004 IEEE

Architectural overview Figure 1 shows the processor block diagram. The execution pipeline is similar to that of the previous implementation.2 A 128-entries, 65bit, 20-port integer register file works with six 1-cycle integer and six 2-cycle multimedia units. These components have fully symmetric bypassing among themselves and with the level-one (L1) data cache. A 128-entry, 82-bit floating-point register file combines with two 4-cycle floating-point multiply-accumulate units with full bypassing among themselves.

Shaded areas in Figure 2 indicate arrays and buses protected by parity and error correction code. Parity protects the L1 instruction and data caches; the table look-aside buffers; the L2 tag cache; and the 44-bit system address bus. Error correction code protects the L2 data array; the L3 data and tag caches; and the 128bit system data bus. To prevent a single upset event in the same cache line from causing multiple bit errors, all caches use bit interleaving for adjacent cache lines in the physical layout. The processor uses a configurable error containment strategy, including early error indicators and a data poisoning mechanism. Availability features include extensive errorlogging capabilities for system diagnostics. The enhanced thermal management circuit throttles instruction issue when the on-die temperature sensor exceeds a fuse-programmed threshold; this mechanism protects the processor from catastrophic overheating.

Table 1. Itanium 2 processor evolution. Attribute Code name Architecture CMOS process (nm) Level-three cache (Mbytes) Frequency (GHz) Supply voltage (V) Maximum power dissipation (W) Thermal design power (W)

B B B

Instruction TLB

M M M M

IA-32 decode and control

8 bundles

I

I

F F

Register stack engine/remapping

ECC

Branch and predicate registers

Branch units

128 integer registers

Integer and multimedia units

128 floating-point registers

Quad-port L1 data cache

ALAT

Scoreboard, predicate, NaT, exceptions

Data Tags

L2 cache

Data

L3 cache

11 issue ports

Tags

Parity protected

L1 instruction cache Fetch/prefetch engine Instruction queue

ECC Branch prediction

ECC

Itanium 2 6M Madison EPIC 130 6 1.5 1.3 130 107

Figure 2 shows the die photo and highlights the location of the main functional blocks.3 The cache has three hierarchical levels. The L1 instruction and data caches are both 16Kbyte, 4-way associative. The L2 cache is a unified 8-way, 256-Kbyte array; and the L3 is a unified, 24-way, set associative, 6-Mbyte cache. The latencies of the first two cache levels are the same as in the previous generation. Interconnect constraints increased the L3

ECC protected

ECC

Itanium 2 McKinley EPIC 180 3 1 1.5 130 100

Floatingpoint units

DTLB

Bus (128-bit data, 6.4 Gbytes/s at 400 megatransfers/s)

Data

Address

Figure 1. Processor block diagram.

MARCH–APRIL 2004

11

HOT CHIPS 15

Processor core

L2 cache (256 Kbytes)

L3 tags +LRU blocks Bus logic L3 cache (6 Mbytes) I/O pads

Figure 2. Itanium 2 processor die photo.

cache latency from 12 to 14 cycles. Table 2 shows a summary of the cache implementation details. The up to 50 percent increase in bandwidth for all caches is because of the higher core frequency.

L3 cache design We built the L3 cache with 140 identical subarrays, tiled to fit the core’s irregular shape. This use of subarrays optimizes the die area utilization without constraining the core floorplan. Each subarray communicates directly with the bus unit and the L2 cache. Eliminating direct communication among the subarrays reduces the clock skew constraints and makes tiling easier. A full cache line of 1,024 bits returns to the CPU in four consecutive

cycles, starting with the critical 256-bit chunk. For a write operation, we collect all four chunks over a period of four clock cycles and write them at once in a single write operation. Each subarray splits into two sub-banks, each containing 12 ways. A sub-bank further divides into eight 96 × 256 array blocks. A pair of these blocks shares a set of sensing units and write drivers. Using three address bits and a bank select signal, we read out 32 bits of data from a single block and send them on the data bus in 8-bit chunks over 4 clock cycles. All SRAM cells use greater-than-minimum Leff transistors to reduce leakage. The SRAM bit cell area is 2.45 µm2. The tag unit consists of the L3 tag cache and L3 least-recently used (LRU) blocks. The L3 tag has 32 tag bits, 3 state bits, and 7 error correction code bits. There are four subarrays in the L3 tag, each holding 6-way blocks. The L3 LRU array consists of 2,048 sets of 24 bits, twice as many as in the previous design. Each of the 24 bits of each set correspond to the usage of a single cache way and any access of a given way updates the corresponding LRU bit.

Power reduction Figure 3 shows the power breakdown of the current design compared with that of its 180nm predecessor. Both products fit in the same 130 W system power envelope. However, the current design has a 50 percent higher frequency, a 2× larger L3 cache, and a 3.5× increase in transistor leakage. To stay within the same power envelope, we had to reduce the active power from 90 to 74 percent of the

Table 2. Cache summary.

Attribute Size Line size (bytes) No. of ways Replacement policy Latency (no. of cycles) Write policy Bandwidth (Gbytes/s) Read Write

12

IEEE MICRO

L1 instruction cache 16 Kbytes 64 4 Least recently used 1

L2 256 Kbytes 128 8 Not recently used Integer: 5 Floating-point: 6 Write back

L3 6 Mbytes 128 24 Not recently used 14

NA

L1 data cache 16 Kbytes 64 4 Not recently used Integer: 1 Floating-point: NA Write through

48 NA

24 24

48 48

48 48

Write back

total. Reaching this goal required aggressive dynamic-power management that focused on three areas: reduced clock loading and activity, lower contention power, and improved L3 cache power management. The 180-nm design uses clocked deracers (as described by S. Naffziger et al.2) to fix mindelay problems by briefly blocking incoming paths while the receiving latch is open. To reduce the clock loading, we replaced about half of the approximately 50,000 deracers with static buffers. We also redesigned the dual-rail domino library cells to reduce their clock loading by sharing the clocked NMOS device among the complementary pull-down stacks. Strictly controlling precharge and discharge phase timing reduced contention power in footless domino gates. We also identified additional clock-gating opportunities. Full-chip register-transfer-level (RTL) simulation of high- and low-power traces helped to identify infrequently used logic in every functional unit. These portions of the circuit are potential candidates for clock gating. Switch-level simulation at the functional-block level of these same traces combined with parasitic extraction data helped determine the actual potential power savings from each possibility. These ideas saved, on average, approximately 10 percent of the power, though the savings under maximum power were less. We built an RTL model that included the changes necessary for the clock-gating ideas, estimated the complexity of each change and its impact on schedule, and selected only the high return-oninvestment changes for implementation. Finally, we changed the L3 cache design to reduce the number of active subarrays necessary to access a cache line. In the prior implementation,4 each L3 cache subarray supplied two bits during a read or write access; this meant using four subarrays to access an 8-bit word. The new design includes revised subarray control access so that an entire 8-bit word is accessible from a single subarray. We automatically turn off unused subarrays to save power in the L3 cache on either a read or write access. This redesign reduced the L3 active power by about 5 W.

Package and power delivery Figure 4 is a photo of the processor package. The processor is flip-chip attached to a 12-

Dynamic power I/O power Core leakage, static Cache leakage

Dynamic power I/O power All leakage, static

5%

5%

7% 14% 5% 74%

90%

(a)

(b)

Figure 3. Power breakdown: Itanium 2 (a) and Itanium 2 6M (b).

Power delivery connector

Server management components

Flip-chip BGA package with integrated heat spreader

Interposer substrate

Figure 4. Package details.

layer organic ball-grid array (BGA) package with an integrated heat spreader; the package measures 42.5 mm on each side. The BGA attaches to an 8-layer 611-pin interposer that houses additional components for server management functions and decoupling. Die-side interdigitated capacitors are on the BGA substrate (underneath the heat spreader); these serve to decouple the core supply and the front-side bus termination voltage. Ground return vias carry image currents propagating in the reference plans. Placing these ground return vias within 1 mm of all signal vias minimizes inductive signal-return current loops. An edge connector delivers power to the processor interposer assembly and provides lower impedance than traditional power

MARCH–APRIL 2004

13

HOT CHIPS 15

can operate either in singleshot or continuous-update Feature Itanium Itanium 2 Itanium 2 6M mode. To improve testability, Scan coverage (thousands) 48 140 140 the design also incorporates Scan-out coverage (thousands) 5.5 24 24 an I/O compensation register, Supports cache direct-access testing mode Yes Yes Yes which can read or override L3 redundancy and repair NA Dual Quad the impedance and slew bit Weak-write test mode Fixed Fixed Programmable settings generated by the Design-for-test I/O-loop-back capability Basic Limited Enhanced compensation unit. To accuDynamic frequency adjustment (shrink or stretch) Muticycle Single cycle Multicycle rately measure pin timing On-die processor monitors No No Yes parameters, the I/O circuit incorporates a built-in selftest called I/O loop back, delivery using pins through the socket. A sep- which tests the pin timing by stressing the arate supply on the motherboard powers the loop time from the output flip-flop to the I/O circuitry through the interposer pins. input flip-flop through the pad. About 95 percent of the 7,877 die-level C4 bumps are for power delivery. The design Test, debug, and manufacturability features includes a total decoupling capacitance of 670 The processor includes a large set of test, nF located in the proximity of high di/dt debug, and manufacturability features, switching circuits and in the routing channels. including partial scan, observability registers We used accumulation-mode decoupling (scan out), and programmable-trigger genercapacitors in preference to inversion-style ators. All the large memory arrays are directcapacitors to reduce the gate leakage by greater ly accessible for testing from the external pads. than 15× with only a 10 percent impact on Table 3 shows a comparison of design-for-test the effective decoupling capacitance. and design-for manufacturability features across the Itanium processor family. Enhanced Front-side bus manufacturability features include increasing The front-side bus (FSB) uses the GTL+ the L3 cache repair from dual to quad redun(Gunning Transceiver Logic) signaling scheme dancy and introducing the programmable with a nominal voltage swing from 0.4 to 1.2 weak-write test mode. Given the large on-die V to implement a 128-bit wide, glueless, 4- cache, both of these features increase the abilway multiprocessor bus.5 The data bus uses a ity to identify and repair defects. source-synchronous double-pumped design We also enhanced the I/O loop-back feathat achieves 400 megatransfer/s (1 mega- ture to enable better control and testing of the transfer/s = 1 Mbyte/s/pin) or an equivalent front-side bus. Features to enable better frebandwidth of 6.4 Gbytes/s. The address and quency validation and testing include multicontrol lines use a 200 MHz common clock cycle dynamic frequency adjustment and scheme. The input buffer uses a differential regional clock tuning. Strategically located amplifier. For data signals, the differential across the die, on-die process monitors proamplifier compares the incoming data signal vide critical feedback for frequency and marwith a reference voltage generated on die. Cen- ginality validation. These on-die process tered on 750 mV, the reference voltage adjusts monitors consist of a collection of circuits that from 600 to 900 mV in 20 mV increments for are individually sensitive to certain process marginality testing. The differential strobes parameters, such as Vt, IDsat, and channel also use the same differential amplifier. length. They also monitor environmental The impedance and slew rate of the I/O variables, such as voltage and temperature. buffers have digital compensation to account These monitors can track the variation in for process, voltage, and temperature varia- these parameters very precisely under a variety tions. A single sampling point suffices to guide of testing and usage conditions. Designers both the N and the P impedance compensa- located them in areas of interest—such as hot tion, thereby requiring the use of only one thermal zones—and near speed-critical funcexternal resistor. The compensation scheme tional units. Table 3. Design for test and design for manufacturing feature summary.

14

IEEE MICRO

Benchmarks

Best RISC

Itanium 2 6M

SPECint base2000 1 processor

1,077 1,322

eServer pSeries IBM 690 with 1.7-GHz Power4+ (source: www.spec.org). HP Integrity Server rx2600 with one Itanium 2 6M at 1.5 GHz, 6-Mbyte L3 cache, HP-UX OS, and compilers.

SPECfp base2000 1 processor

1,598 2,119

eServer pSeries IBM 690 with 1.7-GHz Power4+ (source: www.spec.org). HP Integrity Server rx2600 with one Itanium 2 6M at 1.5 GHz, 6-Mbyte L3 cache, and Red Hat Linux AS2.1 operating system; submitted to SPEC.

TPC-C 64 processors

768,839 tpmC 1,008,144 tpmC

TPC-C dollars/tpmC 64 processor

$8.55/tpmC $8.33/tpmC (lower is better)

TPC-C tpmC 4 processors

56,375 tpmC 136,111 tpmC

TPC-C dollars/tpmC 4 processor

$9.44/tpmC $4.09/tpmC (lower is better)

SAP SD two-tier 4 processor

SPECweb99 SSL 2 processors

SPECjbb 2000 4 processor

420 SD users 860 SD users

IBM eServer pSeries 690 Turbo 7040-681, with 32 IBM Power4+ processors at 1.7 GHz, running IBM AIX 5L V5.2 with IBM DB2UDB 8.1 with 512-Gbytes RAM. TPC-C availability date: 29 February 2004. HP Integrity Superdome with 64 Intel Itanium 2 6Ms, each at 1.5 GHz with 6-Mbyte L3 cache, running HP UX 11.iv2 64-bit with Oracle Database 10G Enterprise Edition, and 1,024-Gbytes RAM. TPC-C availability date: 14 April 2004 (source: www.spec.org). HP AlphaServer using four Alpha ES45 processors at 1.25 GHz and 32-Gbytes memory. TPC-C availability date: 27 September 2002. HP Integrity Server rx5670 with four Itanium 2 6Ms at 1.5 GHz, each with a 6-Mbyte L3 cache, 9-Gbytes memory, Red Hat Enterprise Linux Advanced Server 3, and Oracle Database 10g Enterprise Edition. TPC-C availability date 5 March 2004 (source: www.tpc.org). AlphaServer ES45 with four Alpha 21264 processors at 1,000 MHz (source: www.sap.com/benchmark). HP Server rx5670 with four Itanium 2 6Ms at 1.5-GHz, each with a 6-Mbyte L3 cache, 24-Gbytes memory, HP-UX 11i, SAP rev 4.6C, and Oracle 9i database. SunFire 280R result with two UltraSparc III Cu processors at 1.2 GHz, each with an 8-Mbyte L2 cache (off chip), Solaris 9, Sun ONE Web Server 6.0 SP5, 32-Gbytes RAM. Published in April 2003. HP Server rx2600 using two Itanium 2s at 1.5 GHz with 6-Mbyte L3 cache, 12-Gbytes memory, HP-UX, and Zeus 4.2r2; submitted to SPEC (source: www.spec.org).

1,008 1,930

96,377 operations/s 116,466 operations/s

eServer pSeries IBM 655 with four Power4+ at 1.7 GHz with 16-Gbytes memory, AIX 5L V5.2 APAR IY43549, and JVM J2RE 1.4.1 IBM AIX build cadev-20030410. HP Server rx5670 with four Itanium 2 6Ms at 1.5 GHz, each with a 6-Mbyte L3 cache, 16-Gbytes memory, HP-UX 11i v2.0, and Hotspot JVM 1.4.2.00 (source: www.spec.org).

Performance measure

Figure 5. Itanium 2 processor 6M performance for enterprise applications.

Performance benchmarks Figure 5 shows the Itanium 2 processor 6M’s performance across enterprise benchmark applications. Currently, Itanium-2based systems with from one to 64 processors display superior results on a wide range of benchmarks, including TPC-C, SAP, and SPECjbb2000. Itanium 2 processor 6M outperforms the fastest RISC processor with the same number of processors in a configuration. For TPC-C applications, 4-way Itanium-2based systems achieve 150 percent better performance at a 50 percent lower cost than the

fastest RISC. Figure 6 (next page) presents the Itanium 2 processor 6M results for high-performance and technical computing, speeding up the complex analysis of vast amounts of data and also providing faster simulation and render times for very large models. The advanced EPIC architecture along with the high-bandwidth front-side bus and the large on-die L3 cache let the Itanium 2 processor 6M outperform all other comparable microprocessors. The benchmarks in Figures 5 and 6 represent the latest data available as of this writing.

MARCH–APRIL 2004

15

HOT CHIPS 15

SPECfp_rate base2000 32 processors

405 644 Best RISC

Benchmarks

Itanium 2 6M

AlphaServer GS1280 7/1150 using 32 Alpha 21364 processors at 1.15 GHz each with Tru64 UNIX V5.1B (Rev. 2650) +IPK OS and 256-Gbytes memory (source: www.spec.org). SGI Altix 3000 using 32 Itanium 2 6Ms at 1.5 GHz with SGI ProPack OS, 64-Gbytes memory; submitted to SPEC.*

Linpack 1,000 1 processor

3.884 Gflops 5.285 Gflops

IBM eServer pSeries 655 using one Power4+ 1.7-GHz processor. HP Integrity server rx2600 using one Itanium 2 6M at 1.5 GHz (source: http://www-1.ibm.com/servers/eserver/pseries/hardware/system_ perf.pdf).

Linpack HPC 32 processors

143.3 Gflops 172.3 Gflops

IBM eServer pSeries 690 with 32 Power4+ 1.7-GHz processors (source: http://www-1.ibm.com/servers/eserver/pseries/hardware/system_ perf.pdf). SGI Altix 3000 with 32 Itanium 2 6Ms at 1.5 GHz and 256-Gbytes RAM.*

NAS Parallel 32 processor

9.5 Gflops 17.2 Gflops

NW Chem 32 processor Star CD 32 processor

510 seconds 320 seconds (lower is better) 274 seconds 189 seconds (lower is better)

IBM p690 with 32 Power4s at 1.3 GHz. (source: www.nas.nasa.gov/ Software/NPB/http://www.csm.ornl.gov/~dunigan/sp4/index.html#mpi). SGI Altix 3000 with 32 Itanium 2 processors at 1.5 GHz.* IBM p655 with 32 Power4s at 1.3 GHz (source: NW Chem http://www. emsl.pnl.gov:2080/docs/nwchem/nwchem.html). SGI Altix 3000 using 32 Itanium 2 6Ms at 1.5 GHz.* IBM p655 with 32 Power4s at 1.3 GHz (source: Star-CD, http://www. cd-adapco.com/support/bench/315/aclass.htm). SGI Altix 3000 using 32 Itanium 2 6Ms at 1.5 GHz.*

MM5 32 processor

13.7 Gflops 24.1 Gflops

IBM p690 with 32 Power4s at 1.3 GHz (source: MM5 http://www. mmm.ucar.edu/mm5/mm5-home.html and http://www.mmm.ucar. edu/mm5/mpp/helpdesk/20020218.html). SGI Altix 3000 using 32 Itanium 2 6Ms at 1.5 GHz.*

Gaussian 32 processor

406 seconds 180 seconds (lower is better)

IBM p655 with 32 Power4s at 1.3 GHz (source: http://publib-b.boulder. ibm.com/Redbooks.nsf). SGI Altix 3000 using 32 Itanium 2 6Ms at 1.5 GHz.*

Performance measure

* Source: www.sgi.com/newsroom/press_releases/2003/june/altix_ benchmarks.pdf.

Figure 6. Itanium 2 processor 6M performance for high-performance and technical computing applications.

W

e will continue to push the Itanium 2 processor to higher frequencies and larger cache sizes. We expect to reach 1.7-GHz frequency and a 9-Mbyte cache this year. Looking further ahead, we will implement a dualcore design with even larger caches next year.

2.

3.

Acknowledgment This article presents work of a talented and dedicated team, and we are privileged to represent their work. We also gratefully acknowledge our inheritance of the core microarchitecture and circuits from the McKinley team.

4.

5.

References 1. S. Rusu and G. Singer, “The First IA-64

16

IEEE MICRO

Microprocessor,” IEEE J. Solid-State Circuits, Nov. 2000, vol. 35, no. 11, pp. 1539-1544. S. Naffziger et al., “The Implementation of the Itanium 2 Microprocessor,” IEEE J. Solid-State Circuits, Nov. 2002, vol. 37, no. 11, pp. 1448-1460. S. Rusu et al., “A 1.5-GHz 130-nm Itanium 2 Processor with 6-MB On-Die L3 Cache,” IEEE J. Solid-State Circuits, Nov. 2003, vol. 38, no. 11, pp. 1887-1895. D. Weiss et al., “The On-Chip 3MB Subarray Based 3rd Level Cache on an Itanium Microprocessor,” ISSCC Digest Tech. Papers, IEEE Press, 2002, pp. 112-113. H. Muljono et al., “A 400MT/s 6.4GB/s Multiprocessor Bus Interface,” IEEE J. SolidState Circuits, Nov. 2003, vol. 38, no. 11, pp. 1846-1856.

power and one for redundancy. The power supplies include over-voltage, -current, -temperature, and short-circuit protection. Dual power cords proFigure A shows a dual-processor, 2U rack-mount high-density server (Intel vide AC input redundancy. Figure B shows the SR870BH2 main board block diagram. The chipset SR870BH2). The chassis has an easily serviceable modular design and supports the hot swapping of fans, hard north bridge includes an E8870 scalable node controller (SNC) that controls the memory subsystem. The board also includes four EE870DH memory drives, and power supplies. The electronics bay includes the repeater hubs with double-data rate (MRH-D) that each bridge between the main board, the Peripheral Compo- SNC and two double-data-rate (DDR) memory channels, allowing for a total nent Interconnect (PCI) riser sub- of eight dual in-line memory module (DIMM) slots supporting both 200- and assembly, and airflow ducting for 266-MHz DDR. Fully populated with 2Gbyte DIMMs, the main board can processor and memory cooling. A accommodate up to 16 Gbytes of memory. The chipset south bridge includes an 82870P2 64-bit PCI-X (extended PCI) fan bay that holds six hot-swapFigure A. SR870BH2 pable, speed-controlled fans pro- controller hub (P64H2), an 82801DB I/O controller hub (ICH4), and an 82802AC chassis. vides cooling. Five fans are firmware hub (FWH). The ICH4 includes integrated drive electronics and Uninecessary for proper cooling; an additional fan provides redundancy. The versal Serial Bus controllers and legacy I/O support; it also provides many peripheral bay includes a Small Computer System Interface (SCSI) backplane power management functions. The main board includes two Socket-K processor sockets that support up board, two Ultra320 SCSI hot-swappable hard-drive bays, and a media bay to two Itanium 2 processors with a 400 megatransfers/s front-side bus with that holds one slim-line CD ROM, CD read-write, or DVD ROM drive. The SR870BH2 platform’s power subsystem has a 650 W total output parity protection (on address and control signals) and ECC protection (on data capacity, which comes from three 350 W power supply modules, two for signals). New, low-profile processor power pods fit into the slim 2U form factor. The main board integrates an ATI Rage XL video controller, which includes 8 Mbytes of SDRAM and Itanium Itanium processor 2 processor 1 supports automatic disable if a user installs a PCI video board. Video connections are available on the front FSB 6.4 Gbytes/s and back panels. For networking, this server uses LPC an Intel 82546EB dual-channel netMRH-D SNC-M FWH Memory (4) work interface card, which provides Dual Gigabit support for full-duplex 10baseT, Ethernet SP link 100baseT, or 1000baseT Ethernet; LAN P64H2 2 × 3.2 Gbytes/s controller wake- or alert-on-LAN technologies; hub and remote management. SCSI SCSI drives (Ultra320) An LSI Logic 53C1030 Ultra320 SIOH 1 Gbyte/s SCSI controller provides SCSI supHL 2.0 port. This controller supports dual USBport port USB PCI-X 133 USB channels, with channel A driving an P64H2 ports (4) controller external Very High Density Cabled hub Interconnect (VHDCI) connector, and channel B driving an internal 68-pin PCI-X 100 PCI 33 connector. Both channels support ICH4 VGA low-voltage-differential signaling. ATA 100 PCI-X 100 IDE The SCSI controller and network LPC interface card use 64-bit, 133-MHz PCI-X bus mastering to assure high Serial Super I/O FWH ports data throughput rates. The SR870BH2 main board BMC includes three debug connectors. The north and south bridge each have a port 80 card header; the main board includes an in-target probe port. Figure B. SR870BH2 block diagram.

Dual Itanium 2 Processor Server Platform

MARCH–APRIL 2004

17

HOT CHIPS 15

Stefan Rusu is a senior principal engineer at Intel’s Enterprise Processor Division, leading the technology and special-circuits design group for Itanium. His research interests include highspeed clocking, power distribution, I/O buffers, and low-power and high-speed circuit design techniques. Rusu has an MSEE from the Polytechnic Institute in Bucharest, Romania. He is a senior member of the IEEE and has served on the technical program committee for the European Solid-State Circuit Conference for many years. He holds 17 US patents. Harry Muljono is a senior staff engineer at Intel. His research interests include I/O design and validation. Muljono has a BS in electrical engineering from University of Portland; and a MEng in electrical engineering from Cornell University. He holds 16 patents.

Let your e-mail address show your professional commitment. An IEEE Computer Society e-mail alias forwards e-mail to you, even if you change companies or ISPs.

[email protected] The e-mail address of computing professionals

18

IEEE MICRO

Brian Cherkauer is a design engineer at Intel’s Enterprise Processor Division, working on next-generation Itanium processors. His research interests include power estimation and power reduction techniques. Cherkauer has a BSEE from the State University of New York, Buffalo, and a PhD in electrical engineering from the University of Rochester.

Direct questions and comments about this article to Stefan Rusu, Intel, 2200 Mission College Blvd., M/S SC12-506, Santa Clara, CA 95052; [email protected].

For more information on this or any other computing topic, visit our Digital Library at http://www.computer.org/publications/dlib/.

Suggest Documents