Document not found! Please try again

cell processor low-power design methodology - Semantic Scholar

2 downloads 60 Views 320KB Size Report
There is a “chicken switch” that changes the latch to normal operation if race problems occur in hardware. Designers modified the local-clock buffer circuit to ...
CELL PROCESSOR LOW-POWER DESIGN METHODOLOGY POWER CONSUMPTION IS A MAJOR CHALLENGE IN VLSI DESIGN. POWERCONSTRAINED DESIGNS MUST ATTACK POWER REDUCTION WITH MANY TECHNIQUES AND REQUIRE TOOLS TO ACCURATELY PREDICT THE POWER CONSUMPTION. THESE TOOLS GIVE DESIGNERS FEEDBACK ON THE EFFICIENCY OF THE POWER MANAGEMENT LOGIC. DESIGNING THE FIRST-GENERATION

CELL PROCESSOR CALLED FOR ADVANCED REDUCTION TECHNIQUES AND CYCLE-ACCURATE POWER ESTIMATION. CORRELATIONS BETWEEN HARDWARE MEASUREMENTS AND POWER ESTIMATES SHOWED THE EFFECTIVENESS OF THIS STRATEGY.

Daniel Stasiak Rajat Chaudhry Dennis Cox Stephen Posluszny Jim Warnock Steve Weitzel Dieter Wendel Michael Wang

Power consumption is one of biggest challenges in VLSI design. In the past, power has mainly been a concern for chips used in battery-powered devices. However, due to the continuous increases in frequency and leakage current that come from geometry scaling, power consumption is becoming a constraint even in wall-socket-powered devices. Packaging and thermal cooling costs are the biggest drivers for reducing power in such chips, especially chips manufactured in large quantities for price-sensitive products. Here, we present the basic methodology behind cycle-accurate power estimation. This forms a basis for explaining the techniques used to reduce power in the first-generation Cell processor,1 along with data that correlates our hardware measurements against power estimates. Tight power constraints require designers to do more in VLSI designs to reduce power

0272-1732/05/$20.00  2005 IEEE

Published by the IEEE Computer Society

consumption. The Cell processor’s design includes a variety of circuit, architecture, and global techniques for power reduction.

AC power reduction The so-called AC component of power dissipation refers to the active power that the processor consumes (over and above the noload, passive level) during the execution of an input workload. Roughly, we can further break this down into subcomponents such as • dynamic switching power dissipated (in the resistive paths within current flow circuits) because of the alternate charging and discharging of on-chip capacitances; • short-circuit current incurred by a switching circuit, because of very minute periods within a switching event when both the pull-up PMOS device and the

71

ENERGY-EFFICIENT DESIGN

pull-down NMOS device can be on, creating spurious conducting paths from the supply voltage rail (VDD) to the ground rail (Gnd); and • glitching power caused by spurious internal-device switching within combinational logic between successive clock pulses, before the logic computation has stabilized. Of these, the first component is usually the most dominant; as such, other components of active or AC power are often ignored.

Architecture- and global chip-level reductions Early architecture and global design decisions have a large impact on the final power of the design and can limit further possible reductions. The design team made several important decisions to reduce power on the Cell processor. For one, the design makes extensive use of static logic, saving power-hungry dynamic logic for custom-designed arrays and (as in caches and register file macros) and programmable logic arrays (for control logic functions) used in areas where performance was critical. In addition, the power delivery to the chip must be at the lowest possible voltage for power saving. When the sum of power supply tolerance, AC noise in the board and package, and internal chip current-resistance (IR) drop is a high percentage of VDD (the delivered supply voltage), V at the input C4 pins (later referred to as C4 for brevity) increases to maintain a minimum voltage at the circuit for the desired performance. When the voltage at C4 increases, chip power rises by a factor that corresponds roughly to V 3.6, a characteristic that was observed through empirical measurements in the laboratory after chip bringup (see also the discussion toward the end of this article, in the Results subsection). In other words, this reflects the combined, experimentally observed effect of all components of active and passive power for this particular design. To further reduce power, the external power supply, package, and chip power distribution meet aggressive targets for tolerances, AC noise, and IR drop noise. One of the clock distribution design goals was to significantly reduce power dissipation over previous designs.2,3 Early studies showed

72

IEEE MICRO

that three main categories of capacitances— clock load, lower-level twig and mesh wires, and grid clock buffers—determined much of the power dissipated by clock distribution. We thus monitored clock load, consisting of mostly macro gate capacitance and minor localclock wires, for excessive capacitance. Clock twig wires, connecting clock loads to the clock grid, had widths tuned within reserved physical regions, thereby decreasing both their area and perimeter-wire capacitance components. For the chip, wire width tuning decreased clock twig wire capacitance by as much as 42 percent without sacrificing local-clock skew. Less clock load, smaller clock twig wire capacitance, and reduced lower-level grid wire capacitance allowed a reduction in grid buffer drive strengths. The Cell design achieved a further reduction in grid buffer drive strength through matching one of seven buffer drive strengths to the local grid and clock load capacitances. Together, these techniques lowered clock distribution power dissipation by more than 20 percent of previous designs.2 The Cell processor also has low-power states included in its architecture and controlled by the operating system. The operating system can apply clock gating to individual processor cores and large sections, using various pause modes. Designers also reduced the global-clock grid frequency to a range of one-half to one-tenth of full operation in a slow-mode power state to reduce power while maintaining function at a lower performance.

Circuit power reduction As frequency increases, logic depth between latches decreases, and the number of latches and latch density increase. Latches thus account for more of the chip power in the Cell processor than in previous designs. A rich set of latches helps balance the conflicting trends of low-power flip-flops and low latency. The design uses pulse latches to minimize power. A standard flip-flop, as shown in Figure 1, can run in pulsed mode. In this case, the slave clock is pulsed as the master clock is held high. There is a “chicken switch” that changes the latch to normal operation if race problems occur in hardware. Designers modified the local-clock buffer circuit to generate the pulse clock in Figure 2. The design also supports a

nonscannable pulsed latch to lower power when longer hold times are allowable. Designers targeted the design of the local-clock buffer circuit to have minimal loading on the global-clock grid. Clock gating. Clock gating is an important means for lowering AC power, and designers used it pervasively on the Cell processor to turn off clocks when latches do not require reloading. Clock gating occurs at a finegrained level where each local-clock buffer has an independent clock-gating pin, such as clockgate_b in Figure 2. Logic tuning. Designers used logic tuning tools to reach timing goals and reduce the total device width, which in turn reduces power. Custom logic macros used these design steps:

Iclk

Iclk

scan_out d2clk

d1clk

d1clk

d2clk

d2clk

d1clk

Iclk

d1clk

Iclk

d2clk lclk

d1clk Local d input gate

q lclk

d1clk d2clk scan_in d2clk

Figure 1. Standard master-slave flip-flop design.

• Implement the design with the minimum number of devices (transistors) necessary to verify functionality. • Run tuning to achieve required timing with accurate wire estimates. • Run the tuner to reduce device width (area) with no increase in timing delay on the most timing-critical path.

Feedback nand2 Normal mode fb Global clock

clk lclk

DC power reduction Designers reduced leakage power at global chip and circuit levels. The Cell processor does not use low-threshold-voltage (low-VT) devices, however. Rather, it employs an adaptive power supply to reduce the leakage power. Array memory cells and logic run at half the full chip frequency and use high-threshold-voltage (high-VT) devices. Medium-oxide decoupling capacitors help reduce the leakage power compared to thin-oxide device capacitances; and, this choice increases capacitance density over the use of just thick-oxide device capacitances. Designers attacked leakage by removing function, area reduction optimizations, and tuning circuits to reduce device width.

Power estimation methodology Designers also require power estimation tools that provide accurate estimates and timely feedback on the efficiency of their power management design. As the margin of error in power budgets decreases, the need for power estimation under dynamic operating conditions

Scan

Gate d2clk

clockgate_b Scan testhold_b

Normal mode

d1clk

Not used for nonscan latches

Figure 2. Clock pulsed-mode circuit for a flip-flop.

(workloads) increases. Not only is such dynamic power estimation required to quickly verify power management logic; it is also necessary because static estimates are too pessimistic. It is important to concentrate on realistic highpower test cases. Hardware-based thermal management solutions can handle pathologically high-power workloads by reducing the frequency or shutting down the system.

NOVEMBER–DECEMBER 2005

73

ENERGY-EFFICIENT DESIGN

Net capacitance: Steiner estimates

Area-based macro power models

RTL simulation

Power analysis

Power data

(a)

Net capacitance: 3D extraction

Schematic-based macro power models

RTL simulation

cycle-accurate power estimation tool (Capet) combines the benefit of transistor-level macro power models coupled with switching and clocking activity from RTL simulations. Capet also models the power consumed in the switching of signal interconnect capacitance.

Capet approach Capet estimates the AC power, which it defines as the power dissipated by the switching of node capacitances because of circuit activity. The switching power of a circuit is P = 1/2 CV2f

Power analysis

Power data

(b)

Figure 3. Capet flow at different stages of design: early (a) and as the design matures (b).

Previous work Practitioners have estimated power at various levels of abstraction. They obtain the most accurate estimates by running a Spice-like simulator on a transistor-level netlist, a strategy that is feasible only for a small circuit. Many researchers have worked on building accurate power models for design macros.4 (A macro, in this context, refers to a smaller subcircuit component within a large unit inside a chip design.) Such power models are not very useful in estimating total chip power unless we have accurate information on the switching and clocking activity of the macros. A lot of work has also been done in measuring power at the architectural level.5 Although architectural-level solutions are very fast, they are not very accurate and cannot give feedback on clock gating at the register-transfer level (RTL). The design of application-specific ICs has employed gate-level power estimation methodologies. Such methods do not work well for chips with many custom blocks. In the first-generation Cell processor, designers have employed extensive finegrained clock gating. They needed a tool to provide accurate power estimates as well as information on the power trade-off of adding clock-gating logic. The Cell processor consists of custom blocks and synthesized logic. The

74

IEEE MICRO

(1)

where C is the node’s total switched capacitance, V is the power supply voltage, and f is the frequency of switching. For a circuit design, V and f are generally fixed. To reduce power consumption, designers work on reducing the switched capacitance. In a power-efficient processor like the Cell, designers employ intensive amounts of finegrained clock gating to reduce power. This reduces the switched capacitance by limiting the amount of switching in the latches and any unnecessary switching of combinational logic. The key determinants of power estimation, therefore, are monitoring the switching and clocking activity in a design. The Capet methodology consists of monitoring the switching and clock activity of each macro in an RTL simulation and then applying that information to a transistor-level macro model to estimate the power. Capet repeats this process for each clock cycle to produce a cycle-by-cycle power estimate. Capet also estimates the power consumed by switching interconnect capacitance by monitoring the switching of each global net. It includes the gate capacitance of buffers on the net as part of interconnect capacitance. The minimum requirements for running Capet are a functional VHDL model and a chip floorplan. In the early phase of design, designers can roughly estimate macro power models using macro area and net capacitance as calculated with Steiner estimates. This way, designers can run Capet early in the design for feedback, with working VHDL model (RTL), before detailed circuit schematics allow transistor-level analysis of the power characteristics of each macro. As Figure 3 shows, the

estimates become more refined as the design progresses. Pclk100

The node switching in a circuit block in a given cycle is proportional to the percent of inputs switching and the amount of clock activity in the circuit. Our methodology assumes the switching factor of a circuit to be the percentage of inputs that change state between two consecutive clock cycles. Clock activity is the percentage of capacitive load driven in a given cycle with respect to the total clock load on the circuit. We build our macro power models at the transistor level using IBM’s Common Power Analysis Methodology (CPAM) tool.6 CPAM runs pseudorandom vectors with different switching factors on the schematic diagram of the circuit; it covers two conditions: local-clock buffers off and on. The power model assumes that power is linear with switching factor and clock activity. CPAM provides power information at simulated switching factors. If, for a given clock cycle, SF is the switching factor of a circuit block and CLK is the clock activity of the circuit, the power consumed by the circuit in a cycle C is P(C) = Pclk0(SF) + [Pclk100(SF) − Pclk0(SF)] × CLK. where Pclk0(SF) is the power at input switching factor SF when clock activity is 0 percent. Pclk100(SF) is the power at input switching factor SF when clock activity is 100 percent. Figure 4 shows the two curves for Pclk0 and Pclk100. Currently we simulate macros at only SFs of 0 and 50 percent; therefore, our current models are purely linear with respect to SF. In the early design phase, we estimated the two power curves Pclk0 and Pclk100 using the area of the macro, deriving a power density by scaling from a previous technology or from similar macros that had complete schematic diagrams.

Global signal net power The macro power models characterize the switching power inside macros. To estimate the power consumption caused by the switching of global interconnect capacitance, we calculate the interconnect capacitance using Steiner estimates in the early design phase. Later in the design phase, we used a 3D

P

Macro power models

Pclk0

SF

Figure 4. Power as a function of switching factor at 0 and 100 percent clock activity.

extraction tool. Equation 1 gives the power consumed in switching global nets. We estimated the power consumed by buffers inserted on global nets by adding the gate capacitance of the buffers to the net capacitance. This approach does neglect the shootthrough (or short-circuit) current from VDD to Gnd while the buffer is switching. Our experiments show that this shoot-through current is a negligible part of switching power.

Measuring switching factors and clock activity To calculate the total power, we monitor input switching factors and clock activity for each block instance in an RTL simulation for a given workload. For every macro instance, we calculated the input switching factor by observing the percentage of inputs that have changed state from the previous cycle. Clock activity comes from observing the number of clock buffers that are on in a given cycle. Since power depends on switching factor and clock activity, we must calculate it for every cycle of the simulation. (Averaging the switching factor and clock activity over the entire simulation would result in unacceptably large inaccuracy). In a given cycle, we estimate the total power by using the following overall equation: TotalPower(C) = ΣMacroPower(SF,CLK) + 0.5 CnetV2f where Cnet is the amount of global, net, switched capacitance.

NOVEMBER–DECEMBER 2005

75

ENERGY-EFFICIENT DESIGN

Macro power

weights using the number of latch bits that each local-clock buffer drove.

Net power

Power

Capet usage

0

1,000

2,000

3,000

4,000

No. of cycles

RTL workloads

Figure 5. Macro and net power for the typical workload.

Typical

High

Power

Idle

0

1,000

2,000

We used Capet methodology to estimate the power consumption of the first-generation Cell processor and for refining and verifying the chip’s power management logic.

3,000

4,000

No. of cycles

Each core or functional unit on the chip must run at least three different types of workloads: idle, typical, and high power. The idle workload is very useful in ensuring that when the core should be at the lowest power state, it shuts off as many clock buffers as possible. Analyzing the results of the idle test case is very useful in catching the most obvious errors. After running Capet, designers look at the macros that have the highest power and clock activity, and try to reduce them. On some cores, they also used the tool to verify the power savings by issuing instructions at a slower rate.

Figure 6. Total power for the idle, typical, and high-power workloads.

Power grid and thermal analysis

Active clock buffers (percentage)

50 45 40 35 30 25 20 15 10 5 0

Idle

Typical

High

Workload

Figure 7. Percentage of clock buffers active for idle, typical, and high-power workloads.

The measure of clock activity depends on the type of macro. Since each local clock buffer drives a different amount of load, we cannot treat them as equal. For custom macros, designers provide a table that puts relative weights on each local clock buffer. For synthesized blocks, we calculated the relative

76

IEEE MICRO

We used Capet results for the high-power test case as a stimulus for power grid integrity and thermal analyses. We assigned each core the average power for the high-power workload for current-resistance (IR) drop analysis. Analysis for di/dt of the package used the cycleby-cycle power for the high-power workload for each core. Capet gives a realistic workload for di/dt because it calculates the power for each macro on a cycle-by-cycle basis. Other designs assume conservative di/dt estimates, which yield a failing power grid analysis even when the hardware generally works. Power grid analysis using Capet di/dt estimates match hardware results. For thermal analysis, we used the average power for high-power workloads to calculate the chip’s temperature map.

Full chip estimates Because RTL simulation is not fast enough to run a complete program at chip level, we make those estimates by obtaining utilization information on the different cores from architectural simulation. Based on realistic programs, we assign utilization rates for the three different workloads (idle, typical and high power) to each core, then use those results to estimate power at the chip level.

Results Figure 5 shows the macro and net power results for the typical power workload. The power waveforms are for the idle, typical, and high-power test cases. Figure 6 shows Capet results for one Cell processor core. Figure 7 shows the portion of clock buffers active for the three workloads. The runtime for Capet is similar to the runtime for an RTL simulation: for 4,000 clock cycles on a core with 20.9 million transistors, runtime was approximately 30 minutes. We carefully ran multiple chips under exactly the same temperature conditions as that of the simulations to correlate these results. AC power estimates are within ±20 percent of hardware measurements when adjusted for chip process, frequency, and voltage. Hardware measurements showed that power variation with respect to voltage had the following relationships: AC power = Vx, where 2 < x < 3 DC power = Vy, where 4 < y < 5 Total power = Vz, where 3 < z < 4 Detailed analysis and interpretation of the overall power characteristics, with regard to dependence on supply voltage V are omitted in this article for brevity. Suffice it to say that we empirically determined the value of the overall exponent z (see preceding discussion) to be 3.6, a value considerably higher than would be predicted by principles of analytical modeling applied to capture the first-order effects of active (AC) and passive (DC) power dissipation. In particular, the fact that the observed value of exponent x was significantly more than the classical formulation of 2 (for capacitive switching power), as also observed in the earlier work reported by Zyuban et al.,7 was not a pleasant surprise! In a later analysis, we actually went back and used the measured equation for AC power (that is, dependence on V as Vx, where 2 < x < 3) in recalculating the AC power per macro per cycle within Capet to try and bridge the absolute accuracy gap between the predictive Capet model and the hardware measurements. Capet is also excellent, in any case, for predicting relative power differences. In other words, the detailed hardware power measurements for two different workloads showed relative power differences that were predicted quite accurately by Capet.

W

e employed a variety of methods to make the Cell processor a low-power design. Efficient power supply design, use of static logic, a lack of low-VT devices in the design, clock gating, low-power latch design, and clock grid and circuit tuning were major contributors to lowering power. A new power estimation methodology (Capet) was necessary to understand the overall power and determine what techniques provided enough power reduction to implement. Hardware results show the Cell’s power methodology gives excellent, absolute power estimation (after post-hardware calibrative measurements) and can define relative power tradeoffs with design and logic changes. MICRO References 1. D. Pham et al., “The Design and Implementation of a First-Generation Cell Processor,” IEEE Int’l Solid-State Circuits Conf. Digest of Papers (ISSCC 05), IEEE Press, 2005. pp. 184-185. 2. P.J. Restle et al., “A Clock Distribution Method for Microprocessors,” IEEE J. SolidState Circuits, vol. 36, no. 5, May 2001, pp. 792-799. 3. P.J. Restle et al., “The Clock Distribution of the Power4 Microprocessor,” IEEE Int’l Solid-State Circuits Conf. Digest of Papers (ISSCC 02), IEEE Press, 2002, pp. 144-145. 4. S. Gupta and F.N. Najim, “Power Modeling for High-Level Power Estimation,” IEEE Trans. VLSI Systems, vol. 8, no. 1, Feb. 2000, pp. 18-29. 5. Brooks et al., “Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors,” IEEE Micro, vol. 20, no. 6, Nov.-Dec. 2000, pp. 26-44. 6. J.S. Neely et al., CPAM: A Common Power Analysis Methodology for High-Performance VLSI Design,” IEEE Conf. Electrical Performance of Electronic Packaging, IEEE Press, 2000, pp. 303-306. 7. V. Zyuban et al., “Unified Architecture Level Energy-Efficiency Metric,” Proc. Great Lakes Symp. VLSI, ACM Press, 2002, pp. 24-29.

Daniel Stasiak is with the Systems and Technology Group (STG) at IBM in Austin, Texas, and served at the power team lead in the Cell processor development project.

NOVEMBER–DECEMBER 2005

77

ENERGY-EFFICIENT DESIGN

Rajat Chaudhry is with STG at IBM in Austin, Texas, and was one of the lead developers for the Capet methodology within the Cell processor power estimation and management team. Dennis Cox is a distinguished engineer within STG at IBM in Rochester, Minnesota, engaged in general research and development work related to the Engineering and Technology Services function within STG. Stephen Posluszny is with STG at IBM in Austin, Texas, and is a member of the STI Design Center Tools and Methodology department. Jim Warnock is a distinguished engineer within STG at IBM in Yorktown Heights, New York. He served at the circuit team lead for the Cell processor project.

Get access to individual IEEE Computer Society documents online. More than 100,000 articles and conference papers available! US$9 per article for members US$19 for nonmembers

http://computer.org/publications/dlib/

78

IEEE MICRO

Steve Weitzel is with STG at IBM in Austin, Texas, and he was a contributing member of the Cell power team. Dieter Wendel is a distinguished engineer within the eServer hardware development team of STG at IBM in Boeblingen, Germany. Michael Wang is within the PowerPC microprocessor development team, part of STG at IBM in Austin, Texas; he was a contributing member of the Cell power team. Direct questions and comments about this article to Dan Stasiak, IBM Corp., 11501 Burnet Road, Bldg. 906, Austin, TX 78758; [email protected]. For further information on this or any other computing topic, visit our Digital Library at http://www.computer.org/publications/dlib.

Suggest Documents