Process-Tolerant Low-Power Adaptive Pipeline under

Process-Tolerant Low-Power Adaptive Pipeline under Scaled-Vdd Swaroop Ghosh, Pooja Batra, Keejong Kim, and Kaushik Roy School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN-47907 < ghosh3, pbatra, keejong, kaushik>@ecn.purdue.edu, Ph: 765-494-3372; Fax: 765-494-3371

I. Introduction Power dissipation is becoming a bottleneck in the modern high performance general purpose processors. Due to increased leakage of scaled devices and high frequency of operation, pipelines consume a major portion of processor power. Process parameter variations (both systematic and random) may cause parametric failures in the pipeline stages leading to yield loss. Conventional wisdom dictates a conservative design approach (e.g., scaling up the VDD or upsizing logic gates) to avoid a large number of chip failures. However, such techniques come at the cost of increased power and/or die area. Low power design techniques (e.g., dual Vth [1], dual VDD, supply voltage scaling etc) increase the critical path delays and/or the number of critical paths in the combinational circuit degrading the parametric yield. Therefore, pipeline design for robustness with respect to parametric variations and low-power dissipation do not go hand-in-hand. In this paper, we implement a design technique, which achieves robustness with respect to timing failure and provides the opportunity for aggressive voltage scaling by critical path isolation. A test chip has been implemented and tested in 130nm technology to show the effectiveness of the technique. In particular, we (a) isolate and predict the set of possible paths that may become critical under process variations, (b) ensure that they are activated rarely, and (c) avoid possible delay failures in the critical paths by dynamically stretching the clock period to 2-cycles (assuming all standard operations are 1-cycle), when they are activated. This allows us to utilize the timing slack between critical and off-critical paths and operate the circuits at reduced supply. The off-critical paths are evaluated in 1-cycle while the critical path activation is predicted ahead in time (by pre-decoding few primary inputs) and evaluated in 2-cycles; letting us maintain high frequency and good parametric yield even under scaled supply voltage. The above mentioned strategy (i.e., critical paths isolation of any random logic, making their activation predictable and rare; and creating timing slack between critical and off-critical paths) is possible using the CRISTA design methodology presented in [2]. In this paper, we present CRISTA design methodology for pipeline designs and present measured results to show the effectiveness of the technique. The above mentioned technique can be used to design linear as well as non-linear pipelines to achieve low power and robustness at high operating frequencies. The model of a superscaler pipeline [3] designed using CRISTA technique is shown in Fig. 1. In this example, only three out of the eight pipelines stages (namely, ID, IQ and EX) have been selected for the proposed supply scaling and adaptive clock stretching operations. This is due to ease of application of CRISTA on these stages. At low supply voltage, decoders D1, D2 and D3 predict the activation of critical paths of combinational logic stages by pre-

gCLK

CLK D1, D2, D3 are pre-decoding logic

freeze One-pulse gen

L3

L4

Memory (Mem)

Reg File (RF)

Issue Queue (IQ) Wakeup/Select

Rename (RN) L2

L5

Writeback (WB)

D3

D2

Decode (ID) L1

Circuit designed using CRISTA

Execute (EX)

D1

Fetch (IF)

Abstract— Designing low power pipelines in modern high performance microprocessors is becoming a challenging task due to increasing process parameter fluctuation associated with the scaled devices. Conventional low power design techniques typically make the critical paths of pipeline stages sensitive to parametric variations, degrading the yield. We implement a low-power and robust pipeline design methodology which is suitable for aggressive voltage scaling while maintaining high frequency operations. This is achieved by isolating the critical paths; making them predictable (by design) and ensuring they are activated rarely. At scaled supply (with frequency unchanged), any possible delay errors (under 1-cycle operations) are predicted ahead in time and avoided by adaptively stretching the clock period to 2-cycles. The test-chip implementing the design methodology for a two-stage pipeline in 130nm process shows 40% power savings with only 13% performance loss (due to adaptive clock stretching operations) and ~9.4% area overhead.

L6

L7

L8

Fig. 1: Application of CRISTA [2] in the basic superscalar pipeline design decoding few input states. If the output of one of the pre-decoders is asserted then clock stretching is performed to avoid delay failures. One possible implementation of clock stretching involves stalling the pipeline by gating the pipeline clock with a one-pulse signal freeze generated by decoder outputs as shown in Fig. 1. The pipeline latches (L1-L8) are then operated with the gated clock. The advantages of the proposed pipeline are two-fold: (a) it operates at lower supply voltage for low power dissipation and, (b) it maintains the same frequency as original pipeline without compromising robustness. This is associated with small area penalty (due to pre-decoding and stalling logic) and performance loss because of occasional clock stretching operations. The paper is organized as follows. In Section II, we briefly describe the CRISTA design methodology for low power and robustness. In Section III, we present the application of this technique to a pipeline design. The implementation of test chip containing twostage pipeline design to demonstrate the feasibility of CRISTA methodology is elaborated in Section IV. Conclusions are drawn in Section V.

II. CRISTA Design Methodology The design methodology for critical path isolation for a combinational logic is elaborated in [2] but briefly described here because it has been used for the design of individual pipeline stages. First, the critical paths are isolated to a cofactor by proper control variable selection during Shannon partitioning. This cofactor is then partitioned further under area and delay constraint to reduce the activation probability of critical paths. For example, let the original circuit is partitioned into cofactors CF11 and CF21 using control variable x1. Cofactor CF11 can be active ~50% of the time (when x1=1), assuming x1 is logic ‘1’ 50% of the time. Successive partitioning of CF11 results in cofactors CF32 and CF42, and each can be active for only 25% of the time. By another level of partitioning, we isolate the critical paths to CF53 which is active only 12.5% of time. Gate sizing [4] is then used for further isolation and creation of enough timing slack between critical and off-critical paths. Note that, this sizing approach selectively makes short paths faster and long paths slower to create large timing slack between the set of long paths and the shorter paths. Timing slack is the enabling factor for supply reduction. For example, CF53 is downsized to make it more critical while other cofactors are selectively upsized to make them off-critical. This is contrary to the conventional sizing approach where the critical paths are made faster and off-critical paths are slowed down. Finally, supply is scaled down such that off-critical cofactors operate at 1-cycle while critical cofactors are evaluated in stretched clock period (2cycle).

Performance penalty

Comb-2

Critical path Non-critical paths CLK

VDD=nominal S1

VDD < nominal

S2

Clock period = 2-cycle when critical path activation is predicted Clock period = 1-cycle otherwise

Fig 2: Timing diagram S3

Let us consider the timing diagram of conventional and CRISTA design which can be the combinational logic stages of a pipeline. In conventional design a slack of S1 is maintained in critical path to meet the required yield (Fig 2). However, in CRISTA the critical path (shown by hatched lines in Fig. 2) may fail the timing constraint but the off-critical paths (shown by dotted block) maintain a slack of S2 with respect to the clock period. Note that, slack of off-critical path (S1) can be utilized in conventional design for energy saving whereas in CRISTA, the slack of off-critical path (S2) can be used. CRISTA ensures that the activation condition of critical paths can be determined and adaptive clock stretching can be performed for correct functionality of the circuit. If the performance penalty due to clock stretching operations in CRISTA is p, then the energy-delay product of conventional (EDPconv) and CRISTA (EDPprop) designs are given by 2

 S   S  EDPconv ∝ Cconv 1 − 1  Tc ; EDPCRISTA ∝ CCRISTA 1 − 2  (Tc + pTc )  Tc   Tc  The EDP ratio is given by

(1)

2

EDPCRISTA

10

0.1

0.2

0.3 0.4 Signal probability

Signal probability = 0.5

50

# of control variables for critical cofactor vs. performance penalty

40 30 20 10 0

0.5

N=10 N=5 2

4

6

8

10

k

(a) (b) Fig 3: Performance penalty for (a) critical cofactor for k=4, (b) for different values of k

CRISTA

EDPconv

N increases

20

0

Tc conventional

2

(a ) 30

Performance penalty (%)

# of control variables (k) = 4

40

Comb-1

 1− S1  2  Cconv   Tc   1   Cconv  1− S1norm   1  (2) =  =    norm    CCRISTA  1 − S2   1+ p   CCRISTA   1− S2   1+ p   T   c

where S1norm ( S 2norm ) is the slack normalized with respect to Tc. Since Cprop ≥ CCRISTA due to design modifications and addition of pre-decoding logic, and (1+p)>1 the necessary condition for CRISTA to be better is S2norm > S1norm . In the analysis presented above, it can be concluded that the power saving in CRISTA method mainly comes from quadratic dependency of power on timing slack. Power reduces quadratically while the delay increases only linearly, letting us reduce the EDP. Simulations on a set of MCNC benchmark circuits (as single stage pipeline) show 60% average power saving with only 18% area and 6% performance overhead.

III. Pipeline Design and Performance Analysis In our pipeline design methodology, the maximum stage delay (computed by using Statistical Static Timing Analysis [5]) is chosen as target delay for all the stages. Next, combinational circuit of one stage is picked at a time and CRITSA methodology is applied as explained in Section II. The output is a list of cofactors which is sized to meet the required delay target at lower supply voltage. The above steps are repeated on each of the stages. Next, let us evaluate the performance of the new pipeline design. Consider an N-stage linear pipeline (similar as Fig. 1) after design and synthesis where decoders D1, D2,…, DN predict the activation of critical cofactors of the individual stages. A clock stretching is required whenever the critical cofactor of any of the pipeline stages is activated. Under such circumstances, the pipeline is stalled by gating the clock. Let pi be the activation probability of critical cofactor of ith stage and ptotal is probability of clock stretching operation in the

A3 B3

A2 B2

A1 B1

A0 B0

Partial FA

Partial FA

Partial FA

Partial FA

S2

G3 P3 C3 G2 P2 C2

S1

G1 P1 C1

S0

A3 B3

A2 B2

A1 B1

A0 B0

1-bit comp

1-bit comp

1-bit comp

1-bit comp

G0 P0 C0 Ci

A=B? Critical path activation condition: P3P2P1P0Ci Longest off-critical path activation condition: P3P2P1G0 Critical path activation condition: A3B3

(a) (b) Fig. 4 (a) 4-bit Carry-Look-Ahead adder; (b) 4-bit comparator pipeline in each clock cycle. Further, we assume that critical cofactors of each of the stages are activated by the same number of control variables k (i.e., total number of primary inputs that are pre-decoded to predict the activation of critical path of a pipeline stage = k). So (3) p1 = p2 = ... pN = p = (signal probability)k Then ptotal is given by (4) ptotal =1− (1− p)N If the ideal clock cycle-per instruction (CPI) of the pipeline is given by CPIideal, then the new CPI is given by (5) CPInew = CPIideal + ptotal .(stall penalty) The performance penalty due to occasional clock stretching operation is given by CPInew − CPIideal ptotal .(stall penalty) p Perf. penalty = = = total (6) CPInew CPIideal + ptotal .(stall penalty) 1+ ptotal 1− (1− p)N 1− (1− (signal probability)k )N = 2 − (1− p)N 2 − (1− (signal probability)k )N The performance penalty for different N and input signal probabilities is shown in Fig. 3(a). In this plot, we assume that the critical cofactor of each stage is activated by 4 inputs (i.e. k = 4). It can be observed from this plot that if the control variables have low signal probabilities (~0.1-0.3), then the penalty can be restricted within 10%. From equation (6), it can be noted that penalty can be large for deep pipeline designs (i.e. large N). Following techniques can be used to reduce the performance penalty, (a) tune the control variable selection metric [2] during circuit partitioning to pick low switching inputs as control variables; (b) reduce the activation probability of critical blocks further (i.e., by increasing k as shown in Fig 3(b)); (c) select only power hungry pipelines stages for application of this methodology; and, (d) employ alternate means of implementing the adaptive clock stretching which stall only a part of the entire pipeline. =

IV. Implementation and Test Chip Measurement Results To demonstrate the feasibility of this low power design approach, we implemented a two-stage pipeline containing 4-bit carry-lookahead adder (CLA) and 4-bit comparator [6] as the pipeline stages in IBM 130nm technology. The schematic of CLA and comparator is shown in Fig 4. The CLA and the comparator have the following features which make them amenable to supply voltage scaling and adaptive clock stretching operations: (a) Critical path of CLA is well defined and its occurrence is rare. For a 4-bit CLA, critical path is activated if P3P2P1P0Ci=1 where P3-P0 are the propagate signals and Ci is the carry input. Moreover,

Table-1 Test mode (proposed pipeline)

0

Data comes from LFSR. Only CLA performs 2-cycle operations

0

1

Fixed inputs to activate the longest offcritical path periodically

1

0

Fixed inputs to activate the critical path

1

1

Data comes from LFSR. Both CLA and comparator performs 2-cycle operations

gclk TM1 TM2

Predecoder

LFSR

Predecoder

VDDm

● TM1TM2

●

VDDo

FFs

Comparator

gclk

TM1 = TM2 = 1

CLK

●

●

●

0

2

4

0

2

4

6

8

10

12

6

8

10

12

freeze GDS Layout

Outputs

Proposed pipeline Test logic

0

2

4

6

0

2

4

6

8

10

12

8

10

12

gclk

Clock generator

●

2-cycle operation

1-cycle operation

freeze_syn

Test logic

Conventional pipeline

FFs

Comparator

FFs

Carry-Lookahead Adder

FFs

2:1 Mux

●

Q

(a)

Outputs

TM1

●

D

Simulation (130nm technology) showing 1-cycle/2-cycle operation

●

●

●

Q

CLK

TM1 Mode 0 Data comes from LFSR 1 Fixed inputs to activate the critical path

Conventional Pipeline CLK

fixed vectors

●

FFs

Carry-Lookahead Adder

FFs

4:1 mux

fixed vectors

D

Table-2 Test mode (conventional pipeline)

● ● ●

freeze_syn RST

RST

freeze

FF

Stalling Logic freeze

freq divider

Reset

0 Low Power Robust Pipeline CLK

div64_freeze_syn

Mode

FF

TM1 TM2

time [ns]

(a)

(b)

(b)

Fig 6: (a) Pipeline stalling logic, (b) simulations showing adaptive clock stretching to 1-cycle/2-cycle

Fig 5: (a) Schematic of the proposed and conventional two-stage pipeline design (Table- 1 and Table-2 explains the test mode signals), (b) die photo and GDS layout

Measurement data of cla_out_o (after divide-by-64) TM1 = 1, VDDo = 1.2V

no 2-cycle operations

TFF tclk

div64_freeze_syn

div64_cla_m LFSR

● ●

freq divider cla_out_m

00 11 FFs

fixed vec. 10

FFs

div64_freeze_syn

64 number of 2-cycle operations

CLK

CLA

Reset (active low)

Reset

(b)

Fig 7: Measurement showing adaptive clock stretching to 1-cycle/2cycle (a) for TM1=TM2=1, (b) for TM1=0, TM2=1

div64_cla_o

TM2

TM1

TM1 = 1, VDDo = 1.13V

freq divider cla_out_o

1 TM1

FFs

div64_cla_o

0

CLA

fixed vec.

FFs

one of the longest off-critical paths is activated when P3P2P1G0=1 where G0 is the generate signal. The activation probability of critical path given by p(P3P2P1P0Ci)= p(P3)p(P2)p(P1)p(P0)p(Ci), is low (~3.1% assuming that P3-P0 and Ci are logic ‘1’ 50% of the time) and there is also an inherent slack available between critical and offcritical paths (~85ps with IBM 130nm devices). (b) Comparator circuit also exhibits similar properties as CLA. Its critical path is activated when A^B=0 where A and B are 4-bit input strings and ‘^’ denotes XOR operation. The critical path activation probability of the comparator given by p(A^B)= p(A3^B3) p(A2^B2) p(A1^B1) p(A0^B0), is also low (~6.2% assuming that inputs are logic ‘1’ 50% of the time). Here, the slack between critical and off-critical paths is ~75ps (with IBM 130nm devices). For both of the above mentioned circuits, critical path activation can be predicted by pre-decoding few primary inputs. In the test chip, inputs Ci and A2 of CLA and inputs A3 and B3 of comparator are chosen heuristically [2] for pre-decoding (shown by dashed box in Fig 11(a)). Note that only two inputs have been decoded to simplify the decoding logic. Fig 5(a) shows the schematic of the proposed pipeline design. Activation of critical path of CLA and comparator is predicted by pre-decoders and a freeze signal is generated to stall the pipeline (stalling is implemented by gating the clock with freeze). Test mode signals (TM1 and TM2) have been used to apply either fixed patterns or LFSR generated random patterns to the pipeline. The fixed pattern (with one toggling input) periodically activates either the critical or the longest off-critical path of CLA. The meanings of test mode signals are explained in Table-1 and Table-2. Measurements are taken at 1.2V supply with 200MHz clock. Die photo and GDS layout of the pipelines and associated test circuitries are shown in Fig 5(b).

VDDm

01

(b)

64 times activation of critical path

Reset

fixed vec.

(a)

div64_cla_o

●

●

critical path failure VDDo

Reset cla_out_m: critical path output of CLA (proposed pipeline)

Irrelevant switchings

cla_out_o: critical path output of CLA (conventional pipeline)

(c) (a) Fig 8: (a) Circuit to detect critical/longest off-critical path delay failure; (b) critical path of conventional pipeline meets target frequency at VDD=1.2V, (c) critical path of conventional pipeline fails to meets target frequency at VDD=1.13V

Fig 6 shows the design, simulation and timing diagram of the pipeline stalling logic. Simulation of gated clock (gclk) exhibits 1cycle and 2-cycle operations (Fig 6(b)). Measurement results of div64_freeze_syn (freeze signal after divide-by-64) under two different test conditions are shown in Fig 7(a) and Fig 7(b). Under random test patterns (TM1=TM2=1), many two-cycle operations were detected however, under off-critical path activation (TM1=0, TM2=1) no 2cycle operations were detected (as expected), demonstrating the correctness of the pipeline stalling logic. For power comparison, first we scale the supply of conventional pipeline to use the timing slack S1 of critical path (Fig 2) and determine the power consumption under random test patterns. Next, we scale the supply of proposed pipeline, to use the timing slack S2 between critical and longest off-critical path (Fig 2) and determine the power consumption again. In both cases, we only detect the critical or longest off-critical path failure of CLA (because the delay of CLA is relatively larger than the comparator). The test circuitry to detect the

TM1=1, TM2=0, VDDm = 1.2V

TM1=0, TM2=1, VDDm = 1.2V

TM1=0, TM2=1, VDDm = 0.95V

div64_cla_m

div64_cla_m

div64_cla_m

64 times activation of critical path

Reset

failure of longest offcritical path

Reset

64 times activation of longest off-critical path

(a)

Reset

(b)

(c)

Fig 9: (a) Critical path of proposed pipeline passes at VDD =1.2V, (b) longest off-critical path passes at VDD=1.2V, (c) longest off-critical path fails at VDD=0.95V Power measurement result TM1=TM2=1 Original 8 power 7 Longest offcritical path fails 6 20% reduction 5 (conventional pipeline)

●

Measurement of counter bits

ENABLE CLK freeze

●

10-bit counter

Stalling Logic freeze_syn TM1

1.1 1.2 Supply voltage[V]

1.3

(a)

(b) Performance penalty vs total # of cycles

2−cycle operations vs total # of cycles 20

200 Number of 2−cycle operations

critical/off-critical path delay failures is shown in Fig 8(a). The measurement steps and results are elaborated as follows: (a) We activate the critical path of CLA of the conventional pipeline and observe critical path output div64_cla_o (i.e., cla_out_o signal after divide-by-64). Measurement shows expected switching of this signal (Fig 8(b)). Then, we reduce supply to 1.13V where the critical path shows delay failure (div64_cla_o becomes idle, Fig 8(c)). This lead to 20% power saving. (b) We activate the critical path and subsequently the longest offcritical path of the proposed pipeline (Fig 9(a) and (b)). The longest off-critical path shows frequent switching of div64_cla_m (i.e., cla_out_m signal after divide-by-64). To utilize the timing slack of off-critical path, we reduce the supply to 0.95V, such that signal div64_cla _m cease to switch (Fig 9(c)). However, to avoid failure of longest off-critical path we operate the circuit at 1V saving 60% power. Therefore, the proposed pipeline shows 40% extra saving gained by using the slack of off-critical path for supply reduction (Fig 10). The critical paths of CLA and comparator are evaluated in 2-cycles (as shown in Fig 6(b)) to avoid wrong computation. Functionality of the pipeline at nominal and reduced supply is verified by collecting the test responses. The performance of the proposed pipeline is monitored using a counter (Fig 11(a)) which counts the number of freeze pulses (to stall the pipeline) for a given time period (controlled by ENABLE signal). The output of LSB of the counter is shown in Fig 11(b). We measure the number of clock stretching operations for one-stage (i.e. CLA) as well as for both stages (CLA and comparator) (Fig 11(c)). Simulation data shows worst case performance loss of 18% for the two-stage pipeline. Measurement result shows 10% penalty for one-stage and 13% for both stages (Fig 11(d)). The difference between simulation and measurement data as shown in Fig 11(c) can be due to imperfect ENABLE signal (provided externally). The power saving in the proposed technique comes at the cost of area penalty which is mainly due to pre-decoders, pipeline stalling logic and gated clock generation. In our implementation, the area overhead is approximately 9.4%.

150

Performance penalty (%)

1

Fig 10: Power measurement

CLA only(meas) Both stages(meas) Both stages(sim)

100

50

0 0

120 clock cycle wide

SFFs

●

Comparator

1 0.9

SFFs

2

ENABLE

40% extra reduction (proposed pipeline) CLA

●

3 # of 2-cycle operations

4 3

counter bit-1

mux

TM2

predecoders

SFFs

Avg. Power[µW]

9

200 400 600 800 Number of clock cycles

(c)

1000

15

10

5

0 0

CLA only(meas) Both stages(meas) Both stages(sim) 200 400 600 800 1000 Number of clock cycles

(d)

Fig 11: (a) Performance monitoring circuit, (b) measurement of LSB of the counter, (c) number of clock stretching operations and, (d) performance penalty

possible delay errors (that may occur under 1-cycle operation) predictable and rare. This leads to robust pipeline design which allows us to reduce the supply voltage aggressively while using the predictability to prevent occurrence of any delay violations (by providing extra clock cycle when critical paths are activated) while maintaining high frequency and achieving good yield. The effectiveness of the proposed low power design methodology is demonstrated by a two-stage pipeline designed in 130nm technology. We believe that this technique can find widespread applications in future microprocessor pipelines where variations will be more prominent and the requirement of low power dissipation at high frequencies may require adaptive clock stretching operations for each of the pipeline stages.

References [1] [2] [3]

V. Conclusions

[4]

We proposed an adaptive pipeline design based on critical path isolation, which achieves low power operation while being robust with respect to parametric delay failures. The proposed technique makes

[5] [6]

A. Srivastava et al., Statistical optimization of leakage power considering process variations using dual-Vth and sizing, DAC, 2004. S. Ghosh et al., A new paradigm for low-power, variation-tolerant circuit synthesis using critical path isolation, ICCAD, 2006. S. Palacharla et al., Complexity-effective superscalar processors, ISCA, 1997. S. H. Choi et al., Novel sizing algorithm for yield improvement under process variation in nanometer, DAC, 2004. K. Kang et al., Statistical timing analysis using levelized covariance propagation, DATE, 2005. J. Rabaey, Digital integrated circuits, Prentice hall, 1996.