Voltage Scalable High-Speed Robust Hybrid Arithmetic ... - USF - CSE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2010

1301

Voltage Scalable High-Speed Robust Hybrid Arithmetic Units Using Adaptive Clocking Swaroop Ghosh, Debabrata Mohapatra, Georgios Karakonstantis, and Kaushik Roy, Fellow, IEEE

Abstract—In this paper, we explore various arithmetic units for possible use in high-speed, high-yield ALUs operated at scaled supply voltage with adaptive clock stretching. We demonstrate that careful logic optimization of the existing arithmetic units (to create hybrid units) indeed make them further amenable to supply voltage scaling. Such hybrid units result from mixing right amount of fast arithmetic into the slower ones. Simulations on different hybrid adder and multipliers in BPTM 70 nm technology show 18%–50% improvements in power compared to standard adders with only 2%–8% increase in die-area at iso-yield. These optimized datapath units can be used to construct voltage scalable robust ALUs that can operate at high clock frequency with minimal performance degradation due to occasional clock stretching. Index Terms—Adders, arithmetic logic unit, high-speed design, low power, multipliers, process variation tolerant design, supply voltage scaling.

I. INTRODUCTION

A

RITHMETIC and Logic Units (ALU) are the core of microprocessors where all computations are performed. Demand for performance at low power consumption in today’s general purpose processors has put severe limitations on ALU design. ALUs are also one of the most power hungry components in the processor and are often the possible location of hot-spots. The presence of multiple ALUs in superscalar pipelines further deteriorates the power and thermal issues [3]. Technology scaling has resulted in faster devices but at the same time the die-to-die delay variations has also increased. Therefore, low-power ALU design while maintaining high yield under tighter delay constraint turns out to be a challenging problem. Supply voltage scaling is very effective in reducing the power dissipation due to quadratic dependency of switching power on supply voltage and exponential dependence of subthreshold leakage. Variable supply voltage and adaptive body biasing techniques have been proposed in [4] to jointly optimize the switching and leakage power of a multiply-accumulate unit.

Manuscript received October 31, 2007; revised March 11, 2008; accepted September 26, 2008. First published October 23, 2009; current version published August 25, 2010. This work was supported in part by the Focused Center Research Program—Gigascale System Research Center. D. Mohapatra, G. Karakonstantis, and K. Roy are with the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47906 USA. S. Ghosh is with the Logic Technology Development, Advanced Memory Design Group, Intel Inc., Portland, OR 97006 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2009.2022531

In [1], [2], the power consumption is reduced by observing the fact that the critical paths of arithmetic units are exercised rarely. Therefore, the supply voltage can be scaled down (while maintaining the clock frequency) to utilize the timing slack available between the critical paths and the longest off-critical paths. The off-critical paths (or short paths) are evaluated at rated frequency while the infrequent critical paths (or long paths) are evaluated in two-clock cycles. This allows aggressive scaling of supply voltage with minimal throughput degradation. In [1], the application of this methodology is shown for ripple carry adder. A new adder called Cascaded Carry Select Adder (C SA) is proposed in [2] which improve the existing Carry Select Adder (CSA) to make it amenable to supply voltage scaling and adaptive clock stretching operation. Various adder families [5]–[9] have been proposed in the past to tradeoff speed, power and area for possible use in ALUs. Of all the complex adder families, ripple carry adder (RCA) is the most area and energy efficient but with worst critical path delay. CSA is faster than RCA however, it has larger area due to logic duplication. Kogge–Stone (KS) [6] on the other hand, is among the fastest adders but consumes large area and power. Therefore, a family of sparse tree adders has been proposed to reduce area at the cost of slight increase in delay. Some examples of sparse tree adders are Brent-Kung, Han-Carlson, Sklansky, Quaternary tree adder etc. In Brent-Kung (BK) [7], the forward tree computes the longest carry fast and the intermediate carries are computed by a backward tree. Han-Carlson (HC) [8] computes the even carries first and generates the odd carries using a backward tree. The quaternary tree adder (QTA) [9] performs the computation with significantly less overhead than Han-Carlson adder. It combines the speed of the CLA with the lower complexity of the Carry-Select Adder (CSA). In [5], the various tree adders have been compared in the energy-delay space. However, adders (or hybrid-adders) have not been explored in terms of supply voltage scaling with adaptive clock stretching operation for low-power and high-yield ALUs. In this paper, we explore various topologies of adders (e.g., RCA, C SA, BK, QTA) in terms of their amenability to above mentioned supply voltage scaling and adaptive clock stretching. We compare the power, area and speed of these adders at scaled supply and iso-yield conditions to determine the best candidate suitable for high speed and low power dissipation. We further propose careful optimization to design hybrid adders that would allow further scaling of supply voltage and tolerance to process variations with small area penalty. In hybrid adders we mix a small amount of a fast adder into a strategically selected portion of a slow adder in order to make the single-cycle short paths faster and creating room for further voltage scaling. We also extend this study to the design of low-power multipliers.

1063-8210/$26.00 © 2009 IEEE

1302


Fig. 1. (a) Basic structure of variable latency adder. (b) Adaptive clock stretching operation in variable latency adder (T is the clock period).

In summary, we make the following contributions in this paper. • development and comparison of voltage scalable adder architectures with adaptive clock stretching for high-speed, low-power, and high-yield ALUs; • proposal of hybrid adder design methodology that optimizes the short latency off-critical paths of the adders to allow further scaling of supply voltage with improved yield and tolerance to process variation; • application and analysis of hybrid technique for the design of low-power, process-variation-tolerant multipliers at scaled supply. The rest of the paper is organized as follows. In Section II, we briefly discuss supply voltage scaling and adaptive clock stretching for low-power ALU design. We also explore different adder topologies for power, speed and yield. We propose hybrid adder design methodology in Section III. Low-power multiplier using adaptive clock stretching is presented in Section IV. In Section V, we use the hybrid adder design in the vector merging stage of carry-save multiplier (CSM) architecture. The practical challenges and tradeoffs are discussed in Section VI and finally, the conclusion is given in Section VII. II. LOW-POWER VOLTAGE SCALABLE ADDER ARCHITECTURES In this section, first we briefly discuss the concept of voltage scalable variable latency adders using adaptive clock stretching. Next, we develop different adder architectures based on RCA, C SA, BK and QTA that would enable supply voltage scaling and process variation tolerance, with minimal impact on speed, area and tolerance to process variation. A. Variable Latency Adders Using Adaptive Clock Stretching Variable latency adders are based on the fact that the critical paths are activated occasionally. Therefore, supply voltage can be lowered while maintaining the rated clock frequency. The off-critical paths are evaluated in 1-cycle (at rated frequency) while the clock period is stretched to 2-cycles when the critical paths are activated. This allows us to exploit the timing slack between long and short paths for supply voltage scaling. A pre-decoder is required to predict the activation of critical paths based on the input pattern. The concept of supply voltage scaling and adaptive clock stretching is illustrated by taking an example of a 32-bit ripple carry adder [Fig. 1(a)]. Since decoder consumes area, only few intermediate bits are decoded to predict critical path activation. In Fig. 1(a), bit-13 through bit-17 is decoded to predict whether

the current input pattern can propagate the input carry (C12) through bit-17 or not. The decoding circuit is nothing but a set of XOR gates that determines whether is true or false (where p’s are propagate signal [11]). For this through is the longest critchoice of decoding, path through and through ical path whereas paths are the longest off-critical paths (both as well as carry propagations are detected by this pre-decoding logic). The adder supply is scaled down while keeping the frequency same such that the longest off-critical paths can be computed without any delay failure. For other longer paths, the decoder output is automatically asserted and a clock stretching operation is performed to avoid timing failure. Note that in this example, any path longer than 17 full adder delay is considered critical, and evaluated in stretched clock to avoid delay failure. Since the probability of any path being longer than 17 full adder delay is very small, this technique results in very small throughput penalty. For the sake of convenience, we use the term short paths or short latency paths to refer to longest off-critical paths. The critical paths will be referred as long paths or long latency paths. The adaptive clock stretching operation in adders is further elucidated in Fig. 1(b) with the help of timing diagram for three pipelined instructions. Let us assume that out of these three instructions, the second instruction activates the long path. Therefore, adaptive clock stretching should be performed during the execution of second instruction for correct functionality of the pipeline. The regular clock and adaptive clock is shown in Fig. 1(b) for the sake of clarity. Note that, the second instruction is fired at cycle-2 but evaluated in cycle-4 using the adaptive clock. This is achieved by gating the clock edge in cycle-3 based on the output of critical path prediction logic. The gated clock in Fig. 1(b) suggests that we are essentially knocking-off clock pulses “occasionally” to avoid failures at rated clock frequency. Although better long latency path prediction can be made and performance penalty can be reduced by decoding more bits, the power/area overhead increases with decoder size (Section VI-B). Based on our simulations, 6–10 bit decoding is optimal; therefore in the rest of the paper, we have decoded 8 bits from the middle of the adder for analysis and simulation purposes. Fig. 2 shows the implementation details of the latency predictor block (LPB) that decodes the intermediate inputs to predict if the current computation requires clock stretching or not. The D flip-flop shown is necessary because LPB needs to remember the nature of input latency in the previous clock cycle in order to generate the correct value of enable in the current

GHOSH et al.: VOLTAGE SCALABLE HIGH-SPEED ROBUST HYBRID ARITHMETIC UNITS USING ADAPTIVE CLOCKING

1303

Fig. 2. Latency predictor block (LPB).

cycle. One of the aspects of LPB that needs special mention is the use of a negative D latch along with a positive edge triggered D flip-flop. The circuit (shown in Fig. 2) allows computation of the enable signal before the next set of inputs arrive (at the next rising edge of the clock), an aspect which is extremely critical for the success of our design methodology. The value of enable latched in the negative clock cycle is used to determine whether the output register will be written at the next rising edge of the clock or be delayed by one clock cycle to implement clock stretching. Disabling of the write operation to the input and output registers is achieved by clock gating as shown in Fig. 1(b). While considering the predictor block we make the important assumption that the latency detection circuitry is designed to be process tolerant by proper sizing of the transistors. B. Voltage Scalable Variable Latency Adder Topologies Fig. 3 shows 32-bit C SA, BK and QTA adders with their long latency path and two short latency paths that determine supply voltage scaling. The long latency path is shown with bold line whereas the short latency paths are shown with dashed lines. In C SA [2], cascading is done by dividing the 32-bits into chunks of {2, 2, 3, 4, 5, 2, 2, 3, 4, 5}. The partial sum is comas well as using RCA. puted in parallel for Next, the multiplexers select the appropriate sum based on the actual carry. Fig. 3(a) shows a 32-bit C SA. In the tree adders [Fig. 3(b) and (c)], black squares denote the computation of propagate and generate (p, g) whereas grey squares denote computation of generate (g) only. The buffers are denoted by empty forms the long latriangles. In 32-bit BK adder [Fig. 3(b)], tency path. The pre-decoder determines whether the carry will propagate between bit-16 to bit-23. If the carry path is broken will be [as shown by dashed carry path in Fig. 3(b)] then . Under this condition, computed at the same time as will form the short latency path. In the QTA, the pre-decoder will predict the carry propagation between bit-16 to bit-23. In and will be computed case the carry is not propagated, independent of each other and will form the short latency paths. Intuitively, RCA is expected to allow better supply voltage scaling because of large timing slack present between long latency and short latency paths. However, the speed of the adder itself is slow. Tree adders, for example KS, are fast because they try to compute all paths in parallel. But in the process, it also reduces the timing slack between long and short latency paths. Further the dense routing wires increase area as well as delay. Sparse tree adders like BK and QTA, tradeoff the area with

Fig. 3. Long and short latency paths of various 32-bit adders. (a) C SA. (b) Brent–Kung (the shaded bits have been pre-decoded for prediction of critical path activation). The dashed carry path between bit-15 and bit-23 illustrates the splitting of long latency path by using the pre-decoding logic. The pre-decoder is not shown in the figure. (c) Long and short latency paths of 32-bit QTA.

speed and also reduce the wiring overhead. The delay of critical path also determines the adder’s tolerance to process variation. Longer paths may experience less variations compared to shorter paths due to cancellation effect (i.e, the average current drawn by the logic gates in longer path remains same under intra-die process variation). For analysis and comparison of adders, we experimented with 32-bit and 64-bit RCA, C SA, BK and QTA adders synthesized

1304


Fig. 4. Comparison of variable latency adders in terms of: (a) power dissipation, (b) % power saving, and (c) power–delay product (PDP).

Fig. 5. Hybrid adders. (a) RCA. (b) C SA. The long and short latency paths are shown by solid and dashed lines, respectively.

using Synopsis design compiler [12]. The simulation is done using Hspice with BPTM 70 nm [13] devices. The process varivariations due to inter- and ation is modeled as lumped intra-die process fluctuations. The (mean, sigma) of inter-die and intra-die variation is taken to be (0, 40 mV) and (0, 20 of the transistor is given by the mV), respectively. The summation of nominal and change in due to interand intra-die process variations. The operating frequency of the adders at nominal supply (1 V) is chosen such that the long paths meet 95% yield target. In order to determine the yield we run 5000 Monte Carlo simulation of the long latency critical paths. Supply voltage scaling is performed so that long paths meet 100% yield while the short paths latency meet 95% yield with respect to their delay targets. As discussed earlier, adaptive clock stretching is used with prediction logic to avoid delay failures at reduced supply voltage. C. Comparison in Terms of Power Dissipation For power estimation, we applied a set of 1000 random test patterns to the adders to compute the average power using Nanosim [14]. Fig. 4(a) shows the power dissipation at nominal supply voltage as well as at reduced supply voltage. At reduced supply, we maintain 95% yield with respect to the short latency path delay and 100% yield with respect to the long latency path delay. The adder is operated with variable latency using adaptive clock stretching to avoid delay failures at reduced supply. For the sake of comparison, we also plot percentage saving in power for these adders [Fig. 4(b)]. The important points from this figure are (a) C SA adder consumes highest power due to large area and switching capacitance, RCA on the other hand consumes smallest power, (b) the % power saving also increases as we move towards the slower adders. Fig. 4(c) shows the power-delay product (PDP) of the adders. The faster adders consume large power resulting to increased

PDP whereas slower adders consume less power leading to small PDP. The plot suggests that BK can be the best choice for low power with variable latency operations. QTA is fast but consumes more power. The RCA is at one extreme with good robustness and reasonable PDP however, it is too slow to be practical for high-performance ALUs. C SA adder is the fastest among the adders considered here however; it is not practical for low-power ALU design because of large power dissipation (even at scaled supply and variable latency operation). III. HYBRID ADDERS In previous section, we presented an analysis of different kinds of adder topologies for low power and variation tolerance at scaled supply and clock stretching. In this section, we present the hybrid adders that can either be used to improve the yield at scaled supply or can be used to scale the supply voltage further at iso-yield. A. Basic Idea From Section II, it can be noted that the timing slack of short paths is exploited for scaling down the supply voltage and to maintain the required yield. Based on this observation, we propose hybrid adders that increase the timing slack of short paths by making them faster. Note that this is counter intuitive because conventionally, the critical paths are optimized to make them faster for better yield. However in this case, we optimize the off-critical (short latency) paths because the yield and power saving is determined by these paths at scaled supply voltage. Fig. 5(a) shows the basic strategy for hybrid adder design using RCA as an example where the middle portion of the adder is replaced with a fast adder topology (Kogge–Stone in this case). The main idea is to compute the intermediate carries faster using a faster adder topology and create room for further voltage scaling. The middle bits are chosen because these bits


1305

Fig. 6. (a) Comparison of standard adder (at nominal voltage as well as at low voltage using adaptive clock stretching) and hybrid adders in terms of power and operating voltage. (b) Comparison of extra power savings with 32-bit and 64-bit hybrid adders. (c) Area overhead.

are common to both sets of short paths [i.e., carry generated from LSBs and ending in the middle and carry generated from the middle and ending at the MSBs as shown in Fig. 5(a)]. B. Design of Hybrid Adders The following points should be noted for hybrid adder design: (a) the adder topology (for fast computation of intermediate carries) should be selected carefully because it increases the area overhead, (b) more timing slack can be obtained by making more number of intermediate carries faster, however it also increases the area overhead and; (c) the benefit of hybrid adder design diminishes if the original adder itself is very fast. Based on the aforementioned observations, we designed hybrid RCA and C SA. The hybrid designs for 32-bit adder width are shown in Fig. 5(a) and (b). In RCA, the middle 8-bits (i.e., bit-12 to bit-19) are implemented with fast adder to create more timing slack in short latency paths. An 8-bit KS adder is used for this purpose. In C SA, we implemented the 9 intermediate bits by KS adder (bit-11 to bit-19) as shown in Fig. 5(b). This is done owing to the irregular bit partitioning of C SA. It is worth mentioning that in the process of optimization, the hybrid adder becomes faster compared to conventional adder because the intermediate carries are also accelerated. For example, the RCA becomes faster because the middle 8-bits are accelerated. This is certainly an added advantage because for the iso-frequency scenario, this extra slack in overall adder delay can either be used to improve the yield or to scale down the supply voltage further for power savings. C. Simulation Results For simulation, we follow similar setup as explained in Section II-B. The experiments have been done on both 32-bit as well as 64-bit RCA and C SA. Note that, for 64-bit hybrid RCA we implemented the middle 8-bits to speed-up the short latency paths. However, for 64-bit C SA, we replaced the middle 10-bits with KS adder (to maintain the uniform structure of the adder). The supply voltage reduction of hybrid adders is performed in a conservative manner to ensure that 95% yield is maintained at scaled supply (at rated frequency and adaptive clock stretching). For the sake of comparison, in Fig. 6(a) we plot the power savings in both the standard adder (at nominal supply as well as scaled supply with adaptive clock stretching) and hybrid adder (at scaled supply with adaptive clock stretching). In hybrid adders, the supply voltage can be

reduced further while maintaining similar/better yield. The operating voltages of the standard and hybrid adders are also shown in Fig. 6(a). The extra power saving of 32-bit and 64-bit hybrid adders at reduced supply (compared with standard variable latency adder at low voltage) is presented in Fig. 6(b). Simulation shows that for 32-bit adders, 30%–50% extra power saving can be obtained using hybrid design. For 64-bit adders the power saving varies between 18%–20%. This is due to the fact that only 8 intermediate bits have been optimized for speed. More power saving can be obtained by optimizing 16 intermediate bits (at the cost of more area overhead). The power saving in the hybrid adders come at the price of area overhead. Fig. 6(c) shows the area overhead for 32-bit and 64-bit adders. For the adder examples illustrated in this paper, the overhead is within 10%. Note that, the area overhead presented here does not account for the decoder overhead (for prediction of long latency path activation) since it is common for both conventional and hybrid adders. Furthermore, the area overhead/power saving in hybrid adders can vary depending on the implementation choice of intermediate 8-bits. For example, KS implementation can be more area intensive but may provide large slack whereas BK/HC can be balanced in area but provide less timing slack. At scaled supply, the throughput penalty of both the standard adders and the hybrid adders is same. This is because the signal probabilities of primary inputs are the same and the same number of bits is decoded for adaptive clock stretching. In order to determine the throughput penalty due to adaptive clock stretching, we employed the adders in the execution stage of five stage pipelined processor. Simulation with the SPEC2000 benchmark programs show a throughput loss of 3% (this is discussed in Section VI). An important point to note is that, while designing the decoding circuitry for fast adders, we make sure that the delay of decoding is less than the short latency path delay by proper sizing of the decode logic. This ensures that the decision to stretch the clock period is taken beforehand. IV. LOW POWER MULTIPLIERS AT SCALED SUPPLY We have also applied adaptive clock stretching technique to two classes of multipliers namely the carry-save (CSM) and the Wallace tree multipliers (WTM) [11]. In Fig. 7, we show CSM and its critical paths. The methodology inan volved in splitting of the critical path in case of the multipliers is slightly different than that of adders. For the adders considered

1306

Fig. 7. Long and short latency paths of (a)


N 2 N carry-save multiplier, and (b) 6 2 6 Wallace tree multiplier.

in Section II, the few middle order primary input bits were the inputs to the LPB. However, finding a low-complexity decoder, which can predict in advance the long or short latency operation of the multiplier, is extremely difficult due to the large overhead associated with hardware implementation of decoder. Hence, we relax the constraint that inputs to LPB should be chosen from the primary inputs. Instead, we allow LPB inputs to be the intermediate stage outputs of the multiplier. In the two types of multipliers that we considered, CSM and WTM, the final stage of multiplication consists of a vector merging adder (VMA). In both the multiplier architectures, we consider that the VMA is implemented as a C SA. We split the critical path in the multiplier using the internal bits, which are inputs to the VMA. The resulting short latency paths (SLP1, SLP2) are shown in Fig. 7. Multipliers have many paths of similar delays and under variations any one of these paths may become critical. Hence, we make an assumption that all potential critical paths in presence of variability have common carry propagation path of the VMA. The critical path delay (or the long latency path) is equal to the sum of path delays through the non-VMA part of the circuit and the delay through the VMA. Hence, by breaking the critical path in the adder stage, we take care of paths that have probability of becoming critical under variations. Since we use VMA inputs as our inputs to the LPB, there is a probability that sufficient time may not be available for the enable to be computed by the time the falling edge of the clock arrives. We circumvent this problem by using a negative D latch along with the positive edge triggered D flip-flop (instead of a negative D flip-flop) in the LPB as shown in Fig. 2. Our technique is also applied to WTM, shown in Fig. 7(b). WTM consists of a tree part and a VMA part. The tree part consists of stages, each of which contributes an adder delay to the critical path. The critical path delay is the sum of the number of stages in the tree and the delay through the VMA. For instance, in a 16 16 WTM (using 3:2 compressors), there are 6 stages in the tree part and a 27 bit VMA. Hence, the critical path delay is the sum of 6 adder delay and the worst case carry propagation delay of VMA. We note that there is a great potential to be exploited in case of WTM because of the relatively large size of the VMA. This gives us greater scope for Vdd scaling, resulting in power savings. In Fig. 7(b), a 6 6 WTM, showing the tree part

and the VMA part is illustrated. In a manner similar to the splitting of critical paths in CSM, the critical path in WTM is cut into SLP1 and SLP2 as shown in Fig. 7(b). The slack between long latency path (LLP) and the maximum of SLP1 and SLP2 can be used for supply voltage scaling as explained in Section II-A. A. Simulation Results In this section we compare multipliers (12, 16 bit CSM and 8, 12, 16 bit WTM) implemented in the conventional and our proposed design. All the arithmetic units mentioned above were implemented in 70 nm BPTM technology [13]. The metrics used for comparison were parametric yield improvement, power dissipation, Energy-Per-Computation (EPC), area overhead and throughput penalty. We used VHDL to design the multipliers. The VHDL code was synthesized using Synopsis Design Compiler [12]. In order to obtain the parametric yield in presence of process variations, we ran Monte Carlo simulations in Hspice, assuming a Gaussian Vth variation distribution of zero mean and standard deviation of 40 mV. The power dissipation results were obtained by simulating 1000 random input vectors in NanoSim. In all the arithmetic units considered, the yield of our proposed design was found to be 100% under a nominal supply voltage of 1 V. This can be attributed to the fact that the short latency paths under variations do not exceed the one clock cycle bound and the long latency path, if activated, is evaluated by adaptive clocking scheme. In our simulations we consider two iso-yield conditions: 1) when the proposed design is operated at 1 V and 2) when conventional design is operated at 1 V. In order to have an iso-yield of approximately 100%, when the proposed design is operating at 1 V, the conventional design had to be operated at 1.1 V which gives 24% and 29% power savings for CSM and WTM, respectively, as shown in Fig. 8(a) and (b). However, for the conventional design operating at 1 V, if a certain yield target is desired (CSM Yield 93%, WTM Yield 96%), then the proposed design can operate at a lower supply. Table I shows the percentage EPC savings. Fig. 8(c) and (d) show the power dissipation under iso-yield conditions for the conventional design operating at 1 V and the proposed architecture at a scaled down supply (to meet the same yield target). An interesting observation is that the percentage power savings increases with an increase in the number of bits in the multipliers.


1307

Fig. 8. Power consumption under ISO yield conditions for (a) CSM and (b) WTM; (c) power consumption under ISO Yield conditions and area overhead for CSM; (d) % power saving and % area overhead for WTM.

Fig. 9. (a) Long and short latency paths of an

N 2 N carry-save multiplier. (b) Extra power saving. (c) Area overhead.

TABLE I EPC SAVINGS UNDER SUPPLY SCALING (@ISO-YIELD)

This is due to the fact that, with an increase in the length of the critical path, more timing slack can be exploited, in terms of supply scaling, after splitting the critical path. For example, to have iso-yield of 96%, the proposed 16 bit WTM can be operated at a scaled down supply of 0.85 V. V. HYBRID MULTIPLIERS In previous section, we presented the multiplier designs for low power, process tolerance and adaptive clock stretching operations. In this section, we discuss the hybrid multiplier design for low power and adaptive clock stretching. A. Basic Idea The design of hybrid multiplier is based on the concept of hybit CSM. The first brid adders. Fig. 9(a) shows the rows are carry-save stages while the final row is vector-merge stage. In this implementation, the carry-save stages are computed fast whereas the vector-merge stage is implemented by

ripple carry adder (RCA) to simplify the design. The complication with CSM is that there are many critical paths of similar length. One of the possible critical paths is illustrated in Fig. 9(a) by bold line. The vector merging RCA is replaced with the hybrid-RCA discussed in previous section to speed up the off-critical paths [shown by dashed line in Fig. 9(a)]. Note that, if vector merge stage is implemented by any other adder topology (e.g., C SA), then a corresponding hybrid adder design can be used to optimize the short latency paths. B. Simulation Results We performed simulations on both 32 32 as well as 64 64 bit carry-save multipliers. The vector merge adder is implemented with hybrid RCA where the middle 8-bit of the RCA (bit [12:19] for 32 bit RCA and bit [28:35] for 64-bit RCA) is accelerated up by using the KS adder. The simulation is performed using BPTM 70 nm devices using the test setup as described in Section II-B. The supply voltage scaling is done such that both conventional and hybrid multiplier maintains a yield target of 95% for short latency path and 100% for long latency path under scaled supply voltage. The power dissipation is estimated using Nanosim for a set of 200 random patterns. The simulation results are shown in Fig. 9(b) which indicates that 15%–25% of extra power saving can be gained by using the hybrid multiplier. The extra area-overhead is found to be only 0.2% [Fig. 9(c)].

1308


Fig. 10. (a) Throughput penalty versus number of decoding bits. (b) Throughput penalty for 8 decoding bits.

TABLE II PROCESSOR CONFIGURATION

Fig. 11. Throughput penalty versus area overhead and # of decoding bits (WTM 16 bits).

VI. DESIGN ISSUES AND CHALLENGES In this section, we discuss some of the design challenges and issues involved in designing low-power hybrid arithmetic units using adaptive clock stretching. A. Throughput Penalty In order to determine the actual throughput penalty from a system level perspective, we incorporated the proposed arithmetic units in a five stage DLX pipeline. The throughput penalty was assessed by running SPEC2000 [15] benchmarks in SimpleScalar [16] simulator. The results given in Fig. 10(a) show on an average a 3.03% throughput penalty (by using 10 bit decoding) and less than 8% for 8-bit decoding. We have also done comprehensive simulations on SPEC CINT 2000 benchmarks with “ref” inputs in order to determine throughput loss with 8-bit input decoding. The configuration of the simulated processor is described in Table II. For applying rare two-cycle operations on adders and multipliers, we considered all instructions that use the adders and multipliers. If a long latency add operation is detected, the adder is set to be occupied for two-cycles and the results is broadcast on the operand buses at the end of second cycle. Similarly, for multiplication operation, the computation is executed in three clock cycles whereas the computation is performed and the result is broadcasted after six clock cycles for long latency operations (note that they are rare). The simulation results [shown in Fig. 10(b)] indicate a maximum throughput loss of 2%.

B. Impact of Number of Decoding Bits Increasing the number of decoding bits reduces the throughput penalty because the activation of the critical path can be predicted more accurately. However, the complexity and area overhead of decode logic also increases. In Fig. 11, the tradeoff between throughput penalty and area overhead for a 16 bit WTM is shown. From the graph we observe that throughput penalty decreases with increase in number of decoding bits. Therefore, we use 8-bit decoding throughout this work. C. Impact of the Location of Decoding Bits If the decoding (for critical path prediction) is done in LSB side, the timing slack for supply scaling reduces since it is limited by the path that starts from decoded bits and end at the MSB. vice versa is true if the decoding is done on the MSB side. Therefore, to maximize the timing slack for supply scaling, we decode the middle 8-bits for critical path prediction. D. Requirement of Level Converting Flops Note that our analysis of voltage scalable arithmetic units (standard as well as hybrid) does not consider level conversion at the low voltage/nominal voltage interface. We propose to use low overhead level converting flops [11] in order to minimize the area/power/delay overhead. E. Modifications in Design Flow Application of low-voltage execution units also requires modification in standard design/verification and manufacturing test process. The RTL development will require addition and cost analysis of pre-decoding logic, clock gating logic. For hybrid execution unit design the RTL should be modified to incorporate fast architecture appropriately. The timing analysis and synthesis should consider the timing of long latency as well as short


latency paths to estimate the clock frequency under a yield constraint. The RTL verification requires addition of extra test patterns in delay test suite to verify decoding logic, long and short latency paths.

1309

neer. He spent the summer of 2006 in Intel’s Test Technology group and the summer of 2007 in AMD’s design-for-test group. His research interests include low-power, adaptive circuit and system design, fault-tolerant design and digital testing for nanometer technologies.

VII. CONCLUSION Variable latency functional units using adaptive clock stretching can allow aggressive scaling of supply voltage while maintaining rated frequency with little performance degradation. We developed and explored various adder and multiplier topologies/architecture that are amenable to aggressive supply voltage scaling/clock-stretching while maintaining high yield and frequency. We also proposed hybrid adder design (by mixing small amount of fast arithmetic into the slower ones) that can be utilized for improving yield or scaling the supply voltage further. We demonstrated that hybrid design can indeed make the adders supply voltage scalable with improved tolerance to process variation. REFERENCES [1] H. Suzuki, W. Jeong, and K. Roy, “Low power adder with adaptive supply voltage,” in Proc. Int. Conf. Computer Design, 2003, pp. 103–106. [2] Y. Chen, H. Li, K. Roy, and C. Koh, “Cascaded carry-select adder (C SA): A new structure for low-power CSA design,” in Proc. Int. Symp. Low Power Electronic Design (ISLPED), 2005, pp. 195–200. [3] S. K. Mathew, M. Anders, B. Bloechel, T. Nguyen, R. Krishnamurthy, and S. Borkar, “A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS,” IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 44–51, Jan. 2005. [4] J. Kao, M. Miyazaki, and A. Chandrakasan, “A 175-mW multiply-accumulate unit using an adaptive supply voltage and body bias architecture,” IEEE J. Solid-State Circuits, pp. 1545–1554, 2002. [5] V. G. Oklobdzija, B. Zeydel, H. Dao, S. Mathew, and R. Krishnamurthy, “Comparison of high-performance VLSI adders in energydelay space,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 6, pp. 754–758, Jun. 2005. [6] P. M. Kogge and H. Stone, “A parallel algorithm for the efficient solution of a general class of recurrence equations,” IEEE Trans. Comput., vol. C-22, no. 8, pp. 786–793, Aug. 1973. [7] R. P. Brent and H. Kung, “A regular layout for parallel adders,” IEEE Trans. Comput., vol. C-31, no. 3, pp. 260–264, Mar. 1982. [8] T. D. Han and D. A. Carlson, “Fast area-efficient VLSI adders,” in Proc. Int. Conf. Arithmetic, 1987, pp. 49–55. [9] R. Woo, S. J. Lee, and H. J. Yoo, “A 670 ps, 64 bit dynamic low power adder design,” in Proc. Int. Symp. Comput. Architect., 2000, pp. 128–131. [10] D. Mohapatra, G. Karakonstantis, and K. Roy, “Low-power processvariation tolerant arithmetic units using input-based elastic clocking,” in Proc. Int. Symp. Low Power Electronic Design (ISLPED), 2007, pp. 74–79. [11] J. Rabaey, Digital Integrated Circuits: A Design Perspective, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 2003. [12] Synopsys Design Compiler. [Online]. Available: www.synopsys.com [13] BPTM 70 nm: Berkeley Predictive Technology Model. [Online]. Available: www.eas.asu.edu/~ptm [14] Synopsys Nanosim. [Online]. Available: www.synopsys.com [15] SPEC 2000 Benchmarks. [Online]. Available: www.spec.org [16] Simplescalar Tool Set. [Online]. Available: www.simplescalar.com Swaroop Ghosh received the B.E. degree (Hons.) in electrical engineering from the Indian Institute of Technology, Roorkee, India, in 2000, the M.S. degree from the University of Cincinnati, Cincinnati, OH, in 2004, and the Ph.D. degree from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN. He is a Senior Design Engineer in the Advanced Memory Design Group, Intel, Hillsboro, OR. From 2000 to 2002, he was with Mindtree Technologies Pvt. Ltd., Bangalore, India, as a VLSI Design Engi-

Debabrata Mohapatra received the B.Tech. degree (Hons.) in electrical engineering with a minor in electronic communication engineering from the Indian Institute of Technology, Kharagpur, India, in 2005. He is currently working toward the Ph.D. degree at the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN. He spent the summer of 2008 in Qualcomm’s memory design group. His research interests include low-power, process-variation-aware circuit and system design for multimedia hardware in nanometer technologies.

Georgios Karakonstantis (S’05) received the Diploma degree in computer and communications engineering from the University of Thessaly, Volos, Greece, in 2005. He is currently working toward the Ph.D. degree at the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN. In summer 2008, he was with the Advanced Technology Group, Qualcomm Incorporated, San Diego, CA. His research interests include algorithm/architecture co-design for low-power and process-variation-tolerant digital signal processing circuits and wireless communication systems.

Kaushik Roy (S’83–M’95–SM’95–F’02) received the B.Tech. degree in electronics and electrical communications engineering from the Indian Institute of Technology, Kharagpur, India, and the Ph.D. degree from the Electrical and Computer Engineering Department, University of Illinois at Urbana-Champaign in 1990. He was with the Semiconductor Process and Design Center, Texas Instruments, Dallas, where he worked on field-programmable gate array architecture development and low-power circuit design. He was with the Electrical and Computer Engineering Faculty, Purdue University, West Lafayette, IN, in 1993, where he is currently a Professor and holds the Roscoe H. George Chair of Electrical and Computer Engineering. His research interests include very-large-scale-integration (VLSI) design/computer-aided design for nanoscale silicon and non-silicon technologies, low-power electronics for portable computing and wireless communications, VLSI testing and verification, and reconfigurable computing. He has published more than 400 papers in refereed journals and conferences, holds eight patents, and is a coauthor of two books on low-power CMOS VLSI design. Dr. Roy received the National Science Foundation Career Development Award in 1995, IBM Faculty Partnership Award, ATT/Lucent Foundation Award, 2005 SRC Technical Excellence Award, SRC Inventors Award, and Best Paper Awards at the 1997 International Test Conference, IEEE 2000 International Symposium on Quality of IC Design, 2003 IEEE Latin American Test Workshop, 2003 IEEE Nano, 2004 IEEE International Conference on Computer Design, 2006 IEEE/ACM International Symposium on Low Power Electronics and Design, and 2005 IEEE Circuits and System Society Outstanding Young Author Award (Chris Kim), 2006 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATED (VLSI) SYSTEMS Best Paper Award. He is a Purdue University Faculty Scholar, the Chief Technical Advisor of Zenasis Inc., and Research Visionary Board Member of Motorola Laboratories (2002). He has been in the editorial board of IEEE Design and Test, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, and IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATED (VLSI) SYSTEMS. He was Guest Editor for a Special Issue on Low-Power VLSI in the IEEE Design and Test (1994), IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATED (VLSI) SYSTEMS (June 2000), and IEE Proceedings–Computers and Digital Techniques (July 2002).

Voltage Scalable High-Speed Robust Hybrid Arithmetic ... - USF - CSE

Voltage Scalable High-Speed Robust Hybrid Arithmetic ... - USF - CSE

Suggest Documents

Voltage Scalable High-Speed Robust Hybrid Arithmetic Units Using ...

USF CSE REU

Image Filtering - CSE - USF

Machine Vision - CSE - USF

Santiago â Chile - CSE - USF

Santiago â Chile - CSE - USF

Here - CSE - USF - University of South Florida

paper - CSE - USF - University of South Florida

A Simulation Study of Scalable TCP and HighSpeed ... - Springer Link

Scalable K-Means++ - CSE@IIT Delhi

Relationships under the Microscope with Interaction ... - CSE - USF

Wrapper-based Computation and Evaluation of Sampling ... - USF - CSE

Mobility Limited Flip-based Sensor Networks Deployment - CSE - USF

Dynamic Management of Real-Time Location Data ... - USF CSE REU

Generalization Methods in Bioinformatics - CSE - USF - University of ...

Reducing the Energy Consumption of Networked ... - Dept. of CSE, USF

Performance analysis of a dual-tree algorithm for ... - CSE - USF

Modular Synthesis and Verification of Timed Circuits ... - USF - CSE

Enabling Power Management for Network ... - Dept. of CSE, USF

Context-Aware Intelligent Recommendation ... - Dept. of CSE, USF

A Cost-Effective Agent for Clinical Trial Assignment - CSE - USF

An Architecture for Collecting Longitudinal Social Data - CSE - USF

Affinity P2P: A self-organizing content-based locality ... - CSE - USF

Local version - CSE - USF - University of South Florida

Voltage Scalable High-Speed Robust Hybrid Arithmetic ... - USF - CSE