Design of Multi-stage Latency Adders Using Detection ... - IEEE Xplore

1 downloads 18 Views 510KB Size Report
Abstract—Multi-stage latency adders based on different prediction schemes have been proved promising to enhance the circuit performance with negligible ...
Design of Multi-stage Latency Adders Using Detection and Sequence-Dependence between Successive Calculations Xinghua Yang, Fei Qiao, Chang Liu, Qi Wei, Huazhong Yang Institute of Circuits and Systems, Dept. of Electronic Engineering, Tsinghua University Tsinghua National Laboratory for Information Science and Technology Beijing 100084, P.R. China Email: {yang-xh11, changliu11}@mails.tsinghua.edu.cn, {qiaofei, weiqi, yanghz}@tsinghua.edu.cn it unacceptable to design short bit-length unit or extend to multi-stage structure. In [4-6], speculative function is proposed to distribute extra cycles to the adder when the prediction is wrong, in which the effect of sequence dependence between the successive calculations is considered, however, the ignorance of detecting the pattern of carry-kill (Ai=Bi) of the input data results in a high probability of recalculation and part of the performance improvement is offset. To the best of our knowledge, no research has been conducted that both the detection for the pattern of carry-kill of the input data and the dependence between the successive calculations are utilized jointly for the predictor and realized by corresponding circuit. In this paper, we proposed a multi-stage latency adder based on improved prediction method. The contributions are as follows: • The improved predictor takes less bits than [3] and is proved to be more suitable to be applied to multi-stage latency adder.

Abstract—Multi-stage latency adders based on different prediction schemes have been proved promising to enhance the circuit performance with negligible overhead. This paper presents a novel predictor exploiting both the detection and the sequence-dependence between the successive calculations. The detection of carry-kill pattern of the input data can lower the probability of the operation with multiple clock cycles and the sequence-dependence between the successive calculations is adapted to eliminate redundant cycles. The improved predictors have been inserted into Ripple Carry Adder (RCA) and a multistage latency structure has been setup. Compared with the previous predictors, the proposed one could have the same function with less prediction bits, which results in more energyefficiency. Simulation results show that 2.41X-3.05X speedups can be achieved than the non-prediction counterpart. Furthermore, a design flow and a method for error control are proposed when applying the adder to approximate computation so that more performance improvement could be obtained after trading off certain precision.

I.

INTRODUCTION

• The detection of carry-kill pattern of the input data and the effect of sequence dependence between the successive calculations are utilized jointly to constitute the proposed predictor, resulting in more improvement on performance.

The operating frequency of a digital logic circuit is determined by its critical path and can be increased by some conservative approaches such as scaling up supply voltage or resizing key logic gates, but at the expense of more power and area in practical design [1-2]. The conventional methodologies are about to reach the limit as the CMOS technology is scaling down to the nanometers with inevitably growing effect of the process variation, which makes it difficult to improve the performance of arithmetic unit under low operating voltage with no error incurred or little overhead. With the restriction on further improving the performance of the digital circuit due to the increased process variation in scaled CMOS technology, multi-stage latency adder based on different prediction methods has been proposed. Adders with elastic clocking in [3] is for designing low-power and processvariation tolerant units and is also feasible to improve the performance of the adder. Its prediction method only utilized current input data and large amount of redundant cycles are generated when the data to be processed are changing slowly. Moreover, the constraints in the design that five bit-pairs (Ai, Bi) of input data have to be used for the prediction block make

• 2.41X-3.05X speedups with regard to the nonprediction adder are achieved with reasonable area overheads and the method for error control is proposed when the adder is applied to approximate computation to obtain more performance improvement after trading off certain precision. The reminder of this paper is organized as follows: the proposed adder is illustrated in Section II. Design flow and error control method for approximate computation are shown in Section III. Simulation results are presented in Section IV and conclusions are drawn in Section V. II.

A. Related Work The technique of Elastic Clocking based on Input Prediction (ECIP) in [3] is proposed for low-power and variation-aware design, which is also feasible to improve the performance at nominal supply voltage. Since the activation of critical path in Ripple Carry Adder (RCA) depends on certain

This work was supported by National Science Foundation for Young Scientists of China (Grant No. 61306029). National High Technology Research and Development Program of China (“863” program, Grant No. 2013AA014103).

978-1-4799-3432-4/14/$31.00 ©2014 IEEE

MULTI-STAGE ADDER BASED ON IMPROVED PREDICTOR

998

(a)

(b)

Fig. 1. Predictors from Previous Design in [3] and [6]: (a) ECIP; (b) 1-bit predictor (MSADD)

(a) Scheme of the Proposed Predictor

Fig. 2. Sequence Dependence between Successive Calculations[8] (b) General Structure of Multi-Stage Adder

input pattern and the probability for this kind of activation is rare, then a predict-function is proposed to exploit this feature to divide the whole critical path into two short-latency paths as shown in Fig 1(a). The prediction-block is based on (1): P( L ) = P( ( A7⊕B7 )∩…∩( A10⊕B10 ) = 1 )

Fig. 3. The Proposed Multi-stage Latency Adder

(1)

With this prediction-block, the operating frequency is determined by the latency of short-paths. Thus, when no detection of carry-kill pattern (Ai=Bi) is hit, one more cycle will be applied to the RCA through clock-gating by D-FlipFlop. However, as the sequence-dependence between the successive calculations are not utilized for the prediction, there are two kinds of limitation when this technique is extended to multi-stage prediction adder. First, the number of bit-pairs (Ai, Bi) for prediction has to be more than five in order to keep the activation probability low enough as demonstrated in [3]. Thus, the margin for increasing the frequency of the adder will be shrinking since more bit-pairs will lead to more latency for the short-paths. Second, large amount of redundant cycles are generated when the input data change slowly as the technique ignores the correlation among the continuous computation. In [6], Multi-Speculative Adder (MSADD) is proposed based on [10] and five different predictors have been tested, among which the 1-bit predictor (1P), shown in Fig 1(b), takes advantage of the dependence between the successive calculations and is used to make the speculation of the carry signals. More cycles will be allocated to the adder if a misprediction incurs. However, redundant cycles also exist in this design when part of the critical chain has already been broken (Ai=Bi) and the speculation-correction process will be unnecessary. The other four predictors consisting of estimation from part of the current input data also ignores this effect. It can be seen that both in [3] and [6], no prediction methods have considered the detection of carry-kill pattern (Ai=Bi) and the effect of dependence between the successive calculations jointly. We have found that the detection of carry-kill pattern can decrease the probability of multiple-cycle operation and the sequence dependence could be exploited to eliminate redundant cycles to further improve the performance.

999

B. Sequence Dependence and Detection of Carry-Kill Pattern Since the successive data in addition process are not independent, redundant cycles will be generated as certain computation which activates the long latency path does not need extra cycles [8] as shown in Fig 2. The second addition satisfies (1) and one more cycle is needed under ECIP principle. However, one cycle will be just enough for the second addition if the first one is considered as the carry out bits at each position between first and second addition are identical. This kind of sequence dependence will be significant when the input data change slowly, a common situation where the data to be processed are collected from the sensor network for temperature monitoring or image processing. In MSADD [6], redundant cycles also exist due to the lack of detection for carry-kill pattern of the input data even though the prediction method utilized the effect of sequence dependence in processing data. Thus, in our design, both of these methods are combined to eliminate redundant cycles by corresponding circuit. Detailed description of the proposed predictor will be shown in next sub-section. C. Proposed Circuit Architecture In our design, the whole RCA is divided into several fragments with the improved predictor as shown in Fig 3(a). The input of the Prediction Block (PB) consists of two kinds of signals from the original data. First, the current value of A4, A5 and B4, B5 are taken to be the variables in Equation (1). Second, the last real value of carry-out bit from FA5 to FA6 is preserved by D-Flip-Flop and feedback to be compared with the current carry-out bit to generate the CompSignal, which is another input to Prediction Block. With all of these signals, the Prediction Block outputs Enable and Carry_Sel. We take 2stage latency adder as an example to illustrate the principle: • If P(A4,5, B4,5) = 0, no matter what CompSignal is, Enable will always be logic '0' and Carry_Sel will be high so that Carry_bit will be Cout_Present, indicating

that the addition chain is broken and the computation is executed in parallel with only one cycle. • If P(A4,5, B4,5)= 1, this means that the long latency path has been activated. During first cycle, Carry_Sel will be low so that Carry_bit selects Cout_Preserved, i.e. the last real carry-out bit. Before next cycle coming, after the value of Cout_Present and CompSignal being steady, if CompSignal is logic '0', indicating that the guess is right, then Enable will be logic ‘0’ and the computation will be finished in first cycle. Otherwise, if CompSignal is logic '1', implying that the prediction is wrong, then Enable will be set high before next cycle so that the registers will be delayed one cycle time. During the second cycle, Carry_Sel will be set high. Since the real value of Cout_Present has been steady in first cycle, the computation can be just finished with these two cycles. This improved predictor have been inserted into RCA and multi-stage scheme is setup as shown in Fig 3(b). Enable signals from every predictor (PB) are converged to the NOR gate to generate the Error signal. One cycle is just enough for the computation if all Enable signals are logic ‘0’. Otherwise, two or more cycles are required to correct the mis-prediction. It should be noted that in our design, two bit-pairs at most from the input data and one bit feedback for speculation have constituted the final predictor, which is different from the design constraint in [3] that five bit-pairs at least should be involved in equation (1) to guarantee low performance penalty. A shown in Fig 4, where 104 data with uniform random distribution are simulated in Matlab to get the cycle consumption of the adder in [3] and our proposed one. Twostage is chose for illustration and it can be seen that the number of cycle for adder in [3] decreases following 2-i as shown Fig 4(a). This effect could also be found in our design as shown in Fig 4(b), where the predictor with two bit-pairs and one bit feedback has almost the same cycle consumption with the one in [3] taking three bit-pairs. However, as we have proved in [12], more redundant cycles will be eliminated when the processing data change slowly or express high dependence in practical application. Thus, considering the efficiency to insert the predictors into multi-stage scheme and exploiting the sequence dependence, two bit-pairs at most are utilized in our proposed predictor with one bit feedback for speculation. III.

DESIGN FLOW AND ERROR CONTROL METHOD FOR APPROXIMATE COMPUTATION

As illustrated in section II, the proposed predictor is comprised of some bit-pairs from the input data used for variables in equation (1) and one feedback bit. It can be seen that more bits involved in equation (1) will result in more performance improvement since the probability for multiplecycle operation will get lower, which, however, will also lead to extra area and power overhead for the design. Thus, a design flow based on iteration is proposed as shown in Fig 5. The nbit RCA should be divided into m-stage with (m-1) predictors if the performance is expected to be accelerated m-times. The iteration is to get the number of bits involved into equation (1) and two bit-pairs at most are used as we have explained before. The Performance Test in iteration can be conducted by Matlab in which the number of average cycle for single computation is

Fig. 4. Cycle Consumption of different predictors.

Fig. 5. Design Flow for Multi-stage Latency Adder

counted. In practice, the threshold for Average Cycle can be defined between 1 and 2. The proposed multi-stage latency adder can also be applied to approximate computation to get more performance improvement. The concept of approximate computation is proposed in order to speed up the system based on the fact that the whole computation of an algorithm has different levels of significance. This property is quite common in the domain of image or voice processing where the quality of final output is evaluated by human visual or hearing system that has limited resolution. As illustrated in [7] and [11], accurate computation will be applied to the significant part, while approximate computation will be allocated deliberately to the insignificant part. However, large deviation may be introduced since the value of higher bits is getting random with no error control mechanism. In our design, for a single approximate computation in m-stage adder with (m-1) predictors, the Enable signals from lower predictors are removed from the NOR gate in Fig 3(b). With this error control method, the mean error for single computation will be controlled under a low level since most of the randomness of higher bits is eliminated and more cycle will be abandoned so that the performance can be further improved. Simulation results will be shown in Section IV. IV.

SIMULATION RESULT

A. Environment Setup for Simulation Traditional RCA and our proposed adder are used to verify the methodology. All the designs are coded in Verilog and then synthesized by Design Compiler under SMIC 65nm standard CMOS technology, by which the highest operating frequency

1000

Fig. 6. Average Delay for 4,8,16 stages of our adder with 64-bits. 4

and area are estimated. 10 random input vectors are utilized to simulate the design to obtain the average cycle for single computation. Finally, the mean error and corresponding speedups are counted when approximation is introduced into the adder. B. Performance Improvement A 64-bit RCA is taken for comparison with our multi-stage latency adder under 4, 8 and 16 stages as shown in Fig 6. The x-axis presents the number of bits involved in equation (1) and y-axis presents the average delay. The delay and area of traditional RCA is 3.58(ns) and 1223(um2) respectively, which are obtained from Design Compiler. In order to get the average delay of our design under different stages, the average cycle for one computation is obtained by simulating 104 data in Matlab, then the delay of different designs is got through Design Compiler. Finally, multiply both the average cycle and the delay will result in the average delay as Fig 6 shows. It can be seen that the maximum speed-ups for 4, 8 and 16 stages are 2.41, 2.93 and 3.05 times. The average overheads are 1.1%, 17.6% and 46.3%. C. Trade-off between Performance and Precision Approximate computation is introduced in our design by removing the Enable signals of lower predictors from the NOR gate, which could result in less cycle consumption (more performance improvement) for one computation but causing certain error to the final output. In order to get the mean error and the corresponding speedups to the accurate design, 104 input data following uniform distribution are simulated in Matlab for our proposed adder with 32-bit, 8-stages and 1 bitpairs involved in equation (1), in which different number of Enable signals from PB(1) to PB(7) in Fig 3(b) are abandoned from the NOR gate. For the accurate design, meaning no signals removed, the cycle consumption is 19640, which is treated as the basic comparison. In Fig 7, it can be seen that the cycle consumption decreases as more Enable signals being removed and the mean error increases instead. This feature could be used for the application in which certain error can be tolerated such as image processing, recognition or data mining. It’s nearly 20% improvement on performance than the basic accurate design when the number is 4 and the mean error is 383, about 10-7 deviation from 232, which is acceptable in most of the mentioned applications.

Fig. 7. Percentage of Error for RCA and Proposed Adder in Approxiamte Computation

V.

CONCLUSION

In this paper, we proposed a new multi-stage latency adder based on an improved prediction method, 2.1X-3.05X speedups have been obtained and approximate computation has been introduced to get more performance improvement. Next, we will apply our design to special systems to further test the availability of approximate computation and compare our adder with other multistage speculation adder. REFERENCES [1]

Jianhua Liu, et al. "Optimum Prefix Adders in a Comprehensive Area,Timing and Power Design Space ". Design Automation Conference (ASP-DAC), Asia and South Pacific, 2007. pp.609-615. [2] I. Koren, Computer Arithmetic Algorithms, 2nd ed. [3] Debabrata Mohapatra, et al. " Low-Power Process-Variation Tolerant Arithmetic Units Using Input-Based Elastic Clocking ". Low Power Electronics and Design (ISLPED),2007. pp.74-79. [4] A. Del Barrio, et al, “Applying speculation techniques to implement functional units,”in Proc. Int. Conf. Comput. Des. 2008, pp. 74–80.M. [5] Yongpan Liu, et al. "Design methodology of variable latency adders with multi-stage function speculation". ISQED, 11th International Symposium on, 2011. pp. 824-830. [6] Alberto A. Del Barrio, et al. "Multispeculative Addition Applied to Datapath Synthesis". Transactions on Computer-Aided Design of Integrated Circuits and System, VOL. 31, NO. 12, December 2012. pp. 1817-1830. [7] Swaroop Ghosh, et al. "CRISTA: A New Paradigm for Low Power,Variation Tolerant and Adaptive Circuit Synthesis Using Critical Path Isolation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and System, VOL. 26, NO. 11, NOVEMBER 2007. pp. 1947-1956. [8] Se Hun Kim, et al. " Modeling and Analysis of Image Dependence and Its Implications for Energy Savings in Error Tolerant Image Processing". IEEE Transactions on Computer-Aided Design of Integrated Circuits and System, VOL. 30, NO. 8, AUGUST 2011. pp. 1163-1172. [9] A. K. Verma, et al, "Variable latency speculative addition: A new paradigm for arithmetic circuit design" in Proc. Des.Autom. Test Eur., 2008, pp. 1250–1255. [10] S. M. Nowick, et al, “Design of a low-latency asynchronous adder using speculative completion,” IEE Proc. Comput. Digit. Tech., vol. 143, no.5, 1996, pp. 301–307. [11] D.Mohapatra, et.al, Significance driven computation: A voltage calable, variation aware, quality tuning motion estimator. ISLPED,2009. [12] Xinghua Yang, et al, “Design of variable latency adder based on present and transitional states prediction.” Power and Timing Modeling , Optimization and Simulation (PATMOS), 23th International workshop, 2013.

1001

Suggest Documents