Energy Parsimonious Circuit Design through ... - Semantic Scholar

Energy Parsimonious Circuit Design through Probabilistic Pruning Avinash Lingamneni† , Christian Enz∗ , Jean-Luc Nagel∗ , Krishna Palem† and Christian Piguet∗ ∗

Centre Suisse d’Electronique et de Microtechnique (CSEM) SA, Jaquet- Droz 1, Neuchatel, Switzerland Email:[email protected], [email protected], [email protected] † NTU-Rice

Institute for Sustainable and Applied Infodynamics (ISAID) Rice University, Houston, USA Email: [email protected], [email protected]

Abstract—Inexact Circuits or circuits in which the accuracy of the output can be traded for energy or delay savings, have been receiving increasing attention of late due to invariable inaccuracies in designs as Moore’s law approaches the low nanometer range, and a concomitant growing desire for ultra low energy systems. In this paper, we present a novel designlevel technique called probabilistic pruning to realize inexact circuits. Unlike the previous techniques in literature which relied mostly on some form of scaling of operational parameters such as the supply voltage (Vdd ) to achieve energy and accuracy tradeoffs, our technique uses pruning of portions of circuits having a lower probability of being active, as the basis for performing architectural modifications resulting in significant savings in energy, delay and area. Our approach yields more savings when compared to any of the conventional voltage scaling schemes, for similar error values. Extensive simulations using this pruning technique in a novel logic synthesis based CAD framework on various architectures of 64-bit adders demonstrate that normalized gains as great as 2X-7.5X in the Energy-DelayArea product can be obtained, with a relative error percentage as low as 10−6 % up to 10%, when compared to corresponding conventionally correct designs.

I. I NTRODUCTION Realizing reliable computations from unreliable components has long been a focus of study [1] and is receiving greater than ever prominence today [2] as diminishing transistor sizes driven by Moore’s law are leading to increasing process variations. It is due to these process variations, arising as lithographic scaling lags behind device scaling, and the quest for ultra-low energy circuits, particularly in the domain of embedded systems, that exact computing, in which output of the desired circuit is precise, is yielding the way to inexact computing, wherein accuracy of the output of circuits can be traded in for significant savings in energy and/or delay parameters pioneered in [3] and subsequently used in [4], [5]. Almost all of the previous papers in literature realizing inexact circuits used some form of voltage scaling as the basis for reliability and energy tradeoffs [6], [4], [5]. In this paper, we introduce a novel design level technique called Probabilistic Pruning to realize inexact circuits. This technique focuses on architectural modifications to circuits based on pruning of the circuit elements having a lower probability of being active during operation. The amount of such pruning will be dictated by the application’s error tolerance, quantified through a metric. The term inexact design coined in this paper is an umbrella term for the previously proposed probabilistic [7], approximate [8] and stochastic [4] designs or in general, any

design in which accuracy of the output can be traded off for energy, performance and/or area benefits. The major contributions of this paper are summarized below: • We propose a novel design-level technique called Probabilistic Pruning to realize Inexact circuits along with the corresponding mathematical characterization in Section III. • We propose a logic synthesis based CAD framework in Section III incorporating the Probabilistic Pruning technique to help design inexact circuits achieving a faster time to design and fabricate than a conventional design methodology needed by previous approaches in literature ([7], [8]). • We show through extensive simulations in the proposed CAD framework that the potential savings in energy, delay and area obtained by the pruning technique in the context of 64-bit arithmetic adders can be significant, in Section IV. Our experiments demonstrate about 2X7.5X savings in energy-delay-area product, with corresponding relative error percentage of 10−6 %-10% while using weighted significance values motivated by binary representation of numbers, and about 2X-5.3X savings in energy-delay-area product with corresponding error rates of 0.01%-37% using an uniform significance values where the outputs are interpreted like unary numbers. • We infer some general principles for the application of probabilistic pruning technique to any general circuit beyond adders, and provide some future directions for the design of inexact circuits in Section V and VI. II. R ELATED W ORK Several papers in the past have investigated ways of overcoming the challenges to process and parameter variations threatening the sustained evolution of Moore’s law. Some of the prominent methods include using multicore architectures to increase parallelism without frequency/voltage scaling, designing for average case operation and using temporal and/or spatial redundancy to correct worst-case errors [9], [10], [11]. Exciting new research into novel materials for realizing circuits such as optoelectronics, memristors [12] and molecular electronics [13], [14] have also been investigated successfully. The question is of such great significance to our daily lives that even the articles in New York Times are actively discussing these issues [15]. One common principle in all of these approaches is to ensure that the device always

Preprint under review: Please do not copy or distribute set of inputs {I} and outputs{O} but with components {C’} where NC’ ≤ NC and wires {W’} where NW’ ≤ NW such that given V randomly chosen inputs, the average error

functions correctly, either by design, or through an errorcorrection mechanism. In a radical departure from these conventional approaches, it was shown in [3], [16] that error can be traded as a commodity as opposed to being viewed as an impediment to glean significant savings (typically energy) – in applications that can accommodate error. Fortunately, a large class of emerging applications, particularly in the domain of embedded and mobile systems, can tolerate varying amounts of errors, more so when it results in significant energy savings. A CMOS realization of this principle called PCMOS was given in [17] and was later, extended to realize a system level application through an SoC architecture [18]. Also, application of this principle in the context of DSP applications was realized using probabilistic arithmetic [7] where the errors were caused by (thermal) noise present in ultra-deep submicron technologies, and through approximate arithmetic in [8], [6] where the errors were deterministically introduced through critical path violation as a result of voltage scaling. This principle was applied more recently to processor modules through voltage scaling and slack redistribution in [4], [5]. However, while this unconventional principle opens up several novel directions for trading off accuracy for savings, there are serious impediments when one considers integrated circuits based on this principle. For example, one physical realization referred to as Biased Voltage Scaling (BiVOS) [7], [8] is seriously impeded since it involves significant overheads of routing multiple voltage planes, and by necessity for level shifters for parallel structures in which routing exists between all voltage planes. Another important drawback of all previous voltage scaling based works [4], [5], [7], [19] is that the finetuning of supply voltage at run-time based on the application requirements might not be feasible due to inherent variations present in the power supply routing [20] and by the large overhead generally required to ensure such an accurate finetuning is realized. In this paper, we overcome these drawbacks while exploiting the principle of trading accuracy for significant energy savings through a architectural redesign technique called probabilistic pruning, which also yields improvements in area as well as speed. Finally, we note that this technique yields savings which are significantly more and realized with zero overhead, when compared to the previously proposed voltage scaling-based techniques which have associated overheads.

Er(G 0 ) =

V X

pk × |Ok0 − Ok | ≤ σ

(1)

k=1

where Ok and Ok0 correspond to value of final output vectors 0 0 0 < Ok,1 , Ok,2 , . . . Ok,n > and < Ok,1 , Ok,2 , . . . Ok,n > of circuits G and G’ respectively for a given n-bit input vector Ik which occurs with a probability pk for 1 ≤ k ≤ V. In the unweighted case, |Oi − Oj | is the value of the difference between Oi and Oj treated as unary numbers. In the case where the output bits are weighted, without loss of generality, we could assign a weight ηj to the j th output bit Oj . In this case, Oi and Oj will be the difference between the corresponding binary numbers. In fact, starting with Section IV, we will be demonstrating the value of probabilistic pruning through circuits for integer addition and therefore, will be considering the case of weights ηj = 2j almost exclusively. Output: A pruned G 0 that is optimal in that there is no other G 00 satisfying the conditions above such that NC00 < NC0 . The average error computation metric used above is not limiting in any sense that it can conveniently replaced by any other error metric based on the application requirements in using the probabilistic pruning approach. In circuit design, it is considered to be meaningful to evaluate a design using a range of inputs drawn from a distribution and analyze error following approaches to average case analysis [21], and we will adopt this approach. Not surprisingly, it is easy to show that variants of this problem are NP-hard in general [22], [23]. However, our main goal in this paper is to demonstrate the value of applying probabilistic pruning to circuit design. Therefore, we will not emphasize the algorithmic nuances in this paper, but will rather use a simple-minded and (almost) brute force heuristic here, which is shown in Figure 1. The flowchart is detailed enough to where a pseudo-code for the heuristic can be easily developed using it as a basis. B. Defining Error We can broadly classify error resilient applications into two types : ones which have a bound on the total number of erroneous computations (such as number of incorrect memory address computations in a microprocessor) and others (such as computation of the value of a pixel by a graphics processor) which have bounds on the magnitude of error. While in the former type applications, each of the output receives equal importance or “significance” and errors are quantified through the error rate metric, the outputs in the latter applications have a certain importance or weights depending on the magnitude of error. We will state each in turn for convenience given V.

III. P ROBABILISTIC P RUNING FOR E NERGY /D ELAY /E RROR T RADEOFF IN I NEXACT C IRCUITS Probabilistic Pruning is a design level technique wherein we systematically “prune” or delete components and their associated wires along the paths of the circuit that have a lower probability of being active during circuit operation while staying within the error boundaries dictated by the application.

Error Rate =

A. A Formal Definition of Probabilistic Pruning A circuit can be represented as a directed acyclic graph whose nodes are components such as gates, inputs, or outputs and whose edges are wires. Given a circuit G with NC components, NI inputs, NO outputs and NW wires, our goal is to prune components in the paths such that the energy, area and speed are reduced while maintaining a bound on error, say σ. Let I be the set of all input nodes, O be the set of output nodes, C be the set of all components and W be the set of all wires. We now formulate an optimization problem of computing a circuit G’, which is a subgraph of G such that it has the same

Number of Erroneous Computations V0 = Total Number of Computations V

Relative Error Magnitude =

V 1 X |Ok − Ok0 | V Ok k=1

C. Proposed Logic Synthesis based CAD Flow The proposed logic synthesis CAD flow for realizing Inexact circuits through Probabilistic Pruning is shown in Figure 2. The main object of interest in the proposed CAD flow is the Probabilistic Pruner, which prunes the portions of designs with lower probability of being active and/or with the 2

Preprint under review: Please do not copy or distribute

NO

Uniform Probabilistic Pruning

Initial Design

Estimate Probability of being Active for each Path

Does the application require weighted circuits?

(based on Mathematical model or Simulations)

Prune the Circuit Elements on the path with the lowest Active Probability

NO

Compute Error in the Pruned Circuit (using Functional Simulation or Mathematical estimation)

(only if they are not on any other higher probability paths)

Is Error > Target Error

YES

Undo last Pruning

Pruned Design

YES

Prune the Circuit Elements on the path with the lowest (Active Probability * Significance) (only if they are not on any other higher (probability*significance) paths) Weighted Probabilistic Pruning

Fig. 1.

Flowchart for the Probabilistic Pruning

TABLE I S UMMARY OF VARIOUS C ONVENTIONAL A DDER A RCHITECTURE C HARACTERISTICS

Initial Design (.v or .vhd)

Error Estimator ( C/C++/ Matlab)

Type of Adder

Synthesis Probabilistic Pruner

.vhd/.v

Technology Library

(Synopsys DC Compiler or Cadence RC Compiler)

(.lib or .db)

.v .sdc

Testbench

Functional Simulation (ModelSim)

Place & Route

.v .spef

Power Estimation

(Cadence SoC Encounter)

.sdf

(Synopsys Power Compiler & Modelsim)

Final Design (Layout)

Fig. 2.

Area

Speed

Power

Ripple Carry Lowest Lowest Lowest Carry Skip Low Low Low Carry Select Medium Medium Medium Carry Increment Medium Medium Medium Sklansky High Highest High Brent-Kung High High High Kogge-Stone Highest Highest Highest Han-Carlson High Highest High Ladner-Fischer High Highest High Sparse Tree High Highest High

A Logic Synthesis based CAD flow for designing Inexact Circuits

Prefix Type Serial — Group Group Parallel Parallel Parallel Parallel Parallel Parallel

Prefix depending on the order of grouping of bits and carry propagation. The prefix networks of some common adders are shown in Figure 3. A brief overview of the characteristics of various conventional adders is given in Table I where we show the type of prefix structure and also contrast the relative area, speed and power intuitively.

least significance depending on the application requirement. The overall flowchart summarizing the Probabilistic Pruner is shown in Figure 1 where the pruner interacts with Modelsim based functional simulator and an Error estimator to decide the amount of pruning governed by the error margins dictated by the application.

B. A Brief Overview of Carry Path Probabilities in an Adder For notational convenience, we will use the symbols S, A and B to denote the Sum (output) and the two binary inputs to the adder. As all the paths between output Si and an input Aj or Bj , (∀j 6= i and 0 ≤ j ≤ N ) in existing in an N bit adder are due to the propagation of carry bits, we compute the various path probabilities in an adder using a variation of the carry path propagation results derived in [26], to form the basis of the pruning technique. A bit position ‘i’ is said to generate a carry if both Ai and Bi are equal to 1 and propagate a carry if exactly one of Ai or Bi is equal to 1. Hence, a sum output Si is affected by an input Aj or Bj (where j < i) only if there is a carry generated at j and the rest of the i − j bits propagate the carry. For example, if the summands A and B are chosen uniformly at random, the probability that a bit position j generates a carry is 1/4 and the probability that the rest of the i − j − 1 propagate the carry is 1/2i−j−1 . Hence, the probability of any particular path from an input Aj or Bj to an output Sum Si being active is 1/2i−j+1 .

IV. A PPLICATION OF P ROBABILISTIC P RUNING TO A RITHMETIC C IRCUIT VIA A DDERS Binary addition is a fundamental and most frequently used arithmetic operation on microprocessors, digital signal processors and other data-processing ASICs. Numerous algorithms and circuit architectures for binary addition have been proposed over the last 50 years and the usability of these architectures based on the specific constraints (delay/area/fanout/wiring tracks) has been extensively studied [24], [25]. We choose binary adders as the first platform for the application of our probabilistic pruning technique as their well-understood yet non-trivial architectures, ranging from serial to highly parallel, present us with valuable insights into application of the proposed technique on larger and more complex circuits as outlined in Section V-B. A. Conventional Adders Designs It is widely known that binary addition can be formulated as a prefix dependent function i.e. every output is dependent on all inputs of equal or lower magnitude, and every input influences all outputs of equal or higher magnitude. Based on this property, the functioning of a binary adder can be grouped into 3 stages: Precomputation, Prefix and Postcomputation. While the precomputation and postcomputation stages are similar in almost all adder architectures, the prefix computation generally defines the architecture of an adder and is classified into 3 types [24] : (a) Serial Prefix (b) Group Prefix (c) Parallel

C. Designing Adders using Probabilistic Pruning Based on the application requirement, we classify the probabilistic pruning technique into two forms : uniform probabilistic pruning (UPP ) for applications which require uniform circuits and weighted probabilistic pruning (WPP ) for applications which require circuits whose outputs have associated weights. We apply the UPP and WPP techniques on adders as follows: 3

Preprint under review: Please do not copy or distribute 15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0 i:k

k-1:j

Gi:k Pi:k Gk-1:j

15

14

13

12

11

10

13:12

12

11

10

9

15:0 14:0 13:0 12:0 11:0 10:0

8

7

6

5

4

3

2

9:8

8:7

7:6

6:5

5:4

4:3

3:2

2:1

15:12 14:11 13:10

9:6

8:5

7:4

6:3

5:2

4:1

3:0

2:0

11:8 10:7

4

3

2

8:1

7:0

6:0

5:0

4:0

3:0

0

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

15:14

14:13

13:12

12:11

11:10

10:9

9:8

8:7

7:6

6:5

5:4

4:3

3:2

2:1

15:12

14:11

13:10

12:9

11:8

10:7

9:6

8:5

7:4

6:3

5:2

4:1

3:0

2:0

15:8

14:7

13:6

12:5

11:4

10:3

9:2

8:1

7:0

6:0

5:0

4:0

1:0

0:0

1:0

9:0

8:0

7:0

6:0

5:0

4:0

3:0

2:0

1:0

15:0 14:0 13:0 12:0 11:0 10:0

0:0

9:0

8:0

7:0

6:0

5:0

4:0

3:0

2:0

Kogge-Stone Adder (Parallel Prefix)

Prefix Networks of Some Adders

1

0

15

1:0

2:0

Pi:j

5:4

1:0

14

13

12

15:14

14:13 13:12 12:11

15:12

14:11 13:10

15:8

15:8 14:7 13:6 12:5 11:4 10:3 9:2

1

Carry Increment Adder (Group Prefix)

15:14 14:13 13:12 12:11 11:10 10:9

12:9

5

Gi:j

i:j

7:4

Fig. 3.

13

6

Gi:j Pi:j

Gi:j

6:4

11:8

Ripple Carry Adder (Serial Prefix)

14

7

10:8

15:12

15

8

Gi:k Pi:k Gk-1:j i:j

9:8

14:12

15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

9

i:j

k-1:j

Pi:j

Pk-1:j

i:j

i:k

Gi:j

14:7

13:6

11

10

9

8

7

6

5

4

11:10

10:9

9:8

8:7

7:6

6:5

5:4

4:3

12:9

11:8

10:7

9:6

8:5

7:4

6:3

5:2

4:1

12:5

11:4

10:3

9:2

8:1

3

3:2

2

2:1

1

0

1:0

0:0

1:0

0:0

Uniform Pruned Kogge-Stone Adder 15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

15:4 14:3 13:2 12:1 11:4 10:3

0

9:2

8:1

7:4

6:3

5:2

4:1

3:2

2:1

3

2

Weighted Pruned Kogge-Stone Adder 15:14

13:12

11:10

9:8

7:6

5:4

3:2

7:4

11:8

15:12

15

1:0

14

13

15:14 15:8

13:12

15:12 13:8

9:4

15:8 14:8 13:8 12:8 11:8 10:4

5:0

9:4

8:4

7:4

12

11

10

9

8

7

6

5

4

1

0

1:0

0:0

3:0

6:0

11:10

9:8

7:6

11:8

5:4

3:2

1:0

7:4

15:8

5:0

4:0

3:0

2:0

1:0

0:0

11:4

Uniform Pruned Brent-Kung Adder

13:4

Fig. 4. Example of adder architectures designed using Uniform Probabilistic Pruning

15:4 14:4 13:4

9:4

12:4 11:4 10:4

5:4

9:4

8:4

7:4

6:4

5:4

4:2

3:2

2:0

Weighted Pruned Brent-Kung Adder

1) Uniform Probabilistic Pruning based Adders: Applications for which all bit positions in an adder are treated with equal significance i.e. there is no notion of Most Significant Bit (MSB) or Least Significant Bit (LSB) and each bit position has equal significance. We use concepts from Section IV-B to calculate the path probabilities in each adder and apply the UPP technique as shown in Figure 1. Due to regular structure of the prefix networks, all components at the same level (or row) are on the paths with equal probability of being active. For example, the components on the 4th level of the Kogge-stone adder are propagating carry information from inputs Ai and Bi to output Si+8 and while the components on the 3rd level are propagating the carry information from inputs Ai and Bi to output Si+4 , the path probabilities of which are 1/29 and 1/25 respectively. Hence, we start pruning from the 4th level till we reach an error bound of the application. Examples of 16-bit Uniform Pruned Kogge-Stone and Brent-Kung adders are shown in Figure 4. It should be noted that while all the outputs of a pruned Kogge-Stone adder have equal amount of carry propagation paths due to its regular structure, the outputs of other parallel adders (such as Brent-Kung) have varying number of carry propagation paths. 2) Weighted Probabilistic Pruning based Adders: Starting from the LSB and moving towards the MSB, each bit position has a significance of 2 times higher than the previous bit

Fig. 5. Example of adder architectures designed using Weighted Probabilistic Pruning

position. While it is possible to use the different significance values for each bit position through the WPP technique, in this work for illustrative purposes , we “bin” the bits into four equal groups with each bin having k consecutive bits. We assume that the bits in the each bin have the same significance and are 2k times as significant as those in the bin that is immediately following it. Applying our WPP technique, we compute the (significance*path probability) product and prune the components with the least product value. Examples of resulting 16-bit Pruned Kogge-Stone and Brent-Kung adders are shown in Figure 5. V. E XPERIMENTAL R ESULTS AND A NALYSIS A. Methodology and Framework Our logic-synthesis based CAD methodology for applying probabilistic pruning technique on adders is based on Figure 2. We design the 64-bit adder circuits in VHDL which is sent to the Probabilistic Pruner where individual path probabilities are calculated and depending on the error tolerance of the application, circuit elements on paths with lower probabilities and/or lower significance will be pruned. Accurate error values 4

Preprint under review: Please do not copy or distribute Pruned Kogge-Stone Adders

of the resulting pruned circuits are computed through functional simulations using a testbench consisting of uniformly distributed 64-bit random numbers in Modelsim. We will then synthesize the VHDL code of the pruned circuit using Cadence RC Compiler, and perform Place & Route using Cadence SoC Encounter to obtain the layout along with the final netlist and parasitics. We then estimate the power consumption of the pruned circuit by back-annotation of the final netlist and parasitics using Synopsys Power Compiler. The required toggling frequency of each net and cell in the netlist for the Power Compiler are produced through the Switching Activity Interchange Format (.saif) file derived from the Value Change Dump (VCD) file. The VCD file is produced by simulation of the netlist along with the Standard Delay Format (SDF) file in ModelSim using the previously created testbench . All the designs are implemented using TSMC 180nm (Low Power) technology library, and to crosscheck the gains obtained, some of the designs were re-implemented in IBM 90nm (Normal Vt ) and TSMC 65nm (High Vt ) technologies as well. Also, the synthesis of designs was done targeting highest frequency of operation and lowest power consumption (or loose target frequency) separately, to analyze the gains achieved in each case.

Normalized Gains (Conventional / Pruned)

11 TSMC 180nm (Low Power Synthesis) IBM 90nm- Normal Vt (Low Power Synthesis) TSMC 65nm- High Vt (Low Power Synthesis) TSMC 180nm (Highest Frequency Synthesis) IBM 90nm- Normal Vt (Highest Frequency Synthesis) TSMC 65nm- High Vt (Highest Frequency Synthesis)

10 9 8 7 6 5 4

Low Power Synthesis

3 2 Highest Frequency Synthesis

1 0 1e-006

0.001

0.01

0.1

1

10

40

Relative Error(percentage)

Fig. 8. Energy-Delay-Area Product of Weighted Pruned Kogge-Stone adders implemented in different process technologies and synthesis constraints

with techniques such as adaptive body bias [27] to address the effects of parameter variations in the more significant portions of the circuits. Another observation regarding the probabilistic pruned circuits is that the error (both error rate and relative error magnitude) in probabilistic pruned adders rises sharply beyond a critical amount of pruning akin to the critical voltage scaling point problem mentioned in [5]. We anticipate that this can be fixed by combining a parameter variation approach (such as Vdd or Vt variation techniques) with our pruning technique.

B. Results and Analysis The central results, namely the normalized gains (Conventional/Pruned) for different metrics (area, delay, energy, energy-delay product and energy-delay-area product) obtained by applying WPP and UPP techniques on various 64-bit adders are summarized in Figure 6 and Figure 7 respectively. Due to space constraints, we only show parallel adders with the highest (Kogge-Stone, Han-Carlson) and least (Brent-Kung) amount of gains. From these graphs, it can be observed that: The (normalized) gains obtained for the applications employing weighted circuits is more than the corresponding uniform circuits for the same error percentage. While the normalized gains outlined for parallel adders in Figures 6, 7 are as high as 2X for less than 1% error rate or 10−6 % relative error, it is not possible to achieve such high savings in the context of serial adders. For example, WPP technique applied on a Ripple Carry Adder yielded savings of 1.25X, 1.5X and 2X for a high relative error of 19%, 33% and 56% respectively. Hence, Probabilistic Pruning yields much more significant gains in parallel circuits (with large number of available paths to prune) than the corresponding serial circuits (which generally have very few paths. Figure 8 outlines the results obtained for a pruned KoggeStone adder using three different technology libraries under two different synthesis constraints. From this, we can conclude that: For similar operating conditions, the gains achieved in the probabilistic pruned circuits is proportional to the ratio of circuit pruned to the original circuit. It is largely independent on the process technology being used and only depends on the logic synthesis constraints. Specifically, in addition to the observations highlighted through italicization in the paragraphs above that are of potential value to circuit design in general, we can also observe that probabilistic pruning is a design level technique which does not involve varying of circuit parameters during operation, the amount of error in a probabilistic pruned circuit is independent of varying of parameters (such as Vdd ) unlike other inexact circuits and is as robust as conventional circuits to process variations. The amount of such error is generally fixed at design time based on application requirements. The probabilistic pruning technique can be used in conjunction

VI. C ONCLUSION AND F UTURE D IRECTIONS To the best of our knowledge, this is the first attempt at an architectural design of circuits that explicitly trades error for savings. We believe to have convincingly shown through extensive simulations, that our Probabilistic Pruning technique achieves significantly better savings along all 3 dimensions – energy, delay and area – while having comparable error, over the conventional voltage scaling schemes. Other benefits include the lack of dependence of amount of error on any particular process technology and faster design time than conventional voltage scaling schemes as a result of easy integration into a logic synthesis based CAD flow. These gains achieved through the Probabilistic Pruning are relative in that they can be combined with standard techniques that achieve energy or performance gains or both, through absolute approaches. Specifically, this means that any technique that uses equal voltage (planes) throughout the datapath and yields correct results or voltage scaling to yield tolerable incorrect results can be extended through the insights in this paper to yield additional gains simultaneously along the energy, delay and area dimensions by using the probabilistic pruning technique. R EFERENCES [1] J. Von Neumann, “Probabilistic logics and the synthesis of reliable organisms from unreliable components,” in Automata Studies (C.E. Shannon and J. McCarthy eds.), Priceton Univ. Press, Princeton, N.J., 1956. [2] S. Borkar, “Designing reliable systems from unreliable components: The challenges of transistor variability and degradation,” IEEE Micro, vol. 25, no. 6, pp. 10–16, 2005. [3] K. V. Palem, “Energy aware algorithm design via probabilistic computing: from algorithms and models to Moore’s law and novel (semiconductor) devices,” in Proc. of the IEEE/ACM Intl. Conf. on Compilers, Architecture and Synthesis for Embedded Systems, 2003, pp. 113–117. [4] S. Narayanan, J. Sartori, R. Kumar, and D. Jones, “Scalable stochastic processors,” in Proc. of the Design, Automation and Test in Europe, March 2010. [5] A. Kahng, S. Kang, R. Kumar, and J. Sartori, “Slack redistribution for graceful degradation under voltage overscaling,” in Proc. of 15th IEEE/SIGDA Asia and South Pacific Design and Automation conference, January 2010.

5

Preprint under review: Please do not copy or distribute

7


6 5 4 3 2 1 0.0001 0.01 0.1 1 Relative Error (percentage)

8 7 6 5 4 3 2 1 0 1e-006

10

3

2

1

0 1e-006

10

Energy Area Energy-Delay Product Delay Energy-Delay-Area Product Energy Energy-Delay Product Energy-Delay-Area Product

7 6 6

5 5 44 3 3 2 2 1

10 1e-006 0 0.01

0.0001 0.01 0.1 1 Relative Error (percentage) 0.1 1 10

10

Pruned Ladner-Fischer Adders 6 Normalized (Conventional / Pruned) Normalized GainsGains (Conventional / Pruned)

78

Area Pruned Kogge-Stone Adders Delay

5 5


4 3 3 2 2 1

1 0 1e-006 0

40

0.01

Error Rate (percentage)

Fig. 7.

Area Pruned Han-Carlson Adders Delay

0.0001 0.01 0.1 Relative Error (percentage)

1

10

0.0001 0.01 0.1 1 10 Relative Error (percentage) 0.1 1 10 40

Pruned Sparse Tree Adders Normalized (Conventional / Pruned) Normalized GainsGains (Conventional / Pruned)

Pruned Han-Carlson Adders Normalized GainsGains (Conventional / Pruned) Normalized (Conventional / Pruned)

Area Delay Energy Energy-Delay Product Energy-Delay-Area Product

Normalized gains Vs Relative Error percentage of various Weighted Pruned 64-bit adders implemented in TSMC 180nm technology 9

Area Pruned Brent-Kung Adders Delay

4


3 3

22

1 1 0

0 0.1


0.01 0.1 1 10 Relative Error (percentage) 1 10 40


Normalized gains Vs Error Rate percentage of various Uniform Pruned 64-bit adders implemented in TSMC 180nm technology Pruned Han-Carlson Adders Area Delay Energy Energy-Delay Product Energy-Delay-Area Product

Pruned Ladner-Fischer Adders 7 Normalized Gains (Conventional / Pruned)

5 Normalized Gains (Conventional / Pruned)

0.0001 0.01 0.1 1 Relative Error (percentage)

Pruned Brent-Kung Adders 4



Pruned Sparse Tree Adders Normalized Gains (Conventional / Pruned)


8

0 1e-006

Fig. 6.

Pruned Han-Carlson Adders 9



Pruned Kogge-Stone Adders 9


6 3 Transactions on Computers, vol. 54, no. 9, pp. A study of limits,” IEEE [6] K. V. Palem, L. N. Chakrapani, Z. M. Kedem, A. Lingamneni, and K. K. 4 1123–1137, 2005. Muntimadugu, “Sustaining moore’s law in embedded computing through probabilistic and approximate design: retrospects and 5prospects,” in [17] S. Cheemalavagu, P. Korkmaz, and K. V. Palem, “Ultra low-energy computing via probabilistic algorithms and devices: CMOS device Proc. of Intl. Conf. on Compilers, Architecture, and Synthesis for 3 4 2 primitives and the energy-probability relationship,” in Proc. of the Intl. Embedded Systems. ACM, 2009, pp. 1–10. Conference on Solid State Devices and Materials, Sep. 2004, pp. 402– [7] J. George, B. Marr, B. E. S. Akgul, and K. Palem, “Probabilistic 3 403. arithmetic and energy efficient embedded signal processing,” in Proc. 2 [18] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. of the IEEE/ACM Intl. Conf. on Compilers, Architecture, and Synthesis Palem, and B. Seshasayee, “Ultra efficient embedded SoC architectures 2 1 for Embedded Systems, 2006, pp. 158–168. based on probabilistic CMOS technology,” in Proc. of the 9th Design [8] L. N. B. Chakrapani, K. K. Muntimadugu, A. Lingamneni, J. George, 1 Automation and Test in Europe, Mar. 2006, pp. 1110–1115. 1 and K. V. Palem, “Highly energy and performance efficient embedded [19] N. Banerjee, G. Karakonstantis, and K. Roy, “Process variation tolerant computing through approximately correct arithmetic: A mathematical low power DCT architecture,” in Proceedings of Design, Automation foundation 0and preliminary experimental validation,” in0 Proc. of the 0 0.01 Conf.0.1on Compilers, 1 10Architecture, 40 10 and 40 0.1 1 2007, pp.101 –6.40 Test in Europe Conference, Apr IEEE/ACM Intl.l and0.1Synthesis 1of Error Rate (percentage) Error Rate (percentage) Error Rateof(percentage) [20] M. Alioto and G. Palumbo, “Impact supply voltage variations on Embedded Systems, 2008. full adder delay: analysis and comparison,” IEEE Transactions on Very [9] K. Bowman, J. Tschanz, N. S. Kim, J. Lee, C. Wilkerson, S.-L. Large Scale Integration (VLSI) Systems, vol. 14, no. 12, pp. 1322 – Lu, T. Karnik, and V. De, “Energy-efficient and metastability-immune 1335, 2006. timing-error detection and recovery circuits for dynamic variation toler[21] R. Karp, “Probabilistic analysis of partitioning algorithms for the ance,” IEEE International Conference on Integrated Circuit Design and traveling-salesman problem in the plane,” Mathematics of Operations Technology and Tutorial, pp. 155–158, 2008. Research, vol. 2, no. 3, pp. 209–224, 1977. [10] J. Ray, J. C. Hoe, and B. Falsafi, “Dual use of superscalar datapath for [22] S. A. Cook, “The complexity of theorem-proving procedures,” Proceedtransient-fault detection and recovery,” in Proceedings of the 34th Annual ings of the third annual ACM symposium on Theory of computing, pp. IEEE/ACM International Symposium on Microarchitecture (MICRO), 151–158, 1971. 2001, pp. 214–224. [11] D. Ernst, N. S. Kim, S. Das, S. Pant, T. Pham, R. Rao, C. Ziesler, [23] R. M. Karp, “Reducibility among combinatorial problems,” Complexity of Computer Computations (Raymond E. Miller and James W. Thatcher D. Blaauw, T. Austin, and T. Mudge, “Razor: A low-power pipeline (editors)), pp. 85–103, 1972. based on circuit-level timing speculation,” in Proc. of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), [24] R. Zimmerman, “Binary adder architectures for cell-based vlsi and their synthesis,” Ph.D. dissertation, Swiss Federal Institute of Technolog, Oct. 2003, pp. 7–18. 1997. [12] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The [25] D. Harris, “A taxonomy of parallel prefix networks,” vol. 2, Nov 2003, missing memristor found,” Nature, vol. 453, pp. 80–83, March. pp. 2213 – 2217. [13] J. M. Tour and D. K. James, “Molecular electronic computing archi- [26] N. Pippenger, “Analysis of carry propagation in addition: An elementary tectures: A review,” in Handbook of Nanoscience, Engineering and approach,” Journal of Algorithms, vol. 42, pp. 317–313, 2002. Technology, Second Edition, I. Goddard, W. A., D. W. Brenner, S. E. [27] J. W. Tschanz, J. T. Kao, S. G. Narendra, R. Nair, D. A. Antoniadis, A. P. Lyshevski, and G. J. Iafrate, Eds. New York: CRC Press, 2007, pp. Chandrakasan, and V. De, “Adaptive body bias for reducing impacts 5.1–5.28. of die-to-die and within-die parameter variations on microprocessor [14] J. Yao, Z. Sun, L. Zhong, D. Natelson, and J. M. Tour, “Resistive frequency and leakage,” IEEE Journal Of Solid-State Circuits, pp. 1396– switches and memories from silicon oxide,” Nano Letters, August 2010. 1402, 2002. [15] “Advances offer path to further shrink computer chips,” New York Times, http://www.nytimes.com/2010/08/31/science/31compute.html, 2010. [16] K. V. Palem, “Energy aware computing through probabilistic switching:

6