Design issues of arithmetic structures in adiabatic logic - Advances in ...

3 downloads 0 Views 2MB Size Report
todays VLSI integration (Amirante, 2004; Fischer, 2006) as it is ultra low-power ... (Kogge and Stone, 1973), (Ladner and Fischer, 1980), (Brent and Kung, 1982) ...
Adv. Radio Sci., 5, 291–295, 2007 www.adv-radio-sci.net/5/291/2007/ © Author(s) 2007. This work is licensed under a Creative Commons License.

Advances in Radio Science

Design issues of arithmetic structures in adiabatic logic Ph. Teichmann, J. Fischer, F. Chouard, and D. Schmitt-Landsiedel Institute for Technical Electronics, Technical University Munich, Munich, Germany

Abstract. Since adiabatic logic uses a supply that incorporates both supply voltage and clock signal in one line, adiabatic logic systems have a built-in micro-pipelined architecture. Considering this fact, different design constraints have to be observed compared to static CMOS designs. Complex arithmetic building blocks, like multipliers, mainly consist of adders. Therefore, a comparison of adder structures is performed. Based on these results, multipliers and complex systems can be built. A Discrete Cosine Transformation (DCT) is taken as example for an arithmetic system. Comparing an adiabatic logic implementation of a DCT to its static CMOS counterpart, a significant saving factor of more than 10 can be achieved with the adiabatic system.

1

Introduction

Adiabatic logic circuits are known for offering an energy dis2 limit in static CMOS cirsipation less than the E= 12 CVDD cuits. The Positive Feedback Adiabatic Logic (PFAL) (Vetuli et al., 1996) family has proven to be a reasonable choice for todays VLSI integration (Amirante, 2004; Fischer, 2006) as it is ultra low-power and robust against parameter variations. A chain of PFAL gates is operated via a four-phase clock which acts as power supply and clocking line for the gates, and it is therefore called power-clock. As consequence, cascaded gates form a micropipeline which is an inherent property of adiabatic logic. This fact has to be observed if complex systems are investigated. Arithmetic structures are basic building blocks in a variety of digital signal processing tasks. E. g. adders are used to build more complex arithmetic functions like multipliers, filters, CORDICs etc. Many proposals for adders focusing on different design goals like power, area and performance have been presented for static CMOS circuits (Koren, 2002; Zimmermann, 1997; Beaumont-Smith and Lim, 2001). But PFAL’s inherent properties make it desirable to analyze the

known arithmetic structures with respect to their applicability in adiabatic logic. This work will first introduce the idea of adiabatic logic and the micropipeline in Sect. 2 followed by the results gained from the considerations on adders and multipliers in Sect. 3 followed by a case study of a DCT in Sect. 4. The results are concluded in Sect. 5. 2

Brief introduction of adiabatic logic

In static CMOS the fundamental limit for the energy dissipa2 can only be lowered tion per switching event of E= 12 CVDD by reducing the capacitance C or by scaling of the supply voltage VDD . To overcome this limit adiabatic circuits have been proposed, that trade frequency for energy. By lowering the operating frequency, theoretically an asymptotic convergence to zero dissipation can be reached. The energy dissipation in a PFAL adiabatic logic gate is described by 2 E= RC T CVDD , where R is the path resistance, C is the capacitance at the output, T is the rise/fall time of the power-clock signal. A low threshold voltage Vth on the one hand reduces the path resistance R due to an improved gate overdrive voltage, on the other hand it increases the leakage currents which limit the energy dissipation in the mid- and low-frequency range. PFAL circuits show a minimum in energy dissipation at the crossing point of the adiabatic losses (∝ f ) and the leakage losses (∝ f1 ). For a 130 nm CMOS process this minimum is retrieved around 100 MHz. The PFAL logic family uses a four-phase power-clock signal, that acts as the supply voltage and the clocking signal (see Fig. 1), and inherently imposes the circuits beeing operated in a pipelined fashion. The pipeline style gives a first conception how a system in adiabatic logic should be implemented, as overhead due to synchronization degrades the savings gained through the application of adiabatic logic, furthermore pipelining leads to a rise in latency.

Correspondence to: Ph. Teichmann ([email protected]) Published by Copernicus Publications on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V.

292 2 2

Fig. 1. Two preceeding phases phases Φi Φi and and Φ(i+1) Φ(i+1)of ofan anadiabatic adiabaticfourfourFig.Fig. 1.power-clock Two preceeding phases 8i and 8(i ofstable ananadiabatic four1. Two preceeding phases Φi and Φ(i+1) of adiabatic fourphase are shown. When Φi is in high shown. When Φi is+1) in its its stable highphase, phase, phase power-clock are shown. When 8i is in its stable high phase, phase power-clock are shown. When Φi is in its stable high phase, the data is evaluated in the succeeding stage, indicated by the arrow. the data is evaluated in the succeeding stage, indicated by the arrow. the the datadata is evaluated in the succeeding stage, indicated is evaluated in the succeeding stage, indicatedbybythe thearrow. arrow.

furthermore pipelining furthermore pipelining leads leads to to aa rise rise in in latency. latency. pipelining leads to a rise in latency. 3 furthermore Binary adder structures The inherent properties of adiabatic logic must be taken accountAdder in theStructures design of adiabatic arithmetic structures. Binary 33into3Binary Adder Structures Binary Adder Structures Large overheads in structures result in suboptimal designs. One advantage of PFAL is its dual-rail The inherent properties of adiabatic logicsignal mustrepresentabe taken TheThe inherent properties of ofadiabatic inherent properties adiabaticlogic logicmust mustbe be taken taken tion, so a signal is inverted without additional gates. Subtracinto account in the design of adiabatic arithmetic structures. intointo account in the design of adiabatic arithmetic structures. account in the design of adiabatic arithmetic structures. tors are easily obtained from adders, swappingdesigns. the signal Large overheads in result in suboptimal Large overheads in structures structures result inby Large overheads in structures result insuboptimal suboptimaldesigns. designs. rails of the subtrahend and feeding a logic one into the carryOne advantage of PFAL is its dual-rail signal representation, OneOne advantage of PFAL is its dual-rail advantage of PFAL is its dual-railsignal signalrepresentation, representation, in aasignal ofis adder.without The results for thegates. adders gained in so signal inverted additional so isthe inverted without additional gates. Subtractors sosignal a signal is inverted without additional gates.Subtractors Subtractors this section are valid for the subtractors as well. are easily obtained from adders, by swapping the signal are are easily obtained from adders, byby swapping rails easily obtained from adders, swappingthe thesignal signalrails rails of the subtrahend and feeding a logic one into the Basically two groups of adders exist, namely the carryof the subtrahend feeding a logicone oneinto intothe thecarry-in carry-in of the subtrahend andand feeding a logic carry-in signal of adder. The results for the adders gained propagate the parallel-prefix adders (Sklansky, 1960), signal ofand the adder. The results adders gainedin this signal of the the adder. The results forfor thethe adders gained ininthis this (Kogge and Stone, 1973), (Ladner and Fischer, 1980), (Brent section are valid for the subtractors as well. section valid subtractors well. section are are valid forfor thethe subtractors asaswell. andBasically Kung,two 1982) and (Han and Carlson, 1987). Basically groups of adders exist, the carrytwo groups adders exist,namely namelyThe theripplecarryBasically two groups of of adders exist, namely the carrycarry adder (RCA) is the simplest implementation of a carrypropagate and the parallel-prefix adders [Sklansky (1960)], propagate and the parallel-prefix adders [Sklansky (1960)], propagate and the parallel-prefix adders [Sklansky (1960)], propagate using N full-adder (FA) cells in static [Kogge and Stone (1973)], [Ladner (1980)], [Kogge and Stone (1973)], [Ladnerand andFischer Fischer (1980)], [Kogge andadder, Stone (1973)], [Ladner and Fischer (1980)], CMOS, where N is the input width of the data words. But [Brent and Kung (1982)] and [Han and Carlson (1987)]. [Brent Kung (1982)] and [Han andCarlson Carlson(1987)]. (1987)].The The [Brent andand Kung (1982)] and [Han and The for ripple-carry adiabaticadder logic, the rippling of the carry signal leads to an ripple-carry (RCA) is the simplest implementation of adder (RCA) is the simplest implementation ripple-carry adder (RCA) is the simplest implementation ofofaaa overhead of synchronizing buffer carry-propagate adder, using full-adder (FA) cellsininstatic static carry-propagate adder, using NN full-adder (FA) cells carry-propagate adder, using N full-adder (FA) cells in static CMOS, is the input width datawords. words. But CMOS, where N 2N is the input width of of thethedata stages ofwhere O(N ), making the ripple-carry structure in But adiCMOS, where N is the input width of the data words. But for adiabatic logic, the rippling of the carry signal leads for adiabatic logic, the rippling of the carry signal leads toto abatic logic (Fig. 2) the an inproper we talk about high for adiabatic logic, rippling choice of the if carry signal to 2 2 leads an overhead of synchronizing buffer stages of O(N ), makan overhead of synchronizing buffer stages of O(N ), makbit width N. Additionally, the adiabatic ripple-carry 2 adder an overhead of synchronizing buffer stages of O(N ), making ripple-carry structure adiabatic (Figure2)2) ing the the ripple-carry in in adiabatic (Figure anan (RCA) utilizes N/4structure clock cycles, leadinglogic tologic a relatively fast ing inproper the ripple-carry inabout adiabatic logic (Figure 2) an choice ifstructure we highbitbit width Addiinproper choice if we talktalk about high width NN . . Addirising latency. Only for several suqsequent addition operinproper choice if we talk about high bit width N . Additionally, adiabatic adder (RCA) utilizes N/4 tionally, the the adiabatic (RCA) N/4 ations, the RCA can ripple-carry beripple-carry used in adder a nested way,utilizes as pictured tionally, the adiabatic ripple-carry adder (RCA) utilizes N/4 clock leading aefficiently, relatively fast rising latency.Only Only clock cycles, to atorelatively fast latency. in Fig. 3,cycles, canleading be applied asrising the absolute overclock cycles, leading to a addition relativelyoperations, fast risingthe latency. Only for suqsequent RCAcan can for several suqsequent addition operations, theoperations RCA bebe head ofseveral synchronizing buffers per arithmetic refor used several suqsequent addition operations, the RCA can be in a nested way, as pictured in Figure 3, can be applied used in almost a nestedconstant. way, as pictured in Figure 3, can be applied mains Especially for butterfly structures, usedefficiently, in a nested way, as pictured in Figure 3, can be applied as the absolute overhead of synchronizing buffers efficiently, theDiscrete absolute Cosine overheadTransformation of synchronizing buffers as used inasthe (DCT) in efficiently, as the absolute overhead synchronizing per arithmetic operations remainsofalmost constant.buffers Espeper arithmetic operations almost Other constant. Espe4, the nested RCA isremains advantageous. approaches Sect. per cially arithmetic operations remains almost constant. Espefor butterfly structures, as used in the Discrete Cosine try tofor reduce the critical path, of the rippling cially butterfly structures, as i.e. usedthe in path the Discrete Cosine Transformation (DCT) in Section RCA is adcially for butterfly structures, as used4,inthe thenested Discrete Cosine carry, to gain speed CMOS respectively Transformation (DCT)ininstatic Section 4, thedesigns, nested RCA is advantageous. (DCT) in Section 4, the nested RCA is adTransformation to reduce the latency in adiabatic logic. As we are talkvantageous. Other approaches try to reduce the critical path, i.e. the path vantageous. ing about synchronous, pipelined designs, approaches like Other approaches try to to reduce the critical path, i.e. the path of the rippling carry, gain speed in static CMOS designs, Other approaches trytotocannot reduce theapplied critical path, i.e.designs, the path thethe carry-bypass adder be to adiabatic logic. of rippling carry, gain speed in static CMOS respectively to reduce the latency adiabatic logic. As we of the rippling carry, tothe gain speed inin static CMOS designs, The carry-select scheme splits the words into smaller respectively reduce latency ininput adiabatic logic. As we are talkingto about synchronous, pipelined designs, approaches respectively to reduce the latency in adiabatic logic. As we groups, i.e.about 2 of synchronous, size N/2 each and pre-calculates sums are talking pipelined like the carry-bypass adder cannot bedesigns, appliedapproaches tothe adiabatic are talking about synchronous, pipelined designs, approaches likelogic. the carry-bypass adderscheme cannotsplits be applied to words adiabatic The carry-select the input into like theThe carry-bypass adder cannot bethe applied to adiabatic logic. carry-select scheme splits input words into smaller groups, i.e. 2 of size N/2 each and pre-calculates the Adv. Radio carry-select Sci., 5, 291–295, 2007 logic. scheme thepre-calculates input words into smallerThe groups, i.e. 2 of size N/2 splits each and the smaller groups, i.e. 2 of size N/2 each and pre-calculates the

Ph. Teichmann et al.: Arithmetic structures in adiabatic logic Ph.Teichmann Teichmannetetal.: al.:Arithmetic ArithmeticStructures StructuresininAdiabatic AdiabaticLogic Logic Ph. Ph. Teichmann et al.: Arithmetic Structures in Adiabatic Logic

Fig. ==44ripple-carry ripple-carry adder adiabatic logic. Notice Fig.2. 2. AANN ripple-carryadder adderinininadiabatic adiabaticlogic. logic.Notice Noticethe the Fig. 2. N=4 the Fig. 2. A N = 4 ripple-carry adder in adiabatic logic. Notice the overhead due to the synchronization buffers. Each input overhead due due to to the the synchronization synchronizationbuffers. buffers. Each Eachinput inputbit bitof ofA overhead bit of AA overhead to the synchronization Eachper input of A and B to leading two position. and BB has has totobe be buffered, leading totobuffers. two buffers per bit position. and hasdue bebuffered, buffered, leadingto twobuffers buffers perbit bitbit position. and B has to be buffered, leading to two buffers per bit position. The dashed box shows a clock domain of the power clock Φi. The The dashed dashedbox boxshows showsaaclock clockdomain domainofofthe thepower powerclock clock8i. Φi. The dashed box shows a clock domain of the power clock Φi.

Fig. RCA (here with two suqsequent addition opFig.3. 3. For Foraaaanested nested RCA addition opFig. 3. For nested two suqsequent addition opFig. 3. For nestedRCA RCA(here (herewith withtwo twosuqsequent suqsequent addition operations) structure, the relative overhead due to synchronization erations) structure, the relative overhead due to synchronization isis erations) isis erations) structure, structure, the therelative relativeoverhead overheaddue duetotosynchronization synchronization reduced. also have have to tobe be delayed via input reduced.Please Pleasenote, note,that that inputs inputs C delayed via input reduced. Please note, bebe delayed via input reduced. Please note,that thatinputs inputsCCalso alsohave havetoto delayed via input buffers. buffers. buffers. buffers.

sums forinput both input input carry carry alternatives. alternatives. The sums for both The incoming incoming carry for both carry alternatives. The incoming carrycarry from sums for preceeding both input group carry alternatives. The incoming carry from the appropriate output via from the preceeding group selects the appropriate output viaa the preceeding group selects the the appropriate output via from the preceeding group selects the appropriate output via amultiplexer. multiplexer. Such design trades area, respectiveley enamultiplexer. Such aa design trades area, respectiveley enSuch a design trades area, respectiveley energy, aergy, multiplexer. Such a design trades area, respectiveley enagainst speed in static static CMOS and ininadiabatic ergy, against in CMOS and latency latency adiabatic against speedspeed in static CMOS and latency in adiabatic logic ergy, against speedAdditionally in static CMOS and latency inthe adiabatic logicrespectively. respectively. an logic Additionally an overhead overhead from themulmulrespectively. Additionally an overhead from from the multiplexlogic respectively. Additionally an we overhead from the multiplexers arises foran such an adder. adder. take the tiplexers for such an If wethe take theRCA RCAfrom from2 ers arisesarises for such adder. If we If take RCA from Fig. tiplexers arises forthe such an adder. If we take the RCA from Figure 2 and split adder in two groups of 2 bits, the deFigure 2 and split the adder in two of 2design bits, the de-3 and split the adder in two groups ofgroups 2 bits, the uses Figure 2 and split the adder in groups of 2 bits, the designuses uses blocks consisting of two 22 full-adder cells and sign 33blocks of full-adder cells andthree three blocks consisting ofconsisting 2 full-adder cells and three buffers each, sign uses 3 blocks consisting of 2and full-adder cellssynchroand three buffers each, one 6:3 multiplexer buffers that buffers one 6:3and multiplexer andsynchronize buffers thatthe synchroone 6:3each, multiplexer buffers that output buffers andtobuffers that synchronize theeach, outputone bits6:3 of multiplexer the lower block the outputs of the bits of lowerbits block to the outputs In nize thethe output of the lower blockoftothe themultiplexer. outputs of the multiplexer. In bits Figure the arrangement a multinize the output of 4theforlower block to thewith outputs of the Fig. 4 for theIn arrangement with multiplexer of logic depth multiplexer. Figure 4 for theaarrangement with a multiplexer of logicIndepth equal to the 1 it arrangement can be seen, that this demultiplexer. Figure 4 for with a multiequal to it can be equal seen, that usesthat more plexer of 1logic depth to 1 this it candesign be seen, thisfulldesign uses more full-adders than the RCA in Figure 2, butthis lessdeplexer of logic depth equal to 1 it can be seen, that adders than the RCA in Fig. 2, but less buffers. If the RCA sign uses more full-adders than the RCA in Figure 2, but less buffers. Ifmore the RCA and the than carry-select adders are compared sign uses full-adders the RCA in high Figure 2, but less and adders are compared for bitcompared width buffers. Ifbit thewidth RCA thebe carry-select adders for the highcarry-select Nand it can seen that the RCAare suffers fromN buffers. If the RCA and the carry-select adders are compared itacan be seen that the RCA suffers from a higher energy confor high bit width N it can be seen that the RCA suffers higher energy consumption. Thisthat estimation does notfrom acfor high energy bit width N it can be seen the RCA suffers from This estimation does not account for the energy asumption. higher consumption. This estimation does not account for the energy dissipation caused in the multiplexer. a higher energy consumption. This estimation does not account for the energy dissipation caused in the multiplexer. But as for onlythe N/2 XOR dissipation gates are used for theinmultiplexer, the count energy caused the multiplexer. But as only N/2 XOR gates are usedisfor the multiplexer, the energy consumed by the multiplexer negligible. www.adv-radio-sci.net/5/291/2007/ But as consumed only N/2 XOR are usedisfor the multiplexer, the energy by thegates multiplexer negligible. energy consumed by the multiplexer is negligible.

Ph. Teichmann et al.: Arithmetic Structures in Adiabatic Logic Ph. Teichmann et al.: Arithmetic Structures in Adiabatic Logic Ph. Teichmann et al.: Arithmetic structures in adiabatic logic To avoid the synchronizing overhead observed for the RCA, To avoid the synchronizing overhead observed for the RCA,

3

3 293 14k 14k

energy per cycle [Ebuf] energy per cycle [Ebuf]

12k 12k 10k 10k 8k

RCA RCA SPA SPA PPA Sklanksy PPA Sklanksy PPA Brent−Kung PPA Brent−Kung PPA Kogge−Stone PPA Han−Carlson PPA Kogge−Stone PPA Han−Carlson

8k

6k

6k

4k

4k

2k

2k 0 0 0 0

Fig. 4. A 4 bit RCA carry-select adder using a multiplexer with Fig. 4. A 4 bit RCA carry-select adder using a multiplexer with logic depth of4 1.bit RCA carry-select adder using a multiplexer with Fig.depth 4. Aof logic 1. logic depth of 1.

parallel-prefix schemes are investigated. The parallel-prefix adders (PPA) allow to calculate the output ofThe the parallel-prefix m-th bit via parallel-prefix schemes are investigated. dissipation caused in the multiplexer. But as only N/2 XOR logical combination oftothe inputs 0the to output m − 1.ofTherefore, adders (PPA) allow calculate the m-th thebitby via gates are used for the multiplexer, the energy consumed oretically each output of ofthe theinputs adder0can be−calculated using logical combination to m 1. Therefore, thethe multiplexer is negligible. To avoid the synchronizing one complexeach gate. Because of practical in stacking oretically output of the adder canreasons be calculated using overhead observed for the RCA, parallel-prefix schemes are transistor devices, only a maximum of 3 orreasons 4 inputs comone complex gate. Because of practical inisstacking investigated. The parallel-prefix adders (PPA) allow to calbined at a time. In literature different schemes the parallel transistor devices, only a maximum of 3 or for 4 inputs is comculate the output are of the m-th bit via logical combination prefix algorithm reported, differing in logical andof bined at a time. In literature different schemes fordepth the parallel the inputs 0 to mu−1. Therefore, theoretically each output fanout. prefix algorithm are reported, differing in logical depth and of the aadder can of be PPA calculated using(Sklansky, one complex gate. BeHere variety structures Brent-Kung, fanout. cause of practical reasons in stacking transistor devices, only Kogge-Stone, Han-Carlson) are estimated to see, which Here a variety of PPA structures (Sklansky, Brent-Kung, aadder maximum of 3 or 4 inputs is combined at a time. In literstructure isHan-Carlson) desirable in adiabatic logic. E.g. Kogge-Stone, are estimated to the see,Sklanwhich ature different has schemes for thelogical parallel prefixbutalgorithm are sky structure the lowest depth, the layout adder structure is desirable in adiabatic logic. E.g. the Sklanreported, differing in logical depth and fanout. Here a variissky supposed to be Thelogical Han-Carlson apstructure hasirregular. the lowest depth, structure but the layout ety ofaPPA structures (Sklansky, Brent-Kung, Kogge-Stone, plies more regular layout for the price of a higher logical is supposed to be irregular. The Han-Carlson structure apHan-Carlson) are estimated to see, whichand adder structure is depth. RCA plies aAdditionally, more regularthe layout forstructure the price of aa serial higherprefix logical desirable in adiabatic logic.inE.g. the Sklansky structure has adder (SPA) are estimated respect on their energy condepth. Additionally, the RCA structure and a serial prefix sumption. Estimations arebut performed byiscounting thetonumthe lowest logical depth, the layout supposed be iradder (SPA) are estimated in respect on their energy conber Xi ofThe eachHan-Carlson class of gatesstructure i (e.g. INV, FA) and multiplying regular. applies a more regular sumption. Estimations are performed by counting the numit with for its corresponding mean energy value Ei .Additionally, The whole layout price higher logical depth. ber Xi ofthe each classofofa gates i (e.g. INV, FA) and multiplying system’s energy dissipation E is prefix then calculated by are estithe RCA structure and a serial adder (SPA) it with its corresponding mean energy value E . i The whole X mated respect on their energy consumption. Estimations Esystem’s = in X E . (1) energy dissipation E is then calculated by i i are performed by counting the number X of each class of X i i E = (e.g.XINV, (1) i Ei . FA) and multiplying it with its correspondgates Gatesi are characterized by means of SPICE simulation; ini energy value E . The whole system’s energy dissiing mean i dustrial 130nm CMOS process parameters of low-Vth deGatesEare characterized by means of SPICE simulation; inpation isused. then calculated vices are The gates by are simulated at a frequency of dustrial 130nm CMOS process parameters of low-Vth deX and 100MHz a supply voltage of VDD = 1.2V . E vices = are Xi used. Ei . The gates are simulated (1) a frequency In Figure 5 the investigated adder structuresatare presented. of i 100MHz and in a supply of VDD = 1.2V . with the The RCA rises energyvoltage consumption quadratically In Figure 5 the investigated adder structures are Gates are characterized by means of SPICE simulation; inbit width N . The parallel-prefix adders present thepresented. lowest The RCA rises in energy consumption quadratically with dustrial 130nm CMOS process parameters of low-V deenergy consumption. The serial-prefix adder consumesth thethe bit width N . of The parallel-prefix theInlowest vices are used. The gatesand arethus simulated atpresent a frequency largest amount energy isadders no alternative. theof energy consumption. The the serial-prefix adder consumes the 100MHz andinaFigure supply6 voltage ofRCA VDDand =1.2V In Fig. 5 the detail view only the .most promislargest amount of energy and thus is no alternative. In the ing PPA structures are presented. Up to 8bitThe inputRCA width, thein investigated adder structures are presented. rises detail view in Figure 6 only the RCA and the most promisenergy consumption quadratically with the bit width N . The ing PPA structures are presented. Up to 8bit input width, the

www.adv-radio-sci.net/5/291/2007/

10

20

10

30 40 50 60 20bit width 30 N 40 50 60 bit width N

Fig. 5. Comparison of various adder structures up to 64bits. The parallel-prefix adders show the lowest energy consumption. Fig. ofofvarious adder structures up Fig.5.5. Comparison Comparison various adder structures up to to 64bits. 64bits. The The parallel-prefix adders show the lowest energy parallel-prefix adders show the lowest energy consumption. consumption.

RCA is the best choice. For the decision between Sklansky Han-Carlson differparallel-prefix adders present theand lowest energy consumpRCA is the best choice. ent properties have between to be taken into consideration. As amount said tion. Thedecision serial-prefix adder consumes the largest For the Sklansky and Han-Carlson differbefore, the Han-Carlson a more regular design, ofentenergy and thus no In the detail view in properties have istostructure be alternative. takenhas into consideration. As said thus saving effort in layouting. the Han-Carlson Fig. 6 only the RCA and structure theAdditionally most has promising PPA structures before, the Han-Carlson a more regular design, adder has a fixed maximum of 2. the For RCA the Han-Carlson Sklansky are presented. Up 8bit fan-out inputAdditionally width, is the best thus saving effort intolayouting. the structure, the effort for layout will be higher, and the adder has a fixed maximum fan-out of 2. For the fanout Sklansky choice. rises with So the decision, which isand a suitable structure, the effort for layout will be design higher, the fanout For theN/2. decision between Sklansky and Han-Carlson diflow-power adder structure depend onconsideration. thedesign post-layout ex-said rises with N/2. So the decision, which is a suitable ferent properties have to will be taken into As traction, taking into structure accountstructure the of the energy consumplow-power adder willrise depend the regular post-layout exbefore, the Han-Carlson has a on more design, tion due to parasitic capacitances. traction, taking account the rise of the energy consumpthus saving effortinto in layouting. Additionally the Han-Carlson tion due capacitances. adder hastoa parasitic fixed maximum fan-out of 2. For the Sklansky structure, the effort for layout will be higher, and the fanout 4 Discrete Cosine Transformation: A case study rises with N/2. So the decision, which design is a suitable 4 Discrete Cosine Transformation: A the casepost-layout study adder structure will depend on exInlow-power image processing the Discrete Cosine Transformation traction, taking into account the rise of the energy consump(DCT) is used to compress pictures. A DCT with minimal In image processing the Discrete Cosine Transformation tion due amount to parasitic capacitances. hardware is presented in [Heyne et al. (2006)]. As (DCT) is used to compress pictures. A DCT with minimal the DCT is a strongly parallel algorithm, that can be effihardware amount is presented in [Heyne et al. (2006)]. As ciently implemented using the adiabatic micropipeline, it is a DCT is cosine a strongly parallel algorithm, that can be effi4theDiscrete transformation: a casegained study good demonstrator to show the energy savings by usciently implemented using the adiabatic micropipeline, it is a ing adiabatic logic in comparison to static CMOS. The DCT good demonstrator to show the energy savings gained by usimage processing the Discrete Cosine Transformation isInbuilt of adders, subtractors and buffers as can be seen in ing adiabatic logic in comparison to static CMOS. The DCT (DCT) is used to compress pictures. A DCT with minimal Figure 7. is built ofamount adders,issubtractors and buffers as can be seen in hardware presented in (Heyne et al., 2006). As Different static CMOS implementations have been com- the Figure 7. DCT is a strongly parallel algorithm, that can be efficiently pared. Flip-flops are power-consuming components in a synDifferent static CMOS implementations have been comimplemented the CMOS. adiabatic micropipeline, is a good chronous designusing in static Different levels ofitpipelinpared. Flip-flops are power-consuming components in a demonstrator to show the energy savings gained by using adiing have been investigated and been opposed to the adiabaticsynchronous design in static CMOS. Different levels of pipelinabatic logic in comparison to static CMOS. The DCT is built logic DCT implementation. The adiabatic reference design have been investigated andstructure, been opposed to the adiabatic adders, subtractors andRCA buffers as can be in Fig. 7. isofing implemented as nested thusseen a minimum logic DCT implementation. The adiabatic reference design Different static CMOS implementations have been comnumber of FA cells are used and the overhead due to synis implemented as nested RCA structure, minimum pared. Flip-flops are power-consuming components in a synchronization is reduced due to nesting. For thus statica CMOS number of FA cells are CMOS. used adder andDifferent the overhead due to syndesign in static levels of pipelina chronous 2D-pipelined carry-propagate (CPA) has been estichronization is reduced due to nesting. For static mated first.been To reduce the number of flip-flops, in the V1 adiabatic to CMOS V3 ing have investigated and been opposed to a 2D-pipelined carry-propagate adder (CPA) has been estilogic DCT implementation. The adiabatic reference design mated first. To reduce the number of flip-flops, in V1 to V3

Adv. Radio Sci., 5, 291–295, 2007

4

]

buf

40 10

CPA vs. Ref CSA V1 vs. Ref CSA V2 vs. Ref CSA V3 vs. Ref

50

30 0 0 40 20

ESF

RCA PPA Sklanksy PPA Han−Carlson

0.6k 0.2k

0.2

0.4 0.6 activity factor α

0.8

1

CPA vs. Ref

RCA PPA Sklanksy 0.4k 0 PPA 0 Han−Carlson 5 10

15 20 bit width N

25

CSADCT V1 vs. Ref 10 Fig. 8. 30 Comparing the differing static CMOS implementaCSA V2a vs. Ref tions to the adiabatic DCT. For static CMOS, first 2D-pipelined CSA V3and vs.V1 Ref carry-propagate adder (CPA) has been implemented to V3 show carry-save (CSA) structures with different degrees 2000 array 0.2 0.4 0.6 0.8 1 of pipelining, where V3 has noactivity pipelining at all.α factor

ESF

energy per cycle [Ebuf ] energy per cycle [E

0.6k

0.8k 0.4k

50

20 Ph. Ph. Teichmann et et al.: Adiabatic Logic Teichmann al.:Arithmetic Arithmetic Structures structures ininadiabatic logic

294 0.6k

0.8k

energy per cycle [Ebuf]

30

RCA PPA Sklanksy PPA Han−Carlson

ESF

0.8k

30

0.2k Fig. 6. Comparison of RCA and most promising PPA adders, HanCarlson and Sklansky PPA, up to 32bits. The RCA exhibits the 0.4k lowest energy consumption up to 8bit. 0 0 5 10 15 20 25 30 bit width N

CPA vs. Ref CSA V1 vs. Ref a CSA10 has been used to avoid skewing and deskewing at the Fig. 8. Comparing the differing static CMOS DCT V2 implementavs. Ref inputs, respectively at the outputs. In V1 CSA a flip-flop is introtions to the adiabatic DCT. For static CMOS, first a 2D-pipelined CSA V3 vs.CSA Ref duced after each CSA stage, in V2 after every second carry-propagate (CPA) has been implemented and V1 to V3 0 in V3 adder stage, and no pipelining was introduced at all. Results 0 0.2 (CSA)0.4 0.8 degrees1of show carry-save array structures0.6 with different are calculated according activity factor pipelining, where V3 hastono pipelining at all.α

0.2k Fig. 6.6.Comparison Fig. ComparisonofofRCA RCAand andmost mostpromising promisingPPA PPAadders, adders,HanHanCarlsonand andSklansky SklanskyPPA, PPA,upuptoto32bits. 32bits. The TheRCA RCA exhibits exhibits the the Carlson lowest energyconsumption consumptionupuptoto8bit. 8bit. lowest energy

0

0

5

10

15 20 bit width N

25

30

Fig. 6. Comparison of RCA and most promising PPA adders, HanCarlson and Sklansky PPA, up to 32bits. The RCA exhibits the lowest energy consumption up to 8bit.

EAL has = Xbeen ∗ Eto XBU F,AL EBU F,AL (2) F A,AL F A,AL a CSA used avoid+skewing and∗ deskewing at the Fig. 8. Comparing Comparing the differing static CMOS DCT implementaFig. 8. the differing static CMOS DCT implementaEtions Xadiabatic EFor CM OS F A,CM OS Foutputs. A,CM OSCMOS, inputs, respectively at ∗the In V1 a first flip-flop is(3) introto = the DCT. static a 2D-pipelined tions to the adiabatic DCT. For static CMOS, first a 2D-pipelined carry-propagate adder (CPA) hasV,CM implemented and V1 CSA to V3 duced after each CSA stage, inbeen V2OS after every second + XIN ∗ EIN V,CM OS carry-propagate adder (CPA) has been implemented and V1of to V3 show array (CSA) structures with different degrees stage,carry-save and in V3 no pipelining was introduced at all. Results + XF F,CM ∗ EF F,CM OS (CSA) OS show carry-save array structures with different degrees of pipelining, where V3 has notopipelining at all. are calculated according pipelining, has nofor pipelining all. F A for fullwhere thewhere index V3 AL stands adiabaticat logic, adder, BU F for a buffer, IN V for an inverter, and F F for a EAL + =X XIFNA,AL ∗ EF∗ A,AL XBU F,AL ∗ EBU F,AL (2) flip-flop. EI N V+,CMOS V ,CMOS aEspecially CSA has+= been used to avoid skewing and deskewing at the for static CMOS an important ECM OS X ∗ E A,CM OS∗ EFF A,CM OS parameter for en- (3) XFFF,CMOS F,CMOS ergy dissipation is the at activity α, whereas in PFAL activity is introinputs, respectively the outputs. In V1 a flip-flop + XIN V,CM OS ∗ EIN V,CM OS has a minor impact on the dissipation. Toafter allow a fair where the index AL stands forin adiabatic logic, F Acomfor full-CSA duced after each CSA stage, V2 every second + XFhas ∗ EF F,CMin OSconsidered parison, activity be estimations adder, BU F forF,CM a to buffer, I N V for OS anthe inverter, and Fvia F for a

stage, and in V3 no pipelining was introduced at all. Results flip-flop. where the index AL stands for )adiabatic logic, F A for E(α) = Eα=0 +according α (E Eα=0 , (4)fullare calculated α=1 −to

Fig. Schemeofofthe theDCT, DCT,where whereaa Nop Nop (No (No Operation) Operation) isisaasynFig. 7. 7.Scheme synchronizing bufferstage. stage. chronizing buffer

is implemented as nested RCA structure, thus a minimum number of FA cells are used and the overhead due to synchronization is reduced due to nesting. For static CMOS a Fig. 2D-pipelined carry-propagate adder (CPA) has been esti7. Scheme of the DCT, where a Nop (No Operation) is a synmated first. To reduce chronizing buffer stage. the number of flip-flops, in V1 to V3 a CSA has been used to avoid skewing and deskewing at the inputs, respectively at the outputs. In V1 a flip-flop is introduced after each CSA stage, in V2 after every second CSA stage, and in V3 no pipelining was introduced at all. Results are calculated according to EAL = XFA,AL ∗ EF A,AL + XBU F,AL ∗ EBU F,AL ECMOS = XFA,CMOS ∗ EF A,CMOS

(2) (3)

Fig. 7. Scheme of theSci., DCT, a Nop (No Operation) is a synAdv. Radio 5, where 291–295, 2007 chronizing buffer stage.

Especially CMOS important parameter adder, BU F for astatic buffer, IN Vanfor an inverter, and F Ffor forena where E the respective mean in energy α=0 and Eis α=1 ergy dissipation theareactivity α, whereas PFALdissiactivity flip-flop. pation values for activity = 0an andimportant α = 1.Toparameter has minor impact on∗αthe allow∗aE fair Especially for forcomen- (2) EaAL =X EFdissipation. + XBU F,AL Fstatic A,AL CMOS A,AL BU F,AL The results of the comparison are given ininFigure 8. All estiergy dissipation is the α, whereas PFAL activity parison, activity has to activity be considered theinestimations via Emations =based Ximpact EF A,CM are on the simulations in an industrial CM F A,CM OS has OS a minor onOS the∗dissipation. To allow a130nm fair com- (3) CMOS process, at 100MHz and with a V = 1.2V . The (4) DD E(α) = E + α − E , (E ) α=0 parison, + activity toα=1 be considered in the XIN has ∗ Eα=0 V,CM OS IN V,CM OSestimations via mean energy dissipation values for the used gates are sumXand EE where E E the marized in+ Table for αOS =are 0∗and α = )1., OS mean energy dissiF1 F,CM Frespective F,CM E(α) = E + αα=1 (E − (4) α=0 α=0 α=1 α=0 In Figure 8 we that the bestand implementation for static pation values forsee activity α=0 α=1. where and EAL are the respective energy dissiwhere the forareadiabatic logic, A for fullα=0index α=1 CMOS V3, that nostands pipelining. There the energy savTheisE results of uses the comparison givenmean in Fig. 8. F All estipation values for activity α = 0 and α = 1. adder, BUare F based for a on buffer, IN V for an inverter, and130nm F F for a ing factor mations the simulations in an industrial The results of the comparison are given in Figure 8. All estiflip-flop. CMOS E process, at 100 MHz and with a V =1.2V . The DD diss,CM mations are basedOSon the simulations in an industrial 130nm (5)sumESF mean=energy dissipation values for the used gates are Especially for static CMOS an important parameter for enEdiss,ALat 100MHz and with a VDD = 1.2V . The CMOS process, marized in Table 1 for α=0 and α=1. ergy dissipation is the activity α, the whereas in PFAL activity mean energy dissipation values for used gates are sumIn Fig. 8 impact we see on thatthe thedissipation. best implementation fora fair staticcomhas a minor To allow marized in Table 1 for α = 0 and α = 1. CMOS activity is V3, that uses no pipelining. There energy sav-via parison, bethe considered in thetheestimations In Figure 8 wehas seeto that best implementation for static ing factor CMOS is V3, that uses no pipelining. There the energy savE(α) =E Eα=0 + α (Eα=1 − Eα=0 ) , (4) ing factor diss,CMOS ESF =

(5)

where E Ediss,AL and EOS dissidiss,CM α=1 are the respective mean energy (5) ESF =α=0 Ediss,AL pation values for activity α = 0 and α = 1. reaches its minimum value. Depending on the activity the The ofup thetocomparison arefor given in Figure 8. All ESFresults reaches a factor of 30 the highest activity of estimations are based on the simulations in value an industrial 130nm the system. Applying a realistic activity of α=30% (Ye andprocess, Roy, 2001) an ESF ofand factor 10 can be gained. The CMOS at 100MHz with a VDD = 1.2V . The mean energy dissipation values for the used gates are summarized in Table 1 for αwww.adv-radio-sci.net/5/291/2007/ = 0 and α = 1. In Figure 8 we see that the best implementation for static CMOS is V3, that uses no pipelining. There the energy saving factor

Ph. Teichmann et al.: Arithmetic structures in adiabatic logic dissipated energy per cycle [J] family gate α=0 α=1

AL FA 2.10f 2.67f

BUF 0.24f 0.33f

FA 3.48a 37.6f

CMOS INV 1.59a 2.66f

References FF 6.46f 19.0f

Table 1. Simulated energy per cycle dissipation values for adiabatic and static CMOS gates. All estimations are performed for α=0 and α=1.

generation of the four-phase clock itself is lossy, the ESF must take into account the efficiency of the oscillator. Oscillators with an efficiency of 50% have been reported. There is still room for further improvements of the oscillator efficiency. For a system level comparison the losses on the clock net in an static CMOS system have to be considered here as well. This requires detailed information on the topologie of the clock net. Predictions for high-performance systems show that almost as much energy in the clock distribution is used as in the logic itself. At least flip-flops are used at the inputs of the structure and the outputs of the structure, leading to a slight increase of the dissipation even for the adder V3 on system level. Thus the factor of 10 presented here should also be a good indication for the savings in a complete system achievable with adiabatic logic. 5

Conclusions

In this paper the inherent properties of adiabatic logic have been analyzed for arithmetic building blocks. Adder structures were investigated and compared for the first time showing that due to the synchronization overhead of O(N 2 ) the ripple-carry adder is only suitable for bit sizes N≤8. Parallel-prefix adders are a good choice, as the reduced depth of the adder also reduces the area and the synchronization overhead and thus the power consumption. The Han-Carlson and the Sklansky architecture show the lowest energy consumption of all parallel-prefix schemes. The Discrete Cosine Transformation is an arithmetic structure making use of adiabatic logic’s inherent pipelining structure. The nested ripplecarry adder scheme allows major savings against comparable designs in static CMOS. For an activity factor of 30%, savings of around factor 10 can be achieved. Adiabatic logic is therefore a circuit family that allows the implementation of ultra low-power digital signal-processing tasks.

www.adv-radio-sci.net/5/291/2007/

295

Amirante, E.: Adiabatic Logic in Sub-Quartermicron CMOS Technologies, vol. 13 of Selected Topics of Electronics and Micromechatronics, Shaker, 2004. Beaumont-Smith, A. and Lim, C.-C.: Parallel prefix adder design, in: 15th IEEE Symposium on Computer Arithmetic, 218–225, 2001. Brent, R. P. and Kung, H. T.: A regular layout for parallel adders, IEEE Transactions on Computers, C-31, 260–264, 1982. Fischer, J.: Adiabatische Schaltungen und Systeme in DeepSubmicron-CMOS-Technologien, vol. 24 of Selected Topics of Electronics and Micromechatronics, Shaker, 2006. Han, T. and Carlson, D. A.: Fast area-efficient VLSI adders, in: Proceedings of the 8th Symp. Comp. Arithmetic, 49–56, 1987. Heyne, B., Sun, C.-C., Goetze, J., and Ruan, S.-J.: A Computationally Efficient High-Quality CORDIC based Loefller DCT, in: 14th European Signal Processing Conference (EUSIPCO), 2006. Kogge, P. M. and Stone, H. S.: A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Transactions on Computers, C-22, 786–793, 1973. Koren, I.: Computer Arithmetic Algorithms, A K Peters, second edn., 2002. Ladner, R. E. and Fischer, M. J.: Parallel Prefix Computation, J. Assoc. Computing Machinery, 27, 831–838, 1980. Sklansky, J.: Conditional-sum addition logic, IRE Transactions, EC-9, 226–231, 1960. Vetuli, A., Pascoli, S. D., and Reyneri, L. M.: Positive feedback in adiabatic logic, Electronics Lett., 32, 1867–1869, 1996. Ye, Y. and Roy, K.: QSERL: Quasi-Static Energy Recovery Logic, IEEE Journal of Solid-State Circuits, 36, 239–248, 2001. Zimmermann, R.: Binary Adder Architectures for Cell-Based VLSI and their Synthesis, Ph.D. thesis, ETH Zurich, 1997.

Adv. Radio Sci., 5, 291–295, 2007