From energydelay metrics to constraints on the design ...

INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS Int. J. Circ. Theor. Appl. (2011) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cta.757

From energy-delay metrics to constraints on the design of digital circuits Massimo Alioto1,2, Elio Consoli3 and Gaetano Palumbo3, ∗, † 1 DII

(Dipartimento di Ingegneria dell’Informazione), University of Siena, I-53100, Siena, Italy Wireless Research Center, EECS (Electrical Engineering and Computer Sciences) Department, University of California, Berkeley, CA 94704-1302, U.S.A. 3 DIEES (Dipartimento di Ingegneria Elettrica, Elettronica e dei Sistemi), University of Catania, I-95125, Catania, Italy 2 Berkeley

SUMMARY In this paper, the adoption of general metrics of the energy-delay tradeoff is investigated to achieve energyefficient design of digital CMOS very large-scale integrated circuits. Indeed, as shown in a preliminary analysis on the performance of various commercial microprocessors, a wide range of E i D j metrics is typically adopted. Physical interpretation and interesting properties for the designs minimizing E i D j metrics are provided together with the adoption of the Logical Effort theory to define practical design constraints. Two design examples in a 65-nm CMOS technology are also reported to exemplify the theoretical results. Copyright 䉷 2011 John Wiley & Sons, Ltd. Received 3 May 2010; Revised 30 November 2010; Accepted 5 December 2010 KEY WORDS:

energy-efficiency; low power; high speed; energy-delay tradeoff; VLSI; Logical Effort

1. INTRODUCTION In the last few years, the continuous increase in energy consumption has become the major concern limiting the speed performances of digital very large-scale integrated (VLSI) circuits, insomuch as, even for high-speed systems, designs undergo a power limited regime [1, 2]. Therefore, since the achievement of energy-efficiency must be the primary target [3, 4], a deep understanding of the energy-delay (E-D) tradeoff and the related design issues is crucial. In the past, the minimization of figures of merit (FOMs) has been proposed to deal with the above issues, as for the energy-delay (E-D) products E D (when energy and speed are equally weighed) [5, 6], or E D 2 (speed has priority over energy) [7]. The above FOMs can be easily generalized to the E i D j class (equivalently to the E D representation [8–10]), where the exponents i and j allow for putting more weight on energy or speed, depending on whether i> j or vice versa. This paper deals with the design of digital CMOS circuits in the E-D space by exploiting the properties of E i D j FOMs and several related considerations. In particular, Section 2 reports a comparison relative to several commercial microprocessors, to highlight that, even at the highest abstraction level, practical digital VLSI systems are designed to minimize a wide range of FOMs [11]. In Section 3, we discuss the properties of the designs that minimize the E i D j FOMs, in terms of their relationship with the energy-efficient curve (EEC) and of the energy to delay sensitivity. Section 4 deals with the definition of practical design constraints, ∗ Correspondence

to: Gaetano Palumbo, DIEES (Dipartimento di Ingegneria Elettrica, Elettronica e dei Sistemi), University of Catania, I-95125, Catania, Italy. † E-mail: [email protected] Copyright 䉷 2011 John Wiley & Sons, Ltd.

M. ALIOTO, E. CONSOLI AND G. PALUMBO

again by resorting to the properties of E i D j FOMs. Exemplifying results, relative to circuits in a 65-nm CMOS technology, are reported in Sections 5 and 6, while conclusions are in Section 7. The paper also contains two appendixes: the first is focused on EEC-related properties, whereas the second reports a methodology to estimate the energy of a gate once its input capacitance is given.

2. COMPARISON OF E-D IN COMMERCIAL MICROPROCESSORS As concerns the microprocessor energy, let us define E abs as the dissipation in one clock cycle, i.e. the product of the average power P and the clock period TCK . Furthermore, define NCores as the number of cores within the chip, FCK as the clock frequency and IPC as the maximum number of instructions per cycle that a core can execute in parallel. Thus, assuming that no stalls occur within each core and that all cores are always busy, the maximum number of instructions executable in 1 s is IPSmax = FCK · Ncores ·IPC

(1)

Hence, the average time required to execute an instruction, which can be interpreted as the microprocessor delay to execute a single instruction, is Dabs = 1/IPSmax . The above-mentioned parameters are reported in Table I for several commercial microprocessors (public data taken from their datasheets). Since the chips chosen for the analysis are not realized with the same technology and features, in order to ensure consistent results we have to normalize E abs and Dabs according to scaling rules, as shown below. The generic delay of a CMOS gate can be expressed as ∝

C VDD I

(2)

where VDD is the supply voltage, C is the load capacitance and I is the average current that (dis)charges it. Considering deep-submicron technologies, the dependence on the supply voltage is nearly canceled in the ratio VDD /I (due to velocity saturation [12]), whereas C scales linearly with the feature size. From Equation (2), approximately, scales linearly with feature size. Hence, by defining S as the scaling factor with respect to a reference 65-nm technology, Dabs scales by a factor S D = S. The overall dynamic, E DYN , and static, E STAT , consumption of a group of gates can be expressed as [1, 6] 2 E DYN ∝ CTOT VDD

(3)

E STAT ∝ W VDD TCK e−VTH /n

(4)

where CTOT is the generic (sum of) capacitance(s) to be (dis)charged, W is the generic (sum of) size(s) of the leaking gates, VTH is the threshold voltage and n is a technology-dependent parameter relative to sub-threshold features. It is not possible to separate the dynamic and static contributions on the overall dissipation reported in Table I. Despite these limitation, we decide to consider an overall scaling factor for Table I. Features of the compared commercial microprocessors.

Intel Core 2 Duo U7500 Athlon 64 X2 3800 +EE Intel Core Duo X7900 Intel Xeon 5300 IBM Power6 Intel QX6800

Tech. (nm)

FCK (GHz)

IPC

NCores

VDD (V)

Power (W)

65 90 65 65 65 65

0.80 2.00 2.80 2.33 5.00 2.93

4 2 4 4 4 4

2 2 2 4 2 4

0.90 1.35 1.25 1.20 1.00 1.25

9.2 35.0 44.0 60.0 130.0 130.0

Copyright 䉷 2011 John Wiley & Sons, Ltd.

Int. J. Circ. Theor. Appl. (2011) DOI: 10.1002/cta

THE DESIGN OF DIGITAL CIRCUITS

Table II. Products from E D 3 to E 3 D for the compared commercial microprocessors.

Intel Core 2 Duo U7500 Athlon 64 X2 3800 +EE Intel Core Duo X7900 Intel Xeon 5300 IBM Power6 Intel QX6800

E D3 (10−41 J s3 )

E D2 (10−30 J s2 )

ED (10−19 J s)

E2D (10−29 J2 s)

E3D (10−39 J3 s)

4534.3 744.7 139.2 50.8 67.7 43.3

290.2 82.5 31.2 19.0 27.1 20.3

18.6 9.1 7.0 7.1 10.8 9.5

2207.6 924.8 1092.6 1861.5 4694.4 4245.2

262 398.1 93 601.1 170 935.6 490 394.4 2 034 259.3 1 893 851.3

45 Intel U7500

E [nJ]

40

Athlon 64 X2 3800

35

Intel X7900

30

Intel Xeon 5300

25

IBM Power 6 Intel QX6800

20 15 10 20

40

60

100

80

120

140

160

D [ps]

Figure 1. Location of the compared commercial microprocessors in the normalized E-D space.

energy that is an average between the scaling factors of E DYN and E STAT (in current technologies, static consumption is closely approaching the dynamic one [13, 14]). In particular, E DYN scales by a factor S · SV2DD , where SVDD is the scaling factor of supply voltage by using VDD = 1 V as a reference. Instead, E STAT scales by a factor S · SVDD · STCK , being STCK the scaling factor of clock period by using FCK = 1 GHz as a reference. On the whole, the scaling factor on E abs is SE =

S · SV2DD + S · SVDD · STCK

(5)

2

Finally, the normalized energy E and delay D of a given microprocessor are Dabs SD E abs E= SE

D=

(6) (7)

The products from E D 3 to E 3 D are reported in Table II, where the cells in gray refer to the microprocessors exhibiting the minimum value of a given FOM. Moreover, in Figure 1 we illustrate the location in the normalized E-D space of the analyzed microprocessors. From inspection of Table II, the usual FOMs E D and E D 2 do not cover the entire range of practical applications that are found in the market. More specifically, the fast Intel QX6800 (which minimizes E D 3 ) and the Athlon 64 (which minimizes E 2 D and E 3 D) would have not been considered as practically high-performance systems by using the traditional FOMs E D and E D 2 . Therefore, the use of general E i D j FOMs is highly useful also at the highest abstraction level (where we consider the performances of microprocessors as a whole), since they give a more realistic ranking of microprocessors. Copyright 䉷 2011 John Wiley & Sons, Ltd.



3. DESIGN IN THE E-D SPACE AND E i D j METRICS For a digital circuit (a single gate or a path), the EEC is defined as the set of design points in the E-D space that exhibit minimum delay for a given energy dissipation (or vice versa). By definition, other design points above the EEC lead to a needlessly higher energy under the same speed performances (Figure 2). In [8–10] it was shown that, at a fixed load and supply voltage, the EEC has a hyperbolic shape (E − E 0 )(D − D0 ) = E 0 D0

(8)

where E 0 and D0 are the energy and delay asymptotes. Considerations regarding E 0 , D0 and are in Appendix A.1. As shown in Appendix A.2, a design solution minimizing a FOM E i D j lies in the EEC, i.e. this curve is made up of all points that minimize E i D j for some (i, j ) (see Figure 2). Thus, by minimizing E i D j for a limited number of pairs (i, j ), the EEC of a circuit can be extracted by simply interpolating such optimum points. From a practical perspective, an effective optimization algorithm (exploited in Section 6 to extract the EEC of a typical topology) can be based on the following properties: (1) A binary search can be employed to identify minimum-E i D j designs because of the practical convexity of E i D j functionals in an M-dimensional design space. (2) The design space to be explored can be progressively reduced. Indeed, assuming j1 /i 1 < j2 /i 2 , a design optimizing E i1 D j1 will be always featured by a sizing smaller than that optimizing E i2 D j2 . An interesting property of minimum-E i D j designs concerns the energy to delay sensitivity, equal to − j/i (as shown in Appendix A.3), which holds by tuning all the design variables.

4. DESIGN CONSTRAINTS: PROPER SELECTION OF INPUT CAPACITANCE AND LOGICAL EFFORT METHOD USAGE 4.1. Assumptions: fully variable CIN and energy breakdown As shown in the recent works [4, 15], when dealing with the issue of energy-efficient design, the input capacitance, CIN , of a logic circuit cannot be simply assumed as fixed. Granted that the adopted CIN value is also related with the architectural-level design strategies [15], and that the E Energy-Efficient Curve (EEC) Min ED5 Min ED4

High Performance Region

Min ED3

All other desiign points lie above a the Energy y-Efficient Currve (diiscarded)

Min ED2 Min E ED 2 Min E D

E0

Energy and Delay equally weighed

D0

Minimum size ensuring correct functionality Min E3D

Emin

Low Power Region

D

Figure 2. Energy-efficient curve and designs optimizing the metrics E i D j . Copyright 䉷 2011 John Wiley & Sons, Ltd.



satisfaction of the E-D tradeoff required by the application and the extensive exploration of the potentials in terms of both energy and delay are strongly CIN -dependent, in this paper, different from [9, 10, 16], we consider CIN as an additional design variable to be fully optimized just like all the other transistors sizes. Indeed, CIN corresponds to the width featuring transistors in the first gate of a circuit, w1 , because, except for specific blocks like keepers, the channel length of any gate is at minimum size (a larger L leads to both worse delay and energy). A second assumption, different from [4, 9, 10, 15, 16], is that of including the energy dissipated in the charge and discharge of the CIN of a topology and to exclude the energy dissipated in the charge/discharge of the external output load, CL . Indeed, the first term is inherently related to the adopted circuit sizing (CIN is a further design knob in our work), whereas the latter term does not depend on the features of the topology [17]. It is worth opportunely addressing the consequences of the CIN optimization within a wide range of exploration: • In general, a throughput increment can be achieved by means of an increase in the degree of parallelism and/or a more critical sizing of all the gates in the logic paths (e.g. when the serial part of code is dominant and parallelization is not so effective). In the latter case, if CIN is increased with respect to medium values, it means that the topology is being sized to achieve a high speed (increasing the energy consumption). Even if the circuit imposes a larger load on the preceding logic stage (e.g. in a pipeline), in high-speed applications the speed penalty of the preceding logic stages could be exceeded by the speed improvement in the considered topology. This tradeoff cannot be explored if one does not assume a fully variable CIN . • Conversely, when sizing to achieve low-power, low-speed operation, CIN can be strongly reduced. Indeed, granted that the above tradeoff is still valid, the low-power applications are typically featured by long cycle times and hence can easily tolerate slower stages and high logic depths (e.g. when no parallelization is adopted and the processing is actually done serially through single deep paths). In such a context, a slower topology can be tolerated in favor of its smaller energy dissipation. Obviously, there always exist practical limits on the adoptable CIN values. Nevertheless, once the EEC is extracted as shown in the remainder of the paper, the designer can easily select the portion of interest according to practical constraints in terms of maximum allowed CIN . 4.2. How to derive design constraints? Given the general methodology for the EEC extraction suggested in Section 3, one needs to define practical design constraints (or bounds) allowing to limit the space of design solutions. At the same time, one must be sure to catch the optimum sizings leading to the desired E-D tradeoffs (i.e. one must guarantee that the selected bounds strictly contain the searched optimum sizings). In [15] it is shown that Logical Effort (LE) designs lie above the EEC, i.e. are not the most efficient possible designs. Even if (unlike [15]) the CIN -related dissipation is included in this work, the same result still holds.‡ Nevertheless, the energy-to-delay sensitivity of LE designs can be exploited to determine design space bounds. More specifically, one can be interested in the portion of the EEC up to a certain minimum-E i D j design point with j/i = X (i.e. the portion of the EEC made up by energy-efficient designs that minimize FOMs with j/i less than or equal to X ) [9, 10]. Hence, the design bounds can be defined through the ‘limiting’ LE sizing exhibiting an energy-to-delay sensitivity with respect to CIN equal to −X . It is worth noting that if the searched X is not enough large (say smaller than 3), the bounds determined through LE will not be much close to the minimum-E i D j design with j/i = X . ‡ As

explained in [16], the minimum energy design under a given speed constraint is reached when the sensitivity with respect to ‘all’ the tunable variables is the same. Therefore, given that LE designs are featured by an infinite energy-to-delay sensitivity with respect to the sizes of internal transistors, but not with respect to those defining CIN (see Sections 4.3–4.5), the condition in [16] is not satisfied for LE designs, which thus are not energy-efficient.

Copyright 䉷 2011 John Wiley & Sons, Ltd.



4.3. Properties of the Logical Effort designs: speed The LE method allows us to minimize the delay of a path from the knowledge of the topology, the input capacitance of the first stage, CIN , and the external load capacitance of the path, CL [18]. In particular, given CIN and CL , the optimized delay DTOT of a path of N cascaded gates is [18] √ √ N N DTOT = N G B H + P = N F + P (9) where G, H , B and P are the logical effort, electrical effort (H = CL /CIN ), branching effort and parasitic delay of the entire path, respectively. Thus, equivalently, one has DTOT = P(1+k) √ √ N N G B N CL k= √ P N CIN

(10) (11)

where k is the relative delay increment with respect to the ideal and practically inaccessible minimum path delay (i.e. the path parasitic delay P). From Equations (9)–(11), the sensitivity of the optimized path delay DTOT to CIN is given by D

SCINTOT =

*DTOT CIN 1 k =− N k +1 *CIN DTOT

(12)

which is a function of the only CIN 4.4. Properties of the Logical Effort designs: energy As for the delay DTOT , it is possible to univocally determine the energy E TOT of the LE sizing for a given CIN and CL . In the following, we define f i as a factor allowing to express the overall energy dissipation (both dynamic and static) of the ith gate in the path once its input capacitance, Ci , is given. An accurate estimation of f i is given in Appendix B. For the LE design of a path with N stages, we have that the input capacitance, C N , and the overall energy dissipation, E N , of the N th gate are given by [18] √ g N b N N CIN CN = √ N G BC

(13)

EN = fN CN

(14)

By iterating the above reasoning and going backward through the path, one finds that the input capacitance and energy dissipation of the ith gate (for the LE design) are Ci =

N j =1 gi

N j =1 bi

(CIN )(N −i+1)/N

(G BCL )(N −i+1)/N

CL

E i = f i Ci

(16)

Therefore, the overall dissipation of the LE-sized path is ⎡ N ⎤ N −i+1 N N (C g b ) N i i IN j =1 j =1 E TOT = ⎣ f i CL ⎦ N −i+1 i=1 (G BCL ) N Copyright 䉷 2011 John Wiley & Sons, Ltd.

(15)

(17)



Although one cannot attain a simple expression like Equation (12), also the sensitivity of energy to input capacitance can be again expressed as a function of the only CIN ⎡ N ⎤ N −i+1 N N j =1 gi j =1 bi (C IN ) N N −i+1 ⎣ CL ⎦ i=1 f i N −i+1 N N (G BC ) L TOT SCEIN = (18) ⎡ N ⎤ N −i+1 N N g b ) (C i i IN j =1 j =1 N ⎣ CL ⎦ i=1 f i N −i+1 (G BCL ) N 4.5. Guidelines to deal with practical cases According to the considerations in Section 4.2, the upper bound of CIN (CIN,max ) must be set to catch the minimum E i D j design limiting the important part of the EEC. Ideally, CIN,max can be found by setting E

E

D

TOT S DTOT = SCIN /SCINTOT = − TOT C IN

j i

(19)

and the definition of CIN,max implies to define also the upper bounds of the other design variables (i.e. transistors sizes), thanks to the LE method. Unfortunately, in general, formula (19) cannot be applied straightforwardly. Indeed, parameters gi , h i , bi and pi are often neither constant nor available in a closed-form as functions of CIN , thus making impossible to evaluate (CIN,max ) from Equation (19). More specifically, in general one may have to deal with: (a) keepers (affecting gi and pi in a non-linear way); (b) gates with fixed sizes outside of the path under analysis (affecting bi in a non-linear way); (c) gates having a size dependent on the design variables to be optimized§ (affecting bi in a non-linear way); (d) local interconnections capacitances that depend on the sizes of the transistors in the path (affecting both pi and h i in a complex non-linear way); (e) fixed resistive and capacitive contributions relative to interconnections (see [19] for a thorough analysis); (f) branching effects related to reconvergent paths or multiple fanouts (affecting bi in a nonlinear way).¶ Therefore, in general, a single step through formula (19) to determine CIN,max cannot be applied since gi , h i , bi and pi are not available in a closed-form as a function of CIN , thus making impossible to evaluate (CIN,max ) from Equation (19). Indeed, gi , h i , bi and pi can be found only by numerically solving a set of non-linear equations when applying the LE for a given CIN . Actually, for paths with purely static CMOS gates, with no branches, negligible resistive parasitics due to wires and assuming simple linear models for local wires parasitic capacitances, formulas (12) and (18) can be reasonably estimated and Equation (19) can be applied. Moreover, when weak keepers and small fixed-size branches are added but the limiting j/i is high, such additional effects can be neglected.

§ More

specifically, not all the transistors in the path have to be considered as variables to be optimized. Actually, only transistors lying in input-to-output paths should represent variables to be optimized in the E-D space, since they affect both energy and speed. On the contrary, there can exist some parts of the circuit whose size must be simply the minimum one (to save energy) guaranteeing a correct operation, since they affect only energy. ¶ Different from what is discussed in the previous note, transistors in these side paths affect speed and hence their sizes must be considered as design variables to be optimized in the successive E-D space exploration. Anyhow, during the LE-based procedure described in this section to limit the design space bounds, such side paths must be sized to exhibit the same delay of the longest path. Copyright 䉷 2011 John Wiley & Sons, Ltd.



In other cases, the need for iterative procedures arises. For instance, one can adopt the following cycle for increasing CIN : • under the current CIN , (re)apply the LE method to find the other transistor sizes (a non-linear set of equations must be solved); • (re)size those gates whose size depends on the above transistors sizes (see notes 3 and 4); • (re)simulate energy and delay; • (re)extrapolate the E TOT vs CIN and DTOT vs CIN fitted curves and (re)compute the sensitivity (19) around the current CIN value; E CIN |