Some Issues in Low Power Arithmetic for Fixed-Function DSP

0 downloads 0 Views 87KB Size Report
Some Issues in Low Power Arithmetic for Fixed-Function DSP ... Digital signal processing algorithms are key components ... Note that high flexibility in terms of .... Multirate techniques are efficient in terms of operations per sample. [11]–[15].
Some Issues in Low Power Arithmetic for Fixed-Function DSP Oscar Gustafsson and Lars Wanhammar Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden E-mail: {oscarg, larsw}@isy.liu.se

Abstract

2.1

In this paper we discuss some aspects of designing arithmetic circuits with low power consumption. We focus on digit-serial processing techniques, as well as minimum adder multipliers, and multiple constant multiplication. A review of power dissipation sources in CMOS is given. Further, we discuss some methods to decrease the power dissipation of the algorithmic, arithmetic, and architecture levels.

CMOS technology is currently and will continue to be the dominating technology for VLSI and ULSI circuits. The feature size is rapidly shrinking towards 0.1 µ mand below. Circuits implemented using modern CMOS processes are no longer characterized by the number of transistors on the chip. Instead the energy or power consumption is a more appropriate metric. Power dissipated in CMOS circuits consists of several components (1) Ptotal = Psw + Psc + Pstatic + Pleakage

1

Introduction

Digital signal processing algorithms are key components in communication systems. In this paper we discuss some of the key issues involved in design and implementation of the arithmetic circuits for the inner loop in such algorithms. Generally the throughput of the systems of concern here are fixed and the design problem is viewed as a resource adequate problem. The throughput is therefore a fixed constraint within which we will make a trade-off between chip area and power consumption. Further we assume that the required flexibility is low, i.e., the circuitry will only perform one single algorithm that is not required to be easily modified. Note that high flexibility in terms of modifiable algorithms is contradictory to highly efficient implementation. For example, a programmable signal processor is flexible, but highly inefficient in terms of throughput, chip area and power consumption. In this restricted context we regard design efficiency and power consumption as the two most important aspects in this type of applications. In this paper, however, we will focus on the later. In Section 2 we review the different sources for power consumption in CMOS circuits. In Section 3 we discuss various techniques to reduce the power consumption. We concentrate on digit-serial processing, minimum adder multipliers, and multiple constant multiplication techniques. Finally, in Section 4 we summarize our conclusions.

2

Energy Consumption

Minimization the energy consumption in portable and battery powered systems has become an important aspect of System-om-Chip design. In battery powered systems the peak power that can be delivered is limited by the batteries and we are interested in maximizing the runtime.

Power Dissipation in CMOS Circuits

where Psw is the power required to charge and discharge a capacitive load, Psc is the short circuit power consumed during transitions, Pstatic is the static power consumed by the device, and is Pleakage the leakage power consumed by the device. Psw and Psc are present when a gate is changing state, while Pstatic and Pleakage are present regardless of state changes. Psw is [1] (2) P sw = f CL αC L V dd ∆V where fCL is the frequency of operation (often the same as the clock frequency), CL is the equivalent switched capacitance of the whole circuit, Vdd is the power supply voltage, ∆V is the voltage swing across the switched capacitance, α represents a switching activity factor based on the probability of a transition. The product α CL is also referred to as the effective switched capacitance, or Ceff. In most cases we use full swing, i.e., ∆V = Vdd. 2

P sw = f CL C eff V dd

(3)

Hence, the power consumption can be reduced by using a lower power supply voltage. Unfortunately this has the effect, that the speed of the circuit is also reduced [1]–[3] according to α

k ( V dd – V th ) f CL = ----------------------------------C L V dd

(4)

For properly sized gates, Psc is less than 10% of Ptotal, although this may increase with increased device scaling. Pstatic is not present in pure CMOS circuits, but occur in sense amplifiers, voltage references, and constant current sources. Pleakage is due to leakage currents from reversed biased PN junctions at the source and drain of MOS transistors and subthreshold currents. The leakage is proportional to device area and temperature while the subthreshold leakage is strongly dependent on device

threshold voltages, and becomes an important factor as power supply voltage scaling is used to lower power. For systems with a high ratio of stand-by operation to active operation, Pleakage may be the dominant factor in determining overall battery life. Using a special low-power CMOS process can, of course, reduce the power consumption of digital circuits. Here, however, we will assume that the digital filters will be implemented using a standard digital CMOS process. Reducing any of the factors in (1) can decrease the power consumption.

3

Design Techniques for Power Reduction

Opportunities for power consumption minimization are available and must be explored at all levels of the design hierarchy [1], [2], [6]–[10]. Here we will only discuss some possibilities at the algorithm-arithmetic-hardware architecture levels.

3.1

Algorithmic Level

At the algorithmic level energy reduction techniques typically focus on minimizing the number of operations weighted by the energy cost of those operations. Multirate techniques are efficient in terms of operations per sample [11]–[15]. Common to these techniques is that the effective number of arithmetic operations per sample is reduced. Wave digital filter structures [5], [16] have, besides their robustness, very low coefficient sensitivity which reduces the required coefficient word length but also the data word length. Hence, they are good low- power candidates. Other techniques try to reduce and minimize the arithmetic complexity, e.g., strength reduction [17], or minimize memory traffic. In some algorithms, the power consumption can be reduced by eliminating operations with known results altogether. 3.1.1

Architecture-Driven Power Supply Voltage Scaling Architecture-driven power supply voltage scaling techniques exploit the quadratic dependency on Vdd in (3) to reduce the power consumption [1]. In order to compensate for the associated reduction in circuit speed, (4), the inherent computational parallelism in the algorithm should be exploited. We have therefore proposed the following way to reduce the power consumption [5], [18]. • Design a maximally fast and resource minimal realization of the basic DSP algorithm. • Convert any excess circuit speed to reduced power consumption by reducing the supply voltage. From (4) it is evident that from a speed point of view a CMOS process with a low threshold voltage is to be preferred. In practice, many circuits are therefore manufactured with a process with two threshold voltages. Devices with a low threshold voltage are used for high-speed circuits and devices with a high threshold voltage are used for low speed circuits since the latter have smaller leakage

currents. Typically, the power supply voltage is in the range 2Vth to 3Vth. 3.1.2 Maximally Fast Algorithms DSP algorithms can be divided into iterative processing and block processing algorithms [5]. The maximal sample frequency for an iterative recursive algorithm, described by a fully specified signal-flow graph, is [5],[19]  Ni  f max = max  ----------   T opi  i

(5)

where Topi is the total latency due to the arithmetic operations, and Ni is the number of delay elements in the directed loop i [19]. The minimum sample period is also referred to as the iteration period bound. The iteration period bound is of course not possible to improve for a given algorithm. However, a new algorithm with a higher bound can often be derived form the original algorithm. We recognize from (5) that there are two possibilities to improve the bound. • Reduce the operation latency in the critical loop. • Introduce additional delay elements into the loop. Removing superfluous arithmetic or logic operations from the loop can often reduce the operation latency [5]. Using low-sensitive filter structures, e.g., wave digital filters can also minimize the latency of multipliers. Another efficient approach is using cascade form structures [12], [20]. A new algorithm with more delay elements can be obtained in several ways, for example, scattered look-ahead [21] and frequency masking techniques, which was first proposed by Lim. We do not advocate the former since some finite word length problems may occur [22].

3.2

Arithmetic

At the arithmetic level with have several options. We can divide the data in smaller digits requiring smaller chip area as will be discussed in Section 3.2.1, exploit redundancies inside and among operations as will be discussed in Section 3.2.2 and Section 3.2.3, use distributed arithmetic for complex multiplication and sum-of-products [5], and simplified/truncated multipliers [23]. Further we almost always select fixed-point over floating-point arithmetic. The number representations is another area that offer power trade-offs [4], [5], [10]. Selection of sign-magnitude versus two’s-complement representation result in significant power reduction if the input samples are uncorrelated and dynamic range is minimized [1]. Data and coefficient word length minimization are also important in order to minimize the power consumption. 3.2.1 Digit-Serial Processing Bit- and digit-serial processing techniques has received considerable attention during the last decade [5], [24], [25]. In bit-serial processing one bit of the input word is processed each clock cycle. A bit-serial adder computing c = a + b is shown in Fig. 1 (a) while a bit-serial serial/parallel multiplier computing b = αa is shown in Fig. 1 (b).

ac-1

ac-2 ac-1

ai

aa c-20

a0

ai

ai cD 0 bi FA i D

(a)

D 0 FA FA D

D

D

FA FA

D bj

DD

FA

bj

D

(b)

Figure 1: Bit-serial (a) adder and (b) serial/parallel multiplier. For digit-serial processing a number of bits of the input word, a digit, is processed in parallel. If the digit-size, i.e., the number of bits processed concurrently, is one the digitserial system reduces to a bit-serial system, while for a digit-size equal to the word length the system reduces to a bit-parallel system. The motivation for digit-serial processing is to find an optimum trade-off between area and processing capacity. Traditionally, digit-serial multipliers has been obtained either via unfolding of a bit-serial multiplier [26] or via folding of a bit-parallel multiplier [27]. The problem with these approaches is that the obtained circuits have not been pipelinable at the bit-level. The reason for this is shown in Fig. 2 where a digit-serial adder obtained via unfolding of a bit-serial adder is shown. The recursive loop prohibits the insertion of pipelining to reduce the critical path to less than d full-adders but solutions to this problem has been proposed in [28]–[31]. The latency of digit-serial processing elements are conveniently described in terms of number of clock cycles. An adder has a latency of zero clock cycles while a serial/parallel multiplier has a latency of W f ⁄ d clock cycles, where Wf is the number of fractional bits of the coefficient. The clock frequency will be determined by the longest critical path, which basically is the number of adjacent operations. Introducing a pipelining stage will increase the latency with one clock cycle, but the critical (electrical) path will be decreased, and, thus, the clock frequency increased. The optimal level of pipelining in maximally fast, digit-serial, implementations is presented in [32]. 3.2.2 Minimum Adder Multipliers Multiplication with a constant fixed-point number is a common operation in many DSP algorithms. The multiplier can be expressed as a network of shifts, adders and subtractors. For bit-parallel arithmetics the shifts can be realized without any gates using hard wiring. To obtain an efficient implementation with a small switched capacitance it is therefore of interest to minimize the number of additions and subtractions. As the area cost is similar for additions and subtractions we will refer to both as additions. Representing a number using canonic signed digit (CSD) representation is a common method when an area efficient multiplier required. A B-bit integer coefficient c is represented as

xd-1 yd-1

x1 y1

x0 y0

FA

FA

FA

s1

s0

critical path

D sd-1

Figure 2: Digit-serial adder with digit-size d obtained from unfolding an bit-serial adder. B–1

c =

∑ bi 2 i

(6)

i=0

where b i ∈ { – 1, 0, 1 } and b i b i + 1 = 0, 0 ≤ i ≤ B – 2 . The average number of non-zero bits is about B/3 for 0 ≤ c ≤ 2 B – 1 [33]. CSD multipliers are based on sums of powers of two of the input word. However, other network topologies, not restricted to CSD representation, are available that require an even smaller number of adders. An optimal method was described in [34]. The method is based on the graph representation of the multipliers introduced in [35]. The search method involved was simplified and extended in [36]. In Fig. 3 a graph representation of a multiplication with 341 is shown. The vertices corresponds to additions while the edges corresponds to multiplications with ± 2 n , i.e. shifts and possible replacement of the subsequent addition by a subtraction. The value obtained in each vertex is known as a fundamental. Using CSD requires four adders while only three adders are required using the graph representation. This concept has also been extended in [37] to a redundant type of adders known as carry-save adders [38]. 16

1

1

4

5 1

261

1

341

256

Figure 3: Graph representation of multiplication with 341. Fundamentals are in bold. 3.2.3 Multiple Constant Multiplication In many DSP algorithms one data value is multiplied by several constants. One typical example is the transposed form FIR filter, where one input data is multiplied with the filter coefficients as shown in Fig. 4. The hardware cost can be decreased by utilizing redundancy between the coefficients [35], [39]–[45]. This is known as multiple constant multiplication (MCM). The previous work in this area can be divided in two techniques. The first technique is based on pattern matching techniques [40]–[45] and the result depends on the initial representation of the constants. This type of algorithms are often referred to as using subexpression sharing or subexpression elimination. The second technique is independent of the representation and only considers the values after each addition [35], [39]. The algorithm presented in [39] utilizes a graph representation and is shown to be optimal in terms of minimum number of additions. Most of the al-

x(n) T

T

TT

y(n)

Figure 4: Transposed form FIR filter. gorithms using the first technique can be applied to other subexpression sharing applications than multiplications, such as Hadamard matrix evaluation. Note that by transposing the signal flow graph a sum-of-product network is obtained, i.e., the problem of sum-of-products, e.g., for a direct form FIR filter, can be solved in the same way. 3.2.4 Number Representation In older CMOS processes the equivalent switched capacitance is due to devices and wires and the dominant wire capacitance is between the wire and substrate. In deep submicron CMOS technologies the capacitance associated with the devices are relatively small and the dominant wire capacitance is instead dominated by interwire capacitance as illustrated in Fig. 5. For long buses it is therefore important to order the transmission of successive data on the bus so that the charging/discharging of the interwire capacitors is minimized. A useful measure to minimize is the sum of the Hamming distances of the transmitted words. In speech signals, for example, it may be advantageous to use sign-magnitude representation on long buses, since most samples have small magnitudes which causes only the least significant bits to vary while in two’s-complement representation almost all bits change when a sample value changes sign.

3.3

Architectural

The architecture design starts from an operation schedule that is isomorphically mapped to a corresponding hardware structure. It is important that the schedule is maximally fast so that any excess speed can be traded for lower power consumption by power supply voltage scaling. 3.3.1 Cyclic Scheduling To achieve a maximally fast schedule it is necessary to schedule over several sample intervals by connecting several computation graphs [5], which will yield a periodic schedule. Each computation graph represents the operations and the shimming delays within one sample interval. It may help to imagine the computation graphs drawn on a cylinder of circumference equal to the scheduling period, as shown in Fig. 6. Hence, the schedule has no beginning or ending which may restrict the scheduling the operations. We denote this scheduling formulation as cyclic. This is the most general scheduling formulation and it can be used to find maximally fast, resource-optimal schedules. To attain the minimum sample period, it is necessary to perform cyclic scheduling of the operations belonging to several sample successive intervals if • the execution time for a processing element (PE) is longer than Tmin or • the critical loop(s) contains more than one delay element.

Figure 5: Interwire and wire-to-substrate capacitance model.

Figure 6: Illustration of the cyclic operation schedule. Generally, the critical loop should be at least as long as the longest execution time for any of the PEs in the loop. Bit-serial and digit-serial operations with execution times longer than the minimal sample period can thereby be completed within the scheduling period. The minimum number of sample periods to schedule for is max ( T exe, i ) i m = ------------------------------(7) T min where Texe,i is the execution time of operation i. The execution time is W d ⁄ d clock cycles for a digitserial adder. For a serial/parallel multiplier it is W d ⁄ d + W f ⁄ d clock cycles, where Wd is the data word length.

4

Conclusion

In this paper we have discussed some aspects of designing arithmetic circuits with low power consumption. We focused on digit-serial processing techniques to be able to find an optimal trade-off between area and processing capacity. Further, minimum adder multipliers and multiple constant multiplication was discussed to minimize the switched capacitance and area. A review of power dissipation sources in CMOS was also given. Further, discussions on decreasing the power dissipation were given for the algorithmic, arithmetic, and architecture levels. As noted, the power optimization must be performed at all levels in the design hierarchy. We proposed using maximally fast implementations and trade any excess speed for lower power consumption by lowering the power supply voltage.

References [1] [2] [3]

[4] [5] [6] [7] [8] [9] [10] [11]

[12]

[13]

[14] [15] [16] [17] [18] [19]

[20]

[21]

[22] [23]

A.P. Chandrakasan and R.W. Brodersen, Low Power Digital CMOS Design, Kluwer Academic, 1995. A. Bellaur and M. I. Elmasry, Low Power Digital CMOS Design: Circuits and Systems, Kluwer, 1996. T. Njølstad and E. J. Aas, “Validation of an accurate and simple delay model and its application to voltage scaling,” Proc. IEEE Intern. Symp. Circuits Syst., Monterey, CA, Vol. II, May 31-June 3, 1998, pp. 101–104. I. Koren, Computer Arithmetic Algorithms, Prentice Hall, Englewoods Cliff, New Jersey, 1993. L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. J. M. Rabaey and M. Pedram (Eds.), Low Power Design Methodologies, Kluwer, 1996. M. S. Elrabaa, I. S. Abu-Khater, and M. I. Elmasry, Advanced Low-Power Digital Circuit Techniques, Kluwer, 1997. L. Benini and G. De Micheli, Dynamic Power Management: Design Techniques and CAD Tools, Norwell, MA: Kluwer, 1998. B. Moyer, “Low-power design for embedded processors,” Proc. IEEE, Vol. 89, No. 11, pp. 1576–1587, Nov. 2001. M. Mehendale and S.D. Sherlekar, VLSI Synthesis of DSP Kernels, Algorithmic and Architectural Transformations, Kluwer Academic, 2001. H. Johansson, “Multirate single-stage and multistage structures for high-speed recursive digital filtering,” Proc. IEEE Intern. Symp. Circuits Systems, Orlando, Florida, Vol. 3, May 30-June 2, 1999, pp. 291–294. H. Johansson and L. Wanhammar, “Filter structures composed of allpass and FIR filters for interpolation and decimation with factors of two,” IEEE Trans. Circuits Syst. II, Vol. 46, No. 7, pp. 896-905, July 1999. T. Njølstad and T. A. Ramstad, “Low-power narrow transition-band FIR filters,” Proc. IEEE Nordic. Symp. Signal Processing, Kolmården, Sweden, June 13-15, 2000, pp. 181–185. H. Johansson, “On high-speed recursive digital filters,” Proc. X European Signal Processing Conf., Vol. 3, Tampere, Finland, Sept. 5-8, 2000, pp. 1701–1704. G. Jovanovic-Dolecek (Ed.), Multirate Systems: Design and Applications, Idea Group Publ., Hersey, USA, 2001. A. Fettweis, “Wave Digital Filters: Theory and Practice”, Proc. IEEE, Vol. 74, No. 2, pp. 270–327, Feb. 1986. K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley & Sons, 1999. M. Vesterbacka, On Implementation of Maximally Fast Wave Digital Filters, Diss. No. 487, Linköping University, June 1997. M. Renfors and Y. Neuvo, “The maximal sampling rate of digital filters under hardware speed constraints”, IEEE Trans. Circuits Syst., CAS-28, No. 3, March, pp. 196–202, 1981. H. Johansson and L. Wanhammar, “High-speed recursive filter structures composed of identical sub-filters and QMF banks with perfect magnitude reconstruction,” IEEE Trans. Circuits Syst., Vol. 46, No. 1, pp. 16–28, Jan. 1999. K.K. Parhi and D.G. Messerschmitt, “Pipeline interleaving and parallelism in recursive digital filters—Part I: Pipelining using scattered look-ahead and decomposition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-37, no. 7, pp. 1099–1117, July 1989. K. S. Arun, and D. R. Wagner, “High-speed digital filtering: Structures and finite wordlength effects,” J. VLSI Signal Processing, vol. 4, no. 4, pp. 355–370, Nov. 1992. M. J. Schulte, J. E. Stine, and J. G. Jansen, “Reduced power dissipation through truncated multiplication,” Proc. IEEE Alessandro Volta Memorial Workshop Low-Power Design, Como, Italy, Mar. 1999, pp. 61–69.

[24] S. G. Smith and P. B. Denyer, Serial Data Computation, Kluwer, 1988. [25] R. I. Hartley and K. K. Parhi, Digit-Serial Computation, Kluwer, 1995. [26] K. K. Parhi, “A systematic approach for design of digit-serial signal processing architectures,” IEEE Trans. Circuits Syst., vol. 38, pp. 358–375, Apr. 1991. [27] K. K. Parhi, C.-Y. Wang, and A. P. Brown, “Synthesis of control circuits in folded pipelined DSP architectures,” IEEE J. Solid-State Circuits, vol. 27, pp. 29–43, Jan. 1992. [28] A. S. Ashur, M. K. Ibrahim, and A. Aggoun, “Systolic digit-serial multiplier,” IEE Proc. Circuits, Devices, Syst., vol. 142, no. 1, pp. 14–20, Feb. 1996. [29] Y.-N. Chang, J. H. Satyanarayana, and K. K. Parhi, “Systematic design of high-speed and low-power digit-serial multipliers,” IEEE Trans. Circuits Syst. II, vol. 45, no. 12, pp. 1585–1596, Dec. 1998. [30] O. Nibouche, A. Bouridane, M. Nibouche, and D. Crookes, “A new pipelined digit-serial multiplier,” Proc. IEEE Int. Symp. Circuits Syst., Geneva, Switzerland, May 28–31, 2000, vol. 1, pp. 12–15. [31] O. Gustafsson and L. Wanhammar, “Bit-level pipelinable general and fixed coefficient digit-serial/parallel multipliers based on shift-accumulation,” in preparation. [32] O. Gustafsson and L. Wanhammar, “Optimal logic level pipelining for digit-serial implementation of maximally fast recursive digital filters,” National Conf. Radio Science (RVK), Kista, Sweden, June 10–13, 2002. [33] A. Avizienis, “Signed-digit number representation for fast parallel arithmetic,” IRE Trans. Electronic Comp., vol. 10, pp. 389–400, 1961. [34] A. G. Dempster and M. D. Macleod, “Constant integer multiplication using minimum adders,” IEE Proc. Circuits Devices Syst., vol. 141, no. 6, pp. 407–413, Oct. 1994. [35] D. R. Bull and D. H. Horrocks, “Primitive operator digital filters,” IEE Proc. G, vol. 138, pp. 401–412, June 1991. [36] O. Gustafsson, A. G. Dempster, and L. Wanhammar, “Extended results for minimum-adder constant integer multipliers,” IEEE Int. Symp. Circuits Syst., Phoenix, AZ, May 26–29, 2002. [37] O. Gustafsson, H. Ohlsson, and L. Wanhammar, “Minimum-adder integer multipliers using carry-save adders,” IEEE Int. Symp. Circuits Syst., Sydney, Australia, May 6– 9, 2001, vol. 2, pp. 709–712. [38] T. G. Noll, “Carry-save architectures for high-speed digital signal processing,” J. VLSI Signal Processing, vol. 3, pp. 121–140, 1991. [39] A. G. Dempster and M. D. Macleod, “Use of minimumadder multiplier blocks in FIR digital filters,” IEEE Trans. Circuits Syst. II, vol. 42, pp. 569–577, Sept. 1995. [40] M. Mehendale, S. D. Sherlakar, and G. Vekantesh, “Synthesis of multiplierless FIR filters with minimum number of additions,” Proc. IEEE/ACM Int. Conf. Computer-Aided Design, Los Alamitos, CA, 1995, pp. 668–671. [41] M. Potkonjak, M. B. Shrivasta, and A. P. Chandrakasan, “Multiple constant multiplication: Efficient and versatile framework and algorithms for exploring common subexpression elimination,” IEEE Trans. Computer-Aided Design, vol. 15, pp. 151–161, Feb. 1996. [42] R. I. Hartley, “Subexpression sharing in filters using canonic signed digit multipliers,” IEEE Trans. Circuits Syst. II, vol. 43, pp. 677–688, Oct. 1996. [43] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde, and D. Durackova, “A new algorithm for elimination of common subexpressions,” IEEE Trans. Computer-Aided Design, vol. 18, pp. 58–68, Jan. 1999. [44] A. Yurdakul and G. Dündar, “Multiplierless realization of linear DSP transforms by using common two-term expressions,” J. VLSI Signal Processing, vol. 22, pp. 163–172, Sept. 1999. [45] O. Gustafsson and L. Wanhammar, “ILP modelling of the common subexpression sharing problem,” in preparation.

Suggest Documents